# Sentiment Analysis
This file attempts to gather tweets surrounding a specific pair of stocks. This is the second part in a two-part process of processing data.

After data is collected for the time frame provided (with the post dates matching up), it can be fed into our model which we still need to make

- textblob uses a Naive Bayes classifier trained on movie reviews to determine sentiment of some text. We use this in an attempt to estimate public sentiment on a given stock
- ideally, we want to gather 6 things from this file for every day for each pair of stocks:

| 1 | 2 | 3 | 4 | 5 | 6 |
| --- | --- | --- | --- | --- | --- | 
| % tweets positive | % tweets negative | % tweets neutral | # likes among pos tweets | # likes among negative tweets | # likes among neutral tweets |

### Issues and concerns

So... what issue did I run into that caused me great grief during this project?

- Twitter's free API does not allow you to retrieve tweets beyond 7 days, as previously discussed
- Twitter firehose is needed to do this, and that would cost a pretty penny, not to mention take some time to get set up

Workaround?

- Can currently only train data from the past week on our given stocks... not enough time for real training.
- Perhaps we can look to reddit instead in the meantime?

In [1]:
!pip install tweepy textblob



#### Gather our corpa of text for interpretation
**Note:** We may want to come back to this and find a corpa more catered to stocks/a specific stock sector that we are looking at

In [2]:
!python -m textblob.download_corpora

[nltk_data] Downloading package brown to /home/jovyan/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /home/jovyan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package conll2000 to /home/jovyan/nltk_data...
[nltk_data]   Package conll2000 is already up-to-date!
[nltk_data] Downloading package movie_reviews to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!
Finished.


## Twitter

In [43]:
import re
import tweepy
from tweepy import OAuthHandler
from textblob import TextBlob
from datetime import datetime, date, timezone

from twitter_auth import API_KEY, API_SECRET, ACCESS_TOKEN, ACCESS_TOKEN_SECRET, BEARER_TOKEN
  
class TwitterClient(object):
    '''
    Generic Twitter Class for sentiment analysis.
    '''
    def __init__(self):
        '''
        Class constructor or initialization method.
        '''
        # keys and tokens from the Twitter Dev Console
        consumer_key = API_KEY
        consumer_secret = API_SECRET
        access_token = ACCESS_TOKEN
        access_token_secret = ACCESS_TOKEN_SECRET
  
        # attempt authentication
        try:
            # create OAuthHandler object
#             self.auth = OAuthHandler(consumer_key, consumer_secret)
            # set access token and secret
#             self.auth.set_access_token(access_token, access_token_secret)
            # create tweepy API object to fetch tweets
#             self.api = tweepy.API(self.auth)
            self.api = tweepy.Client(bearer_token=BEARER_TOKEN)
        except:
            print("Error: Authentication Failed")
  
    def clean_tweet(self, tweet):
        '''
        Utility function to clean tweet text by removing links, special characters
        using simple regex statements.
        '''
        return ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)", " ", tweet).split())
  
    def get_tweet_sentiment(self, tweet):
        '''
        Utility function to classify sentiment of passed tweet
        using textblob's sentiment method
        '''
        # create TextBlob object of passed tweet text
        analysis = TextBlob(self.clean_tweet(tweet))
        # set sentiment
        if analysis.sentiment.polarity > 0:
            return 'positive'
        elif analysis.sentiment.polarity == 0:
            return 'neutral'
        else:
            return 'negative'
  
    def get_tweets(self, query, count = 10):
        '''
        Main function to fetch tweets and parse them.
        '''
        # empty list to store parsed tweets
        tweets = []
  
#         try:
        # call twitter api to fetch tweets
        
        # By default, search_tweets uses a "mixed" result_type, meaning we will get tweets
        # that will be recently posted (in real time) AND popular tweets
        
        # Sadly, the start_time parameter also has a 7-day limit, so we are unable to limit our results to anything before then
        
        # The time that we are making our prediction, time t
        # UTC Time (Union[datetime.datetime, str]) – YYYY-MM-DDTHH:mm:ssZ
        
        end_time = datetime(2021, 12, 8, hour=16, minute=0, second=0, microsecond=0, tzinfo=timezone.utc).isoformat()
        
        # The time that we are beggining to look at our predictions, time t-window_size
        # YYYY-MM-DDTHH:mm:ssZ
        start_time = datetime(2021,12,3, hour=16, minute=0, second=0, microsecond=0, tzinfo=timezone.utc).isoformat()
        
        # Max results between 10 and 100
        fetched_tweets = self.api.search_recent_tweets(query, max_results=count, start_time=start_time, end_time=end_time)

        # tweet_fields = [created_at, text] 
        
        
        # parsing tweets one by one
        for tweet in fetched_tweets[0]:
            # empty dictionary to store required params of a tweet
            parsed_tweet = {}

            
            print(tweet)
            # saving text of tweet
            parsed_tweet['text'] = tweet.text
            # saving sentiment of tweet
            parsed_tweet['sentiment'] = self.get_tweet_sentiment(tweet.text)

            # appending parsed tweet to tweets list
            if tweet.retweet_count > 0:
                # if tweet has retweets, ensure that it is appended only once
                if parsed_tweet not in tweets:
                    tweets.append(parsed_tweet)
            else:
                tweets.append(parsed_tweet)

        # return parsed tweets
        return tweets
  
#         except Exception as e:
#             # print error (if any)
#             print("Error : " + str(e))
  
def main():
    # creating object of TwitterClient Class
    api = TwitterClient()
    # calling function to get tweets
    tweets = api.get_tweets(query = 'Cloudflare', count = 10)
    print(tweets)
    
    # picking positive tweets from tweets
    ptweets = [tweet for tweet in tweets if tweet['sentiment'] == 'positive']
    # percentage of positive tweets
    print("Positive tweets percentage: {} %".format(100*len(ptweets)/len(tweets)))
    # picking negative tweets from tweets
    ntweets = [tweet for tweet in tweets if tweet['sentiment'] == 'negative']
    # percentage of negative tweets
    print("Negative tweets percentage: {} %".format(100*len(ntweets)/len(tweets)))
    # percentage of neutral tweets
    print("Neutral tweets percentage: {} % \
        ".format(100*(len(tweets) -(len( ntweets )+len( ptweets)))/len(tweets)))

    # printing first 5 positive tweets
    print("\n\nPositive tweets:")
    for tweet in ptweets[:10]:
        print(tweet['text'])

    # printing first 5 negative tweets
    print("\n\nNegative tweets:")
    for tweet in ntweets[:10]:
        print(tweet['text'])

if __name__ == "__main__":
    # calling main function
    main()

RT @StockMKTNewz: Cloudflare $NET announced today it has acquired Zaraz "a company that has developed technology to speed up and secure web…


AttributeError: 

### Shame on you Twitter
But it's okay. We will go with our backup, Reddit!
Plan B:

| 1 | 2 | 3 | 4 | 5 | 6 |
| --- | --- | --- | --- | --- | --- | 
| % posts positive | % posts negative | % posts neutral | # likes among pos posts | # likes among negative posts | # likes among neutral posts |

## Reddit

In [5]:
import re
import requests
import pandas as pd
from datetime import datetime, date, timedelta
from textblob import TextBlob

from reddit_auth import CLIENT_ID, CLIENT_SECRET, USERNAME, PASSWORD

class RedditScraper(object):
    def clean_post(self, post):
            '''
            Utility function to clean post text by removing links, special characters
            using simple regex statements.
            '''
            return ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)", " ", post).split())

    def get_sentiment(self, post):
        '''
        Utility function to classify sentiment of passed post
        using textblob's sentiment method
        '''
        # create TextBlob object of passed tweet text
        analysis = TextBlob(self.clean_post(post))
        # set sentiment
        if analysis.sentiment.polarity > 0:
            return 'positive'
        elif analysis.sentiment.polarity == 0:
            return 'neutral'
        else:
            return 'negative'

    # we use this function to convert responses to dataframes
    def df_from_response(self, res):
        # initialize temp dataframe for batch of data in response
        df = pd.DataFrame()

        # loop through each post pulled from res and append to df
        for post in res.json()['data']['children']:
    #         if post['data']['link_flair_css_class'] not in ['news', 'meme', 'discussion', 'dd']:
    #             continue

            df = df.append({
    #             'subreddit': post['data']['subreddit'],
                'title': post['data']['title'],
                'selftext': post['data']['selftext'],
                'upvote_ratio': post['data']['upvote_ratio'],
    #             'ups': post['data']['ups'],
    #             'downs': post['data']['downs'],
                'score': post['data']['score'],
    #             'link_flair_css_class': post['data']['link_flair_css_class'],
                'created_utc': datetime.fromtimestamp(post['data']['created_utc']).strftime('%Y-%m-%d'), #T%H:%M:%SZ'),
                'id': post['data']['id'],
                'kind': post['kind']
            }, ignore_index=True)

        return df
    
    def parse_data(self, df):
        df = df.sort_values(by=['created_utc'], ignore_index=True)
        df['sentiment'] = df['title'].apply(self.get_sentiment)
        return df
    
    def gather_data(self, ticker):
        # authenticate API
        client_auth = requests.auth.HTTPBasicAuth(CLIENT_ID, CLIENT_SECRET)
        data = {
            'grant_type': 'password',
            'username': USERNAME,
            'password': PASSWORD
        }
        headers = {'User-Agent': 'myBot/0.0.1'}

        # send authentication request for OAuth token
        res = requests.post('https://www.reddit.com/api/v1/access_token',
                            auth=client_auth, data=data, headers=headers)
        # extract token from response and format correctly
        token = f"bearer {res.json()['access_token']}"
        # update API headers with authorization (bearer token)
        headers = {**headers, **{'Authorization': token}}

        # initialize dataframe and parameters for pulling data in loop
        data = pd.DataFrame()
        params = {'limit': 100}

        # loop through 10 times (returning 1K posts)
        for i in range(10):
            # make request
            res = requests.get("https://oauth.reddit.com/r/wallstreetbets/search/?q=" + ticker, #/hot",
                               headers=headers,
                               params=params)

            # get dataframe from response
            new_df = self.df_from_response(res)

            # Empty/no results, stop here
            if new_df.shape[0] == 0:
                break

            # take the final row (oldest entry)
            row = new_df.iloc[len(new_df)-1]
            # create fullname
            fullname = row['kind'] + '_' + row['id']
            # add/update fullname in params
            params['after'] = fullname

            # append new_df to data
            data = data.append(new_df, ignore_index=True)
        
        data = self.parse_data(data)
        
        return data

In [4]:
scraper = RedditScraper()
data = scraper.gather_data("COHR")

<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>


In [93]:
data

Unnamed: 0,created_utc,id,kind,score,selftext,title,upvote_ratio,sentiment
0,2007-04-17T13:14:42Z,1ilb8,t3,0.0,,ALLIRA COHRS,0.50,neutral
1,2011-01-11T16:58:42Z,f0ax0,t3,0.0,,"COHR, ENZ, BIIB, PWRM, PCLN - CRWEWallstreet.c...",0.50,neutral
2,2013-09-25T21:01:26Z,1n4ph5,t3,7.0,,"WIPO Approves 15 New Observers, Including DNDi...",0.82,positive
3,2013-09-25T21:18:42Z,1n4qvr,t3,2.0,,"WIPO Approves 15 New Observers, Including DNDi...",0.63,positive
4,2015-10-26T18:04:28Z,3qb4zy,t3,1.0,,Relationship between Blood Myostatin Levels an...,1.00,neutral
...,...,...,...,...,...,...,...,...
238,2021-11-29T11:54:46Z,r4u8n3,t3,1.0,,$COHR Awaiting short signal. Algo Trading Idea...,1.00,neutral
239,2021-11-30T19:17:00Z,r5uymx,t3,1.0,,$COHR Waiting for Short signal.,1.00,neutral
240,2021-12-03T18:14:00Z,r85dcd,t3,1.0,,$COHR Awaiting Short Signal. Stock Trading Ide...,1.00,neutral
241,2021-12-03T18:17:48Z,r85gfk,t3,1.0,,$COHR Awaiting Short Signal. Stock Trading Ide...,1.00,neutral


In [81]:
date(2021,5,10)- timedelta(7)

datetime.date(2021, 4, 26)

In [87]:
str(date(2021,5,10) - timedelta(7))

'2021-05-03'

In [97]:
data.loc[(data['created_utc'] < str(date(2021,6,5))) & (data['created_utc'] > str(date(2021,6,5) - timedelta(7)))]

Unnamed: 0,created_utc,id,kind,score,selftext,title,upvote_ratio,sentiment
142,2021-06-03T01:12:52Z,nr1dkn,t3,5367.0,I understand that most people in this thread i...,Beware of what AMC shorts are holding!,0.91,neutral
143,2021-06-03T10:15:31Z,nr9zni,t3,2.0,"**Author**: u/Shark_Bones(**Karma:** 311078, *...",Are we headed for the Mother of All Crashes? H...,0.67,negative
144,2021-06-03T19:35:46Z,nrlwci,t3,10.0,"Reposting something I found in r/stocks, basic...",HODL my smooth brained apes 💎🙌🏽🦍,0.92,positive


In [114]:
import numpy as np
def get_sentiment_data(data, for_date, window=7):
    '''
    Input:
        for_date: a datetime.date object for the max_date when a post was created
        window: how far back to gather data on posts (days)
    Output:
        1 x 6 dataframe with statistics on sentiment for that time interval
    '''
    window_data = data.loc[(data['created_utc'] < str(for_date)) & (data['created_utc'] > str(for_date - timedelta(window)))]
    s = window_data['sentiment']
    pos_count, pos_score = s[s == 'positive'].count(), window_data['score'][s == 'positive'].sum()
    neut_count, neut_score = s[s == 'neutral'].count(), window_data['score'][s == 'neutral'].sum()
    neg_count, neg_score = s[s == 'negative'].count(), window_data['score'][s == 'negative'].sum()
    
    tot = pos_count + neut_count + neg_count
    if tot == 0:
        return [0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
    pos_p = pos_count / tot
    neut_p = neut_count / tot
    neg_p = neg_count / tot

    cols = ['Percent Positive', 'Percent Neutral', 'Percent Negative', 'Positive Score', 'Neutral Score', 'Negative Score']
    stats = np.array([pos_p, neut_p, neg_p, pos_score, neut_score, neg_score])
#     out = pd.DataFrame(data=stats.reshape(1,-1), index=[for_date], columns=cols)
    out = [pos_p, neut_p, neg_p, pos_score, neut_score, neg_score]
    return out

get_sentiment_data(data, date(2021,5,12))#datetime.strptime('2021-05-12T00:00:00', '%Y-%m-%dT%H:%M:%S').date())

[0.0, 0.0, 0.0, 0.0, 0.0, 0.0]

In [57]:
# Let's take a peek at our data
data = data.sort_values(by=['created_utc'], ignore_index=True)
data

Unnamed: 0,created_utc,id,kind,score,selftext,title,upvote_ratio
0,2007-04-17T13:14:42Z,1ilb8,t3,0.0,,ALLIRA COHRS,0.50
1,2011-01-11T16:58:42Z,f0ax0,t3,0.0,,"COHR, ENZ, BIIB, PWRM, PCLN - CRWEWallstreet.c...",0.50
2,2013-09-25T21:01:26Z,1n4ph5,t3,9.0,,"WIPO Approves 15 New Observers, Including DNDi...",0.99
3,2013-09-25T21:18:42Z,1n4qvr,t3,2.0,,"WIPO Approves 15 New Observers, Including DNDi...",0.63
4,2015-10-26T18:04:28Z,3qb4zy,t3,1.0,,Relationship between Blood Myostatin Levels an...,1.00
...,...,...,...,...,...,...,...
238,2021-11-29T11:54:46Z,r4u8n3,t3,1.0,,$COHR Awaiting short signal. Algo Trading Idea...,1.00
239,2021-11-30T19:17:00Z,r5uymx,t3,1.0,,$COHR Waiting for Short signal.,1.00
240,2021-12-03T18:14:00Z,r85dcd,t3,1.0,,$COHR Awaiting Short Signal. Stock Trading Ide...,1.00
241,2021-12-03T18:17:48Z,r85gfk,t3,1.0,,$COHR Awaiting Short Signal. Stock Trading Ide...,1.00


In [58]:
data['sentiment'] = data['title'].apply(get_sentiment)
data

Unnamed: 0,created_utc,id,kind,score,selftext,title,upvote_ratio,sentiment
0,2007-04-17T13:14:42Z,1ilb8,t3,0.0,,ALLIRA COHRS,0.50,neutral
1,2011-01-11T16:58:42Z,f0ax0,t3,0.0,,"COHR, ENZ, BIIB, PWRM, PCLN - CRWEWallstreet.c...",0.50,neutral
2,2013-09-25T21:01:26Z,1n4ph5,t3,9.0,,"WIPO Approves 15 New Observers, Including DNDi...",0.99,positive
3,2013-09-25T21:18:42Z,1n4qvr,t3,2.0,,"WIPO Approves 15 New Observers, Including DNDi...",0.63,positive
4,2015-10-26T18:04:28Z,3qb4zy,t3,1.0,,Relationship between Blood Myostatin Levels an...,1.00,neutral
...,...,...,...,...,...,...,...,...
238,2021-11-29T11:54:46Z,r4u8n3,t3,1.0,,$COHR Awaiting short signal. Algo Trading Idea...,1.00,neutral
239,2021-11-30T19:17:00Z,r5uymx,t3,1.0,,$COHR Waiting for Short signal.,1.00,neutral
240,2021-12-03T18:14:00Z,r85dcd,t3,1.0,,$COHR Awaiting Short Signal. Stock Trading Ide...,1.00,neutral
241,2021-12-03T18:17:48Z,r85gfk,t3,1.0,,$COHR Awaiting Short Signal. Stock Trading Ide...,1.00,neutral


In [59]:
s = data['sentiment']
pos_count, pos_score = s[s == 'positive'].count(), data['score'][s == 'positive'].sum()
neut_count, neut_score = s[s == 'neutral'].count(), data['score'][s == 'neutral'].sum()
neg_count, neg_score = s[s == 'negative'].count(), data['score'][s == 'negative'].sum()
print('positives for this time frame: count={}, score={}'.format(pos_count, pos_score))
print('neutral for this time frame: count={}, score={}'.format(neut_count, neut_score))
print('negatives for this time frame: count={}, score={}'.format(neg_count, neg_score))

positives for this time frame: count=32, score=1475.0
neutral for this time frame: count=206, score=7334.0
negatives for this time frame: count=5, score=24.0


In [60]:
tot = pos_count + neut_count + neg_count
pos_p = pos_count / tot
neut_p = neut_count / tot
neg_p = neg_count / tot
print("Pos: {} Neut: {} Neg: {}".format(pos_p, neut_p, neg_p))

Pos: 0.13168724279835392 Neut: 0.8477366255144033 Neg: 0.0205761316872428


#### Sample Output:

In [61]:
import numpy as np
cols = ['Percent Positive', 'Percent Neutral', 'Percent Negative', 'Positive Score', 'Neutral Score', 'Negative Score']
data = np.array([pos_p, neut_p, neg_p, pos_score, neut_score, neg_score])
out = pd.DataFrame(data=data.reshape(1,-1), index=[date(2000,1,1)], columns=cols)

In [62]:
out

Unnamed: 0,Percent Positive,Percent Neutral,Percent Negative,Positive Score,Neutral Score,Negative Score
2000-01-01,0.131687,0.847737,0.020576,1475.0,7334.0,24.0
