# TwtrConvo

TwtrConvo is a python package that utilizes tweepy, pandas, TextBlob, and plotly to query for the latest top tweets corresponding to a public company (given it's ticker symbol) and provides various interactive plots to display overall sentiment and metrics on the Twitter conversation surrounding the company.  

To accomplish this the library:
    - first, queries for tweets using tweepy and the user's Twitter API keys
    - second, organizes the tweets and their metadata using pandas DataFrames
    - third, parses and cleans the text then calculates the sentiment using TextBlob and regex
    - forth and finally, graphically displays the results using plotly.

nbviewer link: https://nbviewer.jupyter.org/github/LAdaKid/TwtrConvo/blob/master/README.ipynb

### Setup

In order to use this module you will first need to setup your Twitter API keys.  If you don't have Twitter API keys, get them by following this guide:

https://developer.twitter.com/en/docs/basics/authentication/guides/access-tokens.html

Once you get your Twitter API keys you will need to add them to your environment with the variable names:

    TWITTER_CONSUMER_KEY
    TWITTER_CONSUMER_SECRET
    TWITTER_ACCESS_TOKEN
    TWITTER_ACCESS_TOKEN_SECRET

This will allow the TwtrConvo "tweets" module access to the Twitter API in order to query for tweets.

### Importing packages

For this jupyter-notebook we'll just import the TwtrConvo package along with the os library for path manipulation as well as some plotly utilities for creating offline plots within the notebook.

In [1]:
import os
import TwtrConvo
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)

# Now let's initialize the ticker for the company we'd like to analyze
ticker = 'TSLA'

## Tweets module (tweets.py)

The tweets module acts as a wrapper layer around tweepy with the main function "get_tweets".  This method allows the user to query for "n" number of tweets, then by default will filter tweets tagged as retweets and organize the remaining tweets into lists: regular tweets and replies.  You can change the total tweets that will be returned in the query by using the "max_tweets" keyword argument.  The tweets and replies will both be organized into lists of dictionaries matching the json format of the Twitter API.  For more documentation on this visit the link below.

Twitter developer API docs: https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object

In [2]:
# Get the raw tweets and replies using the "tweets.get_tweets" method
tweet_list, reply_list = TwtrConvo.tweets.get_tweets(ticker, max_tweets=10)

# Let's take a look at one of the tweets that get's returned directly from Twitter
print(tweet_list[0])

{'created_at': 'Thu May 09 06:39:40 +0000 2019', 'id': 1126376168222072832, 'id_str': '1126376168222072832', 'full_text': "You can't make this up.\n\n$TSLA\nH/T: @lazygetter https://t.co/vaeYl13gV9", 'truncated': False, 'display_text_range': [0, 47], 'entities': {'hashtags': [], 'symbols': [{'text': 'TSLA', 'indices': [25, 30]}], 'user_mentions': [{'screen_name': 'lazygetter', 'name': 'Bubble Boy', 'id': 920833801, 'id_str': '920833801', 'indices': [36, 47]}], 'urls': [], 'media': [{'id': 1126376154062032897, 'id_str': '1126376154062032897', 'indices': [48, 71], 'media_url': 'http://pbs.twimg.com/media/D6GxJPJWkAEXb1V.png', 'media_url_https': 'https://pbs.twimg.com/media/D6GxJPJWkAEXb1V.png', 'url': 'https://t.co/vaeYl13gV9', 'display_url': 'pic.twitter.com/vaeYl13gV9', 'expanded_url': 'https://twitter.com/ravenvanderrave/status/1126376168222072832/photo/1', 'type': 'photo', 'sizes': {'medium': {'w': 592, 'h': 880, 'resize': 'fit'}, 'thumb': {'w': 150, 'h': 150, 'resize': 'crop'}, 'sma

## TwtrConvo module (twtrconvo.py)

The twtrconvo module houses the main logic of the package including the methods to either build a new dataset or load an existing dataset.  The datasets are composed of:

    tweet_df: the "n" top tweets ranked by retweets, favorites, and net influence (followers - following)
    reply_df: the replies to the top tweets (includes same fields as tweet_df with addition of id of original tweet the reply corresponds to
    user_df: user information corresponding to the top tweets
    
Let's step through each portion of the main functionality (which can be found within the "twtrconvo.main" method) and show each step in the analysis of the recent Twitter conversation surrounding the company.

### Loading datasets

Generally you would use the "build_dataset" method to get the tweet_df, reply_df, and user_df, however with the default set to 2500 tweets can quickly max out you're hourly queries using the Twitter API if you call build_dataset multiple times within an hour (the hourly limit is 100 queries without paying for a premium account).  For this reason, you are also able to load previously built data and use it to conduct statistical analysis.

In [3]:
# Build new dataset
#tweet_df, reply_df, user_df = twtrconvo.build_dataset(ticker)

# Load existing dataset
tweet_df, reply_df, user_df = TwtrConvo.twtrconvo.load_dataset(
    os.path.join(os.getcwd(), 'datasets', ticker)
)
# Let's take a look at what fields the tweet_df contains
tweet_df.head()

Unnamed: 0,index,id,username,user_id,tweet,text,favorites,retweets,followers,following,polarity,subjectivity,net_influence,net_influencerank,retweetsrank,favoritesrank,rank
0,739,1126143473789407233,Teslarati,1308211178,Tesla $TSLA completes $2.7B funding round as B...,Tesla TSLA completes 2 7B funding round as BMW...,266,31,90709,71,0.15,0.45,90638,791.0,791.0,794.0,2376.0
1,357,1126230448101703680,GerberKawasaki,349249475,Have an appointment with the Audi Etron on Fri...,Have an appointment with the Audi Etron on Fri...,138,8,58945,4665,0.465341,0.513636,54280,786.5,776.5,791.0,2354.0
2,312,1126240263632969729,GerberKawasaki,349249475,You wonder why they attack Tesla relentlessly....,You wonder why they attack Tesla relentlessly ...,57,15,58945,4665,0.0,0.1,54280,786.5,785.0,775.5,2347.0
3,310,1126241351656398854,GerberKawasaki,349249475,It's as if my soul mate took over the @Tesla t...,It s as if my soul mate took over the twitter ...,91,7,58945,4665,0.0,0.0,54280,786.5,771.0,787.0,2344.5
4,600,1126172969506619393,LanceRoberts,49958733,"So, @realDonaldTrump showed tax losses which m...",So showed tax losses which makes him a poor bu...,62,14,22514,6679,-0.16,0.64,15835,770.0,784.0,779.0,2333.0


In [4]:
# Now let's take a look at what fields are contained within the user_df
user_df.head()

Unnamed: 0,username,favorites,followers,following,full_description,tweet_count,user_id,description,net_influence
0,4xRevenue,8279,876,460,The best PMs are analysts who have yet to mana...,6215,457483584,The best PMs are analysts who have yet to mana...,416
1,AlterViggo,13725,2195,246,"I like, you know, facts. And science. And bett...",4823,983380841484042241,I like you know facts And science And better cars,1949
2,Andreas_Hopf,4115,1486,15,"Designer in Berlin, BA/MA educator and design ...",707,937065664128475138,Designer in Berlin BA MA educator and design r...,1471
3,BagholderQuotes,18119,19990,676,"You don't loose if you don't sell. A satire, p...",20301,3424441204,You don t loose if you don t sell A satire par...,19314
4,BarkMSmeagol,15327,1759,337,Poking fun and documenting the TSLA fake bears...,5567,980554784133427200,Poking fun and documenting the TSLA fake bears...,1422


### Top 10 Tweets and their stats

Let's take a look at some of the top tweets.  This is a good starting point to get a feel for any recent news or sentiment of the company.

In [5]:
for i in range(10):
    tweet = tweet_df.iloc[i]
    print(
        tweet['username'],
        '({} Favorites, {} Retweets, Net Influence {}, Polarity {:.3f}, Subjectivity {:.3f}): \n'.format(
            tweet['favorites'], tweet['retweets'], tweet['net_influence'], tweet['polarity'],
            tweet['subjectivity']),
        tweet['tweet'], '\n')

Teslarati (266 Favorites, 31 Retweets, Net Influence 90638, Polarity 0.150, Subjectivity 0.450): 
 Tesla $TSLA completes $2.7B funding round as BMW pledges more EV competition, including a plan for 25 electric and electrified vehicles by 2025
https://t.co/vtOmor8zuP 

GerberKawasaki (138 Favorites, 8 Retweets, Net Influence 54280, Polarity 0.465, Subjectivity 0.514): 
 Have an appointment with the Audi Etron on Friday morning. Will be giving the full review after. Excited to drive this new EV and compare it to the best. $TSLA #Tesla 

GerberKawasaki (57 Favorites, 15 Retweets, Net Influence 54280, Polarity 0.000, Subjectivity 0.100): 
 You wonder why they attack Tesla relentlessly. #Big #Oil #Money $TSLA https://t.co/CBjy5XrpLQ 

GerberKawasaki (91 Favorites, 7 Retweets, Net Influence 54280, Polarity 0.000, Subjectivity 0.000): 
 It's as if my soul mate took over the @Tesla twitter. $TSLA 

LanceRoberts (62 Favorites, 14 Retweets, Net Influence 15835, Polarity -0.160, Subjectivity 0.64

### User Data

Next let's take a look at some data on the types of Twitter users that are tweeting about the company with the most influence.  By creating a visualization that displays the relationship between the number of times a term is mentioned in user profile descriptions and the average net influence of those users we can get an idea of the type of profiles participating in the conversation.

In [6]:
user_blob = TwtrConvo.twtrconvo.get_blob(ticker, user_df, header='description')
user_word_count = TwtrConvo.twtrconvo.get_word_count(user_blob)
user_word_count = TwtrConvo.twtrconvo.add_user_data(user_word_count, user_df)

fig = TwtrConvo.plots.create_user_description_scatter(user_word_count)

iplot(fig)

Now let's get an idea of the general influence of most of our top tweeters by looking at a distribution of net influence of our users.

In [7]:
fig = TwtrConvo.plots.create_distplot(user_df)
iplot(fig)

## Tweet Sentiment Analysis

Now that we've got our data organized and an idea of the type of users are contributing to the data set we can dive into the tweets themselves.

### Word Frequency

The first thing we'd like to look at within the tweets is word frequency within the top ranked tweets and their replies.  This could identify any patterns and could point out key words that will effect the current social sentiment that we observe.  We'll use TextBlob and our functions "get_blob" and "get_word_count" to do this then display the word count data using plotly pie charts.

In [8]:
# Get text blobs and word count
tweet_blob = TwtrConvo.twtrconvo.get_blob(ticker, tweet_df)
tweet_word_count = TwtrConvo.twtrconvo.get_word_count(tweet_blob)
reply_blob = TwtrConvo.twtrconvo.get_blob(ticker, reply_df)
reply_word_count = TwtrConvo.twtrconvo.get_word_count(reply_blob)

# The pie chart will default to the top ten words for each word count unless n is
# specified to be different
fig = TwtrConvo.plots.create_pie_chart(tweet_word_count, reply_word_count)

iplot(fig)

### Bigrams and Trigrams

Looking over word frequency can give us some great insight, but it may lack context.  To add some contextual analysis let's create the same visualization with the most frequently used bigrams and trigrams (these are groups of 2 and 3 words used together).  We'll do this by using the "n" keyword argument of the "get_word_count" function.

In [9]:
tweet_bigram_count = TwtrConvo.twtrconvo.get_word_count(tweet_blob, n=2)
tweet_trigram_count = TwtrConvo.twtrconvo.get_word_count(tweet_blob, n=3)

fig = TwtrConvo.plots.create_pie_chart(
    tweet_bigram_count, tweet_trigram_count, name_1='bigrams',
    name_2='trigrams')
iplot(fig)

### Sentiment Gauge

Now that we've got the general idea of frequently used words and terms let's take a look at the overall sentiment of all the tweets using TextBlob.  We'll start by grouping all the text together as a single text blob then displaying the calculated polarity and subjectivity of that large string.  Then we'll display the measured polarity and subjectivity as a sentiment guage using the "create_sentiment_guage" function of the TwtrConvo.plots module.

    - polarity: The polarity score is within the range [-1.0, 1.0] and represents positive or negative sentiment.
    - subjectivity: The subjectivity is within the range [0.0, 1.0] where 0.0 is very objective and 1.0 is very subjective.

In [10]:
fig = TwtrConvo.plots.create_sentiment_gauge(tweet_blob.sentiment.polarity,
                                             tweet_blob.sentiment.subjectivity)
iplot(fig)

### Tweet Metadata

Let's take a look at some of the metadata related to the tweets so that we can get a revised sentiment guage that weights the higher ranked tweets more heavily.  First, we'll dive into some of the metadata then we'll create weights and regenerate the guage.

#### Retweet and Favorite Distributions (Using Boxplots)

In [11]:
fig = TwtrConvo.plots.create_boxplot(tweet_df)
iplot(fig)

#### Polarity and Subjectivity Distributions (Using Violin Plots)

In [12]:
 fig = TwtrConvo.plots.create_violin_plot(
        tweet_df, ['polarity', 'subjectivity'], 'Sentiment Violin Plot')
iplot(fig)

#### Correlation Between these Distributions (Using 2d Contour Plots)

In [13]:
fig = TwtrConvo.plots.create_contour(tweet_df)
iplot(fig)

In [14]:
fig = TwtrConvo.plots.create_contour(tweet_df, title='Favorites vs. Polarity', xaxes=['favorites'],
                                     colors=['Electric'])
iplot(fig)

In [15]:
fig = TwtrConvo.plots.create_contour(tweet_df, title='Net Influence vs. Polarity', xaxes=['net_influence'],
                                     colors=['Greens'])
iplot(fig)

#### Weight the sentiment and regenerate the guage with our weighted values

Finally we'll weight the polarity and subjectivity based on our meta data of retweets, favorites, and net_influence using the "get_weighted_sentiment" function in the twtrconvo module.  For the sake of the example we'll use the default in which each is weighted equally, however by using the "weights" keyword you could weight each parameter differently based on previous analysis results.

In [16]:
weighted_sentiment = TwtrConvo.twtrconvo.get_weighted_sentiment(tweet_df)

fig = TwtrConvo.plots.create_sentiment_gauge(
        weighted_sentiment[0], weighted_sentiment[1])
iplot(fig)

## Top 5 Future Upgrades

1. Backend database to store data so that time series sentiment analysis can be added
2. Pair with financial data and build Neural Network for predicting price movement based on sentiment
3. Create financial and sentiment comparisons between multiple companies
4. Create a web application by embedding plotly plots into dash dashboard with a drop down for supported companies
5. Set up unit testing and organize dependencies so the library can be added to PyPI and pip installed