# Natural Language Processing Project: Obama Tweets

## Intro
Following up on the "Project Intro Data Extraction" notebook, this notebook focuses on the sentiment analysis of the Obama tweets that I extracted and stored in the MySQL database.

**Hindsight**: Tweets obviously reflect trending News. The analysis that follows is based on tweets that were extracted on the **25th May 2020**. A key Twitter topic of the day was the amount of time that both presidents have spent *golfing* during their tenure, and this will be apparent in the following outputs.

![](obama_speaking.JPG)

## Imports
Importing the necessary libraries and modules. These include -among others- sqlalchemy for **retrieving the data from MySQL**, ntlk modules and TextBlob for **NLP processing**, various **viz libaries** (e.g. seaborn, plotly, wordcloud) for different sorts of graphs.

In [None]:
%matplotlib inline

import numpy as np
import pandas as pd
from string import punctuation
from textblob import TextBlob
from collections import Counter
import nltk
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from nltk import pos_tag
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize

from sqlalchemy import create_engine

#set no limit for string printing
pd.set_option('display.max_colwidth', -1)

import seaborn as sns
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import plotly.graph_objects as go

## Fetching the data from the database
Now it is time to fetch and explore the data:

In [None]:
#Create Engine for SQL fetching
engine = create_engine('mysql://USERNAME:PASSWORD@HOSTNAME/TwitterObamaDB?charset=utf8', echo = True)

In [None]:
#writing data to dataframe
obama_df = pd.read_sql_query('select * from obamadata', con= engine)


In [None]:
#checking the end of the dataframe
obama_df.tail()

By looking at the above extract, we observe a few things:
- **User location**: as mentioned in the *'Project Intro' notebook*, users have manually specified their location with different levels of granularity, e.g. City ('Chicago'), City - State ('Dunnellon, FL'), City - State - Country ('Wynne, Arkansas, USA'). In the extraction phase, I have standardized this for consistency(columns GeoName, Country, State).
- **Mentions**: There are user mentions (@...) which need to be removed before the analysis.

Let's have a look at the dataframe composition:

In [None]:
obama_df.info()

From the above, we see that a part of the users have not specified a 'User description', but that is not an issue for our analysis. It also looks like some tweets have no text (blank), so I will make sure to filter them out. Otherwise, the data is consistent.

Let's print the *tweet text* of the first tweets (**reminder**: punctuation, usernames, hashtags, etc., have been cleaned already at the stage of extraction).

In [None]:
#Lets look at some of the tweets
print(obama_df['Text'].head(6))

## Processing

Let's start by cleaning the 'twitter mentions':

In [None]:
#removing the tweeter mentions @
obama_df['Text'] = obama_df['Text'].str.replace('@[^\s]+', '')

In [None]:
print(obama_df['Text'].head(6))

Now, I will drop the tweets with 'no text' in them. To do that, I firstly need to convert the empty strings to NaN values first, and then execute the drop.

In [None]:
# we convert missing strings to nan, so we can drop them
obama_df['Text'].replace('', np.nan, inplace=True)

In [None]:
# we drop rows where there is no text!
obama_df.dropna(subset=['Text'], inplace=True)

In [None]:
obama_df.info()

For an effective NLP analysis, we need to execute a list of actions:
- **Remove stopwords** (e.g. 'i', 'a', 'are', 'on', 'from')
- **Tokenization**: split tweets in smaller 'tokens', the words
- **Lemmatization**: convert (as much as possible) these tokens to their 'canonical form', as per the <a href="https://en.wikipedia.org/wiki/Lemma_(morphology)" target="_blank">definition of 'Lemma'</a>

This is done with the following code:

### Removing Stopwords

In [None]:
# Firstly, we import the english stopwords
from nltk.corpus import stopwords
stop_words=set(stopwords.words('english'))

In [None]:
# Then, for each tweet in the dataframe, we split it in words, remove stopwords,
# and re-join it as a text string.
obama_df['Text'] = obama_df['Text'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop_words)]))


### Tokenization & Lemmatization

In [None]:
obama_df.head()

WorldLemmatizer requires a **POS tag** (Part-Of-Speech tag), in order to understand if a word is e.g. a verb, noun, adjective, adverb - and process it accordingly. 

The below function **get_wordnet_pos** helps with that:

Firstly, it gets our words as input and assigns to each a tag, thanks to the **nltk.pos_tag** tagger. The tagger uses the tagging conventions of the <a href="https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html" target="_blank">'Penn Treebank Project'</a>: From this dictionary, one can observe that *adjective* tags always start with J, *nouns* with NN, *verbs* with V, and *adverbs* with R.

Then, the function maps the treebank tags to the wordnet corpus, a large lexical database in English - and returns a value as input for the Lemmatizer.

In [None]:
#TOKENIZING AND LEMMATIZING

# Lemmatize with POS Tag
from nltk.corpus import wordnet

def get_wordnet_pos(word):
    tag = nltk.pos_tag([word])[0][1][0].upper()
#     tag_dict = {"J": wordnet.ADJ,
#                 "N": wordnet.NOUN,
#                 "V": wordnet.VERB,
#                 "R": wordnet.ADV}

#     return tag_dict.get(tag, wordnet.NOUN)
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

#initialize tokenizer and lemmatizer
w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
lemmatizer = nltk.stem.WordNetLemmatizer()

# apply tokenizer and lemmatizer to each tweet
def lemmatize_text(text):
    return [lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in w_tokenizer.tokenize(text)]

In [None]:
# Checking the lemmatized Text
obama_df['Text']= obama_df['Text'].apply(lemmatize_text)
obama_df.head()

Looking at the lemmatized results, we see an improvement towards obtaining the canonical form - e.g. in the first tweet, the word 'rising' has now been transformed to 'rise'. 

On top of that, I evaluated also the use of a *'stemmer'* (PorterStemmer) to simplify the words, but the results post-processing were not satisfying. Hence, I will proceed with the Lemmatizer as per above.

# Data Exploration

The first thing worth looking into is the frequency of words. **Which words are mostly repeated** in Barack Obama-related tweets? For this, I will plot a wordcloud, and also the frequencies per word, in descending order.

In [None]:
# get individual words
words = []
for word_list in obama_df['Text']: 
    words.extend(word_list)

# create a word frequency dictionary
wordfreq = Counter(words)

#WORD CLOUD plot

plt.subplots(figsize = (12,10))

wordcloud = WordCloud(
    background_color = 'white',
    width = 1000,
    height = 800).generate_from_frequencies(wordfreq)

plt.imshow(wordcloud)
plt.axis('off')
plt.show() 


In [None]:
# print word counts of 10 top words, in descending order
word_df = pd.DataFrame(words,columns =['names'])
word_df['names'].value_counts().nlargest(10)

Unsurprisingly, 'obama', 'trump', 'president' are the most common words. As mentioned in the intro, 'golf' and 'golfing' have a high frequency as well, due to the topic that emerged on that day.


## Calculating Sentiment

Now it is time to **calculate the sentiment** of our tweets. The TextBlob library allows me to extract the polarity & subjectivity of tweet texts: **Polarity** comes as a float number between [-1,1], where -1 means negative, 0 means neutral, and 1 means positive. 

Based on this value, I create extra columns in the dataframe, marking each tweet as 'Negative', 'Neutral', 'Positive', and subsequently, I calculate the % of negative/ neutral/ positive tweets.

In [None]:
# create a new column (cleaned tweet), joining the words that resulted after the preprocessing, per dataframe row:
obama_df['Clean Text'] = obama_df['Text'].apply(lambda x: ' '.join(map(str, x)))

# create new dataframe columns for polarity and subjectivity
obama_df['Polarity'] = np.nan
obama_df['Subjectivity'] = np.nan
obama_df['Sentiment'] = np.nan

In [None]:
#reset indexing of the dataframe
obama_df.reset_index(drop=True, inplace=True)

# Create two new columns: 'Subjectivity' & 'Polarity'
for i, text in enumerate(obama_df['Clean Text'].values): # for each row of cleaned tweets
    #if text:  #where tweet exists
    blob = TextBlob(text)   # assign this text to a Blob object to analyze        
    obama_df['Subjectivity'].iloc[i] = blob.sentiment.subjectivity
    obama_df['Polarity'].iloc[i] = blob.sentiment.polarity

obama_df.loc[obama_df['Polarity'] < 0, 'Sentiment'] = 'Negative'
obama_df.loc[obama_df['Polarity'] > 0, 'Sentiment'] = 'Positive'
obama_df.loc[obama_df['Polarity'] == 0, 'Sentiment'] = 'Neutral'


# Show the new dataframe with columns 'Subjectivity' & 'Polarity'
obama_df.tail()

In [None]:
# counting tweets per sentiment
obama_df['Sentiment'].value_counts()

In [None]:
# lets plot this as well
groupped_sentiment = obama_df.groupby(['Sentiment'])['Text'].count().reset_index()
#groupped_sentiment.head()
ax =sns.barplot(x='Sentiment',y='Text',data=groupped_sentiment)

In [None]:
print('Percentage of positive tweets: {0:.1f}%'.format(100*len(obama_df[obama_df['Sentiment']=='Positive'])/len(obama_df)))
print('Percentage of negative tweets: {0:.1f}%'.format(100*len(obama_df[obama_df['Sentiment']=='Negative'])/len(obama_df)))
print('Percentage of neutral tweets: {0:.1f}%'.format(100*len(obama_df[obama_df['Sentiment']=='Neutral'])/len(obama_df)))

From the above, we see that that tweets are quite opinionated, as *more than 73% of them are either positive or negative*. As we observe, positives are actually a bit more. Below we also check the level of Polarity and Subjectivity (mean): on average, the tweets do not seem much polarized.

In [None]:
# mean of polarity and subjectivity is low.
obama_df[['Polarity','Subjectivity']].describe()

## Negative tweets
We can deepdive further into the negative tweets. Let's have a look at most frequent words here.

In [None]:
#Lets print WORDCLOUD OF ONLY NEGATIVE TWEETS

# get individual words
words2 = []

for i in range(len(obama_df)):
    #if (obama_df['Sentiment'].iloc(i)== 'Positive'):
    if (obama_df.loc[i, 'Sentiment']=='Negative'):
        #words.extend(text)
        words2.extend(obama_df.loc[i, 'Text'])
        
        
# create a word frequency dictionary
wordfreq = Counter(words2)

#WORD CLOUD plot

plt.subplots(figsize = (12,10))

wordcloud = WordCloud(
    background_color = 'white',
    width = 1000,
    height = 800).generate_from_frequencies(wordfreq)

plt.imshow(wordcloud)
plt.axis('off')
plt.show() 

In [None]:
# print word counts of 10 most repeated words, in descending order
word_df2 = pd.DataFrame(words2,columns =['names'])
word_df2['names'].value_counts().nlargest(10)

No surprises here, with 'obama', 'golfing', 'trump' appearing in the top words. However, since we are looking only at the negative tweets now, we can also see that other words with negative connotation come up: 'outrage', 'spent'.

## Positive tweets
We can also have a look at positive tweets. Below the respective wordcloud:

In [None]:
#Lets print WORDCLOUD OF ONLY POSITIVE TWEETS

# get individual words
words3 = []

for i in range(len(obama_df)):
    #if (obama_df['Sentiment'].iloc(i)== 'Positive'):
    if (obama_df.loc[i, 'Sentiment']=='Positive'):
        #words.extend(text)
        words3.extend(obama_df.loc[i, 'Text'])
        
        
# create a word frequency dictionary
wordfreq = Counter(words3)

#WORD CLOUD plot

plt.subplots(figsize = (12,10))

wordcloud = WordCloud(
    background_color = 'white',
    width = 1000,
    height = 800).generate_from_frequencies(wordfreq)

plt.imshow(wordcloud)
plt.axis('off')
plt.show() 

## Analysis by US state
It is very interesting to check where the negative sentiment is coming from. For this, we will group the dataframe by 'State', and aggregate at 'Sentiment' level:

In [None]:
negative_groupped = obama_df[obama_df['Sentiment']=='Negative'].groupby(['State']).agg({'Sentiment':'count'}).reset_index()
negative_groupped = negative_groupped.sort_values(by=['Sentiment'],ascending=[False])
negative_groupped.reset_index(drop=True, inplace=True)

negative_groupped.head(10)

We will do the same for positive  tweets:

In [None]:
positive_groupped = obama_df[obama_df['Sentiment']=='Positive'].groupby(['State']).agg({'Sentiment':'count'}).reset_index()
positive_groupped = positive_groupped.sort_values(by=['Sentiment'],ascending=[False])
positive_groupped.reset_index(drop=True, inplace=True)

positive_groupped.head(10)

We see that *Texas, Florida, California, and New York*, are the primary four sources of *both positive and negative tweets* (in different orders). This might also denote that these states have the most active users in Tweeter. Of course, we always need to keep in mind that our tweets sample was sourced only over one day, so it might not be as represantative.

Lastly, it would be great to visualize the above in a US Map. Below comes the visualization of positive tweets for Barack Obama, colour-coded in a blue scale.

In [None]:
#import abbrevations
%run US_state_dictionary.py

positive_groupped['Abbr'] = positive_groupped['State'].map(us_state_dict)
positive_groupped.head()

In [None]:
fig = go.Figure(data=go.Choropleth(
    locations= positive_groupped['Abbr'], # Spatial coordinates
    z = positive_groupped['Sentiment'].astype(int), # Data to be color-coded
    locationmode = 'USA-states', # set of locations match entries in `locations`
    colorscale = 'Blues',
    colorbar_title = "Count of positive tweets",
))

fig.update_layout(
    title_text = 'Obama Twitter Sentiment - 25 May 2020',
    geo_scope='usa', # limite map scope to USA
)

fig.show()

![](obama_graph.JPG)

![obama_figure.jpg](attachment:obama_figure.jpg)