# Data Analysis

At this time of the project, we have a fully working POC of an application that process Streaming Data in real time and uses machine learning to do online predictions. The results of the predictions are currently being stored in a MySQL database. 

In particular, this application reads a stream of tweets related to Covid-19, pre-process them and uses state-of-the-art NLP models to do sentiment and topic prediction analysis. Finally, these results are post-processed and written into a relational database in AWS. 

In this dashboard, we analyize our results

In [51]:
########### Database Connection ##################

import pymysql
import numpy as np

ENDPOINT = "database-kaka.c8wdpocz3thc.us-east-1.rds.amazonaws.com"
PASSWORD = "Bf2TiD4M4aOpbglEd9lM"
DBNAME = "databasekafka"
USR = "admin"
PORT = 3306
connection = pymysql.connect(host=ENDPOINT, user=USR, password=PASSWORD, port=PORT, db=DBNAME)
import pandas as pd
query = 'SELECT * FROM covid_tweets'
result = pd.read_sql(query, connection)
connection.close() #close the connection
display(f"There are {len(result)} classified tweets in the database")
result[["tweet", "date_creation", "sentiment_prediction", "topic_prediction"]].head()

'There are 1967 classified tweets in the database'

Unnamed: 0,tweet,date_creation,sentiment_prediction,topic_prediction
0,RT @CanadaDev: True or false: #CO2Emissions ha...,2021-02-05 15:13:33,NEGATIVE,work from home
1,RT @nprpolitics: The Senate approved a budget ...,2021-02-05 15:13:33,NEGATIVE,politics
2,@JustinTrudeau @DLeBlancNB Is @JustinTrudeau ...,2021-02-05 15:13:33,NEGATIVE,politics
3,Shrewsbury Severn Bridges 10k Road Race will t...,2021-02-05 15:13:33,POSITIVE,work from home
4,RT @SpillerOfTea: Me taking a post-Covid trip ...,2021-02-05 15:13:33,NEGATIVE,health


## Sentiment Analysis

We expect much more negative predictions than positive predictions:

In [2]:
result.value_counts('sentiment_prediction')

sentiment_prediction
NEGATIVE    1173
POSITIVE     414
dtype: int64

In [3]:
%matplotlib widget
import matplotlib.pyplot as plt  
import ipympl
import seaborn as sns

plt.rcParams['axes.edgecolor']='#333F4B'
plt.rcParams['axes.linewidth']=0.8
plt.rcParams['xtick.color']='#333F4B'
plt.rcParams['ytick.color']='#333F4B'

plt.rcParams['font.family'] = 'DejaVu Sans'
plt.rcParams['font.sans-serif'] = 'DejaVu Sans'

plt.rcParams.update({'axes.spines.top': False, 'axes.spines.right': False})
sns.set_palette("Set2")

In [52]:
result["second"] = result.date_creation.dt.second
result.groupby(["second", "sentiment_prediction"]).agg({"sentiment_score": np.sum})

Unnamed: 0_level_0,Unnamed: 1_level_0,sentiment_score
second,sentiment_prediction,Unnamed: 2_level_1
0,NEGATIVE,24.0
0,POSITIVE,8.0
1,NEGATIVE,27.0
1,POSITIVE,17.0
2,NEGATIVE,18.0
...,...,...
57,POSITIVE,4.0
58,NEGATIVE,16.0
58,POSITIVE,5.0
59,NEGATIVE,15.0


And indeed there are. Out of the 18'000 tweets processed, only 3'000 have a positive sentiment. Let's check the word clouds for both batches of predictions.

In [26]:
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import matplotlib.pyplot as plt

def word_cloud_sentiment(sentiment):

    text = " ".join(tweet for tweet in result.query(f'sentiment_prediction=="{sentiment}"')["tweet"])

    stopwords = set(STOPWORDS)
    stopwords.update(["RT", "https", "t", "COVID", "coronavirus", "Covid-19", "pandemic", "co",
                     "s"])
    print(f"{sentiment} tweets: Word cloud \n")
    wordcloud = WordCloud(stopwords=stopwords, background_color="white",
                         width=2000, height= 1200).generate(text)
    plt.figure(figsize=(20,10))
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis("off")
    plt.show()

In [17]:
result.value_counts('topic_prediction')

topic_prediction
health            568
politics          472
work from home    206
vaccines          181
the economy        93
education          67
dtype: int64

Here are the word clouds for each of the topics:

In [18]:
def create_word_cloud(topic):
    text = " ".join(tweet for tweet in result.query(f'topic_prediction=="{topic}"')["tweet"])

    stopwords = set(STOPWORDS)
    stopwords.update(["RT", "https", "t", "COVID", "coronavirus", "Covid-19", "pandemic", "co",
                 "s", "sound", "starting", "lot", "u"])
    wordcloud = WordCloud(stopwords=stopwords, background_color="white",
                     width=2000, height= 1200).generate(text)
    plt.figure(figsize=(20,10))
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis("off")
    plt.show()

# Sentiment and Topic Prediction

Let's try to analyze how sentiment and topic prediction are related. For example, which are the topics that were most mentioned in the positive tweets relative to the negative tweets? Are vaccines most tweeted in a positive or in a negative fashinon?

The following plots will help us to solve those questions. 

## Sentiments within topics

With the following table, we can analyze the sentiment within each topic. For example, what percentage of the vaccines tweets were positive? What percentage were negative? 

And so on for the different topics predicted. 

In [19]:
pivot_values = pd.crosstab(result['topic_prediction'], result['sentiment_prediction'],normalize='index') \
                 .round(2)
pivot_values['difference'] = pivot_values.eval('NEGATIVE-POSITIVE')
pivot_values.sort_values('difference', ascending = False)

sentiment_prediction,NEGATIVE,POSITIVE,difference
topic_prediction,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
education,0.82,0.18,0.64
health,0.8,0.2,0.6
vaccines,0.8,0.2,0.6
the economy,0.71,0.29,0.42
politics,0.69,0.31,0.38
work from home,0.62,0.38,0.24


In the table to the left, for each combination of `topic_prediction` and `sentiment_prediction`, it appears the percentage of the combination in all the predictions for the given topic. That is, of all the tweets about vaccines, 89% were of a negative sentiment and 11% were with a positive sentiment. 

That is, most tweets about death had a negative sentiment. Whereas the balance evens out the most with globalization. 

In [20]:
pivot_values = pd.crosstab(result['topic_prediction'], result['sentiment_prediction'],normalize='columns') \
                 .round(2)
pivot_values['difference'] = pivot_values.eval('NEGATIVE-POSITIVE')
pivot_values.sort_values('difference', ascending = False)

sentiment_prediction,NEGATIVE,POSITIVE,difference
topic_prediction,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
health,0.39,0.28,0.11
vaccines,0.12,0.09,0.03
education,0.05,0.03,0.02
the economy,0.06,0.07,-0.01
politics,0.28,0.35,-0.07
work from home,0.11,0.19,-0.08


## Topics across Sentiments

With the following table, we can analyze which topics were the most popular within each sentiment. For example, which was the most popular topic in the positive tweets? Which was for the negative tweets?

Whereas for negative tweets the death topic represents 10%, for the positive tweets it only represents 4%. Conversely, globalization tweets represent the 7% of the positive tweets; however, globalization tweets only represent 2% of the overall negative tweets. 

In [45]:
import matplotlib as mpl
mpl.rcParams['text.color'] = 'white'
from statsmodels.graphics.mosaicplot import mosaic
import matplotlib as mpl
with mpl.rc_context():
    mpl.rc("figure", figsize=(10, 15))
    mosaic(result, ['sentiment_prediction', 'topic_prediction'], 
          title = "Sentiment vs Topics")

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

# Creating widgets out of the analysis

In [24]:
import ipywidgets as widgets

## Word Clouds: Different Topics

With the following widget, you can see the word clouds for the tweets that were predicted at each of the 14 different topics with the Zero Shot Learning models from Hugging face. 

In [27]:
widgets.interact(create_word_cloud, topic = result.value_counts('topic_prediction').index)

interactive(children=(Dropdown(description='topic', options=('health', 'politics', 'work from home', 'vaccines…

<function __main__.create_word_cloud(topic)>

## Word Clouds: Different Sentiments

With the following widget, you can see the word clouds for the tweets that the Zero Shot Learning model from Hugging Face predicted. 

In [15]:
widgets.interact(word_cloud_sentiment, sentiment=["NEGATIVE", "POSITIVE"])

interactive(children=(Dropdown(description='sentiment', options=('NEGATIVE', 'POSITIVE'), value='NEGATIVE'), O…

<function __main__.word_cloud_sentiment(sentiment)>