# Big Data Analytic Project CA-2 Repeat
 by Muhammad Haseeb Sheikh 2023120 CA-2

 > In this project, I have explored twitter data. Then I perform sentiment analysis on each tweet, and predict its emotion by using natural language processing techniques. In this pipeline, I first analysed the polarity and subjectivity of each tweet, and then predict the emotion in six metrics: happy, bad, encouraged, joy, loving, and depressed. Besides of these, I also build a binary classifier to find out if a tweet is insult speech. There are five hundred thousands tweets per day approximately and 9,841,743 in total (until the submission), to properly handle them, we indeed build a cluster with four nodes to process the tweets everyday. In this notebook, I have provided the visualization result which are also available online at https://covid19.yaonotes.org. There's also a short introduction on why we want to make this site (on the right upper corner). This notebook was tested with Google Colab. All data are avaliable at Google Drive:https://drive.google.com/drive/folders/1g1ZqcZ3xRtDKrR3aIl1bLs2vzetCa0g0?usp=sharing. Codes for the system and website can also be accessed at Github: https://github.com/xzyaoi/covid-sentiment.

<!-- ![ChessUrl](https://raw.githubusercontent.com/xzyaoi/covid-sentiment/master/ezgif.com-gif-maker.gif "full-visualizer") -->


This notebook will be organized into the following parts:

* **Dependencies Installation**. In this code block, we will declare all our dependencies and provide code to install them.

* **Data Exploration.** We will play around with the sample data we have crawled in step 2 and demonstrate what is included in the dataset.

* **Build Sentiment Analyser.** In this part, we will do three things. </br>1) we will use an existing library, the TextBlob, to predict the polarity and subjectivity. </br>2) we will use a neural network to estimate the emotions of a text. </br>3) we will use another neural network to detect insult speech.

* **Sentiment Analysis.** In this part, we will demonstrate how we actually perform the prediction on our dataset. We use the apply function in pandas, and make it parallel to improve the speed. We also draw the wordcloud as background.

* **Data Collection and Merging.** We then collect the result, and resample the granularity to minute-level.

* **Data Visualization.** In this part, we will  visualize the result into a line chart to demonstrate the changes in people's emotion during the pandemic.

* **Data Warehouse.** We will talk about how we store the data in our data warehouse.

* **Conclusion and Discussion.** In this part, we will discuss the results that we have in the visualization and exploration part. We will conclude how people's emotion change during the time period.

* **Implementation, Limitations and Possible Improvements.** We will introduce how we actually implement this in our cluster, the limitations and possible improvements.

In [1]:
## Files needed
# !wget https://github.com/aidmodels/sentiment-analysis/releases/download/v0.1/model.h5 -O /content/drive/MyDrive/ColabNotebooks/CCT/sentiment.h5
# !wget https://raw.githubusercontent.com/aidmodels/sentiment-analysis/master/pretrained/tokenizer.pickle -O /content/drive/MyDrive/ColabNotebooks/CCT/tokenizer.pickle
# !wget https://raw.githubusercontent.com/deepmipt/DeepPavlov/0.10.0/deeppavlov/configs/classifiers/insults_kaggle_conv_bert.json -P /content/drive/MyDrive/ColabNotebooks/CCT

In [2]:
## Change links in configuration file to github address, this is because the original server is too slow.
!gsed -i 's,http://files.deeppavlov.ai/datasets/insults_data.tar.gz,https://github.com/aidmodels/insult_detection/releases/download/v0.1/insults_data.tar.gz,g' ./insults_kaggle_conv_bert.json
!gsed -i 's,http://files.deeppavlov.ai/deeppavlov_data/bert/conversational_cased_L-12_H-768_A-12.tar.gz,https://github.com/aidmodels/insult_detection/releases/download/v0.1/conversational_cased_L-12_H-768_A-12.tar.gz,g' ./insults_kaggle_conv_bert.json
!gsed -i 's,http://files.deeppavlov.ai/deeppavlov_data/classifiers/insults_kaggle_v4.tar.gz,https://github.com/aidmodels/insult_detection/releases/download/v0.1/insults_kaggle_v4.tar.gz,g' ./insults_kaggle_conv_bert.json

## Data Exploration

In this section, we will perform some basic data exploration about the data we have. Current, we are not able to do much, as we only have two key properties that we are interested: the retweets_count and likes_count.

After we get the sentiment data, we will perform a deeper exploration. We will have more important metrics over there.

In [3]:
# If you do not want to crawling the data again, use this code block to download a sample.
## It is the data crawled in 2020-04-14 00:00:00 to 2020-04-14 00:30:00
import pandas as pd
df_sample = pd.read_csv('https://raw.githubusercontent.com/CConstance/tweets_sentiment/master/test.csv')
display(df_sample.head())

Unnamed: 0,id,date,time,tweet,retweets_count,likes_count
0,1249857397839073291,2020-04-14,00:29:59,Could Coronavirus Trigger Force Majeure Contra...,0,0
1,1249857397771862026,2020-04-14,00:29:59,[Get the Infographic] With the COVID-19 situat...,0,0
2,1249857397721575429,2020-04-14,00:29:59,IMF approves $500m in debt relief for 25 count...,0,1
3,1249857397700481025,2020-04-14,00:29:59,@JeffBezos is our modem Louis XVI. Richer than...,0,0
4,1249857397427974144,2020-04-14,00:29:59,Dr. Michael Wilkes breaks down why nursing hom...,2,6


In [4]:
abs_path = '/Users/macbook/Work/CCT'

In [5]:
# Data Exploration and cleaning
# Take a look at origin crawling data
import os
import pandas as pd
import seaborn as sns
df = pd.read_csv(os.path.join(abs_path, 'ProjectTweets.csv'),
                 names = ['id', 'time', 'query', 
                          'username', 'raw_tweet']).drop(columns=['query','username'])

display(df.head())

Unnamed: 0,id,time,raw_tweet
0,1467810369,Mon Apr 06 22:19:45 PDT 2009,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,1467810672,Mon Apr 06 22:19:49 PDT 2009,is upset that he can't update his Facebook by ...
2,1467810917,Mon Apr 06 22:19:53 PDT 2009,@Kenichan I dived many times for the ball. Man...
3,1467811184,Mon Apr 06 22:19:57 PDT 2009,my whole body feels itchy and like its on fire
4,1467811193,Mon Apr 06 22:19:57 PDT 2009,"@nationwideclass no, it's not behaving at all...."


In [7]:
from datetime import datetime, timezone, timedelta

def convert_to_utc(input_str):
    if 'PDT' in input_str:
        input_str = input_str.replace('PDT', 'UTC')
    # Define the input format based on the provided example
    input_format = "%a %b %d %H:%M:%S %Z %Y"

    # Parse the input string into a datetime object
    input_datetime = datetime.strptime(input_str, input_format)

    # Replace the timezone information with UTC
    input_datetime_utc = input_datetime.replace(tzinfo=timezone.utc)

    # Format the output as per your requirement
    output_time = input_datetime_utc.strftime("%H:%M:%S")
    output_date = input_datetime_utc.strftime("%Y-%m-%d")
#     output_day = input_datetime_utc.strftime("%A")

    return output_time, output_date

In [8]:
# Example usage:
df[['output_time', 'output_date']] = df['time'].apply(convert_to_utc).apply(pd.Series)

In [9]:
display(df.head())

Unnamed: 0,id,time,raw_tweet,output_time,output_date
0,1467810369,Mon Apr 06 22:19:45 PDT 2009,"@switchfoot http://twitpic.com/2y1zl - Awww, t...",22:19:45,2009-04-06
1,1467810672,Mon Apr 06 22:19:49 PDT 2009,is upset that he can't update his Facebook by ...,22:19:49,2009-04-06
2,1467810917,Mon Apr 06 22:19:53 PDT 2009,@Kenichan I dived many times for the ball. Man...,22:19:53,2009-04-06
3,1467811184,Mon Apr 06 22:19:57 PDT 2009,my whole body feels itchy and like its on fire,22:19:57,2009-04-06
4,1467811193,Mon Apr 06 22:19:57 PDT 2009,"@nationwideclass no, it's not behaving at all....",22:19:57,2009-04-06


In [10]:
df = df.drop(columns=['time'])
df = df.rename(columns={'output_time': 'time', 'output_date': 'date'})

# Rearrange columns
column_order = ['id', 'date', 'time', 'raw_tweet']
df = df[column_order]
# df = df.drop(columns=['day'])
# Print the resulting DataFrame
display(df)

Unnamed: 0,id,date,time,raw_tweet
0,1467810369,2009-04-06,22:19:45,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,1467810672,2009-04-06,22:19:49,is upset that he can't update his Facebook by ...
2,1467810917,2009-04-06,22:19:53,@Kenichan I dived many times for the ball. Man...
3,1467811184,2009-04-06,22:19:57,my whole body feels itchy and like its on fire
4,1467811193,2009-04-06,22:19:57,"@nationwideclass no, it's not behaving at all...."
...,...,...,...,...
1599995,2193601966,2009-06-16,08:40:49,Just woke up. Having no school is the best fee...
1599996,2193601969,2009-06-16,08:40:49,TheWDB.com - Very cool to hear old Walt interv...
1599997,2193601991,2009-06-16,08:40:49,Are you ready for your MoJo Makeover? Ask me f...
1599998,2193602064,2009-06-16,08:40:49,Happy 38th Birthday to my boo of alll time!!! ...


In original crawling data, they have 34 colunms of items. During our sentiment analysis, we want to analyse the emotion from tweets for a period of time. Thus, we need to crawl tweets every day. It has several hundreds of thousand of tweets per day. To reduce the amount of fetched data, we filtered irrelvant items and left id, tweets and time while fetching data. ID is used to track the source of tweet. The content of tweet is used to analyse the emotion of users. Time records the changes of emotion.

Even though most tweets do not get any responses, it is still worth investigating them as they are reflecting the feelings and emotions of people at that time.

## Build Sentiment Analyser

In this section, we will build and test three sentiment analysers. They are of different metrics:

* **Emotion Recognition.** We have build a neural-network-based emotion extractor before, and will use the algorithm to extract emotion information from tweets. It will give 6 emotions: bad, depressed, encouraged, happy, joy, sad, loving. The output of every emotion is a real value ranged in $[0,1]$. The neural network architecture and corresponding training code will be provided as appendix.

* **Polarity and Subjectivity.** We will use [TextBlob](https://textblob.readthedocs.io/en/dev/) to analyse the polarity and subjectivity of a single sentence. It is a prebuilt library that uses NaiveBayes classifier to classify the input sentences.

* **Insultation.** We will use [DeepPavlov](http://docs.deeppavlov.ai/en/master/features/models/classifiers.html) to find out if a sentence includes insult content. It is also a neural-network based text classifier.

In [12]:
"""
We use an existing neural network to perform sentiment analysis, 
in this notebook, we will present how it works.
Inside our system, we deployed the neural network into a http server, 
and perform http requests to get the prediction.
"""

from mlpm.solver import Solver
import os
import pickle
import tensorflow as tf
from keras.preprocessing.text import Tokenizer
from keras.utils import pad_sequences

MAX_LEN=140

class SentimentSolver(Solver):
    def __init__(self, toml_file=None):
        super().__init__(toml_file)
        # Do you Init Work here
        with open(os.path.join(abs_path, 'tokenizer.pickle'), 'rb') as handle:
            self.loaded_tokenizer = pickle.load(handle)
        self.model = tf.keras.models.load_model(os.path.join(abs_path,"sentiment.h5"))
        self.ready()

    def infer(self, data):
        # if you need to get file uploaded, get the path from input_file_path in data
        sequences = self.loaded_tokenizer.texts_to_sequences([data['text']])
        padding = pad_sequences(sequences, maxlen=MAX_LEN)
        result = self.model.predict(padding, batch_size=1, verbose=1)
        return {"output": result.tolist()} # return a dict

ss = SentimentSolver()
print(ss.infer({'text':"I'm depressed to hear that...!"}))
print(ss.infer({'text':"I felt sorrow...!"}))
## The output are ordered as: bad, depressed, encouraged, happy, joy, sad, loving

{'output': [[0.0009565531508997083, 0.9974063038825989, 9.468352146768666e-08, 1.0750953833849053e-06, 8.217542131205846e-07, 0.002043356653302908, 4.878527306573233e-06]]}
{'output': [[0.2471114844083786, 0.22646766901016235, 0.03660554811358452, 0.05048996955156326, 0.060417741537094116, 0.2546912133693695, 0.040342219173908234]]}


In [14]:
from textblob import TextBlob
def tb_analyse(sentence):
    tb = TextBlob(str(sentence))
    return tb.sentiment.polarity, tb.sentiment.subjectivity
tb_analyse("I am good to go")

(0.7, 0.6000000000000001)

In [15]:
## Insult Detection
### It will download required models from
from deeppavlov import build_model
model = build_model('./insults_kaggle_conv_bert.json', download=True)
print(model(['Hi, how are you?']))
print(model(['You asshole!']))

2023-12-27 15:01:06.470 INFO in 'deeppavlov.core.data.utils'['utils'] at line 95: Downloading from https://github.com/aidmodels/insult_detection/releases/download/v0.1/insults_kaggle_v4.tar.gz to /Users/macbook/.deeppavlov/models/insults_kaggle_v4.tar.gz
100%|█████████████████████████████████████████████████████| 401M/401M [06:13<00:00, 1.07MB/s]
2023-12-27 15:07:21.663 INFO in 'deeppavlov.core.data.utils'['utils'] at line 276: Extracting /Users/macbook/.deeppavlov/models/insults_kaggle_v4.tar.gz archive into /Users/macbook/.deeppavlov/models/classifiers
2023-12-27 15:07:25.51 INFO in 'deeppavlov.core.data.utils'['utils'] at line 95: Downloading from https://github.com/aidmodels/insult_detection/releases/download/v0.1/insults_data.tar.gz to /Users/macbook/.deeppavlov/insults_data.tar.gz
100%|█████████████████████████████████████████████████████| 682k/682k [00:00<00:00, 1.48MB/s]
2023-12-27 15:07:26.707 INFO in 'deeppavlov.core.data.utils'['utils'] at line 276: Extracting /Users/macbook

ConfigError: 'Model bert_preprocessor is not registered.'

## Sentiment Analysis

In this section, we will use the above analysers to analyse each tweet in our dataset. Here we will provide a function, that reads a row in the dataframe, and returns a row that includes outputs from sentiment analysis. We then want to map each row to the function, and get the output. In ```pandas```, these can be done via ```applymap (for dataframes), map (for series)``` and ```apply (for both)```. Here we use the apply function to perform the map.

Though we use map to parallel our processing, it may still be a bit slow to perform such a large dataset. Therefore, we provide a post-processing ```.csv``` file for quicky explore what's inside after the sentiment analysis.


In [None]:
# Then we apply the sentiment analysis to all the texts in a pandas dataframe,
## Unlike in Sentiment Analysis, we want to make all tweets a list, and get them all.
## Unfortunately, this is really really slow.
from tqdm import tqdm
import pandas as pd

tqdm.pandas()
df = pd.read_csv("./test.csv")

def apply_row(row):
  row['emotion'] = ss.infer({'text':row['tweet']})['output'][0]
  row['is_insult'] = int(model([row['tweet']])[0] == 'Insult')
  row['polarity'], row['subjectivity'] = tb_analyse(row['tweet'])
  return row

df = df.progress_apply(apply_row, axis = 1)
df.to_csv("./test_result.csv")
df.head()
total_insult = df['is_insult'].sum()
print('Total Insult Tweets:' +str(total_insult))

In [None]:
# If the above code block takes too long time, consider downloading the result directly.
!wget https://raw.githubusercontent.com/CConstance/tweets_sentiment/master/test_result.csv

In [None]:
# Demo Visualization - Data prepare
import pandas as pd
df_result = pd.read_csv("./test_result.csv")

df_result['emotion'].tolist()
df_result['bad']=df_result['emotion'].apply(lambda x:float(x.split(",")[0][1:]))
df_result['depressed']=df_result['emotion'].apply(lambda x:float(x.split(",")[1]))
df_result['encouraged']=df_result['emotion'].apply(lambda x:float(x.split(",")[2]))
df_result['happy']=df_result['emotion'].apply(lambda x:float(x.split(",")[3]))
df_result['joy']=df_result['emotion'].apply(lambda x:float(x.split(",")[4]))
df_result['loving']=df_result['emotion'].apply(lambda x:float(x.split(",")[5][:-1]))

# Group by time
df_result['time'] = df_result['date']+" "+df_result['time']
df_result['time'] = pd.to_datetime(df_result['time'])

df_result.index = df_result['time']
df_result = df_result.drop(['emotion'], axis=1)
df_result.head()

## Further Data Exploration

In this section, we will explore the sentiment data after the analysis.

In [None]:
## Summary of insult speech
total_num = df_result['is_insult'].sum()
print("There's "+str(total_num)+" insult speeches found in the dataset.")
## We found there are only 24 insult speeches. It's only a few amoung all the tweets that we have. (a bit surprising for us)
## Let's see whats these:
is_insult =  df_result[df_result['is_insult']==1]

pd.set_option('display.max_colwidth', -1)
print(is_insult['tweet'])

## We found some keywords in these tweets that make them insult others,
## such as: 'a complete moron', 'you cold busted', 'you were incompetent', 'Blood on your hands you', 'immoral as you', 'You are disrespectful and ignorant'
##

In [None]:
## Relations between bad/happy, assumption: they should be negatively correlated.
var = 'bad/happy'
data = pd.concat([df_result['bad'], df_result['happy']], axis=1)
data.plot.scatter(x='happy', y='bad', ylim=(0,1));

## The result aligns with our assumption.

In [None]:
## Relations between polarity and happy, assumption: they should be positvely correlated.
var = 'polarity/happy'
data = pd.concat([df_result['polarity'], df_result['happy']], axis=1)
data.plot.scatter(x='polarity', y='happy', ylim=(0,1));

## We found that as the polarity grows, especially in $[0,1]$, there are more sentences classified as happy.

In [None]:
## Relations between encouraged and happy, assumption: they should be positvely correlated.
var = 'encouraged/happy'
data = pd.concat([df_result['encouraged'], df_result['happy']], axis=1)
data.plot.scatter(x='encouraged', y='happy', ylim=(0,1));

## Data Visualization

In this section, we will visualze the emotional changes across the time. In this notebook, it is hard to demonstrate the whole dataset, and we will only illustrate the sample dataset. The same technique can be used directly for the whole dataset.

We will first resample the data, by finding the mean value of different emotions, polarity and subjectivity in every minute. And then we will also find the sum of insult speeches in every minute. We can interpret the resampled data as:

* Average emotions, polarity and subjectivity per minute.
* Total insult speeches per minute.




In [None]:
# Resample data
## Count insult tweets by minute
df_insult = df_result.resample('T').sum()
df_result = df_result.resample('T').mean()

In [None]:
df_result.plot(y=["polarity", "loving", "joy", "happy", "bad", "depressed", "subjectivity", "encouraged"])

In [None]:
# Statistics of insulat tweets per minute
#df_insult.head()
df_insult.plot(y=["is_insult"])

In [None]:
## Wordcloud of the tweets

from wordcloud import WordCloud
from nltk.corpus import stopwords
import nltk

import matplotlib.pyplot as plt
nltk.download('stopwords')

stopwords = stopwords.words('english')
new_stopwords = ['twitter','utm_campaign','bit','bit.ly','Covid','pic','utm_source','utm_social','utm_medium','https','http','COVID','html','instagram','covid','covid19']
stopwords.extend(new_stopwords)
# stopwords=load_stopwords(stopwords)
stopwords = set(stopwords)

def generate_worldcloud(texts):
  # iterate through the csv file

  wordcloud = WordCloud(width = 300, height = 300,
                  background_color ='white',
                  max_words=2048,
                  stopwords=stopwords,
                  max_font_size=25).generate(texts)
  return wordcloud
texts = " ".join(str(text) for text in df_result['tweet'] if not any(x in str(text) for x in new_stopwords))
wordcloud = generate_worldcloud(texts)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

## Data Warehousing

After the sentiment analysis and generating wordcloud, our program will upload the image of that wordcloud, along with the emotion data to data warehouse, so that our frontend visualizer can read the data, draw the line graph and render the background. We store the data to two different endpoints:

* **OSF(Open Science Foundation)**. They provide an easy-to-use API for quering files in the storage. We store all the wordcloud images in OSF, so that our frontend will be able to know which date is available in our dataset without loading the whole dataset. On top of that, we build a Cloud Function hosted on Google Cloud Platform to avoid cross-origin issues.

* **GitHub**. We use GitHub to provide all ```.csv``` file, i.e. the emotion data per day. It will be named with the same name in the image data, and our frontend will read the data when needed. You can find all the data at https://github.com/xzyaoi/covid-sentiment/tree/master/data

## Conclusion and Discussion

In this project, we crawled nearly 10 million Twitter data. We investigate how we can build our pipeline to analyse these tweets, visualize the results, store them properly. We build three analysers using different techniques, and **extract** the polarity, subjectivity, 6 emotions and if they are insulting others. We also drawn a wordcloud per day to show what people are caring about. We then **transform** the emotion data per minute to compute the average emotion metrics and number of insult tweets. After that we **load** the data into data warehouse. To conclude, we build a ETL system that analyses twitter data everyday.

To our surprise, we found that people's emotion are pretty stable during the pandemic. Overall, there appear more encouraging and happy tweets in May and June, compared with April. Another finding is that the insulting tweets are only a few (in our sample, 24/5352=0.4%).


## Implementation, Limitations and Possible Improvements.

Aside of this notebook, we implemented the system into a 4-nodes cluster. They can be categorized into two types:

* **Crawler**: We have a single node for crawling, hosted on Google Cloud Platform. We set a crontab on 3:00 AM every day to start the crawling for yesterday's data.

* **Analyser**: Other nodes are hosted on a Chinese provider. We want to balance our requests to every node so that we can improve the performance. To achieve so, we made the sentiment analyser and insultation detector to be HTTP service, and then we deployed HAProxy as load balancer. By doing so, we can distribute the tweet across different nodes and parallelize the process.

Even though we tried our best to parallelize the analysis part, it still takes a longer time than we expected and we cannot finish the insultation analysis on time. Thus, in the online version, we only leave the emotion and polarity analysis there. However, if possible we can purchase more nodes, and deploy the insultation detector and finish the insultation analysis.

Another problem in our analysis is that we have not considered other keywords to search. It might be useful and interesting to look into other keywords, such as #China, #Trump to compare people's emotion changes.


