
### Social Media Tweet Analysis on Twitter Dataset

*   Topic Modeling on Twitter Dataset

*   Sentiment analysis on Twitter Dataset






### **Business understanding**

### **Topic modeling**
Topic modeling is a type of statistical model for discovering the abstract "topics" that occur in a collection of texts.
 It is an unsupervised approach used for finding and observing the bunch of words (called “topics”) in large clusters of texts.
 **Topic models** are built around the idea that the semantics of our document are actually being governed by some hidden, or “latent,” variables that we are not observing.

The important libraries used to perform the Topic Modelling are: Pandas, Gensim, pyLDAvis

*   Our task here is to discover abstract topics from tweets.


### **Sentiment analysis**
 It is used in social media monitoring, allowing businesses to gain insights about how customers feel about certain topics, and detect urgent issues in real time before they spiral out of control.


*   Our task here is to classify a tweet as a positive or negative tweet sentiment wise.




**Topic modeling **is a machine learning technique that automatically analyzes text data to determine cluster words for a set of documents. 


*   unsupervised machine learning because it doesn’t require a predefined list of tags or training data that’s been previously classified by humans.
*   doesn’t require training, it’s a quick and easy way to start analyzing your data.

## Data Understanding
### Loading necessary packages

In [1]:
import warnings
warnings.filterwarnings('ignore')
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import STOPWORDS,WordCloud
import gensim
from gensim.models import CoherenceModel
from gensim import corpora
import pandas as pd
from pprint import pprint
import string
import os
import re

Data acquisition

For this example we have two option for data acquisition:

*   You can download Twitter dataset directly from Twitter
*   By registering as a developer using this link [Here](https://developer.twitter.com/en) 

*   Or you can use downloaded data found at Week0/data/cleaned_fintech_data.csv 



In [46]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
#data loader class
class DataLoader:
  def __init__(self,dir_name,file_name):
    self.dir_name=dir_name
    self.file_name = file_name
    
 
  def read_csv(self):
    os.chdir(self.dir_name)
    tweets_df=pd.read_csv(self.file_name)
    return tweets_df
  
    

In [3]:
#object creation
DataLoader_obj= DataLoader('data','cleaned_fintech_data.csv')


In [4]:
tweets_df=DataLoader_obj.read_csv()
tweets_df.dropna()


Unnamed: 0.1,Unnamed: 0,created_at,source,original_text,clean_text,sentiment,polarity,subjectivity,lang,favorite_count,...,original_author,screen_count,followers_count,friends_count,possibly_sensitive,hashtags,user_mentions,place,place_coord_boundaries,timestamp


In [5]:
len(tweets_df)

5621

In [6]:
tweets_df.head()

Unnamed: 0.1,Unnamed: 0,created_at,source,original_text,clean_text,sentiment,polarity,subjectivity,lang,favorite_count,...,original_author,screen_count,followers_count,friends_count,possibly_sensitive,hashtags,user_mentions,place,place_coord_boundaries,timestamp
0,0.0,Thu Jun 17 06:26:34 +0000 2021,"<a href=""https://mobile.twitter.com"" rel=""nofo...",Giving forth life is becoming a burden in Keny...,Giving forth life becoming burden Kenya This m...,"Sentiment(polarity=0.3194444444444445, subject...",0.3194444444444445,0.5305555555555556,en,0,...,reen_law,398,70,223,,,janetmachuka_,,,2021-06-17 06:26:34+00:00
1,1.0,Thu Jun 17 06:26:37 +0000 2021,"<a href=""http://twitter.com/download/android"" ...",Teenmaar - 26cr\nPanja - 32.5cr\nGabbarsingh -...,Teenmaar crPanja crGabbarsingh cr Khaleja Kuda...,"Sentiment(polarity=0.0, subjectivity=0.0)",0.0,0.0,in,0,...,Amigo9999_,19047,132,1084,,,maheshblood,,India,2021-06-17 06:26:37+00:00
2,2.0,Thu Jun 17 06:26:42 +0000 2021,"<a href=""http://twitter.com/download/android"" ...",Rei chintu 2013 lo Vachina Ad Nizam ne 2018 lo...,Rei chintu lo Vachina Ad Nizam ne lo kottaru f...,"Sentiment(polarity=0.0, subjectivity=0.0)",0.0,0.0,hi,0,...,MallaSuhaas,47341,2696,2525,,,Hail_Kalyan,,Vizag,2021-06-17 06:26:42+00:00
3,3.0,Thu Jun 17 06:26:44 +0000 2021,"<a href=""http://twitter.com/download/iphone"" r...",Today is World Day to Combat #Desertification ...,Today World Day Combat Restoring degraded land...,"Sentiment(polarity=0.25, subjectivity=0.65)",0.25,0.65,en,0,...,CIACOceania,7039,343,387,,"Desertification, Drought, resilience",EdwardVrkic,,Papua New Guinea,2021-06-17 06:26:44+00:00
4,4.0,Thu Jun 17 06:26:47 +0000 2021,"<a href=""http://twitter.com/download/android"" ...",Hearing #GregHunt say he's confident vaccines ...,Hearing say 's confident vaccines delivered li...,"Sentiment(polarity=0.5, subjectivity=0.8333333...",0.5,0.8333333333333334,en,0,...,MccarronWendy,26064,419,878,,"GregHunt, Morrison",WriteWithDave,,"Sydney, New South Wales",2021-06-17 06:26:47+00:00


In [7]:
class PrepareData:
  def __init__(self,df):
    self.df=df
    
  def preprocess_data(self):
    tweets_df = self.df.loc[self.df['lang'] =="en"]

    
    #text Preprocessing
    tweets_df['clean_text']=tweets_df['clean_text'].astype(str)
    tweets_df['clean_text'] = tweets_df['clean_text'].apply(lambda x: x.lower())
    tweets_df['clean_text']= tweets_df['clean_text'].apply(lambda x: x.translate(str.maketrans(' ', ' ', string.punctuation)))
    
    #Converting tweets to list of words For feature engineering
    sentence_list = [tweet for tweet in tweets_df['clean_text']]
    word_list = [sent.split() for sent in sentence_list]
    # print(word_list)

    #Create dictionary which contains Id and word 
    word_to_id = corpora.Dictionary(word_list) #generate unique tokens
    #  we can see the word to unique integer mapping
    # print(word_to_id.token2id)
    # using bag of words(bow), we create a corpus that contains the word id and its frequency in each document.
    corpus_1= [word_to_id.doc2bow(tweet) for tweet in word_list]


    return word_list, word_to_id, corpus_1


In [8]:
PrepareData_obj=PrepareData(tweets_df)
word_list ,id2word,corpus=PrepareData_obj.preprocess_data()

In [9]:
print(corpus)

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 2), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1)], [(26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1), (37, 1), (38, 1), (39, 1), (40, 1), (41, 1), (42, 1), (43, 1), (44, 1), (45, 1), (46, 1), (47, 1)], [(29, 1), (48, 1), (49, 1), (50, 1), (51, 1), (52, 2), (53, 1), (54, 1), (55, 2), (56, 2), (57, 1)], [(58, 1), (59, 1), (60, 1), (61, 1), (62, 1), (63, 1), (64, 1), (65, 1), (66, 1), (67, 1), (68, 1), (69, 2), (70, 1), (71, 1), (72, 1), (73, 1), (74, 1)], [(26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1), (37, 1), (38, 1), (39, 1), (40, 1), (41, 1), (42, 1), (43, 1), (44, 1), (45, 1), (46, 1), (47, 1)], [(29, 1), (75, 1), (76, 1), (77, 1), (78, 1), (79, 1), (80, 1), (81, 1), (82, 1), (83, 1), (84, 1), (85, 1), (86, 1),

In [10]:
id_words = [[(id2word[id], count) for id, count in line] for line in corpus]

In [None]:
print(id_words)

[[('babies', 1), ('baby', 1), ('becoming', 1), ('birth', 1), ('bundles', 1), ('burden', 1), ('coz', 1), ('douglas', 1), ('expensiveturn', 1), ('formula', 1), ('forth', 1), ('gave', 1), ('giving', 2), ('handmpesa', 1), ('hard', 1), ('joy', 1), ('kenya', 1), ('life', 1), ('meeting', 1), ('mother', 1), ('needs', 1), ('nutritional', 1), ('nyaoko', 1), ('tears', 1), ('this', 1), ('time', 1)], [('all', 1), ('away', 1), ('brings', 1), ('carbon', 1), ('combat', 1), ('critical', 1), ('day', 1), ('degraded', 1), ('food', 1), ('helps', 1), ('jobs', 1), ('land', 1), ('lifting', 1), ('locking', 1), ('many', 1), ('poverty', 1), ('recover', 1), ('restoring', 1), ('security', 1), ('slows', 1), ('today', 1), ('world', 1)], [('carbon', 1), ('confident', 1), ('delivered', 1), ('emissions', 1), ('g7', 1), ('hearing', 2), ('like', 1), ('reducing', 1), ('s', 2), ('say', 2), ('vaccines', 1)], [('account', 1), ('across', 1), ('airtime', 1), ('arteta', 1), ('buy', 1), ('even', 1), ('fuliza', 1), ('kabando', 1)

<gensim.corpora.dictionary.Dictionary at 0x7f4dde3b1b90>

### Topic Modeling using Latent Dirichlet Allocation 
based on the distributional hypothesis, (i.e. similar topics make use of similar words) and the statistical mixture hypothesis (i.e. documents talk about several topics) for which a statistical distribution can be determined. 

*  The purpose of LDA is mapping each teweets in our corpus to a set of topics 
which covers a good deal of the words in the tweet



In [11]:
# Build LDA model
lda_model = gensim.models.ldamodel.LdaModel(corpus,
                                           id2word=id2word,
                                           num_topics=5, 
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=10,
                                           alpha='auto',
                                           per_word_topics=True)

In [12]:
pprint(lda_model.print_topics())

[(0,
  '0.035*"carbon" + 0.024*"climate" + 0.024*"gt" + 0.014*"bp" + 0.013*"change" '
  '+ 0.010*"future" + 0.010*"land" + 0.009*"large" + 0.008*"intense" + '
  '0.008*"focus"'),
 (1,
  '0.025*"money" + 0.018*"account" + 0.015*"like" + 0.014*"ur" + 0.014*"june" '
  '+ 0.013*"follow" + 0.013*"giveaway" + 0.013*"giving" + 0.013*"would" + '
  '0.013*"rt"'),
 (2,
  '0.024*"carbon" + 0.017*"nt" + 0.016*"i" + 0.014*"meat" + 0.013*"uk" + '
  '0.011*"footprint" + 0.009*"s" + 0.009*"go" + 0.009*"low" + 0.009*"great"'),
 (3,
  '0.043*"carbon" + 0.026*"tax" + 0.025*"protecting" + 0.025*"rights" + '
  '0.025*"amp" + 0.021*"oil" + 0.021*"emissions" + 0.016*"new" + 0.014*"the" + '
  '0.013*"covid"'),
 (4,
  '0.032*"government" + 0.030*"carbon" + 0.025*"s" + 0.023*"emissions" + '
  '0.023*"zero" + 0.018*"i" + 0.018*"net" + 0.017*"gas" + 0.016*"cost" + '
  '0.015*"target"')]


In [13]:
pprint(lda_model.show_topics(formatted=False))

[(0,
  [('carbon', 0.034708492),
   ('climate', 0.024370829),
   ('gt', 0.024116779),
   ('bp', 0.014439735),
   ('change', 0.012907535),
   ('future', 0.00954764),
   ('land', 0.009546837),
   ('large', 0.008649602),
   ('intense', 0.008456626),
   ('focus', 0.007898002)]),
 (1,
  [('money', 0.024699865),
   ('account', 0.017653052),
   ('like', 0.014913489),
   ('ur', 0.014439795),
   ('june', 0.0141965635),
   ('follow', 0.013304844),
   ('giveaway', 0.013050717),
   ('giving', 0.01302125),
   ('would', 0.012878241),
   ('rt', 0.0125428075)]),
 (2,
  [('carbon', 0.024277098),
   ('nt', 0.01659092),
   ('i', 0.016169818),
   ('meat', 0.0144863175),
   ('uk', 0.013002309),
   ('footprint', 0.011360549),
   ('s', 0.0094936825),
   ('go', 0.009141455),
   ('low', 0.008937164),
   ('great', 0.008675807)]),
 (3,
  [('carbon', 0.043205258),
   ('tax', 0.026049877),
   ('protecting', 0.025413387),
   ('rights', 0.02533519),
   ('amp', 0.024646819),
   ('oil', 0.021254148),
   ('emissions', 

Each line is a topic with individual topic terms and weights. Topic0  can be termed as climate change, and Topic4 can be termed as government and carbon emission.

# **Model Analysis**

Perplexity is also a measure of model quality and in natural language processing is often used as “perplexity per number of words”. It describes how well a model predicts a sample, i.e. how much it is “perplexed” by a sample from the observed data. The lower the score, the better the model for the given data.

A coherence matrix is used to test the model for accuracy. Topic coherence is a measure that compares different topic models based on their human-interpretability. The coherence score ‘C_V’ provides a numerical value to the interpretability of the topics

In [14]:
# Compute Perplexity

#It's a measure of how good the model is. The lower the better. Perplexity is a negative value
print('\nPerplexity: ', lda_model.log_perplexity(corpus))  
doc_lda = lda_model[corpus]


# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model, texts=word_list, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\n Ldamodel Coherence Score/Accuracy on Tweets: ', coherence_lda)


Perplexity:  -6.7493288814452335

 Ldamodel Coherence Score/Accuracy on Tweets:  0.6172103452083265


Basic Ldamodel Coherence Score 0.58 This means that the model has performed reasonably well in topic modeling.

In [None]:
!pip install pyLDAvis 

Collecting pyLDAvis
[?25l  Downloading https://files.pythonhosted.org/packages/03/a5/15a0da6b0150b8b68610cc78af80364a80a9a4c8b6dd5ee549b8989d4b60/pyLDAvis-3.3.1.tar.gz (1.7MB)
[K     |████████████████████████████████| 1.7MB 4.3MB/s 
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Collecting numpy>=1.20.0
[?25l  Downloading https://files.pythonhosted.org/packages/3f/03/c3526fb4e79a793498829ca570f2f868204ad9a8040afcd72d82a8f121db/numpy-1.21.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (15.7MB)
[K     |████████████████████████████████| 15.7MB 183kB/s 
Collecting funcy
  Downloading https://files.pythonhosted.org/packages/44/52/5cf7401456a461e4b481650dfb8279bc000f31a011d0918904f86e755947/funcy-1.16-py2.py3-none-any.whl
Collecting pandas>=1.2.0
[?25l  Downloading https://files.pythonhosted.org/packa

**Anlayizing results**
Exploring the Intertopic Distance Plot can help you learn about how topics relate to each other, including potential higher-level structure between groups of topics

In [None]:
import pyLDAvis.gensim_models as gensimvis
import pickle 
import pyLDAvis
# Visualize the topics
pyLDAvis.enable_notebook()

LDAvis_prepared = gensimvis.prepare(lda_model, corpus, id2word)
LDAvis_prepared

  from collections import Iterable


# Sentiment Analysis

[Notebook reference](https://github.com/lazuxd/simple-imdb-sentiment-analysis/blob/master/sentiment-analysis.ipynb)

## Building a Sentiment Classifier using Scikit-Learn

<center><img src="https://raw.githubusercontent.com/lazuxd/simple-imdb-sentiment-analysis/master/smiley.jpg"/></center>
<center><i>Image by AbsolutVision @ <a href="https://pixabay.com/ro/photos/smiley-emoticon-furie-sup%C4%83rat-2979107/">pixabay.com</a></i></center>

> &nbsp;&nbsp;&nbsp;&nbsp;**Sentiment analysis**, an important area in Natural Language Processing, is the process of automatically detecting affective states of text. Sentiment analysis is widely applied to voice-of-customer materials such as product reviews in online shopping websites like Amazon, movie reviews or social media. It can be just a basic task of classifying the polarity of a text as being positive/negative or it can go beyond polarity, looking at emotional states such as "happy", "angry", etc.

&nbsp;&nbsp;&nbsp;&nbsp;Here we will build a classifier of [tweets](https://raw.githubusercontent.com/kolaveridi/kaggle-Twitter-US-Airline-Sentiment-/master/Tweets.csv) about six US airlines, the task is to predict whether a tweet contains positive, negative, or neutral sentiment about the airline. This is a typical supervised learning task where given a text string, we have to categorize the text string into predefined categories.



In [33]:
!wget https://github.com/satyajeetkrjha/kaggle-Twitter-US-Airline-Sentiment-/blob/master/Tweets.csv

--2022-08-10 06:52:41--  https://github.com/satyajeetkrjha/kaggle-Twitter-US-Airline-Sentiment-/blob/master/Tweets.csv
Resolving github.com (github.com)... 20.27.177.113
Connecting to github.com (github.com)|20.27.177.113|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘Tweets.csv’

Tweets.csv              [   <=>              ] 136.44K   214KB/s    in 0.6s    

2022-08-10 06:52:43 (214 KB/s) - ‘Tweets.csv’ saved [139717]



In [34]:
!ls

cleaned_fintech_data.csv  Tweets.csv


>> ### Import required libraries

In [35]:
import numpy as np 
import pandas as pd 
import re
import nltk 
import matplotlib.pyplot as plt
%matplotlib inline

In [42]:
# Read the Data
data = pd.read_csv('/content/airlines.csv')

In [45]:
data.to_csv('drive/MyDrive/data/airlines_data.csv', index=False)

FileNotFoundError: ignored

In [None]:
print(imdb_train.shape)
print(imdb_test.shape)

NameError: ignored

### Data Preprocessing

&nbsp;&nbsp;&nbsp;&nbsp;After the dataset has been downloaded and extracted from archive we have to transform it into a more suitable form for feeding it into a machine learning model for training. We will start by combining all review data into 2 pandas Data Frames representing the train and test datasets, and then saving them as csv files: *imdb_train.csv* and *imdb_test.csv*.  

&nbsp;&nbsp;&nbsp;&nbsp;The Data Frames will have the following form:  

|text       |label      |
|:---------:|:---------:|
|review1    |0          |
|review2    |1          |
|review3    |1          |
|.......    |...        |
|reviewN    |0          |  

&nbsp;&nbsp;&nbsp;&nbsp;where:  
- review1, review2, ... = the actual text of movie review  
- 0 = negative review  
- 1 = positive review

<b>But machine learnng algorithms work only with numerical values.</b> We can't just input the text itself into a machine learning model and have it learn from that. We have to, somehow, <b>represent the text by numbers or vectors of numbers</b>. One way of doing this is by using the **Bag-of-words** model<sup>(3)</sup>, in which a piece of text (often called a **document**) is represented by a <b>vector of the counts of words from a vocabulary in that document. This model doesn't take into account grammar rules or word ordering; all it considers is the frequency of words</b>. If we use the counts of each word independently we name this representation a **unigram**. In general, in a **n-gram** we take into account the counts of <b>each combination of n words from the vocabulary that appears in a given document</b>.  

&nbsp;&nbsp;&nbsp;&nbsp;For example, consider these two documents:  
<br>  
<div style="font-family: monospace;"><center><b>d1: "I am learning"&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</b></center></div>  
<div style="font-family: monospace;"><center><b>d2: "Machine learning is cool"</b></center></div>  
<br>
The vocabulary of all words encountered in these two sentences is: 

<br/>  
<div style="font-family: monospace;"><center><b>v: [ I, am, learning, machine, is, cool ]</b></center></div>   
<br>
&nbsp;&nbsp;&nbsp;&nbsp;The unigram representations of d1 and d2:  
<br>  

|unigram(d1)|I       |am      |learning|machine |is      |cool    |
|:---------:|:------:|:------:|:------:|:------:|:------:|:------:|
|           |1       |1       |1       |0       |0       |0       |  

|unigram(d2)|I       |am      |learning|machine |is      |cool    |
|:---------:|:------:|:------:|:------:|:------:|:------:|:------:|
|           |0       |0       |1       |1       |1       |1       |
  
&nbsp;&nbsp;&nbsp;&nbsp;And, the bigrams of d1 and d2 are:
  
|bigram(d1) |I I     |I am    |I learning|...|machine am|machine learning|...|cool is|cool cool|
|:---------:|:------:|:------:|:--------:|:-:|:--------:|:--------------:|:-:|:-----:|:-------:|
|           |0       |1       |0         |...|0         |0               |...|0      |0        |  

|bigram(d2) |I I     |I am    |I learning|...|machine am|machine learning|...|cool is|cool cool|
|:---------:|:------:|:------:|:--------:|:-:|:--------:|:--------------:|:-:|:-----:|:-------:|
|           |0       |0       |0         |...|0         |1               |...|0      |0        |

&nbsp;&nbsp;&nbsp;&nbsp;Often, we can achieve slightly better results if instead of counts of words we use something called **term frequency times inverse document frequency** (or **tf-idf**). Maybe it sounds complicated, but it is not. Bear with me, I will explain this. The intuition behind this is the following. So, what's the problem of using just the frequency of terms inside a document? <b>Although some terms may have a high frequency inside documents they may not be so relevant for describing a given document in which they appear. That's because those terms may also have a high frequency across the collection of all documents</b>. For example, a collection of movie reviews may have terms specific to movies/cinematography that are present in almost all documents (they have a high **document frequency**). So, when we encounter those terms in a document this doesn't tell much about whether it is a positive or negative review. We need a way of relating **term frequency** (how frequent a term is inside a document) to **document frequency** (how frequent a term is across the whole collection of documents). That is:  
  
$$\begin{align}\frac{\text{term frequency}}{\text{document frequency}} &= \text{term frequency} \cdot \frac{1}{\text{document frequency}} \\ &= \text{term frequency} \cdot \text{inverse document frequency} \\ &= \text{tf} \cdot \text{idf}\end{align}$$  
  
&nbsp;&nbsp;&nbsp;&nbsp;Now, there are more ways used to describe both term frequency and inverse document frequency. But the most common way is by putting them on a logarithmic scale:  
  
$$tf(t, d) = log(1+f_{t,d})$$  
$$idf(t) = log(\frac{1+N}{1+n_t})$$  
  
&nbsp;&nbsp;&nbsp;&nbsp;where:  
$$\begin{align}f_{t,d} &= \text{count of term } \textbf{t} \text{ in document } \textbf{d} \\  
N &= \text{total number of documents} \\  
n_t &= \text{number of documents that contain term } \textbf{t}\end{align}$$  
  
<b>We added 1 in the first logarithm to avoid getting $-\infty$ when $f_{t,d}$ is 0. In the second logarithm we added one fake document to avoid division by zero.</b>

Before we transform our data into vectors of counts or tf-idf values we should remove English **stopwords**<sup>(6)(7)</sup>. <b>Stopwords are words that are very common in a language</b> and are usually removed in the preprocessing stage of natural text-related tasks like sentiment analysis or search.

<b>Note that we should construct our vocabulary only based on the training set. When we will process the test data in order to make predictions we should use only the vocabulary constructed in the training phase, the rest of the words will be ignored.</b>

&nbsp;&nbsp;&nbsp;&nbsp;Now, let's create the data frames and save them as csv files:

### Text Vectorization

Fortunately, for the text vectorization part all the hard work is already done in the Scikit-Learn classes `CountVectorizer`<sup>(8)</sup> and `TfidfTransformer`<sup>(5)</sup>. We will use these classes to transform our csv files into unigram and bigram matrices (using both counts and tf-idf values). (<b>It turns out that if we only use a n-gram for a large n we don't get a good accuracy, we usually use all n-grams up to some n. So, when we say here bigrams we actually refer to uni+bigrams and when we say unigrams it's just unigrams.</b>) Each row in those matrices will represent a document (review) in our dataset, and each column will represent values associated with each word in the vocabulary (in the case of unigrams) or values associated with each combination of maximum 2 words in the vocabulary (bigrams).  

&nbsp;&nbsp;&nbsp;&nbsp;`CountVectorizer` has a parameter `ngram_range` which expects a tuple of size 2 that controls what n-grams to include. After we constructed a `CountVectorizer` object we should call `.fit()` method with the actual text as a parameter, in order for it to learn the required statistics of our collection of documents. Then, by calling `.transform()` method with our collection of documents it returns the matrix for the n-gram range specified. As the class name suggests, this matrix will contain just the counts. To obtain the tf-idf values, the class `TfidfTransformer` should be used. It has the `.fit()` and `.transform()` methods that are used in a similar way with those of `CountVectorizer`, but they take as input the counts matrix obtained in the previous step and `.transform()` will return a matrix with tf-idf values. We should use `.fit()` only on training data and then store these objects. When we want to evaluate the test score or whenever we want to make a prediction we should use these objects to transform the data before feeding it into our classifier.  

&nbsp;&nbsp;&nbsp;&nbsp;Note that the matrices generated for our train or test data will be huge, and if we store them as normal numpy arrays they will not even fit into RAM. But most of the entries in these matrices will be zero. So, these Scikit-Learn classes are using Scipy sparse matrices<sup>(9)</sup> (`csr_matrix`<sup>(10)</sup> to be more exactly), which store just the non-zero entries and save a LOT of space.  

&nbsp;&nbsp;&nbsp;&nbsp;We will use a linear classifier with stochastic gradient descent, `sklearn.linear_model.SGDClassifier`<sup>(11)</sup>, as our model. First we will generate and save our data in 4 forms: unigram and bigram matrix (with both counts and tf-idf values for each). Then we will train and evaluate our model for each these 4 data representations using `SGDClassifier` with the default parameters. After that, we choose the data representation which led to the best score and we will tune the hyper-parameters of our model with this data form using cross-validation in order to obtain the best results.

<b>Refs:</b> 
* Convert a collection of text documents to a matrix of token counts: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
* Convert a collection of raw documents to a matrix of TF-IDF features: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html


In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from joblib import dump, load # used for saving and loading sklearn objects
from scipy.sparse import save_npz, load_npz # used for saving and loading sparse matrices

In [None]:
!mkdir 'data_preprocessors'
!mkdir 'vectorized_data'

#### Unigram Counts

In [None]:
%%time
# TRAINING
# unigram_vectorizer = CountVectorizer(ngram_range=(1, 1))
# unigram_vectorizer.fit(imdb_train['text'].values)
# dump(unigram_vectorizer, 'data_preprocessors/unigram_vectorizer.joblib')

# TESTING
unigram_vectorizer = load('data_preprocessors/unigram_vectorizer.joblib')

In [None]:
unigram_vectorizer.vocabulary_

In [None]:
df = pd.DataFrame(unigram_vectorizer.vocabulary_.items(), columns=['Vocabulary', 'Frequency'])

In [None]:
df.info()

In [None]:
df.sort_values(by="Frequency", axis=0, ascending=False, inplace=True, kind='quicksort', na_position='last')

In [None]:
df.head(n=20)

In [None]:
df.tail(n=20)

In [None]:
%%time
# TRAINING
# X_train_unigram = unigram_vectorizer.transform(imdb_train['text'].values)
# save_npz('vectorized_data/X_train_unigram.npz', X_train_unigram)

# TESTING
X_train_unigram = load_npz('vectorized_data/X_train_unigram.npz')

<b>fit_transform</b>

#### Unigram Tf-Idf

In [None]:
%%time
# TRAINING
# unigram_tf_idf_transformer = TfidfTransformer()
# unigram_tf_idf_transformer.fit(X_train_unigram)
# dump(unigram_tf_idf_transformer, 'data_preprocessors/unigram_tf_idf_transformer.joblib')

# TESTING
unigram_tf_idf_transformer = load('data_preprocessors/unigram_tf_idf_transformer.joblib') 

In [None]:
%%time
# TRAINING
# X_train_unigram_tf_idf = unigram_tf_idf_transformer.transform(X_train_unigram)
# save_npz('vectorized_data/X_train_unigram_tf_idf.npz', X_train_unigram_tf_idf)

# TESTING
X_train_unigram_tf_idf = load_npz('vectorized_data/X_train_unigram_tf_idf.npz')

#### Bigram Counts

In [None]:
%%time
# TRAINING
# bigram_vectorizer = CountVectorizer(ngram_range=(1, 2))
# bigram_vectorizer.fit(imdb_train['text'].values)
# dump(bigram_vectorizer, 'data_preprocessors/bigram_vectorizer.joblib')

# TESTING
bigram_vectorizer = load('data_preprocessors/bigram_vectorizer.joblib')

In [None]:
%%time
# TRAINING
X_train_bigram = bigram_vectorizer.transform(imdb_train['text'].values)
save_npz('vectorized_data/X_train_bigram.npz', X_train_bigram)

# TESTING
X_train_bigram = load_npz('vectorized_data/X_train_bigram.npz')

#### Bigram Tf-Idf

In [None]:
%%time
# TRAINING
bigram_tf_idf_transformer = TfidfTransformer()
bigram_tf_idf_transformer.fit(X_train_bigram)
dump(bigram_tf_idf_transformer, 'data_preprocessors/bigram_tf_idf_transformer.joblib')

# TESTING
bigram_tf_idf_transformer = load('data_preprocessors/bigram_tf_idf_transformer.joblib')

In [None]:
X_train_bigram_tf_idf = bigram_tf_idf_transformer.transform(X_train_bigram)
save_npz('vectorized_data/X_train_bigram_tf_idf.npz', X_train_bigram_tf_idf)

# X_train_bigram_tf_idf = load_npz('vectorized_data/X_train_bigram_tf_idf.npz')

#### Choosing data format

&nbsp;&nbsp;&nbsp;&nbsp;Now, for each data form we split it into train & validation sets, train a `SGDClassifier` and output the score.

In [None]:
def train_and_show_scores(X: csr_matrix, y: np.array, title: str) -> None:
    X_train, X_valid, y_train, y_valid = train_test_split(
        X, y, train_size=0.75, stratify=y
    )

    clf = SGDClassifier()
    clf.fit(X_train, y_train)
    train_score = clf.score(X_train, y_train)
    valid_score = clf.score(X_valid, y_valid)
    print(f'{title}\nTrain score: {round(train_score, 2)} ; Validation score: {round(valid_score, 2)}\n')

In [None]:
y_train = imdb_train['label'].values

In [None]:
train_and_show_scores(X_train_unigram, y_train, 'Unigram Counts')
train_and_show_scores(X_train_unigram_tf_idf, y_train, 'Unigram Tf-Idf')
train_and_show_scores(X_train_bigram, y_train, 'Bigram Counts')
train_and_show_scores(X_train_bigram_tf_idf, y_train, 'Bigram Tf-Idf')

&nbsp;&nbsp;&nbsp;&nbsp;The best data form seems to be **bigram with tf-idf** as it gets the highest validation accuracy: **0.9**; we will use it next for hyper-parameter tuning.

#### Task (Optional)

<h2>Using the processed twitter data from yesterday's challenge</h2>.


- Form a new data frame (named `cleanTweet`), containing columns $\textbf{clean-text}$ and $\textbf{polarity}$.

- Write a function `text_category` that takes a value `p` and returns, depending on the value of p, a string `'positive'`, `'negative'` or `'neutral'`.

- Apply this function (`text_category`) on the $\textbf{polarity}$ column of `cleanTweet` in 1 above to form a new column called $\textbf{score}$ in `cleanTweet`.

- Visualize The $\textbf{score}$ column using piechart and barchart

<h5>Now we want to build a classification model on the clean tweet following the steps below:</h5>

* Remove rows from `cleanTweet` where $\textbf{polarity}$ $= 0$ (i.e where $\textbf{score}$ = Neutral) and reset the frame index.
* Construct a column $\textbf{scoremap}$ Use the mapping {'positive':1, 'negative':0} on the $\textbf{score}$ column
* Create feature and target variables `(X,y)` from $\textbf{clean-text}$ and $\textbf{scoremap}$ columns respectively.
* Use `train_test_split` function to construct `(X_train, y_train)` and `(X_test, y_test)` from `(X,y)`

* Build an `SGDClassifier` model from the vectorize train text data. Use `CountVectorizer()` with a $\textit{trigram}$ parameter.

* Evaluate your model on the test data.


#### EXTENSION

#### Using Cross-Validation for hyperparameter tuning

&nbsp;&nbsp;&nbsp;&nbsp;For this part we will use `RandomizedSearchCV`<sup>(12)</sup> which chooses the parameters randomly from the list that we give, or according to the distribution that we specify from `scipy.stats` (e.g. uniform); then is estimates the test error by doing cross-validation and after all iterations we can find the best estimator, the best parameters and the best score in the variables `best_estimator_`, `best_params_` and `best_score_`.  

&nbsp;&nbsp;&nbsp;&nbsp;Because the search space for the parameters that we want to test is very big and it may need a huge number of iterations until it finds the best combination, we will split the set of parameters in 2 and do the hyper-parameter tuning process in two phases. First we will find the optimal combination of loss, learning_rate and eta0 (i.e. initial learning rate); and then for penalty and alpha.

In [None]:
X_train = X_train_bigram_tf_idf

##### Phase 1: loss, learning rate and initial learning rate

In [None]:
clf = SGDClassifier()

In [None]:
distributions = dict(
    loss=['hinge', 'log', 'modified_huber', 'squared_hinge', 'perceptron'],
    learning_rate=['optimal', 'invscaling', 'adaptive'],
    eta0=uniform(loc=1e-7, scale=1e-2)
)

In [None]:
random_search_cv = RandomizedSearchCV(
    estimator=clf,
    param_distributions=distributions,
    cv=5,
    n_iter=50
)
random_search_cv.fit(X_train, y_train)
print(f'Best params: {random_search_cv.best_params_}')
print(f'Best score: {random_search_cv.best_score_}')

&nbsp;&nbsp;&nbsp;&nbsp;Because we got "learning_rate = optimal" to be the best, then we will ignore the eta0 (initial learning rate) as it isn't used when learning_rate='optimal'; we got this value of eta0 just because of the randomness involved in the process.

##### Phase 2: Penalty and alpha

In [None]:
clf = SGDClassifier()

In [None]:
distributions = dict(
    penalty=['l1', 'l2', 'elasticnet'],
    alpha=uniform(loc=1e-6, scale=1e-4)
)

In [None]:
random_search_cv = RandomizedSearchCV(
    estimator=clf,
    param_distributions=distributions,
    cv=5,
    n_iter=50
)
random_search_cv.fit(X_train, y_train)
print(f'Best params: {random_search_cv.best_params_}')
print(f'Best score: {random_search_cv.best_score_}')

So, the best parameters that I got are:
loss: squared_hinge learning_rate: optimal penalty: l2 alpha: 1.2101013664295101e-05

##### Saving the best classifier

In [None]:
sgd_classifier = random_search_cv.best_estimator_

dump(random_search_cv.best_estimator_, 'classifiers/sgd_classifier.joblib')

# sgd_classifier = load('classifiers/sgd_classifier.joblib')

##### Testing Model

In [None]:
X_test = bigram_vectorizer.transform(imdb_test['text'].values)
X_test = bigram_tf_idf_transformer.transform(X_test)
y_test = imdb_test['label'].values

In [None]:
score = sgd_classifier.score(X_test, y_test)
print(score)

&nbsp;&nbsp;&nbsp;&nbsp;And we got **90.18%** test accuracy. That's not bad for our simple linear model. There are more advanced methods that give better results. The current state-of-the-art on this dataset is **97.42%** <sup>(13)</sup>

## References

<sup>(1)</sup> &nbsp;[Sentiment Analysis - Wikipedia](https://en.wikipedia.org/wiki/Sentiment_analysis)  
<sup>(2)</sup> &nbsp;[Learning Word Vectors for Sentiment Analysis](http://ai.stanford.edu/~amaas/papers/wvSent_acl2011.pdf)  
<sup>(3)</sup> &nbsp;[Bag-of-words model - Wikipedia](https://en.wikipedia.org/wiki/Bag-of-words_model)  
<sup>(4)</sup> &nbsp;[Tf-idf - Wikipedia](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)  
<sup>(5)</sup> &nbsp;[TfidfTransformer - Scikit-learn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html)  
<sup>(6)</sup> &nbsp;[Stop words - Wikipedia](https://en.wikipedia.org/wiki/Stop_words)  
<sup>(7)</sup> &nbsp;[A list of English stopwords](https://gist.github.com/sebleier/554280)  
<sup>(8)</sup> &nbsp;[CountVectorizer - Scikit-learn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)  
<sup>(9)</sup> &nbsp;[Scipy sparse matrices](https://docs.scipy.org/doc/scipy/reference/sparse.html)  
<sup>(10)</sup> [Compressed Sparse Row matrix](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html#scipy.sparse.csr_matrix)  
<sup>(11)</sup> [SGDClassifier - Scikit-learn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html)  
<sup>(12)</sup> [RandomizedSearchCV - Scikit-learn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html)  
<sup>(13)</sup> [Sentiment Classification using Document Embeddings trained with
Cosine Similarity](https://www.aclweb.org/anthology/P19-2057.pdf)  