<a href="https://colab.research.google.com/github/Aaron-anayst/Twitter-Data-Analysis/blob/main/sentiment_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Table of Contents
0. [Introduction](#introduction)
1. [pandas](#Pandas)
2. [Matplotlib](#Matplotlib)
3. [sklearn](#sklearn)
4. [Modeling](#Modeling)
5. [Deployment](#Deployment)
6. [References](#References)

# Introduction

## Task 2 Steps
5. ML training and validation - a code base to train topic modelling and sentiment analysis models.
6. Deployment - a code base to make and deploy dashboard expose trained models. 
7. Model performance analyser - to integrate MLWatcher or something similar to monitor model performance at prediction time. 
8. A data drift or model underperformance trigger mechanism that identifies and shows an alert when model performance is below threshold or incoming data is drifted from the data used to train the model.

## Data Gathering

### Downloading & extracting data

In [1]:
!pip install wget

Collecting wget
  Downloading wget-3.2.zip (10 kB)
Building wheels for collected packages: wget
  Building wheel for wget (setup.py) ... [?25l[?25hdone
  Created wheel for wget: filename=wget-3.2-py3-none-any.whl size=9675 sha256=9bb7440ce4b4de24c686059b4eebfd4c5f0a89551e9be2897515c0dc6b1c13fd
  Stored in directory: /root/.cache/pip/wheels/a1/b6/7c/0e63e34eb06634181c63adacca38b79ff8f35c37e3c13e3c02
Successfully built wget
Installing collected packages: wget
Successfully installed wget-3.2


In [2]:
!wget "http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz" 

--2022-04-27 08:45:13--  http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Resolving ai.stanford.edu (ai.stanford.edu)... 171.64.68.10
Connecting to ai.stanford.edu (ai.stanford.edu)|171.64.68.10|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84125825 (80M) [application/x-gzip]
Saving to: ‘aclImdb_v1.tar.gz’


2022-04-27 08:45:20 (11.8 MB/s) - ‘aclImdb_v1.tar.gz’ saved [84125825/84125825]



In [3]:
!curl "http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz" -o aclImdb_v1.tar.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 80.2M  100 80.2M    0     0  16.1M      0  0:00:04  0:00:04 --:--:-- 16.4M


In [4]:
!tar -xzf "aclImdb_v1.tar.gz"

In [5]:
!7z x aclImdb_v1.tar.gz


7-Zip [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,64 bits,2 CPUs Intel(R) Xeon(R) CPU @ 2.20GHz (406F0),ASM,AES-NI)

Scanning the drive for archives:
  0M Scan         1 file, 84125825 bytes (81 MiB)

Extracting archive: aclImdb_v1.tar.gz
--
Path = aclImdb_v1.tar.gz
Type = gzip
Headers Size = 10

  0% - aclImdb_v1.tar                       7% - aclImdb_v1.tar                      15% - aclImdb_v1.tar                      22% - aclImdb_v1.tar                      27% - aclImdb_v1.tar                      33% - aclImdb_v1.tar                      40% - aclImdb_v1.tar                      47% - aclImdb_

In [6]:
!ls -la

total 373356
drwxr-xr-x 1 root root      4096 Apr 27 08:45 .
drwxr-xr-x 1 root root      4096 Apr 27 08:44 ..
drwxr-xr-x 4 7297 1000      4096 Jun 26  2011 aclImdb
-rw-r--r-- 1 root root 298168320 Jun 26  2011 aclImdb_v1.tar
-rw-r--r-- 1 root root  84125825 Apr 27 08:45 aclImdb_v1.tar.gz
drwxr-xr-x 4 root root      4096 Apr 25 13:45 .config
drwxr-xr-x 1 root root      4096 Apr 25 13:46 sample_data


In [7]:
!ls aclImdb

imdbEr.txt  imdb.vocab	README	test  train


In [8]:
!ls aclImdb/train

labeledBow.feat  pos	unsupBow.feat  urls_pos.txt
neg		 unsup	urls_neg.txt   urls_unsup.txt


In [9]:
!ls aclImdb/test

labeledBow.feat  neg  pos  urls_neg.txt  urls_pos.txt


In [10]:
!head -10 aclImdb/imdbEr.txt

0.0490972013402
0.201363575849
0.0333946807184
0.099837669572
-0.0790210365788
0.188660139871
0.00712569582356
0.109215821589
-0.154986397986
-0.222690363917


In [11]:
!head -10 aclImdb/imdb.vocab

the
and
a
of
to
is
it
in
i
this


In [12]:
!cat aclImdb/README

Large Movie Review Dataset v1.0

Overview

This dataset contains movie reviews along with their associated binary
sentiment polarity labels. It is intended to serve as a benchmark for
sentiment classification. This document outlines how the dataset was
gathered, and how to use the files provided. 

Dataset 

The core dataset contains 50,000 reviews split evenly into 25k train
and 25k test sets. The overall distribution of labels is balanced (25k
pos and 25k neg). We also include an additional 50,000 unlabeled
documents for unsupervised learning. 

In the entire collection, no more than 30 reviews are allowed for any
given movie because reviews for the same movie tend to have correlated
ratings. Further, the train and test sets contain a disjoint set of
movies, so no significant performance is obtained by memorizing
movie-unique terms and their associated with observed labels.  In the
labeled train/test sets, a negative review has a score <= 4 out of 10,
and a positive review has a scor

## Environment Setup

* Download Week 0 Tuesday folder onto your local filesystem (or wherever you're working)
* Create folders you will later need

In [13]:


!mkdir 'csv' 
!mkdir 'data_preprocessors' 
!mkdir 'vectorized_data' 
!mkdir 'classifiers'

In [14]:

!mkdir csv 
!mkdir data_preprocessors 
!mkdir vectorized_data 
!mkdir classifiers

mkdir: cannot create directory ‘csv’: File exists
mkdir: cannot create directory ‘data_preprocessors’: File exists
mkdir: cannot create directory ‘vectorized_data’: File exists
mkdir: cannot create directory ‘classifiers’: File exists


In [15]:
!mkdir -p csv data_preprocessors vectorized_data classifiers

## Imports

In [16]:
# pandas library and other Python modules
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import re
from os import listdir
from os.path import isfile, join
from random import shuffle

In [17]:

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.decomposition import NMF, LatentDirichletAllocation
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RandomizedSearchCV

In [18]:
import numpy as np # linear algebra
from joblib import dump, load # used for saving and loading sklearn objects
from scipy.sparse import save_npz, load_npz # used for saving and loading sparse matrices
from scipy.stats import uniform
from scipy.sparse import csr_matrix

* <b>joblib</b>: In the specific case of scikit-learn, it may be better to use joblib’s replacement of pickle (dump & load), which is more efficient on objects that carry large numpy arrays internally as is often the case for fitted scikit-learn estimators, but can only pickle to the disk and not to a string:

# pandas

Pandas is a python package for data query and analysis. It is well known for its versatility in reading numerous types of data file. It also allows for simple data munging and visualization.

https://pandas.pydata.org/

https://pandas.pydata.org/docs/getting_started/index.html

https://pandas.pydata.org/docs/getting_started/intro_tutorials/

https://pandas.pydata.org/docs/reference/index.html

<b> Pandas is heavily dependent on numpy. It borrows its philosophy from R dataframes. </b>

### Data Preprocessing

In [19]:
def create_data_frame(folder: str) -> pd.DataFrame:
    '''
    folder - the root folder of train or test dataset
    Returns: a DataFrame with the combined data from the input folder
    '''
    pos_folder = f'{folder}/pos' # positive reviews
    neg_folder = f'{folder}/neg' # negative reviews
    
    def get_files(fld: str) -> list:
        '''
        fld - positive or negative reviews folder
        Returns: a list with all files in input folder
        '''
        return [join(fld, f) for f in listdir(fld) if isfile(join(fld, f))]
    
    def append_files_data(data_list: list, files: list, label: int) -> None:
        '''
        Appends to 'data_list' tuples of form (file content, label)
        for each file in 'files' input list
        '''
        for file_path in files:
            with open(file_path, 'r') as f:
                text = f.read()
                data_list.append((text, label))
    
    pos_files = get_files(pos_folder)
    neg_files = get_files(neg_folder)
    
    data_list = []
    append_files_data(data_list, pos_files, 1)
    append_files_data(data_list, neg_files, 0)
    shuffle(data_list)
    
    text, label = tuple(zip(*data_list))
    # replacing line breaks with spaces
    text = list(map(lambda txt: re.sub('(<br\s*/?>)+', ' ', txt), text))
    
    return pd.DataFrame({'text': text, 'label': label})

In [24]:
%%time
imdb_train = create_data_frame('aclImdb/train')
imdb_test = create_data_frame('aclImdb/test')

imdb_train.to_csv('csv/imdb_train.csv', index=False)
imdb_test.to_csv('csv/imdb_test.csv', index=False)

imdb_train = pd.read_csv('csv/imdb_train.csv')
imdb_test = pd.read_csv('csv/imdb_test.csv')

CPU times: user 5.58 s, sys: 719 ms, total: 6.3 s
Wall time: 6.33 s


In [25]:
imdb_train.head(n=10)

Unnamed: 0,text,label
0,"What's wrong with this film? Many, many things...",0
1,it is of course very nice to see improvements ...,0
2,LES CONVOYEURS ATTENDENT was the first film I ...,1
3,"I heard and read many praising things about ""M...",0
4,If you were ever sad for not being able to get...,1
5,As an old white housewife I can still apprecia...,1
6,I am glad being able to say almost only positi...,1
7,DO NOT WATCH THIS MOVIE IF YOU LOVED THE CLASS...,0
8,Lisa Baumer (Ida Galli) is the adulteress wife...,1
9,Sometimes a movie is so comprehensively awful ...,0


In [26]:
imdb_test.head(n=10)

Unnamed: 0,text,label
0,"Although Flatliners is 15 years old, tonight w...",1
1,This is a total piece of crap. It is an insult...,0
2,"...un-funny and un-entertaining, possibly the ...",0
3,It was high time a movie about the situation i...,1
4,"My Caddy Limo was destroyed!!! Well, I had one...",0
5,Watching this on Comcast On-Demand. Every time...,1
6,This is a cult film for many reasons. First be...,1
7,One of the Message Boards threads at IMDb had ...,1
8,This movie is bad as we all knew it would be. ...,0
9,"Once upon a time, in Sweden, there was a poor ...",1


In [27]:
print(imdb_train.loc[0, "text"])

What's wrong with this film? Many, many things. The editing tries too hard to look good, and does nothing but confuse the viewer whilst also supplying him/her with a powerful headache. The plot is muddled and obviously prolonged from what started as a short (story or film). The plot only makes for less than ten minutes of good story, and this is just stretched out painfully until it reached the minimum length for a feature film. We all know what happens to things when we stretch them, right? Exactly. They get thinner. In the end, the plot is just so paper-thin that you might even miss it, if you aren't paying attention, which is hard to do when watching this movie. The acting is not even slightly impressive. The characters are poorly written and dull, uninteresting. One of the worst things that are wrong with this film is that apparently, whoever was in charge of the score/soundtrack had no idea what the movie was, or what it was supposed to be about(not that I blame him, I couldn't fi

### loc vs iloc
s = pd.Series(list("abcdef"), index=[49, 48, 47, 0, 1, 2])

49    a
48    b
47    c
0     d
1     e
2     f

s.loc[0]    # value at index label 0

'd'

s.iloc[0]   # value at index location 0

'a'

s.loc[0:1]  # rows at index labels between 0 and 1 (inclusive)

0    d
1    e

s.iloc[0:1] # rows at index location between 0 and 1 (exclusive)

49    a

Ref: https://stackoverflow.com/questions/31593201/how-are-iloc-and-loc-different#:~:text=The%20main%20distinction%20between%20the,or%20columns)%20at%20integer%20locations.

# Matplotlib

This is exclusively a visualization library in python. It is employed for visual involving 2D, 3D and animation. It also serves as dependency for other numerous libraries.

https://matplotlib.org/1.3.1/index.html

https://github.com/rougier/matplotlib-tutorial

https://matplotlib.org/3.2.2/gallery/index.html

Pyplot module is the most used.

<b> Other visualisation libraries built on top of Matplotlib: </b>
* seaborn



# sklearn

Aka Scikit-Learn. An almost-complete library for data analysis and modelling is the <b>sklearn</b> library. It contains various statistical models and few neural network model one might require in any analytic problem. While thre are many modelling libraries out there, <b>sklearn</b> on it's own houses numerous modelling techniques as modules and functions all which come together to make up the library.

https://scikit-learn.org/stable/

https://scikit-learn.org/stable/getting_started.html

### Other libraries for NLP
* HuggingFace
* Spacy

# Modeling

Unlike data wrangling, this is widely the interesting part in general data analysis. The step here are as follows:

$\bullet$ determine the set of algorithms to try on the data (classification, regression, neural-net etc).

$\bullet$ model design - data splitting.

$\bullet$ model building

$\bullet$ evaluation (metrics)

$\bullet$ model review

<h2><a href="https://github.com/lazuxd/simple-imdb-sentiment-analysis/blob/master/sentiment-analysis.ipynb"> Notebook reference </a></h2>

# Building a Sentiment Classifier using Scikit-Learn

<center><img src="https://raw.githubusercontent.com/lazuxd/simple-imdb-sentiment-analysis/master/smiley.jpg"/></center>
<center><i>Image by AbsolutVision @ <a href="https://pixabay.com/ro/photos/smiley-emoticon-furie-sup%C4%83rat-2979107/">pixabay.com</a></i></center>

> &nbsp;&nbsp;&nbsp;&nbsp;**Sentiment analysis**, an important area in Natural Language Processing, is the process of automatically detecting affective states of text. Sentiment analysis is widely applied to voice-of-customer materials such as product reviews in online shopping websites like Amazon, movie reviews or social media. It can be just a basic task of classifying the polarity of a text as being positive/negative or it can go beyond polarity, looking at emotional states such as "happy", "angry", etc.

&nbsp;&nbsp;&nbsp;&nbsp;Here we will build a classifier that is able to distinguish movie reviews as being either positive or negative. For that, we will use [Large Movie Review Dataset v1.0](http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz)<sup>(2)</sup> of IMDB movie reviews.
This dataset contains 50,000 movie reviews divided evenly into 25k train and 25k test. The labels are balanced between the two classes (positive and negative). <b>Reviews with a score <= 4 out of 10 are labeled negative and those with score >= 7 out of 10 are labeled positive. Neutral reviews are not included in the labeled data.</b> This dataset also contains unlabeled reviews for unsupervised learning; we will not use them here. <b>There are no more than 30 reviews for a particular movie because the ratings of the same movie tend to be correlated. All reviews for a given movie are either in train or test set but not in both, in order to avoid test accuracy gain by memorizing movie-specific terms.</b>



In [28]:
print(imdb_train.shape)
print(imdb_test.shape)

(25000, 2)
(25000, 2)


## Data preprocessing

&nbsp;&nbsp;&nbsp;&nbsp;After the dataset has been downloaded and extracted from archive we have to transform it into a more suitable form for feeding it into a machine learning model for training. We will start by combining all review data into 2 pandas Data Frames representing the train and test datasets, and then saving them as csv files: *imdb_train.csv* and *imdb_test.csv*.  

&nbsp;&nbsp;&nbsp;&nbsp;The Data Frames will have the following form:  

|text       |label      |
|:---------:|:---------:|
|review1    |0          |
|review2    |1          |
|review3    |1          |
|.......    |...        |
|reviewN    |0          |  

&nbsp;&nbsp;&nbsp;&nbsp;where:  
- review1, review2, ... = the actual text of movie review  
- 0 = negative review  
- 1 = positive review

<b>But machine learnng algorithms work only with numerical values.</b> We can't just input the text itself into a machine learning model and have it learn from that. We have to, somehow, <b>represent the text by numbers or vectors of numbers</b>. One way of doing this is by using the **Bag-of-words** model<sup>(3)</sup>, in which a piece of text (often called a **document**) is represented by a <b>vector of the counts of words from a vocabulary in that document. This model doesn't take into account grammar rules or word ordering; all it considers is the frequency of words</b>. If we use the counts of each word independently we name this representation a **unigram**. In general, in a **n-gram** we take into account the counts of <b>each combination of n words from the vocabulary that appears in a given document</b>.  

&nbsp;&nbsp;&nbsp;&nbsp;For example, consider these two documents:  
<br>  
<div style="font-family: monospace;"><center><b>d1: "I am learning"&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</b></center></div>  
<div style="font-family: monospace;"><center><b>d2: "Machine learning is cool"</b></center></div>  
<br>
The vocabulary of all words encountered in these two sentences is: 

<br/>  
<div style="font-family: monospace;"><center><b>v: [ I, am, learning, machine, is, cool ]</b></center></div>   
<br>
&nbsp;&nbsp;&nbsp;&nbsp;The unigram representations of d1 and d2:  
<br>  

|unigram(d1)|I       |am      |learning|machine |is      |cool    |
|:---------:|:------:|:------:|:------:|:------:|:------:|:------:|
|           |1       |1       |1       |0       |0       |0       |  

|unigram(d2)|I       |am      |learning|machine |is      |cool    |
|:---------:|:------:|:------:|:------:|:------:|:------:|:------:|
|           |0       |0       |1       |1       |1       |1       |
  
&nbsp;&nbsp;&nbsp;&nbsp;And, the bigrams of d1 and d2 are:
  
|bigram(d1) |I I     |I am    |I learning|...|machine am|machine learning|...|cool is|cool cool|
|:---------:|:------:|:------:|:--------:|:-:|:--------:|:--------------:|:-:|:-----:|:-------:|
|           |0       |1       |0         |...|0         |0               |...|0      |0        |  

|bigram(d2) |I I     |I am    |I learning|...|machine am|machine learning|...|cool is|cool cool|
|:---------:|:------:|:------:|:--------:|:-:|:--------:|:--------------:|:-:|:-----:|:-------:|
|           |0       |0       |0         |...|0         |1               |...|0      |0        |

&nbsp;&nbsp;&nbsp;&nbsp;Often, we can achieve slightly better results if instead of counts of words we use something called **term frequency times inverse document frequency** (or **tf-idf**). Maybe it sounds complicated, but it is not. Bear with me, I will explain this. The intuition behind this is the following. So, what's the problem of using just the frequency of terms inside a document? <b>Although some terms may have a high frequency inside documents they may not be so relevant for describing a given document in which they appear. That's because those terms may also have a high frequency across the collection of all documents</b>. For example, a collection of movie reviews may have terms specific to movies/cinematography that are present in almost all documents (they have a high **document frequency**). So, when we encounter those terms in a document this doesn't tell much about whether it is a positive or negative review. We need a way of relating **term frequency** (how frequent a term is inside a document) to **document frequency** (how frequent a term is across the whole collection of documents). That is:  
  
$$\begin{align}\frac{\text{term frequency}}{\text{document frequency}} &= \text{term frequency} \cdot \frac{1}{\text{document frequency}} \\ &= \text{term frequency} \cdot \text{inverse document frequency} \\ &= \text{tf} \cdot \text{idf}\end{align}$$  
  
&nbsp;&nbsp;&nbsp;&nbsp;Now, there are more ways used to describe both term frequency and inverse document frequency. But the most common way is by putting them on a logarithmic scale:  
  
$$tf(t, d) = log(1+f_{t,d})$$  
$$idf(t) = log(\frac{1+N}{1+n_t})$$  
  
&nbsp;&nbsp;&nbsp;&nbsp;where:  
$$\begin{align}f_{t,d} &= \text{count of term } \textbf{t} \text{ in document } \textbf{d} \\  
N &= \text{total number of documents} \\  
n_t &= \text{number of documents that contain term } \textbf{t}\end{align}$$  
  
<b>We added 1 in the first logarithm to avoid getting $-\infty$ when $f_{t,d}$ is 0. In the second logarithm we added one fake document to avoid division by zero.</b>

Before we transform our data into vectors of counts or tf-idf values we should remove English **stopwords**<sup>(6)(7)</sup>. <b>Stopwords are words that are very common in a language</b> and are usually removed in the preprocessing stage of natural text-related tasks like sentiment analysis or search.

<b>Note that we should construct our vocabulary only based on the training set. When we will process the test data in order to make predictions we should use only the vocabulary constructed in the training phase, the rest of the words will be ignored.</b>

&nbsp;&nbsp;&nbsp;&nbsp;Now, let's create the data frames and save them as csv files:

### Text vectorization

Fortunately, for the text vectorization part all the hard work is already done in the Scikit-Learn classes `CountVectorizer`<sup>(8)</sup> and `TfidfTransformer`<sup>(5)</sup>. We will use these classes to transform our csv files into unigram and bigram matrices (using both counts and tf-idf values). (<b>It turns out that if we only use a n-gram for a large n we don't get a good accuracy, we usually use all n-grams up to some n. So, when we say here bigrams we actually refer to uni+bigrams and when we say unigrams it's just unigrams.</b>) Each row in those matrices will represent a document (review) in our dataset, and each column will represent values associated with each word in the vocabulary (in the case of unigrams) or values associated with each combination of maximum 2 words in the vocabulary (bigrams).  

&nbsp;&nbsp;&nbsp;&nbsp;`CountVectorizer` has a parameter `ngram_range` which expects a tuple of size 2 that controls what n-grams to include. After we constructed a `CountVectorizer` object we should call `.fit()` method with the actual text as a parameter, in order for it to learn the required statistics of our collection of documents. Then, by calling `.transform()` method with our collection of documents it returns the matrix for the n-gram range specified. As the class name suggests, this matrix will contain just the counts. To obtain the tf-idf values, the class `TfidfTransformer` should be used. It has the `.fit()` and `.transform()` methods that are used in a similar way with those of `CountVectorizer`, but they take as input the counts matrix obtained in the previous step and `.transform()` will return a matrix with tf-idf values. We should use `.fit()` only on training data and then store these objects. When we want to evaluate the test score or whenever we want to make a prediction we should use these objects to transform the data before feeding it into our classifier.  

&nbsp;&nbsp;&nbsp;&nbsp;Note that the matrices generated for our train or test data will be huge, and if we store them as normal numpy arrays they will not even fit into RAM. But most of the entries in these matrices will be zero. So, these Scikit-Learn classes are using Scipy sparse matrices<sup>(9)</sup> (`csr_matrix`<sup>(10)</sup> to be more exactly), which store just the non-zero entries and save a LOT of space.  

&nbsp;&nbsp;&nbsp;&nbsp;We will use a linear classifier with stochastic gradient descent, `sklearn.linear_model.SGDClassifier`<sup>(11)</sup>, as our model. First we will generate and save our data in 4 forms: unigram and bigram matrix (with both counts and tf-idf values for each). Then we will train and evaluate our model for each these 4 data representations using `SGDClassifier` with the default parameters. After that, we choose the data representation which led to the best score and we will tune the hyper-parameters of our model with this data form using cross-validation in order to obtain the best results.

<b>Refs:</b> 
* Convert a collection of text documents to a matrix of token counts: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
* Convert a collection of raw documents to a matrix of TF-IDF features: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html


#### Unigram Counts

In [32]:
%%time
# TRAINING
unigram_vectorizer = CountVectorizer(ngram_range=(1, 1))
unigram_vectorizer.fit(imdb_train['text'].values)
dump(unigram_vectorizer, 'data_preprocessors/unigram_vectorizer.joblib')

# TESTING
unigram_vectorizer = load('data_preprocessors/unigram_vectorizer.joblib')

CPU times: user 4.73 s, sys: 76.9 ms, total: 4.81 s
Wall time: 4.82 s


In [33]:
unigram_vectorizer.vocabulary_

{'what': 72703,
 'wrong': 73842,
 'with': 73342,
 'this': 66562,
 'film': 24536,
 'many': 40829,
 'things': 66524,
 'the': 66339,
 'editing': 20934,
 'tries': 68148,
 'too': 67324,
 'hard': 29821,
 'to': 67125,
 'look': 39364,
 'good': 28068,
 'and': 3258,
 'does': 19421,
 'nothing': 46074,
 'but': 9881,
 'confuse': 14135,
 'viewer': 71334,
 'whilst': 72787,
 'also': 2821,
 'supplying': 64539,
 'him': 31002,
 'her': 30646,
 'powerful': 51151,
 'headache': 30213,
 'plot': 50428,
 'is': 34585,
 'muddled': 44260,
 'obviously': 46550,
 'prolonged': 51981,
 'from': 26180,
 'started': 62903,
 'as': 4465,
 'short': 59806,
 'story': 63422,
 'or': 47142,
 'only': 46957,
 'makes': 40431,
 'for': 25450,
 'less': 38451,
 'than': 66299,
 'ten': 66008,
 'minutes': 42972,
 'of': 46680,
 'just': 35787,
 'stretched': 63606,
 'out': 47449,
 'painfully': 48052,
 'until': 70237,
 'it': 34683,
 'reached': 53759,
 'minimum': 42913,
 'length': 38350,
 'feature': 24077,
 'we': 72365,
 'all': 2662,
 'know': 36

In [34]:
df = pd.DataFrame(unigram_vectorizer.vocabulary_.items(), columns=['Vocabulary', 'Frequency'])

In [35]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 74849 entries, 0 to 74848
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Vocabulary  74849 non-null  object
 1   Frequency   74849 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 1.1+ MB


In [36]:
df.sort_values(by="Frequency", axis=0, ascending=False, inplace=True, kind='quicksort', na_position='last')

In [37]:
df.head(n=20)

Unnamed: 0,Vocabulary,Frequency
68294,üvegtigris,74848
27210,über,74847
65842,østbye,74846
58654,ísnt,74845
36430,ís,74844
56874,êxtase,74843
17275,évery,74842
48490,étc,74841
37066,état,74840
68463,était,74839


In [38]:
df.tail(n=20)

Unnamed: 0,Vocabulary,Frequency
28841,02,19
62974,01pm,18
23930,01,17
10618,00s,16
64118,00pm,15
32178,00am,14
27368,0093638,13
61314,0083,12
61313,0080,11
61762,0079,10


In [40]:
%%time
# TRAINING
X_train_unigram = unigram_vectorizer.transform(imdb_train['text'].values)
save_npz('vectorized_data/X_train_unigram.npz', X_train_unigram)

# TESTING
X_train_unigram = load_npz('vectorized_data/X_train_unigram.npz')

CPU times: user 7.02 s, sys: 23.4 ms, total: 7.04 s
Wall time: 8.35 s


<b> fit_transform </b>

#### Unigram Tf-Idf

In [41]:
%%time
# TRAINING
unigram_tf_idf_transformer = TfidfTransformer()
unigram_tf_idf_transformer.fit(X_train_unigram)
dump(unigram_tf_idf_transformer, 'data_preprocessors/unigram_tf_idf_transformer.joblib')

# TESTING
unigram_tf_idf_transformer = load('data_preprocessors/unigram_tf_idf_transformer.joblib') 

CPU times: user 22.9 ms, sys: 4.9 ms, total: 27.8 ms
Wall time: 31.4 ms


In [42]:
%%time
# TRAINING
X_train_unigram_tf_idf = unigram_tf_idf_transformer.transform(X_train_unigram)
save_npz('vectorized_data/X_train_unigram_tf_idf.npz', X_train_unigram_tf_idf)

# TESTING
X_train_unigram_tf_idf = load_npz('vectorized_data/X_train_unigram_tf_idf.npz')

CPU times: user 3.43 s, sys: 57.8 ms, total: 3.48 s
Wall time: 3.48 s


#### Bigram Counts

In [45]:
%%time
# TRAINING
bigram_vectorizer = CountVectorizer(ngram_range=(1, 2))
bigram_vectorizer.fit(imdb_train['text'].values)
dump(bigram_vectorizer, 'data_preprocessors/bigram_vectorizer.joblib')

# TESTING
bigram_vectorizer = load('data_preprocessors/bigram_vectorizer.joblib')

CPU times: user 33.8 s, sys: 872 ms, total: 34.7 s
Wall time: 34.5 s


In [46]:
%%time
# TRAINING
X_train_bigram = bigram_vectorizer.transform(imdb_train['text'].values)
save_npz('vectorized_data/X_train_bigram.npz', X_train_bigram)

# TESTING
X_train_bigram = load_npz('vectorized_data/X_train_bigram.npz')

CPU times: user 12.7 s, sys: 71.8 ms, total: 12.8 s
Wall time: 13 s


#### Bigram Tf-Idf

In [47]:
%%time
# TRAINING
bigram_tf_idf_transformer = TfidfTransformer()
bigram_tf_idf_transformer.fit(X_train_bigram)
dump(bigram_tf_idf_transformer, 'data_preprocessors/bigram_tf_idf_transformer.joblib')

# TESTING
bigram_tf_idf_transformer = load('data_preprocessors/bigram_tf_idf_transformer.joblib')

CPU times: user 165 ms, sys: 31 ms, total: 196 ms
Wall time: 197 ms


In [48]:
X_train_bigram_tf_idf = bigram_tf_idf_transformer.transform(X_train_bigram)
save_npz('vectorized_data/X_train_bigram_tf_idf.npz', X_train_bigram_tf_idf)

# X_train_bigram_tf_idf = load_npz('vectorized_data/X_train_bigram_tf_idf.npz')

### Choosing data format

&nbsp;&nbsp;&nbsp;&nbsp;Now, for each data form we split it into train & validation sets, train a `SGDClassifier` and output the score.

In [49]:
def train_and_show_scores(X: csr_matrix, y: np.array, title: str) -> None:
    X_train, X_valid, y_train, y_valid = train_test_split(
        X, y, train_size=0.75, stratify=y
    )

    clf = SGDClassifier()
    clf.fit(X_train, y_train)
    train_score = clf.score(X_train, y_train)
    valid_score = clf.score(X_valid, y_valid)
    print(f'{title}\nTrain score: {round(train_score, 2)} ; Validation score: {round(valid_score, 2)}\n')

In [50]:
y_train = imdb_train['label'].values

In [51]:
train_and_show_scores(X_train_unigram, y_train, 'Unigram Counts')
train_and_show_scores(X_train_unigram_tf_idf, y_train, 'Unigram Tf-Idf')
train_and_show_scores(X_train_bigram, y_train, 'Bigram Counts')
train_and_show_scores(X_train_bigram_tf_idf, y_train, 'Bigram Tf-Idf')

Unigram Counts
Train score: 1.0 ; Validation score: 0.88

Unigram Tf-Idf
Train score: 0.95 ; Validation score: 0.89

Bigram Counts
Train score: 1.0 ; Validation score: 0.9

Bigram Tf-Idf
Train score: 0.98 ; Validation score: 0.9



&nbsp;&nbsp;&nbsp;&nbsp;The best data form seems to be **bigram with tf-idf** as it gets the highest validation accuracy: **0.9**; we will use it next for hyper-parameter tuning.

<h1> TUTORIAL </h1>

<h2>Using the processed twitter data from yesterday's challenge</h2>.


- Form a new data frame (named `cleanTweet`), containing columns $\textbf{clean-text}$ and $\textbf{polarity}$.

- Write a function `text_category` that takes a value `p` and returns, depending on the value of p, a string `'positive'`, `'negative'` or `'neutral'`.

- Apply this function (`text_category`) on the $\textbf{polarity}$ column of `cleanTweet` in 1 above to form a new column called $\textbf{score}$ in `cleanTweet`.

- Visualize The $\textbf{score}$ column using piechart and barchart

<h5>Now we want to build a classification model on the clean tweet following the steps below:</h5>

* Remove rows from `cleanTweet` where $\textbf{polarity}$ $= 0$ (i.e where $\textbf{score}$ = Neutral) and reset the frame index.
* Construct a column $\textbf{scoremap}$ Use the mapping {'positive':1, 'negative':0} on the $\textbf{score}$ column
* Create feature and target variables `(X,y)` from $\textbf{clean-text}$ and $\textbf{scoremap}$ columns respectively.
* Use `train_test_split` function to construct `(X_train, y_train)` and `(X_test, y_test)` from `(X,y)`

* Build an `SGDClassifier` model from the vectorize train text data. Use `CountVectorizer()` with a $\textit{trigram}$ parameter.

* Evaluate your model on the test data.


# EXTENSION

### Using Cross-Validation for hyperparameter tuning

&nbsp;&nbsp;&nbsp;&nbsp;For this part we will use `RandomizedSearchCV`<sup>(12)</sup> which chooses the parameters randomly from the list that we give, or according to the distribution that we specify from `scipy.stats` (e.g. uniform); then is estimates the test error by doing cross-validation and after all iterations we can find the best estimator, the best parameters and the best score in the variables `best_estimator_`, `best_params_` and `best_score_`.  

&nbsp;&nbsp;&nbsp;&nbsp;Because the search space for the parameters that we want to test is very big and it may need a huge number of iterations until it finds the best combination, we will split the set of parameters in 2 and do the hyper-parameter tuning process in two phases. First we will find the optimal combination of loss, learning_rate and eta0 (i.e. initial learning rate); and then for penalty and alpha.

In [52]:
X_train = X_train_bigram_tf_idf

#### Phase 1: loss, learning rate and initial learning rate

In [53]:
clf = SGDClassifier()

In [54]:
distributions = dict(
    loss=['hinge', 'log', 'modified_huber', 'squared_hinge', 'perceptron'],
    learning_rate=['optimal', 'invscaling', 'adaptive'],
    eta0=uniform(loc=1e-7, scale=1e-2)
)

In [55]:
random_search_cv = RandomizedSearchCV(
    estimator=clf,
    param_distributions=distributions,
    cv=5,
    n_iter=50
)
random_search_cv.fit(X_train, y_train)
print(f'Best params: {random_search_cv.best_params_}')
print(f'Best score: {random_search_cv.best_score_}')

Best params: {'eta0': 0.001175586099480908, 'learning_rate': 'optimal', 'loss': 'modified_huber'}
Best score: 0.9043599999999999


&nbsp;&nbsp;&nbsp;&nbsp;Because we got "learning_rate = optimal" to be the best, then we will ignore the eta0 (initial learning rate) as it isn't used when learning_rate='optimal'; we got this value of eta0 just because of the randomness involved in the process.

#### Phase 2: penalty and alpha

In [None]:
clf = SGDClassifier()

In [None]:
distributions = dict(
    penalty=['l1', 'l2', 'elasticnet'],
    alpha=uniform(loc=1e-6, scale=1e-4)
)

In [None]:
random_search_cv = RandomizedSearchCV(
    estimator=clf,
    param_distributions=distributions,
    cv=5,
    n_iter=50
)
random_search_cv.fit(X_train, y_train)
print(f'Best params: {random_search_cv.best_params_}')
print(f'Best score: {random_search_cv.best_score_}')

&nbsp;&nbsp;&nbsp;&nbsp;So, the best parameters that I got are:  
`loss: squared_hinge  
 learning_rate: optimal  
 penalty: l2  
 alpha: 1.2101013664295101e-05  `

#### Saving the best classifier

In [56]:
sgd_classifier = random_search_cv.best_estimator_

dump(random_search_cv.best_estimator_, 'classifiers/sgd_classifier.joblib')

# sgd_classifier = load('classifiers/sgd_classifier.joblib')

['classifiers/sgd_classifier.joblib']

### Testing model

In [57]:
X_test = bigram_vectorizer.transform(imdb_test['text'].values)
X_test = bigram_tf_idf_transformer.transform(X_test)
y_test = imdb_test['label'].values

In [58]:
score = sgd_classifier.score(X_test, y_test)
print(score)

0.9014


&nbsp;&nbsp;&nbsp;&nbsp;And we got **90.18%** test accuracy. That's not bad for our simple linear model. There are more advanced methods that give better results. The current state-of-the-art on this dataset is **97.42%** <sup>(13)</sup>

# Deployment
* Flask
* Streamlit

# References

<sup>(1)</sup> &nbsp;[Sentiment Analysis - Wikipedia](https://en.wikipedia.org/wiki/Sentiment_analysis)  
<sup>(2)</sup> &nbsp;[Learning Word Vectors for Sentiment Analysis](http://ai.stanford.edu/~amaas/papers/wvSent_acl2011.pdf)  
<sup>(3)</sup> &nbsp;[Bag-of-words model - Wikipedia](https://en.wikipedia.org/wiki/Bag-of-words_model)  
<sup>(4)</sup> &nbsp;[Tf-idf - Wikipedia](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)  
<sup>(5)</sup> &nbsp;[TfidfTransformer - Scikit-learn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html)  
<sup>(6)</sup> &nbsp;[Stop words - Wikipedia](https://en.wikipedia.org/wiki/Stop_words)  
<sup>(7)</sup> &nbsp;[A list of English stopwords](https://gist.github.com/sebleier/554280)  
<sup>(8)</sup> &nbsp;[CountVectorizer - Scikit-learn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)  
<sup>(9)</sup> &nbsp;[Scipy sparse matrices](https://docs.scipy.org/doc/scipy/reference/sparse.html)  
<sup>(10)</sup> [Compressed Sparse Row matrix](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html#scipy.sparse.csr_matrix)  
<sup>(11)</sup> [SGDClassifier - Scikit-learn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html)  
<sup>(12)</sup> [RandomizedSearchCV - Scikit-learn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html)  
<sup>(13)</sup> [Sentiment Classification using Document Embeddings trained with
Cosine Similarity](https://www.aclweb.org/anthology/P19-2057.pdf)  