# Building Sentiment Analysis Model for the Web App
We shall build this model using the following few steps;
1. Load the dataset
2. Define a Preprocessor and a Lemmatizer function
3. Building the model
4. Train our model
5. Validate and save model

#### Import some important general package , pandas

In [4]:
import pandas as pd # Used to load data

#### Download the dataset
Download the dataset into a `data` folder by running the cell below.This dataset is available at http://thinknook.com/Twitter-sentiment-analysis-training-corpus-dataset-2012-09-22/

In [2]:
!mkdir data  # creating a folder to store data
!wget http://thinknook.com/wp-content/uploads/2012/09/Sentiment-Analysis-Dataset.zip -nc -P ./data/

--2020-12-24 02:05:10--  http://thinknook.com/wp-content/uploads/2012/09/Sentiment-Analysis-Dataset.zip
Resolving thinknook.com (thinknook.com)... 208.109.47.128
Connecting to thinknook.com (thinknook.com)|208.109.47.128|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 56427677 (54M) [application/zip]
Saving to: ‘./data/Sentiment-Analysis-Dataset.zip’


2020-12-24 02:06:03 (1.04 MB/s) - ‘./data/Sentiment-Analysis-Dataset.zip’ saved [56427677/56427677]



### 1. Load the dataset
Load the dataset using ``pd.read_csv`` and assign it to a variable ``df`` from file path ``data/Sentimet-Analysis-Dataset.zip``. Note that parameter ``compression = 'zip'`` because the dataset is a ``.zip`` file so it tells ``pandas`` that the file is zipped and it handles it like that. ``error_bad_lines = False`` because there are some raws in the dataset with more columns/fields and so this parameter tells ``pandas`` to ignore such raws and proceed to the next one.

In [5]:
# load the dataset
df = pd.read_csv('data/Sentiment-Analysis-Dataset.zip',compression='zip',error_bad_lines = False)

b'Skipping line 8836: expected 4 fields, saw 5\n'
b'Skipping line 535882: expected 4 fields, saw 7\n'


As you can see, there are only two raws with more fields and they have been left out and this is not bad because we still have a million more raws of data to work with.Use `df.head()` to view the dataframe and use `len(df)` to know the number of raws left.

In [6]:
print('Number of raws in dataset: {}'.format(len(df)))

Number of raws in dataset: 1578612


In [7]:
df.head(10)# showing first 10 raws

Unnamed: 0,ItemID,Sentiment,SentimentSource,SentimentText
0,1,0,Sentiment140,is so sad for my APL frie...
1,2,0,Sentiment140,I missed the New Moon trail...
2,3,1,Sentiment140,omg its already 7:30 :O
3,4,0,Sentiment140,.. Omgaga. Im sooo im gunna CRy. I'...
4,5,0,Sentiment140,i think mi bf is cheating on me!!! ...
5,6,0,Sentiment140,or i just worry too much?
6,7,1,Sentiment140,Juuuuuuuuuuuuuuuuussssst Chillin!!
7,8,0,Sentiment140,Sunny Again Work Tomorrow :-| ...
8,9,1,Sentiment140,handed in my uniform today . i miss you ...
9,10,1,Sentiment140,hmmmm.... i wonder how she my number @-)


From our dataframe, we shall use the `Sentiment` column and the `SentimentText` column.But has you can see, some texts in the `sentimentText` column have slungs and some characters like '&lt' and so we must deal with such kind of data so that our model learns well.

### 2. Define a Preprocessor and a Lemmatizer function
What this function does is that it takes in a document`doc` searches for all characters and converts them to there English meanings forexample `&lt` is converted to  `<`.

In [8]:
from html import unescape
def preprocessor(doc):
    #Takes in a document (a raw from the SentimentText column)
    return unescape(doc).lower()

In [9]:
preprocessor('&lt')

'<'

Load the `spacy` package which is used in Natural Langauge processing and import the english processor using `en_core_web_sm` parameter. `STOP_WORDS` are just words that our model does not give much weight to during training because they usually don't carry much meaning. Note parameter `disable = ['rer,'parser','tagger']` to disable some functions performed by `spacy` to speed up the process because we don't need those functions for this case.

In [10]:
#lets load the english natural lang processor and disable some functions to make it faster
import spacy
from spacy.lang.en import STOP_WORDS
nlp = spacy.load('en_core_web_sm',disable=['rer','parser','tagger'])

This lemmatizer takes in a document/sentences `doc` and returns a lemma for each word forexample, the lemma for running is run which still means the same thing though shorter.

In [11]:
#define a lemmatizer function
def lemmatizer(doc):
    return [word.lemma_ for word in nlp(doc)]

Create stop words lemma since we shall use lemmas in our training document 

In [12]:
#lets create our stop words lemma
STOP_WORDS_lemma = [word.lemma_ for word in nlp(" ".join(list(STOP_WORDS)))]
#Add ',','.'and ';' to stop words
STOP_WORDS_lemma = set(STOP_WORDS_lemma).union(['.',';',','])

### 3.Building the model
we shall use a naive bayes model because our interest is to build a model that returns the probability of a sentiment being positive. There are other probability models that you can try out.

In [13]:
#lets build our model
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer,HashingVectorizer

Uncomment the `TfidfVectorizer` and comment the `HashingVectorizer` for more accuracy but the problem with doing that is that your model will take more time to train and score. We shall use a `Pipeline` to just organize our process. 

In [14]:
# vectorizer = TfidfVectorizer(preprocessor=preprocessor,
#                             tokenizer=lemmatizer,
#                             ngram_range=(1,2),
#                             stop_words=STOP_WORDS_lemma)
vectorizer = HashingVectorizer(preprocessor = preprocessor,
#                             tokenizer=lemmatizer,
                               alternate_sign = False,
#                             ngram_range=(1,2),
                            stop_words=STOP_WORDS)
clf = MultinomialNB()
model = Pipeline([('vectorizer',vectorizer),
                 ('classifier',clf)])

Lets split our data into train and test data using `train_test_split`

In [15]:
#lets split our data into train and test
X = df['SentimentText']
y = df['Sentiment'] 

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.2,random_state = 0)

### 4. Train our model

In [16]:
#lets train our model
model.fit(X_train,y_train)

  'stop_words.' % sorted(inconsistent))


Pipeline(steps=[('vectorizer',
                 HashingVectorizer(alternate_sign=False,
                                   preprocessor=<function preprocessor at 0x7f897a9f2b90>,
                                   stop_words={"'d", "'ll", "'m", "'re", "'s",
                                               "'ve", 'a', 'about', 'above',
                                               'across', 'after', 'afterwards',
                                               'again', 'against', 'all',
                                               'almost', 'alone', 'along',
                                               'already', 'also', 'although',
                                               'always', 'am', 'among',
                                               'amongst', 'amount', 'an', 'and',
                                               'another', 'any', ...})),
                ('classifier', MultinomialNB())])

In [17]:
#Check model accuracy on training data
model.score(X_train,y_train)

0.8073322358497065

### 5. Validate and save model

In [18]:
#check model accuracy on test data
model.score(X_test,y_test)

0.7699090658583632

As you can see, our model has a 77% accuracy which is not so bad atleast, so we now have to zip and save it so that we can use it to build our sentiment wed app.
We shall use `dill` to save the model and `gzip` to compress the model to reduce it's size

In [19]:
import gzip
import dill

with gzip.open('SentimentModel.dill.gz','wb') as f:
    dill.dump(model,f,recurse=True)#recurse = True to make sure all the parameters are saved

Reload the model from the save file `SentimentModel.dill.gz` to be sure that it works the same and test it's performance on test data.

In [20]:
import gzip
import dill

with gzip.open('SentimentModel.dill.gz','rb') as f:
    sentiment_model = dill.load(f)

In [21]:
sentiment_model.score(X_test,y_test)

  'stop_words.' % sorted(inconsistent))


0.7699090658583632