# Introduction 
[CheatSheet](https://www.kaggle.com/code/raenish/cheatsheet-text-helper-functions/notebook)

Data Source: [Sentiment140](http://help.sentiment140.com/for-students)

In this notebook we will tune our LR model, and we also use a larger and more general set of data.

Some readings I found helpful:
  - GridSearchCV [link](https://machinelearningmastery.com/hyperparameter-optimization-with-random-search-and-grid-search/)
  - Sentiment Analysis Series by Kim [link](https://medium.com/towards-data-science/another-twitter-sentiment-analysis-with-python-part-11-cnn-word2vec-41f5e28eda74)
  - Mathmatical Intuition to Logistic Regression [link](https://machinelearningmastery.com/logistic-regression-with-maximum-likelihood-estimation/)
   

## Import data & packages

In [53]:
# basic
import numpy as np
import pandas as pd
import re
import string
import time
from tqdm import tqdm

# Preprocessing
import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer

# visualization
import matplotlib.pyplot as plt
import seaborn as sns

# ML
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

### Reads data

* 0 - the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)
* 1 - the id of the tweet (2087)
* 2 - the date of the tweet (Sat May 16 23:58:44 UTC 2009)
* 3 - the query (lyx). If there is no query, then this value is NO_QUERY.
* 4 - the user that tweeted (robotickilldozr)
* 5 - the text of the tweet (Lyx is cool)


In [13]:
# dataset does not have column names, so we need to define it 
cols = [str(i) for i in range(6)]
df = pd.read_csv('data/sent140/140noemoticon.csv', encoding='latin-1',names=cols)

In [20]:
df = df[['0','5']]
df = df.rename(mapper={'0':'target','5':'text'},axis=1)
df.head()

Unnamed: 0,target,text
0,0,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,is upset that he can't update his Facebook by ...
2,0,@Kenichan I dived many times for the ball. Man...
3,0,my whole body feels itchy and like its on fire
4,0,"@nationwideclass no, it's not behaving at all...."


In [21]:
print("The original dataset has {} entries".format(df.shape[0]))

The original dataset has 1600000 entries


# Preprocessing Text 
<a id="1"></a>
Usually the steps includes 

1. Scrape text from raw documents
2. remove punctuation
3. lower case
4. tokenize & remove stop word 
5. lemmatize (lemma or stem)

We use lemmatize here.

In [23]:
def twit_preproc(df,column,now, tokenized=False):
    """Preprocessing for df[column]
        process involved: 
            - remove punctuation
            - lower case
            - tokenize & remove stop word 
            - lemmatize (lemma or stem)
            - optional: joining the tokens in each corpus
        the cleaned column will be in df[now]
        
    """
    def clean_text(text):
        '''Make text lowercase, remove text in square brackets,remove links,remove punctuation
    and remove words containing numbers.'''
        text = str(text).lower()
        text = re.sub('\[.*?\]', '', text) 
        text = re.sub('<.*?>+', '', text) # remove text in brackets
        text = re.sub('https?://\S+|www\.\S+', '', text) # remove link
        text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
        text = re.sub('\n', '', text) #remove numbers
        text = re.sub('\w*\d\w*', '', text)
        return text
    df[now]= df[column].apply(lambda x:clean_text(x))
    
    # Tokenize & to lower case
    tokenizer = RegexpTokenizer(r'\w+')
    df[now] = df[now].apply(lambda x:tokenizer.tokenize(x))

    def remove_stopword(x):
        return [y for y in x if y not in stopwords.words('english')]
    df[now] = df[now].apply(lambda x:remove_stopword(x))
    
    # lemmatize and join the words
    lemmatizer = WordNetLemmatizer()
    def sentence_lemmatize(text):
        return ([lemmatizer.lemmatize(x) for x in text])
    df[now] = df[now].apply(lambda text:sentence_lemmatize(text))
    
    # join the text 
    if (tokenized == False):
        df[now] = df[now].apply(lambda text: " ".join(x for x in text))
        
    return df

In [24]:
%%time
twit_preproc(df,'text','clean_text')

CPU times: user 33min, sys: 8min 46s, total: 41min 47s
Wall time: 42min 8s


Unnamed: 0,target,text,clean_text
0,0,"@switchfoot http://twitpic.com/2y1zl - Awww, t...",switchfoot awww thats bummer shoulda got david...
1,0,is upset that he can't update his Facebook by ...,upset cant update facebook texting might cry r...
2,0,@Kenichan I dived many times for the ball. Man...,kenichan dived many time ball managed save res...
3,0,my whole body feels itchy and like its on fire,whole body feel itchy like fire
4,0,"@nationwideclass no, it's not behaving at all....",nationwideclass behaving im mad cant see
...,...,...,...
1599995,4,Just woke up. Having no school is the best fee...,woke school best feeling ever
1599996,4,TheWDB.com - Very cool to hear old Walt interv...,thewdbcom cool hear old walt interview â
1599997,4,Are you ready for your MoJo Makeover? Ask me f...,ready mojo makeover ask detail
1599998,4,Happy 38th Birthday to my boo of alll time!!! ...,happy birthday boo alll time tupac amaru shakur


In [26]:
# save cleaned data
clean_df = df[['target','clean_text']]
clean_df.to_csv('sent140_clean.csv',encoding='utf-8')

In [27]:
path = 'data/sent140/sent140_clean.csv'
df = pd.read_csv(path,index_col=0)
df.head()

Unnamed: 0,target,clean_text
0,0,switchfoot awww thats bummer shoulda got david...
1,0,upset cant update facebook texting might cry r...
2,0,kenichan dived many time ball managed save res...
3,0,whole body feel itchy like fire
4,0,nationwideclass behaving im mad cant see


### Train Test Split
Since our dataset is fairly large, a 2% testing set gives us 30k tweet, which is sufficient.
We will be using cross validation to tune our model, so we only need testing set.

In [28]:
train, test = train_test_split(df,test_size=0.02)

In [29]:
train = train.dropna()
test = test.dropna()
print("Training set has {} rows, and testing set has {} rows".
     format(train.shape[0],test.shape[0]))

Training set has 1566517 rows, and testing set has 31968 rows


In [88]:
X,x_test = train['clean_text'],test['clean_text']
y,y_test = train['target'],test['target']
X.head()

598739                                     cant fall asleep
132635                        woke knee throbbing cant good
747420    hostessojr ooh well update mister hahaha jk th...
720305                                        kaceyfish aww
288605                                        ughhh im sick
Name: clean_text, dtype: object

# Vetorize

I ran the same test as in notebook 1 on this new dataset and TFIDF with 10000 features and trigram gave the best result.

In [86]:
import multiprocessing

tvec = TfidfVectorizer(max_features=10000, ngram_range=(1,3))

In [89]:
start = time.time()

X = tvec.fit_transform(X)
x_test = tvec.transform(x_test)

end = time.time()
print(end-start)

110.28662204742432


# Modelling
<a id="4"></a>
As we decided in previous notebook, the model we will use is logistic regression.

In [93]:
mlr = LogisticRegression(C=5e1,max_iter=1000,multi_class='multinomial',solver='lbfgs',random_state=47,n_jobs=4)

In [94]:
%%time
mlr = mlr.fit(X,y)

CPU times: user 93.8 ms, sys: 60.8 ms, total: 155 ms
Wall time: 2min 55s


In [97]:
preds = mlr.predict(x_test)
print("Model Accuracy for L2")
print(accuracy_score(y_test, preds))

Model Accuracy for L2
0.7860985985985987


In [121]:
import eli5

eli5.show_weights(estimator=mlr, 
                  feature_names= list(tvec.get_feature_names()),
                  top=(50,5))

Weight?,Feature
+4.296,cant wait
+3.294,cannot wait
+3.153,banksyart
+2.964,wish luck
+2.719,nothing wrong
+2.641,mileymonday
+2.468,smiling
+2.454,meits simple
+2.414,cant get enough
+2.395,isnt bad


In [129]:
lgr = LogisticRegression(C=1.0, class_weight=None, max_iter=100, multi_class='ovr',
          penalty='l2', random_state=47, solver='liblinear',
          verbose=0, warm_start=False)

In [130]:
%%timelgr = klgr.fit(X,y)



CPU times: user 44.2 s, sys: 6.1 s, total: 50.3 s
Wall time: 13.4 s


In [131]:
preds = lgr.predict(x_test)
print("Model Accuracy for L2")
print(accuracy_score(y_test, preds))

Model Accuracy for L2
0.7865052552552553


## HyperParameter Tuning

Here we explore 2 methods that transform Logistic Regression to a multi-class classifier:
- One-versus-Rest(OvR)
- Softmax(Multinomial)

### One Versus Rest(ovr)

The question of whether a word is positive, negative or neutral will be divided into 3 problems:
- Binary Classification1: `neutral` vs `[positive, negative]` 
- Binary Classification2: `positive` vs `[neutral, negative]` 
- Binary Classification3: `negative` vs `[positive, neutral]` 

In [132]:
# define search space
param_dict = {
    'C' : [1.0,0.1,0.01],
    'penalty':['l1','l2']}


In [133]:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RepeatedStratifiedKFold
# define model
mlgr = LogisticRegression(max_iter=1000,multi_class='ovr',solver='liblinear')
# define evaluation
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# define search
search = GridSearchCV(mlgr, param_dict, scoring='accuracy', n_jobs=-1, cv=cv)

In [134]:
%%time
result = search.fit(X,y)

CPU times: user 19.9 s, sys: 3.86 s, total: 23.8 s
Wall time: 12min 14s


In [135]:
print('Best Score: %s' % result.best_score_)
print('Best Hyperparameters: %s' % result.best_params_)

Best Score: 0.7848775340810589
Best Hyperparameters: {'C': 1.0, 'penalty': 'l1'}


### Multinomial 

In multinomial approach, instead of log-odds, we measure relative log-odds. Also, instead of a shared weight, each class will have its own set of weights.

Read [more](https://qr.ae/pG0T3c).

In [137]:
param_dict1 = {'solver': ['saga', 'lbfgs'],
              'penalty' :['elasticnet', 'l1', 'l2', 'none']}

In [138]:
# define model
mlgr = LogisticRegression(max_iter=500,multi_class='multinomial',C=1.0)
# define evaluation
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# define search
search = GridSearchCV(mlgr, param_dict1, scoring='accuracy', n_jobs=-1, cv=cv)

In [139]:
%%time
result = search.fit(X,y)



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
90 fits failed out of a total of 240.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the fa

CPU times: user 13h 39min 8s, sys: 3min 14s, total: 13h 42min 23s
Wall time: 3d 1h 15min 7s




In [140]:
print("Result for MULTINOMIAL class")
print('Best Score: %s' % result.best_score_)
print('Best Hyperparameters: %s' % result.best_params_)

Result for MULTINOMIAL class
Best Score: 0.7848779596500978
Best Hyperparameters: {'penalty': 'l1', 'solver': 'saga'}


[GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#:~:text=estimator%20which%20gave%20highest%20score%20(or%20smallest%20loss%20if%20specified)%20on%20the%20left%20out%20data.) yields the final model with the best score on the left out data, so we need to retrain the model on the completed dataset again.

*Note: 
- From multiple projects on the same data sets, C=1.0 yields the best result so i skipped tuning for this one.
- I expected l1 to have better performance since it throw away unimportent features(which we have a lot), but we will do more research on this.*

In [141]:
final_lr = LogisticRegression(max_iter=2000,multi_class='multinomial',C=1.0,penalty='l1',solver='saga',
                                random_state=47)


In [170]:
# concate the training and testing set
# cancat won't work for X since it is does not support sparse matrix
from scipy.sparse import vstack 
X_all = vstack((X,x_test))
y_all = pd.concat([y,y_test])

In [174]:
final_lr = final_lr.fit(X_all,y_all)

## Save Model

In [177]:
import pickle
# create an iterator object with write permission - model.pkl
filename = 'finalized_mlr_model.sav'
pickle.dump(final_lr, open(filename, 'wb'))

In [None]:
# load saved model
#with open('model_pkl' , 'rb') as f:
    #lr = pickle.load(f)