### Implementing Sentiment Analysis for the Machine Hack Hackathon using the Tf-IDF model

#### Importing the necessary libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
import dask
import dask.dataframe as dd

In [3]:
import dask.distributed 

In [4]:
from distributed import Client

In [5]:
client = Client()

In [6]:
twitter_df = pd.read_csv("train.csv")

In [7]:
twitter_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44100 entries, 0 to 44099
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   ID         44100 non-null  int64 
 1   author     44100 non-null  object
 2   Review     44100 non-null  object
 3   Sentiment  44100 non-null  int64 
dtypes: int64(2), object(2)
memory usage: 1.3+ MB


In [8]:
twitter_df.head(10)

Unnamed: 0,ID,author,Review,Sentiment
0,39467,rayinstirling,Today I'm working on my &quot;Quirky Q&quot; c...,2
1,30154,DirtyRose17,@ShannonElizab dont ya know? people love the h...,1
2,16767,yoliemichelle,ughhh rejected from the 09 mediation program. ...,0
3,9334,jayamelwani,@petewentz im so jealous. i want an octo drive,0
4,61178,aliisanoun,I remember all the hype around this movie when...,0
5,54688,empressjazzy1,I liked this quite a bit but I have friends th...,2
6,34838,lorrief,loving that spring definitely seems to be here...,2
7,28520,GE0RGIE,@jeg007jeg yay coutch:couch,2
8,31974,BrandonCarlson,"Working on the store's Facebook group, getting...",2
9,14323,allshookup,@falselove OH THAT'S GOOD! My top 4 are: The H...,0


It can be seen that data cleaning is required. This is because:
- Some reviews have the user's info seen as: @username
- There are words which do not necessarily contribute to the sentiment expressed.

In [9]:
import nltk

In [10]:
from nltk.corpus import stopwords

In [11]:
corpus = []

In [12]:
review_series = twitter_df["Review"]

#### Starting with the text cleaning and following these steps:
1) Converting all data to lower case<br>
2) Removing punctuations<br>
3) Removing HTML tags<br>
4) Removing stopwords<br>
5) Performing lemmatization<br>
> a) Using Spacy lemmatizer<br>
> b) Using textblob lemmatizer
 

In [13]:
review_series = review_series.str.lower()

In [14]:
import string
import re

In [15]:
punctuations = string.punctuation

In [16]:
def remove_punctuations(review):
    return review.translate(str.maketrans('','',punctuations))

In [17]:
review_series = review_series.apply(lambda review: remove_punctuations(review))

In [18]:
review_series

0        today im working on my quotquirky qquot cue or...
1        shannonelizab dont ya know people love the hum...
2        ughhh rejected from the 09 mediation program s...
3             petewentz im so jealous i want an octo drive
4        i remember all the hype around this movie when...
                               ...                        
44095    the mother is a weird lowbudget movie touching...
44096    it started off weird the middle was weird and ...
44097    i was amazed at the quick arrival of the two o...
44098    attractive marjoriefarrah fawcettlives in fear...
44099    refugee me gets quotyour video will start in 1...
Name: Review, Length: 44100, dtype: object

In [19]:
def remove_html_tags(review):
    review = re.sub('<.*?>','',review)
    return review

In [20]:
review_series = review_series.apply(lambda review:remove_html_tags(review))

#### Instead of performing steps 4 and 5 separately, they will be combined in such a way that words not part of the stopwords will be lemmatized and retained.

In [21]:
import spacy

!pip install textblob

In [22]:
from textblob import TextBlob,Word

#### Performing lemmatization with spacy

In [23]:
import en_core_web_sm
from spacy.lang.en.stop_words import STOP_WORDS

nlp = en_core_web_sm.load()

In [24]:
dask_review = dd.from_pandas(review_series,npartitions=3)

In [25]:
def remove_stopwords_and_lemmatize_spacy(review):
    doc = nlp(review)
    return " ".join([token.lemma_ for token in doc if token.is_stop == False])

In [26]:
dask_review = dask_review.apply(lambda review: remove_stopwords_and_lemmatize_spacy(review)) 

You did not provide metadata, so Dask is running your function on a small dataset to guess output types. It is possible that Dask will guess incorrectly.
To provide an explicit output types or to silence this message, please provide the `meta=` keyword, as described in the map or apply function that you are using.
  Before: .apply(func)
  After:  .apply(func, meta=('Review', 'object'))



In [27]:
type(dask_review)

dask.dataframe.core.Series

In [28]:
dask_review = client.persist(dask_review)

In [29]:
dask_review.head(10)

0    today be work quotquirky qquot cue maybe concerto
1    shannonelizab not ya know people love human so...
2           ughhh reject 09 mediation program suckssss
3                 petewentz be jealous want octo drive
4    remember hype movie aaliyah kill fan ms rice n...
5    like bit friend hate s sex s little nudity epi...
6                               love spring definitely
7                          jeg007jeg yay   coutchcouch
8    work store facebook group get ready relax play...
9    falselove oh s good 4 haunted housesound baker...
Name: Review, dtype: object

#### Performing word-embeddings with TfIDF

In [30]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [31]:
tfidf_vectorizer = TfidfVectorizer(max_df = 0.7)

In [32]:
X = tfidf_vectorizer.fit_transform(dask_review)

In [33]:
type(X)

scipy.sparse.csr.csr_matrix

In [34]:
X = X.astype(np.uint8)

In [35]:
X.shape

(44100, 105281)

In [36]:
Y = twitter_df['Sentiment']

In [37]:
Y

0        2
1        1
2        0
3        0
4        0
        ..
44095    2
44096    2
44097    2
44098    2
44099    0
Name: Sentiment, Length: 44100, dtype: int64

#### Splitting the dataset into training and test set

In [38]:
from sklearn.model_selection import train_test_split as tts

In [39]:
X_train,X_test,Y_train,Y_test = tts(X,Y,test_size = 0.30,random_state = 0)

#### Building the different models:
1) Logistic Regression -> Base Model<br>
2) Random Forest Classifier<br>
3) XGBoost Classifier<br>

In [40]:
from sklearn.linear_model import LogisticRegression

In [41]:
logit_classifier = LogisticRegression(random_state = 0)

In [42]:
logit_classifier.fit(X_train,Y_train)

LogisticRegression(random_state=0)

In [43]:
Y_pred = logit_classifier.predict(X_test)

In [44]:
from sklearn.metrics import confusion_matrix 

In [45]:
cm = confusion_matrix(Y_test,Y_pred)

In [46]:
cm

array([[5692,    1,    6],
       [1844,    8,    4],
       [5664,    6,    5]], dtype=int64)

In [47]:
from sklearn.metrics import classification_report 

In [48]:
print(classification_report(Y_test,Y_pred))

              precision    recall  f1-score   support

           0       0.43      1.00      0.60      5699
           1       0.53      0.00      0.01      1856
           2       0.33      0.00      0.00      5675

    accuracy                           0.43     13230
   macro avg       0.43      0.33      0.20     13230
weighted avg       0.40      0.43      0.26     13230



#### Performing hyperparameter tuning for logistic regression classifier.<br>
 - Although there aren't any specific hyperparameters, there are few parameters to tune:<br>
     (1) Solver<br>
     (2) C values
 - Here, Grid Search CV will be performed to execute the tuning.


#### Performing repeated k-fold cross-validation on the data

In [49]:
from sklearn.model_selection import StratifiedKFold

In [50]:
from sklearn.model_selection import GridSearchCV

In [51]:
cv = StratifiedKFold(n_splits = 10,random_state = 0, shuffle = True)

In [52]:
solvers = ['newton-cg', 'lbfgs', 'liblinear']

In [53]:
c_values = [100,10,1,0.1,0.01]

In [54]:
grid_params = dict(solver = solvers,C = c_values)

In [55]:
grid_cv = GridSearchCV(estimator = logit_classifier,param_grid = grid_params,scoring='accuracy',cv=cv,n_jobs=-1)

In [56]:
result_grid_cv = grid_cv.fit(X_train,Y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


In [57]:
print(f"Best score and optimal hyperparameter values are: {result_grid_cv.best_score_,result_grid_cv.best_params_}")

Best score and optimal hyperparameter values are: (0.4409782960803369, {'C': 100, 'solver': 'lbfgs'})


#### It can be seen that even with tuning, the model is performing poorly. Proceeding to the bagging and boosting algorithms to build a better model