### Implementing Sentiment Analysis for the Machine Hack Hackathon using the Tf-IDF model

#### Importing the necessary libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
twitter_df = pd.read_csv("train.csv")

In [3]:
twitter_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44100 entries, 0 to 44099
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   ID         44100 non-null  int64 
 1   author     44100 non-null  object
 2   Review     44100 non-null  object
 3   Sentiment  44100 non-null  int64 
dtypes: int64(2), object(2)
memory usage: 1.3+ MB


In [4]:
twitter_df.head(10)

Unnamed: 0,ID,author,Review,Sentiment
0,39467,rayinstirling,Today I'm working on my &quot;Quirky Q&quot; c...,2
1,30154,DirtyRose17,@ShannonElizab dont ya know? people love the h...,1
2,16767,yoliemichelle,ughhh rejected from the 09 mediation program. ...,0
3,9334,jayamelwani,@petewentz im so jealous. i want an octo drive,0
4,61178,aliisanoun,I remember all the hype around this movie when...,0
5,54688,empressjazzy1,I liked this quite a bit but I have friends th...,2
6,34838,lorrief,loving that spring definitely seems to be here...,2
7,28520,GE0RGIE,@jeg007jeg yay coutch:couch,2
8,31974,BrandonCarlson,"Working on the store's Facebook group, getting...",2
9,14323,allshookup,@falselove OH THAT'S GOOD! My top 4 are: The H...,0


It can be seen that data cleaning is required. This is because:
- Some reviews have the user's info seen as: @username
- There are words which do not necessarily contribute to the sentiment expressed.

In [5]:
import nltk

In [8]:
from nltk.corpus import stopwords

In [9]:
corpus = []

In [10]:
review_series = twitter_df["Review"]

#### Starting with the text cleaning and following these steps:
1) Converting all data to lower case<br>
2) Removing punctuations<br>
3) Removing HTML tags<br>
4) Removing stopwords<br>
5) Performing lemmatization<br>
> a) Using Spacy lemmatizer<br>
> b) Using textblob lemmatizer
 

In [11]:
review_series = review_series.str.lower()

In [19]:
import string
import re

In [13]:
punctuations = string.punctuation

In [14]:
def remove_punctuations(review):
    return review.translate(str.maketrans('','',punctuations))

In [15]:
review_series = review_series.apply(lambda review: remove_punctuations(review))

In [16]:
review_series

0        today im working on my quotquirky qquot cue or...
1        shannonelizab dont ya know people love the hum...
2        ughhh rejected from the 09 mediation program s...
3             petewentz im so jealous i want an octo drive
4        i remember all the hype around this movie when...
                               ...                        
44095    the mother is a weird lowbudget movie touching...
44096    it started off weird the middle was weird and ...
44097    i was amazed at the quick arrival of the two o...
44098    attractive marjoriefarrah fawcettlives in fear...
44099    refugee me gets quotyour video will start in 1...
Name: Review, Length: 44100, dtype: object

In [17]:
def remove_html_tags(review):
    review = re.sub('<.*?>','',review)
    return review

In [20]:
review_series = review_series.apply(lambda review:remove_html_tags(review))

#### Instead of performing steps 4 and 5 separately, they will be combined in such a way that words not part of the stopwords will be lemmatized and retained.

In [21]:
import spacy

!pip install textblob

In [22]:
from textblob import TextBlob,Word

#### Performing lemmatization with spacy

In [23]:
import en_core_web_sm

nlp = en_core_web_sm.load()

In [25]:
def remove_stopwords_and_lemmatize_spacy(review):
    doc = nlp(review)
    return " ".join([token.lemma_ for token in doc if token not in set(stopwords.words('english'))])

In [26]:
review_series = review_series.apply(lambda review: remove_stopwords_and_lemmatize_spacy(review)) 

In [27]:
review_series

0        today -PRON- be work on -PRON- quotquirky qquo...
1        shannonelizab do not ya know people love the h...
2        ughhh reject from the 09 mediation program suc...
3        petewentz -PRON- be so jealous i want an octo ...
4        i remember all the hype around this movie when...
                               ...                        
44095    the mother be a weird lowbudget movie touch at...
44096    -PRON- start off weird the middle be weird and...
44097    i be amazed at the quick arrival of the two or...
44098    attractive marjoriefarrah fawcettlive in fear ...
44099    refugee -PRON- get quotyour video will start i...
Name: Review, Length: 44100, dtype: object

#### Performing word-embeddings with TfIDF