# <font color="#0084B4">Twitter Sentiment Analysis </font>

#### Dataset sourced from :  [Sentiment140]("http://help.sentiment140.com/for-students/")


The data is a CSV with emoticons removed. Data file format has 6
fields: <br>
0 - the polarity of the tweet (0 = negative, 4 = positive) <br>
1 - the id of the tweet (2087) <br>
2 - the date of the tweet (Sat May 16 23:58:44 UTC 2009) <br>
3 - the query (lyx). If there is no query, then this value is
NO_QUERY. <br>
4 - the user that tweeted (robotickilldozr) <br>
5 - the text of the tweet (Lyx is cool) <br>
   [Link]('http://help.sentiment140.com/for-students/')

In [None]:
import sklearn

In [None]:
import os
import pandas as pd
import matplotlib.pyplot as plt
import re
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS, TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import precision_score

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!pip install --upgrade --no-cache-dir gdown

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting gdown
  Downloading gdown-4.6.0-py3-none-any.whl (14 kB)
Installing collected packages: gdown
  Attempting uninstall: gdown
    Found existing installation: gdown 4.4.0
    Uninstalling gdown-4.4.0:
      Successfully uninstalled gdown-4.4.0
Successfully installed gdown-4.6.0


In [None]:
!gdown 1y7-125i4kI0F308iVFDlcFg4MoOPheuJ

Downloading...
From: https://drive.google.com/uc?id=1y7-125i4kI0F308iVFDlcFg4MoOPheuJ
To: /content/misc_data
100% 84.9M/84.9M [00:02<00:00, 33.7MB/s]


In [None]:
!unzip misc_data

Archive:  misc_data
  inflating: training.1600000.processed.noemoticon.csv  


In [None]:
cols = ['sentiment','id','date','query_string','user','tweet']
BASE_DIR = ""
df_tweets = pd.read_csv(os.path.join(BASE_DIR,'training.1600000.processed.noemoticon.csv'),encoding="latin-1",names=cols)

In [None]:
df_tweets.tail()

Unnamed: 0,sentiment,id,date,query_string,user,tweet
1599995,4,2193601966,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,AmandaMarie1028,Just woke up. Having no school is the best fee...
1599996,4,2193601969,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,TheWDBoards,TheWDB.com - Very cool to hear old Walt interv...
1599997,4,2193601991,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,bpbabe,Are you ready for your MoJo Makeover? Ask me f...
1599998,4,2193602064,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,tinydiamondz,Happy 38th Birthday to my boo of alll time!!! ...
1599999,4,2193602129,Tue Jun 16 08:40:50 PDT 2009,NO_QUERY,RyanTrevMorris,happy #charitytuesday @theNSPCC @SparksCharity...


In [None]:
df_tweets.columns

Index(['sentiment', 'id', 'date', 'query_string', 'user', 'tweet'], dtype='object')

<font color="fuschia"> Clean Tweets

In [None]:
def clean(raw):
  result = re.sub("<[a][^>]*>(.+?)</[a]>",'Link',raw)
  result = re.sub('&gt;',"",result)
  result = re.sub('&$x27;',"'",result)
  result = re.sub('&quot;','"',result)
  result = re.sub('$#x2F'," ",result)
  result = re.sub('<p>'," ",result)
  result = re.sub('</i>','',result)
  result = re.sub('&#62;','',result)
  result = re.sub('<i>','',result)
  result = re.sub("\n","",result)
  return result

In [None]:
df_tweets['clean_tweet'] = df_tweets.tweet.apply(func = clean)

In [None]:
df_tweets.head()

Unnamed: 0,sentiment,id,date,query_string,user,tweet,clean_tweet
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t...","@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all....","@nationwideclass no, it's not behaving at all...."


Split Dataset

In [None]:
df_train, df_test = train_test_split(df_tweets,test_size=0.3, stratify=df_tweets['sentiment'],random_state=21)
print(df_train.shape,df_test.shape)

(1120000, 7) (480000, 7)


TF - IDF Vector
Create a TD-IDF vector of the tweet text as defined above

In [None]:
tfidf_vectorizer = TfidfVectorizer(lowercase=True,max_features=1000,stop_words=ENGLISH_STOP_WORDS)
tfidf_vectorizer.fit(df_train.clean_tweet)

TfidfVectorizer(max_features=1000,
                stop_words=frozenset({'a', 'about', 'above', 'across', 'after',
                                      'afterwards', 'again', 'against', 'all',
                                      'almost', 'alone', 'along', 'already',
                                      'also', 'although', 'always', 'am',
                                      'among', 'amongst', 'amoungst', 'amount',
                                      'an', 'and', 'another', 'any', 'anyhow',
                                      'anyone', 'anything', 'anyway',
                                      'anywhere', ...}))

Transform train test data

In [None]:
#vectorize
train_idf = tfidf_vectorizer.transform(df_train.clean_tweet)
test_idf = tfidf_vectorizer.transform(df_test.clean_tweet)

#### Random Forest Model
create the object of Random Forest Model  <br>
fit the model with the training data      <br>
predict the label on the training data    <br>
predict the model on the test data        <br>
f1 score on the data

In [None]:
model_rf = RandomForestClassifier(n_estimators=20)
model_rf.fit(train_idf, df_train.sentiment)
predict_train = model_rf.predict(train_idf)
predict_test = model_rf.predict(test_idf)

took one hour to run

In [None]:
print(sklearn.metrics.precision_score(y_true=df_train.sentiment,y_pred=predict_train, pos_label=4))
print(sklearn.metrics.precision_score(y_true=df_test.sentiment,y_pred=predict_test, pos_label=4))

0.907308645428451
0.7253043775093874


We got 90.7% precision for train data
but for test data we got 72.5%.

Which is fine we can work with it

### Pipeline


to combine previous objects and use it an one object.<br>
save it on disk <br>
easy deployment

When we use the fit() function with a pipeline object, both steps are executed

In [None]:
pipeline = Pipeline(steps = [('tfidf',TfidfVectorizer(lowercase=True,max_features=1000,stop_words=ENGLISH_STOP_WORDS)),
                             ('model',RandomForestClassifier(n_estimators=10))
                             ])

pipeline.fit(df_train.clean_tweet, df_train.sentiment)

Pipeline(steps=[('tfidf',
                 TfidfVectorizer(max_features=1000,
                                 stop_words=frozenset({'a', 'about', 'above',
                                                       'across', 'after',
                                                       'afterwards', 'again',
                                                       'against', 'all',
                                                       'almost', 'alone',
                                                       'along', 'already',
                                                       'also', 'although',
                                                       'always', 'am', 'among',
                                                       'amongst', 'amoungst',
                                                       'amount', 'an', 'and',
                                                       'another', 'any',
                                                       'anyhow', 'anyone',
           

##### <font color="fuschia">Dump Model Using Joblib</font>
<font color="turquoise">
joblib takes any Python object. The object to store to disk.<br>
joblib.dump() and joblib.load() are based on the Python pickle serialization model,<br>
 which means that arbitrary Python code can be executed when loading a serialized object with joblib.load(). <br>
Save the preprocessing parameters and model parameters of this pipeline to disk and load it whenever needed.
</font>