## Twitter Setiment Analysis 

### Part 2: sentiment140 pre-processing for ML

The code was originally inspired Gaurav Singhal's guide: [Building a Twitter Setiment Analysis in Python.](https://www.pluralsight.com/guides/building-a-twitter-sentiment-analysis-in-python)

The data comes from Marios Michailidis' sentiment140 dataset hosted in [Kaggle.](https://www.kaggle.com/kazanova/sentiment140/)


### Load Cleaned Data

In [1]:
import time
import load_data as ld

start = time.perf_counter()

df = ld.run_processes()

finish = time.perf_counter()
print(f'Finished in {round(finish-start, 2)} second(s)')

Finished in 9.34 second(s)


In [2]:
df.head()

Unnamed: 0,target,text,tokenized,filtered,stemmed
0,0,is upset that he can't update his Facebook by ...,is upset that he cant update his facebook by t...,upset cant update his facebook texting might c...,upset cant updat hi facebook text might cri re...
1,0,@Kenichan I dived many times for the ball. Man...,i dived many times for the ball managed to sav...,i dived many times ball managed save 50 rest g...,i dive mani time ball manag save 50 rest go ou...
2,0,my whole body feels itchy and like its on fire,my whole body feels itchy and like its on fire,my whole body feels itchy like fire,my whole bodi feel itchi like fire
3,0,"@nationwideclass no, it's not behaving at all....",no its not behaving at all im mad why am i her...,no not behaving all im mad why am i here becau...,no not behav all im mad whi am i here becaus i...
4,0,@Kwesidei not the whole crew,not the whole crew,not whole crew,not whole crew


In [3]:
df.tail()

Unnamed: 0,target,text,tokenized,filtered,stemmed
1599994,1,Just woke up. Having no school is the best fee...,just woke up having no school is the best feel...,just woke up having no school best feeling ever,just woke up have no school best feel ever
1599995,1,TheWDB.com - Very cool to hear old Walt interv...,very cool to hear old walt interviews,very cool hear old walt interviews,veri cool hear old walt interview
1599996,1,Are you ready for your MoJo Makeover? Ask me f...,are you ready for your mojo makeover ask me fo...,you ready your mojo makeover ask me details,you readi your mojo makeov ask me detail
1599997,1,Happy 38th Birthday to my boo of alll time!!! ...,happy 38th birthday to my boo of alll time tup...,happy 38th birthday my boo alll time tupac ama...,happi 38th birthday my boo alll time tupac ama...
1599998,1,happy #charitytuesday @theNSPCC @SparksCharity...,happy charitytuesday,happy charitytuesday,happi charitytuesday


### Preprocessing Data for ML

In [4]:
import numpy as np

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

### drop NAs created during cleanup

In [33]:
df.isnull().sum()

target          0
text            0
tokenized    3696
filtered     3819
stemmed      3819
dtype: int64

In [35]:
dfm = df.dropna()

In [36]:
dfm.isnull().sum()

target       0
text         0
tokenized    0
filtered     0
stemmed      0
dtype: int64

In [37]:
df.shape, dfm.shape

((1599999, 5), (1596180, 5))

### Vectorize with TF-IDF

Also try Bag of Words and N-grams

In [76]:
vectorizer = TfidfVectorizer(sublinear_tf=True)

In [56]:
y = np.array(dfm.iloc[:, 0]).ravel()

### Tokenized

In [38]:
X = vectorizer.fit_transform(np.array(dfm.iloc[:, 2]).ravel())

In [61]:
# split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [65]:
# Naive Bayes
NB_model = MultinomialNB()
NB_model.fit(X_train, y_train)
y_predict_nb = NB_model.predict(X_test)
print(accuracy_score(y_test, y_predict_nb))

0.775495244897192


In [71]:
# Logistic Regression
from sklearn import preprocessing

LR_model = LogisticRegression(solver='lbfgs', max_iter=1000)

X_train_scaled = preprocessing.scale(X_train, with_mean=False)
X_test_scale = preprocessing.scale(X_test, with_mean=False)
LR_model.fit(X_train_scaled, y_train)

y_predict_lr = LR_model.predict(X_test_scale)

print(accuracy_score(y_test, y_predict_lr))

0.7751099500056384


Naive Bayes is giving nearly 76% accuracy, and Logistic Regression gives nearly 79%. These accuracy figures are recorded without implementing stemming or lemmatization. Using better techniques, you might get better accuracy.


### Filtered

In [72]:
X = vectorizer.fit_transform(np.array(dfm.iloc[:, 3]).ravel())

In [73]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [74]:
# Naive Bayes
NB_model = MultinomialNB()
NB_model.fit(X_train, y_train)
y_predict_nb = NB_model.predict(X_test)
print(accuracy_score(y_test, y_predict_nb))

0.7747685098171886


In [75]:
# Logistic Regression
LR_model = LogisticRegression(solver='lbfgs', max_iter=1000)

X_train_scaled = preprocessing.scale(X_train, with_mean=False)
X_test_scale = preprocessing.scale(X_test, with_mean=False)
LR_model.fit(X_train_scaled, y_train)

y_predict_lr = LR_model.predict(X_test_scale)

print(accuracy_score(y_test, y_predict_lr))

0.7742391209011515


### Stemmed

In [77]:
X = vectorizer.fit_transform(np.array(dfm.iloc[:, 4]).ravel())

In [78]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [79]:
# Naive Bayes
NB_model = MultinomialNB()
NB_model.fit(X_train, y_train)
y_predict_nb = NB_model.predict(X_test)
print(accuracy_score(y_test, y_predict_nb))

0.7701355736821661


In [80]:
# Logistic Regression
LR_model = LogisticRegression(solver='lbfgs', max_iter=1000)

X_train_scaled = preprocessing.scale(X_train, with_mean=False)
X_test_scale = preprocessing.scale(X_test, with_mean=False)
LR_model.fit(X_train_scaled, y_train)

y_predict_lr = LR_model.predict(X_test_scale)

print(accuracy_score(y_test, y_predict_lr))

0.773697202069942


### Nothing

In [81]:
X = vectorizer.fit_transform(np.array(dfm.iloc[:, 1]).ravel())

In [82]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [83]:
# Naive Bayes
NB_model = MultinomialNB()
NB_model.fit(X_train, y_train)
y_predict_nb = NB_model.predict(X_test)
print(accuracy_score(y_test, y_predict_nb))

0.7747309200716711


In [84]:
# Logistic Regression
LR_model = LogisticRegression(solver='lbfgs', max_iter=1000)

X_train_scaled = preprocessing.scale(X_train, with_mean=False)
X_test_scale = preprocessing.scale(X_test, with_mean=False)
LR_model.fit(X_train_scaled, y_train)

y_predict_lr = LR_model.predict(X_test_scale)

print(accuracy_score(y_test, y_predict_lr))

0.7620788382262652


Use other vectorization techniques, apply feature extraxtion and feature selection, implement other ML models.