## Twitter Setiment Analysis 

### Part 2: sentiment140 pre-processing for ML

The code was originally inspired Gaurav Singhal's guide: [Building a Twitter Setiment Analysis in Python.](https://www.pluralsight.com/guides/building-a-twitter-sentiment-analysis-in-python)

The data comes from Marios Michailidis' sentiment140 dataset hosted in [Kaggle.](https://www.kaggle.com/kazanova/sentiment140/)


### Load Cleaned Data

In [1]:
import time
import load_data as ld

start = time.perf_counter()

df = ld.run_processes()

finish = time.perf_counter()
print(f'Finished in {round(finish-start, 2)} second(s)')

Finished in 9.77 second(s)


In [2]:
df.head()

Unnamed: 0,target,text,tokenized,filtered,stemmed
0,0,is upset that he can't update his Facebook by ...,is upset that he cant update his facebook by t...,upset cant update his facebook texting might c...,upset cant updat hi facebook text might cri re...
1,0,@Kenichan I dived many times for the ball. Man...,i dived many times for the ball managed to sav...,i dived many times ball managed save 50 rest g...,i dive mani time ball manag save 50 rest go ou...
2,0,my whole body feels itchy and like its on fire,my whole body feels itchy and like its on fire,my whole body feels itchy like fire,my whole bodi feel itchi like fire
3,0,"@nationwideclass no, it's not behaving at all....",no its not behaving at all im mad why am i her...,no not behaving all im mad why am i here becau...,no not behav all im mad whi am i here becaus i...
4,0,@Kwesidei not the whole crew,not the whole crew,not whole crew,not whole crew


In [3]:
df.tail()

Unnamed: 0,target,text,tokenized,filtered,stemmed
1599994,1,Just woke up. Having no school is the best fee...,just woke up having no school is the best feel...,just woke up having no school best feeling ever,just woke up have no school best feel ever
1599995,1,TheWDB.com - Very cool to hear old Walt interv...,very cool to hear old walt interviews,very cool hear old walt interviews,veri cool hear old walt interview
1599996,1,Are you ready for your MoJo Makeover? Ask me f...,are you ready for your mojo makeover ask me fo...,you ready your mojo makeover ask me details,you readi your mojo makeov ask me detail
1599997,1,Happy 38th Birthday to my boo of alll time!!! ...,happy 38th birthday to my boo of alll time tup...,happy 38th birthday my boo alll time tupac ama...,happi 38th birthday my boo alll time tupac ama...
1599998,1,happy #charitytuesday @theNSPCC @SparksCharity...,happy charitytuesday,happy charitytuesday,happi charitytuesday


### Import ML pre-processing modules

In [4]:
import numpy as np
from sklearn import preprocessing
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

### Drop NAs created during cleanup

We don't want to impute nor anything else since these are empty texts we cannot use for prediction.

In [5]:
df.isnull().sum()

target          0
text            0
tokenized    3696
filtered     3819
stemmed      3819
dtype: int64

In [47]:
dfm = df.dropna()
dfm.index = range(1,len(dfm) + 1)

In [48]:
len(df)-len(dfm)

3819

### Vectorize with TF-IDF

`[TODO: Bag of Words, N-grams]`

In [85]:
#import pandas as pd
#test0 = dfm.loc[:50000, ].copy()
#test1 = dfm.loc[1546181:1596180, ].copy()
#test = pd.concat([test0,test1])
#test0.shape, test1.shape

# use function that feeds data to see learning curves... overfitting? underfitting?

In [93]:
# get np array for target values
y = np.array(dfm.iloc[:, 0]).ravel() # change to dfm

# instantiate a vectorizer
vectorizer = TfidfVectorizer(sublinear_tf=True)

In [94]:
def get_accuracy(vectorizer, X, y):
    
    start = time.perf_counter()
    
    # vectorize
    X_fit = vectorizer.fit_transform(X)
    
    # split into train and test sets
    X_train, X_test, y_train, y_test = \
    train_test_split(X_fit, y, test_size=0.2, random_state=42)
    
    print('Accuracy')
    
    # Naive Bayes
    NB_model = MultinomialNB()
    
    # fit and make predictions
    NB_model.fit(X_train, y_train)
    y_predict_nb = NB_model.predict(X_test)
    accuracy = accuracy_score(y_test, y_predict_nb)
    print(''.join(['NB: ' , str(round(accuracy, 4))]))
    
    # Logistic Regression
    LR_model = LogisticRegression(solver='lbfgs', max_iter=1000)

    # fit and make predictions
    LR_model.fit(X_train, y_train)
    y_predict_lr = LR_model.predict(X_test)
    accuracy = accuracy_score(y_test, y_predict_lr)
    print(''.join(['LR: ' , str(round(accuracy, 4))]))
    
    finish = time.perf_counter()
    print(f'Finished in {round(finish-start, 2)} secs')   

### Original

In [95]:
X = np.array(dfm.iloc[:, 1]).ravel()
get_accuracy(vectorizer, X, y)

Accuracy
NB: 0.7747
LR: 0.8045
Finished in 298.03 secs


### Tokenized

In [96]:
X = np.array(dfm.iloc[:, 2]).ravel()
get_accuracy(vectorizer, X, y)

Accuracy
NB: 0.7755
LR: 0.8006
Finished in 243.62 secs


### Filtered

In [97]:
X = np.array(dfm.iloc[:, 3]).ravel()
get_accuracy(vectorizer, X, y)

Accuracy
NB: 0.7748
LR: 0.7997
Finished in 220.1 secs


### Stemmed

In [98]:
X = np.array(dfm.iloc[:, 4]).ravel()
get_accuracy(vectorizer, X, y)

Accuracy
NB: 0.7701
LR: 0.7946
Finished in 247.27 secs


Use other vectorization techniques, apply feature extraxtion and feature selection, implement other ML models.