## Twitter Setiment Analysis 

### Part 2: sentiment140 dataset pre-processing for ML


The data comes from Marios Michailidis' sentiment140 dataset hosted in [Kaggle.](https://www.kaggle.com/kazanova/sentiment140/)

### Load Cleaned Data

In [1]:
import time
import load_data as ld

start = time.perf_counter()

df = ld.run_processes()

finish = time.perf_counter()
print(f'Finished in {round(finish-start, 2)} second(s)')

Finished in 9.64 second(s)


In [2]:
df.head()

Unnamed: 0,target,text,tokenized,filtered,stemmed
0,0,is upset that he can't update his Facebook by ...,is upset that he cant update his facebook by t...,upset cant update his facebook texting might c...,upset cant updat hi facebook text might cri re...
1,0,@Kenichan I dived many times for the ball. Man...,i dived many times for the ball managed to sav...,i dived many times ball managed save 50 rest g...,i dive mani time ball manag save 50 rest go ou...
2,0,my whole body feels itchy and like its on fire,my whole body feels itchy and like its on fire,my whole body feels itchy like fire,my whole bodi feel itchi like fire
3,0,"@nationwideclass no, it's not behaving at all....",no its not behaving at all im mad why am i her...,no not behaving all im mad why am i here becau...,no not behav all im mad whi am i here becaus i...
4,0,@Kwesidei not the whole crew,not the whole crew,not whole crew,not whole crew


### Import ML pre-processing modules

In [3]:
import numpy as np
from sklearn import preprocessing
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

### Impute NAs created during cleanup

We do not want to drop since the fact they ended up as empty strings is possibly informative - for ex. possibly more positive Tweets?

TODO: examine them to see what pattern could be used to impute NAs.

In [4]:
df.isnull().sum()

target          0
text            0
tokenized    3696
filtered     3819
stemmed      3819
dtype: int64

In [5]:
dfm = df.dropna() # drop for now
dfm.index = range(1,len(dfm) + 1)

In [6]:
round((len(df)-len(dfm))/len(df), 4) # prop is low

0.0024

### Vectorize with BoW and TF-IDF


In [7]:
# target
y = np.array(dfm.iloc[:, 0]).ravel() 

In [11]:
# features
X_text = np.array(dfm.iloc[:, 1]).ravel() # original text
X_tokn = np.array(dfm.iloc[:, 2]).ravel() # tokenized
X_filt = np.array(dfm.iloc[:, 3]).ravel() # filtered
X_stem = np.array(dfm.iloc[:, 4]).ravel() # stemmed

In [12]:
# instantiate vectorizers
bow_vectorizer = CountVectorizer() # simple BoW 
tfidf_vectorizer = TfidfVectorizer(sublinear_tf=False) # default Tfidf
log_tfidf_vectorizer = TfidfVectorizer(sublinear_tf=True) # log(tf) version

In [16]:
X_text_bow = bow_vectorizer.fit_transform(X_text)
X_tokn_bow = bow_vectorizer.fit_transform(X_tokn)
X_filt_bow = bow_vectorizer.fit_transform(X_filt)
X_stem_bow = bow_vectorizer.fit_transform(X_stem)

In [None]:
X_text_tfidf = tfidf_vectorizer.fit_transform(X_text)
X_tokn_tfidf = tfidf_vectorizer.fit_transform(X_tokn)
X_filt_tfidf = tfidf_vectorizer.fit_transform(X_filt)
X_stem_tfidf = tfidf_vectorizer.fit_transform(X_stem)

In [None]:
X_text_log_tfidf = log_tfidf_vectorizer.fit_transform(X_text)
X_tokn_log_tfidf = log_tfidf_vectorizer.fit_transform(X_tokn)
X_filt_log_tfidf = log_tfidf_vectorizer.fit_transform(X_filt)
X_stem_log_tfidf = log_tfidf_vectorizer.fit_transform(X_stem)

TODO: cleanup problem: hashtags and mentions (@) are usually useful in Tweets

TODO: examine NAs and use some kind of imputation method

TODO: split JN into Pre-Processing and Modeling as originally intended

TODO: save sparse matrices in pre-processing JN

TODO: Modeling - def function that feeds data to see learning curves and assess fitting (ageron)

TODO: use CV folds? hyperparameter tuning? SVC, etc...

TODO: feature extraction (text length? do some visualizations)

TODO: feature selection?

TODO: N-grams?

TODO: redo get_accuracy(), has many issues - look at sklearn examples

In [8]:
def get_accuracy(vectorizer, model, X, y):
    
    start = time.perf_counter()

    # this is bad, fit every time? Just fit once!
    X_fit = vectorizer.fit_transform(X)
    
    # move to Modeling phase
    X_train, X_test, y_train, y_test = \
    train_test_split(X_fit, y, test_size=0.2, random_state=42)

    # just... nah. We want to use a pipeline, save model params
    if model == 'NB':
        NB = MultinomialNB()
        NB.fit(X_train, y_train)
        y_preds = NB.predict(X_test)
        accuracy = accuracy_score(y_test, y_preds)
        print('Accuracy')
        print(''.join(['NB: ' , str(round(accuracy, 4))]))

    elif model == 'LR':
        LR = LogisticRegression(solver='lbfgs', max_iter=1000)
        LR.fit(X_train, y_train)
        y_preds = LR.predict(X_test)
        accuracy = accuracy_score(y_test, y_preds)
        print(''.join(['LR: ' , str(round(accuracy, 4))]))
    
    else:
        pass
    
    finish = time.perf_counter()
    print(f'Finished in {round(finish-start, 2)} secs')   

### BoW with original text

In [16]:
# Naive Bayes
get_accuracy(count_vectorizer, 'NB', X, y)

Accuracy
NB: 0.7825
Finished in 53.67 secs


In [18]:
# Logistic Regression
get_accuracy(count_vectorizer, 'LR', X, y)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


LR: 0.8022
Finished in 586.5 secs


### BoW with tokens

In [96]:
X = np.array(dfm.iloc[:, 2]).ravel()
get_accuracy(vectorizer, X, y)

Accuracy
NB: 0.7755
LR: 0.8006
Finished in 243.62 secs


### Filtered

In [97]:
X = np.array(dfm.iloc[:, 3]).ravel()
get_accuracy(vectorizer, X, y)

Accuracy
NB: 0.7748
LR: 0.7997
Finished in 220.1 secs


### Stemmed

In [98]:
X = np.array(dfm.iloc[:, 4]).ravel()
get_accuracy(vectorizer, X, y)

Accuracy
NB: 0.7701
LR: 0.7946
Finished in 247.27 secs
