## Experimenting with an Naive Bayes, Logistic Regression, Linear SVM, and RandomForest Classifiers:

* **NB**: Naive Bayes, BernoulliNB implements the naive Bayes training and classification algorithms for data that is distributed according to multivariate Bernoulli distributions
* **LR**:Logistic Regression (aka logit, MaxEnt) classifier.
* **LSVM**:Linear Support Vector Classification.
* **RF**  : A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.


In [2]:
import os 
datasets = ["testdata.manual.2009.06.14.csv", "training.1600000.processed.noemoticon.csv"]
train_path = os.path.join("dataset", datasets[1])

**Loading Data**

In [5]:
import os
from tqdm import trange
from dataloader import DataLoader

data_loader = DataLoader()

In [6]:
data = data_loader.read_df(train_path, 
                           df_type='csv', encoding='latin-1',
                           names=["Sentiment", "ID", "Date", "Query","UserID","Tweet"])
data.head(2)

Unnamed: 0,Sentiment,ID,Date,Query,UserID,Tweet
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...


**preprocessing using texthero**

In [9]:
import texthero as hero

data['CleanTweet'] = data['Tweet'].pipe(hero.clean)

data.head(2)

Unnamed: 0,Sentiment,ID,Date,Query,UserID,Tweet,CleanTweet
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t...",switchfoot http twitpic com 2y1zl awww bummer ...
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...,upset update facebook texting might cry result...


**Extracting top words for TF-IDF representation**

In [10]:
def get_vocabs(pos_words_dic, neg_words_dic, ft):
    vocabs = []
    for word,freq in pos_words_dic.items():
        if freq > ft and word not in vocabs:
            vocabs.append(word)

    for word,freq in neg_words_dic.items():
        if freq > ft and word not in vocabs:
            vocabs.append(word)
    return vocabs
    
ft = 5
top_pos_words = hero.top_words(data[data['Sentiment'] == 4]['CleanTweet'])
top_neg_words = hero.top_words(data[data['Sentiment'] == 0]['CleanTweet'])
vocabs5 = get_vocabs(top_pos_words.to_dict(), top_neg_words.to_dict(), ft=ft)
print( "ft:{}, vocab lenght:{}".format(ft, len(vocabs5)))

ft:5, vocab lenght:51758


**Train test split**

In [11]:
def transform_labels(label):
    if label==4:
        return 1
    return label

data['Sentiment'] = data['Sentiment'].apply(lambda x:transform_labels(x))

In [12]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(data['CleanTweet'].tolist(), 
                                                    data['Sentiment'].tolist(),
                                                    test_size=0.3, random_state=40)

print("Train size:", len(x_train))
print("Test size:", len(x_test))

Train size: 1120000
Test size: 480000


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from model import ModelPipeline
from sklearn.metrics import classification_report

## TFIDFNB Model
TF-IDF representation with Naive Bays model

In [56]:
from sklearn.naive_bayes import BernoulliNB

model=ModelPipeline(estimator=BernoulliNB(),
                    transformer=TfidfVectorizer(vocabulary=vocabs5) )

model.fit(x_train, y_train)

In [57]:
y_pred = model.predict(x_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.77      0.77      0.77    239943
           1       0.77      0.77      0.77    240057

    accuracy                           0.77    480000
   macro avg       0.77      0.77      0.77    480000
weighted avg       0.77      0.77      0.77    480000



## TFIDFLR Model
TF-IDF representation with LogisticRegression model

In [63]:
from sklearn.linear_model import LogisticRegression

model=ModelPipeline(estimator=LogisticRegression(C=0.1, max_iter=200),
                    transformer=TfidfVectorizer(vocabulary=vocabs5) )

model.fit(x_train, y_train)

In [65]:
y_pred = model.predict(x_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.79      0.75      0.77    239943
           1       0.76      0.80      0.78    240057

    accuracy                           0.78    480000
   macro avg       0.78      0.78      0.78    480000
weighted avg       0.78      0.78      0.78    480000



## TFIDFLSVM Model
TF-IDF representation with Linear SVM model

In [70]:
from sklearn.svm import LinearSVC

model=ModelPipeline(estimator=LinearSVC(C=1),
                    transformer=TfidfVectorizer(vocabulary=vocabs5) )

model.fit(x_train, y_train)

In [71]:
y_pred = model.predict(x_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.79      0.76      0.77    239943
           1       0.77      0.80      0.78    240057

    accuracy                           0.78    480000
   macro avg       0.78      0.78      0.78    480000
weighted avg       0.78      0.78      0.78    480000



## TFIDFRF Model
TF-IDF representation with RandomForest model

In [78]:
from sklearn.ensemble import RandomForestClassifier

model=ModelPipeline(estimator=RandomForestClassifier(max_depth=100),
                    transformer=TfidfVectorizer(vocabulary=vocabs5) )

model.fit(x_train, y_train)

In [79]:
y_pred = model.predict(x_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.78      0.70      0.74    239943
           1       0.73      0.80      0.76    240057

    accuracy                           0.75    480000
   macro avg       0.75      0.75      0.75    480000
weighted avg       0.75      0.75      0.75    480000

