## Sentiment text analisys
This project's purpose is to build a machine learning model predicting sentiment of a tweet ragarding COVID-19 pandemic, using both "classical" machine learning (like logistic regression ect.) and deep learning methods.

The dataset used in this notebook comes from here: https://www.kaggle.com/datatattle/covid-19-nlp-text-classification
<br>It was collected and manually tagged by a Kaggle user named Aman Miglani.  

Let's import neccessary modules.

In [12]:
import numpy as np
import pandas as pd
import re
from string import punctuation
import nltk
from nltk.corpus import stopwords, words
from nltk.tag import pos_tag
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import TweetTokenizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
import lightgbm as lgb
import xgboost as xgb
from sklearn.preprocessing import LabelEncoder
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.optimizers import Adam

Divide into training and test collection and set stopwords and english_words datasets.

In [13]:
STOPWORDS = set(stopwords.words('english'))
ENGLISH_WORDS = set(words.words())
df_train = pd.read_csv(r"data\Corona_NLP_train.csv", encoding='latin1')
df_test = pd.read_csv(r"data\Corona_NLP_test.csv", encoding='latin1')

In [14]:
df_train.head(10)

Unnamed: 0,UserName,ScreenName,Location,TweetAt,OriginalTweet,Sentiment
0,3799,48751,London,16-03-2020,@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/i...,Neutral
1,3800,48752,UK,16-03-2020,advice Talk to your neighbours family to excha...,Positive
2,3801,48753,Vagabonds,16-03-2020,Coronavirus Australia: Woolworths to give elde...,Positive
3,3802,48754,,16-03-2020,My food stock is not the only one which is emp...,Positive
4,3803,48755,,16-03-2020,"Me, ready to go at supermarket during the #COV...",Extremely Negative
5,3804,48756,"ÃT: 36.319708,-82.363649",16-03-2020,As news of the regionÂs first confirmed COVID...,Positive
6,3805,48757,"35.926541,-78.753267",16-03-2020,Cashier at grocery store was sharing his insig...,Positive
7,3806,48758,Austria,16-03-2020,Was at the supermarket today. Didn't buy toile...,Neutral
8,3807,48759,"Atlanta, GA USA",16-03-2020,Due to COVID-19 our retail store and classroom...,Positive
9,3808,48760,"BHAVNAGAR,GUJRAT",16-03-2020,"For corona prevention,we should stop to buy th...",Negative


In [15]:
print("Size of the train dataset: {}".format(df_train.shape))
print("Size of the test dataset: {}".format(df_test.shape))

Size of the train dataset: (41157, 6)
Size of the test dataset: (3798, 6)


Usually three unique sentiment values are just enough, so I will recode the target variable to such shape.

In [16]:
def recode_sentiment(y):

    if y in ['Extremely Positive', 'Positive']:
        return 'Positive'
    elif y in ['Extremely Negative', 'Negative']:
        return 'Negative'
    else:
        return 'Neutral'

In [17]:
df_train['Sentiment'] = df_train['Sentiment'].apply(lambda x: recode_sentiment(x))
df_test['Sentiment'] = df_test['Sentiment'].apply(lambda x: recode_sentiment(x))

In [18]:
df_train.head(10)

Unnamed: 0,UserName,ScreenName,Location,TweetAt,OriginalTweet,Sentiment
0,3799,48751,London,16-03-2020,@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/i...,Neutral
1,3800,48752,UK,16-03-2020,advice Talk to your neighbours family to excha...,Positive
2,3801,48753,Vagabonds,16-03-2020,Coronavirus Australia: Woolworths to give elde...,Positive
3,3802,48754,,16-03-2020,My food stock is not the only one which is emp...,Positive
4,3803,48755,,16-03-2020,"Me, ready to go at supermarket during the #COV...",Negative
5,3804,48756,"ÃT: 36.319708,-82.363649",16-03-2020,As news of the regionÂs first confirmed COVID...,Positive
6,3805,48757,"35.926541,-78.753267",16-03-2020,Cashier at grocery store was sharing his insig...,Positive
7,3806,48758,Austria,16-03-2020,Was at the supermarket today. Didn't buy toile...,Neutral
8,3807,48759,"Atlanta, GA USA",16-03-2020,Due to COVID-19 our retail store and classroom...,Positive
9,3808,48760,"BHAVNAGAR,GUJRAT",16-03-2020,"For corona prevention,we should stop to buy th...",Negative


Here, we are cleaning tweets: remove hashtags, URLs, HTML marks, Twitter mentions, stop words and lemmatizing words.

We remove stop words which are frequently occuring, because they do not bring much information for our algorithms.

Lemmatization is a process of transforming a word into its root form, for example: running -> run.


In [20]:
def remove_url(string):
    return re.sub(r'https?://\S+|www\.\S+', '', string)

def remove_html(string):
    return re.sub(r'<.*?>', '', string)

def remove_numbers(string):
    return re.sub(r'\d+', '', string)

def remove_mentions(string):
    return re.sub(r'@\w+', '', string)

def remove_hashtags(string):
    return re.sub(r'#\w+', '', string)

def clean_data(tweet, return_tokenized=True):
    
    # Tokenization
    tokenizer = TweetTokenizer()
    tokens = tokenizer.tokenize(tweet)
    
    cleaned_tweet = []
    
    for token, tag in pos_tag(tokens):
        
        # Cleaning tokens with regular expressions
        token = remove_url(token)
        token = remove_html(token)
        token = remove_numbers(token)
        token = remove_mentions(token)
        token = remove_hashtags(token)
        
        # Lemmatizing tokens with part of speech recognition
        
        if tag.startswith("NN"):
            pos = 'n'
        elif tag.startswith('VB'):
            pos = 'v'
        else:
            pos = 'a'
        
        lemmatizer = WordNetLemmatizer()
        token = lemmatizer.lemmatize(token, pos)
        
        token = token.lower()
        
        if token not in punctuation and token not in STOPWORDS and token in ENGLISH_WORDS:
            cleaned_tweet.append(token)
    #TfidfVectorizer accepts strings instead of lists of tokens
    if not return_tokenized:
        cleaned_tweet = ' '.join([token for token in cleaned_tweet])

    return cleaned_tweet

In [21]:
df_train['OriginalTweet'] = df_train['OriginalTweet'].apply(lambda x: clean_data(x, return_tokenized=False))
df_test['OriginalTweet'] = df_test['OriginalTweet'].apply(lambda x: clean_data(x, return_tokenized=False))

In [22]:
df_train.head(10)

Unnamed: 0,UserName,ScreenName,Location,TweetAt,OriginalTweet,Sentiment
0,3799,48751,London,16-03-2020,,Neutral
1,3800,48752,UK,16-03-2020,advice talk family exchange phone number creat...,Positive
2,3801,48753,Vagabonds,16-03-2020,give elderly disable dedicate shopping hour am...,Positive
3,3802,48754,,16-03-2020,food stock one empty please panic enough food ...,Positive
4,3803,48755,,16-03-2020,ready go supermarket outbreak paranoid food st...,Negative
5,3804,48756,"ÃT: 36.319708,-82.363649",16-03-2020,news first confirm covid case come county last...,Positive
6,3805,48757,"35.926541,-78.753267",16-03-2020,cashier grocery store share insight prove cred...,Positive
7,3806,48758,Austria,16-03-2020,supermarket today buy toilet paper,Neutral
8,3807,48759,"Atlanta, GA USA",16-03-2020,due covid retail store classroom open business...,Positive
9,3808,48760,"BHAVNAGAR,GUJRAT",16-03-2020,corona prevention stop buy thing cash use paym...,Negative


Count number of words from original tweet.

In [23]:
df_train['NumberOfWords'] = df_train['OriginalTweet'].apply(lambda x: len(x.split()))
df_test['NumberOfWords'] = df_test['OriginalTweet'].apply(lambda x: len(x.split()))

In [24]:
df_train = df_train.loc[df_train['NumberOfWords'] > 0,]
df_test = df_test.loc[df_test['NumberOfWords'] > 0,]

In [25]:
df_train.head(10)

Unnamed: 0,UserName,ScreenName,Location,TweetAt,OriginalTweet,Sentiment,NumberOfWords
1,3800,48752,UK,16-03-2020,advice talk family exchange phone number creat...,Positive,22
2,3801,48753,Vagabonds,16-03-2020,give elderly disable dedicate shopping hour am...,Positive,9
3,3802,48754,,16-03-2020,food stock one empty please panic enough food ...,Positive,15
4,3803,48755,,16-03-2020,ready go supermarket outbreak paranoid food st...,Negative,14
5,3804,48756,"ÃT: 36.319708,-82.363649",16-03-2020,news first confirm covid case come county last...,Positive,22
6,3805,48757,"35.926541,-78.753267",16-03-2020,cashier grocery store share insight prove cred...,Positive,12
7,3806,48758,Austria,16-03-2020,supermarket today buy toilet paper,Neutral,5
8,3807,48759,"Atlanta, GA USA",16-03-2020,due covid retail store classroom open business...,Positive,20
9,3808,48760,"BHAVNAGAR,GUJRAT",16-03-2020,corona prevention stop buy thing cash use paym...,Negative,19
10,3809,48761,"Makati, Manila",16-03-2020,month crowd supermarket restaurant however red...,Neutral,16


In [26]:
print("Size of the train dataset: {}".format(df_train.shape))
print("Size of the test dataset: {}".format(df_test.shape))

Size of the train dataset: (41052, 7)
Size of the test dataset: (3792, 7)


In [27]:
df_train.drop('NumberOfWords', axis=1, inplace=True)
df_test.drop('NumberOfWords', axis=1, inplace=True)

In [28]:
df_train.head(10)

Unnamed: 0,UserName,ScreenName,Location,TweetAt,OriginalTweet,Sentiment
1,3800,48752,UK,16-03-2020,advice talk family exchange phone number creat...,Positive
2,3801,48753,Vagabonds,16-03-2020,give elderly disable dedicate shopping hour am...,Positive
3,3802,48754,,16-03-2020,food stock one empty please panic enough food ...,Positive
4,3803,48755,,16-03-2020,ready go supermarket outbreak paranoid food st...,Negative
5,3804,48756,"ÃT: 36.319708,-82.363649",16-03-2020,news first confirm covid case come county last...,Positive
6,3805,48757,"35.926541,-78.753267",16-03-2020,cashier grocery store share insight prove cred...,Positive
7,3806,48758,Austria,16-03-2020,supermarket today buy toilet paper,Neutral
8,3807,48759,"Atlanta, GA USA",16-03-2020,due covid retail store classroom open business...,Positive
9,3808,48760,"BHAVNAGAR,GUJRAT",16-03-2020,corona prevention stop buy thing cash use paym...,Negative
10,3809,48761,"Makati, Manila",16-03-2020,month crowd supermarket restaurant however red...,Neutral


In [29]:
y_train, y_test = df_train['Sentiment'].copy(), df_test['Sentiment'].copy()

X_train_org, X_test_org = df_train['OriginalTweet'].copy(), df_test['OriginalTweet'].copy()

Evaluate_model function show the accuracy result of the constructed model.

In [30]:
def evaluate_model(model, X_train=X_train_org, X_test=X_test_org, y_train=y_train, y_test=y_test):
    
    preds_train = model.predict(X_train)
    preds_test = model.predict(X_test)
    
    train_acc = accuracy_score(y_train, preds_train)
    test_acc = accuracy_score(y_test, preds_test)
    
    return {'Train accuracy':train_acc, 'Test accuracy':test_acc}

## Logistic Regression

In [32]:
tdidf_logistic_grid = [{'vect__ngram_range': [(1, 1)],
               'vect__max_features':[100, 200, 300, 400, 600],
               'vect__min_df':[5, 7, 9, 11],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]},
              {'vect__ngram_range': [(1, 1)],
               'vect__max_features':[100, 200, 300, 400, 600],
               'vect__min_df':[5, 7, 9, 11],
               'vect__use_idf':[False],
               'vect__norm':[None],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]},
              ]

tfidf_logistic_pipeline = Pipeline([
    ('vect', TfidfVectorizer(encoding='latin1', stop_words='english')),
    ('clf', LogisticRegression())
])

cv = StratifiedKFold(n_splits=10)

tfidf_logistic_grid = GridSearchCV(tfidf_logistic_pipeline, param_grid=tdidf_logistic_grid, cv=cv,
                                  verbose=10, n_jobs=-1)

In [20]:
tfidf_logistic_grid.fit(X_train_org, y_train)

Fitting 10 folds for each of 240 candidates, totalling 2400 fits


1200 fits failed out of a total of 2400.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
1200 fits failed with the following error:
Traceback (most recent call last):
  File "c:\Users\Dawid\anaconda3\Lib\site-packages\sklearn\model_selection\_validation.py", line 732, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "c:\Users\Dawid\anaconda3\Lib\site-packages\sklearn\base.py", line 1151, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\Dawid\anaconda3\Lib\site-packages\sklearn\pipeline.py", line 420, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "c:\Users\Dawid\anaconda3\Lib\site-packages\sklearn\base.py",

In [21]:
print(tfidf_logistic_grid.best_params_)

{'clf__C': 1.0, 'clf__penalty': 'l2', 'vect__max_features': 600, 'vect__min_df': 7, 'vect__ngram_range': (1, 1), 'vect__norm': None, 'vect__use_idf': False}


In [22]:
tfidf_logistic_pipeline = tfidf_logistic_grid.best_estimator_

In [23]:
tfidf_logistic_pipeline.fit(X_train_org, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [24]:
print(evaluate_model(model=tfidf_logistic_pipeline))

{'Train accuracy': 0.7292214752021826, 'Test accuracy': 0.696993670886076}


In [25]:
print(classification_report(y_train, tfidf_logistic_pipeline.predict(X_train_org)))
print('-'*80)
print(classification_report(y_test, tfidf_logistic_pipeline.predict(X_test_org)))

              precision    recall  f1-score   support

    Negative       0.76      0.70      0.73     15392
     Neutral       0.57      0.72      0.64      7620
    Positive       0.79      0.76      0.77     18040

    accuracy                           0.73     41052
   macro avg       0.71      0.73      0.71     41052
weighted avg       0.74      0.73      0.73     41052

--------------------------------------------------------------------------------
              precision    recall  f1-score   support

    Negative       0.76      0.67      0.71      1633
     Neutral       0.52      0.67      0.58       613
    Positive       0.73      0.73      0.73      1546

    accuracy                           0.70      3792
   macro avg       0.67      0.69      0.68      3792
weighted avg       0.71      0.70      0.70      3792



## MultinomialNB

In [26]:
tdidf_naivebayes_grid = [{'vect__ngram_range': [(1, 1)],
               'vect__max_features':[100, 200, 300, 400, 600],
               'vect__min_df':[5, 7, 9, 11],
               'nb__alpha': np.arange(1, 11, 1)},
              {'vect__ngram_range': [(1, 1)],
               'vect__max_features':[100, 200, 300, 400, 600],
               'vect__min_df':[5, 7, 9, 11],
               'vect__use_idf':[False],
               'vect__norm':[None],
               'nb__alpha': np.arange(1, 11, 1)},
              ]
tfidf_naivebayes_pipeline = Pipeline([
    ('vect', TfidfVectorizer(encoding='latin1', stop_words='english')),
    ('nb', MultinomialNB())
])

tfidf_naivebayes_grid = GridSearchCV(tfidf_naivebayes_pipeline, param_grid=tdidf_naivebayes_grid, cv=cv,
                                  verbose=10, n_jobs=-1)

In [27]:
tfidf_naivebayes_grid.fit(X_train_org, y_train)

Fitting 10 folds for each of 400 candidates, totalling 4000 fits


In [28]:
tfidf_naivebayes_pipeline = tfidf_naivebayes_grid.best_estimator_
tfidf_naivebayes_pipeline.fit(X_train_org, y_train)

In [29]:
print(evaluate_model(model=tfidf_naivebayes_pipeline))

{'Train accuracy': 0.6422829581993569, 'Test accuracy': 0.6220991561181435}


In [30]:
print(classification_report(y_train, tfidf_naivebayes_pipeline.predict(X_train_org)))
print('-'*80)
print(classification_report(y_test, tfidf_naivebayes_pipeline.predict(X_test_org)))

              precision    recall  f1-score   support

    Negative       0.67      0.67      0.67     15392
     Neutral       0.46      0.40      0.43      7620
    Positive       0.68      0.73      0.70     18040

    accuracy                           0.64     41052
   macro avg       0.60      0.60      0.60     41052
weighted avg       0.64      0.64      0.64     41052

--------------------------------------------------------------------------------
              precision    recall  f1-score   support

    Negative       0.69      0.66      0.67      1633
     Neutral       0.36      0.34      0.35       613
    Positive       0.65      0.69      0.67      1546

    accuracy                           0.62      3792
   macro avg       0.57      0.56      0.57      3792
weighted avg       0.62      0.62      0.62      3792



## SVC

In [31]:
tdidf_svm_grid = {
    'svm__penalty': ['l1', 'l2'],
    'svm__C': np.arange(1, 11, 1)
}
tdidf_svm_pipeline = Pipeline([
    ('vect', TfidfVectorizer(encoding='latin1', max_features=600, min_df=7,
                                 norm=None, stop_words='english',
                                 use_idf=False)),
    ('svm', LinearSVC())
])

tdidf_svm_grid = GridSearchCV(tdidf_svm_pipeline, param_grid=tdidf_svm_grid, cv=cv,
                                  verbose=10, n_jobs=-1)

In [32]:
tdidf_svm_grid.fit(X_train_org, y_train)

Fitting 10 folds for each of 20 candidates, totalling 200 fits


100 fits failed out of a total of 200.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
100 fits failed with the following error:
Traceback (most recent call last):
  File "c:\Users\Dawid\anaconda3\Lib\site-packages\sklearn\model_selection\_validation.py", line 732, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "c:\Users\Dawid\anaconda3\Lib\site-packages\sklearn\base.py", line 1151, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\Dawid\anaconda3\Lib\site-packages\sklearn\pipeline.py", line 420, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "c:\Users\Dawid\anaconda3\Lib\site-packages\sklearn\base.py", li

In [33]:
tdidf_svm_pipeline = tdidf_svm_grid.best_estimator_
tdidf_svm_pipeline.fit(X_train_org, y_train)



In [34]:
print(evaluate_model(model=tdidf_svm_pipeline))

{'Train accuracy': 0.7284663353795187, 'Test accuracy': 0.6988396624472574}


In [35]:
print(classification_report(y_train, tdidf_svm_pipeline.predict(X_train_org)))
print('-'*80)
print(classification_report(y_test, tdidf_svm_pipeline.predict(X_test_org)))

              precision    recall  f1-score   support

    Negative       0.75      0.71      0.73     15392
     Neutral       0.57      0.70      0.63      7620
    Positive       0.79      0.76      0.77     18040

    accuracy                           0.73     41052
   macro avg       0.71      0.72      0.71     41052
weighted avg       0.74      0.73      0.73     41052

--------------------------------------------------------------------------------
              precision    recall  f1-score   support

    Negative       0.76      0.68      0.72      1633
     Neutral       0.52      0.65      0.58       613
    Positive       0.73      0.74      0.73      1546

    accuracy                           0.70      3792
   macro avg       0.67      0.69      0.68      3792
weighted avg       0.71      0.70      0.70      3792



## Random Forest 

In [36]:
tdidf_rf_grid = {
    'rf__n_estimators': [50, 100, 200],
    'rf__max_depth': [None, 10, 20, 30],
    'rf__min_samples_split': [2, 5, 10],
    'rf__min_samples_leaf': [1, 2, 4]
}

tdidf_rf_pipeline = Pipeline([
    ('vect', TfidfVectorizer(encoding='latin1', max_features=600, min_df=7,
                                 norm=None, stop_words='english',
                                 use_idf=False)),
    ('rf', RandomForestClassifier())
])

tdidf_rf_grid = GridSearchCV(tdidf_rf_pipeline, param_grid=tdidf_rf_grid, cv=cv,
                                  verbose=10, n_jobs=-1)

In [37]:
tdidf_rf_grid.fit(X_train_org, y_train)
tdidf_rf_pipeline = tdidf_rf_grid.best_estimator_
tdidf_rf_pipeline.fit(X_train_org, y_train)

Fitting 10 folds for each of 108 candidates, totalling 1080 fits


In [38]:
print(evaluate_model(model=tdidf_rf_pipeline))
print(classification_report(y_train, tdidf_rf_pipeline.predict(X_train_org)))
print('-'*80)
print(classification_report(y_test, tdidf_rf_pipeline.predict(X_test_org)))

{'Train accuracy': 0.8353795186592614, 'Test accuracy': 0.6809071729957806}
              precision    recall  f1-score   support

    Negative       0.86      0.82      0.84     15392
     Neutral       0.73      0.81      0.76      7620
    Positive       0.87      0.86      0.87     18040

    accuracy                           0.84     41052
   macro avg       0.82      0.83      0.82     41052
weighted avg       0.84      0.84      0.84     41052

--------------------------------------------------------------------------------
              precision    recall  f1-score   support

    Negative       0.72      0.68      0.70      1633
     Neutral       0.53      0.60      0.56       613
    Positive       0.71      0.72      0.71      1546

    accuracy                           0.68      3792
   macro avg       0.65      0.66      0.66      3792
weighted avg       0.68      0.68      0.68      3792



## KNN

In [39]:
## K-Nearest Neighbors
knn_params = {
    'knn__n_neighbors': [3, 5, 7, 9],
    'knn__weights': ['uniform', 'distance'],
    'knn__metric': ['euclidean', 'manhattan']
}

knn_pipeline = Pipeline([
    ('vect', TfidfVectorizer(encoding='latin1', max_features=600, min_df=7,
                             norm=None, stop_words='english', use_idf=False)),
    ('knn', KNeighborsClassifier())
])

knn_grid = GridSearchCV(knn_pipeline, param_grid=knn_params, cv=cv,
                        verbose=10, n_jobs=-1)

In [40]:
knn_grid.fit(X_train_org, y_train)

Fitting 10 folds for each of 16 candidates, totalling 160 fits


In [41]:
best_knn_pipeline = knn_grid.best_estimator_
best_knn_pipeline.fit(X_train_org, y_train)

In [42]:
print(evaluate_model(model=best_knn_pipeline))
print(classification_report(y_train, best_knn_pipeline.predict(X_train_org)))
print('-'*80)
print(classification_report(y_test, best_knn_pipeline.predict(X_test_org)))

{'Train accuracy': 0.9821445970963656, 'Test accuracy': 0.43037974683544306}
              precision    recall  f1-score   support

    Negative       0.98      0.99      0.98     15392
     Neutral       0.95      0.97      0.96      7620
    Positive       1.00      0.98      0.99     18040

    accuracy                           0.98     41052
   macro avg       0.98      0.98      0.98     41052
weighted avg       0.98      0.98      0.98     41052

--------------------------------------------------------------------------------
              precision    recall  f1-score   support

    Negative       0.58      0.49      0.53      1633
     Neutral       0.23      0.63      0.33       613
    Positive       0.63      0.29      0.40      1546

    accuracy                           0.43      3792
   macro avg       0.48      0.47      0.42      3792
weighted avg       0.54      0.43      0.44      3792



## Decision Tree

In [43]:
## Decision Tree
dt_params = {
    'dt__max_depth': [None, 5, 10, 15],
    'dt__min_samples_split': [2, 5, 10],
    'dt__min_samples_leaf': [1, 2, 4],
    'dt__criterion': ['gini', 'entropy']
}

dt_pipeline = Pipeline([
    ('vect', TfidfVectorizer(encoding='latin1', max_features=600, min_df=7,
                             norm=None, stop_words='english', use_idf=False)),
    ('dt', DecisionTreeClassifier())
])

dt_grid = GridSearchCV(dt_pipeline, param_grid=dt_params, cv=cv,
                       verbose=10, n_jobs=-1)

In [44]:
dt_grid.fit(X_train_org, y_train)

Fitting 10 folds for each of 72 candidates, totalling 720 fits


In [45]:
best_dt_pipeline = dt_grid.best_estimator_
best_dt_pipeline.fit(X_train_org, y_train)

In [46]:
print(evaluate_model(model=best_dt_pipeline))
print(classification_report(y_train, best_dt_pipeline.predict(X_train_org)))
print('-'*80)
print(classification_report(y_test, best_dt_pipeline.predict(X_test_org)))

{'Train accuracy': 0.7723131637922634, 'Test accuracy': 0.6112869198312236}
              precision    recall  f1-score   support

    Negative       0.76      0.80      0.78     15392
     Neutral       0.67      0.71      0.69      7620
    Positive       0.83      0.78      0.80     18040

    accuracy                           0.77     41052
   macro avg       0.76      0.76      0.76     41052
weighted avg       0.78      0.77      0.77     41052

--------------------------------------------------------------------------------
              precision    recall  f1-score   support

    Negative       0.63      0.63      0.63      1633
     Neutral       0.47      0.51      0.49       613
    Positive       0.66      0.63      0.64      1546

    accuracy                           0.61      3792
   macro avg       0.58      0.59      0.59      3792
weighted avg       0.61      0.61      0.61      3792



## LGBM Regression

In [47]:
## LightGBM
lgb_params = {
    'lgb__num_leaves': [31, 50, 100],
    'lgb__max_depth': [-1, 10, 20, 30],
    'lgb__learning_rate': [0.01, 0.05, 0.1],
    'lgb__n_estimators': [50, 100, 200]
}

lgb_pipeline = Pipeline([
    ('vect', TfidfVectorizer(encoding='latin1', max_features=600, min_df=7,
                             norm=None, stop_words='english', use_idf=False)),
    ('lgb', lgb.LGBMClassifier())
])

lgb_grid = GridSearchCV(lgb_pipeline, param_grid=lgb_params, cv=cv,
                        verbose=10, n_jobs=-1)

In [48]:
lgb_grid.fit(X_train_org, y_train)

Fitting 10 folds for each of 108 candidates, totalling 1080 fits
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.040222 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2139
[LightGBM] [Info] Number of data points in the train set: 41052, number of used features: 600
[LightGBM] [Info] Start training from score -0.980992
[LightGBM] [Info] Start training from score -1.684063
[LightGBM] [Info] Start training from score -0.822248


In [49]:
best_lgb_pipeline = lgb_grid.best_estimator_
best_lgb_pipeline.fit(X_train_org, y_train)

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.036818 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2139
[LightGBM] [Info] Number of data points in the train set: 41052, number of used features: 600
[LightGBM] [Info] Start training from score -0.980992
[LightGBM] [Info] Start training from score -1.684063
[LightGBM] [Info] Start training from score -0.822248


In [50]:
print(evaluate_model(model=best_lgb_pipeline))
print(classification_report(y_train, best_lgb_pipeline.predict(X_train_org)))
print('-'*80)
print(classification_report(y_test, best_lgb_pipeline.predict(X_test_org)))

{'Train accuracy': 0.7536295430186105, 'Test accuracy': 0.6919831223628692}
              precision    recall  f1-score   support

    Negative       0.79      0.72      0.75     15392
     Neutral       0.59      0.77      0.67      7620
    Positive       0.82      0.77      0.80     18040

    accuracy                           0.75     41052
   macro avg       0.73      0.76      0.74     41052
weighted avg       0.77      0.75      0.76     41052

--------------------------------------------------------------------------------
              precision    recall  f1-score   support

    Negative       0.76      0.67      0.71      1633
     Neutral       0.50      0.68      0.58       613
    Positive       0.73      0.72      0.73      1546

    accuracy                           0.69      3792
   macro avg       0.66      0.69      0.67      3792
weighted avg       0.71      0.69      0.70      3792



## XGBoosting

In [33]:
## XGBoost
xgb_params = {
    'xgb__n_estimators': [50, 100, 200],
    'xgb__max_depth': [3, 4, 5],
    'xgb__learning_rate': [0.01, 0.05, 0.1],
    'xgb__subsample': [0.8, 0.9, 1.0]
}

xgb_pipeline = Pipeline([
    ('vect', TfidfVectorizer(encoding='latin1', max_features=600, min_df=7,
                             norm=None, stop_words='english', use_idf=False)),
    ('xgb', xgb.XGBClassifier(objective='multi:softmax', num_class=3, eval_metric='merror', seed=42))
])

xgb_grid = GridSearchCV(xgb_pipeline, param_grid=xgb_params, cv=cv,
                        verbose=10, n_jobs=-1)

label_encoder = LabelEncoder()
y_train_encoded = label_encoder.fit_transform(y_train)
y_test_encoded = label_encoder.fit_transform(y_test)


In [34]:
xgb_grid.fit(X_train_org, y_train_encoded)

Fitting 10 folds for each of 81 candidates, totalling 810 fits


In [35]:
best_xgb_pipeline = xgb_grid.best_estimator_
best_xgb_pipeline.fit(X_train_org, y_train_encoded)

In [36]:
print(evaluate_model(model=best_xgb_pipeline, y_train = y_train_encoded, y_test =  y_test_encoded))
print(classification_report(y_train_encoded, best_xgb_pipeline.predict(X_train_org)))
print('-'*80)
print(classification_report(y_test_encoded, best_xgb_pipeline.predict(X_test_org)))

{'Train accuracy': 0.7252996199941537, 'Test accuracy': 0.6743143459915611}
              precision    recall  f1-score   support

           0       0.77      0.69      0.73     15392
           1       0.55      0.74      0.63      7620
           2       0.79      0.75      0.77     18040

    accuracy                           0.73     41052
   macro avg       0.70      0.73      0.71     41052
weighted avg       0.74      0.73      0.73     41052

--------------------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.75      0.64      0.69      1633
           1       0.48      0.65      0.55       613
           2       0.71      0.72      0.72      1546

    accuracy                           0.67      3792
   macro avg       0.65      0.67      0.65      3792
weighted avg       0.69      0.67      0.68      3792



## Sieci neuronowe

In [55]:
## Neural Network
def build_nn_model(input_dim):
    model = Sequential([
        Dense(128, input_dim=input_dim, activation='relu'),
        Dropout(0.5),
        Dense(64, activation='relu'),
        Dropout(0.5),
        Dense(3, activation='softmax')
    ])

    model.compile(loss='categorical_crossentropy', optimizer=Adam(learning_rate=0.001), metrics=['accuracy'])
    return model

In [56]:
# Convert labels to one-hot encoding
y_train_onehot = pd.get_dummies(y_train)
y_test_onehot = pd.get_dummies(y_test)

# Convert text to TF-IDF vectors
tfidf_vectorizer = TfidfVectorizer(encoding='latin1', max_features=600, min_df=7,
                                   norm=None, stop_words='english', use_idf=False)


In [57]:
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train_org)
X_test_tfidf = tfidf_vectorizer.transform(X_test_org)

In [58]:
# Build neural network model
nn_model = build_nn_model(X_train_tfidf.shape[1])

# Train neural network model
history = nn_model.fit(X_train_tfidf.toarray(), y_train_onehot.values, 
                       epochs=10, batch_size=32, validation_split=0.1, verbose=1)


Epoch 1/10


Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [59]:
# Evaluate neural network model
train_loss, train_accuracy = nn_model.evaluate(X_train_tfidf.toarray(), y_train_onehot.values, verbose=0)
test_loss, test_accuracy = nn_model.evaluate(X_test_tfidf.toarray(), y_test_onehot.values, verbose=0)

print(f'Train accuracy: {train_accuracy}')
print(f'Test accuracy: {test_accuracy}')

Train accuracy: 0.8278037905693054
Test accuracy: 0.6904008388519287


We can see that the train accuracy of most models closes in the range (0.6, 0.8). Test accuracy, on the other hand, is in the region of (0.6, 0.7) excluding KNN. 
So far, Neural Networks and Random Forest look the best.

The results are not so bad, but there surely is a space for improvement.