# Training Sentimental Analysis Models

This notebook contain training of machine learning models using

1. Logistic Regression
2. Multinomial Naive Bayes

## Importing Libraries

In [1]:
import numpy as np
import pandas as pd

## Importing Dataset

In [2]:
df = pd.read_csv('data/transformed_data.csv')
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,9991,9992,9993,9994,9995,9996,9997,9998,9999,target
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [3]:
df.dropna(inplace=True)

In [4]:
df.isna().sum().sum()

0

## Train Test Split

In [5]:
X = df.drop('target', axis=1)
y = df['target']

In [6]:
from sklearn.model_selection import train_test_split

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.8, random_state=42)

In [8]:
print('Train Sample Size: ', X_train.shape, y_train.shape)
print('Test Sample Size: ', X_test.shape, y_test.shape)

Train Sample Size:  (21983, 10000) (21983,)
Test Sample Size:  (5496, 10000) (5496,)


## Metrics Libraries

In [9]:
from sklearn.metrics import f1_score, confusion_matrix, accuracy_score

## Logistic Regression Model

In [10]:
from sklearn.linear_model import LogisticRegression

In [11]:
logistic_regression = LogisticRegression(max_iter=100)

In [12]:
logistic_regression.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LogisticRegression()

In [13]:
logistic_regression.score(X_train, y_train)

0.6237092298594369

### Evaluation

In [14]:
y_pred = logistic_regression.predict(X_test)

In [15]:
confusion_matrix(y_test, y_pred)

array([[ 244,  374,  954],
       [ 264,  404, 1037],
       [ 375,  512, 1332]])

In [16]:
f1_score(y_test, y_pred, average='weighted')

0.3346281865672697

In [17]:
accuracy_score(y_test, y_pred)

0.36026200873362446

The model is not performing well.

## Multinomial NB Model

In [18]:
from sklearn.naive_bayes import MultinomialNB

In [19]:
multinomialNB = MultinomialNB()

In [20]:
multinomialNB.fit(X_train, y_train)

MultinomialNB()

In [21]:
multinomialNB.score(X_train, y_train)

0.5759450484465268

### Evaluation

In [22]:
y_pred = multinomialNB.predict(X_test)

In [23]:
confusion_matrix(y_test, y_pred)

array([[  93,  208, 1271],
       [ 112,  256, 1337],
       [ 160,  287, 1772]])

In [24]:
f1_score(y_test, y_pred, average='weighted')

0.30897143652720316

In [25]:
accuracy_score(y_test, y_pred)

0.3859170305676856