## Logistic Regression Model

### Logistic Regression with Tf-Idf Vectorization

In [140]:
import pandas as pd
import numpy as np

In [141]:
# read data as pandas dataframe
data = pd.read_csv('../raw_data/fulltrain.csv', names=['label', 'text'])
data.head()

Unnamed: 0,label,text
0,1,"A little less than a decade ago, hockey fans w..."
1,1,The writers of the HBO series The Sopranos too...
2,1,Despite claims from the TV news outlet to offe...
3,1,After receiving 'subpar' service and experienc...
4,1,After watching his beloved Seattle Mariners pr...


In [142]:
# found out that fulltrain.csv has 202 duplicate rows => remove them before proceeding
data = data.drop_duplicates()

In [143]:
from collections import Counter
Counter(data['label'])

Counter({3: 17870, 1: 13911, 4: 9932, 2: 6939})

In [144]:
# create tf-idf matrix
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(ngram_range=(1, 1), max_features=10000) # HYPERPARAMETERS

In [145]:
from sklearn.model_selection import train_test_split

full_train_data = data.copy()
train_data, eval_data = train_test_split(full_train_data, test_size=0.2, random_state=42)
print(train_data.shape)
print(eval_data.shape)

(38921, 2)
(9731, 2)


In [146]:
X_train = vectorizer.fit_transform(train_data['text'])
X_eval = vectorizer.transform(eval_data['text'])

In [147]:
LABEL = 'label'
TEXT = 'text'

train_label = train_data[LABEL]
eval_label = eval_data[LABEL]

In [148]:
from imblearn.over_sampling import SMOTE

sm = SMOTE(random_state=42)
X_train_balanced, train_label_balanced = sm.fit_resample(X_train, train_label)

In [149]:
print("original training data:", Counter(full_train_data[LABEL]))
print("balanced training data:", Counter(train_label_balanced))
print("evaluation data:", Counter(eval_label))

original training data: Counter({3: 17870, 1: 13911, 4: 9932, 2: 6939})
balanced training data: Counter({4: 14276, 1: 14276, 2: 14276, 3: 14276})
evaluation data: Counter({3: 3594, 1: 2764, 4: 2007, 2: 1366})


In [150]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report

In [151]:
model = LogisticRegression(max_iter=1000)
model.fit(X_train_balanced, train_label_balanced)

In [152]:
y_pred = model.predict(X_eval)

In [153]:
# print evaluation metrics
print('Accuracy: ', accuracy_score(eval_label, y_pred))
print('F1: ', f1_score(eval_label, y_pred, average='macro'))
print('Precision: ', precision_score(eval_label, y_pred, average='macro'))
print('Recall: ', recall_score(eval_label, y_pred, average='macro'))
print(classification_report(eval_label, y_pred))

Accuracy:  0.9610523070599116
F1:  0.9586741666106686
Precision:  0.9576758937382174
Recall:  0.95976250616417
              precision    recall  f1-score   support

           1       0.96      0.96      0.96      2764
           2       0.94      0.97      0.95      1366
           3       0.97      0.97      0.97      3594
           4       0.95      0.94      0.95      2007

    accuracy                           0.96      9731
   macro avg       0.96      0.96      0.96      9731
weighted avg       0.96      0.96      0.96      9731



### Tracking Evaluation Metrics on `X_eval` while Tuning Hyperparameters

- Attempt 1
    
    `ngram_range` = (1,1)

    `max_features` = 10000

    no SMOTE
    
    Accuracy:  0.9592672193224849
    
    F1:  0.9591777858793994
    
    Precision:  0.9595104433629937
    
    Recall:  0.9592672193224849

- Attempt 2
    
    `ngram_range` = (1,1)

    `max_features` = 10000

    SMOTE is performed

    Accuracy:  0.9610523070599116
    
    F1:  0.9586741666106686

    Precision:  0.9576758937382174

    Recall:  0.95976250616417

But this is because the training data is highly imbalanced. We **need** to use SMOTE to balance the data, assuming the test data is balanced.

Without SMOTE, we will get higher score on evaluation data (since it is also imblaanced), but much lower score on test data.

Something weird seems to be the case:

- Evaluation scores are TOO high, with and without SMOTE (~0.96)
- Test scores are much lower, with and without SMOTE (~0.73)

This makes me feel that the test data is somehow fundamentally different from the evaluation (and training) data.

In [154]:
# sanity check for test data
test_data = pd.read_csv('../raw_data/balancedtest.csv', names=['label', 'text'])
test_data.head()

Unnamed: 0,label,text
0,1,When so many actors seem content to churn out ...
1,1,In what football insiders are calling an unex...
2,1,In a freak accident following Game 3 of the N....
3,1,North Koreas official news agency announced to...
4,1,The former Alaska Governor Sarah Palin would b...


In [155]:
X_test = vectorizer.transform(test_data['text'])
test_label = test_data[LABEL]

In [156]:
test_pred = model.predict(X_test)
test_pred

array([1, 1, 1, ..., 4, 4, 4])

In [157]:
# print evaluation metrics for test data
print('Accuracy: ', accuracy_score(test_label, test_pred))
print('F1: ', f1_score(test_label, test_pred, average='macro'))
print('Precision: ', precision_score(test_label, test_pred, average='macro'))
print('Recall: ', recall_score(test_label, test_pred, average='macro'))
print(classification_report(test_label, test_pred))

Accuracy:  0.7383333333333333
F1:  0.7314550185745465
Precision:  0.7556497275574477
Recall:  0.7383333333333333
              precision    recall  f1-score   support

           1       0.86      0.75      0.80       750
           2       0.78      0.47      0.59       750
           3       0.61      0.80      0.69       750
           4       0.77      0.93      0.84       750

    accuracy                           0.74      3000
   macro avg       0.76      0.74      0.73      3000
weighted avg       0.76      0.74      0.73      3000

