## Logistic Regression Model

### Logistic Regression with Tf-Idf Vectorization

In [None]:
import pandas as pd
import numpy as np

In [None]:
# read data as pandas dataframe
data = pd.read_csv('../raw_data/fulltrain.csv', names=['label', 'text'])
data.head()

In [None]:
# found out that fulltrain.csv has 202 duplicate rows => remove them before proceeding
data = data.drop_duplicates()

In [None]:
from collections import Counter
Counter(data['label'])

In [None]:
# create tf-idf matrix
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(ngram_range=(1, 1), max_features=9500, max_df=0.6) # HYPERPARAMETERS

In [None]:
from sklearn.model_selection import train_test_split

full_train_data = data.copy()
train_data, eval_data = train_test_split(full_train_data, test_size=0.2, random_state=42)
print(train_data.shape)
print(eval_data.shape)

In [None]:
X_train = vectorizer.fit_transform(train_data['text'])
X_eval = vectorizer.transform(eval_data['text'])

In [None]:
LABEL = 'label'
TEXT = 'text'

train_label = train_data[LABEL]
eval_label = eval_data[LABEL]

In [None]:
# from imblearn.over_sampling import SMOTE

# sm = SMOTE(random_state=42)
# X_train_balanced, train_label_balanced = sm.fit_resample(X_train, train_label)

In [None]:
print("original training data:", Counter(full_train_data[LABEL]))
# print("balanced training data:", Counter(train_label_balanced))
print("evaluation data:", Counter(eval_label))

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report

In [None]:
model = LogisticRegression(max_iter=1000, C=0.15, class_weight='balanced', penalty="l2")
model.fit(X_train, train_label)

In [None]:
y_pred = model.predict(X_eval)

In [None]:
# print evaluation metrics
print('Accuracy: ', accuracy_score(eval_label, y_pred))
print('F1: ', f1_score(eval_label, y_pred, average='macro'))
print('Precision: ', precision_score(eval_label, y_pred, average='macro'))
print('Recall: ', recall_score(eval_label, y_pred, average='macro'))
print(classification_report(eval_label, y_pred))

This makes me feel that the test data is somehow fundamentally different from the evaluation (and training) data.

In [None]:
# sanity check for test data
test_data = pd.read_csv('../raw_data/balancedtest.csv', names=['label', 'text'])
test_data.head()

In [None]:
X_test = vectorizer.transform(test_data['text'])
test_label = test_data[LABEL]

In [None]:
test_pred = model.predict(X_test)
test_pred

In [None]:
# print evaluation metrics for test data
print('Accuracy: ', accuracy_score(test_label, test_pred))
print('F1: ', f1_score(test_label, test_pred, average='macro'))
print('Precision: ', precision_score(test_label, test_pred, average='macro'))
print('Recall: ', recall_score(test_label, test_pred, average='macro'))
print(classification_report(test_label, test_pred))

## Deep Dive into Logistic Regression

The goal of this deep dive is to figure out WHY the model performs so much worse on the test data compared to the evaluation data.

We aim to analyze:

- Which features does the model think are important?
- Does it give too much importance to named entities?
- We will carefully look at which sentences does the model misclassify, and hope to understand why.
- We will also look at the confusion matrix to see if the model is misclassifying a particular class more than others.

In [None]:
vocabulary = vectorizer.vocabulary_ # word: index
inverse_vocabulary = {v: k for k, v in vocabulary.items()} # index: word

For each sentence (document), we want to know which words the model is paying more attention to. We want to find the coefficients of the model for each word in the sentence.

In [None]:
coefficients = model.coef_[0]
word_coefficients = [(inverse_vocabulary[i], coefficients[i]) for i in range(len(coefficients))]

In [None]:
sorted_word_coefficients = sorted(word_coefficients, key=lambda x: abs(x[1]), reverse=True)

In [None]:
for word, coef in sorted_word_coefficients[:10]:
    print(word, coef)

In [None]:
abuse_sources = [
    'Tuesday is a good day',
    'Wednesday is a good day',
    'Thursday is a good day',
    'Friday is a good day',
]
abuse_sources = model.predict(vectorizer.transform(abuse_sources))
abuse_sources

In [None]:
# find all sentences with the word "tuesday" and count their labels
tuesday_sentences = full_train_data[full_train_data['text'].str.contains('tuesday', case=False)]
tuesday_sentences['label'].value_counts()

To us, it seems very strange that the model treats days so differently - it literally changes the class of a sentence depending on which day you're talking about. This is clearly not a good strategy. It's likely that "Tuesday" occurred most commonly in satirical sentences, and the model learnt to be cautious of sentences with the word "Tuesday".

In [None]:
# trump vs biden, WOW this is a big deal!
print(word_coefficients[vocabulary['trump']])
print(word_coefficients[vocabulary['biden']])

In [None]:
trump_biden_sentences = [
    'Trump is the best president.',
    'Biden is the best president.',
    'Trump was a president.',
]
trump_biden_predictions = model.predict(vectorizer.transform(trump_biden_sentences))
print(trump_biden_predictions)
trump_biden_probabilities = model.predict_proba(vectorizer.transform(trump_biden_sentences))
print(trump_biden_probabilities) # the model seems to be quite confident (>90%) when classifying a sentence with "Trump" to be a hoax.

The above result can be unsettling. The only difference between the two sentences is that I've replaced Trump with Biden, and the model proceeds to change it's classification from satire to hoax. At least we can take comfort knowing that it doesn't classify it as reliable :O

Moreover, the model seems to be quite confident (>90%) when classifying a sentence with "Trump" to be a hoax.

In [None]:
# some more pairs of weird words
print(word_coefficients[vocabulary['washington']])
print(word_coefficients[vocabulary['moscow']])
print(word_coefficients[vocabulary['china']])

city_sentences = [
    'Washington is a good place to work',
    'Moscow is a good place to work',
    'China is a good place to work',
]
vectorizer.transform(city_sentences)
city_predictions = model.predict(vectorizer.transform(city_sentences))
print(city_predictions)
city_probabilities = model.predict_proba(vectorizer.transform(city_sentences))
print(city_probabilities)

The above example CLEARLY shows that the model is biased towards washington (possibly because the model was trained on a dataset where washington was a common word in reliable news articles). This is a clear example of bias in the model.

Of course, it doesn't mean that any sentence involving Washington automatically becomes more reliable than sentences involving China or Moscow.

In [None]:
# we also want to figure out what the model is getting wrong, i.e., which class does it get most confused by
# for this, we can use a confusion matrix
from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

# relabel the classes to start from 1 instead of 0
conf_matrix = confusion_matrix(np.array(test_label) + 1, np.array(test_pred) + 1)
sns.heatmap(conf_matrix, annot=True, fmt='d', xticklabels=[1, 2, 3, 4], yticklabels=[1, 2, 3, 4])
plt.xlabel('Predicted')
plt.ylabel('True')
plt.show()

We can make the following observations from the above model:
- Even though the overall F1 score of the model is not very high (~0.70), it is able to classify the majority of the reliable news articles, as being reliable. This means we have a low false positive rate (it doesn't "catch" many reliable news articles as being unreliable)
- There are 2 main issues that the model faces: 
  - It classifies many hoax articles as being propaganda, but surprisingly, it doesn't classify many propaganda articles as being hoax.
  - It also classifies many propaganda articles as being reliable.

Honestly the second kind of issue is more worrisome. Because it fails to catch propaganda articles as being unreliable. This is a more dangerous issue. More generally speaking, propaganda articles tend to use authoritative language, and are more likely to be longer, making them sound more "convincing". This is also why humans find it difficult to distinguish between propaganda and reliable news articles.

It's not just humans though. It's been shown that the youtube recommendation algorithm also ranks more authoritative videos higher, even if they are spreading misinformation. This is a very difficult problem to solve, and it's not clear if we can solve it with a simple logistic regression model, or any algorithmic model for that matter.

There is no algorithm for truth.

#### Looking at which categories of sentences are actually misclassified

Are the most misclassified sentences from: health? environment? politics? etc.

In [None]:
# get the indices of all the test data that were misclassified
misclassified_indices = np.where(test_label != test_pred)[0]

# out of these, find the ones whose ground truth is 3 (propaganda), but the model predicted 4 (reliable)
misclassified_indices_3_4 = [i for i in misclassified_indices if test_label[i] == 3 and test_pred[i] == 4]

print(len(misclassified_indices_3_4))
# and then print those sentences
for i in misclassified_indices_3_4:
    print(test_data.iloc[i][TEXT])

Reading the sentences above, it's clear that nearly all of the 105 misclassified sentences are from the health industry (they discuss topics such as diets, food, medicine, etc.) and environment industry (they discuss topics such as climate change, pollution, etc.). This is a very interesting observation. It seems that the model is not able to distinguish between reliable and unreliable news articles in these industries. This makes some sense because the majority of the sentences in the dataset are from the politics industry, and so, the model is unable to generalize beyond the politics industry.

In fact, the model performs well only on sentences relating to _American_ (or Western) politics and business, not other countries. Again, unsurprising because the dataset is primarily on American politics and business.

In [None]:
vaccine_sentences = [
    'Vaccines are useful',
    'Vaccines are not useful',
]
print(word_coefficients[vocabulary['vaccine']])
vaccine_predictions = model.predict(vectorizer.transform(vaccine_sentences))
print(vaccine_predictions) # both are classified as propaganda
vaccine_probabilities = model.predict_proba(vectorizer.transform(vaccine_sentences))
vaccine_probabilities

In [None]:
# print all training sentences with the word vaccine
vaccine_indices = [i for i in range(len(train_data)) if 'vaccine' in train_data.iloc[i][TEXT]]
freq_class = Counter(train_data.iloc[vaccine_indices][LABEL])
for i in vaccine_indices:
    print(train_data.iloc[i][TEXT], train_data.iloc[i][LABEL])

One interesting thing is the model is over 90% confident in its prediction, and is still wrong!