# Fake News Detection via Text Classification

## Setup

In [1]:
import pandas as pd
import nltk
from nltk import pos_tag
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer, PorterStemmer
import string
from collections import Counter
from tqdm.auto import tqdm

from IPython.display import display, Markdown

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, precision_score, recall_score

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


In [2]:
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\dskra\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\dskra\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\dskra\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\dskra\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\dskra\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

### Load the Dataset

In [3]:
df_fake = pd.read_csv('fake.csv')
df_true = pd.read_csv('true.csv')

In [4]:
df_fake['authenticity'] = 0
df_true['authenticity'] = 1

In [5]:
df = pd.concat([df_fake, df_true])
df = df.sample(frac=1, random_state=42)

df = df.sample(frac=1, random_state=42).reset_index(drop=True)

In [6]:
def preprocess_text(text):
    # Lowercase
    text = text.lower()
    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    # Tokenize
    words = word_tokenize(text)
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    words = [word for word in words if word not in stop_words]
    # Lemmatization
    lemmatizer = WordNetLemmatizer()
    words = [lemmatizer.lemmatize(word) for word in words]
    return words

df_processed = df
tqdm.pandas(desc="Processing Text")
df_processed['text_clean'] = df_processed['text'].progress_apply(preprocess_text)
df_fake_processed = df_processed[df_processed['authenticity'] == 0]
df_true_processed = df_processed[df_processed['authenticity'] == 1]

Processing Text:   0%|          | 0/44898 [00:00<?, ?it/s]

## 1) Explore Essential Information from Text Data and Preprocessing

##### 1. What are the most commonly used words (top 100) in the collection, the most commonly used words (top 100) in the real news and most commonly used words (top 100) in the fake news?

In [7]:
all_words = [word for sublist in df_processed['text_clean'] for word in sublist]
fake_words = [word for sublist in df_fake_processed['text_clean'] for word in sublist]
true_words = [word for sublist in df_true_processed['text_clean'] for word in sublist]

word_counts_all = Counter(all_words).most_common(100)
word_counts_fake = Counter(fake_words).most_common(100)
word_counts_true = Counter(true_words).most_common(100)

df_word_counts = pd.DataFrame({
    'Rank': range(1, len(word_counts_all) + 1),
    'word_counts_all': word_counts_all,
    'word_counts_fake': word_counts_fake,
    'word_counts_true': word_counts_true
})

In [8]:
pd.set_option('display.max_rows', 100)
df_word_counts

Unnamed: 0,Rank,word_counts_all,word_counts_fake,word_counts_true
0,1,"(said, 130050)","(trump, 73744)","(said, 99042)"
1,2,"(trump, 128096)","(said, 31008)","(’, 70768)"
2,3,"(’, 70768)","(president, 26073)","(trump, 54352)"
3,4,"(u, 63450)","(people, 26031)","(“, 54140)"
4,5,"(state, 58336)","(one, 23682)","(”, 53861)"
5,6,"(would, 54945)","(would, 23420)","(u, 41166)"
6,7,"(“, 54140)","(u, 22284)","(state, 36385)"
7,8,"(”, 53861)","(state, 21951)","(would, 31525)"
8,9,"(president, 53070)","(clinton, 18595)","(reuters, 28403)"
9,10,"(people, 41354)","(like, 18139)","(president, 26997)"


##### 2. By reading the preprocessed textual data, can you easily tell the difference between the real news and fake news? What does the strongest feature set (for machine learning) look like?

By having many more instances of the word "said" in the true news data set, it is evident that real news samples show a much higher incidence of direct quotes, also evidenced by the high count of single and double quotation marks. Fake news data seems to also employ the use of emphasis/absolute words such as "never", "even", or "really" at a higher rate than the fake news data. Furthermore, these fake news texts also seem to use more words meant to incite a reaction such as "attack" or contain more demographic-based words such as "black" or "muslim". 

As for the strongest feature set, it might be useful to focus on POS tagging with noun, verb, and adjective/adverb filters to highlight these differences. A way to look into context usage of high-frequency words might also be helpful.

# 2) Build Machine Learning Model

## Test/Train Split Setup 

In [9]:
X_train, X_test, y_train, y_test = train_test_split(df_processed['text'], df_processed['authenticity'], test_size=0.3, random_state=42)

## Feature Set TF

In [10]:
count_vectorizer = CountVectorizer()
tf_train = count_vectorizer.fit_transform(X_train)
tf_test = count_vectorizer.transform(X_test)

## Feature Set TF-IDF

In [11]:
tfidf_vectorizer = TfidfVectorizer(max_df=0.7)
tfidf_train = tfidf_vectorizer.fit_transform(X_train)
tfidf_test = tfidf_vectorizer.transform(X_test)

## 2.1) Logistic Regression

### 2.1.1) Using TF

In [20]:
lr_model = LogisticRegression(max_iter=1000)
lr_model.fit(tf_train, y_train)

predictions = lr_model.predict(tf_test)

accuracy_tf_reg = accuracy_score(y_test, predictions)
conf_matrix_tf_reg = confusion_matrix(y_test, predictions)
report_tf_reg = classification_report(y_test, predictions)
precision_tf_reg = precision_score(y_test, predictions, average='macro')
recall_tf_reg = recall_score(y_test, predictions, average='macro')

print(f'Accuracy: {accuracy_tf_reg}')
print('Confusion Matrix:')
print(conf_matrix_tf_reg)
print('Classification Report:')
print(report_tf_reg)

Accuracy: 0.9952487008166295
Confusion Matrix:
[[6914   39]
 [  25 6492]]
Classification Report:
              precision    recall  f1-score   support

           0       1.00      0.99      1.00      6953
           1       0.99      1.00      1.00      6517

    accuracy                           1.00     13470
   macro avg       1.00      1.00      1.00     13470
weighted avg       1.00      1.00      1.00     13470



### 2.1.2) Using TF-IDF

In [21]:
lr_model = LogisticRegression(max_iter=1000)
lr_model.fit(tfidf_train, y_train)

predictions = lr_model.predict(tfidf_test)

accuracy_tfidf_reg = accuracy_score(y_test, predictions)
conf_matrix_tfidf_reg = confusion_matrix(y_test, predictions)
report_tfidf_reg = classification_report(y_test, predictions)
precision_tfidf_reg = precision_score(y_test, predictions, average='macro')
recall_tfidf_reg = recall_score(y_test, predictions, average='macro')

print(f'Accuracy: {accuracy_tfidf_reg}')
print('Confusion Matrix:')
print(conf_matrix_tfidf_reg)
print('Classification Report:')
print(report_tfidf_reg)

Accuracy: 0.9846325167037862
Confusion Matrix:
[[6838  115]
 [  92 6425]]
Classification Report:
              precision    recall  f1-score   support

           0       0.99      0.98      0.99      6953
           1       0.98      0.99      0.98      6517

    accuracy                           0.98     13470
   macro avg       0.98      0.98      0.98     13470
weighted avg       0.98      0.98      0.98     13470



## 2.2) SVM

### 2.2.1) Using TF

In [14]:
svm_model = SVC(kernel='linear')
svm_model.fit(tf_train, y_train)

predictions = svm_model.predict(tf_test)

accuracy_tf_svm = accuracy_score(y_test, predictions)
conf_matrix_tf_svm = confusion_matrix(y_test, predictions)
report_tf_svm = classification_report(y_test, predictions)
precision_tf_svm = precision_score(y_test, predictions, average='macro')
recall_tf_svm = recall_score(y_test, predictions, average='macro')

print(f'Accuracy: {accuracy_tf_svm}')
print('Confusion Matrix:')
print(conf_matrix_tf_svm)
print('Classification Report:')
print(report_tf_svm)

Accuracy: 0.9949517446176689
Confusion Matrix:
[[6911   42]
 [  26 6491]]
Classification Report:
              precision    recall  f1-score   support

           0       1.00      0.99      1.00      6953
           1       0.99      1.00      0.99      6517

    accuracy                           0.99     13470
   macro avg       0.99      0.99      0.99     13470
weighted avg       0.99      0.99      0.99     13470



### 2.2.2) Using TF-IDF

In [15]:
svm_model = SVC(kernel='linear')
svm_model.fit(tfidf_train, y_train)

predictions = svm_model.predict(tfidf_test)

accuracy_tfidf_svm = accuracy_score(y_test, predictions)
conf_matrix_tfidf_svm = confusion_matrix(y_test, predictions)
report_tfidf_svm = classification_report(y_test, predictions)
precision_tfidf_svm = precision_score(y_test, predictions, average='macro')
recall_tfidf_svm = recall_score(y_test, predictions, average='macro')

print(f'Accuracy: {accuracy_tfidf_svm}')
print('Confusion Matrix:')
print(conf_matrix_tfidf_svm)
print('Classification Report:')
print(report_tfidf_svm)

Accuracy: 0.9933184855233853
Confusion Matrix:
[[6900   53]
 [  37 6480]]
Classification Report:
              precision    recall  f1-score   support

           0       0.99      0.99      0.99      6953
           1       0.99      0.99      0.99      6517

    accuracy                           0.99     13470
   macro avg       0.99      0.99      0.99     13470
weighted avg       0.99      0.99      0.99     13470



## 2.3) Random Forest Classifier

### 2.3.1) Using TF

In [16]:
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(tf_train, y_train)

predictions = rf_model.predict(tf_test)

accuracy_tf_rf = accuracy_score(y_test, predictions)
conf_matrix_tf_rf = confusion_matrix(y_test, predictions)
report_tf_rf = classification_report(y_test, predictions)
precision_tf_rf = precision_score(y_test, predictions, average='macro')
recall_tf_rf = recall_score(y_test, predictions, average='macro')

print(f'Accuracy: {accuracy_tf_rf}')
print('Confusion Matrix:')
print(conf_matrix_tf_rf)
print('Classification Report:')
print(report_tf_rf)

Accuracy: 0.9896807720861173
Confusion Matrix:
[[6879   74]
 [  65 6452]]
Classification Report:
              precision    recall  f1-score   support

           0       0.99      0.99      0.99      6953
           1       0.99      0.99      0.99      6517

    accuracy                           0.99     13470
   macro avg       0.99      0.99      0.99     13470
weighted avg       0.99      0.99      0.99     13470



### 2.3.2) Using TF-IDF

In [17]:
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(tfidf_train, y_train)

predictions = rf_model.predict(tfidf_test)

accuracy_tfidf_rf = accuracy_score(y_test, predictions)
conf_matrix_tfidf_rf = confusion_matrix(y_test, predictions)
report_tfidf_rf = classification_report(y_test, predictions)
precision_tfidf_rf = precision_score(y_test, predictions, average='macro')
recall_tfidf_rf = recall_score(y_test, predictions, average='macro')

print(f'Accuracy: {accuracy_tfidf_rf}')
print('Confusion Matrix:')
print(conf_matrix_tfidf_rf)
print('Classification Report:')
print(report_tfidf_rf)

Accuracy: 0.9844840386043059
Confusion Matrix:
[[6859   94]
 [ 115 6402]]
Classification Report:
              precision    recall  f1-score   support

           0       0.98      0.99      0.98      6953
           1       0.99      0.98      0.98      6517

    accuracy                           0.98     13470
   macro avg       0.98      0.98      0.98     13470
weighted avg       0.98      0.98      0.98     13470



## 2.4) Gradient Boosting Machine

### 2.4.1) Using TF

In [19]:
gbm_model = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)
gbm_model.fit(tf_train, y_train)

predictions = gbm_model.predict(tf_test)

accuracy_tf_gbm = accuracy_score(y_test, predictions)
conf_matrix_tf_gbm = confusion_matrix(y_test, predictions)
report_tf_gbm = classification_report(y_test, predictions)
precision_tf_gbm = precision_score(y_test, predictions, average='macro')
recall_tf_gbm = recall_score(y_test, predictions, average='macro')

print(f'Accuracy: {accuracy_tf_gbm}')
print('Confusion Matrix:')
print(conf_matrix_tf_gbm)
print('Classification Report:')
print(report_tf_gbm)

Accuracy: 0.9941351150705271
Confusion Matrix:
[[6892   61]
 [  18 6499]]
Classification Report:
              precision    recall  f1-score   support

           0       1.00      0.99      0.99      6953
           1       0.99      1.00      0.99      6517

    accuracy                           0.99     13470
   macro avg       0.99      0.99      0.99     13470
weighted avg       0.99      0.99      0.99     13470



### 2.4.2) Using TF-IDF

In [22]:
gbm_model = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)
gbm_model.fit(tfidf_train, y_train)

predictions_tfidf_gbm = gbm_model.predict(tfidf_test)

accuracy_tfidf_gbm = accuracy_score(y_test, predictions)
conf_matrix_tfidf_gbm = confusion_matrix(y_test, predictions)
report_tfidf_gbm = classification_report(y_test, predictions)
precision_tfidf_gbm = precision_score(y_test, predictions, average='macro')
recall_tfidf_gbm = recall_score(y_test, predictions, average='macro')

print(f'Accuracy: {accuracy_tfidf_gbm}')
print('Confusion Matrix:')
print(conf_matrix_tfidf_gbm)
print('Classification Report:')
print(report_tfidf_gbm)

Accuracy: 0.9846325167037862
Confusion Matrix:
[[6838  115]
 [  92 6425]]
Classification Report:
              precision    recall  f1-score   support

           0       0.99      0.98      0.99      6953
           1       0.98      0.99      0.98      6517

    accuracy                           0.98     13470
   macro avg       0.98      0.98      0.98     13470
weighted avg       0.98      0.98      0.98     13470



## 2.5) MultinomialNB

### 2.5.1) Using TF

In [23]:
nb_model = MultinomialNB()
nb_model.fit(tf_train, y_train)

predictions = nb_model.predict(tf_test)

accuracy_tf_mnb = accuracy_score(y_test, predictions)
conf_matrix_tf_mnb = confusion_matrix(y_test, predictions)
report_tf_mnb = classification_report(y_test, predictions)
precision_tf_mnb = precision_score(y_test, predictions, average='macro')
recall_tf_mnb = recall_score(y_test, predictions, average='macro')

print(f'Accuracy: {accuracy_tf_mnb}')
print('Confusion Matrix:')
print(conf_matrix_tf_mnb)
print('Classification Report:')
print(report_tf_mnb)

Accuracy: 0.9537490720118782
Confusion Matrix:
[[6620  333]
 [ 290 6227]]
Classification Report:
              precision    recall  f1-score   support

           0       0.96      0.95      0.96      6953
           1       0.95      0.96      0.95      6517

    accuracy                           0.95     13470
   macro avg       0.95      0.95      0.95     13470
weighted avg       0.95      0.95      0.95     13470



### 2.5.2) Using TF-IDF

In [24]:
nb_model = MultinomialNB()
nb_model.fit(tfidf_train, y_train)

predictions = nb_model.predict(tfidf_test)

accuracy_tfidf_mnb = accuracy_score(y_test, predictions)
conf_matrix_tfidf_mnb = confusion_matrix(y_test, predictions)
report_tfidf_mnb = classification_report(y_test, predictions)
precision_tfidf_mnb = precision_score(y_test, predictions, average='macro')
recall_tfidf_mnb = recall_score(y_test, predictions, average='macro')

print(f'Accuracy: {accuracy_tfidf_mnb}')
print('Confusion Matrix:')
print(conf_matrix_tfidf_mnb)
print('Classification Report:')
print(report_tfidf_mnb)

Accuracy: 0.9373422420193022
Confusion Matrix:
[[6611  342]
 [ 502 6015]]
Classification Report:
              precision    recall  f1-score   support

           0       0.93      0.95      0.94      6953
           1       0.95      0.92      0.93      6517

    accuracy                           0.94     13470
   macro avg       0.94      0.94      0.94     13470
weighted avg       0.94      0.94      0.94     13470



## 2.6) Performance Report

In [40]:
header = "| ML Model                       | Feature | Precision                    | Recall                      | Accuracy                    |"
separator = "|--------------------------------|---------|------------------------------|-----------------------------|-----------------------------|"
row_template = "| {model:<30} | {feature:<6} | {precision:.4f}              | {recall:.4f}               | {accuracy:.4f}               |"

markdown_table = f"{header}\n{separator}\n"

model_accuracies = []

models_features = [
    ("Logistic Regression", "reg", "tf"),
    ("Logistic Regression", "reg", "tfidf"),
    ("Support Vector Machine", "svm", "tf"),
    ("Support Vector Machine", "svm", "tfidf"),
    ("Random Forest", "rf", "tf"),
    ("Random Forest", "rf", "tfidf"),
    ("Gradient Boosting Machine", "gbm", "tf"),
    ("Gradient Boosting Machine", "gbm", "tfidf"),
    ("Multinomial Naive Bayes", "mnb", "tf"),
    ("Multinomial Naive Bayes", "mnb", "tfidf"),
]

for model, model_short, feature in models_features:
    precision_var_name = f"precision_{feature}_{model_short}"
    recall_var_name = f"recall_{feature}_{model_short}"
    accuracy_var_name = f"accuracy_{feature}_{model_short}"
    conf_matrix_var_name = f"conf_matrix_{feature}_{model_short}"

    try:
        precision = float(locals().get(precision_var_name, 0))
        recall = float(locals().get(recall_var_name, 0))
        accuracy = float(locals().get(accuracy_var_name, 0))
        conf_matrix = locals().get(conf_matrix_var_name, "N/A")
    except ValueError:
        precision = recall = accuracy = "N/A"
        conf_matrix = "N/A"

    if accuracy != "N/A":
        model_accuracies.append((accuracy, model, model_short, feature, conf_matrix))
    
    markdown_table += row_template.format(model=model, feature=feature.upper(), precision=precision, recall=recall, accuracy=accuracy) + "\n"

display(Markdown(markdown_table))

| ML Model                       | Feature | Precision                    | Recall                      | Accuracy                    |
|--------------------------------|---------|------------------------------|-----------------------------|-----------------------------|
| Logistic Regression            | TF     | 0.9952              | 0.9953               | 0.9952               |
| Logistic Regression            | TFIDF  | 0.9846              | 0.9847               | 0.9846               |
| Support Vector Machine         | TF     | 0.9949              | 0.9950               | 0.9950               |
| Support Vector Machine         | TFIDF  | 0.9933              | 0.9933               | 0.9933               |
| Random Forest                  | TF     | 0.9897              | 0.9897               | 0.9897               |
| Random Forest                  | TFIDF  | 0.9845              | 0.9844               | 0.9845               |
| Gradient Boosting Machine      | TF     | 0.9940              | 0.9942               | 0.9941               |
| Gradient Boosting Machine      | TFIDF  | 0.9846              | 0.9847               | 0.9846               |
| Multinomial Naive Bayes        | TF     | 0.9536              | 0.9538               | 0.9537               |
| Multinomial Naive Bayes        | TFIDF  | 0.9378              | 0.9369               | 0.9373               |


In [41]:
top_2_models = sorted(model_accuracies, key=lambda x: x[0], reverse=True)[:2]

top_models_md = "\n## 2.7) Top 2 Models based on Accuracy\n"
for _, model, model_short, feature, conf_matrix in top_2_models:
    top_models_md += f"\n### {model} ({feature.upper()})\n"
    top_models_md += "Confusion Matrix:\n"
    
    conf_matrix_table = f"""
|               | Predicted Positive | Predicted Negative |
|---------------|--------------------|--------------------|
| Actual Positive | {conf_matrix[0][0]} (TP)    | {conf_matrix[0][1]} (FP)    |
| Actual Negative | {conf_matrix[1][0]} (FN)    | {conf_matrix[1][1]} (TN)    |
"""
    
    top_models_md += conf_matrix_table

display(Markdown(top_models_md))


## 2.7) Top 2 Models based on Accuracy

### Logistic Regression (TF)
Confusion Matrix:

|               | Predicted Positive | Predicted Negative |
|---------------|--------------------|--------------------|
| Actual Positive | 6914 (TP)    | 39 (FP)    |
| Actual Negative | 25 (FN)    | 6492 (TN)    |

### Support Vector Machine (TF)
Confusion Matrix:

|               | Predicted Positive | Predicted Negative |
|---------------|--------------------|--------------------|
| Actual Positive | 6911 (TP)    | 42 (FP)    |
| Actual Negative | 26 (FN)    | 6491 (TN)    |


# 3) Enhanced NLP Features

In [27]:
all_filtered = {}

def filter_pos(sentences, filter_type, data_type):
    filter_key = filter_type+', ' + data_type
    
    if filter_key in all_filtered:
        return all_filtered[filter_key]
    
    filtered_sentences = []
    for sentence in tqdm(sentences, desc=f'Filtering {filter_type}'):
        tokenized = word_tokenize(sentence.lower())
        tagged = pos_tag(tokenized)
        
        filters = {
            'noun+adj': ['NN', 'JJ'],
            'noun+verb': ['NN', 'VB'],
            'noun+adj+verb': ['NN', 'JJ', 'VB']
        }
        
        if filter_type in filters:
            filtered = [word for word, tag in tagged if any(tag.startswith(t) for t in filters[filter_type])]
        else:
            filtered = tokenized
        
        filtered_sentences.append(" ".join(filtered))
        
    all_filtered[filter_key] = filtered_sentences    
    
    return filtered_sentences

def vectorize_data(data, filter_type, vectorizer_type, data_type, vectorizer=None):
    filtered_texts = filter_pos(data, filter_type, data_type)
    
    if vectorizer is None:
        if vectorizer_type == 'tf':
            vectorizer = CountVectorizer()
        elif vectorizer_type == 'tfidf':
            vectorizer = TfidfVectorizer()
        vectorized_data = vectorizer.fit_transform(filtered_texts)
    else:
        vectorized_data = vectorizer.transform(filtered_texts)
    
    return vectorized_data, vectorizer

## 3.1) Test 1: Noun + Adjective, TFIDF, Regression

In [28]:
filter_type = 'noun+adj'  
vectorizer_type = 'tfidf' 

X_train_vec, fitted_vectorizer = vectorize_data(X_train, filter_type, vectorizer_type, "train")
X_test_vec, _ = vectorize_data(X_test, filter_type, vectorizer_type, "test", fitted_vectorizer)

lr_model = LogisticRegression(max_iter=1000)
lr_model.fit(X_train_vec, y_train)

predictions = lr_model.predict(X_test_vec)

accuracy_na_tfidf_reg = accuracy_score(y_test, predictions)
conf_matrix_na_tfidf_reg = confusion_matrix(y_test, predictions)
report_na_tfidf_reg = classification_report(y_test, predictions)
precision_na_tfidf_reg = precision_score(y_test, predictions, average='macro')
recall_na_tfidf_reg = recall_score(y_test, predictions, average='macro')

print(f'Accuracy: {accuracy_na_tfidf_reg}')
print('Confusion Matrix:')
print(conf_matrix_na_tfidf_reg)
print('Classification Report:')
print(report_na_tfidf_reg)

Filtering noun+adj:   0%|          | 0/31428 [00:00<?, ?it/s]

Filtering noun+adj:   0%|          | 0/13470 [00:00<?, ?it/s]

Accuracy: 0.987305122494432
Confusion Matrix:
[[6863   90]
 [  81 6436]]
Classification Report:
              precision    recall  f1-score   support

           0       0.99      0.99      0.99      6953
           1       0.99      0.99      0.99      6517

    accuracy                           0.99     13470
   macro avg       0.99      0.99      0.99     13470
weighted avg       0.99      0.99      0.99     13470



## 3.2) Test 2: Noun + Adjective, TF, SVM

In [29]:
filter_type = 'noun+adj'  
vectorizer_type = 'tf' 

X_train_vec, fitted_vectorizer = vectorize_data(X_train, filter_type, vectorizer_type, "train")
X_test_vec, _ = vectorize_data(X_test, filter_type, vectorizer_type, "test", fitted_vectorizer)

svm_model = SVC(kernel='linear')
svm_model.fit(X_train_vec, y_train)

predictions = svm_model.predict(X_test_vec)

accuracy_na_tf_svm = accuracy_score(y_test, predictions)
conf_matrix_na_tf_svm = confusion_matrix(y_test, predictions)
report_na_tf_svm = classification_report(y_test, predictions)
precision_na_tf_svm = precision_score(y_test, predictions, average='macro')
recall_na_tf_svm = recall_score(y_test, predictions, average='macro')

print(f'Accuracy: {accuracy_na_tf_svm}')
print('Confusion Matrix:')
print(conf_matrix_na_tf_svm)
print('Classification Report:')
print(report_na_tf_svm)

Accuracy: 0.9947290274684484
Confusion Matrix:
[[6911   42]
 [  29 6488]]
Classification Report:
              precision    recall  f1-score   support

           0       1.00      0.99      0.99      6953
           1       0.99      1.00      0.99      6517

    accuracy                           0.99     13470
   macro avg       0.99      0.99      0.99     13470
weighted avg       0.99      0.99      0.99     13470



## 3.3) Test 3: Noun + Adjective + Verb, TF, GBM

In [30]:
filter_type = 'noun+adj+verb'  
vectorizer_type = 'tf' 

X_train_vec, fitted_vectorizer = vectorize_data(X_train, filter_type, vectorizer_type, "train")
X_test_vec, _ = vectorize_data(X_test, filter_type, vectorizer_type, "test", fitted_vectorizer)

gbm_model = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)
gbm_model.fit(X_train_vec, y_train)

predictions = gbm_model.predict(X_test_vec)

accuracy_nav_tf_gbm = accuracy_score(y_test, predictions)
conf_matrix_nav_tf_gbm = confusion_matrix(y_test, predictions)
report_nav_tf_gbm = classification_report(y_test, predictions)
precision_nav_tf_gbm = precision_score(y_test, predictions, average='macro')
recall_nav_tf_gbm = recall_score(y_test, predictions, average='macro')

print(f'Accuracy: {accuracy_nav_tf_gbm}')
print('Confusion Matrix:')
print(conf_matrix_nav_tf_gbm)
print('Classification Report:')
print(report_nav_tf_gbm)

Filtering noun+adj+verb:   0%|          | 0/31428 [00:00<?, ?it/s]

Filtering noun+adj+verb:   0%|          | 0/13470 [00:00<?, ?it/s]

Accuracy: 0.995025983667409
Confusion Matrix:
[[6907   46]
 [  21 6496]]
Classification Report:
              precision    recall  f1-score   support

           0       1.00      0.99      1.00      6953
           1       0.99      1.00      0.99      6517

    accuracy                           1.00     13470
   macro avg       0.99      1.00      1.00     13470
weighted avg       1.00      1.00      1.00     13470



## 3.4) Test 4: Noun + Adjective, TFIDF, GBM

In [31]:
filter_type = 'noun+adj'  
vectorizer_type = 'tfidf' 

X_train_vec, fitted_vectorizer = vectorize_data(X_train, filter_type, vectorizer_type, "train")
X_test_vec, _ = vectorize_data(X_test, filter_type, vectorizer_type, "test", fitted_vectorizer)

gbm_model = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)
gbm_model.fit(X_train_vec, y_train)

predictions = gbm_model.predict(X_test_vec)

accuracy_na_tfidf_gbm = accuracy_score(y_test, predictions)
conf_matrix_na_tfidf_gbm = confusion_matrix(y_test, predictions)
report_na_tfidf_gbm = classification_report(y_test, predictions)
precision_na_tfidf_gbm = precision_score(y_test, predictions, average='macro')
recall_na_tfidf_gbm = recall_score(y_test, predictions, average='macro')

print(f'Accuracy: {accuracy_na_tfidf_gbm}')
print('Confusion Matrix:')
print(conf_matrix_na_tfidf_gbm)
print('Classification Report:')
print(report_na_tfidf_gbm)

Accuracy: 0.9947290274684484
Confusion Matrix:
[[6906   47]
 [  24 6493]]
Classification Report:
              precision    recall  f1-score   support

           0       1.00      0.99      0.99      6953
           1       0.99      1.00      0.99      6517

    accuracy                           0.99     13470
   macro avg       0.99      0.99      0.99     13470
weighted avg       0.99      0.99      0.99     13470



## 3.5) Test 5: Noun + Verb, TFIDF, SVM

In [32]:
filter_type = 'noun+verb'  
vectorizer_type = 'tfidf' 

X_train_vec, fitted_vectorizer = vectorize_data(X_train, filter_type, vectorizer_type, "train")
X_test_vec, _ = vectorize_data(X_test, filter_type, vectorizer_type, "test", fitted_vectorizer)

svm_model = SVC(kernel='linear')
svm_model.fit(X_train_vec, y_train)

predictions = svm_model.predict(X_test_vec)

accuracy_nv_tfidf_svm = accuracy_score(y_test, predictions)
conf_matrix_nv_tfidf_svm = confusion_matrix(y_test, predictions)
report_nv_tfidf_svm = classification_report(y_test, predictions)
precision_nv_tfidf_svm = precision_score(y_test, predictions, average='macro')
recall_nv_tfidf_svm = recall_score(y_test, predictions, average='macro')

print(f'Accuracy: {accuracy_nv_tfidf_svm}')
print('Confusion Matrix:')
print(conf_matrix_nv_tfidf_svm)
print('Classification Report:')
print(report_nv_tfidf_svm)

Filtering noun+verb:   0%|          | 0/31428 [00:00<?, ?it/s]

Filtering noun+verb:   0%|          | 0/13470 [00:00<?, ?it/s]

Accuracy: 0.9939866369710467
Confusion Matrix:
[[6905   48]
 [  33 6484]]
Classification Report:
              precision    recall  f1-score   support

           0       1.00      0.99      0.99      6953
           1       0.99      0.99      0.99      6517

    accuracy                           0.99     13470
   macro avg       0.99      0.99      0.99     13470
weighted avg       0.99      0.99      0.99     13470



## 3.6) Test 6: Noun + Adjective + Verb, TF, Random Forest

In [33]:
filter_type = 'noun+adj+verb'  
vectorizer_type = 'tf' 

X_train_vec, fitted_vectorizer = vectorize_data(X_train, filter_type, vectorizer_type, "train")
X_test_vec, _ = vectorize_data(X_test, filter_type, vectorizer_type, "test", fitted_vectorizer)

rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train_vec, y_train)

predictions = rf_model.predict(X_test_vec)

accuracy_nav_tf_rf = accuracy_score(y_test, predictions)
conf_matrix_nav_tf_rf = confusion_matrix(y_test, predictions)
report_nav_tf_rf = classification_report(y_test, predictions)
precision_nav_tf_rf = precision_score(y_test, predictions, average='macro')
recall_nav_tf_rf = recall_score(y_test, predictions, average='macro')

print(f'Accuracy: {accuracy_nav_tf_rf}')
print('Confusion Matrix:')
print(conf_matrix_nav_tf_rf)
print('Classification Report:')
print(report_nav_tf_rf)

Accuracy: 0.9920564216778025
Confusion Matrix:
[[6901   52]
 [  55 6462]]
Classification Report:
              precision    recall  f1-score   support

           0       0.99      0.99      0.99      6953
           1       0.99      0.99      0.99      6517

    accuracy                           0.99     13470
   macro avg       0.99      0.99      0.99     13470
weighted avg       0.99      0.99      0.99     13470



## 3.7) Test 7: Noun + Adjective, TFIDF, Random Forest

In [34]:
filter_type = 'noun+adj'  
vectorizer_type = 'tfidf' 

X_train_vec, fitted_vectorizer = vectorize_data(X_train, filter_type, vectorizer_type, "train")
X_test_vec, _ = vectorize_data(X_test, filter_type, vectorizer_type, "test", fitted_vectorizer)

rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train_vec, y_train)

predictions = rf_model.predict(X_test_vec)

accuracy_na_tfidf_rf = accuracy_score(y_test, predictions)
conf_matrix_na_tfidf_rf = confusion_matrix(y_test, predictions)
report_na_tfidf_rf = classification_report(y_test, predictions)
precision_na_tfidf_rf = precision_score(y_test, predictions, average='macro')
recall_na_tfidf_rf = recall_score(y_test, predictions, average='macro')

print(f'Accuracy: {accuracy_na_tfidf_rf}')
print('Confusion Matrix:')
print(conf_matrix_na_tfidf_rf)
print('Classification Report:')
print(report_na_tfidf_rf)

Accuracy: 0.9920564216778025
Confusion Matrix:
[[6900   53]
 [  54 6463]]
Classification Report:
              precision    recall  f1-score   support

           0       0.99      0.99      0.99      6953
           1       0.99      0.99      0.99      6517

    accuracy                           0.99     13470
   macro avg       0.99      0.99      0.99     13470
weighted avg       0.99      0.99      0.99     13470



## 3.8) Test 8: Noun + Verb, TFIDF, MultinomialNB

In [35]:
filter_type = 'noun+verb'  
vectorizer_type = 'tfidf' 

X_train_vec, fitted_vectorizer = vectorize_data(X_train, filter_type, vectorizer_type, "train")
X_test_vec, _ = vectorize_data(X_test, filter_type, vectorizer_type, "test", fitted_vectorizer)

nb_model = MultinomialNB()
nb_model.fit(X_train_vec, y_train)

predictions = nb_model.predict(X_test_vec)

accuracy_nv_tfidf_mnb = accuracy_score(y_test, predictions)
conf_matrix_nv_tfidf_mnb = confusion_matrix(y_test, predictions)
report_nv_tfidf_mnb = classification_report(y_test, predictions)
precision_nv_tfidf_mnb = precision_score(y_test, predictions, average='macro')
recall_nv_tfidf_mnb = recall_score(y_test, predictions, average='macro')

print(f'Accuracy: {accuracy_nv_tfidf_mnb}')
print('Confusion Matrix:')
print(conf_matrix_nv_tfidf_mnb)
print('Classification Report:')
print(report_nv_tfidf_mnb)

Accuracy: 0.9404602821083891
Confusion Matrix:
[[6626  327]
 [ 475 6042]]
Classification Report:
              precision    recall  f1-score   support

           0       0.93      0.95      0.94      6953
           1       0.95      0.93      0.94      6517

    accuracy                           0.94     13470
   macro avg       0.94      0.94      0.94     13470
weighted avg       0.94      0.94      0.94     13470



## 3.9) Test 9: Noun + Adjective + Verb, TF, Regression

In [36]:
filter_type = 'noun+adj+verb'  
vectorizer_type = 'tf' 

X_train_vec, fitted_vectorizer = vectorize_data(X_train, filter_type, vectorizer_type, "train")
X_test_vec, _ = vectorize_data(X_test, filter_type, vectorizer_type, "test", fitted_vectorizer)

lr_model = LogisticRegression(max_iter=1000)
lr_model.fit(X_train_vec, y_train)

predictions = lr_model.predict(X_test_vec)

accuracy_nav_tf_reg = accuracy_score(y_test, predictions)
conf_matrix_nav_tf_reg = confusion_matrix(y_test, predictions)
report_nav_tf_reg = classification_report(y_test, predictions)
precision_nav_tf_reg = precision_score(y_test, predictions, average='macro')
recall_nav_tf_reg = recall_score(y_test, predictions, average='macro')

print(f'Accuracy: {accuracy_nav_tf_reg}')
print('Confusion Matrix:')
print(conf_matrix_nav_tf_reg)
print('Classification Report:')
print(report_nav_tf_reg)

Accuracy: 0.9948775055679288
Confusion Matrix:
[[6919   34]
 [  35 6482]]
Classification Report:
              precision    recall  f1-score   support

           0       0.99      1.00      1.00      6953
           1       0.99      0.99      0.99      6517

    accuracy                           0.99     13470
   macro avg       0.99      0.99      0.99     13470
weighted avg       0.99      0.99      0.99     13470



## 3.10) Test 10: Noun + Adjective, TFIDF, MultinomialNB

In [37]:
filter_type = 'noun+adj'  
vectorizer_type = 'tfidf' 

X_train_vec, fitted_vectorizer = vectorize_data(X_train, filter_type, vectorizer_type, "train")
X_test_vec, _ = vectorize_data(X_test, filter_type, vectorizer_type, "test", fitted_vectorizer)

nb_model = MultinomialNB()
nb_model.fit(X_train_vec, y_train)

predictions = nb_model.predict(X_test_vec)

accuracy_na_tfidf_mnb = accuracy_score(y_test, predictions)
conf_matrix_na_tfidf_mnb = confusion_matrix(y_test, predictions)
report_na_tfidf_mnb = classification_report(y_test, predictions)
precision_na_tfidf_mnb = precision_score(y_test, predictions, average='macro')
recall_na_tfidf_mnb = recall_score(y_test, predictions, average='macro')

print(f'Accuracy: {accuracy_na_tfidf_mnb}')
print('Confusion Matrix:')
print(conf_matrix_na_tfidf_mnb)
print('Classification Report:')
print(report_na_tfidf_mnb)

Accuracy: 0.9354862657757981
Confusion Matrix:
[[6591  362]
 [ 507 6010]]
Classification Report:
              precision    recall  f1-score   support

           0       0.93      0.95      0.94      6953
           1       0.94      0.92      0.93      6517

    accuracy                           0.94     13470
   macro avg       0.94      0.94      0.94     13470
weighted avg       0.94      0.94      0.94     13470



## 3.11) Performance Report

In [39]:
header = "| ML Model                       | Feature | Filter                     | Precision  | Recall    | Accuracy  |"
separator = "|--------------------------------|---------|----------------------------|------------|-----------|-----------|"

row_template = "| {model:<30} | {feature:<6} | {filter:<27} | {precision:.4f}              | {recall:.4f}               | {accuracy:.4f}               |"

markdown_table = f"{header}\n{separator}\n"

models_features = [
    ("Logistic Regression", "na", "Noun + Adjective", "reg", "tfidf"),
    ("Support Vector Machine", "na", "Noun + Adjective", "svm", "tf"),
    ("Gradient Boosting Machine", "nav", "Noun + Adjective + Verb", "gbm", "tf"),
    ("Gradient Boosting Machine", "na", "Noun + Adjective", "gbm", "tfidf"),
    ("Support Vector Machine", "nv", "Noun + Verb", "svm", "tfidf"),
    ("Random Forest", "nav", "Noun + Adjective + Verb", "rf", "tf"),
    ("Random Forest", "na", "Noun + Adjective", "rf", "tfidf"),
    ("Multinomial Naive Bayes", "nv", "Noun + Verb", "mnb", "tfidf"),
    ("Logistic Regression", "nav", "Noun + Adjective + Verb", "reg", "tf"),
    ("Multinomial Naive Bayes", "na", "Noun + Adjective", "mnb", "tfidf"),
]

for model, filter_short, filter, model_short, feature in models_features:
    precision_var_name = f"precision_{filter_short}_{feature}_{model_short}"
    recall_var_name = f"recall_{filter_short}_{feature}_{model_short}"
    accuracy_var_name = f"accuracy_{filter_short}_{feature}_{model_short}"
    precision = locals().get(precision_var_name, "N/A")
    recall = locals().get(recall_var_name, "N/A")
    accuracy = locals().get(accuracy_var_name, "N/A")
    
    markdown_table += row_template.format(model=model, feature=feature.upper(), filter=filter, precision=precision, recall=recall, accuracy=accuracy) + "\n"

display(Markdown(markdown_table))

| ML Model                       | Feature | Filter                     | Precision  | Recall    | Accuracy  |
|--------------------------------|---------|----------------------------|------------|-----------|-----------|
| Logistic Regression            | TFIDF  | Noun + Adjective            | 0.9873              | 0.9873               | 0.9873               |
| Support Vector Machine         | TF     | Noun + Adjective            | 0.9947              | 0.9948               | 0.9947               |
| Gradient Boosting Machine      | TF     | Noun + Adjective + Verb     | 0.9950              | 0.9951               | 0.9950               |
| Gradient Boosting Machine      | TFIDF  | Noun + Adjective            | 0.9947              | 0.9948               | 0.9947               |
| Support Vector Machine         | TFIDF  | Noun + Verb                 | 0.9939              | 0.9940               | 0.9940               |
| Random Forest                  | TF     | Noun + Adjective + Verb     | 0.9921              | 0.9920               | 0.9921               |
| Random Forest                  | TFIDF  | Noun + Adjective            | 0.9921              | 0.9920               | 0.9921               |
| Multinomial Naive Bayes        | TFIDF  | Noun + Verb                 | 0.9409              | 0.9400               | 0.9405               |
| Logistic Regression            | TF     | Noun + Adjective + Verb     | 0.9949              | 0.9949               | 0.9949               |
| Multinomial Naive Bayes        | TFIDF  | Noun + Adjective            | 0.9359              | 0.9351               | 0.9355               |
