## Environment Setup
Reading data from two input files containing true and fake news articles. The data contains title, text, subject and date information.

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn import metrics
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# %matplotlib inline

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

df_true = pd.read_csv("/Users/sanaketabchi/project/DS/NLP/True.csv")
df_fake = pd.read_csv("/Users/sanaketabchi/project/DS/NLP/Fake.csv")

# df_true.shape
# df_fake.shape

df_true.head()

Unnamed: 0,title,text,subject,date
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017"
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017"
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017"
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017"
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017"


## Preprocessing
Add a column that specifies whether the article is real news or fake news

In [2]:
df_true['TrueNews?'] = [True for x in range(len(df_true.index))]
df_fake['TrueNews?'] = [False for x in range(len(df_fake.index))]


Combine the set of real news and fake news into one dataframe

In [3]:
df_all = pd.concat([df_true, df_fake], axis = 0)

Since the subject column contains categorical data, it needs to be encoded using oneHoteEncoder

In [4]:
# df_all['subject'].unique()

one_hot = OneHotEncoder()
subjEncoded = one_hot.fit_transform(df_all[['subject']])
print(subjEncoded.shape)

(44898, 8)


Vectorize text and title of articles using TF-IDF

In [5]:
vectorizer = TfidfVectorizer()
title = vectorizer.fit_transform(df_all.title)
feature_names_title = vectorizer.get_feature_names_out()
text = vectorizer.fit_transform(df_all.text)
feature_names_text = vectorizer.get_feature_names_out()

print(text.shape)
print(title.shape)


(44898, 122002)
(44898, 20896)


The vectorized text and title contain 122k and 21k columns respectively. This is too large to be processed and not all these columns contain useful information. A truncated SVD can be used to reduce the dimensionality (PCA cannot be used since this is a sparse data).

In [6]:
from sklearn.decomposition import TruncatedSVD

svd = TruncatedSVD(n_components=1000)
text = svd.fit_transform(text)

svd = TruncatedSVD(n_components=100)
title = svd.fit_transform(title)

Now concatenating vectorized and transformed title, text, and encoded subject columns into a new dataframe.

In [7]:
title = pd.DataFrame(title)
text = pd.DataFrame(text)
# print(subjEncoded.shape)
subjEncoded = pd.DataFrame.sparse.from_spmatrix(subjEncoded)
# print(l.shape)
# subjEncoded = pd.DataFrame(subjEncoded)
X = pd.concat([title, text, subjEncoded], axis=1)
# subjEncoded.head()


In [8]:
X = X.to_numpy()
y = df_all['TrueNews?'].to_numpy()

Spliting into test and train samples and training a baseline random forest classifier on them.

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, shuffle=True, test_size=0.3, random_state=100)
forest = RandomForestClassifier(min_samples_split=100, bootstrap=True, oob_score=True, random_state=100)
forest.fit(X_train, y_train)
predictions = forest.predict(X_test)

Some performance metrics for this baseline model:

In [13]:
from sklearn.metrics import classification_report

print(classification_report(y_test, predictions))
print("Accuracy:", metrics.accuracy_score(y_test, predictions))
print("Confusion Matrix:\n", metrics.confusion_matrix(y_test, predictions))

              precision    recall  f1-score   support

       False       1.00      1.00      1.00      7056
        True       1.00      1.00      1.00      6414

    accuracy                           1.00     13470
   macro avg       1.00      1.00      1.00     13470
weighted avg       1.00      1.00      1.00     13470

Accuracy: 0.9991833704528582
Confusion Matrix:
 [[7045   11]
 [   0 6414]]


## Hyperparameter Tuning

First producing a grid of random variables to be included in our search.

In [18]:
from sklearn.model_selection import RandomizedSearchCV

# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 50, stop = 200, num = 10)]
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [50, 100, 200]

# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split}

Now using the random grid produced above to find the best set of hyperparameters

In [19]:
rf = RandomForestClassifier()
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 50,
                               cv = 3, verbose=2, random_state=42, n_jobs = -1)
rf_random.fit(X_train, y_train)

Fitting 3 folds for each of 50 candidates, totalling 150 fits


Here are the parameters of the best fit model:

In [24]:
rf_random.best_params_

{'n_estimators': 133, 'min_samples_split': 50, 'max_depth': 110}

Evaluating the performance of this mode, we have:

In [26]:
optimalPredictions = rf_random.predict(X_test)

print(classification_report(y_test, optimalPredictions))
print("Accuracy:", metrics.accuracy_score(y_test, optimalPredictions))
print("Confusion Matrix:\n", metrics.confusion_matrix(y_test, optimalPredictions))

              precision    recall  f1-score   support

       False       1.00      1.00      1.00      7056
        True       1.00      1.00      1.00      6414

    accuracy                           1.00     13470
   macro avg       1.00      1.00      1.00     13470
weighted avg       1.00      1.00      1.00     13470

Accuracy: 0.9991833704528582
Confusion Matrix:
 [[7046   10]
 [   1 6413]]
