**Abhina Premachandran Bindu**

**May 6 2024**

# Preprocessing the dataset using Gensim Library
  <p> The goal of this notebook is to explain the working of the classifier. A Decision Tree classifier is used to fit and train on the word embeddings. To understand the working of the classifier, shap plots are used for individual test data. Further, the feature importance is found using the feature_importances_ attribute of the DecisionTree classifier. </p>
  
  
## Loading and initial cleaning of data

In [None]:
# importing the necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# importing nlp
import nltk
import re
from nltk import word_tokenize
from nltk.corpus import stopwords
import string
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer

# importing sklearn for model building
import sklearn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report

# importing shap
import shap
shap.initjs()

In [None]:
# combining the two separate csv files with fake and real data to a single dataframe 
# df1 --> Fake , df2 --> Real
df1 = pd.read_csv(input("Enter the file path for the fake dataset"))
df2 = pd.read_csv(input("Enter the file path for the real dataset"))

# adding the labels Fake --> 0 and Real --> 1
df1['target'] = 0
df2['target'] = 1

In [None]:
# combining the dataframes
combined_df = pd.concat([df1, df2], ignore_index=True)
# shuffling the indices
data = combined_df.sample(frac=1, random_state=42)
data.reset_index(inplace=True, drop=True)
data.to_csv('fake_real_final.csv', index = False)
print(data.head())

In [None]:
data.head()

In [None]:
data.info()

In [None]:
# checking the value counts of 'target' to check for data imbalance
data.target.value_counts()

 Since the number of Fake and True classes are almost same, there is no class imbalance

In [None]:
data.subject.value_counts()

In [None]:
# visualize the distribution of subjects
data_real = data[data['target']==1]
data_fake = data[data['target']==0]
fig, axs = plt.subplots(nrows=2, ncols=1, figsize=(10, 12))
# Plot the most frequent words in real news on the first subplot
axs[0].hist(data_real['subject'], bins=len(data_real.subject.unique()), align = 'mid', edgecolor='black', color='blue')
axs[0].set_xlabel('Subjects')
axs[0].set_ylabel('Frequency')
axs[0].set_title('Real News Subjects')
axs[0].tick_params(axis='x') 
axs[0].legend(['Real News'])

axs[1].hist(data_fake['subject'], bins=len(data_fake.subject.unique()), align = 'mid', edgecolor='black', color = 'orange')
axs[1].set_xlabel('Subjects')
axs[1].set_ylabel('Frequency')
axs[1].set_title('Fake News Subjects')
axs[1].tick_params(axis='x', rotation=45) 
axs[1].legend(['Fake News'])
plt.tight_layout()
plt.show()

## Data Preprocessing

### cleaning and tokenizing

In [None]:
# Tokenize and removing stop words
stop_words = set(stopwords.words('english'))
def clean_text(text):
    # Tokenize
    tokens = word_tokenize(text)
    # remove non-alphabetical characters and stopwords
    cleaned_tokens = [re.sub(r'[^a-zA-Z ]', '', text).lower() for text in tokens if text.lower() not in stop_words]
    cleaned_tokens = [token for token in cleaned_tokens if ((token not in  set(string.punctuation)))]
    # Lemmatize the tokens
    lemmatizer = WordNetLemmatizer()
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in cleaned_tokens]
    # Join the tokens back into a string
    processed_text = ' '.join(lemmatized_tokens)
    #stem the tokens
    porter = PorterStemmer()
    cleaned_text = " ".join(porter.stem(token) for token in processed_text.split())
    return cleaned_text

# Apply the function across the DataFrame
data['cleaned_text'] = data['text'].apply(clean_text)

In [None]:
data.tail()

## Classifying the data using DecisionTreeClassifier

In [None]:
# defining X and y arrays
# X = word_vectors
X = data['cleaned_text'].values
y = data['target'].values

In [None]:
# Create training and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=44)

In [None]:
# defining the Tfidf vectorizer
vectorizer = TfidfVectorizer(min_df=10)
X_train_vec = vectorizer.fit_transform(X_train).toarray()
X_test_vec = vectorizer.transform(X_test).toarray()

In [None]:
# defining the classification model
tree_clf = tree.DecisionTreeClassifier()
tree_clf.fit(X_train_vec,y_train)

In [None]:
# predicting the test values
y_pred = tree_clf.predict(X_test_vec)
# printing the classification report
print(classification_report(y_test, y_pred))

## Understanding the classification model

### 1. Using Shap for Decision Tree clf

In [None]:
# getting the feature names from tfidf vectorizer
feature_names = vectorizer.get_feature_names_out()
# getting the shap values
explainer = shap.Explainer(tree_clf, X_train_vec, feature_names=feature_names)
shap_values = explainer(X_test_vec)
print(shap_values.values.shape)

In [None]:
# getting the shap waterfall plot for the 7th test data
shap.initjs()

ind = 6
print(X_test[ind])

shap.plots.waterfall(shap_values[ind,:,1])

In [None]:
# getting the shap waterfall plot for the 11th test data
shap.initjs()

ind = 10
print(X_test[ind])

shap.plots.waterfall(shap_values[ind,:,1])

In [None]:
# getting the shap waterfall plot for the 201st test data
shap.initjs()

ind = 1
print(X_test[ind])

shap.plots.waterfall(shap_values[ind,:,1])

In [None]:
X_test[1]

  From the three waterfall plots above, it is clear that the model uses the word 'reuter' as the primary indicator of whether a text classifies as fake or real. If the shap value of 'reuter' is greater than 0, it classifies the text as real and vice versa.

In [None]:
shap.plots.scatter(shap_values[:,feature_names.tolist().index("reuter"),1])


In [None]:
shap.plots.scatter(shap_values[:,feature_names.tolist().index("reuter"),0])

The above plot shows how the 'reuter' feature influences the model in predicting an object as class 1 - real. Most of the shap values for this feature lies closer to 0.5 implying its importance for the classifier in predicting classes as real.

In [None]:
shap.summary_plot(
    shap_values[:,:,1], X_test_vec, feature_names=feature_names
)

### 2. Using Decision Tree classifier features

In [None]:
class_names = ["Fake", "Real"]

fig = plt.figure(figsize=(20, 12))
vis = tree.plot_tree(
    tree_clf,
    class_names=class_names,
    feature_names = vectorizer.get_feature_names_out(),
    max_depth=3,
    fontsize=9,
    proportion=True,
    filled=True,
    rounded=True
)
plt.show()


In [None]:
feature_names = vectorizer.get_feature_names_out()
feature_importance = tree_clf.feature_importances_
inds = np.argsort(np.abs(feature_importance))[::-1]
top_10_inds = inds[:10]
fig, ax = plt.subplots()
rank = np.arange(10)
ax.bar(rank, feature_importance[top_10_inds])
ax.set_xticks(rank)
ax.set_xticklabels(np.array(feature_names)[top_10_inds], rotation=45, ha='right')
ax.set_ylabel("Top 10 Important Features and their ranks")
plt.tight_layout()
plt.show()

  The above tree visualization of the classifier indicates that the classifier uses 'reuter' feature as one of the main feature to decide whether the text is fake or real. In the next level, 'zika' and 'wiretv' are used to split the data into the respective classes based on certain threshold values for the features. The bar chart on the feature importance also indicates that the 'reuter' feature have a huge significance in influencing the model decision compared to other features. 