**Abhina Premachandran Bindu**

**May 22 2024**

# Preprocessing the dataset using Gensim Library
  <p> The goal of this notebook is to explain the working of the classifier. A Decision Tree classifier is used to fit and train on the word embeddings. To understand the working of the classifier, shap plots are used for individual test data. Further, the feature importance is found using the feature_importances_ attribute of the DecisionTree classifier.</p>
  <p>This is similar to the previos notebook except that here the analysis is done by removing 'reuters' from the data due to its redundancy in providingnew insights. </p>
  
  
## Loading and initial cleaning of data

In [None]:
# importing the necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# importing nlp
import nltk
import re
from nltk import word_tokenize
from nltk.corpus import stopwords
import string
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer

# importing sklearn for model building
import sklearn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report

# importing shap
import shap
shap.initjs()

In [None]:
# combining the two separate csv files with fake and real data to a single dataframe 
# df1 --> Fake , df2 --> Real
df1 = pd.read_csv(input("Enter the file path for the fake dataset"))
df2 = pd.read_csv(input("Enter the file path for the real dataset"))

# adding the labels Fake --> 0 and Real --> 1
df1['target'] = 0
df2['target'] = 1

In [None]:
# combining the dataframes
combined_df = pd.concat([df1, df2], ignore_index=True)
# shuffling the indices
data = combined_df.sample(frac=1, random_state=42)
data.reset_index(inplace=True, drop=True)
data.to_csv('fake_real_final.csv', index = False)
print(data.head())

In [None]:
data.head()

In [None]:
data.info()

In [None]:
# checking the value counts of 'target' to check for data imbalance
data.target.value_counts()

 Since the number of Fake and True classes are almost same, there is no class imbalance

In [None]:
data.subject.value_counts()

## Data Preprocessing

### cleaning and tokenizing

In [None]:
# Text cleaning
stop_words = set(stopwords.words('english'))
remove_words = {'reuters', 'reuter'}
def clean_text(text):
    # Tokenizing
    tokens = word_tokenize(text)
    # removing non-alphabetical characters and stopwords
    cleaned_tokens = [re.sub(r'[^a-zA-Z ]', '', text).lower() for text in tokens if text.lower() not in stop_words]
    cleaned_tokens = [token for token in cleaned_tokens if token not in set(string.punctuation)]
    # removing the news media name - 'reuters' from the text
    cleaned_tokens = [token for token in cleaned_tokens if token not in remove_words]
    # Lemmatizing the tokens
    lemmatizer = WordNetLemmatizer()
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in cleaned_tokens]
    processed_text = ' '.join(lemmatized_tokens)
    porter = PorterStemmer()
    cleaned_text = " ".join(porter.stem(token) for token in processed_text.split())
    return cleaned_text

# Applying the function across the DataFrame
data['cleaned_text'] = data['text'].apply(clean_text)

In [None]:
data.tail()

## Classifying the data using DecisionTreeClassifier

In [None]:
# defining X and y arrays
# X = word_vectors
X = data['cleaned_text'].values
y = data['target'].values

In [None]:
# Create training and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=44)

In [None]:
# defining the Tfidf vectorizer
vectorizer = TfidfVectorizer(min_df=10)
X_train_vec = vectorizer.fit_transform(X_train).toarray()
X_test_vec = vectorizer.transform(X_test).toarray()

In [None]:
# defining the classification model
tree_clf = tree.DecisionTreeClassifier()
tree_clf.fit(X_train_vec,y_train)

In [None]:
# predicting the test values
y_pred = tree_clf.predict(X_test_vec)
# printing the classification report
print(classification_report(y_test, y_pred))

## Understanding the classification model

### 1. Using Shap for Decision Tree clf

In [None]:
# getting the feature names from tfidf vectorizer
feature_names = vectorizer.get_feature_names_out()
# getting the shap values
explainer = shap.Explainer(tree_clf, X_train_vec, feature_names=feature_names)
shap_values = explainer(X_test_vec)
print(shap_values.values.shape)

In [None]:
# getting the shap waterfall plot for the 7th test data
shap.initjs()

ind = 6
print(X_test[ind])

shap.plots.waterfall(shap_values[ind,:,1])

In [None]:
# getting the shap waterfall plot for the 11th test data
shap.initjs()

ind = 10
print(X_test[ind])

shap.plots.waterfall(shap_values[ind,:,1])

In [None]:
# getting the shap waterfall plot for the 201st test data
shap.initjs()

ind = 1
print(X_test[ind])

shap.plots.waterfall(shap_values[ind,:,1])

In [None]:
X_test[1]

  From the three waterfall plots above, it is clear that the model uses the word 'reuter' as the primary indicator of whether a text classifies as fake or real. If the shap value of 'reuter' is greater than 0, it classifies the text as real and vice versa.

In [None]:
shap.summary_plot(
    shap_values[:,:,1], X_test_vec, feature_names=feature_names
)

### 2. Using Decision Tree classifier features

In [None]:
class_names = ["Fake", "Real"]

fig = plt.figure(figsize=(20, 12))
vis = tree.plot_tree(
    tree_clf,
    class_names=class_names,
    feature_names = vectorizer.get_feature_names_out(),
    max_depth=3,
    fontsize=9,
    proportion=True,
    filled=True,
    rounded=True
)
plt.show()


In [None]:
feature_names = vectorizer.get_feature_names_out()
feature_importance = tree_clf.feature_importances_
inds = np.argsort(np.abs(feature_importance))[::-1]
top_10_inds = inds[:10]
fig, ax = plt.subplots()
rank = np.arange(10)
ax.bar(rank, feature_importance[top_10_inds])
ax.set_xticks(rank)
ax.set_xticklabels(np.array(feature_names)[top_10_inds], rotation=45, ha='right')
ax.set_ylabel("Top 10 Important Features and their ranks")
plt.tight_layout()
plt.show()

  The above tree visualization of the classifier indicates that the classifier uses 'reuter' feature as one of the main feature to decide whether the text is fake or real. In the next level, 'zika' and 'wiretv' are used to split the data into the respective classes based on certain threshold values for the features. The bar chart on the feature importance also indicates that the 'reuter' feature have a huge significance in influencing the model decision compared to other features. 

# --------------------------------------------------------------------
Testing the python libraries for the above code

In [None]:
import model_explainability_shap as shap_model
import decision_tree_visualization as dt_visual

In [None]:
shap_model.plot_waterfall(shap_values, 6, X_test)

In [None]:
shap_model.plot_waterfall(shap_values, 10, X_test)

In [None]:
shap_model.plot_summary(shap_values, X_test_vec, feature_names)

In [None]:
class_names = ["Fake", "Real"]
dt_visual.plot_tree_and_feature_importance(tree_clf, vectorizer, class_names)

From the three waterfall plots above, an idea on the features that the model relies on class prediction is evident. The 6th indexed test data shows that the features 'said','via', and 'washington' predicts the data as real. for the 10th indexed test value, the features 'washington' and 'breitbart' votes in more weight for the data to be in the fake class. These plots indicates how the model performs in the local level.
The above summary plot picturizes how the model works in a global scale. It shows that 'said' and 'via' are two important word features that the model heavily relies on deciding which class a data belongs to.