<a href="https://colab.research.google.com/github/Mosle963/AI_Projects/blob/main/Detecting_Fake_News.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Connecting to drive to access dataset and save models there**


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


**importing dataset as pandas data frame**

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
df = pd.read_csv('/content/drive/MyDrive/PR2/true-fake-news-processed.csv')

**viewing the first few rows of dataframe**

In [None]:
df.head()

Unnamed: 0,text,label
0,donald trump ha white house republican control...,0
1,sick tired hearing donald trump whine fake new...,0
2,secret gop brass le thrilled donald trump pres...,0
3,glenn beck man described forbes someone manage...,0
4,former fbi agent navy seal jonathan gilliam sa...,0


**View the shape of dataframe which is 34330 rows each of them contains**
*   text  : the text of the news article
*   label : 0 for fake news ,  1 for true news

In [None]:
df.shape

(34330, 2)

In [None]:
df.tail()

Unnamed: 0,text,label
34325,mexico city reuters mexico wa pitched deep unc...,1
34326,washington reuters mexican finance minister jo...,1
34327,united nation reuters united nation security c...,1
34328,washington reuters president donald trump said...,1
34329,riyadh reuters oh arab oh muslim slaughtered o...,1


**Text preprocessing function was used to preprocess the used dataset
and will be used later to preprocess new text before predicting**

The following steps are applied to the text:



*   removing extra white spaces and special characters
*   convert all letters to lower case
*   use nltk lemmatizer on each word
*   use nltk to remove stopwords





In [None]:
import re
import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import numpy as np

nltk.data.path.append("/content/drive/MyDrive/PR2/nltk_data/")
def process_text(text):
    text = re.sub(r'\s+', ' ', text, flags=re.I) # Remove extra white space from text

    text = re.sub(r'\W', ' ', str(text)) # Remove all the special characters from text

    text = re.sub(r'\s+[a-zA-Z]\s+', ' ', text) # Remove all single characters from text

    text = re.sub(r'[^a-zA-Z\s]', '', text) # Remove any character that isn't alphabetical

    text = text.lower()

    words = word_tokenize(text)

    lemmatizer = WordNetLemmatizer()
    words = [lemmatizer.lemmatize(word) for word in words]

    stop_words = set(stopwords.words("english"))
    Words = [word for word in words if word not in stop_words]

    cleaned_text = ' '.join(Words)

    return cleaned_text

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


defining a Tfidf class with the following functions


*   init a Tfidf model with passed parameters
*   fit_transform : fit the model on passed data and return the transformed data
*   transform : use the fitted model to transform passed data
*   set and get for train and test data : as tfidf is mostly will be used repedatly it's a good idea to save the results in the class for later use ,these funcitions make that possible


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pickle

class TfIdf:

    def __init__(self,name, max_features=1000,min_df=1,max_df=1000000):
        self.max_features = max_features
        self.min_df = min_df
        self.max_df = max_df
        self.name = name
        self. vectorizer = TfidfVectorizer(max_features=max_features,min_df=min_df,max_df=max_df)
        self.train_vectors = None
        self.test_vectors = None

    def fit_transform(self,text_data):
        tfidf_vectors = self.vectorizer.fit_transform(text_data)
        pickle.dump(self.vectorizer, open(f"/content/drive/MyDrive/PR2/models/{self.name}.pkl", 'wb'))
        return tfidf_vectors.toarray().tolist()

    def transform(self,text_data):
        tfidf_vectors = self.vectorizer.transform(text_data)
        return tfidf_vectors.toarray().tolist()

    def set_train_vectors(self,train_vectors):
        self.train_vectors = train_vectors

    def set_test_vectors(self,test_vectors):
        self.test_vectors = test_vectors

    def get_train_vectors(self):
        if not self.train_vectors:
            raise ValueError("Train vectors not set. Call set_train_vectors() first.")
        return self.train_vectors

    def get_test_vectors(self):
        if not self.test_vectors:
            raise ValueError("Test vectors not set. Call set_test_vectors() first.")
        return self.test_vectors


defining a word2vec class with the following functions


*   init a word2vec model with passed parameters
*   make_corpus_iterable : function used to preprocess text for the word2vec transform function
*   fit_transform : fit the model on passed data and return the transformed data
*   transform : use the fitted model to transform passed data
*   set and get for train and test data : as tfidf is mostly will be used repedatly it's a good idea to save the results in the class for later use ,these funcitions make that possible


In [None]:
import gensim
import numpy as np

class myword2vec:

    def __init__(self,name,window_size=10,word_min_count=1,vector_size=200):
      self.window_size = window_size
      self.word_min_count = word_min_count
      self.vector_size = vector_size
      self.name = name
      self.word2vecmodel = gensim.models.Word2Vec(
          window = window_size,
          min_count = word_min_count,
          vector_size = vector_size)
      self.train_vectors = None
      self.test_vectors = None

    def make_corpus_iterable(self,text_data):
      corpus_iterable =[]
      for text in text_data:
        vector = gensim.utils.simple_preprocess(text)
        corpus_iterable.append(vector)
      return corpus_iterable

    def fit_transform(self,text_data):
        corpus_iterable = self.make_corpus_iterable(text_data)
        #build vocabulary and train word2vec model
        self.word2vecmodel.build_vocab(corpus_iterable)
        self.word2vecmodel.train(corpus_iterable,
                        total_examples=self.word2vecmodel.corpus_count,
                        epochs = self.word2vecmodel.epochs)
        pickle.dump(self.word2vecmodel, open(f'/content/drive/MyDrive/PR2/models/{self.name}.pkl', 'wb'))


        #replace each doc with a vector calculated as mean of all words vectors in the doc
        vectors=[]
        for text in corpus_iterable:
          vectors.append(self.word2vecmodel.wv.get_mean_vector(text))

        #change the diminsions of the vectors array to be suitable for training functions
        vectors_2d = np.stack(vectors)
        return vectors_2d

    def transform(self,text_data):
        corpus_iterable = self.make_corpus_iterable(text_data)
        #replace each doc with a vector calculated as mean of all words vectors in the doc
        vectors=[]
        for text in corpus_iterable:
          vectors.append(self.word2vecmodel.wv.get_mean_vector(text))

        #change the diminsions of the vectors array to be suitable for training functions
        vectors_2d = np.stack(vectors)

        return vectors_2d

    def set_train_vectors(self,train_vectors):
        self.train_vectors = train_vectors

    def set_test_vectors(self,test_vectors):
        self.test_vectors = test_vectors

    def get_train_vectors(self):
        return self.train_vectors

    def get_test_vectors(self):
        return self.test_vectors



**we define classes for machine learning models each with the following** functions



*   init : create a model with the passed parameters
*   fit : fit the model on the passed data
*   predict : use the fitted function to predict the label of passed news
*   report : return a report showing metrics of model preformance on passed data compared to passed correct labels



In [None]:
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.metrics import classification_report
class PA:
  def __init__(self,name,max_iter=1000):
    self.max_iter = max_iter
    self.name = name
    self.model = PassiveAggressiveClassifier(max_iter=max_iter)

  def fit(self,X,y):
    self.model.fit(X,y)
    pickle.dump(self.model, open(f'/content/drive/MyDrive/PR2/models/{self.name}.pkl', 'wb'))

  def predict(self,X):
    return self.model.predict(X)

  def report(self,X_test,y_test):
    return classification_report(y_test,self.predict(X_test))

In [None]:
from sklearn.ensemble import RandomForestClassifier
class RF:
  def __init__(self,name,n_estimators=100):
    self.n_estimators = n_estimators
    self.name = name
    self.model = RandomForestClassifier(n_estimators=n_estimators)

  def fit(self,X,y):
    self.model.fit(X,y)
    pickle.dump(self.model, open(f'/content/drive/MyDrive/PR2/models/{self.name}.pkl', 'wb'))

  def predict(self,X):
    return self.model.predict(X)


  def report(self,X_test,y_test):
    return classification_report(y_test,self.predict(X_test))

In [None]:
from sklearn.svm import SVC
class SVM:
  def __init__(self,name,C=1.0,kernel='rbf'):
    self.C = C
    self.kernel = kernel
    self.name = name
    self.model = SVC(C=C,kernel=kernel)

  def fit(self,X,y):
    self.model.fit(X,y)
    pickle.dump(self.model, open(f'/content/drive/MyDrive/PR2/models/{self.name}.pkl', 'wb'))

  def predict(self,X):
       return self.model.predict(X)

  def report(self,X_test,y_test):
    return classification_report(y_test,self.predict(X_test))

**A LSTM class with following functions:**

*   init : set the parameters of the LSTM layer
*   build : to build the network and train it on the passed data, the function return the accuracy depending on test data passed
*   predict : use the fitted network to predict for passed text



In [None]:
from keras.models import Sequential
from keras.layers import Dense , LSTM ,Input
import tensorflow as tf
import numpy as np
np.random.seed(42)
tf.random.set_seed(42)

class my_LSTM:
  def __init__(self,name,units = 32,epochs = 20,batch_size = 256):
    self.units = units
    self.epochs = epochs
    self.batch_size = batch_size
    self.name = name

  def build(self,X_train_vector,y_train,X_test_vector,y_test,Input_shape=1000):
    model = Sequential()
    model.add(LSTM(units = self.units , input_shape = (Input_shape,1) ))
    model.add(Dense(1, activation='sigmoid'))
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    history = model.fit(X_train_vector, y_train, epochs=self.epochs, batch_size=self.batch_size,validation_data=(X_test_vector,y_test))
    model.save(f'/content/drive/MyDrive/PR2/models/{self.name}.h5')
    return history.history['val_accuracy'][-1]

  def predict(self,X_vector):
    model = tf.keras.models.load_model(f'/content/drive/MyDrive/PR2/models/{self.name}.h5')
    predict = model.predict(X_vector)
    res = 1 if predict[0] > 0.5 else 0
    return res


**The following function used set a vectorizer class (Tfidf or word2vec) using passed train and test data**

In [None]:
def set_vectorizers(X_train,X_test,vectorizer):
  train_v = vectorizer.fit_transform(X_train)
  test_v = vectorizer.transform(X_test)
  vectorizer.set_train_vectors(train_v)
  vectorizer.set_test_vectors(test_v)
  return

**This function do the following**
* get the train and test data from the passed vectorizer class
* fit the machine learning class based on the train data
* return a report with metrics using the machine learning class and test data



In [None]:
def train_predict_score(X_train,y_train,X_test,y_test,model,vectorizer):
  X_train = vectorizer.get_train_vectors()
  X_test = vectorizer.get_test_vectors()
  model.fit(X_train,y_train)
  score = model.report(X_test,y_test)
  return score

**This function use the passsed model to predict the label of passed text**

**after preprocessing it and transform it using passed vectorizer**

In [None]:
def predict(X,model,vectorizer):
  X=process_text(X)
  X = vectorizer.transform(X)
  return model.predict(X)

**We split the data into train and test , using stratify to maintain balance in classes**

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle

X = df['text']
y = df['label']

# Stratify the split based on the labels to ensure equal representation
X, y = shuffle(X, y, random_state=42)  # Shuffle data before splitting
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

**Define vectorize classes and set the train and test vectors in them using pre defined function**

In [None]:
v1_tfidf = TfIdf('Tfidf')
v1_word2vec = myword2vec('Word2vec')

In [None]:
set_vectorizers(X_train,X_test,v1_tfidf)
set_vectorizers(X_train,X_test,v1_word2vec)

**We define models classes , two for each approach one using Tfidf and another using word2vec**

In [None]:
PA_tfidf = PA('PA_tfidf')
PA_word2vec = PA('PA_word2vec')

RF_tfidf = RF('RF_tfidf')
RF_word2vec = RF('RF_word2vec')

SVM_tfidf = SVM('SVM_tfidf')
SVM_word2vec = SVM('SVM_word2vec')

LSTM_tfidf = my_LSTM('LSTM_tfidf')
LSTM_word2vec = my_LSTM('LSTM_word2vec')


**In the following we generate the wanted reports and print it**

In [None]:
PA_tfidf_report = train_predict_score(X_train,y_train,X_test,y_test,PA_tfidf,v1_tfidf)
PA_word2vec_report = train_predict_score(X_train,y_train,X_test,y_test,PA_word2vec,v1_word2vec)

print("PA_tfidf_report" , ":" ,PA_tfidf_report)
print("PA_word2vec_report" ,":",PA_word2vec_report)

PA_tfidf_report :               precision    recall  f1-score   support

           0       0.99      0.99      0.99      3433
           1       0.99      0.99      0.99      3433

    accuracy                           0.99      6866
   macro avg       0.99      0.99      0.99      6866
weighted avg       0.99      0.99      0.99      6866

PA_word2vec_report :               precision    recall  f1-score   support

           0       0.96      0.98      0.97      3433
           1       0.98      0.96      0.97      3433

    accuracy                           0.97      6866
   macro avg       0.97      0.97      0.97      6866
weighted avg       0.97      0.97      0.97      6866



In [None]:
RF_tfidf_report =  train_predict_score(X_train,y_train,X_test,y_test,RF_tfidf,v1_tfidf)
RF_word2vec_report = train_predict_score(X_train,y_train,X_test,y_test,RF_word2vec,v1_word2vec)

print("RF_tfidf_report" , ":" ,RF_tfidf_report)
print("RF_word2vec_report" ,":",RF_word2vec_report)

RF_tfidf_report :               precision    recall  f1-score   support

           0       1.00      1.00      1.00      3433
           1       1.00      1.00      1.00      3433

    accuracy                           1.00      6866
   macro avg       1.00      1.00      1.00      6866
weighted avg       1.00      1.00      1.00      6866

RF_word2vec_report :               precision    recall  f1-score   support

           0       0.96      0.95      0.95      3433
           1       0.95      0.96      0.96      3433

    accuracy                           0.96      6866
   macro avg       0.96      0.96      0.96      6866
weighted avg       0.96      0.96      0.96      6866



In [None]:
SVM_tfidf_report = train_predict_score(X_train,y_train,X_test,y_test,SVM_tfidf,v1_tfidf)
SVM_word2vec_report = train_predict_score(X_train,y_train,X_test,y_test,SVM_word2vec,v1_word2vec)
print("SVM_tfidf_report" , ":" ,SVM_tfidf_report)
print("SVM_word2vec_report" ,":",SVM_word2vec_report)

SVM_tfidf_report :               precision    recall  f1-score   support

           0       0.99      0.99      0.99      3433
           1       0.99      0.99      0.99      3433

    accuracy                           0.99      6866
   macro avg       0.99      0.99      0.99      6866
weighted avg       0.99      0.99      0.99      6866

SVM_word2vec_report :               precision    recall  f1-score   support

           0       0.99      0.97      0.98      3433
           1       0.98      0.99      0.98      3433

    accuracy                           0.98      6866
   macro avg       0.98      0.98      0.98      6866
weighted avg       0.98      0.98      0.98      6866



In [None]:
LSTM_tfidf_report = LSTM_tfidf.build(X_train_vector = np.array(v1_tfidf.get_train_vectors()),y_train = y_train.to_numpy(),X_test_vector=np.array(v1_tfidf.get_test_vectors()),y_test=y_test.to_numpy(),Input_shape=1000)

print("LSTM_tfidf_report" , ":" ,LSTM_tfidf_report)


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
LSTM_tfidf_report : 0.6354500651359558


In [None]:
LSTM_word2vec_report = LSTM_word2vec.build(X_train_vector = v1_word2vec.get_train_vectors(),y_train = y_train,X_test_vector=v1_word2vec.get_test_vectors(),y_test=y_test,Input_shape=200)
print("LSTM_word2vec_report" ,":",LSTM_word2vec_report)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
LSTM_word2vec_report : 0.8486746549606323


**In the following lines we test traind models on a dataset generated using Chat-GPT**

**Objective :** while fake news in social media, news sites , or nwespapper are written by humans with intintion to lie they also try to  avoid being so determint so a certain vague languages might be common.


 But chat-gpt nad other models can in fact generate fake news which have the same structure of language as true news which in theory will make the task more difficult to detect fake news espically in our case were models were not trained on similar data.

In [None]:
gpt = pd.read_csv('/content/drive/MyDrive/PR2/Chat_GPT_news.csv')

In [None]:
gpt['preprocessed_text'] = gpt['text'].apply(process_text)

In [None]:
gpt_tfidf_test_vectors = v1_tfidf.transform(gpt['preprocessed_text'])
gpt_word2vec_test_vectors = v1_word2vec.transform(gpt['preprocessed_text'])

In [None]:
gpt_PA_tfidf_report = PA_tfidf.report(gpt_tfidf_test_vectors,gpt['label'])
gpt_PA_word2vec_report = PA_word2vec.report(gpt_word2vec_test_vectors,gpt['label'])

print("gpt_PA_tfidf_report" , ":" ,gpt_PA_tfidf_report)
print("gpt_PA_word2vec_report" ,":",gpt_PA_word2vec_report)

gpt_PA_tfidf_report :               precision    recall  f1-score   support

         0.0       0.52      0.95      0.67       150
         1.0       0.71      0.11      0.20       150

    accuracy                           0.53       300
   macro avg       0.61      0.53      0.43       300
weighted avg       0.61      0.53      0.43       300

gpt_PA_word2vec_report :               precision    recall  f1-score   support

         0.0       0.59      0.68      0.63       150
         1.0       0.62      0.53      0.57       150

    accuracy                           0.60       300
   macro avg       0.61      0.60      0.60       300
weighted avg       0.61      0.60      0.60       300



In [None]:
gpt_RF_tfidf_report = RF_tfidf.report(gpt_tfidf_test_vectors,gpt['label'])
gpt_RF_word2vec_report = RF_word2vec.report(gpt_word2vec_test_vectors,gpt['label'])

print("gpt_RF_tfidf_report" , ":" ,gpt_RF_tfidf_report)
print("gpt_RF_word2vec_report" ,":",gpt_RF_word2vec_report)

gpt_RF_tfidf_report :               precision    recall  f1-score   support

         0.0       0.50      1.00      0.67       150
         1.0       0.00      0.00      0.00       150

    accuracy                           0.50       300
   macro avg       0.25      0.50      0.33       300
weighted avg       0.25      0.50      0.33       300

gpt_RF_word2vec_report :               precision    recall  f1-score   support

         0.0       0.67      0.71      0.69       150
         1.0       0.69      0.65      0.67       150

    accuracy                           0.68       300
   macro avg       0.68      0.68      0.68       300
weighted avg       0.68      0.68      0.68       300



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [None]:
gpt_SVM_tfidf_report = SVM_tfidf.report(gpt_tfidf_test_vectors,gpt['label'])
gpt_SVM_word2vec_report = SVM_word2vec.report(gpt_word2vec_test_vectors,gpt['label'])

print("gpt_SVM_tfidf_report" , ":" ,gpt_SVM_tfidf_report)
print("gpt_SVM_word2vec_report" ,":",gpt_SVM_word2vec_report)

gpt_SVM_tfidf_report :               precision    recall  f1-score   support

         0.0       0.51      0.98      0.67       150
         1.0       0.77      0.07      0.12       150

    accuracy                           0.52       300
   macro avg       0.64      0.52      0.40       300
weighted avg       0.64      0.52      0.40       300

gpt_SVM_word2vec_report :               precision    recall  f1-score   support

         0.0       0.60      0.77      0.68       150
         1.0       0.68      0.49      0.57       150

    accuracy                           0.63       300
   macro avg       0.64      0.63      0.62       300
weighted avg       0.64      0.63      0.62       300



In [None]:
model = tf.keras.models.load_model(f'/content/drive/MyDrive/PR2/models/LSTM_tfidf.h5')
predict = model.predict(np.array(gpt_tfidf_test_vectors))
correct = 0
for pred,label in zip(predict,gpt['label']):
  if pred[0] > 0.5:
    if label == 1:
      correct += 1
  else:
    if label == 0:
      correct += 1
accuracy = correct/len(predict)
print(accuracy)

0.5


In [None]:
model = tf.keras.models.load_model(f'/content/drive/MyDrive/PR2/models/LSTM_word2vec.h5')
predict = model.predict(np.array(gpt_word2vec_test_vectors))
correct = 0
for pred,label in zip(predict,gpt['label']):
  if pred[0] > 0.5:
    if label == 1:
      correct += 1
  else:
    if label == 0:
      correct += 1
accuracy = correct/len(predict)
print(accuracy)

0.5733333333333334
