# Creating the Fake News Detection Model using Machine Learning & Neural Networks

##### In this notebook we will be comparing ML and DL techniques to figure which of them give the best accuracy for a Fake News Detector

In [10]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/sc1015dsai-final-fce2-team-1-23-24/categories_one.csv
/kaggle/input/sc1015dsai-final-fce2-team-1-23-24/final.csv
/kaggle/input/sc1015dsai-final-fce2-team-1-23-24/with_subject.csv
/kaggle/input/sc1015dsai-final-fce2-team-1-23-24/no_subject.csv


In [11]:
df = pd.read_csv("/kaggle/input/sc1015dsai-final-fce2-team-1-23-24/final.csv")

In [12]:
df.describe()

Unnamed: 0,TruthRating
count,361363.0
mean,1.357978
std,1.900495
min,0.0
25%,0.0
50%,0.0
75%,2.0
max,5.0


#### Removing null values (if any)

In [13]:
df = df.dropna()

In [14]:
df.head()

Unnamed: 0,Text,Subject,TruthRating,Country,clean_text
0,"WHO praises India's Aarogya Setu app, says it ...",COVID-19,5,India,praises india aarogya setu app says helped ide...
1,"In Delhi, Deputy US Secretary of State Stephen...",VIOLENCE,5,India,delhi deputy us secretary state stephen biegun...
2,LAC tensions: China's strategy behind delibera...,TERROR,5,India,lac tensions china strategy behind deliberatel...
3,India has signed 250 documents on Space cooper...,COVID-19,5,India,india signed documents space cooperation count...
4,Tamil Nadu chief minister's mother passes away...,ELECTION,5,India,tamil nadu chief minister mother passes away


#### Creating a function to train and test ML model (Linear Regression)

#### Linear regression is a foundational algorithm in the field of machine learning and statistics. It is used to model the relationship between a dependent variable (also known as the target or outcome) and one or more independent variables (also known as predictors, features, or explanatory variables)

#### Random Forest is a robust ensemble learning algorithm widely used in machine learning tasks, especially in classification and regression problems.Random Forest uses an ensemble of multiple decision trees to make predictions.Bagging is a technique used in Random Forest to create different datasets for training each tree

In [15]:
from sklearn.linear_model import LogisticRegression #Importing the ML model
from sklearn.model_selection import train_test_split #To split data into training and testing sets
from sklearn.metrics import accuracy_score, f1_score #Accuracy and F1 Score are 2 evaluation Metrics
from sklearn.feature_extraction.text import CountVectorizer #To convert string into vectors for computation
import matplotlib.pyplot as plt #For visualization, if any

def logistic_regression(df):    
    vectorizer = CountVectorizer()
    X = vectorizer.fit_transform(df['clean_text']) #getting the vectorized data.
    #Splitting data into testing and training sets
    X_train, X_test, y_train, y_test = train_test_split(X, df['TruthRating'], test_size=0.2, random_state=42)
    
    #Initializing & Fitting the ML Model
    model = LogisticRegression(max_iter = 500) #Max interations = 500
    model.fit(X_train, y_train)

    y_pred = model.predict(X_test) #Obtaining predictions  
    test_accuracy = accuracy_score(y_test, y_pred) #Testing Accuracy
    print("F1 Score: ",f1_score(y_test, y_pred, average='weighted')) #Getting F1 score, see below
    print(model, test_accuracy)

def random_forest(df):    
    vectorizer = CountVectorizer()
    X_train, X_test, y_train, y_test = train_test_split(df[['Text', 'Subject', 'Country']], df['TruthRating'], test_size=0.2, random_state=42)    
    X_train_v = vectorizer.fit_transform(X_train['Text'])
    X_test_v = vectorizer.transform(X_test['Text'])
    
    rf = RandomForestClassifier(n_estimators=50, random_state=42) #More the estimators, better the accuracy
    rf.fit(X_train_v, y_train)
    
    y_pred = rf.predict(X_test_v)
    accuracy = accuracy_score(y_test, y_pred)
    print(f"Test Accuracy: {accuracy:.2f}")
    

### The F1 score is a metric commonly used in binary classification tasks to evaluate the performance of a model. It combines precision and recall into a single value. The F1 score is the harmonic mean of precision and recall.

##### Precision measures the proportion of true positive predictions (correctly classified positive instances) out of all positive predictions made by the model. 

##### Recall measures the proportion of true positive predictions out of all actual positive instances in the data. 


In [16]:
logistic_regression(df)

F1 Score:  0.7623939920103936
LogisticRegression(max_iter=500) 0.777351911555119


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


# Neural Networks

### Neural networks are a type of machine learning algorithm inspired by the structure and functioning of the human brain. A neural network consists of interconnected nodes (neurons) organized into layers. Each neuron takes inputs, applies a transformation (such as a weighted sum), and produces an output. The key elements of neural networks include:

#### Layers: Neural networks consist of an input layer, one or more hidden layers, and an output layer. Hidden layers can include various transformations, like dense layers (fully connected), convolutional layers, recurrent layers, etc.

#### Weights and Biases: Each connection between neurons has an associated weight, and each neuron has a bias. These parameters are adjusted during training to optimize the network's performance.
#### Activation Functions: After applying the weighted sum and bias, activation functions introduce non-linearity to the network. Common activation functions include ReLU (Rectified Linear Unit), sigmoid, and softmax.

#### Training: Neural networks are typically trained using a process called backpropagation, which involves calculating gradients of a loss function with respect to the weights and updating them to minimize the loss.

**Considering the number of datapoints (nearly 300,000), the neural network takes around 30 minutes for 7 epochs to get completed at optimal capacity **

In [17]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense

X_train, X_test, y_train, y_test = train_test_split(df[['Text', 'Subject', 'Country']], df['TruthRating'], test_size=0.2, random_state=42)
max_words = 10000 

tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(X_train["Text"])
X_train_seq = tokenizer.texts_to_sequences(X_train["Text"])
X_test_seq = tokenizer.texts_to_sequences(X_test["Text"])


max_len = 100 
X_train_padded = pad_sequences(X_train_seq, maxlen=max_len)
X_test_padded = pad_sequences(X_test_seq, maxlen=max_len)


label_encoder = LabelEncoder()
y_train_encoded = label_encoder.fit_transform(y_train)
y_test_encoded = label_encoder.transform(y_test)


embedding_dim = 100
model = Sequential()
model.add(Embedding(max_words, embedding_dim))
model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(6, activation='softmax'))
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

batch_size = 32 
epochs = 7  #Thala For a reason
model.fit(X_train_padded, y_train_encoded, batch_size=batch_size, epochs=epochs, validation_data=(X_test_padded, y_test_encoded))

loss, accuracy = model.evaluate(X_test_padded, y_test_encoded)
print(f'Test Loss: {loss}, Test Accuracy: {accuracy}')

Epoch 1/7
[1m9034/9034[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m846s[0m 93ms/step - accuracy: 0.7589 - loss: 0.7265 - val_accuracy: 0.8248 - val_loss: 0.5152
Epoch 2/7
[1m9034/9034[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m841s[0m 93ms/step - accuracy: 0.8413 - loss: 0.4670 - val_accuracy: 0.8360 - val_loss: 0.4838
Epoch 3/7
[1m9034/9034[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m841s[0m 93ms/step - accuracy: 0.8624 - loss: 0.4019 - val_accuracy: 0.8393 - val_loss: 0.4759
Epoch 4/7
[1m9034/9034[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m853s[0m 92ms/step - accuracy: 0.8807 - loss: 0.3496 - val_accuracy: 0.8381 - val_loss: 0.4808
Epoch 5/7
[1m9034/9034[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m825s[0m 91ms/step - accuracy: 0.8933 - loss: 0.3144 - val_accuracy: 0.8398 - val_loss: 0.4984
Epoch 6/7
[1m9034/9034[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m825s[0m 91ms/step - accuracy: 0.9031 - loss: 0.2831 - val_accuracy: 0.8391 - val_loss: 0.5146
Epoc

## Evaluation

### Since the Neural Network has a accuracy of 0.835 which is considered to be much better as compared to the 0.77 by the Linear Regressor,the neural network is a much better model to detect Fake News as compared to Linear Regressor

But if we factor in computational costs, the NN got the accuracy after 7 epochs which is quite expensive as compared to the Linear Regressor. But if general accuracy is what is mostly required (in the case of Fake News Detection, it is), then neural networks make a very good choice.