### Fake News Classification Using Machine Learning and Deep Learning

This Jupyter Notebook addresses the problem of classifying fake news using various machine learning and deep learning techniques. The notebook is structured into several stages, including data preprocessing, model experimentation, and evaluation of results.

#### Summary:

1. **Introduction**:
   - The notebook begins with an introduction to the problem of fake news classification. It outlines the importance of identifying fake news and provides an overview of the steps involved in the analysis.

2. **Data Loading**:
   - The dataset is loaded from CSV files using pandas. The training and test datasets are read into DataFrames for further processing.
   ```python
   train = pd.read_csv("../input/fake-news/train.csv")
   test = pd.read_csv("../input/fake-news/test.csv")
   train.head()

Sure! Here is the summary in Markdown format:

```markdown
### Fake News Classification Using Machine Learning and Deep Learning

This Jupyter Notebook addresses the problem of classifying fake news using various machine learning and deep learning techniques. The notebook is structured into several stages, including data preprocessing, model experimentation, and evaluation of results.

#### Summary:

1. **Introduction**:
   - The notebook begins with an introduction to the problem of fake news classification. It outlines the importance of identifying fake news and provides an overview of the steps involved in the analysis.

2. **Data Loading**:
   - The dataset is loaded from CSV files using pandas. The training and test datasets are read into DataFrames for further processing.
   ```python
   train = pd.read_csv("../input/fake-news/train.csv")
   test = pd.read_csv("../input/fake-news/test.csv")
   train.head()
   ```

3. **Data Exploration**:
   - Initial exploration of the dataset is performed to understand its structure and contents. This includes displaying the first few rows of the dataset and checking for missing values.

4. **Data Preprocessing**:
   - The data is preprocessed to prepare it for model training. This includes handling missing values, text cleaning, and feature extraction.
   - Natural Language Toolkit (nltk) is used for text processing tasks such as tokenization, stopword removal, and stemming.

5. **Feature Engineering**:
   - Features are engineered from the text data using techniques such as TF-IDF (Term Frequency-Inverse Document Frequency) and word embeddings.
   - The processed text data is transformed into numerical features suitable for model training.

6. **Model Training**:
   - Various machine learning models are trained on the processed data. This includes models such as Logistic Regression, Naive Bayes, and Support Vector Machines (SVM).
   - Deep learning models, including LSTM (Long Short-Term Memory) networks, are also trained using TensorFlow.

7. **Hyperparameter Tuning**:
   - Hyperparameter tuning is performed using GridSearchCV to find the best parameters for the models.
   ```python
   best_rf = grid_search.best_estimator_
   ```

8. **Model Evaluation**:
   - The trained models are evaluated using metrics such as accuracy, precision, recall, F1-score, and ROC AUC.
   - Confusion matrices and ROC curves are plotted to visualize the performance of the models.

9. **Results and Conclusion**:
   - The results of the model evaluations are summarized, highlighting the best-performing models.
   - The notebook concludes with a discussion of the findings and potential future work to improve the fake news classification system.

### Key Steps and Outputs:

- **Data Loading and Exploration**: Load and explore the dataset to understand its structure.
- **Data Preprocessing**: Clean and preprocess the text data for model training.
- **Feature Engineering**: Extract numerical features from the text data using TF-IDF and word embeddings.
- **Model Training**: Train various machine learning and deep learning models on the processed data.
- **Hyperparameter Tuning**: Perform hyperparameter tuning to optimize the models.
- **Model Evaluation**: Evaluate the models using various metrics and visualize the results.
- **Results and Conclusion**: Summarize the findings and discuss potential future work.

This notebook provides a comprehensive workflow for tackling the problem of fake news classification, from data preprocessing to model evaluation and interpretation.
```



In [2]:
# Importing Libraries
import pandas as pd
import numpy as np
import nltk
import re
from nltk.corpus import stopwords
nltk.download('stopwords')
from tensorflow.keras.layers import Embedding,Dropout
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.layers import LSTM,Bidirectional,GRU
from tensorflow.keras.layers import Dense
from sklearn.metrics import classification_report,accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from catboost import CatBoostClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB

ModuleNotFoundError: No module named 'nltk'

In [4]:
# Reading data from csv
import pandas as pd
train = pd.read_csv("data/input/fake-news/train.csv")
test  = pd.read_csv("data/input/fake-news/test.csv")
train.head()

FileNotFoundError: [Errno 2] No such file or directory: 'data/input/fake-news/train.csv'

In [None]:
test.head()

In [None]:
# Displaying rows and columns in dataset
print("There are {} number of rows and {} number of columns for training.".format(train.shape[0],train.shape[1]))
print("There are {} number of rows and {} number of columns for testing.".format(test.shape[0],test.shape[1]))

**Checking Null Values**

In [None]:
# Checking the null values in training data.
train.isnull().sum()

In [None]:
# Checking the null values in testing data.
test.isnull().sum()

In [7]:
# Handling nan values in dataset using empty spaces
def handle_nan(train_data,test_data):
    '''Input: Data to the function containing Nan values.
       Output : Cleaned data containing no Nan values.
       Function: Cleaning Nan values.
     '''
    train = train_data.fillna(" ")
    test  = test_data.fillna(" ")
    return train,test

train,test = handle_nan(train,test)


In [8]:
# Creating a variable "merged" by merging columns "title" and "author"
train["merged"] = train["title"]+" "+train["author"]
test["merged"]  = test["title"]+" "+test["author"]

In [9]:
# Seperating Independent and dependent features
X = train.drop(columns=['label'],axis=1)
y = train['label']

In [10]:
# Creating One-Hot Representations
messages = X.copy()
messages.reset_index(inplace=True)
messages_test = test.copy()
messages_test.reset_index(inplace=True)

# Data Pre-processing
**In Data Pre-processing following steps are followed:** 
**1. Firstly, all the sequences except english characters are removed from the string.**
**2. Next, to avoid false predictions or ambiguity with upper and lowercase, all the characters in strings are converted    to lowercase.**
**3. Next, all the sentences are tokenized into words.**
**4. To facilitate fast processing, stemming is applied to the tokenized words.**
**5. Next, words are joined together and stored in the corpus.**

**Note: In this tutorial, we have used "merged" column for classification task. Also, the loop inside the function runs over all the examples in the merged column.**

In [None]:
# Performing data preprocessing on column 'title'
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()
def perform_preprocess(data):
    '''Input: Data to be processed
       Output: Preprocessed data
    '''
    corpus = []
    for i in range(0,len(data)):
        review = re.sub('[^a-zA-Z]',' ',data['merged'][i])
        review = review.lower()
        review = review.split()
        review = [ps.stem(word) for word in review if word not in stopwords.words('english')]
        review = ' '.join(review)
        corpus.append(review)
    return corpus
    
train_corpus = perform_preprocess(messages)
test_corpus  = perform_preprocess(messages_test)
train_corpus[1]

In [None]:
test_corpus[1]

**Below code converts the pre-processed words to one-hot vectors in the range of vocabulary size=5000. This is done to obtain numerical feature matrix**

In [13]:
# Converting to one-hot repr.
vocab_size = 5000
one_hot_train = [one_hot(word,vocab_size) for word in train_corpus]
one_hot_test  = [one_hot(word,vocab_size) for word in test_corpus]

In [None]:
one_hot_test[1]

**Below code creates an embedding layer which applies "pre" padding to the one-hot encoded features with sentence length = 20. Padding is applied so that the length of every sequence in the dataset should be same.**

In [None]:
# Embedding Representation 
sent_length = 20
embedd_docs_train = pad_sequences(one_hot_train,padding='pre',maxlen=sent_length)
embedd_docs_test  = pad_sequences(one_hot_test,padding='pre',maxlen=sent_length)
print(embedd_docs_train)

In [None]:
print(embedd_docs_test)

In [17]:
# Converting Embedding repr. to array
x_final = np.array(embedd_docs_train)
y_final = np.array(y)
x_test_final = np.array(embedd_docs_test)

In [None]:
# Dimensions of prev. array repr.
x_final.shape,y_final.shape,x_test_final.shape

**Dividing the dataset into training,validation and testing data (ratio: 80/10/10) using train_test_split technique.**

In [19]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x_final, y_final, test_size=0.1, random_state=42, stratify = y_final)
X_train, x_valid, Y_train, y_valid = train_test_split(x_train, y_train, test_size=0.1, random_state=42, stratify = y_train)
x_test_final = x_test_final

# Creating Models
**In this phase, several models are created and evaluated against various metrics shown using classification report.**

**1. Logistic Regresssion**

In [None]:
model_1 = LogisticRegression(max_iter=900)
model_1.fit(X_train,Y_train)
pred_1 = model_1.predict(x_test)
cr1    = classification_report(y_test,pred_1)
print(cr1)

**2. Naive Bayes**

In [None]:
model_2 = MultinomialNB()
model_2.fit(X_train,Y_train)
pred_2 = model_2.predict(x_test)
cr2    = classification_report(y_test,pred_2)
print(cr2)

**3. Decision Trees**

In [None]:
model_3 = DecisionTreeClassifier()
model_3.fit(X_train,Y_train)
pred_3 = model_3.predict(x_test)
cr3    = classification_report(y_test,pred_3)
print(cr3)

**4. Random Forest**

In [None]:
model_4 = RandomForestClassifier()
model_4.fit(X_train,Y_train)
pred_4 = model_4.predict(x_test)
cr4    = classification_report(y_test,pred_4)
print(cr4)

**5. XGBOOST**

In [None]:
model_5 = XGBClassifier()
model_5.fit(X_train,Y_train)
pred_5 = model_5.predict(x_test)
cr5    = classification_report(y_test,pred_5)
print(cr5)

**6.Catboost**

In [None]:
model_6 = CatBoostClassifier(iterations=200)
model_6.fit(X_train,Y_train)
pred_6 = model_5.predict(x_test)
cr6    = classification_report(y_test,pred_5)
print(cr6)

**7. LSTM**

**In this model, 1.) The value for embedding feature vectors = 40 which are target feature vectors for the embedding layer. 2.) Single LSTM Layer with 100 nodes are used. 3.)Dense Layer with 1 neuron and sigmoid activation function is used since, this is a binary classification problem. 4) Dropout technique is used to avoid overfiiting and adam optimizer is used for optimizing the loss function.**

In [None]:
# Creating the LSTM Model for prediction
embedding_feature_vector = 40
model = Sequential()
model.add(Embedding(vocab_size,embedding_feature_vector,input_length=sent_length))
model.add(Dropout(0.3))
model.add(LSTM(100))
model.add(Dropout(0.3))
model.add(Dense(1,activation='sigmoid'))
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
print(model.summary())

In [None]:
# Training the model
model.fit(X_train,Y_train,validation_data=(x_valid,y_valid),epochs=10,batch_size=64)

In [None]:
predictions = model.predict_classes(x_test)
cr = classification_report(y_test,predictions)
print(cr)

# Evaluation of Models

**Tabulating the results of various implemented models.**

In [None]:
score_1 = accuracy_score(y_test,pred_1)
score_2 = accuracy_score(y_test,pred_2)
score_3 = accuracy_score(y_test,pred_3)
score_4 = accuracy_score(y_test,pred_4)
score_5 = accuracy_score(y_test,pred_5)
score_6 = accuracy_score(y_test,pred_6)
score_7 = accuracy_score(y_test,predictions)
results = pd.DataFrame([["Logistic Regression",score_1],["Naive Bayes",score_2],["Decision Tree",score_3],
                       ["Random Forest",score_4],["XGBOOST",score_5],["CatBoost",score_6],["LSTM",score_7]],columns=["Model","Accuracy"])
results

**Discussion: From the above results, it appears that LSTM Model gives the highest accuracy amongst various models. Therefore, it is selected as the final model for making predictions on final testing data.**

**Predictions on Testing Data**

In [None]:
# Making Predictions on test data
predictions_test = pd.DataFrame(model.predict_classes(x_test_final))
test_id = pd.DataFrame(test["id"])
submission = pd.concat([test_id,predictions_test],axis=1)
submission.columns = ["id","label"]
submission.to_csv("Submission.csv",index=False)

In [None]:
submission.head()