#Fake News Classifier Using LSTM
Dataset: https://www.kaggle.com/c/fake-news/data#

###Loading Data From Kaggle

In [None]:
! pip install -q kaggle
from google.colab import files
files.upload()
! mkdir ~/.kaggle
! cp kaggle.json ~/.kaggle/
! chmod 600 ~/.kaggle/kaggle.json

Saving kaggle.json to kaggle (1).json
mkdir: cannot create directory ‘/root/.kaggle’: File exists


In [None]:
!kaggle competitions download -c fake-news

Downloading train.csv.zip to /content
 94% 35.0M/37.0M [00:00<00:00, 73.6MB/s]
100% 37.0M/37.0M [00:00<00:00, 83.5MB/s]
Downloading submit.csv to /content
  0% 0.00/40.6k [00:00<?, ?B/s]
100% 40.6k/40.6k [00:00<00:00, 42.0MB/s]
Downloading test.csv.zip to /content
 96% 9.00M/9.42M [00:00<00:00, 91.4MB/s]
100% 9.42M/9.42M [00:00<00:00, 86.9MB/s]


In [None]:
!unzip train.csv.zip
!unzip test.csv.zip

Archive:  train.csv.zip
  inflating: train.csv               
Archive:  test.csv.zip
  inflating: test.csv                


###Importing Data 

In [2]:
import pandas as pd
data=pd.read_csv("fake-news/train.csv")
data.head(10)

Unnamed: 0,id,title,author,text,label
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1
5,5,Jackie Mason: Hollywood Would Love Trump if He...,Daniel Nussbaum,"In these trying times, Jackie Mason is the Voi...",0
6,6,Life: Life Of Luxury: Elton John’s 6 Favorite ...,,Ever wonder how Britain’s most iconic pop pian...,1
7,7,Benoît Hamon Wins French Socialist Party’s Pre...,Alissa J. Rubin,"PARIS — France chose an idealistic, traditi...",0
8,8,Excerpts From a Draft Script for Donald Trump’...,,Donald J. Trump is scheduled to make a highly ...,0
9,9,"A Back-Channel Plan for Ukraine and Russia, Co...",Megan Twohey and Scott Shane,A week before Michael T. Flynn resigned as nat...,0


###Removing NaN Values from the dataset

In [3]:
data=data.dropna()
data.reset_index(inplace=True)

###Extracting Dependent and Independent variables from data

In [4]:
X = data.drop("label",axis=1)
y = data["label"]

###Checking The Shape of X and y

In [None]:
print(X.shape)
print(y.shape)

(18285, 5)
(18285,)


###Importing the required Libraries

In [5]:
import tensorflow as tf
from tensorflow.keras.layers import Embedding
from tensorflow.keras.layers import LSTM
from tensorflow.keras.layers import Dense
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential


### 1. `import tensorflow as tf`

- **TensorFlow**: It's an open-source machine learning library developed by Google. It's widely used for various kinds of ML tasks, including deep learning models like neural networks. In NLP, TensorFlow is often used for building and training models that work with text data.

### 2. `from tensorflow.keras.layers import Embedding`

- **Embedding Layer**: 
  - **Usage**: In NLP, this layer is used to convert numerical representations of words (like word indices) into dense vectors of fixed size. This is more efficient than one-hot encoded vectors as it captures more information (like semantic relationships) in fewer dimensions.
  - **ML Project Context**: In text classification or sentiment analysis, an Embedding layer is often the first layer of the neural network, processing sequences of word indices and turning them into vectors that the model can learn from.

### 3. `from tensorflow.keras.layers import LSTM`

- **LSTM (Long Short-Term Memory) Layer**: 
  - **Usage**: LSTM is a type of Recurrent Neural Network (RNN) that is capable of learning long-term dependencies in sequence data. It's particularly useful for NLP tasks where the context or order of words is important, like in language modeling or text generation.
  - **ML Project Context**: An LSTM layer can be used after an Embedding layer to process the word vectors and capture sequential dependencies between words in text data.

### 4. `from tensorflow.keras.layers import Dense`

- **Dense Layer**: 
  - **Usage**: A dense layer is a fully connected neural network layer where each input node is connected to each output node. It’s used for outputting predictions for the task at hand (like a classification or regression output).
  - **ML Project Context**: In an NLP model, a Dense layer is often used after LSTM or other types of layers to interpret the features extracted from the text and make predictions.

### 5. `from tensorflow.keras.preprocessing.text import one_hot`

- **One-Hot Encoding for Text**:
  - **Usage**: This function is used to convert text into a one-hot encoded numerical format. Each word is represented by a unique integer, and this integer is mapped to a binary vector of a fixed size (the vocabulary size) with all zeros except for the index of the word.
  - **ML Project Context**: Before feeding text data into an Embedding layer, you often need to convert it into a numerical format. One-hot encoding is one way to do this, especially for smaller vocabulary sizes.

### 6. `from tensorflow.keras.preprocessing.sequence import pad_sequences`

- **Padding Sequences**:
  - **Usage**: This function is used to ensure that all sequences in a list have the same length by padding them with zeros (or truncating them) to a specified length. This uniformity is required for batch processing in neural networks.
  - **ML Project Context**: When working with text data, different texts might have different lengths. Padding is essential to create consistent input sizes for training neural network models.

### 7. `from tensorflow.keras.models import Sequential`

- **Sequential Model**:
  - **Usage**: The Sequential model in Keras is a linear stack of layers. It's the simplest kind of Keras model for neural networks, where you can just add layers to the model in the order that they should be executed.
  - **ML Project Context**: In NLP, a Sequential model might start with an Embedding layer, followed by LSTM layers or other types of layers, and end with a Dense layer for output. It's a straightforward way to build models layer by layer.

In a typical NLP project, these components work together to process text data, extract features, and make predictions or analyze text. The specific architecture and choice of layers depend on the nature of the task (e.g., classification, sentiment analysis, language modeling) and the complexity of the dataset.

###Preprocessing The Text 



In [6]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
nltk.download("stopwords")

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\E\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [7]:
messages=X.copy()
ps = PorterStemmer()
corpus = []
for i in range(len(messages)):
  review = re.sub("[^a-zA-Z]"," ",messages['title'][i])
  review = review.lower()
  review = review.split()
  review = [ps.stem(word) for word in review if not word in stopwords.words('english')]  # (word not in stopwords) and stem(word)
  review = ' '.join(review)
  corpus.append(review)

text preprocessing in Natural Language Processing (NLP), using the Natural Language Toolkit (NLTK). Let's go through the code to understand and potentially optimize it.

1. **Import Statements**: 
   - `import re` imports the regular expressions library, useful for text cleaning.
   - `import nltk` and the subsequent download of stopwords are crucial for text processing.

2. **Preprocessing Steps**:
   - **Copying Data**: You're creating a copy of `X` into `messages`. It's good practice to avoid modifying the original dataset directly.
   - **PorterStemmer Initialization**: `ps = PorterStemmer()` initializes the Porter Stemmer, a popular stemming algorithm.

3. **Loop for Text Preprocessing**:
   - You loop through each message and perform several steps:
     - **Regular Expression Cleaning**: `re.sub("[^a-zA-Z]"," ",messages['title'][i])` removes everything except alphabetic characters. This is a common practice to remove numbers and special characters.
     - **Lowercasing**: Converting the text to lower case to maintain uniformity.
     - **Tokenization**: Splitting the text into individual words (`review.split()`).
     - **Stopword Removal and Stemming**: Removing common English stopwords and applying stemming. This step reduces each word to its root form, removing inflections.

4. **Creating the Corpus**:
   - Finally, the processed review is joined back into a string and added to the `corpus` list.

### Suggestions for Improvement:

- **Efficiency in Stopword Removal**: You are checking stopwords inside the loop for each word. Consider creating a set of stopwords outside the loop. It will make lookup faster.
  
  ```python
  stop_words = set(stopwords.words('english'))
  ```

- **Handling Empty Strings**: After preprocessing, some strings might become empty. It's good to add a condition to handle or filter them out.

- **Parameterizing Function**: Consider encapsulating this logic into a function for reusability. This function can take parameters like the dataset and column names, making it more versatile.

- **Error Handling**: Your code assumes that `X` has a column named 'title'. In practice, it’s good to add checks and error handling for cases where 'title' might not exist.

- **Commenting**: Adding comments to your code can make it more readable and maintainable.

Here's a revised version of your code with these suggestions:



In [16]:

import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

nltk.download("stopwords")

def preprocess_text(data, column_name):
    ps = PorterStemmer()
    stop_words = set(stopwords.words('english'))
    corpus = []

    for i in range(len(data)):
        if column_name in data.columns:
            review = re.sub("[^a-zA-Z]", " ", data[column_name][i])
            review = review.lower().split()
            review = [ps.stem(word) for word in review if word not in stop_words]
            review = ' '.join(review)
            corpus.append(review if review else 'empty')  # Add 'empty' for empty reviews
        else:
            raise ValueError(f"Column '{column_name}' not found in the dataset")

    return corpus

corpus = preprocess_text(X, 'title')


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\E\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### One Hot Encoding

converting preprocessed text into numerical form using one-hot encoding. This process is essential for preparing text data for various machine learning models. Let's discuss this step in detail.

1. **One-Hot Encoding**: 
   - The function `one_hot` is typically used to represent words as integer indices. Each word in the corpus is assigned a unique integer value within the range of the vocabulary size. 
   - The `vocab_size`  chosen is 5000, which means setting a limit of 5000 unique words for model. Any words beyond this limit in corpus will not be considered.

2. **Applying One-Hot Encoding**:
   - using a list comprehension to apply the `one_hot` function to each preprocessed text (each 'review') in `corpus`. 
   - This results in `onehot_repr`, a list where each element is a list of integers, representing the one-hot encoded version of the corresponding text in the corpus.

### Points to Consider:

- **Vocabulary Size**: The choice of `vocab_size` is crucial. A smaller vocabulary might be insufficient for a large dataset, leading to a lot of words being ignored. Conversely, a very large vocabulary can increase computational complexity and memory usage.

- **Out-of-Vocabulary Words**: In practice, it’s important to handle words that are not in the predefined vocabulary. This is typically done using a special token like `<UNK>` (unknown).

- **Importing Necessary Functions**: Ensure that the `one_hot` function is imported or defined in your script. This function is often available in deep learning libraries like Keras.

- **Further Steps**: After one-hot encoding,  consider converting these sequences into equal-length vectors using padding, especially if plan to use models like CNNs or RNNs for NLP task.

Here's how to proceed, assuming using Keras:

```python
from keras.preprocessing.text import one_hot
from keras.preprocessing.sequence import pad_sequences

# One-Hot Encoding
onehot_repr = [one_hot(words, vocab_size) for words in corpus]

# Padding Sequences (Optional, based on your model requirement)
max_sentence_length = 20  # Example length, choose according to your data
padded_corpus = pad_sequences(onehot_repr, maxlen=max_sentence_length, padding='post')
```

This code first applies one-hot encoding and then pads each sequence to ensure that they all have the same length, which is a common requirement for neural network inputs.

Remember, each step in preprocessing is crucial and has a significant impact on the performance of NLP model. 

In [17]:
vocab_size=5000
onehot_repr=[one_hot(words,vocab_size)for words in corpus] 


###Applying Pad Sequences To Make Sentence Length Equal

In [18]:
sent_length=20
embedded_docs=pad_sequences(onehot_repr,padding='pre',maxlen=sent_length)
print(embedded_docs)

[[   0    0    0 ... 4718   47 4454]
 [   0    0    0 ...  342 2472  485]
 [   0    0    0 ... 4768 3973  363]
 ...
 [   0    0    0 ... 3498 2113 2272]
 [   0    0    0 ... 2703 2396  606]
 [   0    0    0 ... 4335 2402 4220]]


progressing nicely into the deep learning part of NLP project. Here, building a neural network model using Keras, specifically for a text classification task. Let's break down the steps and the code:

1. **Padding Sequences**: 
   - used `pad_sequences` to ensure all sequences in `embedded_docs` have the same length (`sent_length=20`). This is crucial for training neural networks, as they require inputs of uniform size.
   - Padding with 'pre' means any sequences shorter than 20 tokens will be prepended with zeros.

2. **Building the Neural Network**:
   - **Embedding Layer**: The `Embedding` layer is set up with `vocab_size` as the input dimension and `embedding_vector_features=40`. This layer will learn an embedding for all words in the dataset.
   - **LSTM Layer**: added an LSTM (Long Short-Term Memory) layer with 100 units. LSTM is effective for sequence data like text as it can capture long-term dependencies.
   - **Dense Layer**: A Dense layer with a sigmoid activation function is used, indicating this model is for binary classification.

3. **Compiling the Model**:
   - used `binary_crossentropy` as the loss function and `adam` as the optimizer, which are standard choices for binary classification tasks.
   - The metric chosen is 'accuracy'.

4. **Model Summary**:
   - `model.summary()` will print a summary representation of your model, showing each layer, its type, output shape, and number of parameters.

### Things to Consider:

- **Input Length Consistency**: Ensure that the input length specified in the `Embedding` layer (`input_length=sent_length`) matches the length of the sequences in `embedded_docs`.

- **Model Architecture Tuning**: The architecture (number of LSTM units, number of embedding features, etc.) can greatly influence the model's performance. Consider experimenting with these parameters.

- **Overfitting Check**: With deep learning models, especially LSTMs, there's a risk of overfitting. Monitor the training and validation accuracy, and consider using techniques like dropout or regularization if necessary.

- **Training the Model**: After compiling the model, need to train it using the `model.fit()` method, supplying training data and labels, and specifying the number of epochs and batch size.



#Creating Model for Train Set

In [None]:
'''embedding_vector_features=40
model=Sequential()
model.add(Embedding(vocab_size,embedding_vector_features,input_length=sent_length))
model.add(LSTM(100))
model.add(Dense(1,activation='sigmoid'))
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
print(model.summary())'''

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 20, 40)            200000    
                                                                 
 lstm (LSTM)                 (None, 100)               56400     
                                                                 
 dense (Dense)               (None, 1)                 101       
                                                                 
Total params: 256,501
Trainable params: 256,501
Non-trainable params: 0
_________________________________________________________________
None


In [19]:
# Here's a slight enhancement to include a Dropout layer for regularization:


from keras.layers import Dropout

embedding_vector_features=40
model=Sequential()
model.add(Embedding(vocab_size,embedding_vector_features,input_length=sent_length))
model.add(LSTM(100))
model.add(Dense(1,activation='sigmoid'))
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
print(model.summary())

model.add(Dropout(0.3))  # Dropout layer for regularization
model.add(Dense(1, activation='sigmoid'))

# Compile the model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Model Summary
print(model.summary())


# The Dropout layer will help in preventing overfitting by randomly setting a fraction of input units to 0 at each update during training time.



Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, 20, 40)            200000    
                                                                 
 lstm_1 (LSTM)               (None, 100)               56400     
                                                                 
 dense_2 (Dense)             (None, 1)                 101       
                                                                 
Total params: 256,501
Trainable params: 256,501
Non-trainable params: 0
_________________________________________________________________
None
Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, 20, 40)            200000    
                                                                 
 lstm_1 (LSTM)       

In [20]:
import numpy as np
X_final=np.array(embedded_docs)
y_final=np.array(y)

In [21]:
print("Length of X_final:", len(X_final))
print("Length of y_final:", len(y_final))


Length of X_final: 18285
Length of y_final: 18285


In [22]:
y_final = y_final[:len(X_final)]  # Truncate y_final to match X_final's length

In [23]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_final, y_final, test_size=0.33, random_state=42)

In [24]:
model.fit(X_train,y_train,validation_data=(X_test,y_test),epochs=10,batch_size=64)


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x1b9b6303dc0>

###Performance Metrics And Accuracy



In [25]:
y_pred = model.predict(X_test)
y_pred = np.round(y_pred).astype(int)
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test,y_pred)



array([[3091,  328],
       [ 208, 2408]], dtype=int64)

In [26]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test,y_pred)


0.9111847555923778