# Introduction
Hello people, how its going? Today I am going to predict news whether real or fake. In order to do this I will train a deep learning model.

Before starting, Let's take a look at our content

# Notebook Content
1. Importing Libraries and The Data
1. Data Overview
1. Data Preprocessing
    1. Other Preprocessings
    1. Natural Language Processing
1. Building Model Using Pytorch
1. Fitting Model Using Pytorch
1. Evaulating Results
1. What Did We Do?
1. Conclusion

# Importing Libraries and The Data
iddd
In this section I am going to import libraries and the data that I need.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load


"""
DATA MANIPULATİNG
"""
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

"""
NATURAL LANGUAGE PROCESSING
"""
import re 
import nltk 
from sklearn.feature_extraction.text import CountVectorizer

"""
PYTORCH
"""

import torch
import torch.nn as nn


"""
VISUALIZATION TOOLS
"""

import matplotlib.pyplot as plt
import seaborn as sns

"""
TRAIN TEST SPLIT
"""
from sklearn.model_selection import train_test_split

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
true_data = pd.read_csv('/kaggle/input/fake-and-real-news-dataset/True.csv')
fake_data = pd.read_csv('/kaggle/input/fake-and-real-news-dataset/Fake.csv')
true_data.head()

* Our data does not splitted as train and test. It splitted as true and false, so we will split them into train and test, but before this we will concatenate them.

In [None]:
true_data.info()

In [None]:
fake_data.info()

In [None]:
# Adding labels 
true_data["label"] = np.ones(len(true_data),dtype=int)
fake_data["label"] = np.zeros(len(fake_data),dtype=int)

true_data.head()

* And now we will concatenate and shuffle them

In [None]:
data = pd.concat((true_data,fake_data),axis=0)
print(data.info())

In [None]:
data = data.sample(frac=1)
data.head(10)

* Our data is ready, let's examine it!

# Data Overview

In this section we will meet with the data. We will check these:

* Is the data unbalanced?
* How many classes in subject feature?


## Is The Data Unbalanced?

In [None]:
sns.countplot(data["label"])
plt.show()

* Fake news are a bit more but it does not create a problem. So, we can say that our data is balanced.

## How Many Classes In Subject Feature


In [None]:
data["subject"].value_counts()

* There are 8 type of subjects in the dataset.
* Most of the dataset's label is politicsNews
* We should encode this feature.

# Data Preprocessing

In this section I will prepare the dataset for deep learning. I will follow two main steps:

1. Other Preprocessings
1. Natural Language Processing

## Other Preprocessings

In this main step, I will follow these steps:

1. Subject Feature - One Hot Encoding
1. Dropping Date

In [None]:
data = pd.get_dummies(data,columns=["subject"])
data.head()

In [None]:
data = data.drop("date",axis=1)
data.info()

## Natural Language Processing
Finally we came our most important step, natural language processing. In this step, I will process text and label features. I will start with the text

In [None]:
new_text = []
pattern = "[^a-zA-Z]"
lemma = nltk.WordNetLemmatizer()

for txt in data.text:
    
    txt = re.sub(pattern," ",txt) # Cleaning
    txt = txt.lower() # Lowering
    txt = nltk.word_tokenize(txt) # Tokenizing
    txt = [lemma.lemmatize(word) for word in txt] # Lemmatizing
    txt = " ".join(txt)
    new_text.append(txt)
    
    
new_text[0]
    

In [None]:
new_title = []
for txt in data.title:
    
    txt = re.sub(pattern," ",txt) # Cleaning
    txt = txt.lower() # Lowering
    txt = nltk.word_tokenize(txt) # Tokenizing
    txt = [lemma.lemmatize(word) for word in txt] # Lemmatizing
    txt = " ".join(txt)
    new_title.append(txt)
new_title[0]


* And now I am going to create sparce matrixes.


In [None]:
vectorizer_title = CountVectorizer(stop_words="english",max_features=1000)
vectorizer_text = CountVectorizer(stop_words="english",max_features=4000)

title_matrix = vectorizer_title.fit_transform(new_title).toarray() 
text_matrix = vectorizer_text.fit_transform(new_text).toarray()

print("Finished")

* And now let's concatenate everything.

In [None]:
data.head()

In [None]:
data.drop(["title","text"],axis=1,inplace=True)
data.info()

In [None]:
print(data.shape)
print(title_matrix.shape)
print(text_matrix.shape)

In [None]:
# Creating Y
y = data.label
# Creating X
x = np.concatenate((np.array(data.drop("label",axis=1)),title_matrix,text_matrix),axis=1)



* Lets check our shapes, after that I am going to split X and Y into train and test.

In [None]:
print(x.shape)
print(y.shape)

In [None]:
from sklearn.model_selection import train_test_split

# Train Test Split
X_train,X_test,Y_train,Y_test = train_test_split(x,np.array(y),test_size=0.25,random_state=1)
print(X_train.shape)
print(X_test.shape)
print(Y_train.shape)
print(Y_test.shape)


# Building Model Using Pytorch

Our data is ready, now we are going to build our ANN model using pytorch. We will use ReLU as activation function, Adam as optimizer and Cross Entropy as Loss. Let's start.

In [None]:
class ANN(nn.Module):
    
    def __init__(self):
        
        super(ANN,self).__init__() # Inhertiting
        
        self.linear1 = nn.Linear(5008,2000) # IN 5008 OUT 2000
        self.relu1 = nn.ReLU() # Actfunc 1
        
        self.linear2 = nn.Linear(2000,500) # IN 2000 OUT 500
        self.relu2 = nn.ReLU()
        
        self.linear3 = nn.Linear(500,100) # IN 500 OUT 100
        self.relu3 = nn.ReLU()
        
        self.linear4 = nn.Linear(100,20) # IN 100 OUT 20
        self.relu4 = nn.ReLU()
        
        self.linear5 = nn.Linear(20,2) # IN 20 OUT 2 | OUTPUT 
        
    
    def forward(self,x):
        
        out = self.linear1(x) # Input Layer 
        out = self.relu1(out)
        
        out = self.linear2(out) # Hidden Layer 1 
        out = self.relu2(out)
        
        out = self.linear3(out) # Hidden Layer 2 
        out = self.relu3(out)
        
        out = self.linear4(out) # Hidden Layer 3 
        out = self.relu4(out)

        
        out = self.linear5(out) # Output Layer
        
        return out
    

model = ANN()
optimizer = torch.optim.Adam(model.parameters(),lr=0.01)
error = nn.CrossEntropyLoss()

* Our model have built, now let's train it using our data.

# Fitting Model Using Pytorch

And in this stage, I will fit the model using our prepared data. I will use a for loop in order to train.

In [None]:
# Converting numpy arrays into pytorch tensors
X_train = torch.Tensor(X_train)

# You must convert it into LongTensor. I did it once
Y_train = torch.Tensor(Y_train).type(torch.LongTensor)

X_test = torch.Tensor(X_test)
Y_test = torch.Tensor(Y_test)

EPOCHS = 20

for epoch in range(EPOCHS):
    
    # Clearing gradients
    optimizer.zero_grad()
    
    # Forward Propagation
    outs = model(X_train)
    
    # Computing Loss
    loss = error(outs,Y_train)
    
    # Backward propagation
    loss.backward()
    
    # Updating parameters
    optimizer.step()
    
    # Printing loss
    print(f"Loss after iteration {epoch} is {loss}")
    
    

# Evaulating Results

Our model have trained. Now I am going to predict X_test and after that I will evaulate the results.

In [None]:
# Importing metrics
from sklearn.metrics import accuracy_score,confusion_matrix


# Prediction
y_head = model(X_test)
print(y_head[0])
# Converting Prediction into labels
y_pred = torch.max(y_head,1)[1]
print(y_pred[0])

# Accuracy score
print("Model accuracy is ",accuracy_score(y_pred,Y_test))


* Our score is %98.4. Let's check the confusion matrix

In [None]:
confusion_matrix = confusion_matrix(y_pred=y_pred,y_true=Y_test)

fig,ax = plt.subplots(figsize=(6,6))
sns.heatmap(confusion_matrix,annot=True,fmt="0.1f",linewidths=1.5)
plt.show()

# What Did We Do?

Our kernel has finished, let's check that what did we do in this kernel.

1. We've imported our data and libraries
1. We've process the data using nlp methods.
1. We've created our sparce matrixes
1. We've created X and Y
1. We've created test and train arrays.
1. We've built a 4 Layer Model using Pytorch
1. We've trained the model that we built
1. We've made predictions and evaulate them.


# Conclusion
Thanks for your attention, if you have any questions in your mind, or if you confused, you can ask anything to me. I am waiting for your questions.

And if you like this kernel, if this kernel be useful, I am waiting for your upvotes.

Greetings.