# Inspiration
Can we use this data set to make an algorithm able to determine if an article is fake news or not ?

# Workflows

Obtaining Data > Feature engineering > Scrubbing Data > Exploring Data Analysis > NLP > Models Training > Models Evaluation

## Step 1 : Make necessary imports

In [1]:
# Fundamental pre-processing
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# NLP processing
import re
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# Machine learning
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

In [2]:
nltk.download('wordnet')
nltk.download('stopwords')

## Step 2 : Obtaining Data

In [3]:
# Extracting data into dataframes
real_news = pd.read_csv('../input/fake-and-real-news-dataset/True.csv')
fake_news = pd.read_csv('../input/fake-and-real-news-dataset/Fake.csv')

In [4]:
# Accessing few sample records
real_news.head()

In [5]:
fake_news.head()

## Step 3 : Feature Engineering

In [6]:
# Adding a new column "isFake" to identify Real/Fake new articles
real_news['isReal'] = 1   # Denote real news as 1
fake_news['isReal'] = 0   # Denote fake news as 0

In [7]:
# Combining both datasets
# so that we can analyse the complete dataset
news = pd.concat([real_news, fake_news], ignore_index=True)
news.head()

## Step 4 : Scrubbing Data

In [8]:
# Checking for missing values
news.isnull().any()

**As you can see there is no missing values, so our data is clean**

In [9]:
# As we are running the analysis with the news titles 
# hence the remaining columns are not required.
news = news.drop(['text', 'subject', 'date'], axis=1)
news.head()

## Step 5 : Exploring Data Analysis

In [10]:
# Displaying the size of datasets
print(f"Real news size : {real_news.shape}")
print(f"Fake news size : {fake_news.shape}")

In [11]:
# Real/Fake news distribution
fake_news_count, real_news_count = news.isReal.value_counts()
print(f"Fake news count : {fake_news_count}")
print(f"Real news count : {real_news_count}")

In [12]:
sns.countplot(data = news, x = "isReal")
plt.show()

**The labels seems to be evenly distributed. This is a good sign and confirms that the dataset is not biased.**

## Step 6 : Natural Language Processing

### 1 - Processing text

In [13]:
wordnet = WordNetLemmatizer()
corpus = []

titles = news['title'].values
for title in titles:
    # Cleaning text
    review = re.sub('[^a-zA-Z]', ' ', title)
    
    # Formating text
    review = review.lower()
    review = review.split()
    
    # Lemmatize each word in text
    review = [wordnet.lemmatize(word) 
              for word in review
              if word not in stopwords.words('english')]
    
    review = ' '.join(review)
    corpus.append(review)

**First title Before processing :**

In [14]:
print(news.loc[0, 'title'])

**First title After processing :**

In [15]:
print(corpus[0])

### 2 - Feature Extraction

In [16]:
# Bag of Words using countVectorizer
cv = CountVectorizer()
X_bw = cv.fit_transform(corpus).toarray()
X_bw.shape

In [17]:
# TF_IDF
tfidf_v = TfidfVectorizer()
X_tfidf = tfidf_v.fit_transform(corpus).toarray()
X_tfidf.shape

In [18]:
count_df = pd.DataFrame(X_bw, columns=cv.get_feature_names())
count_df.head()

In [19]:
count_df = pd.DataFrame(X_tfidf, columns=tfidf_v.get_feature_names())
count_df.head()

## Step 8 : Models Training

In [20]:
# Target feature
y = news["isReal"]

### 1 - Bag of Words

In [21]:
# Divide the dataset into Train and Test
X_train, X_test, y_train, y_test = train_test_split(
    X_bw, y, test_size=0.33, random_state=0)

classifier = MultinomialNB()         # Choose model hyperparameters
classifier.fit(X_train, y_train)     # Fit model to data
bw_pred = classifier.predict(X_test) # Predict on new data 

### 2 - TF-IDF

In [22]:
# Divide the dataset into Train and Test
X_train, X_test, y_train, y_test = train_test_split(
    X_tfidf, y, test_size=0.33, random_state=0)

classifier = MultinomialNB()            # Choose model hyperparameters
classifier.fit(X_train, y_train)        # Fit model to data
tfidf_pred = classifier.predict(X_test) # Predict on new data 

## Step 9 : Models Evaluation

### 1 - Bag of Words

In [23]:
bw_score = accuracy_score(y_test, bw_pred)
cm = confusion_matrix(y_test, bw_pred)
print(f"Accuracy: {round(bw_score, 2)}")
print(f"Confusion matrix:\n {cm}")

*With Bag of Words we achieved maximum accuracy of 94%*

### 2 - TF-IDF

In [24]:
tfidf_score = accuracy_score(y_test, tfidf_pred)
cm = confusion_matrix(y_test, tfidf_pred)
print(f"Accuracy: {round(tfidf_score, 2)}")
print(f"Confusion matrix:\n {cm}")

*With TF-IDF we are able to achieve an accuracy on 93%*

## Step 10 : Summary
We analyzed the accuracy, For TF-IDF we achieved the accuracy of 93% and for Bag of Words it is 94%, so we conclude that Bag of Words have performed better than Bag TF-IDF

In [25]:
models = ['Bag of Words', 'TF-IDF']
score = [bw_score, tfidf_score]
sns.barplot(x=models, y=score)
plt.title('Models Performance', fontsize=15)
plt.xlabel('Model', fontsize=15)
plt.ylabel('Performance', fontsize=15)
plt.ylim(0, 1)
plt.show()