## Fake and Real News Dataset


# **<span style="color:#6daa9f;">IMPORT LIBRARY & PACKAGES </span>**


### Hi there!😄 I am new to data science and this is my try on the Fake and Real News dataset. Feel free to comment if you have any questions, insights or advice on this or any data science related :) Upvote if you find my work useful for you! Thank you!

In [None]:
#import package

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

import nltk
nltk.download('punkt')
from nltk.corpus import stopwords
from nltk.util import ngrams
from nltk.tokenize import word_tokenize,sent_tokenize
import re
import string

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
from sklearn.feature_extraction.text import TfidfVectorizer

# **<span style="color:#6daa9f;">EXPLORATORY DATA ANALYSIS </span>**


In [None]:
# Reading from file 
fake = pd.read_csv('../input/fake-and-real-news-dataset/Fake.csv')
true = pd.read_csv('../input/fake-and-real-news-dataset/True.csv')

In [None]:
print(true.shape)
print(true.info())
true.head()

In [None]:
print(fake.shape)
print(fake.info())
fake.head()

In [None]:
fake['Label'] = 1
true['Label'] = 0

In [None]:
data = pd.concat([true,fake],axis=0,ignore_index=True)
print(data.shape)
data.head()

In [None]:
data.describe()

In [None]:
data['text']=data['title']+data['text']
data=data.drop(['title'], axis=1)

In [None]:
sns.countplot(data.Label)

In [None]:
data.isnull().sum()

In [None]:
data.subject.value_counts()

# **<span style="color:#6daa9f;">DATA CLEANING </span>**

**Lowercase words, remove the word 'Reuters', remove square brackets, links, words containing numbers and punctuations**

* Cleaning our text data is important so that the model wont be fed noises that would not help with the prediction. 
* The word reuters was removed as it always appear in the real news article therefore I removed it as it is an obvious indicator to the model 

In [None]:
def clean_text(text):
    
    text = str(text).lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('https?://\S+|www\.\S+', '', text)
    text = re.sub('<.*?>+', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\n', '', text)
    text = re.sub('\w*\d\w*', '', text)
    text = re.sub('Reuters','',text)
    return text

data['text'] = data['text'].apply(lambda x:clean_text(x))


**Remove stop words**

In [None]:
stop = stopwords.words('english')
data['text'] = data['text'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))



**Lemmatize words**

Words were lemmatized so that only root words are retain in the data and fed into the model 

In [None]:
def lemmatize_words(text):
    wnl = nltk.stem.WordNetLemmatizer()
    lem = ' '.join([wnl.lemmatize(word) for word in text.split()])    
    return lem

data['text'] = data['text'].apply(lemmatize_words)

**Split data into train and test set**

In [None]:
y = data['Label']
X_train, X_test, y_train, y_test = train_test_split(data['text'], y,test_size=0.33,random_state=53)

**Using Bag of words model for data transformation**

Since we are dealing with text data, we cannot fed it directly to our model. Therefore, I am using bag of words model to extract features from our text data and convert it into numerical feature vectors that can be fed directly to the algorithm

In [None]:
count_vectorizer = CountVectorizer(stop_words='english')
count_train = count_vectorizer.fit_transform(X_train.values)
count_test = count_vectorizer.transform(X_test.values)
print(count_train.shape)

# **<span style="color:#6daa9f;">MODEL </span>**

Using 2 different model with different parameter for parameter investigation values of alpha and c

**Naive Bayes**

In [None]:
# Model 1 - default parameter 
from sklearn.metrics import classification_report

nb_classifier1 = MultinomialNB()
nb_classifier1.fit(count_train, y_train)

pred1 = nb_classifier1.predict(count_test)

print(classification_report(y_test, pred1, target_names = ['Fake','True']))

In [None]:
#model 2
nb_classifier2 = MultinomialNB(alpha = 1000)
nb_classifier2.fit(count_train, y_train)

pred2 = nb_classifier2.predict(count_test)

print(classification_report(y_test, pred2, target_names = ['Fake','True']))

**Support Vector Machine (SVM)**

In [None]:
# 1
from sklearn.svm import SVC

svc_model1 = SVC(C=1, kernel='linear', gamma= 1)
svc_model1.fit(count_train, y_train)

prediction1 = svc_model1.predict(count_test)

print(classification_report(y_test, prediction1, target_names = ['Fake','True']))

In [None]:
# 2
svc_model2 = SVC(C= 100, kernel='linear', gamma= 1)
svc_model2.fit(count_train, y_train)

prediction2 = svc_model2.predict(count_test)

print(classification_report(y_test, prediction2, target_names = ['Fake','True']))