# Problem statemaent : Detecting Fake News with Python and Machine Learning

## What is Fake News?
### A type of yellow journalism, fake news encapsulates pieces of news that may be hoaxes and is generally spread through social media and other online media. This is often done to further or impose certain ideas and is often achieved with political agendas. Such news items may contain false and/or exaggerated claims, and may end up being viralized by algorithms, and users may end up in a filter bubble.

#### Importing Required libraries

In [1]:
import pandas as pd 
#pandas uses dataframe to alter, merge etc perform certain tasks on rows and columns

import numpy as np 
#uses array format to calculate certain type of calculations quickly

import re 
#re stands for Regular Expressions

from nltk.corpus import stopwords 
#nltk stands for natuaral language toolkit

from nltk.stem.porter import PorterStemmer 
#apply PorterStemmer

from sklearn.feature_extraction.text import TfidfVectorizer 
#coverts text to vectors
 
from sklearn.model_selection import train_test_split 
#splits data ratiowise

from sklearn.linear_model import LogisticRegression 
#Classisfication problem 

from sklearn.metrics import accuracy_score 
#checks scores

from sklearn.preprocessing import LabelEncoder
#encode string data to values

#### - re stands for Regular Expressions means it is a sequence of characters that defines a specific search pattern and using which you can match or substitute patterns inside a text with least amount of code. Eg: " i was born in *year @&1995", so re.sub('*&@',"",sentence) this removes all special characters with nothing.

#### - nltk - from this package we are importing stopwords, these are words which have less meaning in sentences. Corpus means body of that document which docs in rows

#### - PorterStemmer is stemming algorithm which uses to do remove prefixes and suffixes of words into root words. eg. actor, actress, -- root word act.

#### - TfidfVectorizer is term frequency and inverse document frequency uses to convert text to vectors, give priority to most frequent words and rare words as well, uses log function in idf vectors which is monotonous function but it does not take symantic words like taste and delicious have same meaning but tfidf uses different dimensions of words

#### - Stopwords are those which does not add much values to the data

In [2]:
# downloading stopwords from nltk library
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Dines\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [3]:
# checking which are stopwords to get general idea

print(stopwords.words('english'))  
#here english language stopword printed

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

#### stopwords have also listed 'not' word sometimes if 'not' includes in any sentences make sentence meaning same so we can exclude 'not' word from stopword list

### Data/Text Preprocessing

In [4]:
#reading csv data
news = pd.read_csv('news.csv')
news.head(8)

Unnamed: 0.1,Unnamed: 0,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL
5,6903,"Tehran, USA","\nI’m not an immigrant, but my grandparents ...",FAKE
6,7341,Girl Horrified At What She Watches Boyfriend D...,"Share This Baylee Luciani (left), Screenshot o...",FAKE
7,95,‘Britain’s Schindler’ Dies at 106,A Czech stockbroker who saved more than 650 Je...,REAL


In [5]:
#chekcing shape of dataset
news.shape

(6335, 4)

In [6]:
#checking null values
news.isnull().sum()

Unnamed: 0    0
title         0
text          0
label         0
dtype: int64

#### -no null values present in any column

In [7]:
#creating input and output variable
train = news['text']
test = news['label']

In [8]:
# always check input and output shape and data
train.shape, test.shape

((6335,), (6335,))

In [9]:
# now target column has string values so need to encode into interger
le = LabelEncoder()
test = le.fit_transform(test)
test # here, 0 represents Fake_news, 1 represents Real_news

array([0, 0, 1, ..., 0, 1, 1])

In [10]:
#creating PorterStemmer variable
port_stem = PorterStemmer()

In [11]:
# applying stemming Process
def stemming(content): #defining function
    
    stemmed_content = re.sub('[^a-zA-Z]',' ',content) 
    #here applying regular expression substitute, '^' means excluding a to z characters, replacing with '_'(space) in content 
    
    stemmed_content = stemmed_content.lower()
    #apply string values in lower case
    
    stemmed_content = stemmed_content.split() 
    #splitting strings
    
    stemmed_content = [port_stem.stem(word) for word in stemmed_content if not word in stopwords.words('english')]
    #applying PorterStemmer algorithm on stemmed_content which are not in stopwords library
    
    stemmed_content =' '.join(stemmed_content)
    #joining all words again into same sentences
    
    return stemmed_content

In [12]:
#applying stemming on train dataset
train = train.apply(stemming)

In [13]:
#checking after applying stemming process
train.head(20)

0     daniel greenfield shillman journal fellow free...
1     googl pinterest digg linkedin reddit stumbleup...
2     u secretari state john f kerri said monday sto...
3     kayde king kaydeek novemb lesson tonight dem l...
4     primari day new york front runner hillari clin...
5     immigr grandpar year ago arriv new york citi i...
6     share bayle luciani left screenshot bayle caug...
7     czech stockbrok save jewish children nazi germ...
8     hillari clinton donald trump made inaccur clai...
9     iranian negoti reportedli made last ditch push...
10    cedar rapid iowa one wonder ralli entir career...
11    donald trump organiz problem gone bad wors fla...
12    click learn alexandra person essenc psychic pr...
13    octob pretti factual except women select servi...
14    kill obama administr rule dismantl obamacar pu...
15    women move high offic often bring style approa...
16    shock michel obama hillari caught glamor date ...
17    hillari clinton bare lost presidenti elect

In [14]:
# train data have string object so need to convert data into numerical values so computer can understand
vectorizer = TfidfVectorizer()
train = vectorizer.fit_transform(train)

In [15]:
#chekcing sparse matrix after TfidfVectorizer method
print(train)

  (0, 11498)	0.018669004977501614
  (0, 4952)	0.016492946252976797
  (0, 106)	0.02761351227547292
  (0, 17587)	0.02761351227547292
  (0, 5293)	0.028910867392933547
  (0, 6325)	0.01281969349793845
  (0, 34872)	0.022143906642919344
  (0, 26227)	0.01375386384385865
  (0, 8751)	0.0226780256967325
  (0, 26965)	0.024519594414306525
  (0, 42772)	0.023731581997339293
  (0, 36349)	0.01529282822765254
  (0, 21457)	0.011020597155694964
  (0, 23657)	0.011689867576623285
  (0, 33684)	0.022127993048050038
  (0, 39761)	0.02827181627824194
  (0, 31214)	0.022305525686593712
  (0, 7097)	0.01411426717822184
  (0, 2591)	0.015117514678878584
  (0, 36474)	0.01864302037416216
  (0, 5797)	0.02274843807685857
  (0, 40952)	0.01971379102038718
  (0, 34237)	0.018712576655497182
  (0, 1089)	0.018102417756284926
  (0, 18761)	0.02906905996945774
  :	:
  (6334, 14047)	0.023167179277208322
  (6334, 38604)	0.012077162102509642
  (6334, 36852)	0.027692187669159592
  (6334, 9454)	0.06398857267178765
  (6334, 7806)	0.0177

In [16]:
# splitting data into train and test

x_train, x_test, y_train, y_test = train_test_split(train, test, test_size=0.2, stratify=test, random_state=2)
# stratify defines equal proportionate of real and fake news

In [17]:
#checking shape of splittiing
x_train.shape, x_test.shape, y_train.shape, y_test.shape

((5068, 43733), (1267, 43733), (5068,), (1267,))

#### Training the model

In [18]:
model = LogisticRegression()
model.fit(x_train, y_train)

LogisticRegression()

#### Evaluation

In [19]:
#chekcing accuracy score on training data

x_train_pred = model.predict(x_train)
training_pred = accuracy_score(x_train_pred, y_train)
print('training accuracy score', training_pred)

training accuracy score 0.9522494080505131


In [20]:
#chekcing accuracy score on test data

x_test_pred = model.predict(x_test)
test_pred = accuracy_score(x_test_pred, y_test)
print('test accuracy score', test_pred)

test accuracy score 0.9171270718232044


#### Making a predictive system

In [21]:
# checking news label on first index

x_news = x_test[0]
prediction = model.predict(x_news)

print(prediction)

if (prediction[0]==0):
    print('News is Fake')
else:
    print('News is Real')
    
print(y_test[0])
# checking label in y_test, its predicted correct

[1]
News is Real
1


In [22]:
# checking news label on other index's

x_news = x_test[1]
prediction = model.predict(x_news)

print(prediction)

if (prediction[0]==0):
    print('News is Fake')
else:
    print('News is Real')
    
print(y_test[1])
# checking label in y_test, its predicted correct

[0]
News is Fake
0


#### Conclusion : Model is performing well, we can improve accuracy score changing vectors and stemming methods

In [23]:
import pickle
#to save trained model

In [24]:
with open('fake_news_text', 'wb') as f:
    pickle.dump(model, f)
# here saving trained model as "fake_news_text" as file name, wb=write_binary to save model using pickle library

In [27]:
with open('fake_news_text.pkl', 'wb') as f:
    pickle.dump(model, f)