About the Dataset
  1. id: unique id for a news article
  2. title: the title of a news article
  3. author: author of the news article
  4. text: the text of the article could be incomplete 
  5. label: a label that marks whether the news article is real or fake:

```
  1: fake news
  0: real news
```


importing the dependencies 

In [None]:
import numpy as np 
import pandas as pd
import re #regular expression, used for searching words in expressions
from nltk.corpus import stopwords #words that dont add much value to paragraphs: eg "where, what, etc"
# nltk = natural language toolkit, the text we will take to create the natural language
from nltk.stem.porter import PorterStemmer
# stemming takes a wrod and removisn gthe prefix and suffix and gives root word, this comand will do that for us
from sklearn.feature_extraction.text import TfidfVectorizer
# convert text to feature vectors which is bascially just numbers
from sklearn.model_selection import train_test_split
# split data into training data and test data
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score


In [None]:
#first we need to remove stopwards from dataset
#then stem all words, to stem words, gives rood words for any words 
#feature extraction makes words into feature vectors
# train_test_split = splitting training and test data

In [None]:
#download stopwards from nltk library
import nltk 
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
#printing the words that dont add value in english (stopwords)
print (stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

data pre-proessing 

In [None]:
#loading datasets into a pandas Dataframe
news_dataset = pd.read_csv('/content/train.csv.zip') #loads our training data to a new variable

In [None]:
#run this dataset
news_dataset.shape
#outcome = 20800 rows and 5 columns. 20800 news article

(20800, 5)

In [None]:
# print the first 5 rows of dataframe 
news_dataset.head()
#prints data: 

Unnamed: 0,id,title,author,text,label
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1


In [None]:
# count the number of missing data/values in the dataset
news_dataset.isnull().sum()
#counts missing valuesin each column. insull is the comand. NOTE: always mention dataset first

id           0
title      558
author    1957
text        39
label        0
dtype: int64

In [None]:
# if alot of values missing, u do imputation to rplace with proper vlaues, but since our data set is so large, we have enough

In [None]:
# replacing the null values with empty string 
news_dataset = news_dataset.fillna('')

In [None]:
#for our prediction we incide author and title, we want to combine togehter. We dont uet text, just use title and author
#you can use text sometimes but title + author can give good accuracy

In [None]:
# merging author name and news title 
news_dataset['content'] = news_dataset['author']+' '+news_dataset['title']

In [None]:
print(news_dataset['content'])
#output would be author name first and then title, so its a new datata (content data to make the predictions)

0        Darrell Lucus House Dem Aide: We Didn’t Even S...
1        Daniel J. Flynn FLYNN: Hillary Clinton, Big Wo...
2        Consortiumnews.com Why the Truth Might Get You...
3        Jessica Purkiss 15 Civilians Killed In Single ...
4        Howard Portnoy Iranian woman jailed for fictio...
                               ...                        
20795    Jerome Hudson Rapper T.I.: Trump a ’Poster Chi...
20796    Benjamin Hoffman N.F.L. Playoffs: Schedule, Ma...
20797    Michael J. de la Merced and Rachel Abrams Macy...
20798    Alex Ansary NATO, Russia To Hold Parallel Exer...
20799              David Swanson What Keeps the F-35 Alive
Name: content, Length: 20800, dtype: object


In [None]:
# seperating the data & label
X = news_dataset.drop(columns= 'label', axis=1)
#want to remove lables and store all data without the label 
  # you need to say axis=1 to remove the column
Y = news_dataset['label']

In [None]:
print(X)
print(Y)

          id                                              title  \
0          0  House Dem Aide: We Didn’t Even See Comey’s Let...   
1          1  FLYNN: Hillary Clinton, Big Woman on Campus - ...   
2          2                  Why the Truth Might Get You Fired   
3          3  15 Civilians Killed In Single US Airstrike Hav...   
4          4  Iranian woman jailed for fictional unpublished...   
...      ...                                                ...   
20795  20795  Rapper T.I.: Trump a ’Poster Child For White S...   
20796  20796  N.F.L. Playoffs: Schedule, Matchups and Odds -...   
20797  20797  Macy’s Is Said to Receive Takeover Approach by...   
20798  20798  NATO, Russia To Hold Parallel Exercises In Bal...   
20799  20799                          What Keeps the F-35 Alive   

                                          author  \
0                                  Darrell Lucus   
1                                Daniel J. Flynn   
2                             Consortiu

**stemming:**

stemming is the process of reducing a word to its root word - removes prefixs and suffixes 

e.g - actor, actress, acting --> act

(helps with better performance in model)

- after we vectorize, convert words to feature vectors = numerical data, then we can feed it to ml model

In [None]:
port_stem = PorterStemmer()

In [None]:
# def = define to create a function : name of function is stemming
def stemming (content):
  # regular expression library, searching text. sub, substitues certain values. "^" excludes certain things "
  # we just want alpehet words, to exclude everything that isnt present in our set (unesscary stuff: numbers, puncuation)
  # feed content column to stemming, to remove other stuf exepct regular letters/alpehebet 
  # content = title + author data together
  stemmed_content = re.sub('[^a-zA-Z]',' ', content)
  # make everything lowercase letts to make it easier
  stemmed_content = stemmed_content.lower()
  # splitted to respective lists, converted to lists
  stemmed_content = stemmed_content.split()
  # stemming function to each word. and reducing it to its root word to all the words. remove all stopwords (which is insignficiant words) which is kinda useless in our dataset 
  # all words EXECPT stopwords, stem them to root form
  stemmed_content = [port_stem.stem(word) for word in stemmed_content if not word in stopwords.words('english')]
  #then just join all of the words
  stemmed_content = ' '.join(stemmed_content)
  return stemmed_content 

  # 1) first create function called stemming, content is input were giving
  

In [None]:
news_dataset['content'] = news_dataset['content'].apply(stemming)
#taking content colum (author+title) to do procces of all stemming fucntion. to put in in same coluns as the data frame

In [None]:
# all insignifnicant words will be removed from stemming procress 

In [None]:
# using text will take a long time for proccesing but the author and title work just fine
X = news_dataset['content'].values 
Y = news_dataset['label'].values

In [None]:
print(X) # this is our data, there are alot of values "..." = alot of values
#this is what we will feed our model

['darrel lucu hous dem aid even see comey letter jason chaffetz tweet'
 'daniel j flynn flynn hillari clinton big woman campu breitbart'
 'consortiumnew com truth might get fire' ...
 'michael j de la merc rachel abram maci said receiv takeov approach hudson bay new york time'
 'alex ansari nato russia hold parallel exercis balkan'
 'david swanson keep f aliv']


In [None]:
print(Y) # labels 

[1 0 1 ... 0 1 1]


In [None]:
Y.shape #w e have 20800 labels 

(20800,)

In [None]:
# computers wont understand ^^^ so we have to vectorize all of the text

In [None]:
# converting the textual data to numerical data
#tf = term freqieunces. idf = inverse document frequency
  # counts how many times a word is repeated in a dataset
    # --> it shows how important it is
vectorizer = TfidfVectorizer()
vectorizer.fit(X)

X = vectorizer.transform(X)

In [None]:
print(X)
# now it has converted the text from above to numbres

  (0, 15686)	0.28485063562728646
  (0, 13473)	0.2565896679337957
  (0, 8909)	0.3635963806326075
  (0, 8630)	0.29212514087043684
  (0, 7692)	0.24785219520671603
  (0, 7005)	0.21874169089359144
  (0, 4973)	0.233316966909351
  (0, 3792)	0.2705332480845492
  (0, 3600)	0.3598939188262559
  (0, 2959)	0.2468450128533713
  (0, 2483)	0.3676519686797209
  (0, 267)	0.27010124977708766
  (1, 16799)	0.30071745655510157
  (1, 6816)	0.1904660198296849
  (1, 5503)	0.7143299355715573
  (1, 3568)	0.26373768806048464
  (1, 2813)	0.19094574062359204
  (1, 2223)	0.3827320386859759
  (1, 1894)	0.15521974226349364
  (1, 1497)	0.2939891562094648
  (2, 15611)	0.41544962664721613
  (2, 9620)	0.49351492943649944
  (2, 5968)	0.3474613386728292
  (2, 5389)	0.3866530551182615
  (2, 3103)	0.46097489583229645
  :	:
  (20797, 13122)	0.2482526352197606
  (20797, 12344)	0.27263457663336677
  (20797, 12138)	0.24778257724396507
  (20797, 10306)	0.08038079000566466
  (20797, 9588)	0.174553480255222
  (20797, 9518)	0.295420

splitting the dataset to training and test data

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, stratify=Y, random_state=2)
# splitting data into data strains, traininng and testing data
  # 80% should be training, 20% should be testing - thats why we mentioe 0.2 for testing size
  # lables for X is stored in Y...
  # stratify means that the fake news and real news needs to be split
  #randomstate=2 is to split it into 2 ro reporduce a specifc code 

training the model: logistic regression

In [None]:
model = LogisticRegression()

In [None]:
model.fit(X_train, Y_train)

LogisticRegression()

evaluation

In [None]:
# accuracy score on the test data
X_test_prediction = model.predict(X_test)
test_data_accuracy = accuracy_score(X_test_prediction, Y_test)

In [None]:
print('Accuracy score of the test data : ', test_data_accuracy)

Accuracy score of the test data :  0.9790865384615385


In [None]:
X_new = X_test[4]

prediction = model.predict(X_new)
print(prediction)

if (prediction[0]==0):
  print('The news is Real')
else:
  print('The news is Fake')

[0]
The news is Real


making a predctive system

In [None]:
print(Y_test[10])

0
