<a href="https://colab.research.google.com/github/1500-Shubham/ML_With_Python_GoogleColab/blob/main/FakeNewsDetectionProject.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fake News Detection
- Useing text dataset preprocessing
  - stopwords, regex, string manipulation, stemming, Tfidf vectorizer numerical convert
- Vectorizer
  - sometimes some word dont add importance to document
    - avenger word appear in all document string so dont add importance to training
  - significance of words
  - create feature vector of text array
  - x= (20800, 0) means 20800 array of strings
  - then when vectorizer (20800, 17128) means for each document against unique words(17128) ka importance
- Logistic regression model to train model
    -from sklearn.linear_model import LogisticRegression
    - sigmoid function use y=1/1+e-z where z= wx+b
    - w: weights and b=biasse
    - y= 0 or 1
    - classification model used
- Accuracy measurement
  - from sklearn.metrics import accuracy_score
- Model Training and accuracy
  - model.fit(x_train,y_train)
  - x_train_prediction = model.predict(x_train)
  - training_data_accuracy= accuracy_score(x_train_prediction,y_train)

###Text Dataset PreProcessing
- re library for regular expression
- nltk
  - corpus is collection of words
  - natural language tool kit -> text processing
  - Stopwords -> english words that appear frequently in english sentences
    - for data preprocessing need to remove them as they dont convey much information
  - PorterStemmer - For Stemming
- Stemming: is the process of reducing a word to its root word
  - ex. enjoy enjoyinh enjoyable can be replaced with enjoy
  - re.sub('[^a-zA-Z]') using regular expression
- Text data to numerical data
  - use Tfidf Vectorizer

Steps:
- Select input fields merge columns
- Text -> stemming stopwords convert
- Text -> Numerical Vectorizer
  - vectorizer uses only string not list when fit and transform
- now everything in numericals
- Train and Split

In [2]:
import numpy as np
import pandas as pd
import re
import nltk #entire library importing
from nltk.corpus import stopwords
#corpus is collection of words
#natural language tool kit -> text processing
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

##### Downloading stopwords

In [3]:
nltk.download('stopwords')
# Printing the stopwords
print(stopwords.words('swedish'))

['och', 'det', 'att', 'i', 'en', 'jag', 'hon', 'som', 'han', 'på', 'den', 'med', 'var', 'sig', 'för', 'så', 'till', 'är', 'men', 'ett', 'om', 'hade', 'de', 'av', 'icke', 'mig', 'du', 'henne', 'då', 'sin', 'nu', 'har', 'inte', 'hans', 'honom', 'skulle', 'hennes', 'där', 'min', 'man', 'ej', 'vid', 'kunde', 'något', 'från', 'ut', 'när', 'efter', 'upp', 'vi', 'dem', 'vara', 'vad', 'över', 'än', 'dig', 'kan', 'sina', 'här', 'ha', 'mot', 'alla', 'under', 'någon', 'eller', 'allt', 'mycket', 'sedan', 'ju', 'denna', 'själv', 'detta', 'åt', 'utan', 'varit', 'hur', 'ingen', 'mitt', 'ni', 'bli', 'blev', 'oss', 'din', 'dessa', 'några', 'deras', 'blir', 'mina', 'samma', 'vilken', 'er', 'sådan', 'vår', 'blivit', 'dess', 'inom', 'mellan', 'sådant', 'varför', 'varje', 'vilka', 'ditt', 'vem', 'vilket', 'sitta', 'sådana', 'vart', 'dina', 'vars', 'vårt', 'våra', 'ert', 'era', 'vilkas']


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


##### Data preprocessing

In [4]:
# /content/train.csv
df = pd.read_csv("/content/train.csv")
df.head()
# Label 0 - Real 1- Fake
# Train model which news are fake and which are real
df.shape

(20800, 5)

##### Label 0 - Real
##### Label 1- Fake

##### Checking for missing values

In [5]:
df.isnull().sum()

Unnamed: 0,0
id,0
title,558
author,1957
text,39
label,0


##### Handle missing value
- Text replace with ' ' string
- In case of data we generally replace using mean median and mode

In [6]:
df = df.fillna('')
df.isnull().sum()

Unnamed: 0,0
id,0
title,0
author,0
text,0
label,0


##### Merging two feature to treat as input
- author and title -> gives ouput label

In [7]:
df['content']= df['author'] + ' '+ df['title']
df.head()

Unnamed: 0,id,title,author,text,label,content
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1,Darrell Lucus House Dem Aide: We Didn’t Even S...
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0,"Daniel J. Flynn FLYNN: Hillary Clinton, Big Wo..."
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1,Consortiumnews.com Why the Truth Might Get You...
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1,Jessica Purkiss 15 Civilians Killed In Single ...
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1,Howard Portnoy Iranian woman jailed for fictio...


##### Separating target and features
- feature content
- traget label

In [8]:
x= df['content']
y=df['label']
x.head()

Unnamed: 0,content
0,Darrell Lucus House Dem Aide: We Didn’t Even S...
1,"Daniel J. Flynn FLYNN: Hillary Clinton, Big Wo..."
2,Consortiumnews.com Why the Truth Might Get You...
3,Jessica Purkiss 15 Civilians Killed In Single ...
4,Howard Portnoy Iranian woman jailed for fictio...


##### Stemming
- Process of reducinf a word to its root word
- example: actor,actress,acting -> act manke
- Use PorterStemmer
- re.sub('[^a-zA-Z] (string se only take a-z and A-Z ^means not)

In [11]:
port_stem= PorterStemmer()

In [12]:
def stemming(content):
  stemmed_content= re.sub('[^a-zA-Z]',' ',content) # replcaing char to ' '
  stemmed_content= stemmed_content.lower()
  stemmed_content=stemmed_content.split() #User ' ' to split words
  # print(stemmed_content)
  stemmed_content= [port_stem.stem(word) for word in stemmed_content if not word in stopwords.words('english')]
  # Array list to string -> string will be used by vectorizer further
  stemmed_content = ' '.join(stemmed_content)
  return stemmed_content
# Content - >" string , : aise"
# remove characters convert to lower
# Split the words ' ' separted
# ignoring words present in stopwords english
# port stem root word for each word mein convert


In [13]:
# for x data use the function
df['content'] = df['content'].apply(stemming)
# Pass each row value apply
print(df['content'])

0        darrel lucu hous dem aid even see comey letter...
1        daniel j flynn flynn hillari clinton big woman...
2                   consortiumnew com truth might get fire
3        jessica purkiss civilian kill singl us airstri...
4        howard portnoy iranian woman jail fiction unpu...
                               ...                        
20795    jerom hudson rapper trump poster child white s...
20796    benjamin hoffman n f l playoff schedul matchup...
20797    michael j de la merc rachel abram maci said re...
20798    alex ansari nato russia hold parallel exercis ...
20799                            david swanson keep f aliv
Name: content, Length: 20800, dtype: object


##### Taking x and y data as numpy arrays

In [19]:
x= df['content'].values
y=df['label'].values
print(x)
print(y)
print(x.shape)

['darrel lucu hous dem aid even see comey letter jason chaffetz tweet'
 'daniel j flynn flynn hillari clinton big woman campu breitbart'
 'consortiumnew com truth might get fire' ...
 'michael j de la merc rachel abram maci said receiv takeov approach hudson bay new york time'
 'alex ansari nato russia hold parallel exercis balkan'
 'david swanson keep f aliv']
[1 0 1 ... 0 1 1]
(20800,)


##### Feed computer text to numbers
##### Now Text To feature Vector
- Text data to numerical data

In [22]:
vectorizer= TfidfVectorizer()
vectorizer.fit(x)
x= vectorizer.transform(x)
# vectorizer uses only string so x ["string","string"] type not [[""],[""]]

In [24]:
print(x.shape)
print(x[0])


(20800, 17128)
<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 12 stored elements and shape (1, 17128)>
  Coords	Values
  (0, 267)	0.2701012497770876
  (0, 2483)	0.36765196867972083
  (0, 2959)	0.24684501285337127
  (0, 3600)	0.3598939188262558
  (0, 3792)	0.27053324808454915
  (0, 4973)	0.23331696690935097
  (0, 7005)	0.2187416908935914
  (0, 7692)	0.24785219520671598
  (0, 8630)	0.2921251408704368
  (0, 8909)	0.36359638063260746
  (0, 13473)	0.2565896679337956
  (0, 15686)	0.2848506356272864


### Model Training
- x transformed into vector
- y into number

#### Splitting data

In [25]:
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,stratify=y,random_state=2)


In [27]:
print(x_train.shape,x_test.shape)

(16640, 17128) (4160, 17128)


#### Training the model: Logistic Regression

In [28]:
model = LogisticRegression()

In [29]:
model.fit(x_train,y_train)

#### Evaluation
- Accuracy Score

###### Accuracy score on training data

In [31]:
x_train_prediction = model.predict(x_train)
training_data_accuracy= accuracy_score(x_train_prediction,y_train)
print(training_data_accuracy)

0.9863581730769231


##### Accuracy score on test data

In [32]:
x_test_prediction = model.predict(x_test)
testing_data_accuracy= accuracy_score(x_test_prediction,y_test)
print(testing_data_accuracy)

0.9790865384615385


### Making a Predictive System

In [37]:
x_new= x_test[3]
print(x_new)
print(y_test[3])

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 13 stored elements and shape (1, 17128)>
  Coords	Values
  (0, 1894)	0.10717190161988106
  (0, 2407)	0.22941655373198822
  (0, 2510)	0.19575583865343713
  (0, 2806)	0.22586908309808346
  (0, 3782)	0.5146474416282094
  (0, 5529)	0.2895160407717207
  (0, 6868)	0.2729431954288567
  (0, 7668)	0.18748490820807123
  (0, 11217)	0.33024740742035963
  (0, 11246)	0.340188979812318
  (0, 11888)	0.25336998543209355
  (0, 12693)	0.21157740447059398
  (0, 14632)	0.2333702891139994
0


In [38]:
prediction = model.predict(x_new)
print(prediction)

[0]
