# Following CRIPS-DM METHODOLOGY

## Business Understanding

Let's Suppose:

**Goal** : To detect fake vs true News

**Stakeholder** : News Agencies, Social Media Platform, Decision Maker

**Value Proposition** : To improve Credibility and enhance trustworthiness

**Data Source** : https://drive.google.com/drive/folders/1ByadNwMrPyds53cA6SDCHLelTAvIdoF_

**Metrics** : Accuracy Score and Classification Report

---
*Notice : This is an example to understand what i'm building*

# Data Understanding 

In [52]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

import re
import string

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier

%matplotlib inline

In [2]:
fake_df = pd.read_csv("Dataset/Fake.csv")
true_df = pd.read_csv("Dataset/True.csv")

In [3]:
true_df.head()

Unnamed: 0,title,text,subject,date
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017"
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017"
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017"
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017"
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017"


In [4]:
print(fake_df.shape)
print(true_df.shape)
print(fake_df.size)
print(true_df.size)

(23481, 4)
(21417, 4)
93924
85668


Fake and True Data sizes are not identical

In [5]:
print(fake_df.info())
print(true_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23481 entries, 0 to 23480
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   title    23481 non-null  object
 1   text     23481 non-null  object
 2   subject  23481 non-null  object
 3   date     23481 non-null  object
dtypes: object(4)
memory usage: 733.9+ KB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21417 entries, 0 to 21416
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   title    21417 non-null  object
 1   text     21417 non-null  object
 2   subject  21417 non-null  object
 3   date     21417 non-null  object
dtypes: object(4)
memory usage: 669.4+ KB
None


Data are all Categorical


In [6]:
fake_df.subject.unique()

array(['News', 'politics', 'Government News', 'left-news', 'US_News',
       'Middle-east'], dtype=object)

In [7]:
fake_df.subject.value_counts().unique

<bound method Series.unique of subject
News               9050
politics           6841
left-news          4459
Government News    1570
US_News             783
Middle-east         778
Name: count, dtype: int64>

In [8]:
true_df.subject.unique()

array(['politicsNews', 'worldnews'], dtype=object)

In [9]:
true_df.subject.value_counts().unique

<bound method Series.unique of subject
politicsNews    11272
worldnews       10145
Name: count, dtype: int64>

# Data Preparation

In [10]:
true_df['class'] = 1
fake_df['class'] = 0

Inserting a column that represent either data as fake or true i.e 0 or 1

In [11]:
fake_df.head(2)

Unnamed: 0,title,text,subject,date,class
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017",0
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017",0


In [12]:
true_fake_df = pd.concat([true_df,fake_df],axis=0,ignore_index=True)

In [13]:
true_fake_df.head(2)

Unnamed: 0,title,text,subject,date,class
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017",1
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017",1


In [14]:
true_fake_df.tail(2)

Unnamed: 0,title,text,subject,date,class
44896,How to Blow $700 Million: Al Jazeera America F...,21st Century Wire says Al Jazeera America will...,Middle-east,"January 14, 2016",0
44897,10 U.S. Navy Sailors Held by Iranian Military ...,21st Century Wire says As 21WIRE predicted in ...,Middle-east,"January 12, 2016",0


I have successfully stacked data together. Here, we can see true data on top and fake data on bottom

In [15]:
true_fake_df.shape

(44898, 5)

In [16]:
true_fake_df.isnull().sum()

title      0
text       0
subject    0
date       0
class      0
dtype: int64

Here, I checked for Null values

In [17]:
true_fake_df.columns

Index(['title', 'text', 'subject', 'date', 'class'], dtype='object')

In [18]:
_column = ['title','subject','date']
true_fake_df.drop(columns=_column, axis= 1,inplace=True)

In [19]:
true_fake_df.columns

Index(['text', 'class'], dtype='object')

We selected the required Columns

In [20]:
true_fake_df.duplicated().value_counts()

False    38647
True      6251
Name: count, dtype: int64

In [21]:
true_df.duplicated().value_counts()

False    21211
True       206
Name: count, dtype: int64

In [22]:
fake_df.duplicated().value_counts()

False    23478
True         3
Name: count, dtype: int64

In [23]:
true_fake_df.duplicated(subset=['text']).value_counts()

False    38646
True      6252
Name: count, dtype: int64

We got 6251 duplicated data but originally there are only 206 and 3 duplicated data.

I conclude that the title of data was the cause for the low number of duplicates. We got high duplicate value after merging and removing some columns. Meaning Multiple value got duplicated due to similar text. Which is suprisingly good findings.

In [24]:
true_fake_df.drop_duplicates(inplace=True)

In [25]:
true_fake_df.duplicated().value_counts()

False    38647
Name: count, dtype: int64

I have succefully removed the duplicates

### Random Sampling

In [26]:
rdm_data = true_fake_df.sample(frac=1)

In [27]:
rdm_data.head(3)

Unnamed: 0,text,class
20532,MOSCOW (Reuters) - U.S. Undersecretary of Stat...,1
26211,Donald Trump recently gave a speech to the Ame...,0
15005,TUNIS (Reuters) - A group of 25 refugees have ...,1


In [28]:
rdm_data.reset_index(inplace=True)

In [29]:
rdm_data.columns

Index(['index', 'text', 'class'], dtype='object')

In [30]:
rdm_data.drop(columns = ['index'],inplace=True,axis=1)

In [31]:
rdm_data.head(2)

Unnamed: 0,text,class
0,MOSCOW (Reuters) - U.S. Undersecretary of Stat...,1
1,Donald Trump recently gave a speech to the Ame...,0


### Funtion to process the text

In [32]:
def wordopt(txt):
    txt = txt.lower()
    txt = re.sub('\[.*?\]','',txt)
    txt = re.sub('\\W'," ",txt)
    txt = re.sub('https?://\S+|www\.\S+','',txt)
    txt = re.sub('<.*?>+','',txt)
    txt = re.sub('[%s]' % re.escape(string.punctuation),'',txt)
    txt = re.sub('\n', '',txt)
    txt = re.sub('\w*\d\w*','',txt)
    return txt

In [33]:
true_fake_df.text = true_fake_df.text.apply(wordopt)

In [34]:
true_fake_df.head(2)

Unnamed: 0,text,class
0,washington reuters the head of a conservat...,1
1,washington reuters transgender people will...,1


In [35]:
x = true_fake_df['text']
y = true_fake_df['class']

Finding dependent and independent variable for splitting the data

### Train_TEST_Split

In [36]:
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.2)

### Text into Vector

In [37]:
vectorization  = TfidfVectorizer()
xv_train = vectorization.fit_transform(x_train)
xv_test = vectorization.transform(x_test)

# Modelling And Evaluating

In [38]:
logistic = LogisticRegression()
logistic.fit(xv_train,y_train)

In [39]:
y_pred = logistic.predict(xv_test)

In [40]:
logistic.score(xv_test,y_test)

0.986287192755498

In [41]:
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.99      0.98      0.99      3561
           1       0.98      0.99      0.99      4169

    accuracy                           0.99      7730
   macro avg       0.99      0.99      0.99      7730
weighted avg       0.99      0.99      0.99      7730



In [42]:
decision_tree = DecisionTreeClassifier()
decision_tree.fit(xv_train,y_train)

In [43]:
y1_pred = decision_tree.predict(xv_test)

In [44]:
decision_tree.score(xv_test,y_test)

0.9965071151358345

In [50]:
print(classification_report(y_test,y1_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00      3561
           1       1.00      1.00      1.00      4169

    accuracy                           1.00      7730
   macro avg       1.00      1.00      1.00      7730
weighted avg       1.00      1.00      1.00      7730



In [47]:
gradient_classifier = GradientBoostingClassifier(random_state=0)
gradient_classifier.fit(xv_train,y_train)

In [48]:
y2_pred = gradient_classifier.predict(xv_test)

In [49]:
gradient_classifier.score(xv_test,y_test)

0.9961190168175937

In [51]:
print(classification_report(y_test,y2_pred))

              precision    recall  f1-score   support

           0       1.00      0.99      1.00      3561
           1       0.99      1.00      1.00      4169

    accuracy                           1.00      7730
   macro avg       1.00      1.00      1.00      7730
weighted avg       1.00      1.00      1.00      7730



In [54]:
rdm_forest_classifier = RandomForestClassifier(random_state=0)
rdm_forest_classifier.fit(xv_train,y_train)

In [55]:
y3_pred = rdm_forest_classifier.predict(xv_test)

In [56]:
rdm_forest_classifier.score(xv_test,y_test)

0.9865459249676585

In [57]:
print(classification_report(y_test,y3_pred))

              precision    recall  f1-score   support

           0       0.99      0.98      0.99      3561
           1       0.99      0.99      0.99      4169

    accuracy                           0.99      7730
   macro avg       0.99      0.99      0.99      7730
weighted avg       0.99      0.99      0.99      7730

