# Deteksi Berita Hoax
Sebuah Proyek yang bertujuan untuk mengembangkan sebuah model Machine Learning yang mampu mengidentifikasi dan membedakan antara berita palsu (hoax) dan berita yang benar dengan akurat. Model ini dibangun berdasarkan analisis data berita yang telah dikumpulkan, dengan fitur-fitur yang dapat membedakan karakteristik antara berita palsu dan berita yang benar.

## Dataset
Dataset diambil dari Kaggle "FAKE-REAL NEWS" dapat diakses di link berikut ini : https://data.mendeley.com/datasets/p3hfgr5j3m/1

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
import re
import string

In [2]:
df_valid = pd.read_csv("True.csv")

In [3]:
df_hoax = pd.read_csv("Fake.csv")

In [4]:
df_valid.head()

Unnamed: 0,title,text,subject,date
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017"
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017"
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017"
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017"
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017"


In [5]:
df_hoax.head()

Unnamed: 0,title,text,subject,date
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017"
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017"
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017"
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017"
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017"


In [6]:
rows_valid = df_valid.shape
rows_hoax = df_hoax.shape
print('Jumlah baris dataset berita valid : ' + str(rows_valid) + '\nJumlah baris dataset berita hoax : ' + str(rows_hoax))

Jumlah baris dataset berita valid : (21417, 4)
Jumlah baris dataset berita hoax : (23481, 4)


In [7]:
df_valid.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21417 entries, 0 to 21416
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   title    21417 non-null  object
 1   text     21417 non-null  object
 2   subject  21417 non-null  object
 3   date     21417 non-null  object
dtypes: object(4)
memory usage: 669.4+ KB


In [8]:
df_hoax.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23481 entries, 0 to 23480
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   title    23481 non-null  object
 1   text     23481 non-null  object
 2   subject  23481 non-null  object
 3   date     23481 non-null  object
dtypes: object(4)
memory usage: 733.9+ KB


### Data Validation (Missing Values)

In [9]:
df_valid.isna().sum()

title      0
text       0
subject    0
date       0
dtype: int64

In [10]:
df_hoax.isna().sum()

title      0
text       0
subject    0
date       0
dtype: int64

### Data Validation (Duplicated)

In [11]:
duplikat_valid = df_valid.duplicated().sum()
duplikat_hoax = df_hoax.duplicated().sum()
print('Jumlah data duplikat berita valid : ' + str(duplikat_valid) + '\nJumlah data duplikat berita hoax : ' + str(duplikat_hoax))

Jumlah data duplikat berita valid : 206
Jumlah data duplikat berita hoax : 3


In [12]:
duplicate_mask = df_valid.duplicated(keep=False)
duplicates = df_valid[duplicate_mask]
print(duplicates)

                                                   title  \
416    Senate tax bill stalls on deficit-focused 'tri...   
445    Senate tax bill stalls on deficit-focused 'tri...   
762    Trump warns 'rogue regime' North Korea of grav...   
778    Trump warns 'rogue regime' North Korea of grav...   
850    Republicans unveil tax cut bill, but the hard ...   
...                                                  ...   
21290  Europeans, Africans agree renewed push to tack...   
21345  Thailand's ousted PM Yingluck has fled abroad:...   
21353  Thailand's ousted PM Yingluck has fled abroad:...   
21406  U.S., North Korea clash at U.N. forum over nuc...   
21408  U.S., North Korea clash at U.N. forum over nuc...   

                                                    text       subject  \
416    WASHINGTON (Reuters) - The U.S. Senate on Thur...  politicsNews   
445    WASHINGTON (Reuters) - The U.S. Senate on Thur...  politicsNews   
762    BEIJING (Reuters) - U.S. President Donald Trum... 

Setalah diperiksa langsung di dalam datasetnya, terbukti jika baris dari Indeks 416-21408 memiliki data yang sama.

In [13]:
#Menghapus data duplikat
df_valid = df_valid.drop_duplicates()

In [14]:
df_valid.duplicated().sum()

0

In [15]:
# Pengecekan data duplikat df_hoax
duplicate_mask = df_hoax.duplicated(keep=False)
duplicates = df_hoax[duplicate_mask]
print(duplicates)

                                                   title  \
9941   HILLARY TWEETS MESSAGE In Defense Of DACA…OOPS...   
9942   HILLARY TWEETS MESSAGE In Defense Of DACA…OOPS...   
11445  FORMER DEMOCRAT WARNS Young Americans: “Rioter...   
11446  FORMER DEMOCRAT WARNS Young Americans: “Rioter...   
14924  [VIDEO] #BlackLivesMatter Terrorists Storm Dar...   
14925  [VIDEO] #BlackLivesMatter Terrorists Storm Dar...   

                                                    text   subject  \
9941   No time to waste   we've got to fight with eve...  politics   
9942   No time to waste   we've got to fight with eve...  politics   
11445   Who is silencing political speech, physically...  politics   
11446   Who is silencing political speech, physically...  politics   
14924  They were probably just looking for a  safe sp...  politics   
14925  They were probably just looking for a  safe sp...  politics   

               date  
9941    Sep 9, 2017  
9942    Sep 9, 2017  
11445  Mar 10, 2017  


In [16]:
#Menghapus data duplikat
df_hoax = df_hoax.drop_duplicates()

In [17]:
duplikat_valid = df_valid.duplicated().sum()
duplikat_hoax = df_hoax.duplicated().sum()
print('Jumlah data duplikat berita valid : ' + str(duplikat_valid) + '\nJumlah data duplikat berita hoax : ' + str(duplikat_hoax))

Jumlah data duplikat berita valid : 0
Jumlah data duplikat berita hoax : 0


### Data Transform

Deklarasi kolom atau fitur target

In [18]:
df_valid.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 21211 entries, 0 to 21416
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   title    21211 non-null  object
 1   text     21211 non-null  object
 2   subject  21211 non-null  object
 3   date     21211 non-null  object
dtypes: object(4)
memory usage: 828.6+ KB


In [19]:
df_valid.head()

Unnamed: 0,title,text,subject,date
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017"
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017"
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017"
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017"
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017"


In [20]:
df_hoax["Tag"] = 0
df_valid["Tag"] = 1

In [21]:
df_valid.head()

Unnamed: 0,title,text,subject,date,Tag
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017",1
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017",1
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017",1
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017",1
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017",1


In [22]:
df_hoax.head()

Unnamed: 0,title,text,subject,date,Tag
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017",0
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017",0
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017",0
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017",0
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017",0


In [23]:
# Memisahkan 10 baris dari setiap dataset untuk digunakan sebagai manual testing
df_valid.shape, df_hoax.shape

((21211, 5), (23478, 5))

In [24]:
df_valid_manual_testing = df_valid.tail(10)
for i in range(21211,21201,-1):
    df_valid.drop([i], axis = 0, inplace = True)
    
    
df_hoax_manual_testing = df_hoax.tail(10)
for i in range(23478,23468,-1):
    df_hoax.drop([i], axis = 0, inplace = True)

In [25]:
df_hoax_manual_testing.head()

Unnamed: 0,title,text,subject,date,Tag
23471,Seven Iranians freed in the prisoner swap have...,"21st Century Wire says This week, the historic...",Middle-east,"January 20, 2016",0
23472,#Hashtag Hell & The Fake Left,By Dady Chery and Gilbert MercierAll writers ...,Middle-east,"January 19, 2016",0
23473,Astroturfing: Journalist Reveals Brainwashing ...,Vic Bishop Waking TimesOur reality is carefull...,Middle-east,"January 19, 2016",0
23474,The New American Century: An Era of Fraud,Paul Craig RobertsIn the last years of the 20t...,Middle-east,"January 19, 2016",0
23475,Hillary Clinton: ‘Israel First’ (and no peace ...,Robert Fantina CounterpunchAlthough the United...,Middle-east,"January 18, 2016",0


In [26]:
df_valid_manual_testing.head()

Unnamed: 0,title,text,subject,date,Tag
21406,"U.S., North Korea clash at U.N. forum over nuc...",GENEVA (Reuters) - North Korea and the United ...,worldnews,"August 22, 2017",1
21407,"Mata Pires, owner of embattled Brazil builder ...","SAO PAULO (Reuters) - Cesar Mata Pires, the ow...",worldnews,"August 22, 2017",1
21409,"U.S., North Korea clash at U.N. arms forum on ...",GENEVA (Reuters) - North Korea and the United ...,worldnews,"August 22, 2017",1
21410,Headless torso could belong to submarine journ...,COPENHAGEN (Reuters) - Danish police said on T...,worldnews,"August 22, 2017",1
21411,North Korea shipments to Syria chemical arms a...,UNITED NATIONS (Reuters) - Two North Korean sh...,worldnews,"August 21, 2017",1


In [27]:
#Eksport dataset manual-testing.csv
df_manual_testing = pd.concat([df_valid_manual_testing, df_hoax_manual_testing], axis=0)
df_manual_testing.to_csv("manual-testing.csv")

### Concat Dataframe

In [28]:
df = pd.concat([df_valid, df_hoax], axis=0)

In [29]:
df.head()

Unnamed: 0,title,text,subject,date,Tag
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017",1
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017",1
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017",1
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017",1
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017",1


In [30]:
df.shape

(44669, 5)

In [31]:
df.columns

Index(['title', 'text', 'subject', 'date', 'Tag'], dtype='object')

### Penghapusan Kolom yang tidak digunakan

In [32]:
df = df.drop(["title", "subject", "date"], axis = 1)

In [33]:
df.shape

(44669, 2)

In [34]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 44669 entries, 0 to 23480
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    44669 non-null  object
 1   Tag     44669 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 1.0+ MB


In [35]:
df = df.rename(columns={'Tag' : 'tag'})

In [36]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 44669 entries, 0 to 23480
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    44669 non-null  object
 1   tag     44669 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 1.0+ MB


In [37]:
df.tail()

Unnamed: 0,text,tag
23466,21st Century Wire says For those who still ref...,0
23467,"21st Century Wire says So far, after nearly 20...",0
23468,21st Century Wire says If you ve been followin...,0
23479,21st Century Wire says Al Jazeera America will...,0
23480,21st Century Wire says As 21WIRE predicted in ...,0


In [38]:
df.head()

Unnamed: 0,text,tag
0,WASHINGTON (Reuters) - The head of a conservat...,1
1,WASHINGTON (Reuters) - Transgender people will...,1
2,WASHINGTON (Reuters) - The special counsel inv...,1
3,WASHINGTON (Reuters) - Trump campaign adviser ...,1
4,SEATTLE/WASHINGTON (Reuters) - President Donal...,1


Bisa dilihat kalau posisi atau urutan berita real dengan berita hoax saling beu-urutan, untuk mencegah Mechine Learning overvitting maka kita akan melakukan Random Shufflineg agar datanya tidak saling berurutan.

### Random Shuffling

In [39]:
df = df.sample(frac = 1)

In [40]:
df.head()

Unnamed: 0,text,tag
12805,Donald Trump bragged in vulgar terms about kis...,0
11851,"#disruptj20 protesters on the move, many leavi...",0
23251,21st Century Wire says If you haven t seen Hil...,0
22350,Tune in to the Alternate Current Radio Network...,0
18939,BEIJING (Reuters) - China said on Wednesday th...,1


In [41]:
df.tail()

Unnamed: 0,text,tag
10691,WASHINGTON (Reuters) - The White House named a...,1
12733,,0
4029,If Trump supporters seriously want black peopl...,0
12624,Remember Combetta is Hillary s Oh Sh*t IT guy:...,0
15626,RIYADH (Reuters) - The deputy governor of Saud...,1


In [42]:
df.columns

Index(['text', 'tag'], dtype='object')

In [43]:
df.reset_index(inplace = True)
df.drop(["index"], axis = 1, inplace = True)

In [44]:
df.head()

Unnamed: 0,text,tag
0,Donald Trump bragged in vulgar terms about kis...,0
1,"#disruptj20 protesters on the move, many leavi...",0
2,21st Century Wire says If you haven t seen Hil...,0
3,Tune in to the Alternate Current Radio Network...,0
4,BEIJING (Reuters) - China said on Wednesday th...,1


### Wordopt

In [70]:
def wordopt(text) :
    text = text.lower()
    text = re.sub("\\W"," ", text)
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('https?://\S+|www\.\S+', '', text)
    text = re.sub('<.*?>+', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\n', '', text)
    text = re.sub('\w*\d\w*', '', text)    
    return text

In [71]:
df["text"] = df["text"].apply(wordopt)

### Deklarasi Independen & Dependen Variabel

In [47]:
x = df["text"]
y = df["tag"]

### Data Splitting
- Data Train = 75%
- Data Test = 25%

In [48]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.25)

### Text -> Vector

In [49]:
from sklearn.feature_extraction.text import TfidfVectorizer

vector = TfidfVectorizer()
xv_train = vector.fit_transform(x_train)
xv_test = vector.transform(x_test)

### Modelling

### Decision Tree Classication

In [50]:
from sklearn.tree import DecisionTreeClassifier

DT = DecisionTreeClassifier()
DT.fit(xv_train, y_train)

In [51]:
pred_dt = DT.predict(xv_test)

In [52]:
DT.score(xv_test, y_test)

0.9951647564469914

In [53]:
print(classification_report(y_test, pred_dt))

              precision    recall  f1-score   support

           0       0.99      1.00      1.00      5828
           1       1.00      0.99      0.99      5340

    accuracy                           1.00     11168
   macro avg       1.00      1.00      1.00     11168
weighted avg       1.00      1.00      1.00     11168



### Gradient Boosting Classifier

In [54]:
from sklearn.ensemble import GradientBoostingClassifier

GBC = GradientBoostingClassifier(random_state=0)
GBC.fit(xv_train, y_train)

In [55]:
pred_gbc = GBC.predict(xv_test)

In [56]:
GBC.score(xv_test, y_test)

0.9950752148997135

In [57]:
print(classification_report(y_test, pred_gbc))

              precision    recall  f1-score   support

           0       1.00      0.99      1.00      5828
           1       0.99      1.00      0.99      5340

    accuracy                           1.00     11168
   macro avg       0.99      1.00      1.00     11168
weighted avg       1.00      1.00      1.00     11168



### Logistic Regression

In [58]:
from sklearn.linear_model import LogisticRegression

LR = LogisticRegression()
LR.fit(xv_train,y_train)

In [59]:
pred_lr=LR.predict(xv_test)
LR.score(xv_test, y_test)

0.985852435530086

In [60]:
print(classification_report(y_test, pred_lr))

              precision    recall  f1-score   support

           0       0.99      0.98      0.99      5828
           1       0.98      0.99      0.99      5340

    accuracy                           0.99     11168
   macro avg       0.99      0.99      0.99     11168
weighted avg       0.99      0.99      0.99     11168



### Random Forest Classifier

In [61]:
from sklearn.ensemble import RandomForestClassifier

RFC = RandomForestClassifier(random_state=0)
RFC.fit(xv_train, y_train)

In [62]:
pred_rfc = RFC.predict(xv_test)
RFC.score(xv_test, y_test)

0.9890759312320917

In [63]:
print(classification_report(y_test, pred_rfc))

              precision    recall  f1-score   support

           0       0.99      0.99      0.99      5828
           1       0.99      0.99      0.99      5340

    accuracy                           0.99     11168
   macro avg       0.99      0.99      0.99     11168
weighted avg       0.99      0.99      0.99     11168



### Linear SVC

In [64]:
from sklearn.svm import LinearSVC

SVC = LinearSVC(dual=False)
SVC.fit(xv_train, y_train)

In [65]:
pred_svc = SVC.predict(xv_test)
SVC.score(xv_test, y_test)

0.9947170487106017

In [66]:
print(classification_report(y_test, pred_svc))

              precision    recall  f1-score   support

           0       0.99      1.00      0.99      5828
           1       0.99      0.99      0.99      5340

    accuracy                           0.99     11168
   macro avg       0.99      0.99      0.99     11168
weighted avg       0.99      0.99      0.99     11168



### Testing Model

In [67]:
def output_lable(n):
    if n == 0:
        return "Berita Hoax"
    elif n == 1:
        return "Berita Real"
    
def manual_testing(news):
    testing_news = {"text":[news]}
    new_def_test = pd.DataFrame(testing_news)
    new_def_test["text"] = new_def_test["text"].apply(wordopt) 
    new_x_test = new_def_test["text"]
    new_xv_test = vectorization.transform(new_x_test)
    pred_LR = LR.predict(new_xv_test)
    pred_DT = DT.predict(new_xv_test)
    pred_GBC = GBC.predict(new_xv_test)
    pred_SVC = SVC.predict(new_xv_test)
    pred_RFC = RFC.predict(new_xv_test)

    return print("\n\nLR Prediction: {} \nDT Prediction: {} \nGBC Prediction: {} \nRFC Prediction: {}".
                 format(output_lable(pred_LR[0]),
                    output_lable(pred_DT[0]), 
                    output_lable(pred_GBC[0]),
                    output_lable(pred_SVC[0]),
                    output_lable(pred_RFC[0])))

In [68]:
news = str(input())
manual_testing(news)

 21st Century Wire says As 21WIRE predicted in its new year s look ahead, we have a new  hostage  crisis underway.Today, Iranian military forces report that two small riverine U.S. Navy boats were seized in Iranian waters, and are currently being held on Iran s Farsi Island in the Persian Gulf. A total of 10 U.S. Navy personnel, nine men and one woman, have been detained by Iranian authorities. NAVY STRAYED: U.S. Navy patrol boat in the Persian Gulf (Image Source: USNI)According to the Pentagon, the initial narrative is as follows: The sailors were on a training mission around noon ET when their boat experienced mechanical difficulty and drifted into Iranian-claimed waters and were detained by the Iranian Coast Guard, officials added. The story has since been slightly revised by White House spokesman Josh Earnest to follow this narrative:The 2 boats were traveling en route from Kuwait to Bahrain, when they were stopped and detained by the Iranians.According to USNI, search and rescue t

NameError: name 'wordopt' is not defined

In [69]:
news = str(input())
manual_testing(news)

 SAO PAULO (Reuters) - Cesar Mata Pires, the owner and co-founder of Brazilian engineering conglomerate OAS SA, one of the largest companies involved in Brazil s corruption scandal, died on Tuesday. He was 68. Mata Pires died of a heart attack while taking a morning walk in an upscale district of S o Paulo, where OAS is based, a person with direct knowledge of the matter said. Efforts to contact his family were unsuccessful. OAS declined to comment. The son of a wealthy cattle rancher in the northeastern state of Bahia, Mata Pires  links to politicians were central to the expansion of OAS, which became Brazil s No. 4 builder earlier this decade, people familiar with his career told Reuters last year. His big break came when he befriended Antonio Carlos Magalh es, a popular politician who was Bahia governor several times, and eventually married his daughter Tereza. Brazilians joked that OAS stood for  Obras Arranjadas pelo Sogro  - or  Work Arranged by the Father-In-Law.   After years o

NameError: name 'wordopt' is not defined