# FAKE NEWS

The goal of this project is to develop a classification machine learning model that will deduct whether a given news article is true or fake (reliable or unreliable). 

## Data Preprocessing

In [2]:
# Import required libraries
import zipfile, os
import pandas as pd

### Retrieve datasets from Kaggle

In [5]:
# Import dataset as zip
! kaggle competitions download -c fake-news

# Extract datasets from imported zip
with zipfile.ZipFile('./fake-news.zip', 'r') as zip_ref:
  zip_ref.extractall('./data/')

# Delete imported zip
os.remove('./fake-news.zip')

Downloading fake-news.zip to /Users/mariainigo/Documents/GitHub/Maria-Inigo/Fake-News
100%|██████████████████████████████████████| 46.5M/46.5M [00:02<00:00, 19.0MB/s]
100%|██████████████████████████████████████| 46.5M/46.5M [00:02<00:00, 19.0MB/s]


### Read datasets

We are given 3 datasets: 
+ train.csv: A full training dataset with the following attributes:
  - id: unique id for a news article
  - title: the title of a news article
  - author: author of the news article
  - text: the text of the article; could be incomplete
  - label: a label that marks the article as potentially unreliable
  - 1: unreliable
  - 0: reliable
+ test.csv: A testing training dataset with all the same attributes at train.csv without the label.
+ submit.csv: A sample submission that you can

In [3]:
df_train = pd.read_csv('./data/train.csv')
df_test = pd.read_csv('./data/test.csv')
df_submit = pd.read_csv('./data/submit.csv')

In [4]:
df_train.head()

Unnamed: 0,id,title,author,text,label
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1


In [5]:
df_test.head()

Unnamed: 0,id,title,author,text
0,20800,"Specter of Trump Loosens Tongues, if Not Purse...",David Streitfeld,"PALO ALTO, Calif. — After years of scorning..."
1,20801,Russian warships ready to strike terrorists ne...,,Russian warships ready to strike terrorists ne...
2,20802,#NoDAPL: Native American Leaders Vow to Stay A...,Common Dreams,Videos #NoDAPL: Native American Leaders Vow to...
3,20803,"Tim Tebow Will Attempt Another Comeback, This ...",Daniel Victor,"If at first you don’t succeed, try a different..."
4,20804,Keiser Report: Meme Wars (E995),Truth Broadcast Network,42 mins ago 1 Views 0 Comments 0 Likes 'For th...


In [6]:
df_submit.head()

Unnamed: 0,id,label
0,20800,0
1,20801,1
2,20802,0
3,20803,1
4,20804,1


#### Feature Extraction

From the train dataset we can study some data to get the most relevant characteristics of reliable news to extract the features from and feed the model. 

In [9]:
df_train.isnull().sum()

id           0
title      558
author    1957
text        39
label        0
dtype: int64

There are several rows with missing data. It makes sense to think that if unreliable articles would be missing either title, author or text. 

In [18]:
print('Reliability of articles missing titles: ', df_train[df_train.title.isnull()].label.unique())
print('Reliability of articles missing authors: ', df_train[df_train.author.isnull()].label.unique())
print('Reliability of articles missing text: ', df_train[df_train.text.isnull()].label.unique())

Reliability of articles missing titles:  [1]
Reliability of articles missing authors:  [1 0]
Reliability of articles missing text:  [1]


All articles missing titles and text are unreliable. However some of the articles missing an author are reliable. From here we can create dummies for missing titles, authors and text.

In [45]:
final_df_train = df_train[['label']].copy()

final_df_train['has_title'] = 1
final_df_train.loc[df_train.title.isnull(), 'has_title'] = 0

final_df_train['has_author'] = 1
final_df_train.loc[df_train.author.isnull(), 'has_author'] = 0

final_df_train['has_text'] = 1
final_df_train.loc[df_train.author.isnull(), 'has_text'] = 0

final_df_train.head()

Unnamed: 0,label,has_title,has_author,has_text
0,1,1,1,1
1,0,1,1,1
2,1,1,1,1
3,1,1,1,1
4,1,1,1,1
