# _Trial 1: Fake News_

Build a system to identify unreliable news articles. Data acquired via [Kaggle](https://www.kaggle.com/c/fake-news/data).

In [4]:
!pip install kaggle

Collecting kaggle
[?25l  Downloading https://files.pythonhosted.org/packages/e9/fc/0de659ea1f2096563204925b6660ae141f3d85bbe9e8a1571c3eb6cc1fdd/kaggle-1.5.5.tar.gz (56kB)
[K     |████████████████████████████████| 61kB 1.2MB/s eta 0:00:01
Collecting python-slugify (from kaggle)
  Downloading https://files.pythonhosted.org/packages/a2/5d/bd30413c00bbed3945558aca07c55944073e1e30abeee1f06515281f9811/python-slugify-3.0.3.tar.gz
Collecting text-unidecode==1.2 (from python-slugify->kaggle)
[?25l  Downloading https://files.pythonhosted.org/packages/79/42/d717cc2b4520fb09e45b344b1b0b4e81aa672001dd128c180fabc655c341/text_unidecode-1.2-py2.py3-none-any.whl (77kB)
[K     |████████████████████████████████| 81kB 3.5MB/s eta 0:00:011
[?25hBuilding wheels for collected packages: kaggle, python-slugify
  Building wheel for kaggle (setup.py) ... [?25ldone
[?25h  Stored in directory: /Users/jai/Library/Caches/pip/wheels/db/6a/80/6cd1892eb9b9b136333db3c74e16cba4e17e2c700f51541f06
  Building wheel f

In [1]:
# import libraries
import pandas as pd
pd.options.display.max_columns = None
import numpy as np
import random
import os

# Matplotlib
%matplotlib inline
%config InlineBackend.figure_format='retina'
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('ggplot')

In [2]:
os.getcwd()

'/Users/jai/Documents/projects/fake-news'

In [6]:
os.listdir()

['Untitled.ipynb', '.ipynb_checkpoints', 'data']

In [9]:
# create environment variables for kaggle to authenticate with
#os.environ['KAGGLE_USERNAME'] = "insert-here"
#os.environ['KAGGLE_KEY'] = "insert-here"

In [10]:
# !kaggle competitions download -c fake-news -p 'data'

In [11]:
os.listdir()

['Untitled.ipynb', '.ipynb_checkpoints', 'data']

In [12]:
from pathlib import Path

#create path variable to primary directory
path = Path(os.getcwd())
path

PosixPath('/Users/jai/Documents/projects/fake-news')

In [13]:
# make a dataframe from train.csv
train_df = pd.read_csv(path/'data/train.csv')

# make a dataframe from test.csv
test_df = pd.read_csv(path/'data/test.csv')

In [14]:
train_df.head()

Unnamed: 0,id,title,author,text,label
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1


### _Data Description on `train_df`_

train.csv: A full training dataset with the following attributes:

- `id`: unique id for a news article
- `title`: the title of a news article
- `author`: author of the news article
- `text`: the text of the article; could be incomplete
- `label`: a label that marks the article as potentially unreliable
    - `1`: unreliable
    - `0`: reliable

In [15]:
test_df.head()

Unnamed: 0,id,title,author,text
0,20800,"Specter of Trump Loosens Tongues, if Not Purse...",David Streitfeld,"PALO ALTO, Calif. — After years of scorning..."
1,20801,Russian warships ready to strike terrorists ne...,,Russian warships ready to strike terrorists ne...
2,20802,#NoDAPL: Native American Leaders Vow to Stay A...,Common Dreams,Videos #NoDAPL: Native American Leaders Vow to...
3,20803,"Tim Tebow Will Attempt Another Comeback, This ...",Daniel Victor,"If at first you don’t succeed, try a different..."
4,20804,Keiser Report: Meme Wars (E995),Truth Broadcast Network,42 mins ago 1 Views 0 Comments 0 Likes 'For th...


In [16]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20800 entries, 0 to 20799
Data columns (total 5 columns):
id        20800 non-null int64
title     20242 non-null object
author    18843 non-null object
text      20761 non-null object
label     20800 non-null int64
dtypes: int64(2), object(3)
memory usage: 812.6+ KB


In [17]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5200 entries, 0 to 5199
Data columns (total 4 columns):
id        5200 non-null int64
title     5078 non-null object
author    4697 non-null object
text      5193 non-null object
dtypes: int64(1), object(3)
memory usage: 162.6+ KB


In [29]:
# drop missing observations from text column
drop_train_df = train_df[~train_df['text'].isnull()]
drop_test_df = test_df[~test_df['text'].isnull()]

In [30]:
drop_train_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 20761 entries, 0 to 20799
Data columns (total 5 columns):
id        20761 non-null int64
title     20203 non-null object
author    18843 non-null object
text      20761 non-null object
label     20761 non-null int64
dtypes: int64(2), object(3)
memory usage: 973.2+ KB


In [31]:
drop_test_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5193 entries, 0 to 5199
Data columns (total 4 columns):
id        5193 non-null int64
title     5071 non-null object
author    4697 non-null object
text      5193 non-null object
dtypes: int64(1), object(3)
memory usage: 202.9+ KB


In [34]:
# reset index to id
drop_train_df.set_index('id', inplace=True)

In [35]:
drop_test_df.set_index('id', inplace=True)