## Data Cleaning and Preprocessing Notebook

This notebook is to be strictly used for data cleaning and preprocessing purposes. Steps:

1. Read the dataset
2. Handle Missing Values (if any).
3. Do visualizations as required
4. Explore your data here
5. Save the cleaned and processed dataset as `data/final_dataset.csv`.
6. Split the dataset obtained in step 5 as `input/train.csv`,`input/test.csv`,`input/validation.csv`

NO MODELLING WILL BE DONE IN THIS NOTEBOOK!

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [2]:
from sklearn import feature_extraction, linear_model, model_selection, preprocessing
from sklearn.model_selection import StratifiedKFold, StratifiedShuffleSplit
from sklearn.metrics import precision_score, recall_score, f1_score,classification_report

In [3]:
import re
import string
import nltk
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

In [4]:
df=pd.read_csv('../data/TARP_Project_Final_Dataset.csv')

In [5]:
df.head()

Unnamed: 0,title,text,label
0,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL


In [6]:
df.shape

(61144, 3)

In [7]:
df['label'].value_counts()/len(df['label'])

REAL    0.523224
FAKE    0.476776
Name: label, dtype: float64

Before looking and wrangling at data, let's take a glance at the datasets

In [8]:
print("Tagged REAL:",df[df['label']=="REAL"]['title'].values[0])
print("Tagged FAKE:",df[df['label']=="FAKE"]['title'].values[0])

Tagged REAL: Kerry to go to Paris in gesture of sympathy
Tagged FAKE: You Can Smell Hillary’s Fear


In [9]:
stratify=StratifiedShuffleSplit(test_size=0.3,random_state=42)

In [10]:
stratify.get_n_splits(df[['text','title']],df['label'])

10

In [11]:
X=['title','text']
y='label'

In [12]:
def stratified_shuffle_split(df,X, y,test_size=0.3):
    stratify=StratifiedShuffleSplit(test_size=test_size,random_state=42)
    for train_index, test_index in stratify.split(df[X],df[y]):
        df_train=df.iloc[train_index]
        df_test=df.iloc[test_index]
    return df_train,df_test

In [13]:
train,test=stratified_shuffle_split(df,X,y, test_size=0.4)
test,validation=stratified_shuffle_split(test,X,y,test_size=0.5)

In [14]:
print(train.shape)
print(test.shape)
print(validation.shape)

(39743, 3)
(10700, 3)
(10701, 3)


In [15]:
train.to_csv('../input/train.csv')
test.to_csv('../input/test.csv')
validation.to_csv("../input/validation.csv")