# Project V2

### Summary
1. Claim and Related Article Data:
    - Made lowercase, removed punctuations, links, unicode hex amongst other misc items like ", ', - ...etc.
    - Removed stopwords and Tokenized
        - To run this section, you may have to download the stopwords packages. I have included the code, you just have to uncomment 2 lines on the first run (section 1.3)
2. Date
    - Converted from string to datetime format (for practicality)
    - Created 3 features:
        - 1. Days since Jan 1st 1986
        - 2. The Month
        - 3. The Year
3. Claimant
    - Replaced missing values with "unknown"
    - Replaced counts below threshold with "other"
4. Final Frame
    - 2 final frames:
        - Final_data = this is the frame that holds the claims, claimant, date, label, related articles
        - cleaned_articles2 = this is the frame that holds the related articles

## Preliminaries

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import re
from datetime import datetime
import os
import math
from IPython.display import clear_output, display
import time
import warnings
warnings.filterwarnings('ignore')
import string
# sklearn
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import make_scorer, confusion_matrix
from sklearn.metrics import accuracy_score

# The following line is needed to show plots inline in notebooks
%matplotlib inline 

In [2]:
# function to convert strings to numpy array - used to convert the related_articles column in to arrays
# for practicality
def str2array(value):
    str_list = re.findall(r'\d+', value)
    int_list = list(map(int, str_list))
    article_array = np.array(int_list)
    return article_array

In [3]:
data = pd.read_csv('train.csv')
data.shape

(15555, 7)

In [4]:
# create a new column with the related articles saved as an array called "article_array"
data['article_array'] = data['related_articles'].apply(str2array)
data.head()

Unnamed: 0.1,Unnamed: 0,claim,claimant,date,id,label,related_articles,article_array
0,0,A line from George Orwell's novel 1984 predict...,,17/07/2017,0,0,"[122094, 122580, 130685, 134765]","[122094, 122580, 130685, 134765]"
1,1,Maine legislature candidate Leslie Gibson insu...,,17/03/2018,1,2,"[106868, 127320, 128060]","[106868, 127320, 128060]"
2,2,A 17-year-old girl named Alyssa Carson is bein...,,18/07/2018,4,1,"[132130, 132132, 149722]","[132130, 132132, 149722]"
3,3,In 1988 author Roald Dahl penned an open lette...,,04/02/2019,5,2,"[123254, 123418, 127464]","[123254, 123418, 127464]"
4,4,"When it comes to fighting terrorism, ""Another ...",Hillary Clinton,22/03/2016,6,2,"[41099, 89899, 72543, 82644, 95344, 88361]","[41099, 89899, 72543, 82644, 95344, 88361]"


In [5]:
# set paths
cur_path = os.path.dirname(os.path.abspath("Project_Data.ipynb"))
articles_dir = cur_path + '/train_articles/'

In [6]:
# create a dictionary of article ID and content
article_dict = {}
for filename in os.listdir(articles_dir):
    filenumber = filename.replace('.txt', '')
    file_open = open(articles_dir + filename, "r")
    text = file_open.read()
    article_dict[filenumber] = text
# use the dictionary created to create a dataframe of articles
articles  = pd.DataFrame.from_dict(article_dict, orient='index')
articles.columns = ['Article']
# a dataframe that holds all the articles
articles.head()

Unnamed: 0,Article
6741,Risk Corridors And Budget Neutrality\nScott Ha...
126507,FACT CHECK: Did Rand Paul Tweet That the Purpo...
113460,Baboon Beauty Queen South Africa\nDescription\...
79633,Obama Going Negative? Mailer Hits Clinton On T...
12252,"Michelle Obama 'concerned about us' as women, ..."


## 1. Related Articles and Claim Data

### 1.1 Basic Cleaning for Related Articles

In [7]:
%%time
# CLEAN ARTICLE DATA - ~5 minutes to run
# convert all string values to lower case
articles_cleaned = articles.apply(lambda x: x.str.lower())
# replace new line with space
articles_cleaned = articles_cleaned.replace('\n', ' ', regex=True)
# get rid of all links
articles_cleaned = articles_cleaned.Article.replace(r'(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}     /)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?«»“”‘’]))', '', regex = True).to_frame()
# get rid of unicode hex
articles_cleaned = articles_cleaned.Article.replace({r'[^\x00-\x7F]+':''}, regex=True).to_frame()
# remove punctuation
articles_cleaned = articles_cleaned.Article.str.replace('[{}]'.format(string.punctuation), '').to_frame()
# remove misc items
articles_cleaned = articles_cleaned.replace(' — ', ' ', regex=True)
articles_cleaned = articles_cleaned.replace('-', '', regex=True)
articles_cleaned = articles_cleaned.replace('’', '', regex=True)
articles_cleaned = articles_cleaned.replace('‘', '', regex=True)
articles_cleaned = articles_cleaned.replace('”', '', regex=True)
articles_cleaned = articles_cleaned.replace('“', '', regex=True)
# replace consecutive spaces with just one space
articles_cleaned = articles_cleaned.replace('\s+', ' ', regex=True)

CPU times: user 1min 58s, sys: 895 ms, total: 1min 58s
Wall time: 1min 58s


In [8]:
articles_cleaned.head()

Unnamed: 0,Article
6741,risk corridors and budget neutrality scott har...
126507,fact check did rand paul tweet that the purpos...
113460,baboon beauty queen south africa description t...
79633,obama going negative mailer hits clinton on tr...
12252,michelle obama concerned about us as women pra...


### 1.2 Basic Cleaning for Claims 

In [9]:
%%time
# CLEAN CLAIM DATA
# create a new dataframe of just claims
cleaned_claim = data.claim.to_frame()
# convert all string values to lower case
cleaned_claim = cleaned_claim.apply(lambda x: x.str.lower())
# replace new line with space
cleaned_claim = cleaned_claim.replace('\n', ' ', regex=True)
# get rid of all links
cleaned_claim = cleaned_claim.claim.replace(r'(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}     /)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?«»“”‘’]))', '', regex = True).to_frame()
# get rid of unicode hex
cleaned_claim = cleaned_claim.claim.replace({r'[^\x00-\x7F]+':''}, regex=True).to_frame()
# remove punctuation
cleaned_claim = cleaned_claim.claim.str.replace('[{}]'.format(string.punctuation), '').to_frame()
# remove misc items
cleaned_claim = cleaned_claim.replace(' — ', ' ', regex=True)
cleaned_claim = cleaned_claim.replace('-', ' ', regex=True)
cleaned_claim = cleaned_claim.replace('’', '', regex=True)
cleaned_claim = cleaned_claim.replace('‘', '', regex=True)
cleaned_claim = cleaned_claim.replace('”', '', regex=True)
cleaned_claim = cleaned_claim.replace('“', '', regex=True)
# replace consecutive spaces with just one space
cleaned_claim = cleaned_claim.replace('\s+', ' ', regex=True)

CPU times: user 475 ms, sys: 8.08 ms, total: 483 ms
Wall time: 481 ms


In [10]:
# concatenate cleaned_claims with label and article_array
cleaned_claim = pd.concat([cleaned_claim, data.label, data.article_array], axis=1)
# cleaned_claim now holds the claims that are cleaned, the label, and the article array
cleaned_claim.head()

Unnamed: 0,claim,label,article_array
0,a line from george orwells novel 1984 predicts...,0,"[122094, 122580, 130685, 134765]"
1,maine legislature candidate leslie gibson insu...,2,"[106868, 127320, 128060]"
2,a 17yearold girl named alyssa carson is being ...,1,"[132130, 132132, 149722]"
3,in 1988 author roald dahl penned an open lette...,2,"[123254, 123418, 127464]"
4,when it comes to fighting terrorism another th...,2,"[41099, 89899, 72543, 82644, 95344, 88361]"


### 1.3 Stop words and Tokenization 

In [11]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import nltk
stop_words = set(stopwords.words('english'))

# the first time running - you may need to uncomment the bottom two lines to download the necessary packages
# nltk.download('stopwords')
# nltk.download('punkt')

#### 1.3.1 Create a LIST of Tokenized Claims

In [12]:
# create a list of claims
claim_list=[]
for i in range(cleaned_claim.shape[0]):
    claim_entry = cleaned_claim.claim.loc[i]
    claim_list.append(claim_entry)

# tokenize every claim in the claim list generated from above
# the result is a list of tokenized claims: tokenized_claims
tokenized_claims = []
for i in range(cleaned_claim.shape[0]):
    word_tokens = word_tokenize(claim_list[i])
    filtered_sentence = [w for w in word_tokens if not w in stop_words]
    filtered_sentence = []
    for w in word_tokens:
        if w not in stop_words:
            filtered_sentence.append(w)
    tokenized_claims.append(filtered_sentence)

In [13]:
# list of tokenized claims
tokenized_claims

[['line',
  'george',
  'orwells',
  'novel',
  '1984',
  'predicts',
  'power',
  'smartphones'],
 ['maine',
  'legislature',
  'candidate',
  'leslie',
  'gibson',
  'insulted',
  'parkland',
  'shooting',
  'survivor',
  'activist',
  'emma',
  'gonzalez',
  'via',
  'twitter'],
 ['17yearold',
  'girl',
  'named',
  'alyssa',
  'carson',
  'trained',
  'nasa',
  'become',
  'astronaut'],
 ['1988',
  'author',
  'roald',
  'dahl',
  'penned',
  'open',
  'letter',
  'urging',
  'parents',
  'children',
  'vaccinated',
  'measles'],
 ['comes',
  'fighting',
  'terrorism',
  'another',
  'thing',
  'know',
  'work',
  'based',
  'lots',
  'empirical',
  'evidence',
  'torture'],
 ['rhode',
  'island',
  'almost',
  'dead',
  'last',
  'among',
  'northeastern',
  'states',
  'length',
  'time',
  'firstdegree',
  'murderers',
  'must',
  'spend',
  'prison',
  'theyre',
  'eligible',
  'parole'],
 ['poorest',
  'counties',
  'us',
  'appalachia',
  'happen',
  '90',
  'percent',
  'whi

#### 1.3.2 Create a LIST of Tokenized Related Articles

In [14]:
# create a list of tokenized, non-stop words articles ~ may take a few minutes to run
tokenized_articles = []
for i in range(articles_cleaned.shape[0]):
    word_tokens = word_tokenize(articles_cleaned.Article[articles_cleaned.index[i]])
    filtered_sentence = [w for w in word_tokens if not w in stop_words]
    filtered_sentence = []
    for w in word_tokens:
        if w not in stop_words:
            filtered_sentence.append(w)
    tokenized_articles.append(filtered_sentence)
    progress = round((i/articles_cleaned.shape[0])*100,2)
    clear_output(wait=True)
    print("progress: " + str(progress) + "%")

progress: 100.0%


In [15]:
# list of tokenized related articles - below is showing only the first entry of the list
tokenized_articles[0]

['risk',
 'corridors',
 'budget',
 'neutrality',
 'scott',
 'harrington',
 'may',
 '14',
 '2014',
 'editors',
 'note',
 'topic',
 'risk',
 'corridors',
 'also',
 'addressed',
 'upcoming',
 'health',
 'policy',
 'brief',
 'briefs',
 'produced',
 'health',
 'affairs',
 'grant',
 'robert',
 'wood',
 'johnson',
 'foundation',
 'department',
 'health',
 'human',
 'services',
 'hhs',
 'stated',
 'march',
 '11',
 'preamble',
 'final',
 'notice',
 'benefit',
 'payment',
 'parameters',
 '2015',
 'intends',
 'implement',
 'acas',
 'risk',
 'corridor',
 'program',
 'budget',
 'neutral',
 'manner',
 'hhs',
 'previously',
 'indicated',
 'march',
 '2013',
 'notice',
 'insurers',
 'making',
 'decisions',
 'marketplace',
 'participation',
 'plan',
 'design',
 'pricing',
 'budget',
 'neutrality',
 'programs',
 'threeyear',
 'life',
 'required',
 'statute',
 'payments',
 'made',
 'regardless',
 'balance',
 'receipts',
 'payments',
 'april',
 '11',
 'hhs',
 'issued',
 'faq',
 'explaining',
 'achieve',
 '

#### Convert the Tokenized Claims List in to a DataFrame

In [16]:
# create a dataframe of the tokenized claims and add it to the cleaned_claim
tok_claims_df = pd.Series(tokenized_claims).to_frame()
tok_claims_df.columns = ['tokenized_claims']
cleaned_claim = pd.concat([cleaned_claim, tok_claims_df], axis=1)
# this is the dataframe with the cleaned claim, label, article array and tokenized_claims
cleaned_claim.head()

Unnamed: 0,claim,label,article_array,tokenized_claims
0,a line from george orwells novel 1984 predicts...,0,"[122094, 122580, 130685, 134765]","[line, george, orwells, novel, 1984, predicts,..."
1,maine legislature candidate leslie gibson insu...,2,"[106868, 127320, 128060]","[maine, legislature, candidate, leslie, gibson..."
2,a 17yearold girl named alyssa carson is being ...,1,"[132130, 132132, 149722]","[17yearold, girl, named, alyssa, carson, train..."
3,in 1988 author roald dahl penned an open lette...,2,"[123254, 123418, 127464]","[1988, author, roald, dahl, penned, open, lett..."
4,when it comes to fighting terrorism another th...,2,"[41099, 89899, 72543, 82644, 95344, 88361]","[comes, fighting, terrorism, another, thing, k..."


#### Conver the Tokenized Articles list in to a DataFrame

In [17]:
# create a dataframe of the tokenized articles
tok_article_df = pd.Series(tokenized_articles)
tok_article_df.index = [articles_cleaned.index]
tok_article_df = tok_article_df.to_frame()
tok_article_df.columns = ['tokenized_articles']
# reindex to combine the articles_cleaned dataframe with tok_Article_df
articles_cleaned2 = articles_cleaned.reset_index()
tok_article_df = tok_article_df.reset_index()
# concatehate the two dataframes
articles_cleaned2 = pd.concat([articles_cleaned2, tok_article_df], axis=1)
articles_cleaned2.index = [articles_cleaned.index]
articles_cleaned2 = articles_cleaned2.drop(['index', 'level_0'], axis=1)
# this is the dataframe with the cleaned article and tokenized_articles, indexed by article number
articles_cleaned2.head()

Unnamed: 0,Article,tokenized_articles
6741,risk corridors and budget neutrality scott har...,"[risk, corridors, budget, neutrality, scott, h..."
126507,fact check did rand paul tweet that the purpos...,"[fact, check, rand, paul, tweet, purpose, seco..."
113460,baboon beauty queen south africa description t...,"[baboon, beauty, queen, south, africa, descrip..."
79633,obama going negative mailer hits clinton on tr...,"[obama, going, negative, mailer, hits, clinton..."
12252,michelle obama concerned about us as women pra...,"[michelle, obama, concerned, us, women, practi..."


## 2. Date

In [18]:
# Convert date column to datetime format
data['new_date'] = pd.to_datetime(data['date'], dayfirst=True)
data.head()

Unnamed: 0.1,Unnamed: 0,claim,claimant,date,id,label,related_articles,article_array,new_date
0,0,A line from George Orwell's novel 1984 predict...,,17/07/2017,0,0,"[122094, 122580, 130685, 134765]","[122094, 122580, 130685, 134765]",2017-07-17
1,1,Maine legislature candidate Leslie Gibson insu...,,17/03/2018,1,2,"[106868, 127320, 128060]","[106868, 127320, 128060]",2018-03-17
2,2,A 17-year-old girl named Alyssa Carson is bein...,,18/07/2018,4,1,"[132130, 132132, 149722]","[132130, 132132, 149722]",2018-07-18
3,3,In 1988 author Roald Dahl penned an open lette...,,04/02/2019,5,2,"[123254, 123418, 127464]","[123254, 123418, 127464]",2019-02-04
4,4,"When it comes to fighting terrorism, ""Another ...",Hillary Clinton,22/03/2016,6,2,"[41099, 89899, 72543, 82644, 95344, 88361]","[41099, 89899, 72543, 82644, 95344, 88361]",2016-03-22


In [19]:
# create new feature with consecutive days since January 1st, 1986
data['start_date'] = pd.to_datetime('1986-01-01', format='%Y-%m-%d')
data['cont_days'] = (data['new_date'] - data['start_date']).dt.days
data = data.drop(['start_date'], axis=1)

In [20]:
# Convert Year and Month features in to int (instead of str before), can be kept as int since it is ordinal

#Year
data['Year'] = data['new_date'].apply(lambda x: "%d" % (x.year))
data['Year'] = data['Year'].astype(int)
# Month
data['Month'] = data['new_date'].apply(lambda x: "%d" % (x.month))
data['Month'] = data['Month'].astype(int)

In [21]:
# 4 columns at the end show the new_date (which is the date in a date format), the continuous days, 
# the year and month
data.head()

Unnamed: 0.1,Unnamed: 0,claim,claimant,date,id,label,related_articles,article_array,new_date,cont_days,Year,Month
0,0,A line from George Orwell's novel 1984 predict...,,17/07/2017,0,0,"[122094, 122580, 130685, 134765]","[122094, 122580, 130685, 134765]",2017-07-17,11520,2017,7
1,1,Maine legislature candidate Leslie Gibson insu...,,17/03/2018,1,2,"[106868, 127320, 128060]","[106868, 127320, 128060]",2018-03-17,11763,2018,3
2,2,A 17-year-old girl named Alyssa Carson is bein...,,18/07/2018,4,1,"[132130, 132132, 149722]","[132130, 132132, 149722]",2018-07-18,11886,2018,7
3,3,In 1988 author Roald Dahl penned an open lette...,,04/02/2019,5,2,"[123254, 123418, 127464]","[123254, 123418, 127464]",2019-02-04,12087,2019,2
4,4,"When it comes to fighting terrorism, ""Another ...",Hillary Clinton,22/03/2016,6,2,"[41099, 89899, 72543, 82644, 95344, 88361]","[41099, 89899, 72543, 82644, 95344, 88361]",2016-03-22,11038,2016,3


## 3. Claimant 

In [22]:
# fill missing claimants with "unknown"
data['claimant'] = data['claimant'].fillna('Unknown')

In [23]:
# Group together all counts less than 100 in to Others
claimant_count = data['claimant'].value_counts()
value_mask = data.claimant.isin(claimant_count.index[claimant_count < 100]) 
data.loc[value_mask,'claimant'] = "Other"

In [24]:
data.head()

Unnamed: 0.1,Unnamed: 0,claim,claimant,date,id,label,related_articles,article_array,new_date,cont_days,Year,Month
0,0,A line from George Orwell's novel 1984 predict...,Unknown,17/07/2017,0,0,"[122094, 122580, 130685, 134765]","[122094, 122580, 130685, 134765]",2017-07-17,11520,2017,7
1,1,Maine legislature candidate Leslie Gibson insu...,Unknown,17/03/2018,1,2,"[106868, 127320, 128060]","[106868, 127320, 128060]",2018-03-17,11763,2018,3
2,2,A 17-year-old girl named Alyssa Carson is bein...,Unknown,18/07/2018,4,1,"[132130, 132132, 149722]","[132130, 132132, 149722]",2018-07-18,11886,2018,7
3,3,In 1988 author Roald Dahl penned an open lette...,Unknown,04/02/2019,5,2,"[123254, 123418, 127464]","[123254, 123418, 127464]",2019-02-04,12087,2019,2
4,4,"When it comes to fighting terrorism, ""Another ...",Hillary Clinton,22/03/2016,6,2,"[41099, 89899, 72543, 82644, 95344, 88361]","[41099, 89899, 72543, 82644, 95344, 88361]",2016-03-22,11038,2016,3


## 4. Final Frame

There are 2 final dataframes, one for the data (claims, claimant, date, label, related_articles) and another for the related articles

### 4.1 Final Data 

In [25]:
# concatenate all work done above to a single frame
final_data = pd.concat([cleaned_claim.claim, cleaned_claim.tokenized_claims, data.claimant, data.new_date, data.cont_days, data.Year, data.Month, cleaned_claim.label, cleaned_claim.article_array], axis=1)

In [26]:
# this is equivalent to the "train.csv" that we were given, but cleaned with a few additional feature
final_data.head()

Unnamed: 0,claim,tokenized_claims,claimant,new_date,cont_days,Year,Month,label,article_array
0,a line from george orwells novel 1984 predicts...,"[line, george, orwells, novel, 1984, predicts,...",Unknown,2017-07-17,11520,2017,7,0,"[122094, 122580, 130685, 134765]"
1,maine legislature candidate leslie gibson insu...,"[maine, legislature, candidate, leslie, gibson...",Unknown,2018-03-17,11763,2018,3,2,"[106868, 127320, 128060]"
2,a 17yearold girl named alyssa carson is being ...,"[17yearold, girl, named, alyssa, carson, train...",Unknown,2018-07-18,11886,2018,7,1,"[132130, 132132, 149722]"
3,in 1988 author roald dahl penned an open lette...,"[1988, author, roald, dahl, penned, open, lett...",Unknown,2019-02-04,12087,2019,2,2,"[123254, 123418, 127464]"
4,when it comes to fighting terrorism another th...,"[comes, fighting, terrorism, another, thing, k...",Hillary Clinton,2016-03-22,11038,2016,3,2,"[41099, 89899, 72543, 82644, 95344, 88361]"


### 4.2 Related Articles Dataframe 

In [27]:
# this frame holds the cleaned articles, the tokenized version, the index is the article ID
articles_cleaned2.head()

Unnamed: 0,Article,tokenized_articles
6741,risk corridors and budget neutrality scott har...,"[risk, corridors, budget, neutrality, scott, h..."
126507,fact check did rand paul tweet that the purpos...,"[fact, check, rand, paul, tweet, purpose, seco..."
113460,baboon beauty queen south africa description t...,"[baboon, beauty, queen, south, africa, descrip..."
79633,obama going negative mailer hits clinton on tr...,"[obama, going, negative, mailer, hits, clinton..."
12252,michelle obama concerned about us as women pra...,"[michelle, obama, concerned, us, women, practi..."
