# Project_Data

### Summary
1. Claim and Related Article Data:
    - Made lowercase, removed punctuations, links, unicode hex amongst other misc items like ", ', - ...etc.
    - Removed stopwords and Tokenized
        - To run this section, you may have to download the stopwords packages. I have included the code, you just have to uncomment 2 lines on the first run (section 1.3)
2. Date
    - Converted from string to datetime format (for practicality)
    - Created 3 features:
        - 1. Days since Jan 1st 1986
        - 2. The Month
        - 3. The Year
3. Claimant
    - Replaced missing values with "unknown"
    - Replaced counts below threshold with "other"
4. Final Frame
    - 2 final frames:
        - final_data = this is the frame that holds the claims, claimant, date, label, related articles
        - final_articles = this is the frame that holds the related articles
    - I have included a few extra lines of code as an example of how to work with the frames
        - I have saved the output of the 2 dataframe to csv files

## Preliminaries

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import re
from datetime import datetime
import os
import math
from IPython.display import clear_output, display
import time
import warnings
warnings.filterwarnings('ignore')
import string
# sklearn
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import make_scorer, confusion_matrix
from sklearn.metrics import accuracy_score

# The following line is needed to show plots inline in notebooks
%matplotlib inline 

In [2]:
# function to convert strings to numpy array - used to convert the related_articles column in to arrays
# for practicality
def str2array(value):
    str_list = re.findall(r'\d+', value)
    int_list = list(map(int, str_list))
    article_array = np.array(int_list)
    return article_array

In [3]:
data = pd.read_csv('train.csv')
data.shape

(15555, 7)

In [4]:
# create a new column with the related articles saved as an array called "article_array"
data['article_array'] = data['related_articles'].apply(str2array)
data.head()

Unnamed: 0.1,Unnamed: 0,claim,claimant,date,id,label,related_articles,article_array
0,0,A line from George Orwell's novel 1984 predict...,,17/07/2017,0,0,"[122094, 122580, 130685, 134765]","[122094, 122580, 130685, 134765]"
1,1,Maine legislature candidate Leslie Gibson insu...,,17/03/2018,1,2,"[106868, 127320, 128060]","[106868, 127320, 128060]"
2,2,A 17-year-old girl named Alyssa Carson is bein...,,18/07/2018,4,1,"[132130, 132132, 149722]","[132130, 132132, 149722]"
3,3,In 1988 author Roald Dahl penned an open lette...,,04/02/2019,5,2,"[123254, 123418, 127464]","[123254, 123418, 127464]"
4,4,"When it comes to fighting terrorism, ""Another ...",Hillary Clinton,22/03/2016,6,2,"[41099, 89899, 72543, 82644, 95344, 88361]","[41099, 89899, 72543, 82644, 95344, 88361]"


In [5]:
# set paths
cur_path = os.path.dirname(os.path.abspath("Project_Data.ipynb"))
articles_dir = cur_path + '/train_articles/'

In [6]:
%%time
# create a dictionary of article ID and content
article_dict = {}
for filename in os.listdir(articles_dir):
    filenumber = filename.replace('.txt', '')
    file_open = open(articles_dir + filename, "r")
    text = file_open.read()
    article_dict[filenumber] = text
# use the dictionary created to create a dataframe of articles
articles  = pd.DataFrame.from_dict(article_dict, orient='index')
articles.columns = ['Article']
# a dataframe that holds all the articles
articles.head()

CPU times: user 2.86 s, sys: 2.03 s, total: 4.89 s
Wall time: 13.8 s


Unnamed: 0,Article
131504,L'oreal abusing monkey during animal testing\n...
62141,Remarks by President Trump Before Marine One D...
55267,Scott likes Rubio's push on immigration reform...
150721,Effectiveness of fluoride in preventing caries...
159743,Prisoners to get a monthly salary and a bonus ...


## 1. Related Articles and Claim Data

### 1.1 Basic Cleaning for Related Articles

In [7]:
%%time
# CLEAN ARTICLE DATA - ~5 minutes to run
# convert all string values to lower case
articles_cleaned = articles.apply(lambda x: x.str.lower())
# replace new line with space
articles_cleaned = articles_cleaned.replace('\n', ' ', regex=True)
# get rid of all links
articles_cleaned = articles_cleaned.Article.replace(r'(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}     /)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?«»“”‘’]))', '', regex = True).to_frame()
# get rid of unicode hex
articles_cleaned = articles_cleaned.Article.replace({r'[^\x00-\x7F]+':''}, regex=True).to_frame()
# remove punctuation
articles_cleaned = articles_cleaned.Article.str.replace('[{}]'.format(string.punctuation), '').to_frame()
# remove misc items
articles_cleaned = articles_cleaned.replace(' — ', ' ', regex=True)
articles_cleaned = articles_cleaned.replace('-', '', regex=True)
articles_cleaned = articles_cleaned.replace('’', '', regex=True)
articles_cleaned = articles_cleaned.replace('‘', '', regex=True)
articles_cleaned = articles_cleaned.replace('”', '', regex=True)
articles_cleaned = articles_cleaned.replace('“', '', regex=True)
# replace consecutive spaces with just one space
articles_cleaned = articles_cleaned.replace('\s+', ' ', regex=True)

CPU times: user 1min 56s, sys: 459 ms, total: 1min 56s
Wall time: 1min 56s


In [8]:
articles_cleaned.head()

Unnamed: 0,Article
131504,loreal abusing monkey during animal testing da...
62141,remarks by president trump before marine one d...
55267,scott likes rubios push on immigration reform ...
150721,effectiveness of fluoride in preventing caries...
159743,prisoners to get a monthly salary and a bonus ...


### 1.2 Basic Cleaning for Claims 

In [9]:
%%time
# CLEAN CLAIM DATA
# create a new dataframe of just claims
cleaned_claim = data.claim.to_frame()
# convert all string values to lower case
cleaned_claim = cleaned_claim.apply(lambda x: x.str.lower())
# replace new line with space
cleaned_claim = cleaned_claim.replace('\n', ' ', regex=True)
# get rid of all links
cleaned_claim = cleaned_claim.claim.replace(r'(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}     /)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?«»“”‘’]))', '', regex = True).to_frame()
# get rid of unicode hex
cleaned_claim = cleaned_claim.claim.replace({r'[^\x00-\x7F]+':''}, regex=True).to_frame()
# remove punctuation
cleaned_claim = cleaned_claim.claim.str.replace('[{}]'.format(string.punctuation), '').to_frame()
# remove misc items
cleaned_claim = cleaned_claim.replace(' — ', ' ', regex=True)
cleaned_claim = cleaned_claim.replace('-', ' ', regex=True)
cleaned_claim = cleaned_claim.replace('’', '', regex=True)
cleaned_claim = cleaned_claim.replace('‘', '', regex=True)
cleaned_claim = cleaned_claim.replace('”', '', regex=True)
cleaned_claim = cleaned_claim.replace('“', '', regex=True)
# replace consecutive spaces with just one space
cleaned_claim = cleaned_claim.replace('\s+', ' ', regex=True)

CPU times: user 464 ms, sys: 3.99 ms, total: 468 ms
Wall time: 467 ms


In [10]:
# concatenate cleaned_claims with label and article_array
cleaned_claim = pd.concat([cleaned_claim, data.label, data.article_array], axis=1)
# cleaned_claim now holds the claims that are cleaned, the label, and the article array
cleaned_claim.head()

Unnamed: 0,claim,label,article_array
0,a line from george orwells novel 1984 predicts...,0,"[122094, 122580, 130685, 134765]"
1,maine legislature candidate leslie gibson insu...,2,"[106868, 127320, 128060]"
2,a 17yearold girl named alyssa carson is being ...,1,"[132130, 132132, 149722]"
3,in 1988 author roald dahl penned an open lette...,2,"[123254, 123418, 127464]"
4,when it comes to fighting terrorism another th...,2,"[41099, 89899, 72543, 82644, 95344, 88361]"


### 1.3 Stemming, Stop Words and Tokenization 

In [11]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import nltk
stop_words = set(stopwords.words('english'))
ps = PorterStemmer()
# the first time running - you may need to uncomment the bottom two lines to download the necessary packages
# nltk.download('stopwords')
# nltk.download('punkt')

### 1.3.1 Claims

In [12]:
# create a list of claims
claim_list=[]
for i in range(cleaned_claim.shape[0]):
    claim_entry = cleaned_claim.claim.loc[i]
    claim_list.append(claim_entry)

In [13]:
%%time
# tokenize every claim in the claim list generated from above
# the result is a list of tokenized claims: tokenized_claims
tokenized_claims = []
stemmed_claims = []
stemmed_sw_claims = []
for i in range(cleaned_claim.shape[0]):

    #--------------------------------------------------------------
    # stemming
    word_tokens = word_tokenize(claim_list[i])
    stemmed_tok_claims = []
    for w in word_tokens:
        stemmed_tok_claims.append(ps.stem(w))
    stemmed_string = ' '.join(stemmed_tok_claims)
    # stemmed_claims is a list of stemmed strings
    stemmed_claims.append(stemmed_string)
    
    #--------------------------------------------------------------
    # remove stop words
    stemmed_sw_string = []
    word_tokens = word_tokenize(stemmed_claims[i])
    filtered_sentence = [w for w in word_tokens if not w in stop_words]
    stemmed_sw_string = ' '.join(filtered_sentence)
    # stemmed_sw_claims is a list of stemmed strings without stopwords
    stemmed_sw_claims.append(stemmed_sw_string)    
        
    #--------------------------------------------------------------    
    # tokenize
    tokenized_ = word_tokenize(stemmed_sw_claims[i])
    tokenized_claims.append(tokenized_)
    
    # print progress
    progress = round((i/cleaned_claim.shape[0])*100,2)
    clear_output(wait=True)
    print("progress: " + str(progress) + "%")

progress: 99.99%
CPU times: user 34.2 s, sys: 2.3 s, total: 36.5 s
Wall time: 33.6 s


In [14]:
# list of tokenized claims
# tokenized_claims

#### Create Claims Dataframe

In [17]:
# zip together all the claim lists and create a dataframe
zipped_claims = list(zip(stemmed_claims, stemmed_sw_claims, tokenized_claims))
claims_ = pd.DataFrame(zipped_claims, columns = ['stemmed_claims', 'stemmed_stopword_claims', 'tokenized_claims'])

In [18]:
claims_.head()

Unnamed: 0,stemmed_claims,stemmed_stopword_claims,tokenized_claims
0,a line from georg orwel novel 1984 predict the...,line georg orwel novel 1984 predict power smar...,"[line, georg, orwel, novel, 1984, predict, pow..."
1,main legislatur candid lesli gibson insult par...,main legislatur candid lesli gibson insult par...,"[main, legislatur, candid, lesli, gibson, insu..."
2,a 17yearold girl name alyssa carson is be trai...,17yearold girl name alyssa carson train nasa b...,"[17yearold, girl, name, alyssa, carson, train,..."
3,in 1988 author roald dahl pen an open letter u...,1988 author roald dahl pen open letter urg par...,"[1988, author, roald, dahl, pen, open, letter,..."
4,when it come to fight terror anoth thing we kn...,come fight terror anoth thing know doe work ba...,"[come, fight, terror, anoth, thing, know, doe,..."


### 1.3.2 Related Articles

In [15]:
%%time
# create a list of tokenized, non-stop words articles ~ may take a few minutes to run
tokenized_articles = []
stemmed_art = []
stemmed_sw_art = []

for i in range(articles_cleaned.shape[0]):

    #--------------------------------------------------------------
    # stemming
    word_tokens = word_tokenize(articles_cleaned.Article[articles_cleaned.index[i]])
    stemmed_tok_art = []
    for w in word_tokens:
        stemmed_tok_art.append(ps.stem(w))
    stemmed_string = ' '.join(stemmed_tok_art)
    # stemmed_claims is a list of stemmed strings
    stemmed_art.append(stemmed_string)
    
    #--------------------------------------------------------------
    # remove stop words
    stemmed_sw_string = []
    word_tokens = word_tokenize(stemmed_art[i])
    filtered_sentence = [w for w in word_tokens if not w in stop_words]
    stemmed_sw_string = ' '.join(filtered_sentence)
    # stemmed_sw_claims is a list of stemmed strings without stopwords
    stemmed_sw_art.append(stemmed_sw_string)    
        
    #--------------------------------------------------------------    
    # tokenize
    tokenized_ = word_tokenize(stemmed_sw_art[i])
    tokenized_articles.append(tokenized_)
    
    # print progress
    progress = round((i/articles_cleaned.shape[0])*100,2)
    clear_output(wait=True)
    print("progress: " + str(progress) + "%")

progress: 100.0%
CPU times: user 31min 11s, sys: 20.9 s, total: 31min 32s
Wall time: 31min 17s


In [16]:
# list of tokenized related articles - below is showing only the first entry of the list
# tokenized_articles[0]

#### Create Related Articles DataFrame

In [19]:
# zip together all the articles and create a dataframe
zipped_articles = list(zip(stemmed_art, stemmed_sw_art, tokenized_articles))
articles_ = pd.DataFrame(zipped_articles, columns = ['stemmed_articles', 'stemmed_stopword_articles', 'tokenized_articles'])
# index the articles based on article ID
articles_.index = [articles_cleaned.index]

In [20]:
articles_.head()

Unnamed: 0,stemmed_articles,stemmed_stopword_articles,tokenized_articles
131504,loreal abus monkey dure anim test dan crespo y...,loreal abus monkey dure anim test dan crespo y...,"[loreal, abus, monkey, dure, anim, test, dan, ..."
62141,remark by presid trump befor marin one departu...,remark presid trump befor marin one departur s...,"[remark, presid, trump, befor, marin, one, dep..."
55267,scott like rubio push on immigr reform or at l...,scott like rubio push immigr reform least part...,"[scott, like, rubio, push, immigr, reform, lea..."
150721,effect of fluorid in prevent cari in adult to ...,effect fluorid prevent cari adult date systema...,"[effect, fluorid, prevent, cari, adult, date, ..."
159743,prison to get a monthli salari and a bonu when...,prison get monthli salari bonu done serv time ...,"[prison, get, monthli, salari, bonu, done, ser..."


## 2. Date

In [21]:
# Convert date column to datetime format
data['new_date'] = pd.to_datetime(data['date'], dayfirst=True)
data.head()

Unnamed: 0.1,Unnamed: 0,claim,claimant,date,id,label,related_articles,article_array,new_date
0,0,A line from George Orwell's novel 1984 predict...,,17/07/2017,0,0,"[122094, 122580, 130685, 134765]","[122094, 122580, 130685, 134765]",2017-07-17
1,1,Maine legislature candidate Leslie Gibson insu...,,17/03/2018,1,2,"[106868, 127320, 128060]","[106868, 127320, 128060]",2018-03-17
2,2,A 17-year-old girl named Alyssa Carson is bein...,,18/07/2018,4,1,"[132130, 132132, 149722]","[132130, 132132, 149722]",2018-07-18
3,3,In 1988 author Roald Dahl penned an open lette...,,04/02/2019,5,2,"[123254, 123418, 127464]","[123254, 123418, 127464]",2019-02-04
4,4,"When it comes to fighting terrorism, ""Another ...",Hillary Clinton,22/03/2016,6,2,"[41099, 89899, 72543, 82644, 95344, 88361]","[41099, 89899, 72543, 82644, 95344, 88361]",2016-03-22


In [22]:
# create new feature with consecutive days since January 1st, 1986
data['start_date'] = pd.to_datetime('1986-01-01', format='%Y-%m-%d')
data['cont_days'] = (data['new_date'] - data['start_date']).dt.days
data = data.drop(['start_date'], axis=1)

In [23]:
# Convert Year and Month features in to int (instead of str before), can be kept as int since it is ordinal

#Year
data['Year'] = data['new_date'].apply(lambda x: "%d" % (x.year))
data['Year'] = data['Year'].astype(int)
# Month
data['Month'] = data['new_date'].apply(lambda x: "%d" % (x.month))
data['Month'] = data['Month'].astype(int)

In [24]:
# 4 columns at the end show the new_date (which is the date in a date format), the continuous days, 
# the year and month
data.head()

Unnamed: 0.1,Unnamed: 0,claim,claimant,date,id,label,related_articles,article_array,new_date,cont_days,Year,Month
0,0,A line from George Orwell's novel 1984 predict...,,17/07/2017,0,0,"[122094, 122580, 130685, 134765]","[122094, 122580, 130685, 134765]",2017-07-17,11520,2017,7
1,1,Maine legislature candidate Leslie Gibson insu...,,17/03/2018,1,2,"[106868, 127320, 128060]","[106868, 127320, 128060]",2018-03-17,11763,2018,3
2,2,A 17-year-old girl named Alyssa Carson is bein...,,18/07/2018,4,1,"[132130, 132132, 149722]","[132130, 132132, 149722]",2018-07-18,11886,2018,7
3,3,In 1988 author Roald Dahl penned an open lette...,,04/02/2019,5,2,"[123254, 123418, 127464]","[123254, 123418, 127464]",2019-02-04,12087,2019,2
4,4,"When it comes to fighting terrorism, ""Another ...",Hillary Clinton,22/03/2016,6,2,"[41099, 89899, 72543, 82644, 95344, 88361]","[41099, 89899, 72543, 82644, 95344, 88361]",2016-03-22,11038,2016,3


## 3. Claimant 

In [25]:
# fill missing claimants with "unknown"
data['claimant'] = data['claimant'].fillna('Unknown')

In [26]:
# Group together all counts less than 100 in to Others
claimant_count = data['claimant'].value_counts()
value_mask = data.claimant.isin(claimant_count.index[claimant_count < 100]) 
data.loc[value_mask,'claimant'] = "Other"

In [27]:
data.head()

Unnamed: 0.1,Unnamed: 0,claim,claimant,date,id,label,related_articles,article_array,new_date,cont_days,Year,Month
0,0,A line from George Orwell's novel 1984 predict...,Unknown,17/07/2017,0,0,"[122094, 122580, 130685, 134765]","[122094, 122580, 130685, 134765]",2017-07-17,11520,2017,7
1,1,Maine legislature candidate Leslie Gibson insu...,Unknown,17/03/2018,1,2,"[106868, 127320, 128060]","[106868, 127320, 128060]",2018-03-17,11763,2018,3
2,2,A 17-year-old girl named Alyssa Carson is bein...,Unknown,18/07/2018,4,1,"[132130, 132132, 149722]","[132130, 132132, 149722]",2018-07-18,11886,2018,7
3,3,In 1988 author Roald Dahl penned an open lette...,Unknown,04/02/2019,5,2,"[123254, 123418, 127464]","[123254, 123418, 127464]",2019-02-04,12087,2019,2
4,4,"When it comes to fighting terrorism, ""Another ...",Hillary Clinton,22/03/2016,6,2,"[41099, 89899, 72543, 82644, 95344, 88361]","[41099, 89899, 72543, 82644, 95344, 88361]",2016-03-22,11038,2016,3


## 4. Final Frame

There are 2 final dataframes, one for the data (claims, claimant, date, label, related_articles) and another for the related articles

### 4.1 Final Data 

In [28]:
# concatenate all work done above to a single frame
final_data = pd.concat([data.claim, cleaned_claim.claim, claims_.stemmed_claims, claims_.stemmed_stopword_claims, claims_.tokenized_claims, data.claimant, data.new_date, data.cont_days, data.Year, data.Month, cleaned_claim.label, cleaned_claim.article_array], axis=1)
# rename columns for clarity
final_data.columns = ['raw_claim', 'cleaned_claim', 'stemmed_claims', 'stemmed_stopword_claims', 'tokenized_claim', 'claimant', 'date', 'cont_days', 'year', 'month', 'label', 'article_array']

In [29]:
# this is equivalent to the "train.csv" that we were given, but cleaned with a few additional feature
final_data.head()

Unnamed: 0,raw_claim,cleaned_claim,stemmed_claims,stemmed_stopword_claims,tokenized_claim,claimant,date,cont_days,year,month,label,article_array
0,A line from George Orwell's novel 1984 predict...,a line from george orwells novel 1984 predicts...,a line from georg orwel novel 1984 predict the...,line georg orwel novel 1984 predict power smar...,"[line, georg, orwel, novel, 1984, predict, pow...",Unknown,2017-07-17,11520,2017,7,0,"[122094, 122580, 130685, 134765]"
1,Maine legislature candidate Leslie Gibson insu...,maine legislature candidate leslie gibson insu...,main legislatur candid lesli gibson insult par...,main legislatur candid lesli gibson insult par...,"[main, legislatur, candid, lesli, gibson, insu...",Unknown,2018-03-17,11763,2018,3,2,"[106868, 127320, 128060]"
2,A 17-year-old girl named Alyssa Carson is bein...,a 17yearold girl named alyssa carson is being ...,a 17yearold girl name alyssa carson is be trai...,17yearold girl name alyssa carson train nasa b...,"[17yearold, girl, name, alyssa, carson, train,...",Unknown,2018-07-18,11886,2018,7,1,"[132130, 132132, 149722]"
3,In 1988 author Roald Dahl penned an open lette...,in 1988 author roald dahl penned an open lette...,in 1988 author roald dahl pen an open letter u...,1988 author roald dahl pen open letter urg par...,"[1988, author, roald, dahl, pen, open, letter,...",Unknown,2019-02-04,12087,2019,2,2,"[123254, 123418, 127464]"
4,"When it comes to fighting terrorism, ""Another ...",when it comes to fighting terrorism another th...,when it come to fight terror anoth thing we kn...,come fight terror anoth thing know doe work ba...,"[come, fight, terror, anoth, thing, know, doe,...",Hillary Clinton,2016-03-22,11038,2016,3,2,"[41099, 89899, 72543, 82644, 95344, 88361]"


In [30]:
# save to csv
final_data.to_csv("final_data.csv")

#### 4.1.1 How to work with final_data.csv

In [31]:
# to go to a specific cleaned claim
final_data.cleaned_claim.loc[0]

'a line from george orwells novel 1984 predicts the power of smartphones'

In [32]:
# to go to a specific stemmed_stopword_claims
final_data.stemmed_stopword_claims.loc[0]

'line georg orwel novel 1984 predict power smartphon'

In [33]:
# sample code to read elements from each article_array

# # to iterate the article array
# for i in range(final_data.shape[0]):
#     # i iterates row by row till the end
#     for u in range(len(final_data.article_array[i])):
#         # u holds the index of each element, within each array. Uncomment the following to understand
#         # print(u)
#         art_array = final_data.article_array[i]
#         # print specific elements of each array
#         print(art_array[u])

### 4.2 Related Articles Dataframe 

In [34]:
raw_article_list = articles.Article.tolist()
cleaned_article_list = articles_cleaned.Article.tolist()
final_articles_zipped = list(zip(raw_article_list, cleaned_article_list, stemmed_art, stemmed_sw_art, tokenized_articles))
final_articles = pd.DataFrame(final_articles_zipped, columns = ['raw_articles', 'cleaned_articles', 'stemmed_articles', 'stemmed_stopwords_articles', 'tokenized_articles'])
final_articles.index = [articles_cleaned.index]

In [35]:
final_articles.head()

Unnamed: 0,raw_articles,cleaned_articles,stemmed_articles,stemmed_stopwords_articles,tokenized_articles
131504,L'oreal abusing monkey during animal testing\n...,loreal abusing monkey during animal testing da...,loreal abus monkey dure anim test dan crespo y...,loreal abus monkey dure anim test dan crespo y...,"[loreal, abus, monkey, dure, anim, test, dan, ..."
62141,Remarks by President Trump Before Marine One D...,remarks by president trump before marine one d...,remark by presid trump befor marin one departu...,remark presid trump befor marin one departur s...,"[remark, presid, trump, befor, marin, one, dep..."
55267,Scott likes Rubio's push on immigration reform...,scott likes rubios push on immigration reform ...,scott like rubio push on immigr reform or at l...,scott like rubio push immigr reform least part...,"[scott, like, rubio, push, immigr, reform, lea..."
150721,Effectiveness of fluoride in preventing caries...,effectiveness of fluoride in preventing caries...,effect of fluorid in prevent cari in adult to ...,effect fluorid prevent cari adult date systema...,"[effect, fluorid, prevent, cari, adult, date, ..."
159743,Prisoners to get a monthly salary and a bonus ...,prisoners to get a monthly salary and a bonus ...,prison to get a monthli salari and a bonu when...,prison get monthli salari bonu done serv time ...,"[prison, get, monthli, salari, bonu, done, ser..."


In [36]:
# make a copy to be saved to a local csv file
final_articles2 = final_articles
final_articles2 = final_articles2.reset_index()
final_articles2.to_csv("final_articles.csv")

#### 4.2.1 How to work with final_articles.csv

In [37]:
# import the csv
aaron = pd.read_csv('final_articles.csv')
aaron.head()

Unnamed: 0.1,Unnamed: 0,level_0,raw_articles,cleaned_articles,stemmed_articles,stemmed_stopwords_articles,tokenized_articles
0,0,131504,L'oreal abusing monkey during animal testing\n...,loreal abusing monkey during animal testing da...,loreal abus monkey dure anim test dan crespo y...,loreal abus monkey dure anim test dan crespo y...,"['loreal', 'abus', 'monkey', 'dure', 'anim', '..."
1,1,62141,Remarks by President Trump Before Marine One D...,remarks by president trump before marine one d...,remark by presid trump befor marin one departu...,remark presid trump befor marin one departur s...,"['remark', 'presid', 'trump', 'befor', 'marin'..."
2,2,55267,Scott likes Rubio's push on immigration reform...,scott likes rubios push on immigration reform ...,scott like rubio push on immigr reform or at l...,scott like rubio push immigr reform least part...,"['scott', 'like', 'rubio', 'push', 'immigr', '..."
3,3,150721,Effectiveness of fluoride in preventing caries...,effectiveness of fluoride in preventing caries...,effect of fluorid in prevent cari in adult to ...,effect fluorid prevent cari adult date systema...,"['effect', 'fluorid', 'prevent', 'cari', 'adul..."
4,4,159743,Prisoners to get a monthly salary and a bonus ...,prisoners to get a monthly salary and a bonus ...,prison to get a monthli salari and a bonu when...,prison get monthli salari bonu done serv time ...,"['prison', 'get', 'monthli', 'salari', 'bonu',..."


In [38]:
# I prefer to reindex based on level_0 so that I can call each article directly as follows
aaron.index = [aaron.level_0]
aaron.head()

Unnamed: 0_level_0,Unnamed: 0,level_0,raw_articles,cleaned_articles,stemmed_articles,stemmed_stopwords_articles,tokenized_articles
level_0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
131504,0,131504,L'oreal abusing monkey during animal testing\n...,loreal abusing monkey during animal testing da...,loreal abus monkey dure anim test dan crespo y...,loreal abus monkey dure anim test dan crespo y...,"['loreal', 'abus', 'monkey', 'dure', 'anim', '..."
62141,1,62141,Remarks by President Trump Before Marine One D...,remarks by president trump before marine one d...,remark by presid trump befor marin one departu...,remark presid trump befor marin one departur s...,"['remark', 'presid', 'trump', 'befor', 'marin'..."
55267,2,55267,Scott likes Rubio's push on immigration reform...,scott likes rubios push on immigration reform ...,scott like rubio push on immigr reform or at l...,scott like rubio push immigr reform least part...,"['scott', 'like', 'rubio', 'push', 'immigr', '..."
150721,3,150721,Effectiveness of fluoride in preventing caries...,effectiveness of fluoride in preventing caries...,effect of fluorid in prevent cari in adult to ...,effect fluorid prevent cari adult date systema...,"['effect', 'fluorid', 'prevent', 'cari', 'adul..."
159743,4,159743,Prisoners to get a monthly salary and a bonus ...,prisoners to get a monthly salary and a bonus ...,prison to get a monthli salari and a bonu when...,prison get monthli salari bonu done serv time ...,"['prison', 'get', 'monthli', 'salari', 'bonu',..."


In [39]:
# to go to a specific cleaned article
aaron.cleaned_articles.loc[75770].iloc[0]

'charterschool bill deserves action ohio senate president keith faber who is scouting a run for state auditor or attorney general in two years has a chance to demonstrate an affinity for defending ohioans against misspending and fraud by advancing a charterschool accountability bill ohio senate president keith faber who is scouting a run for state auditor or attorney general in two years has a chance to demonstrate an affinity for defending ohioans against misspending and fraud by advancing a charterschool accountability bill as it stands it appears that fabers senate is playing games with senate bill 298 a measure to ensure that charter schools are actually educating students this crucial reform legislation was oddly assigned to the senate finance committee rather than being sent to the education committee where sen peggy lehner rkettering had made it clear that shed fasttrack the bill senate minority leader joe schiavoni dboardman reasonably suspects his bill is being slotted for ina

In [40]:
# to go to a specific raw article
aaron.raw_articles.loc[74306].iloc[0]

'Defining \'reasonable compensation\'\nState House\n\nWhen Andy Sanborn started his first company, he recalls sleeping on the floor on an air mattress eating ramen noodles. Now, Sanborn owns The Draft in Concord, along with a couple of real estate companies. He won\'t say how much money he makes, but he is outraged at the thought of the government telling him what salary is considered "reasonable."\n\n"Should government have the right and authority to tell people how much money they can make?" Sanborn said. "It\'s insulting. If my wife works at Concord Hospital and makes $300,000 a year, no one blinks. If I make $300,000, I have to justify it to New Hampshire."\n\nFor decades, business owners have been required to calculate "reasonable compensation" for federal and state tax purposes. But as state legislators work to clarify those laws, the discussion about what is reasonable compensation has come to the forefront. To the chagrin of those involved in the technicalities of tax law, the 