# Data Cleaning and Processing
- This small data set uses only 1 folder from the NYT Corpus [Articles from Jan 1 2007]
- This is to make sure that the script works before scaling up to the entire corpus

This script is mainly focusing on extracting and cleaning up the data. Then it will import the data into a pandas DataFrame. From there I can begin manipulate the data to take a form that will be more userful for doing the sentiment analysis.  
  
Because the NYT Corpus is already annotated and has a well-defined structure, cleaning is not as much of an issue as opposed to extracting the desired data for processing. 

In [1]:
# import libraries
import xml.etree.ElementTree as Et
import glob
import pandas as pd
import nltk

In [2]:
# create new dataframe with empty data
columns = ['DOCID', 'Date', 'Month', 'Year', 'Name', 'Text']
data = pd.DataFrame(columns=columns)

#### Append data to DataFrames
Immediately from the XML files, I want to retrieve data for the following columns: document ID, date, month, year, mentioned names, article text.  
This data is to be appended to a pandas DataFrame.

In [3]:
# open each xml file in the specified folder, open it and print out the names of mentioned people
for file in glob.glob("../data/NYT Corpus/nyt_corpus/data/2007/01/01/*.xml"):
    # parse the xml file into an element tree to extract data
    tree = Et.parse(file)
    root = tree.getroot()
    
    # get document id information (not sure if I need this yet, seems like it could be helpful)
    docid = root.find('.//doc-id[@id-string]').attrib['id-string']
    
    # get publication date information
    date = root.find(".//meta[@name='publication_day_of_month']").attrib['content']
    month = root.find(".//meta[@name='publication_month']").attrib['content']
    year = root.find(".//meta[@name='publication_year']").attrib['content']
    
    # get article text information
    # some articles seem to lack text - this is caught and handled in the if/else
    article = root.find(".//block[@class='full_text']/p")
    if article is not None:
        text = (article.text).lower()
    else:
        text = None
        
    # for each person mentioned, create a new row of data for them in the dataframe    
    for c in root.iter('person'):
        name = str(c.text).upper()
        data = data.append([{'DOCID': docid, 'Date': date, 'Month': month, 'Year': year, 'Name': name, 'Text': text}])
data.head()

Unnamed: 0,DOCID,Date,Month,Year,Name,Text
0,1815718,1,1,2007,"BLUMENTHAL, MARTIN",blumenthal--martin. a new york business man an...
0,1815719,1,1,2007,"BRADLEY, CAROL L.","bradley--carol l., 84, of tinton falls, nj die..."
0,1815720,1,1,2007,"CRAWFORD, PERRY JR.","crawford--perry jr., died at 89 on december 13..."
0,1815721,1,1,2007,"FLOOD, ROBERT FRANCIS","flood--robert francis, husband of the late cat..."
0,1815722,1,1,2007,"GEISLER, ENID (FRIEDMAN)","geisler--enid (friedman), on december 29, 2006..."


#### Start tweaking the DataFrame to make it more useful

In [4]:
# make the docid the index
data = data.set_index('DOCID')
data.head()

Unnamed: 0_level_0,Date,Month,Year,Name,Text
DOCID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1815718,1,1,2007,"BLUMENTHAL, MARTIN",blumenthal--martin. a new york business man an...
1815719,1,1,2007,"BRADLEY, CAROL L.","bradley--carol l., 84, of tinton falls, nj die..."
1815720,1,1,2007,"CRAWFORD, PERRY JR.","crawford--perry jr., died at 89 on december 13..."
1815721,1,1,2007,"FLOOD, ROBERT FRANCIS","flood--robert francis, husband of the late cat..."
1815722,1,1,2007,"GEISLER, ENID (FRIEDMAN)","geisler--enid (friedman), on december 29, 2006..."


#### Create a new column with parsed text
This is to create a new column associated with each person that contains the text of the article they are mentioned in, but it is parsed using NLTK.word_tokenize. I can then use this column later for futher analysis.

In [5]:
# create function to tokenize the Text
def tokenizeText(col):
    return nltk.word_tokenize(str(col))

data['Tokenized'] = data['Text'].apply(tokenizeText)
data.head()

Unnamed: 0_level_0,Date,Month,Year,Name,Text,Tokenized
DOCID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1815718,1,1,2007,"BLUMENTHAL, MARTIN",blumenthal--martin. a new york business man an...,"[blumenthal, --, martin, ., a, new, york, busi..."
1815719,1,1,2007,"BRADLEY, CAROL L.","bradley--carol l., 84, of tinton falls, nj die...","[bradley, --, carol, l., ,, 84, ,, of, tinton,..."
1815720,1,1,2007,"CRAWFORD, PERRY JR.","crawford--perry jr., died at 89 on december 13...","[crawford, --, perry, jr., ,, died, at, 89, on..."
1815721,1,1,2007,"FLOOD, ROBERT FRANCIS","flood--robert francis, husband of the late cat...","[flood, --, robert, francis, ,, husband, of, t..."
1815722,1,1,2007,"GEISLER, ENID (FRIEDMAN)","geisler--enid (friedman), on december 29, 2006...","[geisler, --, enid, (, friedman, ), ,, on, dec..."


## Future Processing Tasks
Further tasks to do is to process the tokenized values and strip out any stop words or words irrelevant to sentiment. I also need to look into bringing another corpus with has words associated with positive/negative/neutral sentiment in order to make decisions on the words that are remaining. Additionally, I am noticing duplicate names being entered, however they are slightly different strings and thus simply running set() won't remove them. I need a way to standardize the names so that they can fold into each other nicely.

## Data Sharing Plan
Because this is a paid corpus, I was thinking of making a single XML file available publicly, while making a month's worth of data available for the class, since they are also associated with Pitt. Could potentially make the entire thing available for the class I suppose, but that would turn into a really large data set. 