# Data Cleaning and Processing
- This small data set uses only 1 folder from the NYT Corpus [Articles from Jan 1 2007]
- This is to make sure that the script works before scaling up to the entire corpus

This script is mainly focusing on extracting and cleaning up the data. Then it will import the data into a pandas DataFrame. From there I can begin manipulate the data to take a form that will be more userful for doing the sentiment analysis.  
  
Because the NYT Corpus is already annotated and has a well-defined structure, cleaning is not as much of an issue as opposed to extracting the desired data for processing. 

----For Progress Report 2, I'm using my existing script and adding onto it. 

In [1]:
# from IPython.core.interactiveshell import InteractiveShell
# InteractiveShell.ast_node_interactivity = "all"

In [2]:
# import libraries
import xml.etree.ElementTree as Et
import glob
import pandas as pd
import nltk
from nltk.corpus import stopwords
import string
import matplotlib.pyplot as plt

In [3]:
# create new dataframe with empty data
columns = ['DOCID', 'Date', 'Month', 'Year', 'Name', 'Text']
data = pd.DataFrame(columns=columns)

#### Append data to DataFrames
Immediately from the XML files, I want to retrieve data for the following columns: document ID, date, month, year, mentioned names, article text.  
This data is to be appended to a pandas DataFrame.

In [4]:
# open each xml file in the specified folder, open it and print out the names of mentioned people
for file in glob.glob("../data/NYT Corpus/nyt_corpus/data/2007/01/*/*.xml"):
    # parse the xml file into an element tree to extract data
    tree = Et.parse(file)
    root = tree.getroot()
    
    # get document id information (not sure if I need this yet, seems like it could be helpful)
    docid = root.find('.//doc-id[@id-string]').attrib['id-string']
    
    # get publication date information
    date = root.find(".//meta[@name='publication_day_of_month']").attrib['content']
    month = root.find(".//meta[@name='publication_month']").attrib['content']
    year = root.find(".//meta[@name='publication_year']").attrib['content']
    
    # get article text information
    # some articles seem to lack text - this is caught and handled in the if/else
    article = root.find(".//block[@class='full_text']/p")
    if article is not None:
        text = (article.text).lower()
    else:
        text = None
        
    # for each person mentioned, create a new row of data for them in the dataframe    
    for c in root.iter('person'):
        name = str(c.text).upper()
        data = data.append([{'DOCID': docid, 'Date': date, 'Month': month, 'Year': year, 'Name': name, 'Text': text}])
data.head()

Unnamed: 0,DOCID,Date,Month,Year,Name,Text
0,1815718,1,1,2007,"BLUMENTHAL, MARTIN",blumenthal--martin. a new york business man an...
0,1815719,1,1,2007,"BRADLEY, CAROL L.","bradley--carol l., 84, of tinton falls, nj die..."
0,1815720,1,1,2007,"CRAWFORD, PERRY JR.","crawford--perry jr., died at 89 on december 13..."
0,1815721,1,1,2007,"FLOOD, ROBERT FRANCIS","flood--robert francis, husband of the late cat..."
0,1815722,1,1,2007,"GEISLER, ENID (FRIEDMAN)","geisler--enid (friedman), on december 29, 2006..."


#### Start tweaking the DataFrame to make it more useful

In [5]:
# make the docid the index
data = data.set_index('DOCID')
data.head()

Unnamed: 0_level_0,Date,Month,Year,Name,Text
DOCID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1815718,1,1,2007,"BLUMENTHAL, MARTIN",blumenthal--martin. a new york business man an...
1815719,1,1,2007,"BRADLEY, CAROL L.","bradley--carol l., 84, of tinton falls, nj die..."
1815720,1,1,2007,"CRAWFORD, PERRY JR.","crawford--perry jr., died at 89 on december 13..."
1815721,1,1,2007,"FLOOD, ROBERT FRANCIS","flood--robert francis, husband of the late cat..."
1815722,1,1,2007,"GEISLER, ENID (FRIEDMAN)","geisler--enid (friedman), on december 29, 2006..."


#### Create a new column with parsed text
This is to create a new column associated with each person that contains the text of the article they are mentioned in, but it is parsed using NLTK.word_tokenize. I can then use this column later for futher analysis.

In [6]:
# create function to tokenize the Text
def tokenizeText(col):
    return nltk.word_tokenize(str(col))

data['Tokenized'] = data['Text'].apply(tokenizeText)
data.head()

Unnamed: 0_level_0,Date,Month,Year,Name,Text,Tokenized
DOCID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1815718,1,1,2007,"BLUMENTHAL, MARTIN",blumenthal--martin. a new york business man an...,"[blumenthal, --, martin, ., a, new, york, busi..."
1815719,1,1,2007,"BRADLEY, CAROL L.","bradley--carol l., 84, of tinton falls, nj die...","[bradley, --, carol, l., ,, 84, ,, of, tinton,..."
1815720,1,1,2007,"CRAWFORD, PERRY JR.","crawford--perry jr., died at 89 on december 13...","[crawford, --, perry, jr., ,, died, at, 89, on..."
1815721,1,1,2007,"FLOOD, ROBERT FRANCIS","flood--robert francis, husband of the late cat...","[flood, --, robert, francis, ,, husband, of, t..."
1815722,1,1,2007,"GEISLER, ENID (FRIEDMAN)","geisler--enid (friedman), on december 29, 2006...","[geisler, --, enid, (, friedman, ), ,, on, dec..."


#### Right now, I want to be able to do the sentiment analysis on individual people
Once I can do sentiment analysis on individual people, I can further broaden that multiple people. When the script is later modified to process the entirety of the NYT corpus, I can aggregate frequent names and then analyze those names over time. Baby steps now I suppose.

In [7]:
# make a function that removes stopwords
def filter(toks):
    sw = set(stopwords.words('english'))
    others = ['--']
    filtered = [w for w in toks 
                if not w in sw 
                if not w in string.punctuation 
                if not w in others
               ]
    return filtered


t = data.iloc[0]['Tokenized']
filter(t)

['blumenthal',
 'martin',
 'new',
 'york',
 'business',
 'man',
 'philanthropist',
 'died',
 'last',
 'saturday',
 'manhattan',
 'long',
 'illness',
 '90.',
 'mr.',
 'blumenthal',
 'born',
 'frankfurt',
 'germany',
 'immigrated',
 'new',
 'york',
 'city',
 '1935.',
 'president',
 'a.j',
 'hollander',
 'company',
 'commodities',
 'trading',
 'firm',
 'retirement',
 'devoted',
 'philanthropic',
 'activities',
 'trustee',
 'ymha',
 'served',
 'chairman',
 'bezalel',
 'charitable',
 'organization',
 'supports',
 'arts',
 'israel',
 'also',
 'active',
 'human',
 'rights',
 'watch',
 'mr.',
 'blumenthal',
 'survived',
 'wife',
 'sallie',
 'blumenthal',
 'children',
 'richard',
 'greenwich',
 'david',
 'boston',
 'six',
 'grandchildren',
 'brother',
 'fred',
 'sister',
 'edith',
 'levisohn',
 'first',
 'wife',
 'jane',
 'died',
 '1969.',
 'funeral',
 'take',
 'place',
 'riverside',
 'chapel',
 '11:30am',
 'tuesday',
 'january',
 '2nd',
 '6',
 'blumenthal',
 'martin',
 'park',
 'avenue',
 'syn

In [8]:
# map my new filter function to the dataframe
# this will clean up the Tokenized column to do work on
data["Tokenized"] = data["Tokenized"].apply(lambda x: filter(x))
data.head()

Unnamed: 0_level_0,Date,Month,Year,Name,Text,Tokenized
DOCID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1815718,1,1,2007,"BLUMENTHAL, MARTIN",blumenthal--martin. a new york business man an...,"[blumenthal, martin, new, york, business, man,..."
1815719,1,1,2007,"BRADLEY, CAROL L.","bradley--carol l., 84, of tinton falls, nj die...","[bradley, carol, l., 84, tinton, falls, nj, di..."
1815720,1,1,2007,"CRAWFORD, PERRY JR.","crawford--perry jr., died at 89 on december 13...","[crawford, perry, jr., died, 89, december, 13t..."
1815721,1,1,2007,"FLOOD, ROBERT FRANCIS","flood--robert francis, husband of the late cat...","[flood, robert, francis, husband, late, cather..."
1815722,1,1,2007,"GEISLER, ENID (FRIEDMAN)","geisler--enid (friedman), on december 29, 2006...","[geisler, enid, friedman, december, 29, 2006.,..."


#### I'm gonna try to use NLTK's SentimentAnalyzer package. Because this returns intensity scores, I can possibly use early data to make improvements upon the analyzer later on. I have an idea on how I could try this, but will try this later on.

In [9]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sia = SentimentIntensityAnalyzer()

# data.iloc[0]['Tokenized']
sia.polarity_scores(" ".join(data.iloc[0]['Tokenized']))



{'compound': 0.8934, 'neg': 0.109, 'neu': 0.688, 'pos': 0.203}

In [10]:
# create a new dataframe with the polarities
columns = ['DOCID', 'Date', 'Month', 'Year', 'Name', 'COM' ,'NEG', 'NEU', 'POS']
polarities = pd.DataFrame(columns=columns)

for i in range(len(data.index.values.tolist())):
    row = data.iloc[i]
    
    scores = sia.polarity_scores(str(data.iloc[i]['Text']))
    pos = scores.get('pos')
    neu = scores.get('neu')
    neg = scores.get('neg')
    com = scores.get('compound')
    
    polarities = polarities.append([{'DOCID': row.name, 'Date': row['Date'], 'Month': row['Month'], 'Year': row['Year'], 'Name': row['Name'], 'COM': com,'NEG': neg, 'NEU': neu, 'POS': pos}])

polarities = polarities.set_index('DOCID')
polarities

Unnamed: 0_level_0,Date,Month,Year,Name,COM,NEG,NEU,POS
DOCID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1815718,1,1,2007,"BLUMENTHAL, MARTIN",0.8934,0.078,0.775,0.147
1815719,1,1,2007,"BRADLEY, CAROL L.",0.9186,0.058,0.675,0.267
1815720,1,1,2007,"CRAWFORD, PERRY JR.",0.0000,0.083,0.833,0.083
1815721,1,1,2007,"FLOOD, ROBERT FRANCIS",0.4404,0.020,0.944,0.036
1815722,1,1,2007,"GEISLER, ENID (FRIEDMAN)",0.7783,0.023,0.893,0.085
1815723,1,1,2007,"GIUDICE, EMILY",-0.4215,0.151,0.767,0.082
1815724,1,1,2007,"HIRSCH, TRUDE",0.9118,0.042,0.673,0.285
1815725,1,1,2007,"KERRIGAN, MARGARET H. M. (MIMI)",0.5423,0.049,0.835,0.117
1815726,1,1,2007,"KLEIN, ABRAHAM E., PH.D.",0.8689,0.063,0.702,0.235
1815727,1,1,2007,"LONG, WILLIAM ALBERS OF POTOMAC, MD",0.9287,0.046,0.845,0.109


#### Additional task that will need to be done
I need larger corpora that have more words associated with positive negative, the values dont seem quite right, perhaps its because the words that remain dont have a particular connotation associated with them.

## Future Processing Tasks
- I need more words that are positive/negative/neutral so that I can get better estimates about the sentiment of each article. 
- As my data grows bigger as I scale, I need to start saving my data structures instead of generating them each time I run the program