# Data Cleaning and Processing
- This small data set uses only 1 folder from the NYT Corpus [Articles from Jan 1 2007]
- This is to make sure that the script works before scaling up to the entire corpus

This script is mainly focusing on extracting and cleaning up the data. Then it will import the data into a pandas DataFrame. From there I can begin manipulate the data to take a form that will be more userful for doing the sentiment analysis.  
  
Because the NYT Corpus is already annotated and has a well-defined structure, cleaning is not as much of an issue as opposed to extracting the desired data for processing. 

----For Progress Report 2, I'm using my existing script and adding onto it. 

In [1]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [2]:
# import libraries
import xml.etree.ElementTree as Et
import glob
import pandas as pd
import nltk
from nltk.corpus import stopwords
import string
import matplotlib.pyplot as plt

In [3]:
# create new dataframe with empty data
columns = ['DOCID', 'Date', 'Month', 'Year', 'Name', 'Text']
data = pd.DataFrame(columns=columns)

#### Append data to DataFrames
Immediately from the XML files, I want to retrieve data for the following columns: document ID, date, month, year, mentioned names, article text.  
This data is to be appended to a pandas DataFrame.

In [4]:
# open each xml file in the specified folder, open it and print out the names of mentioned people
for file in glob.glob("../data/NYT Corpus/nyt_corpus/data/2007/01/*/*.xml"):
    # parse the xml file into an element tree to extract data
    tree = Et.parse(file)
    root = tree.getroot()
    
    # get document id information (not sure if I need this yet, seems like it could be helpful)
    docid = root.find('.//doc-id[@id-string]').attrib['id-string']
    
    # get publication date information
    date = root.find(".//meta[@name='publication_day_of_month']").attrib['content']
    month = root.find(".//meta[@name='publication_month']").attrib['content']
    year = root.find(".//meta[@name='publication_year']").attrib['content']
    
    # get article text information
    # some articles seem to lack text - this is caught and handled in the if/else
    article = root.find(".//block[@class='full_text']/p")
    if article is not None:
        text = (article.text).lower()
    else:
        text = None
        
    # for each person mentioned, create a new row of data for them in the dataframe    
    for c in root.iter('person'):
        name = str(c.text).upper()
        data = data.append([{'DOCID': docid, 'Date': date, 'Month': month, 'Year': year, 'Name': name, 'Text': text}])
data.head()

Unnamed: 0,DOCID,Date,Month,Year,Name,Text
0,1816122,3,1,2007,"FORD, GERALD RUDOLPH JR",
0,1816122,3,1,2007,"FORD, BETTY",
0,1816122,3,1,2007,"BUSH, GEORGE W (PRES)",
0,1816136,3,1,2007,"BENBROOK, CHARLES M",to the editor:
0,1816095,3,1,2007,"TAPLIN, JONATHAN T","in 1997, jonathan t. taplin, a veteran film an..."


#### Start tweaking the DataFrame to make it more useful

In [5]:
# make the docid the index
data = data.set_index('DOCID')
data.head()

Unnamed: 0_level_0,Date,Month,Year,Name,Text
DOCID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1816122,3,1,2007,"FORD, GERALD RUDOLPH JR",
1816122,3,1,2007,"FORD, BETTY",
1816122,3,1,2007,"BUSH, GEORGE W (PRES)",
1816136,3,1,2007,"BENBROOK, CHARLES M",to the editor:
1816095,3,1,2007,"TAPLIN, JONATHAN T","in 1997, jonathan t. taplin, a veteran film an..."


#### Create a new column with parsed text
This is to create a new column associated with each person that contains the text of the article they are mentioned in, but it is parsed using NLTK.word_tokenize. I can then use this column later for futher analysis.

In [6]:
# create function to tokenize the Text
def tokenizeText(col):
    return nltk.word_tokenize(str(col))

data['Tokenized'] = data['Text'].apply(tokenizeText)
data.head()

Unnamed: 0_level_0,Date,Month,Year,Name,Text,Tokenized
DOCID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1816122,3,1,2007,"FORD, GERALD RUDOLPH JR",,[None]
1816122,3,1,2007,"FORD, BETTY",,[None]
1816122,3,1,2007,"BUSH, GEORGE W (PRES)",,[None]
1816136,3,1,2007,"BENBROOK, CHARLES M",to the editor:,"[to, the, editor, :]"
1816095,3,1,2007,"TAPLIN, JONATHAN T","in 1997, jonathan t. taplin, a veteran film an...","[in, 1997, ,, jonathan, t., taplin, ,, a, vete..."


#### Right now, I want to be able to do the sentiment analysis on individual people
Once I can do sentiment analysis on individual people, I can further broaden that multiple people. When the script is later modified to process the entirety of the NYT corpus, I can aggregate frequent names and then analyze those names over time. Baby steps now I suppose.

In [7]:
# make a function that removes stopwords
def filter(toks):
    sw = set(stopwords.words('english'))
    others = ['--']
    filtered = [w for w in toks 
                if not w in sw 
                if not w in string.punctuation 
                if not w in others
               ]
    return filtered


t = data.iloc[0]['Tokenized']
filter(t)

['None']

In [8]:
# map my new filter function to the dataframe
# this will clean up the Tokenized column to do work on
data["Tokenized"] = data["Tokenized"].apply(lambda x: filter(x))
data.head()

Unnamed: 0_level_0,Date,Month,Year,Name,Text,Tokenized
DOCID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1816122,3,1,2007,"FORD, GERALD RUDOLPH JR",,[None]
1816122,3,1,2007,"FORD, BETTY",,[None]
1816122,3,1,2007,"BUSH, GEORGE W (PRES)",,[None]
1816136,3,1,2007,"BENBROOK, CHARLES M",to the editor:,[editor]
1816095,3,1,2007,"TAPLIN, JONATHAN T","in 1997, jonathan t. taplin, a veteran film an...","[1997, jonathan, t., taplin, veteran, film, te..."


#### I'm gonna try to use NLTK's SentimentAnalyzer package. Because this returns intensity scores, I can possibly use early data to make improvements upon the analyzer later on. I have an idea on how I could try this, but will try this later on.

In [9]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sia = SentimentIntensityAnalyzer()

# data.iloc[0]['Tokenized']
sia.polarity_scores(" ".join(data.iloc[0]['Tokenized']))



{'compound': 0.0, 'neg': 0.0, 'neu': 1.0, 'pos': 0.0}

#### The SentimentIntensityAnalyzer returns 4 categories of scores:
pos = the positive score  
neg = the negative score  
neu = the neutral score  
compound = intensity of positive or negative

In [10]:
# create a new dataframe with the polarities
columns = ['DOCID', 'Date', 'Month', 'Year', 'Name', 'COM' ,'NEG', 'NEU', 'POS']
polarities = pd.DataFrame(columns=columns)

for i in range(len(data.index.values.tolist())):
    row = data.iloc[i]
    
    scores = sia.polarity_scores(str(data.iloc[i]['Text']))
    pos = scores.get('pos')
    neu = scores.get('neu')
    neg = scores.get('neg')
    com = scores.get('compound')
    
    polarities = polarities.append([{'DOCID': row.name, 'Date': row['Date'], 'Month': row['Month'], 'Year': row['Year'], 'Name': row['Name'], 'COM': com,'NEG': neg, 'NEU': neu, 'POS': pos}])

polarities = polarities.set_index('DOCID')
polarities.head()
polarities.tail()

Unnamed: 0_level_0,Date,Month,Year,Name,COM,NEG,NEU,POS
DOCID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1816122,3,1,2007,"FORD, GERALD RUDOLPH JR",0.0,0.0,1.0,0.0
1816122,3,1,2007,"FORD, BETTY",0.0,0.0,1.0,0.0
1816122,3,1,2007,"BUSH, GEORGE W (PRES)",0.0,0.0,1.0,0.0
1816136,3,1,2007,"BENBROOK, CHARLES M",0.0,0.0,1.0,0.0
1816095,3,1,2007,"TAPLIN, JONATHAN T",0.4588,0.0,0.925,0.075


Unnamed: 0_level_0,Date,Month,Year,Name,COM,NEG,NEU,POS
DOCID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1821230,25,1,2007,"WALAT, KATHRYN",0.0,0.074,0.885,0.042
1821230,25,1,2007,"GRECO, LORETTA",0.0,0.074,0.885,0.042
1821230,25,1,2007,"CAMPBELL, JESSI",0.0,0.074,0.885,0.042
1821230,25,1,2007,"GRECO, LORETTA",0.0,0.074,0.885,0.042
1821218,25,1,2007,"KUCZYNSKI, ALEX",0.1901,0.0,0.941,0.059


In [11]:
# check the size
polarities.size

109992

Now, I want to see what duplicates are in the current dataframe for the month of Jan 2007

In [12]:
len(set(polarities.Name))
len(polarities.Name)

6456

13749

## Future Processing Tasks
- I need more words that are positive/negative/neutral so that I can get better estimates about the sentiment of each article. 
- As my data grows bigger as I scale, I need to start saving my data structures instead of generating them each time I run the program