# Unpack

`unpack.ipynb` is a utility used to process the NYT Annotated Corpus' XML files to extract particular tags of information, and append it to a DataFrame. A file is generated from this script that will be used for further processing.  

  
  The files produced from the script are used in later ones ( `processing.ipynb` and `analysis.ipynb` )
  
  Running time for the script can be lengthy depending on the values entered for year, month, and date.

### Import Libraries
The XML library is used to parse and traverse the .xml files provided in the corpus.  
The glob library is used to be able to find files using regular expressions to loop through multiple files.  
The pandas library is used to hold all of the information that is extracted from the corpus.  
The pickle library is used to serialize the DataFrame object into a file, to be loaded and used by another script.

In [None]:
import xml.etree.ElementTree as Et
import glob
import pandas as pd
import pickle

### Create a New DataFrame
This creates a new, empty DataFrame to read in information from the NYT Annotated Corpus.

In [None]:
columns = ['DOCID', 'Date', 'Month', 'Year', 'Name', 'Text']
data = pd.DataFrame(columns=columns)

### Set the Dates of Desired Files
The variables used below are able to be modified in order to determine what month, day, and year to extract and process files from.  
All values are in numerical format. Single digit values are expressed as `01`, `02`, `...`, `09`.  
If you wish to use all of a specific type of value, use the `*` instead of a number.

In [None]:
year = "2005"
month = "*"
day = "*"

### Begin Processing Files from NYT Annotated Corpus
Using the values for year, month, day, the glob library is able to get the file names that match a particular path, represented as a regular expression.  
  
The data being extracted are:
- docid
- date
- month
- year
- identified name

  
Values are stored in a DataFrame called `data`.  

In [None]:
# open each xml file in the specified folder, open it and print out the names of mentioned people
for file in glob.glob("../data/NYT Corpus/nyt_corpus/data/"+year+"/"+month+"/"+day+"/*.xml"):
    # parse the xml file into an element tree to extract data
    tree = Et.parse(file)
    root = tree.getroot()
    
    # get document id information (not sure if I need this yet, seems like it could be helpful)
    docid = root.find('.//doc-id[@id-string]').attrib['id-string']
        
    # get publication date information
    date = root.find(".//meta[@name='publication_day_of_month']").attrib['content']
    month = root.find(".//meta[@name='publication_month']").attrib['content']
    year = root.find(".//meta[@name='publication_year']").attrib['content']
    
    # get article text information
    # some articles seem to lack text - this is caught and handled in the if/else
    article = root.find(".//block[@class='full_text']/p")
    if article is not None:
        text = (article.text).lower()
    else:
        text = None
        
    # for each person mentioned, create a new row of data for them in the dataframe    
    for c in root.iter('person'):
        name = str(c.text).upper()
        data = data.append([{'DOCID': docid, 'Date': date, 'Month': month, 'Year': year, 'Name': name, 'Text': text}])


### Sort the DataFrame
For readability, the DataFrame is sorted below by Month and then by Date. The minimum preferred granularity for processing files is by year. Any larger than that and the script would take too long to execute.

In [None]:
data = data.sort_values(ascending=[True, True], by=['Month', 'Date'])

### Verify that the DataFrame is Sorted at the Beginning

In [None]:
data.head()

### Verify that the DataFrame is Sorted at the End

In [None]:
data.tail()

### Write Out the Resulting DataFrame to a File
The DataFrame is serialized below using the pickle library. The filename is taken from the `year` variable. Pickle files from this script carry the `.p` extension.

In [None]:
pickle.dump(data, open("nyt-" + year + ".p", "wb"))