# Unpack

`unpack.ipynb` is a utility used to process the NYT Annotated Corpus' XML files to extract particular tags of information, and append it to a DataFrame. A file is generated from this script that will be used for further processing.  

  
  The files produced from the script are used in later ones ( `processing.ipynb` and `analysis.ipynb` )
  
  Running time for the script can be lengthy depending on the values entered for year, month, and date.
  
  ---
  
### Table of Contents  
- [Import Libraries](#Import-Libraries)
- [Create a List to Collect Data](#Create-a-List-to-Collect-Data)
- [Set the Dates of Desired Files](#Set-the-Dates-of-Desired-Files)
- [Begin Processing Files from NYT Annotated Corpus](#Begin-Processing-Files-from-NYT-Annotated-Corpus)
- [Create and Sort the DataFrame](#Create-and-Sort-the-DataFrame)
- [Verify that the DataFrame is Sorted at the Beginning](#Verify-that-the-DataFrame-is-Sorted-at-the-Beginning)
- [Verify that the DataFrame is Sorted at the End](#Verify-that-the-DataFrame-is-Sorted-at-the-End)
- [Write Out the Resulting DataFrame to a File](#Write-Out-the-Resulting-DataFrame-to-a-File)

### Import Libraries
The XML library is used to parse and traverse the .xml files provided in the corpus.  
The glob library is used to be able to find files using regular expressions to loop through multiple files.  
The pandas library is used to hold all of the information that is extracted from the corpus.  
The pickle library is used to serialize the DataFrame object into a file, to be loaded and used by another script.

In [1]:
import xml.etree.ElementTree as Et
import glob
import pandas as pd
import pickle

### Create a List to Collect Data
Appending large data to a list and then converting it to a DataFrame has proven _MUCH_ faster than appending rows to a DataFrame directly.

In [2]:
gather_data = []

### Set the Dates of Desired Files
The variables used below are able to be modified in order to determine what month, day, and year to extract and process files from.  
All values are in numerical format. Single digit values are expressed as `01`, `02`, `...`, `09`.  
If you wish to use all of a specific type of value, use the `*` instead of a number.

In [3]:
year = "2000"
month = "*"
day = "*"

### Begin Processing Files from NYT Annotated Corpus
Using the values for year, month, day, the glob library is able to get the file names that match a particular path, represented as a regular expression.  
  
The data being extracted are:
- docid
- date
- month
- year
- identified name

  
Values are stored in a DataFrame called `data`.  

In [4]:
# open each xml file in the specified folder, open it and print out the names of mentioned people
for file in glob.glob("../data/NYT Corpus/nyt_corpus/data/"+year+"/"+month+"/"+day+"/*.xml"):
    # parse the xml file into an element tree to extract data
    tree = Et.parse(file)
    root = tree.getroot()
    
    # get document id information (not sure if I need this yet, seems like it could be helpful)
    docid = root.find('.//doc-id[@id-string]').attrib['id-string']
        
    # get publication date information
    date = root.find(".//meta[@name='publication_day_of_month']").attrib['content']
    month = root.find(".//meta[@name='publication_month']").attrib['content']
    year = root.find(".//meta[@name='publication_year']").attrib['content']
    
    # get article text information
    # some articles seem to lack text - this is caught and handled in the if/else
    article = root.find(".//block[@class='full_text']/p")
    if article is not None:
        text = (article.text).lower()
    else:
        text = None
        
    # get all of the classifer information
    doctypes = ""
    for d in root.iter('classifier'):
        doctypes += str(d.text).upper() + "|"
        
    # for each person mentioned, create a new row of data for them in the dataframe    
    for c in root.iter('person'):
        name = str(c.text).upper()
        cur = [docid, date, month, year, name, text, doctypes]
        gather_data.append(cur)

### Create and Sort the DataFrame
This creates a new, empty DataFrame to read in information from the NYT Annotated Corpus.  
For readability, the DataFrame is sorted below by Month and then by Date. The minimum preferred granularity for processing files is by year. Any larger than that and the script would take too long to execute.

In [5]:
columns = ['DOCID', 'Date', 'Month', 'Year', 'Name', 'Text', 'Doctypes']
data = pd.DataFrame(gather_data, columns=columns)
data = data.sort_values(ascending=[True, True], by=['Month', 'Date'])

### Verify that the DataFrame is Sorted at the Beginning

In [6]:
data.head()

Unnamed: 0,DOCID,Date,Month,Year,Name,Text,Doctypes
115964,1165229,1,1,2000,"BERAKHA, ESTHER (NEE ROSSI)","berakha-esther (nee rossi). of manhattan, died...",PAID DEATH NOTICE|TOP/CLASSIFIEDS/PAID DEATH N...
115965,1165201,1,1,2000,"MUNOZ MARQUEZ, SANDRA",the spare farmsteads and bony milk cows of san...,IMMIGRATION AND REFUGEES|HISTORY|CHRONOLOGY|TO...
115966,1165201,1,1,2000,"MARQUEZ MARTINEZ, ENRIQUETA",the spare farmsteads and bony milk cows of san...,IMMIGRATION AND REFUGEES|HISTORY|CHRONOLOGY|TO...
115967,1165201,1,1,2000,"PRESTON, JULIA",the spare farmsteads and bony milk cows of san...,IMMIGRATION AND REFUGEES|HISTORY|CHRONOLOGY|TO...
115968,1165163,1,1,2000,"FITZGERALD, F SCOTT (1896-1940)",to the editor:,BOOKS AND LITERATURE|FORECASTS|LETTER|TOP/OPIN...


### Verify that the DataFrame is Sorted at the End

In [7]:
data.tail()

Unnamed: 0,DOCID,Date,Month,Year,Name,Text,Doctypes
92583,1228940,9,9,2000,"WONG, EDWARD","yasir arafat had deadlines to meet. take, for ...",AIRLINES AND AIRPLANES|KENNEDY INTERNATIONAL A...
92584,1228940,9,9,2000,"ASPARRO, VITO (SGT)","yasir arafat had deadlines to meet. take, for ...",AIRLINES AND AIRPLANES|KENNEDY INTERNATIONAL A...
92585,1229069,9,9,2000,"CHANG, DAVID J.Y.","chang-david j.y., passed on september 7, in ne...",PAID DEATH NOTICE|TOP/CLASSIFIEDS/PAID DEATH N...
92586,1229055,9,9,2000,"BUSH, GEORGE W (GOV)",the 2000 campaign,PRESIDENTIAL ELECTION OF 2000|CAPTION|TOP/NEWS...
92587,1229082,9,9,2000,"SHAPIRO, IRVING",shapiro-irving. beloved father of barbara meye...,PAID DEATH NOTICE|TOP/CLASSIFIEDS/PAID DEATH N...


### Write Out the Resulting DataFrame to a File
The DataFrame is serialized below using the pickle library. The filename is taken from the `year` variable. Pickle files from this script carry the `.p` extension.

In [8]:
pickle.dump(data, open("nyt-" + year + ".p", "wb"))