# Convert to Dataframe Using Multiprocessing Pool

#### Background Context
**Multiprocessing** is a useful python package that enables users to utilize multiple processors on a given machine for more efficient progress. The Pool object allows the exploitation of data parallelism by distributing the work across a pool of processes running the same function. This greatly improves the speed at which the work is done, reducing overall runtime.

Multiprocessing is mainly preferred when calling functions on larger sets of data expressing data parallelism. Data parallelism is the concept of breaking a set of data into smaller sets, which is then processed on multiple processes applying the same function without communicating with each other. Joining the output of these processes should produce the same result as if one process had applied the function to the entire dataset.

#### Script Purpose
This script is the multiprocessing variation of the **Convert to Dataframe** script. In this sample script, we will demonstrate the use of Multiprocessing Pool in parsing large numbers of XML files. We will create the function for parsing, create a pool object, and then call the function using that pool object to run via multiprocessing.

#### Expected Runtime and Sample Size
Mileage will vary depending on document size and resources available. On the standard notebook instance with 4 cores, expect around 8 minutes of runtime to process 53k New York Times articles and around 2 minutes of runtime to process 5k dissertations. The beginning input size is set to the corpus `SAMPLEDATA`, which includes around 53k articles.

## Import Libraries

In [1]:
# Libraries for parsing data
from lxml import etree
from bs4 import BeautifulSoup
import pandas as pd
import os

# Libraries for multiprocessing
import multiprocessing as mp
from multiprocessing import Pool

## Load Data

In [3]:
# Set corpus to the folder of files you want to use
corpus_bernanke = '/home/ec2-user/SageMaker/data/Global_Newsstream_2013_bernanke/'

# Read in files
input_files_bernanke = os.listdir(corpus_bernanke)

print("Loaded", len(input_files_bernanke), "documents.")

Loaded 15516 documents.


## Specify Output File

Define the `output_file` variable to the desired save location and file name. This variable will be used at the end of the script to save the processed data.

In [4]:
# Modify output_file to desired save name
output_file = 'output_files/Global_Newsstream_2013_bernanke.csv'

## Check Total Cores

Check the total number of cores on your current device. The following multiprocessing portions will be using this variable.

In [5]:
# Check core count
num_cores = mp.cpu_count()
print(num_cores)

4


## Define Functions

In [6]:
# Function to strip html tags from text portion
def strip_html_tags(text):
    stripped = BeautifulSoup(text).get_text().replace('\n', ' ').replace('\\', '').strip()
    return stripped

In [None]:
# Retrieve metadata from XML document
def getxmlcontent(corpus, file, strip_html=True):
    try:
        tree = etree.parse(corpus + file)
        root = tree.getroot()

        if root.find('.//GOID') is not None:
            goid = root.find('.//GOID').text
        else:
            goid = None

        if root.find('.//Title') is not None:
            title = root.find('.//Title').text
        else:
            title = None

        if root.find('.//NumericDate') is not None:
            date = root.find('.//NumericDate').text
        else:
            date = None
            
        if root.find('.//PublisherName') is not None:
            publisher = root.find('.//PublisherName').text
        else:
            publisher = None

        if root.find('.//FullText') is not None:
            text = root.find('.//FullText').text

        elif root.find('.//HiddenText') is not None:
            text = root.find('.//HiddenText').text

        elif root.find('.//Text') is not None:
            text = root.find('.//Text').text

        else:
            text = None

        # Strip html from text portion
        if text is not None and strip_html == True:
            text = strip_html_tags(text)
    
    except Exception as e:
        print(f"Error while parsing file {file}: {e}")
    
    return goid, title, date, publisher, text

In [None]:
# Function to make lists out of parsed data--on single document scale for multiprocessing
def make_lists(file):
    
    goid, title, date, publisher, text = getxmlcontent(corpus, file, strip_html=True)
    
    return goid, text, date

## Run Multiprocessing to parse XML files

In [None]:
# Test function on single document
make_lists(input_files[1])

In [None]:
# When using multiple processes, important to eventually close them to avoid memory/resource leaks
try:
    # Define a thread Pool to process multiple XML files simultaneously
    # Default set to num_cores - 1, but may change number of processes depending on instance
    p = Pool(processes=num_cores-1)
    
    # Apply function with Pool to corpus
    processed_lists = p.map(make_lists, input_files)

except Exception as e:
    print(f"Error in processing document: {e}")
    
finally:
    p.close()

In [None]:
# Transform processed data into a dataframe
df = pd.DataFrame(processed_lists, columns=['GOID', 'Text', 'Date'])

In [None]:
# View dataframe
df

## Save Dataframe as CSV

Make sure to change the `output_file` variable (defined at the top of script) to desired output file name before running this cell.

In [None]:
# Save output to file
df.to_csv(output_file)