# Convert to Dataframe
#### Script Purpose
In this sample script, we will convert the input dataset into a pandas dataframe containing the filename, title, and publication date of each document.

#### Expected Runtime and Sample Size
Mileage will vary depending on document size and resources available. For 1000 New York Times articles, expect around 16 seconds for this script to complete. For 1000 dissertations, expect around 20 seconds for this script to complete. The current starting sample size is set to 100 documents so that the script will run in real time for demonstration purposes.

## Import Libraries

In [1]:
# Libraries for parsing data
import os
import pandas as pd
import re
import random
from lxml import etree
from bs4 import BeautifulSoup

## Load and Sample Data

Depending on the size and vocabulary of the input dataset, runtime of this script may vary. To process the entire dataset, set `sample_size` to `len(input_files)`. Larger datasets can be run on the multiprocessing version of this script.

In [2]:
# Set corpus to the folder of files you want to use
corpus = '/home/ec2-user/SageMaker/data/AlchemyData/'

# Read in files
input_files = os.listdir(corpus)

print("Loaded", len(input_files), "documents.")

Loaded 180 documents.


In [3]:
# Select the number of articles to sample
sample_size = 180

# Generate a sample of articles
try:
    sample_input_files = input_files[0:sample_size]

except ValueError:
    sample_input_files = input_files
    
print("Currently sampling", len(sample_input_files), "documents.")

Currently sampling 180 documents.


## Specify Output File

Define the `output_file` variable to the desired save location and file name. This variable will be used at the end of the script to save the processed data.

In [13]:
# Modify output_file to desired save name
output_file = 'output_files/coocc.csv.gz'
print('asdf')

asdf


## Gather Metadata Fields

This section will gather text fields from the articles and add them to lists that will be used to make a dataframe. By default, this script will collect article ID, title, and the publishing date of the articles.

In [5]:
# Function to strip html tags from text portion
def strip_html_tags(text):
    stripped = BeautifulSoup(text).get_text().replace('\n', ' ').replace('\\', '').strip()
    return stripped

In [6]:
# Retrieve metadata from XML document
def getxmlcontent(corpus, file, strip_html=True):
    try:
        tree = etree.parse(corpus + file)
        root = tree.getroot()

        if root.find('.//GOID') is not None:
            goid = root.find('.//GOID').text
        else:
            goid = None

        if root.find('.//Title') is not None:
            title = root.find('.//Title').text
        else:
            title = None

        if root.find('.//NumericDate') is not None:
            date = root.find('.//NumericDate').text
        else:
            date = None
            
        if root.find('.//PublisherName') is not None:
            publisher = root.find('.//PublisherName').text
        else:
            publisher = None

        if root.find('.//FullText') is not None:
            text = root.find('.//FullText').text

        elif root.find('.//HiddenText') is not None:
            text = root.find('.//HiddenText').text

        elif root.find('.//Text') is not None:
            text = root.find('.//Text').text

        else:
            text = None

        # Strip html from text portion
        if text is not None and strip_html == True:
            text = strip_html_tags(text)
    
    except Exception as e:
        print(f"Error while parsing file {file}: {e}")
    
    return title, date, publisher, text

In [7]:
def remove_numbers(list):
    pattern = '[0-9]'
    list = [re.sub(pattern, '',i) for i in list]
    return list

In [8]:
def get_fragments(text):
    #Turns the file contents into a list of words, and makes them all lowercase
    casedtext = re.findall(r'\w+', text)
    lowertext = [item.lower() for item in casedtext]
    lowertext = [value for value in lowertext if value != "page" and value != "image"]
    lowertext = remove_numbers(lowertext)

    # Number of words being searched around each term occurrence
    neighbors_range = 5

    # Words that equate to mercury
    mercury_terms = ['argent', 'mercury', 'quicksilver']

    # Number of times each term appears in the text
    m_occ = sum('mercury' in s for s in lowertext) + sum('argent' in s for s in lowertext) + sum('quicksilver' in s for s in lowertext)
    s_occ = sum('sulphur' in s for s in lowertext)
    l_occ = sum('salt' in s for s in lowertext)

    text_fragments = []
    fragment_sample = []
    sample_num = 7

    # Tracks co-occurrences of the terms. The leading letter is the "main term".
    m_s_coocc = 0
    m_l_coocc = 0
    m_s_l_coocc = 0

    s_l_coocc = 0
    s_m_coocc = 0
    s_m_l_coocc = 0

    l_s_coocc = 0
    l_m_coocc = 0
    l_s_m_coocc = 0

    # Lists storing the index of each occurrence of mercury or sulphur
    mercury_indexes = []
    sulphur_indexes = []

    # Lists storing the distances between each occurrence of the first term and the next occurrence of the second term
    m_m_distances = []
    m_s_distances = []
    s_m_distances = []
    s_s_distances = []

    def check_mercury(word):
      for i in mercury_terms: 
        if i in word: 
          return True
      return False

    # Iterates through each word in the text, and any of the words contains a mercury term, then it creates a neighborhood (the 5 words before + the term + the 5 words after). This tracks how many times salt or sulphur occur in the same set of words as mercury.
    # MERCURY
    for i in range(len(lowertext)):
      if check_mercury(lowertext[i]): 
          neighbors = ' '.join(lowertext[i-neighbors_range:i+neighbors_range + 1])
          mercury_indexes.append(i)
          text_fragments.append(neighbors)
          if 'salt' in neighbors and 'sulphur' in neighbors: 
            m_s_l_coocc+= 1
            m_s_coocc += 1
            m_l_coocc += 1
          elif 'salt' in neighbors: 
            m_l_coocc += 1
          elif 'sulphur' in neighbors: 
            m_s_coocc += 1

    # SULPHUR
    for i in range(len(lowertext)):
      if 'sulphur' in lowertext[i]: 
        neighbors = ' '.join(lowertext[i-neighbors_range:i+neighbors_range+ 1])
        sulphur_indexes.append(i)
        text_fragments.append(neighbors)
        if 'salt' in neighbors and ('mercury' in neighbors or 'argent' in neighbors or 'quicksilver' in neighbors): 
          s_m_l_coocc+= 1
          s_l_coocc += 1
          s_m_coocc += 1
        elif 'salt' in neighbors: 
          s_l_coocc += 1
        elif mercury_terms[0] in neighbors or mercury_terms[1] in neighbors  or mercury_terms[2] in neighbors: 
          s_m_coocc += 1

    # SALT
    for i in range(len(lowertext)):
      if 'salt' in lowertext[i]: 
        neighbors = ' '.join(lowertext[i-neighbors_range:i+neighbors_range+ 1])
        text_fragments.append(neighbors)
        if 'sulphur' in neighbors and ('mercury' in neighbors or 'argent' in neighbors  or 'quicksilver' in neighbors): 
          l_s_m_coocc+= 1
          l_s_coocc += 1
          l_m_coocc += 1
        elif 'sulphur' in neighbors:
          l_s_coocc += 1
        elif mercury_terms[0] in neighbors or mercury_terms[1] in neighbors  or mercury_terms[2] in neighbors:
          l_m_coocc += 1
    
    if len(text_fragments) < sample_num: return []
    else:
        for i in random.sample(text_fragments, sample_num):
          fragment_sample.append(i)
        return fragment_sample, m_occ, s_occ, l_occ, m_s_coocc, m_l_coocc, m_s_l_coocc, s_l_coocc, s_m_coocc, s_m_l_coocc, l_s_coocc, l_m_coocc, l_s_m_coocc

In [9]:
# Create lists to store article IDs, titles, dates, and text
print("asdf")
title_list = []
date_list = []
text_list = []
fragments_list = []
publisher_list = []
m_list = []
s_list = []
l_list = []
m_s_list = []
m_l_list = []
m_s_l_list = []
s_l_list = []
s_m_list = []
s_m_l_list = []
l_s_list = []
l_m_list = [] 
l_s_m_list = []

docs_count = 0


# Parse files and add data to lists
for file in sample_input_files:  
    try:
        # Retrieve the metadata
        title, date, publisher, text = getxmlcontent(corpus, file, strip_html=True)

        fragments, m, s, l, m_s, m_l, m_s_l, s_l, s_m, s_m_l, l_s, l_m, l_s_m = get_fragments(text)

        # Store metadata to lists
        title_list.append(title)
        date_list.append(date)
        publisher_list.append(publisher)

        docs_count += 1
        print(docs_count)
            

        # Co-occurrences
        m_list.append(m)
        s_list.append(s)
        l_list.append(l)
        m_s_list.append(m_s)
        m_l_list.append(m_l)
        m_s_l_list.append(m_s_l)
        s_l_list.append(s_l)
        s_m_list.append(s_m)
        s_m_l_list.append(s_m_l)
        l_s_list.append(l_s)
        l_m_list.append(l_m)
        l_s_m_list.append(l_s_m)

    except:
        print('Error')

asdf
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
Error
41
42
Error
Error
43
44
45
46
47
48
Error
Error
Error
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
Error
76
77
78
79
80
81
82
83
84
85
86
87
Error
88
89
90
91
92
93
Error
94
95
Error
96
97
98
99
100
101
Error
102
103
104
105
106
Error
107
108
109
110
111
112
113
Error
114
Error
Error while parsing file .ipynb_checkpoints: Document is empty, line 1, column 1 (.ipynb_checkpoints, line 1)
Error
115
116
Error
Error
117
118
Error
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
Error
Error
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
Error
157
158
159


## Create Dataframe

This section uses the collected fields to make a dataframe.

In [10]:
# Create a dataframe, setting each of the columns to one of the lists made in the cell above
df = pd.DataFrame({'Title': title_list, 'Date': date_list, 'Mercury Occurrences' : m_list, 'Sulphur Occurrences' : s_list, 'Salt Occurrences': l_list, 'M-S': m_s_list, 'M-L': m_l_list, 'M-S-L': m_s_l_list, 'S-L': s_l_list, 'S-M': s_m_list, 'S-M-L': s_m_l_list, 'L-S': l_s_list, 'L-M': l_m_list, 'L-S-M': l_s_m_list })

In [11]:
# View dataframe
df

Unnamed: 0,Title,Date,Mercury Occurrences,Sulphur Occurrences,Salt Occurrences,M-S,M-L,M-S-L,S-L,S-M,S-M-L,L-S,L-M,L-S-M
0,"Medicinal experiments, or, A collection of cho...",1693-01-01,5,14,26,2,1,1,1,2,1,1,1,1
1,Chymical secrets and rare experiments in physi...,1683-01-01,12,154,356,2,3,2,7,2,2,8,4,2
2,Observations on the mineral waters of France m...,1684-01-01,18,64,360,1,3,1,25,1,1,30,3,1
3,A letter in answer to certain quæries and obje...,1670-01-01,5,5,9,2,3,2,2,2,2,2,3,2
4,"Paracelsus his Aurora, & treasure of the philo...",1659-01-01,47,43,33,12,3,1,7,13,2,7,3,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
154,A treatise of Lewisham (but vulgarly miscalled...,1680-01-01,4,15,52,1,2,1,3,1,1,3,2,1
155,"Vindiciæ literarum, the schools guarded, or, T...",1655-01-01,6,2,4,1,1,1,2,1,1,2,1,1
156,Tryon's letters upon several occasions ... by ...,1700-01-01,5,11,36,2,2,2,2,2,2,2,2,2
157,"The anatomy of human bodies, comprehending the...",1694-01-01,30,215,477,4,3,2,71,4,3,70,3,3


## Save Dataframe as CSV

Make sure to change the `output_file` variable (defined at the top of script) to desired output file name before running this cell.

In [14]:
# Save output to file
df.to_csv(output_file, compression='gzip')