# Summary
This notebook removes comments containing words irrelevant to Title IX from a databse. The database was web scraped from the regulations.gov Title IX public request for comments.

The cleaned data is then saved for use in later notebooks.

In [1]:
import json
import pandas
from pandas.io.json import json_normalize

In [2]:
# Load in the database file.
json_data = json.load(open('./data/db3.json'))

# Convert the data from regular JSON to a more easily-manipulated dataframe
data = json_normalize(json_data)

With the database imported, lets inspect it and make sure it's ready for analysis and use in further notebooks.

In [3]:
# Lets see how many comments (data samples) are present.
len(data)

# Take a look at the last 3 comments
display(data[:3])

Unnamed: 0,doc._id,doc._rev,doc.attachment_download,doc.attachment_download -href,doc.attachment_name,doc.category,doc.city,doc.comment_body,doc.country,doc.name,doc.state,doc.zip,id,key,value.rev
0,0f3b11691179a9abe35b8f18a9000950,1-4773e83804ba65b25953a680828dc134,,,,,United States,"Dear Assistant General Counsel Hilary Malawer,...",Parent/Relative,Heather Hirsch,MN,55016,0f3b11691179a9abe35b8f18a9000950,0f3b11691179a9abe35b8f18a9000950,1-4773e83804ba65b25953a680828dc134
1,0f3b11691179a9abe35b8f18a9001152,1-49b2757833fa575401bca5e42cbae985,,,,,United States,"Dear Assistant General Counsel Hilary Malawer,...",Other,Maryann Decker,UT,84737,0f3b11691179a9abe35b8f18a9001152,0f3b11691179a9abe35b8f18a9001152,1-49b2757833fa575401bca5e42cbae985
2,0f3b11691179a9abe35b8f18a900209a,1-72d2132b690f9c0de839db6908f376b7,,,,,United States,"Dear Assistant General Counsel Hilary Malawer,...",Other,Greg Lofgren,WI,53704,0f3b11691179a9abe35b8f18a900209a,0f3b11691179a9abe35b8f18a900209a,1-72d2132b690f9c0de839db6908f376b7


The above output has a few unnecessary columns, namely, ones that were created when the database was imported from a CouchDB instance. The next code block removes these columns.

In [4]:
# Drop couchDB labels from the dataframe.
data.drop(labels=['doc._id', 'doc._rev', 'id', 'key', 'value.rev'], inplace=True, axis='columns')

With the dataset "cleaned up" we inspect it again for any further issues.

In [5]:
display(data[:1])

Unnamed: 0,doc.attachment_download,doc.attachment_download -href,doc.attachment_name,doc.category,doc.city,doc.comment_body,doc.country,doc.name,doc.state,doc.zip
0,,,,,United States,"Dear Assistant General Counsel Hilary Malawer,...",Parent/Relative,Heather Hirsch,MN,55016


The cleaned up database is exported as `data_cleaned` for use in other notebooks

In [6]:
data.to_json('./data/data_cleaned.json', orient='records')