# Summary
This notebook takes a look at the `data_cleaned` file from the Data Cleanup notebook. 
The comments are separated into two new datasets: `has_attachment` and `no_attachment`. The notebook then downloads the attachments from comments in the `has_attachment` dataset from the regulations.gov website. 
The downloaded attachments are saved in this repository under the /attachments folder.

The `has_attachment` and `no_attachments` datasets are then exported for use in other notebooks.

In [13]:
import json
import pandas
import urllib.request
import tempfile
import os
import time
from subprocess import (PIPE, Popen)

## Separating the two datasets

The dataset from the Data Cleanup notebook, `data_cleaned` is imported. The following two codeblocks then separate this database into two new ones, `has_attachment` and `no_attachment`.

In [2]:
# Import the cleaned data.
data = pandas.read_json('./data/data_cleaned.json', orient='records', dtype='false')

In [3]:
# Get a subset of the database with only documents containing attachments.
has_attachment = data.dropna(subset=['doc.attachment_download -href'])

# Take a look at how many comments are in this subset and what some of the comments look like.
len(has_attachment)

520

In [4]:
# Create another subset of the database, but this time with comments that _don't_ have an attachment
no_attachment = data[pandas.isnull(data['doc.attachment_download -href'])]

# See how many comments are in this subset and the first few comments.
len(no_attachment)

16012

To confirm that we aren't missing any comments, we check that the length of `has_attachment` and `no_attachment` sum to the length of the original dataset.

In [5]:
len(has_attachment) + len(no_attachment) == len(data)

True

## Inspecting the new datasets

The first 3 comments of each dataset are displayed below.

In [6]:
display(has_attachment[:3])

Unnamed: 0,doc.attachment_download,doc.attachment_download -href,doc.attachment_name,doc.category,doc.city,doc.comment_body,doc.country,doc.name,doc.state,doc.zip
2406,,https://www.regulations.gov/contentStreamer?do...,,Law firm,Arlington,As a lawyer who once practiced education and c...,United States,Hans Bader,VA,
2425,,https://www.regulations.gov/contentStreamer?do...,,,National Advocacy Organization,"Attached, please find comments from The Educat...",,Daria Hall,,
2427,,https://www.regulations.gov/contentStreamer?do...,,Community Organization,Oakton,Dear Members of the United State Department of...,United States,Janet Samuelson,VA,


In [7]:
display(no_attachment[:3])

Unnamed: 0,doc.attachment_download,doc.attachment_download -href,doc.attachment_name,doc.category,doc.city,doc.comment_body,doc.country,doc.name,doc.state,doc.zip
0,,,,,United States,"Dear Assistant General Counsel Hilary Malawer,...",Parent/Relative,Heather Hirsch,MN,55016
1,,,,,United States,"Dear Assistant General Counsel Hilary Malawer,...",Other,Maryann Decker,UT,84737
2,,,,,United States,"Dear Assistant General Counsel Hilary Malawer,...",Other,Greg Lofgren,WI,53704


Because the indices of each of the databases are out of order, they are reindexed.

The two datasets are then saved for later use.

In [8]:
has_attachment.reset_index(drop=True);
no_attachment.reset_index(drop=True);

In [9]:
# Saving the `has_attachment` dataset as `has_attachment.json`
has_attachment.to_json('./data/has_attachment.json', orient='records')

# Saving the `no_attachment` dataset as `no_attachment.json`
no_attachment.to_json('./data/no_attachment.json', orient='records')

## Downloading the attachments

Downloading the attachments is a multi-step process.

Attachments are either PDFs or Microsoft Word documents. The document type is first identified and then downloaded according its type. The name of the document is the same as the document's index in `has_attachment`.

In [10]:
# Header information for downloading the PDF attachment.
user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:58.0) Gecko/20100101 Firefox/58.0'
headers = {'User-Agent': user_agent}

In [11]:
# Download all attachments as .pdf or .docx and name them based on their index in `has_attachment`.
def download_attachments(row):
    global counter
    url = str(row['doc.attachment_download -href'])
    
    # If the last 5 characters in the URL are `msw12` 
    # then the file is a Microsoft Word document, otherwise it's a PDF
    if (url[-5:] == 'msw12'):
        extension = '.docx'
    else:
        extension = '.pdf'
        
    name = str(counter) + extension
    
    try:
        request = urllib.request.Request(url, headers=headers)
        response = urllib.request.urlopen(request).read()        
        file = open(name, 'wb+')
        file.write(response)
        file.close()
        
    except:
        print("failed to download", str(counter))
    
    # To prevent the download server from blocking our IP, we wait 3 seconds before the next download.
    time.sleep(3)
    counter = counter + 1

In [None]:
# Counter for filenames in download_attachments.
counter = 1

# Save files in resources/attachments.
os.chdir("./resources/attachments")

# Download all the attachments in has_attachment.
has_attachment.apply(download_attachments, axis=1)

failed to download 21
failed to download 78
failed to download 79
failed to download 80
failed to download 81
failed to download 82
failed to download 83
failed to download 84
failed to download 85
failed to download 86
failed to download 87
failed to download 88
failed to download 89
failed to download 90
failed to download 91
failed to download 92
failed to download 93
failed to download 94
failed to download 95
failed to download 96
failed to download 97
failed to download 98
failed to download 99
failed to download 100
failed to download 101
failed to download 102
failed to download 103
failed to download 104
failed to download 105
