<h1>Women Data Scientists, DC Meetup, November 2015</h1>
<h2>Jennifer A Stark</h2>
<h3>Webscraping code for how I got the data from the NamUs website, and put it in a Pandas dataframe</h3>

<h3>Step 1: grab all the HTML files from the website and save them on your harddrive</h3>

* **First, import everything we need**

In [None]:
import pandas as pd
from bs4 import BeautifulSoup
import requests
from time import sleep

* **Next, initialize the `namus_scrape` function**
-> **This function will loop through a range of numeric values (cases), and for each value it will print it and attempt to give it to the function `get_html`. If `get_html` cannot use it, `namus_scrape` will `continue` to the next numeral (case).**

* Using `sleep(2)` is polite so as not to overwhelm the website. This function will politely wait 2 seconds before trying the next case number.

In [None]:
def namus_scrape(cases):
    for case in cases:
        try:
            print(case)
            get_html(case)
        except:
            continue
    sleep(2)

* **Initialize the `get_html` function -> This function will look for the current case number on the website, and if it finds it, will grab the html and write it to a file on your machine.**

In [None]:
def get_html(case_id):
    
    #using the current case number passed to it from namus_scrape, we create a filename eg `case_30.html` and 
    #assign it the varible name `case_html` to be used in this function
    case_html = ('case_' + str(case_id) + '.html')
    
    #open this new file as f (f for 'file')
    with open((case_html), 'wb') as f:
        
        #go get the URL for the case we're currently on ('case_id') and call it 'r'
        r = requests.get('https://identifyus.org/en/cases/' + str(case_id) + '/')
        
        #turn the URL data into text and call it 'b'
        b = BeautifulSoup(r.text)
        
        #for each 1024-bit chunk of data, write it to our html file 'f'
        for b in r.iter_content(1024):
            f.write(b)
            
    #return our newly written case_html file
    return case_html

* **With these 2 functions initiallised, we can now call namus_scrape with the range of cases we want.**
* **I did this in chunks of ~500 cos it takes a while**
* **The below range represents the <em>whole</em> range of values searched**

In [None]:
namus_scrape(range(0, 16000))

<h3>Step 2: Trawl through each HTML file on your machine extracting all the data you want, and pickle it (save it as a dataframe for later)</h3>

* **First we make an empty list to store the data from each case**

In [None]:
#Initiallize an empty list to put things into
namus = []

* **Next we initialize a function to try to open each case HTML file on your machine. **
* **When it finds file that exists, it passes it to the `get_namus_fromFile` function**

In [None]:
def namus_fromFile(cases):
    for case in cases:
        
        #use 'try' because not every case number has a case-file associated with it
        try:
            
            #open the file if it exists, and read it
            with open(("../../Namus_data/html_files2/" + 'case_' + str(case) + '.html'), 'rU') as f:
                
                #this next bit checks that the file is long. Some files were 'placeholders' and had no actual case data
                lines = len(f.readlines())
                if lines > 500:
                    
                    #if it is long enough to have case information, read it:
                    with open(("../../Namus_data/html_files2/" + 'case_' + str(case) + '.html'), 'rU') as f:
                        html = f.read()
                        
                        # I like printing things cos I can see it's working!
                        print(case)
                        
                        #call `get_namus_fromFile` and pass it the file we just found. Then append everything 
                        # `get_namus_fromFile` collected to our `namus` list
                        namus.append(get_namus_fromFile(html))
                        
        #if there IS not file for that case number, continue to the next case number
        except:
            continue

* **The file is passed to `get_namus_fromFile` which goes through file and extracts the data described below**

In [None]:
#read in the html file that function `namus_fromFile` found
def get_namus_fromFile(html):
    
    #open with beautiful soup so you can use its functionality to extract all the data you want
    b = BeautifulSoup(html)
    
    #for each individual case, an empty dictionary is initiated. 
    #Every line afterwards stores data (values) from each case to keys named on the left of the '=' for that one case
    individual={}
    
    individual['case_rating'] = b.find(name='dt', attrs={'class':'rating'}).find(name='span').text.strip() 
    individual['case_status'] = b.find(name='label', attrs={'for':'case_status'}).find_next('td').text.strip() 
    individual['case_number'] = b.find(name='label', attrs={'for':'case_case_number'}).find_next('td').text.strip() 
    individual['date_found'] = b.find(name='label', attrs={'for':'case_date_found_2i'}).find_next('td').text.strip()
   
    individual['est_age'] = b.find(name='label', attrs={'for':'case_estimated_age'}).find_next('td').text.strip()
    individual['min_age']= b.find(name='label', attrs={'for':'case_minimum_age'}).find_next('td').text.strip()
    individual['max_age'] = b.find(name='label', attrs={'for':'case_maximum_age'}).find_next('td').text.strip()
    individual['race'] = b.find(name='label', attrs={'for':'case_race'}).find_next('td').text.strip() 
    individual['ethnicity'] = b.find(name='label', attrs={'for':'case_ethnicity'}).find_next('td').text.strip()
    individual['sex'] = b.find(name='label', attrs={'for':'case_sex'}).find_next('td').text.strip() 
    individual['weight'] = b.find(name='label', attrs={'for':'case_weight'}).find_next('td').text.strip() 
    individual['height'] = b.find(name='label', attrs={'for':'case_height'}).find_next('td').text.strip()
    
    individual['all_parts_recovered'] = int(b.find(name='input', id='case_body_inventory_all_parts_recovered').find_next('input')['value'])
    individual['head_not_recovered'] = int(b.find(name='input', id='case_body_inventory_head_not_recovered').find_next('input')['value'])
    individual['torso_not_recovered'] = int(b.find(name='input', id='case_body_inventory_torso_not_recovered').find_next('input')['value'])
    individual['n-limbs_not_recovered'] = int(b.find(name='input', id='case_body_inventory_limbs_not_recovered').find_next('input')['value'])
    individual['n-hands_not_recovered'] = int(b.find(name='input', id='case_body_inventory_hands_not_recovered').find_next('input')['value'])
    individual['recognizable_face'] = b.find(name='label', attrs={'for':'case_body_condition'}).find_next('td').text.strip() 
    
    individual['min_year_of_death'] = b.find(name='label', attrs={'for':'case_minimum_year_of_death'}).find_next('td').text.strip()
    individual['postmortem_interval'] = b.find(name='label', attrs={'for':'case_postmortem_interval'}).find_next('td').text.strip()
    
    individual['address_1'] = b.find(name='label', attrs={'for':'case_address_found_1'}).find_next('td').text.strip()
    individual['address_2'] = b.find(name='label', attrs={'for':'case_address_found_2'}).find_next('td').text.strip()
    individual['city'] = b.find(name='label', attrs={'for':'case_city_found'}).find_next('td').text.strip()
    individual['state'] = b.find(name='label', attrs={'for':'case_state_found_id'}).find_next('td').text.strip()
    individual['zip'] = b.find(name='label', attrs={'for':'case_zip_found'}).find_next('td').text.strip()
    individual['county'] = b.find(name='label', attrs={'for':'case_county_found_id'}).find_next('td').text.strip()
    
    individual['circumstances'] = b.find(name='div', id="case_circumstances").text.strip()
    
    individual['hair_color'] = b.find(name='label', attrs={'for':'case_hair_color'}).find_next('td').text.strip()
    individual['head_hair'] = b.find(name='div', id='case_head_hair').text.strip() 
    individual['body_hair'] = b.find(name='div', id='case_body_hair').text.strip() 
    individual['facial_hair'] = b.find(name='div', id='case_facial_hair').text.strip() 
    individual['left_eye_color'] = b.find(name='label', attrs = {'for':'case_eye_color_left'}).find_next('td').text.strip() 
    individual['right_eye_color'] = b.find(name='label', attrs = {'for':'case_eye_color_right'}).find_next('td').text.strip() 
    individual['eye_description'] = b.find(name='div', id='case_eye_description').text.strip() 
    
    individual['amputations'] = int(b.find(name='input', id='case_amputations').find_next('input')['value'])
    if individual['amputations'] == 1:
        individual['amputations_description'] = b.find(name='div', id='case_amputations_details').text.strip() 
    else:
        individual['amputations_description'] = 'NA'
        
    individual['deformities'] = int(b.find(name='input', id='case_deformities').find_next('input')['value'])
    if individual['deformities'] == 1:
        individual['deformities_description'] = b.find(name='div', id='case_deformities_details').text.strip() 
    else:
        individual['deformities_description'] = 'NA'
        
    individual['scars_and_marks'] = int(b.find(name='input', id='case_scars_and_marks').find_next('input')['value'])
    if individual['scars_and_marks'] == 1:
        individual['scars_and_marks_description'] = b.find(name='div', id='case_scars_and_marks_details').text.strip() 
    else:
        individual['scars_and_marks_description'] = 'NA'
        
    individual['tattoos'] = int(b.find(name='input', id='case_tattoos').find_next('input')['value'])
    if individual['tattoos'] == 1:
        individual['tattoos_description'] = b.find(name='div', id='case_tattoos_details').text.strip() 
    else:
        individual['tattoos_description'] = 'NA'
    
    individual['piercings'] = int(b.find(name='input', id='case_piercings').find_next('input')['value'])
    if individual['piercings'] == 1:
        individual['piercings_description'] = b.find(name='div', id='case_piercings_details').text.strip() 
    else:
        individual['piercings_description'] = 'NA'
        
    individual['artificial_parts_aids'] = int(b.find(name='input', id='case_artificial_body_parts_and_aids').find_next('input')['value'])
    if individual['artificial_parts_aids'] == 1:
        individual['artificial_parts_aids_description'] = b.find(name='div', id='case_artificial_body_parts_and_aids_details').text.strip() 
    else:
        individual['artificial_parts_aids_description'] = 'NA'
        
    individual['finger_toe_nails'] = int(b.find(name='input', id='case_finger_and_toe_nails').find_next('input')['value'])
    if individual['finger_toe_nails'] == 1:
        individual['finger_toe_nails_description'] = b.find(name='div', id='case_finger_and_toe_nails_details').text.strip() 
    else:
        individual['finger_toe_nails_description'] = 'NA'
        
    individual['other_distinctive_features'] = int(b.find(name='input', id='case_physical_other').find_next('input')['value'])
    if individual['other_distinctive_features'] == 1:
        individual['other_distinctive_features_description'] = b.find(name='div', id='case_physical_other_details').text.strip() 
    else:
        individual['other_distinctive_features_description'] = 'NA'
        
    individual['medical_implants'] = int(b.find(name='input', id='case_medical_implants').find_next('input')['value'])
    if individual['medical_implants'] == 1:
        individual['medical_implants_description'] = b.find(name='div', id='case_medical_implants_details').text.strip() 
    else:
        individual['medical_implants_description'] = 'NA'
        
    individual['foreign_objects'] = int(b.find(name='input', id='case_foreign_objects').find_next('input')['value'])
    if individual['foreign_objects'] == 1:
        individual['foreign_objects_description'] = b.find(name='div', id='case_foreign_objects_details').text.strip() 
    else:
        individual['foreign_objects_description'] = 'NA'
        
    individual['skeletal_findings'] = int(b.find(name='input', id='case_skeletal_findings').find_next('input')['value'])
    if individual['skeletal_findings'] == 1:
        individual['skeletal_findings_description'] = b.find(name='div', id='case_skeletal_findings_details').text.strip() 
    else:
        individual['skeletal_findings_description'] = 'NA'
       
    individual['organ_absent'] = int(b.find(name='input', id='case_organ_absent').find_next('input')['value'])
    if individual['organ_absent'] == 1:
        individual['organ_absent_description'] = b.find(name='div', id='case_organ_absent_details').text.strip() 
    else:
        individual['organ_absent_description'] = 'NA'
        
    individual['prior_surgery'] = int(b.find(name='input', id='case_prior_surgery').find_next('input')['value'])
    if individual['prior_surgery'] == 1:
        individual['prior_surgery_description'] = b.find(name='div', id='case_prior_surgery_details').text.strip() 
    else:
        individual['prior_surgery_description'] = 'NA'
        
    individual['other_medical_information'] = int(b.find(name='input', id='case_medical_other').find_next('input')['value'])
    if individual['other_medical_information'] == 1:
        individual['other_medical_information_description'] = b.find(name='div', id='case_medical_other_details').text.strip() 
    else:
        individual['other_medical_information_description'] = 'NA'
    
    individual['fingerprints'] = b.find(name='div', id='fingerprints').find_next('td', attrs={'class':'view_field'}).text.strip()
   
    individual['clothing_on_body'] = b.find('div', id='case_clothing_on_body').text.strip() 
    individual['clothing_with_body'] = b.find('div', id='case_clothing_with_body').text.strip() 
    individual['footwear'] = b.find('div', id='case_footwear').text.strip() 
    individual['jewelry'] = b.find('div', id='case_jewelry').text.strip() 
    individual['eyewear'] = b.find('div', id='case_eyewear').text.strip() 
    individual['other_items_with_body'] = b.find('div', id='case_other_items_with_body').text.strip() 
    
    individual['dental'] = b.find(name='div', id='dental').find_next('td', attrs={'class':'view_field'}).text.strip()
    
    individual['dna'] = b.find(name='div', id='dna').find_next('td', attrs={'class':'view_field'}).text.strip()
    
    individual['images'] = len(b.find('div', id='image_box').find_all('img'))
    
    #return the 'individual' dictionary so that it can be appended to the `namus` list specified in `namus_fromFile` 
    return individual

* **Call function `namus_fromFile`.**
* **I ran this all in one go as it is faster than when downloading all the HTML files from the web, taking about 30 minutes in total**

In [None]:
namus_fromFile(range(0,16000))

* **Finally, convert the list of dictionaries into a Pandas dataframe, and save it as both a csv and pickle (options are nice)**

In [None]:
# Convert the list of dictionaries to a pandas dataframe
namusdb = pd.DataFrame(namus)

#dataframes only exist while you are working on them. You lose everything once you restart your kernal. Therefore, you 
# have to store the data somewhere as a csv (typically) or pickle it. 
namusdb.to_csv('namus_html.csv')
namusdb.to_pickle('namus_html.pkl')