<h3>Data collection stage</h3>

In this notebook, we gather the data by webscraping method from the New York government website using BeautifulSoup package.
This data is then further processed and stored in a pandas dataframe.

In [1]:
#Packages used for Data Manipulation
import pandas as pd 
import numpy as np

Next, we install the 'BeautifulSoup' package for webscraping

In [2]:
!conda install -c conda-forge beautifulsoup4 --yes

Solving environment: done

## Package Plan ##

  environment location: /home/jupyterlab/conda

  added / updated specs: 
    - beautifulsoup4


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    beautifulsoup4-4.7.1       |        py36_1001         140 KB  conda-forge
    soupsieve-1.7.1            |        py36_1000          49 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         189 KB

The following NEW packages will be INSTALLED:

    soupsieve:      1.7.1-py36_1000 conda-forge

The following packages will be UPDATED:

    beautifulsoup4: 4.6.3-py36_0    anaconda    --> 4.7.1-py36_1001 conda-forge


Downloading and Extracting Packages
beautifulsoup4-4.7.1 | 140 KB    | ##################################### | 100% 
soupsieve-1.7.1      | 49 KB     | ##################################### | 100% 
Preparin

<b>Here, we pass the website link to the 'urlopen' method. It is then parsed using the BeautifulSoup package.</b>

In [9]:
#To read the data from the url and store it in a file
import urllib #used to handle the URL's
from bs4 import BeautifulSoup

link = "https://www.health.ny.gov/statistics/cancer/registry/appendix/neighborhoods.htm"
page = urllib.request.urlopen(link)

# parse the html using beautiful soup and store in variable `soup`
soup = BeautifulSoup(page, 'html.parser')

#soup.prettify() is used to indent the html
print (soup.prettify())

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<!-- INCLUDE HEADER Version 1.05 7/26/2007 PAGE LAST MODIFIED Monday, 23-Feb-2015 12:15:40 EST -->
<html lang="en-us" xml:lang="en-us" xmlns="http://www.w3.org/1999/xhtml">
 <head>
  <title>
   NYC Neighborhood ZIP Code Definitions
  </title>
  <meta content="Definitions of New York City Neighborhoods" name="description"/>
  <meta content="neighborhood, Neighborhood, New York City, new york city,new york state, New York State" name="keywords"/>
  <!-- THE FOLLOWING STYLE TAG IS FOR IMPORTING STYLE ONLY -->
  <style type="text/css">
   <!--
-->
  </style>
  <!-- -->
  <!-- -->
  <!-- -->
  <!-- -->
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <meta content="IE=edge" http-equiv="x-ua-compatible"/>
  <link href="/style/twenty16/main.css" media="screen" rel="stylesheet"/>
  <link hre

In [7]:
# extracting the post codes data from the file
data = soup.find('table', attrs={'summary': " "})
print(data)

<table summary=" ">
<tr>
<th abbr="Borough" id="header1">Borough</th>
<th abbr="Neighborhood" id="header2">Neighborhood</th>
<th abbr="ZIP Codes" id="header3">ZIP Codes</th>
</tr><tr>
<td headers="header1" rowspan="7">Bronx</td>
<td headers="header2"> Central Bronx</td>
<td headers="header3"> 10453, 10457, 10460</td>
</tr><tr>
<td headers="header2"> Bronx Park and Fordham</td>
<td headers="header3"> 10458, 10467, 10468</td>
</tr><tr>
<td headers="header2"> High Bridge and Morrisania</td>
<td headers="header3"> 10451, 10452, 10456</td>
</tr><tr>
<td headers="header2"> Hunts Point and Mott Haven</td>
<td headers="header3"> 10454, 10455, 10459, 10474</td>
</tr><tr>
<td headers="header2"> Kingsbridge and Riverdale</td>
<td headers="header3"> 10463, 10471</td>
</tr><tr>
<td headers="header2"> Northeast Bronx</td>
<td headers="header3"> 10466, 10469, 10470, 10475</td>
</tr><tr>
<td headers="header2"> Southeast Bronx</td>
<td headers="header3"> 10461, 10462,10464, 10465, 10472, 10473</td>
</t

In [98]:
#define dataframe columns
column_names = ['PostalCode', 'Borough', 'Neighborhood']

#converting the HTML table into a pandas dataframe
table_rows = data.find_all('tr')

borough='NAN' #assigning a default value for borough name

result=[]
#iterate through HTML table and formulate a dataframe with Borough and Neighborhood names for each Postalcode
for tr in table_rows:
    td = tr.find_all('td')
    #print(td)
    row = [tr.text.strip() for tr in td if tr.text.strip()]
    if row:
        if len(row)==3:
            borough= row[0]
        else:
            row[1:3]=row
            row[0]=borough
        
        postcodes=row[2].split(',')
        for postcode in postcodes:
            newRow = [postcode,row[0],row[1]]
            result.append(newRow)
        
#write the data to a dataframe
Newyorkdata = pd.DataFrame(result,columns=column_names)

#print the dataframe
print(Newyorkdata)

    PostalCode        Borough                Neighborhood
0        10453          Bronx               Central Bronx
1        10457          Bronx               Central Bronx
2        10460          Bronx               Central Bronx
3        10458          Bronx      Bronx Park and Fordham
4        10467          Bronx      Bronx Park and Fordham
5        10468          Bronx      Bronx Park and Fordham
6        10451          Bronx  High Bridge and Morrisania
7        10452          Bronx  High Bridge and Morrisania
8        10456          Bronx  High Bridge and Morrisania
9        10454          Bronx  Hunts Point and Mott Haven
10       10455          Bronx  Hunts Point and Mott Haven
11       10459          Bronx  Hunts Point and Mott Haven
12       10474          Bronx  Hunts Point and Mott Haven
13       10463          Bronx   Kingsbridge and Riverdale
14       10471          Bronx   Kingsbridge and Riverdale
15       10466          Bronx             Northeast Bronx
16       10469

<b>Write the dataframe into a csv file for future use</b>

In [99]:
Newyorkdata.to_csv('Newyork_data.csv', index=False)