<a href="https://colab.research.google.com/github/PavaniMoturu/Corsera_Capstone/blob/main/The_Battle_w2_Data_prep_New_York_neighborhoods.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1 style="text-align: center">Battle: Neighborhoods in New York - data preparation</h1>

<h2>Introduction</h2>
<p>I will use this notebook to prepare data about New York Neighborhoods, and save them as CSV. Prepared data will be used as starting point in another notebook</p>

Before I get the data, let's download all the dependencies that we will need.

In [None]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

print('Libraries imported.')

Libraries imported.


<a id='item1'></a>

<h3>1.1 Download Dataset</h3>

New York has a total of 5 boroughs and 306 neighborhoods. In order to segement the neighborhoods and explore them, we will essentially need a dataset that contains the 5 boroughs and the neighborhoods that exist in each borough as well as the the latitude and logitude coordinates of each neighborhood. 

This dataset exists for free on the web. Here is the link to the dataset: https://geo.nyu.edu/catalog/nyu_2451_34572 (https://cocl.us/new_york_dataset).

Run a `wget` command and access the data.

In [None]:
!wget -q -O 'newyork_data.json' https://cocl.us/new_york_dataset
print('Data downloaded!')

Data downloaded!


Let's load the data using json.load

In [None]:
with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)

Let's take a quick look at the data, for one neighborhood. The *features* key contains list of neighborhoods. Then lets filter only first one.

In [None]:
newyork_data['features'][0:1]

[{'geometry': {'coordinates': [-73.84720052054902, 40.89470517661],
   'type': 'Point'},
  'geometry_name': 'geom',
  'id': 'nyu_2451_34572.1',
  'properties': {'annoangle': 0.0,
   'annoline1': 'Wakefield',
   'annoline2': None,
   'annoline3': None,
   'bbox': [-73.84720052054902,
    40.89470517661,
    -73.84720052054902,
    40.89470517661],
   'borough': 'Bronx',
   'name': 'Wakefield',
   'stacked': 1},
  'type': 'Feature'}]

In [None]:
neighborhoods_data = newyork_data['features']

<h3>1.2 Tranform the data into a pandas dataframe</h3>

The next task is essentially transforming this data of nested Python dictionaries into a *pandas* dataframe. So let's start by creating an empty dataframe.

In [None]:
# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)

Loop through the data and fill the dataframe one row at a time.

In [None]:
for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

Examine the dataframe which was crated.

In [None]:
neighborhoods.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585


Make sure that the dataset has all 5 boroughs and 306 neighborhoods.

In [None]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(neighborhoods['Borough'].unique()),
        neighborhoods.shape[0]
    )
)

The dataframe has 5 boroughs and 306 neighborhoods.


<h3>1.3 Save New York neighborhoods as CSV</h3>

In [None]:
# save to project
project.save_data(data=neighborhoods.to_csv(index=False),file_name='neighborhood_new_york_geospatial_data.csv',overwrite=True)

{'asset_id': '60bc572d-d440-496a-b620-d17ac8b92e8f',
 'bucket_name': 'courseracapstone-donotdelete-pr-2afewjmpcmomni',
 'file_name': 'neighborhood_new_york_geospatial_data.csv',
 'message': 'File saved to project storage.'}

<h2>Summary</h2>
<p>I've prepared CSV file with New York neighborhoods data, for later use, during Coursera Capstone assignment.</p>
<ul>
    <li>neighborhood_new_york_geospatial_data.csv - <b>Borough, Neighborhood, Latitude, Longitude</b> for each New York Neighborhood
</ul>
<p><b>Note:</b> The <b>neighborhood_new_york_geospatial_data.csv</b> will be used as initial data for New York neighborhoods.</p>