## Data Collection

The Data from 03 main sources will be used for this particular project:
1. A dataset giving the list of boroughs and neighborhoods in New York city and their Geographical Co-ordinates. The dataset is available in the link: https://geo.nyu.edu/catalog/nyu_2451_34572
2. A dataset that lists the population of each borough in New York City. New_York_City_Population_By_Neighborhood_Tabulation_Areas.csv will be used for this purpose. The file is available in the link: https://data.cityofnewyork.us/api/views/swpk-hqdp/rows.csv
3. Foursquare API to get the venues and locations in each Neighborhood.

In [2]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

print('Libraries imported.')

Libraries imported.


### 1. Downloading Dataset & creating a dataframe for Boroughs and Neighborhoods in New York

!wget -q -O 'newyork_data.json' https://cocl.us/new_york_dataset
print('Data downloaded!')

Loading the Data from Json File

In [4]:
with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)

In [5]:
#assigned a variable for data in the features Key, which is basically a list of neighbourhoods. 
neighborhoods_data = newyork_data['features'] 

In [7]:
#transforming the data into a pandas dataframe:
# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)

#Filling the dataframe rows, one row at a time, through a loop
for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)
    
#Verifying the first 05 rows
neighborhoods.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585


In [8]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(neighborhoods['Borough'].unique()),
        neighborhoods.shape[0]
    )
)

The dataframe has 5 boroughs and 306 neighborhoods.


### 2. Downloading dataset for population and finding 03 boroughs with the highest population

In [9]:
pop_df = pd.read_csv('https://data.cityofnewyork.us/api/views/swpk-hqdp/rows.csv')
pop_df.head()

Unnamed: 0,Borough,Year,FIPS County Code,NTA Code,NTA Name,Population
0,Bronx,2000,5,BX01,Claremont-Bathgate,28149
1,Bronx,2000,5,BX03,Eastchester-Edenwald-Baychester,35422
2,Bronx,2000,5,BX05,Bedford Park-Fordham North,55329
3,Bronx,2000,5,BX06,Belmont,25967
4,Bronx,2000,5,BX07,Bronxdale,34309


In [17]:
pop_df.groupby(['Borough']).sum()

Unnamed: 0_level_0,Year,FIPS County Code,Population
Borough,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Bronx,152380,380,2717758
Brooklyn,204510,4794,4970026
Manhattan,116290,3538,3123068
Queens,232580,9396,4460101
Staten Island,76190,3230,912458


We have identified 03 Boroughs with the highest population as Brooklyn, Manhattan and Queens