# Business Problem - how venues influence the price of real estate? 

For this project, I decided to model the relation between the price of an estate and the type of venues in its neighborhood. 
The objective is to determine whether or not some venues have a direct impact on the price, and if so how is the price affected by it. 

Concretely I will have to: 
1. geolocalize a set of real-estate transactions. As I cannot use geocoder API as such a large scale (too much data), I will have to make the analysis at the level of the neighborhoods. 
2. Calculate the average price per square meter for all neighborhoods in my data set. 
3. List the venues (and the type of venues) for each neighborhood
4. Concatenate my two sources of data into one df
5. Split my data in order to have a train set and a test set 
6. Pick and design the right algorithm to determine & predict the price of an estate given the venues its neighborhood has
7. Evaluate the accuracy of the model

Note: 
1. To make this analysis accurate and a bit pertinent I would need to work at the level of each transaction (i.e. geolocalize each real estate), because not only it is the close neighborhood that has an impact, but also the distance is a fluctuant paramter (e.g. it may be convinent to have a supermarket nearby your place but maybe not to having seen on the building from your living room. You'd rather have a beautiful park). 
2. Some other factors will be disregarded to make this study simplier. Yet they probably have some big impacts on the price (e.g. construction date of the building, material used for the construction, ecominical & social indicators of the population living in the neighborhood, etc.)

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans



I will first import my main data source as a csv using the following method. Please find below a data set description: 

Manhattan Rolling Sales File.  All Sales From Oct 2018 - Sep 2019.			
"For sales prior to the Final, Neighborhood Name and Descriptive Data reflect the Final Roll 2019/20.  
Sales after the Final Roll, Neighborhood Name and Descriptive Data reflect current data"			
Building Class Category is based on Building Class at Time of Sale.			
Note: Condominium and cooperative sales are on the unit level and understood to have a count of one.			


If you are interested by this dataset, you can download it from this url: https://www1.nyc.gov/site/finance/taxes/property-rolling-sales-data.page

In [2]:
# The code was removed by Watson Studio for sharing.

Unnamed: 0,NEIGHBORHOOD,ADDRESS,ZIP CODE,GROSS SQUARE FEET,YEAR BUILT,TAX CLASS AT TIME OF SALE,BUILDING CLASS AT TIME OF SALE,SALE PRICE
0,ALPHABET CITY,743 EAST 6TH STREET,10009,3.68,1940,1,S1,3200000
1,ALPHABET CITY,189 EAST 7TH STREET,10009,2.183,1860,1,A4,0
2,ALPHABET CITY,526 EAST 5TH STREET,10009,5.2,1900,1,A4,6100000
3,ALPHABET CITY,166 AVENUE A,10009,4.52,1900,1,B9,0
4,ALPHABET CITY,166 AVENUE A,10009,4.52,1900,1,B9,0


I will clean a bit my df by removing some useless columns

In [3]:
df_data_0 = df_data_0.drop(["ADDRESS","ZIP CODE","YEAR BUILT","TAX CLASS AT TIME OF SALE", "BUILDING CLASS AT TIME OF SALE"], axis=1)
df_data_0.head()

Unnamed: 0,NEIGHBORHOOD,GROSS SQUARE FEET,SALE PRICE
0,ALPHABET CITY,3.68,3200000
1,ALPHABET CITY,2.183,0
2,ALPHABET CITY,5.2,6100000
3,ALPHABET CITY,4.52,0
4,ALPHABET CITY,4.52,0


As I'm facing some performance issues with geocoder API, I will use a ready-made data set where I can extract the geographic coordinates for each neighborhood in Manhattan

In [4]:
!wget -q -O 'newyork_data.json' https://cocl.us/new_york_dataset
print('Data downloaded!')

Data downloaded!


In [5]:
with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)

In [6]:
neighborhoods_data = newyork_data['features']

In [7]:
# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)

In [8]:
for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

In [9]:
manhattan_data = neighborhoods[neighborhoods['Borough'] == 'Manhattan'].reset_index(drop=True)
manhattan_data.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Manhattan,Marble Hill,40.876551,-73.91066
1,Manhattan,Chinatown,40.715618,-73.994279
2,Manhattan,Washington Heights,40.851903,-73.9369
3,Manhattan,Inwood,40.867684,-73.92121
4,Manhattan,Hamilton Heights,40.823604,-73.949688


In [10]:
manhattan_data.rename(columns={'Neighborhood': 'NEIGHBORHOOD'},inplace=True)
manhattan_data = manhattan_data.drop("Borough", axis=1)
manhattan_data['NEIGHBORHOOD'] = manhattan_data['NEIGHBORHOOD'].str.upper() 
manhattan_data.head()

Unnamed: 0,NEIGHBORHOOD,Latitude,Longitude
0,MARBLE HILL,40.876551,-73.91066
1,CHINATOWN,40.715618,-73.994279
2,WASHINGTON HEIGHTS,40.851903,-73.9369
3,INWOOD,40.867684,-73.92121
4,HAMILTON HEIGHTS,40.823604,-73.949688


In [11]:
Lastdf = df_data_0.merge(manhattan_data)
Lastdf.head()

Unnamed: 0,NEIGHBORHOOD,GROSS SQUARE FEET,SALE PRICE,Latitude,Longitude
0,CHELSEA,0.0,3469075,40.744035,-74.003116
1,CHELSEA,0.0,3063553,40.744035,-74.003116
2,CHELSEA,0.0,3809780,40.744035,-74.003116
3,CHELSEA,5.39,0,40.744035,-74.003116
4,CHELSEA,5.39,0,40.744035,-74.003116


In [12]:
g = Lastdf.groupby('NEIGHBORHOOD')

In [13]:
g.mean()

Unnamed: 0_level_0,Latitude,Longitude
NEIGHBORHOOD,Unnamed: 1_level_1,Unnamed: 2_level_1
CHELSEA,40.744035,-74.003116
CHINATOWN,40.715618,-73.994279
CIVIC CENTER,40.715229,-74.005415
CLINTON,40.759101,-73.996119
EAST VILLAGE,40.727847,-73.982226
FLATIRON,40.739673,-73.990947
GRAMERCY,40.73721,-73.981376
INWOOD,40.867684,-73.92121
LITTLE ITALY,40.719324,-73.997305
LOWER EAST SIDE,40.717807,-73.98089
