 <img src="https://yardnyc.com/wp-content/uploads/2018/09/YARD_SITE_2018_NEWS_HERO_FORBES_092318.jpg" alt="New York Landscape" style="width:600px;height:300px;">

# Purchase recommendation for real estate in New York
### Where should you purchase your next house in New York? Unsupervised clustering and statistical regression analysis.

### Table of contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>

1.  <a href="#Introduction">1.1 Introduction</a>

2.  <a href="#ETL">2.1 Data ETL Process</a>

3.  <a href="#item3"></a>

4.  <a href="#item4"></a>

5.  <a href="#item5"></a>  
    </font>
    </div>

### <a id="Introduction"> </a> 1.1 Introduction
Considering the various expensive house prices in the different neighborhood and boroughs of New York, a careful analysis of current house prices as well as future house market house price prediction are important for consideration when deciding on which house to buy. For this scenario, we will assume the investor is interested in buying residential homes in New York. Specifically, I am interested in analyzing the price of houses in heach neighborhood, amount of houses bought/sold, location analysis, an unsupervised clustering analysis, as well as a regression in order to predict future house value for the prospective neighborhoods.
Machine learning tools such as unsupervised clustering and statistical regression are used in order to identify common neighborhoods and analyze curent and future property prices.

### 1.2 Selection of datasets
I will utilize the following tools and datasets to perform my analysis:
- NYU "2016 New York City Neighborhood Tabulation Areas" GeoJson
- NYC "DOF: Summary of Neighborhood Sales by Neighborhood Citywide by Borough"
- CognitiveClass.ai "newyork_data"

The following libraries and APIs will be utilized to analyze and visualize the data:
Data analysis:
- Pandas
- SkLearn
    - k-means clustering
    - Statistical Regression
- Numpy
Visualization: 
- Folium
- Matplotlib
- Seaborn
Data:
- Foursquare API
- NYC OpenData API

Data analysis tools are chosen in order to effectively read and work with Data Frames, perform mathematical analysis and create Machine Learning models. K-means clustering is used in order to group neighborhoods and possibly identify common available investments amongst different areas. It will prove to be useful in observing alternatives after finding a viable investment in another area. Statistical regression is used in order to predict future neighborhood house prices in order to identify trends and give information regarding future value of the investment.

For visualization, Folium is used for geospatial data analysis and the matplotlib and seaborn libraries are used for graphical analysis of numerical and categorical data.

Lastly, the foursquare and NYC OpenData APIs are queried in order to obtain surrouding venues data and geospatial data respectively.

### <a id="ETL"> </a>  2.1 Data ETL process

First, let's analyze the House Sales by Neighborhood for different boroughs in New York. This information comes from the NYC OpenData API, containing information for house sales between 2010 and 2019. 

In [12]:
# Import the required libraries for data gathering, management, visualization and analysis
import requests # Handle API requests
import pandas as pd # Analyze dataframe
from pandas.io.json import json_normalize # Transform json into pandas DataFrame
import numpy 
import matplotlib as plt # Visualization
import seaborn as sns # Visualization

import folium # Geographical data visualization
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values

### 2.1.1 NYC House sales data

In [27]:
# Get home sales data from NYC OpenDAta API
urlNYC = 'https://data.cityofnewyork.us/resource/5ebm-myj7.json?$limit=50000'
NYC = requests.get(urlNYC).json()
NYC

[{'borough': 'MANHATTAN',
  'neighborhood': 'ALPHABET CITY',
  'type_of_home': '01 ONE FAMILY HOMES',
  'number_of_sales': '1',
  'lowest_sale_price': '593362',
  'average_sale_price': '593362',
  'median_sale_price': '593362',
  'highest_sale_price': '593362.00',
  'year': '2010'},
 {'borough': 'MANHATTAN',
  'neighborhood': 'ALPHABET CITY',
  'type_of_home': '02 TWO FAMILY HOMES',
  'number_of_sales': '1',
  'lowest_sale_price': '1320000',
  'average_sale_price': '1320000',
  'median_sale_price': '1320000',
  'highest_sale_price': '1320000.00',
  'year': '2010'},
 {'borough': 'MANHATTAN',
  'neighborhood': 'ALPHABET CITY',
  'type_of_home': '03 THREE FAMILY HOMES',
  'number_of_sales': '1',
  'lowest_sale_price': '900000',
  'average_sale_price': '900000',
  'median_sale_price': '900000',
  'highest_sale_price': '900000.00',
  'year': '2010'},
 {'borough': 'MANHATTAN',
  'neighborhood': 'CHELSEA',
  'type_of_home': '01 ONE FAMILY HOMES',
  'number_of_sales': '2',
  'lowest_sale_price

### NYC "DOF: Summary of Neighborhood Sales by Neighborhood Citywide by Borough"
The following table contains data in 9 categories formatted this way:

***BOROUGH***	
Department of Finance determines the neighborhood name in the course of valuing properties. The common name of the neighborhood is generally the same as the name Finance designates. However, there may be slight differences in neighborhood boundary lines.
	
Plain Text
	
***NEIGHBORHOOD***
Department of Finance determines the neighborhood name in the course of valuing properties. The common name of the neighborhood is generally the same as the name Finance designates. However, there may be slight differences in neighborhood boundary lines.
	
Plain Text
	
***TYPE OF HOME***	
Total number of properties for that particular borough and neighborhood
	
Plain Text
	
***NUMBER OF SALES***	
Total number of sales for that particular neighborhood
	
Number
	
***LOWEST SALE PRICE***
Lowest sales prices for that particular neighborhood
	
Number
	
***AVERAGE SALE PRICE***
Average sales prices for that particular neighborhood
	
Number
	
***MEDIAN SALE PRICE***	
Median sales prices for that particular neighborhood
	
Number
	
***HIGHEST SALE PRICE***	
Highest sales prices for that particular neighborhood
	
Number
	
***YEAR***	
Year of Summary Report
	
Plain Text
	

In [32]:
# Flatten json into DataFrame
dfNYC = pd.json_normalize(NYC)
dfNYC.head(10)

Unnamed: 0,borough,neighborhood,type_of_home,number_of_sales,lowest_sale_price,average_sale_price,median_sale_price,highest_sale_price,year
0,MANHATTAN,ALPHABET CITY,01 ONE FAMILY HOMES,1,593362,593362,593362,593362.0,2010
1,MANHATTAN,ALPHABET CITY,02 TWO FAMILY HOMES,1,1320000,1320000,1320000,1320000.0,2010
2,MANHATTAN,ALPHABET CITY,03 THREE FAMILY HOMES,1,900000,900000,900000,900000.0,2010
3,MANHATTAN,CHELSEA,01 ONE FAMILY HOMES,2,500000,2875000,2875000,5250000.0,2010
4,MANHATTAN,CHELSEA,02 TWO FAMILY HOMES,2,1306213,2603107,2603107,3900000.0,2010
5,MANHATTAN,CHELSEA,03 THREE FAMILY HOMES,1,6400000,6400000,6400000,6400000.0,2010
6,MANHATTAN,CLINTON,01 ONE FAMILY HOMES,1,3850000,3850000,3850000,3850000.0,2010
7,MANHATTAN,EAST VILLAGE,01 ONE FAMILY HOMES,2,3100000,5800000,5800000,8500000.0,2010
8,MANHATTAN,EAST VILLAGE,02 TWO FAMILY HOMES,2,477500,2738750,2738750,5000000.0,2010
9,MANHATTAN,EAST VILLAGE,03 THREE FAMILY HOMES,1,3290000,3290000,3290000,3290000.0,2010


In [35]:
'The NYC House sales data set contains {} rows and {} columns'.format(dfNYC.shape[0],dfNYC.shape[1])

'The NYC House sales data set contains 5979 rows and 9 columns'

We can see the data has 5979 rows of data, containing information for lowest, average, median and highest sale price per year in each neighborhood. The data also makes a distinction between the type of home: One family, two family and three family homes.  

Let's observe confirm the data types for each column.

In [39]:
dfNYC.dtypes

borough               object
neighborhood          object
type_of_home          object
number_of_sales       object
lowest_sale_price     object
average_sale_price    object
median_sale_price     object
highest_sale_price    object
year                  object
dtype: object

Since all the columns contain object variables, let's transform them do more appropriate data types (As described on the data set category introduction).

In [49]:
numCols = ["number_of_sales","lowest_sale_price","average_sale_price","median_sale_price","highest_sale_price","year"]
dfNYC[numCols]=dfNYC[numCols].apply(pd.to_numeric)
dfNYC.dtypes

borough                object
neighborhood           object
type_of_home           object
number_of_sales         int64
lowest_sale_price       int64
average_sale_price      int64
median_sale_price       int64
highest_sale_price    float64
year                    int64
dtype: object

Now all the data types are correctly converted.

### 2.1.2 Descriptive analysis for NYC House sales data
Now that we have all the data correctly transformed, let's perform some exploratory descriptive analysis of the data set. Our objective is to find any initial patterns and gain insights on the data distribution, characteristics or trends for each neighborhood, borough or type of home as well as yearly trends.