<h1 align="center"> Peer-graded Assignment: Capstone Project - The Battle of Neighborhoods</h1>
<h2 align="center"> Is a real estate sell price good, fair, or above average?</h2>

### Background and Proposal

When you want to by a car, you go to [AutoTrader](https://www.autotrader.ca/) or [CarGuru](https://ca.cargurus.com/). One sweet thing about using such websites is that they tell you the car you are interested has a below-market price or above market price based on their own algorithms. 

However, when you looking for either renting or buying a real estate from websites such as [Centris](https://www.centris.ca/en/) or [DuProprio](https://duproprio.com/en), you won't be able to find neither if similar real estate exists somewhere else in the city nor if it's price is below market (a.k.a., a good deal).

Is such thing should be done? I certainly think so. One of my friends recently wants to buy a condomunium in Montreal. He told me it is exhausting to find an ideal condo to buy. Sometime the location is acceptable but the price is way above the budget, sometimes the price is fair but location isn't ideal. He literrally has to view all the available condos and compare them by heart to determine which one is the best.

My friend's experience has intrggered me to develop a tool that can:

1. find similar for sale real estate based on the one you are interested, and
2. build a price regression model and predict if the sale price is above or below the market.

This tool can help to fast locate other similar real estates once you have on in mind. It also sets a reference on price to help you determine which one is worth buying.

How am I going to do it? The first and foremost step is to find the for sale real estates. I intend to use web crawler to get data from [DuProprio](https://duproprio.com/en). The real estate data shall have the following information:

- address
- sale price
- number of rooms
- areas
- built year

Then I will use the address to extract the neighbourhood informaiton from [Foursquare](https://foursquare.com/city-guide). I will categorize the neighbourhood venues into several main categories such as:

- parks
- grocery stores
- schools
- clinic/hospital
- public transportation

All the data shall be ready by now. The tool will take one specific real estate as an input and find all similar ones by using segmentation and clustering technique. A price regression model is further developed within one cluster. Using the regression model to predict the price for all similar real estates will give reference on price. Finally, display all the candidates in ascending order of (actual price - predicted price).

### Data

I used a [web scrawler](https://github.com/Alcander-Z/MTL_house) to collect data about real estate for sale from [DuProprio](https://duproprio.com/en/search/list?search=true&regions%5B0%5D=6&is_for_sale=1&with_builders=1&parent=1&pageNumber=1&sort=-published_at). The raw data is saved as JSON file. It is available at [Here]()

Let's load it and convert to pandas DataFrame.

In [1]:
import json
import pandas as pd

In [3]:
with open(r'duproprio-20190517.json', 'r') as f:
    jsf = json.load(f)
df = pd.DataFrame.from_dict(jsf)

Let's see the number of rows and columns of the dataframe

In [4]:
df.shape

(1099, 13)

Let's the column names

In [5]:
df.columns.values

array(['address', 'areas', 'backyard', 'bathrooms', 'bedrooms',
       'category', 'floor_if_condo', 'levels', 'municipal', 'ownership',
       'postalcode', 'price', 'year'], dtype=object)

There are 13 attributes (columns) of a real estate:

- address and postalcode shall be used to define neighbourhood;
- areas, bathrooms, bedrooms, years shall be used for the price regression model;

Let's further look into the 'category' column:

In [6]:
df.category.value_counts()

Condominium      517
New              110
2                 80
Duplex            72
Bungalow          72
Townhouse         63
Triplex           54
Semi-detached     39
Split             22
Quadruplex        19
Quintuplex        14
Commercial         8
6                  6
Loft               6
Penthouse          4
Residential        4
Storey             4
Raised             2
3                  2
Bi-generation      1
Name: category, dtype: int64

Category defines the type of a real estate. It is obvious that 'Condominium' is dominant in numbers among all. I will only use __condominium__ to continue the analysis.

In [7]:
subsets = ['price', 'address', 'category', 'year', 'bedrooms', 'bathrooms', 'areas', 'postalcode']
df.dropna(subset=subsets, axis=0, inplace=True)
df.drop(columns=['ownership', 'levels'], inplace=True)
condo = df[df.category=='Condominium'].reset_index(drop=True)
condo.head()

Unnamed: 0,address,areas,backyard,bathrooms,bedrooms,category,floor_if_condo,municipal,postalcode,price,year
0,"5985 Boyer, Rosemont / La Petite Patrie, QC","2,000 ft² (185.81 m²)",East,2.5,3.0,Condominium,1.0,394000.0,H2S 2H8,599000,1928
1,"1008-3581 boulevard Gouin Est, Montréal-Nord, QC","1,100 ft² (102.19 m²)",,1.5,2.0,Condominium,10.0,316600.0,H1H 0A1,349000,2006
2,"16107 rue Forsyth, Pointe-Aux-Trembles / Montr...",911 ft² (84.63 m²),North-West,1.0,2.0,Condominium,3.0,207200.0,H1A 5R8,200000,1999
3,"711-680 rue de Courcelles, Le Sud-Ouest, QC",950 ft² (88.26 m²),South,1.0,2.0,Condominium,7.0,339500.0,H4C 0B8,415000,2011
4,"1-5230 rue Resther, Le Plateau-Mont-Royal, QC",106.4 m² (1 145.28 ft²),,2.5,2.0,Condominium,1.0,350700.0,H2J 2W3,415000,2005


In [8]:
print("invalid backyard numbers: ", condo.backyard.isna().sum())
print("invalid municipal numbers: ", condo.municipal.isna().sum())

invalid backyard numbers:  231
invalid municipal numbers:  241


There are a few things shall be done before city segmentation and price regression modeling:

1. extract proper number of 'areas' from the string, (e.g., use $m^2$ as area unit);
2. add neighbourhood information for each condo;
3. use Foursquare to get venue information for each neighbourhood;
4. drop category;
5. drop backyard/municipal since nearly half of them are None.

In [9]:
condo.to_csv('condo_montreal.csv')