### IBM Data Science Professional Certificate: Capstone Project 
#### <div style="text-align: right"> Author: Tianji Cai </div>   <div style="text-align: right"> Email: tianjiC@outlook.com </div> <div style="text-align: right"> Date: Feb 24th, 2019 </div> 
# <div style="text-align: center"> *Where is your Second Home?* </div> 
## <div style="text-align: center"> A Recommendation System for Finding Home-like Neighborhoods in a New City</div> 
***

## Introduction

This project aims to build a __recommendation system__ that helps people find the most home-like neighborhoods in a new city. In this notebook, I will use myself as an example and try to look for neighborhoods in __New York and San Francisco__ that are most similar to my hometown in __Shanghai__. Venues around a neighborhood will be obtained from __Foursquare__ and will then be used to construct a feature set which characterizes that particular neighborhood. Afterwards, different neighborhoods in the new cities will be compared to one's hometown based on their respective feature sets, and those that share most similarities with the hometown will be recommended to the customer and get marked on an __interactive Folium map__. From the total number of recommended neighborhoods in a particular city, one can also see which city on the whole most resembles someone's hometown, at least as far as living is concerned. In the case studied here, we will see whether it is New York or San Francisco that is most similar to Shanghai.   

***

## Importing Libraries

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import geocoder # import geocoder

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes 
import folium # map rendering library

#!conda install -c conda-forge beautifulsoup4 --yes
from bs4 import BeautifulSoup # library for web scraping

#!conda install -c conda-forge lxml --yes

import requests # library for requesting webpage

import csv # library for working with csv file

print('Libraries imported.')

Libraries imported.


***

## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>

1. <a href="#item1">Background and Business Problem</a>

2. <a href="#item2">Data Acquisition</a>

3. <a href="#item3">Methodology and Analysis</a>

4. <a href="#item3">Results</a>

5. <a href="#item3">Discussions and Conclusions</a>

</font>
</div>

***

## 1. Background and Business Problem

### 1.1 Background

As the saying goes, _EAST OR WEST, HOME THE BEST_. After spending years moving from one city to another, I'm often stroke by nostalgia and always wish to find a living place that resembles my childhood home. Given that nowadays globalization has made people much more mobile than ever before, finding a second home in a foreign city that we truly feel comfortable with is likely to become a greater concern for a growing number of young people around the world. Yet, it is no easy task, especially for someone who has to remotely sign an online leasing contract before even visiting the city. 

This is the place where I think data science can help. Utilizing the vast amount of location and venue information available online, if we can somehow build a feature set for every neighborhood in the cities concerned, we can then use these feature sets to compare different neighborhoods with the customer's hometown (or any other neighborhood he/she picks) and come up with the top neighborhoods that most resemble one's hometown. The result can aid our customers to make a well-informed choice in deciding where to live. 

As an analogy, the recommendation system that will be built here is similar in essence to that used by Netflix which suggests the next movie for you to watch.  

### 1.2 Business Problem

As said in the _Introduction_, I will use myself as an example for this project. As a child, I lived in a district called __Hongkou__ in __Shanghai, China__, which used to be a cultural center and is located not far from the commercial heart of Shanghai. It is a nice place to live and I will use the neighborhoods in this district/borough as a benchmark in my search for a second home far far away on the other side of the globe. 

Since I'm an aspiring data scientist, the next city that I might move into will probably be a hub for data science. I decided to pick one city from the east coast of US and one from the west coast, hoping that this will enable me to make a crude comparison of the two coasts in addition to my neighborhood analysis. The two cities chosen in this study are __New York and San Francisco__. __The main objective is to recommend a total number of 20 neighborhoods in these two cities__. From the recommendation, we can also learn which city has the largested number of recommended neighborhoods, thus most similar to my hometown in Shanghai. 

Choosing your right neighborhood is a daunting task. Therefore, the anaylsis done here will be interesting to anyone who plans to move into a new city, yet is at a loss in the process of finding a new place to call home.

***

## 2. Data Acquisition

### 2.1 Description of Data

Based on the business problem defined above, we need two sets of data:
* Neighborhoods in a given city with latitudes and longitudes
* Venues information in a given neighborhood

#### 2.1.1 Neighborhood Location Data

For the 1st set, i.e, location data of the neighborhoods, we need to obtain tables that have the following form:

| City | Borough | Neighborhood | Latitude | Longitude |
| --- | --- | --- | --- | --- |
| --- | --- | --- | --- | --- |

We need 3 such tables, each for one city being considered in the project. In addition, we will create a master table _TwoCities_ for the two US cities combined, and a grand master dataframe _ThreeCities_ for all three cities combined.

* __Shanghai__: I'm only interested in neighborhoods in the district/borough of Hongkou and they can be found on this [Wikipedia page](https://en.wikipedia.org/wiki/Hongkou_District). Since there're only 8 neighborhoods and I'm not able to find the latitudes and longitudes of these neighborhoods all in one table, I decided to create the dataframe myself and put in the values by hand. The brute-force method is more time-saving than scraping the web in this case, so I will go with it. Below is how the table should look like, where all the location data are obtained from direct Google search.

| City | Borough | Neighborhood | Latitude | Longitude |
| --- | --- | --- | --- | --- |
| Shanghai | Hongkou | Ouyang | 31.2695 | 121.4943 |
| Shanghai | Hongkou | Quyang | 31.2890 | 121.4953 |
| Shanghai | Hongkou | Guangzhong | 31.2669 | 121.4775 |
| Shanghai | Hongkou | Jiaxing | 31.2714 | 121.5002 |
| Shanghai | Hongkou | Liangcheng Xincun| 31.2988 | 121.4719 |
| Shanghai | Hongkou | Sichuan North | 31.2584 | 121.4806 |
| Shanghai | Hongkou | Tilanqiao | 31.2564 | 121.5137 |
| Shanghai | Hongkou | Jiangwanzhen | 31.3049 | 121.4732 |


* __New York__: We've already created such a dataframe in our example project of exploring neighborhoods in New York, and there the raw data was fetched from this [json file](https://cocl.us/new_york_dataset). I'll simply repeat the procedures in that project, which makes life much easier.

* __San Francisco__: It is not easy to find an existing dataset for the neighborhoods in San Francisco and after much work I did find one from [DataSF.org](https://data.sfgov.org/Geographic-Locations-and-Boundaries/Realtor-Neighborhoods/5gzd-g9ns). The data downloaded from their csv/json file is not very organized and I need to clean them before the anaylsis. In addition, since the data was originally collected in 2010, it might not be most up-to-date and by comparing it with the [Wikipedia Page](https://en.wikipedia.org/wiki/List_of_neighborhoods_in_San_Francisco), I did find some discrepancies. However, given that I'm in no way concerned with precision, the dataset here should suffice out purpose in this project. 

#### 2.1.2 Venue Information Data

It's relatively easy to obtain this data. We will use __Foursquare__ API to fetch the data, same as what we did before. We will create two dataframes of venue categories, one for the neighborhoods in Hongkou and the other for those in New York and San Francisco. 

In order to get the same number of venue categories for each city, we need to fetch the venue data for all three cities at the same time. Thus, we will use our grand master _ThreeCities_ dataframe here. Afterwards, the venue categories dataframe will be split into two, and the one for Shanghai will be averaged to get a _HomeFeatures_ dataframe, which defines the characteristics of my hometown and will be used in the same way as a user profile. The other dataframe, named _NewNeighborFeatures_, gives us the venue characteristics of each neighborhood in NY and SF.  

The _HomeFeatures_ dataframe will then be multiplied by the _NewNeighborFeatures_ dataframe to produce a sorted _NewNeighborRating_ dataframe, whose top few rows tell us the most home-like neighborhoods in NY and SF.  

**The final result will be the first 20 rows of _NewNeighborRating_ dataframe.**

### 2.2 Cleaning Neighborhood Location Data

#### 2.2.1 Shanghai

Let's build the dataframe for _Hongkou, Shanghai_ by hand.

In [2]:
# define the dataframe columns
column_names = ['City', 'Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
HomeLocation = pd.DataFrame(columns=column_names)
HomeLocation

Unnamed: 0,City,Borough,Neighborhood,Latitude,Longitude


In [3]:
HomeNeigh_name = ['Ouyang', 'Quyang', 'Guangzhong', 'Jiaxing', 'Liangcheng Xincun', 'Sichuan North', 'Tilanqiao', 'Jiangwanzhen']
HomeNeigh_lat = [31.2695, 31.2890, 31.2669, 31.2714, 31.2988, 31.2584, 31.2564, 31.3049]
HomeNeigh_lon = [121.4943, 121.4953, 121.4775, 121.5002, 121.4719, 121.4806, 121.5137, 121.4732]

for i in range(len(HomeNeigh_name)):
    HomeLocation = HomeLocation.append({'City': 'Shanghai',
                                    'Borough': 'Hongkou',
                                    'Neighborhood': HomeNeigh_name[i],
                                    'Latitude': HomeNeigh_lat[i],
                                    'Longitude': HomeNeigh_lon[i]}, ignore_index=True)
HomeLocation

Unnamed: 0,City,Borough,Neighborhood,Latitude,Longitude
0,Shanghai,Hongkou,Ouyang,31.2695,121.4943
1,Shanghai,Hongkou,Quyang,31.289,121.4953
2,Shanghai,Hongkou,Guangzhong,31.2669,121.4775
3,Shanghai,Hongkou,Jiaxing,31.2714,121.5002
4,Shanghai,Hongkou,Liangcheng Xincun,31.2988,121.4719
5,Shanghai,Hongkou,Sichuan North,31.2584,121.4806
6,Shanghai,Hongkou,Tilanqiao,31.2564,121.5137
7,Shanghai,Hongkou,Jiangwanzhen,31.3049,121.4732


We now have __HomeLocation__ as our dataframe for the location data for neighborhoods in Hongkou, Shanghai.

#### 2.2.2 New York

Let's build the dataframe for _New York_, using the same procedure as in the project Neighborhoods-New-York.

In [4]:
url = 'https://cocl.us/new_york_dataset'
newyork_data = requests.get(url).json()
NY_data = newyork_data['features']
NY_data[0]

{'type': 'Feature',
 'id': 'nyu_2451_34572.1',
 'geometry': {'type': 'Point',
  'coordinates': [-73.84720052054902, 40.89470517661]},
 'geometry_name': 'geom',
 'properties': {'name': 'Wakefield',
  'stacked': 1,
  'annoline1': 'Wakefield',
  'annoline2': None,
  'annoline3': None,
  'annoangle': 0.0,
  'borough': 'Bronx',
  'bbox': [-73.84720052054902,
   40.89470517661,
   -73.84720052054902,
   40.89470517661]}}

In [5]:
# define the dataframe columns
column_names = ['City', 'Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
NYLocation = pd.DataFrame(columns=column_names)
NYLocation

Unnamed: 0,City,Borough,Neighborhood,Latitude,Longitude


In [6]:
for data in NY_data:
    borough = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    NYLocation = NYLocation.append({'City': 'New York',
                                          'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

NYLocation.head()

Unnamed: 0,City,Borough,Neighborhood,Latitude,Longitude
0,New York,Bronx,Wakefield,40.894705,-73.847201
1,New York,Bronx,Co-op City,40.874294,-73.829939
2,New York,Bronx,Eastchester,40.887556,-73.827806
3,New York,Bronx,Fieldston,40.895437,-73.905643
4,New York,Bronx,Riverdale,40.890834,-73.912585


In [7]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(NYLocation['Borough'].unique()),
        NYLocation.shape[0]
    )
)

The dataframe has 5 boroughs and 306 neighborhoods.


We now have __NYLocation__ as our dataframe for the location data for neighborhoods in New York, US.

#### 2.2.3 San Francisco

Let's read in the local csv file for the neighborhoods data in San Francisco and take a quick look at it.

In [136]:
SFLocation = pd.read_csv("SF_Neighborhoods.csv")
SFLocation

Unnamed: 0,sfar_distr,the_geom,nbrhood,nid
0,District 6 - Central North,MULTIPOLYGON (((-122.42948394891741 37.7750962...,Alamo Square,6e
1,District 6 - Central North,MULTIPOLYGON (((-122.44746439135872 37.7798633...,Anza Vista,6a
2,District 4 - Twin Peaks West,MULTIPOLYGON (((-122.46450886214802 37.7322084...,Balboa Terrace,4a
3,District 10 - Southeast,MULTIPOLYGON (((-122.38758527038996 37.7502633...,Bayview,10a
4,District 9 - Central East,MULTIPOLYGON (((-122.40375492236231 37.7491900...,Bernal Heights,9a
5,District 5 - Central,MULTIPOLYGON (((-122.43562414065005 37.7673279...,Buena Vista Park/Ashbury Heights,5f
6,District 1 - Northwest,MULTIPOLYGON (((-122.49167862383683 37.7721233...,Central Richmond,1a
7,District 2 - Central West,MULTIPOLYGON (((-122.4774733239755 37.76531008...,Central Sunset,2e
8,District 5 - Central,MULTIPOLYGON (((-122.44635254035374 37.7603195...,Clarendon Heights,5h
9,District 5 - Central,MULTIPOLYGON (((-122.43575809313332 37.7672311...,Corona Heights,5g


We need the name of neighborhoods and boroughs from the file, as well as its location.

In [137]:
# Separate the location data from the df
location = SFLocation['the_geom']

# Renaming the two useful columns
SFLocation.rename(columns ={'sfar_distr': 'Borough', 'nbrhood': 'Neighborhood'}, inplace=True) 
# Dropping the other columns
SFLocation.drop(columns=['the_geom', 'nid'], inplace=True)
# Deleting the characters "District # - " in the names of boroughs
for i in range(11):
    SFLocation['Borough'].replace(regex=True,inplace=True,to_replace=r'District {} - '.format(i),value=r'')
# Retaining only one neighborhood if several are present in one cell    
SFLocation['Neighborhood'] = SFLocation['Neighborhood'].apply(lambda x: x.split('/')[0])
# Add the other three empty columns 
SFLocation = SFLocation.reindex(['City'] + SFLocation.columns.tolist() + ['Latitude', 'Longitude'], axis=1) 
# Set all the values of 'City' to 'San Francisco'
SFLocation['City'] = 'San Francisco'

# Inspect the data
rows = SFLocation.shape[0]
print('The number of rows of SFLocation is: ', rows)
SFLocation

The number of rows of SFLocation is:  92


Unnamed: 0,City,Borough,Neighborhood,Latitude,Longitude
0,San Francisco,Central North,Alamo Square,,
1,San Francisco,Central North,Anza Vista,,
2,San Francisco,Twin Peaks West,Balboa Terrace,,
3,San Francisco,Southeast,Bayview,,
4,San Francisco,Central East,Bernal Heights,,
5,San Francisco,Central,Buena Vista Park,,
6,San Francisco,Northwest,Central Richmond,,
7,San Francisco,Central West,Central Sunset,,
8,San Francisco,Central,Clarendon Heights,,
9,San Francisco,Central,Corona Heights,,


Now clean the location data and append them back into SFLocation

In [138]:
for row in range(rows):
    location[row] = location[row].replace("MULTIPOLYGON (((","")
    location[row] = location[row][:location[row].find(",")]
    latitude = location[row].split(" ")[1]
    longitude = location[row].split(" ")[0]
    
    SFLocation['Latitude'][row] = latitude
    SFLocation['Longitude'][row] = longitude
    
SFLocation

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  import sys
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Unnamed: 0,City,Borough,Neighborhood,Latitude,Longitude
0,San Francisco,Central North,Alamo Square,37.775096,-122.429484
1,San Francisco,Central North,Anza Vista,37.779863,-122.447464
2,San Francisco,Twin Peaks West,Balboa Terrace,37.732208,-122.464509
3,San Francisco,Southeast,Bayview,37.750263,-122.387585
4,San Francisco,Central East,Bernal Heights,37.74919,-122.403755
5,San Francisco,Central,Buena Vista Park,37.767328,-122.435624
6,San Francisco,Northwest,Central Richmond,37.772123,-122.491679
7,San Francisco,Central West,Central Sunset,37.76531,-122.477473
8,San Francisco,Central,Clarendon Heights,37.76032,-122.446353
9,San Francisco,Central,Corona Heights,37.767231,-122.435758


In [142]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(SFLocation['Borough'].unique()),
        SFLocation.shape[0]
    )
)

The dataframe has 11 boroughs and 92 neighborhoods.


We now have __SFLocation__ as our dataframe for the location data for neighborhoods in San Francisco, US.

#### _HomeLocation_, _NYLocation_, _SFLocation_ are the three location dataframe that will be used in the subsequent analysis!

#### 2.2.4 Master Dataframes _TwoCities_ and _ThreeCities_

Let's merge _NYLocation_ and _SFLocation_ into one dataframe _TwoCities_.

In [144]:
TwoCities = NYLocation.append(SFLocation, ignore_index=True)
print('The shape of TwoCities is: ', TwoCities.shape)
TwoCities

The shape of TwoCities is:  (398, 5)


Unnamed: 0,City,Borough,Neighborhood,Latitude,Longitude
0,New York,Bronx,Wakefield,40.894705,-73.847201
1,New York,Bronx,Co-op City,40.874294,-73.829939
2,New York,Bronx,Eastchester,40.887556,-73.827806
3,New York,Bronx,Fieldston,40.895437,-73.905643
4,New York,Bronx,Riverdale,40.890834,-73.912585
5,New York,Bronx,Kingsbridge,40.881687,-73.902818
6,New York,Manhattan,Marble Hill,40.876551,-73.91066
7,New York,Bronx,Woodlawn,40.898273,-73.867315
8,New York,Bronx,Norwood,40.877224,-73.879391
9,New York,Bronx,Williamsbridge,40.881039,-73.857446


Let's merge _HomeLocation_ and _TwoCities_ into one dataframe _ThreeCities_.

In [145]:
ThreeCities = HomeLocation.append(TwoCities, ignore_index=True)
print('The shape of ThreeCities is: ', ThreeCities.shape)
ThreeCities

The shape of ThreeCities is:  (406, 5)


Unnamed: 0,City,Borough,Neighborhood,Latitude,Longitude
0,Shanghai,Hongkou,Ouyang,31.2695,121.4943
1,Shanghai,Hongkou,Quyang,31.289,121.4953
2,Shanghai,Hongkou,Guangzhong,31.2669,121.4775
3,Shanghai,Hongkou,Jiaxing,31.2714,121.5002
4,Shanghai,Hongkou,Liangcheng Xincun,31.2988,121.4719
5,Shanghai,Hongkou,Sichuan North,31.2584,121.4806
6,Shanghai,Hongkou,Tilanqiao,31.2564,121.5137
7,Shanghai,Hongkou,Jiangwanzhen,31.3049,121.4732
8,New York,Bronx,Wakefield,40.894705,-73.847201
9,New York,Bronx,Co-op City,40.874294,-73.829939


### Notice: 

Here in our record, New York has a total of 306 neighborhoods, while San Francisco only has 92 neighborhoods. Therefore statistically speaking, New York should have more neighborhoods (__roughly 3 times more__) that are similar to my home than San Francisco, if those home-like neighborhoods are assumed to be randomly distribued among the two cities. This should be kept in mind when we later perform our analysis.

***

## 3. Methodology and Analysis