# Coursera capstone

#### Name: Qing Li
#### Date: 27/10/2019

## Table of contents:
- Introduction: Background and Business Problem
- Data Description
- Methodology
- Analysis
- Results and Discussion
- Conclusion

## Introduction: Background and Business Problem

As we know, Shanghai is a  promising international city, which is the financial and trade center of China. During recent years, a large number of companies, including both those mature ones who have strong backgrounds and strat-up companies, perfer to open a branch in Shanghai. This project aims at finding an optimal location for stakeholders interested in starting a company.

Since this is a start-up company, I am going to **avoid neighborhoods which located too closed to the city center** because of the high rent of offices. However, locating too far away form the business center of the city brings no advantages to the company and employees. Therefore, the location will be set in areas which are certainly distant from the city center. Also, locations with **multiple restaurants around**, which are convenient for employees to have lunch everyday, will be considered preferentially. Besides,  considering  of commuting factors, I would also like to choose locations which **have subway stations nearby**.

Based on this criteria, I am going to create a map and information charts to show the promising locations and their advantages.

## Data Description

To consider about our business problem, we can list the factors that will influence our data:

- distance of neighborhood from the city center (not too closed)
- number of existing restaurants in the neighborhood (any type of restaurant
- distance of neighborhood from neareast subway station

Based on these limitations, I decide to use regularly spaced grid of locations, centered around 3-12km from city center, to define our neighborhoods.

Datas we used will be extracted from following resources:

- Google Maps API reverse geocoding will be used to generate the addresses of centers of candidate areas
- number of restaurants and their per capita consumption and location in every neighborhood will be obtained using Foursquare API
- coordinate of Shanghai center will be obtained using Google Maps API geocoding of well known Shanghai location (Huangpu, which is the most bustling area in Shanghai).

## Methodology

### Neighborhood Candidates

For the first step, I create latitude & longitude coordinates for centroids of the candidate neighborhoods. I will create a grid of cells covering the area of interest which is aprox. 3x3-12x12 killometers away and centered around Shanghai city center (Huangpu).

Now, Google Maps geocoding API is used to generate the latitude & longtitude of Shanghai city center.

In [81]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

import folium # map rendering library

print('Libraries imported.')

Libraries imported.


In [82]:
def get_coordinates(api_key, address, verbose=False):
    try:
        url = 'https://maps.googleapis.com/maps/api/geocode/json?key={}&address={}'.format(api_key, address)
        response = requests.get(url).json()
        if verbose:
            print('Google Maps API JSON result =>', response)
        results = response['results']
        geographical_data = results[0]['geometry']['location'] # get geographical coordinates
        lat = geographical_data['lat']
        lon = geographical_data['lng']
        return [lat, lon]
    except:
        return [None, None]
    
address = 'Huangpu, Shanghai, China'
google_api_key = 'AIzaSyCxrGdn5VOxLZ8G8h41KlVuf7_QSflfL80'
shanghai_center = get_coordinates(google_api_key, address)
print('Coordinate of {}: {}'.format(address, shanghai_center))

Coordinate of Huangpu, Shanghai, China: [31.231763, 121.484443]


Next, I will create a grid of area candidates, equaly spaced, 3x3-12x12 killometers away and centered around city center, Huangpu District. Our neighborhoods will be defined as circular areas with a radius of 300 meters, so our neighborhood centers will be 600 meters apart.

To accurately calculate distances, a grid of locations in Cartesian 2D coordinate system which allows us to calculate distances in meters (not in latitude/longitude degrees) must be created first. Then those coordinates will be projected back to latitude/longitude degrees to be shown on Folium map. 

Now, I am going to create functions to convert between WGS84 spherical coordinate system (latitude/longitude degrees) and UTM Cartesian coordinate system (X/Y coordinates in  meters).

In [83]:
import shapely.geometry

import pyproj

import math

def lonlat_to_xy(lon, lat):
    proj_latlon = pyproj.Proj(proj='latlong',datum='WGS84')
    proj_xy = pyproj.Proj(proj="utm", zone=51, datum='WGS84')
    xy = pyproj.transform(proj_latlon, proj_xy, lon, lat)
    return xy[0], xy[1]

def xy_to_lonlat(x, y):
    proj_latlon = pyproj.Proj(proj='latlong',datum='WGS84')
    proj_xy = pyproj.Proj(proj="utm", zone=51, datum='WGS84')
    lonlat = pyproj.transform(proj_xy, proj_latlon, x, y)
    return lonlat[0], lonlat[1]

def calc_xy_distance(x1, y1, x2, y2):
    dx = x2 - x1
    dy = y2 - y1
    return math.sqrt(dx*dx + dy*dy)

print('Coordinate transformation check')
print('-------------------------------')
print('Shanghai center longitude={}, latitude={}'.format(shanghai_center[1], shanghai_center[0]))
x, y = lonlat_to_xy(shanghai_center[1], shanghai_center[0])
print('Shanghai center UTM X={}, Y={}'.format(x, y))
lo, la = xy_to_lonlat(x, y)
print('Shanghai center longitude={}, latitude={}'.format(lo, la))

Coordinate transformation check
-------------------------------
Shanghai center longitude=121.484443, latitude=31.231763
Shanghai center UTM X=355659.0073027079, Y=3456277.527307382
Shanghai center longitude=121.48444300000001, latitude=31.231762999999997


Now, it's time to create a hexagonal grid of cells. The method is offseting every other row, then adjust vertical row spacing to make sure that every cell is equally distant from all of its neighbors.

In [84]:
shanghai_center_x, shanghai_center_y = lonlat_to_xy(shanghai_center[1], shanghai_center[0]) # City center in Cartesian coordinates

k = math.sqrt(3) / 2 # Vertical offset for hexagonal grid cells
x_min = shanghai_center_x - 6000
x_step = 600
y_min = shanghai_center_y - 6000 - (int(21/k)*k*600-12000)/2
y_step = 600 * k 

latitudes = []
longitudes = []
distances_from_center = []
xs = []
ys = []
for i in range(0, int(21/k)):
    y = y_min + i * y_step
    x_offset = 300 if i%2==0 else 0
    for j in range(0, 21):
        x = x_min + j * x_step + x_offset
        distance_from_center = calc_xy_distance(shanghai_center_x, shanghai_center_y, x, y)
        if (3000 <= distance_from_center <= 12001):
            lon, lat = xy_to_lonlat(x, y)
            latitudes.append(lat)
            longitudes.append(lon)
            distances_from_center.append(distance_from_center)
            xs.append(x)
            ys.append(y)

print(len(latitudes), 'candidate neighborhood centers generated.')

412 candidate neighborhood centers generated.


In [85]:
map_shanghai = folium.Map(location=shanghai_center, zoom_start=13)
folium.Marker(shanghai_center, popup='Huangpu').add_to(map_shanghai)
for lat, lon in zip(latitudes, longitudes):
    folium.CircleMarker([lat, lon], radius=2, color='blue', fill=True, fill_color='blue', fill_opacity=1).add_to(map_shanghai) 
    folium.Circle([lat, lon], radius=300, color='blue', fill=False).add_to(map_shanghai)
    #folium.Marker([lat, lon]).add_to(map_shanghai)
map_shanghai

After evaluating the coordinates of centers of neighbors I choose, I am going to get the addresses of these centers by using Google Maps API

In [86]:
def get_address(api_key, latitude, longitude, verbose=False):
    try:
        url = 'https://maps.googleapis.com/maps/api/geocode/json?key={}&latlng={},{}'.format(api_key, latitude, longitude)
        response = requests.get(url).json()
        if verbose:
            print('Google Maps API JSON result =>', response)
        results = response['results']
        address = results[0]['formatted_address']
        return address
    except:
        return None

addr = get_address(google_api_key, shanghai_center[0], shanghai_center[1])
print('Reverse geocoding check')
print('-----------------------')
print('Address of [{}, {}] is: {}'.format(shanghai_center[0], shanghai_center[1], addr))

Reverse geocoding check
-----------------------
Address of [31.231763, 121.484443] is: China, 码头


In [87]:
print('Obtaining location addresses: ', end='')
addresses = []
for lat, lon in zip(latitudes, longitudes):
    address = get_address(google_api_key, lat, lon)
    if address is None:
        address = 'NO ADDRESS'
    address = address.replace(', China', '') # We don't need country part of address
    addresses.append(address)
    print(' .', end='')
print(' done.')

Obtaining location addresses:  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . done.


Now I have already got the addresses of these centers of chosen neighbors, which are shown in the next cell.

In [88]:
addresses[0:10]

['Unnamed Road, Tian Lin, Xuhui Qu, Shanghai Shi',
 'China, Shanghai, Xuhui, 田林东路158号',
 '258 Tiandong Rd, Xuhui Qu, Shanghai Shi, 200042',
 'China, Shanghai, Xuhui, 龙华西路323号',
 'Long Hua Su Zhai Guan ( Long Hua Lu Dian ), Xuhui Qu, Shanghai Shi',
 'Yun Jin Lu, Xuhui Qu, Shanghai Shi',
 '147 Feng Xi Lu, Xuhui Qu, Shanghai Shi, 200050',
 'China, Shanghai, Pudong, 耀华支路170号上海港复兴船务公司',
 '21 Tongyao Rd, Pudong Xinqu, Shanghai Shi',
 'Unnamed Road, Pudong Xinqu, Shanghai Shi']

In order to facilitate the processing of subsequent data, the addresses and their corresponding latitudes and longtitudes will be organized into a dataframe.

In [89]:
df_locations = pd.DataFrame({'Address': addresses,
                             'Latitude': latitudes,
                             'Longitude': longitudes,
                             'X': xs,
                             'Y': ys,
                             'Distance from center': distances_from_center})

df_locations.head(10)

Unnamed: 0,Address,Latitude,Longitude,X,Y,Distance from center
0,"Unnamed Road, Tian Lin, Xuhui Qu, Shanghai Shi",31.174804,121.425545,349959.007303,3450042.0,8448.076704
1,"China, Shanghai, Xuhui, 田林东路158号",31.17488,121.431839,350559.007303,3450042.0,8055.432949
2,"258 Tiandong Rd, Xuhui Qu, Shanghai Shi, 200042",31.174957,121.438133,351159.007303,3450042.0,7689.603371
3,"China, Shanghai, Xuhui, 龙华西路323号",31.175033,121.444427,351759.007303,3450042.0,7354.590403
4,"Long Hua Su Zhai Guan ( Long Hua Lu Dian ), Xu...",31.175109,121.450721,352359.007303,3450042.0,7054.78561
5,"Yun Jin Lu, Xuhui Qu, Shanghai Shi",31.175185,121.457015,352959.007303,3450042.0,6794.850992
6,"147 Feng Xi Lu, Xuhui Qu, Shanghai Shi, 200050",31.17526,121.46331,353559.007303,3450042.0,6579.51366
7,"China, Shanghai, Pudong, 耀华支路170号上海港复兴船务公司",31.175335,121.469604,354159.007303,3450042.0,6413.267498
8,"21 Tongyao Rd, Pudong Xinqu, Shanghai Shi",31.17541,121.475898,354759.007303,3450042.0,6300.0
9,"Unnamed Road, Pudong Xinqu, Shanghai Shi",31.175484,121.482192,355359.007303,3450042.0,6242.595614


In [90]:
df_locations.to_pickle('./locations.pkl')  

## Foursquare

For now, I have already got the specific information of the neighborhoods meeting my needs initially. Then I am going to use Foursquare API to get info on restaurants and subway stations in each neighborhood.

