# Capstone Project Part 1
### Applied Data Science Capstone by IBM/Coursera
Edited by Qingzhu Yuan

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data Acquisition and Cleaning](#data)
* [Methodology](#methodology)
* [Exploratoy Data Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)

## Introduction: Business Problem <a name="introduction"></a>

#### 1. Background
Shanghai is a well-known metropolitan city with a population of 24.28 million as of 2019[1].It is reputed as oriental Paris with its beauty of the combination of antique architectures and modern skyscrapers and its vibrant economic market, which has attracted thousands of people and business as well to rush into this city with their home town food. As it for residents and visitors in Shanghai,it is quite convenient to reach any kinds of cuisines, however,in terms of business owners,they should carefully take into consideration what kinds of restaurants should be open and how the business runs due to fierce competition.Let us focus on the location of one new bubble milk tea shop.

#### B. Problems
As mentioned, the stackholder would like to select the location to open up a new bubble milk tea shop. The venue should be in the place with comparatively high density of population, maybe better to close to the potential customers such as students and young people, so hopefully located in or close to shopping malls or schools.

#### C. Interest
The business owner of milk tea shop is the stackholder of this case. Besides, potential customers and the competitors would also be interested in this study. Furthermore, the owner of other kinds of small business stores such as coffee stores or fastfood shops may be attracted to this research as well.

## Data Acquisition and Cleaning<a name="data"></a>

#### A. Data Acquisition
The data was achieved by web scraping.

In [1]:
#import library
import pandas as pd #dataframe
import numpy as np # library to handle data in a vectorized manner

import folium # map rendering

import requests # library to handle requests
from bs4 import BeautifulSoup # Web Scraping

import matplotlib as mpl
import matplotlib.pyplot as plt

In [2]:
!conda install -c conda-forge geocoder -y
import geocoder # get coordinates library

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.



Firstly,Shanghai district data should be scrapped from web wiki, and stored in data frame.

In [3]:
# get page under url
def get_page(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text,'lxml')
    return soup

In [4]:
# create shanghai district table
tabledis = get_page("https://en.wikipedia.org/wiki/List_of_administrative_divisions_of_Shanghai").findAll('table',class_='wikitable')[1]
tabledis

<table cellspacing="0" class="wikitable" style="padding: 24em 0; border: 1px #aaa solid; border-collapse: collapse; font-size: 90%;">
<tbody><tr>
<th rowspan="2"></th>
<th colspan="8">County Level
</th></tr>
<tr>
<th>Name</th>
<th><a class="mw-redirect" href="/wiki/Simplified_Chinese" title="Simplified Chinese">Chinese</a></th>
<th><a class="mw-redirect" href="/wiki/Hanyu_Pinyin" title="Hanyu Pinyin">Hanyu Pinyin</a></th>
<th colspan="2"><a href="/wiki/Administrative_division_codes_of_the_People%27s_Republic_of_China" title="Administrative division codes of the People's Republic of China">Division code</a><sup class="reference" id="cite_ref-2"><a href="#cite_note-2">[2]</a></sup></th>
<th>Area (km²)<sup class="reference" id="cite_ref-census2018_3-0"><a href="#cite_note-census2018-3">[3]</a></sup></th>
<th>Population (2018 census)<sup class="reference" id="cite_ref-census2018_3-1"><a href="#cite_note-census2018-3">[3]</a></sup></th>
<th>Density (/km²)
</th></tr>
<tr>
<td bgcolor="#ffd35

In [5]:
tabledis_rows = tabledis.find_all('tr')

res = []
for tr in tabledis_rows:
    td = tr.find_all('td')
    row = [tr.text.strip() for tr in td if tr.text.strip()]
    if row:
        res.append(row)
        
df = pd.DataFrame(res)
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7
0,Huangpu District[4](City seat),黄浦区,Huángpǔ Qū,310101,HGP,20.46,653800,31955
1,Xuhui District,徐汇区,Xúhuì Qū,310104,XHI,54.76,1084400,19803
2,Changning District,长宁区,Chángníng Qū,310105,CNQ,38.3,694000,18120
3,Jing'an District,静安区,Jìng'ān Qū,310106,JAQ,36.88,1062800,28818
4,Putuo District,普陀区,Pǔtuó Qū,310107,PTQ,54.83,1281900,23380


In [6]:
# The District Name and Density is what we are interested in.
District = df[[0,7]].rename(columns={0:"District",7:"District Density(/km^2)"})
District.iloc[0,0]=District.iloc[0,0].replace('[4](City seat)','')
District.head()

Unnamed: 0,District,District Density(/km^2)
0,Huangpu District,31955
1,Xuhui District,19803
2,Changning District,18120
3,Jing'an District,28818
4,Putuo District,23380


In [7]:
District = pd.DataFrame(District)
District.dtypes

District                   object
District Density(/km^2)    object
dtype: object

In [8]:
# Convert values from string to float in order to calcuate density of population.
import locale
from locale import atof
locale.setlocale(locale.LC_NUMERIC, '')

DD = District[['District Density(/km^2)']].applymap(atof)

Shanghai District Dataframe with Density data in **District Dataframe**

In [9]:
District = pd.concat([District[['District']],DD],axis=1)
District.head()

Unnamed: 0,District,District Density(/km^2)
0,Huangpu District,31955.0
1,Xuhui District,19803.0
2,Changning District,18120.0
3,Jing'an District,28818.0
4,Putuo District,23380.0


Secondly,sub-district data should also be found from web, stored and merged in data frame.

In [10]:
links_a = tabledis.find_all('a')

In [11]:
links_a[6].get('href')

'/wiki/Huangpu_District,_Shanghai'

In [12]:
# Achieve all links for sub-districts
link_subd = []
for i in range(6,23):
    link_subd.append(links_a[i].get('href')) 
link_subd

['/wiki/Huangpu_District,_Shanghai',
 '#cite_note-4',
 '/wiki/Xuhui_District',
 '/wiki/Changning_District',
 '/wiki/Jing%27an_District',
 '/wiki/Putuo_District,_Shanghai',
 '/wiki/Hongkou_District',
 '/wiki/Yangpu_District',
 '/wiki/Pudong',
 '/wiki/Minhang_District',
 '/wiki/Baoshan_District,_Shanghai',
 '/wiki/Jiading_District',
 '/wiki/Jinshan_District',
 '/wiki/Songjiang_District',
 '/wiki/Qingpu_District,_Shanghai',
 '/wiki/Fengxian_District',
 '/wiki/Chongming_District']

In [13]:
# remove certain wrong item in the list
link_subd.remove( '#cite_note-4')
link_subd

['/wiki/Huangpu_District,_Shanghai',
 '/wiki/Xuhui_District',
 '/wiki/Changning_District',
 '/wiki/Jing%27an_District',
 '/wiki/Putuo_District,_Shanghai',
 '/wiki/Hongkou_District',
 '/wiki/Yangpu_District',
 '/wiki/Pudong',
 '/wiki/Minhang_District',
 '/wiki/Baoshan_District,_Shanghai',
 '/wiki/Jiading_District',
 '/wiki/Jinshan_District',
 '/wiki/Songjiang_District',
 '/wiki/Qingpu_District,_Shanghai',
 '/wiki/Fengxian_District',
 '/wiki/Chongming_District']

In [14]:
# Try on finding the related district to the subdistrict
get_page('https://en.wikipedia.org'+link_subd[0]).find('title').text.replace(", Shanghai - Wikipedia","")

'Huangpu District'

In [15]:
# create a big table including all subdistricts
ressub =[]
Dis = []
for i in range(len(link_subd)):
    # scrap district name from web and merge to the table
    d = get_page('https://en.wikipedia.org'+link_subd[i]).find('title').text.replace(", Shanghai - Wikipedia","")
    Dis.append(d)
    Dis[i] = Dis[i].replace("- Wikipedia","")
    
    tablesub = get_page('https://en.wikipedia.org'+link_subd[i]).find('table',class_='wikitable')
    table_rows = tablesub.find_all('tr')

    for tr in table_rows:
        td = tr.find_all('td')
        row = [tr.text.strip() for tr in td if tr.text.strip()]
        if row:
            ressub.append([Dis[i]]+row)

tablesubdf = pd.DataFrame(ressub)
tablesubdf.head()

Unnamed: 0,0,1,2,3,4,5,6
0,Huangpu District,Bansongyuan Road Subdistrict,半淞园路街道,Bànsōngyuánlù Jiēdào,peu son yeu lu ka do,89776,2.87
1,Huangpu District,Huaihai Central Road Subdistrict,淮海中路街道,Huáihǎi Zhōnglù Jiēdào,wa he tzon lu ka do,57931,1.41
2,Huangpu District,Dapuqiao Subdistrict,打浦桥街道,Dǎpǔqiáo Jiēdào,tan phu djio ka do,59085,1.59
3,Huangpu District,Nanjing East Road Subdistrict,南京东路街道,Nánjīng Dōnglù Jiēdào,neu cin ton lu ka do,66285,2.41
4,Huangpu District,Laoximen Subdistrict,老西门街道,Lǎoxīmén Jiēdào,lo sij men ka do,72898,1.24


In [16]:
# Convert values from string to float in order to calcuate density of population.
a = tablesubdf[[5]].applymap(atof)
b = tablesubdf[6].astype(float) 

# We are interested in subdistrict name, population and area.
Subdis= tablesubdf[[0,1,5,6]].rename(columns={0:'District',1:'Subdistrict',5:'Population(2010)',6:'Area(km^2)'})

Density = pd.DataFrame([a.iloc[:,0]/b]).T.rename(columns={0:'Density(/km^2)'})
Subdis = pd.concat([Subdis,Density],axis=1)

print('There are totally {} subdistricts in Shanghai.'.format(Subdis.shape[0]))
Subdis.head()

There are totally 234 subdistricts in Shanghai.


Unnamed: 0,District,Subdistrict,Population(2010),Area(km^2),Density(/km^2)
0,Huangpu District,Bansongyuan Road Subdistrict,89776,2.87,31280.836237
1,Huangpu District,Huaihai Central Road Subdistrict,57931,1.41,41085.815603
2,Huangpu District,Dapuqiao Subdistrict,59085,1.59,37160.377358
3,Huangpu District,Nanjing East Road Subdistrict,66285,2.41,27504.149378
4,Huangpu District,Laoximen Subdistrict,72898,1.24,58788.709677


In [17]:
# check if there is missing data in the table
Subdis.describe()

Unnamed: 0,Density(/km^2)
count,228.0
mean,13681.62752
std,15554.105494
min,9.859155
25%,1196.161692
50%,4927.204233
75%,22931.833451
max,72255.147059


In [23]:
# The count of Density is less than the count of subdistricts, so we need to find out which data is missing and which area it is related to.
Subdis[Subdis.isnull().any(axis=1)]

Unnamed: 0,District,Subdistrict,Population(2010),Area(km^2),Density(/km^2)
55,Putuo District,Changzheng town,7.67,,23380.0
123,Minhang District,Pujiang town,78.51,,6860.0
173,Songjiang District,Guangfulin Subdistrict,19.05,,2910.0
174,Songjiang District,Jiuliting Subdistrict,6.79,,2910.0
188,Songjiang District,Songjiang Industrial Zone,43.69,,2910.0
201,Fengxian District,Nanqiao town,114.68,,1676.0


In [19]:
# import math library to add boolean value
import math
math.isnan(Subdis.loc[174,'Density(/km^2)'])

True

In [20]:
# The Density for item 174 of Songjiang District, Jiuliting Subdistrict is replaced with the data for Songjiang District
# .strip() to remove the possible space
# .loc[item in certain cell,index of column] means to select one row to make a new dataframe
# .iloc[0] to find out the value of that item on the index of the column
District.loc[District['District']==Subdis.iloc[174,0].strip(),'District Density(/km^2)'].iloc[0]

2910.0

In [21]:
# I'd like to use the same district density to cover the NaN for specific Subdistrict for all missing data.
for i in range(0,Subdis.shape[0]):
    if math.isnan(Subdis.loc[i,'Density(/km^2)']):
        Subdis.loc[i,'Density(/km^2)'] = District.loc[District['District']==Subdis.iloc[i,0].strip(),'District Density(/km^2)'].iloc[0]
        
Subdis[Subdis.isnull().any(axis=1)]

Unnamed: 0,District,Subdistrict,Population(2010),Area(km^2),Density(/km^2)
55,Putuo District,Changzheng town,7.67,,23380.0
123,Minhang District,Pujiang town,78.51,,6860.0
173,Songjiang District,Guangfulin Subdistrict,19.05,,2910.0
174,Songjiang District,Jiuliting Subdistrict,6.79,,2910.0
188,Songjiang District,Songjiang Industrial Zone,43.69,,2910.0
201,Fengxian District,Nanqiao town,114.68,,1676.0


Shanghai Sub-district with Density data in **Subdis Dataframe**

In [22]:
print(Subdis.shape[0])
Subdis.head()

234


Unnamed: 0,District,Subdistrict,Population(2010),Area(km^2),Density(/km^2)
0,Huangpu District,Bansongyuan Road Subdistrict,89776,2.87,31280.836237
1,Huangpu District,Huaihai Central Road Subdistrict,57931,1.41,41085.815603
2,Huangpu District,Dapuqiao Subdistrict,59085,1.59,37160.377358
3,Huangpu District,Nanjing East Road Subdistrict,66285,2.41,27504.149378
4,Huangpu District,Laoximen Subdistrict,72898,1.24,58788.709677


Then,geological coordinates of sub-district data should be achieved by using geocorder, stored and merged in data frame.

In [24]:
def get_latlng(location,district):
    lat_lng_coords = None
    while(lat_lng_coords is None):
        g = geocoder.arcgis('{},{},Shanghai'.format(location,district))
        lat_lng_coords = g.latlng
    return lat_lng_coords

In [25]:
import time
latitude =[]
longitude=[]

for i in range(0,Subdis.shape[0]):
    district = Subdis.iloc[i,0]
    latitude.append(get_latlng(Subdis.iloc[i,1],district)[0])
    longitude.append(get_latlng(Subdis.iloc[i,1],district)[1])
    time.sleep(2)

Status code Unknown from https://geocode.arcgis.com/arcgis/rest/services/World/GeocodeServer/find: ERROR - HTTPSConnectionPool(host='geocode.arcgis.com', port=443): Max retries exceeded with url: /arcgis/rest/services/World/GeocodeServer/find?f=json&text=Laoximen+Subdistrict%2CHuangpu+District%2CShanghai&maxLocations=1 (Caused by SSLError(SSLError("bad handshake: SysCallError(-1, 'Unexpected EOF')")))
Status code Unknown from https://geocode.arcgis.com/arcgis/rest/services/World/GeocodeServer/find: ERROR - HTTPSConnectionPool(host='geocode.arcgis.com', port=443): Max retries exceeded with url: /arcgis/rest/services/World/GeocodeServer/find?f=json&text=Pujiang+town%2CMinhang+District+%2CShanghai&maxLocations=1 (Caused by ProxyError('Cannot connect to proxy.', timeout('select timed out')))
Status code Unknown from https://geocode.arcgis.com/arcgis/rest/services/World/GeocodeServer/find: ERROR - HTTPSConnectionPool(host='geocode.arcgis.com', port=443): Max retries exceeded with url: /arcg

In [26]:
Subdis_latlng = pd.DataFrame({'latitude':latitude,'longitude':longitude})
print(Subdis_latlng.shape)
Subdis_latlng.head()

(234, 2)


Unnamed: 0,latitude,longitude
0,31.2378,121.4781
1,31.24055,121.19262
2,31.2378,121.4781
3,31.2378,121.4781
4,31.2378,121.4781


In [27]:
# concatenate dataframes
SubDistrict = pd.concat([Subdis,Subdis_latlng],axis=1)
SubDistrict.head()

Unnamed: 0,District,Subdistrict,Population(2010),Area(km^2),Density(/km^2),latitude,longitude
0,Huangpu District,Bansongyuan Road Subdistrict,89776,2.87,31280.836237,31.2378,121.4781
1,Huangpu District,Huaihai Central Road Subdistrict,57931,1.41,41085.815603,31.24055,121.19262
2,Huangpu District,Dapuqiao Subdistrict,59085,1.59,37160.377358,31.2378,121.4781
3,Huangpu District,Nanjing East Road Subdistrict,66285,2.41,27504.149378,31.2378,121.4781
4,Huangpu District,Laoximen Subdistrict,72898,1.24,58788.709677,31.2378,121.4781


In [28]:
# double check if any outlier points
SubDistrict.describe()

Unnamed: 0,Density(/km^2),latitude,longitude
count,234.0,234.0,234.0
mean,13504.517412,31.390507,116.279895
std,15439.710628,1.033089,31.389813
min,9.859155,30.72333,-76.73595
25%,1292.23372,31.096268,121.280403
50%,4829.483439,31.22,121.41844
75%,22821.64903,31.299568,121.500618
max,72255.147059,37.58806,121.8601


In [29]:
# The min of longitude is very strange, and the max of latitude is also weird. We should look at those data.
SubDistrict[SubDistrict['longitude']<121]

Unnamed: 0,District,Subdistrict,Population(2010),Area(km^2),Density(/km^2),latitude,longitude
46,Jing'an District,Pengpu town,152725,7.88,19381.345178,37.58806,-76.73595
151,Jiading District,Malu town,172864,57.16,3024.212736,37.58806,-76.73595
208,Fengxian District,Haiwan town,28457,100.6,282.872763,37.58806,-76.73595
214,Chongming District,Bu town,60111,63.48,946.928166,37.58806,-76.73595
216,Chongming District,Miao town,45926,95.7,479.895507,37.58806,-76.73595
226,Chongming District,Xinhai town,11646,105.04,110.872049,37.58806,-76.73595
227,Chongming District,Dongping town,15112,119.7,126.248956,31.40344,108.7503


In [30]:
SubDistrict[SubDistrict['latitude']>32]

Unnamed: 0,District,Subdistrict,Population(2010),Area(km^2),Density(/km^2),latitude,longitude
46,Jing'an District,Pengpu town,152725,7.88,19381.345178,37.58806,-76.73595
151,Jiading District,Malu town,172864,57.16,3024.212736,37.58806,-76.73595
208,Fengxian District,Haiwan town,28457,100.6,282.872763,37.58806,-76.73595
214,Chongming District,Bu town,60111,63.48,946.928166,37.58806,-76.73595
216,Chongming District,Miao town,45926,95.7,479.895507,37.58806,-76.73595
226,Chongming District,Xinhai town,11646,105.04,110.872049,37.58806,-76.73595


In [31]:
# The data can be modified manually since the only few data should be modified.
# It take much more time on checking geocorder.
Pengpu = [31.306678, 121.449] # per check on google
Malu = [31.337447, 121.233592] # no found, refer to Jiading District
Haiwan = [30.883, 121.567] # no found, refer to Fengxian District
# no found, refer to Chongming
Bu = [31.666667, 121.5]
Miao = Bu
Xinhai = Bu
Dongping = Bu

SubDistrict.iloc[46,-2] = Pengpu[0]
SubDistrict.iloc[46,-1] = Pengpu[1]
SubDistrict.iloc[151,-2] = Malu[0]
SubDistrict.iloc[151,-1] = Malu[1]
SubDistrict.iloc[208,-2] = Haiwan[0]
SubDistrict.iloc[208,-1] = Haiwan[1]
SubDistrict.iloc[214,-2] = Bu[0]
SubDistrict.iloc[214,-1] = Bu[1]
SubDistrict.iloc[216,-2] = Bu[0]
SubDistrict.iloc[216,-1] = Bu[1]
SubDistrict.iloc[226,-2] = Bu[0]
SubDistrict.iloc[226,-1] = Bu[1]
SubDistrict.iloc[227,-2] = Bu[0]
SubDistrict.iloc[227,-1] = Bu[1]

SubDistrict.describe()

Unnamed: 0,Density(/km^2),latitude,longitude
count,234.0,234.0,234.0
mean,13504.517412,31.233507,121.416284
std,15439.710628,0.2364,0.172806
min,9.859155,30.72333,121.00982
25%,1292.23372,31.094205,121.32363
50%,4829.483439,31.22,121.42105
75%,22821.64903,31.29848,121.511438
max,72255.147059,31.83195,121.8601


Above **SubDistrict Dataframe** is Shanghai subdistrict table with density and coordinates data.

In [32]:
# save dataframe to csv.file
SubDistrict.to_csv (r'C:\Users\Administrator\Documents\IBM data science\8 Applied Data Science Capstone\shanghai_subdistricts_dataframe.csv', index = False, header=True)

I'd like to map the regions.

In [33]:
# take Downtown Shanghai as map center
latitude = 31.2304
longitude = 121.4737

# create map of Toronto using latitude and longitude values
map_shanghai = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, district, subdistrict in zip(SubDistrict['latitude'], SubDistrict['longitude'], SubDistrict['District'], SubDistrict['Subdistrict']):
    label = '{},{}'.format(subdistrict,district)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_shanghai)  
    
map_shanghai

With the above table, we are able to explore the venues in each regions.

In [34]:
# Define FourSquare Credentials and Version
CLIENT_ID = 'your client ID' # your Foursquare ID
CLIENT_SECRET = 'your client secret' # your Foursquare Secret
VERSION = '20200624' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: your client IDCLIENT_SECRET:your client secret


In [35]:
# Let's try on the first subdistrict.
SubDistrict.loc[0,'Subdistrict']

'Bansongyuan Road Subdistrict'

In [36]:
Subdistrict_latitude = SubDistrict.loc[0, 'latitude'] # latitude value
Subdistrict_longitude = SubDistrict.loc[0, 'longitude'] # longitude value

Subdistrict_name = SubDistrict.loc[0,'Subdistrict'] # SubDistrict name

print('Latitude and longitude values of {} are {}, {}.'.format(Subdistrict_name, 
                                                               Subdistrict_latitude, 
                                                               Subdistrict_longitude))

Latitude and longitude values of Bansongyuan Road Subdistrict are 31.23780000000005, 121.47810000000004.


In [38]:
# First, let's create the GET request URL. Name your URL **url**.
LIMIT = 100
radius = 500

url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}'.format(
    CLIENT_ID, CLIENT_SECRET, Subdistrict_latitude, Subdistrict_longitude, VERSION, radius, LIMIT)

results = requests.get(url).json()
results['response']['groups'][0]['items'][0]

{'reasons': {'count': 0,
  'items': [{'summary': 'This spot is popular',
    'type': 'general',
    'reasonName': 'globalInteractionReason'}]},
 'venue': {'id': '545718a4498ec939e658e228',
  'name': 'Épices & Foie-gras',
  'location': {'address': '309-A1-05 Hangkou Rd.',
   'crossStreet': 'Shandong Rd.',
   'lat': 31.23755713089388,
   'lng': 121.47958042912362,
   'labeledLatLngs': [{'label': 'display',
     'lat': 31.23755713089388,
     'lng': 121.47958042912362}],
   'distance': 143,
   'cc': 'CN',
   'neighborhood': 'Běixīn qiáo',
   'city': 'Huangpu',
   'state': '上海市',
   'country': '中国',
   'formattedAddress': ['309-A1-05 Hangkou Rd. (Shandong Rd.), Běixīn qiáo',
    'Huangpu',
    '上海市',
    '中国']},
  'categories': [{'id': '4bf58dd8d48988d10c941735',
    'name': 'French Restaurant',
    'pluralName': 'French Restaurants',
    'shortName': 'French',
    'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/food/french_',
     'suffix': '.png'},
    'primary': True}],
  'ph

In [39]:
# Let's borrow the get_category_type function from the Foursquare lab.
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [40]:
import json # library to handle JSON files
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

In [41]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Épices & Foie-gras,French Restaurant,31.237557,121.47958
1,Bund Plaza,Department Store,31.239211,121.479741
2,M1NT Restaurant & Grill,Restaurant,31.23692,121.479641
3,Grand Central Hotel Shanghai (上海大酒店),Hotel,31.237379,121.476754
4,Nanjing Road Pedestrian Street (南京路步行街),Pedestrian Plaza,31.238273,121.476807


In [42]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

50 venues were returned by Foursquare.


#### Explore Subdistricts in Shanghai Downtown
Let's create a function to repeat the same process to all the Subdistricts in Shanghai Downtown.

In [50]:
# foursquare has limited calls for free personal account. Let's try on the downtown area of Shanghai.
shanghaidowntown = SubDistrict.loc[0:118,:]
shanghaidowntown.tail()

Unnamed: 0,District,Subdistrict,Population(2010),Area(km^2),Density(/km^2),latitude,longitude
114,Pudong,Donghai Farm,508,15.2,33.421053,31.09611,121.79556
115,Pudong,Chaoyang Farm,862,10.67,80.787254,31.4715,121.79315
116,Pudong,Waigaoqiao Free-trade Zone,1349,10.0,134.9,31.35589,121.5727
117,Pudong,Jinqiao Export Processing Zone,5514,67.79,81.339431,31.27092,121.59331
118,Pudong,Zhangjiang Hi-tech Park,23617,75.9,311.15942,31.20861,121.60889


In [43]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        #print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [51]:
shanghai_venues = getNearbyVenues(names=shanghaidowntown['Subdistrict'],
                                   latitudes=shanghaidowntown['latitude'],
                                   longitudes=shanghaidowntown['longitude']
                                  )

In [52]:
print('{} venues were returned by Foursquare for all subdistricts in Shanghai downtown.'.format(shanghai_venues.shape[0]))
shanghai_venues.head()

1618 venues were returned by Foursquare for all subdistricts in Shanghai downtown.


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Bansongyuan Road Subdistrict,31.2378,121.4781,Épices & Foie-gras,31.237557,121.47958,French Restaurant
1,Bansongyuan Road Subdistrict,31.2378,121.4781,Bund Plaza,31.239211,121.479741,Department Store
2,Bansongyuan Road Subdistrict,31.2378,121.4781,M1NT Restaurant & Grill,31.23692,121.479641,Restaurant
3,Bansongyuan Road Subdistrict,31.2378,121.4781,Grand Central Hotel Shanghai (上海大酒店),31.237379,121.476754,Hotel
4,Bansongyuan Road Subdistrict,31.2378,121.4781,Nanjing Road Pedestrian Street (南京路步行街),31.238273,121.476807,Pedestrian Plaza


In [53]:
# Let's check how many venues were returned for each neighborhood
shanghai_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Bansongyuan Road Subdistrict,50,50,50,50,50,50
Baoshan Road Subdistrict,47,47,47,47,47,47
Beicai town,1,1,1,1,1,1
Beixinjing Subdistrict,5,5,5,5,5,5
Beizhan Subdistrict[10],47,47,47,47,47,47
...,...,...,...,...,...,...
Zhenruzhen Subdistrict,6,6,6,6,6,6
Zhijiang West Road Subdistrict,47,47,47,47,47,47
Zhoujiadu Subdistrict,4,4,4,4,4,4
Zhoujiaqiao Subdistrict,27,27,27,27,27,27


In [54]:
# Let's find out how many unique categories can be curated from all the returned venues
print('There are {} uniques categories.'.format(len(shanghai_venues['Venue Category'].unique())))

There are 97 uniques categories.


In [55]:
# save dataframe to csv.file
shanghai_venues.to_csv (r'C:\Users\Administrator\Documents\IBM data science\8 Applied Data Science Capstone\shanghai_venues_dataframe.csv', index = False, header=True)

## Methodology <a name="methodology"></a>

## Exploratory Data Analysis <a name="analysis"></a>

## Results and Discussion <a name="results"></a>

## Conclusion <a name="conclusion"></a>

Reference:
1. https://en.wikipedia.org/wiki/Shanghai