<h1 align=center><font size = 5>Capstone project - The best place to open a bakery in Chengdu, China</font></h1>

## Introduction & Business Problem

With the development of globalization, more and more Chinese people start to buy products from western bakeries, especially for college students and white-collar workers. A friend of mine wants to open a western-style bakery in some division of Chengdu in China. He asked me to help him to find the best place for it. In order to find the best place for a bakery, I will need to leverage some information about the divisions in Chengdu. I must be sure that there will be enough customers for the bakery, and on the other hand that there are not already too many other similar bakeries in the same divison. In the following I will use data science to analyze which divison of Chengdu is the best to open a bakery.

The main business problem attacked in this work is based on how to determine the optimum location for a new business – a bakery. This problem can be solved by means of inferred data about already existing business. Naturally, certain types of enterprises tend to be built in the same areas because of economic incentive or public regulations.  It is important to also mention that the inexistence of certain types of enterprises can also mean that there is no demand for their services, indicating that data without additional socioeconomic information about the regions is not sufficient to construct a complete picture. Nevertheless, it is possible to construct a robust profile, which is sufficient for kickstarting the plan for a new business, and this will be my main goal.

## Data 

The city - Chengdu will be analyzed in this work. The divisions and neighborhoods names and postal information from Chengdu are extracted from a Wikipedia webpage¹, and with this information at hands, the Google Geocoder API can be used to extract geographical coordinates of each division using their names and postcode as input. The coordinates will be utilized for map generation, and as input for the Foursquare API, which will be leveraged to provide venues information of each division. 

In the following, I will mainly focus on the venue category parameter, refining and clustering different categories of venues in major groups that will facilitate the analysis and also make it possible for the generation of a better visualization. Clustering algorithms like K-Means will be used to automatically group the divisions in similar groups. Plotly, Seaborn and Folium Python packages are used for data rendering and visualization.

[1] https://en.wikipedia.org/wiki/Chengdu

## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>

1. <a href="#item1">Download and Explore Dataset</a>

2. <a href="#item2">Get the latitude and longitude coordinates of a given postal code </a>

3. <a href="#item3">Explore the divisions in Chengdu</a>

4. <a href="#item4">Determine the optimum location for a bakery</a>

5. <a href="#item5">Conclusions</a>    
</font>
</div>

Before we get the data and start exploring it, let's download all the dependencies that we will need.

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Solving environment: done

## Package Plan ##

  environment location: /home/jupyterlab/conda

  added / updated specs: 
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    openssl-1.0.2p             |       h470a237_1         3.1 MB  conda-forge
    certifi-2018.10.15         |        py36_1000         138 KB  conda-forge
    geopy-1.17.0               |             py_0          49 KB  conda-forge
    ca-certificates-2018.10.15 |       ha4d7672_0         135 KB  conda-forge
    conda-4.5.11               |        py36_1000         651 KB  conda-forge
    geographiclib-1.49         |             py_0          32 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         4.1 MB

The following NEW packages will be INSTALLED:

    geographiclib:   1.49-py_0            conda-forge
    geopy:           

## 1. Download and Explore Dataset

In order to explore the divisions in Chengdu, we will essentially need a dataset that contains the divisions as well as the the latitude and logitude coordinates of each division. 

Unfortunately, the division data is not readily available on the internet. For the Chengdu division data, a Wikipedia page exists that has all the information we need to explore and cluster the divisions in Chengdu. We need to scrape the Wikipedia page and wrangle the data, clean it, and then read it into a pandas dataframe so that it is in a structured format.

In [2]:
import requests

from bs4 import BeautifulSoup

url = 'https://en.wikipedia.org/wiki/Chengdu'

results = requests.get(url)

soup = BeautifulSoup(results.content, 'html5lib')
#print(soup.prettify())

In [3]:
import csv
csv_file=open('Chengdu.csv','w')
csv_writer=csv.writer(csv_file)
csv_writer.writerow(['Divisioncode', 'Division', 'Postalcode'])

wiki_table = soup.find_all('table',  class_="wikitable")

# print(wiki_table[1])

for tr in wiki_table[1].find_all('tr'):
    ths = tr.find_all('th')
    tds = tr.find_all('td')
    if len(ths) == 2:
        if len(tds) == 9:
            Divisioncode = ths[0].text
            Division = ths[1].text.strip('\n')
            Postalcode = tds[3].text
           #print(Divisioncode, Division, Postalcode)
            csv_writer.writerow([Divisioncode, Division, Postalcode])
csv_file.close()

Now that the data is scraped from the website and saved into a csv file, let's read it into a pandas dataframe.

In [4]:
chengdu_df=pd.read_csv('Chengdu.csv')
chengdu_df.head()

Unnamed: 0,Divisioncode,Division,Postalcode
0,510100,Chengdu,610000
1,510104,Jinjiang,610000
2,510105,Qingyang,610000
3,510106,Jinniu,610000
4,510107,Wuhou,610000


## 2. Get the latitude and longitude coordinates of a given postal code 

In [5]:
# import geocoder and geopy for geographic coordinates extraction
!conda install -c conda-forge geocoder --yes
import geocoder
from geopy.extra.rate_limiter import RateLimiter
from geopy.geocoders import Nominatim 

Solving environment: done

## Package Plan ##

  environment location: /home/jupyterlab/conda

  added / updated specs: 
    - geocoder


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    orderedset-2.0             |           py36_0         231 KB  conda-forge
    geocoder-1.38.1            |             py_0          52 KB  conda-forge
    ratelim-0.1.6              |           py36_0           5 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         288 KB

The following NEW packages will be INSTALLED:

    geocoder:   1.38.1-py_0  conda-forge
    orderedset: 2.0-py36_0   conda-forge
    ratelim:    0.1.6-py36_0 conda-forge


Downloading and Extracting Packages
orderedset-2.0       | 231 KB    | ##################################### | 100% 
geocoder-1.38.1      | 52 KB     | #############################

In [6]:
GEOCODER_GOOGLE_KEY = 'AIzaSyBuF-099dPdT5xRIGpIeiP8ruccYSmrxKg'

Lat = []
Lon = []

for index in range(0,chengdu_df.shape[0],1):
    # send request
    lat_lng_coords = None
    while(lat_lng_coords is None):
        g = geocoder.google('{}, Chengdu, China'.format(chengdu_df['Postalcode'][index]), key=GEOCODER_GOOGLE_KEY)
        lat_lng_coords = g.latlng
      
        # append coordinates 
        Lat.append(lat_lng_coords[0])
        Lon.append(lat_lng_coords[1])
chengdu_df['Latitude'] = Lat
chengdu_df['Longitude'] = Lon
chengdu_df.head()

Unnamed: 0,Divisioncode,Division,Postalcode,Latitude,Longitude
0,510100,Chengdu,610000,30.652658,104.074725
1,510104,Jinjiang,610000,30.652658,104.074725
2,510105,Qingyang,610000,30.652658,104.074725
3,510106,Jinniu,610000,30.652658,104.074725
4,510107,Wuhou,610000,30.652658,104.074725


In [7]:
address = 'Chengdu, China'

geolocator = Nominatim(user_agent="capstoneProject")
location = geolocator.geocode(address, timeout=60, exactly_one=True)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinates of Chengdu are {}, {}.'.format(latitude, longitude))

The geograpical coordinates of Chengdu are 30.6765553, 104.0612783.


## 3. Explore the divisions in Chengdu

In [8]:
#Access to FourSquare

CLIENT_ID = 'KJXSRI4TDR4SRXB31ZDX5EQBA5X1CMNKEJMFWHGAFKUUFLOD' # your Foursquare ID
CLIENT_SECRET = 'EWCXY1V2USCLJ1RX5Y0LLANYWDYWBCIUJ2XZHR00UOIRU0MC' # your Foursquare Secret
VERSION = '20180604' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: KJXSRI4TDR4SRXB31ZDX5EQBA5X1CMNKEJMFWHGAFKUUFLOD
CLIENT_SECRET:EWCXY1V2USCLJ1RX5Y0LLANYWDYWBCIUJ2XZHR00UOIRU0MC


#### Create a map of Chengdu with divisions superimposed on top.

In [9]:
# create map of Chengdu using latitude and longitude values
chengdu_map = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, division in zip(chengdu_df['Latitude'], chengdu_df['Longitude'], chengdu_df['Division']):
    label = '{}'.format(division)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(chengdu_map)  
    
chengdu_map

#### Get other venue information in Chengdu from Foursquare

In [10]:
# function to repeat the exploring process to all the neighborhoods in Toronto
def getNearbyVenues(names, latitudes, longitudes, radius=5000, categoryIds=''):
    try:
        venues_list=[]
        for name, lat, lng in zip(names, latitudes, longitudes):
            #print(name)

            # create the API request URL
            url = 'https://api.foursquare.com/v2/venues/search?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
                CLIENT_ID, 
                CLIENT_SECRET, 
                VERSION, 
                lat, 
                lng, 
                radius, 
                LIMIT)

            if (categoryIds != ''):
                url = url + '&categoryId={}'
                url = url.format(categoryIds)

            # make the GET request
            response = requests.get(url).json()
            results = response["response"]['venues']

            # return only relevant information for each nearby venue
            for v in results:
                success = False
                try:
                    category = v['categories'][0]['name']
                    success = True
                except:
                    pass

                if success:
                    venues_list.append([(
                        name, 
                        lat, 
                        lng, 
                        v['name'], 
                        v['location']['lat'], 
                        v['location']['lng'],
                        v['categories'][0]['name']
                    )])

        nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
        nearby_venues.columns = ['Division', 
                  'Division Latitude', 
                  'Division Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    except:
        print(url)
        print(response)
        print(results)
        print(nearby_venues)

    return(nearby_venues)

In [11]:
LIMIT = 500 # limit of number of venues returned by Foursquare API
radius = 5000 # define radius

#### Add different bakeries in the diffrent divisions of Chengdu on the map

In [12]:
# Use category id 4bf58dd8d48988d16a941735 to only get the bakeries
chengdu_venues_bakery = getNearbyVenues(names=chengdu_df['Division'], latitudes=chengdu_df['Latitude'], longitudes=chengdu_df['Longitude'], radius = 5000, categoryIds='4bf58dd8d48988d16a941735')
chengdu_venues_bakery.head()

Unnamed: 0,Division,Division Latitude,Division Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Chengdu,30.652658,104.074725,85ºC (85度C),30.64969,104.073272,Bakery
1,Chengdu,30.652658,104.074725,YECLIP COFFEE,30.650794,104.078914,Coffee Shop
2,Chengdu,30.652658,104.074725,邱公馆(伊势丹店),30.657023,104.076616,Bakery
3,Chengdu,30.652658,104.074725,面包新语 Bread Talk,30.657891,104.075713,Bakery
4,Chengdu,30.652658,104.074725,猫眼蛋糕,30.655321,104.076941,Bakery


In [13]:
# function to add markers for given venues to ma
def addToMap(df, color, existingMap):
    for lat, lng, Division, venue, venueCat in zip(df['Venue Latitude'], df['Venue Longitude'], df['Division'], df['Venue'], df['Venue Category']):
        label = '{} ({}) - {}'.format(venue, venueCat, Division)
        label = folium.Popup(label, parse_html=True)
        folium.CircleMarker(
            [lat, lng],
            radius=5,
            popup=label,
            color=color,
            fill=True,
            fill_color=color,
            fill_opacity=0.7,
            parse_html=False).add_to(existingMap)

In [14]:
addToMap(chengdu_venues_bakery, 'red', chengdu_map)
chengdu_map

#### Add universities in the diffrent divisions of Chengdu on the map

In [15]:
chengdu_venues_schools = getNearbyVenues(names=chengdu_df['Division'], latitudes=chengdu_df['Latitude'], longitudes=chengdu_df['Longitude'], radius=5000, categoryIds='4d4b7105d754a06372d81259')
chengdu_venues_schools.head()

Unnamed: 0,Division,Division Latitude,Division Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Chengdu,30.652658,104.074725,Chengdu Institute Of Sports,30.647921,104.076554,College Gym
1,Chengdu,30.652658,104.074725,四川音乐学院 Sichuan Conservatory of Music,30.640023,104.076729,University
2,Chengdu,30.652658,104.074725,电子科技大学 University of Electronic Science and Te...,30.675937,104.100308,University
3,Chengdu,30.652658,104.074725,四川大学 华西医学院,30.643613,104.063511,College Quad
4,Chengdu,30.652658,104.074725,四川大学小北门,30.635846,104.076418,University


In [16]:
addToMap(chengdu_venues_schools, 'green', chengdu_map)
chengdu_map

#### Add office arears in the diffrent divisions of Chengdu on the map

In [17]:
chengdu_venues_offices = getNearbyVenues(names=chengdu_df['Division'], latitudes=chengdu_df['Latitude'], longitudes=chengdu_df['Longitude'], radius=5000, categoryIds='4bf58dd8d48988d124941735')
chengdu_venues_offices.head()

Unnamed: 0,Division,Division Latitude,Division Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Chengdu,30.652658,104.074725,The Atrium (晶融汇),30.654282,104.07914,Office
1,Chengdu,30.652658,104.074725,Regus Yanlord Landmark,30.654433,104.064407,Office
2,Chengdu,30.652658,104.074725,中国水电顾问集团成都勘测设计研究院 HydroChina Chengdu Engineeri...,30.655477,104.087289,Office
3,Chengdu,30.652658,104.074725,WeWork (睿东中心),30.652513,104.080047,Building
4,Chengdu,30.652658,104.074725,Shangri-La Office Tower,30.645275,104.084603,Office


In [18]:
addToMap(chengdu_venues_offices, 'yellow', chengdu_map)
chengdu_map

## 4. Determine the optimum location for a bakery

#### Calculate the number of pizzerie, schools and enterprises in each division of Chengdu

In [19]:
def addColumn(startDf, columnTitle, dataDf):
    grouped = dataDf.groupby('Division').count()
    
    for n in startDf['Division']:
        try:
            startDf.loc[startDf['Division'] == n,columnTitle] = grouped.loc[n, 'Venue']
        except:
            startDf.loc[startDf['Division'] == n,columnTitle] = 0

In [20]:
chengdu_data = chengdu_df.copy()
addColumn(chengdu_data, 'Bakeries', chengdu_venues_bakery)
addColumn(chengdu_data, 'Universities', chengdu_venues_schools)
addColumn(chengdu_data, 'Office areas', chengdu_venues_offices)
chengdu_data

Unnamed: 0,Divisioncode,Division,Postalcode,Latitude,Longitude,Bakeries,Universities,Office areas
0,510100,Chengdu,610000,30.652658,104.074725,50.0,50.0,50.0
1,510104,Jinjiang,610000,30.652658,104.074725,50.0,50.0,50.0
2,510105,Qingyang,610000,30.652658,104.074725,50.0,50.0,50.0
3,510106,Jinniu,610000,30.652658,104.074725,50.0,50.0,50.0
4,510107,Wuhou,610000,30.652658,104.074725,50.0,50.0,50.0
5,510108,Chenghua,610000,30.652658,104.074725,50.0,50.0,50.0
6,510112,Longquanyi,610100,30.556413,104.274661,0.0,4.0,3.0
7,510113,Qingbaijiang,610300,30.878478,104.251192,0.0,0.0,0.0
8,510114,Xindu,610500,30.823212,104.158803,0.0,4.0,1.0
9,510115,Wenjiang,611100,30.685184,103.832723,1.0,13.0,7.0


#### Define a weight according to the effect of the venues on your choice

In [21]:
# negative weight, because my friend wants to open a bakery and thus wants to avoid concurrence as much as possible
weight_bakeries = -1

# positive weight, because university students are good customers
weight_universities = 1

# positive weight because employees are even better customers
weight_offices = 1.5

In [22]:
chengdu_weighted = chengdu_data[['Divisioncode', 'Division']].copy()

#### Based on the chosen weights, compute the score of each division

In [23]:
chengdu_weighted['Score'] = chengdu_data['Bakeries'] * weight_bakeries + chengdu_data['Universities'] * weight_universities + chengdu_data['Office areas'] * weight_offices
chengdu_weighted = chengdu_weighted.sort_values(by=['Score'], ascending=False)
chengdu_weighted

Unnamed: 0,Divisioncode,Division,Score
0,510100,Chengdu,75.0
1,510104,Jinjiang,75.0
2,510105,Qingyang,75.0
3,510106,Jinniu,75.0
4,510107,Wuhou,75.0
5,510108,Chenghua,75.0
15,510132,Xinjin Co.,68.0
14,510131,Pujiang Co.,68.0
13,510129,Dayi Co.,68.0
20,510185,Jianyang,68.0


## 5. Conclusions

In [24]:
map_chengdu_result = folium.Map(location=[latitude, longitude], zoom_start=12)

chengdu_win = chengdu_df[0:5]

for lat, lng, division in zip(chengdu_win['Latitude'], chengdu_win['Longitude'], chengdu_win['Division']):
    label = '{}'.format(Division)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='blue',
        fill_opacity=0.7,
        parse_html=False).add_to(map_chengdu_result) 

addToMap(chengdu_venues_bakery[chengdu_venues_bakery['Division'] == 'Chengdu'], 'red', map_chengdu_result)
addToMap(chengdu_venues_bakery[chengdu_venues_bakery['Division'] == 'Jinjiang'], 'red', map_chengdu_result)
addToMap(chengdu_venues_bakery[chengdu_venues_bakery['Division'] == 'Qingyang'], 'red', map_chengdu_result)
addToMap(chengdu_venues_bakery[chengdu_venues_bakery['Division'] == 'Jinniu'], 'red', map_chengdu_result)
addToMap(chengdu_venues_bakery[chengdu_venues_bakery['Division'] == 'Wuhou'], 'red', map_chengdu_result)
addToMap(chengdu_venues_bakery[chengdu_venues_bakery['Division'] == 'Chenghua'], 'red', map_chengdu_result)

addToMap(chengdu_venues_schools[chengdu_venues_schools['Division'] == 'Chengdu'], 'green', map_chengdu_result)
addToMap(chengdu_venues_schools[chengdu_venues_schools['Division'] == 'Jinjiang'], 'green', map_chengdu_result)
addToMap(chengdu_venues_schools[chengdu_venues_schools['Division'] == 'Qingyangu'], 'green', map_chengdu_result)
addToMap(chengdu_venues_schools[chengdu_venues_schools['Division'] == 'Jinniu'], 'green', map_chengdu_result)
addToMap(chengdu_venues_schools[chengdu_venues_schools['Division'] == 'Wuhou'], 'green', map_chengdu_result)
addToMap(chengdu_venues_schools[chengdu_venues_schools['Division'] == 'Chenghua'], 'green', map_chengdu_result)

addToMap(chengdu_venues_offices[chengdu_venues_offices['Division'] == 'Chengdu'], 'yellow', map_chengdu_result)
addToMap(chengdu_venues_offices[chengdu_venues_offices['Division'] == 'Jinjiang'], 'yellow', map_chengdu_result)
addToMap(chengdu_venues_offices[chengdu_venues_offices['Division'] == 'Qingyang'], 'yellow', map_chengdu_result)
addToMap(chengdu_venues_offices[chengdu_venues_offices['Division'] == 'Jinniuu'], 'yellow', map_chengdu_result)
addToMap(chengdu_venues_offices[chengdu_venues_offices['Division'] == 'Wuhou'], 'yellow', map_chengdu_result)
addToMap(chengdu_venues_offices[chengdu_venues_offices['Division'] == 'Chenghua'], 'yellow', map_chengdu_result)

map_chengdu_result