# 1. Introduction Section: 
## Discussion of the business problem and the audience who would be interested in this project. 
## Description of the Problem and Background 
## Scenario: 
    
    I am currently a Business student from University of Alberta (UA) and desire to figure out any possible business plans for fresh graduated students. University of Alberta (UA) is such an international academic institute in Edmonton that attract a lot of students from more than 100 countries every year. While hard to find jobs and work in domestic companies has been the challenge for most international students. Thus, more and more fresh graduated students decide to start up their own business. The restaurant is such a kind of business that easy to start and develop.
    
    More and more people from other countries decide to live in Edmonton, which make the demographic composition becoming more and more comprehensive and diversified and bring more demands for different styles of food. However, the existing restaurants may not keep tracks on the changes on the demands. 
    
    My project will select the neighborhood around University of Alberta as the sample, aims to figure out where and what kinds of restaurants could be open to meet the needs for different food based on the changing trends of source of students. What is more, my project would also figure out whether existing some low-evaluated restaurants but good earning just because customers do not have other substitute restaurants.
    
## Business Problem: 
    To identify whether the existing restaurants could meet the demands for more and more kinds of food as the demographic composition in Edmonton has been more comprehensive and diversified.
    To figure whether existing some low-evaluated restaurants could be replaced.
	
## Interested Audience 
    The fresh international graduated students who would like to start up their first business.
    The new immigrates who would like to start up their first business in Canada.



# 2. Data Section: 
## Description of the data and its sources that will be used to solve the problem 
 
## The data will be used as follows:

    A table with the data about the borough, neighborhood and postal codes: This table would be scraped from Wikipedia and well organized in order to get the specific location data for Edmonton. In this project, I would use X-path package to scrape data.
    
    A table with the data about each borough’s coordinates data: This table would use the above table as the input to find out corresponding coordinates data. The table could provide the required information to map each region point. I would use geopy package to catch their coordinates data.
    
    A table with the data about the existing restaurant’s types, location, evaluation etc.: This table would contain the information existing restaurant’s types, location, evaluation to help me analyze the current situation and figure whether existing low-evaluated restaurants could be replaced by new ones. I would use Foursquare API to get the data.
    
    A table with the data about the source of students in University of Alberta: This table would help me contain the information about the changing trends of source of students in University of Alberta.
    
    I would use K-means to divide all the restaurants into different clusters and try to figure out the whether the existing types of restaurants could meet the needs for recent or even the future students. I would also try to use K-means to figure out what kinds of restaurants with low evaluation should be improved by the entering of new competitors.


In [None]:
# !pip install lxml

In [1]:
#1.Scraping data from Wiki Pedia
    
#1.1 import required modules
import requests
from lxml import etree
import pandas as pd

#1.2 set url and headers
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_T"

headers = {
    "user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:90.0) Gecko/20100101 Firefox/90.0",
            }
    
#1.3 get html strings
response = requests.get(url,headers = headers)
content = response.content.decode("utf8")

#1.4 etree parse
html = etree.HTML(content)


In [2]:
#1.5 Xpath extract data
# /html/body/div[3]/div[3]/div[5]/div[1]/table[1]/tbody/tr[1]/td[3]/b
trs = html.xpath("///body/div[3]/div[3]/div[5]/div[1]/table[1]/tbody/tr")
#an empty list to input the dic-s
dic_list = []
for tr in trs:
    temp_dic = {}#temp dic for put values
    tds = tr.xpath("./td")
    for td in tds:
        Postalcode = td.xpath("./b/text()[1]")
        Borough = td.xpath("./span/a[1]/text()")
        Neighborhood = td.xpath("./span/a/text()")[1:] # because the first item is the borough name
        
        
        
        #to convert list elements of Neighborhood into str seperated by, 
        temp_str = ""
        if Neighborhood != []:
            counter = len(Neighborhood)
            for each in Neighborhood:
                counter -= 1
                temp_str = temp_str + each 
                if counter > 1:
                    temp_str = temp_str +  ", "
            Neighborhood = temp_str
         #to convert list elements of Neighborhood into str seperated by, 
        
        
        
        temp_dic = {
                "Postalcode":Postalcode,
                "Borough":Borough,
                "Neighborhood":Neighborhood, #since we may have one more items in the list 
                }
        dic_list.append(temp_dic)

dic_list

[{'Postalcode': ['T1A'], 'Borough': ['Medicine Hat'], 'Neighborhood': []},
 {'Postalcode': ['T2A'],
  'Borough': ['Calgary'],
  'Neighborhood': 'Penbrooke MeadowsMarlborough'},
 {'Postalcode': ['T3A'],
  'Borough': ['Calgary'],
  'Neighborhood': 'Dalhousie, Edgemont, HamptonsHidden Valley'},
 {'Postalcode': ['T4A'], 'Borough': ['Airdrie'], 'Neighborhood': []},
 {'Postalcode': ['T5A'],
  'Borough': ['Edmonton'],
  'Neighborhood': 'ClareviewLondonderry'},
 {'Postalcode': ['T6A'], 'Borough': ['Edmonton'], 'Neighborhood': 'Capilano'},
 {'Postalcode': ['T7A'], 'Borough': ['Drayton Valley'], 'Neighborhood': []},
 {'Postalcode': ['T8A'], 'Borough': ['Sherwood Park'], 'Neighborhood': []},
 {'Postalcode': ['T9A'], 'Borough': ['Wetaskiwin'], 'Neighborhood': []},
 {'Postalcode': ['T1B'], 'Borough': ['Medicine Hat'], 'Neighborhood': []},
 {'Postalcode': ['T2B'],
  'Borough': ['Calgary'],
  'Neighborhood': 'Forest Lawn, DoverErin Woods'},
 {'Postalcode': ['T3B'],
  'Borough': ['Calgary'],
  'Neighb

In [3]:
# creat a new Dataframe
columns= ['Postalcode','Borough','Neighborhood']
df_edm = pd.DataFrame(columns = columns,data = dic_list)
df_edm

Unnamed: 0,Postalcode,Borough,Neighborhood
0,[T1A],[Medicine Hat],[]
1,[T2A],[Calgary],Penbrooke MeadowsMarlborough
2,[T3A],[Calgary],"Dalhousie, Edgemont, HamptonsHidden Valley"
3,[T4A],[Airdrie],[]
4,[T5A],[Edmonton],ClareviewLondonderry
...,...,...,...
175,[T5Z],[Edmonton],Lake District
176,[T6Z],[],[]
177,[T7Z],[Stony Plain],[]
178,[T8Z],[],[]


In [4]:
#get the value from the [list]
df_edm["Postalcode"] = df_edm["Postalcode"].apply(lambda x:x[0]) 
#get the value from the [list] and make all the [] into ""
df_edm["Borough"] = df_edm["Borough"].apply(lambda x:x[0]if x != [] else "") 
#make all the [] into ""
df_edm["Neighborhood"] = df_edm["Neighborhood"].apply(lambda x:x if x != [] else "")


In [5]:
#clean all the empty string: ""
df_edm = df_edm[df_edm["Borough"]!= ""]
df_edm = df_edm[df_edm["Neighborhood"]!= ""]

In [6]:
df_edm = df_edm.reset_index(drop = True)
df_edm.shape

(72, 3)

In [7]:
df_edm

Unnamed: 0,Postalcode,Borough,Neighborhood
0,T2A,Calgary,Penbrooke MeadowsMarlborough
1,T3A,Calgary,"Dalhousie, Edgemont, HamptonsHidden Valley"
2,T5A,Edmonton,ClareviewLondonderry
3,T6A,Edmonton,Capilano
4,T2B,Calgary,"Forest Lawn, DoverErin Woods"
...,...,...,...
67,T5Y,Edmonton,Horse HillLake District
68,T6Y,Edmonton,Southeast Edmonton
69,T7Y,Spruce Grove,Parkland CountyCarvel
70,T2Z,Calgary,"Douglas Glen, McKenzie Lake, CopperfieldEast S..."


In [8]:
!pip install geocoder



In [12]:
import geocoder #free and faster API to get coordinates data

# define the function get_coords to get latt 
def get_coords(postal_code):

    coords = None

    while(coords == None):

        geocoder_1 = geocoder.arcgis('%s, Edmonton, Alberta'%format(postal_code))

        coords = geocoder_1.latlng

    return coords
#reference:https://www.coursera.org/learn/applied-data-science-capstone/discussions/all/threads/GUbn5pVFEemtYQ4WTQex_A

In [15]:
# get_coords("T3A")

[53.54624000000007, -113.49036999999998]

In [16]:
#set the list for catch temp_dic 
coords_list = []
Postalcodes_list = df_edm["Postalcode"].tolist()
# print(Postalcodes_list)

for each in Postalcodes_list:
#     print(each)
    temp_dic = {}
    temp_coord = get_coords(each) 
    longitude = temp_coord[1]
    latitude = temp_coord[0]
    temp_dic = {
                "Postalcode":each,
                "latitude":latitude,
                "longitude":longitude,
                }
    coords_list.append(temp_dic)
#     break

df_edm_coords = pd.DataFrame(coords_list)
print("finished!!!!!!!!!")

finished!!!!!!!!!


In [17]:
df_edm_coords

Unnamed: 0,Postalcode,latitude,longitude
0,T2A,53.54624,-113.49037
1,T3A,53.54624,-113.49037
2,T5A,53.59443,-113.40844
3,T6A,53.55094,-113.43550
4,T2B,53.54624,-113.49037
...,...,...,...
67,T5Y,53.65196,-113.35747
68,T6Y,53.36686,-113.60763
69,T7Y,53.49069,-114.14729
70,T2Z,53.54624,-113.49037


In [18]:
df_merge = pd.merge(df_edm,df_edm_coords, on = "Postalcode")
df_merge
# df_merge.to_csv("week3_df_merge.csv")


Unnamed: 0,Postalcode,Borough,Neighborhood,latitude,longitude
0,T2A,Calgary,Penbrooke MeadowsMarlborough,53.54624,-113.49037
1,T3A,Calgary,"Dalhousie, Edgemont, HamptonsHidden Valley",53.54624,-113.49037
2,T5A,Edmonton,ClareviewLondonderry,53.59443,-113.40844
3,T6A,Edmonton,Capilano,53.55094,-113.43550
4,T2B,Calgary,"Forest Lawn, DoverErin Woods",53.54624,-113.49037
...,...,...,...,...,...
67,T5Y,Edmonton,Horse HillLake District,53.65196,-113.35747
68,T6Y,Edmonton,Southeast Edmonton,53.36686,-113.60763
69,T7Y,Spruce Grove,Parkland CountyCarvel,53.49069,-114.14729
70,T2Z,Calgary,"Douglas Glen, McKenzie Lake, CopperfieldEast S...",53.54624,-113.49037


In [20]:
# show the basic information of Toronto
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(df_merge['Borough'].unique()),
        df_merge.shape[0]
    )
)

The dataframe has 4 boroughs and 72 neighborhoods.


In [19]:
#install required packages
!pip install geopy
!pip install folium==0.5.0



In [21]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


In [23]:
#get the coordinates inf of Toronto
geocoder_Edm = geocoder.arcgis('Edmonton, Alberta')

coords_Edm = geocoder_Edm.latlng
coords_Edm

[53.54624000000007, -113.49036999999998]

In [25]:
# create map of Toronto using latitude and longitude values
map_Edm = folium.Map(location=[coords_Edm[0], coords_Edm[1]], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(df_merge['latitude'], df_merge['longitude'], df_merge['Borough'], df_merge['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False
                        ).add_to(map_Edm) 

map_Edm