# Final Assignment from IBM_C9_Cource: Data_&_Code Section

#### The REPORT Section one can find here: https://github.com/Bobushka/Coursera_Capstone

## 1. Preambula

Let's assume that someone is looking to open a restaurant in Toronto, Ontario, Canada. 
What exact location should I recommend to this person to choose?

Here we will collect and wrangle data:

- scrab the Neighbourhoods datas from Wiki page "Demographics of Toronto neighbourhoods",
- process scrabbed data to make it workable,
- get the geolocation coordinates of the Neighbourhoods.

_________________________________________________________________________________________

## 2. Scrabbing data from Wiki

Getting Toronto neighborhoods data from Wiki page: https://en.wikipedia.org/wiki/Demographics_of_Toronto_neighbourhoods

In [435]:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup

In [436]:
# Prepare the "soup"
URL = 'https://en.wikipedia.org/wiki/Demographics_of_Toronto_neighbourhoods'
response = requests.get(URL)
soup = BeautifulSoup(response.text, 'html.parser') 

# there are 5 tables on the wiki's page, so we need to use find_all command
table_all = soup.find_all('table',{'class':'wikitable sortable'})

# we want to scrab the 5th table (named "Scarborough", but covering all Toronto boroughs) 
table = table_all[4].tbody 

# prepare the list fo column names
rows = table.find_all('tr')
column_names = [v.text.replace('\n','') for v in rows[0].find_all('th')]

# Create the Dataframe with defined columns
df_0 = pd.DataFrame(columns=column_names)

# Fill table by data
for i in range(1,len(rows)):
    tds = rows[i].find_all('td')
    if len(tds)==4:
        values = [tds[0].text, tds[1].text, tds[2].text, tds[3].text.replace('\n',''.replace('\xa0',''))]
    else:
        values = [td.text.replace('\n',''.replace('\xa0','')) for td in tds]
    df_0 = df_0.append(pd.Series(values, index=column_names), ignore_index=True)

In [437]:
df_0.head()

Unnamed: 0,Name,FM,Census Tracts,Population,Land area (km2),Density (people/km2),% Change in Population since 2001,Average Income,Transit Commuting %,% Renters,Second most common language (after English) by name,Second most common language (after English) by percentage,Map
0,Toronto CMA Average,,All,5113149,5903.63,866,9.0,40704,10.6,11.4,,,
1,Agincourt,S,"0377.01, 0377.02, 0377.03, 0377.04, 0378.02, 0...",44577,12.45,3580,4.6,25750,11.1,5.9,Cantonese (19.3%),19.3% Cantonese,
2,Alexandra Park,OCoT,0039.00,4355,0.32,13609,0.0,19687,13.8,28.0,Cantonese (17.9%),17.9% Cantonese,
3,Allenby,OCoT,0140.00,2513,0.58,4333,-1.0,245592,5.2,3.4,Russian (1.4%),01.4% Russian,
4,Amesbury,NY,"0280.00, 0281.01, 0281.02",17318,3.51,4934,1.1,27546,16.4,19.7,Spanish (6.1%),06.1% Spanish,


In [439]:
df_0.shape

(156, 13)

##### DONE! Now we have all Toronto neighborhood data parsed from Wiki

________________________________________________________________________________________________________

## 3. Wrangling the Neighborhood Data

### 3.1. Let's remove unuseful columns and rename some rest columns to have better look

In [441]:
df_1 = df_0.drop(labels=["FM", "Census Tracts", "Map", "Second most common language (after English) by percentage"], axis=1)

df_1.rename(columns={
    "Name": "Neighborhood", 
    "% Change in Population since 2001": "Change in Population since 2001 (%)",
    "Average Income": "Average Income (CAD)",
    "Transit Commuting %": "Transit Commuting (%)",
    "% Renters": "Renters (%)",
    "Second most common language (after English) by name": "Second language after English"}, 
    inplace=True)
df_1.head()

# "Transit Commuting" is very strange column: I can't do nothing with it... Let's ignore it

Unnamed: 0,Neighborhood,Population,Land area (km2),Density (people/km2),Change in Population since 2001 (%),Average Income (CAD),Transit Commuting %,Renters (%),Second language after English
0,Toronto CMA Average,5113149,5903.63,866,9.0,40704,10.6,11.4,
1,Agincourt,44577,12.45,3580,4.6,25750,11.1,5.9,Cantonese (19.3%)
2,Alexandra Park,4355,0.32,13609,0.0,19687,13.8,28.0,Cantonese (17.9%)
3,Allenby,2513,0.58,4333,-1.0,245592,5.2,3.4,Russian (1.4%)
4,Amesbury,17318,3.51,4934,1.1,27546,16.4,19.7,Spanish (6.1%)


### 3.2. Does we have some empty rows? Let's check it: Create the List with indexes of empty rows

In [267]:
Neighborhood = df_1["Neighborhood"]
list_of_empty_rows = []
for i in range(1, df_1.shape[0]):
    if Neighborhood[i] == '':
        list_of_empty_rows.append(i)
list_of_empty_rows

[65, 68, 95, 96]

##### Yes, we have 4 empty rows. Let's delete them.

In [268]:
# Deleting the empty rows
df_1.drop(labels=list_of_empty_rows, inplace=True)
df_1.drop(labels=0, inplace=True)                    # Also remove first row with average information
df_1.reset_index(drop=True, inplace=True)
df_1.shape

(151, 9)

### 3.3 What about the data's type?

In [269]:
# Let's chek the data types in dataframe
df_1.dtypes

Neighborhood                           object
Population                             object
Land area (km2)                        object
Density (people/km2)                   object
Change in Population since 2001 (%)    object
Average Income (CAD)                   object
Transit Commuting %                    object
Renters (%)                            object
Second language after English          object
dtype: object

##### All datas were parced as an objects. As object type datas can't be processed, we need to fixed it.

In [270]:
# Let's try to use "convert_dtype" command:
df_1 = df_1.convert_dtypes()
df_1.dtypes

Neighborhood                           string
Population                             string
Land area (km2)                        string
Density (people/km2)                   string
Change in Population since 2001 (%)    string
Average Income (CAD)                   string
Transit Commuting %                    string
Renters (%)                            string
Second language after English          string
dtype: object

##### Now we have string type for all cells. Let's transfer strings to float for columns with numerical datas:

In [271]:
# Step_1. Replace all commas to dots:
df_1["Population"] = df_1["Population"].str.replace(',','.')
df_1["Density (people/km2)"] = df_1["Density (people/km2)"].str.replace(',','')  # Note the "Density" column has mixed formatting
df_1["Average Income (CAD)"] = df_1["Average Income (CAD)"].str.replace(',','.')

# Step_2. Change all numeric datas to float type:
df_1["Population"] = df_1["Population"].astype(float)
df_1["Land area (km2)"] = df_1["Land area (km2)"].astype(float)
df_1["Density (people/km2)"] = df_1["Density (people/km2)"].astype(float)
df_1["Change in Population since 2001 (%)"] = df_1["Change in Population since 2001 (%)"].astype(float)
df_1["Average Income (CAD)"] = df_1["Average Income (CAD)"].astype(float)
# df_1["Transit Commuting %"] = df_1["Transit Commuting %"].astype(float)  Doesn't work. Let's ignore it
df_1["Renters (%)"] = df_1["Renters (%)"].astype(float)
df_1.dtypes

Neighborhood                            string
Population                             float64
Land area (km2)                        float64
Density (people/km2)                   float64
Change in Population since 2001 (%)    float64
Average Income (CAD)                   float64
Transit Commuting %                     string
Renters (%)                            float64
Second language after English           string
dtype: object

### 3.4. Let's delete all Neighborhood names that can't provide Geolocator correct work:

In [272]:
df_2 = df_1.drop(labels=[34, 65, 101, 105, 123])
df_2.reset_index(drop=True, inplace=True)
df_2.shape

(146, 9)

##### DONE! Now we have all dataframe ready to be analyzed.

___________________________________________________________________________________________________________________

## 4. Gettng geografical coordinates of Toronto neighborhoods

### 4.1. Import additional libraries

In [273]:
!conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim        # convert an address into latitude and longitude values

#It will take a couple of minutes!

Collecting package metadata (current_repodata.json): done
Solving environment: \ 
  - anaconda/osx-64::ca-certificates-2020.10.14-0, anaconda/osx-64::openssl-1.1.1h-haf1e3a3_0
  - anaconda/osx-64::openssl-1.1.1h-haf1e3a3_0, defaults/osx-64::ca-certificates-2020.10.14-0
  - anaconda/osx-64::ca-certificates-2020.10.14-0, defaults/osx-64::openssl-1.1.1h-haf1e3a3_0
  - defaults/osx-64::ca-certificates-2020.10.14-0, defaults/osx-64::openssl-1.1.1h-haf1e3a3done

# All requested packages already installed.



### 4.2. Get Neighborhood geocoordinares from Geolocator

In [274]:
n = df_2.shape[0]                    # length of Neighborhoods column
Lat = pd.Series([0.0] * n)           # create empty Series for latitudes
Lng = pd.Series([0.0] * n)           # create empty Series for longitude
Neighborhood = df_2["Neighborhood"]  # list of Neighborhood names

geolocator = Nominatim(user_agent="Toronto_explorer")
for i in range(n):
    print(i, Neighborhood[i], " "*100, end="\r")
    address = Neighborhood[i] + ', Toronto, Ontario'
    location = geolocator.geocode(address)
    Lat[i] = location.latitude
    Lng[i] = location.longitude  

145 Yorkville                                                                                                                    

##### Let's see what we obained in Latitude and Longitude Series

In [275]:
Lat

0      43.785353
1      43.650787
2      43.712849
3      43.706162
4      43.743944
         ...    
141    43.759824
142    43.682171
143    43.744039
144    43.758781
145    43.671386
Length: 146, dtype: float64

In [276]:
Lng

0     -79.278549
1     -79.404318
2     -79.547065
3     -79.483492
4     -79.430851
         ...    
141   -79.225291
142   -79.423113
143   -79.406657
144   -79.519434
145   -79.390168
Length: 146, dtype: float64

##### Concatinate Lat and Lng Series as a columns to df_2 Dataframe

In [277]:
df_2 = df_2.assign(Latitude=Lat.values) 
df_2 = df_2.assign(Longitude=Lng.values) 
df_2

Unnamed: 0,Neighborhood,Population,Land area (km2),Density (people/km2),Change in Population since 2001 (%),Average Income (CAD),Transit Commuting %,Renters (%),Second language after English,Latitude,Longitude
0,Agincourt,44.577,12.45,3580.0,4.6,25.750,11.1,5.9,Cantonese (19.3%),43.785353,-79.278549
1,Alexandra Park,4.355,0.32,13609.0,0.0,19.687,13.8,28.0,Cantonese (17.9%),43.650787,-79.404318
2,Allenby,2.513,0.58,4333.0,-1.0,245.592,5.2,3.4,Russian (1.4%),43.712849,-79.547065
3,Amesbury,17.318,3.51,4934.0,1.1,27.546,16.4,19.7,Spanish (6.1%),43.706162,-79.483492
4,Armour Heights,4.384,2.29,1914.0,2.0,116.651,10.8,16.1,Russian (9.4%),43.743944,-79.430851
...,...,...,...,...,...,...,...,...,...,...,...
141,Woburn,48.507,13.34,3636.0,-1.5,26.190,13.3,16.0,Gujarati (9.1%),43.759824,-79.225291
142,Wychwood,4.182,0.68,6150.0,-2.0,53.613,17.1,20.1,Portuguese (2.7%),43.682171,-79.423113
143,York Mills,17.564,7.29,2409.0,2.0,92.099,10.0,11.8,Korean (4.0%),43.744039,-79.406657
144,York University Heights,26.140,13.21,1979.0,-1.2,24.432,15.2,20.4,Italian (6.6%),43.758781,-79.519434


### 4.3. Let's save Toronto Neighborhood result dataframe into csv file and read it back

In [278]:
df_2.to_csv(r'/Users/borisyushenkov/Desktop/DS_ML_NN/IBM/C9_Capstone/Toronto_Neighborhoods.csv')

In [280]:
df_Toronto_Neighborhoods = pd.read_csv(r'/Users/borisyushenkov/Desktop/DS_ML_NN/IBM/C9_Capstone/Toronto_Neighborhoods.csv', index_col=0)
df_Toronto_Neighborhoods

Unnamed: 0,Neighborhood,Population,Land area (km2),Density (people/km2),Change in Population since 2001 (%),Average Income (CAD),Transit Commuting %,Renters (%),Second language after English,Latitude,Longitude
0,Agincourt,44.577,12.45,3580.0,4.6,25.750,11.1,5.9,Cantonese (19.3%),43.785353,-79.278549
1,Alexandra Park,4.355,0.32,13609.0,0.0,19.687,13.8,28.0,Cantonese (17.9%),43.650787,-79.404318
2,Allenby,2.513,0.58,4333.0,-1.0,245.592,5.2,3.4,Russian (1.4%),43.712849,-79.547065
3,Amesbury,17.318,3.51,4934.0,1.1,27.546,16.4,19.7,Spanish (6.1%),43.706162,-79.483492
4,Armour Heights,4.384,2.29,1914.0,2.0,116.651,10.8,16.1,Russian (9.4%),43.743944,-79.430851
...,...,...,...,...,...,...,...,...,...,...,...
141,Woburn,48.507,13.34,3636.0,-1.5,26.190,13.3,16.0,Gujarati (9.1%),43.759824,-79.225291
142,Wychwood,4.182,0.68,6150.0,-2.0,53.613,17.1,20.1,Portuguese (2.7%),43.682171,-79.423113
143,York Mills,17.564,7.29,2409.0,2.0,92.099,10.0,11.8,Korean (4.0%),43.744039,-79.406657
144,York University Heights,26.140,13.21,1979.0,-1.2,24.432,15.2,20.4,Italian (6.6%),43.758781,-79.519434


### 4.4. It's also useful to select only the columns we need in further analysis. 

##### We also can unite two columns (Density and Income) in one column (PP). We can do this because we are unterested in Neighborhoods with maximum unhabitant's Purchasing power (PP), where PP = Density x Income. In other words, that's the way to dataframe dimension reduse from 4 to 3 dimensions.

In [449]:
# We can't use 'drop' method because of the strange "Transit Commuting %" column. Let's use 'assign' method to select columns we needs.
df_pp = pd.DataFrame()
df_pp = df_.assign(
    Neighborhood=df_Toronto_Neighborhoods['Neighborhood'],
    PP=(df_Toronto_Neighborhoods['Density (people/km2)'].values * df_Toronto_Neighborhoods['Average Income (CAD)'].values).round(),
    Latitude=df_Toronto_Neighborhoods['Latitude'].values, 
    Longitude=df_Toronto_Neighborhoods['Longitude'].values)
df_pp

Unnamed: 0,Neighborhood,PP,Latitude,Longitude
0,Agincourt,92185.0,43.785353,-79.278549
1,Alexandra Park,267920.0,43.650787,-79.404318
2,Allenby,1064150.0,43.712849,-79.547065
3,Amesbury,135912.0,43.706162,-79.483492
4,Armour Heights,223270.0,43.743944,-79.430851
...,...,...,...,...
141,Woburn,95227.0,43.759824,-79.225291
142,Wychwood,329720.0,43.682171,-79.423113
143,York Mills,221866.0,43.744039,-79.406657
144,York University Heights,48351.0,43.758781,-79.519434


##### DONE! Now we have all Toronto neigborhoods datas ready to modeling

_____________________________________________________________________________________

## 5. Scrabbing data from Foursquare

Gettng Fouesquare data for competitive venues in Toronto neighborhoods

### 5.1. Define Foursquare Credentials 

In [281]:
CLIENT_ID = '002XCF5MADOPDEBGLAMJW5GO2AV3C30U1RRF5RAE53V3Y3MW'
CLIENT_SECRET = 'IMRERLDG5HFV1UOH1355WMBJIOTRYCSPV0JUH4XGAL5TOGSN'
ACCESS_TOKEN = 'LNHEI3W3F4EKYAHKLBR03QNLGFR1RAEJPBZHHKH5XVOU2JDP'
VERSION = '20180604'
LIMIT = 50
radius = 1000
search_query = ['food']

### 5.2. Prepare function to get reference information

In [283]:
# function extracts the category of the venue from the 'caregories' column
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

### 5.3 Preparing the main loop

In [284]:
# list of columns we need in dataframes of venues
filtered_columns = ['id', 'name', 'categories', 'location.lat', 'location.lng']

# empty dataframe to collect results
df_Toronto_venues = pd.DataFrame(columns=filtered_columns)

names=df_pp['Neighborhood']
latitudes=df_pp['Latitude']
longitudes=df_pp['Longitude']

### 5.4. Run the loop to get location of all competing venues through all Toronto's neighborhoods

In [285]:
for name, lat, lng in zip(names, latitudes, longitudes):
    print(name)
    # Define the corresponding URL
    url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&oauth_token={}&v={}&query={}&radius={}&limit={}'.format(
        CLIENT_ID, CLIENT_SECRET, lat, lng, ACCESS_TOKEN, VERSION, search_query, radius, LIMIT)
    # Send the GET request
    results = requests.get(url).json()
    venues = results['response']['venues']
    if len(venues)==0:
        # If result has no information, let's return empty local results dataframe
        df_local_venues = pd.DataFrame(columns=filtered_columns)
    else: 
        # If not, lt's get relevant part of results
        dataframe = pd.json_normalize(venues)
        # Define information of interest and filter dataframe
        df_local_venues = dataframe.loc[:, filtered_columns]
        # Extract the category of the venue from the 'caregories' column
        df_local_venues['categories'] = df_local_venues.apply(get_category_type, axis=1)
    print(df_local_venues.shape[0])
    df_Toronto_venues = df_Toronto_venues.append(df_local_venues)
df_Toronto_venues.shape

Agincourt
20
Alexandra Park
50
Allenby
1
Amesbury
5
Armour Heights
3
Banbury
0
Bathurst Manor
27
Bay Street Corridor
45
Bayview Village
3
Bayview Woods – Steeles
0
Bedford Park
10
Bendale
4
Birch Cliff
1
Bloor West Village
9
Bracondale Hill
15
Branson
0
Bridle Path
0
Brockton
19
Cabbagetown
20
Caribou Park
5
Carleton Village
12
Casa Loma
18
Chaplin Estates
22
Christie Pits
13
Church and Wellesley
50
Clairlea
2
Cliffcrest
1
Cliffside
1
Corktown
13
Cricket Club
0
Davenport
11
Davisville
20
Deer Park
9
Discovery District
50
Don Mills
8
Don Valley Village
3
Dorset Park
6
Dovercourt Park
17
Downsview
2
Dufferin Grove
18
Earlscourt
8
East Danforth
14
Eglinton East
1
Elia (Jane and Finch)
7
Fashion District
50
Financial District
50
Flemingdon Park
1
Forest Hill
12
Fort York/Liberty Village
20
Garden District
50
Glen Park
7
Grange Park
50
Graydon Hall
3
Guildwood
3
Harbord Village
36
Harbourfront / CityPlace
50
Harwood
6
Henry Farm
3
High Park North
17
Highland Creek
1
Hillcrest
15
Hoggs Hollo

(1740, 5)

### 5.5. Let's check how many unique venues we obtained

In [286]:
df_Toronto_venues.groupby('id').count()

Unnamed: 0_level_0,name,categories,location.lat,location.lng
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
4ad4c05ff964a5200cf720e3,5,5,5,5
4ad4c05ff964a52014f720e3,3,3,3,3
4ad4c062f964a52000f820e3,1,1,1,1
4ad4c062f964a52011f820e3,4,4,4,4
4ad4c063f964a5203ff820e3,4,4,4,4
...,...,...,...,...
5fa59dbcb9755669c261a8f1,1,1,1,1
5faa56a1f561ff59324b4001,3,3,3,3
5febba6d14d53c1e12bf5372,1,1,1,1
6015d71b38deb728dff4bde2,1,1,1,1


##### We have 684 unique venues. Let's leave them only

In [288]:
df_Toronto_venues.drop_duplicates(subset=['id'], keep='first', inplace=True, ignore_index=True)
df_Toronto_venues.shape

(684, 5)

##### Let's look what we finally get

In [289]:
df_Toronto_venues

Unnamed: 0,id,name,categories,location.lat,location.lng
0,50fb1994e4b0d5fd052ed832,King's Vegetarian Food 觀自在,Grocery Store,43.786749,-79.270004
1,4d862b8ef1e56ea85e38988a,Midtown Food Court,Food Court,43.785134,-79.278878
2,4dbc40d7815439392f9cf0ce,大泉港式快餐 Great Fountain Fast Food,Food Court,43.786835,-79.277400
3,4bdc7e67fed22d7ffd6c58c9,Dynasty Centre Food Court,Food Court,43.786869,-79.277407
4,4c6dd5dd06ed6dcbd338a522,Rainbow Food,Chinese Restaurant,43.784946,-79.277958
...,...,...,...,...,...
679,4c5704a330d82d7f30dbd862,M&M Food Market,Grocery Store,43.757220,-79.234778
680,4f380d00e4b039c3c31e1dd4,Harry's West Indian Fine Foods,Grocery Store,43.759580,-79.223818
681,5856dbaf52a0510ab8aaf845,Jian Hing Foodmart,Supermarket,43.760999,-79.226707
682,4bd2111477b29c7455e88d82,Kitchen Food Fair,Convenience Store,43.751298,-79.401393


### 5.6. Let's save Toronto venues result dataframe into csv file

In [292]:
df_Toronto_venues.to_csv(r'/Users/borisyushenkov/Desktop/DS_ML_NN/IBM/C9_Capstone/Toronto_venues_food_1km.csv')

In [293]:
df_venues = pd.read_csv(r'/Users/borisyushenkov/Desktop/DS_ML_NN/IBM/C9_Capstone/Toronto_venues_food_1km.csv', index_col=0)
df_venues.shape

(684, 5)

##### DONE!  Now we have location and categories data of all competing venues in Toronto

_____________________________________________________________________________________________

## 6. Plotting maps to visualise data collected

Now we will plot:
- map of Toronto with Neighborhoods location; 
- map of Toronto with purchasing power (PP) of each Neighborhood;
- map of Toronto with locations of competing venues.

In [294]:
import folium

### 6.1. Convert the Toronto central point to its latitude and longitude coordinates

In [446]:
address = 'Toronto, Ontario'
geolocator = Nominatim(user_agent="foursquare_agent")
location = geolocator.geocode(address)
lat_Toronto = location.latitude + 0.06  # "latitude+0.06" make map centered better
lng_Toronto = location.longitude

### 6.2. Create a map of Toronto with neighborhoods superimposed on top

In [457]:
map_1 = folium.Map(location=[lat_Toronto, lng_Toronto], zoom_start=11)   

for lat, lng, neighborhood in zip(
    df_pp['Latitude'], 
    df_pp['Longitude'], 
    df_pp['Neighborhood']):
    
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_1)

map_1.save("Toronto_with_Neighborhoods.html")
map_1

# NOTE! If you do not see the map here, please find in repo folder https://github.com/Bobushka/Coursera_Capstone under the corresponding name

### 6.3 Create a map of Toronto with purchasing power (PP) of each Neighborhood

In [501]:
map_2 = folium.Map(location=[lat_Toronto, lng_Toronto], zoom_start=12, tiles='CartoDB positron')

for lat, lng, ppower, neighborhood in zip(
    df_pp['Latitude'], 
    df_pp['Longitude'], 
    df_pp['PP'],
    df_pp['Neighborhood']):
    
    label = '{}, {}'.format(neighborhood, ppower)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=0.00005*ppower,  # radius of the circles corresponds to ppower value
        popup=label,
        color='green',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_2)

map_2.save("Toronto_with_ppower.html")
map_2

### 6.4. Create a map of Toronto with with locations of competing venues

In [499]:
map_4 = folium.Map(location=[lat_Toronto, lng_Toronto], zoom_start=12, tiles='CartoDB positron')  

i=0

for lat, lng, name, category in zip(
    df_venues['location.lat'], 
    df_venues['location.lng'], 
    df_venues['name'], 
    df_venues['categories']):
    
    # Labeling loop will take several minutes with a list of hundreds of venues. Let's print some progress information
    i = i+1
    print(i, name, " "*100, end="\r")
    
    label = '{}, {}'.format(name, category)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=7,
        popup=label,
        color='red',
        fill=True,
        fill_color='yellow',
        fill_opacity=0.5,
        parse_html=False).add_to(map_4)

    map_4.save("Toronto_with_competition.html")
map_4

684 Delight Food                                                                                                                                                                                      

__________________________

## 7. Analysing Data with k-means approach

Now we will apply k-meand for dataset of neighborhoods& It's 4-dimentional dataset (latitude, longtitude, density, income) with length of 146 rows. Our expectation is that the we will find a clusters of neighborhoods with high density and high income simultaneously. After selecting such clusters and defining their's geografical borders, we can calculate a numder of venues lying inside this clusters. The one, having minimal number of competing venurs - is the best for new restorant to be placed. That's the idea.

### 7.1. Load additional libraries

In [219]:
import random # library for random number generation
import matplotlib.pyplot as plt # plotting library
# backend for rendering plots within the browser
%matplotlib inline 
from sklearn.cluster import KMeans 
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler

- подготовить данные для k-means (взять df_pp, удалить столбцы с номерами и именами, нормализовать оставшиеся)
- выбрать параметры для модели (число кластеров)
- обучить модель и получить массив лейблов
- добавить столбец лейблов в таблицу df_pp
- изучить в чем отличие кластера (какому набору параметров кластера соответствует каждая метка) и выбрать нужные нам кластеры
- нарисовать карту с учетом лейблов
- выбрать тот кластер, который попало наименьшее количество конкурирующих ресторанов - он и есть наилучший для размещения нового ресторана
- оформить Отчет

### 7.2 Prepare cluster_dataset and modeling in k-means

##### Delete "Neighborhood" column from df_pp dataframe

In [460]:
df_pp_ = df_pp.drop('Neighborhood', axis=1)

##### Normalize dataset. Normalization is a statistical method that helps mathematical-based algorithms interpret features with different magnitudes and distributions equally

In [462]:
X = df_pp_.values[:,:]
cluster_dataset = StandardScaler().fit_transform(X)

##### Run our model to group neighborhoods into different clusters

In [463]:
# Let's apply K-means method with all default parameters. 
k_means = KMeans(n_clusters=5).fit(cluster_dataset)  # Number of clusters equal to 9
labels = k_means.labels_
print(labels)

[1 0 2 3 3 3 0 2 3 3 4 1 1 0 4 3 3 0 4 4 0 4 4 0 2 1 1 1 4 1 0 4 2 4 1 1 1
 0 3 0 0 4 1 3 0 0 1 4 0 4 3 4 1 1 4 4 0 1 4 1 0 3 3 3 4 1 0 0 0 1 3 3 3 4
 4 4 0 0 4 1 3 1 1 4 1 0 3 4 2 1 0 0 1 1 3 4 1 0 1 4 0 4 0 4 1 1 0 1 1 1 4
 0 0 4 2 1 4 0 0 4 4 4 3 0 0 0 3 0 0 4 1 0 0 1 1 3 0 3 1 3 3 1 4 3 3 2]


In [464]:
# Add Labels to dataframe
df_pp.insert(0, 'Clusters', labels)
df_pp

Unnamed: 0,Clusters,Neighborhood,PP,Latitude,Longitude
0,1,Agincourt,92185.0,43.785353,-79.278549
1,0,Alexandra Park,267920.0,43.650787,-79.404318
2,2,Allenby,1064150.0,43.712849,-79.547065
3,3,Amesbury,135912.0,43.706162,-79.483492
4,3,Armour Heights,223270.0,43.743944,-79.430851
...,...,...,...,...,...
141,1,Woburn,95227.0,43.759824,-79.225291
142,4,Wychwood,329720.0,43.682171,-79.423113
143,3,York Mills,221866.0,43.744039,-79.406657
144,3,York University Heights,48351.0,43.758781,-79.519434


### 7.3. Now let's understand what we get

##### Let's compare mean values of "purchasing power" with each cluster

In [486]:
df_pp.groupby('Clusters').mean().round().sort_values(by=["PP"], ascending=False)

Unnamed: 0_level_0,PP,Latitude,Longitude
Clusters,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2,1203623.0,44.0,-79.0
4,443798.0,44.0,-79.0
0,209020.0,44.0,-79.0
3,142459.0,44.0,-79.0
1,128323.0,44.0,-79.0


##### Definitly cluster nr. 2 is very interesting for us: it's 3 times bigger in PP then the following one. How many neigborhoods belongs to that cluster?

In [491]:
df_prime_cluster = df_pp.loc[df_pp['Clusters'] == 2]
df_prime_cluster.shape

(7, 5)

In [492]:
df_prime_cluster

Unnamed: 0,Clusters,Neighborhood,PP,Latitude,Longitude
2,2,Allenby,1064150.0,43.712849,-79.547065
7,2,Bay Street Corridor,1766744.0,43.668865,-79.389126
24,2,Church and Wellesley,917152.0,43.665524,-79.383801
32,2,Deer Park,838272.0,43.68809,-79.394094
88,2,North York City Centre,1278415.0,43.739396,-79.513131
114,2,St. James Town,1424574.0,43.669403,-79.372704
145,2,Yorkville,1136055.0,43.671386,-79.390168


##### There are 7 neighborhoods in cluster 2. 

## 7.3. Let's show the cluster number two on the map of competing venues (map_4)

In [500]:
map_5 = map_4

#loop for cluster 2
for lat, lng, ppower, cluster, name in zip(
    df_prime_cluster['Latitude'], 
    df_prime_cluster['Longitude'], 
    df_prime_cluster['PP'], 
    df_prime_cluster['Clusters'], 
    df_prime_cluster['Neighborhood']):
    
    label = '{}, {}'.format(name, ppower)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=0.00005*ppower,  # radius of the circles corresponds to "purchasing power" value in defined Neighborhood
        popup=label,
        color='green',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.5,
        parse_html=False).add_to(map_5)

map_5.save("Toronto_final_decision.html")
map_5

#### Now the answer is obvious. If somebody has an intention to open new food venue in Toronto, my suggestion to that person is to focus on Allenby and North York City Center neigbourhoods. Basis for the decision:
- very high purchasing power of neighborhood's population
- comparatevly low competition in vicinity

## Thanks for your attention!