# Final IBM Capstone Project

### Introduction 
Welcome to the final IBM Capstone Project. 
Now that you have been equipped with the skills and the tools to use location data to explore a geographical location, over the course of two weeks, you will have the opportunity to be as creative as you want and come up with an idea to leverage the Foursquare location data to explore or compare neighborhoods or cities of your choice or to come up with a problem that you can use the Foursquare location data to solve. 

## A. Introduction 
#### Discuss the business problem and who would be interested in this project

### Opening a Chinese Restaurant in Toronto, Canada 
The provincial capital of Ontario. With a recorded population of 2,731,571 in 2016, it is the most populous city in Canada and the fourth most populous city in North America. Toronto is an international centre of business, finance, arts, and culture, and is recognized as one of the most multicultural and cosmopolitan cities in the world.

Chinese food is a great, delectable meal! Starting a restaurant with a recognized and popular cuisine can have great potential.  

We will analyze the neighborhoods in Toronto to identify the most profitable area based on population density and ethnicity diversity. Toronto is a great place to start the restaurant, but we just need to make sure whether it is a profitable idea or not.

### Target Audience
- Business personal who are looking to open a restaraunt in Toronto
- Investors looking for a potentially successful restauraunt
- Freelancers looking to start a franchise 
- Data Scientists who wish to analyze Toronto's neighborhoods 
- New visitors to Toronto who love to eat sushi often 

## B. Data
#### Describe the data that will be used to solve the problem and the source of the data.

### B.1 Data Sources

a. Toronto's Neighborhood information such as 
- Postal Codes
- Boroughs
- Neighborhood Names 
Source: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

b. Toronto's Neighborhood Geographical Information
- Latitude 
- Longitude
Source: https://cocl.us/Geospatial_data

c. Population Distribution by Ethnic Diversity 
- Ethnic Origin 
Source: https://en.m.wikipedia.org/wiki/Demographics_of_Toronto#Ethnic_diversity)

d. Toronto's Venues Locations, Names, Categories, Location (in Latitude and Longitude) 
via Foursquare's explore API 
Source: https://developer.foursquare.com

#### Explanation: 
By combining all of these data sources, we can create a data summary that will allow target audiences to make the best educated decision for their restaurant location.

### B.2 Data Frame 

#### B.2.a. Toronto's Neighborhood information

Goal: Create a Data Frame with the following columns:
- Postal Code
- Borough
- Neighborhood 

*Note
- *Only the cells that have an assigned borough will be processed. Borough that is not assigned are ignored.
- *More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.
- *If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.

In [3]:
#Install Dependencies 
!conda install -c conda-forge wikipedia --yes 

import pandas as pd
import numpy as np
import wikipedia as wp

Solving environment: done

# All requested packages already installed.



In [None]:
#Download Source from Wikipedia
source = wp.page("List of postal codes of Canada: M").html().encode("UTF-8")
df = pd.read_html(source, header = 0)[0]

#Data Cleaning Part 1: Unassigned Boroughs will be ignored
df = df[df.Borough != 'Not assigned']
df = df.rename(columns={'Postcode': 'Postal Code'})

#Data Cleaning Part 2: Unassigned Neighborhoods will share same name as their Assigned Boroughs
for index, row in df.iterrows():
    if row['Neighbourhood'] == 'Not assigned':
        row['Neighbourhood'] = row['Borough']
        
#Data Cleaning Part 3: Place Multiple Neighborhoods in one Borough         
df = df.groupby(['Borough', 'Postal Code'])['Neighbourhood'].apply(list).apply(
    lambda x:', '.join(x)).to_frame().reset_index()

In [None]:
#Data Sample 
df.head()

#### B.2.b. Toronto's Neighborhood Geographical Information

Goal: Add to the Data Frame (Postal Code, Borough, Neighborhood) with the following columns:
- Latitude
- Longitude

In [None]:
#Download Dependencies 
import io
import requests

#Extract data from csv file 
url = "https://cocl.us/Geospatial_data"
geo_list = requests.get(url).text
geo_list_df=pd.read_csv(io.StringIO(geo_list))

In [None]:
#Data Sample 
geo_list_df.head()

In [None]:
#Merge Dataframe (DF) [Postal Code, Borough, Neighborhood] + Dataframe (GeoList) [Latitude, Longitude]
toronto_DF = pd.merge(df,geo_list_df, on='Postal Code')

#Change Neighbo'u'rhood to Neighborhood 
toronto_DF = toronto_DF.rename(columns={'Neighbourhood':'Neighborhood'})

In [None]:
#Data Sample 
toronto_DF.head()

#### B.2.c. Population Distribution by Ethnic Diversity 

Goal: Obtain data of each neighborhood's population in term of ethnic diversity and transfer it into the Jupyter notebook.By seeing each federal electoral districts, we can see the most populous ethnic group in each riding (AKA Neighborhood). 

In [None]:
#overall population distribution 
html = wp.page("Demographics of Toronto").html().encode("UTF-8")

In [None]:
#TORONTO & EAST YORK population distribution by ethnicity 
TEY_population_df = pd.read_html(html, header = 0)[13]
TEY_population_df = TEY_population_df.rename(columns={'%':'Ethnic Origin 1 in %', 
                                                      '%.1':'Ethnic Origin 2 in %',
                                                     '%.2':'Ethnic Origin 3 in %',
                                                     '%.3':'Ethnic Origin 4 in %',
                                                     '%.4':'Ethnic Origin 5 in %',
                                                     '%.5':'Ethnic Origin 6 in %',
                                                     '%.6':'Ethnic Origin 7 in %',
                                                     '%.7':'Ethnic Origin 8 in %',
                                                     '%.8':'Ethnic Origin 9 in %'})

In [None]:
#TORONTO & EAST YORK
TEY_population_df

In [None]:
#NORTH YORK population distribution by ethnicity 
North_population_df = pd.read_html(html, header = 0)[14]
North_population_df = North_population_df.rename(columns={'%':'Ethnic Origin 1 in %', 
                                                      '%.1':'Ethnic Origin 2 in %',
                                                     '%.2':'Ethnic Origin 3 in %',
                                                     '%.3':'Ethnic Origin 4 in %',
                                                     '%.4':'Ethnic Origin 5 in %',
                                                     '%.5':'Ethnic Origin 6 in %',
                                                     '%.6':'Ethnic Origin 7 in %',
                                                     '%.7':'Ethnic Origin 8 in %'})

In [None]:
#NORTH YORK 
North_population_df

In [None]:
#SCARBOROUGH population distribution by ethnicity 
Scar_population_df = pd.read_html(html, header = 0)[15]
Scar_population_df = Scar_population_df.rename(columns={'%':'Ethnic Origin 1 in %', 
                                                      '%.1':'Ethnic Origin 2 in %',
                                                     '%.2':'Ethnic Origin 3 in %',
                                                     '%.3':'Ethnic Origin 4 in %',
                                                     '%.4':'Ethnic Origin 5 in %',
                                                     '%.5':'Ethnic Origin 6 in %',
                                                     '%.6':'Ethnic Origin 7 in %',
                                                     '%.7':'Ethnic Origin 8 in %'})

In [None]:
#SCARBOROUGH 
Scar_population_df

In [None]:
#ETOBICOKE & YORK population distribution by ethnicity 
ETY_population_df = pd.read_html(html, header = 0)[16]
ETY_population_df = ETY_population_df.rename(columns={'%':'Ethnic Origin 1 in %', 
                                                      '%.1':'Ethnic Origin 2 in %',
                                                     '%.2':'Ethnic Origin 3 in %',
                                                     '%.3':'Ethnic Origin 4 in %',
                                                     '%.4':'Ethnic Origin 5 in %',
                                                     '%.5':'Ethnic Origin 6 in %',
                                                     '%.6':'Ethnic Origin 7 in %',
                                                     '%.7':'Ethnic Origin 8 in %'})

In [None]:
#ETOBICOKE & YORK 
ETY_population_df

#### B.2.d. Toronto's Venues Locations, Names, Categories, Location (in Latitude and Longitude) 
via Foursquare's explore API 

Using FourSquare API, we can find explore neighborhoods in Toronto and what kind of venues reside in each neighborhood. 

In [None]:
#Install Dependencies 
!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

In [None]:
#Use geopy library to get the latitude and longitude values of Toronto 
address = 'Toronto'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

In [None]:
#Folium is a great visualization library. 
#It has the ability zoom in/out of map, and click on each circle mark to reveal the name of the 
#neighborhood and its respective borough.

!conda install -c conda-forge folium=0.5.0 --yes
import folium 

In [None]:
#We are using a 1 Km Radius 
radius=1000
url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&ll={},{}&v={}&radius={}'.format(CLIENT_ID, CLIENT_SECRET, latitude, longitude, VERSION, radius)
results = requests.get(url).json()

def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']

    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [None]:
#The API returned a JSON file. 
#Now we turn it into a pandas data frame. 

In [None]:
#Install Dependencies 
import json
from pandas.io.json import json_normalize

#Panda Data Frame 
venues = results['response']['groups'][0]['items']

nearby_venues = json_normalize(venues) # flatten JSON
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

In [None]:
#Data Sample 
nearby_venues.sorted.head()

In [None]:
#Continue to look for nearby venues
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [None]:
#Im going to look for the Top 100 venues 
LIMIT = 100
toronto_venues = getNearbyVenues(names=toronto_DF['Neighborhood'],
                                   latitudes=toronto_DF['Latitude'],
                                   longitudes=toronto_DF['Longitude'])

In [None]:
#Data Sample 
toronto_venues.head(10)

In [None]:
#I want to see how many existing Venue Categories (e.g. Park, Swim School, Hotel, Gym, etc)
toronto_venues.groupby('Neighborhood').count()

In [None]:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

In [None]:
#Calculate the mean of all venue groupby in each neighborhood
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood']

fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped

In [None]:
#Let's see how many categories there are
print (toronto_venues['Venue Category'].value_counts())

## C. Methodology 
#### Represents the main component of the report where you discuss
#### Describe any exploratory data analysis that you did, any inferential statistical testing that you performed, if any, and what machine learnings were used and why.

### C.1 Folium
- Folium is a great visualization library. 
- It has the ability zoom in/out of map, and click on each circle mark to reveal the name of the neighborhood and its respective borough.

Goal: Create a interactive leaflet map using our coordinate data 

In [None]:
# create map of Toronto using latitude and longitude
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(toronto_DF['Latitude'], toronto_DF['Longitude'], toronto_DF['Borough'], toronto_DF['Neighborhood']):
    label = '{},{}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='yellow',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

### C.2 Relationship between Neighborhood and Chinese Restaurant 


#### C.2.a Data Frame 
Goal: Add to our dataframe (Borough, Postal Code, Neighborhood, Latitude, Longitude) with the following columns: 
- Cluster Labels 
- Chinese Restaurant (the mean of venue groupby) 

In [None]:
#Create a Dataframe with the columns : Neighborhood, Chinese Restaurant 
toronto_part = toronto_grouped[['Neighborhood', 'Chinese Restaurant']]
#Data Sample
toronto_part.head()

In [None]:
#Add Dataframe (Toronto Part)[Neighborhood, Chinese Restaurant] to 
# Dataframe(Toronto DF)[Borough, Postal Code, Neighborhood, Latitude, Longitude]
toronto_merged = pd.merge(toronto_DF, toronto_part, on='Neighborhood')
#Data Sample
toronto_merged

#### C.2.b Bar Charts 
Goal: Identify which specific neighborhoods have the highest mean of chinese restaurants 

In [None]:
#Install Dependencies 
#%matplotlib inline
#import matplotlib as mpl
#import matplotlib.pyplot as plt
#import seaborn as sns

#Create Plot with Neighborhood vs. Chinese Restaurant (Mean)
#fig = plt.figure(figsize=(19,9))

#sns.set(font_scale=1.1)
#sns.violinplot(y="Chinese Restaurant", x="Borough", data=toronto_merged, cut=0);

#plt.title('Mean of Chinese restaurants in each Borough (Toronto, Canada)', fontsize=15)
#plt.show()

In [None]:
#With boroughs visualized, now we will continue with Neighborhood

graph = pd.DataFrame(toronto_onehot.groupby('Neighborhood')['Chinese Restaurant'].sum())
graph = graph.sort_values(by ='Chinese Restaurant', ascending=False)
graph.iloc[:14].plot(kind='bar', figsize=(20,10))
plt.xlabel("Neighborhoods")
plt.ylabel("Number of Chinese Restaurant")
plt.title("Neighborhoods vs Number of Chinese Restaurant")
plt.show()

### C.2 Relationship between Neighborhood and Chinese Population
Fun Fact: The Chinese Population Group made up for 11.1% of Toronto's total population (in 2016) 

Goal: Identify which specific neighborhoods have the highest ethnic percentage 

In Section B.2.c. Population Distribution by Ethnic Diversity, we created four dataframes for each federal electoral district:
- #TORONTO & EAST YORK (TEY_population_df)
- #NORTH YORK (North_population_df)
- #SCARBOROUGH (Scar_population_df)
- #ETOBICOKE & YORK (ETY_population_df)

In [None]:
#Merge all four dataframes 
ET = ETY_population_df.append(TEY_population_df,sort=True).reset_index()
ET.drop('index',axis=1,inplace=True)
SN = North_population_df.append(Scar_population_df,sort=True).reset_index()
SN.drop('index',axis=1,inplace=True)
pop_ethnic_df = SN.append(ET,sort=True).reset_index()
pop_ethnic_df.drop('index',axis=1,inplace=True)
pop_ethnic_df = pop_ethnic_df[['Riding', 'Population','Ethnic Origin #1', 'Ethnic Origin 1 in %','Ethnic Origin #2', 'Ethnic Origin 2 in %',
                               'Ethnic Origin #3','Ethnic Origin 3 in %','Ethnic Origin #4', 'Ethnic Origin 4 in %','Ethnic Origin #5','Ethnic Origin 5 in %', 
                               'Ethnic Origin #6','Ethnic Origin 6 in %','Ethnic Origin #7', 'Ethnic Origin 7 in %','Ethnic Origin #8', 'Ethnic Origin 8 in %',
                               'Ethnic Origin #9','Ethnic Origin 9 in %',
                              ]]
#Now we have a dataframe with important columns: 
#Riding (aka Neighborhood), Population, Ethnic Origin # in Percentage
pop_ethnic_df

In [None]:
From the above dataframe we can pickout the neighborhoods with highest Indian population percentage by using the below given method.

In [None]:
#We're going to create a new dataframe(pop_chinese_df) 
#where the Ethnic Origin (from #1-#9) has at least one "Chinese" group
temp = pop_ethnic_df.loc[(pop_ethnic_df['Ethnic Origin #1'] == 'Chinese')| 
                                      (pop_ethnic_df['Ethnic Origin #2'] == 'Chinese')|
                                      (pop_ethnic_df['Ethnic Origin #3'] == 'Chinese')|
                                      (pop_ethnic_df['Ethnic Origin #4'] == 'Chinese')|
                                      (pop_ethnic_df['Ethnic Origin #5'] == 'Chinese')|
                                      (pop_ethnic_df['Ethnic Origin #6'] == 'Chinese')|
                                      (pop_ethnic_df['Ethnic Origin #7'] == 'Chinese')|
                                      (pop_ethnic_df['Ethnic Origin #8'] == 'Chinese')|
                                      (pop_ethnic_df['Ethnic Origin #9'] == 'Chinese')]
pop_chinese_df = pd.DataFrame(temp).reset_index()
pop_chinese_df.drop('index',axis=1,inplace=True)

#Data Sample 
pop_chinese_df.head()

In [None]:
#retaining only Indian ethnic percentage & the neighborhood name 
columns_list = pop_indian_df.columns.to_list()
pop_indian_DF_with_percent = pd.DataFrame()

#removing Riding & Population from the column names list
del columns_list[0]
del columns_list[0]


for i in range(0,pop_indian_df.shape[0]):
    for j in columns_list:
        print(j)
        if pop_indian_df.at[i, j] == 'East Indian':
            k = columns_list.index(j) + 1
            percent_col = columns_list[k]
            pop_indian_DF_with_percent = pop_indian_DF_with_percent.append({'Riding':pop_indian_df.at[i, 'Riding'], 'Population':pop_indian_df.at[i, 'Population']
                                                                           , 'Ethnicity': pop_indian_df.at[i, j], 'Percentage': pop_indian_df.at[i, percent_col]},ignore_index=True)

pop_indian_DF_with_percent

In [None]:
pop_indian_DF_with_percent['Indian Population'] = (pop_indian_DF_with_percent['Percentage'] * pop_indian_DF_with_percent['Population'])/100
pop_indian_DF_with_percent.drop(columns={'Percentage','Population','Ethnicity'},axis=1, inplace =True)
pop_indian_DF_with_percent.drop_duplicates(keep='first',inplace=True) 
pop_indian_DF_with_percent

In [None]:
bar_graph = pop_indian_DF_with_percent.sort_values(by='Indian Population', ascending=False)
bar_graph.plot(kind='bar',x='Riding', y='Indian Population',figsize=(12,8), color='brown')
plt.title("Indian Population in each Neighborhood")
plt.xlabel("Neighborhoods")
plt.ylabel("Population")
plt.show()


This analysis & visualization of the relationship between neighborhoods & indian population present in those neighborhoods helps us in identifying the highly populated indian neighborhoods. Once we identify those neighborhoods it helps us in deciding where to place the new Indian restaurant. Indian restaurant placed in an densely populated Indian neighborhood is more likely to get more Indian customers than a restaurant placed in a neighborhood with less or no Indian population. Thus this analysis helps in the determining the success of the new Indian restaurant.

In [None]:

3.4 Relationship between Indian poplation and Indian restaurant

First get the list of neighborhoods present in the riding using the wikipedia geography section for each riding. Altering the riding names to match the wikipedia page so we can retrieve the neighborhoods present in those ridings

In [None]:
#Altering the list to match the wikipedia page so we can retrieve the neighborhoods present in those Ridings
riding_list = pop_indian_DF_with_percent['Riding'].to_list()
riding_list[riding_list.index('Scarborough Centre')] = 'Scarborough Centre (electoral district)'
riding_list[riding_list.index('Scarborough North')] = 'Scarborough North (electoral district)'
riding_list

In [None]:
#Scraping wiki page to get the neighborhoods of ech Ridings
import wikipedia

Riding_neighborhood_df = pd.DataFrame()

for item in riding_list:
    section = wikipedia.WikipediaPage(item).section('Geography')
    start = section.index('neighbourhoods of') + 17
    stop = section.index('.',start)
    Riding_neighborhood_df = Riding_neighborhood_df.append({'Riding':item, 'Neighborhoods':section[start:stop]},ignore_index=True)
    

Riding_neighborhood_df = Riding_neighborhood_df[['Riding','Neighborhoods']]
Riding_neighborhood_df

In [None]:
#Merging the pop_indian_DF_with_percent dataframe containing population information with the Riding_neighborhood_df dataframe.

Neigh_pop = pd.merge(pop_indian_DF_with_percent, Riding_neighborhood_df, on='Riding')

Neigh_pop.drop(columns=['Riding'],inplace =True)
Neigh_pop

In [None]:
Neigh_pop['split_neighborhoods'] = Neigh_pop['Neighborhoods'].str.split(',') 
Neigh_pop.drop(columns=['Neighborhoods'],inplace=True,axis=1)
Neigh_pop = Neigh_pop.split_neighborhoods.apply(pd.Series).merge(Neigh_pop, left_index = True, right_index = True).drop(["split_neighborhoods"], axis = 1)\
                    .melt(id_vars = ['Indian Population'], value_name = "Neighborhood").drop("variable", axis = 1).dropna()

Neigh_pop.reset_index()
Neigh_pop

In [None]:
toronto_part['split_neighborhoods'] = toronto_part['Neighborhood'].str.split(',') 
toronto_part.drop(columns=['Neighborhood'],inplace=True,axis=1)
toronto_part = toronto_part.split_neighborhoods.apply(pd.Series).merge(toronto_part, left_index = True, right_index = True).drop(["split_neighborhoods"], axis = 1)\
                    .melt(id_vars = ['Indian Restaurant'], value_name = "Neighborhood").drop("variable", axis = 1).dropna()

toronto_part.reset_index()
toronto_part

In [None]:
pop_merged_restaurant_percent = pd.merge(Neigh_pop, toronto_part, on='Neighborhood')
pop_merged_restaurant_percent.head()


After performing the data cleaning & data analysis we can identify that their no big relationship established in terms of the Indian population & the popular Indian restaurants.

Thus this marks end of the data cleaning & analyses step in this project. Next we will look into the predictive modeling. In the predictive modelling we are going to use Clustering techniques since this is analysis of unlabelled data. K-Means clustering is used to perform the analysis of the data at hand.


4. Predictive Modeling
4.1 Clustering Neighborhoods of Toronto:
First step in K-means clustering is to identify best K value meaning the number of clusters in a given dataset. To do so we are going to use the elbow method on the Toronto dataset with Indian restaurant percentage (i.e. toronto_merged dataframe).

In [None]:

from sklearn.cluster import KMeans

toronto_part_clustering = toronto_part.drop('Neighborhood', 1)


error_cost = []

for i in range(3,11):
    KM = KMeans(n_clusters = i, max_iter = 100)
    try:
        KM.fit(toronto_part_clustering)
    except ValueError:
        print("error on line",i)
    
    
    
    
    #calculate squared error for the clustered points
    error_cost.append(KM.inertia_/100)

#plot the K values aganist the squared error cost
plt.plot(range(3,11), error_cost, color='r', linewidth='3')
plt.xlabel('K values')
plt.ylabel('Squared Error (Cost)')
plt.grid(color='white', linestyle='-', linewidth=2)
plt.show()

In [None]:
!conda install -c districtdatalabs yellowbrick

from yellowbrick.cluster import KElbowVisualizer

In [None]:

# Instantiate the clustering model and visualizer
model = KMeans()
visualizer = KElbowVisualizer(model, k=(4,13))

visualizer.fit(toronto_part_clustering)        # Fit the data to the visualizer
visualizer.show()        # Finalize and render the figure


After analysing using elbow method using distortion score & Squared error for each K value, looks like K = 6 is the best value.¶
Clustering the Toronto Neighborhood Using K-Means with K = 6

In [None]:
kclusters = 6

toronto_part_clustering = toronto_part.drop('Neighborhood', 1)

kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_part_clustering)

kmeans.labels_

In [None]:
#sorted_neighborhoods_venues.drop(['Cluster Labels'],axis=1,inplace=True)
toronto_part.insert(0, 'Cluster Labels', kmeans.labels_)
toronto_merged = toronto_DF
# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(toronto_part.set_index('Neighborhood'), on='Neighborhood')
toronto_merged.dropna(subset=["Cluster Labels"], axis=0, inplace=True)
toronto_merged.reset_index(drop=True, inplace=True)
toronto_merged['Cluster Labels'].astype(int)
toronto_merged.head()

Let us see the clusters visually on the map with the help of Folium.

In [None]:
:
import matplotlib.cm as cm
import matplotlib.colors as colors

map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11, width='90%', height='70%')

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels'].astype(int)):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
map_clusters

4.2 Examing the Clusters:
We have total of 6 clusters such as 0,1,2,3,4,5. Let us examine one after the other.

Cluster 0 contains all the neighborhoods which has least number of Indian restaurants. It is shown in red color in the map

In [None]:
#Cluster 0
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0]

In [None]:

# Cluster 1 contains the neighborhoods which is sparsely populated with Indian restaurants. 
# It is shown in purple color in the map.


#Cluster 1
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1]

In [None]:
#Cluster 2 has no rows meaning no data points or neighborhood was near to this centroid.
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2]


In [None]:
# Cluster 3 contains all the neighborhoods which is medium populated with Indian restaurants. 
# It is shown in blue color in the map.

#Cluster 3
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3]

In [None]:
#Cluster 4 has no rows meaning no data points or neighborhood was near to this centroid.
toronto_merged.loc[toronto_merged['Cluster Labels'] == 4]


In [2]:

#Cluster 5 contains all the neighborhoods which is densely populated with Indian restaurants. 
#It is shown in Orange color in the map

#Cluster 5
toronto_merged.loc[toronto_merged['Cluster Labels'] == 5]

In [None]:
5. Results and Discussion:

5.1 Results
We have reached the end of the analysis, in the result section we can document all the findinds from above clustering & visualization of the datas. In this project, as the business problem started with identifying a good neighborhood to open a new Indian restaurant, we looked into all the neighborhoods in Toronto, analysed the Indian population in each neighborhood & spread of Indian restaurants in those neighborhoods to come to conclusion about which neighborhood would be a better spot for opening a new Indian restaurant. I have used data from web resources like Wikipedia, geospatial coordinates of Toronto neighborhoods, and Foursquare API, to set up a very realistic data-analysis scenario. We have found out that —

In those 11 boroughs we identified that only Central Toronto, Downtown Tronto, East Toronto, East York, North York & Scarborough boroughs have high amount of Indian restaurants with the help of Violin plots between Number of Indian restaurants in Borough of Toronto.
In all the ridings, Scarborough-Guildwood, Scarborough-Rouge Park, Scarborough Centre, Scarborough North, Humber River-Black Creek, Don Valley East, Scarborough Southwest, Don Valley North & Scarborough-Agincourt are the densely populated with Indian crowd ridings.
With the help of clusters examing & violin plots looks like Downtown Toronto, Central Toronto, East York are already densely populated with Indian restaurants. So it is better idea to leave those boroughs out and consider only Scarborough, East Toronto & North York for the new restaurant's location.
After careful consideration it is a good idea to open a new Indian restaurant in Scarborough borough since it has high number of Indian population which gives a higher number of customers possibility and lower competition since very less Indian restaurants in the neighborhoods.

5.2 Discussion
According to this analysis, Scarborough borough will provide least competition for the new upcoming Indian restaurant as there is very little Indian restaurants spread or no Indian restaurants in neighborhoods. Also looking at the population distribution looks like it is densely populated with Indian crowd which helps the new restaurant by providing hig customer visit possibilty. So, definitely this region could potentially be a perfect place for starting a quality Indian restaurants. Some of the drawbacks of this analysis are — the clustering is completely based only on data obtained from Foursquare API. Also the Indian population distribution in each neighborhood is also based on the 2016 census which is not up-to date. Thus population distribution would have definitely changed by 2019 given 3 years gap in the data. Since population distribution of Indian crowd in each neighborhood & number of Indian restaurants are the major feature in this analysis and it is not fully up-to date data, this analysis is definitely not far from being conclusory & it has lot of areas where it can be imporved. However, it certainly provides us with some good insights, preliminary information on possibilites & a head start into this business problem by setting the step stones properly. Furthermore, this may also potentially vary depending on the type of clustering techniques that we use to examine the data.

6. Conclusion:
Finally to conclude this project, We have got a chance to on a business problem like how a real like data scientists would do. We have used many python libraries to fetch the data , to manipulate the contents & to analyze and visualize those datasets. We have made use of Foursquare API to explore the venues in enighborhoods of Toronto, then get good amount of data from Wikipedia which we scraped with help of Wikipedia python library and visualized using various plots present in seaborn & matplotlib. We also applied machine learning technique to to predict the output given the data and used Folium to visualize it on a map. Also, some of the drawbacks or areas of improvements shows us that this analysis can further be improved with help more data and different machine learning technique. Similarly we can use this project to analysis any scenario such opening a different cuisine or success of opening a new gym and etc. Hopefully, this project helps acts as initial guidance to take more complex real-life challenges using data-science.

In [None]:

toronto_part.drop('Cluster Labels',axis=1, inplace=True)

In [None]:
Sources

https://en.wikipedia.org/wiki/Toronto
    
    