# <center>**The Battle Of Neighborhoods**</center>
# <center>**Capstone Project - IBM Data Science Professional Certificate**</center>

**Author : Sidharth Kumar Mohanty**

# **1. Introduction**

### **Objective**
The objective of this project is to find the best neighbourhood or place in Toronto( A city in Canada) to open a start up or Italian restaurant using Foursquare location data. In this project we’ll go through the solution for this problem for avoiding or considering low risk criteria and high success rate.
### **Target Audiance**
* Business personnel who wants to invest or open a start up company or restaurant.
* Bachelors who want to stay in a good city where they can get each facilities what they want like GYM,Playground,Parlour,Movie theatre etc.
* The freelancer who loves to have their own small company or restaurant as a side business.
* Marketing companies who want to release a new product on a best place.
* Researchers who want to create a camp for Survey.
* Torrists who wants to eat italian food.


### **Data Description**
For this project we need these following data:
1. ***Toronto City data that contains Borough, Neighborhoods along with there latitudes and longitudes***
* **Data Source:** https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M
* **Description:** This Wikipedia page contain all the information we need to explore and cluster the neighborhoods in Toronto. We will be required to scrape the Wikipedia page and wrangle the data, clean it, and then read it into a pandas dataframe so that it is in a structured format like the Toronto dataset.
2. ***Geographical Location data using Geocoder Package***
* **Data Source:** https://cocl.us/Geospatial_data
* **Description:** The second source of data provided us with the Geographical coordinates of the neighbourhoods with the respective Postal Codes.
3. ***Venue Data using Foursquare API***
* **Data Source:** https://foursquare.com/developers/apps
* **Description:** From Foursquare API we can get the name,category,latitude,longitude for each venue.

### **Tech Stack Used**
Machine Learning, Web Scraping, Foursquare API, Geocoder, Beautiful Soup, Folium

### **Table of Content**
1. Introduction
1. Import Libraries
1. Scrape Neighborhoods data
1. Data Pre-processing
1. Data Analysis
1. Clustering
1. Map Visualization
1. Conclusion
1. Future Work


# **2. Import Libraries**

In [None]:
# install geopy to access geocoder package
!pip install geopy

In [None]:
# install beautifulsoup4 for web scraping
!pip install beautifulsoup4

In [None]:
# install requests to gain access to an URL
!pip install requests

In [None]:
# install kmeans for clustering
!pip install kmeans

In [None]:
# install folium for visualization
!pip install folium

In [None]:
# install sklearn
!pip install -U scikit-learn

In [None]:
# import all necessary libraries
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

# !conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

from bs4 import BeautifulSoup
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
import matplotlib.pyplot as plt 

# import k-means from clustering stage
from sklearn.cluster import KMeans

# !conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

# **3. Scrape Neighborhoods Data**
As the dataset is not available,we will create a dataset of all neighborhoods of Toronto by **webscraping**.

In [None]:
# Get the neighborhood data using beautiful soup 
url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
result = requests.get(url)
data_html = BeautifulSoup(result.content)

# read the data into a Pandas Dataframe
soup = BeautifulSoup(str(data_html))

In [None]:
# loop through table, grab each of the 3 columns shown
# Scrape the neighborhood data from the table in the wikipedia page of Toronto
table_contents=[]
table=soup.find('table')
for row in table.findAll('td'):
    cell = {}
    if row.span.text=='Not assigned':
        pass
    else:
      # Create three columns named as "PostalCode","Borough" & "Neighborhood"
        cell['PostalCode'] = row.p.text[:3] # store only first three letter from the test of <p> tab.(Ex: M3A )
        cell['Borough'] = (row.span.text).split('(')[0]
        cell['Neighborhood'] = (((((row.span.text).split('(')[1]).strip(')')).replace(' /',',')).replace(')',' ')).strip(' ')
        # here we replace some symbols like "(" , ")" , "/" from the neighborhood name(Ex: (Parkview Hill / Woodbine Gardens))
        table_contents.append(cell)

df=pd.DataFrame(table_contents)
# compress some big borough name by smaller one
df['Borough']=df['Borough'].replace({'Downtown TorontoStn A PO Boxes25 The Esplanade':'Downtown Toronto Stn A',
                                             'East TorontoBusiness reply mail Processing Centre969 Eastern':'East Toronto Business',
                                             'EtobicokeNorthwest':'Etobicoke Northwest','East YorkEast Toronto':'East York/East Toronto',
                                             'MississaugaCanada Post Gateway Processing Centre':'Mississauga'})
df.head()

This is the created dataset that we'r going to use. This dataset have 3 columns i.e "PostalCode", "Borough", "Neighborhood". As the dataset is unstructured and dirty we need some data pre-processing to clean the dataset.

In [None]:
# save this dataframe in a CSV file
df.to_csv('Neighborhood Data.csv')

# **4. Data Pre-processing**
In this step we'll do these following steps
- Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
- More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11  in the above table.
- If a cell has a borough but a Not assigned  neighborhood, then the neighborhood will be the same as the borough.
- Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.
- In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.




In [None]:
# drop rows having null value and value assigned as "Not assigned"
df_dropna = df.dropna()
empty = 'Not assigned'
df_dropna = df_dropna[(df_dropna.PostalCode != empty ) & (df_dropna.Borough != empty) & (df_dropna.Neighborhood != empty)].reset_index(drop=True)

In [None]:
# check for missing value
df_dropna.isnull().sum()

In [None]:
# Check if we still have any Neighborhoods that are Not Assigned
df_dropna.loc[df_dropna['Borough'].isin(["Not assigned"])]

In [None]:
df = df_dropna
df.head()

In [None]:
# shape of dataframe
df.shape

Now  data is cleaned and all the requirements are met. So we just have to add the Latitude and Longitudes of each location.


Now that you have built a dataframe of the postal code of each neighborhood along with the borough name and neighborhood name, in order to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood. Now we are going to create a new table with the Latitudes and Longitudes corresponding to the different PostalCodes

In [None]:
# get the latitude and the longitude coordinates of each Postal code
geo_url = "https://cocl.us/Geospatial_data"

geo_df = pd.read_csv(geo_url)
geo_df.rename(columns={'Postal Code': 'PostalCode'}, inplace=True)
geo_df.head()

Now we'll merge the **geographical dataframe** with **neighborhood dataframe** according to the **Postal Code**

In [None]:
# Merging the Data
df = pd.merge(df, geo_df, on='PostalCode')
df.head()

In [None]:
# lets find out how many neighborhoods present in each borough
df.groupby('Borough').count()['Neighborhood']

### 4.1. Now we will visualize all the borough present in Toronto

In [None]:
df_toronto = df
df_toronto.head()

In [None]:
# Create a list and store all unique borough names
boroughs = df_toronto['Borough'].unique().tolist()

In [None]:
# Obtain the Latitude and Longitude of Toronto by taking mean of Latitude/Longitude of all postal code
lat_toronto = df_toronto['Latitude'].mean()
lon_toronto = df_toronto['Longitude'].mean()
print('The geographical coordinates of Toronto are {}, {}'.format(lat_toronto, lon_toronto))

In [None]:
# This will color categorize each borough
borough_color = {}
for borough in boroughs:
    borough_color[borough]= '#%02X%02X%02X' % tuple(np.random.choice(range(256), size=3)) #Random color

In [None]:
map_toronto = folium.Map(location=[lat_toronto, lon_toronto], zoom_start=10.5)

# add markers to map
for lat, lng, borough, neighborhood in zip(df_toronto['Latitude'], 
                                           df_toronto['Longitude'],
                                           df_toronto['Borough'], 
                                           df_toronto['Neighborhood']):
    label_text = borough + ' - ' + neighborhood
    label = folium.Popup(label_text)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color=borough_color[borough],
        fill_color=borough_color[borough],
        fill_opacity=0.8).add_to(map_toronto)  
    
map_toronto

### 4.2. Next we will define foursquare Credentials

In [None]:
CLIENT_ID = 'CURLH5YYCXMLJUABNE5Y22LK1JNKWHZLO5MCW2OD4PRRRDK1' # your Foursquare ID
CLIENT_SECRET = 'O5PCL405KIK4MGGBIMJD2EIAYSEIQK03W4QMEG4L4ZYOEMMF' # your Foursquare Secret
VERSION = 20200514 # Foursquare API version

print('Credentials Stored')

### 4.3. Now, let's get the top 100 venues that are in each neighborhood within a radius of 500 meters.

First, let's create the GET request URL

In [None]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    LIMIT = 100 # limit of number of venues returned by Foursquare API
    radius = 500 # define radius
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
        
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [None]:
#Get venues for all neighborhoods in our dataset
toronto_venues = getNearbyVenues(names=df_toronto['Neighborhood'],
                                latitudes=df_toronto['Latitude'],
                                longitudes=df_toronto['Longitude'])

In [None]:
toronto_venues.tail()

Lets check how many venues are there per neighborhood

In [None]:
toronto_venues.groupby('Neighborhood').count()

### 4.4. How many unique venues are there in all neighborhood ?

In [None]:
print('There are {} uniques vanue categories.'.format(len(toronto_venues['Venue Category'].unique())))

In [None]:
print("The Unique Venue Categories are", toronto_venues['Venue Category'].unique())

## 4.5. Are there any Italian Restaurants present in the venues?

In [None]:
"Italian Restaurant" in toronto_venues['Venue Category'].unique()

# **5. Data Analysis**

### 5.1. Now we will analyze each neighborhood

As the column "Venue Category" contain categorical value.So we need to convert it to numerical values by one hot encoding.

In [None]:
# one hot encoding
to_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
to_onehot['Neighborhoods'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [to_onehot.columns[-1]] + list(to_onehot.columns[:-1])
to_onehot = to_onehot[fixed_columns]

print("shape of dataset after one hot encoding is : ",to_onehot.shape)
to_onehot.head()

Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [None]:
to_grouped = to_onehot.groupby(["Neighborhoods"]).mean().reset_index() 

print(to_grouped.shape)
to_grouped.head()

Here we only require the "Neighborhoods" and "Italian Restaurant" columns for the clustering. So we'll group these two columns.

In [None]:
ita = to_grouped[["Neighborhoods","Italian Restaurant"]]
ita.head()

In [None]:
# rename column "Neighborhoods" to "Neighborhood"
ita = ita.rename(columns={'Neighborhoods':'Neighborhood'})


# **6. Clustering**
We will use k-means clustering. But first we will find the best _K_ value using the **Elbow Point** method.

### 6.1. Elbow Method

In [None]:
# drop "Neighborhood" column from the dataframe
X = ita.drop(['Neighborhood'], axis=1)

In [None]:
# find 'k' value by Elbow Method
plt.figure(figsize=[10, 8])
inertia=[]
range_val=range(2,20)
for i in range_val:
  kmean=KMeans(n_clusters=i)
  kmean.fit_predict(X)
  inertia.append(kmean.inertia_)
plt.plot(range_val,inertia,'bx-')
plt.xlabel('Values of K') 
plt.ylabel('Inertia') 
plt.title('The Elbow Method using Inertia') 
plt.show()

Here,We saw that the optimum K value is 4 so we will have a resulting of 4 clusters.

In [None]:
kclusters = 4

toronto_grouped_clustering = ita.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

In [None]:
# unique value in target column
np.unique(kmeans.labels_)

 Now create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [None]:
to_merged = ita.copy()

# add clustering labels
to_merged["Cluster Labels"] = kmeans.labels_

In [None]:
to_merged.head()

In [None]:
# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
to_merged = to_merged.join(toronto_venues.set_index("Neighborhood"), on="Neighborhood")

print(to_merged.shape)
to_merged.head()

In [None]:
# sort the results by Cluster Labels
print(to_merged.shape)
to_merged.sort_values(["Cluster Labels"], inplace=True)
to_merged.tail()

Lets check how many Italian Restaurant are there

In [None]:
to_merged['Venue Category'].value_counts()['Italian Restaurant']

We see that there are a total of **46** locations with Italian Restaurants in Toronto  
We will create a new dataframe with the Neighborhood and Italian Restaurants

## 6.2. Visualize Clustering on Google Map

In [None]:
# create map
map_clusters = folium.Map(location=[lat_toronto, lon_toronto], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(to_merged['Neighborhood Latitude'], to_merged['Neighborhood Longitude'], to_merged['Neighborhood'], to_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' - Cluster ' + str(cluster))
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill_color=rainbow[cluster-1],
        fill_opacity=0.8).add_to(map_clusters)
       
map_clusters

### ***Warning :***
*If we run the above cell,we can see the visualization on Google Map but when we'll upload this notebook on **Github** the Map visualization will not show. As Github doesn't support Google Map Visualization.*
*So i've uploaded the Map visualization image on next cell from my drive.* 


<img src="https://drive.google.com/uc?export=view&id=1Q_kkL6SA_VysraN1kDWfRL2EL45NoAwy" alt="Google Map Visualization">



## 6.3. How many Neighborhoods per Cluster?


In [None]:
ita["Cluster Labels"] = kmeans.labels_
ita.head()

In [None]:
objects = (1,2,3,4)
y_pos = np.arange(len(objects))
performance = ita['Cluster Labels'].value_counts().to_frame().sort_index(ascending=True)
perf = performance['Cluster Labels'].tolist()
plt.bar(y_pos, perf, align='center', alpha=0.8, color=['red', 'purple','aquamarine', 'darkkhaki'])
plt.xticks(y_pos, objects)
plt.ylabel('No of Neighborhoods')
plt.xlabel('Cluster')
plt.title('How many Neighborhoods per Cluster')

plt.show()

In [None]:
# How many neighborhoods in each cluster

ita['Cluster Labels'].value_counts()

## 6.4. Analysis of each Cluster

In [None]:
# This will create a dataframe with borough of each neighborhood which we will merge with each cluster dataframe

df_new = df[['Borough', 'Neighborhood']]
df_new.head()

### Cluster 1

In [None]:
# Red 

cluster1 = to_merged.loc[to_merged['Cluster Labels'] == 0]
df_cluster1 = pd.merge(df_new, cluster1, on='Neighborhood')
df_cluster1.head()

### Cluster 2

In [None]:
# Purple 
cluster2=to_merged.loc[to_merged['Cluster Labels'] == 1]
df_cluster2 = pd.merge(df_new, cluster2, on='Neighborhood')
df_cluster2.head()

### Cluster 3

In [None]:
# Blue
cluster3 = to_merged.loc[to_merged['Cluster Labels'] == 2]
df_cluster3 = pd.merge(df_new, cluster3, on='Neighborhood')
df_cluster3.head()

### Cluster 4

In [None]:
# Turquoise
cluster4 = to_merged.loc[to_merged['Cluster Labels'] == 3]
df_cluster4 = pd.merge(df_new, cluster4, on='Neighborhood')
df_cluster4.head()

## 6.4. Number of neighborhoods per cluster *vs* Average number of Italian Restaurants in each Cluster

In [None]:
plt.figure(figsize=(15,5))

# Plot-1 ( Number of Neighborhoods per Cluster )

plt.subplot(1,2,1)
objects = (1,2,3,4)
y_pos = np.arange(len(objects))
performance = ita['Cluster Labels'].value_counts().to_frame().sort_index(ascending=True)
perf_1 = performance['Cluster Labels'].tolist()
plt.bar(y_pos, perf_1, align='center', alpha=0.8, color=['red', 'purple','aquamarine', 'darkkhaki'])
plt.xticks(y_pos, objects)
plt.ylabel('No of Neighborhoods')
plt.xlabel('Cluster')
plt.title('Number of Neighborhoods per Cluster')

# Plot-2 ( Average number of Italian Restaurants per Cluster )

plt.subplot(1, 2, 2)
clusters_mean = [df_cluster1['Italian Restaurant'].mean(),df_cluster2['Italian Restaurant'].mean(),df_cluster3['Italian Restaurant'].mean(),
                df_cluster4['Italian Restaurant'].mean()]
y_pos = np.arange(len(objects))
perf_2 = clusters_mean
plt.bar(y_pos, perf_2, align='center', alpha=0.8, color=['red', 'purple','aquamarine', 'darkkhaki'])
plt.xticks(y_pos, objects)
plt.ylabel('Mean')
plt.xlabel('Cluster')
plt.title('Average number of Italian Restaurants per Cluster')


# **7. Conclusion**

The Neighborhoods located in the East Toronto area(cluster-3) have the highest average of Italian Restaurants which is represented by aquamarine colour. North York has second heighest number of Italian restaurants present. Looking at the nearby venues, the optimum place to put a new Italian Restaurant is in Victoria village,North York(cluster-1) as their are many Neighborhoods in that area but a little number of Italian Restaurants therefore, eliminating any competition.The second best Neighborhoods that have a great oppurtunity would be in areas such as Queen's Park which is in Cluster 4.Having 70 neighborhoods in the area with no Italian Restaurants gives a good oppurtunity for opening up a new restaurant. This concludes the optimal findings for this project and recommends the entrepreneur to open an authentic Italian restaurant in these locations with little to no competition. Nonetheless, if the food is authentic, affordable and good taste, I am confident that it will have great following everywhere.

**Here we take an Italian Restaurant as an example. We can do the same process to find the best place or neighborhood**
- to open a start up company 
- to stay on rent for bachelors
- to start a side business for middle class people
- to open a camp for any kind of servey
- to release a new product for checking the success rate
  

# **8. Future Work**

* Apply different types of clustering algorithms to cluster the 
neighborhoods.
* Consider other food vanues,market area etc. as features for clustering.
* Consider more then 100 vanues in a neighborhood for analysis using Foursquare api.


### **The complete code of "The Battle Of Neighborhood" available on my github profile.**

### **Click [here](https://github.com/sidharth178/The-Battle-of-Neighborhoods-Capstone-Project) to access the code.**

### **Follow me on [github](https://github.com/sidharth178). I used to upload good data science projects**

### Happy Learning!!!