| <h1> Data Science Coursera Capstone Project <h1/> |
:---:

This project is for the IBM Data Science course offered on Coursera. This Capstone project will be analyzing neighborhoods in Toronto Canada to determine where to move. Location data and Machine Learning will be used to come to the best conclusion for the situation. 

##### First step is to download and import the necessary packages

In [1]:
import pandas as pd
import numpy as np
!pip install beautifulsoup4



In [2]:
from bs4 import BeautifulSoup
import requests

In [3]:
!pip install lxml



##### Here I am getting the URL where the information is stored

In [4]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
html_content = requests.get(url).text

##### Here I am loading the information from the site and converting it into parsed data using the BeautifulSoup package.

In [5]:
ParseData = BeautifulSoup(html_content, "lxml")

##### Now that I retrieved the html file and parsed the data. I need to find the table located within the file and retrieve each row in that table. Since the table is the only thing I care about we only extract the table information and put it in a variable. If you look at the output of this variable and compare it to the table on the site you will see that each individual row is separated by "tr" and each column within each row is separated by "th". So basically I created 3 empty lists and ran a loop through the parsed data searching for each row. When the row was found, I located each value in each column in put them in their respective list. Each row was appended to the other and unnecessary information was removed ("\n").

In [6]:
TorontoTable = ParseData.find('table', attrs={'class': 'wikitable sortable'}) # Find the Table

In [7]:
Postalcode = []
Borough = []
Neighborhood = []

for row in TorontoTable.find_all('tr'):
    cells = row.find_all('td')
    if len(cells)==3:
        Postalcode.append(cells[0].find(text=True).replace('\n', ' ').strip())
        Borough.append(cells[1].find(text=True).replace('\n', ' ').strip())
        Neighborhood.append(cells[2].find(text=True).replace('\n', ' ').strip())

##### Now that I have the data and it is almost fully cleaned up, I need to convert it into a dataframe and change the column names.

In [8]:
TorontoNeighborhoods = pd.DataFrame(Postalcode, columns=['PostalCode'])
TorontoNeighborhoods['Borough'] = Borough
TorontoNeighborhoods['Neighborhood'] = Neighborhood
TorontoNeighborhoods.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


##### The last thing I need to do to clean it up is to filter out all data points where there is no Borough assigned to the postal code.

In [9]:
NewToronto = TorontoNeighborhoods[TorontoNeighborhoods['Borough'] != "Not assigned"]
NewToronto.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


##### Here I am simply getting the dimensions of the dataframe.

In [10]:
NewToronto.shape

(103, 3)

##### Because geocoder was not working for me, could not find my postal codes, I decided to read in the csv file provided that contains all the coordinates of the Boroughs.

In [11]:
lonlng = pd.read_csv('http://cocl.us/Geospatial_data')
lonlng.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


##### I then merged the two dataframes, specifying the columns to find the matches and dropping the duplicate column. 

In [12]:
lnlngToronto = pd.merge(NewToronto, lonlng, left_on = "PostalCode", right_on = "Postal Code").drop('Postal Code', axis=1)

In [13]:
lnlngToronto.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


In [14]:
!conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.



##### First I got the location of Toronto so that, before we get into a specific Borough, we can look at the Postal Areas in the data.

In [15]:
address = "Toronto, Canada"
geolocator = Nominatim(user_agent= "my_application")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

In [16]:
!conda install -c conda-forge folium=0.5.0 --yes
import matplotlib.cm as cm
import matplotlib.colors as colors
import folium

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.



##### This map is then displayed here, showing all the postal areas in Toronto.

In [17]:
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=11)

for lat, lng, borough, neighborhood in zip(lnlngToronto['Latitude'], lnlngToronto['Longitude'], lnlngToronto['Borough'], lnlngToronto['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius = 5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)

map_toronto

##### After this we just want to look at Downtown Toronto. To do this, we filter the dataset to only include postal codes that are from Downtown Toronto.

In [18]:
downtown_data = lnlngToronto[lnlngToronto['Borough'] == 'Downtown Toronto'].reset_index(drop=True)
downtown_data.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
3,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
4,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306


address = "Downtown Toronto, Canada"
geolocator = Nominatim(user_agent= "my-application")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

##### Next I got data from Foursquare using their API information.

In [19]:
import json
from pandas.io.json import json_normalize

In [20]:
CLIENT_ID = 'OPWJ3S00V5SN1DSUOC3AVHF2MO2R1Q4D32PRRU1S24ZHL1CN' # your Foursquare ID
CLIENT_SECRET = 'MJSXI1XR23PGDGPCE2IGHXDC5DZIRSETBUC4O4K1OZ3FO13N' # your Foursquare Secret
VERSION = '20180605'

In [21]:
borough_lat = downtown_data.loc[0, 'Latitude']
borough_lon = downtown_data.loc[0, 'Longitude']
borough_name = downtown_data.loc[0, 'Neighborhood']
LIMIT = 100
radius = 500
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, VERSION, borough_lat, borough_lon, radius, LIMIT)
url

'https://api.foursquare.com/v2/venues/explore?&client_id=OPWJ3S00V5SN1DSUOC3AVHF2MO2R1Q4D32PRRU1S24ZHL1CN&client_secret=MJSXI1XR23PGDGPCE2IGHXDC5DZIRSETBUC4O4K1OZ3FO13N&v=20180605&ll=43.6542599,-79.3606359&radius=500&limit=100'

In [22]:
results = requests.get(url).json()

##### After getting the results from Foursquare, I turned that data into a dataframe by filtering the results and cleaning the data.

In [23]:
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [24]:
venues = results['response']['groups'][0]['items']

nearby_venues = json_normalize(venues)

filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues = nearby_venues.loc[:, filtered_columns]
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,name,categories,lat,lng
0,Roselle Desserts,Bakery,43.653447,-79.362017
1,Tandem Coffee,Coffee Shop,43.653559,-79.361809
2,Morning Glory Cafe,Breakfast Spot,43.653947,-79.361149
3,Cooper Koo Family YMCA,Distribution Center,43.653249,-79.358008
4,Body Blitz Spa East,Spa,43.654735,-79.359874


##### Here I got the venue and Neighborhood information and combined them into one dataframe. 

In [25]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [None]:
downtown_venues = getNearbyVenues(names=downtown_data['Neighborhood'],
                                  latitudes=downtown_data['Latitude'],
                                  longitudes=downtown_data['Longitude']
                                 )
downtown_venues.head()

##### Next I created dummy variables for each of the venues and then got their average frequency for each Neighborhood.

In [None]:
downtown_onehot = pd.get_dummies(downtown_venues[['Venue Category']], prefix = "", prefix_sep="")
downtown_onehot['Neighborhood'] = downtown_venues['Neighborhood']
fixed_columns = [downtown_onehot.columns[-1]] + list(downtown_onehot.columns[:-1])
downtown_onehot = downtown_onehot[fixed_columns]

downtown_onehot.head()

In [None]:
downtown_onehot.Neighborhood = downtown_onehot.Neighborhood.apply(str)

In [None]:
downtown_grouped = downtown_onehot.groupby('Neighborhood').mean().reset_index()
downtown_grouped.head()

In [None]:
num_top_venues = 5

for hood in downtown_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = downtown_grouped[downtown_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue', 'freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending = False).reset_index(drop=True).head(num_top_venues))
    print('\n')

##### The rest of the code is Clustering the data based on commonality of each venue.

In [None]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending =False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [None]:
num_top_venues = 10
indicators = ['st', 'nd', 'rd']

columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))
        
neighborhoods_venue_sorted = pd.DataFrame(columns=columns)
neighborhoods_venue_sorted['Neighborhood'] = downtown_grouped['Neighborhood']

for ind in np.arange(downtown_grouped.shape[0]):
    neighborhoods_venue_sorted.iloc[ind, 1:] = return_most_common_venues(downtown_grouped.iloc[ind, :], num_top_venues)
    
neighborhoods_venue_sorted.head()

##### Here I assigned each of the Neighborhoods to one of five clusters and displayed the results on a map.

In [None]:
from sklearn.cluster import KMeans
kclusters = 5

downtown_grouped_clustering = downtown_grouped.drop('Neighborhood', 1)

kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(downtown_grouped_clustering)

kmeans.labels_[0:10]

In [None]:
neighborhoods_venue_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

downtown_merged = downtown_data

downtown_merged = downtown_merged.join(neighborhoods_venue_sorted.set_index('Neighborhood'), on='Neighborhood')

In [None]:
downtown_merged.head()

In [None]:
import matplotlib.cm as cm
import matplotlib.colors as colors

map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

x= np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

markers_colors = []
for lat, lon, poi, cluster in zip(downtown_merged['Latitude'], downtown_merged['Longitude'], downtown_merged['Neighborhood'], downtown_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius = 5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
map_clusters

##### The following tables show each cluster and the neighborhoods that belong to each cluster.

In [None]:
downtown_merged.loc[downtown_merged['Cluster Labels'] == 0, downtown_merged.columns[[2] + list(range(5, downtown_merged.shape[1]))]]

In [None]:
downtown_merged.loc[downtown_merged['Cluster Labels'] == 1, downtown_merged.columns[[2] + list(range(5, downtown_merged.shape[1]))]]

In [None]:
downtown_merged.loc[downtown_merged['Cluster Labels'] == 2, downtown_merged.columns[[2] + list(range(5, downtown_merged.shape[1]))]]

In [None]:
downtown_merged.loc[downtown_merged['Cluster Labels'] == 3, downtown_merged.columns[[2] + list(range(5, downtown_merged.shape[1]))]]

In [None]:
downtown_merged.loc[downtown_merged['Cluster Labels'] == 4, downtown_merged.columns[[2] + list(range(5, downtown_merged.shape[1]))]]

##### Based on venue commonality, these clusters can be split into the following groups:
* Group 1 consists of Coffee Shops
* Group 2 consists of Parks
* Group 3 consists of Grocery Stores
* Group 4 consists of Airports
* Group 5 consists of Coffee Shops/Cafe's