# Capstone Project
### IBM Data Science Professional Certificate
This notebook will be mainly used for the Capstone Project of the IBM Data Science Professional Certificate on Coursera.
<hr>

## Step 1:  Scraping and Cleaning the Data
I have used the Scrapy framework for scraping the data from the specified website https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

Here is the code I've written for it:

To run the spider the following command needs to be executed in the console - specifying the output file name and the output file extension:

Import the data from the .csv file into a dataframe:

In [None]:
import pandas as pd

df = pd.read_csv('toronto_neighborhoods.csv')
# Reverse column order to match the example
df = df[['post_code', 'borough', 'neighborhood']]
# Rename column to match the exaple code
df.rename(columns={'post_code': 'PostalCode', 'borough': 'Borough', 'neighborhood': 'Neighborhood'}, inplace=True)
# Remove first row
df = df[1:]

Performing the required data pre-processing tasks:

In [None]:
# Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
df = df[df.Borough != 'Not assigned']
df.head()

In [None]:
df.shape

## Step 2: Find Coordinates and Merge Dataframes

In order to utilize the Foursquare location data, I need to get the latitude and the longitude coordinates of each neighborhood. 

In [None]:
# Load data from provided .csv file into dataframe
df_coord = pd.read_csv('Geospatial_Coordinates.csv')

In [None]:
# Rename column to match with previous label
df_coord.rename(columns={'Postal Code': 'PostalCode'}, inplace=True)

In [None]:
# Merge the two dataframes
df = df.merge(df_coord, on='PostalCode')
df.head()

## Step 3: Clustering Neighborhoods in Toronto


Import dependencies:

In [None]:
import numpy as np
import json
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Install and load modules that are not available
#!pip install folium requests matplotlib sklearn
import folium # map rendering library
import requests
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans

print('Libraries imported.')

Create a map of Toronto with neighborhoods superimposed on top.

In [None]:
# create map of Toronto using latitude and longitude values
map_tor = folium.Map(location=[43.6532, -79.3832], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(df['Latitude'], df['Longitude'], df['Borough'], df['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='yellow',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_tor)  
    
map_tor

Explore neighborhoods and segment them using Foursquare API. I start with defining a function to get the venues for each neighborhood.

In [None]:
CLIENT_ID = 'YAIGDAA3TL0QTZWMZLV0ATVHVWGUOJDSV2UDENOB2Y1GX5R3'
CLIENT_SECRET = 'QAKJZPQRVS1PENVBXZ1PF31YJFZATFDP2QTBR2SEBIEWT3WC'
VERSION = '20180604'
LIMIT = '500'

In [None]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

After that I run the function for all neighborhoods to find the nearby venues.

In [None]:
toronto_venues = getNearbyVenues(names=df['Neighborhood'], latitudes=df['Latitude'], longitudes=df['Longitude'])
# Check the shape of the dataframe
toronto_venues.shape


# One Hot Encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# Group rows by neighborhood (mean of frequency of occurrence)
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()


### Clustering Neighborhoods


Running k-means tu cluster into 6 clusters.

In [None]:
# Set number of clusters
kclusters = 6

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# Run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# Check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 