<h1> Coursera IBM's Applied Data Science Capston - Week 3</h1>

In this notebook, we will use pandas to collect and put into a DataFrame the data of 
the <a href="https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M">List of postal codes of Canada: M (Toronto)</a>.

Then, we will use the Geocoder package to append the Geographic coordinates to each neighborhoods in our Dataframe.

Finally, we will create a folium map of Toronto with markers representing each neighborhoods. Only to apply a k-mean clustering to these neighborhoods with respect to the categories of venues nearby.

In [7]:
import pandas as pd
pd.set_option('display.max_rows', 20)

import numpy as np
import json
import requests


#!pip install geocoder
import geocoder

#!pip install folium
import folium

from sklearn.cluster import KMeans

print("dependecies imported !")

ImportError: No module named 'sklearn.__check_build._check_build'
___________________________________________________________________________
Contents of c:\users\clems\desktop\desktop temp\coursera\applied data science specialization - ibm\course 4\coursera_capstone\myvenv\lib\site-packages\sklearn\__check_build:
setup.py                  _check_build.pyx          __init__.py
__pycache__
___________________________________________________________________________
It seems that scikit-learn has not been built correctly.

If you have installed scikit-learn from source, please do not forget
to build the package before using it: run `python setup.py install` or
`make` in the source directory.

If you have used an installer, please check that it is suited for your
Python version, your operating system and your platform.

<h1>1. Data collecting and cleaning</h1>

Collect the data and put it in a DataFrame:

In [None]:
link = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

tables = pd.read_html(link)

df = tables[0].iloc[0:]

df.head()

Clean the data:
<ol>
    <li>Ignore the cells with a borough that is 'Not assigned'</li>
    <li>Give to Borough the value of Neighborhoods for when the former is 'Not assigned'</li>
    <li>Group Neighborhoods with the same Postcode</li>
</ol>

In [None]:
df.drop(df.loc[df['Borough']=='Not assigned'].index, inplace=True)
df.reset_index(drop=True , inplace=True)

In [None]:
neighbourhood_na = df[df['Neighborhood'] == 'Not assigned'].index
df.iloc[neighbourhood_na, 2] = df['Borough'][neighbourhood_na]

In [None]:
df = df.groupby(['Postcode', 'Borough'], as_index=False).agg(lambda x: ', '.join(x))
df

<h1>2. Collection of the geographical coordinates</h1>

<ol>
    <li>Collect the coresponding geographical coordinates</li>
    <li>Append the geographical coordinates to the dataframe</li>
</ol>

In [None]:
lat = []
lng = []

postal_codes = df['Postcode']

# Store latitude and longitude values in lat and lng
for postal_code in postal_codes:
    g = geocoder.arcgis('{}, Toronto, Ontario'.format(postal_code))
    temp = g.latlng
    lat.append(temp[0])
    lng.append(temp[1])

In [None]:
df['Latitude'] = lat
df['Longitude'] = lng

df.head()

<h1>3. Exploration and clustering of the neighbourhoods</h1>

We will apply our analysis on only the boroughs that contain the word Toronto, hence we reduce the dataframe to these values.

In [None]:
df = df[df['Borough'].str.find('Toronto') != -1].reset_index(drop=True)
df.shape

We use the geocoder API to get the geographical coordinates of Tonronto, we then initialize a folium map center around Toronto.

In [None]:
g = geocoder.arcgis('Toronto, Ontario')
lat_tor = g.latlng[0]
lng_tor = g.latlng[1]
print('The geograpical coordinate of Toronto are {}, {}.'.format(lat_tor, lng_tor))

map = folium.Map(location=[lat_tor, lng_tor], zoom_start=11)

map

We then add markers to represent each neighbourhoods.

In [None]:
for lat, lng, borough, postcode in zip(df['Latitude'], df['Longitude'], df['Borough'], df['Postcode']):
    label = '{}, {}'.format(postcode, borough)        # popup labels with postcode and borough
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker([lat, lng],
                        radius=5,
                        popup=label,
                        color='blue',
                        fill=True,
                        fill_color='#3186cc',
                        fill_opacity=0.7,
                        parse_html=False).add_to(map)
    
map

We define ou Foursquare credentials. We will use this API in order to get a list of recommended venus for each neighbourhoods.

In [None]:
CLIENT_ID = '#' # your Foursquare ID
CLIENT_SECRET = '#' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print("Credentials defined !")

We define a function to get the recommended neighbourhoods for a given neighbourhood.

In [None]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

We define a limit of 100 venues for each API call.
We then apply this function to each of our neighbourhoods.

In [None]:
LIMIT = 100

toronto_venues = getNearbyVenues(names=df['Neighbourhood'],
                                   latitudes=df['Latitude'],
                                   longitudes=df['Longitude']
                                  )

We check the shape and the first values of our resulting dataframe.

In [None]:
print(toronto_venues.shape)
toronto_venues.head()

We count the number of venue for each Postcode.

In [None]:
count_venues = pd.DataFrame()
count_venues['Neighbourhood'] = toronto_venues.groupby('Neighbourhood').count().reset_index()['Neighbourhood']
count_venues['Venue Count'] = toronto_venues.groupby('Neighbourhood').count().reset_index()['Venue']
count_venues

We check how many unique category of venues we have in our dataframe.

In [None]:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

Let's compute the frequency of each venues for each neighbourhood.

In [None]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighbourhood column back to dataframe
toronto_onehot['Neighbourhood'] = toronto_venues['Neighbourhood'] 

# move neighbourhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
manhattan_onehot = toronto_onehot[fixed_columns]

toronto_grouped = manhattan_onehot.groupby('Neighbourhood').mean().reset_index()
toronto_grouped.head()

We check the shape of our resulting dataframe.

In [None]:
toronto_grouped.shape

We now put into a pandas dataframe the most common venues for each neighbouhood with their frequency.

<ol>
    <li>We create a function that return the N most common venues for a given neighbourhood</li>
    <li>We use that function to create a dataframe of the top10 most common venues type for each of our neighbouhood</li>
</ol>

In [None]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [None]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhood']
for i in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(i+1, indicators[i]))
    except:
        columns.append('{}th Most Common Venue'.format(i+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighbourhood'] = toronto_grouped['Neighbourhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

We then use a k-mean clustering on our neighbourhoods, in order to obtain groups with respect to the categories of the most common venues.

<ol>
    <li>We set the hyperparameter k, the number of clusters, to 5</li>
    <li></li>
</ol>

In [None]:
# Set number of clusters
kclusters = 5

In [None]:
toronto_grouped_clustering = toronto_grouped.drop('Neighbourhood', 1)

# Run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# Check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]
print("done")