## Data Science Capstone Project
There are three chapters in this project:

<b> (1) Web Scraping the List of Postal Codes of Canada</b>

<b> (2) Adding Geolocation to Postcodes</b>

<b> (3) Explore the Neighborhoods in Toronto</b>


### (1) Web Scraping the List of Postal Codes of Canada



In [1]:
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import numpy as np
import requests
from bs4 import BeautifulSoup



Load the html file

In [2]:
response = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")
# Load html OK is '200'
#response.status_code

Use BeautifulSoup to parse the html file and retrieve the table information.
The data of interest is in the first table and only this table will be scraped.  

In [3]:
soup_obj = BeautifulSoup(response.text,'lxml')
#soup_obj.prettify
tables = soup_obj.find_all('tbody')

Extract the data into Nx5 array 'bits'. 

In [4]:
snip=tables[0].get_text()
snippit = snip.split("\n")
bits=np.reshape(snip[0:-1].split("\n"),(-1,5))

Create three DataFrames from the data table, one for each column as they will be treated differently later. First and last empty columes in the Nx5 data array are dropped automatically.  

In [5]:
bits_p_df = pd.DataFrame(bits[1:],columns = bits[0])
bits_p_df.drop(['Borough','Neighbourhood'],axis=1,inplace=True)

bits_n_df = pd.DataFrame(bits[1:],columns = bits[0])
bits_n_df.drop('Borough',axis=1,inplace=True)

bits_b_df = pd.DataFrame(bits[1:],columns = bits[0])
bits_b_df.drop('Neighbourhood',axis=1,inplace=True)

Group the data by postalcodes. Create concatenated entries for the variable 'Neighborhood'.


In [6]:
bits_p_li = list(bits_p_df.groupby(['Postcode'], as_index=True)['Postcode'].apply(lambda x: (','.join(x)).split(',')[0] ))
bits_n_li = list(bits_n_df.groupby(['Postcode'], as_index=True)['Neighbourhood'].apply(lambda x: ','.join(x)))
bits_b_li = list(bits_b_df.groupby(['Postcode'], as_index=False)['Borough'].apply(lambda x: (','.join(x)).split(',')[0] ))

Create a dictionary with three data columns and read it into a dataframe. 

In [7]:
postcode_dic = { 'Postcode':bits_p_li , 'Borough':bits_b_li , 'Neighbourhood':bits_n_li }
postcode_df = pd.DataFrame.from_dict(postcode_dic)

First select rows if the variable 'Borough' has entries not equal to 'Not assigned'.

In [8]:
bo_is_assigned = (postcode_df['Borough'] != 'Not assigned') 
postcode_df = postcode_df[bo_is_assigned]

Then copy the 'Borough' value to 'Neighbourhood' if the variable 'Neighbourhood' has an value 'Not assigned'. Reset the index.

In [9]:
ne_not_assigned = (postcode_df['Neighbourhood'] == 'Not assigned') 
postcode_df['Neighbourhood'][ne_not_assigned] = postcode_df['Borough'][ne_not_assigned]
postcode_df.reset_index(drop=True, inplace=True)

Print DataFrame as a table. Index # 85 is postcode M7A from table entry 9 on the Wikipedia page. The Neighbourhood name was fixed by the call in the previous cell.

In [10]:
postcode_df.head()
#postcode_df # use this line to look at all data

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


Finally the shape function:

In [11]:
postcode_df.shape

(103, 3)

Save DataFrame postcode_df to file.

In [12]:
export_csv = postcode_df.to_csv (r'postcode_df.csv', index = None, header=True)

### (2) Adding Geolocation to Postal Codes

This can be achieved in just three lines of code using method b) below.

Read the DataFrame postcode_df from file. This data was saved in the previous section (2).

In [13]:
postcode_df = pd.read_csv('postcode_df.csv')
postcode_df.head(5)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


##### a) Get geo data using geocoder

Skip this step, never got a response to a request via geocoder. Use the csv file instead.

##### b) Alternative: Load geo data from file

Read csv file with geo data

In [14]:
geodata_df = pd.read_csv('http://cocl.us/Geospatial_data')
geodata_df.shape

(103, 3)

Print DataFrame as a table.

In [15]:
geodata_df.head()
#geodata_df # use this line to look at all data

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Create two new columns and fill them using vectorization. This means that the two matching rows w.r.t. the postal code in the two dataframes do not need to have the same number index. By specifiying the columns 'Postal Code' in geodata_df and 'Postcodes' in postcode_df for comparison, a match of the correct rows is based on the same postal code values and not the index numbers.  

In [16]:
postcode_df['Latitude'] = geodata_df.loc[geodata_df['Postal Code']==postcode_df['Postcode'],['Latitude']]
postcode_df['Longitude'] = geodata_df.loc[geodata_df['Postal Code']==postcode_df['Postcode'],['Longitude']]
postcode_df.head()
#postcode_geo_df # use this line to look at all data

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge,Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


Save DataFrame postcode_df that now incudes the geo data to file postcode_geo_df and print the DataFrame as a table.

In [17]:
export_csv = postcode_df.to_csv (r'postcode_geo_df.csv', index = None, header=True)

### (3) Explore the Neighborhoods in Toronto

### a) Get the Foursquare data

First, read the output DataFrame postcode_geo_df from file.

In [18]:
postcode_geo_df = pd.read_csv('postcode_geo_df.csv')
postcode_geo_df.head(5)

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge,Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


Load new libraries

In [19]:
!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

#import k-means from clustering stage
from sklearn.cluster import KMeans

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - folium=0.5.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    branca-0.3.1               |             py_0          25 KB  conda-forge
    ca-certificates-2019.6.16  |       hecc5488_0         145 KB  conda-forge
    altair-3.1.0               |           py36_0         724 KB  conda-forge
    openssl-1.1.1c             |       h516909a_0         2.1 MB  conda-forge
    vincent-0.4.4              |             py_1          28 KB  conda-forge
    certifi-2019.6.16          |           py36_1         149 KB  conda-forge
    folium-0.5.0               |             py_0          45 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         3.2 MB

The following NEW packages will be 

Create a function to explore the neighborhoods in Toronto. Any credentials are defined in a hidden cell.

In [20]:
# The code was removed by Watson Studio for sharing.

Parameters for exploration

In [21]:
LIMIT = 100

Define function to loop over neighbourhoods.

In [22]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):

    print('')
    print('--- Start neighbourhood list. ---')
    print('')

    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()['response']['groups'][0]['items']        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    print('')
    print('--- End neighbourhood list. ---')
    print('')
    return(nearby_venues)

Code to run the above function on each neighbourhood and create a new dataframe called *toronto_venues*.

In [23]:
toronto_venues = getNearbyVenues(names=postcode_df['Neighbourhood'],
                                   latitudes=postcode_df['Latitude'],
                                   longitudes=postcode_df['Longitude']
                                  )


--- Start neighbourhood list. ---

Rouge,Malvern


KeyError: 'groups'

Export to file and print out summary info about the venues

In [None]:
export_csv = toronto_venues.to_csv (r'toronto_venues.csv', index = None, header=True)
print(toronto_venues.shape)
toronto_venues.head()

Check how many venues were returned for each neighbourhood

In [None]:
toronto_venues.groupby('Neighbourhood').count()

Find out how many unique categories can be curated from all the returned venues

In [None]:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

### b) Further Analysis

In [None]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighbourhood'] = toronto_venues['Neighbourhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Examine the new dataframe size.

In [None]:
toronto_onehot.shape

Group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [None]:
toronto_grouped = toronto_onehot.groupby('Neighbourhood').mean().reset_index()
toronto_grouped.head()

Confirm the new size

In [None]:
toronto_grouped.shape

Print each neighborhood along with the top 5 most common venues

In [None]:
num_top_venues = 5

for hood in toronto_grouped['Neighbourhood']:
    print("----"+hood+"----")
    temp = toronto_grouped[toronto_grouped['Neighbourhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

<b>Put results into a *pandas* dataframe</b>

First, write a function to sort the venues in descending order.

In [None]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Then create the new dataframe and display the top 10 venues for each neighborhood.

In [None]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighbourhoods_venues_sorted = pd.DataFrame(columns=columns)
neighbourhoods_venues_sorted['Neighbourhood'] = toronto_grouped['Neighbourhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighbourhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighbourhoods_venues_sorted.head(5)

### c) Cluster Neighborhoods

Run k-means to cluster the neighbourhood into 5 clusters.

In [None]:
# set number of clusters
kclusters = 3

toronto_grouped_clustering = toronto_grouped.drop('Neighbourhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

Create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [None]:
# add clustering labels
#neighbourhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = postcode_df

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighbourhoods_venues_sorted.set_index('Neighbourhood'), on='Neighbourhood')

toronto_merged.head() # check the last columns!

### d) Visualize the resulting clusters

Toronto geo coordinates

In [None]:
latitude = 43.715383
longitude = -79.405678

Map it.

In [None]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighbourhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    print(cluster)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       

        map_clusters