## Segmenting and Clustering Neighbourhoods in Toronto City

I will connect to the wikipedia page https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M and scrape the required data. Also used BeautifulSoup to obtain the data from table. Then parse into DataFrame. I will use Geodata by google(if applicable) or use the provided the csv file that contains latitude and longitude for the assinged postal code. 
I will use the Foursquare API to explore neighborhoods in Toronto City. The neighbourhood will be narrowed down by only include the neighbourhoods contains word "Toronto". The I will use the explore function to get the most common venue categories in each neighborhood, and then use this feature to group the neighborhoods into clusters. We will use the k-means clustering algorithm to complete this task. Finally, I will use the Folium library to visualize the neighborhoods in Toronto City and their emerging clusters.

Data sources : https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M and 
               https://cocl.us/Geospatial_data


In [1]:
# The code was removed by Watson Studio for sharing.

In [None]:
!conda install -c conda-forge folium=0.5.0 --yes 

import folium 

Solving environment: | 

connect to the wikipedia page https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M and scrape the required data.  Also used BeautifulSoup to obtain the data from table.  Then parse into DataFrame.




In [None]:
res = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')
soup = BeautifulSoup(res.content, 'lxml')
table = soup.find_all('table')[0]
df = pd.read_html(str(table))[0]
df.head()

Check the shape of the total initial data.  


In [None]:
df.shape

The data contains 287 rows and 3 columns.  Check how many rows of Not assinged value in Borough column

In [None]:
df[df.Borough == 'Not assigned'].shape

We want to only process the values that have an assinged Borough. So we will remove 77 'Not assinged' Borough rows.

In [None]:
# Remove all rows for 'Not assinged' Borough.   
df = df[df.Borough != 'Not assigned']
#df.drop(df.loc[df['Borough'] == 'Not assigned'].index, inplace=True)
df.shape

Check how many multiple neighborhood exist in one postal code area.  For example, M9v and M8Y have 8 neighborhoods. 

In [None]:
df['Postcode'].value_counts()

We want to join all neighborhoods under one postal code into one cell separated by ", ". 

In [None]:
df_group =  pd.DataFrame(df.groupby(['Postcode','Borough'])['Neighbourhood'].apply(', '.join))

After group by 'Postcode' and 'Borough', two columns became an multiindex.  We want to reset index so dataframe now back to 3 columns.

In [None]:
df_group.reset_index(inplace=True)
df_group.head()

Now check for Not assinged Neighbourhood. If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.

In [None]:
df_group[df_group['Neighbourhood']=='Not assigned'].size

Check the shape of the processed data. 

In [None]:
df_group.shape


## Finished the first part 
scrape the wikipedia page for Toronto city. Data Wrangling as instructed.



Install geocoder package 

In [None]:
!pip install geocoder

Try to use geocoder in multiple different ways but it kept returning 'None'.  While loop is keep running for a while so we will use the provide the csv file for the course. The is located in https://cocl.us/Geospatial_data

In [None]:

import geocoder
lat_lng_coords = None

# loop until you get the coordinates
#while(lat_lng_coords is None):
#  g = geocoder.google('{}, Toronto, Ontario'.format("M5G"))
#  lat_lng_coords = g.latlng

#latitude = lat_lng_coords[0]
#longitude = lat_lng_coords[1]

In [None]:
geocode_df = pd.read_csv('https://cocl.us/Geospatial_data')
geocode_df.head()


In [None]:
geocode_df.shape

Notice that csv file contains a column name "Postal Code".  The data from wikipedia has "Postcode". The columns are mismatching so we will rename the 'Postal Code' to 'Postcode'

In [None]:
geocode_df.rename(columns = {"Postal Code":"Postcode"}, inplace=True)
geocode_df.head()

Now we have two data frames and those will be joined on Postcode. After joined, the new column has created as index.

In [None]:
toronto = df_group.join(geocode_df.set_index('Postcode'), on='Postcode')
toronto.head()



In [None]:
#toronto.drop("index", axis=1, inplace=True)
toronto.rename(columns={"Postcode":"PostalCode"}, inplace=True)

In [None]:
toronto.head()

## Finished the second part **
Read the csv file contains latitude and logitude. 

In [None]:

toronto[toronto['Neighbourhood'].str.contains('University of Toronto', regex=False)]


In [None]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(toronto['Borough'].unique()),
        toronto.shape[0]
    )
)

In [None]:
address = 'Toronto, ON'

geolocator = Nominatim(user_agent="Toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

In [None]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(toronto['Latitude'], toronto['Longitude'], toronto['Borough'], toronto['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

Define Foursquare Credentials and Version

In [None]:
# The code was removed by Watson Studio for sharing.

We will explore the neighbourhood of University of Toronto.  The index of University of Toronto is 66. 

In [None]:
uofT = toronto[toronto['Neighbourhood'].str.contains('University of Toronto', regex=False)]
uofT

In [None]:
neighbourhood_latitude = toronto.loc[66, 'Latitude'] # neighborhood latitude value
neighbourhood_longitude = toronto.loc[66, 'Longitude'] # neighborhood longitude value

neighbourhood_name = toronto.loc[66, 'Neighbourhood'] # neighborhood name
neighbourhood_pcode = toronto.loc[66, 'PostalCode'] # neighborhood name


print('Postal Code, Latitude and longitude values of {} are {}, {}, {}.'.format(neighbourhood_name, 
                                                                                neighbourhood_pcode, 
                                                               neighbourhood_latitude, 
                                                               neighbourhood_longitude))

#### Now, let's get the top 100 venues that are in U of T within a radius of 500 meters



First, let's create the GET request URL. Name your URL url.

In [None]:
# type your answer here
LIMIT = 100
radius = 500
#url = "https://api.foursquare.com/v2/venues/explore?client_id=CLIENT_ID&client_secret=CLIENT_SECRET&ll=neighborhood_latitude,neighborhood_longitude&v=VERSION&limit=LIMIT"
url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}\
&ll={},{}&v={}&radius={}&limit={}'.format(
CLIENT_ID, CLIENT_SECRET, neighbourhood_latitude, neighbourhood_longitude, VERSION, radius, LIMIT)


#url

Send the GET request and examine the results

In [None]:
results = requests.get(url).json()
results

In [None]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Let's clean the json and structure it into a pandas dataframe

In [None]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

In [None]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

## Explore Neighborhoods in Toronto(Neighbourhood contains Toronto)

In [None]:
toronto_data = toronto[toronto['Neighbourhood'].str.contains('Toronto', regex=False)]

In [None]:
toronto_data.shape

#### Let's create a function to repeat the same process to all the neighborhoods in Toronto

In [None]:
def getNearbyVenues(pcodes, names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for pcode, name, lat, lng in zip(pcodes, names, latitudes, longitudes):
        print(pcode + ' '+ name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            pcode,
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['PostalCode',
                             'Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)



#### The code to run the above function on each neighborhood and create a new dataframe called *toronto_venues*

In [None]:
toronto_venues = getNearbyVenues(pcodes = toronto_data['PostalCode'],
                                    names=toronto_data['Neighbourhood'],
                                   latitudes=toronto_data['Latitude'],
                                   longitudes=toronto_data['Longitude']
                                  )



In [None]:
print(toronto_venues.shape)
toronto_venues.head()

Let's check how many venues were returned for each neighborhood

In [None]:
toronto_venues.groupby('PostalCode').count()

#### Let's find out how many unique categories can be curated from all the returned venues

In [None]:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

## Analyze Each Neighborhood

In [None]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighbourhood'] = toronto_venues['Neighbourhood'] 

In [None]:
# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

In [None]:
toronto_onehot.shape

In [None]:
toronto_onehot.groupby('Neighbourhood').mean()

#### Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [None]:
toronto_grouped = toronto_onehot.groupby('Neighbourhood').mean().reset_index()
toronto_grouped

In [None]:
toronto_grouped.shape

In [None]:
toronto_grouped[toronto_grouped['Neighbourhood'] == 'CFB Toronto, Downsview East'].T

#### Let's print each neighborhood along with the top 5 most common venues

In [None]:
num_top_venues = 5

for hood in toronto_grouped['Neighbourhood']:
    print("----"+hood+"----")
    temp = toronto_grouped[toronto_grouped['Neighbourhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})

In [None]:
print(temp.sort_values('freq', ascending=False).reset_index(drop=False).head(num_top_venues))

In [None]:
print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
print('\n')

#### Let's put that into a *pandas* dataframe
First, let's write a function to sort the venues in descending order.

In [None]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create the new dataframe and display the top 10 venues for each neighborhood.

In [None]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighbourhoods_venues_sorted = pd.DataFrame(columns=columns)
neighbourhoods_venues_sorted['Neighbourhood'] = toronto_grouped['Neighbourhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighbourhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighbourhoods_venues_sorted

## Cluster Neighborhoods
Run k-means to cluster the neighborhood into 5 clusters

In [None]:
# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Neighbourhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

In [None]:
toronto_data.head()

In [None]:
neighbourhoods_venues_sorted.head()

Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [None]:
# add clustering labels
neighbourhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

In [None]:
toronto_merged = toronto_data

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighbourhoods_venues_sorted.set_index('Neighbourhood'), on='Neighbourhood')

toronto_merged.head() # check the last columns!

Finally, let's visualize the resulting clusters

In [None]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighbourhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters



## Examine Clusters


Cluster 1

In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Cluster 2

In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Cluster 3

In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Cluster 4

In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Cluster 5

In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 4, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]