#### Question 1  
Use the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a _pandas_ dataframe  
  
*The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood  
*Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.  
*More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.  
*If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.  
*Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.  
*In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.  
  
Submit a link to your Notebook on your Github repository.  

Import all essential coding tools, including Pandas, Numpy, BeautifulSoup, and mapping tools

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
!pip install folium
import folium
from geopy.geocoders import Nominatim 
import matplotlib.cm as cm
import matplotlib.colors as colors



In [2]:
List_url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
source = requests.get(List_url).text

In [3]:
soup = BeautifulSoup(source, 'xml')
table=soup.find('table')

Set up labels for table columns

In [4]:
column_names = ['Postalcode','Borough','Neighborhood']
df = pd.DataFrame(columns = column_names)

In [5]:
for tr_cell in table.find_all('tr'):
    row_data=[]
    for td_cell in tr_cell.find_all('td'):
        row_data.append(td_cell.text.strip())
    if len(row_data)==3:
        df.loc[len(df)] = row_data 
df.head()        

Unnamed: 0,Postalcode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


Remove rows with 'Not assigned'

In [6]:
df=df[df['Borough']!='Not assigned']
df.head()

Unnamed: 0,Postalcode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [7]:
x1=df.groupby('Postalcode')['Neighborhood'].apply(lambda x: "%s" % ', '.join(x))
x1=x1.reset_index(drop=False)
x1.rename(columns={'Neighborhood':'Neighborhood_joined'},inplace=True)

Because several rows contain 'Not assigned' in both 'Borough' and 'Neighborhood' columns, those rows will be dropped

In [8]:
m = pd.merge(df, x1, on='Postalcode')
m.drop(['Neighborhood'],axis=1,inplace=True)
m.drop_duplicates(inplace=True)
m.rename(columns={'Neighborhood_joined':'Neighborhood'},inplace=True)
m.head(12)

Unnamed: 0,Postalcode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


Use shape function to determine number of rows within dataframe

In [9]:
m.shape

(103, 3)

#### Question 2  
Now that you have built a dataframe of the postal code of each neighborhood along with the borough name and neighborhood name, in order to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood.  
Recently Google started charging for their API: http://geoawesomeness.com/developers-up-in-arms-over-google-maps-api-insane-price-hike/, so we will use the Geocoder Python package instead: https://geocoder.readthedocs.io/index.html.  
The problem with this Package is you have to be persistent sometimes in order to get the geographical coordinates of a given postal code. So you can make a call to get the latitude and longitude coordinates of a given postal code and the result would be None, and then make the call again and you would get the coordinates. So, in order to make sure that you get the coordinates for all of our neighborhoods, you can run a while loop for each postal code.  
Given that this package can be very unreliable, in case you are not able to get the geographical coordinates of the neighborhoods using the Geocoder package, here is a link to a csv file that has the geographical coordinates of each postal code: http://cocl.us/Geospatial_data  
Submit a link to the new Notebook on your Github repository

Create function to locate Latitude and Longitude of address

In [10]:
def get_geocode(postal_code):
    # initialize variable to 'None'
    lat_lng_coords = None
    while(lat_lng_coords is None):
        g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
        lat_lng_coords = g.latlng
    latitude = lat_lng_coords[0]
    longitude = lat_lng_coords[1]
    return latitude,longitude

In [11]:
geo=pd.read_csv('http://cocl.us/Geospatial_data')
geo.head()

HTTPError: HTTP Error 501: Not Implemented

Merge Latitude and Longitude attributes to address table

In [None]:
geo.rename(columns={'Postal Code':'Postalcode'},inplace=True)
geom = pd.merge(geo, m, on='Postalcode')
data=geom[['Postalcode','Borough','Neighborhood','Latitude','Longitude']]
data.head()

#### Question 3  
Explore and cluster the neighborhoods in Toronto. You can decide to work with only boroughs that contain the word Toronto and then replicate the same analysis we did to the New York City data.  
Make sure:  
*to add enough Markdown cells to explain what you decided to do and to report any observations you make.  
*to generate maps to visualize your neighborhoods and how they cluster together.

In [None]:
toronto=data[data['Borough'].str.contains("Toronto")]
toronto.head()

Input credentials from Foursquare account

In [None]:
ID= 'S0K0BZIKTCOBIPHLZ4XXADOL1G5FWWDSWUQC30E2ARVOTS5R'
secret= 'IQTKWREDOR4FSKTL2OD42EZEHBWVHYIGJLPWS4Q0AZ3UBT3Z'
version= '20201028'

Create function that will search for nearby venues in Foursquare using dataset coordinates

In [None]:
def getNearbyVenues(names, latitudes, longitudes):
    radius=500
    LIMIT=100
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # Use Foursquare's base URI followed by 'venue' and Foursquare credentials to form the API request
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            ID, 
            secret, 
            version, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # Create the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # Return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [None]:
toronto_venues = getNearbyVenues(names=toronto['Neighborhood'],
                                   latitudes=toronto['Latitude'],
                                   longitudes=toronto['Longitude']
                                  )

In [None]:
toronto_venues.head()

Use count function to determine the number of venues associated with each 'Neighborhood'

In [None]:
toronto_venues.groupby('Neighborhood').count()

In [None]:
y = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")
y.drop(['Neighborhood'],axis=1,inplace=True) 
y.insert(loc=0, column='Neighborhood', value=toronto_venues['Neighborhood'] )
y.shape

Categorize the types of venues available in each 'Neighborhood'

In [None]:
toronto_grouped = y.groupby('Neighborhood').mean().reset_index()
toronto_grouped.head()

Create function to return the most visted venues in each 'Neighborhood'

In [None]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [None]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Create cluster for 'Neighborhood/Venue' dataset using kmeans

In [None]:
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

kmeans.labels_[0:10]

Label clusters and merge data with 'Neighborhood/Venue' dataset

In [None]:
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = toronto

toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

toronto_merged.head()

Use **Geocoder** to mark down location coordinates

In [None]:
address = 'Toronto, CA'

geolocator = Nominatim(user_agent="coursera_capstone_project")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

Use folium to create map cluster

In [None]:
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# Color in clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# Create markers for locations
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters