#  Capstone Project

## Introduction/Business Problem  

### Objective:
The idea of this machine learning project is to build a similar neighbourhood recommendation system.
### Target Audience:
The Business Problem is related to many People who do often switch companies for a better opportunity. But while switching, most of the times people have to change the current Neighbourhood/City/Country. They always wish that they get all same required things like services, enjoyments, clubs, restaurants, hangout places etc in the new Neighbourhood/City/Country. So Is there a way we can recommend them best neighbourhoods near their new office.
### Introduction / Business Problem :
So Here in this project, we are going to recommend the best and same type of neighbourhoods as their current neighbourhood to a user in terms of service, search for the potential explanation of why a neighbourhood is popular, the cause of complaints in another neighbourhood, or anything else related to neighbourhoods.

   #### Success criteria of the project are :
     - define common cluster/class values for similar neighborhoods in London / New York
     - deliver optimized model for these classes
     - provide a list of similar neighborhoods within the chosen cities
     - show the recommended neighborhood on a map 


## Data Gathering, Cleansing and Exploratory Data Analysis

### Importing libs 

In [1]:
import requests # library to handle requests
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner
import random # library for random number generation

from bs4 import BeautifulSoup # library of Beautifulsoup

from folium.plugins import MarkerCluster 
import folium  # plotting library
 
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values
import pgeocode
nomi = pgeocode.Nominatim('CA')


# libraries for displaying images
from IPython.display import Image 
from IPython.core.display import HTML 
    
# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize
pd.set_option('display.max_columns', None)

print('Folium installed')
print('Libraries imported.')


colors = [
    'red',
    'blue',
    'gray',
    'darkred',
    'lightred',
    'orange',
    'beige',
    'green',
    'darkgreen',
    'lightgreen',
    'darkblue',
    'lightblue',
    'purple',
    'darkpurple',
    'pink',
    'cadetblue',
    'lightgray',
    'black'
]


CLIENT_ID = 'WJALSUPQARDIIVEU4NV1RWEEFGT0DZNNX0KQTVCSX5LZNIGI' # your Foursquare ID
CLIENT_SECRET = 'RUZPK1EYFBKDLKNTDF1QKHBOMGYWG3C0JICCP0Y1C5S2LCPK' # your Foursquare Secret
VERSION = '20180604'
LIMIT = 100
radius = 500

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)


Folium installed
Libraries imported.
Your credentails:
CLIENT_ID: WJALSUPQARDIIVEU4NV1RWEEFGT0DZNNX0KQTVCSX5LZNIGI
CLIENT_SECRET:RUZPK1EYFBKDLKNTDF1QKHBOMGYWG3C0JICCP0Y1C5S2LCPK


### Getting Wikipage Table Data (Web Scraping part)

In [2]:
# getting html text

wikipage = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")

soup = BeautifulSoup(wikipage.text,"html.parser")

# finding all texts which contains <table> tag
tables = soup.find_all("table")

# need first table only
final_table = tables[0]

# getting table data
table_data = [[cell.text for cell in row("td")]
                         for row in final_table("tr")]
#getting table headers
table_columns = [x.text for x in final_table("th")]

###  Cleaning and formating cluster Dataframe

In [3]:
# Creating cluster dataframe using table data and table columns and removing '\n'
cluster_df = pd.DataFrame(table_data[1:],columns=table_columns).rename(columns={"Neighbourhood\n":"Neighbourhood"})

cluster_df.loc[:,"Neighbourhood"] = cluster_df["Neighbourhood"].str.replace('\n','').values

# Cleaning data
cluster_assigned_df = cluster_df.loc[cluster_df["Borough"]!="Not assigned",:].reset_index(drop=True)

# Fromating naeighbours into the list
cluster_assigned_df = cluster_assigned_df.groupby(["Postcode","Borough"],as_index=True).apply(lambda x : ','.join(x["Neighbourhood"].tolist()))

cluster_assigned_df = cluster_assigned_df.reset_index().rename(columns={0:"Neighbourhood"})

# if neighbourhood is 'Not Assigned' then assigning 'Borough' name there
cluster_assigned_df.loc[:,"Neighbourhood"] = cluster_assigned_df.T.apply(lambda x : x["Borough"] if x["Neighbourhood"]=="Not assigned" else x["Neighbourhood"]).values

####  Total Rows and Columns 

In [4]:
print("Rows : {}  Columns: {}".format(cluster_assigned_df.shape[0],cluster_assigned_df.shape[1]))

Rows : 103  Columns: 3


### Getting Latitude and Longitude and merging with cluster dataframe 

In [5]:
# getting lats and long from given method so using this file

def get_long_lat(postcode):
    """ Method takes a Series object and returns
    a list of Latitude and corresponding Longitude data,
    using the pgeocode library.
    This method also prints out the coordinate data"""
    
    if postcode=="M7R":
        latitude = 43.637
        longitude = -79.616
    
    else:
        location = nomi.query_postal_code(postcode)
        latitude = location.latitude
        longitude = location.longitude
    return [latitude, longitude]

In [8]:
clusters_df = cluster_assigned_df.copy()

latt_longs = clusters_df["Postcode"].apply(get_long_lat)

clusters_df.loc[:,"Latitude"] = latt_longs.apply(lambda col: col[0])
clusters_df.loc[:,"Longitude"] = latt_longs.apply(lambda col: col[1])

# Showing first 5 rows 

clusters_df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge,Malvern",43.8113,-79.193
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.7878,-79.1564
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",43.7678,-79.1866
3,M1G,Scarborough,Woburn,43.7712,-79.2144
4,M1H,Scarborough,Cedarbrae,43.7686,-79.2389


### visualizing clusters

In [9]:
clr_i = 0

some_map = folium.Map(location=(43.65,-79.38) , zoom_start=11)

for city in clusters_df.Borough.unique():
    city_neighbs = clusters_df.loc[clusters_df.Borough == city,['Borough','Neighbourhood',"Postcode",'Latitude', 'Longitude']]
    neighs = folium.map.FeatureGroup()
    clr = colors[clr_i]
    clr_i+=1
    for br,nm,pc,lat,lng in city_neighbs.values:
        folium.CircleMarker(
            [lat, lng],
            radius=3, 
            color=clr,
            fill=True,
            popup= "<br>Borough ==> {} And <br>Neighbourhood ==> {}".format(br,nm),
            fill_opacity=0.8
        ).add_to(some_map)

some_map

### Let's Pick one of the PostCodes and explore the all venues 

In [11]:
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [12]:
# search_query = borough_df["Borough"]
# search_url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&query={}&radius={}&limit={}'.format(
#     CLIENT_ID, CLIENT_SECRET, latitude, longitude, VERSION, search_query, radius, LIMIT)

In [13]:
def explore_venues(neigbh , latitude , longitude):
    explore_url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
        CLIENT_ID, CLIENT_SECRET,  VERSION, latitude, longitude,  radius,  LIMIT)

    result = requests.get(explore_url).json()
    
#     print(result.keys())
    venues = result["response"]["groups"][0]["items"]
    
    
#     print(len(venues))
    
    if len(venues)==0:
        return pd.DataFrame()
    
    nearby_venues = json_normalize(venues)

    # # filter columns
    filtered_columns = ['venue.name', 'venue.categories','venue.location.city', 'venue.location.lat', 'venue.location.lng', 'venue.location.distance']
    nearby_venues =nearby_venues.loc[:, filtered_columns]

    # filter the category for each row
    nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)
    nearby_venues["venue.location.city"] = nearby_venues["venue.location.city"].fillna(method = "ffill")

    # # clean columns
    nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]
    
    nearby_venues.loc[:,"Neighbourhood"] = neigbh
    nearby_venues.loc[:,"latitude"] = latitude
    nearby_venues.loc[:,"longitude"] = longitude

    return nearby_venues

In [111]:
def get_top_five_common_places(borough):
    borough_venues = pd.DataFrame()
    for neigbh,latitude,longitude  in  clusters_df.loc[clusters_df["Borough"] == borough,["Neighbourhood","Latitude","Longitude"]].values:
        print(neigbh,latitude,longitude )
        venue_df = explore_venues(neigbh , latitude , longitude)

        borough_venues = pd.concat([borough_venues,venue_df],sort=False)


    borough_categories = pd.get_dummies(borough_venues["categories"], prefix="", prefix_sep="")
    borough_categories.loc[:,"Neighbourhood"] = borough_venues["Neighbourhood"].values
    borough_venues_grouped = borough_categories.groupby("Neighbourhood").mean()

    five_most_common_venues = borough_venues_grouped.T.apply(lambda x : x.sort_values(ascending=False).index[:5]).T
    
    five_most_common_venues.columns = ["1st Most Common Venue" ,"2nd Most Common Venue","3rd Most Common Venue","4th Most Common Venue","5th Most Common Venue"]
    
    five_most_common_venues.loc[:,"Borough"] = borough
    borough_venues_grouped.loc[:,"Borough"] = borough
    
    five_most_common_venues =five_most_common_venues.reset_index()
    borough_venues_grouped =borough_venues_grouped.reset_index()
    
    return five_most_common_venues,borough_venues_grouped

### Top 5 most common Venues

In [113]:
torronto_common_venues = pd.DataFrame()
torronto_grouped_venues = pd.DataFrame()


for borough in clusters_df.Borough.unique():
    print("---------------------------")
    print("Getting Data of {} ".format(borough))
    print()
    
    commn_df,grouped_df = get_top_five_common_places(borough)
    
    torronto_common_venues = pd.concat([torronto_common_venues,commn_df],sort=False)
    torronto_grouped_venues = pd.concat([torronto_grouped_venues,grouped_df],sort=False)
    torronto_grouped_venues = torronto_grouped_venues.fillna(0.0)

---------------------------
Getting Data of Scarborough 

Rouge,Malvern 43.8113 -79.193
Highland Creek,Rouge Hill,Port Union 43.7878 -79.1564
Guildwood,Morningside,West Hill 43.7678 -79.1866
Woburn 43.7712 -79.2144
Cedarbrae 43.7686 -79.2389
Scarborough Village 43.7464 -79.2323
East Birchmount Park,Ionview,Kennedy Park 43.7298 -79.2639
Clairlea,Golden Mile,Oakridge 43.7122 -79.2843
Cliffcrest,Cliffside,Scarborough Village West 43.7247 -79.2312
Birch Cliff,Cliffside West 43.6952 -79.2646
Dorset Park,Scarborough Town Centre,Wexford Heights 43.7612 -79.2707
Maryvale,Wexford 43.7507 -79.3003
Agincourt 43.7946 -79.2644
Clarks Corners,Sullivan,Tam O'Shanter 43.7812 -79.3036
Agincourt North,L'Amoreaux East,Milliken,Steeles East 43.8177 -79.2819
L'Amoreaux West 43.8016 -79.3216
Upper Rouge 43.834 -79.2069
---------------------------
Getting Data of North York 

Hillcrest Village 43.8015 -79.3577
Fairview,Henry Farm,Oriole 43.7801 -79.3479
Bayview Village 43.7797 -79.3813
Silver Hills,York Mill

Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike
  return self._getitem_tuple(key)


Downsview Northwest 43.7568 -79.521
Victoria Village 43.7276 -79.3148
Bedford Park,Lawrence Manor East 43.7335 -79.4177
Lawrence Heights,Lawrence Manor 43.7223 -79.4504
Glencairn 43.7081 -79.4479
Downsview,North Park,Upwood Park 43.7137 -79.4869
Humber Summit 43.7598 -79.5565
Emery,Humberlea 43.7366 -79.5401
---------------------------
Getting Data of East York 

Woodbine Gardens,Parkview Hill 43.7063 -79.3094
Woodbine Heights 43.6913 -79.3116
Leaside 43.7124 -79.3644
Thorncliffe Park 43.7059 -79.3464
East Toronto 43.6872 -79.3368
---------------------------
Getting Data of East Toronto 

The Beaches 43.6784 -79.2941
The Danforth West,Riverdale 43.6803 -79.3538
The Beaches West,India Bazaar 43.6693 -79.3155
Studio District 43.6561 -79.3406
Business Reply Mail Processing Centre 969 Eastern 43.7804 -79.2505
---------------------------
Getting Data of Central Toronto 

Lawrence Park 43.7301 -79.3935
Davisville North 43.7135 -79.3887
North Toronto West 43.7143 -79.4065
Davisville 43.702 -7

### Cluster Neighbourhoods 

In [123]:
# import k-means from clustering stage
from sklearn.cluster import KMeans

In [130]:
# set number of clusters
kclusters = 5

c_toronto_grouped_clustering = torronto_grouped_venues.drop(['Neighbourhood','Borough'], 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(c_toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 2, 0, 2, 0, 0, 0, 0, 0, 0], dtype=int32)

In [132]:
torronto_grouped_venues.loc[:,"cluster_no"] = kmeans.labels_

In [137]:
updt_clusters_df = clusters_df.merge(torronto_grouped_venues.loc[:,["Neighbourhood","cluster_no"]],on="Neighbourhood")

In [138]:
updt_clusters_df

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude,cluster_no
0,M1B,Scarborough,"Rouge,Malvern",43.8113,-79.1930,1
1,M1E,Scarborough,"Guildwood,Morningside,West Hill",43.7678,-79.1866,0
2,M1G,Scarborough,Woburn,43.7712,-79.2144,1
3,M1H,Scarborough,Cedarbrae,43.7686,-79.2389,2
4,M1J,Scarborough,Scarborough Village,43.7464,-79.2323,2
...,...,...,...,...,...,...
94,M9N,York,Weston,43.7068,-79.5170,1
95,M9P,Etobicoke,Westmount,43.6949,-79.5323,0
96,M9R,Etobicoke,"Kingsview Village,Martin Grove Gardens,Richvie...",43.6898,-79.5582,0
97,M9V,Etobicoke,"Albion Gardens,Beaumond Heights,Humbergate,Jam...",43.7432,-79.5876,0
