# Battle of Neighbourhoods - Week 1


This notebook is divided into two parts:
    
   **1 Introduction/Business Problem**
   
       1.a Discussion of the business problem and what is the audience who would be interested in this project.
   
   **2 Data Section**
       
       2.a What data is used?
       2.b What Librairies needed to accomplish this tasks
       

## 1. Introduction/Business Problem

### Discussion of the business problem and the audience who would be interested in this project.

#### Something about Toronto

Toronto, capital of the province of Ontario, is one of the major Canadian cities, located on the northwest shore of Lake Ontario. This dynamic metropolis includes a number of dizzying skyscrapers, dominated by the iconic CN tower. Toronto is also home to many green spaces, from Queen's Park Oval Park to High Park, which covers an area of ​​1.6 km2 and offers trails, sports facilities and a zoo.


#### Expected / Interested Audience


This project is dedicated to all those who want to come and live in Toronto and who want to know:
1. What are the neighborhoods with the cheapest houses for sale.
2. Which neighborhoods have the least crime to live with their family.
3.  What are the most interesting places to visit in a neighborhood and find some clustered venues.

##  2.Data section

### 2.a What data is used?

1. Scrape the site <a href="https://www.point2homes.com/CA/Real-Estate-Listings/ON/Toronto.html?location=Toronto%2C+ON&search_mode=location&page=3&SelectedView=listings&LocationGeoId=783094&location_changed=&ajax=1">point2homes.com on Toronto</a> to predict house sales on Toronto Neighbourhoods
2. Use <a href='http://data.torontopolice.on.ca/datasets/mci-2014-to-2018'> The MCI  dataset includes all Major Crime Indicators (MCI) 2014 to 2018 occurrences by reported date and related offences </a>  to predict crime on Toronto Neighbourhoods.
3. We will be completely working on <a href='https://developer.foursquare.com/'>Foursquare data</a> to explore and try to locate our new house where more venues like church, restaurant, bar, hotel museums, memorials etc.. that are present nearby.

### 2.b Importing Libraries

In [2]:
import warnings
warnings.filterwarnings('ignore') 
import requests # HTTP library
import pandas as pd # for data analysis
import numpy as np  # data in a vectorized manner manipulation
# Matplotlib and associated plotting modules for visualization
import matplotlib.cm as cm
import matplotlib.colors as colors
import matplotlib.pyplot as plt

import statsmodels.api as sm # implement statistic models
import time # use time
from geopy.geocoders import Nominatim  # for geocoders referencing
import geopandas as gpd # for spatial dataset
import seaborn as sns # for plotting and visulalization
from scipy import stats # statistic computation
from bs4 import BeautifulSoup # scrapping web site
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
import re # regualr expression

#### scrape the site point2homes.com on Toronto to predict house sales on Toronto Neighbourhoods

In [4]:
import requests
user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36 Edg/80.0.100.0'
headers={'User-Agent':user_agent} 

In [None]:
# https://www.point2homes.com/CA/Real-Estate-Listings/ON/Toronto.html?location=Toronto%2C+ON&search_mode=location&page=3&SelectedView=listings&LocationGeoId=783094&location_changed=&ajax=1
# http://data.torontopolice.on.ca/datasets/mci-2014-to-2018

In [None]:
long_list = []
lat_list = []
address_list = []
price_list= []
data_url_list = []
bed_list = []
bath_list = []
bath_list = []
sqft_list = []
type_list = []
# iterates in 148 pages
for index in range(1,148):
    url = 'https://www.point2homes.com/CA/Real-Estate-Listings/ON/Toronto.html?location=Toronto%2C+ON&search_mode=location&page={}&SelectedView=listings&LocationGeoId=783094&location_changed=&ajax=1'.format(index)
    page = requests.get(url,headers=headers)
    soup = BeautifulSoup(page.content,'html.parser')
    time.sleep(3)
    articles = soup.find_all('article')
           
    for article in articles:
        bed = article.find(class_ = 'ic-beds')
        if bed != None:
            bed_list.append(bed.get_text().replace('\n','').strip())
        else:
            bed_list.append('--')
        bath = article.find(class_ = 'ic-baths')
        if bath != None:
            bath_list.append(bath.get_text().replace('\n','').strip())
        else:
            bath_list.append('--')

        sqft = article.find(class_ = 'ic-sqft')
        if sqft != None:
            sqft_list.append(sqft.get_text().replace('\n','').strip())
        else:
            sqft_list.append('--')

        typ = article.find(class_ = 'ic-proptype')
        if typ != None:
            type_list.append(typ.get_text().replace('\n','').strip())
        else:
            type_list.append('--')
            
        price = article.find(class_ ='price')
        
        if price != None:
            price_list.append(price['data-price'])
        else:
            price_list.append('--')
        
        address = article.find(class_ ='item-address')
        if address != None:
            data_url_list.append(address['data-url'])
        else:
            price_list.append('--')
        
        inputs = article.find_all('input')
        if len(inputs) == 3:
            address_list.append(inputs[0]['value'])
            long_list.append(inputs[2]['value'])
            lat_list.append(inputs[1]['value'])
        elif len(inputs) == 2  :
            if 'ShortAddress' in inputs[0]['id']:
                address_list.append(inputs[0]['value'])
                long_list.append(0)
                lat_list.append(0)
            elif 'ShortAddress' in inputs[1]['id']:
                address_list.append(inputs[1]['value'])
                long_list.append(0)
                lat_list.append(0)
            else:
                address_list.append('--')
                long_list.append(0)
                lat_list.append(0)
        elif len(inputs) == 1  :
            if 'ShortAddress' in inputs[0]['id']:
                address_list.append(inputs[0]['value'])
                long_list.append(0)
                lat_list.append(0)
            else :
                address_list.append('--')
                long_list.append(0)
                lat_list.append(0)
        elif len(inputs) == 0:
            address_list.append('--')
            long_list.append(0)
            lat_list.append(0)
        

In [None]:
# use dictionary before use dataframe
data_dict = dict()
data_dict['address'] = address_list
data_dict['long'] = long_list
data_dict['lat'] = lat_list
data_dict['data_url'] = data_url_list
data_dict['price_$CAN'] = price_list
data_dict['beds'] = bed_list
data_dict['baths'] =bath_list
data_dict['sqft'] = sqft_list
data_dict['type'] = type_list
# convert dictionary to  dataframe
house_data = pd.DataFrame(data_dict)
# save dataframe as csv file
house_data.to_csv('./dataset/house_data.csv')

In [3]:
# use the dataframe
house_data = pd.read_csv('./dataset/house_data.csv')
print(house_data.shape)

(3525, 10)


In [4]:
print(house_data.loc[0,'data_url'])

/CA/Condo-For-Sale/ON/Toronto/Humber-Bay/33-Shore-Breeze-Dr/83854942.html


#### 2. Use <a href='http://data.torontopolice.on.ca/datasets/mci-2014-to-2018'> The MCI  dataset includes all Major Crime Indicators (MCI) 2014 to 2018 occurrences by reported date and related offences </a>  to predict crime on Toronto Neighbourhoods.

In [34]:
mci_2014_2018 = pd.read_csv('./dataset/mci_2014_to_2018.csv')
print(mci_2014_2018.shape)

(167525, 29)


In [35]:
(mci_2014_2018.Neighbourhood)

{'Agincourt North (129)',
 'Agincourt South-Malvern West (128)',
 'Alderwood (20)',
 'Annex (95)',
 'Banbury-Don Mills (42)',
 'Bathurst Manor (34)',
 'Bay Street Corridor (76)',
 'Bayview Village (52)',
 'Bayview Woods-Steeles (49)',
 'Bedford Park-Nortown (39)',
 'Beechborough-Greenbrook (112)',
 'Bendale (127)',
 'Birchcliffe-Cliffside (122)',
 'Black Creek (24)',
 'Blake-Jones (69)',
 'Briar Hill-Belgravia (108)',
 'Bridle Path-Sunnybrook-York Mills (41)',
 'Broadview North (57)',
 'Brookhaven-Amesbury (30)',
 'Cabbagetown-South St.James Town (71)',
 'Caledonia-Fairbank (109)',
 'Casa Loma (96)',
 'Centennial Scarborough (133)',
 'Church-Yonge Corridor (75)',
 'Clairlea-Birchmount (120)',
 'Clanton Park (33)',
 'Cliffcrest (123)',
 'Corso Italia-Davenport (92)',
 'Danforth (66)',
 'Danforth East York (59)',
 'Don Valley Village (47)',
 'Dorset Park (126)',
 'Dovercourt-Wallace Emerson-Junction (93)',
 'Downsview-Roding-CFB (26)',
 'Dufferin Grove (83)',
 'East End-Danforth (62)',
 

In [34]:
from geopy.geocoders import Nominatim  # for geocoders referencing

In [53]:
CLIENT_ID = 'XXXXXXX' # your Foursquare ID
CLIENT_SECRET = 'XXXXXXX' # your Foursquare Secret
VERSION = '20191028'
LIMIT = 150

latitude = house_data.head(1).lat.values[0]  
longitude =house_data.head(1).long.values [0]
toronto='Toronto location : {},{}'.format(latitude,longitude)
print(toronto)


Toronto location : 43.623426,-79.47897900000001


In [54]:
#Quering for hotel & restaurant

search_query = 'hotel'
search_query_res = 'restaurant'

radius = 1000
url_hotel = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&query={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, latitude, longitude, VERSION, search_query, radius, LIMIT)
url_restaurant = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&query={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, latitude, longitude, VERSION, search_query_res, radius, LIMIT)
#url

In [55]:
results_hotel = requests.get(url_hotel).json()
results_restaurant = requests.get(url_restaurant).json()


In [57]:
results_restaurant

{'meta': {'code': 200, 'requestId': '5e502bdc216785001c9065bf'},
 'response': {'venues': [{'id': '4e6fac43aeb74b05755eceb8',
    'name': 'Hoaikung Restaurant',
    'location': {'address': '716 The Queensway',
     'crossStreet': 'Royal York Road',
     'lat': 43.624089,
     'lng': -79.49124,
     'labeledLatLngs': [{'label': 'display',
       'lat': 43.624089,
       'lng': -79.49124}],
     'distance': 990,
     'postalCode': 'M8Y 1L3',
     'cc': 'CA',
     'city': 'Etobicoke',
     'state': 'ON',
     'country': 'Canada',
     'formattedAddress': ['716 The Queensway (Royal York Road)',
      'Etobicoke ON M8Y 1L3',
      'Canada']},
    'categories': [{'id': '4bf58dd8d48988d142941735',
      'name': 'Asian Restaurant',
      'pluralName': 'Asian Restaurants',
      'shortName': 'Asian',
      'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/food/asian_',
       'suffix': '.png'},
      'primary': True}],
    'referralId': 'v-1582312426',
    'hasPerk': False},
   {'id': '

In [64]:
from pandas.io.json import json_normalize
dataframe_restaurant = json_normalize(venues_restaurant)
print("There are {} restaurants  at this location".format(dataframe_restaurant.shape[0]))

There are 3 restaurants  at this location


In [70]:
import folium # map rendering library
# keep only columns that include venue name, and anything that is associated with location
filtered_columns = ['name', 'categories'] + [col for col in dataframe_restaurant.columns if col.startswith('location.')] + ['id']
dataframe_filtered = dataframe_restaurant.loc[:, filtered_columns]

# function that extracts the category of the venue

def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

    
# filter the category for each row
dataframe_filtered['categories'] = dataframe_filtered.apply(get_category_type, axis=1)

  
# clean column names by keeping only last term
dataframe_filtered.columns = [column.split('.')[-1] for column in dataframe_filtered.columns]

#dataframe_filtered
resto_df=dataframe_filtered[['name','categories','distance','lat','lng','id']]
resto_df.head()

Unnamed: 0,name,categories,distance,lat,lng,id
0,Hoaikung Restaurant,Asian Restaurant,990,43.624089,-79.49124,4e6fac43aeb74b05755eceb8
1,Palace Pier Restaurant,American Restaurant,1017,43.631868,-79.474137,4ee93b13e3001405fced8c59
2,La Vinia Restaurant,Spanish Restaurant,1080,43.6163,-79.488077,5169ed9de4b02cfb7f63597a


In [77]:
import folium # map rendering library
resto_map = folium.Map(location=[latitude, longitude], zoom_start=16) # generate map centred around the Toronto house

# add a red circle marker to represent the core location of Toronto house 1
folium.CircleMarker(
    [latitude, longitude],
    radius=10,
    color='red',
    popup='Location House 1',
    fill = True,
    fill_color = 'red',
    fill_opacity = 0.6
).add_to(resto_map)
geoPath

resto_map

In [None]:
maps = folium.Map(location =[data_geom.y.mean(),data_geom.x.mean()],zoom_start=11)
maps.choropleth(
 geo_data = r'./dataset/simple.json',
 data =data_geom,
 key_on = 'properties.HOODNUM',
 columns = ['HOODNUM','Total houses'],
 fill_color='Reds', 
 fill_opacity=0.9,
 line_opacity=0.4,
 )
maps

In [None]:
house_data['neighborhood'] = np.nan
for i in range(0,house_data.shape[0]):
    house_data.loc[i,'neighborhood']= house_data.loc[i,'data_url'].split('/')[5]
neighbourhood = house_data.groupby('neighborhood').size().to_frame()
neighbourhood.sort_values(by= [0],ascending = False).head(25)

## Data Clearning

In [None]:
ax=data_geom.plot(figsize =(30,15),cmap ='Reds')
data_geom.apply(lambda x: ax.annotate(s=x.district_code, xy=x.geometry.centroid.coords[0], ha='center',fontsize =12),axis=1)
