# Capstone Project - Battle of Neighborhoods Week1

# Part 1 - A description of the problem and a discussion of the background.

# 1.1. Background


Various cities in every part of the world contain various kinds of venues and their categories in itself which inform about the specific local cultures of those cities to the mankind. Let me introduce the most important cities which attract the people’s attention from all over the world, such as New York City which is the main capital, international and most populated city of the United States, and Toronto which is the main financial capital, and also most populated city of Canada. 


# 1.2. Business Problem

Let me introduce my tourism agent that is led by me and my friend which is located in Munich, Germany. We are working here as the consultants regarding the travelling issues to various points of the world, trying to offer the best opportunities to our customers and making them become satisfied with the results. As it is obvious that most people in Munich are working at very prestigious work places and their salaries are very high pointing to the high level of living standards of those people. For that reason, after getting retired, older people get very high retirement salaries as a result of their previous high work performance. 
Here in my research, the main target audiences are older people above age of 50-60 years. Every month various number of those older people visit our travel agency and indicate that they want to travel to the USA or Canada. They want from us to provide them with the best opportunities, give them advices about the details of the travel to abovementioned countries and want from us to find the best locations in those countries according to their preferences and tastes. Therefore, in order to provide our customers with necessary information regarding travelling tips and also provide with the best opportunities, starting from hotels to entertainment places, we conduct a small survey and prepare special questions for them, which help us to learn about their preferences. Thus, the following survey questions for the customers are prepared by us:
1. How old are you and what is the purpose of your travelling?
2. Are you travelling with your family/friends or alone?
3. Are you interested in entertainment, such as going to pubs, clubs, cinema or others?
4. Are you interested in doing various sport types and do you want to go for some winter sports?
5. Would you like mainly to visit the old places, such as not investigated remote places and historical museums?

In addition, for my research, I will choose the main boroughs from abovementioned two countries, such that I will choose Brooklyn from New York, and York from Toronto, which are one of most important and famous places of their corresponding cities. After that, according to customers’ answers to the survey questions, I will recommend them which one of the places to visit.


# 1.3. Target Audience

Here in my research, the main target audiences are older people above age of 50-60 years. Every month various number of those older people visit our travel agency and indicate that they want to travel to the USA or Canada.

# 1.4. Objective

The purpose of this project is to categorically segment the neighborhoods of New York City and Toronto (Brooklyn, York respectively) into major clusters and examine those clusters to find the appropriate travel places considering the preferences and tastes of our customers who are in the category of above 50- 60 years. 

# Part 2 - A description of the data and how it will be used to solve the problem

# 2.1. Data Sources



In this project, I will be working with two data sets. The first dataset of New York City consists of 5 boroughs and neighborhoods in each borough, and also geometric coordinates, such as latitude and longitude coordinates of each neighborhood. The link to this dataset can be found easily on the web and is the following: https://geo.nyu.edu/catalog/nyu_2451_34572

The second dataset of Toronto city consists of different boroughs, neighborhood in each borough and their respective postal codes. The link to this dataset is taken from Wikipedia page and is the following: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M 

Also the data for the geographical coordinates for each neighborhood in Toronto are collected, thus, the following csv file contains those data: https://cocl.us/Geospatial_data 


# Import libraries

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!pip install folium
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


# Download and explor the data set for New York City

In [2]:
!wget -q -O 'newyork_data.json' https://cocl.us/new_york_dataset
print('Data downloaded!')

Data downloaded!


In [3]:
#load the data
with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)

Notice how all the relevant data is in the features key, which is basically a list of the neighborhoods. So, let's define a new variable that includes this data.

In [4]:
neighborhoods_data = newyork_data['features']

Let's take a look at the first item in this list.

In [5]:
neighborhoods_data[0]

{'type': 'Feature',
 'id': 'nyu_2451_34572.1',
 'geometry': {'type': 'Point',
  'coordinates': [-73.84720052054902, 40.89470517661]},
 'geometry_name': 'geom',
 'properties': {'name': 'Wakefield',
  'stacked': 1,
  'annoline1': 'Wakefield',
  'annoline2': None,
  'annoline3': None,
  'annoangle': 0.0,
  'borough': 'Bronx',
  'bbox': [-73.84720052054902,
   40.89470517661,
   -73.84720052054902,
   40.89470517661]}}

The next task is essentially transforming this data of nested Python dictionaries into a pandas dataframe. So let's start by creating an empty dataframe.

In [8]:
# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)

Then let's loop through the data and fill the dataframe one row at a time.

In [10]:
  for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

In [11]:
neighborhoods.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585


Let's segment and cluster only the neighborhoods in Brooklyn. So let's slice the original dataframe and create a new dataframe of the Brooklyn data.



In [12]:
brooklyn_data = neighborhoods[neighborhoods['Borough'] == 'Brooklyn'].reset_index(drop=True)
brooklyn_data.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Brooklyn,Bay Ridge,40.625801,-74.030621
1,Brooklyn,Bensonhurst,40.611009,-73.99518
2,Brooklyn,Sunset Park,40.645103,-74.010316
3,Brooklyn,Greenpoint,40.730201,-73.954241
4,Brooklyn,Gravesend,40.59526,-73.973471


# Now I work on the dataset of Toronto City

Download the data, scrap the webpage and transform the data into pandas dataframe.

In [13]:
from bs4 import BeautifulSoup
import lxml

url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
source = requests.get(url).text
# Cleans html file
canada_df = BeautifulSoup(source, 'lxml')

In [14]:
column_names = ['Postal Code','Borough','Neighborhood']

#instantiate the dataframe
Toronto = pd.DataFrame(columns = column_names)

# This extracts the "tbody" within the table where class is "wikitable sortable"
content = canada_df.find('div', class_='mw-parser-output')
table = content.table.tbody
postcode = 0
borough = 0
neighborhood = 0


# Extracts all "tr" (table rows) within the table above
for tr in table.find_all('tr'):
    i = 0
    for td in tr.find_all('td'):
        if i == 0:
            postcode = td.text.strip('\n').replace('\n', ',')
            i = i + 1
        elif i == 1:
            borough = td.text.strip('\n').replace('\n', ',') 
            i = i + 1
        elif i == 2: 
            neighborhood = td.text.strip('\n').replace('/', ',') # Extracts the column headers, removes and replaces possible '\n' with space for the "th" tag

    Toronto = Toronto.append({'Postal Code': postcode,'Borough': borough,'Neighborhood': neighborhood},ignore_index=True)

# clean dataframe 
Toronto = Toronto[Toronto.Borough!='Not assigned']
Toronto = Toronto[Toronto.Borough!= 0]
Toronto.reset_index(drop = True, inplace = True)
i = 0
for i in range(0,Toronto.shape[0]):
    if Toronto.iloc[i][2] == 'Not assigned':
        Toronto.iloc[i][2] = Toronto.iloc[i][1]
        i = i+1
                                 
df = Toronto.groupby(['Postal Code','Borough'])['Neighborhood'].apply(','.join).reset_index()

df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


Then load the geographical coordinates data in csv format

In [15]:
geo_csv = 'https://cocl.us/Geospatial_data'
df_geo = pd.read_csv(geo_csv)
df_geo.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


I merge the web scrapped data with the csv file of geographical coordinates

In [16]:
df_geonew = pd.merge(df_geo, df)
df_geonew.head()

Unnamed: 0,Postal Code,Latitude,Longitude,Borough,Neighborhood
0,M1B,43.806686,-79.194353,Scarborough,"Malvern, Rouge"
1,M1C,43.784535,-79.160497,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,43.763573,-79.188711,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,43.770992,-79.216917,Scarborough,Woburn
4,M1H,43.773136,-79.239476,Scarborough,Cedarbrae


In [17]:
df_geonew = df_geonew[['Postal Code', 'Borough', 'Neighborhood', 'Latitude', 'Longitude']]
df_geonew.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


Let's segment and cluster only the neighborhoods in York. So let's slice the original dataframe and create a new dataframe of the York data.

In [18]:
#slice the original dataframe and create a new dataframe of the York data.
toronto_data = df_geonew[df_geonew['Borough'] == 'York'].reset_index(drop=True)
print(toronto_data.shape)
toronto_data.head()

(5, 5)


Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M6C,York,Humewood-Cedarvale,43.693781,-79.428191
1,M6E,York,Caledonia-Fairbanks,43.689026,-79.453512
2,M6M,York,"Del Ray, Mount Dennis, Keelsdale and Silverthorn",43.691116,-79.476013
3,M6N,York,"Runnymede, The Junction North",43.673185,-79.487262
4,M9N,York,Weston,43.706876,-79.518188


# 2.1.2. Data Source


The Foursquare API will be utilized to obtain the geographical location data, such as for Brooklyn in New York, and for York in Toronto. These datas will be used to explore the venues in the neighbourhoods of Brooklyn and York,respectively.

The venues will provide the categories needed for my dataset analysis.



In [22]:
CLIENT_ID = '5P1DXXGFSTOZZECQH5ERYPU5QCSPS4TLZLB3RVDP0ZUFX44X' # your Foursquare ID
CLIENT_SECRET = 'R3BWO10QD5LLTUVL4BJXDERK3EZIYJ3O2GQPTM1UJFLNHBHP' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 50

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: 5P1DXXGFSTOZZECQH5ERYPU5QCSPS4TLZLB3RVDP0ZUFX44X
CLIENT_SECRET:R3BWO10QD5LLTUVL4BJXDERK3EZIYJ3O2GQPTM1UJFLNHBHP


# 2.2 How data will be used to solve the problem

A multi-dimensional dataset for 2 various regions, Brooklyn in New York and York in Toronto, will be used to find the main differences between them in order to provide my target customers with the best opportunities based on their preferences whether to visit Brooklyn or York, where the survey questions also came for my assistance.