# Business Problem

The problem this report will aim to solve is to find whether big multicultural cities tend to have a similar way of developing, a common cultural scene and common preferences of the people living there. 

Analyzing the clustering of different venues by category one can find wether a specific type tend to be located in the center of the city or it's suburb areas. 

Analyzing the preferences of the users, conclusions about the trends in different longitude and latitudes can be made i.e whether a location of a venue would make it less or more popular if same category venues are differently located in the chosen datasets.


For the current report, the information about the already explored cities will be used - New York and Toronto.

New York City comprises 5 boroughs sitting where the Hudson River meets the Atlantic Ocean. At its core is Manhattan, a densely populated borough that’s among the world’s major commercial, financial and cultural centers. With an estimated 2018 population of 8,398,748 distributed over a land area of about 302.6 square miles (784 km2), New York is also the most densely populated major city in the United States. According to the latest census data it is the only one of the top five  biggest cities in the US in which each of the four major racial and ethnic groups makesup at least 10 percent of the population. It is also a relatively young city with most of the population being at age between 18 and 55.

Toronto on the other hand is the capital of the province of Ontario, a major Canadian city along Lake Ontario’s northwestern shore. It is an international centre of business, finance, arts, and culture, and is recognized as one of the most multicultural and cosmopolitan cities in the world. Toronto covers an area of 630 square kilometres (243 sq miles) with many green spaces, from the orderly oval of Queen’s Park to 400-acre High Park and its trails, sports facilities and zoo. It has a population of over 2.93 millions and people from over 250 ethnicities and 16 countries are represented in the Toronto.


# Data

### 1. Data for New York - from the dataset provided in Module/Week 3

In [11]:
!wget -q -O 'newyork_data.json' https://cocl.us/new_york_dataset

In [9]:
import json
import pandas as pd

with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)
nw_data = newyork_data['features']

column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 
nw_dataset = pd.DataFrame(columns=column_names)
for data in nw_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    nw_dataset = nw_dataset.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)
nw_dataset.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585


In [10]:
nw_dataset.shape

(306, 4)

### 2. Data for Toronto - from the dataset created during the assignment in Module/Week 4

In [13]:
import types
import requests
import pandas as pd
from botocore.client import Config
import ibm_boto3
from bs4 import BeautifulSoup


website_html = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(website_html, 'lxml')
postcodes_table = soup.find('table',{'class':'wikitable sortable'})


l = []
for tr in postcodes_table.find_all('tr')[1:]:
    td = tr.find_all('td')
    row = [tr.text for tr in td]
    l.append(row)
    

toronto_df = pd.DataFrame(l, columns=["PostalCode", "Borough", "Neighborhood"])
toronto_df.loc[toronto_df['Neighborhood'].str.contains('\n'), 'Neighborhood'] = toronto_df['Neighborhood'].str.replace('\n', '')
toronto_df = toronto_df.drop(toronto_df[toronto_df.Borough == 'Not assigned'].index)
toronto_df.loc[toronto_df['Neighborhood'] == 'Not assigned', 'Neighborhood'] = toronto_df['Borough']
aggregation_functions = {}
toronto_df = toronto_df.groupby(['PostalCode'], as_index=False).agg({'Borough': 'first', 'Neighborhood': ', '.join})
toronto_df.reset_index()


def __iter__(self): return 0

# @hidden_cell
client_add727566460447cb4943451ba9a2c4d = ibm_boto3.client(service_name='s3',
    ibm_api_key_id=<ID>,
    ibm_auth_endpoint="https://iam.eu-gb.bluemix.net/oidc/token",
    config=Config(signature_version='oauth'),
    endpoint_url='https://s3.eu-geo.objectstorage.service.networklayer.com')

body = client_add727566460447cb4943451ba9a2c4d.get_object(Bucket='courseraibmdatascience-donotdelete-pr-2dj0rbysng8mnt',Key='Geospatial_Coordinates.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )
geo_data_df = pd.read_csv(body)


toronto_df = pd.merge(toronto_df, geo_data_df, left_on=['PostalCode'], right_on = ['Postal Code'], how = 'left').drop('Postal Code', axis=1)
toronto_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


In [14]:
toronto_df.shape

(103, 5)