# Comparison of multicultural cities

## Introduction: Description & Disscusion of the Background

New York City is one of the large cities in USA, it's also the most densely populated major city in USA. The city constitutes the premier gateway for immigration to the USA. There are more than 800 languages are spoken in New York City. It's one of the most multicultural cities in the world.
Toronto is Canada's largest city, the fourth largest in North American, and home to a diverse population. The diverse population of Toronto reflects its role as an important destination for immigrants to Canada. Over 160 languages are spoken in the city. Toronto is also one of the most multicultural cities in the world.
Los Angeles is one of the most stylish cities in the world. Almost 100 language speaking people reside at Los Angeles and hence every style, dress and cuisines of this city is different. Los Angeles is a welcoming place for immigrants.It is also one of the most multicultural cities in the world.

New York, Toronto and Los Angeles are all multicultural cities. They attract thousands of people from all over the world to live in. While the cities have some similarities and may be attractive to various people depending on lifestyles and preferences. In this project we will try to compare these three multicultural cities to see how similar they are. There are so many ways to compare cities. In this project we would like to use foursquare data to compare venues (i.e. density, distribution, categories of venues) in these cities and their clustered neighbourhoods. We will use our data science powers to determine if the cities are similar or dissimilar in multiple ways. 

## Data Description

Based on definition of our problem, factors that will influence our decision are:
<br>
•	Population of the neighbourhood/city
<br>
•	Density of neighborhood/city
<br>
•	How many categories of venues in the neighborhood/city?
<br>
•	Density of venues in the neighborhood/city?
<br>
We can use these factors to determine if cities are similar to each other.

To consider the problem we can list the data as below:
<br>
I use https://geo.nyu.edu/catalog/nyu_2451_34572 data to get neighborhoods as well as the the latitude and logitude coordinates of each neighborhood of New York City.Example:  We can find Toronto's neibourhoods and the latitude and logitude coordinates of each neighborhood from wikipedia https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M. A csv file contians latitude and logitude information of Toronto. Los Angeles neibourhoods information are available on SOCR Data LA Neighborhoods: Datahttp://wiki.stat.ucla.edu/socr/index.php/SOCR_Data_LA_Neighborhoods_Data#SOCR_Data_-_Los_Angeles_City_Neighborhoods_Data.

I used Forsquare API to get the most common venues of given neibourhoods of three cities.

In [4]:
!conda install -c conda-forge geopy --yes

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    openssl-1.1.1c             |       h516909a_0         2.1 MB  conda-forge
    ca-certificates-2019.9.11  |       hecc5488_0         144 KB  conda-forge
    geopy-1.20.0               |             py_0          57 KB  conda-forge
    certifi-2019.9.11          |           py36_0         147 KB  conda-forge
    geographiclib-1.49         |             py_0          32 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         2.5 MB

The following NEW packages will be INSTALLED:

    geographiclib:   1.49-py_0         conda-forge
    geopy:           1.20.0-py_0       conda-forge

The following packages will be UPDATED:

    ca-

In [20]:
!conda install -c conda-forge folium=0.5.0 --yes

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - folium=0.5.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    folium-0.5.0               |             py_0          45 KB  conda-forge
    altair-3.2.0               |           py36_0         770 KB  conda-forge
    vincent-0.4.4              |             py_1          28 KB  conda-forge
    branca-0.3.1               |             py_0          25 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         868 KB

The following NEW packages will be INSTALLED:

    altair:  3.2.0-py36_0 conda-forge
    branca:  0.3.1-py_0   conda-forge
    folium:  0.5.0-py_0   conda-forge
    vincent: 0.4.4-py_1   conda-forge


Downloading and Extracting Packages
folium-0.5.0         | 45 KB    

In [21]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


### Download and Cleaning New York City Data

In [22]:
!wget -q -O 'newyork_data.json' https://cocl.us/new_york_dataset
print('Data downloaded!')

Data downloaded!


In [23]:
with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)

In [24]:
ny_neighborhoods_data = newyork_data['features']
# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
ny_neighborhoods = pd.DataFrame(columns=column_names)

In [25]:
for data in ny_neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough']
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    ny_neighborhoods = ny_neighborhoods.append({'Borough': borough,
                                                'Neighborhood': neighborhood_name,
                                                'Latitude': neighborhood_lat,
                                                'Longitude': neighborhood_lon}, ignore_index=True)

In [26]:
ny_neighborhoods.head(5)

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585


In [27]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(ny_neighborhoods['Borough'].unique()),
        ny_neighborhoods.shape[0]
    )
)

The dataframe has 5 boroughs and 306 neighborhoods.


### Download and Cleaning Toronto Data

In [29]:
toronto_df = pd.read_html("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")[0]
# remove rows if Borough is Not assigned
toronto_df = toronto_df[toronto_df.Borough != 'Not assigned']
# rows will be combined into one row with the neighborhoods separated with a comma if Postcode is the same
df1=toronto_df.groupby('Postcode').agg({'Borough':'first',
                               'Neighbourhood': ', '.join}).reset_index()
# If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough
df1.loc[df1['Neighbourhood'] =='Not assigned', 'Neighbourhood'] = df1['Borough']

In [30]:
# read in csv file
df_part = pd.read_csv("http://cocl.us/Geospatial_data/Geospatial_Coordinates.csv")
df_part=df_part.rename(columns={'Postal Code':'Postcode','Latitude':'Latitude','Longitude':'Longitude'})

In [31]:
# append latitude and longtitude to postcode
df_toronto = pd.merge(df1,df_part,on='Postcode')

In [32]:
df_toronto.head(5)

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


In [33]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(df_toronto['Borough'].unique()),
        df_toronto.shape[0]
    )
)

The dataframe has 11 boroughs and 103 neighborhoods.


### Download and Cleaning Los Angeles Data

In [18]:
Los_df = pd.read_html("http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data_LA_Neighborhoods_Data#SOCR_Data_-_Los_Angeles_City_Neighborhoods_Data")[2]

In [19]:
Los_df.head(5)

Unnamed: 0,LA_Nbhd,Income,Schools,Diversity,Age,Homes,Vets,Asian,Black,Latino,White,Population,Area,Longitude,Latitude
0,Adams_Normandie,29606,691,0.6,26,0.26,0.05,0.05,0.25,0.62,0.06,31068,0.8,-118.30027,34.03097
1,Arleta,65649,719,0.4,29,0.29,0.07,0.11,0.02,0.72,0.13,31068,3.1,-118.430015,34.240603
2,Arlington_Heights,31423,687,0.8,31,0.31,0.05,0.13,0.25,0.57,0.05,22106,1.0,-118.320109,34.043611
3,Atwater_Village,53872,762,0.9,34,0.34,0.06,0.2,0.01,0.51,0.22,14888,1.8,-118.265808,34.124908
4,Baldwin_Hills/Crenshaw,37948,656,0.4,36,0.36,0.1,0.05,0.71,0.17,0.03,30123,3.0,-118.3667,34.01909
