# Comparing 2 cities of dreams: Mumbai and New York

 Rikki Mohanty                  
 31 May, 2020

## 1. Introduction

### 1.1 Background

##### Let’s find about the cities which are part of this project 

Mumbai is the second most populous city in India and the seventh most populous city in the world with a population of 19.98 million in 2018. Mumbai is the financial, commercial and entertainment capital of India. 

New York City (NYC) is the most populous city in the United States. With an estimated 2019 population of 8,336,817 distributed over about 302.6 square miles (784 km2). New York City has been described as the cultural, financial, and media capital of the world. 

So I have decided to use these cities, explore and compare the neighborhoods and find if these cities have any similarities.

### 1.2 Target Audience

What type of clients or a group of people/stakeholders would be interested in this project?
1. People who are visiting these cities can make the best of city experience; also find the similar places for comfort.
2. Business personnel who want to invest. This analysis will give them and an idea of where to invest.
3. People who are migrating to these cities will have better ideas where to settle down, which places have the right resource and others.

## 2. Data Description

### 2.1 Data acquisition and cleaning 

1. I have taken the dataset of New York from the Capstone Project and found their respective coordinates. 
2. For Mumbai city: the data availability is infrequent and dispersed in many places, so I’ve manually scraped the list of neighborhoods from this Wikipedia page http://zipcodepincode.com/India/Maharashtra/Mumbai/Mumbai/index.html. For this, I’ve used requests and Beautifulsoup4 library to create a dataframe with coordinates and pin codes which was manually scrapped from web.
3. I have used the Foursquare API to explore the neighborhoods of both the cities and segmented them.
4. These venues are then clustered using k-means. Found the most common venues (MCV) and finally compared the (MCV) of both cities to look for similarities.

### 2.2 Data with Examples

#### 2.3.1 Programming Section (Initial Processing: Scraping from the web and getting the coordinates)

##### New York: Importing the libraries and getting the coordinates  and refining the data

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans



print('Libraries imported.')

Libraries imported.


In [2]:
!wget -q -O 'newyork_data.json' https://cocl.us/new_york_dataset
print('Data downloaded!')

Data downloaded!


In [3]:
with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)

In [4]:
newyork_data
neighborhoods_data = newyork_data['features']
neighborhoods_data[0]

{'type': 'Feature',
 'id': 'nyu_2451_34572.1',
 'geometry': {'type': 'Point',
  'coordinates': [-73.84720052054902, 40.89470517661]},
 'geometry_name': 'geom',
 'properties': {'name': 'Wakefield',
  'stacked': 1,
  'annoline1': 'Wakefield',
  'annoline2': None,
  'annoline3': None,
  'annoangle': 0.0,
  'borough': 'Bronx',
  'bbox': [-73.84720052054902,
   40.89470517661,
   -73.84720052054902,
   40.89470517661]}}

In [5]:
# define the dataframe columns
column_names = ['Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)

In [6]:
neighborhoods

Unnamed: 0,Neighborhood,Latitude,Longitude


In [7]:
for data in neighborhoods_data:
   
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

In [8]:
neighborhoods.head()


Unnamed: 0,Neighborhood,Latitude,Longitude
0,Wakefield,40.894705,-73.847201
1,Co-op City,40.874294,-73.829939
2,Eastchester,40.887556,-73.827806
3,Fieldston,40.895437,-73.905643
4,Riverdale,40.890834,-73.912585


##### Mumbai : Importing the libraries and getting the coordinates  and refining the data

In [9]:
#importing the libraries
from bs4 import BeautifulSoup
import lxml       
import requests

In [10]:
#Scrapping the website
source = requests.get("http://zipcodepincode.com/India/Maharashtra/Mumbai/Mumbai/index.html").text
soup = BeautifulSoup(source, 'html.parser')

table = soup.find("table", {'class' : 'table table-bordered'})
table_rows = table.tbody.find_all("tr")

res = []
for tr in table_rows:
    td = tr.find_all("td")
    row = [tr.text for tr in td]
    res.append(row)


In [11]:
# headers="Postcode,Borough,Neighbourhood"
df = pd.DataFrame(res, columns = ["Neighborhood","Postal Code"])
df.sort_values(by=['Postal Code'])
df.head()

Unnamed: 0,Neighborhood,Postal Code
0,A I Staff Colony,400029
1,Agripada,400011
2,Airport (Mumbai),400099
3,Ambewadi (Mumbai),400004
4,Andheri,400053


In [12]:
#Combining rows with the same postal code area
df_group = df.groupby(['Postal Code'], sort = False).agg( ','.join)
df_new = df_group.reset_index()

df_new.head()

Unnamed: 0,Postal Code,Neighborhood
0,400029,"A I Staff Colony,Santacruz P&t Colony"
1,400011,"Agripada,Chinchpokli,Haines Road,Jacob Circle"
2,400099,"Airport (Mumbai),International Airport,Sahar P..."
3,400004,"Ambewadi (Mumbai),Charni Road,Chaupati,Girgaon..."
4,400053,"Andheri,Azad Nagar (Mumbai)"


In [13]:

import types
import pandas as pd
from botocore.client import Config
import ibm_boto3

def __iter__(self): return 0

# @hidden_cell
# The following code accesses a file in your IBM Cloud Object Storage. It includes your credentials.
# You might want to remove those credentials before you share the notebook.
client_2c57d2b927814c1e9ded5712fefe02b9 = ibm_boto3.client(service_name='s3',
    ibm_api_key_id='fg9b4YuSn5ujawqf8On4zltzrorSSi9x95FQwlnnFXCJ',
    ibm_auth_endpoint="https://iam.cloud.ibm.com/oidc/token",
    config=Config(signature_version='oauth'),
    endpoint_url='https://s3.eu-geo.objectstorage.service.networklayer.com')

body = client_2c57d2b927814c1e9ded5712fefe02b9.get_object(Bucket='capstoneproject-donotdelete-pr-stuswmbfrolyp4',Key='Mumbai_coordinates.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

df_data_3 = pd.read_csv(body)
df_data_3.head()


Unnamed: 0,Postal Code,Latitude,Longitude
0,400001,18.958,72.8214
1,400002,19.1599,72.9992
2,400003,19.1018,72.8904
3,400004,19.2072,72.8348
4,400005,19.1121,72.8611


In [14]:
df_data_3 = df_data_3.astype(str)

In [15]:
df_mumbai = pd.merge(df_new, df_data_3, how='left', left_on = 'Postal Code', right_on = 'Postal Code')
# remove the "Postal Code" column
df_mumbai.drop("Postal Code", axis=1, inplace=True)
df_mumbai.head()

Unnamed: 0,Neighborhood,Latitude,Longitude
0,"A I Staff Colony,Santacruz P&t Colony",19.0797,72.8679
1,"Agripada,Chinchpokli,Haines Road,Jacob Circle",19.1721,72.9483
2,"Airport (Mumbai),International Airport,Sahar P...",18.9104,72.8198
3,"Ambewadi (Mumbai),Charni Road,Chaupati,Girgaon...",19.2072,72.8348
4,"Andheri,Azad Nagar (Mumbai)",19.163,72.8393
