# Capstone Project - The Battle of Neighbourhoods

# Introduction / Business Problem

Which locality of all the cities in United States would be the best place to start a Water Park?   
I met a businessman who is interested in starting a Water Park in the best locality of all the cities in United states.   
He defines a best locatlity based on the following constraints:   
1. Population density of a locality
2. Per Capital income
3. Population of each location
4. Venues in each locality   

The category of the venues that he's interested in are:

1. Arts and Entertainment
2. Shops & Service
3. College and University
4. Event
5. Food
6. Nightife Spot
7. Outdoors & Recreation
7. Professional & Other places
8. Residence
9. Travel & Transport

# Data

To help  set up a Water Park, we will get the data from the below sources:   
List of all the cities in United States with population density and coordinates: https://en.wikipedia.org/wiki/List_of_United_States_cities_by_population   

List of all the cities in United States with Per Capita Income : https://en.wikipedia.org/wiki/List_of_United_States_counties_by_per_capita_income   
Using Four Square API to get the following
1. List of all venues in each city
2. List of all venues in each locality in the selected city

Using the above data we will first select best city to proceed with based on the values like Population density, per capita income of the state, number of venues (as we are giving weights to each venue based on its category).

Once we select a city, we then go hunting for Localities. Again, we do it using the same approach i.e. based on the scores of venues in each locality.

In [116]:
# Importing all the necessary libraries we will be needing to do the Ananlysis

import numpy as np # library to handle data in a vectorized manner
import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

# for webscraping import Beautiful Soup 
from bs4 import BeautifulSoup

import xml

!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Solving environment: done

# All requested packages already installed.

Solving environment: done

# All requested packages already installed.

Libraries imported.


# Exracting the content in a wiki page that has 'List of US Cities by population' in to a text file

In [118]:
link = 'https://en.wikipedia.org/wiki/List_of_United_States_cities_by_population'
page = requests.get(link) 
soup = BeautifulSoup(page.text)

### Finding the table that has the data that we need i.e. list of all cities with their population, Square Area, Location (coordinates)

In [119]:
table = soup.find_all('table')[4]

### Extracting the table from the webpage into a data frame by specifying the column names

In [130]:
table_rows = table.find_all('tr')
res = []
for tr in table_rows:
    td = tr.find_all('td')
    row = [tr.text.strip() for tr in td if tr.text.strip()]
    if row:
        res.append(row)
df = pd.DataFrame(res, columns=["Rank", "City", "State", "del1", "del2", "del3", "Sq.Area", "del5", "population density in Sq Mi", "Population density in Km2", "Location"])
df.head()

Unnamed: 0,Rank,City,State,del1,del2,del3,Sq.Area,del5,population density in Sq Mi,Population density in Km2,Location
0,1,New York[d],New York,8398748,8175133,+2.74%,301.5 sq mi,780.9 km2,"28,317/sq mi","10,933/km2",40°39′49″N 73°56′19″W﻿ / ﻿40.6635°N 73.9387°W﻿...
1,2,Los Angeles,California,3990456,3792621,+5.22%,468.7 sq mi,"1,213.9 km2","8,484/sq mi","3,276/km2",34°01′10″N 118°24′39″W﻿ / ﻿34.0194°N 118.4108°...
2,3,Chicago,Illinois,2705994,2695598,+0.39%,227.3 sq mi,588.7 km2,"11,900/sq mi","4,600/km2",41°50′15″N 87°40′54″W﻿ / ﻿41.8376°N 87.6818°W﻿...
3,4,Houston[3],Texas,2325502,2100263,+10.72%,637.5 sq mi,"1,651.1 km2","3,613/sq mi","1,395/km2",29°47′12″N 95°23′27″W﻿ / ﻿29.7866°N 95.3909°W﻿...
4,5,Phoenix,Arizona,1660272,1445632,+14.85%,517.6 sq mi,"1,340.6 km2","3,120/sq mi","1,200/km2",33°34′20″N 112°05′24″W﻿ / ﻿33.5722°N 112.0901°...


### Finding the radius of each city with the help of Sq.Area, this step involves in preprocessing of the the column Sq.Area (changing its data type to float) then finding its square root

In [131]:
new= df["Sq.Area"].str.split("s", n=1, expand = True)
new = new[0].str.replace(u'\xa0',u'')
df["Sq.Area"] = new.str.replace(',','')
df["Sq.Area"] = df["Sq.Area"].astype(float)
df["Radius"] = np.sqrt(df["Sq.Area"])

### Drop unnecessary columns like Rank, del1, del2 ... del5, population density in Sq Mi from the data frame we extracted from the webpag

In [132]:
df.drop(columns = ["Rank", "del1", "del2", "del3", "del5", "Sq.Area", "population density in Sq Mi"], inplace = True)

In [135]:
df.head()

Unnamed: 0,City,State,Population density in Km2,Location,Radius
0,New York[d],New York,"10,933/km2",40°39′49″N 73°56′19″W﻿ / ﻿40.6635°N 73.9387°W﻿...,17.363755
1,Los Angeles,California,"3,276/km2",34°01′10″N 118°24′39″W﻿ / ﻿34.0194°N 118.4108°...,21.64948
2,Chicago,Illinois,"4,600/km2",41°50′15″N 87°40′54″W﻿ / ﻿41.8376°N 87.6818°W﻿...,15.076472
3,Houston[3],Texas,"1,395/km2",29°47′12″N 95°23′27″W﻿ / ﻿29.7866°N 95.3909°W﻿...,25.248762
4,Phoenix,Arizona,"1,200/km2",33°34′20″N 112°05′24″W﻿ / ﻿33.5722°N 112.0901°...,22.750824


### Splitting the cooridnates to Latitudes and Longitudes for each city

In [137]:
#Splitting the location into latitudes and longitudes 
df["Location"]= df["Location"].str.split("/", n = 2, expand = True)[1]
df.head()

Unnamed: 0,City,State,Population density in Km2,Location,Radius
0,New York[d],New York,"10,933/km2",﻿40.6635°N 73.9387°W﻿,17.363755
1,Los Angeles,California,"3,276/km2",﻿34.0194°N 118.4108°W﻿,21.64948
2,Chicago,Illinois,"4,600/km2",﻿41.8376°N 87.6818°W﻿,15.076472
3,Houston[3],Texas,"1,395/km2",﻿29.7866°N 95.3909°W﻿,25.248762
4,Phoenix,Arizona,"1,200/km2",﻿33.5722°N 112.0901°W﻿,22.750824


In [138]:
new = df["Location"].str.split(" ", n = 0, expand = False)
k = df.copy(deep = True)

In [139]:
Latitude = []
Longitude = []
for i in range(len(new)):
    Latitude.append(new[i][1][:-2])
    Longitude.append(new[i][2][:-3]) 

k["Latitude"] = Latitude
k["Longitude"] = Longitude
k["Latitude"] = k["Latitude"].str.replace(u'\ufeff',u'')
k.drop(columns = ["Location"], inplace = True)
k.head()
df = k.copy(deep = True)

In [140]:
df['Longitude'] = -df['Longitude'].astype(float)
df['Latitude'] = df['Latitude'].astype(float)
df['Radius'] = df['Radius']* 1000
df.head()

Unnamed: 0,City,State,Population density in Km2,Radius,Latitude,Longitude
0,New York[d],New York,"10,933/km2",17363.755354,40.6635,-73.9387
1,Los Angeles,California,"3,276/km2",21649.480363,34.0194,-118.4108
2,Chicago,Illinois,"4,600/km2",15076.471736,41.8376,-87.6818
3,Houston[3],Texas,"1,395/km2",25248.762346,29.7866,-95.3909
4,Phoenix,Arizona,"1,200/km2",22750.824161,33.5722,-112.0901


# Getting the per capita income state wise for USA

In [153]:
link1 = 'https://en.wikipedia.org/wiki/List_of_United_States_counties_by_per_capita_income'
page1 = requests.get(link1) 
soup1 = BeautifulSoup(page1.text)

In [162]:
table = soup1.find_all('table')[2]

In [163]:
table_rows = table.find_all('tr')
res = []
for tr in table_rows:
    td = tr.find_all('td')
    row = [tr.text.strip() for tr in td if tr.text.strip()]
    if row:
        res.append(row)
df_state = pd.DataFrame(res, columns=["Rank", "Country-equivalent", "State", "Per capita income", "del2", "del3", "Population", "del5"])
df_state.head()

Unnamed: 0,Rank,Country-equivalent,State,Per capita income,del2,del3,Population,del5
0,1,New York County,New York,"$62,498","$69,659","$84,627",1605272,736192
1,2,Arlington,Virginia,"$62,018","$103,208","$139,244",214861,94454
2,3,Falls Church City,Virginia,"$59,088","$120,000","$152,857",12731,5020
3,4,Marin,California,"$56,791","$90,839","$117,357",254643,102912
4,5,Alexandria City,Virginia,"$54,608","$85,706","$107,511",143684,65369


### Dropping unnecessary columns like Rank, del2, del3, del5 from the Table we extracted from the webpage that has population and percapita income

In [164]:
df_state.drop(columns = ['Rank','del2', 'del3', 'del5'], axis = 1, inplace = True)
df_state.head()

Unnamed: 0,Country-equivalent,State,Per capita income,Population
0,New York County,New York,"$62,498",1605272
1,Arlington,Virginia,"$62,018",214861
2,Falls Church City,Virginia,"$59,088",12731
3,Marin,California,"$56,791",254643
4,Alexandria City,Virginia,"$54,608",143684


### Plotting all the cities of USA that we have extracted from Wiki page, using their coordinates

In [165]:
# create map of USA cities that we have using latitude and longitude values
map_tohood = folium.Map(location=[37.0902,-95.7129], zoom_start=3)

# add markers to map
for lat, lng, state, city in zip(df['Latitude'], df['Longitude'], df['State'], df['City']):
    label = '{}, {}'.format(city, state)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=3,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.3,
        parse_html=False).add_to(map_tohood)  
    
map_tohood

# Define Four Square Credentials and version

In [166]:
CLIENT_ID = 'XIL45J3D0IPUR5OKNZCLC4JDSIVC3KCBO10UGZMMF3KILQVE' # your Foursquare ID
CLIENT_SECRET = 'KVHXPQ2OX2R4TGL5OGKIZYYQ4FGXAEZ0FGOG45BQ5UI0Y0UO' # your Foursquare Secret
VERSION = '20180604'
LIMIT = 20
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: XIL45J3D0IPUR5OKNZCLC4JDSIVC3KCBO10UGZMMF3KILQVE
CLIENT_SECRET:KVHXPQ2OX2R4TGL5OGKIZYYQ4FGXAEZ0FGOG45BQ5UI0Y0UO


### Function that extracts necessary columns into a data frame from the json files that we get when we search using four square API out of the values that we have in the data frame that we got from the web page

In [167]:
def getNearbyVenues(names, latitudes, longitudes, radius):
    
    venues_list=[]
    for name, lat, lng,radius in zip(names, latitudes, longitudes,radius):
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
       # print(results)
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['City', 
                  'Latitude', 
                  'Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [168]:
df_venues = getNearbyVenues(names = df['City'], latitudes = df['Latitude'],longitudes = df['Longitude'], radius = df['Radius'])
df_venues.head()

Unnamed: 0,City,Latitude,Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,New York[d],40.6635,-73.9387,Super Power,40.673952,-73.950184,Tiki Bar
1,New York[d],40.6635,-73.9387,Brooklyn Botanic Garden,40.667622,-73.963191,Botanical Garden
2,New York[d],40.6635,-73.9387,Covenhoven,40.675143,-73.960203,Beer Bar
3,New York[d],40.6635,-73.9387,Brooklyn Museum,40.671521,-73.963677,Art Museum
4,New York[d],40.6635,-73.9387,Kings Theatre,40.64611,-73.957175,Theater


### Assigning weights to some of the categories that client wants to consider (as these are subsets of what he initially thought of)

In [171]:
k = df_venues.copy(deep = True)
weights_dict={'Movie Theater':3,'Beach':3,'Concert Hall':2.5,'Playground':3,'Coffee Shop':3.5,'Food Court':4,'Nightclub':4,'Toy / Game Store':4.5,'Theme Park Ride / Attraction':4,'Pub':4}
data = df_venues['Venue Category']
allVenues = data.astype (str)

In [172]:
weights = []
for i in allVenues:
    if i in weights_dict.keys():
        weights.append(weights_dict[i])
    else :
        weights.append(0)
df_venues['weights'] = weights;
df_venues.head()

Unnamed: 0,City,Latitude,Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category,weights
0,New York[d],40.6635,-73.9387,Super Power,40.673952,-73.950184,Tiki Bar,0.0
1,New York[d],40.6635,-73.9387,Brooklyn Botanic Garden,40.667622,-73.963191,Botanical Garden,0.0
2,New York[d],40.6635,-73.9387,Covenhoven,40.675143,-73.960203,Beer Bar,0.0
3,New York[d],40.6635,-73.9387,Brooklyn Museum,40.671521,-73.963677,Art Museum,0.0
4,New York[d],40.6635,-73.9387,Kings Theatre,40.64611,-73.957175,Theater,0.0


In [173]:
# Dropping the rows that we are not giving any weight
df_venues.drop(df_venues[df_venues.weights < 1.0].index, inplace=True)
df_venues.head()

Unnamed: 0,City,Latitude,Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category,weights
21,Los Angeles,34.0194,-118.4108,Blue Bottle Coffee,34.027115,-118.387637,Coffee Shop,3.5
28,Los Angeles,34.0194,-118.4108,Blue Bottle Coffee,34.05931,-118.419797,Coffee Shop,3.5
29,Los Angeles,34.0194,-118.4108,Blue Bottle Coffee,33.980027,-118.40802,Coffee Shop,3.5
39,Los Angeles,34.0194,-118.4108,iPic Theatres,34.059093,-118.441475,Movie Theater,3.0
57,Chicago,41.8376,-87.6818,Sawada Coffee,41.88373,-87.648726,Coffee Shop,3.5


### Copying only the relevants columns like city and weights to group all the venues by city and calculating the means for each city

In [174]:
citywise_venues_weights = df_venues[['City','weights']].copy()
citywise_venues_weights_means = citywise_venues_weights.groupby(['City']).mean()
citywise_venues_weights_means = citywise_venues_weights_means.reset_index(drop=False)
citywise_venues_weights_means.head()

Unnamed: 0,City,weights
0,Abilene,3.5
1,Alexandria[m],3.5
2,Allen,2.5
3,Amarillo,3.5
4,Anaheim,3.5


### Merging the table for which we calculated the means of weights city wise to the actual table that we got from the wiki page.

In [175]:
city_selection = pd.merge(df, citywise_venues_weights_means, on='City')
city_selection = city_selection[['City','Population density in Km2','weights']].copy()
city_selection.head()

Unnamed: 0,City,Population density in Km2,weights
0,Los Angeles,"3,276/km2",3.375
1,Chicago,"4,600/km2",3.5
2,Houston[3],"1,395/km2",2.5
3,Phoenix,"1,200/km2",3.5
4,Philadelphia[e],"4,511/km2",3.5


In [176]:
# Preprocessing the population density in Km2 column as we have to normalize these values
k = city_selection.copy(deep = True)
k['Population density in Km2'] = k['Population density in Km2'].str.split("/", n = 0, expand = True)
k['Population density in Km2'] = k['Population density in Km2'].str.replace(',','')
k['Population density in Km2'] = k['Population density in Km2'].astype(float)
city_selection = k.copy(deep = True)
city_selection.head()

Unnamed: 0,City,Population density in Km2,weights
0,Los Angeles,3276.0,3.375
1,Chicago,4600.0,3.5
2,Houston[3],1395.0,2.5
3,Phoenix,1200.0,3.5
4,Philadelphia[e],4511.0,3.5


# Normalizing our data frame

In [177]:
# Normalizing the data frame
from sklearn import preprocessing
column_names_to_normalize = ['Population density in Km2', 'weights']
x = city_selection[column_names_to_normalize].values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
city_selection[column_names_to_normalize] = pd.DataFrame(x_scaled)
city_selection.head()

Unnamed: 0,City,Population density in Km2,weights
0,Los Angeles,0.470174,0.4375
1,Chicago,0.664224,0.5
2,Houston[3],0.194489,0.0
3,Phoenix,0.165909,0.5
4,Philadelphia[e],0.65118,0.5


### calculating the sum of normalized columns to determine the city that has maximum sum and conclude that one locality in that city would be the best fit

In [178]:
#calculating the sum of normalized columns to determine the city that has maximum sum and conclude that one locality in that city would be the best fit 
city_selection['sum'] = city_selection['Population density in Km2'] + city_selection['weights']
row_num = city_selection['sum'].argmax()
city_name = city_selection['City'].iloc[row_num]
city_name

The current behaviour of 'Series.argmax' is deprecated, use 'idxmax'
instead.
The behavior of 'argmax' will be corrected to return the positional
maximum in the future. For now, use 'series.values.argmax' or
'np.argmax(np.array(values))' to get the position of the maximum
row.
  app.launch_new_instance()


'Jersey City'

In [179]:
# Finding the state in which that city belongs
row = df.loc[df['City']== city_name].index[0]
state_name = df['State'].iloc[row]
state_name

'New Jersey'

### Client was thinking that the state in which he will setup a Water Park should've a percapital income of minimum of 45,000 USD, let us check if that is the case with this

In [182]:
# checking the percapita income of New Jercy 
p_row = df_state.loc[df_state['State'] == state_name].index[0]
per_capital_income = df_state['Per capita income'].iloc[p_row]
print("Per capita income of New Jercy is :", per_capital_income)

Per capita income of New Jercy is : $50,349


###  As we have concluded that we can choose one location in Jersey City, we then check for the percapita income of the state. Since it is more than 45000 USD, it is considered as one of the best places to procced for setting up a Water Park.

# Let us now check which location in Jersey City would be the best to start a Water Park

In [183]:
# Getting coordinates of New Jersey
lat_newJercy = df['Latitude'].iloc[row]
long_newJercy = df['Longitude'].iloc[row]
print(lat_newJercy, long_newJercy)

40.7114 -74.0648


### Getting all the venues of New Jersey using four square API within our considered radius and Limits

In [184]:
# Getting the venues of New Jersey using four square API 

def getNearbyVenues1(name, latitudes, longitudes, radius):
    
    LIMIT = 150       
        # create the API request URL
    url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            latitudes, 
            longitudes, 
            radius, 
            LIMIT)
            
        # make the GET request
    results = requests.get(url).json()["response"]['groups'][0]['items']
   # print(results)
    venues_list=[]
    venues_list.append([(name,lat,lng,v['venue']['name'],v['venue']['location']['lat'],v['venue']['location']['lng'],v['venue']['categories'][0]['name'])for v in results])
    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['City', 'Latitude', 'Longitude', 'Venue', 'Venue Latitude', 'Venue Longitude','Venue Category']
    return(nearby_venues)


new_jersey_venues = getNearbyVenues1(name = 'Jersey City', latitudes = lat_newJercy ,longitudes = long_newJercy, radius = 2500)
new_jersey_venues.head()

Unnamed: 0,City,Latitude,Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Jersey City,38.3539,-121.9728,The Grind Shop,40.71167,-74.062872,Coffee Shop
1,Jersey City,38.3539,-121.9728,Harry’s Daughter,40.710904,-74.062071,Caribbean Restaurant
2,Jersey City,38.3539,-121.9728,Corgi Spirits at The Jersey City Distillery,40.708304,-74.064803,Distillery
3,Jersey City,38.3539,-121.9728,Hooked JC,40.714709,-74.067009,Fish & Chips Shop
4,Jersey City,38.3539,-121.9728,Liberty Science Center,40.707881,-74.055121,Science Museum


In [185]:
venues_in_newjersey = new_jersey_venues.copy(deep = True)
venues_in_newjersey.shape

(100, 7)

### Since we have got 100 categories, we are now giving weights for each category for better results

In [187]:

#copying the data frame in and giving weights for each category

k = new_jersey_venues.copy(deep = True)
new_weightage_dict= {'Coffee Shop' : 3, 
'Caribbean Restaurant':3,
'Distillery':2,
'Fish & Chips Shop':3,
'Science Museum':3,
'Latin American Restaurant':4,
'Restaurant':5,
'State / Provincial Park':1,
'Diner':1,
'Supermarket':1,
'Bar':1,
'Jazz Club':1,
'Golf Course':3,
'Park':2,
'Cajun / Creole Restaurant':2,
'Bakery':2,
'Go Kart Track':3,
'Taco Place':3,
'Hot Dog Joint':2,
'Food Truck':3,
'Beer Garden':3,
'Boutique':4,
'Café':5,
'Bagel Shop':1,
'Record Shop':1,
'Bakery':1,
'Pizza Place':1,
'Ramen Restaurant':1,
'Wine Bar':3,
'Middle Eastern Restaurant':2,
'French Restaurant':2,
'Theater':2,
'Lounge':3,
'Wine Shop':3,
'Cocktail Bar':2,
'New American Restaurant':3,
'Residential Building (Apartment / Condo)':3,
'Pool':4,
'Burger Joint':5,
'Cheese Shop':1,
'Coffee Shop':1,
'Bagel Shop':1,
'Vietnamese Restaurant':1,
'Portuguese Restaurant':1,
'Ice Cream Shop':3,
'Italian Restaurant':2,
'Gym':2,
'Farmers Market':2,
'Bar':3,
'Pizza Place':3,
'Bakery':2,
'Bookstore':3,
'Bar':3,
'Farmers Market':4,
'Asian Restaurant':5,
'Tea Room':1,
'Donut Shop':1,
'Historic Site':1,
'Gym / Fitness Center':1,
'Café':1,
'Mexican Restaurant':3,
'Plaza':2,
'Gay Bar':2,
'Bar':3,
'College Administrative Building':3,
'Mexican Restaurant':2,
'Bakery':3,
'American Restaurant':3,
'American Restaurant':4,
'American Restaurant':5,
'Café':1,
'New American Restaurant':1,
'Chocolate Shop':1,
'Gym':1,
'Grocery Store':1,
'Middle Eastern Restaurant':3,
'American Restaurant':2,
'Frozen Yogurt Shop':2,
'Japanese Restaurant':2,
'Bar':3,
'Liquor Store':3,
'Ice Cream Shop':2,
'Fish Market':3,
'Indie Movie Theater':3,
'Grocery Store':4,
'Modern European Restaurant':5,
'American Restaurant':1,
'Poke Place':1,
'Ramen Restaurant':1,
'Diner':1,
'Brewery':1,
'Burger Joint':3,
'Burger Joint':2,
'Café':2,
'Fried Chicken Joint':2,
'Beer Garden':3,
'Gym / Fitness Center':3,
'Vietnamese Restaurant':2,
'Italian Restaurant':3,
'Pet Store':3}

### Plotting all the venues that we have got from the Four Square API

In [188]:
import matplotlib.cm as cm
import matplotlib.colors as colors
import folium

# create map of the venues that we have using latitude and longitudes
venues_map = folium.Map(location=[lat_newJercy, long_newJercy], zoom_start=15) # generate map centred around Jersey city


# add Jersey City as a red circle mark
folium.features.CircleMarker(
    [lat_newJercy, long_newJercy],
    radius=10,
    popup='Jersey city',
    fill=True,
    color='red',
    fill_color='red',
    fill_opacity=0.6
    ).add_to(venues_map)

<folium.features.CircleMarker at 0x7f6be54ad9b0>

In [189]:
# add all the venuew of the Jersey city to the map as blue circle markers
for lat, lng, label in zip(venues_in_newjersey['Venue Latitude'], venues_in_newjersey['Venue Longitude'], venues_in_newjersey['Venue']):
    label=folium.Popup(label,parse_html=True)
    folium.features.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.6,
        parse_html = False).add_to(venues_map)
venues_map

### Assigning weights to each category, same as we gave for each city

In [190]:
# Calculating new weights for our data frame as we have given weights for all categories

allVenuesinCity1 = k['Venue Category']

f_weights1 = []
for i in allVenuesinCity1:
    if i in new_weightage_dict.keys():
        f_weights1.append(new_weightage_dict[i])
    else :
        f_weights1.append(0)
k['weights'] = f_weights1;
k.head()

Unnamed: 0,City,Latitude,Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category,weights
0,Jersey City,38.3539,-121.9728,The Grind Shop,40.71167,-74.062872,Coffee Shop,1
1,Jersey City,38.3539,-121.9728,Harry’s Daughter,40.710904,-74.062071,Caribbean Restaurant,3
2,Jersey City,38.3539,-121.9728,Corgi Spirits at The Jersey City Distillery,40.708304,-74.064803,Distillery,2
3,Jersey City,38.3539,-121.9728,Hooked JC,40.714709,-74.067009,Fish & Chips Shop,3
4,Jersey City,38.3539,-121.9728,Liberty Science Center,40.707881,-74.055121,Science Museum,3


In [191]:
# Dropping unnecessary columns 

newframe = k[['City','Venue Category','weights']].copy()
newframe = k.groupby(['Venue Category']).mean()
newframe.drop(columns = ["Latitude", "Longitude"], inplace = True)
newframe

Unnamed: 0_level_0,Venue Latitude,Venue Longitude,weights
Venue Category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
American Restaurant,40.715969,-74.041594,1
Australian Restaurant,40.717187,-74.044216,0
Bagel Shop,40.72299,-74.058068,1
Bakery,40.721297,-74.048836,3
Bar,40.718,-74.056019,3
Beer Garden,40.715149,-74.046633,3
Bookstore,40.719984,-74.043205,3
Boutique,40.717606,-74.044299,4
Brewery,40.72066,-74.040287,1
Burger Joint,40.724225,-74.048478,2


# Using K Means algorithm to cluster the venues and calculating the weights for each cluster to decide which cluster would be the best area to install a Water Park

In [192]:
# Cluster them using K means algorithm 
from scipy import stats
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import seaborn as sns
#Standardize
clmns = ['weights','Venue Latitude', 'Venue Longitude']
df_tr_std = stats.zscore(newframe[clmns])
#Cluster the data
kmeans = KMeans(n_clusters=3, random_state=0).fit(df_tr_std)
labels = kmeans.labels_
newframe['clusters'] = labels
#Add the column into our list
clmns.extend(['clusters'])
#Lets analyze the clusters
kframe = newframe[clmns].groupby(['Venue Category']).mean()
kframe = kframe.reset_index(drop = False)
kframe.head()

Unnamed: 0,Venue Category,weights,Venue Latitude,Venue Longitude,clusters
0,American Restaurant,1,40.715969,-74.041594,0
1,Australian Restaurant,0,40.717187,-74.044216,0
2,Bagel Shop,1,40.72299,-74.058068,0
3,Bakery,3,40.721297,-74.048836,2
4,Bar,3,40.718,-74.056019,2


In [193]:
#new group by clusters and add weights of each cluster 
finalWeight = kframe.groupby(['clusters']).mean()
finalWeight

Unnamed: 0_level_0,weights,Venue Latitude,Venue Longitude
clusters,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,1.290323,40.720093,-74.046701
1,2.411765,40.711615,-74.066077
2,3.3,40.719682,-74.047152


In [195]:
# Final coordinates of the place where we will be setting up an arcade is the one that has maximum weight for, in the above data frame
lat1 = 40.719682
long1 = -74.047152

# As we have a location with the maximum weight, let us plot the same in the graph with a circle of radius 50M such that our client can install his water park with in that premices

In [197]:
# create map of the venues that we have using latitude and longitudes
final_map = folium.Map(location=[lat1, long1], zoom_start=15) # generate map centred around Jersey city


# add prefered location in the City as a green circle mark
folium.features.CircleMarker(
    [lat1, long1],
    radius=50,
    popup='Water Park can be installed within this circle',
    fill=True,
    color='green',
    fill_color='green',
    fill_opacity=0.6
    ).add_to(final_map)
final_map

## So, we finally got a better place in the Jersey city
This place is between the Groove Street and the Grand Street.
## Further Enhancements and drawbacks in continuing this approach :
This project can be enhanced by considering many more attributes to define the weights and do the analysis and also by extending the LIMIT and Radius of the search that we are giving to extract the number of venues. As we have an API limit in the free trail of four square API we had to limit our search within a small Radius.
Also, further enhancements could be, we have hardcorded the radius to 2500 around the Jercey city but the city radius might be more or less than that (actually more). Hence based on the weights we gave our cluster might go to other city if the radius is more than what it actually is.