# Capstone Project - The Battle of the Neighborhoods (Week 2)

## Introduction / Business Problem

In this project, we will try to find optimal areas in Ahmedabad city where facilities such as School, Hospital and Indian Restaurant is in the nearest proximity of the center of different areas.

There are around 48 different areas in Ahmedabad city and hundreds of facilities in the proximity. The aim of this project is to pull out 4 most desirable areas that satisfy the above criteria and can be chosen by the stakeholders.

### Import the required libraries
We will import the libraries that will be used for data collection, data normalization, analysis and getting the results.

In [37]:
import requests # library to handle requests
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner
import random # library for random number generation
import matplotlib.pyplot as plt # library for plotting
from sklearn.cluster import KMeans # libarary for K-Means algorithm

from pandas.io.json import json_normalize # tranforming json file into a pandas dataframe library
import json # library for json operations

#!conda install -c conda-forge folium=0.11.0 --yes
import folium # plotting library

print('Libraries imported.')

Libraries imported.


### Define the credentials to connect to the FourSquare API
Initializing the credentials of FourSquare API that will be used later in the section.

**This is confidential information and hence will be masked when publisehd for review by other users**

In [19]:
CLIENT_ID = '<client_id>' # Foursquare ID
CLIENT_SECRET = '<client_secret>' # Foursquare Secret
VERSION = '20190531'
LIMIT = 500
RADIUS = 5000

## Data

In this section we will start collecting, process/extract and normalize the data from these data sources:
* Coordiantes of the cities within the country (csv file)
* List of facilities along with their category ids in FourSquare database (csv file)
* GeoJson file containing the areas and their data of Ahmedabad city (geojson file)
* Facilities information from FourSquare database (using APIs)

### Load country location data

The code uses the location data of cities within India. 

In [20]:
df_location = pd.read_csv("F:\Raj\Study\Coursera\IBM Datascience\Course 9. Applied Data Science Capstone\Capstone-project\India.csv")
df_location.head(3)

Unnamed: 0,city,lat,lng,country,iso2,admin,capital,population,population_proper
0,Mumbai,18.987807,72.836447,India,IN,Maharashtra,admin,18978000.0,12691836.0
1,Delhi,28.651952,77.231495,India,IN,Delhi,admin,15926000.0,7633213.0
2,Kolkata,22.562627,88.363044,India,IN,West Bengal,admin,14787000.0,4631392.0


### Define the city you want to use for analysis 

Define the city name (Ahmedabad) and retrieve latitude and longitude in which the facilities/neighborhoods to be searched for.

In [21]:
cityname = "Ahmedabad"

# Extract the location information of given city
latitude = df_location[df_location['city'] == cityname].values[0][1]
longitude = df_location[df_location['city'] == cityname].values[0][2]

print("Latittude and Longitude of", cityname, "city center are:", latitude, "and", longitude)

Latittude and Longitude of Ahmedabad city center are: 23.025793 and 72.587265


### Load facilities/neighborhood data

Code allows to use a customized list of facilities desired to be in the facilities/neighborhood. Load the list of facilities from the CSV file to a dataframe.

In [38]:
df_neighorhood = pd.read_csv(r"F:\Raj\Study\Coursera\IBM Datascience\Course 9. Applied Data Science Capstone\Capstone-project\Neighborhood-facilities v2.csv")
df_neighorhood.head()

Unnamed: 0,Neighborhood facilities,cat_id
0,Airport,4bf58dd8d48988d1eb931735
1,Bank,4bf58dd8d48988d10a951735
2,Bus station,4bf58dd8d48988d1fe931735
3,Bus Stop,52f2ab2ebcbc57f1066b8b4f
4,Cinema,4bf58dd8d48988d17f941735


### Define the list of facilities

In this section, we will define the list of desired facilities in neghborhood. For current scope, we are using 3 facilities - School, Hospital, Indian restaurant.

In [39]:
# Choose the desired venues required in locality
nb1_name, nb1 = df_neighorhood['Neighborhood facilities'][10], (df_neighorhood['cat_id'][10]).replace("-",",")
nb2_name, nb2 = df_neighorhood['Neighborhood facilities'][12], (df_neighorhood['cat_id'][12]).replace("-",",")
nb3_name, nb3 = df_neighorhood['Neighborhood facilities'][21], (df_neighorhood['cat_id'][21]).replace("-",",")

nb = nb1 + ',' + nb2 + ',' + nb3

print("Category 1:", nb1_name)
print("Category 2:", nb2_name)
print("Category 3:", nb3_name)

#print("Category 1:", nb1_name, "\nCategory Ids:", nb1, "\n")
#print("Category 2:", nb2_name, "\nCategory Ids:", nb2, "\n")
#print("Category 3:", nb3_name, "\nCategory Ids:", nb3, "\n")
#print("All categories combined\nCategory Ids:", nb)

Category 1: Hospital
Category 2: Indian Restaurant
Category 3: School


## Methodology

After gathering all the required data sources, we start the processs of processing, analysis and identifaction of candidate areas.

We will start with displaying the city map and highlighting the city center using plotting libraries. Once done, clearly highlight different areas of the city using the geojson file.

Now that we have clear map view, we will extract the list of nearby facilities within 5km radius of each area using FourSquare API. We will plot these facilities on the map to understand the distribution of venues over the different areas. 

The next step will be to go through each of the area and identify if each of the facility type is available in the proximity. Process the available data and find out the distance of nearest facility for each of the facility type. Once the distances are calculated, this will provide us with the areas where each of nearest facility type is in a closet proximity.

Pick up 4 optimal areas and plot them on the map.

### Plot differnt areas and city center coordinates to map

1. Plot different city areas using GeoJSON file. 
2. Plot coordinate of the city on the map

In [27]:
# Define the map of the city on which the facilities and clusters to be plotted 
venues_map = folium.Map(location=[latitude, longitude], zoom_start=11.5) # generate map centred around Ecco

# Load the Geojson file
geojson_url = 'https://raw.githubusercontent.com/datameet/Municipal_Spatial_Data/master/Ahmedabad/Wards.geojson'
geojson_map = requests.get(geojson_url).json()

def map_style(feature):
    return { 'color': 'orange', 'fill': False }

# Define type of map layer and plot all the city areas on map  
folium.TileLayer('cartodbpositron').add_to(venues_map)
folium.GeoJson(geojson_map, style_function=map_style, name='geojson').add_to(venues_map)

# add the City central as a red circle mark
label = cityname + ' city center'
#folium.features.CircleMarker(
folium.CircleMarker(
    [latitude, longitude],
    radius=8,
    popup=label,
    fill=True,
    color='red',
    fill_color='red',
    fill_opacity=0.6
    ).add_to(venues_map)

venues_map

### Find out the coordinates of each area center

1. Find out the lowest and highest coordinates of each polygon area and caclulcate the center of the areas
2. Store the details to a dataframe for further use

In [28]:
columns = ['area name','lat','lng']
df_area = pd.DataFrame(columns=columns)

#searchUrl_json = 'https://raw.githubusercontent.com/datameet/Municipal_Spatial_Data/master/Ahmedabad/Wards.geojson'
#results_json = requests.get(searchUrl_json).json()

try:
    areaCount = len(geojson_map['features'])
    #areaCount = 2
    #print (areaCount)
    
    for indx in range(areaCount):
        # Extract the area name from JSON
        area_name = geojson_map['features'][indx]['properties']['Name']
        #print(len(results_json['features'][indx]['geometry']['coordinates'][0]))
        
        lat_list = []
        lng_list = []

        try:
            for indx2 in range(len(geojson_map['features'][indx]['geometry']['coordinates'][0])):
                lat_list.append(geojson_map['features'][indx]['geometry']['coordinates'][0][indx2][1]) # latitude
                lng_list.append(geojson_map['features'][indx]['geometry']['coordinates'][0][indx2][0]) # longitude
        except:
            print("exception occured while extracting coordinates")

        # Extract the lowest and highest latitude/longitude of given area from JSON
        min_lat = min(lat_list)
        max_lat = max(lat_list)
        min_lng = min(lng_list)
        max_lng = max(lng_list)

        # Calculate the central co-ordinates of given area
        center_lat = min_lat + ((max_lat - min_lat) / 2)
        center_lng = min_lng + ((max_lng - min_lng) / 2)

        # Store the area name and area center coordinates to a dataframe for further use 
        df_area = df_area.append({'area name': area_name, 'lat': center_lat, 'lng': center_lng}, ignore_index=True)
except:
    print("exception occured")

# Sort the dataframe by area name
df_area.sort_values(by=["area name"], inplace=True, ignore_index=True)
df_area.head()

Unnamed: 0,area name,lat,lng
0,01 GOTA,23.099003,72.520454
1,02 CHANDLODIA,23.1067,72.550119
2,03 CHANDKHEDA,23.114959,72.589485
3,04 SABARMATI,23.090009,72.597304
4,05 RANIP,23.086959,72.570963


### FourSquare API
Here, we will extract the facilities data from FourSquare for each of the city area.

Iterate through each of the city area and perform following opeartion:
1. Create FourSquare search URLs to fetch the results of these facilities (venues).
2. Execute the API request and retrieve the results
3. Parse JSON object, extract required elements and store the results for each of the facility.
4. Normalize and clean up the data and keep only required data.
5. Extract the data on: number of facilities, nearest facility to area center and store them.

In [40]:
## Version 2

# Defining fuction that extracts category name from free text categories field
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']


# Define empty lists - to be pouplated dynamically
nb1_count = [] # to store the number of facility 1
nb2_count = [] # to store the number of facility 2
nb3_count = [] # to store the number of facility 3 

nb1_lowest_distance = [] # to store the distance of nearest venue of facility 1
nb2_lowest_distance = [] # to store the distance of nearest venue of facility 2
nb3_lowest_distance = [] # to store the distance of nearest venue of facility 3

# Iterate each area from the file and extract the required data
for indx, row in df_area.iterrows():
    # create FourSquare Request URL
    areaName = df_area['area name'][indx]
    #print(areaName)
    searchUrl_nb1 = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&categoryId={}&limit={}&radius={}'.format(CLIENT_ID, CLIENT_SECRET, df_area.lat[indx], df_area.lng[indx], VERSION, nb1, LIMIT, RADIUS)
    searchUrl_nb2 = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&categoryId={}&limit={}&radius={}'.format(CLIENT_ID, CLIENT_SECRET, df_area.lat[indx], df_area.lng[indx], VERSION, nb2, LIMIT, RADIUS)
    searchUrl_nb3 = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&categoryId={}&limit={}&radius={}'.format(CLIENT_ID, CLIENT_SECRET, df_area.lat[indx], df_area.lng[indx], VERSION, nb3, LIMIT, RADIUS)
    
    if (areaName != ""):
        # Send request and retrieve the list of venues in vacinity
        results_nb1 = requests.get(searchUrl_nb1).json()
        results_nb2 = requests.get(searchUrl_nb2).json()
        results_nb3 = requests.get(searchUrl_nb3).json()

        #print(results_nb1)
        #print(results_nb2)
        #print(results_nb3)

        venues_nb1 = results_nb1['response']['venues']
        venues_nb2 = results_nb2['response']['venues']
        venues_nb3 = results_nb3['response']['venues']

        # Extract the data of facility 1, perform clean up and store them 
        if (len(venues_nb1)):
            df_nb1 = pd.json_normalize(venues_nb1)
            df_nb1['categories'] = df_nb1.apply(get_category_type, axis=1)

            if indx == 0:
                df_nb1_consolidated = df_nb1
            else:
                df_nb1_consolidated = df_nb1_consolidated.append(df_nb1, ignore_index = True)
            # Add the count of facilities to individual list
            nb1_count.append(len(df_nb1.index))

            if (len(df_nb1.index) > 0):
                nb1_lowest_distance.append(min(df_nb1['location.distance']))
            else:
                nb1_lowest_distance.append(999999)
        else:
            nb1_count.append(0)
            nb1_lowest_distance.append(999999)

        # Extract the data of facility 2, perform clean up and store them
        if (len(venues_nb2)):
            df_nb2 = pd.json_normalize(venues_nb2)
            df_nb2['categories'] = df_nb2.apply(get_category_type, axis=1)

            # Extract the records of facility 2 and store them in a facility 2 dataframe
            if indx == 0:
                df_nb2_consolidated = df_nb2
            else:
                df_nb2_consolidated = df_nb2_consolidated.append(df_nb2, ignore_index = True)
            # Add the count of facilities to individual list
            nb2_count.append(len(df_nb2.index))
            #print(len(df_temp.index))

            if (len(df_nb2.index) > 0):
                nb2_lowest_distance.append(min(df_nb2['location.distance']))
            else:
                nb2_lowest_distance.append(999999)
        else:
            nb2_count.append(0)
            nb2_lowest_distance.append(999999)

        # Extract the data of facility 3, perform clean up and store them
        if (len(venues_nb3)):
            df_nb3 = pd.json_normalize(venues_nb3)
            df_nb3['categories'] = df_nb3.apply(get_category_type, axis=1)

            # Extract the records of facility 3 and store them in a facility 3 dataframe
            if indx == 0:
                df_nb3_consolidated = df_nb3
            else:
                df_nb3_consolidated = df_nb3_consolidated.append(df_nb3, ignore_index = True)
            # Add the count of facilities to individual list
            nb3_count.append(len(df_nb3.index))

            if (len(df_nb3.index) > 0):
                nb3_lowest_distance.append(min(df_nb3['location.distance']))
            else:
                nb3_lowest_distance.append(999999)
        else:
            nb3_count.append(0)
            nb3_lowest_distance.append(999999)

print("Data retrieved from FourSquare API")

Data retrieved from FourSquare API


### Cleaning of data 

Perform data clean up for further use

In [41]:
# Clean column names by keeping only last term
df_nb1_consolidated.columns = [column.split('.')[-1] for column in df_nb1_consolidated.columns]
df_nb2_consolidated.columns = [column.split('.')[-1] for column in df_nb2_consolidated.columns]
df_nb3_consolidated.columns = [column.split('.')[-1] for column in df_nb3_consolidated.columns]

# Remove the duplicate rows
df_nb1_consolidated.drop_duplicates(subset ="id", keep = 'first', inplace = True)
df_nb2_consolidated.drop_duplicates(subset ="id", keep = 'first', inplace = True)
df_nb3_consolidated.drop_duplicates(subset ="id", keep = 'first', inplace = True)

print("Data clean up completed")

Data clean up completed


### Plots all the facilities on the map

Plot all the facilities with different colors on the map

In [32]:
# Define function to plot the coordinates on map
def plot_facility_points(venues_map, df, colorCode, rds):
    # add spots to the map as circle markers
    for lat, lng, categories, name in zip(df.lat, df.lng, df.categories, df.name):
        label = '{}, {}'.format(categories, name)
        label = folium.Popup(label, parse_html=True)

        #folium.features.CircleMarker(
        folium.CircleMarker(
            [lat, lng],
            radius=rds,
            popup=label,
            fill=True,
            color=colorCode,
            fill_color=colorCode,
            fill_opacity=0.7
            ).add_to(venues_map)


# Add each facility spots to map in different color (each facility will have a single color code)
nb1_color = "brown"
nb2_color = "orange"
nb3_color = "green"

plot_facility_points(venues_map, df_nb1_consolidated, nb1_color, 2)
plot_facility_points(venues_map, df_nb2_consolidated, nb2_color, 2)
plot_facility_points(venues_map, df_nb3_consolidated, nb3_color, 2)

print(nb1_name, "will be plotted in", nb1_color, "color")
print(nb2_name, "will be plotted in", nb2_color, "color")
print(nb3_name, "will be plotted in", nb3_color, "color")

venues_map

Hospital will be plotted in brown color
Indian Restaurant will be plotted in orange color
School will be plotted in green color


### Collate the data for analyzing and identifying the results

Now that we have all the required data to define the candidate areas, we will identify the areas where the each of the neighourhood facility is nearest to the city center.

We will update the dataframe with number of facilities and nearest facilities info.

Now, sum up the distance from each of the nearest facilities and pull out firt 4 areas that matches the criteria.

In [35]:
# Add the count of facilities to respective areas
df_area[nb1_name] = nb1_count
df_area[nb2_name] = nb2_count
df_area[nb3_name] = nb3_count

temp_col_name_1 = 'nearest ' + nb1_name + ' distance'
temp_col_name_2 = 'nearest ' + nb2_name + ' distance'
temp_col_name_3 = 'nearest ' + nb3_name + ' distance'

df_area[temp_col_name_1] = nb1_lowest_distance
df_area[temp_col_name_2] = nb2_lowest_distance
df_area[temp_col_name_3] = nb3_lowest_distance

#df_area["average facilities"] = (np.array(nb1_count) * np.array(nb2_count) * np.array(nb3_count)) / 3
df_area["total distance"] = (np.array(nb1_lowest_distance) + np.array(nb2_lowest_distance) + np.array(nb3_lowest_distance))

df_area.head()

Unnamed: 0,area name,lat,lng,Hospital,Indian Restaurant,School,nearest Hospital distance,nearest Indian Restaurant distance,nearest School distance,total distance
0,44 KHOKHRA,22.996573,72.615554,26,49,12,1742,912,3202,5856
1,30 PALDI,23.009716,72.560894,26,49,6,1341,1553,2931,5825
2,19 BODAKDEV,23.033983,72.50978,19,47,3,230,493,2940,3663
3,18 NAVRANGPURA,23.028733,72.553691,40,49,4,789,726,746,2261
4,37 MANINAGAR,23.000279,72.603619,45,49,11,1621,1211,1976,4808


## Result
### Plot the candidate areas on the map

As per the analysis, following 4 areas are the possible city areas that has the hospital, school and Indian restaurant in the closet proximity.
1. Bodakdev
2. Khokhra
3. Navrangpura
4. Paldi

Note: the result is based on the distance of facilities from the **center of the area** and not from each other.

In [36]:
# Sort the rows
df_area.sort_values(by=["total distance"], inplace=True, ignore_index=True)
df_area.head(4)

Unnamed: 0,area name,lat,lng,Hospital,Indian Restaurant,School,nearest Hospital distance,nearest Indian Restaurant distance,nearest School distance,total distance
0,46 LAMBHA,22.94801,72.562056,18,48,11,142,357,165,664
1,42 INDRAPURI,22.999274,72.640249,38,50,31,160,219,642,1021
2,38 GOMTIPUR,23.01961,72.619931,40,50,34,823,19,267,1109
3,03 CHANDKHEDA,23.114959,72.589485,41,50,34,498,460,185,1143


In [34]:
for indx, row in df_area.head(4).iterrows():
    #folium.features.CircleMarker(
    folium.CircleMarker(
        [df_area['lat'][indx], df_area['lng'][indx]],
        radius=5,
        fill=True,
        color='blue',
        fill_color='blue',
        fill_opacity=1
        ).add_to(venues_map)

    #folium.features.Marker(
    folium.Marker(
        [df_area['lat'][indx], df_area['lng'][indx]],
        popup=df_area['area name'][indx],
        ).add_to(venues_map)
    
venues_map

## Conclusion
Purpose of this project was to identify candidate areas that has facilities (School, Hospital, Indian restaurant) close to center of the area with lowest distance in order to aid the person in narrowing down the search for optimal location for settling down.

Using FourSquare data, first we identified the list of facilities within the 5km radius of each of the area that satisfy the criteria of the user. Then identified the lowest distance for each of the facility type from center of the area.

The data of all facilities could not be retrieved fully due to FourSuqare API restricion (maxium 100 records returned). The result can be further optimized by getting full data. The final decission on optimal area will be made the end user based on various factors e.g. the location of the office, type of school whether nursay, primary, high school etc.