# Coursera - Capstone Project for IBM Data Science Certificate
## Opening of a new Peruvian Restaurant in Barranquilla, Colombia
### The battle of the neighborhoods" **by Carlos Barros**

## 1. Introduction

### 1.1 Problem 
According to the economic projections with respect to Peruvian restaurants in Colombia, it is expected that by 2019 they will exceed sales to 62 million dollars . Up to 2018, 81 Peruvian food points were registered in the country, 33 are in Bogotá and 29 in Medellin , the rest are in other regions.

A franchise of **Peruvian restaurants** has decided to start a new plaza in the city of Barranquilla, among its reasons is: geographically strategic position with opening to the Magdalena River and for being one of the epicenters in terms of business in Colombia. The central objective of this research study is to analyze and select the best locations in the city of Barranquilla, Colombia so that the franchise opens to its new location. With the use of data science methodology and automatic learning techniques such as grouping.

My proposal is to analyze the sectors of Barranquilla to consider the opening of new Peruvian restaurants, according to the requirements demanded by the franchise:

- Proximity and ease for customers
- There is no presence of the competition at least with a radius of 1km away.
- It is surrounded by companies from different economic sectors.
- The proximity to places of high influx of traffic, via major.

###  1.2 Interest 
Of course, any investor who is thinking of opening a new restaurant of Peruvian food in the city and also for Big Data students who want to expand their knowledge and be able to carry out similar projects.


## 2.Data acquisition and cleaning 
### 2.1 Data sources 
To make a good choice to open Peruvian restaurants in Barranquilla, the following information is required:
- List / Information on the Barranquilla sectors with their Geodata (latitude and longitude).
- List / Information on the main roads in Barranquilla with geographic data.

### 2.2 Data cleaning 
The data that we will use for this analysis is a combination of a CSV file that has been prepared for the purposes of the analysis of multiple sources (Barrios_coord.csv) and the location / location information on Foursquare.

The file will be read directly in Jupiter's notebook for convenience and space saving. However, the grouping of city sectors and mapping will be shown. An algorithm was used to determine the Nominatim geodata. The coding of the real algorithm can be shown in 'markdown', keep in mind that this will take a while to be executed.

**Source 1: Barrios_coord.csv**

## 3. Exploratory Data Analysis 
16 strategic neighborhoods were commercially taken in the city of Barranquilla and the latitudes and longitudes of their locations were manually searched with the help of Google Maps (See Figure 1) and with the data provided by the DANE (National Administrative Department of Statistics, by its acronyms in Spanish). The latter happened, because it was not possible to find in Wikipedia the list of neighborhoods in the city of Barranquilla. Next, an example of the table generated by the obtained data.



## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">
<li><a href="#ref1">Import Libraries</a>
<li><a href="#ref2">Sectors of the city of Barranquilla - Data and mapping </a>
<li><a href="#ref3">Use the Foursquare API to explore the neighborhoods</a>
<li><a href="#ref3">Exploratory Data Analysis</a>
<li><a href="#ref3">Recommendations</a>
<li><a href="#ref3">Conclusion</a>

    
</div>
 
<hr>



## 1. Import Libraries

In [None]:

import numpy as np # library to handle data in a vectorized manner
import time
import pandas as pd # library for data analsysis
import json # library to handle JSON files
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
from bs4 import BeautifulSoup

#mapping tools
!pip install geopy 

# convert an address into latitude and longitude values
from geopy.geocoders import Nominatim 

!pip install folium
# map rendering library
import folium 

def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)



!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library
import folium # map rendering library
from folium import plugins

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

import seaborn as sns

# import k-means from clustering stage
from sklearn.cluster import KMeans



print('Libraries imported.')

Collecting package metadata: done
Solving environment: done


## 2. Sectors of the city of Barranquilla - Data and mapping.

Cluster neighborhood data was produced with Foursquare during the laboratory work of the course. A csv file was produced containing the neighborhoods around the 17 main and commercial districts of the city of Barranquilla. Now, the csv file has just been read for convenience and consolidation of the report.

In [None]:
import os
cwd = os.getcwd()
cwd

In [None]:
# Read csv file with clustered neighborhoods with geodata
bq_data  = pd.read_csv('Barrios_coord.csv') 
bq_data

In [None]:
# get the coordinates of Barranquilla
address = 'Barranquilla,Atlántico,Colombia'

geolocator = Nominatim(user_agent="my-application")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Kuala Lumpur, Malaysiae {}, {}.'.format(latitude, longitude))

In [None]:
latitude= 11.008799
longitude= -74.805004 

# create map of Barranquilla using latitude and longitude values
map_bq = folium.Map(location=[latitude, longitude], zoom_start=13)

# add markers to map
for lat, lng, neighborhood in zip(bq_data['Latitude'], bq_data['Longitude'], bq_data['Neighborhood']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7).add_to(map_bq)  
    
map_bq

In [None]:
# save the map as HTML file
map_bq.save('map_bq.html')

## 3. Use the Foursquare API to explore the neighborhoods

In [None]:
# credentails
CLIENT_ID = 'N1FXT1RQOUXXXXXXXXXXPFJYM5VGHNSMZQHK4JXWVQ5DDI1Y' # your Foursquare ID
CLIENT_SECRET = 'PLQF0OEC3F2QSJXXXXXXXXXXTOAWWAAP324E1EWBFA0AGHTG' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
radius = 2000
LIMIT = 250

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

##### **Let's explore 'Barranquilla'.. that sounds like a cool spot**

In [None]:
#define objects for 'Studio District' index [15] in bq_data
neighborhood_latitude = bq_data.loc[15, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = bq_data.loc[15, 'Longitude'] # neighborhood longitude value
neighborhood_name = bq_data.loc[15, 'Neighborhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

##### **Now, let's get the top 100 venues that are in Barranquilla within a radius of 2000 m**

In [None]:
#step 1 - create the correct GET request URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)
url # display GET request URL

In [None]:
results = requests.get(url).json()
results # remove ';' if you want to see json data

## 4. Exploratory Data Analysis

##### **Clean the json and structure it into a pandas dataframe**#

In [None]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [None]:
venues = results['response']['groups'][0]['items']
    
df_bq = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
df_bq = df_bq.loc[:, filtered_columns]

# filter the category for each row
df_bq['venue.categories'] = df_bq.apply(get_category_type, axis=1)

# clean columns

df_bq.columns = [col.split(".")[-1] for col in df_bq.columns]
df_bq.insert(0, 'neighborhood', 'Barranquilla')

print('{} venues were returned by Foursquare.'.format(df_bq.shape[0]))
df_bq.head()

##### **Create a map of the Barranquilla district and highlight nearby venues**

In [None]:
map_bq = folium.Map(location=[neighborhood_latitude, neighborhood_longitude], zoom_start=14)

# add markers to map
for lat, lng, name, categories in zip(df_bq['lat'], df_bq['lng'], df_bq['name'], df_bq['categories']):
  label = '{},{}'.format(categories,name)
  label = folium.Popup(label, parse_html=True)
  folium.CircleMarker(
      [lat, lng],
      radius=5,
      popup=label,
      color='blue',
      fill=True,
      fill_color='#3186cc',
      fill_opacity=0.7).add_to(map_bq) 
    
map_bq

In [None]:
df_bq['categories'].value_counts()

#### Let's create a similar dataframe for each neighborhood:

**Index # 1 Miramar**

In [None]:
#define objects for 'Miramar' index [1] in bq_data
neighborhood_latitude = bq_data.loc[1, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = bq_data.loc[1, 'Longitude'] # neighborhood longitude value
neighborhood_name = bq_data.loc[1, 'Neighborhood'] # neighborhood name

#step 1 - create the correct GET request URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)

results = requests.get(url).json()

venues = results['response']['groups'][0]['items']
    
df_miramar = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
df_miramar = df_miramar.loc[:, filtered_columns]

# filter the category for each row
df_miramar['venue.categories'] = df_miramar.apply(get_category_type, axis=1)

# clean columns

df_miramar.columns = [col.split(".")[-1] for col in df_miramar.columns]
df_miramar.insert(0, 'neighborhood', 'Miramar')

print('{} venues were returned by Foursquare.'.format(df_miramar.shape[0]))
df_miramar.head()

**Index # 3 - Altos del limón**


In [None]:
#define objects for 'Altos del limón' index [3] in bq_data
neighborhood_latitude = bq_data.loc[3, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = bq_data.loc[3, 'Longitude'] # neighborhood longitude value
neighborhood_name = bq_data.loc[3, 'Neighborhood'] # neighborhood name

#step 1 - create the correct GET request URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)

results = requests.get(url).json()

venues = results['response']['groups'][0]['items']
    
df_altos_del_limon = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
df_altos_del_limon = df_altos_del_limon.loc[:, filtered_columns]

# filter the category for each row
df_altos_del_limon['venue.categories'] = df_altos_del_limon.apply(get_category_type, axis=1)

# clean columns

df_altos_del_limon.columns = [col.split(".")[-1] for col in df_altos_del_limon.columns]
df_altos_del_limon.insert(0, 'neighborhood', 'Altos del limón')

print('{} venues were returned by Foursquare.'.format(df_altos_del_limon.shape[0]))
df_altos_del_limon.head()

**Index # 6 - Villa Country**


In [None]:
#define objects for 'Villa Country' index [6] in bq_data
neighborhood_latitude = bq_data.loc[6, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = bq_data.loc[6, 'Longitude'] # neighborhood longitude value
neighborhood_name = bq_data.loc[6, 'Neighborhood'] # neighborhood name

#step 1 - create the correct GET request URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)

results = requests.get(url).json()

venues = results['response']['groups'][0]['items']
    
df_villa_country = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
df_villa_country  = df_villa_country .loc[:, filtered_columns]

# filter the category for each row
df_villa_country ['venue.categories'] = df_villa_country .apply(get_category_type, axis=1)

# clean columns

df_villa_country .columns = [col.split(".")[-1] for col in df_villa_country .columns]
df_villa_country .insert(0, 'neighborhood', 'Villa Country')

print('{} venues were returned by Foursquare.'.format(df_villa_country .shape[0]))
df_villa_country .head()

**Index # 8 - Alto Prado**


In [None]:
#define objects for 'Alto Prado' index [8] in bq_data
neighborhood_latitude = bq_data.loc[8, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = bq_data.loc[8, 'Longitude'] # neighborhood longitude value
neighborhood_name = bq_data.loc[8, 'Neighborhood'] # neighborhood name

#step 1 - create the correct GET request URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)

results = requests.get(url).json()

venues = results['response']['groups'][0]['items']
    
df_alto_prado = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
df_alto_prado  = df_alto_prado .loc[:, filtered_columns]

# filter the category for each row
df_alto_prado ['venue.categories'] = df_alto_prado .apply(get_category_type, axis=1)

# clean columns

df_alto_prado .columns = [col.split(".")[-1] for col in df_alto_prado .columns]
df_alto_prado .insert(0, 'neighborhood', 'Alto Prado')

print('{} venues were returned by Foursquare.'.format(df_alto_prado .shape[0]))
df_alto_prado .head()

**Index # 10- Villa Campestre**


In [None]:
#define objects for 'Villa Campestre' index [10] in bq_data
neighborhood_latitude = bq_data.loc[10, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = bq_data.loc[10, 'Longitude'] # neighborhood longitude value
neighborhood_name = bq_data.loc[10, 'Neighborhood'] # neighborhood name

#step 1 - create the correct GET request URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)

results = requests.get(url).json()

venues = results['response']['groups'][0]['items']
    
df_villa_campestre = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
df_villa_campestre  = df_villa_campestre .loc[:, filtered_columns]

# filter the category for each row
df_villa_campestre ['venue.categories'] = df_villa_campestre .apply(get_category_type, axis=1)

# clean columns

df_villa_campestre .columns = [col.split(".")[-1] for col in df_villa_campestre .columns]
df_villa_campestre .insert(0, 'neighborhood', 'Villa Campestre')

print('{} venues were returned by Foursquare.'.format(df_villa_campestre .shape[0]))
df_villa_campestre .head()

**Index # 13 - Ciudad Jardin**


In [None]:
#define objects for 'Ciudad Jardin' index [13] in bq_data
neighborhood_latitude = bq_data.loc[13, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = bq_data.loc[13, 'Longitude'] # neighborhood longitude value
neighborhood_name = bq_data.loc[13, 'Neighborhood'] # neighborhood name

#step 1 - create the correct GET request URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)

results = requests.get(url).json()

venues = results['response']['groups'][0]['items']
    
df_ciudad_jardin = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
df_ciudad_jardin  = df_ciudad_jardin .loc[:, filtered_columns]

# filter the category for each row
df_ciudad_jardin ['venue.categories'] = df_ciudad_jardin .apply(get_category_type, axis=1)

# clean columns

df_ciudad_jardin .columns = [col.split(".")[-1] for col in df_ciudad_jardin .columns]
df_ciudad_jardin .insert(0, 'neighborhood', 'Ciudad Jardin')

print('{} venues were returned by Foursquare.'.format(df_ciudad_jardin .shape[0]))
df_ciudad_jardin .head()

**Analysis of venue distribution**

In [None]:
df_venues = pd.concat([df_miramar, df_altos_del_limon , df_villa_country, df_alto_prado, df_villa_campestre, df_ciudad_jardin])
df_venues['count'] = 1
df_venues.shape

In [None]:
total_venues = pd.pivot_table(df_venues,index=["neighborhood"], values=["count"],aggfunc=np.sum)
total_venues

In [None]:
df_venues2 = df_venues.copy()
df_venues3 = df_venues.copy()
df_venues_rest = df_venues2[df_venues2['categories'].str.contains('Restaurant')].reset_index(drop=True)
df_venues_rest['Venue Type'] = 'Restaurant'
df_venues_hotel = df_venues3[df_venues3['categories'].str.contains('Hotel')].reset_index(drop=True)
df_venues_hotel['Venue Type'] = 'Hotel'
df_venues_final = pd.concat([df_venues_rest,df_venues_hotel]).reset_index(drop=True)
df_venues_final.shape

In [None]:
pivot = pd.pivot_table(df_venues_final,index=["neighborhood","Venue Type"], values=["count"],aggfunc=np.sum)
pivot

In [None]:
df_venues_final.groupby('neighborhood')['Venue Type']\
    .value_counts()\
    .unstack(level=1)\
    .plot.bar(stacked=True)

**Create 'one hot' file with dummy values by venue category**

In [None]:
# one hot encoding
bq_onehot = pd.get_dummies(df_venues_final[['categories']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
bq_onehot['neighborhood'] = df_venues_final['neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [bq_onehot.columns[-1]] + list(bq_onehot.columns[:-1])
bq_onehot = bq_onehot[fixed_columns]

bq_onehot.head()

**Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category**

In [None]:
bq_grouped = bq_onehot.groupby('neighborhood').mean().reset_index()
bq_grouped

In [None]:
bq_grouped.shape

In [None]:
len(bq_grouped[bq_grouped["Peruvian Restaurant"] > 0])

##### **Create a new DataFrame for Peruvian Restaurant data only**

In [None]:
bq_peruvian = bq_grouped[["neighborhood","Peruvian Restaurant"]]

In [None]:
bq_peruvian

In [None]:
# set number of clusters
kclusters = 4

bq_clustering = bq_peruvian.drop(["neighborhood"], 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(bq_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

In [None]:
# create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.
bq_merged = bq_peruvian.copy()


# add clustering labels
bq_merged["Cluster Labels"] = kmeans.labels_

In [None]:
bq_merged.rename(columns={"Neighborhoods": "Neighborhood"}, inplace=True)
bq_merged.head()

In [None]:
# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
bq_merged = bq_merged.join(bq_data.set_index("Neighborhood"), on="neighborhood")

print(bq_merged.shape)
bq_merged.head() # check the last columns!

In [None]:
# sort the results by Cluster Labels
print(bq_merged.shape)
bq_merged.sort_values(["Cluster Labels"], inplace=True)
bq_merged

In [None]:
bq_merged= bq_merged.drop([1])
bq_merged

In [None]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=13)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(bq_merged['Latitude'], bq_merged['Longitude'], bq_merged['neighborhood'], bq_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' - Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=10,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## 5.Recommendations 

The 98th street between 53-56 races is currently having a large population growth (more and more buildings are being made in the sector), along with it can be said that this sector has a great commercial influence in the city. At present there is not a Peruvian food restaurant around, the closest one is 644m, however, it is considered that you can still take advantage of the opportunity due to its strategic geographical position. As can be seen in figure 5, Altos del Limón is where major hotels are present, as is Miramar, but which has few restaurants compared to other sectors. This also represents a great opportunity since many tourists will have more access to foreign food, especially Peruvian food.

## 6. Conclusion

The final result of this research study recommends the sector of Altos del Limón, on Calle 98 with carrera 56 for the reasons mentioned in the Recommendations section, and it is considered the best option considering all the factors: relatively volume high of activities for tourists and locals, it is a business center, it is close to other high rental neighborhoods and the proportion of hotels to restaurants is not too high. Other neighborhoods that were considered were Alto Prado, Ciudad Jardín, Miramar, Villa Campestre and Villa Country, but when not fulfilling most of the requirements, Altos de Limón was chosen as an option, and it was also essential to make the final decision about the distance between the competition. Finally, it is concluded that Barranquilla is not saturated in restaurants whose niche market is Peruvian food, only in the city there are 3 restaurants in this category.

**End of Project and Course/ Thanks to Coursera Team, IBM and Students!**