# Capstone project

This notebook will be mainly used for the capstone project of IBM data science capstone project.

In [70]:
import pandas as pd
import numpy as np

## Hello Capstone Project Course!

In this project we are going to talk about opening a new mexican food restaurant in the city of San Diego, CA. Why is this something useful? well, San Diego is a city full of visitors from Mexico and has a high number of mexicans currently living or working there. Because of this, there a variety of mexican restaurants distributed along the city. So our work here is to use the foursquare data and machine learning algorithms to find out where would be the best places to put a new mexican restaurant. For the data, we will be using only the best neighborhood of San Diego according to the page: https://www.zumper.com/blog/2018/05/7-best-san-diego-neighborhoods/. So let's get started.

After getting the name of the best neighborhood, I got into the task of getting the latitude and longitude of them so I can link them with foursquare

In [71]:
d={'Neighborhood':["Hillcrest", "Little Italy", "North Park", "Gaslamp Quarter",\
                   "Ocean Beach", "La Jolla", "Normal Heigths"], 'Latitude': [32.749997, \
                   32.721163782, 32.7408842, 32.7075795, 32.741947, 32.842674, 32.7580119679], 'Longitude':\
                   [-117.166666, -117.166999332, -117.1305877, -117.1601285, -117.239571, -117.257767 , -117.117999528]}


In [72]:
df=pd.DataFrame(data=d)
df.head()

Unnamed: 0,Neighborhood,Latitude,Longitude
0,Hillcrest,32.749997,-117.166666
1,Little Italy,32.721164,-117.166999
2,North Park,32.740884,-117.130588
3,Gaslamp Quarter,32.70758,-117.160128
4,Ocean Beach,32.741947,-117.239571


So now let's import the rest of the libraries that we will be using for the rest of the project.

In [73]:
import requests # library to handle requests

from sklearn.cluster import KMeans # import k-means from clustering stage

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

import folium # map rendering library

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

Let's load our credentials of foursquare.

In [74]:
CLIENT_ID = 'GTNIX443DWLBJ3GJADTUMKBVAELTUXTK4S2OG4H1BFR0DXGI' 
CLIENT_SECRET = 'ZH1EY4UNZL3VMUFCRG0O20M3OBYEYRNED0QC1EH340CJBG3B' # your Foursquare Secret
VERSION = '20180604'
LIMIT = 100
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: GTNIX443DWLBJ3GJADTUMKBVAELTUXTK4S2OG4H1BFR0DXGI
CLIENT_SECRET:ZH1EY4UNZL3VMUFCRG0O20M3OBYEYRNED0QC1EH340CJBG3B


And now let's get the geographical data of Sand Diego so we can create a folium map.

In [75]:
address = 'San Diego, CA'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of San Diego are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of San Diego are 32.7174209, -117.1627714.


And now, we create a folium with all the top neighborhoods that we selected at the start marked.

In [76]:
map_sd = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(df['Latitude'], df['Longitude'], df['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_sd)  
    
map_sd

Now we define a function that extract the venues' name, latitude and longitude as well as category.

In [77]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue_Latitude', 
                  'Venue_Longitude', 
                  'Venue_Category']
    
    return(nearby_venues)

And now we call the previous function with the data from our dataframe.

In [78]:
sd_venues_best_neighborhood = getNearbyVenues(names=df['Neighborhood'],
                                   latitudes=df['Latitude'],
                                   longitudes=df['Longitude'])

Hillcrest
Little Italy
North Park
Gaslamp Quarter
Ocean Beach
La Jolla
Normal Heigths


Now let's take a look at the shape of our dataframe and the information we got from our neighborhoods,

In [79]:
print(sd_venues_best_neighborhood.shape)
sd_venues_best_neighborhood.head()

(263, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue_Latitude,Venue_Longitude,Venue_Category
0,Hillcrest,32.749997,-117.166666,Toma Sol Tavern,32.749877,-117.166341,Sports Bar
1,Hillcrest,32.749997,-117.166666,Lazy Acres Natural Market,32.75021,-117.167797,Organic Grocery
2,Hillcrest,32.749997,-117.166666,Vons,32.74938,-117.168194,Grocery Store
3,Hillcrest,32.749997,-117.166666,RK Sushi,32.749992,-117.16707,Sushi Restaurant
4,Hillcrest,32.749997,-117.166666,Sushi Deli 1,32.74995,-117.165757,Sushi Restaurant


Now let's see how mane unique categories exist in our dataframe.

In [80]:
print('There are {} uniques categories.'.format(len(sd_venues_best_neighborhood['Venue_Category'].unique())))

There are 99 uniques categories.


In [82]:
for i,j in enumerate(sd_venues_best_neighborhood.Venue_Category):
        if j=="Taco Place":
            sd_venues_best_neighborhood.Venue_Category[i]="Mexican Restaurant"
            
sd_venues_best_neighborhood.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue_Latitude,Venue_Longitude,Venue_Category
0,Hillcrest,32.749997,-117.166666,Toma Sol Tavern,32.749877,-117.166341,Sports Bar
1,Hillcrest,32.749997,-117.166666,Lazy Acres Natural Market,32.75021,-117.167797,Organic Grocery
2,Hillcrest,32.749997,-117.166666,Vons,32.74938,-117.168194,Grocery Store
3,Hillcrest,32.749997,-117.166666,RK Sushi,32.749992,-117.16707,Sushi Restaurant
4,Hillcrest,32.749997,-117.166666,Sushi Deli 1,32.74995,-117.165757,Sushi Restaurant


Now we need to see how many restaurants of each kind exist in our dataframe, for this case, we can count Taco places and Mexican food as the same. We will first get the dummy variables and the merge the columns.

In [83]:
sd_dummies = pd.get_dummies(sd_venues_best_neighborhood[['Venue_Category']], prefix="", prefix_sep="")
sd_dummies.head()

Unnamed: 0,ATM,Accessories Store,American Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,BBQ Joint,Bakery,Bar,...,Steakhouse,Sushi Restaurant,Tapas Restaurant,Thai Restaurant,Tiki Bar,Vegetarian / Vegan Restaurant,Video Store,Whisky Bar,Wine Bar,Wine Shop
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0


In [84]:
sd_dummies['Neighborhood'] = sd_venues_best_neighborhood['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [sd_dummies.columns[-1]] + list(sd_dummies.columns[:-1])
sd_dummies = sd_dummies[fixed_columns]

In [85]:
sd_top_grouped = sd_dummies.groupby('Neighborhood').mean().reset_index()
sd_top_grouped

Unnamed: 0,Neighborhood,Wine Shop,ATM,Accessories Store,American Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,BBQ Joint,...,Sports Bar,Steakhouse,Sushi Restaurant,Tapas Restaurant,Thai Restaurant,Tiki Bar,Vegetarian / Vegan Restaurant,Video Store,Whisky Bar,Wine Bar
0,Gaslamp Quarter,0.01,0.0,0.01,0.05,0.01,0.0,0.0,0.02,0.01,...,0.02,0.05,0.02,0.01,0.0,0.0,0.0,0.0,0.01,0.0
1,Hillcrest,0.0,0.0,0.0,0.116279,0.0,0.0,0.0,0.0,0.0,...,0.023256,0.0,0.069767,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,La Jolla,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Little Italy,0.0,0.0,0.0,0.042254,0.0,0.014085,0.014085,0.0,0.0,...,0.0,0.014085,0.014085,0.014085,0.0,0.014085,0.014085,0.0,0.0,0.042254
4,Normal Heigths,0.0,0.052632,0.0,0.0,0.0,0.0,0.0,0.0,0.052632,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,North Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.090909,0.0,0.0,0.0,0.0,0.0
6,Ocean Beach,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0


In [86]:
num_top_venues=10
for hood in sd_top_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = sd_top_grouped[sd_top_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Gaslamp Quarter----
                     venue  freq
0                      Bar  0.10
1                    Hotel  0.10
2       Mexican Restaurant  0.06
3      American Restaurant  0.05
4               Steakhouse  0.05
5                   Lounge  0.04
6  New American Restaurant  0.04
7       Seafood Restaurant  0.03
8              Pizza Place  0.03
9                     Café  0.03


----Hillcrest----
                      venue  freq
0        Mexican Restaurant  0.12
1       American Restaurant  0.12
2               Pizza Place  0.07
3               Coffee Shop  0.07
4          Sushi Restaurant  0.07
5        Salon / Barbershop  0.07
6  Mediterranean Restaurant  0.02
7         Mobile Phone Shop  0.02
8       Moroccan Restaurant  0.02
9             Grocery Store  0.02


----La Jolla----
                     venue  freq
0       Photography Studio  0.33
1           Scenic Lookout  0.33
2                     Park  0.33
3                Wine Shop  0.00
4        Mobile Phone Shop  0.00
5 

As we can see from the raw data of our grouped data set, 'La Jolla' has no Mexican restaurants at all, but there is a problem, this neighborhood is too far north in the city, so not a lot of tourist venture dat much in it, because of this, La Jolla would not be a good choice to put your Mexican Restaurant.

For this reason, we will take special attention on four specific neighborhoods: Little Italy, Hillcrest, Normal Heights and North Park.

If we remember from our raw data of frequencies, Little Italy had a frequency of 0.03 of Mexican Food Places, while Hillcrest had 0.12, Normal Heighs ad 0.11 and North Park 0.04; That leaves Hillcrest as the most dangerous place to put a Mexican food restaurant since there will be more competition. But let's see if there is more information that can help us select wich one of this is the best place to put our restaurant.  

According to the information found in this page https://statisticalatlas.com/place/California/San-Diego/Race-and-Ethnicity The neighborhood with more density of Hispanic population are Hillcrest, Normal Heights and North Park. And, as we said, North Park had the lowest frequency of mexican food restaurants aside from Little Italy. Because of this, we can say that North Park would be a great neighborhood to star a new Mexican food restaurant.
We must not forget that, even if Hillcrest has a the mayor frequency of Mexican restaurants, it is also the neighborhood that has the greatest population density of hispanic people; because of this, even if it is dangerous to open a new restaurant, it may be more rewarding if things are done correctly.

Just to be sure, let's create a map in which we can see the mexican restaurants listened in our previous dataframes

In [87]:
venue_cor=sd_venues_best_neighborhood[["Venue","Venue_Latitude","Venue_Longitude","Venue_Category"]]

In [88]:
venue_cor=venue_cor.loc[venue_cor["Venue_Category"]=="Mexican Restaurant"]
venue_cor.reset_index(drop=True)

Unnamed: 0,Venue,Venue_Latitude,Venue_Longitude,Venue_Category
0,La Posta de Acapulco's,32.749763,-117.162973,Mexican Restaurant
1,El Cuervo,32.750104,-117.164221,Mexican Restaurant
2,"Ortega's, A Mexican Bistro",32.74824,-117.162993,Mexican Restaurant
3,Eat Mexican Food,32.749653,-117.169822,Mexican Restaurant
4,Fiesta Cantina,32.748292,-117.162864,Mexican Restaurant
5,King and Queen Cantina,32.720818,-117.169177,Mexican Restaurant
6,The Taco Stand,32.74111,-117.129632,Mexican Restaurant
7,The Blind Burro,32.709269,-117.158356,Mexican Restaurant
8,Tacos El Cabron,32.710852,-117.16113,Mexican Restaurant
9,La Puerta,32.711093,-117.161181,Mexican Restaurant


In [92]:
Venue_lat_lon=venue_cor[["Venue_Latitude","Venue_Longitude"]]
Venue_lat_lon.reset_index(drop=True)

Unnamed: 0,Venue_Latitude,Venue_Longitude
0,32.749763,-117.162973
1,32.750104,-117.164221
2,32.74824,-117.162993
3,32.749653,-117.169822
4,32.748292,-117.162864
5,32.720818,-117.169177
6,32.74111,-117.129632
7,32.709269,-117.158356
8,32.710852,-117.16113
9,32.711093,-117.161181


In [93]:
kclusters = 1

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(Venue_lat_lon)
Venue_lat_lon.insert(0, 'Cluster Labels', kmeans.labels_)

In [98]:
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=15)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(Venue_lat_lon['Venue_Latitude'], Venue_lat_lon['Venue_Longitude'], venue_cor['Venue'], Venue_lat_lon['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

As we can see from this map, almost all of the restaurants are too close among themselves, so just by picking a place fairlt separated from them in one of the listened Neighbohoods (like North Park) should be fairly safe!