## IBM Applied Data Science Capstone Course by Coursera
### Opening a new truly Italian restaurant in Milan (Italy)

### I. Import libraries

In [1]:
# Library to handle data in a vectorized manner
import numpy as np 
# Library for data analsysis
import pandas as pd 
# Library to apply the k-means algorithm
from sklearn.cluster import KMeans

# Library to handle JSON files
import json 
# Library ti tranform JSON file into a pandas dataframe
from pandas.io.json import json_normalize 

# Library to convert an address into latitude and longitude values
from geopy.geocoders import Nominatim
# Library to get coordinates
import geocoder 

# Library to handle requests
import requests 
# Library to parse HTML and XML documents
from bs4 import BeautifulSoup 

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
# Library to render a map
import folium 

### II. Scrape data from Wikipedia page of Milan districts

In [2]:
# Send the GET request
data = requests.get("https://en.wikipedia.org/wiki/Category:Districts_of_Milan").text

# Parse data from the HTML into a beautifulsoup object
soup = BeautifulSoup(data, 'html.parser')

# Create a list to store neighborhood data
districtList = []

# Append the data into the list
for row in soup.find_all("div", class_="mw-category")[0].findAll("li"):
    districtList.append(row.text)
    
# Create a new DataFrame from the list
milan_df = pd.DataFrame({"District": districtList})
milan_df.head()

Unnamed: 0,District
0,Affori
1,Assiano
2,Baggio (district of Milan)
3,Barona
4,Bicocca (district of Milan)


In [3]:
# Print the number of rows of the dataframe
milan_df.shape

(76, 1)

### III. Get the geographical coordinates

In [5]:
# Define a function to get coordinates:
def get_latlng(district):
    # initialize your variable to None
    lat_lng_coords = None
    # loop until you get the coordinates
    while(lat_lng_coords is None):
        g = geocoder.arcgis('{}, Milan, Italy'.format(district))
        lat_lng_coords = g.latlng
    return lat_lng_coords

# Call the function to get the coordinates and store it in a new list using list comprehension
coords = [get_latlng(district) for district in milan_df["District"].tolist()]

coords[:5]

[[45.51410000000004, 9.173530000000028],
 [45.45058966499236, 9.06163771478597],
 [45.46324000000004, 9.092700000000036],
 [45.433710000000076, 9.15160000000003],
 [45.52149000000003, 9.213260000000048]]

In [7]:
# Create a temporary dataframe to populate the coordinates into Latitude and Longitude
df_coords = pd.DataFrame(coords, columns=['Latitude', 'Longitude'])

df_coords.head()

Unnamed: 0,Latitude,Longitude
0,45.5141,9.17353
1,45.45059,9.061638
2,45.46324,9.0927
3,45.43371,9.1516
4,45.52149,9.21326


In [9]:
# Merge the coordinates into the original dataframe
milan_df['Latitude'] = df_coords['Latitude']
milan_df['Longitude'] = df_coords['Longitude']

# check the neighborhoods and the coordinates
print(milan_df.shape)
milan_df.head()

(76, 3)


Unnamed: 0,District,Latitude,Longitude
0,Affori,45.5141,9.17353
1,Assiano,45.45059,9.061638
2,Baggio (district of Milan),45.46324,9.0927
3,Barona,45.43371,9.1516
4,Bicocca (district of Milan),45.52149,9.21326


In [10]:
# Save the DataFrame as CSV file
milan_df.to_csv("milan_df.csv", index=False)

### IV. Create a map of Kuala Lumpur with neighborhoods superimposed on top

In [11]:
# Get the coordinates of Kuala Lumpur
address = 'Milan, Italy'

geolocator = Nominatim(user_agent="my-application")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Milan (Italy) {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Milan (Italy) 45.4667971, 9.1904984.


In [12]:
# Create map of Milan using latitude and longitude values
map_milan = folium.Map(location=[latitude, longitude], zoom_start=11)

# Add markers to map
for lat, lng, district in zip(milan_df['Latitude'], milan_df['Longitude'], milan_df['District']):
    label = '{}'.format(district)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7).add_to(map_milan)  
    
map_milan

In [13]:
# Save the map as HTML file
map_milan.save('map_milan.html')

### V. Use the Foursquare API to explore the neighborhoods

In [14]:
# Define Foursquare Credentials and Version
CLIENT_ID = 'AAAAA' # your Foursquare ID
CLIENT_SECRET = 'BBBBB' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: LTNFDJLIM1CFBQMNBQYP2NTWQJVDJI0YCFUD2CW30JGTEUOI
CLIENT_SECRET:CVHEUGRHTT0CTFPU55GFKM5SXKP20VMK2MARSPQLAIXZ2UCF


In [15]:
# Get the top 100 venues that are within a radius of 2000 meters.
radius = 5000
LIMIT = 500

venues = []

for lat, long, district in zip(milan_df['Latitude'], milan_df['Longitude'], milan_df['District']):  
    # create the API request URL
    url = "https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}".format(
        CLIENT_ID,
        CLIENT_SECRET,
        VERSION,
        lat,
        long,
        radius, 
        LIMIT)
    
    # make the GET request
    results = requests.get(url).json()["response"]['groups'][0]['items']
    
    # return only relevant information for each nearby venue
    for venue in results:
        venues.append((
            district,
            lat, 
            long, 
            venue['venue']['name'], 
            venue['venue']['location']['lat'], 
            venue['venue']['location']['lng'],  
            venue['venue']['categories'][0]['name']))
        
# convert the venues list into a new DataFrame
venues_df = pd.DataFrame(venues)

# define the column names
venues_df.columns = ['District', 'Latitude', 'Longitude', 'VenueName', 'VenueLat', 'VenueLong', 'VenueCat']

print(venues_df.shape)
venues_df.head()

(7437, 7)


Unnamed: 0,District,Latitude,Longitude,VenueName,VenueLat,VenueLong,VenueCat
0,Affori,45.5141,9.17353,Spirit de Milan,45.506678,9.159744,Ballroom
1,Affori,45.5141,9.17353,Parco di Villa Litta,45.516414,9.167165,Park
2,Affori,45.5141,9.17353,La Scighera (arci),45.502966,9.161845,Music Venue
3,Affori,45.5141,9.17353,Virgin Active,45.502018,9.18259,Gym / Fitness Center
4,Affori,45.5141,9.17353,Il Bucatino con Giardino,45.502088,9.165959,Italian Restaurant


In [16]:
# Find out how many unique categories can be curated from all the returned venues
print('There are {} uniques categories.'.format(len(venues_df['VenueCat'].unique())))

There are 213 uniques categories.


In [17]:
# print out the list of categories
venues_df['VenueCat'].unique()

array(['Ballroom', 'Park', 'Music Venue', 'Gym / Fitness Center',
       'Italian Restaurant', 'Paper / Office Supplies Store', 'Creperie',
       'Gym', 'Ice Cream Shop', 'Café', 'Jazz Club', 'Cemetery',
       'Ramen Restaurant', 'Sushi Restaurant', 'Pub',
       'Japanese Restaurant', 'Pizza Place', 'Hotel',
       'Brazilian Restaurant', 'Theater', 'Nightclub', 'Sports Bar',
       'Dessert Shop', 'Plaza', 'Art Gallery', 'Event Space', 'Bistro',
       'Wine Shop', 'Club House', 'Dim Sum Restaurant',
       'Persian Restaurant', 'Government Building', 'Seafood Restaurant',
       'Boutique', 'Steakhouse', 'Cocktail Bar', 'Coffee Shop',
       'Restaurant', 'Historic Site', 'Gym Pool', 'Butcher',
       'Chinese Restaurant', 'Hostel', 'Diner', 'Bar',
       'Mongolian Restaurant', 'Winery', 'Beer Bar',
       'Monument / Landmark', 'Grocery Store', 'Gourmet Shop',
       'Bookstore', 'Pastry Shop', 'Mediterranean Restaurant', 'Wine Bar',
       'Scandinavian Restaurant', 'Flower Sho

In [18]:
# check if the results contain "Restaurant"
"Italian Restaurant" in venues_df['VenueCat'].unique()

True

### VI. Analyze Each Neighborhood

In [19]:
# One hot encoding
milan_onehot = pd.get_dummies(venues_df[['VenueCat']], prefix="", prefix_sep="")

# Add neighborhood column back to dataframe
milan_onehot['Districts'] = venues_df['District'] 

# Move neighborhood column to the first column
fixed_columns = [milan_onehot.columns[-1]] + list(milan_onehot.columns[:-1])
milan_onehot = milan_onehot[fixed_columns]

print(milan_onehot.shape)
milan_onehot.head()

(7437, 214)


Unnamed: 0,Districts,Accessories Store,African Restaurant,Agriturismo,Airport,American Restaurant,Amphitheater,Arcade,Argentinian Restaurant,Art Gallery,...,Tuscan Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Volleyball Court,Water Park,Whisky Bar,Wine Bar,Wine Shop,Winery,Women's Store
0,Affori,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Affori,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Affori,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Affori,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Affori,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [20]:
# Group rows by district and by taking the mean of the frequency of occurrence of each category
milan_grouped = milan_onehot.groupby(["Districts"]).mean().reset_index()

print(milan_grouped.shape)
milan_grouped

(76, 214)


Unnamed: 0,Districts,Accessories Store,African Restaurant,Agriturismo,Airport,American Restaurant,Amphitheater,Arcade,Argentinian Restaurant,Art Gallery,...,Tuscan Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Volleyball Court,Water Park,Whisky Bar,Wine Bar,Wine Shop,Winery,Women's Store
0,Affori,0.00,0.00,0.00,0.00,0.0,0.00,0.00,0.00,0.02,...,0.00,0.00,0.00,0.00,0.00,0.0,0.00,0.01,0.01,0.00
1,Assiano,0.00,0.00,0.00,0.00,0.0,0.00,0.00,0.00,0.00,...,0.00,0.00,0.00,0.00,0.01,0.0,0.01,0.00,0.00,0.00
2,Baggio (district of Milan),0.00,0.00,0.00,0.00,0.0,0.00,0.01,0.00,0.00,...,0.00,0.00,0.00,0.01,0.01,0.0,0.01,0.01,0.00,0.00
3,Barona,0.00,0.00,0.00,0.00,0.0,0.00,0.00,0.02,0.02,...,0.01,0.00,0.00,0.00,0.00,0.0,0.01,0.01,0.00,0.00
4,Bicocca (district of Milan),0.00,0.00,0.00,0.00,0.0,0.00,0.00,0.00,0.01,...,0.00,0.00,0.01,0.00,0.00,0.0,0.00,0.00,0.00,0.00
5,Bovisa,0.00,0.00,0.00,0.00,0.0,0.00,0.00,0.00,0.02,...,0.00,0.00,0.00,0.00,0.00,0.0,0.01,0.01,0.01,0.00
6,Bovisasca,0.00,0.00,0.00,0.00,0.0,0.00,0.00,0.00,0.01,...,0.00,0.00,0.01,0.00,0.00,0.0,0.01,0.01,0.01,0.00
7,Brera (district of Milan),0.02,0.00,0.00,0.00,0.0,0.00,0.00,0.00,0.03,...,0.00,0.00,0.00,0.00,0.00,0.0,0.03,0.00,0.00,0.00
8,Bruzzano,0.00,0.00,0.00,0.01,0.0,0.00,0.00,0.00,0.01,...,0.00,0.00,0.01,0.00,0.00,0.0,0.00,0.00,0.00,0.00
9,Calvairate,0.01,0.00,0.00,0.00,0.0,0.00,0.00,0.00,0.03,...,0.00,0.00,0.00,0.00,0.00,0.0,0.03,0.00,0.00,0.01


In [21]:
len(milan_grouped[milan_grouped["Italian Restaurant"] > 0])

76

In [22]:
# Create a new DataFrame for restaurant data only
milan_restaurant = milan_grouped[["Districts","Italian Restaurant"]]
milan_restaurant.head()

Unnamed: 0,Districts,Italian Restaurant
0,Affori,0.14
1,Assiano,0.16
2,Baggio (district of Milan),0.19
3,Barona,0.11
4,Bicocca (district of Milan),0.14


### VII. Cluster districts

In [23]:
# Run k-means to cluster the districts of Milan into 5 clusters
kclusters = 5

milan_clustering = milan_restaurant.drop(["Districts"], 1)

# Run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(milan_clustering)

# Check cluster labels generated for each row in the dataframe
kmeans.labels_

array([0, 0, 2, 3, 0, 3, 0, 4, 3, 1, 0, 2, 0, 2, 3, 3, 2, 3, 3, 3, 0, 0,
       1, 1, 2, 3, 0, 0, 0, 4, 0, 4, 0, 2, 1, 3, 0, 1, 4, 4, 3, 1, 1, 1,
       0, 4, 4, 1, 1, 1, 1, 1, 3, 2, 1, 4, 2, 3, 0, 0, 0, 0, 2, 3, 1, 4,
       3, 0, 1, 3, 3, 0, 4, 0, 1, 3])

In [25]:
# Create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.
milan_merged = milan_restaurant.copy()

# Add clustering labels
milan_merged["Cluster Labels"] = kmeans.labels_
milan_merged.rename(columns={"Districts": "District"}, inplace=True)
milan_merged.head()

Unnamed: 0,District,Italian Restaurant,Cluster Labels
0,Affori,0.14,0
1,Assiano,0.16,0
2,Baggio (district of Milan),0.19,2
3,Barona,0.11,3
4,Bicocca (district of Milan),0.14,0


In [26]:
# Merge milan_merged with milan_df to add latitude/longitude for each neighborhood
milan_merged = milan_merged.join(milan_df.set_index("District"), on="District")

print(milan_merged.shape)
milan_merged.head()

(76, 5)


Unnamed: 0,District,Italian Restaurant,Cluster Labels,Latitude,Longitude
0,Affori,0.14,0,45.5141,9.17353
1,Assiano,0.16,0,45.45059,9.061638
2,Baggio (district of Milan),0.19,2,45.46324,9.0927
3,Barona,0.11,3,45.43371,9.1516
4,Bicocca (district of Milan),0.14,0,45.52149,9.21326


In [27]:
# Visualize the resulting clusters
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# Set color scheme
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# Add markers
markers_colors = []
for lat, lon, poi, cluster in zip(milan_merged['Latitude'], milan_merged['Longitude'], milan_merged['District'], milan_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' - Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [28]:
# save the map as HTML file
map_clusters.save('map_clusters.html')

### VIII. Examine Clusters

#### Cluster 0

In [29]:
milan_merged.loc[milan_merged['Cluster Labels'] == 0]

Unnamed: 0,District,Italian Restaurant,Cluster Labels,Latitude,Longitude
0,Affori,0.14,0,45.5141,9.17353
1,Assiano,0.16,0,45.45059,9.061638
4,Bicocca (district of Milan),0.14,0,45.52149,9.21326
6,Bovisasca,0.17,0,45.51555,9.15094
10,Centro Direzionale di Milano,0.16,0,45.501976,9.264641
12,"Chinatown, Milan",0.17,0,45.50086,9.26513
20,Gallaratese,0.14,0,45.49671,9.11484
21,Garegnano,0.14,0,45.50469,9.13697
26,Greco (district of Milan),0.14,0,45.49702,9.21212
27,Lambrate,0.17,0,45.48157,9.25172


#### Cluster 1

In [30]:
milan_merged.loc[milan_merged['Cluster Labels'] == 1]

Unnamed: 0,District,Italian Restaurant,Cluster Labels,Latitude,Longitude
9,Calvairate,0.09,1,45.45618,9.22488
22,Ghisolfa,0.09,1,45.49631,9.1694
23,Giambellino-Lorenteggio,0.08,1,45.44622,9.134919
34,Nosedo,0.07,1,45.43381,9.22137
37,Porta Garibaldi (Milan),0.07,1,45.48065,9.18731
41,Porta Monforte,0.08,1,45.467223,9.202516
42,Porta Nuova (Milan),0.07,1,45.47971,9.19247
43,Porta Romana (Milan),0.08,1,45.4569,9.20095
47,Porta Venezia,0.08,1,45.47098,9.19981
48,Porta Vigentina,0.07,1,45.453726,9.196123


#### Cluster 2

In [31]:
milan_merged.loc[milan_merged['Cluster Labels'] == 2]

Unnamed: 0,District,Italian Restaurant,Cluster Labels,Latitude,Longitude
2,Baggio (district of Milan),0.19,2,45.46324,9.0927
11,Chiaravalle (district of Milan),0.2,2,45.41719,9.23971
13,Cimiano,0.23,2,45.50346,9.2488
16,Crescenzago,0.22,2,45.51054,9.24386
24,Gorla,0.2,2,45.5059,9.22264
33,Niguarda,0.2,2,45.5184,9.19201
53,Precotto,0.19,2,45.51541,9.22553
56,Quartiere Feltre,0.2,2,45.491594,9.250424
62,Rogoredo,0.19,2,45.43016,9.244


#### Cluster 3

In [32]:
milan_merged.loc[milan_merged['Cluster Labels'] == 3]

Unnamed: 0,District,Italian Restaurant,Cluster Labels,Latitude,Longitude
3,Barona,0.11,3,45.43371,9.1516
5,Bovisa,0.11,3,45.50313,9.16122
8,Bruzzano,0.13,3,45.52825,9.18071
14,Città Studi,0.12,3,45.47708,9.2266
15,Comasina,0.12,3,45.52631,9.15887
17,Dergano,0.12,3,45.50411,9.17647
18,Figino (district of Milan),0.11,3,45.49234,9.07852
19,Forlanini (district of Milan),0.11,3,45.45975,9.2469
25,Gratosoglio,0.11,3,45.41459,9.17122
35,Ortica,0.13,3,45.47096,9.2416


#### Cluster 4

In [33]:
milan_merged.loc[milan_merged['Cluster Labels'] == 4]

Unnamed: 0,District,Italian Restaurant,Cluster Labels,Latitude,Longitude
7,Brera (district of Milan),0.06,4,45.4747,9.19001
29,Milano Santa Giulia,0.06,4,45.46796,9.18178
31,Morivione,0.05,4,45.44099,9.18781
38,Porta Genova,0.04,4,45.4579,9.17457
39,Porta Lodovica,0.05,4,45.45318,9.18929
45,Porta Tenaglia,0.06,4,45.47765,9.182238
46,Porta Ticinese,0.04,4,45.45738,9.18095
55,Quadrilatero della moda,0.06,4,45.46796,9.18178
65,San Cristoforo sul Naviglio (district of Milan),0.04,4,45.44763,9.15458
72,Vaiano Valle,0.06,4,45.42893,9.2162


#### Discussion
Clusters 2, 0 and 3 contain the highest percentage of restaurants: this is due to the fact that most of these districts are big in terms of area and outside from the very centre of the city. Actually, they could be considered small towns on their own.
Conversely, Clusters 1 and 4, despite being in the centre, present values that are significantly smaller if compared to the ones of the previous clusters. This means that the presence if Italian restaurants is not that high: this may be due to the fact that there are a lot of different types of restaurants that serve very all the different kinds of people that may visit the city.
This rationale also explains why in the outer districts the presence of Italian restaurants is higher. Being more residential areas, where commuters and mostly Italians live, such restaurants can be found easier.
Thus, the best thing to do is to open a restaurant in the central area of Milan. Moreover, looking at the map, we notice that in the eastern area there are no points.
In fact, this area have a very important museum and a University, thus there are more bars. Moreover, it is also a chic district, so opening a fancy and classy restaurant here would be a good choice. 