# Diving equipment store in Haute-Savoie

## Table of contents
   1. [Introduction : Business Problem](#introduction)
   2. [Data](#data)
       - [Cities and their population in Haute-Savoie](#cities)
       - [Localization of the cities](#loc)
       - [Venues in the neighborhoods](#venues)
   3. [Methodology and data analysis](#methodology)
   4. [Results](#results)
   5. [Discussion](#discussion)
   5. [Conclusion](#conclusion)

## 1. Introduction : Business Problem <a name="introduction"></a>

In this project we will try to find an optimal location for a diving equipment store. Specifically, this report will be targeted to stakeholders interested in opening a **diving & watersport equipment store** in **Haute-Savoie**, France.

Since the store is dedicated to scuba diving and watersport activities, we will first try to detect **locations with *water access* in vicinity**. In order to limit the effect of competition, it will also be necessary to ensure that **no such sports store is already established near the selected areas**.
Then, assuming that these two conditions are met, we would prefer **frequented places** with a fairly high population density.

We will use our data science powers to generate a few most promising neighborhoods based on these criteria. Advantages of each area will then be clearly expressed, so that the stakeholders can easily make a choice.

# 2. Data <a name="data"></a>

Based on our business problem, we will have to gather data such as :
 * Name and localization of all the cities in Haute-Savoie
 * Population of each city
 * Number and type of water access in the neighborhood of each city, if any
 * Number of existing sports store in the neighborhood of each city
 

Following data sources will be needed :
 * Names and populations of the cities in Haute-Savoie will be obtained using an available table of **Wikipedia**
 * Localization of the cities will be read from a local csv file
 * Venues such as water access and sports store will be extracted using **Foursquare API**

### a. Cities and their population in Haute-Savoie <a name="cities"></a>

Let's extract the available table of Wikipedia where all the cities ('communes' in french) of Haute-Savoie are listed with also the associated population.

In [1]:
#import libraries
import numpy as np
import pandas as pd 
pd.set_option('display.max_columns',None)
pd.set_option('display.max_rows', None)

In [2]:
# Extract table from wikipedia
from pandas.io.html import read_html
page='https://fr.wikipedia.org/wiki/Liste_des_communes_de_la_Haute-Savoie'
table=read_html(page,attrs={"class":"wikitable"})
print('Table extracted')

Table extracted


In [3]:
# Check the first rows of the obtained dataframe
communes74=table[0]
print(communes74.shape)
communes74.head()

(280, 10)


Unnamed: 0,Nom,CodeInsee,Code postal,Arrondissement,Canton,Intercommunalité,Superficie(km2),Population(dernière pop. légale),Densité(hab./km2),Modifier
0,Annecy(préfecture),74010,7400074370746007494074960,Annecy,Annecy-1Annecy-2Annecy-le-VieuxSeynod,CA du Grand Annecy,6694,126 924 (2017),1 896,
1,Abondance,74001,74360,Thonon-les-Bains,Évian-les-Bains,CC Pays d'Évian Vallée d'Abondance,5884,1 439 (2017),24,
2,Alby-sur-Chéran,74002,74540,Annecy,Rumilly,CA du Grand Annecy,656,2 579 (2017),393,
3,Alex,74003,74290,Annecy,Faverges,CC des vallées de Thônes,1702,1 072 (2017),63,
4,Allèves,74004,74540,Annecy,Rumilly,CA du Grand Annecy,881,411 (2017),47,


We only need the columns with the name and the population, so let's clean the table.

In [4]:
# Remove rows with undefined postal code
communes74.dropna(subset=['Code postal'],axis=0,inplace=True)
# Remove all colums except 'Nom' and 'Population'
communes74.drop({'CodeInsee','Code postal','Arrondissement','Canton','Intercommunalité','Superficie(km2)','Densité(hab./km2)','Modifier'},axis=1,inplace=True)
# Rename the columns as 'Commune' and 'Population'
communes74.rename(columns={'Nom':'Commune','Population(dernière pop. légale)':'Population'},inplace=True)
# Remove all comments in ()
communes74['Commune']=communes74['Commune'].str.split('(').str[0]
communes74['Population']=communes74['Population'].str.split('(').str[0]
print(communes74.shape)
communes74.head()

(279, 2)


Unnamed: 0,Commune,Population
0,Annecy,126 924
1,Abondance,1 439
2,Alby-sur-Chéran,2 579
3,Alex,1 072
4,Allèves,411


Population has to be considered as a number, so let's define the type of collected data.

In [5]:
communes74.dtypes

Commune       object
Population    object
dtype: object

In [6]:
# Remove the thousand separator in the string
def remov_sep(s):
    j=""
    for x in s.split():
        j=j+str(x)
    return j

for num in range(communes74.shape[0]):
    communes74['Population'][num]=remov_sep(communes74['Population'][num])

In [7]:
communes74[['Population']]=communes74[['Population']].astype('int')
print(communes74.dtypes)
communes74.head()

Commune       object
Population     int64
dtype: object


Unnamed: 0,Commune,Population
0,Annecy,126924
1,Abondance,1439
2,Alby-sur-Chéran,2579
3,Alex,1072
4,Allèves,411


### b. Localization of the cities <a name="loc"></a>

Now let's collect also the latitude and longitude of all theses cities. For that we have to read a local csv file.

In [8]:
import types
import pandas as pd
from botocore.client import Config
import ibm_boto3

def __iter__(self): return 0

# @hidden_cell
# The following code accesses a file in your IBM Cloud Object Storage. It includes your credentials.
# You might want to remove those credentials before you share the notebook.
client_8956600744454fd49a672b20480e9c77 = ibm_boto3.client(service_name='s3',
    ibm_api_key_id='3gTeVBbVaooBP-pf0nS--scpZ-dPVZOdt9YiUp8qUR9G',
    ibm_auth_endpoint="https://iam.ng.bluemix.net/oidc/token",
    config=Config(signature_version='oauth'),
    endpoint_url='https://s3-api.us-geo.objectstorage.service.networklayer.com')

body = client_8956600744454fd49a672b20480e9c77.get_object(Bucket='courseracapstonenotebook-donotdelete-pr-8bhzmhzejueb8e',Key='Communes74.xlsx')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

communes_latlon = pd.read_excel(body)
print(communes_latlon.shape)
communes_latlon.head()

(279, 4)


Unnamed: 0,Commune,Code Postal,Latitude,Longitude
0,Annecy,74000\n74370\n74600\n74940\n74960,45.900002,6.11667
1,Abondance,74360,46.283329,6.73333
2,Alby-sur-Chéran,74540,45.8167,6.0167
3,Alex,74290,45.883331,6.23333
4,Allèves,74540,45.75,6.08333


Let's first remove the postal code from the previous table and then combine the 2 databases.

In [9]:
communes_latlon.drop({'Code Postal'},axis=1,inplace=True)
comm74 = pd.merge(communes74,communes_latlon,on='Commune')
print(comm74.shape)
comm74.head()

(279, 4)


Unnamed: 0,Commune,Population,Latitude,Longitude
0,Annecy,126924,45.900002,6.11667
1,Abondance,1439,46.283329,6.73333
2,Alby-sur-Chéran,2579,45.8167,6.0167
3,Alex,1072,45.883331,6.23333
4,Allèves,411,45.75,6.08333


In order to visualize on a map the repartition of the cities, we have to import the following libraries.

In [10]:
!conda install -c conda-forge folium=0.5.0 --yes
import folium

import matplotlib.cm as cm
import matplotlib.colors as colors

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - folium=0.5.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    branca-0.4.0               |             py_0          26 KB  conda-forge
    openssl-1.1.1d             |       h516909a_0         2.1 MB  conda-forge
    python_abi-3.6             |          1_cp36m           4 KB  conda-forge
    ca-certificates-2019.11.28 |       hecc5488_0         145 KB  conda-forge
    certifi-2019.11.28         |   py36h9f0ad1d_1         149 KB  conda-forge
    vincent-0.4.4              |             py_1          28 KB  conda-forge
    folium-0.5.0               |             py_0          45 KB  conda-forge
    altair-4.0.1               |             py_0         575 KB  conda-forge
    ------------------------------------------------------------
                       

In [11]:
# Define Latitude and Longitude of Haute-Savoie
T_lon = 6.3833
T_lat = 46.05

# Create map of Haute-Savoie
map_htesavoie = folium.Map(location = [T_lat, T_lon], zoom_start=10)

# Add markers to map
for lat, lon, comm in zip(comm74['Latitude'], comm74['Longitude'], comm74['Commune']):
    label = '{}'.format(comm)
    label = folium.Popup(label,parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius = 5,
        popup = label,
        color = 'blue',
        fill = True,
        fill_color = '#3186cc',
        fill_opacity = 0.7,
        parse_html=False).add_to(map_htesavoie)

map_htesavoie

### c. Venues in the neighborhoods <a name="venues"></a>

Now that we have a good overview of the cities in Haute-Savoie, let's find out where are the 'water access' and sports stores around them. We will use Foursquare API to gather this information.

Foursquare credentials are defined in hidden cell below.

In [12]:
# @hidden_cell
CLIENT_ID = 'KDJONS01JQACI4YP4Z3ETZ3SR3TCKPBNRUVVIGAWNMLO5CAY' # your Foursquare ID
CLIENT_SECRET = 'WVQW2BA00RDW4VLQZNNWTO3R3W3CGGUQH3NK4LCG4Z00Q1XT' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

In [13]:
import requests

In [14]:
# Define a function to get nearby venues
# limit of number of venues is set to 100 and radius around each location is set to 2500 m

limit = 100

def getNearbyVenues(names, latitudes, longitudes, radius=2500):
    
    venues_list=[]
    for name, lat, lon in zip(names, latitudes, longitudes):
        # print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lon,
            radius, 
            limit)
        
        # make the GET request
        results = requests.get(url).json()['response']['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lon, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name'],
            v['venue']['categories'][0]['id']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Commune', 
                  'Commune Latitude', 
                  'Commune Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category',
                  'Venue Id']
    
    return(nearby_venues)

Let's now go over our neighborhood locations and get nearby venues

In [16]:
htesavoie_venues = getNearbyVenues(names = comm74['Commune'], latitudes = comm74['Latitude'], longitudes = comm74['Longitude'])
print(htesavoie_venues.shape)
htesavoie_venues.head()

(2391, 8)


Unnamed: 0,Commune,Commune Latitude,Commune Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category,Venue Id
0,Annecy,45.900002,6.11667,Brumes,45.899517,6.123535,Coffee Shop,4bf58dd8d48988d1e0931735
1,Annecy,45.900002,6.11667,Beer O'Clock,45.897427,6.123039,Bar,4bf58dd8d48988d116941735
2,Annecy,45.900002,6.11667,Une Autre Histoire,45.899761,6.121999,Tea Room,4bf58dd8d48988d1dc931735
3,Annecy,45.900002,6.11667,Chez Pen,45.904349,6.121446,Bar,4bf58dd8d48988d116941735
4,Annecy,45.900002,6.11667,Le Barista Café,45.900824,6.124986,Coffee Shop,4bf58dd8d48988d1e0931735


In [17]:
htesavoie_venues.to_pickle('./htesavoie_venues.pkl')    

In [18]:
import pickle
with open('htesavoie_venues.pkl', 'rb') as f:
        t = pickle.load(f)

In [19]:
t.shape

(2391, 8)

In [34]:
#htesavoie_venues = t

In [20]:
print('There are {} unique categories.'.format(len(htesavoie_venues['Venue Category'].unique())))

There are 223 unique categories.


We have to define the list of relevant venues. We can find all the needed documentation on Foursquare website *(https://developer.foursquare.com/docs/resources/categories)*.

In [21]:
# list of all the relevant categories
category = ['4bf58dd8d48988d193941735','4bf58dd8d48988d105941735','52e81612bcbc57f1066b7a28',
            '56aa371be4b08b9a8d573544','4bf58dd8d48988d1e2941735','52e81612bcbc57f1066b7a12',
            '4bf58dd8d48988d1e0941735','4bf58dd8d48988d160941735','4bf58dd8d48988d161941735',
            '4bf58dd8d48988d15e941735','52e81612bcbc57f1066b7a29','56aa371be4b08b9a8d573541',
            '4eb1d4dd4b900d56c88a45fd','56aa371be4b08b9a8d573560','56aa371be4b08b9a8d5734c3',
            '52e81612bcbc57f1066b7a44','52e81612bcbc57f1066b7a27','52f2ab2ebcbc57f1066b8b1a',
            '52f2ab2ebcbc57f1066b8b22','58daa1558bbb0b01f18ec1ae','4bf58dd8d48988d1ed941735',
            '4bf58dd8d48988d1f2941735','56aa371be4b08b9a8d57353e']
# sublist with only the categories related to water access
water_access_cat = ['4bf58dd8d48988d193941735','4bf58dd8d48988d105941735','52e81612bcbc57f1066b7a28',
            '56aa371be4b08b9a8d573544','4bf58dd8d48988d1e2941735','52e81612bcbc57f1066b7a12',
            '4bf58dd8d48988d1e0941735','4bf58dd8d48988d160941735','4bf58dd8d48988d161941735',
            '4bf58dd8d48988d15e941735','52e81612bcbc57f1066b7a29','56aa371be4b08b9a8d573541',
            '4eb1d4dd4b900d56c88a45fd','56aa371be4b08b9a8d573560','56aa371be4b08b9a8d5734c3',
            '52e81612bcbc57f1066b7a44','52e81612bcbc57f1066b7a27','58daa1558bbb0b01f18ec1ae',
            '4bf58dd8d48988d1ed941735','56aa371be4b08b9a8d57353e']
# sublist with only the categories related to shops
shops_cat = ['52f2ab2ebcbc57f1066b8b1a','52f2ab2ebcbc57f1066b8b22','4bf58dd8d48988d1f2941735']

Let's extract only the relevant venues for our research.

In [22]:
htesavoie_sel_venues = htesavoie_venues[htesavoie_venues['Venue Id'].isin(category)]
print(htesavoie_sel_venues.shape)

(107, 8)


In [23]:
# one hot encoding
htesavoie_onehot = pd.get_dummies(htesavoie_sel_venues[['Venue Category']], prefix="", prefix_sep="")

# add Commune column back to dataframe
htesavoie_onehot['Commune'] = htesavoie_sel_venues['Commune'] 

# move Commune column to the first column
fixed_columns = [htesavoie_onehot.columns[-1]] + list(htesavoie_onehot.columns[:-1])
htesavoie_onehot = htesavoie_onehot[fixed_columns]

print(htesavoie_onehot.shape)
htesavoie_onehot.head()

(107, 13)


Unnamed: 0,Commune,Bay,Beach,Harbor / Marina,Hot Spring,Lake,Pool,Rafting,River,Spa,Sporting Goods Shop,Water Park,Waterfall
49,Annecy,0,1,0,0,0,0,0,0,0,0,0,0
53,Annecy,0,1,0,0,0,0,0,0,0,0,0,0
82,Annecy,0,1,0,0,0,0,0,0,0,0,0,0
134,Ambilly,0,0,0,0,0,0,0,0,0,1,0,0
193,Annemasse,0,0,0,0,0,0,0,0,0,1,0,0


And here we are ! Let's have a look to the different relevant venues around each city.

In [24]:
communes_table = htesavoie_onehot.groupby('Commune').sum().reset_index()
print(communes_table.shape)
communes_table.head()

(64, 13)


Unnamed: 0,Commune,Bay,Beach,Harbor / Marina,Hot Spring,Lake,Pool,Rafting,River,Spa,Sporting Goods Shop,Water Park,Waterfall
0,Ambilly,0,0,0,0,0,0,0,0,0,1,0,0
1,Annecy,0,3,0,0,0,0,0,0,0,0,0,0
2,Annemasse,0,0,0,0,0,0,0,0,0,1,0,0
3,Anthy-sur-Léman,0,0,0,0,0,0,0,0,0,1,0,0
4,Armoy,0,0,0,0,0,0,1,0,0,0,0,0


Let's merge all the data by city (relevant venues, population, latitude and longitude).

In [25]:
df=pd.merge(communes_table,comm74,on='Commune')
df.head()

Unnamed: 0,Commune,Bay,Beach,Harbor / Marina,Hot Spring,Lake,Pool,Rafting,River,Spa,Sporting Goods Shop,Water Park,Waterfall,Population,Latitude,Longitude
0,Ambilly,0,0,0,0,0,0,0,0,0,1,0,0,6385,46.1952,6.2243
1,Annecy,0,3,0,0,0,0,0,0,0,0,0,0,126924,45.900002,6.11667
2,Annemasse,0,0,0,0,0,0,0,0,0,1,0,0,35712,46.200001,6.25
3,Anthy-sur-Léman,0,0,0,0,0,0,0,0,0,1,0,0,2171,46.3553,6.4273
4,Armoy,0,0,0,0,0,0,1,0,0,0,0,0,1295,46.349998,6.51667


Let's now see all the collected data on a map.
All the cities that have water access are shown with a blue dot, and if there's a sports shop at the same time we add a red circle around.

In [26]:
# Define Latitude and Longitude of Haute-Savoie
T_lon = 6.3833
T_lat = 46.05

# Create map of Haute-Savoie
map2_htesavoie = folium.Map(location = [T_lat, T_lon], zoom_start=10)

# Add markers to map
for lat, lon, comm, shop in zip(df['Latitude'], df['Longitude'], df['Commune'], df['Sporting Goods Shop']):
    label = '{}'.format(comm)
    label = folium.Popup(label,parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius = 5,
        popup = label,
        color = 'blue' if shop==0 else 'red',
        fill = True,
        fill_color = '#3186cc',
        fill_opacity = 0.7,
        parse_html=False).add_to(map2_htesavoie)

map2_htesavoie

This concludes the data gathering phase - we're now ready to use this data for analysis to find an optimal location for a new dive equipment store !

## 3. Methodology and data analysis <a name="analysis"></a>

Now that we have the needed data, let's go back to our criteria to define the optimal locations :
- locations with water access in vicinity
- no sports store already established
- frequented places

### Locations with water access

Let's first extract for our gathered data, the cities with water access.

In [27]:
df_noshop= df[df['Sporting Goods Shop']==0].reset_index(drop=True)

In [28]:
df_noshop.drop({'Sporting Goods Shop'},axis=1,inplace=True)

In [29]:
df_noshop

Unnamed: 0,Commune,Bay,Beach,Harbor / Marina,Hot Spring,Lake,Pool,Rafting,River,Spa,Water Park,Waterfall,Population,Latitude,Longitude
0,Annecy,0,3,0,0,0,0,0,0,0,0,0,126924,45.900002,6.11667
1,Armoy,0,0,0,0,0,0,1,0,0,0,0,1295,46.349998,6.51667
2,Arthaz-Pont-Notre-Dame,0,0,0,0,0,1,0,0,0,0,0,1577,46.150002,6.28333
3,Bluffy,0,1,1,0,0,0,0,0,0,0,0,391,45.866669,6.21667
4,Bonne,0,0,0,0,0,1,0,0,0,0,0,3218,46.1682,6.3215
5,Brenthonne,0,0,0,0,0,0,0,0,1,0,0,1037,46.283329,6.4
6,Brizon,0,0,0,0,0,1,0,0,0,0,0,485,46.049999,6.45
7,Champanges,0,0,0,0,0,0,1,0,0,0,0,1015,46.366669,6.55
8,Chens-sur-Léman,0,0,1,0,0,0,0,0,0,0,0,2776,46.333328,6.26667
9,Choisy,0,0,0,0,1,0,0,0,0,0,0,1604,45.98333,6.05


In [40]:
noshop_latlons = [df_noshop['Latitude'],df_noshop['Longitude']]

In [41]:
noshop_latlons

[0     45.900002
 1     46.349998
 2     46.150002
 3     45.866669
 4     46.168200
 5     46.283329
 6     46.049999
 7     46.366669
 8     46.333328
 9     45.983330
 10    45.900002
 11    46.183331
 12    45.916672
 13    45.833328
 14    46.066669
 15    46.349998
 16    46.166672
 17    46.316669
 18    46.133331
 19    46.316669
 20    46.366669
 21    45.783329
 22    46.333328
 23    46.400002
 24    45.849998
 25    46.349998
 26    46.083328
 27    46.150002
 28    46.366669
 29    46.400002
 30    46.116669
 31    46.333328
 32    45.833328
 33    46.083328
 34    46.333328
 35    45.933331
 36    45.866669
 37    45.849998
 38    46.116669
 39    46.099998
 40    45.883331
 41    46.150002
 42    46.150002
 43    46.366669
 44    46.383331
 Name: Latitude, dtype: float64, 0     6.11667
 1     6.51667
 2     6.28333
 3     6.21667
 4     6.32150
 5     6.40000
 6     6.45000
 7     6.55000
 8     6.26667
 9     6.05000
 10    6.65000
 11    6.30000
 12    6.65000
 13    6

In [42]:
from folium import plugins
from folium.plugins import HeatMap

map3_htesavoie = folium.Map(location = [T_lat, T_lon], zoom_start=10)
folium.TileLayer('cartodbpositron').add_to(map3_htesavoie) #cartodbpositron cartodbdark_matter
HeatMap(noshop_latlons).add_to(map3_htesavoie)
map3_htesavoie

TypeError: cannot convert the series to <class 'float'>

In [36]:
from mpl_toolkits.basemap import Basemap
import matplotlib.pyplot as plt
from pylab import rcParams
%matplotlib inline
rcParams['figure.figsize'] = (14,10)

llon = 5.5
ulon = 7.5
llat = 45
ulat = 47

my_map = Basemap (projection='merc',
                  resolution = 'l', area_thresh = 1000.0,
                  llcrnrlon = llon, llcrnrlat = llat,
                  urcrnrlon = ulon, urcrnrlat = ulat)

my_map.drawcoastlines()
my_map.drawcountries()
my_map.fillcontinents(color='white', alpha=0.3)
my_map.shaderelief()

# To collect data based on relevant cities with water access
xs,ys = my_map(np.asarray(df_noshop.Longitude),np.asarray(df_noshop.Latitude))
df_noshop['xm']=xs.tolist()
df_noshop['ym']=ys.tolist()

# Visualization
for index,row in df_noshop.iterrows():
    my_myp.plot(row.xm,row.ym,markerfacecolor=([1,0,0]),marker='o',markersize=5,alpha=0.75)
plt.show()

AttributeError: 'property' object has no attribute '__name__'

In [196]:
df_norm=df_noshop
df_norm =df_norm.drop(['Latitude','Longitude'],axis=1)
#df_norm['Population']=df_norm['Population']/df_norm['Population'].max()
#df_norm['Beach']=df_norm['Beach']/df_norm['Beach'].max()
#df_norm['Dive Spot']=df_norm['Dive Spot']/df_norm['Dive Spot'].max()
#df_norm['Harbor / Marina']=df_norm['Harbor / Marina']/df_norm['Harbor / Marina'].max()
#df_norm['Hot Spring']=df_norm['Hot Spring']/df_norm['Hot Spring'].max()
#df_norm['Lake']=df_norm['Lake']/df_norm['Lake'].max()
#df_norm['Pool']=df_norm['Pool']/df_norm['Pool'].max()
#df_norm['Rafting']=df_norm['Rafting']/df_norm['Rafting'].max()
#df_norm['River']=df_norm['River']/df_norm['River'].max()
#df_norm['Spa']=df_norm['Spa']/df_norm['Spa'].max()
#df_norm['Water Park']=df_norm['Water Park']/df_norm['Water Park'].max()
#df_norm['Waterfall']=df_norm['Waterfall']/df_norm['Waterfall'].max()
#df_norm['Waterfront']=df_norm['Waterfront']/df_norm['Waterfront'].max()

In [197]:
filt_norm=df_norm
filt_norm = filt_norm.drop(['Population'],axis=1)

In [198]:
filt_norm['Open water']=filt_norm['Beach']+filt_norm['Lake']+filt_norm['Waterfront']+filt_norm['Dive Spot']+filt_norm['Harbor / Marina']
filt_norm['Confined water']=filt_norm['Pool']+filt_norm['Spa']+filt_norm['Hot Spring']+filt_norm['Water Park']
filt_norm['Rapids']=filt_norm['Rafting']+filt_norm['Waterfall']+filt_norm['River']
filt_norm = filt_norm.drop(['Beach','Lake','Waterfront','Dive Spot','Harbor / Marina','Pool','Spa','Hot Spring','Water Park','Rafting','Waterfall','River'],axis=1)
filt_norm['Open water']=filt_norm['Open water']/filt_norm['Open water'].max()
filt_norm['Confined water']=filt_norm['Confined water']/filt_norm['Confined water'].max()
filt_norm['Rapids']=filt_norm['Rapids']/filt_norm['Rapids'].max()
filt_norm.head()

Unnamed: 0,Commune,Open water,Confined water,Rapids
0,Annecy,0.5,0.0,0.0
1,Armoy,0.0,0.0,1.0
2,Bluffy,0.333333,0.0,0.0
3,Brenthonne,0.0,1.0,0.0
4,Chainaz-les-Frasses,0.0,1.0,0.0


In [147]:
from sklearn.cluster import KMeans

In [199]:
# set number of clusters
kclusters = 3

htesavoie_clustering = filt_norm.drop('Commune', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(htesavoie_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 2, 0, 1, 1, 1, 0, 0, 0, 0], dtype=int32)

In [200]:
# add clustering labels
filt_norm.insert(0, 'Cluster Labels', kmeans.labels_)
df_noshop.insert(0, 'Cluster Labels', kmeans.labels_)
filt_norm.head()

Unnamed: 0,Cluster Labels,Commune,Open water,Confined water,Rapids
0,0,Annecy,0.5,0.0,0.0
1,2,Armoy,0.0,0.0,1.0
2,0,Bluffy,0.333333,0.0,0.0
3,1,Brenthonne,0.0,1.0,0.0
4,1,Chainaz-les-Frasses,0.0,1.0,0.0


In [201]:
# create map
map_clusters = folium.Map(location=[T_lat, T_lon], zoom_start=10)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, comm, cluster in zip(df_noshop['Latitude'], df_noshop['Longitude'], df_noshop['Commune'], df_noshop['Cluster Labels']):
    label = folium.Popup(str(comm) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [202]:
filt_norm.loc[filt_norm['Cluster Labels']==0,filt_norm.columns[[1]+list(range(2,filt_norm.shape[1]))]]

Unnamed: 0,Commune,Open water,Confined water,Rapids
0,Annecy,0.5,0.0,0.0
2,Bluffy,0.333333,0.0,0.0
6,Chens-sur-Léman,0.166667,0.0,0.0
7,Choisy,0.166667,0.0,0.0
8,Chêne-en-Semine,0.166667,0.0,0.0
9,Clarafond-Arcine,0.166667,0.0,0.0
10,Combloux,0.333333,0.0,0.0
13,Domancy,0.166667,0.0,0.0
16,Excenevex,0.166667,0.0,0.0
19,La Tour,0.166667,0.0,0.0


In [203]:
filt_norm.loc[filt_norm['Cluster Labels']==1,filt_norm.columns[[1]+list(range(2,filt_norm.shape[1]))]]

Unnamed: 0,Commune,Open water,Confined water,Rapids
3,Brenthonne,0.0,1.0,0.0
4,Chainaz-les-Frasses,0.0,1.0,0.0
5,Champanges,0.0,1.0,1.0
11,Cranves-Sales,0.0,1.0,0.0
12,Cusy,0.0,1.0,0.0
15,Etaux,0.0,1.0,0.0
27,Neuvecelle,0.166667,1.0,0.0
28,Neydens,0.0,1.0,0.0
43,Évian-les-Bains,0.166667,1.0,0.0


In [204]:
filt_norm.loc[filt_norm['Cluster Labels']==2,filt_norm.columns[[1]+list(range(2,filt_norm.shape[1]))]]

Unnamed: 0,Commune,Open water,Confined water,Rapids
1,Armoy,0.0,0.0,1.0
14,Duingt,0.833333,0.0,1.0
17,La Balme-de-Thuy,0.0,0.0,1.0
18,La Forclaz,0.0,0.0,1.0
20,La Vernaz,0.166667,0.0,1.0
22,Lyaud,0.0,0.0,1.0
23,Menthon-Saint-Bernard,1.0,0.0,1.0
29,Reyvroz,0.0,0.0,1.0
35,Talloires-Montmin,0.333333,0.0,1.0


In [186]:
filt_norm.loc[filt_norm['Cluster Labels']==3,filt_norm.columns[[1]+list(range(2,filt_norm.shape[1]))]]

Unnamed: 0,Commune,Open water,Confined water,Rapids


In [133]:
df_norm.loc[df_norm['Cluster Labels']==4,df_norm.columns[[1]+list(range(2,df_norm.shape[1]))]]

Unnamed: 0,Commune,Beach,Dive Spot,Harbor / Marina,Hot Spring,Lake,Pool,Rafting,River,Spa,Water Park,Waterfall,Waterfront,Population
14,Duingt,1.0,0.0,0.666667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.007658
17,La Balme-de-Thuy,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.00364
23,Menthon-Saint-Bernard,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.014883
35,Talloires-Montmin,0.333333,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.015726


## 4. Results <a name="results"></a>

## 5. Discussion <a name="discussion"></a>

## 6. Conclusion <a name="conclusion"></a>