# Capstone Project - Dry Bars
### Applied Data Science Capstone by IBM/Coursera

## Table of Contents
1. [Introduction](#introduction)
2. [Data](#data)
3. [Methodology](#methodology)
4. [Extract specific data](#analysis)
5. [Machine Learning and Results](#results)

---
## 1. Introduction: Business Problem<a name="introduction"></a>
---

With the objective of studying possible locations for the opening of a business that has not yet been explored in Brazil, this project brings the concept of the **Dry Bar**, with a proposal to identify suitable places that received this type of establishment, it requires a future extrapolation to other cities.

The **Dry Bar** concept emerged in California in 2010 and is based on offering fast beauty salon services, at prices that promise to be more affordable than in traditional spaces. Due to the facilities, agility and inviting prices, they have already become a common presence in cities such as New York and Los Angeles.

For the development of the project, the choice was to map a region of New York City, seeking to understand elements of geolocation and their relevance, for the installation of the Dry Bar. It is considered that this is a service dependent on the flow of people from the female, with little time available and profile that values good presentation in the work environment. These are determining factors for the success of this business model.


- Project's goal:

Find Dry Bars in the observation region, analyzing its surroundings from the study of clusters.
One way found to complement this study was to analyze the evaluations given by customers, considering that, in addition to the quality of service, the ease of access, an element related to location, is implicit. Thus, the objective is to try to understand the relationship of location of these groups and to seek similarities with the central region of the City of Rio de Janeiro.

The starting point was the mapping of the Midtown region, on Manhattan Island, which in addition to being one of the main tourist spots in New York City, is also home to numerous large companies, where, due to these characteristics, the profile of the city is concentrated. public for this type of business; women with lots of activities and little time for traditional beauty salons.

Based on these criteria, Foursquare will be used as a source of data for these areas. The ‘k-means clustering’ unsupervised learning algorithm will provide clarification of the advantages of each area, so that the best possible location is defined, for future referrals to those interested in implementing this business in downtown Rio de Janeiro.


<a name="data"></a>[Scroll Back to Table of Contents](#tableofContents)

---
## 2. Data <a name="data"></a>
---

Based on the problem definition, the data sources were:

- **List with the location data of the Manhattan district**

For this, the Json file of the Coursera IBM Certification was used, which provides location data for the 5 districts of New YORK with its 306 neighborhoods.

    As a project it is NOT intended to explore all these places, just one of interest, a filter was made selecting the district of Manhattan and saved in a csv file that is on GitHub.(manhattan.csv)
    
In addition, GitHub has a notebook with only this part of the process: Extract, filter and save these neighborhoods.

- **API do Foursquare:**

 Link: <https://developer.foursquare.com/docs>
    
Description: Foursquare API, a location data provider, will be used to make RESTful API calls to retrieve data about venues in different neighborhoods. This is the link to [Foursquare Venue Category Hierarchy](https://developer.foursquare.com/docs/resources/categories). Venues retrieved from all the neighborhoods are categorized broadly into "Arts & Entertainment", "College & University", "Event", "Food", "Nightlife Spot", "Outdoors & Recreation", etc. An extract of an API call is as follows:
```
	'categories': [{'id': '4bf58dd8d48988d110941735',
	   'name': 'Italian Restaurant',
	   'pluralName': 'Italian Restaurants',
	   'shortName': 'Italian',
	   'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/food/italian_',
	   'suffix': '.png'},
	   'primary': True}],
	'verified': False,
	'stats': {'tipCount': 17},
	'url': 'http://eccorestaurantny.com',
	'price': {'tier': 4, 'message': 'Very Expensive', 'currency'
```


<a name="methodology">[Scroll Back to Table of Contents](#tableofcontents)

---
## 3. Methodology<a name="methodology"></a>
---

> **3.1**  Download all the dependencies.

> **3.2**  Load location data for Manhattan neighborhoods.

> **3.3**  Use the geopy library to find the latitude and longitude of the Manhattan district.

> **3.4**  From this location use a radius of 9500 and search for all Drybars.
This query is very important because the project deals with a specific business and not any beauty salon, as it is a relatively new trend it is likely that they will have few returns.

> **3.5**  View these establishments.


- 3.1 Download all the dependencies.

In [69]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis

pd.set_option('display.max_columns', None)     
pd.set_option('display.max_rows', None)        

import json # library to handle JSON files

#from pprint import pprint # data pretty printer    retirar

import requests # library to handle requests

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import folium # map rendering library

import matplotlib.cm as cm # Matplotlib and associated plotting modules

import matplotlib.colors as colors # Matplotlib and associated plotting modules

from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

#from collections import Counter # count occurrences         retirar

from sklearn.cluster import KMeans # import k-means from clustering stage

import warnings
warnings.filterwarnings('ignore')

- 3.2 Loading the file  **Manhattan.csv** , which is hosted on GitHub to find out the latitude and longitude of each neighborhood

In [19]:
manhattan_neighborhoods=pd.read_csv("manhattan.csv")

manhattan_neighborhoods.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Manhattan,Marble Hill,40.876551,-73.91066
1,Manhattan,Chinatown,40.715618,-73.994279
2,Manhattan,Washington Heights,40.851903,-73.9369
3,Manhattan,Inwood,40.867684,-73.92121
4,Manhattan,Hamilton Heights,40.823604,-73.949688


In [20]:
print('The Manhattan has {} neighborhoods.'.format(len(manhattan_neighborhoods['Neighborhood'].unique())))

The Manhattan has 40 neighborhoods.


- 3.3 Using a geopy library to obtain the latitude and longitude values for the city of Manhattan.

In [21]:
address = 'Manhattan, NY'
location = None

# define an instance of the geocoder -> ny_explorer

while location == None:
    try:
        geolocator = Nominatim(user_agent="ny_explorer")
        location = geolocator.geocode(address)
        latitude = location.latitude
        longitude = location.longitude
    except:
        pass
print('The geograpical coordinate of Manhattan are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Manhattan are 40.7896239, -73.9598939.


- 3.4 Using a Foursquare API to search for Manhattan Drybars.

In [29]:
CLIENT_ID = 'P5MNEPLK4OXEDGPBP45SWSLTU1TY4R1DU4ZGMFBBOZCP23MO'# your Foursquare ID
CLIENT_SECRET = '3UTVUFOEVCRFZU5RGBMOSR4AFQG4BB4XXVN0P1NQMQNEFK2H'# your Foursquare Secret
ACCESS_TOKEN = "L5FWYDVVHZXJ4GKNOAD45X22X5MVYZJIFJHE1YJXFGDFB4YN"# your FourSquare Access Token
VERSION = '20180604'

In [46]:
search_query = 'Drybar'
radius = 9500
LIMIT = 100
print(search_query + ' .... OK!')

Drybar .... OK!


In [47]:
url_drybar = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&oauth_token={}&v={}&query={}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    latitude, 
    longitude,
    ACCESS_TOKEN, 
    VERSION, 
    search_query, 
    radius,
    LIMIT)

url_drybar

'https://api.foursquare.com/v2/venues/search?client_id=P5MNEPLK4OXEDGPBP45SWSLTU1TY4R1DU4ZGMFBBOZCP23MO&client_secret=3UTVUFOEVCRFZU5RGBMOSR4AFQG4BB4XXVN0P1NQMQNEFK2H&ll=40.7896239,-73.9598939&oauth_token=L5FWYDVVHZXJ4GKNOAD45X22X5MVYZJIFJHE1YJXFGDFB4YN&v=20180604&query=Drybar&radius=9500&limit=100'

In [48]:
# make the GET request

results_drybars= requests.get(url_drybar).json()

venues = results_drybars['response']['venues']

In [49]:
#Transform the json result into a dataframe

df_drybar = json_normalize(venues)

df_drybar.head()

Unnamed: 0,id,name,categories,referralId,hasPerk,location.address,location.crossStreet,location.lat,location.lng,location.labeledLatLngs,...,location.country,location.formattedAddress,venuePage.id,location.neighborhood,delivery.id,delivery.url,delivery.provider.name,delivery.provider.icon.prefix,delivery.provider.icon.sizes,delivery.provider.icon.name
0,5345ddce498e374167fa7173,DryBar,"[{'id': '4bf58dd8d48988d110951735', 'name': 'S...",v-1619113154,False,141 E 56th St,56th Street And Lexington Avenue,40.760087,-73.969126,"[{'label': 'display', 'lat': 40.76008684544023...",...,United States,[141 E 56th St (56th Street And Lexington Aven...,,,,,,,,
1,4f2064f8e4b0a00cf1d3c3c1,Drybar,"[{'id': '4bf58dd8d48988d110951735', 'name': 'S...",v-1619113154,False,119 W 56th St,btwn 6th & 7th Ave,40.76411,-73.978769,"[{'label': 'display', 'lat': 40.76410996501346...",...,United States,"[119 W 56th St (btwn 6th & 7th Ave), New York,...",,,,,,,,
2,4e6f679dae604d1b459a5154,DryBar,"[{'id': '4bf58dd8d48988d110951735', 'name': 'S...",v-1619113154,False,4 W 16th St,at 5th Ave,40.737554,-73.993193,"[{'label': 'display', 'lat': 40.73755376258029...",...,United States,"[4 W 16th St (at 5th Ave), New York, NY 10011,...",77545546.0,,,,,,,
3,50df63efe4b0454c5e8f655d,Drybar,"[{'id': '4bf58dd8d48988d110951735', 'name': 'S...",v-1619113154,False,180 W Broadway,,40.7181,-74.007092,"[{'label': 'display', 'lat': 40.71809983636473...",...,United States,"[180 W Broadway, New York, NY 10013, United St...",,,,,,,,
4,5085a8b5e4b0ca321fdcea32,Dry Bar,"[{'id': '4bf58dd8d48988d110951735', 'name': 'S...",v-1619113154,False,209 E 76th St,3rd Ave.,40.772205,-73.958315,"[{'label': 'display', 'lat': 40.77220496421525...",...,United States,"[209 E 76th St (3rd Ave.), New York, NY 10021,...",,,,,,,,


Check how many **Drybar** returned

In [50]:
df_drybar.shape

(41, 25)

There is some important information within the category variable, we are going to open and build a dataframe with the columns that include the name of the place the category that includes the Drybar and the location of each one of them.

In [53]:
filtered_columns = ['name', 'categories'] + [col for col in df_drybar.columns if col.startswith('location.')] + ['id']

dataframe_filtered=df_drybar.loc[:, filtered_columns]

def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

    
dataframe_filtered['categories'] = dataframe_filtered.apply(get_category_type, axis=1)

# limpe os nomes das colunas mantendo apenas o último termo

dataframe_filtered.columns = [column.split('.')[-1] for column in dataframe_filtered.columns]

dataframe_filtered.head(2)

Unnamed: 0,name,categories,address,crossStreet,lat,lng,labeledLatLngs,distance,postalCode,cc,city,state,country,formattedAddress,neighborhood,id
0,DryBar,Salon / Barbershop,141 E 56th St,56th Street And Lexington Avenue,40.760087,-73.969126,"[{'label': 'display', 'lat': 40.76008684544023...",3378,10022,US,New York,NY,United States,[141 E 56th St (56th Street And Lexington Aven...,,5345ddce498e374167fa7173
1,Drybar,Salon / Barbershop,119 W 56th St,btwn 6th & 7th Ave,40.76411,-73.978769,"[{'label': 'display', 'lat': 40.76410996501346...",3255,10019,US,New York,NY,United States,"[119 W 56th St (btwn 6th & 7th Ave), New York,...",,4f2064f8e4b0a00cf1d3c3c1


In [55]:
#Select the columns that matter

df_drybar_manhattan=dataframe_filtered[['id','name', 'categories', 'lat', 'lng','distance']]
df_drybar_manhattan.head()

Unnamed: 0,id,name,categories,lat,lng,distance
0,5345ddce498e374167fa7173,DryBar,Salon / Barbershop,40.760087,-73.969126,3378
1,4f2064f8e4b0a00cf1d3c3c1,Drybar,Salon / Barbershop,40.76411,-73.978769,3255
2,4e6f679dae604d1b459a5154,DryBar,Salon / Barbershop,40.737554,-73.993193,6440
3,50df63efe4b0454c5e8f655d,Drybar,Salon / Barbershop,40.7181,-74.007092,8901
4,5085a8b5e4b0ca321fdcea32,Dry Bar,Salon / Barbershop,40.772205,-73.958315,1943


In [85]:
df_drybar_manhattan.to_csv("df_drybar.csv" ,index= False)

- 3.5 View these establishments

In this view we can see the initial coordinate in red and the establishments in blue.

In [82]:
# Generate map centered around Manhattan New York

Manhattan_drybar_Map= folium.Map(location=[latitude, longitude], zoom_start=12)  

# add a red circle marker to represent the Midtown Manhattan

folium.CircleMarker(
    [latitude, longitude],
    radius=10,
    color='red',
    popup='Manhattan',
    fill = True,
    fill_color = 'red',
    fill_opacity = 0.6
).add_to(Manhattan_drybar_Map)


# add the Dry Bars as blue circle markers


for lat, lng, label in zip(df_drybar_manhattan.lat, df_drybar_manhattan.lng, df_drybar_manhattan.categories):
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        color='blue',
        popup=label,
        fill = True,
        fill_color='blue',
        fill_opacity=0.6
    ).add_to(Manhattan_drybar_Map)

# display map


Manhattan_drybar_Map

### Conclusão Parcial

<a name="analysis"> [Scroll Back to Table of Contents](#tableofcontents)

---

## 4. Extract specific data<a name="#analysis"></a>
---

As part of the process, the next step is to seek evaluations given by customers, such as grades and likes, taking into account, implicitly, the quality of service and ease of access, with the intention of justifying the hypothesis that the most frequented and tanned establishments are at strategic points.

With Foursquare and later with the K-means algorithm, these strategic points can be known.

This step requires access to the more specific Fousquare, as information such as rating and likes is necessary to use the establishment's ID, so for each note or number of likes, a “request” is made, which decreases your daily quota of 50 accesses or requests.

In [86]:
df_rating=pd.read_csv("teste.csv",encoding='iso-8859-1',delimiter =';')
df_rating.head()

Unnamed: 0,id,name,rating,like,text
0,5345ddce498e374167fa7173,DryBar,8.4,62,My favorite Dry Bar! The stylists know how to ...
1,4f2064f8e4b0a00cf1d3c3c1,DryBar,8.0,62,The only place where mixing a Manhattan and a ...
2,4e6f679dae604d1b459a5154,DryBar,8.5,142,Uncharged? No worries! You can charge your iPh...
3,50df63efe4b0454c5e8f655d,Drybar,8.9,72,If you want big curls & big volume (The Southe...
4,5085a8b5e4b0ca321fdcea32,Dry Bar,9.2,89,Free champagne and chick flicks. Oh and a blow...


In [88]:
df_rating.shape

(41, 5)

In [87]:
df_drybar_manhattan.head()

Unnamed: 0,id,name,categories,lat,lng,distance
0,5345ddce498e374167fa7173,DryBar,Salon / Barbershop,40.760087,-73.969126,3378
1,4f2064f8e4b0a00cf1d3c3c1,Drybar,Salon / Barbershop,40.76411,-73.978769,3255
2,4e6f679dae604d1b459a5154,DryBar,Salon / Barbershop,40.737554,-73.993193,6440
3,50df63efe4b0454c5e8f655d,Drybar,Salon / Barbershop,40.7181,-74.007092,8901
4,5085a8b5e4b0ca321fdcea32,Dry Bar,Salon / Barbershop,40.772205,-73.958315,1943


In [89]:
df_drybar_manhattan.shape

(41, 6)

In [92]:
df3=pd.merge( df_drybar_manhattan,df_rating, on=["id"], how="inner")
df3.head()

Unnamed: 0,id,name,categories,lat,lng,distance,name.1,rating,like,text
0,5345ddce498e374167fa7173,DryBar,Salon / Barbershop,40.760087,-73.969126,3378,DryBar,8.4,62,My favorite Dry Bar! The stylists know how to ...
1,4f2064f8e4b0a00cf1d3c3c1,Drybar,Salon / Barbershop,40.76411,-73.978769,3255,DryBar,8.0,62,The only place where mixing a Manhattan and a ...
2,4e6f679dae604d1b459a5154,DryBar,Salon / Barbershop,40.737554,-73.993193,6440,DryBar,8.5,142,Uncharged? No worries! You can charge your iPh...
3,50df63efe4b0454c5e8f655d,Drybar,Salon / Barbershop,40.7181,-74.007092,8901,Drybar,8.9,72,If you want big curls & big volume (The Southe...
4,5085a8b5e4b0ca321fdcea32,Dry Bar,Salon / Barbershop,40.772205,-73.958315,1943,Dry Bar,9.2,89,Free champagne and chick flicks. Oh and a blow...


---
## 5. Machine Learning and Results<a name="#results"></a>
---