<a href="https://colab.research.google.com/github/Dehyzz/IBM_Capstone/blob/master/The_Battle_of_Neighborhoods.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# The Capstone Project - The Battle of Neighborhoods

## Week 1: Report beginning

### 1. Description of the problem 
It is always a big challenge to choose right location for new property, especially a hotel. As a D. Trump being a luxury property magnate once said "When considering to build new property there is three major factors that you always have to consider: Location, Location and of course Location!"

This data science project is aiming to solve the challenge of finding the best location for new hotel in $$$ price category in Dublin, Ireland. To do so we will apply machine learning algorithms on Foursquare location data.

As a measure of location choice quality we will be using Venue Rating on Foursquare. Factors known for influencing location quality: proximity to restaurants, museums, parks, monuments, shops and other points of interest (POI).

---

### 2. Description of the data and how it will be used to solve the problem

Provider of Location Data will be Foursquare. Below is high-level process for preparing data.

1. Will be used Foursquare API call "search/ + keyword=hotel" to find all existing hotels in Dublin. Than filtered only hotels with price category $$$. Saving to Pandas Dataframe.

2. For every hotel in Dataframe the rating will be found, by passing to Foursquare query with by "venue/ + id" API call) (for every hotel in Dataframe).

3. Having all the hotels we add (by "explore/" API call) nearby POI in selected categories:

    *   Restaurants
    *   Museums
    *   Parks
    *   Monuments
    *   Shops
    *   List item

4. Then we filter out some veues that have less impact on rating of high price category hotels, keeping the following:
    *   Restaurants (only with rating>4 and price_category>$$)
    *   Shops (only with price_category>$$)

5. Then we will merge all the data into Dataframe (one hotel = one line) and make a prediction for every Dublin postcode prediction of rating for hypothetical hotel build in this location.

---

## Week 2:

### 1. Report
* Introduction with the business problem and interested parties

* Data and its source

* Methodology section which represents the main component of the report where you discuss and describe exploratory data analysis, inferential statistical testing, if any, and what machine learnings were used and why.

* Results description

* Discussion section where you discuss any observations you noted and any recommendations you can make based on the results

* Conclusion section where you conclude the report

### 2. A link to your Notebook

### 3. Presentation or blogpost


### Example

* Report: https://cocl.us/coursera_capstone_report
* Notebook: https://cocl.us/coursera_capstone_notebook
* Presentation: https://cocl.us/coursera_capstone_presentation
* Blogpost: https://cocl.us/coursera_capstone_blogpost

# Importing Libs & G.Drive

In [None]:
### Importing Libraries
import requests # library to handle requests
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner
import random # library for random number generation
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values
from pandas.io.json import json_normalize # tranforming json file into a pandas dataframe library
import folium # plotting library

# Mounting Gdrive in Notebook
from google.colab import drive
drive.mount('/content/drive/')

# libraries for displaying images
# from IPython.display import Image 
# from IPython.core.display import HTML 

In [None]:
### Foursquare Auth data
CLIENT_ID = '1Z1J1FGK4R5WD2RSEUC3P011PLPLUP02NJGM2W1OF3UYXJ2M' # your Foursquare ID
CLIENT_SECRET = 'LTVB0HMSJU4SPBTD4ZWGCFBUTSGOQ4F3L5VZ0YNAAZZ45FDT' # your Foursquare Secret
VERSION = '20200531'

### Center location - by default
address = 'London, UK'
geolocator = Nominatim(user_agent="foursquare_agent")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

# Preparing Data

1. Getting List of hotels in Dublin (NAME, LATITUDE, LONGITUDE, CATEGORY, RATING)

In [27]:
### Getting hotels list (by 'Hotel/Inn' in category)
# https://developer.foursquare.com/docs/build-with-foursquare/categories/

hotels_cat = {
              'inn': '5bae9231bedf3950379f89cb',
              'hotel': '4bf58dd8d48988d1fa931735'
             }

radius = 40*1000 # in km
LIMIT = 1000
intent='browse'
search_query = ''
url_search = 'https://api.foursquare.com/v2/venues/search?'


# function that gets hotel list of the area by category
def get_hotel_list(category):
  # request 
  url = '{}client_id={}&client_secret={}&ll={},{}&{}&v={}&query={}&radius={}&limit={}&categoryId={}'\
      .format(url_search, CLIENT_ID, CLIENT_SECRET, latitude, longitude, intent, VERSION, search_query, radius, LIMIT, category)
  results = requests.get(url).json()

  # tranform results into dataframe
  df = json_normalize(results['response']['venues'])
  filtered_columns = ['id', 'name', 'location.city', 'location.lat', 'location.lng'] + ['categories']
  hotels = df.loc[:, filtered_columns]

  # Unpack category name in 'categories'
  hotels['category_id'] = [d[0].get('id') for d in hotels.categories]
  return hotels

hotels_df = pd.DataFrame()
for value in hotels_cat.values():
  hotels = get_hotel_list(value)
  hotels_df = hotels_df.append(hotels, ignore_index=True)

# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']


# get category name for each row
hotels['category'] = hotels.apply(get_category_type, axis=1)
hotels.drop('categories', axis=1, inplace=True)
hotels.info()
hotels.head(6)




Unnamed: 0,id,name,location.city,location.lat,location.lng,category_id,category
0,4d274945d86aa09024d41dc0,Corinthia Hotel,London,51.506607,-0.12446,4bf58dd8d48988d1fa931735,Hotel
1,4bb44578643cd13ab492395c,The Ritz London,City of London,51.507078,-0.141627,4bf58dd8d48988d1fa931735,Hotel
2,59dfd7c76eda02332950f453,Premier Inn London Woolwich (Royal Arsenal) hotel,Woolwich,51.492892,0.067341,4bf58dd8d48988d1fa931735,Hotel
3,59045110d1a4026a246ac71e,The House Of Toby London,London,51.528755,-0.118853,4bf58dd8d48988d1fa931735,Hotel
4,5d6d66792127810007da7e7a,The Hoxton Southwark,London,51.505676,-0.104648,4bf58dd8d48988d1fa931735,Hotel


In [28]:
hotels.head(400)

Unnamed: 0,id,name,location.city,location.lat,location.lng,category_id,category
0,4d274945d86aa09024d41dc0,Corinthia Hotel,London,51.506607,-0.12446,4bf58dd8d48988d1fa931735,Hotel
1,4bb44578643cd13ab492395c,The Ritz London,City of London,51.507078,-0.141627,4bf58dd8d48988d1fa931735,Hotel
2,59dfd7c76eda02332950f453,Premier Inn London Woolwich (Royal Arsenal) hotel,Woolwich,51.492892,0.067341,4bf58dd8d48988d1fa931735,Hotel
3,59045110d1a4026a246ac71e,The House Of Toby London,London,51.528755,-0.118853,4bf58dd8d48988d1fa931735,Hotel
4,5d6d66792127810007da7e7a,The Hoxton Southwark,London,51.505676,-0.104648,4bf58dd8d48988d1fa931735,Hotel
5,4c4b7418c668e21e170c11fa,Balham Lodge,,51.436407,-0.145153,4bf58dd8d48988d1fa931735,Hotel
6,56474be5498e2673f4dd7f3c,Travelodge,Hackney Central,51.547501,-0.056074,4bf58dd8d48988d1fa931735,Hotel
7,5dd2a07d0855370008306619,Treehouse Hotel London,London,51.517934,-0.142849,4bf58dd8d48988d1fa931735,Hotel
8,5acbd085485709062d17c9fd,Chilworth London Paddington,London,51.515265,-0.178308,4bf58dd8d48988d1fa931735,Hotel
9,4bc87634dc55eee17d78e8ac,Upton Park Hotel Slough,Slough,51.50343,-0.596619,4bf58dd8d48988d1f8931735,Bed & Breakfast


In [29]:
# url = '{}client_id={}&client_secret={}&ll={},{}&{}&v={}&query={}&radius={}&limit={}&categoryId={}'\
#       .format(url_search, CLIENT_ID, CLIENT_SECRET, latitude, longitude, intent, VERSION, search_query, radius, LIMIT, category)
# results = requests.get(url).json()

# # tranform results into dataframe
# df = json_normalize(results['response']['venues'][0])
# df.head()

In [None]:
# venue_id = '5b70c70064c8e1002ca8ef08'
# url = 'https://api.foursquare.com/v2/venues/{}?client_id={}&client_secret={}&v={}'.format(venue_id, CLIENT_ID, CLIENT_SECRET, VERSION)
# result = requests.get(url).json()
# result

In [30]:
# Get rating for every hotel
def get_rating(row):
    venue_id = row['id']
    url = 'https://api.foursquare.com/v2/venues/{}?client_id={}&client_secret={}&v={}'.format(venue_id, CLIENT_ID, CLIENT_SECRET, VERSION)
    result = requests.get(url).json()
    try:
        result = result['response']['venue']['rating']
    except:
        result = 'No rating'
    
    return result

hotels['rating'] = hotels.apply(get_rating, axis=1)
hotels.head(100)

Unnamed: 0,id,name,location.city,location.lat,location.lng,category_id,category,rating
0,4d274945d86aa09024d41dc0,Corinthia Hotel,London,51.506607,-0.12446,4bf58dd8d48988d1fa931735,Hotel,9.4
1,4bb44578643cd13ab492395c,The Ritz London,City of London,51.507078,-0.141627,4bf58dd8d48988d1fa931735,Hotel,9
2,59dfd7c76eda02332950f453,Premier Inn London Woolwich (Royal Arsenal) hotel,Woolwich,51.492892,0.067341,4bf58dd8d48988d1fa931735,Hotel,6.4
3,59045110d1a4026a246ac71e,The House Of Toby London,London,51.528755,-0.118853,4bf58dd8d48988d1fa931735,Hotel,7.5
4,5d6d66792127810007da7e7a,The Hoxton Southwark,London,51.505676,-0.104648,4bf58dd8d48988d1fa931735,Hotel,8.4
5,4c4b7418c668e21e170c11fa,Balham Lodge,,51.436407,-0.145153,4bf58dd8d48988d1fa931735,Hotel,No rating
6,56474be5498e2673f4dd7f3c,Travelodge,Hackney Central,51.547501,-0.056074,4bf58dd8d48988d1fa931735,Hotel,6.1
7,5dd2a07d0855370008306619,Treehouse Hotel London,London,51.517934,-0.142849,4bf58dd8d48988d1fa931735,Hotel,No rating
8,5acbd085485709062d17c9fd,Chilworth London Paddington,London,51.515265,-0.178308,4bf58dd8d48988d1fa931735,Hotel,No rating
9,4bc87634dc55eee17d78e8ac,Upton Park Hotel Slough,Slough,51.50343,-0.596619,4bf58dd8d48988d1f8931735,Bed & Breakfast,No rating


In [31]:
# Download csv locally
# from google.colab import files
# hotels.to_csv('Hotels.csv') 
# files.download('Hotels.csv')

In [34]:
# Write file
temp = hotels.to_csv(index=False)
with open('/content/drive/My Drive/Colab_disk/Hotels.csv', 'w') as f:
  f.write(temp)



AttributeError: ignored

# Part 2: Finding POI nearby for every hotel and then add ratings

Here we download Hotels & its Ratings from temp csv

In [4]:
gd = 'drive/My Drive/Colab_disk/Hotels.csv'
df = pd.read_csv(gd)
df.head()

Unnamed: 0,id,name,location.city,location.lat,location.lng,category_id,category,rating
0,4d274945d86aa09024d41dc0,Corinthia Hotel,London,51.506607,-0.12446,4bf58dd8d48988d1fa931735,Hotel,9.3
1,5d6d66792127810007da7e7a,The Hoxton Southwark,London,51.505676,-0.104648,4bf58dd8d48988d1fa931735,Hotel,8.2
2,59dfd7c76eda02332950f453,Premier Inn London Woolwich (Royal Arsenal) hotel,Woolwich,51.492892,0.067341,4bf58dd8d48988d1fa931735,Hotel,6.4
3,59045110d1a4026a246ac71e,The House Of Toby London,London,51.528755,-0.118853,4bf58dd8d48988d1fa931735,Hotel,7.4
4,4bb44578643cd13ab492395c,The Ritz London,City of London,51.507078,-0.141627,4bf58dd8d48988d1fa931735,Hotel,9.1


In [48]:
### Getting POI list (by category)
# https://developer.foursquare.com/docs/build-with-foursquare/categories/

categories = {
              'Food': '4d4b7105d754a06374d81259', 
              'Arts & Entertainment':'4d4b7104d754a06370d81259',
              'Nightlife Spot': '4d4b7105d754a06376d81259',
              'Train Station': '4bf58dd8d48988d129951735'
             }
# category = categories['Food']
radius = 30000 # in meters
LIMIT = 40000
intent='browse'
search_query = ''
url_search = 'https://api.foursquare.com/v2/venues/search?'


# POI data request
def get_poi_data(category, latitude, longitude, radius):
  url = '{}client_id={}&client_secret={}&ll={},{}&{}&v={}&query={}&radius={}&limit={}&categoryId={}'\
      .format(url_search, CLIENT_ID, CLIENT_SECRET, latitude, longitude, intent, VERSION, search_query, radius, LIMIT, category)
  results = requests.get(url).json()

  # tranform results into dataframe
  df = json_normalize(results['response']['venues'])
  filtered_columns = ['id', 'location.lat', 'location.lng'] + ['categories']
  # filtered_columns = ['id', 'name', 'location.city', 'location.lat', 'location.lng'] + ['categories']
  
  poi = df.loc[:, filtered_columns]

  # Unpack category name & id in 'categories'
  poi['category'] = poi.apply(get_category_type, axis=1)
  poi['category_id'] = [d[0].get('id') for d in poi.categories]

  # function that extracts the category of the venue
  # def get_category_type(row):
  #     try:
  #         categories_list = row['categories']
  #     except:
  #         categories_list = row['venue.categories']
          
  #     if len(categories_list) == 0:
  #         return None
  #     else:
  #         return categories_list[0]['name']

  ### get category name for each row


  ### get rating for each row
  poi['rating'] = poi.apply(get_rating, axis=1)
  poi.drop('categories', axis=1, inplace=True)

  return poi

# TODO: Iterate categories
poi = pd.DataFrame()
poi_df = pd.DataFrame()

for category in categories.values():
  poi = get_poi_data(category, latitude, longitude,  radius=10000)
  poi_df = poi_df.append(poi, ignore_index=True)

# Save to G.Drive
temp = poi.to_csv(index=False)
with open('/content/drive/My Drive/Colab_disk/poi.csv', 'w') as f:
  f.write(temp)

poi_df.info()
poi_df.head(4)



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   id            200 non-null    object 
 1   location.lat  200 non-null    float64
 2   location.lng  200 non-null    float64
 3   category      200 non-null    object 
 4   category_id   200 non-null    object 
 5   rating        200 non-null    object 
dtypes: float64(2), object(4)
memory usage: 9.5+ KB


Unnamed: 0,id,location.lat,location.lng,category,category_id,rating
0,5da05a3631ba060009a2e8b2,51.511728,-0.124049,Coffee Shop,4bf58dd8d48988d1e0931735,8.4
1,4ead1773f5b936710b583278,51.473337,-0.133701,Caribbean Restaurant,4bf58dd8d48988d144941735,No rating
2,5b770349112c6c0039a9f0c2,51.512232,-0.122371,Sushi Restaurant,4bf58dd8d48988d1d2941735,9.3
3,5c4c7c2335811b002c3bac56,51.54352,-0.11331,Coffee Shop,4bf58dd8d48988d1e0931735,8.8
4,4b6407e4f964a5200d9c2ae3,51.533885,-0.171737,Bakery,4bf58dd8d48988d16a941735,8.4
5,4dcaa14a183817362fe73490,51.512016,-0.122743,Dessert Shop,4bf58dd8d48988d1d0941735,9.1
6,5bc1d72838f21600252afeea,51.494005,-0.157339,Indian Restaurant,4bf58dd8d48988d10f941735,8.6
7,57ee2dd9498e28af93eff303,51.492952,-0.14933,Bakery,4bf58dd8d48988d16a941735,8.5
8,5cb044142be425002c5682b4,51.513456,-0.153472,Juice Bar,4bf58dd8d48988d112941735,8.7
9,521209e311d2ac3f5ea4f1c0,51.512462,-0.153264,French Restaurant,4bf58dd8d48988d10c941735,8.9


In [53]:
### Claning POI Data
# Drop No Rating records
poi_df = poi_df[~poi_df['rating'].isin(["No rating"])]

poi_df.info()
poi_df.head(55)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 183 entries, 0 to 199
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   id            183 non-null    object 
 1   location.lat  183 non-null    float64
 2   location.lng  183 non-null    float64
 3   category      183 non-null    object 
 4   category_id   183 non-null    object 
 5   rating        183 non-null    object 
dtypes: float64(2), object(4)
memory usage: 10.0+ KB


Unnamed: 0,id,location.lat,location.lng,category,category_id,rating
0,5da05a3631ba060009a2e8b2,51.511728,-0.124049,Coffee Shop,4bf58dd8d48988d1e0931735,8.4
2,5b770349112c6c0039a9f0c2,51.512232,-0.122371,Sushi Restaurant,4bf58dd8d48988d1d2941735,9.3
3,5c4c7c2335811b002c3bac56,51.54352,-0.11331,Coffee Shop,4bf58dd8d48988d1e0931735,8.8
4,4b6407e4f964a5200d9c2ae3,51.533885,-0.171737,Bakery,4bf58dd8d48988d16a941735,8.4
5,4dcaa14a183817362fe73490,51.512016,-0.122743,Dessert Shop,4bf58dd8d48988d1d0941735,9.1
6,5bc1d72838f21600252afeea,51.494005,-0.157339,Indian Restaurant,4bf58dd8d48988d10f941735,8.6
7,57ee2dd9498e28af93eff303,51.492952,-0.14933,Bakery,4bf58dd8d48988d16a941735,8.5
8,5cb044142be425002c5682b4,51.513456,-0.153472,Juice Bar,4bf58dd8d48988d112941735,8.7
9,521209e311d2ac3f5ea4f1c0,51.512462,-0.153264,French Restaurant,4bf58dd8d48988d10c941735,8.9
10,5c164f9facc5f5002c24c014,51.499647,-0.162241,Café,4bf58dd8d48988d16d941735,8.2


### Methodology:

1. Listing all London Hotels and it's Foursquare ratings
2. Same for POI (of selected cathegories - known for impacting hotel rating)
3. Cluster (ML) POI around highest Hotel ratings locations (based on POI ratings)
4. Vizualize clusters on the Folio map where highest hotel rating expected


### ToDo:
1. Prepare data for ML: POI ratings as predictor nearest hotel as result

In [2]:
# 2.1 Reading and preparing all of Greater London postcodes (by outcode) and get average location (coordinates)
import pandas as pd
postcodes = pd.read_csv('https://raw.githubusercontent.com/Dehyzz/IBM_Capstone/master/London%20postcodes.csv')
postcodes["Postcode"] = postcodes["Postcode"].str.split().str[0]
postcodes = postcodes.groupby(['Postcode']).mean()
postcodes

Unnamed: 0_level_0,Latitude,Longitude
Postcode,Unnamed: 1_level_1,Unnamed: 2_level_1
BR1,51.412023,0.020878
BR2,51.387284,0.022406
BR3,51.404592,-0.030544
BR4,51.375068,-0.008391
BR5,51.391707,0.103335
...,...,...
WC2N,51.509066,-0.124911
WC2R,51.511689,-0.118319
WD23,51.631946,-0.334541
WD3,51.624974,-0.490279
