<a href="https://colab.research.google.com/github/Dehyzz/IBM_Capstone/blob/master/The_Battle_of_Neighborhoods.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# The Capstone Project - The Battle of Neighborhoods

## Week 1: Report beginning

### 1. Description of the problem 
It is always a big challenge to choose right location for new property, especially a hotel. As a D. Trump being a luxury property magnate once said "When considering to build new property there is three major factors that you always have to consider: Location, Location and of course Location!"

This data science project is aiming to solve the challenge of finding the best location for new hotel in $$$ price category in Dublin, Ireland. To do so we will apply machine learning algorithms on Foursquare location data.

As a measure of location choice quality we will be using Venue Rating on Foursquare. Factors known for influencing location quality: proximity to restaurants, museums, parks, monuments, shops and other points of interest (POI).

---

### 2. Description of the data and how it will be used to solve the problem

Provider of Location Data will be Foursquare. Below is high-level process for preparing data.

1. Will be used Foursquare API call "search/ + keyword=hotel" to find all existing hotels in Dublin. Than filtered only hotels with price category $$$. Saving to Pandas Dataframe.

2. For every hotel in Dataframe the rating will be found, by passing to Foursquare query with by "venue/ + id" API call) (for every hotel in Dataframe).

3. Having all the hotels we add (by "explore/" API call) nearby POI in selected categories:

    *   Restaurants
    *   Museums
    *   Parks
    *   Monuments
    *   Shops
    *   List item

4. Then we filter out some veues that have less impact on rating of high price category hotels, keeping the following:
    *   Restaurants (only with rating>4 and price_category>$$)
    *   Shops (only with price_category>$$)

5. Then we will merge all the data into Dataframe (one hotel = one line) and make a prediction for every Dublin postcode prediction of rating for hypothetical hotel build in this location.

---

## Week 2:

### 1. Report
* Introduction with the business problem and interested parties

* Data and its source

* Methodology section which represents the main component of the report where you discuss and describe exploratory data analysis, inferential statistical testing, if any, and what machine learnings were used and why.

* Results description

* Discussion section where you discuss any observations you noted and any recommendations you can make based on the results

* Conclusion section where you conclude the report

### 2. A link to your Notebook

### 3. Presentation or blogpost


### Example

* Report: https://cocl.us/coursera_capstone_report
* Notebook: https://cocl.us/coursera_capstone_notebook
* Presentation: https://cocl.us/coursera_capstone_presentation
* Blogpost: https://cocl.us/coursera_capstone_blogpost

# Importing Libs & G.Drive

In [2]:
### Importing Libraries
import requests # library to handle requests
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner
import random # library for random number generation
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values
from pandas.io.json import json_normalize # tranforming json file into a pandas dataframe library
import folium # plotting library

# Mounting Gdrive in Notebook
from google.colab import drive
drive.mount('/content/drive/')

# libraries for displaying images
# from IPython.display import Image 
# from IPython.core.display import HTML 

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive/


In [3]:
### Foursquare Auth data
CLIENT_ID = '1Z1J1FGK4R5WD2RSEUC3P011PLPLUP02NJGM2W1OF3UYXJ2M' # your Foursquare ID
CLIENT_SECRET = 'LTVB0HMSJU4SPBTD4ZWGCFBUTSGOQ4F3L5VZ0YNAAZZ45FDT' # your Foursquare Secret
VERSION = '20200531'

### Center location - by default
address = 'London, UK'
geolocator = Nominatim(user_agent="foursquare_agent")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

# Preparing Data

1. Getting List of hotels in Central London (NAME, LATITUDE, LONGITUDE, CATEGORY, RATING)

In [62]:
### Getting hotels list (by 'Hotel/Inn' in category)
# https://developer.foursquare.com/docs/build-with-foursquare/categories/

hotels_cat = {
              'inn': '5bae9231bedf3950379f89cb',
              'hotel': '4bf58dd8d48988d1fa931735'
             }

radius = 5*1000 # in km
LIMIT = 200
intent='browse'
search_query = ''
url_search = 'https://api.foursquare.com/v2/venues/search?'


# function that gets hotel list of the area by category
def get_hotel_list(category):
  # request 
  url = '{}client_id={}&client_secret={}&ll={},{}&{}&v={}&query={}&radius={}&limit={}&categoryId={}'\
      .format(url_search, CLIENT_ID, CLIENT_SECRET, latitude, longitude, intent, VERSION, search_query, radius, LIMIT, category)
  results = requests.get(url).json()

  # tranform results into dataframe
  df = json_normalize(results['response']['venues'])
  filtered_columns = ['id', 'name', 'location.city', 'location.lat', 'location.lng'] + ['categories']
  hotels = df.loc[:, filtered_columns]

  # Unpack category name in 'categories'
  hotels['category_id'] = [d[0].get('id') for d in hotels.categories]
  return hotels

hotels_df = pd.DataFrame()
for value in hotels_cat.values():
  hotels = get_hotel_list(value)
  hotels_df = hotels_df.append(hotels, ignore_index=True)

# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']


# get category name for each row
hotels['category'] = hotels.apply(get_category_type, axis=1)
hotels.drop('categories', axis=1, inplace=True)
hotels.info()
hotels.head(6)




KeyError: ignored

In [54]:
# Get rating for every hotel
def get_rating(row):
    venue_id = row['id']
    url = 'https://api.foursquare.com/v2/venues/{}?client_id={}&client_secret={}&v={}'.format(venue_id, CLIENT_ID, CLIENT_SECRET, VERSION)
    result = requests.get(url).json()
    try:
        result = result['response']['venue']['rating']
    except:
        result = 'No rating'
    return result


In [55]:
hotels['rating'] = hotels.apply(get_rating, axis=1)
hotels.head(3)

Unnamed: 0,id,name,location.city,location.lat,location.lng,category_id,category,rating
0,59045110d1a4026a246ac71e,The House Of Toby London,London,51.528755,-0.118853,4bf58dd8d48988d1fa931735,Hotel,No rating
1,5d6d66792127810007da7e7a,The Hoxton Southwark,London,51.505676,-0.104648,4bf58dd8d48988d1fa931735,Hotel,No rating
2,5dd2a07d0855370008306619,Treehouse Hotel London,London,51.517934,-0.142849,4bf58dd8d48988d1fa931735,Hotel,No rating


In [31]:
### Download csv locally
# from google.colab import files
# hotels.to_csv('Hotels.csv') 
# files.download('Hotels.csv')

In [None]:
# Write file to G.Drive
temp = hotels.to_csv(index=False)
with open('/content/drive/My Drive/Colab_disk/Hotels.csv', 'w') as f:
  f.write(temp)

# Part 2: Finding POI nearby for every hotel and then add ratings

Here we download Hotels & its Ratings (if needed to skip Stage 1)

In [5]:
# Download saved Hotels df
hotels = pd.read_csv('drive/My Drive/Colab_disk/Hotels.csv')
hotels = hotels[~hotels['rating'].isin(["No rating"])]
print(hotels.count())
hotels.head(3)

id               46
name             46
location.city    46
location.lat     46
location.lng     46
category_id      46
category         46
rating           46
dtype: int64


Unnamed: 0,id,name,location.city,location.lat,location.lng,category_id,category,rating
0,4d274945d86aa09024d41dc0,Corinthia Hotel,London,51.506607,-0.12446,4bf58dd8d48988d1fa931735,Hotel,9.4
1,4bb44578643cd13ab492395c,The Ritz London,City of London,51.507078,-0.141627,4bf58dd8d48988d1fa931735,Hotel,9.0
2,59dfd7c76eda02332950f453,Premier Inn London Woolwich (Royal Arsenal) hotel,Woolwich,51.492892,0.067341,4bf58dd8d48988d1fa931735,Hotel,6.4


In [50]:
### Getting POI list (by category)
# https://developer.foursquare.com/docs/build-with-foursquare/categories/

LIMIT = 4
intent='browse'
search_query = ''
url_search = 'https://api.foursquare.com/v2/venues/search?'

# POI data request - returns df of POI
def get_poi_data(category, latitude, longitude, radius=2000):
  url = '{}client_id={}&client_secret={}&ll={},{}&{}&v={}&query={}&radius={}&limit={}&categoryId={}'\
      .format(url_search, CLIENT_ID, CLIENT_SECRET, latitude, longitude, intent, VERSION, search_query, radius, LIMIT, category)
  results = requests.get(url).json()
  # tranform results into dataframe
  df = json_normalize(results['response']['venues'])
  filtered_columns = ['id', 'location.lat', 'location.lng'] + ['categories']
  # filtered_columns = ['id', 'name', 'location.city', 'location.lat', 'location.lng'] + ['categories']
  poi = df.loc[:, filtered_columns]
  # Unpack category name & id in 'categories'
  poi['category_id'] = [d[0].get('id') for d in poi.categories]

  def get_category_type(row):   # Extracts category of the venue
      try:
          categories_list = row['categories']
      except:
          categories_list = row['venue.categories']
          
      if len(categories_list) == 0:
          return None
      else:
          return categories_list[0]['name']
  poi['category'] = poi.apply(get_category_type, axis=1) # get category for each row
  poi['rating'] = poi.apply(get_rating, axis=1) # get rating for each row
  poi.drop('categories', axis=1, inplace=True)
  return poi


def get_poi_analysis(latitude, longitude):
  categories = {
              'Food': '4d4b7105d754a06374d81259', 
              'Entertainment':'4d4b7104d754a06370d81259',
              'Nightlife': '4d4b7105d754a06376d81259'
             }
  cat_n = 0
  poi_avg = [0,0,0]
  # Create a list of poi avg values to append it to every hotel row
  for category in categories.values(): # iterate POI categories
    poi_df = get_poi_data(category, latitude, longitude)
    poi_df = poi_df[~poi_df['rating'].isin(["No rating"])] # Drop 'NoRating'
    poi_avg[cat_n] = poi_df['rating'].astype(float).mean() # Take column Mean
    cat_n += 1
  return poi_avg

In [51]:
### POI analysis for location
col_names = list(categories) 
avg_poi_df = pd.DataFrame(columns=col_names)
# Iterate hotels to add columns with POI rating analysis 
for index, row in hotels.iterrows():
  latitude, longitude = row['location.lat'], row['location.lng']
  avg_poi_df.loc[index] = get_poi_analysis(latitude, longitude)
hotels = pd.concat([hotels, avg_poi_df], axis=1)
print(hotels.head(6))


  


                         id  ... Nightlife
0  4d274945d86aa09024d41dc0  ...       NaN
1  4bb44578643cd13ab492395c  ...       NaN
2  59dfd7c76eda02332950f453  ...       NaN
3  59045110d1a4026a246ac71e  ...       NaN

[4 rows x 14 columns]


In [47]:
hotels.head(60)

Unnamed: 0,id,name,location.city,location.lat,location.lng,category_id,category,rating,Food,Entertainment,Nightlife
0,4d274945d86aa09024d41dc0,Corinthia Hotel,London,51.506607,-0.12446,4bf58dd8d48988d1fa931735,Hotel,9.4,8.72,9.07,8.43
1,4bb44578643cd13ab492395c,The Ritz London,City of London,51.507078,-0.141627,4bf58dd8d48988d1fa931735,Hotel,9.0,8.72,9.02,8.49
2,59dfd7c76eda02332950f453,Premier Inn London Woolwich (Royal Arsenal) hotel,Woolwich,51.492892,0.067341,4bf58dd8d48988d1fa931735,Hotel,6.4,6.733333,6.6,6.785714
3,59045110d1a4026a246ac71e,The House Of Toby London,London,51.528755,-0.118853,4bf58dd8d48988d1fa931735,Hotel,7.5,8.875,,


In [49]:
# Save to G.Drive
temp = hotels.to_csv(index=False)
with open('/content/drive/My Drive/Colab_disk/poi.csv', 'w') as f:
  f.write(temp)

### ToDo:
1. Prepare data for ML: download df by the segmets (5-10 hotel rows per request -> save repeat)

2. Same for postcodes

3. Use continus Nearest neighbors analysis to find postcode location rating 

4. Vizualize for postcodes 

5. Prepare report

In [60]:
# Reading and preparing all of Greater London postcodes (by outcode) and get average location (coordinates)
postcodes = pd.read_csv('https://raw.githubusercontent.com/Dehyzz/IBM_Capstone/master/London%20postcodes.csv')
postcodes["Postcode"] = postcodes["Postcode"].str.split().str[0]
postcodes = postcodes.groupby(['Postcode']).mean()
postcodes.head(6)

# Save to G.Drive
temp = postcodes.to_csv(index=False)
with open('/content/drive/My Drive/Colab_disk/postcodes.csv', 'w') as f:
  f.write(temp)

Unnamed: 0_level_0,Latitude,Longitude
Postcode,Unnamed: 1_level_1,Unnamed: 2_level_1
BR1,51.412023,0.020878
BR2,51.387284,0.022406
BR3,51.404592,-0.030544
BR4,51.375068,-0.008391
BR5,51.391707,0.103335
BR6,51.365326,0.090189


In [61]:
### POI analysis every postcode
col_names = list(categories)
avg_poi_df = pd.DataFrame(columns=col_names)
# Iterate postcodes to add columns with POI rating analysis 
for index, row in postcodes.iterrows():
  latitude, longitude = row['Latitude'], row['Longitude']
  avg_poi_df.loc[index] = get_poi_analysis(latitude, longitude)
postcodes = pd.concat([hotels, avg_poi_df], axis=1)
print(postcodes.head(6))

  


KeyError: ignored

In [None]:
# Save Postcodes df to G.Drive
temp = postcodes.to_csv(index=False)
with open('/content/drive/My Drive/Colab_disk/postcodes.csv', 'w') as f:
  f.write(temp)

### Part 3: Teach ML model and get predictions

### Part 4: Visualising the results of ML predictions on the Map