# Week 4 - Problem Definition
___

## Business Understanding

How to choose the restaurant in the New York metropolitan area?

To choose where to eat in New York, one could employ several approaches, for example, to pick:
- The best cost per benefit;
- The most popular in the region;
- The best fit for its requirements.

## Approach and Requirement

Descriptive Analytics:
- **Popularity**, number of `likes`;
- **Cost per benefit**, ratio of `price` over `rating`;
- **Concentration**, distribution of `location`.

Predictive Approach:
- Prescribe **neighbourhood** given requirements.

### Methodology

For this project it's needed an holistic approach where both descriptive and predictive analytics are employed.
A descriptive analysis will generate cost per benefit and popularity for each area. While, a predictive model will result in the best place given requirements.

The predictive model choosen is a decision tree with rating, likes and price (requirements) as features and neighbourhood as target.

### Data Sources

The data will be obtained from an Airbnb data source and from Foursquare and Geolocator API.\
Airbnb data source: http://insideairbnb.com/get-the-data.html.

# Week 5 - New York Restaurants 
___

## Data Collection

### Import Libraries

Built-in Python

In [None]:
import json, pathlib

Data Analysis

In [None]:
import numpy as np
import pandas as pd

Data Visualization

In [None]:
import matplotlib.pyplot as plt

Machine Learning

In [None]:
from sklearn import preprocessing, metrics
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

from joblib import dump, load

Other Libraries

In [None]:
import requests, folium

from geopy.geocoders import Nominatim

### Airbnb Data - Neighbourhood and Area Names

In [None]:
geojson = r'./neighbourhoods.geojson'

In [None]:
df_ny = pd.read_csv('./neighbourhoods.csv', index_col='neighbourhood')

In [None]:
df_ny.index.name = 'neighbourhood'

In [None]:
df_ny.head()

In [None]:
df_ny[['latitude', 'longitude']] = np.nan

In [None]:
df_areas = pd.DataFrame(columns=['latitude','longitude'],index=df_ny['neighbourhood_group'].unique())

In [None]:
df_areas.index.name = 'neighbourhood_group'

In [None]:
df_areas.head()

## Geolocator API - Neighbourhood and Area Location

In [None]:
address = ', New York City, NY'
geolocator = Nominatim(user_agent="ny_explorer")

def locate(df):

    for idx in df.index:
        loc = geolocator.geocode(idx + address)
        
        if loc is None:
            continue

        df.loc[[idx],['latitude']] = loc.latitude
        df.loc[[idx],['longitude']] = loc.longitude

In [None]:
locate(df_ny)
locate(df_areas)

In [None]:
df_ny.dropna(inplace=True)

In [None]:
df_areas.dropna(inplace=True)

### Foursquare API - Search Venue

In [None]:
CLIENT_ID = 'PNC41IKLXMVLKVCZ5CSL3044KIAEY1ACPRIIEXODK3DEYMX1'
CLIENT_SECRET = 'EELGU3BTHACFHWF13KM42Y2LZO4DUFR2451JFXOVWLR3M5Y4'
VERSION = '20180605'
LIMIT = 100

Request search venues for each neighbourhood:

In [None]:
url = f'https://api.foursquare.com/v2/venues/search'
search_data = list()
for neighbourhood in df_ny.index:
    params = dict(
        client_id = CLIENT_ID,
        client_secret = CLIENT_SECRET,
        v = VERSION,
        near = neighbourhood,
        query = 'Restaurant',
        limit = 100,
        radius = 500
    )
    response = requests.get(url = url, params = params)
    search_data.append(json.loads(response.text))

Function to check and return field in the json data:

In [None]:
check = lambda name, table: table[name] if name in table.keys() else np.nan;

Extract general information about each venue from response data:

In [None]:
df_venues = pd.DataFrame()

for i in range(len(search_data)):
    venues = check('venues', search_data[i]['response']);
    
    if venues is np.nan:
        continue;
        
    for venue in venues:
        df_venues = df_venues.append({
            'neighbourhood':\
                df_ny.at[i,'neighbourhood'],
            'neighbourhood_group':\
                df_ny.at[i,'neighbourhood_group'],
            'id':\
                venue['id'],
            'names':\
                venue['name'],
            'category':\
                venue['categories'][0]['name'] if len(venue['categories']) else np.nan,
            'latitude':\
                venue['location']['lat'],
            'longitude':\
                venue['location']['lng'],
        }, ignore_index = True)
        
df_venues = df_venues.set_index('id')

## Data Sampling

Filter in restaurant venues type:

In [None]:
df_venues = df_venues[df_venues['category'].str.contains('Restaurant', na=False)]

Drop irrelevant venues:

In [None]:
irrelevant = ['Restaurant','Food', 'Sushi Restaurant','Seafood Restaurant', 'Vegetarian / Vegan Restaurant', 'New American Restaurant']

In [None]:
df_venues = df_venues[~df_venues['category'].isin(irrelevant)]

Select relevant restaurants:

In [None]:
relevant = list(df_venues['category'].value_counts().head(15).index)

In [None]:
df_venues = df_venues[df_venues['category'].isin(relevant)]

Sample the data (limited calls):

In [None]:
df_venues = df_venues.sample(n = 250, random_state=1)

## Additional Data Collection

### Foursquare API - Venue's Information

Request information for each restaurant:

In [None]:
restaurant_data = list()

for venue_id in df_venues.index:
    url = f'https://api.foursquare.com/v2/venues/{venue_id}'
    params = dict(client_id = CLIENT_ID,
                  client_secret = CLIENT_SECRET,
                  v = VERSION)
    response = requests.get(url = url, params = params)
    restaurant_data.append(json.loads(response.text))

Extract specific information about each venue from response data:

In [None]:
for idx in range(len(restaurant_data)):
    venue = check('venue', restaurant_data[idx]['response']);
    
    if venue is np.nan:
        continue;
        
    price = check('price', venue);
    tier = check('tier', price);
    df_venues.loc[[venue['id']], ['price']] = tier;
    rating = check('rating', venue);
    df_venues.loc[[venue['id']], ['rating']] = rating;
    likes = check('likes', venue);
    count = check('count', likes);
    df_venues.loc[[venue['id']], ['likes']] = count;

##

## Data Prepocessing - Descriptive Analysis

Fill fields for each dataframe (consistency):

In [None]:
df_venues['ratio'] = df_venues['price'] / df_venues['rating'] * 2.5

In [None]:
df_ny = df_ny.merge(df_venues.loc[~df_venues['rating'].isnull()].groupby('neighbourhood')['rating'].mean(), on='neighbourhood')
df_ny = df_ny.merge(df_venues.loc[~df_venues['price'].isnull()].groupby('neighbourhood')['price'].mean(), on='neighbourhood')
df_ny = df_ny.merge(df_venues.groupby('neighbourhood')['likes'].sum(), on='neighbourhood')

In [None]:
df_ny['ratio'] = df_ny['price'] / df_ny['rating'] * 2.5

In [None]:
df_areas = df_areas.merge(df_venues.loc[~df_venues['rating'].isnull()].groupby('neighbourhood_group')['rating'].mean(), on='neighbourhood_group')
df_areas = df_areas.merge(df_venues.loc[~df_venues['price'].isnull()].groupby('neighbourhood_group')['price'].mean(), on='neighbourhood_group')
df_areas = df_areas.merge(df_venues.groupby('neighbourhood_group')['likes'].sum(), on='neighbourhood_group')

In [None]:
df_areas['ratio'] = df_areas['price'] / df_areas['rating'] * 2.5

## Data Exploration

### Categorical Analysis

**Distribution of Restaurants**

In [None]:
series = df_venues['neighbourhood_group'].value_counts(normalize = True).sort_values(ascending = False)
series.plot(kind='bar', figsize = (10, 6), color = 'slateblue')

labels = []

for index, value in enumerate(series.values):
    plt.annotate(value,xy=(index - 0.12, value - 0.015) , color = 'white')

plt.title('Concentration of Restaurants by New York Area')
plt.xlabel('New York Area')
plt.ylabel('Concentration')

plt.show()

**Distribution of Cuisines**

In [None]:
series = df_venues['category'].value_counts(normalize = True).sort_values(ascending = True)
series.plot(kind='barh', figsize = (10, 6), color = 'slateblue')
labels = []

for index, value in enumerate(series.values):
    plt.annotate(value,xy=(value - 0.02, index - 0.15) , color = 'white')
    
plt.title('Concentration of Restaurants by Cuisines')
plt.xlabel('Concentration')
plt.ylabel('Cuisine')

plt.show()

### Quantitative Analysis

**Descriptive Stats by Area**

In [None]:
df_areas.sort_values('likes', ascending = False)

**Note: ratio is cost (price) per benefit (rating) and likes is popularity**

### Data Prepocessing - Predictive Model

In [None]:
df_venues = df_venues.reset_index()

Fill NaN cells with the mean of the columns:

In [None]:
def nan_to_mean(series):
    series.replace(np.nan, series.astype("float").mean(axis=0), inplace=True)

nan_to_mean(df_venues['price'])
nan_to_mean(df_venues['rating'])
nan_to_mean(df_venues['likes'])
nan_to_mean(df_venues['ratio'])

Define `df_feature` and `target` dataframes:

In [None]:
df_feature = df_venues[['price', 'rating', 'likes']]
df_target = df_venues[['neighbourhood']]

Perform one hot encoding in `df_target`:

In [None]:
df_target = pd.concat([pd.get_dummies(df_target['neighbourhood'])], axis=1)

Split between train and test datasets:

In [None]:
df_feature = preprocessing.StandardScaler().fit(df_feature).transform(df_feature)

In [None]:
X_train, X_test, y_train, y_test = map(np.array,train_test_split(df_feature, df_target, train_size=0.9))

## Data Modelling

Build `tree` model given best `k`: 

In [None]:
tree = DecisionTreeClassifier(criterion="entropy", max_depth = 4)

Train `tree` model with the train dataset (`x_train` and `y_train`):

In [None]:
model = tree.fit(X_train, y_train)

## Model Evaluation

Check R² score for with the test dataset (`X_test` and `y_test`):

In [None]:
yhat = tree.predict(X_test)
print('Decision Tree Model Accuracy:',metrics.r2_score(y_test, yhat))

## Data Visualization

In [None]:
lat, lng = 40.730610, -73.935242

Function to automate the process of creating a map:

In [None]:
def create_map(var):
    ny_map = folium.Map(
        location = [lat,lng],
        zoom_start = 10
    )

    scale_ratio = np.linspace(df_venues[var].min(), df_venues[var].max(), 10, dtype=float)
    scale_ratio = scale_ratio.tolist()

    folium.Choropleth(
        name = 'Restaurant ' + var.capitalize(),
        legend_name = var.capitalize(),
        geo_data = geojson,
        data = df_venues,
        columns = ['neighbourhood', var],
        key_on = 'properties.neighbourhood',
        fill_color='YlGn',
        fill_opacity=0.5,
        line_opacity=0.2,
        threshold_scale = scale_ratio
    ).add_to(ny_map)

    return ny_map

**Benefit by Neighbourhood**

In [None]:
create_map('rating')

**Cost by Neighbourhood**

In [None]:
create_map('price')

** Popularity by Neighbourhood **

In [None]:
create_map('likes')

## Export Data and Models

In [None]:
df_areas.to_csv('areas.csv')

In [None]:
df_venues.to_csv('venues.csv')

In [None]:
dump(tree, 'model.joblib') 