# Modeling

## Imports

In [1]:
import numpy as np
import pandas as pd
import sklearn as sk

## Load the data

Read the pickled `DataFrame` that shows consumer profiles, consumer cuisines, restaurant profiles, restaurant cuisines, and user ratings of restaurants

In [2]:
user_restaurant_ratings = pd.read_pickle('user_restaurant_ratings_exploration')

## Prepare data for machine learning

Add additional columns to your feature `DataFrame` for any categorical columns you think will be relevant. For each such column:

* Use `pandas.get_dummies` to one-hot encode the categorical values.
* Use `pandas.concat` with `axis=1` to add those columns to the feature `DataFrame`

In [3]:
categorical_cols = ['drink_level', 
                    'dress_preference', 
                    'ambience', 
                    'interest', 
                    'personality',
                    'activity', 
                    'budget', 
                    'alcohol', 
                    'smoking_area', 
                    'dress_code',
                    'price', 
                    'Rambience',
                    'area'] 
                    #'religion', 
                    #'color', 
                    #'marital_status', 
                    #'hijos', 
                    #'accessibility']
                    #'other_services']
dfs = [user_restaurant_ratings]
for col in categorical_cols:
    dfs.append(pd.get_dummies(user_restaurant_ratings[col]))

user_restaurant_ratings = pd.concat(dfs, axis=1)

Now that the dummies are concatenated to the table, we can now drop the categorical columns

In [4]:
user_restaurant_ratings.drop(['drink_level', 'dress_preference', 'ambience', 'marital_status', 'hijos', 'interest',
                              'personality', 'religion', 'activity', 'color', 'budget', 'alcohol',
                              'smoking_area', 'dress_code', 'accessibility', 'price', 'Rambience', 'area','other_services'], 
                              axis=1, inplace=True)

Drop columns that are not categorical and will not be needed for the project. We will not be needing location (lat, long) since we are assuming a customer can be satisfied or unsatisfied at any restaurant in the world.

In [5]:
user_restaurant_ratings.drop(['userID', 'placeID', 'latitude_x', 
                              'longitude_x', 'longitude_y', 'latitude_y', 
                              'name'], axis=1, inplace=True)

## Finding the most significant data

At the moment, our merged table has too many columns. We can increase the accuracy of our recommendation system by dropping unnecessary columns.

In [6]:
list(user_restaurant_ratings.columns)

['smoker',
 'birth_year',
 'weight',
 'height',
 'u_Afghan',
 'u_African',
 'u_American',
 'u_Armenian',
 'u_Asian',
 'u_Australian',
 'u_Austrian',
 'u_Bagels',
 'u_Bakery',
 'u_Bar',
 'u_Bar_Pub_Brewery',
 'u_Barbecue',
 'u_Basque',
 'u_Brazilian',
 'u_Breakfast-Brunch',
 'u_British',
 'u_Burgers',
 'u_Burmese',
 'u_Cafe-Coffee_Shop',
 'u_Cafeteria',
 'u_Cajun-Creole',
 'u_California',
 'u_Cambodian',
 'u_Canadian',
 'u_Caribbean',
 'u_Chilean',
 'u_Chinese',
 'u_Contemporary',
 'u_Continental-European',
 'u_Cuban',
 'u_Deli-Sandwiches',
 'u_Dessert-Ice_Cream',
 'u_Dim_Sum',
 'u_Diner',
 'u_Doughnuts',
 'u_Dutch-Belgian',
 'u_Eastern_European',
 'u_Eclectic',
 'u_Ethiopian',
 'u_Family',
 'u_Fast_Food',
 'u_Filipino',
 'u_Fine_Dining',
 'u_French',
 'u_Fusion',
 'u_Game',
 'u_German',
 'u_Greek',
 'u_Hawaiian',
 'u_Hot_Dogs',
 'u_Hungarian',
 'u_Indian-Pakistani',
 'u_Indigenous',
 'u_Indonesian',
 'u_International',
 'u_Irish',
 'u_Israeli',
 'u_Italian',
 'u_Jamaican',
 'u_Japanese

### Cuisines

Cuisines that are served by restaurants are marked with the prefix '`r_`' and cuisines that are preferred by the consumers prefixed '`u_`'. As we can see by the list of columns, there are less cuisines offered at restaurants than are preferred by consumers.

In [7]:
user_cuisine_cols = ['u_Afghan', 'u_African', 'u_American', 'u_Armenian', 'u_Asian', 'u_Australian', 'u_Austrian', 'u_Bagels', 'u_Bakery', 'u_Bar', 'u_Bar_Pub_Brewery', 'u_Barbecue', 'u_Basque', 'u_Brazilian', 'u_Breakfast-Brunch', 'u_British', 'u_Burgers', 'u_Burmese', 'u_Cafe-Coffee_Shop', 'u_Cafeteria', 'u_Cajun-Creole', 'u_California', 'u_Cambodian', 'u_Canadian', 'u_Caribbean', 'u_Chilean', 'u_Chinese', 'u_Contemporary', 'u_Continental-European', 'u_Cuban', 'u_Deli-Sandwiches', 'u_Dessert-Ice_Cream', 'u_Dim_Sum', 'u_Diner', 'u_Doughnuts', 'u_Dutch-Belgian', 'u_Eastern_European', 'u_Eclectic', 'u_Ethiopian', 'u_Family', 'u_Fast_Food', 'u_Filipino', 'u_Fine_Dining', 'u_French', 'u_Fusion', 'u_Game', 'u_German', 'u_Greek', 'u_Hawaiian', 'u_Hot_Dogs', 'u_Hungarian', 'u_Indian-Pakistani', 'u_Indigenous', 'u_Indonesian', 'u_International', 'u_Irish', 'u_Israeli', 'u_Italian', 'u_Jamaican', 'u_Japanese', 'u_Juice', 'u_Korean', 'u_Kosher', 'u_Latin_American', 'u_Lebanese', 'u_Malaysian', 'u_Mediterranean', 'u_Mexican', 'u_Middle_Eastern', 'u_Mongolian', 'u_Moroccan', 'u_North_African', 'u_Organic-Healthy', 'u_Pacific_Northwest', 'u_Pacific_Rim', 'u_Persian', 'u_Peruvian', 'u_Pizzeria', 'u_Polish', 'u_Polynesian', 'u_Portuguese', 'u_Regional', 'u_Romanian', 'u_Russian-Ukrainian', 'u_Scandinavian', 'u_Seafood', 'u_Soup', 'u_Southeast_Asian', 'u_Southern', 'u_Southwestern', 'u_Spanish', 'u_Steaks', 'u_Sushi', 'u_Swiss', 'u_Tapas', 'u_Tea_House', 'u_Tex-Mex', 'u_Thai', 'u_Tibetan', 'u_Tunisian', 'u_Turkish', 'u_Vegetarian', 'u_Vietnamese']

In [8]:
len(user_cuisine_cols)

103

In [9]:
restaurant_cuisine_cols = ['r_American', 'r_Armenian', 'r_Bakery', 'r_Bar', 'r_Bar_Pub_Brewery', 'r_Breakfast-Brunch', 'r_Burgers', 'r_Cafe-Coffee_Shop', 'r_Cafeteria', 'r_Chinese', 'r_Contemporary', 'r_Family', 'r_Fast_Food', 'r_Game', 'r_International', 'r_Italian', 'r_Japanese', 'r_Mediterranean', 'r_Mexican', 'r_Pizzeria', 'r_Regional', 'r_Seafood', 'r_Vietnamese']

In [10]:
len(restaurant_cuisine_cols)

23

For the purpose of our restaurant recommendation system, we can drop the columns of the cuisines preferred by consumers that are not offered by any restaurants in our dataset.

In [11]:
user_cuisine_drops = []
for cuisine in user_cuisine_cols:
    if 'r_' + cuisine[2:] not in restaurant_cuisine_cols:
        user_cuisine_drops.append('u_' + cuisine[2:])

len(user_cuisine_drops)

80

We will now drop these 80 unnecessary columns from our table

In [12]:
user_restaurant_ratings.drop(user_cuisine_drops, axis=1, inplace=True)

In [13]:
cuisine_cols = ['u_American', 'u_Armenian', 'u_Bakery', 'u_Bar', 'u_Bar_Pub_Brewery', 'u_Breakfast-Brunch', 'u_Burgers', 'u_Cafe-Coffee_Shop', 'u_Cafeteria', 'u_Chinese', 'u_Contemporary', 'u_Family', 'u_Fast_Food', 'u_Game', 'u_International', 'u_Italian', 'u_Japanese', 'u_Mediterranean', 'u_Mexican', 'u_Pizzeria', 'u_Regional', 'u_Seafood', 'u_Vietnamese', 'r_American', 'r_Armenian', 'r_Bakery', 'r_Bar', 'r_Bar_Pub_Brewery', 'r_Breakfast-Brunch', 'r_Burgers',
 'r_Cafe-Coffee_Shop', 'r_Cafeteria', 'r_Chinese', 'r_Contemporary', 'r_Family', 'r_Fast_Food', 'r_Game', 'r_International', 'r_Italian', 'r_Japanese', 'r_Mediterranean', 'r_Mexican', 'r_Pizzeria', 'r_Regional', 'r_Seafood', 'r_Vietnamese']

However, cuisinse preferences do not necessailty need to be taken into account to determine customer satisfaction. A customer can be satisfied by cuisines that aren't his favorite and can be unsatisfied by cuisines that are his favorite. Because of this, we will drop the cuisine data entirely from the table.

In [14]:
user_restaurant_ratings.drop(cuisine_cols, axis=1, inplace=True)

## Write the data

Pickle the user_restaurant_ratings dataframe for use in the next notebook using pandas.dataframe.to_pickle

In [15]:
user_restaurant_ratings.to_pickle('user_restaurant_ratings_modeling')