[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](http://colab.research.google.com/github/Jonas-Metz-verovis/verovis_Coding_Challenge/blob/main/01_Data_Loading_and_Preprocessing.ipynb)

# Introduction

In this Challenge you need to load Restaurant Rating Data and create a tree-based Classification Model to predict the Rating of a Restaurant. The Challenge will be scored based on:

1.  The Predictions Model's Test Accuracy Score
1.  The Readability and Transferability of the submitted Code
1.  The Documentation of the submitted Code

# Documentation and Support

The following Resources might be useful to complete this Challenge:

1.  [Pandas Documentation](https://pandas.pydata.org/docs/reference/index.html#api)
1.  [Scikit-Learn Documentation](https://scikit-learn.org/stable/modules/classes.html)
1.  [Category Encoders Documentation](https://contrib.scikit-learn.org/category_encoders/)
1.  [TowardsDataScience: Feature Engineering](https://towardsdatascience.com/feature-engineering-for-machine-learning-3a5e293a5114)
1.  [Machine Learning Mastery: Feature Engineering](https://machinelearningmastery.com/discover-feature-engineering-how-to-engineer-features-and-how-to-get-good-at-it/)
1.  [TowardsDataScience: Working with Numerical Variables](https://towardsdatascience.com/understanding-feature-engineering-part-1-continuous-numeric-data-da4e47099a7b)
1.  [TowardsDataScience: Working with Categorical Variables](https://towardsdatascience.com/understanding-feature-engineering-part-2-categorical-data-f54324193e63)
1.  [TowardsDataScience: Categorical Variable Encoding](https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02)


The Challenge was created by [Jonas Metz](jmetz@verovis.com), please contact me anytime, if you have any Questions! :-)

# Global Flags

# Imports

In [None]:
import os
import sys
import psutil
import pandas as pd

# Data Loading

### Task: Load the Data from these CSV-Files into separate Pandas DataFrames.

1.  Restaurant_Accepts.csv: https://raw.githubusercontent.com/Jonas-Metz-verovis/verovis_Coding_Challenge/main/Data/Restaurant_Accepts.csv
1.  Restaurant_Cuisine.csv: https://raw.githubusercontent.com/Jonas-Metz-verovis/verovis_Coding_Challenge/main/Data/Restaurant_Cuisine.csv
1.  Restaurant_Location.csv: https://raw.githubusercontent.com/Jonas-Metz-verovis/verovis_Coding_Challenge/main/Data/Restaurant_Location.csv
1.  Restaurant_Opening_Hours.csv: https://raw.githubusercontent.com/Jonas-Metz-verovis/verovis_Coding_Challenge/main/Data/Restaurant_Opening_Hours.csv
1.  Restaurant_Parking.csv: https://raw.githubusercontent.com/Jonas-Metz-verovis/verovis_Coding_Challenge/main/Data/Restaurant_Parking.csv
1.  Restaurant_Rating.csv: https://raw.githubusercontent.com/Jonas-Metz-verovis/verovis_Coding_Challenge/main/Data/Restaurant_Rating.csv
1.  Restaurant_User_Cuisine.csv: https://raw.githubusercontent.com/Jonas-Metz-verovis/verovis_Coding_Challenge/main/Data/Restaurant_User_Cuisine.csv
1.  Restaurant_User_Payment.csv: https://raw.githubusercontent.com/Jonas-Metz-verovis/verovis_Coding_Challenge/main/Data/Restaurant_User_Payment.csv
1.  Restaurant_User_Profile.csv: https://raw.githubusercontent.com/Jonas-Metz-verovis/verovis_Coding_Challenge/main/Data/Restaurant_User_Profile.csv

# Exploratory Data Analysis

### Task: Inspect the loaded DataFrames and create a List of possible Input Features as well as possible Target Variables.

# Data Preprocessing

### Task: Preprocess the raw DataFrames (Cleaning, Formating, ...) to be able to merge them into one clean DataFrame (Features + Targets) afterwards.

### Task: Merge the preprocessed DataFrames into one DataFrame which contains a row-wise Combination of Input Features and the corresponding Target Variables.

# Feature Engineering

### Task: Use some of the Input Features to create new Features through Combination (drop the original Features afterwards).

### Task: Inspect the Input Features. Which Levels of Measurement do they have? Which Encoding Methods are appropriate for each Level/Feature? Encode all Input Features appropriately. How about the Target Variables, do they need to be encoded as well?

### Task: Split your preprocessed and encoded DataFrame into Training and Test Input/Output Data (~25 % Test Data, random Split).

# Model Calculation

### Task: Create a tree-based Model (Decision Tree/Random Forest) to predict the Rating of a Restaurant. Calculate the Accuracy Score of the Model (Training and Test Data). The Test Score will be used to grade this Challenge.

## Solution: The Test Score

In [None]:
test_score = model.score (X = X_test, y = y_test)
print ("Test Score (Accuracy):", test_score)

### Task (Optional): Optimize your Feature Selection Workflow as well as the Hyperparameters of your tree-based Model. To do this properly, you need to split your Data into a Training/Validation/Test Dataset (e.g. 60/20/20 %). Train your Models using the Training Dataset. Select the best model based on its Score on the Validation Dataset. Only score the (final) best Model on the Test Dataset to create a true out-of-sample Test Score. You can, of course, use Scikit-Learn's Pipelines/Grid Search/Cross Validation/etc. Functionalities if you like. 

In [None]:
print ("Feel free to be creative ;-)")

# Result Postprocessing

In [None]:
df_predictions = y_test
df_predictions ["Predicted_Rating"] = model.predict (X_test)
df_predictions.head ()

# Data Saving

### Info (Google Colab)

If you are working in Google Colab, you can save the Results to your Google Drive by running

```
from google.colab import drive
drive.mount("/content/drive")
```

You will be requested to authenticate with your Google Account.

The Path to your Google Colab Notebooks Folder will be "/content/drive/My Drive/Colab Notebooks".

The Commands can then use this Path:

```
os.makedirs ("/content/drive/My Drive/Colab Notebooks/Results", exist_ok=True)
df_predictions.to_csv ("/content/drive/My Drive/Colab Notebooks/Results/Restaurant_Rating_Predictions.csv", index=False)
```

### Task: Save a DataFrame which contains the actual Test Ratings as well as the corresponding Test Predictions to a CSV-File.

In [None]:
os.makedirs ("Results", exist_ok=True)

# INFO: This writes the CSV-File in a way which can be read by a German Microsoft Excel without any necessary Modifications
df_predictions.to_csv (os.path.join ("Results", "Restaurant_Rating_Predictions.csv"), sep=";", decimal=",", header=True, index=False, encoding="utf-8", float_format="%.4f")
print ("The Predictions have been successfully saved to a CSV-File.")