[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](http://colab.research.google.com/github/Jonas-Metz-verovis/verovis_Coding_Challenge/blob/main/01_Data_Loading_and_Preprocessing.ipynb)

# Introduction

#### In this Challenge you need to load Restaurant Rating Data and create a tree-based Classification Model to predict the Rating of a Restaurant. The Challenge will be scored based on:

1.  The Predictions Model's Test Accuracy Score
1.  The verbal Explanations for specific Processing/Modeling Choices
1.  The Readability and Transferability of the submitted Code
1.  The Documentation of the submitted Code
1.  Optional (not scored): Explanation of the Model's learned Relationships (e.g. through the Feature Importances)

General Machine Learning Project Checklist (**Focus of this Challenge**) by [Aurélien Géron](https://github.com/ageron/handson-ml)

1. **Frame the Problem and look at the Big Picture**
1. **Get the Data**
1. **Explore the Data to gain Insights**
1. **Prepare the Data to better expose the underlying Data Patterns to the used Machine Learning Algorithms**
1. Explore many different Models and short-list the best ones
1. Fine-tune your Models and combine them into a great Solution
1. Present your Solution
1. Launch, monitor, and maintain your Model/Service

INFO: Instead of working with [Google Colab](https://colab.research.google.com/), which is recommended because you can get started right away, you can also work with your own Development Environment (e.g. [Visual Studio Code](https://code.visualstudio.com/)), by using [Git](https://git-scm.com/) to clone the [verovis Coding Challenge GitHub Repository](https://github.com/Jonas-Metz-verovis/verovis_Coding_Challenge)

# Documentation and Support

#### The following Resources might be useful to complete this Challenge:

1.  [Pandas Documentation](https://pandas.pydata.org/docs/reference/index.html#api)
1.  [Numpy Documentation](https://numpy.org/doc/stable/)
1.  [Scikit-Learn Documentation](https://scikit-learn.org/stable/modules/classes.html)
1.  [Category Encoders Documentation](https://contrib.scikit-learn.org/category_encoders/)
1.  [Imbalanced-Learn Documentation](https://imbalanced-learn.readthedocs.io/en/stable/api.html)
1.  [Seaborn Documentation](https://seaborn.pydata.org/api.html)
1.  [SHAP Documentation](https://shap.readthedocs.io/en/latest/api.html)
1.  [Pandas Data Wrangling Cheat Sheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf)
1.  [TowardsDataScience: Data Cleansing](https://towardsdatascience.com/data-cleaning-in-python-the-ultimate-guide-2020-c63b88bf0a0d)
1.  [TowardsDataScience: Data Preprocessing](https://towardsdatascience.com/data-preprocessing-concepts-fa946d11c825)
1.  [TowardsDataScience: Feature Engineering](https://towardsdatascience.com/feature-engineering-for-machine-learning-3a5e293a5114)
1.  [Machine Learning Mastery: Feature Engineering](https://machinelearningmastery.com/discover-feature-engineering-how-to-engineer-features-and-how-to-get-good-at-it/)
1.  [TowardsDataScience: Working with Numerical Variables](https://towardsdatascience.com/understanding-feature-engineering-part-1-continuous-numeric-data-da4e47099a7b)
1.  [TowardsDataScience: Working with Categorical Variables](https://towardsdatascience.com/understanding-feature-engineering-part-2-categorical-data-f54324193e63)
1.  [TowardsDataScience: Categorical Variable Encoding](https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02)
1.  [TowardsDataScience: One-Hot-Encoding for tree-based Models](https://towardsdatascience.com/one-hot-encoding-is-making-your-tree-based-ensembles-worse-heres-why-d64b282b5769)
1.  [Stat Trek: One-Hot-Encoding (Dummy Variables)](https://stattrek.com/multiple-regression/dummy-variables.aspx)

#### If you don't know how to find a Solution to a given Problem, it often works well if one just "googles the problem". Great Sources are:

1.  [TowardsDataScience](https://towardsdatascience.com/)
1.  [StackOverflow](https://stackoverflow.com/)
1.  [Machine Learning Mastery](https://machinelearningmastery.com/start-here/)
1.  [Python-Kurs.eu](https://www.python-kurs.eu/python3_kurs.php)
1.  [Python Data Science Handbook](https://github.com/jakevdp/PythonDataScienceHandbook)
1.  [The Hitchhiker's Guide to Python](https://docs.python-guide.org/)
1.  [Overview of Data Science YouTube Channels](https://towardsdatascience.com/top-20-youtube-channels-for-data-science-in-2020-2ef4fb0d3d5)
1.  [Introduction to Machine Learning with Python](https://github.com/amueller/introduction_to_ml_with_python) / [Buy the Book](https://www.amazon.de/Introduction-Machine-Learning-Python-Scientists/dp/1449369413)
1.  [The Elements of Statistical Learning](https://web.stanford.edu/~hastie/ElemStatLearn/printings/ESLII_print12_toc.pdf)
1.  [Bayesian Reasoning and Machine Learning](http://web4.cs.ucl.ac.uk/staff/D.Barber/textbook/200620.pdf)
1.  [Deep Learning](https://www.deeplearningbook.org/)

#### The Challenge was created by [Jonas Metz](jmetz@verovis.com), please contact me anytime, if you have any Questions! :-)

# Global Flags

INFO: Please select a creative Team Name :-)

In [None]:
TEAM_NAME = "HelloWorldTeam"

# Imports

### Info (Google Colab)

If you are working in Google Colab, you can install necessary (and not already installed) Packages by running e.g.

```
!pip install shap
```

In [None]:
import os
import sys
import psutil
import datetime
import pandas as pd
from joblib import dump, load

# Data Loading

### Task: Load the Data from these CSV-Files into separate Pandas DataFrames.

1.  Restaurant_Accepts.csv: https://raw.githubusercontent.com/Jonas-Metz-verovis/verovis_Coding_Challenge/main/Data/Restaurant_Accepts.csv
1.  Restaurant_Cuisine.csv: https://raw.githubusercontent.com/Jonas-Metz-verovis/verovis_Coding_Challenge/main/Data/Restaurant_Cuisine.csv
1.  Restaurant_Location.csv: https://raw.githubusercontent.com/Jonas-Metz-verovis/verovis_Coding_Challenge/main/Data/Restaurant_Location.csv
1.  Restaurant_Opening_Hours.csv: https://raw.githubusercontent.com/Jonas-Metz-verovis/verovis_Coding_Challenge/main/Data/Restaurant_Opening_Hours.csv
1.  Restaurant_Parking.csv: https://raw.githubusercontent.com/Jonas-Metz-verovis/verovis_Coding_Challenge/main/Data/Restaurant_Parking.csv
1.  Restaurant_Rating.csv: https://raw.githubusercontent.com/Jonas-Metz-verovis/verovis_Coding_Challenge/main/Data/Restaurant_Rating.csv
1.  Restaurant_User_Cuisine.csv: https://raw.githubusercontent.com/Jonas-Metz-verovis/verovis_Coding_Challenge/main/Data/Restaurant_User_Cuisine.csv
1.  Restaurant_User_Payment.csv: https://raw.githubusercontent.com/Jonas-Metz-verovis/verovis_Coding_Challenge/main/Data/Restaurant_User_Payment.csv
1.  Restaurant_User_Profile.csv: https://raw.githubusercontent.com/Jonas-Metz-verovis/verovis_Coding_Challenge/main/Data/Restaurant_User_Profile.csv

In [None]:
# Example
df_restaurant_accepts = pd.read_csv("https://raw.githubusercontent.com/Jonas-Metz-verovis/verovis_Coding_Challenge/main/Data/Restaurant_Accepts.csv")

# Exploratory Data Analysis

### Task: Inspect the loaded DataFrames and create a List of possible Input Features as well as possible Target Variables.

#### Discuss/answer the following Questions:
-   What are the Names of the Columns?
-   What are the Data Types of the Columns?
-   Do the Rows/Columns contain any Missing Values?
-   Is this a supervised or unsupervised Machine Learning Problem?
-   More specific: Is this a Classification, Regression or Clustering Problem?
-   Which Machine Learning Algorithms could be used to solve this Challenge?

INFO: Plots can often be very informative and provide valuable Insights! You can create Plots with e.g. [Matplotlib](https://matplotlib.org/tutorials/introductory/pyplot.html#sphx-glr-tutorials-introductory-pyplot-py) or [Seaborn](https://seaborn.pydata.org/tutorial/categorical.html)

In [None]:
# Example
df_restaurant_accepts.head()

In [None]:
# Example
df_restaurant_accepts.Rpayment.value_counts()

# Data Preprocessing

### Task: Preprocess the raw DataFrames (Cleaning, Formatting, ...) to be able to merge them into one clean DataFrame (Features + Targets) afterwards.

In [None]:
# Example
df_restaurant_accepts = df_restaurant_accepts [df_restaurant_accepts.Rpayment != "Carte_Blanche"]
df_restaurant_accepts.Rpayment.value_counts()

### Task: Merge the preprocessed DataFrames into one DataFrame which contains a row-wise Combination of Input Features and the corresponding Target Variables.

#### Discuss/answer the following Questions:
-   What are the Names of the Key Columns which will be used to merge the DataFrames?
-   What Type of Merge do you perform? What are the Requirements to successfully merge two DataFrames?

INFO: If you don't know how to merge two DataFrames, you can check the [Pandas Documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html)

# Feature Engineering

### Task (Optional): Use some of the Input Features to create new Features through Combination (drop the original Features afterwards).

### Task: Inspect the Input Features. Which Levels of Measurement do they have? Which Encoding Methods are appropriate for each Level/Feature? Encode all Input Features appropriately. How about the Target Variables, do they need to be encoded as well?

### Task: Summarize your Data Preprocessing Process and explain your Feature Selection/Engineering Choices.

#### Discuss/answer the following Questions:
-   Which Features did you select?
-   How did you handle Missing Values?
-   Which Difficulties did face while merging the DataFrames and how did you solve them?
-   How did you encode the selected Categorical Variables and why are your Encoding Methods an appropriate Choice for these Features?
-   Which Assumptions did you make?
-   Which Hypotheses did you formulate about the Relationship between the selected Features and the Target Variables?
-   What possible Limitations do you expect and what are the Reasons for your Expectation?

Add your Summary here ...

### Task: Split your preprocessed and encoded DataFrame into Training and Test Input/Output Data (~25 % Test Data, random Split).

#### Discuss: Based on your Preprocessing, could there be a Problem with some Form of [Target Leakage](https://www.kaggle.com/alexisbcook/data-leakage)? If yes, you either need to adjust your Preprocessing or perform the Train-Test-Split before the Data Preprocessing (and apply the same Preprocessing on the Test Dataset which you've applied to the Training Dataset)!

In [None]:
# Hint
X_train, X_test, y_train, y_test = train_test_split (preprocessed_data [input_features], preprocessed_data [target_variables], test_size = 0.25, shuffle = True)

# Model Calculation

### Task: Create a tree-based Model (Decision Tree/Random Forest) to predict the Rating of a Restaurant. Calculate the Accuracy Score of the Model (Training and Test Data). The Test Score will be used to grade this Challenge.

In [None]:
test_score = model.score (X = X_test, y = y_test)

## Solution: The Test Score

In [None]:
print ("Test Score (Accuracy):", test_score)

### Task (Optional): Optimize your Feature Selection Workflow as well as the Hyperparameters of your tree-based Model. To do this properly, you need to split your Data into a Training/Validation/Test Dataset (e.g. 60/20/20 %). Train your Models using the Training Dataset. Select the best model based on its Score on the Validation Dataset. Only score the (final) best Model on the Test Dataset to create a true out-of-sample Test Score. You can, of course, use Scikit-Learn's Pipelines/Grid Search/Cross Validation/etc. Functionalities if you like. 

In [None]:
print ("Feel free to be creative ;-)")

# Result Postprocessing

In [None]:
df_predictions = y_test
df_predictions ["Predicted_Rating"] = model.predict (X_test)
df_predictions.head ()

# Data Saving

### Info (Google Colab)

If you are working in Google Colab, you can save the Results to your Google Drive by running

```
from google.colab import drive
drive.mount("/content/drive")
```

You will be requested to authenticate with your Google Account.

The Path to your Google Colab Notebooks Folder will be "/content/drive/My Drive/Colab Notebooks".

The Commands can then use this Path:

```
os.makedirs ("/content/drive/My Drive/Colab Notebooks/Results", exist_ok=True)
df_predictions.to_csv ("/content/drive/My Drive/Colab Notebooks/Results/Restaurant_Rating_Predictions.csv", index=False)
```

### Task: Save a DataFrame which contains the actual Test Ratings as well as the corresponding Test Predictions to a CSV-File.

In [None]:
os.makedirs ("Results", exist_ok=True)

# INFO: This writes the CSV-File in a way which can be read by a German Microsoft Excel without any necessary Modifications
df_predictions.to_csv (os.path.join ("Results", "Restaurant_Rating_Predictions.csv"), sep=";", decimal=",", header=True, index=False, encoding="utf-8", float_format="%.4f")
print ("The Predictions have been successfully saved to a CSV-File.")

In [None]:
# INFO: This writes the Model to joblib Pickle Dump File, which can be loaded during Inference. Please submit this File together with your Solution Notebook!
dump(model, os.path.join("Results", datetime.now().strftime("%Y%m%d_%H%M%S") +'_Coding_Challenge_' + TEAM_NAME + '_CLF.joblib'))
print ("The fitted Model has been successfully saved.")