# Task 2: Recommendation Engine - Skeleton Notebook

This notebook provides a very basic example for the notebook you are expected to submit for Task 2 of the Final Project. The main purpose is that we can try different examples to get a better sense of your approach. Compared to Task 1 (Kaggle Competition), we don't have any objective means to evaluate the recommendations. 

Some general comments:
* You can import any data you need. This particularly includes your cleaned version of the Used Cars dataset; there's no need to show the data cleaning / preprocessing steps in this notebook.
* You can also import your code in form of external Python (.py) script. You're actually encouraged to do so to keep this notebook light and uncluttered.
* Please consider this notebook as an example and not to set specific requirements. As long there is a section where we can easily test your solution, it should be fine.

## Setting up the Notebook

In [110]:
import pandas as pd
import numpy as np
from similarity import TS_SS
from similarity import CD
from similarity import ED
from math import sqrt
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error

## Load the Data

For this example, we use a simplified version of the dataset with only 100 data sample, each with only 6 features

In [98]:
df_train = pd.read_csv('train_v1.csv')
df_simple = df_train.drop(columns=['Unnamed: 0', 'accessories_vectors', 'make', 'model', 'listing_id'])

df_simple.head()

df_normalized=df_simple.apply(lambda x: x - x.mean(), axis=0)
df_normalized.head()

Unnamed: 0,years of warranty,better loan offer,well maintained,low fuel consumption,reg_date,power,engine_cap,mileage,no_of_owners,depreciation,...,type_of_vehicle_stationwagon,type_of_vehicle_suv,type_of_vehicle_truck,type_of_vehicle_van,fuel_type_diesel,fuel_type_electric,fuel_type_petrol,fuel_type_petrol-electric,transmission,price
0,-0.197405,0.391642,0.647396,-0.275841,1.698033,0.9997083,-88.356363,1841.29001,-1.024691,2941.124695,...,-0.021044,-0.184432,-0.044599,-0.064267,-0.12441,-0.003169,0.189215,-0.061637,-0.099002,-41614.898069
1,-0.197405,0.391642,-0.352604,0.724159,-0.301967,-3.808509e-12,896.643637,38953.29001,0.975309,-3128.875305,...,-0.021044,-0.184432,-0.044599,0.935733,0.87559,-0.003169,-0.810785,-0.061637,0.900998,-69114.898069
2,-0.197405,-0.608358,-0.352604,0.724159,-1.301967,-44.00029,-490.356363,8841.29001,-1.024691,311.124695,...,-0.021044,-0.184432,-0.044599,-0.064267,-0.12441,-0.003169,0.189215,-0.061637,-0.099002,-17414.898069
3,2.802595,0.391642,0.647396,-0.275841,-5.301967,-19.00029,-588.356363,-61358.70999,-1.024691,1641.124695,...,-0.021044,-0.184432,-0.044599,-0.064267,-0.12441,-0.003169,0.189215,-0.061637,-0.099002,84985.101931
4,-0.197405,0.391642,0.647396,-0.275841,-4.301967,-42.00029,-488.356363,-31158.70999,-1.024691,-4308.875305,...,-0.021044,-0.184432,-0.044599,-0.064267,-0.12441,-0.003169,0.189215,-0.061637,-0.099002,-9714.898069


## Computing the Top Recommendations

The method `get_top_recommendations()` shows an example of how to get the top recommendations for a given data sample (data sample = row in the dataframe of the dataset). The input is a row from the dataset and a list of optional input parameters which will depend on your approach; `k` is the number of returned recommendations seems useful, though.

The output should be a `pd.DataFrame` containing the recommendations. The output dataframe should have the same columns as the row + any additional columns you deem important (e.g., any score or tags that you might want to add to your recommendations).

In principle, the method `get_top_recommendations()` may be imported from a external Python (.py) script as well.

In [99]:
def get_top_recommendations(row_id, method, **kwargs) -> pd.DataFrame:
    
    #####################################################
    ## Initialize the required parameters
    
    # The number of recommendations seem recommended
    # Additional input parameters are up to you
    k = None
    
    # Extract all **kwargs input parameters
    # and set the used paramaters (here: k)
    for key, value in kwargs.items():
        if key == 'k':
            k = value
            
       
    #####################################################
    ## Compute your recommendations
    #
    # This is where your magic happens. Of course, you can call methods
    # defined in this notebook or in external Python (.py) scripts
    #
        
    # Here, we just return the input row k times
    # Ideally, you recommendations will be much better
    similarity = None
    if method == 'TS-SS':
        similarity = TS_SS()
    elif method == 'ED':
        similarity = ED()
    else:
        similarity = CD()

    df_result = None
    df_modified = None
    if method == 'CD':
        df_modified = df_simple.copy()
        row = df_modified.iloc[row_id]
        df_modified['Similarity'] = df_modified.apply(lambda df: similarity(row.to_numpy().reshape(1, -1), df.array.to_numpy().reshape(1, -1)), axis=1)
        df_result = df_modified.sort_values('Similarity').head(k)
    else:
        df_modified = df_normalized.copy()
        row = df_modified.iloc[row_id]
        df_modified['Similarity'] = df_modified.apply(lambda df: similarity(row.to_numpy().reshape(1, -1), df.array.to_numpy().reshape(1, -1)), axis=1)
        df_result = df_modified.sort_values('Similarity').head(k)
        df_result = df_simple.copy().reindex(df_result.index)
    
    # Return the dataset with the k recommendations
    return df_result


## Testing the Recommendation Engine

This will be the main part of your notebook to allow for testing your solutions. Most basically, for a given listing (defined by the row id in your input dataframe), we would like to see the recommendations you make. So however you set up your notebook, it should have at least a comparable section that will allow us to run your solution for different inputs.

### Pick a Sample Listing as Input

In [100]:
# Pick a row id of choice
row_id = 10
#row_id = 20
#row_id = 30
#row_id = 40
#row_id = 50

# Get the row from the dataframe (an valid row ids will throw an error)
row = df_simple.iloc[row_id]

# Just for printing it nicely, we create a new dataframe from this single row
pd.DataFrame([row])

Unnamed: 0,years of warranty,better loan offer,well maintained,low fuel consumption,reg_date,power,engine_cap,mileage,no_of_owners,depreciation,...,type_of_vehicle_stationwagon,type_of_vehicle_suv,type_of_vehicle_truck,type_of_vehicle_van,fuel_type_diesel,fuel_type_electric,fuel_type_petrol,fuel_type_petrol-electric,transmission,price
10,0.0,1.0,0.0,0.0,4.0,95.3,1591.0,66000.0,1.0,8270.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,61400.0


## Compute and Display the recommendations

Since the method `get_top_recommendations()` returns a `pd.DataFrame`, it's easy to display the result.

In [112]:
k = 10

# Recommendation based on Cosine Distance
df_recommendations_cd = get_top_recommendations(row_id, 'CD', k=k)
df_recommendations_cd.head(k)

Unnamed: 0,years of warranty,better loan offer,well maintained,low fuel consumption,reg_date,power,engine_cap,mileage,no_of_owners,depreciation,...,type_of_vehicle_suv,type_of_vehicle_truck,type_of_vehicle_van,fuel_type_diesel,fuel_type_electric,fuel_type_petrol,fuel_type_petrol-electric,transmission,price,Similarity
10,0,1,0,0,4,95.3,1591.0,66000.0,1.0,8270.0,...,0,0,0,0,0,1,0,0,61400.0,[[2.220446049250313e-16]]
4092,5,1,0,0,4,93.8,1591.0,66000.0,1.0,8380.0,...,0,0,0,0,0,1,0,0,62600.0,[[4.485451108349192e-05]]
12574,0,0,0,0,4,95.3,1591.0,79000.0,1.0,10020.0,...,0,0,0,0,0,1,0,0,71300.0,[[0.0001583591208276225]]
6378,0,0,0,0,4,95.3,1591.0,65803.0,1.0,8360.0,...,0,0,0,0,0,1,0,0,61400.0,[[0.00019364331899129894]]
9630,0,0,0,0,4,95.3,1591.0,66000.0,1.0,8120.0,...,0,0,0,0,0,1,0,0,59300.0,[[0.00019524230543721544]]
3430,0,0,1,0,4,110.0,1998.0,82000.0,1.0,9720.0,...,1,0,0,0,0,1,0,0,76800.0,[[0.00020172661895268007]]
16480,0,1,0,0,3,95.3,1591.0,73000.0,1.0,8630.0,...,0,0,0,0,0,1,0,0,66900.0,[[0.00021063537360943574]]
3708,0,0,0,0,4,95.3,1591.0,66591.0,2.0,7890.0,...,0,0,0,0,0,1,0,0,59400.0,[[0.0002734659207408452]]
533,0,0,0,0,4,111.0,1998.0,79638.0,1.0,9750.0,...,0,0,0,0,0,1,0,0,75100.0,[[0.00028983713724661797]]
1320,0,1,0,0,4,95.3,1591.0,75000.0,1.0,9030.0,...,0,0,0,0,0,1,0,0,71300.0,[[0.0002905468443661352]]


In [113]:
# Recommendation based on Euclidean Distance
df_recommendations_ed = get_top_recommendations(row_id, 'ED', k=k)
df_recommendations_ed.head(k)

Unnamed: 0,years of warranty,better loan offer,well maintained,low fuel consumption,reg_date,power,engine_cap,mileage,no_of_owners,depreciation,...,type_of_vehicle_stationwagon,type_of_vehicle_suv,type_of_vehicle_truck,type_of_vehicle_van,fuel_type_diesel,fuel_type_electric,fuel_type_petrol,fuel_type_petrol-electric,transmission,price
10,0,1,0,0,4,95.3,1591.0,66000.0,1.0,8270.0,...,0,0,0,0,0,0,1,0,0,61400.0
4092,5,1,0,0,4,93.8,1591.0,66000.0,1.0,8380.0,...,0,0,0,0,0,0,1,0,0,62600.0
6378,0,0,0,0,4,95.3,1591.0,65803.0,1.0,8360.0,...,0,0,0,0,0,0,1,0,0,61400.0
3708,0,0,0,0,4,95.3,1591.0,66591.0,2.0,7890.0,...,0,0,0,0,0,0,1,0,0,59400.0
9630,0,0,0,0,4,95.3,1591.0,66000.0,1.0,8120.0,...,0,0,0,0,0,0,1,0,0,59300.0
6369,0,1,0,1,4,93.8,1591.0,66000.0,1.0,8600.0,...,0,0,0,0,0,0,1,0,0,64700.0
258,0,1,0,0,4,93.8,1591.0,65000.0,1.0,8580.0,...,0,0,0,0,0,0,1,0,0,64900.0
2376,0,1,1,1,4,95.3,1591.0,68000.0,1.0,8600.0,...,0,0,0,0,0,0,1,0,0,64700.0
280,0,1,1,0,4,86.0,1590.0,62000.0,2.0,8680.0,...,0,0,0,0,0,0,1,0,0,63600.0
10863,0,1,0,0,4,88.0,1496.0,67000.0,1.0,8940.0,...,0,0,0,0,0,0,1,0,0,63300.0


In [114]:
# Recommendation based on TS-SS
df_recommendations_ts = get_top_recommendations(row_id, 'TS-SS', k=k)
df_recommendations_ts.head(k)

Unnamed: 0,years of warranty,better loan offer,well maintained,low fuel consumption,reg_date,power,engine_cap,mileage,no_of_owners,depreciation,...,type_of_vehicle_stationwagon,type_of_vehicle_suv,type_of_vehicle_truck,type_of_vehicle_van,fuel_type_diesel,fuel_type_electric,fuel_type_petrol,fuel_type_petrol-electric,transmission,price
10,0,1,0,0,4,95.3,1591.0,66000.0,1.0,8270.0,...,0,0,0,0,0,0,1,0,0,61400.0
4092,5,1,0,0,4,93.8,1591.0,66000.0,1.0,8380.0,...,0,0,0,0,0,0,1,0,0,62600.0
6378,0,0,0,0,4,95.3,1591.0,65803.0,1.0,8360.0,...,0,0,0,0,0,0,1,0,0,61400.0
3708,0,0,0,0,4,95.3,1591.0,66591.0,2.0,7890.0,...,0,0,0,0,0,0,1,0,0,59400.0
9630,0,0,0,0,4,95.3,1591.0,66000.0,1.0,8120.0,...,0,0,0,0,0,0,1,0,0,59300.0
6369,0,1,0,1,4,93.8,1591.0,66000.0,1.0,8600.0,...,0,0,0,0,0,0,1,0,0,64700.0
405,0,1,0,1,4,95.3,1591.0,66000.0,1.0,8150.0,...,0,0,0,0,0,0,1,0,0,61400.0
258,0,1,0,0,4,93.8,1591.0,65000.0,1.0,8580.0,...,0,0,0,0,0,0,1,0,0,64900.0
280,0,1,1,0,4,86.0,1590.0,62000.0,2.0,8680.0,...,0,0,0,0,0,0,1,0,0,63600.0
2376,0,1,1,1,4,95.3,1591.0,68000.0,1.0,8600.0,...,0,0,0,0,0,0,1,0,0,64700.0


## Evaluation

Evaluate the results with MSE and MAE

In [115]:
# RMSE
def rmse_evl(row, df_recommendations):
    if 'Similarity' in df_recommendations.columns:
        _df = df_recommendations.drop(columns=['Similarity'])
    else:
        _df = df_recommendations
    df_rmse = _df.apply(lambda df: sqrt(mean_squared_error(row.to_numpy().reshape(1, -1), df.array.to_numpy().reshape(1, -1))), axis=1)
    return df_rmse.mean(axis=0)

# MAE
def mae_evl(row, df_recommendations):
    if 'Similarity' in df_recommendations.columns:
        _df = df_recommendations.drop(columns=['Similarity'])
    else:
        _df = df_recommendations
    df_mae = _df.apply(lambda df: mean_absolute_error(row.to_numpy().reshape(1, -1), df.array.to_numpy().reshape(1, -1)), axis=1)
    return df_mae.mean(axis=0)

print("==== MSE ====")
print(f"CD: {rmse_evl(row, df_recommendations_cd)}")
print(f"ED: {rmse_evl(row, df_recommendations_ed)}")
print(f"TS-SS: {rmse_evl(row, df_recommendations_ts)}")

print("==== MAE ====")
print(f"CD: {mae_evl(row, df_recommendations_cd)}")
print(f"ED: {mae_evl(row, df_recommendations_ed)}")
print(f"TS-SS: {mae_evl(row, df_recommendations_ts)}")



==== MSE ====
CD: 1890.5560019009984
ED: 538.2956257433977
TS-SS: 539.656130765649
==== MAE ====
CD: 748.7383870967742
ED: 186.9358064516129
TS-SS: 175.0541935483871


In [120]:
k = 500
df_recommendations_cd = get_top_recommendations(row_id, 'CD', k=k)
df_recommendations_ed = get_top_recommendations(row_id, 'ED', k=k)
df_recommendations_ts = get_top_recommendations(row_id, 'TS-SS', k=k)

print("==== MSE ====")
print(f"CD: {rmse_evl(row, df_recommendations_cd)}")
print(f"ED: {rmse_evl(row, df_recommendations_ed)}")
print(f"TS-SS: {rmse_evl(row, df_recommendations_ts)}")

print("==== MAE ====")
print(f"CD: {mae_evl(row, df_recommendations_cd)}")
print(f"ED: {mae_evl(row, df_recommendations_ed)}")
print(f"TS-SS: {mae_evl(row, df_recommendations_ts)}")

==== MSE ====
CD: 3076.9557408069936
ED: 2067.2915888719067
TS-SS: 2085.058907282408
==== MAE ====
CD: 1214.835073532354
ED: 771.400468027288
TS-SS: 773.9125127435362
