# Task 2: Recommendation Engine - Skeleton Notebook

This notebook provides a very basic example for the notebook you are expected to submit for Task 2 of the Final Project. The main purpose is that we can try different examples to get a better sense of your approach. Compared to Task 1 (Kaggle Competition), we don't have any objective means to evaluate the recommendations. 

Some general comments:
* You can import any data you need. This particularly includes your cleaned version of the Used Cars dataset; there's no need to show the data cleaning / preprocessing steps in this notebook.
* You can also import your code in form of external Python (.py) script. You're actually encouraged to do so to keep this notebook light and uncluttered.
* Please consider this notebook as an example and not to set specific requirements. As long there is a section where we can easily test your solution, it should be fine.

## Setting up the Notebook

In [1]:
import pandas as pd
import numpy as np
import heapq
import utils

## Load the Data

For this example, we use a simplified version of the dataset with only 100 data sample, each with only 6 features

In [83]:
df_sample = pd.read_csv('../data/train.csv')
df_original = pd.read_csv('../data/train.csv')

df_sample.head()

Unnamed: 0,listing_id,title,make,model,description,manufactured,original_reg_date,reg_date,type_of_vehicle,category,...,mileage,omv,arf,opc_scheme,lifespan,eco_category,features,accessories,indicative_price,price
0,1030324,BMW 3 Series 320i Gran Turismo M-Sport,bmw,320i,1 owner! 320i gt m-sports model! big brake kit...,2013.0,,09-dec-2013,luxury sedan,"parf car, premium ad car, low mileage car",...,73000.0,45330.0,50462.0,,,uncategorized,"5 doors gt, powerful and fuel efficient 2.0l t...","bmw i-drive, navigation, bluetooth/aux/usb inp...",,71300.0
1,1021510,Toyota Hiace 3.0M,,hiace,high loan available! low mileage unit. wear an...,2014.0,,26-jan-2015,van,premium ad car,...,110112.0,27502.0,1376.0,,25-jan-2035,uncategorized,low mileage unit. well maintained vehicle. vie...,factory radio setting. front recording camera....,,43800.0
2,1026909,Mercedes-Benz CLA-Class CLA180,mercedes-benz,cla180,1 owner c&c unit. full agent service with 1 mo...,2016.0,,25-jul-2016,luxury sedan,"parf car, premium ad car",...,80000.0,27886.0,26041.0,,,uncategorized,responsive and fuel efficient 1.6l inline 4 cy...,dual electric/memory seats. factory fitted aud...,,95500.0
3,1019371,Mercedes-Benz E-Class E180 Avantgarde,mercedes-benz,e180,"fully agent maintained, 3 years warranty 10 ye...",2019.0,,17-nov-2020,luxury sedan,"parf car, almost new car, consignment car",...,9800.0,46412.0,56977.0,,,uncategorized,"1.5l inline-4 twin scroll turbocharged engine,...",64 colour ambient lighting. active parking ass...,,197900.0
4,1031014,Honda Civic 1.6A VTi,,civic,"kah motor unit! 1 owner, lowest 1.98% for full...",2019.0,,20-sep-2019,mid-sized sedan,parf car,...,40000.0,20072.0,20101.0,,,uncategorized,"1.6l i-vtec engine, 123 bhp, earth dreams cvt ...","s/rims, premium leather seats, factory touch s...",,103200.0


# Data Preporcessing

In [84]:
def data_preprocess(df:pd.DataFrame) -> pd.DataFrame:
    df['make'] = df.apply(lambda row: row['title'].split()[0].lower() if pd.isna(row['make']) else row['make'],axis=1)
    df['make'] = utils.ordinal_encoder(df['make'])

    df['type_of_vehicle'] = utils.ordinal_encoder(df['type_of_vehicle'])
    df['transmission'] = utils.ordinal_encoder(df['transmission'])

    utils.fill_with_mean(df['power'])
    df['power'] = (df['power'] - np.min(df['power'])) / (np.max(df['power']) - np.min(df['power']))
    
    utils.fill_with_mean(df['engine_cap'])
    #df["engine_cap"] = utils.data_discretization(df["engine_cap"], num=10)
    df['engine_cap'] = (df['engine_cap'] - np.min(df['engine_cap'])) / (np.max(df['engine_cap']) - np.min(df['engine_cap']))
    

    df["depreciation"] = utils.del_outlier(df["depreciation"], lower_val=0.0, upper_val=0.99)
    utils.fill_with_mean(df['depreciation'])
    #df['depreciation'] = (df['depreciation'] - np.min(df['depreciation'])) / (np.max(df['depreciation']) - np.min(df['depreciation']))

    utils.fill_with_mean(df["road_tax"])
    #df["road_tax"] = utils.data_discretization(df["road_tax"], num=15)
    df['road_tax'] = (df['road_tax'] - np.min(df['road_tax'])) / (np.max(df['road_tax']) - np.min(df['road_tax']))

    utils.fill_with_mean(df["mileage"])
    #df["mileage"] = utils.data_discretization(df["mileage"], num=30)
    
    #df["depreciation"] = utils.data_discretization(df["depreciation"], num=15)
    #df["power"] = utils.data_discretization(df["power"], num=15)
    #df["price"] = utils.data_discretization(df["price"], num=200)

        
    df.drop(columns='title', inplace=True)
    df.drop(columns='model', inplace=True)
    df.drop(columns='description', inplace=True)
    df.drop(columns='manufactured', inplace=True)
    df.drop(columns='original_reg_date', inplace=True)
    df.drop(columns='reg_date', inplace=True)
    df.drop(columns='fuel_type', inplace=True)
    df.drop(columns='opc_scheme', inplace=True)
    df.drop(columns='lifespan', inplace=True)
    df.drop(columns='eco_category', inplace=True)
    df.drop(columns='features', inplace=True)
    df.drop(columns='accessories', inplace=True)
    df.drop(columns='indicative_price', inplace=True)
    
    df.drop(columns='curb_weight', inplace=True)
    df.drop(columns='no_of_owners', inplace=True)
    df.drop(columns='coe', inplace=True)
    df.drop(columns='omv', inplace=True)
    df.drop(columns='category', inplace=True)
    df.drop(columns='arf', inplace=True)
    df.drop(columns='dereg_value', inplace=True)

    return df

In [85]:
df_process = data_preprocess(df_sample)

In [86]:
df_process

Unnamed: 0,listing_id,make,type_of_vehicle,transmission,power,engine_cap,depreciation,road_tax,mileage,price
0,1030324,1,1,1,0.184751,0.127352,17700.0,0.094118,73000.0,71300.0
1,1021510,2,2,2,0.182796,0.190166,11630.0,0.106207,110112.0,43800.0
2,1026909,3,1,1,0.096774,0.101715,15070.0,0.055984,80000.0,95500.0
3,1019371,3,1,1,0.145650,0.095466,16400.0,0.051440,9800.0,197900.0
4,1031014,4,3,1,0.100684,0.101843,10450.0,0.056146,40000.0,103200.0
...,...,...,...,...,...,...,...,...,...,...
16779,1030181,5,6,1,0.286413,0.125566,21720.0,0.091359,64000.0,144400.0
16780,1027041,7,5,1,0.123363,0.100759,10770.0,0.055335,100808.0,70200.0
16781,1021099,14,4,1,0.092864,0.101907,7190.0,0.062150,72539.0,71300.0
16782,1019473,4,5,1,0.063539,0.083987,7940.0,0.042840,13000.0,81200.0


## Computing the Top Recommendations

The method `get_top_recommendations()` shows an example of how to get the top recommendations for a given data sample (data sample = row in the dataframe of the dataset). The input is a row from the dataset and a list of optional input parameters which will depend on your approach; `k` is the number of returned recommendations seems useful, though.

The output should be a `pd.DataFrame` containing the recommendations. The output dataframe should have the same columns as the row + any additional columns you deem important (e.g., any score or tags that you might want to add to your recommendations).

In principle, the method `get_top_recommendations()` may be imported from a external Python (.py) script as well.

In [89]:
def get_top_recommendations(row, **kwargs) -> pd.DataFrame:
    
    #####################################################
    ## Initialize the required parameters
    
    # The number of recommendations seem recommended
    # Additional input parameters are up to you
    k = None
    source_data = None
    mean = 'cos'
    
    # Extract all **kwargs input parameters
    # and set the used paramaters (here: k)
    for key, value in kwargs.items():
        if key == 'k':
            k = value
        if key == 'source_data':
            source_data = value
        if key == 'mean':
            mean = value
            
       
    #####################################################
    ## Compute your recommendations
    #
    # This is where your magic happens. Of course, you can call methods
    # defined in this notebook or in external Python (.py) scripts

    # slice
    column = ["price","depreciation","road_tax","make","power","engine_cap","mileage","type_of_vehicle"]
    selected_source_data = source_data[column].copy()  # reassign
    selected_row = row[column]

    # convert to 2-d numpy.array
    np_source_data = np.array(selected_source_data.values, dtype=np.float32)
    np_row = np.array(selected_row.values, dtype=np.float32).reshape(1,-1)
    
    ## Different similarity metric functions
    top_k_index = None
    row_index = row.name # exclude row to avoid recommending the item itself
    
    if mean == 'dis':
        # weighted
        weights = np.array([0.002,0.002,1,2,1,1,0.001,1], dtype=np.float32)
        # calculate weighted distance and reshape to 1-d
        num = (np_source_data - np_row)*weights
        result = np.linalg.norm(num,axis=1,keepdims=True).reshape(-1)
        # exclude the row itself
        result[row_index] = np.inf
        # indices of top k most similar items
        top_k_index = heapq.nsmallest(k, range(len(result)), result.take)
    else:
        # calculate cos simularity and reshape to 1-d
        num = np.dot(np_row,np_source_data.T).reshape(-1)
        denom = np.linalg.norm(np_row,axis=1,keepdims=True).reshape(-1)*np.linalg.norm(np_source_data,axis=1,keepdims=True).reshape(-1)
        result = np.round(num/denom, 2)
        # exclude the row itself
        result[row_index] = -1
        # indices of top k most similar items
        top_k_index = heapq.nlargest(k, range(len(result)), result.take)
    
    print(top_k_index)
    
    
    ##################################################### 
    ## Return
    # Here, we just return the input row k times
    # Ideally, you recommendations will be much better
    df_result = pd.DataFrame(source_data.loc[top_k_index], index=None)
        
    # Return the dataset with the k recommendations
    return df_result

## Testing the Recommendation Engine

This will be the main part of your notebook to allow for testing your solutions. Most basically, for a given listing (defined by the row id in your input dataframe), we would like to see the recommendations you make. So however you set up your notebook, it should have at least a comparable section that will allow us to run your solution for different inputs.

### Pick a Sample Listing as Input

In [106]:
# Pick a row id of choice
row_id = 10
#row_id = 20
#row_id = 30
#row_id = 40
#row_id = 50

# Get the row from the dataframe (an valid row ids will throw an error)
row = df_process.iloc[row_id]

# Just for printing it nicely, we create a new dataframe from this single row
pd.DataFrame([row])

Unnamed: 0,listing_id,make,type_of_vehicle,transmission,power,engine_cap,depreciation,road_tax,mileage,price
10,1004029.0,6.0,3.0,1.0,0.107136,0.10146,8270.0,0.055822,72539.0,61400.0


## Compute and Display the recommendations

Since the method `get_top_recommendations()` returns a `pd.DataFrame`, it's easy to display the result.

In [107]:
k = 10

df_recommendations = get_top_recommendations(row, k=k, source_data=df_sample, mean='dis')

df_recommendations.head(k)

[405, 3800, 3100, 1019, 8656, 2059, 4098, 5321, 12890, 16131]


Unnamed: 0,listing_id,make,type_of_vehicle,transmission,power,engine_cap,depreciation,road_tax,mileage,price
405,1019476,6,3,1,0.107136,0.10146,8150.0,0.055822,72539.0,61400.0
3800,1011395,6,3,1,0.107136,0.10146,8380.0,0.055822,73000.0,62400.0
3100,1019958,6,3,1,0.102053,0.10146,7130.0,0.06783,72539.0,61600.0
1019,1011666,7,3,1,0.104203,0.10146,9140.0,0.055822,72539.0,61400.0
8656,1003349,7,3,1,0.104203,0.10146,9170.0,0.055822,72539.0,61400.0
2059,1011373,6,3,1,0.107136,0.10146,8770.0,0.055822,71466.0,62700.0
4098,1020292,7,3,1,0.104203,0.10146,8380.0,0.055822,72539.0,62600.0
5321,984583,6,3,1,0.107136,0.10146,9520.0,0.055822,72539.0,62500.0
12890,1014749,5,1,1,0.331378,0.188317,7220.0,0.222799,72539.0,61100.0
16131,1014160,8,2,2,0.182796,0.158663,8160.0,0.106207,72539.0,61400.0


In [108]:
recommendation = df_recommendations.head(k)
df_recommendation = pd.DataFrame()
for i in recommendation['listing_id']:
    df_recommendation = df_recommendation.append(df_original[df_original['listing_id'] == i])

In [109]:
df_recommendation

Unnamed: 0,listing_id,title,make,model,description,manufactured,original_reg_date,reg_date,type_of_vehicle,category,...,mileage,omv,arf,opc_scheme,lifespan,eco_category,features,accessories,indicative_price,price
405,1019476,Kia Cerato K3 1.6A,,cerato,"viewing by appt only, whatsapp for doorstep vi...",2017.0,,18-sep-2017,mid-sized sedan,parf car,...,,12763.0,12763.0,,,uncategorized,powerful and fuel efficient 1.6l dohc dual cvv...,sports rims. audio system aux and usb port. fo...,,61400.0
3800,1011395,Kia Cerato K3 1.6A EX,kia,cerato,"1 owner unit, ex model, sporty sedan, maintain...",2017.0,,22-aug-2017,mid-sized sedan,parf car,...,73000.0,12918.0,12918.0,,,uncategorized,"1.6l, 4 cylinders inline 16 valve dohc dual cv...","leather seats, sports rims, factory audio syst...",,62400.0
3100,1019958,Kia Cerato Forte 1.6A EX (COE till 06/2029),,cerato,0,2010.0,,12-may-2010,mid-sized sedan,"coe car, consignment car",...,,11398.0,11398.0,,,uncategorized,"1.6 in line 4 cylinder dohc engine, 124 bhp, a...","17 inches sport rims with new tyre, eagle eye ...",,61600.0
1019,1011666,Hyundai Elantra 1.6A GLS Elite,hyundai,elantra,accredited company! low mileage done! accident...,2016.0,,19-oct-2016,mid-sized sedan,parf car,...,,17487.0,17487.0,,,uncategorized,"1.6l dohc 16v dual vvt engine, smooth 6 speed ...","ventilated memory electric seats,leather seats...",,61400.0
8656,1003349,Hyundai Elantra 1.6A GLS Elite,hyundai,elantra,viewing by appointment please and passing thro...,2016.0,,28-dec-2016,mid-sized sedan,parf car,...,,13572.0,13572.0,,,uncategorized,reliable and fuel efficient 1.6l 4 cylinders i...,"factory fitted audio system, reverse sensors, ...",,61400.0
2059,1011373,Kia Cerato K3 1.6A EX,kia,cerato,1 owner only! 5 years warranty c&c unlimited m...,2017.0,,29-may-2017,mid-sized sedan,parf car,...,71466.0,12945.0,12945.0,,,uncategorized,"1.6l 4 cylinders, dual cvvt engine, responsive...","touchscreen infotainment system, leather seats...",,62700.0
4098,1020292,Hyundai Elantra 1.6A GLS S,,elantra,"hyundai specialist! 1 owner only. high spec ""s...",2017.0,,07-sep-2017,mid-sized sedan,parf car,...,,12674.0,12674.0,,,uncategorized,"1.6l cvvt engine, 125 bhp, 6 speed cvt automat...","leather seats, sports rims, factory fitted aud...",,62600.0
5321,984583,Kia Cerato K3 1.6A SX,kia,cerato,your safety is our priority. as part of govern...,2016.0,,21-sep-2016,mid-sized sedan,"parf car, premium ad car",...,,17000.0,17000.0,,,uncategorized,"powered by 1.6l 4 cylinder dohc engine, 6 spee...","kia infotainment with apple/android cp, front ...",,62500.0
12890,1014749,Volvo S80 T6 (COE till 04/2029),volvo,s80,brand new road tax! ultra rare top of the line...,2009.0,,12-oct-2009,luxury sedan,coe car,...,,55340.0,55340.0,,,uncategorized,3.0l straight 6 cylinder turbo charged. 281bhp...,r rims. original factory condition. electric s...,,61100.0
16131,1014160,Nissan NV350 2.5M,nissan,nv350,"6/18 nissan nv350 panel van 2.5 manual 5 drs, ...",2017.0,,27-jun-2018,van,premium ad car,...,,25062.0,1254.0,,26-jun-2038,uncategorized,view specs of the nissan nv350,,,61400.0
