# Task 2: Recommendation Engine - Skeleton Notebook

This notebook provides a very basic example for the notebook you are expected to submit for Task 2 of the Final Project. The main purpose is that we can try different examples to get a better sense of your approach. Compared to Task 1 (Kaggle Competition), we don't have any objective means to evaluate the recommendations. 

Some general comments:
* You can import any data you need. This particularly includes your cleaned version of the Used Cars dataset; there's no need to show the data cleaning / preprocessing steps in this notebook.
* You can also import your code in form of external Python (.py) script. You're actually encouraged to do so to keep this notebook light and uncluttered.
* Please consider this notebook as an example and not to set specific requirements. As long there is a section where we can easily test your solution, it should be fine.

## Setting up the Notebook

## This notebook requires python 3.8 to use faiss library

In [86]:
import sys

sys.path.append("..")

In [87]:
import pandas as pd
import numpy as np
from src.transformers import *
from sklearn.pipeline import Pipeline
from sklearn.metrics.pairwise import cosine_similarity

## Load the Data

For this example, we use a simplified version of the dataset with only 100 data sample, each with only 6 features

In [88]:
train = pd.read_csv("../data/processed/train.csv", sep=",")
test = pd.read_csv("../data/processed/test.csv", sep=",")
# train.loc[:, ['listing_id', 'engine_cap_range']]
train.head()

Unnamed: 0,listing_id,title,make,model,description,manufactured,original_reg_date,reg_date,type_of_vehicle,category,...,vehicle_age,is_parf_car,parf,coe_rebate,dereg_value_computed,vehicle_age_bins,lifespan_restriction,features_count,accessories_count,brand_rank
0,1030324,bmw 3 series 320i gran turismo m-sport,bmw,320i,1 owner! 320i gt m-sports model! big brake kit...,2013.0,,2013-12-09 00:00:00,luxury sedan,"parf car, premium ad car, low mileage car",...,8.0,1,27754.1,16705.0,44459.1,0-10,1,6,7,3
1,1021510,toyota hiace 3.0m,toyota,hiace,high loan available! low mileage unit. wear an...,2014.0,,2015-01-26 00:00:00,van,premium ad car,...,7.0,0,0.0,3464.5,3464.5,0-10,-1,1,1,2
2,1026909,mercedes-benz cla-class cla180,mercedes-benz,cla180,1 owner c&c unit. full agent service with 1 mo...,2016.0,,2016-07-25 00:00:00,luxury sedan,"parf car, premium ad car",...,5.0,1,18228.7,25504.65,43733.35,0-10,1,1,4,4
3,1019371,mercedes-benz e-class e180 avantgarde,mercedes-benz,e180,"fully agent maintained, 3 years warranty 10 ye...",2019.0,,2020-11-17 00:00:00,luxury sedan,"parf car, almost new car, consignment car",...,2.0,1,42732.75,36960.083333,79692.833333,0-10,1,5,4,4
4,1031014,honda civic 1.6a vti,honda,civic,"kah motor unit! 1 owner, lowest 1.98% for full...",2019.0,,2019-09-20 00:00:00,mid-sized sedan,parf car,...,2.0,1,15075.75,21111.375,36187.125,0-10,1,7,6,2


## Choosing columns 

In [89]:
train.columns

Index(['listing_id', 'title', 'make', 'model', 'description', 'manufactured',
       'original_reg_date', 'reg_date', 'type_of_vehicle', 'category',
       'transmission', 'curb_weight', 'power', 'fuel_type', 'engine_cap',
       'no_of_owners', 'depreciation', 'coe', 'road_tax', 'dereg_value',
       'mileage', 'omv', 'arf', 'opc_scheme', 'lifespan', 'eco_category',
       'features', 'accessories', 'indicative_price', 'price', 'reg_date_year',
       'make_model', 'engine_cap_range', 'fuel_type_diesel',
       'fuel_type_petrol-electric', 'fuel_type_petrol', 'fuel_type_electric',
       'transmission_auto', 'transmission_manual', 'coe_text',
       'coe_expiry_days', 'coe_expiry_months', 'coe_expiry_date',
       'coe_start_date', 'coe_start_year', 'vehicle_age', 'is_parf_car',
       'parf', 'coe_rebate', 'dereg_value_computed', 'vehicle_age_bins',
       'lifespan_restriction', 'features_count', 'accessories_count',
       'brand_rank'],
      dtype='object')

In [90]:
# need age from manufactured
df_recommend = train.loc[:, ['listing_id','make', 'vehicle_age', 'type_of_vehicle', 'depreciation',
                   'dereg_value', 'mileage', 'price', 'engine_cap',  'fuel_type_diesel',
                   'fuel_type_petrol-electric', 'fuel_type_petrol', 'fuel_type_electric','transmission_auto',
                   'transmission_manual', 'brand_rank']]

## Normalizing the numerical columns

In [91]:
df_to_be_normalized = df_recommend.loc[:, ['vehicle_age', 'depreciation', 'dereg_value', 'mileage', 'engine_cap', 'price']]
max_ = df_to_be_normalized.max()
min_ = df_to_be_normalized.min()
print(max_)
print(min_)
df_to_be_normalized = (df_to_be_normalized - min_) / (max_ - min_)
df_to_be_normalized.head(2)

vehicle_age          88.0
depreciation     865610.0
dereg_value      653862.0
mileage          740459.0
engine_cap        15681.0
price           2920500.0
dtype: float64
vehicle_age        0.0
depreciation    2680.0
dereg_value       97.0
mileage            1.0
engine_cap       647.0
price           2100.0
dtype: float64


Unnamed: 0,vehicle_age,depreciation,dereg_value,mileage,engine_cap,price
0,0.090909,0.017406,0.072529,0.098586,0.089796,0.023712
1,0.079545,0.010372,0.005432,0.148707,0.155315,0.014289


In [92]:
df_recommend.loc[:, ['vehicle_age', 'depreciation', 'dereg_value', 'mileage', 'engine_cap', 'price']] = df_to_be_normalized
df_recommend.head()

Unnamed: 0,listing_id,make,vehicle_age,type_of_vehicle,depreciation,dereg_value,mileage,price,engine_cap,fuel_type_diesel,fuel_type_petrol-electric,fuel_type_petrol,fuel_type_electric,transmission_auto,transmission_manual,brand_rank
0,1030324,bmw,0.090909,luxury sedan,0.017406,0.072529,0.098586,0.023712,0.089796,1,0,0,0,1,0,3
1,1021510,toyota,0.079545,van,0.010372,0.005432,0.148707,0.014289,0.155315,1,0,0,0,0,1,2
2,1026909,mercedes-benz,0.056818,luxury sedan,0.014358,0.067945,0.10804,0.032004,0.063057,0,1,0,0,1,0,4
3,1019371,mercedes-benz,0.022727,luxury sedan,0.015899,0.12268,0.013234,0.067092,0.056539,0,1,0,0,1,0,4
4,1031014,honda,0.022727,mid-sized sedan,0.009004,0.05561,0.054019,0.034642,0.06319,0,1,0,0,1,0,2


## Convert Categorical Columns to One Hot Encoding

In [93]:
df_transformed = pd.get_dummies(df_recommend, columns = ['make', 'type_of_vehicle', 'brand_rank'])
df_transformed.head()

Unnamed: 0,listing_id,vehicle_age,depreciation,dereg_value,mileage,price,engine_cap,fuel_type_diesel,fuel_type_petrol-electric,fuel_type_petrol,...,type_of_vehicle_stationwagon,type_of_vehicle_suv,type_of_vehicle_truck,type_of_vehicle_van,brand_rank_1,brand_rank_2,brand_rank_3,brand_rank_4,brand_rank_5,brand_rank_6
0,1030324,0.090909,0.017406,0.072529,0.098586,0.023712,0.089796,1,0,0,...,0,0,0,0,0,0,1,0,0,0
1,1021510,0.079545,0.010372,0.005432,0.148707,0.014289,0.155315,1,0,0,...,0,0,0,1,0,1,0,0,0,0
2,1026909,0.056818,0.014358,0.067945,0.10804,0.032004,0.063057,0,1,0,...,0,0,0,0,0,0,0,1,0,0
3,1019371,0.022727,0.015899,0.12268,0.013234,0.067092,0.056539,0,1,0,...,0,0,0,0,0,0,0,1,0,0
4,1031014,0.022727,0.009004,0.05561,0.054019,0.034642,0.06319,0,1,0,...,0,0,0,0,0,1,0,0,0,0


In [94]:
df_transformed['listing_id'].iloc[:5]

0    1030324
1    1021510
2    1026909
3    1019371
4    1031014
Name: listing_id, dtype: int64

## Recommenders (Content-based item-item)

In [105]:
# https://ai.plainenglish.io/speeding-up-similarity-search-in-recommender-systems-using-faiss-basics-part-i-ec1b5e92c92d
import faiss
import time

start=time.time()
x = df_transformed['listing_id'].to_numpy()
y = df_transformed.iloc[:,1:].to_numpy()
# y = np.ascontiguousarray(y)

dimension = y.shape[1]
y = np.ascontiguousarray(y, dtype=np.float32)
# print(xb.shape, xq.shape)
print(y.shape, y.dtype)


index = faiss.IndexFlatIP(dimension)  # build the index
faiss.normalize_L2(y)
print(index.is_trained)
index.add(y)                  # add vectors to the index
print(index.ntotal)

faiss.write_index(index, 'yo')

k=11 

D, I = index.search(y, k) # sanity check. We get the top k similar items
print('I: \n', I)
print()
print('D: \n', D)
print(I.shape)
print('Time taken: ', time.time() - start)
# D, I = index.search(y, k)     # actual search
# print(I[:5])                   # neighbors of the 5 first queries
# print(I[-5:])                  # neighbors of the 5 last queries

(16728, 106) float32
True
16728
I: 
 [[    0 10039  8927 ... 11763  4077   849]
 [    1  5530  6524 ...  7832   575  8973]
 [    2 14505 13892 ...  6164  7841  7571]
 ...
 [16725  8866  7867 ... 15742  9638  2077]
 [16726  6498 15564 ...  4188  7169  6766]
 [16727  6539  8115 ...  3208  3700  6446]]

D: 
 [[1.0000001  0.99996465 0.99996257 ... 0.99992824 0.99992687 0.9999193 ]
 [0.99999994 0.99996966 0.9999583  ... 0.99991995 0.99991685 0.9999156 ]
 [1.0000001  0.9999967  0.9999955  ... 0.9999861  0.99998564 0.9999832 ]
 ...
 [0.99999994 0.9999985  0.9999969  ... 0.99995655 0.9999549  0.99995327]
 [1.0000001  0.9999998  0.9999997  ... 0.99999917 0.99999905 0.9999982 ]
 [1.0000001  0.99938613 0.9993421  ... 0.9968879  0.99666464 0.99643654]]
(16728, 11)
Time taken:  0.28324198722839355


In [106]:
sim_0_ind = I[0]
sim_0_listing_id = x[sim_0_ind]
sim_0_listing_id


array([1030324,  990396, 1016908, 1029711, 1010802, 1011255, 1001225,
       1021268,  979741, 1011352, 1003523], dtype=int64)

In [107]:
df_recommend[df_recommend['listing_id'].isin(sim_0_listing_id)]

Unnamed: 0,listing_id,make,vehicle_age,type_of_vehicle,depreciation,dereg_value,mileage,price,engine_cap,fuel_type_diesel,fuel_type_petrol-electric,fuel_type_petrol,fuel_type_electric,transmission_auto,transmission_manual,brand_rank
0,1030324,bmw,0.090909,luxury sedan,0.017406,0.072529,0.098586,0.023712,0.089796,1,0,0,0,1,0,3
849,1003523,bmw,0.090909,luxury sedan,0.014416,0.057065,0.121934,0.019326,0.089796,1,0,0,0,1,0,3
4011,1021268,bmw,0.090909,luxury sedan,0.014474,0.067731,0.072486,0.020593,0.089796,1,0,0,0,1,0,3
4077,1011352,bmw,0.079545,luxury sedan,0.019863,0.094303,0.098586,0.035019,0.089796,1,0,0,0,1,0,3
8927,1016908,bmw,0.090909,luxury sedan,0.020071,0.068893,0.079679,0.02354,0.089796,1,0,0,0,1,0,3
9716,1011255,bmw,0.090909,luxury sedan,0.023003,0.086606,0.112586,0.030154,0.089796,1,0,0,0,1,0,3
10039,990396,bmw,0.079545,luxury sedan,0.017591,0.08455,0.097141,0.032655,0.089796,1,0,0,0,1,0,3
10217,1010802,bmw,0.079545,luxury sedan,0.018078,0.081315,0.111434,0.031524,0.089796,1,0,0,0,1,0,3
11763,979741,bmw,0.079545,luxury sedan,0.020442,0.094311,0.101287,0.033888,0.089796,1,0,0,0,1,0,3
12953,1029711,bmw,0.079545,luxury sedan,0.017823,0.076425,0.114792,0.02868,0.089796,1,0,0,0,1,0,3


In [108]:
print(I[0,:11])
print(D[0,:11])

[    0 10039  8927 12953 10217  9716 14497  4011 11763  4077   849]
[1.0000001  0.99996465 0.99996257 0.9999572  0.9999571  0.9999538
 0.9999286  0.9999284  0.99992824 0.99992687 0.9999193 ]


In [115]:
for i in range(11):
    sim = cosine_similarity(y[0].reshape(1,-1), y[I[0,i]].reshape(1,-1))
    print(sim)

[[1.0000001]]
[[0.9999646]]
[[0.99996257]]
[[0.9999572]]
[[0.9999571]]
[[0.9999538]]
[[0.99992853]]
[[0.9999284]]
[[0.99992824]]
[[0.99992687]]
[[0.9999194]]


In [102]:
# import time

# x = df_transformed['listing_id'].to_numpy()
# print(x.shape)
# y = df_transformed.iloc[:, 1:].to_numpy()
# print(y.shape)
# similarity_matrix = np.zeros((x.shape[0], x.shape[0]))
# similarity_matrix[:] = np.NaN
# # print(similarity_matrix)
# # print('y[0]: ',y[0])
# start = time.time()

# for i in range(similarity_matrix.shape[0]):
#     for j in range(similarity_matrix.shape[1]):
#         if i==j:
#             similarity_matrix[i][j] = 1
#         elif not np.isnan(similarity_matrix[j][i]):
#             similarity_matrix[i][j] = similarity_matrix[j][i]
#         else:
#             similarity_matrix[i][j] = cosine_similarity(y[i].reshape(1,-1), y[j].reshape(1,-1))
            
# print('\nsimilarity_matrix:\n', similarity_matrix)
# print('\n\n Time taken: ', time.time() - start, ' ms')


##  Top Recommendations

The method `get_top_recommendations()` shows an example of how to get the top recommendations for a given data sample (data sample = row in the dataframe of the dataset). The input is a row from the dataset and a list of optional input parameters which will depend on your approach; `k` is the number of returned recommendations seems useful, though.

The output should be a `pd.DataFrame` containing the recommendations. The output dataframe should have the same columns as the row + any additional columns you deem important (e.g., any score or tags that you might want to add to your recommendations).

In principle, the method `get_top_recommendations()` may be imported from a external Python (.py) script as well.

In [123]:
def get_top_recommendations(listing_id, I, D, x):
    ind = np.where(x == listing_id)[0][0]
    print(ind)
    top_10_ind = I[ind,:11]
    print(top_10_ind)
    sim_listing_id = x[top_10_ind]
    print()
    print(sim_listing_id)
    df_result = train[train['listing_id'].isin(sim_listing_id)]
    return df_result
    

## Testing 

In [124]:
listing_id = 1026880 #1030324
get_top_recommendations(listing_id, I, D, x).loc[:, ['listing_id','make', 'vehicle_age', 'type_of_vehicle', 'depreciation',
                   'dereg_value', 'mileage', 'price', 'engine_cap',  'fuel_type','transmission', 'brand_rank']]

8144
[ 8144 14937 13361  2074 13832 12647   578 16313  7978  7453  9474]

[1026880  997601 1024018  994240 1030080 1014620 1002323  983273 1024824
 1021122 1002212]


Unnamed: 0,listing_id,make,vehicle_age,type_of_vehicle,depreciation,dereg_value,mileage,price,engine_cap,fuel_type,transmission,brand_rank
578,1002323,bmw,9.0,sports car,18870.0,39301.357272,82000.0,207700.0,4395.0,diesel,auto,3
2074,994240,bmw,9.0,sports car,17070.0,4304.426716,124000.0,187900.0,4395.0,diesel,auto,3
7453,1021122,bmw,9.0,sports car,18490.0,39301.357272,110000.0,203500.0,4395.0,diesel,auto,3
7978,1024824,bmw,9.0,sports car,19190.0,39301.357272,105000.0,211200.0,4395.0,diesel,auto,3
8144,1026880,bmw,10.0,sports car,13870.0,3248.511277,90000.0,152700.0,4395.0,diesel,auto,3
9474,1002212,bmw,9.0,sports car,21370.0,39301.357272,76000.0,235200.0,4395.0,diesel,auto,3
12647,1014620,bmw,10.0,sports car,14870.0,39301.357272,108000.0,163700.0,4395.0,diesel,auto,3
13361,1024018,bmw,9.0,sports car,20470.0,5283.847073,86000.0,225300.0,4395.0,diesel,auto,3
13832,1030080,bmw,9.0,sports car,19660.0,31623.0,112000.0,195800.0,4395.0,diesel,auto,3
14937,997601,bmw,10.0,sports car,14570.0,3248.511277,80000.0,160400.0,4395.0,diesel,auto,3


In [None]:
###################Rishabh's code and testing ends here######################

## Testing the Recommendation Engine

This will be the main part of your notebook to allow for testing your solutions. Most basically, for a given listing (defined by the row id in your input dataframe), we would like to see the recommendations you make. So however you set up your notebook, it should have at least a comparable section that will allow us to run your solution for different inputs.

### Pick a Sample Listing as Input

In [8]:
# Pick a row id of choice
row_id = 10
#row_id = 20
#row_id = 30
#row_id = 40
#row_id = 50

# Get the row from the dataframe (an valid row ids will throw an error)
row = df_sample.iloc[row_id]

# Just for printing it nicely, we create a new dataframe from this single row
pd.DataFrame([row])

Unnamed: 0,listing_id,make,power,engine_cap,mileage,price
10,1020216,honda,73.0,1317.0,22703.0,78000


## Compute and Display the recommendations

Since the method `get_top_recommendations()` returns a `pd.DataFrame`, it's easy to display the result.

In [9]:
k = 3

df_recommendations = get_top_recommendations(row, k=k)

df_recommendations.head(k)

Loss: 3324427.80743 	 0%
Loss: 476668.07471 	 10%
Loss: 449168.88083 	 20%
Loss: 268857.02562 	 30%
Loss: 59132.12156 	 40%
Loss: 18365.99409 	 50%
Loss: 4516.21252 	 60%
Loss: 1790.65040 	 70%
Loss: 1372.55576 	 80%
Loss: 1290.99146 	 90%
Loss: 1268.25165 	 100%


Unnamed: 0,listing_id,make,power,engine_cap,mileage,price,similarity
57,1009803,toyota,80.0,1496.0,19000.0,88900,0.998459
30,1019460,toyota,125.0,2362.0,28126.0,137300,0.998123
59,1010728,honda,73.0,1318.0,29000.0,76400,0.998069
