# Car recommender
This simple "research" was done within hiring task. 
The task was to design a simple car recommender system. As I was applying for a software engineering position, my focus was primarily on implementing a robust back-end solution. But it was necessary to come up with some, at least simple, datascience solution. 

After a quick research, I essentially ruled out using any complex ML solutions (since I don't have experience nor intuition in this domain and thus it would be too time consuming) and narrowed it down to two possible solution (ideally a hybrid of both):

1. Content-based filtering
2. Collaborative filtering (NOT IMPLEMENTED)

Both solutions seems to boil down to computing some "preference" vector for the user (either based on some real feature set, in case of content-based filtering, or latent feature set, in case of collaborative filtering) and representing each car listing by a feature vector (again either based on real features or latent ones). Then the recommendations for the user are obtained by finding nearest car listing neighbors to the users "preference" vectors.


## Initialization

In [1]:
from IPython.display import display
from typing import List
from random import randint
import pandas as pd
import numpy as np

## Data

First let's take some simple car listings dataset from kaggle.

In [11]:
# load the dataset
listings_path = 'data/car_listings.csv'
df_listings = pd.read_csv(listings_path)
df_listings

Unnamed: 0,name,year,selling_price,km_driven,fuel,seller_type,transmission,owner
0,Maruti 800 AC,2007,60000,70000,Petrol,Individual,Manual,First Owner
1,Maruti Wagon R LXI Minor,2007,135000,50000,Petrol,Individual,Manual,First Owner
2,Hyundai Verna 1.6 SX,2012,600000,100000,Diesel,Individual,Manual,First Owner
3,Datsun RediGO T Option,2017,250000,46000,Petrol,Individual,Manual,First Owner
4,Honda Amaze VX i-DTEC,2014,450000,141000,Diesel,Individual,Manual,Second Owner
...,...,...,...,...,...,...,...,...
4335,Hyundai i20 Magna 1.4 CRDi (Diesel),2014,409999,80000,Diesel,Individual,Manual,Second Owner
4336,Hyundai i20 Magna 1.4 CRDi,2014,409999,80000,Diesel,Individual,Manual,Second Owner
4337,Maruti 800 AC BSIII,2009,110000,83000,Petrol,Individual,Manual,Second Owner
4338,Hyundai Creta 1.6 CRDi SX Option,2016,865000,90000,Diesel,Individual,Manual,First Owner


In [3]:
df_listings.describe()

Unnamed: 0,year,selling_price,km_driven
count,4340.0,4340.0,4340.0
mean,2013.090783,504127.3,66215.777419
std,4.215344,578548.7,46644.102194
min,1992.0,20000.0,1.0
25%,2011.0,208749.8,35000.0
50%,2014.0,350000.0,60000.0
75%,2016.0,600000.0,90000.0
max,2020.0,8900000.0,806599.0


In [4]:
df_listings['transmission'].unique()

array(['Manual', 'Automatic'], dtype=object)

In [5]:
df_listings['fuel'].unique()

array(['Petrol', 'Diesel', 'CNG', 'LPG', 'Electric'], dtype=object)

## Features 

For simplicity (as this is just a demonstration) we will only vectorize some listing attributes. Let's pick couple numerical (selling_price and year of making) and couple categorical (fuel, transmission).  

- to vectorize selected numerical attributes, essentially all we will do, is to normalize the numerical range to 0-1 range.
- to vectorize categorical, we will just create a new (0-1) dimension for each category 

In [24]:
# Numerical attributes
year_min, year_max = 1992, 2020
norm_year = (df_listings['year'] - year_min) / (year_max - year_min)

# TODO remove outliers first
price_min, price_max = df_listings['selling_price'].min(), df_listings['selling_price'].max()
norm_price = (df_listings['selling_price'] - price_min) / (price_max - price_min)

features = {
    'price': norm_price,
    'year': norm_year,
}

# Simple Categorical attributes
for category in df_listings['transmission'].unique():
    category_vector = df_listings['transmission'] == category
    features[f'transmission_{category}'] = category_vector.astype(np.float64)
    
for category in df_listings['fuel'].unique():
    category_vector = df_listings['fuel'] == category
    features[f'fuel_{category}'] = category_vector.astype(np.float64)


# Car names/brands ...
# attempt with extracting brand names
x = map(lambda x: x.split()[0], df_listings['name'])
brands = set(list(x))

for category in brands:
    category_vector = df_listings['name'].apply(lambda x: category in x)
    features[f'brand_{category}'] = category_vector.astype(np.float64)

df_features = pd.DataFrame(features)
df_features

Unnamed: 0,price,year,transmission_Manual,transmission_Automatic,fuel_Petrol,fuel_Diesel,fuel_CNG,fuel_LPG,fuel_Electric,brand_Datsun,...,brand_Kia,brand_Tata,brand_Jeep,brand_Toyota,brand_Hyundai,brand_Honda,brand_Volkswagen,brand_Mercedes-Benz,brand_Volvo,brand_Isuzu
0,0.004505,0.535714,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.012950,0.535714,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.065315,0.714286,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
3,0.025901,0.892857,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.048423,0.785714,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4335,0.043919,0.785714,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
4336,0.043919,0.785714,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
4337,0.010135,0.607143,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4338,0.095158,0.857143,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0


In [20]:
df_listings["feature"] = df_listings.apply(lambda l: df_features.iloc[l.name].to_numpy(), axis=1)
df_listings

Unnamed: 0,name,year,selling_price,km_driven,fuel,seller_type,transmission,owner,feature
0,Maruti 800 AC,2007,60000,70000,Petrol,Individual,Manual,First Owner,"[0.0045045045045045045, 0.5357142857142857, 1...."
1,Maruti Wagon R LXI Minor,2007,135000,50000,Petrol,Individual,Manual,First Owner,"[0.01295045045045045, 0.5357142857142857, 1.0,..."
2,Hyundai Verna 1.6 SX,2012,600000,100000,Diesel,Individual,Manual,First Owner,"[0.06531531531531531, 0.7142857142857143, 1.0,..."
3,Datsun RediGO T Option,2017,250000,46000,Petrol,Individual,Manual,First Owner,"[0.0259009009009009, 0.8928571428571429, 1.0, ..."
4,Honda Amaze VX i-DTEC,2014,450000,141000,Diesel,Individual,Manual,Second Owner,"[0.04842342342342342, 0.7857142857142857, 1.0,..."
...,...,...,...,...,...,...,...,...,...
4335,Hyundai i20 Magna 1.4 CRDi (Diesel),2014,409999,80000,Diesel,Individual,Manual,Second Owner,"[0.043918806306306304, 0.7857142857142857, 1.0..."
4336,Hyundai i20 Magna 1.4 CRDi,2014,409999,80000,Diesel,Individual,Manual,Second Owner,"[0.043918806306306304, 0.7857142857142857, 1.0..."
4337,Maruti 800 AC BSIII,2009,110000,83000,Petrol,Individual,Manual,Second Owner,"[0.010135135135135136, 0.6071428571428571, 1.0..."
4338,Hyundai Creta 1.6 CRDi SX Option,2016,865000,90000,Diesel,Individual,Manual,First Owner,"[0.09515765765765766, 0.8571428571428571, 1.0,..."


## Content-based filtering

For content based filtering we will need to figure out users preferred features. Here we need to solve two problems:

1. Figure out preferred features for a new user
2. Update users preferred features based on users activity (i.e., for what features user searches or what listings are viewed by the user)

To solve the first problem, we can either ask user to fill in an "onboarding" questionnaire. Or we can, based on some research, assign some predefined preferred features based on users gender, age or other paramters...

In [9]:
# Let's say our new user fills in a questionnaire: 
# - has a big budget
# - prefers a new(er) car
# - prefers automatic transmission and
# - prefers either electric or petrol car 
# - And doesn't really prefer any brand.
# That would yield the following preference vector:

user_preference_vector = [0.8, 0.9, 0.0, 0.1, 0.1, 0.0, 0.0, 0.0, 0.1]
# Extend with vector of 0.0 for each car brand
user_preference_vector.extend([0.0]*len(brands))
user_preference_vector

# Let's say that user clicks on a random listing
clicked_listing_vector = df_features.iloc[randint(0, 4339)]

# Ultra naive way is to just create an average of the user preferred features and
# features of listing that user clicked on. And this is exactly the way we will use for our 
# stupid recommender-system :-). 
a = np.array([1.0, 0.0, 0.0, 1.0])
b = np.array([1.0, 1.0, 0.0, 1.0])
a = (a + b)/2
display(a)
a = (a + b)/2
display(a)

array([1. , 0.5, 0. , 1. ])

array([1.  , 0.75, 0.  , 1.  ])

## Vector db + search
Since we will need to implement API within this solution, and thus ideally use some ORM framework with a relational DB. And as PostgreSQL can be extended with pg_vector to also work as a vector db, it is no-brainer that we will use it. 

For searching users recommendations (i.e. listings with closest features to user's preferred features) we will use euclidean distance.

```sql
SELECT *
FROM listings
ORDER BY features <-> '{user_preference}'
LIMIT 20;
```

QED