# Adevinta: Property characteristics similarity: Recommender

Item-based collaborative filtering is a model-based algorithm for making recommendations. In the algorithm, the similarities between different items in the dataset are calculated by using one of a number of similarity measures, and then these similarity values are used to predict ratings for user-item pairs not present in the dataset.

In order to build the recommender we'he followed the collaborative filtering paradingm but without matrix factorization or building a model, just the cosine distances matrix and an algorithm that returns the top 5 more similars properties to the one selected.

A model-based approach has also been attempted but it's not yet finished.

In [78]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:95% !important; }</style>"))

In [79]:
%matplotlib inline

In [80]:
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

In [81]:
import numpy as np
from sklearn.metrics import pairwise_distances
from scipy.spatial.distance import cosine

## Adevinta: Dataset loading and cleaning

In [82]:
df_fotocasa = pd.read_csv("./data/problem_data_reduced.csv",sep="|")

In [83]:
df_fotocasa.head()

Unnamed: 0,idproperty,province,municipality,surface,rooms,baths,property_type,property_subtype,transacion_type,price,description
0,qkgdhixsul,Girona,Castell-Platja d'Aro,60,2,1,Vivienda,Apartamento,Sell,178000.0,"apartamento de 60 m2, dsistribuido en cocina i..."
1,swigwvclxz,Barcelona,Vilanova i la Geltrú,197,4,2,Vivienda,Casa-Chalet,Sell,345000.0,VILANOVA I LA GELTRULes presentamos esta casa ...
2,bfvgsrcdoj,Lleida,Fondarella,375,5,3,Vivienda,Casa-Chalet,Sell,180000.0,
3,tsracvmevc,Girona,Girona Capital,89,4,2,Vivienda,Piso,Sell,187000.0,"Pis de 89m2, menjador de 23m2, cuina office de..."
4,biayppbmen,Barcelona,Manresa,180,6,1,Vivienda,Piso,Sell,350000.0,"MANRESA, piso de 6 habitaciones muy amplias to..."


In [84]:
print("\nObservations: {}, Features: {}\n".format(df_fotocasa.shape[0], df_fotocasa.shape[1]-1))


Observations: 1208, Features: 10



In [85]:
df_fotocasa_vars = df_fotocasa[["idproperty","surface","rooms","baths"]].copy()

df_fotocasa_province = pd.get_dummies(df_fotocasa.province)
df_fotocasa_municipality = pd.get_dummies(df_fotocasa.municipality)
df_fotocasa_property_subtype = pd.get_dummies(df_fotocasa.property_subtype)
df_fotocasa_transacion_type = pd.get_dummies(df_fotocasa.transacion_type)

In [86]:
df_fotocasa_new = pd.concat([
    df_fotocasa_vars, 
    df_fotocasa_province,
    #df_fotocasa_municipality,
    df_fotocasa_property_subtype,
    df_fotocasa_transacion_type
], axis=1,sort=False)

In [87]:
df_fotocasa_new.head()

Unnamed: 0,idproperty,surface,rooms,baths,Barcelona,Girona,Lleida,Tarragona,Apartamento,Casa adosada,...,Estudio,Finca rústica,Loft,Piso,Planta baja,Ático,Added in Olap,Rent,Sell,Share
0,qkgdhixsul,60,2,1,0,1,0,0,1,0,...,0,0,0,0,0,0,0,0,1,0
1,swigwvclxz,197,4,2,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
2,bfvgsrcdoj,375,5,3,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,1,0
3,tsracvmevc,89,4,2,0,1,0,0,0,0,...,0,0,0,1,0,0,0,0,1,0
4,biayppbmen,180,6,1,1,0,0,0,0,0,...,0,0,0,1,0,0,0,0,1,0


In [88]:
df_fotocasa_new.shape

(1208, 22)

In [89]:
df_fotocasa_first = df_fotocasa_new.groupby(['idproperty']).first()

In [90]:
df_fotocasa_first.shape

(904, 21)

## Adevinta: Recomender: Collaborative Filtering

A recommender system is where the system is capable of producing a list of recommendation with respect to an item. One of the ways to create a recommender system is through Collaborative Filtering, where the information is filtered by looking at the activity of other users. 

## Adevinta: Item-to-item cosine similarity

**Selection of similarity metric**

- Pearson when data is subject to user-bias
- Cosine, if data is sparse 
- Euclidean, if your data is not sparse 
- Adjusted cosine for item-based approach to adjust for user-bias

In [94]:
X = np.array(df_fotocasa_first.reset_index()[df_fotocasa_first.columns]).astype('float')

**Data scaling**

In [95]:
scaler = StandardScaler()
scaler.fit(X)
X_sacled = scaler.transform(X)

**Similarity matrix**

In [108]:
cosdist_out = 1-pairwise_distances(X_sacled, metric="cosine")

In [109]:
df_sim_matrix = pd.DataFrame(cosdist_out)
df_sim_matrix.columns = list(df_fotocasa_first.reset_index().idproperty)

df_sim_matrix['idproperty'] = df_fotocasa_first.reset_index().idproperty
df_sim_matrix.head()

Unnamed: 0,abkvpehvdk,abrrpeggwd,adqphxnrhg,adsvpzczjm,aemkznotwk,aeunvoqpfk,aexiyslvzb,afpxjapnaa,aglafrntto,agvdpedzlx,...,zrgbsuyzxw,zrsflarpkv,zswfozpnpq,zufqkgdlos,zuqgmuxggs,zvepjajedz,zxkniyripf,zxmtqvtewr,zzqybyrjsg,idproperty
0,1.0,0.714619,-0.327909,-0.157857,0.339355,0.485678,0.485678,0.009127,-0.134279,0.748309,...,-0.038692,-0.153109,0.147416,-0.369042,-0.125557,-0.17779,-0.161843,-0.134283,-0.316778,abkvpehvdk
1,0.714619,1.0,-0.031288,-0.093798,0.462845,0.533254,0.533255,0.056913,-0.078601,0.391823,...,0.027126,0.1766,0.259783,-0.339278,-0.07331,-0.154387,-0.172143,-0.078603,0.027561,abrrpeggwd
2,-0.327909,-0.031288,1.0,-0.259417,0.003985,-0.319756,-0.319759,-0.162643,-0.223652,-0.020338,...,-0.243193,0.624968,-0.373722,-0.180016,0.273105,0.33749,0.007943,-0.22366,0.707881,adqphxnrhg
3,-0.157857,-0.093798,-0.259417,1.0,0.207353,-0.315145,-0.315151,0.217288,0.2687,-0.159713,...,0.265056,-0.241218,0.922565,0.945599,0.225748,-0.384292,0.090251,0.268692,0.168054,adsvpzczjm
4,0.339355,0.462845,0.003985,0.207353,1.0,0.107921,0.107919,-0.156705,-0.251866,0.539242,...,-0.116848,-0.229332,0.426203,0.045187,0.812323,0.282468,-0.184841,-0.251874,-0.363411,aemkznotwk


**Recomendation (top 5)**

In [99]:
df_orig = df_fotocasa_first.reset_index().copy()

In [132]:
prop = 'abkvpehvdk'
prop = 'afpxjapnaa'

In [133]:
df_fotocasa[
    df_fotocasa.idproperty.isin([prop])
][['idproperty','province','municipality','surface',
   'rooms','baths','price','property_subtype','description']] \
    .groupby('idproperty').first().reset_index()

Unnamed: 0,idproperty,province,municipality,surface,rooms,baths,price,property_subtype,description
0,afpxjapnaa,Tarragona,Cambrils,120,3,3,370000.0,Casa adosada,"Magnífica casa adosada cantonera en Cambrils, ..."


**Getting the top 5 properties by cosine similarity**

In [134]:
df_fotocasa[
    df_fotocasa.idproperty.isin(
        list(df_sim_matrix[['idproperty',prop]] \
                 .sort_values(by=prop, ascending=False).head(6).idproperty)
    )
][['idproperty','province','municipality','surface',
   'rooms','baths','price','property_subtype','description']] \
    .groupby('idproperty').first().reset_index()

Unnamed: 0,idproperty,province,municipality,surface,rooms,baths,price,property_subtype,description
0,afpxjapnaa,Tarragona,Cambrils,120,3,3,370000.0,Casa adosada,"Magnífica casa adosada cantonera en Cambrils, ..."
1,btzjxbatdb,Tarragona,Mont-roig del Camp,120,3,3,265000.0,Casa adosada,
2,gsdfinkefl,Tarragona,Torredembarra,185,4,3,280000.0,Casa adosada,Próxima Promoción en Urbanización Sant Jordi. ...
3,hohptqqnsl,Tarragona,Reus,159,4,3,240000.0,Casa adosada,NUEVA FASE EN VENTA. Nueva promoción de casas ...
4,ieuftaczfb,Tarragona,El Montmell,140,4,3,149000.0,Casa adosada,GRAN OPORTUNIDAD POR CAMBIO DE RESIDENCIA!!!!V...
5,uisfjlyigd,Tarragona,Banyeres del Penedès,182,3,3,225000.0,Casa adosada,casa de dos planta más garaje con jardin y dos...


## Adevinta: Model based collaborative filtering

Not finished yet so it's not mentioned in the report.

In [114]:
import pandas as pd
from surprise import Dataset
from surprise import Reader

df = df_sim_matrix.melt(id_vars=["idproperty"], 
        var_name="property", 
        value_name="rating")
reader = Reader(rating_scale=(-1, 1))

data = Dataset.load_from_df(df[
    ["idproperty", "property", "rating"]], reader)

from surprise import KNNWithMeans

sim_options = {
    "name": "cosine",
    "user_based": False,  
}
algo = KNNWithMeans(sim_options=sim_options)

In [117]:
trainingSet = data.build_full_trainset()
algo.fit(trainingSet)

Computing the cosine similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNWithMeans at 0x109a44e48>

In [128]:
algo.predict('abkvpehvdk','abkvpehvdk')

Prediction(uid='abkvpehvdk', iid='abkvpehvdk', r_ui=None, est=0.6925364820994977, details={'actual_k': 40, 'was_impossible': False})

In [124]:
data.df.iloc[0].T

idproperty    abkvpehvdk
property      abkvpehvdk
rating                 1
Name: 0, dtype: object