# Recommendation using Collaborative filtering: User-User Similarity

In the previous notebooks, so far we did- 
* EDA-
* Data Cleaning-
* Recommending popular products-

Here, we are using the <b> Santander Product Recommendation</b> dataset.

# Data Dictionary


We are provided with 1.5 years of customers behavior data from Santander bank to predict what new products customers will purchase. The data starts at 2015-01-28 and has monthly records (from January 2015 to May 2016) of products a customer has, such as "credit card", "savings account", etc. 

Each row represents a customer, each column contains attributes related to customer demographics and products owned by the customer at the end of the month.

We will predict what additional products a customer will get in the last month, 2016-06-28, in addition to what they already have at 2016-05-28. 

These products are the columns named: ind_(xyz)_ult1, which are the columns 25 - 48 in the training data. We will predict what a customer will buy in addition to what they already had at 2016-05-28. 


### Data Fields- 

<b>fecha_dato</b>- The table is partitioned for this column

###  Demographic information about customers

<b>ncodpers</b>- Customer code

<b>pais_residencia</b>- Customer's Country residence

<b>sexo</b>- Customer's sex

<b>age</b>- Age

<b>indresi</b>- Residence index (S (Yes) or N (No) if the residence country is the same than the bank country)

<b>indext</b>- Foreigner index (S (Yes) or N (No) if the customer's birth country is different than the bank                                    country)

<b>conyuemp</b>- Spouse index. 1 if the customer is spouse of an employee

<b>canal_entrada</b>- channel used by the customer to join

<b>indfall</b>- Deceased index. N/S

<b>tipodom</b>- Address type. 1, primary address

<b>cod_prov</b>- Province code (customer's address)

<b>nomprov</b>- Province name

<b>ind_actividad_cliente</b>-  Activity index (1, active customer; 0, inactive customer)

<b>renta</b>- Gross income of the household

<b>segmento</b>- segmentation: 01 - VIP, 02 - Individuals 03 - college graduated

### Customer Bank Relationship

<b>ind_empleado</b>- Employee index: A active, B ex employed, F filial, N not employee, P pasive

<b>fecha_alta</b>- The date in which the customer became as the first holder of a contract in the bank

<b>ind_nuevo</b>- New customer Index. 1 if the customer registered in the last 6 months.

<b>antiguedad</b>- Customer seniority (in months)

<b>indrel</b>- 1 (First/Primary) , 99 (Primary customer during the month but not at the end of the month)

<b>ult_fec_cli_1t</b>- Last date as primary customer (if he isn't at the end of the month)

<b>indrel_1mes</b>- Customer type at the beginning of the month ,1 (First/Primary customer), 2 (co-owner ),P (Potential),                            3 (former primary), 4(former co-owner)
    
<b>tiprel_1mes</b>- Customer relation type at the beginning of the month, A (active), I (inactive),                                                  P (former customer),R(Potential)

### Products of the Bank


<b>ind_ahor_fin_ult1</b>- Saving Account

<b>ind_aval_fin_ult1</b>- Guarantees

<b>ind_cco_fin_ult1</b>- Current Accounts

<b>ind_cder_fin_ult1</b>- Derivada Account

<b>ind_cno_fin_ult1</b>- Payroll Account

<b>ind_ctju_fin_ult1</b>- Junior Account

<b>ind_ctma_fin_ult1</b>-  Más particular Account

<b>ind_ctop_fin_ult1</b>- particular Account

<b>ind_ctpp_fin_ult1</b>-  particular Plus Account

<b>ind_deco_fin_ult1</b>-  Short-term deposits

<b>ind_deme_fin_ult1</b>-  Medium-term deposits

<b>ind_dela_fin_ult1</b>-  Long-term deposits

<b>ind_ecue_fin_ult1</b>-  e-account

<b>ind_fond_fin_ult1</b>-  Funds

<b>ind_hip_fin_ult1</b>-  Mortgage

<b>ind_plan_fin_ult1</b>-  Pensions

<b>ind_pres_fin_ult1</b>-  Loans

<b>ind_reca_fin_ult1</b>-  Taxes

<b>ind_tjcr_fin_ult1</b>-  Credit Card

<b>ind_valo_fin_ult1</b>-  Securities

<b>ind_viv_fin_ult1</b>- Home Account

<b>ind_nomina_ult1</b>-  Payroll

<b>ind_nom_pens_ult1</b>- Pensions

<b>ind_recibo_ult1</b>- Direct Debit



We found a few features in the data dictionary which we included in our hypothesis.


### Test Data

Each row in the test dataset represents a customer with columns related to customer demographics and their relationship with the bank.


As you can see in the Data Dictionary, we have not been given any information about the products attributes. 

For example- if it is a debit account, what is the minimum balance required for keeping it active.

Therefore, we cannot use Content Based Filtering for this dataset because we don't have any information about the products.

We can only use <b>Collaborative Filtering</b>. 


Let us first recall the concept of collaborative filtering.

# Collaborative Filtering 

The collaborative filtering algorithm uses “User Behavior” for recommending items. This is one of the most commonly used algorithms in the industry as it is not dependent on any additional information. There are different types of collaborating filtering techniques-

* <b>User-User Collaborative Filtering</b>

This algorithm first finds the similarity score between users. Based on this similarity score, it then picks out the most similar users and recommends products which these similar users have liked or bought previously.

* <b>Item-Item Collaborative Filtering</b>

In this algorithm, we compute the similarity between each pair of items.

In this notebook, we are going to use User-User collaborative Filtering Algorithm. We will find the similarities between the users and based on the similarity matrix, we will recommend them the products.

## Loading Packages

In [1]:
#importing libraries

import pandas as pd
import numpy as np
import seaborn as sns
sns.set(style="whitegrid")
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn import preprocessing, ensemble, metrics
from tqdm import tqdm_notebook as tqdm
from IPython.display import display
pd.options.display.max_columns = None
import datetime
import warnings                               # To ignore any warnings 
warnings.filterwarnings("ignore")

## Loading Data

In [2]:
april=pd.read_csv('aprilf')
may=pd.read_csv('may')
test=pd.read_csv('test_f')

Here, we have used data April and May 2016 data as the train data

In [4]:
april.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,fecha_dato,ncodpers,ind_empleado,pais_residencia,sexo,age,ind_nuevo,antiguedad,indrel,indrel_1mes,tiprel_1mes,indresi,indext,canal_entrada,indfall,cod_prov,nomprov,ind_actividad_cliente,renta,segmento,ind_cco_fin_ult1,ind_cno_fin_ult1,ind_ctju_fin_ult1,ind_ctma_fin_ult1,ind_ctop_fin_ult1,ind_ctpp_fin_ult1,ind_deme_fin_ult1,ind_dela_fin_ult1,ind_ecue_fin_ult1,ind_fond_fin_ult1,ind_hip_fin_ult1,ind_plan_fin_ult1,ind_pres_fin_ult1,ind_reca_fin_ult1,ind_tjcr_fin_ult1,ind_valo_fin_ult1,ind_viv_fin_ult1,ind_nomina_ult1,ind_nom_pens_ult1,ind_recibo_ult1
0,11787586,11787586,2016-04-28,896849,N,ES,V,working,0.0,medium,1.0,1.0,A,S,N,KFC,N,28.0,MADRID,1.0,medium,03 - UNIVERSITARIO,0,1,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0.0,0.0,0
1,11787589,11787589,2016-04-28,896848,N,ES,H,working,0.0,medium,1.0,1.0,A,S,N,KAT,N,46.0,VALENCIA,1.0,medium,02 - PARTICULARES,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1.0,1.0,1
2,11787593,11787593,2016-04-28,896832,N,ES,H,working,0.0,medium,1.0,1.0,I,S,S,KFC,N,28.0,MADRID,0.0,medium,02 - PARTICULARES,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0
3,11787597,11787597,2016-04-28,896823,N,ES,V,working,0.0,medium,1.0,1.0,A,S,N,KFC,N,28.0,MADRID,1.0,rich,02 - PARTICULARES,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0
4,11787600,11787600,2016-04-28,896894,N,ES,H,working,0.0,medium,1.0,1.0,I,S,N,KFC,N,28.0,MADRID,0.0,medium,02 - PARTICULARES,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0


In [5]:
may.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,Unnamed: 0.1.1,fecha_dato,ncodpers,ind_empleado,pais_residencia,sexo,age,ind_nuevo,antiguedad,indrel,indrel_1mes,tiprel_1mes,indresi,indext,canal_entrada,indfall,cod_prov,nomprov,ind_actividad_cliente,renta,segmento,ind_cco_fin_ult1,ind_cno_fin_ult1,ind_ctju_fin_ult1,ind_ctma_fin_ult1,ind_ctop_fin_ult1,ind_ctpp_fin_ult1,ind_deme_fin_ult1,ind_dela_fin_ult1,ind_ecue_fin_ult1,ind_fond_fin_ult1,ind_hip_fin_ult1,ind_plan_fin_ult1,ind_pres_fin_ult1,ind_reca_fin_ult1,ind_tjcr_fin_ult1,ind_valo_fin_ult1,ind_viv_fin_ult1,ind_nomina_ult1,ind_nom_pens_ult1,ind_recibo_ult1
0,6,12715862,12715862,2016-05-28,657789,N,ES,V,working,0.0,senior,1.0,1.0,A,S,N,KAT,N,28.0,MADRID,1.0,medium,01 - TOP,1,0,0,0,0,1,0,1,0,0,0,0,0,0,1,0,0,0.0,0.0,0
1,8,12715864,12715864,2016-05-28,657777,N,ES,V,working,0.0,senior,1.0,1.0,I,S,N,KFC,N,28.0,MADRID,0.0,medium,02 - PARTICULARES,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0
2,9,12715865,12715865,2016-05-28,657782,N,ES,V,working,0.0,senior,1.0,1.0,I,S,N,KFC,N,28.0,MADRID,0.0,rich,02 - PARTICULARES,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0
3,11,12715867,12715867,2016-05-28,657780,N,ES,V,working,0.0,senior,1.0,1.0,I,S,N,KAT,N,28.0,MADRID,0.0,medium,02 - PARTICULARES,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0
4,21,12715877,12715877,2016-05-28,657786,N,ES,V,working,0.0,medium,1.0,1.0,A,S,N,KAT,N,28.0,MADRID,1.0,medium,02 - PARTICULARES,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0.0,0.0,1


In [9]:
test.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,fecha_dato,ncodpers,ind_empleado,pais_residencia,sexo,age,ind_nuevo,antiguedad,indrel,indrel_1mes,tiprel_1mes,indresi,indext,canal_entrada,indfall,cod_prov,nomprov,ind_actividad_cliente,renta,segmento
0,0,0,6/28/2016,15889,F,ES,V,working,0,senior,1,1.0,A,S,N,KAT,N,28.0,MADRID,1,rich,01 - TOP
1,8,8,6/28/2016,1170555,N,ES,V,working,0,medium,1,1.0,I,S,N,KHE,N,28.0,MADRID,0,rich,03 - UNIVERSITARIO
2,11,11,6/28/2016,1170563,N,ES,H,working,0,medium,1,1.0,I,S,N,KHE,N,28.0,MADRID,0,rich,03 - UNIVERSITARIO
3,15,15,6/28/2016,1170570,N,ES,H,working,0,medium,1,1.0,A,S,N,KHE,N,28.0,MADRID,1,medium,03 - UNIVERSITARIO
4,19,19,6/28/2016,1170581,N,ES,H,working,0,medium,1,1.0,I,S,N,KHE,N,28.0,MADRID,0,medium,03 - UNIVERSITARIO


In [4]:
may=may.drop(['Unnamed: 0','Unnamed: 0.1','Unnamed: 0.1.1'],axis=1)           # removing the undesired columns
april=april.drop(['Unnamed: 0','Unnamed: 0.1'],axis=1) 
test=test.drop(['Unnamed: 0','Unnamed: 0.1'],axis=1)

Now, for user-user similarity we will make two variables. One containing user demographics like age, gender, etc. And the other column for products

In [25]:
demographics = ['fecha_dato',
 'ncodpers','ind_empleado','pais_residencia','sexo','age','ind_nuevo','antiguedad','indrel',
 'indrel_1mes','tiprel_1mes','indresi','indext','canal_entrada','indfall',
 'cod_prov','ind_actividad_cliente','renta','segmento']

products = [ 'ind_cco_fin_ult1','ind_recibo_ult1','ind_ecue_fin_ult1','ind_nomina_ult1',
 'ind_nom_pens_ult1','ind_reca_fin_ult1','ind_tjcr_fin_ult1','ind_ctju_fin_ult1',
 'ind_ctma_fin_ult1','ind_dela_fin_ult1','ind_fond_fin_ult1','ind_hip_fin_ult1','ind_plan_fin_ult1',
 'ind_valo_fin_ult1','ind_viv_fin_ult1']

drop_pro=['ind_ctop_fin_ult1','ind_ctpp_fin_ult1','ind_cno_fin_ult1','ind_pres_fin_ult1']

In [11]:
april=april.drop(drop_pro,1)
may=may.drop(drop_pro,1)

For the product column, we have not included cno, ctop, ctpp, pres based on our EDA.

Now, we will merge both the April month dataset with May to add each user's purchases in the previous month.

In [26]:
merge_drop_cols = demographics[:]
merge_drop_cols.remove('ncodpers')
may_new = pd.merge(may,april.drop(merge_drop_cols,1), on=['ncodpers'],how='left',suffixes=['','_previous'])

may_new.head()

Unnamed: 0,fecha_dato,ncodpers,ind_empleado,pais_residencia,sexo,age,ind_nuevo,antiguedad,indrel,indrel_1mes,tiprel_1mes,indresi,indext,canal_entrada,indfall,cod_prov,nomprov,ind_actividad_cliente,renta,segmento,ind_cco_fin_ult1,ind_ctju_fin_ult1,ind_ctma_fin_ult1,ind_deme_fin_ult1,ind_dela_fin_ult1,ind_ecue_fin_ult1,ind_fond_fin_ult1,ind_hip_fin_ult1,ind_plan_fin_ult1,ind_reca_fin_ult1,ind_tjcr_fin_ult1,ind_valo_fin_ult1,ind_viv_fin_ult1,ind_nomina_ult1,ind_nom_pens_ult1,ind_recibo_ult1,nomprov_previous,ind_cco_fin_ult1_previous,ind_ctju_fin_ult1_previous,ind_ctma_fin_ult1_previous,ind_deme_fin_ult1_previous,ind_dela_fin_ult1_previous,ind_ecue_fin_ult1_previous,ind_fond_fin_ult1_previous,ind_hip_fin_ult1_previous,ind_plan_fin_ult1_previous,ind_reca_fin_ult1_previous,ind_tjcr_fin_ult1_previous,ind_valo_fin_ult1_previous,ind_viv_fin_ult1_previous,ind_nomina_ult1_previous,ind_nom_pens_ult1_previous,ind_recibo_ult1_previous
0,2016-05-28,657789,N,ES,V,working,0.0,senior,1.0,1.0,A,S,N,KAT,N,28.0,MADRID,1.0,medium,01 - TOP,1,0,0,0,1,0,0,0,0,0,1,0,0,0.0,0.0,0,MADRID,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
1,2016-05-28,657777,N,ES,V,working,0.0,senior,1.0,1.0,I,S,N,KFC,N,28.0,MADRID,0.0,medium,02 - PARTICULARES,1,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0,MADRID,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2016-05-28,657782,N,ES,V,working,0.0,senior,1.0,1.0,I,S,N,KFC,N,28.0,MADRID,0.0,rich,02 - PARTICULARES,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0,MADRID,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,2016-05-28,657780,N,ES,V,working,0.0,senior,1.0,1.0,I,S,N,KAT,N,28.0,MADRID,0.0,medium,02 - PARTICULARES,1,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0,MADRID,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,2016-05-28,657786,N,ES,V,working,0.0,medium,1.0,1.0,A,S,N,KAT,N,28.0,MADRID,1.0,medium,02 - PARTICULARES,1,0,0,0,0,0,0,0,0,1,0,0,0,0.0,0.0,1,MADRID,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
may_new=may_new.drop('nomprov_previous',axis=1)

In [27]:
for i in products:                                      
    may_new[i] = may_new[i]-may_new[i+"_previous"]          # Getting the products purchased in May 2016.
    may_new[i][may_new[i] < 0] = 0                          # substituting 0 for dropped products 

In [28]:
may_new[products] = may_new[products].fillna(0)             # filling the missing values created by merge

In [29]:
may_new.head()

Unnamed: 0,fecha_dato,ncodpers,ind_empleado,pais_residencia,sexo,age,ind_nuevo,antiguedad,indrel,indrel_1mes,tiprel_1mes,indresi,indext,canal_entrada,indfall,cod_prov,nomprov,ind_actividad_cliente,renta,segmento,ind_cco_fin_ult1,ind_ctju_fin_ult1,ind_ctma_fin_ult1,ind_deme_fin_ult1,ind_dela_fin_ult1,ind_ecue_fin_ult1,ind_fond_fin_ult1,ind_hip_fin_ult1,ind_plan_fin_ult1,ind_reca_fin_ult1,ind_tjcr_fin_ult1,ind_valo_fin_ult1,ind_viv_fin_ult1,ind_nomina_ult1,ind_nom_pens_ult1,ind_recibo_ult1,nomprov_previous,ind_cco_fin_ult1_previous,ind_ctju_fin_ult1_previous,ind_ctma_fin_ult1_previous,ind_deme_fin_ult1_previous,ind_dela_fin_ult1_previous,ind_ecue_fin_ult1_previous,ind_fond_fin_ult1_previous,ind_hip_fin_ult1_previous,ind_plan_fin_ult1_previous,ind_reca_fin_ult1_previous,ind_tjcr_fin_ult1_previous,ind_valo_fin_ult1_previous,ind_viv_fin_ult1_previous,ind_nomina_ult1_previous,ind_nom_pens_ult1_previous,ind_recibo_ult1_previous
0,2016-05-28,657789,N,ES,V,working,0.0,senior,1.0,1.0,A,S,N,KAT,N,28.0,MADRID,1.0,medium,01 - TOP,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,MADRID,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
1,2016-05-28,657777,N,ES,V,working,0.0,senior,1.0,1.0,I,S,N,KFC,N,28.0,MADRID,0.0,medium,02 - PARTICULARES,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,MADRID,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2016-05-28,657782,N,ES,V,working,0.0,senior,1.0,1.0,I,S,N,KFC,N,28.0,MADRID,0.0,rich,02 - PARTICULARES,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,MADRID,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,2016-05-28,657780,N,ES,V,working,0.0,senior,1.0,1.0,I,S,N,KAT,N,28.0,MADRID,0.0,medium,02 - PARTICULARES,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,MADRID,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,2016-05-28,657786,N,ES,V,working,0.0,medium,1.0,1.0,A,S,N,KAT,N,28.0,MADRID,1.0,medium,02 - PARTICULARES,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,MADRID,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0


In [30]:
prev_mon_products = [i + "_previous" for i in products]      # change name of products brought in April 2016 in the merge dataframe
may_new[prev_mon_products] = may_new[prev_mon_products].fillna(0)   # filling the missing values

Now, we will also add purchase history for the test data

In [31]:
test_col = products + ['ncodpers']                        
test_new = pd.merge(test,may[test_col],on='ncodpers',how='left',suffixes=['','_previous'])   # merge the test data with the products customers has in May 2016
test_new.head()

Unnamed: 0,fecha_dato,ncodpers,ind_empleado,pais_residencia,sexo,age,ind_nuevo,antiguedad,indrel,indrel_1mes,tiprel_1mes,indresi,indext,canal_entrada,indfall,cod_prov,nomprov,ind_actividad_cliente,renta,segmento,ind_cco_fin_ult1,ind_recibo_ult1,ind_ecue_fin_ult1,ind_nomina_ult1,ind_nom_pens_ult1,ind_reca_fin_ult1,ind_tjcr_fin_ult1,ind_ctju_fin_ult1,ind_ctma_fin_ult1,ind_dela_fin_ult1,ind_fond_fin_ult1,ind_hip_fin_ult1,ind_plan_fin_ult1,ind_valo_fin_ult1,ind_viv_fin_ult1
0,6/28/2016,15889,F,ES,V,working,0,senior,1,1.0,A,S,N,KAT,N,28.0,MADRID,1,rich,01 - TOP,1,0,0,0.0,0.0,0,1,0,0,0,0,0,0,1,0
1,6/28/2016,1170555,N,ES,V,working,0,medium,1,1.0,I,S,N,KHE,N,28.0,MADRID,0,rich,03 - UNIVERSITARIO,1,0,0,0.0,0.0,0,0,0,0,0,0,0,0,0,0
2,6/28/2016,1170563,N,ES,H,working,0,medium,1,1.0,I,S,N,KHE,N,28.0,MADRID,0,rich,03 - UNIVERSITARIO,1,0,0,0.0,0.0,0,0,0,0,0,0,0,0,0,0
3,6/28/2016,1170570,N,ES,H,working,0,medium,1,1.0,A,S,N,KHE,N,28.0,MADRID,1,medium,03 - UNIVERSITARIO,0,0,0,0.0,0.0,0,0,0,0,0,0,0,0,0,0
4,6/28/2016,1170581,N,ES,H,working,0,medium,1,1.0,I,S,N,KHE,N,28.0,MADRID,0,medium,03 - UNIVERSITARIO,1,0,0,0.0,0.0,0,0,0,0,0,0,0,0,0,0


In [32]:
test_new.rename(columns={i:j for i,j in zip(products,prev_mon_products)}, inplace=True)       # rename columns in the test data like we did for may_use
test_new[prev_mon_products] = test_new[prev_mon_products].fillna(0)                                # filling the missing values
test_new.head()    

Unnamed: 0,fecha_dato,ncodpers,ind_empleado,pais_residencia,sexo,age,ind_nuevo,antiguedad,indrel,indrel_1mes,tiprel_1mes,indresi,indext,canal_entrada,indfall,cod_prov,nomprov,ind_actividad_cliente,renta,segmento,ind_cco_fin_ult1_previous,ind_recibo_ult1_previous,ind_ecue_fin_ult1_previous,ind_nomina_ult1_previous,ind_nom_pens_ult1_previous,ind_reca_fin_ult1_previous,ind_tjcr_fin_ult1_previous,ind_ctju_fin_ult1_previous,ind_ctma_fin_ult1_previous,ind_dela_fin_ult1_previous,ind_fond_fin_ult1_previous,ind_hip_fin_ult1_previous,ind_plan_fin_ult1_previous,ind_valo_fin_ult1_previous,ind_viv_fin_ult1_previous
0,6/28/2016,15889,F,ES,V,working,0,senior,1,1.0,A,S,N,KAT,N,28.0,MADRID,1,rich,01 - TOP,1,0,0,0.0,0.0,0,1,0,0,0,0,0,0,1,0
1,6/28/2016,1170555,N,ES,V,working,0,medium,1,1.0,I,S,N,KHE,N,28.0,MADRID,0,rich,03 - UNIVERSITARIO,1,0,0,0.0,0.0,0,0,0,0,0,0,0,0,0,0
2,6/28/2016,1170563,N,ES,H,working,0,medium,1,1.0,I,S,N,KHE,N,28.0,MADRID,0,rich,03 - UNIVERSITARIO,1,0,0,0.0,0.0,0,0,0,0,0,0,0,0,0,0
3,6/28/2016,1170570,N,ES,H,working,0,medium,1,1.0,A,S,N,KHE,N,28.0,MADRID,1,medium,03 - UNIVERSITARIO,0,0,0,0.0,0.0,0,0,0,0,0,0,0,0,0,0
4,6/28/2016,1170581,N,ES,H,working,0,medium,1,1.0,I,S,N,KHE,N,28.0,MADRID,0,medium,03 - UNIVERSITARIO,1,0,0,0.0,0.0,0,0,0,0,0,0,0,0,0,0


In [33]:
may_purchases = may_new[products].copy()            # products customers brought in May 2016
may_prev_mon = may_new[prev_mon_products].copy()    # products customers has in April 2016 

 Now, we will create a variable consisting the customer demographics we want to include in the user-user similarity matrix

In [108]:
demogs = ['ind_empleado','pais_residencia','sexo','age','antiguedad','ind_nuevo','indrel','indrel_1mes', 'tiprel_1mes','nomprov','canal_entrada','indresi','indext','indfall','ind_actividad_cliente','segmento','renta']
may_demog_final = may_new[demogs].copy()       # creating a dataframe containing customer demographics

#### Feature Engineering for demographics

Now, we will transform demographic factor variables into numbers.

In [109]:
ind_empleado_map={'N':0,'A':1, 'S':1, 'F':1, 'B':1}
pais_residencia_map={'ES':0,'others':1}
sexo_map = {'V': 1,'H': 0}
age_map = {'old': 2,'working':1,'teen':0}
indresi_map = {'S': 1,'N': 0}
indfall_map = {'S': 1,'N': 0}
indrel_1mes_map={0.:3,1.:0,2.:1,3.:2,4.:3}
tiprel_1mes_map={'A':1,'I':0,'P':2,'R':3}
ind_nuevo_map={0.:0,1.:1}
antiguedad_map= {'senior': 2,'medium': 1,'new':0}
nomprov_map={'MADRID':1, 'SEVILLA':2, 'VALENCIA':3, 'others':0}
canal_entrada_map={'KAT':2, 'KFC':3, 'KHE':1, 'others':0}
indrel_map={1.:1,99.:0}
indext_map={'N':0,'S':1}
ind_actividad_cliente_map={1.:1, 0.:0}
segmento_map= {'02 - PARTICULARES':1,'01 - TOP':2,'03 - UNIVERSITARIO':0}
renta_map= {'rich':2,'medium':1,'poor':0}

Replace them in the dataframe

In [110]:
may_demog_final.ind_empleado = [ind_empleado_map[item] for item in may_demog_final.ind_empleado]
may_demog_final.pais_residencia = [pais_residencia_map[item] for item in may_demog_final.pais_residencia]
may_demog_final.sexo = [sexo_map[item] for item in may_demog_final.sexo]
may_demog_final.age = [age_map[item] for item in may_demog_final.age]
may_demog_final.indresi = [indresi_map[item] for item in may_demog_final.indresi]
may_demog_final.indfall = [indfall_map[item] for item in may_demog_final.indfall]
may_demog_final.ind_nuevo = [ind_nuevo_map[item] for item in may_demog_final.ind_nuevo]
may_demog_final.nomprov = [nomprov_map[item] for item in may_demog_final.nomprov]
may_demog_final.canal_entrada = [canal_entrada_map[item] for item in may_demog_final.canal_entrada]
may_demog_final.indext = [indext_map[item] for item in may_demog_final.indext]
may_demog_final.indrel = [indrel_map[item] for item in may_demog_final.indrel]
may_demog_final.ind_actividad_cliente = [ind_actividad_cliente_map[item] for item in may_demog_final.ind_actividad_cliente]
may_demog_final.indrel_1mes = [indrel_1mes_map[item] for item in may_demog_final.indrel_1mes]
may_demog_final.tiprel_1mes = [tiprel_1mes_map[item] for item in may_demog_final.tiprel_1mes]
may_demog_final.antiguedad = [antiguedad_map[item] for item in may_demog_final.antiguedad]
may_demog_final.segmento = [segmento_map[item] for item in may_demog_final.segmento]
may_demog_final.renta = [renta_map[item] for item in may_demog_final.renta]

In [69]:
may_demog_final.head()                                    # customers in may 

Unnamed: 0,ind_empleado,pais_residencia,sexo,age,antiguedad,ind_nuevo,indrel,indrel_1mes,tiprel_1mes,nomprov,canal_entrada,indresi,indext,indfall,ind_actividad_cliente,segmento,renta
0,0,0,1,1,2,0,1,0,1,1,2,1,0,0,1,2,1
1,0,0,1,1,2,0,1,0,0,1,3,1,0,0,0,1,1
2,0,0,1,1,2,0,1,0,0,1,3,1,0,0,0,1,2
3,0,0,1,1,2,0,1,0,0,1,2,1,0,0,0,1,1
4,0,0,1,1,1,0,1,0,1,1,2,1,0,0,1,1,1


In [34]:
prev_mon_products_col = prev_mon_products + ['ncodpers']     # previous month product columns
test_final = test_new[prev_mon_products_col].copy()          # dataset containing customer id and products customers has in May 2016.
test_final_unique = test_final.drop('ncodpers',1).drop_duplicates().copy().reset_index(drop=True)    # drop the duplicate rows to get unique purchase matrix

In [35]:
test_final_unique.shape                                    # unique combinations of products that customer has in May 2016. 

(2020, 15)

In [113]:
demogs_col = demogs + ['ncodpers']
test_demog_final = test_use[demogs_col].copy()            # test data containing the customer's demographics.

Replacing features in the test dataframe to create matrix

In [114]:
test_demog_final.ind_empleado = [ind_empleado_map[item] for item in test_demog_final.ind_empleado]
test_demog_final.pais_residencia = [pais_residencia_map[item] for item in test_demog_final.pais_residencia]
test_demog_final.sexo = [sexo_map[item] for item in test_demog_final.sexo]
test_demog_final.age = [age_map[item] for item in test_demog_final.age]
test_demog_final.indresi = [indresi_map[item] for item in test_demog_final.indresi]
test_demog_final.indfall = [indfall_map[item] for item in test_demog_final.indfall]
test_demog_final.ind_nuevo = [ind_nuevo_map[item] for item in test_demog_final.ind_nuevo]
test_demog_final.nomprov = [nomprov_map[item] for item in test_demog_final.nomprov]
test_demog_final.canal_entrada = [canal_entrada_map[item] for item in test_demog_final.canal_entrada]
test_demog_final.indext = [indext_map[item] for item in test_demog_final.indext]
test_demog_final.indrel = [indrel_map[item] for item in test_demog_final.indrel]
test_demog_final.ind_actividad_cliente = [ind_actividad_cliente_map[item] for item in test_demog_final.ind_actividad_cliente]
test_demog_final.indrel_1mes = [indrel_1mes_map[item] for item in test_demog_final.indrel_1mes]
test_demog_final.tiprel_1mes = [tiprel_1mes_map[item] for item in test_demog_final.tiprel_1mes]
test_demog_final.antiguedad = [antiguedad_map[item] for item in test_demog_final.antiguedad]
test_demog_final.segmento = [segmento_map[item] for item in test_demog_final.segmento]
test_demog_final.renta = [renta_map[item] for item in test_demog_final.renta]

In [78]:
test_demog_final_unique = test_demog_final.drop('ncodpers',1).drop_duplicates().copy().reset_index(drop=True)
test_demog_final_unique.shape               # unique customer demographics in June

(5547, 17)

In [37]:
test_final=test_final.drop('ncodpers',axis=1)


KeyError: "['ncodpers'] not found in axis"

In [38]:
test_final.shape

(929615, 15)

We now have everything we need to perform User-User Similarity based Collaborative Filtering. It's time to write a function that takes all the different matrices as input and returns the probability of purchase of every products by the customers.

### Function to Calculate Probability of Purchase of a Product by the Customers in June 2016.

In [39]:
predict_product_col = [i + "_predict" for i in prev_mon_products]

def probability_calculation(test_demog_unique,last_month_demog,last_month_purchases,demogs,metric,test_demog,print_option=False):
    
    # 'test_demog_unique' takes the unique test data with customer demographics.
    # 'last_month_demog' are the training data that we calculate distances to.
    # 'last_mon_purchases' are the products purchased by the customer in May 2016.
    # 'demogs' are the customers demographics that we have included.
    # 'metric' is the way by which we want to calculate distances to.
    # 'test_demog' takes the test data with customer demographics.
   
    n = test_demog_unique.shape[0]
    for index, row in test_demog_unique.iterrows():
        if print_option == True:
            print(str(index) + '/' + str(n))
        row_use = row.to_frame().T
        
        #store purchase history for the test users
        row_history = row_use[demogs]
        
        #calculate distances between the test point and each training point based on selected binary features
        distances = metrics.pairwise_distances(row_use,last_month_demog,metric=metric) + 1e-6
        
        #normalise distances: The asymptotic behaviour of 1/distances gives the most accurate predictions.
        norm_distances = 1/distances
        
        #take dot product between distance to last month purchase history to obtain ownership likelihood matrix
        sim = pd.DataFrame(norm_distances.dot(last_month_purchases)/np.sum(norm_distances),columns =  prev_mon_products)
        if(index == 0):
            probabilities = sim
        else:
            probabilities = probabilities.append(sim)
   
    # reindex users for join
    reindexed_output = probabilities.reset_index().drop('index',axis=1).copy()
    indexed_unique_test = test_demog_unique.reset_index().drop('index',axis=1).copy()
    output_unique = indexed_unique_test.join(reindexed_output,rsuffix='_predict')
    output_final = pd.merge(test_demog,output_unique,on=demogs,how='left')
    
    # only select relevant products
    output_final = output_final.drop(demogs,1)
    output_final.columns = output_final.columns.str.replace("_predict", "")
    output_final.columns = output_final.columns.str.replace("_previous", "_predict")
    
    return output_final

In [40]:
probabilities_user = probability_calculation(test_demog_final_unique,may_demog_final ,may_purchases,demogs,'correlation',test_demog_final)

NameError: name 'test_demog_final_unique' is not defined

It takes a lot of time to run this function because we have to make predictions for a large number of customers.

This algorithm is useful when the number of users is less. Its not effective when there are a large number of users as it will take a lot of time to compute the similarity between all user pairs.

Here, we used correlation as the distance metric. 

In [119]:
test_final_unique.shape, may_prev_mon.shape, may_purchases.shape, test_final.shape

((2168, 16), (931453, 16), (931453, 16), (929615, 16))

In [120]:
test_demog_final_unique.shape,may_demog_final.shape ,may_purchases.shape,test_demog_final.shape

((5547, 17), (931453, 17), (931453, 16), (929615, 18))

### Unique Customer demographics-

test data- 5547 

In [121]:
probabilities_user.head()

Unnamed: 0,ncodpers,ind_cco_fin_ult1_predict,ind_recibo_ult1_predict,ind_ecue_fin_ult1_predict,ind_nomina_ult1_predict,ind_nom_pens_ult1_predict,ind_reca_fin_ult1_predict,ind_tjcr_fin_ult1_predict,ind_ctju_fin_ult1_predict,ind_ctma_fin_ult1_predict,ind_dela_fin_ult1_predict,ind_fond_fin_ult1_predict,ind_hip_fin_ult1_predict,ind_plan_fin_ult1_predict,ind_pres_fin_ult1_predict,ind_valo_fin_ult1_predict,ind_viv_fin_ult1_predict
0,15889,0.00032,0.001063,0.02803787,0.028376,0.028373,2.541784e-05,0.084008,3.348381e-06,2.240733e-05,6.161317e-06,6.412532e-06,4.262374e-07,2.31688e-06,1.32466e-06,2.667121e-05,1.061192e-06
1,1170555,0.000379,4e-06,9.952912e-07,2e-06,2e-06,1.166181e-07,2e-06,1.483419e-08,1.57381e-07,1.988603e-08,2.463826e-08,1.979055e-09,9.04195e-09,3.569296e-09,8.180711e-08,3.065335e-09
2,1170563,1e-06,4e-06,8.318544e-07,2e-06,2e-06,9.458205e-08,2e-06,1.255537e-08,1.292946e-07,1.737051e-08,2.07022e-08,1.466969e-09,7.755375e-09,2.9559e-09,6.698169e-08,2.725772e-09
3,1170570,0.005732,0.031506,0.002151155,0.027918,0.027918,0.0007160498,0.002157,5.615746e-08,6.507527e-07,1.000085e-07,1.145236e-07,7.986163e-09,4.457411e-08,1.848213e-08,3.991639e-07,1.726734e-08
4,1170581,0.000129,0.000131,6.833017e-07,2e-06,2e-06,8.021031e-08,1e-06,1.006551e-08,9.21081e-08,1.488706e-08,1.717566e-08,1.188639e-09,6.495914e-09,2.697161e-09,5.876159e-08,2.502621e-09


Now, we have the probabilities of purchase of a product by the customer in June 2016. We would now remove the products that the customer already has in May 2016 and then arrange the remaining products in decreasing value of probability. 

In [122]:
test_final['ncodpers']=probabilities_user['ncodpers']           

Merge the predicted products with the products customers has in May 2016.

In [124]:
pred = pd.merge(probabilities_user,test_final,on='ncodpers',how='left')
pred.head()

Unnamed: 0,ncodpers,ind_cco_fin_ult1_predict,ind_recibo_ult1_predict,ind_ecue_fin_ult1_predict,ind_nomina_ult1_predict,ind_nom_pens_ult1_predict,ind_reca_fin_ult1_predict,ind_tjcr_fin_ult1_predict,ind_ctju_fin_ult1_predict,ind_ctma_fin_ult1_predict,ind_dela_fin_ult1_predict,ind_fond_fin_ult1_predict,ind_hip_fin_ult1_predict,ind_plan_fin_ult1_predict,ind_pres_fin_ult1_predict,ind_valo_fin_ult1_predict,ind_viv_fin_ult1_predict,ind_cco_fin_ult1_previous,ind_recibo_ult1_previous,ind_ecue_fin_ult1_previous,ind_nomina_ult1_previous,ind_nom_pens_ult1_previous,ind_reca_fin_ult1_previous,ind_tjcr_fin_ult1_previous,ind_ctju_fin_ult1_previous,ind_ctma_fin_ult1_previous,ind_dela_fin_ult1_previous,ind_fond_fin_ult1_previous,ind_hip_fin_ult1_previous,ind_plan_fin_ult1_previous,ind_pres_fin_ult1_previous,ind_valo_fin_ult1_previous,ind_viv_fin_ult1_previous
0,15889,0.00032,0.001063,0.02803787,0.028376,0.028373,2.541784e-05,0.084008,3.348381e-06,2.240733e-05,6.161317e-06,6.412532e-06,4.262374e-07,2.31688e-06,1.32466e-06,2.667121e-05,1.061192e-06,1,0,0,0.0,0.0,0,1,0,0,0,0,0,0,0,1,0
1,1170555,0.000379,4e-06,9.952912e-07,2e-06,2e-06,1.166181e-07,2e-06,1.483419e-08,1.57381e-07,1.988603e-08,2.463826e-08,1.979055e-09,9.04195e-09,3.569296e-09,8.180711e-08,3.065335e-09,1,0,0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,0
2,1170563,1e-06,4e-06,8.318544e-07,2e-06,2e-06,9.458205e-08,2e-06,1.255537e-08,1.292946e-07,1.737051e-08,2.07022e-08,1.466969e-09,7.755375e-09,2.9559e-09,6.698169e-08,2.725772e-09,1,0,0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,0
3,1170570,0.005732,0.031506,0.002151155,0.027918,0.027918,0.0007160498,0.002157,5.615746e-08,6.507527e-07,1.000085e-07,1.145236e-07,7.986163e-09,4.457411e-08,1.848213e-08,3.991639e-07,1.726734e-08,0,0,0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,0
4,1170581,0.000129,0.000131,6.833017e-07,2e-06,2e-06,8.021031e-08,1e-06,1.006551e-08,9.21081e-08,1.488706e-08,1.717566e-08,1.188639e-09,6.495914e-09,2.697161e-09,5.876159e-08,2.502621e-09,1,0,0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,0


In [130]:
for i in range(1,17):
    pred.ix[:,i] = np.where(pred.ix[:,i+16]==1,0,pred.ix[:,i])    # the purchase probability is made 0 for products that the customer already has in May 2016

pred.head()                           

Unnamed: 0,ncodpers,ind_cco_fin_ult1_predict,ind_recibo_ult1_predict,ind_ecue_fin_ult1_predict,ind_nomina_ult1_predict,ind_nom_pens_ult1_predict,ind_reca_fin_ult1_predict,ind_tjcr_fin_ult1_predict,ind_ctju_fin_ult1_predict,ind_ctma_fin_ult1_predict,ind_dela_fin_ult1_predict,ind_fond_fin_ult1_predict,ind_hip_fin_ult1_predict,ind_plan_fin_ult1_predict,ind_pres_fin_ult1_predict,ind_valo_fin_ult1_predict,ind_viv_fin_ult1_predict,ind_cco_fin_ult1_previous,ind_recibo_ult1_previous,ind_ecue_fin_ult1_previous,ind_nomina_ult1_previous,ind_nom_pens_ult1_previous,ind_reca_fin_ult1_previous,ind_tjcr_fin_ult1_previous,ind_ctju_fin_ult1_previous,ind_ctma_fin_ult1_previous,ind_dela_fin_ult1_previous,ind_fond_fin_ult1_previous,ind_hip_fin_ult1_previous,ind_plan_fin_ult1_previous,ind_pres_fin_ult1_previous,ind_valo_fin_ult1_previous,ind_viv_fin_ult1_previous
0,15889,0.0,0.001063,0.02803787,0.028376,0.028373,2.541784e-05,0.0,3.348381e-06,2.240733e-05,6.161317e-06,6.412532e-06,4.262374e-07,2.31688e-06,1.32466e-06,0.0,1.061192e-06,1,0,0,0.0,0.0,0,1,0,0,0,0,0,0,0,1,0
1,1170555,0.0,4e-06,9.952912e-07,2e-06,2e-06,1.166181e-07,2e-06,1.483419e-08,1.57381e-07,1.988603e-08,2.463826e-08,1.979055e-09,9.04195e-09,3.569296e-09,8.180711e-08,3.065335e-09,1,0,0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,0
2,1170563,0.0,4e-06,8.318544e-07,2e-06,2e-06,9.458205e-08,2e-06,1.255537e-08,1.292946e-07,1.737051e-08,2.07022e-08,1.466969e-09,7.755375e-09,2.9559e-09,6.698169e-08,2.725772e-09,1,0,0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,0
3,1170570,0.005732,0.031506,0.002151155,0.027918,0.027918,0.0007160498,0.002157,5.615746e-08,6.507527e-07,1.000085e-07,1.145236e-07,7.986163e-09,4.457411e-08,1.848213e-08,3.991639e-07,1.726734e-08,0,0,0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,0
4,1170581,0.0,0.000131,6.833017e-07,2e-06,2e-06,8.021031e-08,1e-06,1.006551e-08,9.21081e-08,1.488706e-08,1.717566e-08,1.188639e-09,6.495914e-09,2.697161e-09,5.876159e-08,2.502621e-09,1,0,0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,0


In [139]:
user_sim=pred.ix[:,0:20]                                   # removing the previous month purchase history columns
user_sim.head()

Unnamed: 0,ncodpers,ind_cco_fin_ult1_predict,ind_recibo_ult1_predict,ind_ecue_fin_ult1_predict,ind_nomina_ult1_predict,ind_nom_pens_ult1_predict,ind_reca_fin_ult1_predict,ind_tjcr_fin_ult1_predict,ind_ctju_fin_ult1_predict,ind_ctma_fin_ult1_predict,ind_dela_fin_ult1_predict,ind_fond_fin_ult1_predict,ind_hip_fin_ult1_predict,ind_plan_fin_ult1_predict,ind_pres_fin_ult1_predict,ind_valo_fin_ult1_predict,ind_viv_fin_ult1_predict,ind_cco_fin_ult1_previous,ind_recibo_ult1_previous,ind_ecue_fin_ult1_previous
0,15889,0.0,0.001063,0.02803787,0.028376,0.028373,2.541784e-05,0.0,3.348381e-06,2.240733e-05,6.161317e-06,6.412532e-06,4.262374e-07,2.31688e-06,1.32466e-06,0.0,1.061192e-06,1,0,0
1,1170555,0.0,4e-06,9.952912e-07,2e-06,2e-06,1.166181e-07,2e-06,1.483419e-08,1.57381e-07,1.988603e-08,2.463826e-08,1.979055e-09,9.04195e-09,3.569296e-09,8.180711e-08,3.065335e-09,1,0,0
2,1170563,0.0,4e-06,8.318544e-07,2e-06,2e-06,9.458205e-08,2e-06,1.255537e-08,1.292946e-07,1.737051e-08,2.07022e-08,1.466969e-09,7.755375e-09,2.9559e-09,6.698169e-08,2.725772e-09,1,0,0
3,1170570,0.005732,0.031506,0.002151155,0.027918,0.027918,0.0007160498,0.002157,5.615746e-08,6.507527e-07,1.000085e-07,1.145236e-07,7.986163e-09,4.457411e-08,1.848213e-08,3.991639e-07,1.726734e-08,0,0,0
4,1170581,0.0,0.000131,6.833017e-07,2e-06,2e-06,8.021031e-08,1e-06,1.006551e-08,9.21081e-08,1.488706e-08,1.717566e-08,1.188639e-09,6.495914e-09,2.697161e-09,5.876159e-08,2.502621e-09,1,0,0


In [140]:
user_sim_melt=pd.melt(user_sim,id_vars =['ncodpers'],value_vars =user_sim.columns[1:17])   # Melt the dataframe

In [141]:
to_rec=user_sim_melt[user_sim_melt['value']!=0.0]           # removing products with purchase probability 0

In [142]:
to_rec.shape

(13842058, 3)

In [145]:
a=to_rec.groupby('ncodpers')            # grouping on customer id
to_rec_sort = a.apply(lambda x: x.sort_values(by=['value'],ascending=False).head(7)) # Extract Top 7 products for each group in decreasing order of probabilities.
to_rec_sort = to_rec_sort.reset_index(level=0, drop=True)             # reset index

KeyboardInterrupt: 

In [147]:
list_prod=to_rec_sort.groupby('ncodpers')['variable'].apply(list)     # Create a list of products
list_prod=list_prod.reset_index()                                     # reset index
list_prod.variable=list_prod.variable.apply(lambda l: " ".join(l))    # join the list

In [148]:
list_prod['variable'] = list_prod['variable'].str.replace('_predict', '')    # remove the _predict from the products name
list_prod.head()

Unnamed: 0,ncodpers,variable
0,15889,ind_nomina_ult1 ind_nom_pens_ult1 ind_ecue_fin...
1,15890,ind_cco_fin_ult1 ind_reca_fin_ult1 ind_valo_fi...
2,15892,ind_nomina_ult1 ind_nom_pens_ult1 ind_ctma_fin...
3,15893,ind_recibo_ult1 ind_tjcr_fin_ult1 ind_nomina_u...
4,15894,ind_ctma_fin_ult1 ind_fond_fin_ult1 ind_dela_f...


In [149]:
list_prod = list_prod.set_index('ncodpers')         # re-ordering customers like that present in the test data
list_prod=list_prod.reindex(index=test['ncodpers'])
list_prod.head()

Unnamed: 0_level_0,variable
ncodpers,Unnamed: 1_level_1
15889,ind_nomina_ult1 ind_nom_pens_ult1 ind_ecue_fin...
1170555,ind_recibo_ult1 ind_nom_pens_ult1 ind_nomina_u...
1170563,ind_recibo_ult1 ind_nom_pens_ult1 ind_nomina_u...
1170570,ind_recibo_ult1 ind_nomina_ult1 ind_nom_pens_u...
1170581,ind_recibo_ult1 ind_nomina_ult1 ind_nom_pens_u...


In [153]:
list_prod=list_prod.reset_index()
list_prod.shape

(929615, 3)

In [155]:
list_prod.to_csv('user-user-sim.csv')

Using User-User Colllaborative Filtering, we were able to get a score 0.02001 on the Kaggle Leaderboard.


This score is even less than the model we build recommending the most popular products to the customers. This is because we have a large number of customers and products are very less. User-User CF does not works well in this case.


#### What else can be done-

I have tried to improve the performance of the model using different distance metrics like euclidean, manhattan. But got the best results with correlation metric. We can also try our hands with other distance metrics.

In the next notebook, we will show you how to perform Item-Item CF.