# Recommendation using Collaborative filtering: Item-Item Similarity

In the previous notebooks, so far we have completed- 
* EDA-
* Data Cleaning-
* Recommending popular products-
* Recommending products based on User-User Similarity-

Here, we are using the <b> Santander Product Recommendation</b> dataset.

# Data Dictionary


We are provided with 1.5 years of customers behavior data from Santander bank to predict what new products customers will purchase. The data starts at 2015-01-28 and has monthly records (from January 2015 to May 2016) of products a customer has, such as "credit card", "savings account", etc. 

Each row represents a customer, each column contains attributes related to customer demographics and products owned by the customer at the end of the month.

We will predict what additional products a customer will get in the last month, 2016-06-28, in addition to what they already have at 2016-05-28. 

These products are the columns named: ind_(xyz)_ult1, which are the columns 25 - 48 in the training data. We will predict what a customer will buy in addition to what they already had at 2016-05-28. 


### Data Fields- 

<b>fecha_dato</b>- The table is partitioned for this column

###  Demographic information about customers

<b>ncodpers</b>- Customer code

<b>pais_residencia</b>- Customer's Country residence

<b>sexo</b>- Customer's sex

<b>age</b>- Age

<b>indresi</b>- Residence index (S (Yes) or N (No) if the residence country is the same than the bank country)

<b>indext</b>- Foreigner index (S (Yes) or N (No) if the customer's birth country is different than the bank                                    country)

<b>conyuemp</b>- Spouse index. 1 if the customer is spouse of an employee

<b>canal_entrada</b>- channel used by the customer to join

<b>indfall</b>- Deceased index. N/S

<b>tipodom</b>- Address type. 1, primary address

<b>cod_prov</b>- Province code (customer's address)

<b>nomprov</b>- Province name

<b>ind_actividad_cliente</b>-  Activity index (1, active customer; 0, inactive customer)

<b>renta</b>- Gross income of the household

<b>segmento</b>- segmentation: 01 - VIP, 02 - Individuals 03 - college graduated

### Customer Bank Relationship

<b>ind_empleado</b>- Employee index: A active, B ex employed, F filial, N not employee, P pasive

<b>fecha_alta</b>- The date in which the customer became as the first holder of a contract in the bank

<b>ind_nuevo</b>- New customer Index. 1 if the customer registered in the last 6 months.

<b>antiguedad</b>- Customer seniority (in months)

<b>indrel</b>- 1 (First/Primary) , 99 (Primary customer during the month but not at the end of the month)

<b>ult_fec_cli_1t</b>- Last date as primary customer (if he isn't at the end of the month)

<b>indrel_1mes</b>- Customer type at the beginning of the month ,1 (First/Primary customer), 2 (co-owner ),P (Potential),                            3 (former primary), 4(former co-owner)
    
<b>tiprel_1mes</b>- Customer relation type at the beginning of the month, A (active), I (inactive),                                                  P (former customer),R(Potential)

### Products of the Bank


<b>ind_ahor_fin_ult1</b>- Saving Account

<b>ind_aval_fin_ult1</b>- Guarantees

<b>ind_cco_fin_ult1</b>- Current Accounts

<b>ind_cder_fin_ult1</b>- Derivada Account

<b>ind_cno_fin_ult1</b>- Payroll Account

<b>ind_ctju_fin_ult1</b>- Junior Account

<b>ind_ctma_fin_ult1</b>-  Más particular Account

<b>ind_ctop_fin_ult1</b>- particular Account

<b>ind_ctpp_fin_ult1</b>-  particular Plus Account

<b>ind_deco_fin_ult1</b>-  Short-term deposits

<b>ind_deme_fin_ult1</b>-  Medium-term deposits

<b>ind_dela_fin_ult1</b>-  Long-term deposits

<b>ind_ecue_fin_ult1</b>-  e-account

<b>ind_fond_fin_ult1</b>-  Funds

<b>ind_hip_fin_ult1</b>-  Mortgage

<b>ind_plan_fin_ult1</b>-  Pensions

<b>ind_pres_fin_ult1</b>-  Loans

<b>ind_reca_fin_ult1</b>-  Taxes

<b>ind_tjcr_fin_ult1</b>-  Credit Card

<b>ind_valo_fin_ult1</b>-  Securities

<b>ind_viv_fin_ult1</b>- Home Account

<b>ind_nomina_ult1</b>-  Payroll

<b>ind_nom_pens_ult1</b>- Pensions

<b>ind_recibo_ult1</b>- Direct Debit



We found a few features in the data dictionary which we included in our hypothesis.


### Test Data

Each row in the test dataset represents a customer with columns related to customer demographics and their relationship with the bank.


As you can see in the Data Dictionary, we have not been given any information about the products attributes. 

For example- if it is a debit account, what is the minimum balance required for keeping it active.

Therefore, we cannot use Content Based Filtering for this dataset because we don't have any information about the products.

We can only use <b>Collaborative Filtering</b>. 


# Collaborative Filtering 

The collaborative filtering algorithm uses “User Behavior” for recommending items. This is one of the most commonly used algorithms in the industry as it is not dependent on any additional information. There are different types of collaborating filtering techniques-

* <b>User-User Collaborative Filtering</b>

This algorithm first finds the similarity score between users. Based on this similarity score, it then picks out the most similar users and recommends products which these similar users have liked or bought previously.

* <b>Item-Item Collaborative Filtering</b>

In this algorithm, we compute the similarity between each pair of items.

In the preceding notebook, we have used the User-User CF for recommendation.

In this notebook, we are going to use Item-Item collaborative Filtering Algorithm. We will find the similarities between the items and based on the similarity matrix, we will recommend products to the customers.

## Loading Packages

In [1]:
#importing libraries

import pandas as pd
import numpy as np
import seaborn as sns
sns.set(style="whitegrid")
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn import preprocessing, ensemble, metrics
from tqdm import tqdm_notebook as tqdm
from IPython.display import display
pd.options.display.max_columns = None
import datetime
import warnings                               # To ignore any warnings 
warnings.filterwarnings("ignore")

## Loading Data

In [2]:
april=pd.read_csv('aprilf')
may=pd.read_csv('may')
test=pd.read_csv('test_f')

Here, we have used April and May month 2016 data as train data.

In [3]:
may=may.drop(['Unnamed: 0','Unnamed: 0.1','Unnamed: 0.1.1'],axis=1)           # removing the undesired columns
april=april.drop(['Unnamed: 0','Unnamed: 0.1'],axis=1) 
test=test.drop(['Unnamed: 0','Unnamed: 0.1'],axis=1)

Now, we will break the data into two variables. One containing user demographics like age, gender, etc. And the other column for products

In [4]:
demographics = ['fecha_dato',
 'ncodpers','ind_empleado','pais_residencia','sexo','age','ind_nuevo','antiguedad','indrel',
 'indrel_1mes','tiprel_1mes','indresi','indext','canal_entrada','indfall',
 'cod_prov','ind_actividad_cliente','renta','segmento']

products = [ 'ind_cco_fin_ult1','ind_recibo_ult1','ind_ecue_fin_ult1','ind_nomina_ult1',
 'ind_nom_pens_ult1','ind_reca_fin_ult1','ind_ctju_fin_ult1','ind_tjcr_fin_ult1',
 'ind_ctma_fin_ult1','ind_dela_fin_ult1','ind_fond_fin_ult1','ind_hip_fin_ult1','ind_plan_fin_ult1',
 'ind_valo_fin_ult1','ind_viv_fin_ult1']

drop_cols=['ind_cno_fin_ult1','ind_deme_fin_ult1','ind_ctop_fin_ult1','ind_ctpp_fin_ult1','ind_pres_fin_ult1']

In [5]:
may=may.drop(drop_cols,1)
april=april.drop(drop_cols,1)

For the product column, we have not included cno, ctop, ctpp, pres based on our EDA.

Now, we will merge both the April month dataset with May to add each user's purchases in the previous month.

In [6]:
merge_drop_cols = demographics[:]
merge_drop_cols.remove('ncodpers')
may_new = pd.merge(may,april.drop(merge_drop_cols,1), on=['ncodpers'],how='left',suffixes=['','_previous'])
may_new.head()

Unnamed: 0,fecha_dato,ncodpers,ind_empleado,pais_residencia,sexo,age,ind_nuevo,antiguedad,indrel,indrel_1mes,tiprel_1mes,indresi,indext,canal_entrada,indfall,cod_prov,nomprov,ind_actividad_cliente,renta,segmento,ind_cco_fin_ult1,ind_ctju_fin_ult1,ind_ctma_fin_ult1,ind_dela_fin_ult1,ind_ecue_fin_ult1,ind_fond_fin_ult1,ind_hip_fin_ult1,ind_plan_fin_ult1,ind_reca_fin_ult1,ind_tjcr_fin_ult1,ind_valo_fin_ult1,ind_viv_fin_ult1,ind_nomina_ult1,ind_nom_pens_ult1,ind_recibo_ult1,nomprov_previous,ind_cco_fin_ult1_previous,ind_ctju_fin_ult1_previous,ind_ctma_fin_ult1_previous,ind_dela_fin_ult1_previous,ind_ecue_fin_ult1_previous,ind_fond_fin_ult1_previous,ind_hip_fin_ult1_previous,ind_plan_fin_ult1_previous,ind_reca_fin_ult1_previous,ind_tjcr_fin_ult1_previous,ind_valo_fin_ult1_previous,ind_viv_fin_ult1_previous,ind_nomina_ult1_previous,ind_nom_pens_ult1_previous,ind_recibo_ult1_previous
0,2016-05-28,657789,N,ES,V,working,0.0,senior,1.0,1.0,A,S,N,KAT,N,28.0,MADRID,1.0,medium,01 - TOP,1,0,0,1,0,0,0,0,0,1,0,0,0.0,0.0,0,MADRID,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
1,2016-05-28,657777,N,ES,V,working,0.0,senior,1.0,1.0,I,S,N,KFC,N,28.0,MADRID,0.0,medium,02 - PARTICULARES,1,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0,MADRID,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2016-05-28,657782,N,ES,V,working,0.0,senior,1.0,1.0,I,S,N,KFC,N,28.0,MADRID,0.0,rich,02 - PARTICULARES,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0,MADRID,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,2016-05-28,657780,N,ES,V,working,0.0,senior,1.0,1.0,I,S,N,KAT,N,28.0,MADRID,0.0,medium,02 - PARTICULARES,1,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0,MADRID,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,2016-05-28,657786,N,ES,V,working,0.0,medium,1.0,1.0,A,S,N,KAT,N,28.0,MADRID,1.0,medium,02 - PARTICULARES,1,0,0,0,0,0,0,0,1,0,0,0,0.0,0.0,1,MADRID,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0


In [7]:
may_new=may_new.drop('nomprov_previous',1)

In [8]:
may_new.shape

(931453, 50)

In [9]:
for i in products:                                      
    may_new[i] = may_new[i]-may_new[i+"_previous"]          # Getting the products purchased in May 2016.
    may_new[i][may_new[i] < 0] = 0                          # substituting 0 for dropped products 

In [10]:
may_new.shape

(931453, 50)

In [11]:
may_new[products] = may_new[products].fillna(0)             # filling the missing values created by merge

In [12]:
prev_mon_products = [i + "_previous" for i in products]             # change name of products brought in April 2016 in the merge dataframe
may_new[prev_mon_products] = may_new[prev_mon_products].fillna(0)   # filling the missing values

Now, we will also add purchase history for the test data

In [13]:
test_col = products + ['ncodpers']                        
test_new = pd.merge(test,may[test_col],on='ncodpers',how='left',suffixes=['','_previous'])   # merge the test data with the products customers has in May 2016
test_new.head()

Unnamed: 0,fecha_dato,ncodpers,ind_empleado,pais_residencia,sexo,age,ind_nuevo,antiguedad,indrel,indrel_1mes,tiprel_1mes,indresi,indext,canal_entrada,indfall,cod_prov,nomprov,ind_actividad_cliente,renta,segmento,ind_cco_fin_ult1,ind_recibo_ult1,ind_ecue_fin_ult1,ind_nomina_ult1,ind_nom_pens_ult1,ind_reca_fin_ult1,ind_ctju_fin_ult1,ind_tjcr_fin_ult1,ind_ctma_fin_ult1,ind_dela_fin_ult1,ind_fond_fin_ult1,ind_hip_fin_ult1,ind_plan_fin_ult1,ind_valo_fin_ult1,ind_viv_fin_ult1
0,6/28/2016,15889,F,ES,V,working,0,senior,1,1.0,A,S,N,KAT,N,28.0,MADRID,1,rich,01 - TOP,1,0,0,0.0,0.0,0,0,1,0,0,0,0,0,1,0
1,6/28/2016,1170555,N,ES,V,working,0,medium,1,1.0,I,S,N,KHE,N,28.0,MADRID,0,rich,03 - UNIVERSITARIO,1,0,0,0.0,0.0,0,0,0,0,0,0,0,0,0,0
2,6/28/2016,1170563,N,ES,H,working,0,medium,1,1.0,I,S,N,KHE,N,28.0,MADRID,0,rich,03 - UNIVERSITARIO,1,0,0,0.0,0.0,0,0,0,0,0,0,0,0,0,0
3,6/28/2016,1170570,N,ES,H,working,0,medium,1,1.0,A,S,N,KHE,N,28.0,MADRID,1,medium,03 - UNIVERSITARIO,0,0,0,0.0,0.0,0,0,0,0,0,0,0,0,0,0
4,6/28/2016,1170581,N,ES,H,working,0,medium,1,1.0,I,S,N,KHE,N,28.0,MADRID,0,medium,03 - UNIVERSITARIO,1,0,0,0.0,0.0,0,0,0,0,0,0,0,0,0,0


In [31]:
test_new.columns

Index([u'fecha_dato', u'ncodpers', u'ind_empleado', u'pais_residencia',
       u'sexo', u'age', u'ind_nuevo', u'antiguedad', u'indrel', u'indrel_1mes',
       u'tiprel_1mes', u'indresi', u'indext', u'canal_entrada', u'indfall',
       u'cod_prov', u'nomprov', u'ind_actividad_cliente', u'renta',
       u'segmento', u'ind_cco_fin_ult1', u'ind_recibo_ult1',
       u'ind_ecue_fin_ult1', u'ind_nomina_ult1', u'ind_nom_pens_ult1',
       u'ind_reca_fin_ult1', u'ind_ctju_fin_ult1', u'ind_tjcr_fin_ult1',
       u'ind_ctma_fin_ult1', u'ind_dela_fin_ult1', u'ind_fond_fin_ult1',
       u'ind_hip_fin_ult1', u'ind_plan_fin_ult1', u'ind_pres_fin_ult1',
       u'ind_valo_fin_ult1', u'ind_viv_fin_ult1'],
      dtype='object')

In [14]:
test_new.shape

(929615, 35)

In [15]:
test_new.rename(columns={i:j for i,j in zip(products,prev_mon_products)}, inplace=True)       # rename columns in the test data like we did for may_use
test_new[prev_mon_products] = test_new[prev_mon_products].fillna(0)                                # filling the missing values
test_new.head() 

Unnamed: 0,fecha_dato,ncodpers,ind_empleado,pais_residencia,sexo,age,ind_nuevo,antiguedad,indrel,indrel_1mes,tiprel_1mes,indresi,indext,canal_entrada,indfall,cod_prov,nomprov,ind_actividad_cliente,renta,segmento,ind_cco_fin_ult1_previous,ind_recibo_ult1_previous,ind_ecue_fin_ult1_previous,ind_nomina_ult1_previous,ind_nom_pens_ult1_previous,ind_reca_fin_ult1_previous,ind_ctju_fin_ult1_previous,ind_tjcr_fin_ult1_previous,ind_ctma_fin_ult1_previous,ind_dela_fin_ult1_previous,ind_fond_fin_ult1_previous,ind_hip_fin_ult1_previous,ind_plan_fin_ult1_previous,ind_valo_fin_ult1_previous,ind_viv_fin_ult1_previous
0,6/28/2016,15889,F,ES,V,working,0,senior,1,1.0,A,S,N,KAT,N,28.0,MADRID,1,rich,01 - TOP,1,0,0,0.0,0.0,0,0,1,0,0,0,0,0,1,0
1,6/28/2016,1170555,N,ES,V,working,0,medium,1,1.0,I,S,N,KHE,N,28.0,MADRID,0,rich,03 - UNIVERSITARIO,1,0,0,0.0,0.0,0,0,0,0,0,0,0,0,0,0
2,6/28/2016,1170563,N,ES,H,working,0,medium,1,1.0,I,S,N,KHE,N,28.0,MADRID,0,rich,03 - UNIVERSITARIO,1,0,0,0.0,0.0,0,0,0,0,0,0,0,0,0,0
3,6/28/2016,1170570,N,ES,H,working,0,medium,1,1.0,A,S,N,KHE,N,28.0,MADRID,1,medium,03 - UNIVERSITARIO,0,0,0,0.0,0.0,0,0,0,0,0,0,0,0,0,0
4,6/28/2016,1170581,N,ES,H,working,0,medium,1,1.0,I,S,N,KHE,N,28.0,MADRID,0,medium,03 - UNIVERSITARIO,1,0,0,0.0,0.0,0,0,0,0,0,0,0,0,0,0


In [16]:
may_purchases = may_new[products].copy()            # products customers brought in May 2016
may_prev_mon = may_new[prev_mon_products].copy()    # products customers has in April 2016 

 Now, we will create a variable consisting the customer demographics we want to include in the user-user similarity matrix

In [17]:
prev_mon_products_col = prev_mon_products + ['ncodpers']     # previous month product columns
test_final = test_new[prev_mon_products_col].copy()          # dataset containing customer id and products customers has in May 2016.
test_final_unique = test_final.drop('ncodpers',1).drop_duplicates().copy().reset_index(drop=True)    # drop the duplicate rows to get unique purchase matrix

In [18]:
test_final_unique.shape                                    # unique combinations of products that customer has in May 2016. 

(2020, 15)

In [19]:
test_final=test_final.drop('ncodpers',axis=1)
test_final.head()

Unnamed: 0,ind_cco_fin_ult1_previous,ind_recibo_ult1_previous,ind_ecue_fin_ult1_previous,ind_nomina_ult1_previous,ind_nom_pens_ult1_previous,ind_reca_fin_ult1_previous,ind_ctju_fin_ult1_previous,ind_tjcr_fin_ult1_previous,ind_ctma_fin_ult1_previous,ind_dela_fin_ult1_previous,ind_fond_fin_ult1_previous,ind_hip_fin_ult1_previous,ind_plan_fin_ult1_previous,ind_valo_fin_ult1_previous,ind_viv_fin_ult1_previous
0,1,0,0,0.0,0.0,0,0,1,0,0,0,0,0,1,0
1,1,0,0,0.0,0.0,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0.0,0.0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0.0,0.0,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0.0,0.0,0,0,0,0,0,0,0,0,0,0


In [21]:
predict_product_col = [i + "_predict" for i in prev_mon_products]

def probability_calculation(dataset,training,training_purchases,used_columns,metric,test_remap,print_option=False):
    # 'dataset' takes the unique test data with purchase/demographic history; 'training' are the training data that we calculate distances to
    n = dataset.shape[0]
    for index, row in dataset.iterrows():
        if print_option == True:
            print(str(index) + '/' + str(n))
        row_use = row.to_frame().T
        #store purchase history for the test users
        row_history = row_use[used_columns]
        #calculate distances between the test point and each training point based on selected binary features
        #use 'manhattan' when data was binary - when weighted against demographics, use Euclidean
        distances = metrics.pairwise_distances(row_use,training) + 1e-6
        #normalise distances: previously used 24-distances, and 1/(1+distances), but the asymptotic behaviour of 1/distances gives the most accurate predictions.
        norm_distances = 1/distances
        #take dot product between distance to training point and training point's purchase history to obtain ownership likelihood matrix
        sim = pd.DataFrame(norm_distances.dot(training_purchases)/np.sum(norm_distances),columns = prev_mon_products)
        if(index == 0):
            probabilities = sim
        else:
            probabilities = probabilities.append(sim)
    print("probabilities calculated")
    # reindex users for join
    reindexed_output = probabilities.reset_index().drop('index',axis=1).copy()
    indexed_unique_test = dataset.reset_index().drop('index',axis=1).copy()
    output_unique = indexed_unique_test.join(reindexed_output,rsuffix='_predict')
    output_final = pd.merge(test_remap,output_unique,on=used_columns,how='left')
    # only select relevant products
    output_final = output_final.drop(used_columns,1)
    output_final.columns = output_final.columns.str.replace("_predict", "")
    output_final.columns = output_final.columns.str.replace("_previous", "_predict")
    # now we have all test probabilities - can average and compare with results
    return output_final

In [22]:
probabilities_item = probability_calculation(test_final_unique,may_prev_mon,may_purchases,prev_mon_products,'correlation',test_final)

# prob based on purchases similarity

probabilities calculated


In [23]:
probabilities_item.head()

Unnamed: 0,ind_cco_fin_ult1_predict,ind_recibo_ult1_predict,ind_ecue_fin_ult1_predict,ind_nomina_ult1_predict,ind_nom_pens_ult1_predict,ind_reca_fin_ult1_predict,ind_ctju_fin_ult1_predict,ind_tjcr_fin_ult1_predict,ind_ctma_fin_ult1_predict,ind_dela_fin_ult1_predict,ind_fond_fin_ult1_predict,ind_hip_fin_ult1_predict,ind_plan_fin_ult1_predict,ind_valo_fin_ult1_predict,ind_viv_fin_ult1_predict
0,9.372087e-06,0.045159,7e-06,0.004526,0.004527,7.413212e-07,1.042224e-07,1e-05,1e-06,1.243219e-07,1.649606e-07,5.70308e-09,5.342805e-08,4.670515e-07,1.442552e-08
1,7.323946e-09,0.011177,0.000848,0.001694,0.001749,0.0002281548,9.218356e-11,0.001081,0.000622,3.91781e-05,5.070107e-05,2.912334e-12,9.218393e-06,0.0001336665,8.055281e-12
2,7.323946e-09,0.011177,0.000848,0.001694,0.001749,0.0002281548,9.218356e-11,0.001081,0.000622,3.91781e-05,5.070107e-05,2.912334e-12,9.218393e-06,0.0001336665,8.055281e-12
3,0.007146801,0.002294,0.000729,0.002895,0.002976,2.565594e-05,0.0001466009,0.000652,0.000766,1.099519e-05,3.665187e-06,4.520534e-12,7.330095e-06,2.19906e-05,1.401779e-11
4,7.323946e-09,0.011177,0.000848,0.001694,0.001749,0.0002281548,9.218356e-11,0.001081,0.000622,3.91781e-05,5.070107e-05,2.912334e-12,9.218393e-06,0.0001336665,8.055281e-12


In [24]:
probabilities_item['ncodpers']=test['ncodpers']
test_final['ncodpers']=probabilities_item['ncodpers']

In [33]:
probabilities_item=probabilities_item[['ncodpers', 'ind_cco_fin_ult1_predict','ind_recibo_ult1_predict','ind_ecue_fin_ult1_predict','ind_nomina_ult1_predict','ind_nom_pens_ult1_predict','ind_reca_fin_ult1_predict','ind_ctju_fin_ult1_predict','ind_tjcr_fin_ult1_predict','ind_ctma_fin_ult1_predict','ind_dela_fin_ult1_predict','ind_fond_fin_ult1_predict','ind_hip_fin_ult1_predict','ind_plan_fin_ult1_predict','ind_valo_fin_ult1_predict','ind_viv_fin_ult1_predict']]
test_final=test_final[['ncodpers','ind_cco_fin_ult1_previous','ind_recibo_ult1_previous','ind_ecue_fin_ult1_previous','ind_nomina_ult1_previous','ind_nom_pens_ult1_previous','ind_reca_fin_ult1_previous','ind_ctju_fin_ult1_previous','ind_tjcr_fin_ult1_previous','ind_ctma_fin_ult1_previous','ind_dela_fin_ult1_previous','ind_fond_fin_ult1_previous','ind_hip_fin_ult1_previous','ind_plan_fin_ult1_previous','ind_valo_fin_ult1_previous','ind_viv_fin_ult1_previous']]

In [34]:
test_final.head()

Unnamed: 0,ncodpers,ind_cco_fin_ult1_previous,ind_recibo_ult1_previous,ind_ecue_fin_ult1_previous,ind_nomina_ult1_previous,ind_nom_pens_ult1_previous,ind_reca_fin_ult1_previous,ind_ctju_fin_ult1_previous,ind_tjcr_fin_ult1_previous,ind_ctma_fin_ult1_previous,ind_dela_fin_ult1_previous,ind_fond_fin_ult1_previous,ind_hip_fin_ult1_previous,ind_plan_fin_ult1_previous,ind_valo_fin_ult1_previous,ind_viv_fin_ult1_previous
0,15889,1,0,0,0.0,0.0,0,0,1,0,0,0,0,0,1,0
1,1170555,1,0,0,0.0,0.0,0,0,0,0,0,0,0,0,0,0
2,1170563,1,0,0,0.0,0.0,0,0,0,0,0,0,0,0,0,0
3,1170570,0,0,0,0.0,0.0,0,0,0,0,0,0,0,0,0,0
4,1170581,1,0,0,0.0,0.0,0,0,0,0,0,0,0,0,0,0


In [35]:
pred = pd.merge(probabilities_item,test_final,on='ncodpers',how='left')
pred.head()

Unnamed: 0,ncodpers,ind_cco_fin_ult1_predict,ind_recibo_ult1_predict,ind_ecue_fin_ult1_predict,ind_nomina_ult1_predict,ind_nom_pens_ult1_predict,ind_reca_fin_ult1_predict,ind_ctju_fin_ult1_predict,ind_tjcr_fin_ult1_predict,ind_ctma_fin_ult1_predict,ind_dela_fin_ult1_predict,ind_fond_fin_ult1_predict,ind_hip_fin_ult1_predict,ind_plan_fin_ult1_predict,ind_valo_fin_ult1_predict,ind_viv_fin_ult1_predict,ind_cco_fin_ult1_previous,ind_recibo_ult1_previous,ind_ecue_fin_ult1_previous,ind_nomina_ult1_previous,ind_nom_pens_ult1_previous,ind_reca_fin_ult1_previous,ind_ctju_fin_ult1_previous,ind_tjcr_fin_ult1_previous,ind_ctma_fin_ult1_previous,ind_dela_fin_ult1_previous,ind_fond_fin_ult1_previous,ind_hip_fin_ult1_previous,ind_plan_fin_ult1_previous,ind_valo_fin_ult1_previous,ind_viv_fin_ult1_previous
0,15889,9.372087e-06,0.045159,7e-06,0.004526,0.004527,7.413212e-07,1.042224e-07,1e-05,1e-06,1.243219e-07,1.649606e-07,5.70308e-09,5.342805e-08,4.670515e-07,1.442552e-08,1,0,0,0.0,0.0,0,0,1,0,0,0,0,0,1,0
1,1170555,7.323946e-09,0.011177,0.000848,0.001694,0.001749,0.0002281548,9.218356e-11,0.001081,0.000622,3.91781e-05,5.070107e-05,2.912334e-12,9.218393e-06,0.0001336665,8.055281e-12,1,0,0,0.0,0.0,0,0,0,0,0,0,0,0,0,0
2,1170563,7.323946e-09,0.011177,0.000848,0.001694,0.001749,0.0002281548,9.218356e-11,0.001081,0.000622,3.91781e-05,5.070107e-05,2.912334e-12,9.218393e-06,0.0001336665,8.055281e-12,1,0,0,0.0,0.0,0,0,0,0,0,0,0,0,0,0
3,1170570,0.007146801,0.002294,0.000729,0.002895,0.002976,2.565594e-05,0.0001466009,0.000652,0.000766,1.099519e-05,3.665187e-06,4.520534e-12,7.330095e-06,2.19906e-05,1.401779e-11,0,0,0,0.0,0.0,0,0,0,0,0,0,0,0,0,0
4,1170581,7.323946e-09,0.011177,0.000848,0.001694,0.001749,0.0002281548,9.218356e-11,0.001081,0.000622,3.91781e-05,5.070107e-05,2.912334e-12,9.218393e-06,0.0001336665,8.055281e-12,1,0,0,0.0,0.0,0,0,0,0,0,0,0,0,0,0


In [44]:
for i in range(1,16):
    pred.ix[:,i] = np.where(pred.ix[:,i+15]==1,0,pred.ix[:,i] )

In [45]:
pred.head()

Unnamed: 0,ncodpers,ind_cco_fin_ult1_predict,ind_recibo_ult1_predict,ind_ecue_fin_ult1_predict,ind_nomina_ult1_predict,ind_nom_pens_ult1_predict,ind_reca_fin_ult1_predict,ind_ctju_fin_ult1_predict,ind_tjcr_fin_ult1_predict,ind_ctma_fin_ult1_predict,ind_dela_fin_ult1_predict,ind_fond_fin_ult1_predict,ind_hip_fin_ult1_predict,ind_plan_fin_ult1_predict,ind_valo_fin_ult1_predict,ind_viv_fin_ult1_predict,ind_cco_fin_ult1_previous,ind_recibo_ult1_previous,ind_ecue_fin_ult1_previous,ind_nomina_ult1_previous,ind_nom_pens_ult1_previous,ind_reca_fin_ult1_previous,ind_ctju_fin_ult1_previous,ind_tjcr_fin_ult1_previous,ind_ctma_fin_ult1_previous,ind_dela_fin_ult1_previous,ind_fond_fin_ult1_previous,ind_hip_fin_ult1_previous,ind_plan_fin_ult1_previous,ind_valo_fin_ult1_previous,ind_viv_fin_ult1_previous
0,15889,0.0,0.045159,7e-06,0.004526,0.004527,7.413212e-07,1.042224e-07,0.0,1e-06,1.243219e-07,1.649606e-07,5.70308e-09,5.342805e-08,0.0,1.442552e-08,1,0,0,0.0,0.0,0,0,1,0,0,0,0,0,1,0
1,1170555,0.0,0.011177,0.000848,0.001694,0.001749,0.0002281548,9.218356e-11,0.001081,0.000622,3.91781e-05,5.070107e-05,2.912334e-12,9.218393e-06,0.000134,8.055281e-12,1,0,0,0.0,0.0,0,0,0,0,0,0,0,0,0,0
2,1170563,0.0,0.011177,0.000848,0.001694,0.001749,0.0002281548,9.218356e-11,0.001081,0.000622,3.91781e-05,5.070107e-05,2.912334e-12,9.218393e-06,0.000134,8.055281e-12,1,0,0,0.0,0.0,0,0,0,0,0,0,0,0,0,0
3,1170570,0.007147,0.002294,0.000729,0.002895,0.002976,2.565594e-05,0.0001466009,0.000652,0.000766,1.099519e-05,3.665187e-06,4.520534e-12,7.330095e-06,2.2e-05,1.401779e-11,0,0,0,0.0,0.0,0,0,0,0,0,0,0,0,0,0
4,1170581,0.0,0.011177,0.000848,0.001694,0.001749,0.0002281548,9.218356e-11,0.001081,0.000622,3.91781e-05,5.070107e-05,2.912334e-12,9.218393e-06,0.000134,8.055281e-12,1,0,0,0.0,0.0,0,0,0,0,0,0,0,0,0,0


In [46]:
item_sim=pred.ix[:,0:16]

In [47]:
item_sim.head()

Unnamed: 0,ncodpers,ind_cco_fin_ult1_predict,ind_recibo_ult1_predict,ind_ecue_fin_ult1_predict,ind_nomina_ult1_predict,ind_nom_pens_ult1_predict,ind_reca_fin_ult1_predict,ind_ctju_fin_ult1_predict,ind_tjcr_fin_ult1_predict,ind_ctma_fin_ult1_predict,ind_dela_fin_ult1_predict,ind_fond_fin_ult1_predict,ind_hip_fin_ult1_predict,ind_plan_fin_ult1_predict,ind_valo_fin_ult1_predict,ind_viv_fin_ult1_predict
0,15889,0.0,0.045159,7e-06,0.004526,0.004527,7.413212e-07,1.042224e-07,0.0,1e-06,1.243219e-07,1.649606e-07,5.70308e-09,5.342805e-08,0.0,1.442552e-08
1,1170555,0.0,0.011177,0.000848,0.001694,0.001749,0.0002281548,9.218356e-11,0.001081,0.000622,3.91781e-05,5.070107e-05,2.912334e-12,9.218393e-06,0.000134,8.055281e-12
2,1170563,0.0,0.011177,0.000848,0.001694,0.001749,0.0002281548,9.218356e-11,0.001081,0.000622,3.91781e-05,5.070107e-05,2.912334e-12,9.218393e-06,0.000134,8.055281e-12
3,1170570,0.007147,0.002294,0.000729,0.002895,0.002976,2.565594e-05,0.0001466009,0.000652,0.000766,1.099519e-05,3.665187e-06,4.520534e-12,7.330095e-06,2.2e-05,1.401779e-11
4,1170581,0.0,0.011177,0.000848,0.001694,0.001749,0.0002281548,9.218356e-11,0.001081,0.000622,3.91781e-05,5.070107e-05,2.912334e-12,9.218393e-06,0.000134,8.055281e-12


In [48]:
item_sim_melt=pd.melt(item_sim,id_vars =['ncodpers'],value_vars =item_sim.columns[1:16])

In [49]:
to_rec=item_sim_melt[item_sim_melt['value']!=0.0]
to_rec.shape

(12914434, 3)

In [50]:
a=to_rec.groupby('ncodpers')            # grouping on customer id
to_rec_sort = a.apply(lambda x: x.sort_values(by=['value'],ascending=False).head(7)) # Extract Top 7 products for each group in decreasing order of probabilities.
to_rec_sort = to_rec_sort.reset_index(level=0, drop=True)             # reset index

In [52]:
to_rec_sort.head()

Unnamed: 0,ncodpers,variable,value
929615,15889,ind_recibo_ult1_predict,0.045159
3718460,15889,ind_nom_pens_ult1_predict,0.004527
2788845,15889,ind_nomina_ult1_predict,0.004526
1859230,15889,ind_ecue_fin_ult1_predict,7e-06
7436920,15889,ind_ctma_fin_ult1_predict,1e-06


In [53]:
list_prod=to_rec_sort.groupby('ncodpers')['variable'].apply(list)     # Create a list of products
list_prod=list_prod.reset_index()                                     # reset index
list_prod.variable=list_prod.variable.apply(lambda l: " ".join(l))    # join the list

In [75]:
list_prod.head()

Unnamed: 0_level_0,variable
ncodpers,Unnamed: 1_level_1
15889,ind_recibo_ult1 ind_nom_pens_ult1 ind_nomina_u...
1170555,ind_recibo_ult1 ind_nom_pens_ult1 ind_nomina_u...
1170563,ind_recibo_ult1 ind_nom_pens_ult1 ind_nomina_u...
1170570,ind_cco_fin_ult1 ind_nom_pens_ult1 ind_nomina_...
1170581,ind_recibo_ult1 ind_nom_pens_ult1 ind_nomina_u...


In [55]:
list_prod['variable'] = list_prod['variable'].str.replace('_predict', '')    # remove the _predict from the products name
list_prod.head()

Unnamed: 0,ncodpers,variable
0,15889,ind_recibo_ult1 ind_nom_pens_ult1 ind_nomina_u...
1,15890,ind_cco_fin_ult1 ind_ctma_fin_ult1 ind_reca_fi...
2,15892,ind_nom_pens_ult1 ind_nomina_ult1 ind_fond_fin...
3,15893,ind_cco_fin_ult1 ind_nom_pens_ult1 ind_nomina_...
4,15894,ind_ctma_fin_ult1 ind_fond_fin_ult1 ind_dela_f...


In [76]:
list_prod = list_prod.set_index('ncodpers')         # re-ordering customers like that present in the test data
list_prod=list_prod.reindex(index=test['ncodpers'])

KeyError: 'ncodpers'

In [57]:
list_prod.to_csv('item-item CF.csv')

With the help of Item-Item CF, we got a score of 0.02388 on the Kaggle leaderborad. We can see that it's the best score so far we got. Item-Item worked well when enough when we had more number of customers than the number of products.

#### What else can be done-

The results of both our User-User CF and Item-Item CF can be combined to get a hybrid model based on customer demographics and purchase history.

We can use the boosting algorithms to improve positions on the Leaderboard.