# **Cross-selling Recommendations for Banking Products** 
### (Analyzed by: Jesumbo Joseph Oludipe)

## **Problem Description:**
XYZ Credit Union, a prominent financial institution in Latin America, has been successful in selling various banking products including credit cards, deposit accounts, retirement accounts, and safe deposit boxes. However, despite the initial success, the existing customers do not show any inclination to purchase more than one product. This indicates that the bank is not effectively utilizing the opportunity of cross-selling to its customers. To address this issue, XYZ Credit Union has sought the assistance of JB Analytics to analyze the data and provide actionable insights to increase cross-selling.

## **Business Understanding:**
Based on the given problem statement, while XYZ Credit Union has been successful in selling a variety of banking products in Latin America, they have not been able to capitalize on cross-selling opportunities. Despite having a large customer base, the bank's existing customers do not purchase more than one product, which is indicative of a lack of effective cross-selling strategies. This project aims to identify the factors that affect cross-selling, understand customer behavior, and propose strategies that can improve the bank's overall performance in selling multiple products to its existing customers.

In [1]:
# Import required libraries and packages

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
# Import the dataset using pandas pd.read_csv
df = pd.read_csv(r"C:\Users\Oludipe j\Documents\Data Analysis\Data Glacier\Cross Selling\Train.csv", low_memory=False)

df

Unnamed: 0,fecha_dato,ncodpers,ind_empleado,pais_residencia,sexo,age,fecha_alta,ind_nuevo,antiguedad,indrel,...,ind_hip_fin_ult1,ind_plan_fin_ult1,ind_pres_fin_ult1,ind_reca_fin_ult1,ind_tjcr_fin_ult1,ind_valo_fin_ult1,ind_viv_fin_ult1,ind_nomina_ult1,ind_nom_pens_ult1,ind_recibo_ult1
0,2015-01-28,1375586,N,ES,H,35,2015-01-12,0.0,6,1.0,...,0,0,0,0,0,0,0,0.0,0.0,0
1,2015-01-28,1050611,N,ES,V,23,2012-08-10,0.0,35,1.0,...,0,0,0,0,0,0,0,0.0,0.0,0
2,2015-01-28,1050612,N,ES,V,23,2012-08-10,0.0,35,1.0,...,0,0,0,0,0,0,0,0.0,0.0,0
3,2015-01-28,1050613,N,ES,H,22,2012-08-10,0.0,35,1.0,...,0,0,0,0,0,0,0,0.0,0.0,0
4,2015-01-28,1050614,N,ES,V,23,2012-08-10,0.0,35,1.0,...,0,0,0,0,0,0,0,0.0,0.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13647304,2016-05-28,1166765,N,ES,V,22,2013-08-14,0.0,33,1.0,...,0,0,0,0,0,0,0,0.0,0.0,0
13647305,2016-05-28,1166764,N,ES,V,23,2013-08-14,0.0,33,1.0,...,0,0,0,0,0,0,0,0.0,0.0,0
13647306,2016-05-28,1166763,N,ES,H,47,2013-08-14,0.0,33,1.0,...,0,0,0,0,0,0,0,0.0,0.0,0
13647307,2016-05-28,1166789,N,ES,H,22,2013-08-14,0.0,33,1.0,...,0,0,0,0,0,0,0,0.0,0.0,0


In [3]:
# Set to display all columns
pd.set_option('display.max_columns', None)

df.head(3)

Unnamed: 0,fecha_dato,ncodpers,ind_empleado,pais_residencia,sexo,age,fecha_alta,ind_nuevo,antiguedad,indrel,ult_fec_cli_1t,indrel_1mes,tiprel_1mes,indresi,indext,conyuemp,canal_entrada,indfall,tipodom,cod_prov,nomprov,ind_actividad_cliente,renta,segmento,ind_ahor_fin_ult1,ind_aval_fin_ult1,ind_cco_fin_ult1,ind_cder_fin_ult1,ind_cno_fin_ult1,ind_ctju_fin_ult1,ind_ctma_fin_ult1,ind_ctop_fin_ult1,ind_ctpp_fin_ult1,ind_deco_fin_ult1,ind_deme_fin_ult1,ind_dela_fin_ult1,ind_ecue_fin_ult1,ind_fond_fin_ult1,ind_hip_fin_ult1,ind_plan_fin_ult1,ind_pres_fin_ult1,ind_reca_fin_ult1,ind_tjcr_fin_ult1,ind_valo_fin_ult1,ind_viv_fin_ult1,ind_nomina_ult1,ind_nom_pens_ult1,ind_recibo_ult1
0,2015-01-28,1375586,N,ES,H,35,2015-01-12,0.0,6,1.0,,1.0,A,S,N,,KHL,N,1.0,29.0,MALAGA,1.0,87218.1,02 - PARTICULARES,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0
1,2015-01-28,1050611,N,ES,V,23,2012-08-10,0.0,35,1.0,,1.0,I,S,S,,KHE,N,1.0,13.0,CIUDAD REAL,0.0,35548.74,03 - UNIVERSITARIO,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0
2,2015-01-28,1050612,N,ES,V,23,2012-08-10,0.0,35,1.0,,1.0,I,S,N,,KHE,N,1.0,13.0,CIUDAD REAL,0.0,122179.11,03 - UNIVERSITARIO,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0


## **Data Understanding:**
The dataset that has been provided includes comprehensive customer information such as age, gender, and country of residence, alongside the various bank products that they currently own, such as credit cards, deposit accounts, retirement accounts, safe deposit boxes, and more. In total, the dataset comprises 48 features (columns) and 13647308 observations (rows), offering a large volume of data for analysis and providing valuable insights into the bank's customer base and product preferences.

## **Data Cleaning:**
As a data analyst, I recognize the importance of data cleaning to ensure that the data is accurate and reliable for any downstream analysis or use. XYZ Credit Union is also a Latina American American bank which made a lot of the column names and data values to be written in spanish, these words or abbrieviations had to be translated to English for better understanding.


In [4]:
# Check all column names

print(df.columns)

Index(['fecha_dato', 'ncodpers', 'ind_empleado', 'pais_residencia', 'sexo',
       'age', 'fecha_alta', 'ind_nuevo', 'antiguedad', 'indrel',
       'ult_fec_cli_1t', 'indrel_1mes', 'tiprel_1mes', 'indresi', 'indext',
       'conyuemp', 'canal_entrada', 'indfall', 'tipodom', 'cod_prov',
       'nomprov', 'ind_actividad_cliente', 'renta', 'segmento',
       'ind_ahor_fin_ult1', 'ind_aval_fin_ult1', 'ind_cco_fin_ult1',
       'ind_cder_fin_ult1', 'ind_cno_fin_ult1', 'ind_ctju_fin_ult1',
       'ind_ctma_fin_ult1', 'ind_ctop_fin_ult1', 'ind_ctpp_fin_ult1',
       'ind_deco_fin_ult1', 'ind_deme_fin_ult1', 'ind_dela_fin_ult1',
       'ind_ecue_fin_ult1', 'ind_fond_fin_ult1', 'ind_hip_fin_ult1',
       'ind_plan_fin_ult1', 'ind_pres_fin_ult1', 'ind_reca_fin_ult1',
       'ind_tjcr_fin_ult1', 'ind_valo_fin_ult1', 'ind_viv_fin_ult1',
       'ind_nomina_ult1', 'ind_nom_pens_ult1', 'ind_recibo_ult1'],
      dtype='object')


In [5]:
# Change all column names to appropriate and descriptive ones. Also from Spanish to English

df = df.rename(columns={'fecha_dato': 'data_date', 'ncodpers': 'customer_id', 'ind_empleado': 'employee_index', 'pais_residencia': 'country_of_residence', 'sexo': 'gender',
       'fecha_alta': 'date_joined', 'ind_nuevo': 'new_customer', 'antiguedad': 'customer_seniority', 'indrel': 'primary_index',
       'ult_fec_cli_1t': 'last_date_as_primary', 'indrel_1mes': 'customer_type', 'tiprel_1mes': 'customer_relation', 'indresi': 'resident', 'indext': 'foreigner',
       'conyuemp': 'employee_spouse', 'canal_entrada': 'channel', 'indfall': 'deceased', 'tipodom': 'address_type', 'cod_prov': 'province_code',
       'nomprov': 'province_name', 'ind_actividad_cliente': 'activity_index', 'renta': 'gross_household_income', 'segmento': 'customer_segment',
       'ind_ahor_fin_ult1': 'saving_acc', 'ind_aval_fin_ult1': 'guarantees', 'ind_cco_fin_ult1': 'current_acc',
       'ind_cder_fin_ult1': 'derivative_acc', 'ind_cno_fin_ult1': 'payroll_acc', 'ind_ctju_fin_ult1': 'junior_acc',
       'ind_ctma_fin_ult1': 'more_particular_acc', 'ind_ctop_fin_ult1': 'particular_acc', 'ind_ctpp_fin_ult1': 'particular_plus_acc',
       'ind_deco_fin_ult1': 'short_term_deposits', 'ind_deme_fin_ult1': 'medium_term_deposits', 'ind_dela_fin_ult1': 'long_term_deposits',
       'ind_ecue_fin_ult1': 'e_account', 'ind_fond_fin_ult1': 'funds', 'ind_hip_fin_ult1': 'mortgage',
       'ind_plan_fin_ult1': 'pensions_plan', 'ind_pres_fin_ult1': 'loans', 'ind_reca_fin_ult1': 'taxes',
       'ind_tjcr_fin_ult1': 'credit_card', 'ind_valo_fin_ult1': 'securities', 'ind_viv_fin_ult1': 'home_acc',
       'ind_nomina_ult1': 'payroll', 'ind_nom_pens_ult1': 'pensions', 'ind_recibo_ult1': 'direct_debit'})

print(df.columns)

Index(['data_date', 'customer_id', 'employee_index', 'country_of_residence',
       'gender', 'age', 'date_joined', 'new_customer', 'customer_seniority',
       'primary_index', 'last_date_as_primary', 'customer_type',
       'customer_relation', 'resident', 'foreigner', 'employee_spouse',
       'channel', 'deceased', 'address_type', 'province_code', 'province_name',
       'activity_index', 'gross_household_income', 'customer_segment',
       'saving_acc', 'guarantees', 'current_acc', 'derivative_acc',
       'payroll_acc', 'junior_acc', 'more_particular_acc', 'particular_acc',
       'particular_plus_acc', 'short_term_deposits', 'medium_term_deposits',
       'long_term_deposits', 'e_account', 'funds', 'mortgage', 'pensions_plan',
       'loans', 'taxes', 'credit_card', 'securities', 'home_acc', 'payroll',
       'pensions', 'direct_debit'],
      dtype='object')


In [6]:
# View Table
df.head(2)

Unnamed: 0,data_date,customer_id,employee_index,country_of_residence,gender,age,date_joined,new_customer,customer_seniority,primary_index,last_date_as_primary,customer_type,customer_relation,resident,foreigner,employee_spouse,channel,deceased,address_type,province_code,province_name,activity_index,gross_household_income,customer_segment,saving_acc,guarantees,current_acc,derivative_acc,payroll_acc,junior_acc,more_particular_acc,particular_acc,particular_plus_acc,short_term_deposits,medium_term_deposits,long_term_deposits,e_account,funds,mortgage,pensions_plan,loans,taxes,credit_card,securities,home_acc,payroll,pensions,direct_debit
0,2015-01-28,1375586,N,ES,H,35,2015-01-12,0.0,6,1.0,,1.0,A,S,N,,KHL,N,1.0,29.0,MALAGA,1.0,87218.1,02 - PARTICULARES,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0
1,2015-01-28,1050611,N,ES,V,23,2012-08-10,0.0,35,1.0,,1.0,I,S,S,,KHE,N,1.0,13.0,CIUDAD REAL,0.0,35548.74,03 - UNIVERSITARIO,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0


In [7]:
# Convert values in the gender column from "H" and "V" to "M" and "F" to standardize the data.

df['gender'] = df['gender'].replace({'H': 'M', 'V': 'F'})

df.head(3)

Unnamed: 0,data_date,customer_id,employee_index,country_of_residence,gender,age,date_joined,new_customer,customer_seniority,primary_index,last_date_as_primary,customer_type,customer_relation,resident,foreigner,employee_spouse,channel,deceased,address_type,province_code,province_name,activity_index,gross_household_income,customer_segment,saving_acc,guarantees,current_acc,derivative_acc,payroll_acc,junior_acc,more_particular_acc,particular_acc,particular_plus_acc,short_term_deposits,medium_term_deposits,long_term_deposits,e_account,funds,mortgage,pensions_plan,loans,taxes,credit_card,securities,home_acc,payroll,pensions,direct_debit
0,2015-01-28,1375586,N,ES,M,35,2015-01-12,0.0,6,1.0,,1.0,A,S,N,,KHL,N,1.0,29.0,MALAGA,1.0,87218.1,02 - PARTICULARES,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0
1,2015-01-28,1050611,N,ES,F,23,2012-08-10,0.0,35,1.0,,1.0,I,S,S,,KHE,N,1.0,13.0,CIUDAD REAL,0.0,35548.74,03 - UNIVERSITARIO,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0
2,2015-01-28,1050612,N,ES,F,23,2012-08-10,0.0,35,1.0,,1.0,I,S,N,,KHE,N,1.0,13.0,CIUDAD REAL,0.0,122179.11,03 - UNIVERSITARIO,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0


In [8]:
# Convert the "S" and "N" values to "Yes" and "No" to make them more explicit.
df['resident'] = df['resident'].replace({'S': 'Yes', 'N': 'No'})

df['foreigner'] = df['foreigner'].replace({'S': 'Yes', 'N': 'No'})

df['deceased'] = df['deceased'].replace({'S': 'Yes', 'N': 'No'})

df.head(3)

Unnamed: 0,data_date,customer_id,employee_index,country_of_residence,gender,age,date_joined,new_customer,customer_seniority,primary_index,last_date_as_primary,customer_type,customer_relation,resident,foreigner,employee_spouse,channel,deceased,address_type,province_code,province_name,activity_index,gross_household_income,customer_segment,saving_acc,guarantees,current_acc,derivative_acc,payroll_acc,junior_acc,more_particular_acc,particular_acc,particular_plus_acc,short_term_deposits,medium_term_deposits,long_term_deposits,e_account,funds,mortgage,pensions_plan,loans,taxes,credit_card,securities,home_acc,payroll,pensions,direct_debit
0,2015-01-28,1375586,N,ES,M,35,2015-01-12,0.0,6,1.0,,1.0,A,Yes,No,,KHL,No,1.0,29.0,MALAGA,1.0,87218.1,02 - PARTICULARES,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0
1,2015-01-28,1050611,N,ES,F,23,2012-08-10,0.0,35,1.0,,1.0,I,Yes,Yes,,KHE,No,1.0,13.0,CIUDAD REAL,0.0,35548.74,03 - UNIVERSITARIO,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0
2,2015-01-28,1050612,N,ES,F,23,2012-08-10,0.0,35,1.0,,1.0,I,Yes,No,,KHE,No,1.0,13.0,CIUDAD REAL,0.0,122179.11,03 - UNIVERSITARIO,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0


In [9]:
# Convert the values in a customer_segment column from '01 - TOP' to VIP, '02 - PARTICULARES' to Individuals,
#and 03 - UNIVERSITARIO to College Graduate.

# Define a dictionary to map the old values to the new values
mapping_dict = {
    '01 - TOP': 'VIP',
    '02 - PARTICULARES': 'Individuals',
    '03 - UNIVERSITARIO': 'College Graduate'
}

# Replace the old values in the 'customer_segment' column with the new values using the mapping dictionary
df['customer_segment'] = df['customer_segment'].replace(mapping_dict)

df.head(3)

Unnamed: 0,data_date,customer_id,employee_index,country_of_residence,gender,age,date_joined,new_customer,customer_seniority,primary_index,last_date_as_primary,customer_type,customer_relation,resident,foreigner,employee_spouse,channel,deceased,address_type,province_code,province_name,activity_index,gross_household_income,customer_segment,saving_acc,guarantees,current_acc,derivative_acc,payroll_acc,junior_acc,more_particular_acc,particular_acc,particular_plus_acc,short_term_deposits,medium_term_deposits,long_term_deposits,e_account,funds,mortgage,pensions_plan,loans,taxes,credit_card,securities,home_acc,payroll,pensions,direct_debit
0,2015-01-28,1375586,N,ES,M,35,2015-01-12,0.0,6,1.0,,1.0,A,Yes,No,,KHL,No,1.0,29.0,MALAGA,1.0,87218.1,Individuals,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0
1,2015-01-28,1050611,N,ES,F,23,2012-08-10,0.0,35,1.0,,1.0,I,Yes,Yes,,KHE,No,1.0,13.0,CIUDAD REAL,0.0,35548.74,College Graduate,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0
2,2015-01-28,1050612,N,ES,F,23,2012-08-10,0.0,35,1.0,,1.0,I,Yes,No,,KHE,No,1.0,13.0,CIUDAD REAL,0.0,122179.11,College Graduate,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0


In [10]:
# Check data types for all columns
print(df.dtypes)

data_date                  object
customer_id                 int64
employee_index             object
country_of_residence       object
gender                     object
age                        object
date_joined                object
new_customer              float64
customer_seniority         object
primary_index             float64
last_date_as_primary       object
customer_type              object
customer_relation          object
resident                   object
foreigner                  object
employee_spouse            object
channel                    object
deceased                   object
address_type              float64
province_code             float64
province_name              object
activity_index            float64
gross_household_income    float64
customer_segment           object
saving_acc                  int64
guarantees                  int64
current_acc                 int64
derivative_acc              int64
payroll_acc                 int64
junior_acc    

In [11]:
#Convert columns to appropriate datatypes

#Convert 'data_date' column to a datetime format 
df['data_date'] = pd.to_datetime(df['data_date'])

df['date_joined'] = pd.to_datetime(df['date_joined'])

# Replace invalid or missing values with NaN
df['age'] = df['age'].replace(' NA', np.nan)

# Convert the 'age' column from object to float
df['age'] = df['age'].astype(float)

# Check data types for all columns
print(df.dtypes)

data_date                 datetime64[ns]
customer_id                        int64
employee_index                    object
country_of_residence              object
gender                            object
age                              float64
date_joined               datetime64[ns]
new_customer                     float64
customer_seniority                object
primary_index                    float64
last_date_as_primary              object
customer_type                     object
customer_relation                 object
resident                          object
foreigner                         object
employee_spouse                   object
channel                           object
deceased                          object
address_type                     float64
province_code                    float64
province_name                     object
activity_index                   float64
gross_household_income           float64
customer_segment                  object
saving_acc      

In [12]:
#Check for missing data

print(df.isnull().sum())

data_date                        0
customer_id                      0
employee_index               27734
country_of_residence         27734
gender                       27804
age                          27734
date_joined                  27734
new_customer                 27734
customer_seniority               0
primary_index                27734
last_date_as_primary      13622516
customer_type               149781
customer_relation           149781
resident                     27734
foreigner                    27734
employee_spouse           13645501
channel                     186126
deceased                     27734
address_type                 27735
province_code                93591
province_name                93591
activity_index               27734
gross_household_income     2794375
customer_segment            189368
saving_acc                       0
guarantees                       0
current_acc                      0
derivative_acc                   0
payroll_acc         

In [13]:
# Drop rows with missing data
df = df.dropna(subset=['gender'])


In [14]:
#Check for missing data

print(df.isnull().sum())

data_date                        0
customer_id                      0
employee_index                   0
country_of_residence             0
gender                           0
age                              0
date_joined                      0
new_customer                     0
customer_seniority               0
primary_index                    0
last_date_as_primary      13594712
customer_type               122047
customer_relation           122047
resident                         0
foreigner                        0
employee_spouse           13617697
channel                     158391
deceased                         0
address_type                     1
province_code                65857
province_name                65857
activity_index                   0
gross_household_income     2766607
customer_segment            161633
saving_acc                       0
guarantees                       0
current_acc                      0
derivative_acc                   0
payroll_acc         

In [15]:
# Creating a dataframe with unique customer ID with most recent data date.
df_sorted = df.loc[df.groupby('customer_id')['data_date'].idxmax()]

df_sorted

Unnamed: 0,data_date,customer_id,employee_index,country_of_residence,gender,age,date_joined,new_customer,customer_seniority,primary_index,last_date_as_primary,customer_type,customer_relation,resident,foreigner,employee_spouse,channel,deceased,address_type,province_code,province_name,activity_index,gross_household_income,customer_segment,saving_acc,guarantees,current_acc,derivative_acc,payroll_acc,junior_acc,more_particular_acc,particular_acc,particular_plus_acc,short_term_deposits,medium_term_deposits,long_term_deposits,e_account,funds,mortgage,pensions_plan,loans,taxes,credit_card,securities,home_acc,payroll,pensions,direct_debit
13026343,2016-05-28,15889,F,ES,F,56.0,1995-01-16,0.0,255,1.0,,1,A,Yes,No,N,KAT,No,1.0,28.0,MADRID,1.0,326124.90,VIP,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,1,0,0.0,0.0,0
13026342,2016-05-28,15890,A,ES,F,63.0,1995-01-16,0.0,256,1.0,,1,A,Yes,No,N,KAT,No,1.0,28.0,MADRID,1.0,71461.20,VIP,0,0,0,0,1,0,0,0,1,0,0,0,1,0,0,1,0,0,1,0,0,1.0,1.0,1
5319232,2015-08-28,15891,N,ES,M,59.0,2015-07-28,0.0,246,99.0,2015-08-05,1,A,Yes,No,N,KAT,No,1.0,28.0,MADRID,0.0,,Individuals,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0
13026341,2016-05-28,15892,F,ES,M,62.0,1995-01-16,0.0,256,1.0,,1,A,Yes,No,N,KAT,No,1.0,28.0,MADRID,1.0,430477.41,VIP,0,0,1,0,0,0,0,0,0,0,0,1,1,0,0,0,0,1,1,1,0,0.0,0.0,1
13026340,2016-05-28,15893,N,ES,F,63.0,1997-10-03,0.0,256,1.0,,1,A,Yes,No,N,KAT,No,1.0,28.0,MADRID,1.0,430477.41,Individuals,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0.0,0.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13336818,2016-05-28,1553685,N,ES,F,52.0,2016-05-31,1.0,0,1.0,,,,Yes,No,,,No,1.0,13.0,CIUDAD REAL,0.0,,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0
13336817,2016-05-28,1553686,N,ES,M,30.0,2016-05-31,1.0,0,1.0,,,,Yes,Yes,,,No,1.0,41.0,SEVILLA,0.0,,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0
13336816,2016-05-28,1553687,N,ES,F,21.0,2016-05-31,1.0,0,1.0,,,,Yes,No,,,No,1.0,28.0,MADRID,0.0,,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0
13336815,2016-05-28,1553688,N,ES,M,43.0,2016-05-31,1.0,0,1.0,,,,Yes,No,,,No,1.0,39.0,CANTABRIA,0.0,,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0


This reduced the dataset to 949,609 unique Customer IDs

In [16]:
#Check for missing data in reduced dataset

print(df_sorted.isnull().sum())

data_date                      0
customer_id                    0
employee_index                 0
country_of_residence           0
gender                         0
age                            0
date_joined                    0
new_customer                   0
customer_seniority             0
primary_index                  0
last_date_as_primary      930280
customer_type               7655
customer_relation           7655
resident                       0
foreigner                      0
employee_spouse           949488
channel                    11433
deceased                       0
address_type                   0
province_code               4018
province_name               4018
activity_index                 0
gross_household_income    240201
customer_segment           11685
saving_acc                     0
guarantees                     0
current_acc                    0
derivative_acc                 0
payroll_acc                    0
junior_acc                     0
more_parti

In [17]:
# Checking for percentage missing data in reduced dataset

for col in df_sorted.columns:
    pct_missing = np.mean(df_sorted[col].isnull())
    print('{} - {}%'.format(col, pct_missing))

data_date - 0.0%
customer_id - 0.0%
employee_index - 0.0%
country_of_residence - 0.0%
gender - 0.0%
age - 0.0%
date_joined - 0.0%
new_customer - 0.0%
customer_seniority - 0.0%
primary_index - 0.0%
last_date_as_primary - 0.9796453066472622%
customer_type - 0.008061212562222978%
customer_relation - 0.008061212562222978%
resident - 0.0%
foreigner - 0.0%
employee_spouse - 0.999872579135202%
channel - 0.012039692125917089%
deceased - 0.0%
address_type - 0.0%
province_code - 0.004231215163293524%
province_name - 0.004231215163293524%
activity_index - 0.0%
gross_household_income - 0.2529472656640786%
customer_segment - 0.012305064505496472%
saving_acc - 0.0%
guarantees - 0.0%
current_acc - 0.0%
derivative_acc - 0.0%
payroll_acc - 0.0%
junior_acc - 0.0%
more_particular_acc - 0.0%
particular_acc - 0.0%
particular_plus_acc - 0.0%
short_term_deposits - 0.0%
medium_term_deposits - 0.0%
long_term_deposits - 0.0%
e_account - 0.0%
funds - 0.0%
mortgage - 0.0%
pensions_plan - 0.0%
loans - 0.0%
taxes -