# Comprehensive exploration and visualization

This notebook is intended for data exploration and visualization to understand the drivers of customer behavior. I think it is meaningless to analyze the added or removed products by customer because the customer data (such as income, channel used, segment, employee index, customer index, etc.) are identical for the whole period. Hence, we cannot say that the customer has dropped or added a product because his income increased as the income is consistent throughout all months. However, we can analyze additions or substitutions or retention times across alll customers and compare these metrics with customers features such as income groups or channels or whatever. In my attempt here, I analyzed products additions and tried my best to understand what drives the customer to use different types of products.

## Import the necessary libraries

In [None]:
# Load packages
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%pylab inline
pylab.rcParams['figure.figsize'] = (10, 6) 

In [None]:
# to customize the displayed area of the dataframe 
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

Now that our packages are loaded, let’s read in and take a look at the data.

In [None]:
df  = pd.read_csv("../input/train_ver2.csv",
                           dtype={"sexo":str, "ind_nuevo":str, 
                                  "ult_fec_cli_1t":str, 
                                  "indext":str}, nrows=7e6) 

unique_ids   = pd.Series(df["ncodpers"].unique())
unique_id    = unique_ids.sample(n=130000)
df           = df[df.ncodpers.isin(unique_id)]

In [None]:
df.describe()

## Data cleaning

In [None]:
# Records count
df["ncodpers"].count()

In [None]:
# Change datatype
df["age"]   = pd.to_numeric(df["age"], errors="coerce") 
df["antiguedad"]   = pd.to_numeric(df["antiguedad"], errors="coerce") 
df["indrel_1mes"]   = pd.to_numeric(df["indrel_1mes"], errors="coerce") 

In [None]:
# Check how many missing values in every column
df.isnull().sum()

There are two columns ("ult_fec_cli_1t", "conyuemp") with almost all values are missing. We are going to delete them from the dataframe.

In [None]:
# Drop the columns with majority of missing values
df = df.drop(["ult_fec_cli_1t", "conyuemp"], axis=1) 

There are many columns with missing values. Let's see how can we deal with them. The column "renta" or income contains a lot of missing values. I am going to replace missing values in the income column by the average income of the customers in the same province. 
Since the purpose of this notebook is visualization more than data cleaning, I am going to drop any missing values in the other variables. 

In [None]:
#Impute missing values in the income column 
grouped        = df.groupby("nomprov").agg({"renta":lambda x: x.median(skipna=True)}).reset_index()
new_incomes    = pd.merge(df,grouped,how="inner",on="nomprov").loc[:, ["nomprov","renta_y"]]
new_incomes    = new_incomes.rename(columns={"renta_y":"renta"}).sort_values("renta").sort_values("nomprov")
df.sort_values("nomprov",inplace=True)
df             = df.reset_index()
new_incomes    = new_incomes.reset_index()

df.loc[df.renta.isnull(),"renta"] = new_incomes.loc[df.renta.isnull(),"renta"].reset_index()
df.loc[df.renta.isnull(),"renta"] = df.loc[df.renta.notnull(),"renta"].median()
df.sort_values(by="fecha_dato",inplace=True)

Drop all the other missing values

In [None]:
df = df.dropna(axis=0)

There are some customers with seniority less than zero. I will replace with 0

In [None]:
df[df["antiguedad"]<0] = 0

In [None]:
 df.loc[:,"ind_ahor_fin_ult1":"ind_recibo_ult1"].sum(axis=1)

In [None]:
# Add a new column of the total number of products per customer per month
df["tot_products"] = df.loc[:,"ind_ahor_fin_ult1":"ind_recibo_ult1"].sum(axis=1)
df["tot_products"]   = pd.to_numeric(df["tot_products"], errors="coerce") 

Now I will check customer distribution by country

In [None]:
df['pais_residencia'].describe() 


Almost all observations were acquired from country "ES".                                                                              Since other observations are so few in numbers, let's exclude all observations other than those coming from this country 

In [None]:
df = df.loc[lambda df: df.pais_residencia == "ES", :]

###### How about employee index?

In [None]:
df['ind_empleado'].value_counts()

Almost all obervations have come from non employees "N". Again for the purposes of investigation of the most important features, I am going to exclude the other observations. 

In [None]:
df = df.loc[lambda df: df.ind_empleado == "N", :]

## Data Visualization ##


#### Age distribution of the customers

In [None]:
plt.figure(figsize=(18,9))
df['age'].hist(bins=50)
plt.title("Customers' Age Distribution")
plt.xlabel("Age(years)")
plt.ylabel("Number of customers") 

### Customers attraction by channel
The channels through which the customers were attracted to join.

In [None]:
# Customers count by channel 
df['canal_entrada'].value_counts().head(15)

In [None]:
plt.figure(figsize=(18,9))
df["canal_entrada"].value_counts().plot(x=None, y=None, kind='pie') 

The majority of customers have joined through three major channels.

#### Number of products by activity index and sex

In [None]:
df_a = df.loc[:, ['sexo', 'ind_actividad_cliente']].join(df.loc[:, "ind_ahor_fin_ult1": "ind_recibo_ult1"])
df_a = df_a.groupby(['sexo', 'ind_actividad_cliente']).sum()
df_a = df_a.T

In [None]:
df_a.head()

In [None]:
plt.figure(figsize=(18,11))
df_a.plot(kind='barh', stacked=True, fontsize=14, figsize=[16,12], colormap='gist_ncar')
plt.title('Popularity of products by sex and activity index', fontsize=20) 
plt.xlabel('Number of customers', fontsize=17) 
plt.ylabel('Products_names', fontsize=17) 
plt.legend(["Sex:H; Activity_Ind:0", "Sex:H; Activity_Ind:1", "Sex:V; Activity_Ind:0", 
            "Sex:V; Activity_Ind:1"], prop={'size':15}) 

Most of the customers used only one product which is the current account. In order to investigate the other products, let's exclude the dominant product (current account).

In [None]:
# excluding the dominant product 
exclude = ['ind_cco_fin_ult1']
df_a = df_a.T
df_a = df_a.drop(exclude, axis=1).T

In [None]:
df_a.plot(kind='barh', stacked=True, fontsize=14, figsize=[16,12], colormap='gist_ncar')
plt.title('Popularity of products by sex and activity index', fontsize=20, color='black') 
plt.xlabel('Number of customers', fontsize=17, color='black') 
plt.ylabel('Products_names', fontsize=17, color='black') 
plt.legend(["Sex:H; Activity_Ind:0", "Sex:H; Activity_Ind:1", "Sex:V; Activity_Ind:0", 
            "Sex:V; Activity_Ind:1"], prop={'size':15}) 

###### Total number of products per customer

In [None]:
df["tot_products"].value_counts()

Most of the customers used one or two products and rarely use more than five products.

###### Total number of products by age

In [None]:
df_a = df.groupby(['age'])['tot_products'].agg('sum')

In [None]:
df.groupby(['age'])['tot_products'].agg('sum')

Let's sort the values in a descending order to check the age groups which contribute the most to the total number of products.

In [None]:
df_a.sort_values(ascending=False).head(20)

In [None]:
# Number of products by age 
plt.figure(figsize=(18,11))
df_a.plot(kind='bar', colormap='autumn', legend=None) 
plt.xticks(np.arange(0, 120, 10), [str(x) for x in np.arange(0, 120, 10)])
plt.title('Number of products by age') 
plt.xlabel('Age(years)') 
plt.ylabel('Number of products') 

As we see bimodal distribution with most of the products used by middle aged customers between 35 and 55 years old, followed by young customers in thier twentieths.

###### Total number of products by segmentation

In [None]:
df_a = df.groupby(['segmento'])['tot_products'].agg('sum') 
df_a

PARTICULARES are the most important group

###### Number of products by customer index

In [None]:
df_a = df.groupby(['ind_nuevo'])['tot_products'].agg('count') 
df_a

Most customers are recurrent customers, older than six months

###### Number of products by customer regularity

In [None]:
df_a = df.groupby(['indrel'])['tot_products'].agg('count') 
df_a

Almost all customers are regular customers throughout the month

###### Number of products by customer type at the beginning of the month

In [None]:
df_a = df.groupby(['indrel_1mes'])['tot_products'].agg('count') 
df_a

Almost all customers are primary customers

###### Number of products by "Customer relation type at the beginning of the month"

In [None]:
df_a = df.groupby(['tiprel_1mes'])['tot_products'].agg('count') 
df_a

Almost all customers are separated between active and inactive groups

###### Number of products by customer's birth country in relation to the bank country

In [None]:
df_a = df.groupby(['indext'])['tot_products'].agg('count') 
df_a

Most of the customers have their birth country different than the bank country 

###### Total number of products by income

Let's create income groups

In [None]:
df_a = (df.groupby(pd.cut(df['renta'], [0,60000,120000,180000,240000, pd.np.inf], right=False))
                     .sum())

Total number of products by income groups

In [None]:
df_a["tot_products"]

In [None]:
# Let's drop the columns which are unnecessary for this step
df_a = df_a.loc[:, "ind_ahor_fin_ult1": "ind_recibo_ult1"]
df_a = df_a.T

In [None]:
df_a.head(10)

In [None]:
# Plot of product share for each income group
df_a.plot(kind='barh', stacked=True, fontsize=14, figsize=[16,12], colormap='gist_rainbow')
plt.title('Distribution of products among customers by income group', fontsize=20, color='black') 
plt.xlabel('Total number of products', fontsize=17, color='black') 
plt.ylabel('Products_names', fontsize=17, color='black') 
plt.legend(prop={'size':15}, loc=1) 

"ind_cco_fin_ult1" is the dominant product. Now let's exclude it to investigate further the other products.

In [None]:
# exclude the dominant product "ind_cco_fin_ult1"
exclude = ['ind_cco_fin_ult1']
df_a = df_a.T
df_a = df_a.drop(exclude, axis=1).T

In [None]:
df_a.head()

In [None]:
# Plot of product share for each income group; excluding the dominant product 
df_a.plot(kind='barh', stacked=True, fontsize=14, figsize=[16,12], colormap='gist_rainbow')
plt.title('Popularity of products by income group', fontsize=20, color='black') 
plt.xlabel('Total number of products', fontsize=17, color='black') 
plt.ylabel('Products_names', fontsize=17, color='black') 
plt.legend(bbox_to_anchor=(1.02, 1), loc=2, borderaxespad=0.1, prop={'size':15}) 

###### Total number of products by age group

In [None]:
# Let's create age groups
df_a = (df.groupby(pd.cut(df['age'], [0,20,40,60,80,100, pd.np.inf], right=False))
                     .sum())

In [None]:
df_a

In [None]:
# Keep the products columns and discard the others
df_a = df_a.loc[:, "ind_ahor_fin_ult1": "ind_recibo_ult1"]
df_a = df_a.T

In [None]:
df_a.head(10)

In [None]:
# Plot of customers' age distibution of each product 
df_a.plot(kind='barh', stacked=True, fontsize=14, figsize=[16,12], colormap='Reds')
plt.title('Customers age distribution of different products', fontsize=20, color='black') 
plt.xlabel('Total number of products', fontsize=17, color='black') 
plt.ylabel('Products_names', fontsize=17, color='black') 
plt.legend(prop={'size':15}, loc=1) 

Again let's exclude the dominant product

In [None]:
# exclude the dominant product "ind_cco_fin_ult1"
exclude = ['ind_cco_fin_ult1']
df_a = df_a.T
df_a = df_a.drop(exclude, axis=1).T

In [None]:
# Plot of customers' age distibution of each product (excluding the dominant product) 
df_a.plot(kind='barh', stacked=True, fontsize=14, figsize=[16,12], colormap='Blues')
plt.title('Customers age distribution of different products', fontsize=20, color='black') 
plt.xlabel('Total number of products', fontsize=17, color='black') 
plt.ylabel('Products_names', fontsize=17, color='black') 
plt.legend(bbox_to_anchor=(1.02, 1), loc=2, borderaxespad=0.1, prop={'size':15}) 

###### Let's check number of products by channels

In [None]:
df["canal_entrada"].value_counts().head(10)

Most of the customers joined through three major channels

Since 6 out of 160 channels account for about 87.11% of the total number of customers attraction and to be able to visualze, we are going to exclude any channel has a contribution less than 1%.

In [None]:
# Let's extract the necessary columns for this step
df_a = df.loc[:, ['canal_entrada']].join(df.loc[:, 'ind_ahor_fin_ult1':'ind_recibo_ult1'])

In [None]:
# Let's subset the data to keep only the records from the five major channels
subset = ["KHE", "KAT", "KFC", "KFA", "KHK"]
df_a = df_a.loc[df_a['canal_entrada'].isin(subset)]

In [None]:
df_a = df_a.groupby("canal_entrada").agg("sum")
df_a = df_a.T

In [None]:
# Channels used by the customer to join and the purchased products
df_a.plot(kind='barh', stacked=True, fontsize=14, figsize=[16,12], colormap='gist_rainbow')
plt.title('Channels used by the customers to join and associated product uses', fontsize=20, color='black') 
plt.xlabel('Total number of customers', fontsize=17, color='black') 
plt.ylabel('Products names', fontsize=17, color='black') 
plt.legend(prop={'size':15}) 

Again we will exclude the dominant product

In [None]:
# exclude the dominant product "ind_cco_fin_ult1"
exclude = ['ind_cco_fin_ult1']
df_a = df_a.T
df_a = df_a.drop(exclude, axis=1).T

In [None]:
# Channels share distribution of each product, excluding the dominant product
df_a.plot(kind='barh', stacked=True, fontsize=14, figsize=[16,12], colormap='gist_rainbow')
plt.title('Channels used to join for each product', fontsize=20, color='black') 
plt.xlabel('Total number of customers', fontsize=17, color='black') 
plt.ylabel('Products_names', fontsize=17, color='black') 
plt.legend(["KAT", "KFA", "KFC", "KHE", "KHK"], prop={'size':15}, loc=1) 
plt.legend(bbox_to_anchor=(1.02, 1), loc=2, borderaxespad=0.1, prop={'size':15}) 

###### Number of products by seniority group

In [None]:
# Let's extract the necessary columns for this step
df_a = df.loc[:, ['antiguedad']].join(df.loc[:, 'ind_ahor_fin_ult1':'ind_recibo_ult1'])

In [None]:
# Let's create seniority groups
df_a = (df_a.groupby(pd.cut(df_a['antiguedad'], [0,50,100,150,200, pd.np.inf], right=False))
                     .sum())

In [None]:
df_a.head()

In [None]:
exclude = ["antiguedad"]
df_a = df_a.drop(exclude, axis=1).T

In [None]:
# Customers' seniority distribution of each product
df_a.plot(kind='barh', stacked=True, fontsize=14, figsize=[16,12], colormap='gist_rainbow')
plt.title('Customers seniority distribution of each product', fontsize=20, color='black') 
plt.xlabel('Customer seniority', fontsize=17, color='black') 
plt.ylabel('Product names', fontsize=17, color='black') 
plt.legend([[0, 50], [50, 100], [100, 150], [150, 200], [200, inf]], prop={'size':15}) 

It is noticable that the dominant product is purchased most by those who joined less than 50 months ago.

Again we will exclude the dominant product

In [None]:
# exclude the dominant product "ind_cco_fin_ult1"
exclude = ['ind_cco_fin_ult1']
df_a = df_a.T
df_a = df_a.drop(exclude, axis=1).T

In [None]:
# Customers' seniority distribution of each product
df_a.plot(kind='barh', stacked=True, fontsize=14, figsize=[16,12], colormap='gist_rainbow')
plt.title('Customers seniority distribution of each product', fontsize=20, color='black') 
plt.xlabel('Customer seniority', fontsize=17, color='black') 
plt.ylabel('Product names', fontsize=17, color='black') 
plt.legend([[0, 50], [50, 100], [100, 150], [150, 200], [200, inf]], prop={'size':15}) 
plt.legend(bbox_to_anchor=(1.02, 1), loc=2, borderaxespad=0.1, prop={'size':15}) 

Let's see how many products individual customer usually has in any month

In [None]:
# Extract total number of products 
df_a = df["tot_products"].value_counts()
df_a = pd.DataFrame(df_a)

In [None]:
df["ncodpers"].count()

In [None]:
# calculate the percentage of customers using different number of products
count = df["ncodpers"].count()
df_a["percentage"] = (df_a["tot_products"]/count)*100
df_a

56.26% of the customers have used only one product while 15.57% used two products and there is also 12.22% of the customers who have not used any products at all.

#### **In case of total products = 1**

In [None]:
# extract those customers who purchased only one product had being current account ("ind_cco_fin_ult1") 
df_a = df[df["tot_products"]==1]  
df_a = df_a[df_a["ind_cco_fin_ult1"]==1]  

In [None]:
a = df_a["ncodpers"].count() # Observations where customers had only one product being the current account 
b = len(df) # Total number of observations
c = len(df[df["tot_products"]==1]) # Observations where customers had only one account

print("%.2f" % ((c/b)*100), "% of the customers had purchased only one product") 
print("%.2f" % ((a/b)*100), "% of the customers had the current account as the only one product") 
print("%.2f" % ((47.12/55.74)*100), "% of the customers when they have only one product, this product is the current account")

In [None]:
print("%.2f" % (55.74 - 47.12 ), "% of the customers have only one account being not the current account") 

In [None]:
# extract the necessary columns
df_a = df[df["tot_products"]==1]  # cases where the total products is one
df_a = df_a.loc[:, ["tot_products"]].join(df_a.loc[:, "ind_ahor_fin_ult1":"ind_recibo_ult1"]) 

In [None]:
df_a = df_a.groupby("tot_products").agg("sum")
df_a = df_a.T

Now let's plot which products were chosen as the only product in case of the total products is one in any single month.

In [None]:
df_a.plot(kind='barh', stacked=True, fontsize=14, figsize=[16,12], colormap='winter')
plt.title('Number of times each product was chosen as the only product in case of the total products is one', fontsize=20) 
plt.xlabel('times of occurences', fontsize=17, color='black') 
plt.ylabel('products names', fontsize=17, color='black') 

#### **In case of total products = 2**

In [None]:
# extract those customers who purchased two products with current account being one of them
df_a = df[df["tot_products"]==2]  
df_a = df_a[df_a["ind_cco_fin_ult1"]==1]  

In [None]:
df_a["ncodpers"].count()

In [None]:
# extract the necessary columns
df_a = df_a.loc[:, ["tot_products"]].join(df_a.loc[:, "ind_ahor_fin_ult1":"ind_recibo_ult1"]) 

In [None]:
df_a = df_a.groupby("tot_products").agg("sum")
df_a = df_a.T

Now let's plot which products were chosen along with the dominant product when the customer bought only two products in any single month.

In [None]:
df_a.plot(kind='barh', stacked=True, fontsize=14, figsize=[16,12], colormap='winter')
plt.title('Number of times each product was chosen along with the dominant product in case of the total products is two', fontsize=20) 
plt.xlabel('times of occurences', fontsize=17, color='black') 
plt.ylabel('products names', fontsize=17, color='black') 

As seen on the above plot, in case of two products, along with "ind_cco_fin_ult1", "ind_ctop_fin_ult1" is chosen most followed by "ind_recibo_ult1". 

#### **In case of total products = 0.0**

In [None]:
# extract those customers who did not purchase any products in any month
df_a = df[df["tot_products"]==0]

Let's see the activity index of those customers

In [None]:
df_a["ind_actividad_cliente"].value_counts() 

As expected, those customers are mainly inactive customers and that's why they have not purchased any products.

#### **In case of tot_products = 3**

Now let's check the customers who purchased three products, what kind of products they have chosen.

In [None]:
# extract those customers who purchased three products in any single month
df_a = df[df["tot_products"]==3]  

In [None]:
df_a = df_a.loc[:, ["tot_products"]].join(df_a.loc[:, "ind_ahor_fin_ult1":"ind_recibo_ult1"]) 

In [None]:
df_a = df_a.groupby("tot_products").agg("sum")
df_a = df_a.T

In [None]:
df_a.plot(kind='barh', stacked=True, fontsize=14, figsize=[16,12], colormap='winter')
plt.title('Number of times each product was chosen in case of the total products is three in any month', fontsize=20, color='black') 
plt.xlabel('times of occurences', fontsize=17, color='black') 
plt.ylabel('products names', fontsize=17, color='black') 

Now let's check the products share of different cases (total products: only one product, two products, three products) 

In [None]:
# Categorize by total products
df_a = df.loc[:, ["tot_products"]].join(df.loc[:, "ind_ahor_fin_ult1":"ind_recibo_ult1"]) 
df_a = df_a.groupby("tot_products").agg("sum")
df_a = df_a.T

In [None]:
df_a.head()

In [None]:
# percentage of each product contribution of those customers who only purchased one product in any month
a = df_a[1]
b = df_a[1].sum()
c = (a/b)*100
c = c.sort_values(ascending=False)
c

In [None]:
c[0]

In [None]:
print("Wow, about", "%.2f" % (c[0]), "% of the customers purchase only the current account in case they purchase only one product.") 

In [None]:
# percentage of each product contribution of those customers who purchased two products in any month
a = df_a[2]
b = df_a[2].sum()
c = (a/b)*100
c = c.sort_values(ascending=False)
c

In [None]:
print("In case of two products purchased by the customer, about", "%.2f" % (c[0]), "% of the customers have got a (current account) and in", "%.2f" %(c[1]),"% of the cases it is combined with (Particular Account) and in", "%.2f"% (c[2]),"% of the cases it is combined with (Direct Debit)") 

In [None]:
# percentage of each product contribution of those customers who purchased three products in any month
a = df_a[3]
b = df_a[3].sum()
c = (a/b)*100
c.sort_values(ascending=False)

### Drivers of customers choices

Now we want to see what drives customer choices and especially the customers who have not chosen the current account which contribute more than half of the total products. Let's see what are these products and analyze the features that may contribute to the customer buying behavior.

Distribution of products by age group

In [None]:
df_a = df.loc[:, ['age']].join(df.loc[:, 'ind_ahor_fin_ult1':'ind_recibo_ult1'])

In [None]:
df_a = (df_a.groupby(pd.cut(df_a['age'], [0,18,25,35,45,55, pd.np.inf], right=False))
                     .sum())

In [None]:
df_a.head()

In [None]:
del df_a["age"]

In [None]:
df_a = df_a.T

In [None]:
# Customers age distribution of each product
df_a.plot(kind='barh', stacked=True, fontsize=14, figsize=[16,12], colormap='gist_rainbow')
plt.title('Customers age distribution of each product', fontsize=20, color='black') 
plt.xlabel('Total number of customers', fontsize=17, color='black') 
plt.ylabel('Products names', fontsize=17, color='black') 
plt.legend([[0, 18], [18, 25], [25, 35], [35, 45], [45, 55], [55, inf]], prop={'size':15}) 

Products use occurencies by age

In [None]:
df_a = df.loc[:,["age"]].join(df.loc[:, 'ind_ahor_fin_ult1':'ind_recibo_ult1'])

a function to calculate the mean of age and the sum of other columns

In [None]:
fnc = {c:'sum' for c in df_a.columns.drop(['age']).tolist()}
     ...: fnc['age'] = 'mean'
     ...: 

In [None]:
df_a = df_a.groupby('age').agg(fnc).reindex_axis(df_a.columns.drop('age'), 1) 

In [None]:
df_a.head()

In [None]:
# Products distribution by age
df_a.plot(kind='barh', stacked=True, fontsize=14, figsize=[16,42], colormap='hsv')
plt.title('Products distribution by age', fontsize=20, color='black') 
plt.xlabel('Total number of customers', fontsize=17, color='black') 
plt.ylabel('Age (years)', fontsize=17, color='black') 
plt.legend(prop={'size':15}) 

It is noticable from the above plot that those customers whose ages are less or equal 18 years use only Junior Account and those whose ages between 20 and 28 years use mostly the current account.

Distribution of products by segment

In [None]:
df_a = df.loc[:, ['segmento']].join(df.loc[:, 'ind_ahor_fin_ult1':'ind_recibo_ult1'])

In [None]:
df_a = df_a.groupby("segmento").agg("sum")
df_a = df_a.T

In [None]:
df_a.head()

In [None]:
# Customers segment of each product
df_a.plot(kind='barh', stacked=True, fontsize=14, figsize=[16,12], colormap='gist_rainbow')
plt.title('Customers segmentation of products', fontsize=20, color='black') 
plt.xlabel('Total number of customers', fontsize=17, color='black') 
plt.ylabel('Products names', fontsize=17, color='black') 
plt.legend(prop={'size':15}) 

Distribution of products by activity index

In [None]:
# Let's extract the necessary columns for this step
df_a = df.loc[:, ['ind_actividad_cliente']].join(df.loc[:, 'ind_ahor_fin_ult1':'ind_recibo_ult1'])

In [None]:
df_a = df_a.groupby("ind_actividad_cliente").agg("sum")
df_a = df_a.T

In [None]:
# Purchased products types by customer activity index
df_a.plot(kind='barh', stacked=True, fontsize=14, figsize=[16,12], colormap='prism')
plt.title('Purchased products types by customer activity index ', fontsize=20, color='black') 
plt.xlabel('Total number of customers', fontsize=17, color='black') 
plt.ylabel('Products names', fontsize=17, color='black') 
plt.legend(["Inactive", "Active"], prop={'size':15}) 

Most of the products except the current account and particular account were used by active customers. It seems that the activity index do not have an impact on the current and particular accounts.

Distribution of products by sex

In [None]:
# Let's extract the necessary columns for this step
df_a= df.loc[:, ['sexo']].join(df.loc[:, 'ind_ahor_fin_ult1':'ind_recibo_ult1'])

In [None]:
df_a = df_a.groupby("sexo").agg("sum")
df_a = df_a.T

In [None]:
# Percentage of purchased products by sex
df_a.plot(kind='barh', stacked=True, fontsize=14, figsize=[16,12], colormap='prism')
plt.title('Purchased products by sex ', fontsize=20, color='black') 
plt.xlabel('Total number of customers', fontsize=17, color='black') 
plt.ylabel('Products names', fontsize=17, color='black') 
plt.legend(prop={'size':15}) 

V purchased more than Z but this is due to the customers of V being more than Z in numbers. Let's confirm this.

In [None]:
df["sexo"].value_counts()

In [None]:
a = df["ncodpers"][df["sexo"]=="H"].count()
b = df["ncodpers"][df["sexo"]=="V"].count()
(a/b)*100

The above result reveal that the number of customers whose sex is H is lower and hence their share in purchased products is lower than the other sex.

Now let's look at the total number of products by age, income and seniority.

In [None]:
df_a = df.loc[:, ['age', 'renta', 'antiguedad']].join(df.loc[:, 'ind_ahor_fin_ult1':'tot_products'])

In [None]:
df_a.head()

In [None]:
df_a = df_a.dropna(axis=0)

In [None]:
df_a = df_a.groupby("tot_products").agg("mean")

In [None]:
df_b = df_a.loc[:, ['age', 'renta', 'antiguedad']]

In [None]:
df_b.head()

#### Total number of products by seniority

In [None]:
df_a = df_b["antiguedad"][0:10]

In [None]:
# Total number of products by seniority
df_a.plot(kind='bar', fontsize=14, figsize=[16,12], colormap='prism')
plt.title('Total number of products by seniority', fontsize=20, color='black') 
plt.xlabel('Total number of products', fontsize=17, color='black') 
plt.ylabel('Average seniority', fontsize=17, color='black') 
#plt.legend(prop={'size':15}) 

It is clear that the number of products used by the customer is positively correlated with seniority except those customers who have not used any products.

#### Total number of products by age

In [None]:
df_a = df_b["age"]

In [None]:
# Total number of products by age
df_a.plot(kind='bar', fontsize=14, figsize=[16,12], colormap='prism')
plt.title('Total number of products by age', fontsize=20, color='black') 
plt.xlabel('Total number of products', fontsize=17, color='black') 
plt.ylabel('Average age', fontsize=17, color='black') 
plt.legend(prop={'size':15}) 

#### Total number of products by income

In [None]:
df_a = df_b["renta"]

In [None]:
# Total number of products by income
df_a.plot(kind='bar', fontsize=14, figsize=[16,12], colormap='prism')
plt.title('Total number of products by income', fontsize=20, color='black') 
plt.xlabel('Total number of products', fontsize=17, color='black') 
plt.ylabel('Average income', fontsize=17, color='black') 
plt.legend(prop={'size':15}) 