# Analyzing the Customer Data

For now we are simply going to get some insights into the customer data, to get a better feel and understanding of the data.
Which will be helpful later on when we start to build our models and start to with more complex analysis.


We will do this by looking at the data in a number of different ways:

* First we are going to look at some simple distributions.
    * We will look at the make up of our customers.
        * where they are form 
        * who they are
        * what type of customers they are
    * How many customers the bank most likely has in total.
    * and how the general opinion of the customers is.

* Then we will look at the relationship between the different variables.
    * we will look at the correlatoin of the different scores to their satisfaction. (finding out which is the most important factor)
    * see what kind of demographics are most satisfied/least.
        * takin into account: gender, age, location, type, and if they have a mortgage or credit card.

* Then we look at how the scores trend over time.
    * we will look when the scores are the highest and lowest.


In [69]:
import plotly.express as px
import plotly.graph_objs as go
from plotly.subplots import make_subplots
import plotly.subplots as sp
import  plotly.figure_factory as ff
import pandas as pd
import pickle as pkl
import numpy as np

In [70]:
customer_data = pkl.load(open('./bank-data/cleaned_customers.pkl', 'rb'))
comment_data = pkl.load(open('./bank-data/cleaned_comments.pkl', 'rb'))

## Age Distribution of Customers by Customer Type

We will start by looking at the age distribution of the customers by customer type.

Points to note:

As we can see from the graph no one younger than 18 is a customer of the bank. \
Most of the banks customers are between 30 and 60 years old. 

Around the 30 year mark we see an increase in the number of Business customers. \
Which corresponds to a drop in the number of Personal customers. \
This pattern is reversed right around the 56 year mark. \
Where the number of Business customers and Personal customers are on eaqul footing. \
Unit it reverses again at the 67 year mark. \
Where the number of Personal customers is greater than the number of Business customers.

The amount of Business-pro is consitently low during the whole age range. 

In [71]:
# Calculate the age distribution by age and customer type
age_dist = customer_data.groupby(['customer_age', 'customer_type']).size().reset_index(name='count')

# Create a subplot with three rows
fig = make_subplots(rows=2, cols=1, specs=[[{"type": "scatter", "rowspan": 1}],[{"type": "violin", "rowspan": 1}]])

# Add a violin plot to the first row without legend
fig.add_trace(go.Violin(x=customer_data['customer_age'], showlegend=False, box_visible=True), row=2, col=1)

# Add a scatter plot to the second row
customer_types = age_dist['customer_type'].unique()
for ctype in customer_types:
    subset = age_dist[age_dist['customer_type'] == ctype]
    fig.add_trace(go.Scatter(x=subset['customer_age'], y=subset['count'], mode='lines', name=ctype), row=1, col=1)

# Set the X-axis to start at age 0 and end at 100
fig.update_xaxes(range=[0, 100])

# Set the titles and axis labels
fig.update_layout(
    title="Age Distribution of Customers by Customer Type",
    xaxis_title="Age",
    yaxis_title="Number of Customers",
    yaxis2_title="Customer Age",
    height=700,

)

# Show the plot
fig.show()


## Amount of Customers

We only have a small subset of 3000 customers, who have filled out a survey. \
If we are interested in the total number of customers, we have to do some further statistical analysis.

Based in the fact that the highest customer id is 389602, the lowest is 26 we can assume that the customer ids are given sequentially. \
This is also supported by the the distribution graph below, which shows that the customer ids are evenly distributed. 

Based on this information we can use a Minimum-variance unbiased estimator (MVUE) to make a point estimate of the highest customer id. 


Utilizing the following formula:

$\large \hat{N} = m + \frac{m}{k} - 1$

Where:
* $\large \hat{N}$ is the estimated largest user id
* $\large m$ is the largest user id in the sample
* $\large k$ is the number of users in the sample


A strategy that has been used to great effect since WWII ([German tank problem](https://en.wikipedia.org/wiki/German_tank_problem).)

Based on the calculation below we can say that we have roughly 389731 customers.


In [72]:
# Create a distplot of customer ids
fig = ff.create_distplot([customer_data['customer_id']], ['Customer ID'], bin_size=1000, show_rug=False)
fig.update_layout(title='Customer ID Distribution', xaxis_title='Customer ID', yaxis_title='Density')

# change the Y unit to amount of customers
fig.update_yaxes(title_text='Number of Customers')


fig.show()

In [73]:
# Calculate the total amount of customers

m = customer_data['customer_id'].max()
k = len(customer_data['customer_id'])

# Making a point estimate of the population max

N = m + (m / k) - 1 

print('The total amount of customers is: ', round(N, 1))

The total amount of customers is:  389730.9


In [74]:
# Creating a new dataframe for this plot
customer_data_plot = customer_data.copy()

# Replace gender labels
customer_data_plot['customer_gender'] = customer_data_plot['customer_gender'].replace({'m': 'Male', 'f': 'Female', 'n': 'N/A'})

# Remove NaN values from customer_location
customer_data_plot = customer_data_plot.dropna(subset=['customer_location'])

# Create subplots
fig = make_subplots(rows=1, cols=3, specs=[[{'type': 'domain'}, {'type': 'domain'}, {'type': 'domain'}]],
                    subplot_titles=['Customer Gender Distribution', 'Customer Location Distribution', 'Customer Type Distribution'])

# Add pie chart for customer_gender
gender_pie = go.Pie(labels=customer_data_plot['customer_gender'].unique(),
                    values=customer_data_plot['customer_gender'].value_counts(),
                    name='Gender',
                    domain={'x': [0, 0.32], 'y': [0, 1]},
                    textinfo='label+percent',
                    textposition='inside',
                    showlegend=False
                    )

fig.add_trace(gender_pie, 1, 1)

# Add pie chart for customer_location
location_pie = go.Pie(labels=customer_data_plot['customer_location'].unique(),
                      values=customer_data_plot['customer_location'].value_counts(),
                      name='Location',
                      domain={'x': [0.34, 0.66], 'y': [0, 1]},
                      textinfo='label+percent',
                      textposition='inside',
                      showlegend=False
                      )

fig.add_trace(location_pie, 1, 2)

# Add pie chart for customer_type
type_pie = go.Pie(labels=customer_data_plot['customer_type'].unique(),
                  values=customer_data_plot['customer_type'].value_counts(),
                  name='Type',
                  domain={'x': [0.68, 1], 'y': [1, 1]},
                  textinfo='label+percent',
                  textposition='inside',
                  showlegend=False
                  )

fig.add_trace(type_pie, 1, 3)

# Update layout
fig.update_layout(
    margin=dict(l=0)  # Reset the left margin
)

# Show the figure
fig.show()

## Makeup of the Customers

Now we will take a closer look not just at the kind of customers we have, but at who they are and where they are from.

### Gender:

Starting of with the gender, we can see a pretty even split with men 3.4 percentage pints ahead of women. \
While roughly 10 have decided not to disclose their gender or do not feel represented by one of those options.


### Location:

Looking at the location of the customers we can see that the majority of the customers are from Munster, making up almost half of the customers. \
While leinster takes up a little less than a third of the customers. \
The rest of the customers are split between Connacht and Ulster, with Connacht having an edge over Ulster, by 5 percentage points.

But taking into account the population density of the different provinces, we can see that the bank is actually having the highest market penetration in Connacht. \
Followed closely by Munster, with Leinster and Ulster far lagging behind.


### Type:

As already discussed in the previous section, the majority of the customers are Personal customers, while Business customers are pretty close, with Buissiness plus being almost non existent, with its 5.6%.

In [75]:
# Calculating market penetration for each region

provinces = {
    "Leinster": {"customers": 1248, "population": 2858501},
    "Munster": {"customers": 777, "population": 1364098},
    "Connacht": {"customers": 379, "population": 588583},
    "Ulster": {"customers": 231, "population": 2215454}
}

market_penetration = {}

for province, data in provinces.items():
    penetration = (data["customers"] / data["population"]) * 100
    market_penetration[province] = penetration
    print(f"{province}: {penetration:.3f}%")


Leinster: 0.044%
Munster: 0.057%
Connacht: 0.064%
Ulster: 0.010%


In [76]:
# Customer opinion

satisfaction = customer_data['satisfied'].dropna()

# rename true to satisfied and false to not satisfied
satisfaction = satisfaction.replace({True: 'satisfied', False: 'unsatisfied'})

# How many people are satisfied with the bank?
fig = go.Figure(data=[go.Pie(labels=satisfaction, hole=.3)])
fig.update_layout( title_text="Customer satisfaction",  font=dict(size=18))



# making "True" =  green and "False" = red
fig.update_traces(marker=dict(colors=['green' if i == True else 'red' for i in customer_data['satisfied']]))

fig.show()


## Customer Satisfaction

As we can see from the graph, the majority of the customers are unfornately not satisfied with the bank. \
With only 42 percent of the customers being satisfied.

To look at the bright side, this means that there is a lot of room for improvement. \
But lets look at the scores in more detail.

Looking at the scores below we can see that the majority of the customers are not very thrilled with the bank. \
with the majority of the different dimension hovering around the three starts, which in nowadays is a pretty bad score. 

The dimensions which are doing best, while still not being good is the 'Products & Services' and 'Security' dimensions. \
While 'convenience' and 'customer service' are doing pretty bad with a score of 2.71 and 2.78 respectively.

All in all the bank has a lot of room for improvement...


In [77]:


stars = ["★", "★★", "★★★", "★★★★", "★★★★★"]

fig = sp.make_subplots(rows=len(customer_data.columns[8:-2]), cols=2, specs=[[{"type": "indicator"}, {}]] * len(customer_data.columns[8:-2]))

for index, columname in enumerate(customer_data.columns[8:-2]):
    ratings = customer_data[columname].dropna()

    percentages = [len(ratings[ratings == i])/len(ratings)*100 for i in range(1, 6)]

    full_stars = "★" * int(ratings.mean())
    half_star = "⯪" if ratings.mean() % 1 >= 0.5 else ""
    empty_stars = "☆" * (5 - len(full_stars + half_star))
    median_star_string = f"{full_stars}{half_star}{empty_stars}"

    bar_chart = go.Bar(
        x=percentages,
        y=stars,
        orientation="h",
        marker=dict(color="gold"),
        text=[f"{round(r)}%" for r in percentages],
        textposition="inside",
        insidetextanchor="start",
        textfont=dict(color="black")
    )

    median_box = go.Indicator(
        mode="number",
        value=round(ratings.mean(), 2),
        number=dict(suffix=f" {median_star_string}", font=dict(size=50, color="gold")),
        domain=dict(x=[0.7, 1], y=[0.4, 0.6]),
        title=dict(text=f"Average score ({columname})", font=dict(size=30)),
    )

    fig.add_trace(median_box, row=index+1, col=1)
    fig.add_trace(bar_chart, row=index+1, col=2)


fig.update_layout(
    showlegend=False,
    plot_bgcolor="white",
    margin=dict(l=150),
    height=3000,
)

fig.show()


# Relationships between variables

Now that we have played around with a couple of isolated variables, lets look at how they relate to each other.

Especially we will be looking at:

* we will look at the correlatoin of the different scores to their satisfaction. (finding out which is the most important factor)
* see what kind of demographics are most satisfied/least.
    * takin into account: gender, age, location, type, and if they have a mortgage or credit card.


In [78]:
# Plotting which tabular score has the highest impact in the satisfaction of the customer

# Create a new dataframe for this plot
tabular_correlation = customer_data[['convenience', 'customer_service', 'online_banking', 'interest_rates', 'fees_charges', 'community_involvement', 'products_services', 'privacy_security', 'reputation', 'satisfied']]

# Replace true to 1 and false to 0
tabular_correlation['satisfied'] = tabular_correlation['satisfied'].replace({True: 1, False: 0})

# Remove NaN values from tabular_correlation
tabular_correlation = tabular_correlation.dropna(subset=['satisfied'])

# Calculate the correlation between the tabular scores and the satisfaction of the customer
tabular_correlation = tabular_correlation.corr()['satisfied'].sort_values(ascending=False)

# Remove the satisfaction column
tabular_correlation = tabular_correlation.drop('satisfied')

tabular_correlation_df = tabular_correlation.reset_index()

# Rename the columns for better readability
tabular_correlation_df.columns = ['factor', 'correlation']

# Create a bar chart using Plotly Express
fig = px.bar(tabular_correlation_df, x='factor', y='correlation', title='Correlation between Tabular Scores and Customer Satisfaction')

# Show the plot
fig.show()

## Score satisfaction correlation

As we can see here the "Online banking" dimension has the highest correlation with the overall satisfaction by a score of 0.46. \
While the "Products & Services" dimension has the lowest correlation with the overall satisfaction by a score of 0.20.

This means that the "Online banking" dimension is the most important factor when it comes to the overall satisfaction of the customers. \
While the "Products & Services" dimension is the least important factor when it comes to the overall satisfaction of the customers. \
In turn this means every increase in the "Online banking" dimension will have more than double the effect on the overall satisfaction of the customers, than an increase in the "Products & Services" dimension.

## Demographic satisfaction correlation 

In the following section we take a closer look at the people who are happy/unhappy with the bank. \
By isolating the different demographic features and correlating them seperatly to satisfaction, \
we can more closely specify who is happy/unhappy with the bank and therefore where to adjust specifically.

As can be seen clearly in the graph below, the most important factor when it comes to satisfaction is the type of customer and if they have a mortgage.

Being a Business customer has a correlation of 0.4, while being a Personal customer has a correlation of -0.35 with satisfaction. \
This is a stark difference of 0.75 between the two types of customers and shows that the bank is doing a much better job with its Business customers than with its Personal customers. \ 
But apparently the bank is doing much worse with its Business plus customers, with a correlation of -0.1 than with their normal Business customers.

An even bigger difference can be seen when looking at the mortgage. \
A customer with a morgae has a negative correlation of -0.44 with satisfaction, while a customer without a mortgage has a positive correlation of 0.44 with satisfaction. \
This is a difference of 0.88 between the two groups and shows that the bank is doing a much better job with its customers who do not have a mortgage. \
This is a very important insight, the bank should definitely look into this and find out why this is the case. As making their customers this extremely unhappy is not a good thing.

In comparison, the other features are pretty much negligible. \
Only the credit card is of any interst, by virtue of having no impact on satisfaction whatsoever. \
Its correlation is in the range where random noise would be expected. 

### Tabular correlation

The graph below the demographic satisfaction correlation graph shows the correlatio of the different scores with each other. \
So whether customers who are upset about one thing are also upset about another thing.

This is mostly important for when we want to fill in holes in the data. \
But also provides us a little insight about the customers.

For example we can see that customers who care about customer service also care about convenience. \
While customers who care about fees also care about the reputation of the bank. 



In [79]:


demographic_correlation = customer_data[["customer_gender", "customer_location", "customer_type", "has_cc", "has_mortgage", "customer_age_norm", "satisfied"]]

for col in ['customer_gender', 'customer_location', 'customer_type', 'has_cc', 'has_mortgage', 'satisfied']:
    demographic_correlation[col] = demographic_correlation[col].astype('category')

demographic_correlation['customer_age_norm'] = demographic_correlation['customer_age_norm'].fillna(demographic_correlation['customer_age_norm'].mean())

# Create dummy columns for each categorical variable
dummy_columns = pd.get_dummies(demographic_correlation[['customer_gender', 'customer_location', 'customer_type', 'has_cc', 'has_mortgage']], prefix_sep='_')
dummy_columns['satisfied'] = demographic_correlation['satisfied'].cat.codes

# Compute the correlation matrix
correlation_matrix = dummy_columns.corr()

# Extract correlations related to the 'satisfied' variable
satisfaction_correlations = correlation_matrix['satisfied'].drop('satisfied')

# Create a bar plot for the correlations
fig = go.Figure()

fig.add_trace(go.Bar(
    x=satisfaction_correlations.index,
    y=satisfaction_correlations.values,
))

fig.update_layout(
    title='Correlations with Customer Satisfaction',
    xaxis_title='Demographic Variables',
    yaxis_title='Correlation',
)

fig.show()


In [80]:
# correlation matirx for tabular scores

tabular_correlation = customer_data[['convenience', 'customer_service', 'online_banking', 'interest_rates', 'fees_charges', 'community_involvement', 'products_services', 'privacy_security', 'reputation', 'satisfied']]

tabular_correlation.drop('satisfied', axis=1, inplace=True)

tabular_correlation = tabular_correlation.corr()

# remove the 1 from the diagonal
tabular_correlation = tabular_correlation[tabular_correlation != 1]

# Create a heatmap using Plotly Express
fig = px.imshow(tabular_correlation, color_continuous_scale='RdBu', title='Correlation between Tabular Scores')

# Show the plot
fig.show()

# Changes over time

Now last but not least lets look at how the opinions of the customers have changed over time. \
For this we will look at the different scores over time and see how they have changed.

* Then we look at how the scores trend over time.
    * we will look when the scores are the highest and lowest.

In [101]:

customer_data = customer_data[['convenience', 'customer_service', 'online_banking', 'interest_rates', 'fees_charges', 'community_involvement', 'products_services', 'privacy_security', 'reputation', 'date']]

customer_data['date'] = pd.to_datetime(customer_data['date'])

# Calculate the average opinion
customer_data['average_opinion'] = customer_data.iloc[:, :9].mean(axis=1)

# Group by date and calculate the mean of the average_opinion
customer_data_grouped = customer_data.groupby('date').mean().reset_index()


customer_data_grouped['smoothed_opinion'] = customer_data_grouped['average_opinion'].rolling(window=5).mean()

# Perform linear regression to find the trendline
x = np.arange(len(customer_data_grouped))
y = customer_data_grouped['smoothed_opinion'].fillna(method='backfill')  # Fill NaN values to prevent errors
slope, intercept = np.polyfit(x, y, 1)
trendline = slope * x + intercept

# Create the plot with Plotly
fig = px.line(customer_data_grouped, x='date', y='smoothed_opinion', title='Smoothed Average Opinion Over Time')

fig.add_scatter(x=customer_data_grouped['date'], y=trendline, mode='lines', name=f'Trendline (Slope: {slope:.5f})')
# Fixing the y-axis range
fig.update_yaxes(range=[1, 5])

fig.show()

## Opinion changes over time

As we can see from the graph below, the scores have been very stable over time. \
The trendline shows a slope of -0.00002, which for all intents and purposes is a flat line. \
This means that the scores while varieing over time have not changed in general.

While the customers opinion might be not good, at least its not becoming bad. \
So the bank is doing something right.

We see little peaks and valleys in the scores, but they are not very significant. \
The biggest change in the scores is a 0.74 increas on the 23 of august 2022 and a valley right after. \
But as the graph stays even this is most likely just a statistical anomaly.