# Credit Card Customer Segmentation

## Goal

We are a data scients working in a bank. The goal is to segment the customers so that we can apply different business strategies to each. For example we could provide higher limits for customers that use the card a lot but spend little money. Or give incentives to those with high income that doesn't use it as much.<br>
We will apply the K-means algorithm to segment the data.<br>
The output should be a group for each client and an explanation of its characteristics.

## Libraries

In [4]:
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.cluster import KMeans  
from sklearn.preprocessing import StandardScaler

## Data Dictionary

- **customer_id**: unique identifier for each customer.
- **age**: customer age in years.
- **gender**: customer gender (M or F).
- **dependent_count**: number of dependents of each customer.
- **education_level**: level of education ("High School", "Graduate", etc.).
- **marital_status**: marital status ("Single", "Married", etc.).
- **estimated_income**: the estimated income for the customer projected by the data science team.
- **months_on_book**: time as a customer in months.
- **total_relationship_count**: number of times the customer contacted the company.
- **months_inactive_12_mon**: number of months the customer did not use the credit card in the last 12 months.
- **credit_limit**: customer's credit limit.
- **total_trans_amount**: the overall amount of money spent on the card by the customer.
- **total_trans_count**: the overall number of times the customer used the card.
- **avg_utilization_ratio**: daily average utilization ratio.

## Load data

In [26]:
ccc_df = pd.read_csv('data/credit_card_customers.csv')

## EDA

In [7]:
ccc_df.shape

(10127, 14)

In [8]:
ccc_df.head(5)

Unnamed: 0,customer_id,age,gender,dependent_count,education_level,marital_status,estimated_income,months_on_book,total_relationship_count,months_inactive_12_mon,credit_limit,total_trans_amount,total_trans_count,avg_utilization_ratio
0,768805383,45,M,3,High School,Married,69000,39,5,1,12691.0,1144,42,0.061
1,818770008,49,F,5,Graduate,Single,24000,44,6,1,8256.0,1291,33,0.105
2,713982108,51,M,3,Graduate,Married,93000,36,4,1,3418.0,1887,20,0.0
3,769911858,40,F,4,High School,Unknown,37000,34,3,4,3313.0,1171,20,0.76
4,709106358,40,M,3,Uneducated,Married,65000,21,5,1,4716.0,816,28,0.0


In [21]:
ccc_df['education_level'].value_counts(normalize=True)

education_level
Graduate         0.363879
High School      0.232152
Uneducated       0.173299
College          0.117705
Post-Graduate    0.060827
Doctorate        0.052138
Name: proportion, dtype: float64

### Observations

- There are 10k observations/customers
- We have 14 features
- Age, normally distributed min 26 max 73, avg 43
- Gender 53% female
- Avg dependant 2
- Most common education level is Graduate
- 46% are married, 39% single
- Income mean 62k, min 20k, max 200k
- Avg months on book is around 36 months
- Max inactivity in the last 12 mo is 6mo, the most common is 3mo
- Credit limit range [1438.3, 34,516], has a long tail distribution
- Total transactions amount [510, 18484], it has a long tail distribution

## Feature Engineering I

In [34]:
ccc_modified = ccc_df.copy()

Let's deal with the categorical variables

In [35]:
ccc_modified['gender'] = ccc_modified['gender'].apply(lambda row: 1 if row == 'F' else 0)

In [36]:
# Let's encode the level of education in sequence

mapper = {
    'Uneducated': 0,
    'High School': 1,
    'College': 2,
    'Graduate': 3,
    'Post-Graduate': 4,
    'Doctorate':  5
}

ccc_modified['education_level'] = ccc_modified['education_level'].map(mapper)


In [38]:
"""
for marital status we can't encode them as the previous one 
because the values are not related, there is no 
order of magnitude between them, lets one-hot encode them
"""

pd.get_dummies(ccc_modified, columns=['marital_status'])

Unnamed: 0,customer_id,age,gender,dependent_count,education_level,estimated_income,months_on_book,total_relationship_count,months_inactive_12_mon,credit_limit,total_trans_amount,total_trans_count,avg_utilization_ratio,marital_status_Divorced,marital_status_Married,marital_status_Single,marital_status_Unknown
0,768805383,45,0,3,1,69000,39,5,1,12691.0,1144,42,0.061,False,True,False,False
1,818770008,49,1,5,3,24000,44,6,1,8256.0,1291,33,0.105,False,False,True,False
2,713982108,51,0,3,3,93000,36,4,1,3418.0,1887,20,0.000,False,True,False,False
3,769911858,40,1,4,1,37000,34,3,4,3313.0,1171,20,0.760,False,False,False,True
4,709106358,40,0,3,0,65000,21,5,1,4716.0,816,28,0.000,False,True,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10122,772366833,50,0,2,3,51000,40,3,2,4003.0,15476,117,0.462,False,False,True,False
10123,710638233,41,0,2,3,40000,25,4,2,4277.0,8764,69,0.511,True,False,False,False
10124,716506083,44,1,1,1,33000,36,5,3,5409.0,10291,60,0.000,False,True,False,False
10125,717406983,30,0,2,3,47000,36,4,3,5281.0,8395,62,0.000,False,False,False,True
