<p style="text-align:center">
    <a href="https://skills.network/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2022-01-01">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo"  />
    </a>
</p>

<h1 align="center"><font size="5">Unsupervised Machine Learning - Final Assignment</font></h1>


# Introduction

The aim of this workbook is to use unsupervised learning to draw insights from a dataset.

# Lib Import

In [25]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [26]:
from sklearn.cluster import KMeans, DBSCAN, MeanShift

# Dataset Info

__name:__ Bank Customer Segmentation (1M+ Transactions)\
__source:__ Kaggle\
__url:__ https://www.kaggle.com/datasets/shivamb/bank-customer-segmentation

Bank Customer Segmentation
Most banks have a large customer base - with different characteristics in terms of age, income, values, lifestyle, and more. Customer segmentation is the process of dividing a customer dataset into specific groups based on shared traits.

According to a report from Ernst & Young, “A more granular understanding of consumers is no longer a nice-to-have item, but a strategic and competitive imperative for banking providers. Customer understanding should be a living, breathing part of everyday business, with insights underpinning the full range of banking operations.

About this Dataset
This dataset consists of 1 Million+ transaction by over 800K customers for a bank in India. The data contains information such as - customer age (DOB), location, gender, account balance at the time of the transaction, transaction details, transaction amount, etc.

## 1. EDA

### 1.1 Dataset Loading

In [27]:
df = pd.read_csv('bank_transactions.csv')
df.head()

Unnamed: 0,TransactionID,CustomerID,CustomerDOB,CustGender,CustLocation,CustAccountBalance,TransactionDate,TransactionTime,TransactionAmount (INR)
0,T1,C5841053,10/1/94,F,JAMSHEDPUR,17819.05,2/8/16,143207,25.0
1,T2,C2142763,4/4/57,M,JHAJJAR,2270.69,2/8/16,141858,27999.0
2,T3,C4417068,26/11/96,F,MUMBAI,17874.44,2/8/16,142712,459.0
3,T4,C5342380,14/9/73,F,MUMBAI,866503.21,2/8/16,142714,2060.0
4,T5,C9031234,24/3/88,F,NAVI MUMBAI,6714.43,2/8/16,181156,1762.5


First step is to check for missing or null values.

In [28]:
(df.isna().sum() / len(df)) * 100

TransactionID              0.000000
CustomerID                 0.000000
CustomerDOB                0.323966
CustGender                 0.104905
CustLocation               0.014401
CustAccountBalance         0.225927
TransactionDate            0.000000
TransactionTime            0.000000
TransactionAmount (INR)    0.000000
dtype: float64

There do seem to be missing entries but they are all below 1%, so I think I will drop them from the dataset.

In [29]:
df.dropna(inplace=True)

In [30]:
df.isna().sum()

TransactionID              0
CustomerID                 0
CustomerDOB                0
CustGender                 0
CustLocation               0
CustAccountBalance         0
TransactionDate            0
TransactionTime            0
TransactionAmount (INR)    0
dtype: int64

Now I will review the datatypes of the columns.

In [31]:
df.dtypes

TransactionID               object
CustomerID                  object
CustomerDOB                 object
CustGender                  object
CustLocation                object
CustAccountBalance         float64
TransactionDate             object
TransactionTime              int64
TransactionAmount (INR)    float64
dtype: object

Initial insight is that there seem to be some columns which are currently represented as objects, but can be represented numerically either via encoding or as time / date, which will be done during the Feature Engineering step.

I will quickly review the distribution of the TransactionID and CustomerID columns.

In [32]:
transaction_id_vc = df['TransactionID'].value_counts()
transaction_id_vc

TransactionID
T1          1
T699342     1
T699328     1
T699329     1
T699330     1
           ..
T349709     1
T349710     1
T349711     1
T349712     1
T1048567    1
Name: count, Length: 1041614, dtype: int64

In [33]:
transaction_id_vc[transaction_id_vc > 1]

Series([], Name: count, dtype: int64)

So the TransactionID appears to be unique to each entry, so this can probably be removed.

In [34]:
customer_id_vc = df['CustomerID'].value_counts()
customer_id_vc

CustomerID
C5533885    6
C7537344    6
C1736254    6
C1113684    6
C4327447    6
           ..
C1610768    1
C4929259    1
C1026114    1
C6817889    1
C6420483    1
Name: count, Length: 879358, dtype: int64

In [35]:
customer_id_vc[customer_id_vc > 1]

CustomerID
C5533885    6
C7537344    6
C1736254    6
C1113684    6
C4327447    6
           ..
C2239666    2
C5811913    2
C1635174    2
C6229976    2
C6442630    2
Name: count, Length: 141961, dtype: int64

The CustomerID has some instances where there are more than one value, so this can stay in the dataset.

### 1.2 Data Preprocessing

Tasks in this step include:

1. Removing unnecessary columns
2. Encoding or converting object classes
3. Normalizing numerical classes

In [36]:
df.drop('TransactionID', axis=1, inplace=True)

The next step is to convert the object columns which are dates to actual date values.

In [37]:
df.head()

Unnamed: 0,CustomerID,CustomerDOB,CustGender,CustLocation,CustAccountBalance,TransactionDate,TransactionTime,TransactionAmount (INR)
0,C5841053,10/1/94,F,JAMSHEDPUR,17819.05,2/8/16,143207,25.0
1,C2142763,4/4/57,M,JHAJJAR,2270.69,2/8/16,141858,27999.0
2,C4417068,26/11/96,F,MUMBAI,17874.44,2/8/16,142712,459.0
3,C5342380,14/9/73,F,MUMBAI,866503.21,2/8/16,142714,2060.0
4,C9031234,24/3/88,F,NAVI MUMBAI,6714.43,2/8/16,181156,1762.5


In [38]:
DATE_COLUMNS = ['CustomerDOB', 'TransactionDate']

In [39]:
df[DATE_COLUMNS] = df[DATE_COLUMNS].astype('datetime64[ns]')
df[DATE_COLUMNS].head()

Unnamed: 0,CustomerDOB,TransactionDate
0,1994-10-01,2016-02-08
1,2057-04-04,2016-02-08
2,1996-11-26,2016-02-08
3,2073-09-14,2016-02-08
4,1988-03-24,2016-02-08


Next step is to convert Gender from an object to labels.

In [40]:
df['CustGender'] = df['CustGender'].apply(lambda x: 0 if x == 'M' else 1)
df['CustGender'].unique()

array([1, 0])

In [51]:
unique_locations = df['CustLocation'].unique()
print(f"num unique locations: {len(unique_locations)}")
unique_locations

num unique locations: 9275


array(['JAMSHEDPUR', 'JHAJJAR', 'MUMBAI', ..., 'KARANJIA',
       'NR HERITAGE FRESH HYDERABAD', 'IMPERIA THANE WEST'], dtype=object)

There are two many locations to use categorical encoding, as this will introduce too many features into the dataset. I've noticed that there are some locations that have brackets in the name, so maybe these can be excluded.

In [54]:
bracketed_locations = [loc for loc in unique_locations if loc.find('(') != -1]
print(f"num bracketed locations {len(bracketed_locations)}")

num bracketed locations 365
