# <font color='#088F8F' size='6' face='Times New Roman'><b>**Customer Segmentation: Market Basket Analysis**</b></font>
![Customer Segmentation](https://miro.medium.com/v2/resize:fit:1200/1*XboZUy1dlqOsYmwxRwREuw.png)

<font color='#5F9EA0' size='6' face='Times New Roman'><b>**About The DataSet:**</b></font>

* **This dataset consists of 1 Million+ transaction by over 800K customers for a bank in India.**

* **The data contains information such as - customer age (DOB), location, gender, account balance at the time of the transaction , transactiondetails,transaction amount, etc.**

<font color='#5F9EA0' size='6' face='Times New Roman'><b>**Goal Of The Project:**</b></font>

* **Perform Clustering / Segmentation on the dataset and identify popular customer groups along with their definitions/rules**
* **Perform Location-wise analysis to identify regional trends in India**
* **Perform transaction-related analysis to identify interesting trends that can be used by a bank to improve / optimi their user experiences**
* **Customer Recency, Frequency, Monetary analysis**
* **Network analysis or Graph analysis of customer data.**

<font color='#5F9EA0' size='6' face='Times New Roman'><b>**Table of contents of this notebook:**</b></font>

1. [**Importing Necessary Libraries**](http://)

2. [**Data Collection**](http://)

3. [**Data Cleaning**](http://)

4. [**Exploratory Data Analysis**](http://)

5. [**Feature Engineering**](http://)

6. [**Modelling**](http://)

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
pip install tabulate

# Meet-01

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import probplot

import scipy.stats as stats # scientific computing tools
pd.set_option('mode.chained_assignment', None)

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
df = pd.read_csv('/kaggle/input/bank-customer-segmentation/bank_transactions.csv')

In [None]:
df.head()

In [None]:
df.shape

In [None]:
df.info()

**Checking the null values and duplicate records**

In [None]:
df.isnull().sum()

In [None]:
# percentage of missing values out of total 1048567 records
round(df.isnull().sum()*100/df.shape[0],2)

In [None]:
# cross-checking the nan values
df.isnull().sum()

In [None]:
df.shape

* Now, 10,41,614 records left in the dataset

In [None]:
df.duplicated().sum()

**Checking the Numerical columns**

In [None]:
df.describe().T

In [None]:
df[df['TransactionTime']==0].shape

In [None]:
df[df['TransactionAmount (INR)']==0].shape

* It is okay to have 0 account balance.
* There are 2 records where the transaction time is 0 min.
* There are 820 records where transaction amount is 0 INR.
* Both the condition are useless so, dropping those records.

**Converting the CustomerDOB, TransactionDate from object type to Datetime type.**

In [None]:
df['CustomerDOB'].unique()

In [None]:
from datetime import datetime 
datetime.now()

In [None]:
from datetime import datetime 
df['DOB'] = df['DOB'].apply(lambda x: x - pd.DateOffset(years=100) if x > datetime.now() else x)

In [None]:
df['CustBYear'] = df['DOB'].dt.year
df['CustomerAge'] = df['TransactionDate'].dt.year - df['CustBYear']

In [None]:
df.head()

In [None]:
df = pd.read_csv(r'/kaggle/input/bank-customer-segmentation/bank_transactions.csv')
df = df.sample(n=100000,random_state = 42)
df.info()

In [None]:
df['TransactionDate'] = pd.to_datetime(df['TransactionDate'])
df['CustomerDOB'] = pd.to_datetime(df['CustomerDOB'])

In [None]:
pd.options.display.float_format = '{:.2f}'.format
df.describe()

In [None]:
cat_columns = ['CustGender', 'CustLocation']
for col in cat_columns:
    print(f"Column: {col}")
    print(df[col].value_counts())
    print("\n")

In [None]:
df = df.drop(df[df['CustomerDOB'] == '1/1/1800'].index, axis=0)
df.loc[df.CustomerDOB.dt.year >= 2022, 'CustomerDOB'] = df.loc[df.CustomerDOB.dt.year >= 2022, 'CustomerDOB'] - pd.DateOffset(years=100)
df['CustomerAge'] = ((pd.to_datetime('today') - df['CustomerDOB']).dt.days / 365.25).round(0)
df.head()

In [None]:
df1 = df.copy()
# 1. Distribution of customers by gender (CustGender)
plt.figure(figsize=(6, 4))
sns.countplot(x='CustGender',data = df1, palette='pastel')
plt.title('Distribution of Customers by Gender')
plt.show()

# 2. Age distribution of customers based on "CustomerDOB" column
plt.figure(figsize=(10, 6))
sns.histplot(df1['CustomerAge'], bins=30, kde=True, color='skyblue')
plt.title('Age Distribution of Customers')
plt.xlabel('Age')
plt.ylabel('Number of Customers')
plt.show()

**Let's analyse the top Locations where maximum no.of transactions are done**

In [None]:
temp_location = df['CustLocation'].value_counts().reset_index().head(10)
temp_location

In [None]:
location_counts = df1['CustLocation'].value_counts().nlargest(10)
plt.figure(figsize=(12, 6))
sns.barplot(x=location_counts.index, y=location_counts.values, palette='viridis')
plt.title('Top 10 Locations with the Highest Number of Customers')
plt.xlabel('Location')
plt.ylabel('Number of Customers')
plt.xticks(rotation=45, ha='right')
plt.show()

In [None]:
top_10_locations = df1['CustLocation'].value_counts().nlargest(10)

# Visualize the transaction volumes for each location using a bar plot
plt.figure(figsize=(12, 6))
sns.barplot(x=location_transaction_volumes.index, y=location_transaction_volumes.values, palette='coolwarm')
plt.title('Transaction Volumes for Different Locations')
plt.xlabel('Location')
plt.ylabel('Total Transaction Amount (INR)')
plt.xticks(rotation=45)
plt.show()

**Top 10 customers who did biggest transactions**

In [None]:
def age_group(age):
    if age < 20:
        return 'Below 20'
    elif age >= 20 and age <= 30:
        return '20 - 30'
    elif age > 30 and age <= 40:
        return '31 - 40'
    elif age > 40 and age <= 50:
        return '41 - 50'
    elif age > 50 and age <= 60:
        return '51 - 60'
    elif age >60 and age <= 70:
        return '61 - 70'
    else:
        return 'Above 71'
    
df['age_group'] = df['CustomerAge'].apply(age_group)

In [None]:
plt.title(' Distribution of Customer Age-group')
sns.countplot(x = df['age_group'].sort_values())

**Top 10 customers who did Maximum transactions**

In [None]:
top_10_big_transact = df.sort_values(by = 'TransactionAmount (INR)',ascending = False).head(10)
top_10_big_transact

**Top 10 customers who did Smallest transactions**

In [None]:
top_10_small_transact = df.sort_values(by= 'TransactionAmount (INR)').head(10)
top_10_small_transact

**Top 10 Transaction Dates when Maximum no. of Transaction are made**

In [None]:
max_transact_dates = df.groupby('TransactionDate').agg('count').reset_index().sort_values(by = 'TransactionAmount (INR)',ascending = False).head(10)
plt.figure(figsize = [15,5])
sns.barplot(x = max_transact_dates['TransactionDate'].dt.date,y = max_transact_dates['TransactionAmount (INR)'] )
plt.title('Top 10 Dates with maximum no. of Transactions')
plt.xlabel('Transaction Date ')
plt.ylabel('No of Transactions')

**Top 10 Transaction Dates when least no. of Transaction are made**

In [None]:
least_transact_dates = df.groupby('TransactionDate').agg('count').reset_index().sort_values(by = 'TransactionAmount (INR)').head(10)
plt.figure(figsize = [15,5])
sns.barplot(x = least_transact_dates['TransactionDate'].dt.date,y = least_transact_dates['TransactionAmount (INR)'] )
plt.title('Top 10 Dates with Least no. of Transactions')
plt.xlabel('Transaction Date ')
plt.ylabel('No. of Transactions')

**Extracting some more feature for extra data analysis**

In [None]:
df['transaction_month'] = df['TransactionDate'].dt.month_name()
df['transaction_year']  = df['TransactionDate'].dt.year
df['transaction_weekday'] = df['TransactionDate'].dt.day_name()

# df['Transaction_Time'] = df['TransactionTime'].apply(lambda x : datetime.utcfromtimestamp(int(x)).strftime('%H:%M:%S'))
# bank_clean_2['TransactionTime'] = bank_clean_2['TransactionTime'].apply(lambda x : datetime.utcfromtimestamp(int(x)).strftime('%H:%M:%S'))

df['Transaction_Time'] = df['TransactionTime'].apply(lambda x : datetime.utcfromtimestamp(int(x)).strftime('%H:%M '))

In [None]:
sns.countplot(x = df['transaction_weekday'])

In [None]:
df_month = df['transaction_month'].value_counts().reset_index()
plt.figure(figsize = [15,5])
sns.barplot(x = df_month['transaction_month'] , y = df_month['count'])
plt.title('Top 10 Dates with Least no. of Transactions')
plt.xlabel('Transaction Month ')
plt.ylabel('No. of Transactions')

In [None]:
def time_division(time):
    
    if time>= '06:00' and time< '12:00':
        return 'Morning'
    elif time >= '12:00' and time < '17:00':
        return "Afternoon"
    elif time >='17:00' and time < '22:00':
        return 'Evening'
    else :
        return 'Night'
    

df['transaction_time_division'] = df['Transaction_Time'].apply(time_division)

In [None]:
df_transaction_time_division = df['transaction_time_division'].value_counts().reset_index()
plt.figure(figsize = [15,5])
sns.barplot(x = df_transaction_time_division['transaction_time_division'] , y = df_transaction_time_division['count'])
plt.title('No.of transaction based the Time division')
plt.xlabel('Time Division')
plt.ylabel('No. of Transactions')

# Exploratory Data Analysis(EDA)

The code loops through each numerical column in a DataFrame, calculates skewness and kurtosis, displays a distribution plot, a box plot, and a quantile-quantile plot for each numerical column. These plots help in understanding the distribution and characteristics of the data in each column.







In [None]:
from scipy.stats import probplot
for col in df.columns:
    if df[col].dtypes == np.float64:
        print("Skewness of {}:".format(col),df[col].skew())
        print("Kurtosis of {}:".format(col),df[col].kurt())
        plt.figure(figsize=(3,3))
        print("Distribution Plot of {}:".format(col))
        sns.distplot(df[col])
        plt.show()
        print("Box Plot of {}:".format(col))
        plt.figure(figsize=(3,3))
        sns.boxplot(df[col])
        plt.show()
        print("Quantile-Quantile Plot of {}:".format(col))
        plt.figure(figsize=(3,3))
        probplot(df[col],plot=plt,rvalue=True)
        plt.show()

# **RFM**

In [None]:
df = pd.read_csv('/kaggle/input/bank-customer-segmentation/bank_transactions.csv')
df['TransactionDate'] = pd.to_datetime(df['TransactionDate'], format = '%d/%m/%y')

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

sns.histplot(x = df['TransactionDate'].dt.month, bins = 3, binwidth = 1)
plt.title('Number of transactions in each month')

**Benefits and Result:**

**Visualization of Transaction Count by Month:** The histogram plot visually represents the distribution of transaction counts across different months. Each bar in the histogram corresponds to a month, and the height of the bar indicates the number of transactions that occurred in that month.

**Identifying Seasonal Patterns:** By observing the histogram, We can identify any seasonal patterns or trends in transaction volumes. For example, if certain months consistently have higher transaction counts, it may indicate seasonal factors influencing customer behavior or economic activities.

**Analyzing Transaction Frequency:** The histogram helps in analyzing the frequency of transactions over time, providing insights into periods of high or low activity. This information can be valuable for business planning, marketing campaigns, and resource allocation.

**Data Quality Check:** Additionally, this plot can also serve as a quick data quality check to ensure that the 'TransactionDate' column has been parsed correctly and contains meaningful date values.

**In short this will allow us to visually explore and analyze the distribution of bank transactions over different months, enabling you to gain valuable insights into transaction patterns and behavior.**

In [None]:
import pandas as pd
import numpy as np

# Convert CustomerDOB and TransactionDate to datetime format
df['CustomerDOB'] = pd.to_datetime(df['CustomerDOB'])
df['TransactionDate'] = pd.to_datetime(df['TransactionDate'])

# Calculate Age and Days Since Transaction
df['Age'] = (pd.to_datetime('today') - df['CustomerDOB']) / np.timedelta64(1, 'D') / 365.25  # 365.25 days in a year
df['DaysSinceTransaction'] = (pd.to_datetime('today') - df['TransactionDate']) / np.timedelta64(1, 'D')

# Adjust DaysSinceTransaction to start from 0
df['DaysSinceTransaction'] = df['DaysSinceTransaction'] - df['DaysSinceTransaction'].min()

# Group by CustomerID and count transactions
temp = df[['CustomerID', 'TransactionID']].groupby(by='CustomerID', as_index=False, sort=False).count().reset_index()

# Print the first few rows of the resulting DataFrame
print(temp.head())


In [None]:
temp = temp.drop(columns = 'index')
temp.rename(columns = {'TransactionID' : 'TransactionFrequency'})
df = df.merge(right = temp, on = 'CustomerID')
df.head()

In [None]:
df = df.rename(columns = {'TransactionID_y' : 'TransactionFrequency',
                         'DaysSinceTransaction' : 'Recency'})

In [None]:
rmf = df.drop(columns = ['CustGender', 'CustLocation', 'CustLocation',
                         'CustAccountBalance', 'TransactionTime', 'Age']
             ).groupby(by = 'CustomerID').agg({'Recency' : 'min',
                                               'TransactionFrequency': 'first',
                                               'TransactionAmount (INR)' : 'mean'})
df = df.rename(columns = {'TransactionAmount (INR)' : 'AverageTransactionAmount'})
rmf = rmf.rename(columns = {'TransactionAmount (INR)' : 'AverageTransactionAmount'})
rmf.head()


**The process of this code snippet involves data preprocessing and aggregation for customer relationship management (CRM) analysis, specifically for Recency, Frequency, and Monetary (RFM) analysis. Here's what each step accomplishes:**

**Renaming columns:** The code renames columns to have more meaningful names, such as 'TransactionID_y' to 'TransactionFrequency' and 'DaysSinceTransaction' to 'Recency', making the data easier to understand and analyze.

**Dropping unnecessary columns:** It removes columns that are not required for RFM analysis, such as customer demographic information ('CustGender', 'CustLocation', 'CustLocation', 'CustAccountBalance', 'TransactionTime', 'Age'). This simplifies the data and focuses on relevant metrics.

**Grouping and aggregating data:** The code groups the data by 'CustomerID' and calculates key metrics like 'Recency' (how recently a customer made a transaction), 'TransactionFrequency' (how often a customer transacts), and 'AverageTransactionAmount' (the average amount spent per transaction by each customer). These metrics are fundamental for RFM analysis and customer segmentation in CRM strategies.

**Renaming columns again:** After aggregating the data, the code renames the 'TransactionAmount (INR)' column to 'AverageTransactionAmount' for clarity and consistency in naming conventions.

In [None]:
fig, axes = plt.subplots(1, 3, figsize = (65, 15))
axes = axes.flatten()

sns.countplot(x = 'Recency', data = rmf, ax = axes[0])
sns.histplot(x = 'TransactionFrequency', data = rmf, ax = axes[1])
sns.scatterplot(x = 'AverageTransactionAmount', y = 'Recency', data = rmf, ax = axes[2])
plt.tight_layout()


**This code segment visualizes key aspects of customer behavior and transaction data using three different types of plots:**
​
* The first subplot shows the distribution of customer recency.
* The second subplot displays the frequency of customer transactions.
* The third subplot illustrates the relationship between the average transaction amount and recency.


In [None]:
def recency_score(value, quartiles):
    if value < quartiles[0.25]:
        return 4
    if value < quartiles[0.5]:
        return 3
    if value < quartiles[.75]:
        return 2
    else:
        return 1

def monetary_score(value, quartiles):
    if value < quartiles[0.25]:
        return 1
    if value < quartiles[0.5]:
        return 2
    if value < quartiles[0.75]:
        return 3
    else:
        return 4
    
quartiles = rmf.quantile([0.25, 0.5, 0.75]).to_dict()

rmf['recency_score'] = rmf['Recency'].apply(recency_score, quartiles = quartiles['Recency'],)
rmf['frequency_score'] = rmf['TransactionFrequency'].astype(int)
rmf.loc[rmf['frequency_score'] > 4, 'frequency_score'] = 4
rmf['monetary_score'] = rmf['AverageTransactionAmount'].apply(monetary_score, quartiles = quartiles['AverageTransactionAmount'],)
rmf['total_score'] = rmf['recency_score'] + rmf['frequency_score'] + rmf['monetary_score']

In [None]:
fig, axes = plt.subplots(1, 3, figsize = (20, 10))
axes = axes.flatten()

recency = rmf.groupby(by = 'TransactionFrequency').mean().reset_index()
avg_amount = rmf.groupby(by = 'TransactionFrequency').mean().reset_index()


sns.scatterplot(x = 'total_score', y = 'AverageTransactionAmount', hue = 'TransactionFrequency',
            data = avg_amount, ax = axes[0])
axes[0].set_title('''Transaction Amount vs Total Score 
                  \n Averaged over Transaction Frequency ''')
sns.scatterplot(x = 'total_score', y = 'Recency', data = recency, hue = 'TransactionFrequency',
            ax = axes[1])
axes[1].set_title('''Recency vs Total Score \n
                    Averaged over Transaction Frequency''')
sns.countplot(x = 'total_score', data = rmf, ax = axes[2])
axes[2].set_title('Number of Customers in each score range')
plt.tight_layout()

In [None]:
rmf[rmf.total_score == 12].count()

**The average transaction is mostly constant for the transaction frequency range, except for the sharp increase in the range of 4 - 5 transactions over the three months, and a sudden sharp decrease for the most frequent (and significantly rarer) 6 transactions over that interval. Since 4 - 6 transactions all give a frequency score of 4, one would expect that they share the same average total score, however this is clearly not the case, with the average total score increasing from 4-6 transactions.

**When we look at the recency, we see that the average recency decreases with the average frequency (which shouldn't be shocking), which will compensate for the reduced average transactional amount.**

In [None]:
rmf.groupby(by = 'total_score').describe().T

**Unsurpisingly, due to the low frequency of transactions, most people fall around a total score of the 5 - 7 out of 12. There are similar numbers of 3, 4 to 8, 9, and very few 10 and above. About 1% of the people have a score above 10. Moving on to clustering:**

**Step by Step explaination:**

1. **Recency Score Calculation**:
   - The code calculates a recency score for each customer based on how recently they made a transaction. Customers who made transactions more recently receive higher scores.

2. **Frequency Score Calculation**:
   - The code assigns a frequency score based on how often each customer transacts. Customers with more frequent transactions receive higher scores.

3. **Monetary Score Calculation**:
   - The code calculates a monetary score based on the average amount spent per transaction by each customer. Customers who spend more receive higher scores.

4. **Quartile-Based Scoring**:
   - The quartiles of the data (25th, 50th, and 75th percentiles) are used as thresholds for assigning scores. Customers are divided into quartiles based on their recency and monetary value, and each quartile is assigned a score ranging from 1 to 4.

5. **Total Score Calculation**:
   - The total score for each customer is computed by adding their recency, frequency, and monetary scores together. This total score provides a comprehensive view of a customer's overall transaction behavior.

6. **Purpose**:
   - RFM analysis helps businesses identify and target different customer segments effectively. For example, high-scoring customers (e.g., high recency, frequency, and monetary value) may be considered VIP customers and targeted with special offers to retain their loyalty. On the other hand, low-scoring customers may receive targeted marketing campaigns to encourage more frequent and higher-value transactions.

In essence, the code implements a methodical approach to segmenting customers based on their transaction patterns, allowing businesses to tailor their marketing strategies and customer engagement efforts for maximum effectiveness.

In [None]:
df.head()

In [None]:
df.describe()

In [None]:
# Check the data
df.sample()

In [None]:
# Set 'today' variable
today = max(df['TransactionDate'])

**Next step, we need to create a new table that will store the user details specifically for this analysis. The purpose of this new table is to provide a streamlined and structured format for conducting the RFM analysis.**