Age brackets for segmenting customers in the credit card or banking sector:
- 18-24: Young adults, often starting their financial journey.
- 25-34: Early career professionals, potentially with more significant financial responsibilities.
- 35-44: Mid-career individuals, often with established careers and possibly higher income.
- 45-54: Pre-retirement age, may have accumulated more wealth or financial assets.
- 55-64: Nearing retirement, often focusing on saving and investment.
- 65 and above: Retired or nearing retirement, may have different financial needs and behaviors.

In [86]:
import pandas as pd
import shutil


In [87]:
# Load CSV data
csv_file_path = 'UCI_Credit_Card.csv'
df = pd.read_csv(csv_file_path)

In [88]:
# Show summary of the DataFrame
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 0 to 29999
Data columns (total 25 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   ID                          30000 non-null  int64  
 1   LIMIT_BAL                   30000 non-null  float64
 2   SEX                         30000 non-null  int64  
 3   EDUCATION                   30000 non-null  int64  
 4   MARRIAGE                    30000 non-null  int64  
 5   AGE                         30000 non-null  int64  
 6   PAY_0                       30000 non-null  int64  
 7   PAY_2                       30000 non-null  int64  
 8   PAY_3                       30000 non-null  int64  
 9   PAY_4                       30000 non-null  int64  
 10  PAY_5                       30000 non-null  int64  
 11  PAY_6                       30000 non-null  int64  
 12  BILL_AMT1                   30000 non-null  float64
 13  BILL_AMT2                   300

In [89]:
# Investigate null values in each column
df.isnull().sum()

ID                            0
LIMIT_BAL                     0
SEX                           0
EDUCATION                     0
MARRIAGE                      0
AGE                           0
PAY_0                         0
PAY_2                         0
PAY_3                         0
PAY_4                         0
PAY_5                         0
PAY_6                         0
BILL_AMT1                     0
BILL_AMT2                     0
BILL_AMT3                     0
BILL_AMT4                     0
BILL_AMT5                     0
BILL_AMT6                     0
PAY_AMT1                      0
PAY_AMT2                      0
PAY_AMT3                      0
PAY_AMT4                      0
PAY_AMT5                      0
PAY_AMT6                      0
default.payment.next.month    0
dtype: int64

In [90]:
# Define age bins and labels
bins = [18,25,35,45,55,65, float('inf')]
labels = ['18-24','25-34','35-44','45-54','55-64','65 and above']

# Original payment delay months
payment_delay_months = ['PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6']

# Corresponding month names
month_names = ['September', 'August', 'July', 'June', 'May', 'April']


# Create a dictionary for renaming the payment columns
rename_dict = dict(zip(payment_delay_months, month_names))

# Rename the columns in the DataFrame
df_renamed = df.rename(columns=rename_dict)

# Payment categories mapping
payment_categories = {
    -2: 'Paid Early',
    -1: 'Paid Duly',
    0: 'No Required Payment',
    1: '1 month',
    2: '2 months',
    3: '3 months',
    4: '4 months',
    5: '5 months',
    6: '6 months',
    7: '7 months',
    8: '8 months',
    9: '9 months or more'
}



In [91]:
# Add a new column 'Age_Bin' to the DataFrame
df_renamed['Age_Bin'] = pd.cut(df_renamed['AGE'], bins=bins, labels=labels, right=True)

In [92]:
# Initialize an empty DataFrame to hold the final data
final_df = pd.DataFrame()

Analyzing month and payment category based on Age Group

In [93]:
# Iterate over each month and payment category, count occurrences and add to the final DataFrame
for month in month_names:
    for value, category in payment_categories.items():
        # Count occurrences for each category in the current month
        count_df = df_renamed[df_renamed[month] == value].groupby('Age_Bin').size().reset_index(name=f'{month}_{category}')
        
        # Merge with the final DataFrame
        if final_df.empty:
            final_df = count_df
        else:
            final_df = final_df.merge(count_df, on='Age_Bin', how='outer')

final_df.head()

  count_df = df_renamed[df_renamed[month] == value].groupby('Age_Bin').size().reset_index(name=f'{month}_{category}')
  count_df = df_renamed[df_renamed[month] == value].groupby('Age_Bin').size().reset_index(name=f'{month}_{category}')
  count_df = df_renamed[df_renamed[month] == value].groupby('Age_Bin').size().reset_index(name=f'{month}_{category}')
  count_df = df_renamed[df_renamed[month] == value].groupby('Age_Bin').size().reset_index(name=f'{month}_{category}')
  count_df = df_renamed[df_renamed[month] == value].groupby('Age_Bin').size().reset_index(name=f'{month}_{category}')
  count_df = df_renamed[df_renamed[month] == value].groupby('Age_Bin').size().reset_index(name=f'{month}_{category}')
  count_df = df_renamed[df_renamed[month] == value].groupby('Age_Bin').size().reset_index(name=f'{month}_{category}')
  count_df = df_renamed[df_renamed[month] == value].groupby('Age_Bin').size().reset_index(name=f'{month}_{category}')
  count_df = df_renamed[df_renamed[month] == value].grou

Unnamed: 0,Age_Bin,September_Paid Early,September_Paid Duly,September_No Required Payment,September_1 month,September_2 months,September_3 months,September_4 months,September_5 months,September_6 months,...,April_No Required Payment,April_1 month,April_2 months,April_3 months,April_4 months,April_5 months,April_6 months,April_7 months,April_8 months,April_9 months or more
0,18-24,147,510,2186,475,466,59,21,3,1,...,2415,0,459,45,14,3,5,7,0,0
1,25-34,1209,2449,6483,1616,1013,125,21,9,6,...,7051,0,1168,59,19,5,8,14,1,0
2,35-44,934,1853,3864,1020,726,77,22,11,0,...,4358,0,706,55,4,1,4,13,0,0
3,45-54,386,728,1822,460,369,45,8,2,3,...,2015,0,353,16,7,3,1,11,1,0
4,55-64,74,133,340,108,80,16,4,1,1,...,399,0,71,9,4,1,0,1,0,0


In [94]:
# Define the path to save the CSV file
csv_file_path = 'final_df.csv'

# Save the DataFrame to CSV
final_df.to_csv(csv_file_path, index=True)

In [95]:
# Define the path to save the JSON file
json_file_path = 'final_df.json'

# Convert the DataFrame to JSON
final_df.to_json(json_file_path, orient='records')

# Load and check the JSON file
with open(json_file_path, 'r') as f:
    json_data = f.read()

print(json_data[:500])

[{"Age_Bin":"18-24","September_Paid Early":147,"September_Paid Duly":510,"September_No Required Payment":2186,"September_1 month":475,"September_2 months":466,"September_3 months":59,"September_4 months":21,"September_5 months":3,"September_6 months":1,"September_7 months":0,"September_8 months":3,"September_9 months or more":0,"August_Paid Early":225,"August_Paid Duly":524,"August_No Required Payment":2382,"August_1 month":3,"August_2 months":636,"August_3 months":73,"August_4 months":19,"Augus


Creating a multi-index DataFrame

In [96]:
# Convert 'Age_Bin' to a string type if it is categorical
final_df['Age_Bin'] = final_df['Age_Bin'].astype(str)

In [97]:
# Fill NaN values with 0 (since we're counting occurrences)
final_df = final_df.fillna(0)

In [98]:
# Set 'Age_Bin' as the index
final_df = final_df.set_index('Age_Bin')

In [99]:
# Create MultiIndex for the columns
multi_index_columns = pd.MultiIndex.from_product([month_names, list(payment_categories.values())], names=['Month', 'Category'])


In [100]:
# Rename columns with MultiIndex
final_df.columns = multi_index_columns
final_df

Month,September,September,September,September,September,September,September,September,September,September,...,April,April,April,April,April,April,April,April,April,April
Category,Paid Early,Paid Duly,No Required Payment,1 month,2 months,3 months,4 months,5 months,6 months,7 months,...,No Required Payment,1 month,2 months,3 months,4 months,5 months,6 months,7 months,8 months,9 months or more
Age_Bin,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
18-24,147,510,2186,475,466,59,21,3,1,0,...,2415,0,459,45,14,3,5,7,0,0
25-34,1209,2449,6483,1616,1013,125,21,9,6,2,...,7051,0,1168,59,19,5,8,14,1,0
35-44,934,1853,3864,1020,726,77,22,11,0,6,...,4358,0,706,55,4,1,4,13,0,0
45-54,386,728,1822,460,369,45,8,2,3,1,...,2015,0,353,16,7,3,1,11,1,0
55-64,74,133,340,108,80,16,4,1,1,0,...,399,0,71,9,4,1,0,1,0,0
65 and above,9,13,42,9,13,0,0,0,0,0,...,48,0,9,0,1,0,1,0,0,0


Code verification

In [101]:
 
# Load CSV data
csv_file_path = 'UCI_Credit_Card.csv'
df = pd.read_csv(csv_file_path)

In [102]:
# Define age bins and labels
bins = [18,25,35,45,55,64, float('inf')]
labels = ['18-24','24-34','35-44','45-54','55-64','65 and above']
df['Age_Bin'] = pd.cut(df['AGE'], bins=bins, labels=labels, right=True)

# Filter for the specific Age_Bin '35-44'
age_bin_filtered_df = df[df['Age_Bin'] == '35-44']

# Count occurrences of the value 7 in 'PAY_0'
count_value_7 = (age_bin_filtered_df['PAY_0'] == -2).sum()

print("Count of value 7 in 'PAY_0' for 'Age_Bin' 35-44:", count_value_7)

Count of value 7 in 'PAY_0' for 'Age_Bin' 35-44: 990


Credit Utilization Ratio (Outstanding Balance / Credit limit * 100%) 

In [121]:
# Load CSV data
csv_file_path = 'UCI_Credit_Card.csv'
df = pd.read_csv(csv_file_path)

In [122]:
# Define the column for September's outstanding balance and the credit limit column
sept_outstanding_bal = 'BILL_AMT1'
credit_limit_column = 'LIMIT_BAL'

In [124]:
# Calculate the credit utilization ratio for September and round to 2 decimal places
df['CUR_Sept(%)'] = (df[sept_outstanding_bal] / df[credit_limit_column] * 100).round(2)
print(df[['ID', sept_outstanding_bal, credit_limit_column, 'CUR_Sept(%)']].head())

   ID  BILL_AMT1  LIMIT_BAL  CUR_Sept(%)
0   1     3913.0    20000.0        19.56
1   2     2682.0   120000.0         2.24
2   3    29239.0    90000.0        32.49
3   4    46990.0    50000.0        93.98
4   5     8617.0    50000.0        17.23


In [132]:
low_CUR_df = df[df['CUR_Sept(%)'] < 30]
low_CUR_df.head()

Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,...,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default.payment.next.month,CUR_Sept(%),Age_Bin
0,1,20000.0,2,2,1,24,2,2,-1,-1,...,0.0,0.0,689.0,0.0,0.0,0.0,0.0,1,19.56,18-24
1,2,120000.0,2,2,2,26,-1,2,0,0,...,3261.0,0.0,1000.0,1000.0,1000.0,0.0,2000.0,1,2.24,25-34
4,5,50000.0,1,2,1,57,-1,0,-1,0,...,19131.0,2000.0,36681.0,10000.0,9000.0,689.0,679.0,0,17.23,55-64
7,8,100000.0,2,2,2,23,0,-1,-1,0,...,567.0,380.0,601.0,0.0,581.0,1687.0,1542.0,0,11.88,18-24
8,9,140000.0,2,3,1,28,0,0,2,0,...,3719.0,3329.0,0.0,432.0,1000.0,1000.0,1000.0,0,8.06,25-34


In [133]:
# Define age bins and labels
bins = [18,25,35,45,55,64, float('inf')]
labels = ['18-24','25-34','35-44','45-54','55-64','65 and above']

In [134]:
# Assign age bins to a new column 'Age_Bin'
df['Age_Bin'] = pd.cut(df['AGE'], bins=bins, labels=labels, right=True)

In [142]:
# Calculate the mean of CUR below 30% for each age group
age_group_low_CUR = low_CUR_df.groupby('Age_Bin')['CUR_Sept(%)'].mean().round(2)
age_group_low_CUR.head()

  age_group_low_CUR = low_CUR_df.groupby('Age_Bin')['CUR_Sept(%)'].mean().round(2)


Age_Bin
18-24    8.72
25-34    6.31
35-44    5.48
45-54    5.45
55-64    4.86
Name: CUR_Sept(%), dtype: float64

In [131]:
# Calculate the mean of CUR for each age group
age_group_CUR = df.groupby('Age_Bin')['CUR_Sept(%)'].mean().round(2)

# Display the results
print(age_group_CUR)


Age_Bin
18-24           54.96
25-34           39.48
35-44           39.36
45-54           45.21
55-64           46.38
65 and above    48.55
Name: CUR_Sept(%), dtype: float64


  age_group_CUR = df.groupby('Age_Bin')['CUR_Sept(%)'].mean().round(2)


In [140]:
# To have a good credit history, have CUR below 10%
good_CUR_df = df[df['CUR_Sept(%)'] <= 10] 
good_CUR_df.head()

Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,...,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default.payment.next.month,CUR_Sept(%),Age_Bin
1,2,120000.0,2,2,2,26,-1,2,0,0,...,3261.0,0.0,1000.0,1000.0,1000.0,0.0,2000.0,1,2.24,25-34
8,9,140000.0,2,3,1,28,0,0,2,0,...,3719.0,3329.0,0.0,432.0,1000.0,1000.0,1000.0,0,8.06,25-34
9,10,20000.0,1,3,2,35,-2,-2,-2,-2,...,13912.0,0.0,0.0,0.0,13007.0,1122.0,0.0,0,0.0,25-34
10,11,200000.0,2,3,2,34,0,0,2,0,...,3731.0,2306.0,12.0,50.0,300.0,3738.0,66.0,0,5.54,25-34
11,12,260000.0,2,1,2,51,-1,-1,-1,-1,...,13668.0,21818.0,9966.0,8583.0,22301.0,0.0,3640.0,0,4.72,45-54


In [144]:
# Calculate the mean of CUR for each age group
age_group_CUR = good_CUR_df.groupby('Age_Bin')['CUR_Sept(%)'].mean().round(2)
age_group_CUR

  age_group_CUR = good_CUR_df.groupby('Age_Bin')['CUR_Sept(%)'].mean().round(2)


Age_Bin
18-24           2.52
25-34           2.14
35-44           1.97
45-54           1.85
55-64           1.65
65 and above    0.77
Name: CUR_Sept(%), dtype: float64

: 