# Credit score dataset data analysis

EDA project aimed at exploring creditworthiness data, finding potential relationships and recognizing the structure of the data. Includes numerical, univariate and multivariate analysis.

## Data cleaning

Character of this section is purely technical. Every change neccesary for the dataset to be readable and ready to work on is implemented in the 'cleaner.py' file and in the cells below. The biggest part of data cleaning was resolving problems with mixed data types in the columns. Mixed types were mostly results of mistakes and typos. However, there was also an issue with entries being varied, even when they concern one and the same person over a short period of time. Some of the entries were virtually impossible.


In [1]:
import cleaner
import pandas as pd
import os
import matplotlib
from utils.dataset_download import download_dataset
import plotly.express as px
import plotly.graph_objects as go 
from utils.cleaning_tools import *
from plotly.subplots import make_subplots

In [2]:
path_to_training = os.path.join('data', 'train.csv')
path_to_test = os.path.join('data', 'test.csv')

df_training = pd.read_csv(path_to_training)
df_test = pd.read_csv(path_to_test)

df = pd.concat([df_training, df_test], axis=0)

df = cleaner.cleaning(df)

age_trimmed = df.query(" 0 < Age < 90")
annual_trimmed = df.query('Annual_Income < 200000')
nba_trimmed =df.query('0 <= Num_Bank_Accounts <= 10')
ncc_trimmed =df.query("0 <= Num_Credit_Card <= 10")
ir_trimmed = df.query('Interest_Rate <= 100')
nol_trimmed = df.query("0<= Num_of_Loan <= 20")
nodp_trimmed = df.query('0 <= Num_of_Delayed_Payment <= 30')
nci_trimmed = df.query("Num_Credit_Inquiries <= 20")
emi_trimmed = df.query('0 <= Total_EMI_per_month <= 1100')
mb_trimmed = df.query("-1000000 <= Monthly_Balance <= 1000000000")

trimmed_columns ={'Age':age_trimmed, 'Annual_Income':annual_trimmed, 'Num_Bank_Accounts':nba_trimmed, 'Num_Credit_Card':ncc_trimmed, 
                'Interest_Rate':ir_trimmed, 'Num_of_Loan':nol_trimmed, 'Num_of_Delayed_Payment':nodp_trimmed, 'Num_Credit_Inquiries':nci_trimmed, 
                'Total_EMI_per_month':emi_trimmed, 'Monthly_Balance':mb_trimmed}

outlier_null_count = {}
for key in trimmed_columns:
    count = len(df[key]) - len(trimmed_columns[key])
    outlier_null_count[key] = count

columns_to_normalize = ['Age' , 'Annual_Income' , 'Changed_Credit_Limit', 'Monthly_Balance', 
    'Num_of_Delayed_Payment', 'Amount_invested_monthly', 'Num_of_Loan',
    'Outstanding_Debt', 'Monthly_Inhand_Salary', 'Num_Bank_Accounts', 'Num_Credit_Card',
    'Interest_Rate', 'Delay_from_due_date', 'Num_Credit_Inquiries', 'Credit_Utilization_Ratio',
    'Credit_History_Age', 'Total_EMI_per_month']

bins = [0, 25000, 50000, 75000, 100000, 150000, float('inf')]  
labels = ['<25.000', '25.000-50.000', '50.000-75.000', '75.000-100.000','100.000-150.000','<150.000']


annual_trimmed['Annual_Category'] = pd.cut(annual_trimmed['Annual_Income'], bins=bins, labels=labels, right=False)


annual_counts = annual_trimmed['Annual_Category'].value_counts()

annual_percentages = annual_counts / annual_counts.sum() * 100


earnings = df.groupby('Occupation')['Monthly_Inhand_Salary'].median().dropna() 
earnings_values = []
for key in earnings.keys():
    earnings_values.append(earnings[key])


accounts = nba_trimmed.groupby('Occupation')['Num_Bank_Accounts'].median().dropna() 
accounts_values = []
for key in accounts.keys():
    accounts_values.append(accounts[key])

cards = ncc_trimmed.groupby('Occupation')['Num_Credit_Card'].median().dropna() 
cards_values = []
for key in cards.keys():
    cards_values.append(cards[key])

loans = nol_trimmed.groupby('Occupation')['Num_of_Loan'].median().dropna() 
loans_values = []
for key in loans.keys():
    loans_values.append(loans[key])

age = age_trimmed.groupby('Occupation')['Age'].mean().dropna()
age_values = []
for key in age.keys():
    age_values.append(age[key])

earnings_age = age_trimmed.groupby('Age')['Monthly_Inhand_Salary'].mean().dropna()
earnings_age_values = []
for key in earnings_age.keys():
    earnings_age_values.append(earnings_age[key])

accounts_age = age_trimmed.groupby('Age')['Num_Bank_Accounts'].median().dropna() 
accounts_age_values = []
for key in accounts_age.keys():
    accounts_age_values.append(accounts_age[key])

cards_age = age_trimmed.groupby('Age')['Num_Credit_Card'].median().dropna() 
cards_age_values = []
for key in cards_age.keys():
    cards_age_values.append(cards_age[key])

loans_age = age_trimmed.groupby('Age')['Num_of_Loan'].median().dropna() 
loans_age_values = []
for key in loans_age.keys():
    loans_age_values.append(loans_age[key])

  df_training = pd.read_csv(path_to_training)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  annual_trimmed['Annual_Category'] = pd.cut(annual_trimmed['Annual_Income'], bins=bins, labels=labels, right=False)


## Numerical analysis

In this section we will focus on the numerical side of data: the number of entries, the structure of the data, and their characteristics.

In [3]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
Index: 150000 entries, 0 to 49999
Data columns (total 28 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   ID                        150000 non-null  object 
 1   Customer_ID               150000 non-null  object 
 2   Month                     150000 non-null  object 
 3   Name                      135000 non-null  object 
 4   Age                       150000 non-null  int64  
 5   SSN                       150000 non-null  object 
 6   Occupation                150000 non-null  object 
 7   Annual_Income             150000 non-null  float64
 8   Monthly_Inhand_Salary     127500 non-null  float64
 9   Num_Bank_Accounts         150000 non-null  int64  
 10  Num_Credit_Card           150000 non-null  int64  
 11  Interest_Rate             150000 non-null  int64  
 12  Num_of_Loan               150000 non-null  int64  
 13  Type_of_Loan              150000 non-null  object 

Our dataset has 28 columns with total number of 150.000 entries among which there are some with null values (about 1,8%). 

There are both string entries as well as numerical ones:

- Numerical value columns;
  -  Age
  -  Annual income
  -  Monthly inhand salary
  -  Number of bank accounts
  -  Number of credit cards
  -  Interest rate
  -  Number of taken loans
  -  Delay from due date
  -  Number of delayed payments
  -  Changed credit limit
  -  Number of credit inquiries
  -  Outstanding debt
  -  Credit utilization ratio
  -  Credit history age
  -  Total EMI per month
  -  Amount invested monthly

- String value columns;
   -  Customer_ID
   -  Month
   -  Name
   -  Social Security Number
   -  Occupation
   -  Type of loan
   -  Credit mix
   -  Payment of minimal amount
   -  Payment behaviour 
   -  Credit score


In [4]:
df.describe()

Unnamed: 0,Age,Annual_Income,Monthly_Inhand_Salary,Num_Bank_Accounts,Num_Credit_Card,Interest_Rate,Num_of_Loan,Delay_from_due_date,Num_of_Delayed_Payment,Changed_Credit_Limit,Num_Credit_Inquiries,Outstanding_Debt,Credit_Utilization_Ratio,Credit_History_Age,Total_EMI_per_month,Amount_invested_monthly,Monthly_Balance
count,150000.0,150000.0,127500.0,150000.0,150000.0,150000.0,150000.0,150000.0,139500.0,146850.0,147000.0,150000.0,150000.0,136500.0,150000.0,143250.0,148238.0
mean,110.33794,173055.2,4190.115139,17.00694,22.623447,71.234907,3.141093,21.0634,30.911878,10.384299,28.529014,1426.220376,32.283309,18.601277,1432.513579,638.826309,-3.372954e+22
std,684.066779,1404215.0,3180.489657,117.069476,129.143006,461.537193,63.910655,14.860154,224.534007,6.786522,194.456058,1155.127101,5.113315,8.309962,8403.759977,2046.843019,3.352927e+24
min,-500.0,7005.93,303.645417,-1.0,0.0,1.0,-100.0,-5.0,-3.0,-6.49,0.0,0.23,20.0,0.083333,0.0,0.0,-3.333333e+26
25%,25.0,19455.49,1625.265833,3.0,4.0,8.0,1.0,10.0,9.0,5.33,3.0,566.0725,28.054731,12.166667,30.947775,74.533842,270.2297
50%,33.0,37578.61,3091.0,6.0,5.0,13.0,3.0,18.0,14.0,9.41,6.0,1166.155,32.297058,18.5,71.280006,135.791445,336.7995
75%,42.0,72796.9,5948.454596,7.0,7.0,20.0,5.0,28.0,18.0,14.84,9.0,1945.9625,36.487954,25.333333,166.279555,266.110841,470.4553
max,8698.0,24198060.0,15204.633333,1798.0,1499.0,5799.0,1496.0,67.0,4399.0,36.97,2597.0,4998.07,50.0,34.0,82398.0,10000.0,1606.518


As we can tell from the table above, the dataset is very unclean when it comes to numerical values, having lots of outliers, even impossible ones (for example - negative age or 'monthly balance' column having minimal value of -3e26). In the further analysis, we will only addres values, that at least feel like they're possible or legal. 

Here is the list of all filters used to get rid of unsatisfactory entries:
- Age - between 0 and 90
- Annual income - below 200000
- Number of bank accounts - between 0 and 10
- Number of credit cards between - 0 and 10
- Interest rate - below 100
- Number of loans - between  0 and 20
- Number of delayed payments - between 0 and 30
- Number of credit inquiries - below 20
- Total EMI per month - between 0 and 1100
- Monthly balance - between -1000000 and 1000000000


## One dimensional analysis

This section provides summary quantitative charts of individual columns, providing key context for understanding further analysis.

In [5]:
barchart_overview = make_subplots(
    rows=3, cols=6,
   specs=[[{"colspan": 3}, None, None, {"colspan": 3}, None, None],
            [None, {"type": "pie", "colspan": 1}, None, None, {"type": "pie", "colspan": 1}, None],
            [{"colspan": 2}, None, {"colspan": 2}, None, {"colspan": 2}, None]],
    subplot_titles=("Percentage of representatives of a given age", 
                      "Percentage of representatives of occupation", 
                      "Percentage distribution of people with annual income in the given range (USD)",
                      "Percentage distribution of people with given credit score",
                      "Percentage of people with given number of bank accounts",
                      "Percentage of people with given number of credit cards",
                      "Number of loans taken out at a given month",
                      )
)

barchart_overview.add_trace(go.Histogram(
    x = age_trimmed['Age'],
    histnorm='percent',
    name='Age',

), row=1,col=1) 

barchart_overview.add_annotation(
    x=56.5, y=4,
    text=f"Outliers + nulls: {outlier_null_count['Age']} ({round(outlier_null_count['Age']/len(df['Age'])*100, 2)}%)",
    showarrow=False,
    bgcolor="rgba(256, 256, 256, 0.7)",
    borderpad=5,
    bordercolor='black',
    borderwidth=1,
    align='center',
    xanchor='right',
    yanchor='top',
    font=dict(size=12, color='black'),
    row=1, col=1
)

barchart_overview.add_trace(go.Histogram(
    x=sorted(df['Occupation']),
    histnorm='percent',
    name='Occupation',
), row=1, col=4)

barchart_overview.add_annotation(
    x=15.5, y=10,
    text=f"Nulls: {df['Occupation'].isna().sum()} ({round(df['Occupation'].isna().sum()/len(df['Occupation'])*100, 2)}%)",
    showarrow=False,
    bgcolor="rgba(256, 256, 256, 0.7)",
    borderpad=5,
    bordercolor='black',
    borderwidth=1,
    align='center',
    xanchor='right',
    yanchor='top',
    font=dict(size=12, color='black'),
    row=1, col=4
)

barchart_overview.add_trace(
    go.Pie(
            labels=annual_percentages.index, 
            values=annual_percentages, 
            textinfo='label+percent',
            textposition='outside',
            insidetextorientation='radial'
            ), row=2, col=2
)

barchart_overview.add_annotation(
    x=0.95, y=0.58,
    text=f"Nulls: {df['Annual_Income'].isna().sum()} ({round(df['Annual_Income'].isna().sum()/len(df['Annual_Income'])*100, 2)}%)",
    showarrow=False,
    bgcolor="rgba(256, 256, 256, 0.7)",
    borderpad=5,
    bordercolor='black',
    borderwidth=1,
    align='center',
    xref='paper',
    yref='paper',
    font=dict(size=12, color='black'),
)

barchart_overview.add_annotation(
    x=0.43, y=0.58,
    text=f"Nulls + outliers: {outlier_null_count['Annual_Income']} ({round(outlier_null_count['Annual_Income']/len(df['Annual_Income'])*100, 2)}%)",
    showarrow=False,
    bgcolor="rgba(256, 256, 256, 0.7)",
    borderpad=5,
    bordercolor='black',
    borderwidth=1,
    align='center',
    xref='paper',
    yref='paper',
    font=dict(size=12, color='black'),
)

barchart_overview.add_trace(go.Pie(
            labels=(df['Credit_Score'].value_counts(normalize=True) * 100).index,
            values=(df['Credit_Score'].value_counts(normalize=True) * 100).values,
            textinfo='percent+label',
            textposition="outside"
        ), row=2,col=5

)

barchart_overview.add_trace(go.Histogram(
    x = nba_trimmed['Num_Bank_Accounts'],
    histnorm='percent',
    name='Num_Bank_Accounts',

), row=3,col=1) 

barchart_overview.add_annotation(
    x=10.5, y=18,
    text=f"Outliers + nulls: {outlier_null_count['Num_Bank_Accounts']} ({round(outlier_null_count['Num_Bank_Accounts']/len(df['Num_Bank_Accounts'])*100, 2)}%)",
    showarrow=False,
    bgcolor="rgba(256, 256, 256, 0.7)",
    borderpad=5,
    bordercolor='black',
    borderwidth=1,
    align='center',
    xanchor='right',
    yanchor='top',
    font=dict(size=12, color='black'),
    row=3, col=1
)

barchart_overview.add_trace(go.Histogram(
    x=ncc_trimmed['Num_Credit_Card'],
    histnorm='percent',
    name='Num_Bank_Accounts',
), row=3, col=3
)

barchart_overview.add_annotation(
    x=10.5, y=24,
    text=f"Outliers + nulls: {outlier_null_count['Num_Credit_Card']} ({round(outlier_null_count['Num_Credit_Card']/len(df['Num_Credit_Card'])*100, 2)}%)",
    showarrow=False,
    bgcolor="rgba(256, 256, 256, 0.7)",
    borderpad=5,
    bordercolor='black',
    borderwidth=1,
    align='center',
    xanchor='right',
    yanchor='top',
    font=dict(size=12, color='black'),
    row=3, col=3
)

barchart_overview.add_trace(go.Histogram(
    x=df['Month'],
    name='Num_Bank_Accounts',
)
    , row=3, col=5)

barchart_overview.add_annotation(
    x=11.5, y=18000,
    text=f"Nulls: {df['Month'].isna().sum()} ({round(df['Month'].isna().sum()/len(df['Month'])*100, 2)}%)",
    showarrow=False,
    bgcolor="rgba(256, 256, 256, 0.7)",
    borderpad=5,
    bordercolor='black',
    borderwidth=1,
    align='center',
    xanchor='right',
    yanchor='top',
    font=dict(size=12, color='black'),
    row=3, col=5)

barchart_overview.update_xaxes(title_text="Age", row=1, col=1)
barchart_overview.update_xaxes(title_text="Occupation", row=1, col=4)

barchart_overview.update_xaxes(title_text="Number of bank accounts", row=3, col=1)
barchart_overview.update_xaxes(title_text="Number of credit cards", row=3, col=3)
barchart_overview.update_xaxes(title_text="Month", row=3, col=5)



barchart_overview.update_layout(
    title_text='One dimensional overview',
    title_font=dict(size=30),
    xaxis1 = dict(tickmode = 'linear', tick0 = 0, dtick = 5,),
    xaxis3 = dict(tickmode = 'linear', tick0 = 0, dtick = 1,),
    xaxis4 = dict(tickmode = 'linear', tick0 = 0, dtick = 1,),
    showlegend=False,
    height=1100,
    margin=dict(l=40, r=40, t=70, b=20), 
    yaxis1_ticksuffix = '%',
    yaxis2_ticksuffix = '%',
    yaxis3_ticksuffix = '%',
    yaxis4_ticksuffix = '%',
    bargap=0.2, bargroupgap=0.1,
)

The population is evenly distributed across most ages, with a slight drop in representation as the age increases past 45.

A majority of individuals (60.8%) earn less than $50,000, with very few earning above $150,000. This indicates a population skewed towards lower to middle-income groups.

The majority of individuals fall into the "Standard" credit score category (53.2%), with "Poor" scores representing 29% and "Good" scores at 17.8%. This indicates that a large portion of the population may encounter difficulties in obtaining credit due to lower credit scores.

People typically have 2–6 bank accounts, with very few having 0 or 10 accounts. The data might indicate standard banking behavior, as multiple accounts are common for savings, checking, or business purposes.

Most people own 2–6 credit cards, with a similar pattern to bank accounts. The presence of outliers (e.g., people with 0 or 10 cards) might represent financially inexperienced individuals or those with high financial exposure.

Loan issuance is consistent across months, with no noticeable seasonal trends. This may suggest that dataset was generated.

## Multi dimensional analysis

Bivariate analysis examines the relationship between two variables, allowing the identification of patterns, trends, and potential correlations.

In [6]:
barchart_overview = make_subplots(
    rows=5, cols=6,
   specs=[[{"colspan": 6}, None, None, None, None, None],
            [{"colspan": 3}, None, None, {"colspan": 3}, None, None],
            [{"colspan": 2}, None, {"colspan": 4}, None, None, None],
            [{"colspan": 6}, None, None, None, None, None],
            [{"colspan": 2}, None, {"colspan": 2}, None, {"colspan": 2}, None]],
    subplot_titles=("Average earnings for every occupation (USD)",
                      "Median of bank accounts for every occupation",
                      "Median of credit cards for every occupation",
                      "Median of loans taken out for every occupation",
                      "Mean age for every occupation",
                      "Average earnings for every age (USD)",
                      "Median of bank accounts for every age",
                      "Median of credit cards for every age",
                      "Median of loans taken out for every age",
                      )
)

barchart_overview.add_trace(go.Bar(
    x=earnings.keys(),
    y=earnings_values
), row=1, col=1)

barchart_overview.add_trace(go.Bar(
    x=accounts.keys(),
    y=accounts_values
), row=2, col=1)

barchart_overview.add_trace(go.Bar(
    x=cards.keys(),
    y=cards_values
), row=2, col=4)

barchart_overview.add_trace(go.Bar(
    x=loans.keys(),
    y=loans_values
), row=3, col=1)

barchart_overview.add_trace(go.Bar(
    x=age.keys(),
    y=age_values
), row=3, col=3)

barchart_overview.add_trace(go.Bar(
    x=earnings_age.keys(),
    y=earnings_age_values
), row=4, col=1)

barchart_overview.add_trace(go.Bar(
    x=accounts_age.keys(),
    y=accounts_age_values
), row=5, col=1)

barchart_overview.add_trace(go.Bar(
    x=cards_age.keys(),
    y=cards_age_values
), row=5, col=3)

barchart_overview.add_trace(go.Bar(
    x=loans_age.keys(),
    y=loans_age_values
), row=5, col=5)


#barchart_overview.update_xaxes(title_text="Occupation", row=2, col=1)
barchart_overview.update_yaxes(title_text="Bank accounts", row=2, col=1)

barchart_overview.update_yaxes(title_text="Credit cards", row=2, col=4)

barchart_overview.update_yaxes(title_text="Loans taken", row=3, col=1)

barchart_overview.update_yaxes(title_text="Age", row=3, col=3)

barchart_overview.update_yaxes(title_text="Monthly salary", row=4, col=1)
barchart_overview.update_xaxes(title_text="Age", row=4, col=1)

barchart_overview.update_layout(
    title_text='Two dimensional overview',
    title_font=dict(size=30),
    xaxis6 = dict(tickmode = 'linear', tick0 = 0, dtick = 1,),
    xaxis7 = dict(tickmode = 'linear', tick0 = 0, dtick = 5,),
    xaxis8 = dict(tickmode = 'linear', tick0 = 0, dtick = 5,),
    xaxis9 = dict(tickmode = 'linear', tick0 = 0, dtick = 5,),
    showlegend=False,
    height=1000,
    margin=dict(l=80, r=80, t=70, b=60), 
    bargap=0.2, bargroupgap=0.1,
    yaxis=dict(title="Monthly salary", title_standoff=5),
    yaxis2=dict(title="Bank accounts", title_standoff=5),
    yaxis3=dict(title="Credit cards", title_standoff=5),
    yaxis4=dict(title="Loans taken", title_standoff=5),
    yaxis5=dict(title="Age", title_standoff=0),
    yaxis6=dict(title="Monthly salary", title_standoff=5),
    yaxis7=dict(title="Bank accounts", title_standoff=5),
    yaxis8=dict(title="Credit cards", title_standoff=5),
    yaxis9=dict(title="Loans taken", title_standoff=5),
)

The average earnings across all occupations appear to be fairly uniform, without any occupation significantly standing out. This suggests that the dataset represents a balanced sample of earnings among professions.

Most occupations show a median of 4–6 bank accounts, with some slight variations. This indicates that individuals across professions generally follow similar banking habits.

The median number of credit cards is consistent across occupations, typically ranging from 4 to 5. This suggests similar credit card usage regardless of profession.

The median number of loans is stable across all occupations. This reflects similar borrowing behavior across professions.

The average age is consistent across occupations, falling within the range of 30–35 years. This suggests a similar age demographic representation in the dataset.

Average earnings increase steadily with age, peaking and stabilizing during midlife (30s to 50s), reflecting career growth and progression, with a slight late-career rise likely tied to senior roles or specialized positions.

The median number of bank accounts remains relatively stable across ages, with younger individuals (under 20) having slightly more accounts, possibly due to student or savings accounts. A slight decline is observed in older age groups.

The median number of credit cards remains fairly consistent across different ages, indicating stable credit behavior once individuals establish their financial standing.

Younger individuals (under 20) take out more loans, likely due to student loans or early financial commitments. Loan-taking stabilizes in middle age and declines in older groups, possibly due to reduced borrowing needs.

In [None]:
#clean data
dfc = df.query(" 0 < Age < 90")
dfc = dfc.query('Annual_Income < 200000')
dfc =dfc.query('0 <= Num_Bank_Accounts <= 10')
dfc =dfc.query("0 <= Num_Credit_Card <= 10")
dfc = dfc.query('Interest_Rate <= 100')
dfc = dfc.query("0<= Num_of_Loan <= 20")
dfc = dfc.query('0 <= Num_of_Delayed_Payment <= 30')
dfc = dfc.query("Num_Credit_Inquiries <= 20")
dfc = dfc.query('0 <= Total_EMI_per_month <= 1100')
dfc = dfc.query("-1000000 <= Monthly_Balance <= 1000000000")
dfc = dfc.query('0 <= Interest_Rate <= 50')

Unnamed: 0,Age,Annual_Income,Monthly_Inhand_Salary,Num_Bank_Accounts,Num_Credit_Card,Interest_Rate,Num_of_Loan,Delay_from_due_date,Num_of_Delayed_Payment,Changed_Credit_Limit,Num_Credit_Inquiries,Outstanding_Debt,Credit_Utilization_Ratio,Credit_History_Age,Total_EMI_per_month,Amount_invested_monthly,Monthly_Balance
count,109713.0,109713.0,93200.0,109713.0,109713.0,109713.0,109713.0,109713.0,109713.0,107406.0,109713.0,109713.0,109713.0,99778.0,109713.0,104767.0,109713.0
mean,33.439036,49638.543935,4123.266986,5.390163,5.545952,14.602846,3.533383,21.163773,13.426604,10.416403,6.313017,1431.0732,32.305329,18.559966,106.511889,628.045618,400.598159
std,10.764442,37780.591541,3144.752586,2.579176,2.060864,8.747249,2.44213,14.865637,6.19247,6.80479,3.953016,1154.456196,5.106923,8.310834,121.858519,2042.002266,211.757597
min,14.0,7005.93,303.645417,0.0,0.0,1.0,0.0,-5.0,0.0,-6.48,0.0,0.23,20.800587,0.083333,0.0,0.0,0.088628
25%,25.0,19200.08,1613.4225,3.0,4.0,8.0,2.0,10.0,9.0,5.34,3.0,569.26,28.084922,12.166667,29.568216,73.695647,269.744585
50%,33.0,36490.98,3045.354167,6.0,5.0,13.0,3.0,18.0,14.0,9.44,6.0,1171.51,32.31211,18.416667,66.780695,133.245085,335.795107
75%,42.0,70446.63,5862.33,7.0,7.0,20.0,5.0,28.0,18.0,14.95,9.0,1960.47,36.51616,25.25,146.707645,256.134373,467.756687
max,56.0,179987.28,15204.633333,10.0,10.0,34.0,19.0,67.0,28.0,36.65,17.0,4998.07,49.564519,34.0,1095.265876,10000.0,1606.518192


In [10]:
air_max_dict = (
        dfc
        .groupby('Age')['Interest_Rate']
        .max()
        .dropna()  
        .to_dict()
    )

air_min_dict = (
        dfc
        .groupby('Age')['Interest_Rate']
        .min()
        .dropna()  
        .to_dict()
    )

ages = list(air_max_dict.keys())
max_rate = list(air_max_dict.values())
min_rate = list(air_min_dict.values())

air = go.Figure()

for group in ages:
    air.add_trace(go.Box(
        y=dfc[dfc['Age']== group]['Interest_Rate'],
        name=str(group),
        boxmean=True,
        marker=dict(color='blue'),
        showlegend=False,
        width=0.4
    ))


air.update_layout(
    #template = 'ggplot2',
    title = 'Age vs Interest Rate',
    margin = dict(l=50, r=50, t=80, b=50),
    xaxis_title = 'Age',
    yaxis_title = 'Interest Rate',
    boxmode = 'group',
)

air.add_vline(x=3.5, line_width=2, line_dash="dash", line_color="black")
air.add_vline(x=32.5, line_width=2, line_dash="dash", line_color="black")

The dataset appears to be artificially generated, as evidenced by the two spikes marked by dashed lines. These sudden shifts in data distribution suggest a structured or programmed change rather than natural variation, reinforcing the likelihood of a synthetic dataset.

In [11]:
scores = list(df["Credit_Score"].unique())
occupation_counts = df.groupby(['Occupation', 'Credit_Score']).size().unstack(fill_value=0)
occupations = occupation_counts.index.tolist()

svso = go.Figure()

for score in scores:
    svso.add_trace(go.Bar(
        x=occupations,
        y=occupation_counts[score],
        name=score,
        #text=occupation_counts[score],
        #textposition='inside'
    ))

svso.update_layout(
    barmode='stack',
    title='Credit score by occupation',
    xaxis_title='Occupation',
    yaxis_title='Count of People',
    legend_title='Credit Score',
    legend=dict(
        orientation="h",  
        yanchor="bottom", 
        y=1.02,  
        xanchor="right", 
        x=1  # 
    )
    #template='plotly_white'

)

It appears that there is no correlation between occupation and credit score, which may sound counterintuitive at first, but ultimatively, keeping in mind rather equal earnings for all occupations, it is a rather normal outcome.

In [12]:
scores = list(dfc['Credit_Score'].unique())

score_colors = {'Unknown':'gray','Poor':'red', 'Standard':'yellow', 'Good':'blue'}

dfc_nonnacs = dfc[dfc['Credit_Score'] != "Unknown"]

dfc_good = dfc_nonnacs.query("Credit_History_Age < Age")
dfc_bad = dfc_nonnacs.query("Age <= Credit_History_Age")

test = go.Figure()

for score in dfc_good['Credit_Score'].unique():
    df_filtered = dfc_good[dfc_good['Credit_Score'] == score]
    
    test.add_trace(go.Scattergl(
        x=df_filtered['Age'],
        y=df_filtered['Credit_History_Age'],
        mode='markers',
        marker=dict(
            size=20,
            opacity = 0.015,
            color=score_colors[score],
            line=dict(width=0, color='black'),
        ),
        name=score
    ))

test.add_trace(go.Scattergl(
    x=dfc_bad['Age'],
    y=dfc_bad['Credit_History_Age'],
    mode='markers',
    marker=dict(
        size=20,
        opacity = 0.015,
        color= 'black',
        line=dict(width=0, color='black'),
    ),
        name='Impossible entries'
))

test.update_layout(
    title='Bubble Chart of Credit Score by Age and Credit History',
    xaxis_title='Age',
    yaxis_title='Credit History Length',
)


A noticeable issue is that some individuals appear to have a credit history length that suggests they started building credit at birth - a scenario that is unrealistic. This strongly indicates that the dataset is artificially generated, as such patterns would not occur in real-world financial data.

## Summary

After the analysis, we are able to state with a fairly high degree of certainty that the dataset is artificially generated. During generation, the creators used mainly uniform distributions, sometimes several combined. The degree of data contamination and its structure indicate the purposefulness of this state of affairs, which signals that the set was created for training purposes.