<a href="https://colab.research.google.com/github/Rishi-Kora/Credit-Card-Data-Analysis/blob/main/credit_card_data_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Importing the necessary libraries

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns
from sklearn.preprocessing import StandardScaler, MinMaxScaler

Loading the dataset

In [4]:
file_path = '/content/drive/MyDrive/Colab Notebooks/data/credit_customers.csv'
df = pd.read_csv(file_path)
print(df)

    checking_status  duration                  credit_history  \
0                <0       6.0  critical/other existing credit   
1          0<=X<200      48.0                   existing paid   
2       no checking      12.0  critical/other existing credit   
3                <0      42.0                   existing paid   
4                <0      24.0              delayed previously   
..              ...       ...                             ...   
995     no checking      12.0                   existing paid   
996              <0      30.0                   existing paid   
997     no checking      12.0                   existing paid   
998              <0      45.0                   existing paid   
999        0<=X<200      45.0  critical/other existing credit   

                 purpose  credit_amount    savings_status  employment  \
0               radio/tv         1169.0  no known savings         >=7   
1               radio/tv         5951.0              <100      1<=X<4   


Displaying the dataset information

In [5]:
print("Initial DataFrame Info:")
df.info()

Initial DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 21 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   checking_status         1000 non-null   object 
 1   duration                1000 non-null   float64
 2   credit_history          1000 non-null   object 
 3   purpose                 1000 non-null   object 
 4   credit_amount           1000 non-null   float64
 5   savings_status          1000 non-null   object 
 6   employment              1000 non-null   object 
 7   installment_commitment  1000 non-null   float64
 8   personal_status         1000 non-null   object 
 9   other_parties           1000 non-null   object 
 10  residence_since         1000 non-null   float64
 11  property_magnitude      1000 non-null   object 
 12  age                     1000 non-null   float64
 13  other_payment_plans     1000 non-null   object 
 14  housing          

Describing the dataset

In [6]:
print("Initial Descriptive Statistics:")
df.describe()

Initial Descriptive Statistics:


Unnamed: 0,duration,credit_amount,installment_commitment,residence_since,age,existing_credits,num_dependents
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
mean,20.903,3271.258,2.973,2.845,35.546,1.407,1.155
std,12.058814,2822.736876,1.118715,1.103718,11.375469,0.577654,0.362086
min,4.0,250.0,1.0,1.0,19.0,1.0,1.0
25%,12.0,1365.5,2.0,2.0,27.0,1.0,1.0
50%,18.0,2319.5,3.0,3.0,33.0,1.0,1.0
75%,24.0,3972.25,4.0,4.0,42.0,2.0,1.0
max,72.0,18424.0,4.0,4.0,75.0,4.0,2.0


Checking the null values in the dataset

In [7]:
print("Missing Values in Each Column:")
df.isnull().sum()

Missing Values in Each Column:


Unnamed: 0,0
checking_status,0
duration,0
credit_history,0
purpose,0
credit_amount,0
savings_status,0
employment,0
installment_commitment,0
personal_status,0
other_parties,0


Dropping the duplicates in the dataset

In [8]:
df.drop_duplicates(inplace=True)

Standardizing the dataset

In [9]:
numeric_features = df.select_dtypes(include=[np.number])

In [10]:
scaler_standard = StandardScaler()
standardized_features = scaler_standard.fit_transform(numeric_features)
df_standardized = pd.DataFrame(standardized_features, columns=numeric_features.columns)

In [11]:
scaler_minmax = MinMaxScaler()
normalized_features = scaler_minmax.fit_transform(numeric_features)
df_normalized = pd.DataFrame(normalized_features, columns=numeric_features.columns)

Detecting Outliers in the dataset

In [12]:
def detect_outliers_iqr(data):
    Q1 = data.quantile(0.25)
    Q3 = data.quantile(0.75)
    IQR = Q3 - Q1
    return data[(data < (Q1 - 1.5 * IQR)) | (data > (Q3 + 1.5 * IQR))]

In [13]:
outliers_iqr = numeric_features.apply(detect_outliers_iqr)
print("Outliers detected using IQR:")
print(outliers_iqr)

Outliers detected using IQR:
     duration  credit_amount  installment_commitment  residence_since   age  \
0         NaN            NaN                     NaN              NaN  67.0   
1        48.0            NaN                     NaN              NaN   NaN   
2         NaN            NaN                     NaN              NaN   NaN   
3         NaN            NaN                     NaN              NaN   NaN   
4         NaN            NaN                     NaN              NaN   NaN   
..        ...            ...                     ...              ...   ...   
983       NaN         8229.0                     NaN              NaN   NaN   
990       NaN            NaN                     NaN              NaN   NaN   
991       NaN            NaN                     NaN              NaN   NaN   
998      45.0            NaN                     NaN              NaN   NaN   
999      45.0            NaN                     NaN              NaN   NaN   

     existing_credits 

Calculating the z score

In [14]:
z_scores = np.abs((numeric_features - numeric_features.mean()) / numeric_features.std())
outliers_z = (z_scores > 3).any(axis=1)
print("Rows identified as outliers using Z-score:")
print(df[outliers_z])

Rows identified as outliers using Z-score:
    checking_status  duration                  credit_history  \
18         0<=X<200      24.0                   existing paid   
29               <0      60.0              delayed previously   
63         0<=X<200      48.0             no credits/all paid   
65      no checking      27.0                   existing paid   
87         0<=X<200      36.0                   existing paid   
95         0<=X<200      54.0             no credits/all paid   
105        0<=X<200      24.0  critical/other existing credit   
134     no checking      60.0                   existing paid   
163        0<=X<200      10.0                   existing paid   
186        0<=X<200       9.0                        all paid   
197        0<=X<200      12.0                   existing paid   
236        0<=X<200       6.0                   existing paid   
255        0<=X<200      60.0              delayed previously   
272        0<=X<200      48.0                  

Plotting the graph

In [15]:
fig = px.box(numeric_features, title="Boxplot of Numeric Features")
fig.show()

In [16]:
for col in numeric_features.columns:
    fig = px.histogram(
        df,
        x=col,
        title=f'Distribution of {col}',
        nbins=30,
        marginal='rug',
        histnorm='percent'
    )
    fig.show()