
# Credit Score Classification — Detailed Report

This project uses **Logistic Regression** and **K-Nearest Neighbors (KNN)** to predict a customer’s **credit score category** based on their demographic and financial data.  
Each section includes a brief explanation followed by executable code to maintain both clarity and reproducibility.



## Problem Statement

The objective is to build a machine learning model that classifies customers into categories like *Good*, *Standard*, or *Poor* credit scores.  
This helps financial institutions automate the credit scoring process, making it consistent, unbiased, and efficient.



## Step 1: Data Loading

In this step, we load the provided datasets (`train.csv` and `test.csv`) into pandas DataFrames.  
We then preview the data to ensure it’s read correctly and check the column names and basic structure.


In [None]:

import pandas as pd

train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")

train.head()



## Step 2: Data Overview

Here, we inspect the dataset’s structure, including column data types and number of non-null entries.  
This helps us identify which features are numerical or categorical and where missing data may exist.


In [None]:

train.info()



We also look at descriptive statistics for numerical and categorical columns to understand distributions, ranges, and outliers.


In [None]:

train.describe(include='all').T



## Step 3: Target Variable

The target variable in this dataset is **Credit_Score**.  
Before modeling, it’s important to understand how balanced the classes are, since imbalance can affect model performance.


In [None]:

train['Credit_Score'].value_counts()



## Step 4: Missing Values

We now check for missing values in each column.  
Identifying and handling missing data early prevents training errors and improves model reliability.


In [None]:

train.isna().sum().sort_values(ascending=False).head(15)



## Step 5: Exploratory Data Analysis (EDA)

To better understand the dataset, we visualize how many customers fall into each credit score category.  
This helps us check for imbalance between the classes.


In [None]:

import matplotlib.pyplot as plt

plt.figure(figsize=(6,4))
train['Credit_Score'].value_counts().plot(kind='bar', color='skyblue')
plt.title("Credit Score Distribution")
plt.xlabel("Credit Score Category")
plt.ylabel("Count")
plt.show()



## Step 6: Data Preprocessing

We prepare the data for modeling by:
- Filling missing numeric values with the median.
- Filling missing categorical values with the mode.
- Converting categorical variables into numerical ones using one-hot encoding.
- Splitting the data into training and testing sets for model evaluation.


In [None]:

from sklearn.model_selection import train_test_split

X = train.drop(columns=['Credit_Score'])
y = train['Credit_Score']

num_cols = X.select_dtypes(include='number').columns
cat_cols = [c for c in X.columns if c not in num_cols]

X[num_cols] = X[num_cols].fillna(X[num_cols].median())
X[cat_cols] = X[cat_cols].fillna(X[cat_cols].mode().iloc[0])

X = pd.get_dummies(X, drop_first=True)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

X_train.shape, X_test.shape



## Step 7: Logistic Regression (Numeric Features Only)

We start with a simple Logistic Regression model using **only numeric features**.  
This helps us evaluate how well numeric information alone can predict credit scores.


In [None]:

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

X_num_train = X_train[num_cols]
X_num_test = X_test[num_cols]

model_lr1 = LogisticRegression(max_iter=1000)
model_lr1.fit(X_num_train, y_train)
y_pred_lr1 = model_lr1.predict(X_num_test)

acc1 = accuracy_score(y_test, y_pred_lr1)
prec1 = precision_score(y_test, y_pred_lr1, average='macro')
rec1 = recall_score(y_test, y_pred_lr1, average='macro')
f11 = f1_score(y_test, y_pred_lr1, average='macro')

print(f"Accuracy: {acc1:.4f}, Precision: {prec1:.4f}, Recall: {rec1:.4f}, F1: {f11:.4f}")



## Step 8: Logistic Regression (All Features)

Next, we use **all available features** (numeric + encoded categorical).  
This allows the model to leverage more information and potentially improve accuracy.


In [None]:

model_lr2 = LogisticRegression(max_iter=1000)
model_lr2.fit(X_train, y_train)
y_pred_lr2 = model_lr2.predict(X_test)

acc2 = accuracy_score(y_test, y_pred_lr2)
prec2 = precision_score(y_test, y_pred_lr2, average='macro')
rec2 = recall_score(y_test, y_pred_lr2, average='macro')
f12 = f1_score(y_test, y_pred_lr2, average='macro')

print(f"Accuracy: {acc2:.4f}, Precision: {prec2:.4f}, Recall: {rec2:.4f}, F1: {f12:.4f}")



## Step 9: K-Nearest Neighbors (k=3)

We now test a KNN model with **k=3**.  
This algorithm classifies data based on the majority label among the 3 nearest neighbors in feature space.


In [None]:

from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled_train = scaler.fit_transform(X_train)
X_scaled_test = scaler.transform(X_test)

knn3 = KNeighborsClassifier(n_neighbors=3)
knn3.fit(X_scaled_train, y_train)
y_pred_k3 = knn3.predict(X_scaled_test)

acc3 = accuracy_score(y_test, y_pred_k3)
prec3 = precision_score(y_test, y_pred_k3, average='macro')
rec3 = recall_score(y_test, y_pred_k3, average='macro')
f13 = f1_score(y_test, y_pred_k3, average='macro')

print(f"Accuracy: {acc3:.4f}, Precision: {prec3:.4f}, Recall: {rec3:.4f}, F1: {f13:.4f}")



## Step 10: K-Nearest Neighbors (k=5)

Here, we increase **k to 5** to check how the model behaves with a larger neighborhood size.  
Larger `k` values generally smooth out noise but can reduce sensitivity to small patterns.


In [None]:

knn5 = KNeighborsClassifier(n_neighbors=5)
knn5.fit(X_scaled_train, y_train)
y_pred_k5 = knn5.predict(X_scaled_test)

acc5 = accuracy_score(y_test, y_pred_k5)
prec5 = precision_score(y_test, y_pred_k5, average='macro')
rec5 = recall_score(y_test, y_pred_k5, average='macro')
f15 = f1_score(y_test, y_pred_k5, average='macro')

print(f"Accuracy: {acc5:.4f}, Precision: {prec5:.4f}, Recall: {rec5:.4f}, F1: {f15:.4f}")



## Step 11: Model Comparison

Finally, we compare the performance of all models using key metrics (Accuracy, Precision, Recall, F1).  
This helps identify which approach performs best for this dataset.


In [None]:

results = pd.DataFrame({
    'Model': ['LR (Numeric)', 'LR (All)', 'KNN (k=3)', 'KNN (k=5)'],
    'Accuracy': [acc1, acc2, acc3, acc5],
    'Precision': [prec1, prec2, prec3, prec5],
    'Recall': [rec1, rec2, rec3, rec5],
    'F1': [f11, f12, f13, f15]
})
results



## Step 12: Conclusion

- **Logistic Regression (All Features)** provides the most balanced and interpretable results.  
- **KNN** performs well but is more sensitive to feature scaling and dataset size.  
- The best model can be used for automated credit risk classification to improve consistency in decision-making.

**Next Steps:**
1. Add cross-validation for more stable evaluation.  
2. Try adjusting KNN’s `k` value and distance metrics.  
3. Consider using Decision Trees or Random Forests for more complex patterns.
