## Step 1: Load and Inspect the Data

In [2]:
# Import the pandas library for data manipulation
import pandas as pd

# Define the full path to your training CSV file
# Note: I'm using the exact filename from your screenshot. You can rename it to 'train.csv' to make it simpler.
file_path = r'C:\Users\user 1\Documents\Apollos\credit-scoring-model\train_u6lujux_CVtuZ9i.csv'

# Load the CSV file into a pandas DataFrame
df_loan = pd.read_csv(file_path)

# --- Initial Inspection ---

# 1. Display the first 5 rows to understand the features
print("First 5 rows of the loan dataset:")
print(df_loan.head())
print("\n" + "="*50 + "\n")

# 2. Get a concise summary of the DataFrame to check for missing values and data types
print("Loan DataFrame Information:")
df_loan.info()

First 5 rows of the loan dataset:
    Loan_ID Gender Married Dependents     Education Self_Employed  \
0  LP001002   Male      No          0      Graduate            No   
1  LP001003   Male     Yes          1      Graduate            No   
2  LP001005   Male     Yes          0      Graduate           Yes   
3  LP001006   Male     Yes          0  Not Graduate            No   
4  LP001008   Male      No          0      Graduate            No   

   ApplicantIncome  CoapplicantIncome  LoanAmount  Loan_Amount_Term  \
0             5849                0.0         NaN             360.0   
1             4583             1508.0       128.0             360.0   
2             3000                0.0        66.0             360.0   
3             2583             2358.0       120.0             360.0   
4             6000                0.0       141.0             360.0   

   Credit_History Property_Area Loan_Status  
0             1.0         Urban           Y  
1             1.0         Rural 

## Step 2: Data Cleaning and Preprocessing

In [3]:
# --- Data Preprocessing ---

# Make a copy to keep the original data safe
df_processed = df_loan.copy()

# 1. Drop the Loan_ID column as it is not needed for prediction
df_processed = df_processed.drop('Loan_ID', axis=1)

# 2. Fill missing values (Imputation)
# For categorical columns, we'll fill missing values with the most frequent value (the 'mode')
for col in ['Gender', 'Married', 'Dependents', 'Self_Employed']:
    df_processed[col].fillna(df_processed[col].mode()[0], inplace=True)

# For numerical columns, we'll fill missing values with the average value (the 'mean')
for col in ['LoanAmount', 'Loan_Amount_Term', 'Credit_History']:
    df_processed[col].fillna(df_processed[col].mean(), inplace=True)

# 3. Convert categorical columns to numbers
# Machine learning models only understand numbers. We will convert text categories into numerical representations.
# We will use one-hot encoding for most columns, and a simple map for our target variable.
df_processed['Loan_Status'] = df_processed['Loan_Status'].map({'Y': 1, 'N': 0})

# Use get_dummies for one-hot encoding of other categorical features
df_processed = pd.get_dummies(df_processed, drop_first=True)


# --- Verification ---
print("Data types after processing:")
print(df_processed.info())

print("\n" + "="*50 + "\n")

print("First 5 rows of the processed dataset:")
print(df_processed.head())

Data types after processing:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 15 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   ApplicantIncome          614 non-null    int64  
 1   CoapplicantIncome        614 non-null    float64
 2   LoanAmount               614 non-null    float64
 3   Loan_Amount_Term         614 non-null    float64
 4   Credit_History           614 non-null    float64
 5   Loan_Status              614 non-null    int64  
 6   Gender_Male              614 non-null    bool   
 7   Married_Yes              614 non-null    bool   
 8   Dependents_1             614 non-null    bool   
 9   Dependents_2             614 non-null    bool   
 10  Dependents_3+            614 non-null    bool   
 11  Education_Not Graduate   614 non-null    bool   
 12  Self_Employed_Yes        614 non-null    bool   
 13  Property_Area_Semiurban  614 non-null    bool   
 1

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_processed[col].fillna(df_processed[col].mode()[0], inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_processed[col].fillna(df_processed[col].mean(), inplace=True)


## Step 3: Building and Training the Model
This is the core of the project. We will now split our data, train a machine learning model, and then test its performance.

**1. Define Features (X) and Target (y):** We separate our dataset into two parts: the features (all the columns we use to make a prediction) and the target (the actual outcome we want to predict, Loan_Status).

**2. Split the Data:** We can't test our model on the same data it learned from; that would be like giving a student the answers before an exam. We split our data into a "training set" (for the model to learn from) and a "testing set" (to evaluate its performance on unseen data).

**3. Train the Model:** We will "fit" a LogisticRegression model to the training data. This is the process where the model learns the relationships between the features and the loan status.

**4. Make Predictions:** We use our trained model to make predictions on the unseen test set.

In [4]:
# Import necessary libraries from scikit-learn
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# --- Model Building ---

# 1. Define our features (X) and target (y)
# X contains all columns except 'Loan_Status'
X = df_processed.drop('Loan_Status', axis=1)
# y contains only the 'Loan_Status' column
y = df_processed['Loan_Status']

# 2. Split the data into training and testing sets
# We'll use 80% of the data for training and 20% for testing
# random_state ensures we get the same split every time we run the code
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 3. Initialize and train the Logistic Regression model
model = LogisticRegression(max_iter=1000) # max_iter helps the model converge
model.fit(X_train, y_train)

# 4. Make predictions on the test data
y_pred = model.predict(X_test)


# --- Model Evaluation ---

# 5. Check the model's accuracy and other metrics
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2f}")

print("\n" + "="*50 + "\n")

print("Confusion Matrix:")
# A confusion matrix shows us True Positives, True Negatives, False Positives, and False Negatives
print(confusion_matrix(y_test, y_pred))

print("\n" + "="*50 + "\n")

print("Classification Report:")
# This report gives us precision, recall, and f1-score, which are key performance indicators
print(classification_report(y_test, y_pred))

Model Accuracy: 0.79


Confusion Matrix:
[[18 25]
 [ 1 79]]


Classification Report:
              precision    recall  f1-score   support

           0       0.95      0.42      0.58        43
           1       0.76      0.99      0.86        80

    accuracy                           0.79       123
   macro avg       0.85      0.70      0.72       123
weighted avg       0.83      0.79      0.76       123



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


## Interpreting The Model's Performance
**1. Overall Accuracy: 79%**
The model correctly predicted the loan status for 79% of the applicants in the test set. This is a good starting point, but accuracy alone can be misleading, especially when one class is more common than the other.

**2. The Confusion Matrix**
This little table is the most honest report card for our model.
```
[[18 25]
[ 1 79]]
```

- **18 (True Negatives):** The model correctly predicted 18 people would default, and they did.

- **79 (True Positives):** The model correctly predicted 79 people would repay, and they did.

- **1 (False Negatives):** The model incorrectly predicted 1 person would repay, but they actually defaulted. This is the most costly mistake for the business.

- **25 (False Positives):** The model incorrectly predicted 25 people would default, but they would have actually repaid. This is a missed business opportunity.

**3. The Classification Report (The Deep Dive)**
This is where we see the real story. Let's focus on predicting defaults (class `0`).

- **Precision (0.95 for class 0):** When the model predicts someone will default, it is correct 95% of the time. This is very good. It means the model is very reliable when it raises a red flag.

- **Recall (0.42 for class 0):** This is our model's weakness. It only successfully identified 42% of all the people who actually defaulted. It missed the other 58%.

**The Business Story:** This model is cautious and accurate when it predicts a default, but it's not very good at finding all the defaulters. From a business perspective, I may want to improve the recall, even if it means lowering precision slightly, to catch more potential losses.

## Final Step: Predicting on New Data

In [7]:
# --- Predicting on the Test File ---

# 1. Load the test dataset
test_file_path = r'C:\Users\user 1\Documents\Apollos\credit-scoring-model\test_Y3wMUE5_7gLdATN.csv'
df_test = pd.read_csv(test_file_path)

# Keep the Loan_ID for our final submission file
test_loan_ids = df_test['Loan_ID']

# 2. Preprocess the test data
# Drop Loan_ID first
df_test_cleaned = df_test.drop('Loan_ID', axis=1)

# Fill missing categorical values with the mode
for col in ['Gender', 'Married', 'Dependents', 'Self_Employed']:
    # Check if column exists before filling
    if col in df_test_cleaned.columns:
        df_test_cleaned[col].fillna(df_test_cleaned[col].mode()[0], inplace=True)

# Fill missing numerical values with the mean
for col in ['LoanAmount', 'Loan_Amount_Term', 'Credit_History']:
    if col in df_test_cleaned.columns:
        df_test_cleaned[col].fillna(df_test_cleaned[col].mean(), inplace=True)

# Convert categorical columns to numbers
df_test_processed = pd.get_dummies(df_test_cleaned, drop_first=True)

# 3. Align columns with the training data
df_test_processed = df_test_processed.reindex(columns=X.columns, fill_value=0)

# 4. Make predictions
final_predictions = model.predict(df_test_processed)

# 5. Create and display the final submission DataFrame
submission_df = pd.DataFrame({
    'Loan_ID': test_loan_ids,
    'Predicted_Loan_Status': final_predictions
})
submission_df['Predicted_Loan_Status'] = submission_df['Predicted_Loan_Status'].map({1: 'Y', 0: 'N'})

print("Predictions for the test file:")
print(submission_df.head(10))

Predictions for the test file:
    Loan_ID Predicted_Loan_Status
0  LP001015                     Y
1  LP001022                     Y
2  LP001031                     Y
3  LP001035                     Y
4  LP001051                     Y
5  LP001054                     Y
6  LP001055                     Y
7  LP001056                     N
8  LP001059                     Y
9  LP001067                     Y


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_test_cleaned[col].fillna(df_test_cleaned[col].mode()[0], inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_test_cleaned[col].fillna(df_test_cleaned[col].mean(), inplace=True)


## Project Summary
This project addresses a critical business problem in the fintech sector: credit risk assessment. The goal was to build a machine learning model to predict loan default probability for a microfinance institution. Using a loan prediction dataset from Kaggle, the project involved a complete data science workflow, including data cleaning, imputation of missing values, and feature engineering to prepare the data for modeling. A **Logistic Regression** model was trained and evaluated, achieving an overall **accuracy of 79%.** More importantly, the model demonstrated a **precision of 95%** in predicting defaults, indicating high reliability when flagging high-risk applicants, though its recall of 42% suggests opportunities for future improvement in identifying all potential defaulters.