<a href="https://colab.research.google.com/github/2021-uam-4646/codealpha_task/blob/main/CREDIT_SCORING_MODEL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 1. Data Loading

Yeh cell data ko load karne ke liye hai. `load_data_safely` function file se data ka sample load karta hai taake memory kam use ho, aur sirf zaruri columns ko select karta hai jaisa ke `keep_cols` list mein bataya gaya hai.

In [None]:
import pandas as pd

## 1. Smart Data Loading ##
def load_data_safely(file_path, sample_frac=0.3):
    # Get column names without loading full file
    cols = pd.read_csv(file_path, nrows=0).columns.tolist()

    # Select relevant columns (modify as needed)
    # Corrected column names based on the notebook state
    keep_cols = ['disbursed_amount', 'asset_cost', 'ltv', 'manufacturer_id', 'Employment.Type', 'loan_default', 'Date.of.Birth']


    # Load sampled data with selected columns
    try:
        data = pd.read_csv(file_path, usecols=keep_cols).sample(frac=sample_frac, random_state=42)
        print(f"Loaded {len(data)} samples safely")
        return data
    except Exception as e:
        print(f"Error: {str(e)}")
        # Fallback to smaller sample if failed
        # Also update keep_cols for the fallback
        return pd.read_csv(file_path, usecols=keep_cols, nrows=50000)

# Step 1: Load data safely
data = load_data_safely('train.csv', sample_frac=0.4)  # Adjust fraction as needed

Loaded 93262 samples safely


## 2. Data Preprocessing

Yeh cell data ko model training ke liye prepare karta hai. Ismein missing values ko hataana aur categorical features ko numbers mein badalna shamil hai. 'Date.of.Birth' column ko bhi drop kiya gaya hai kyunki woh numerical operations mein masla kar raha tha.

In [None]:
import numpy as np

## 2. Memory-Friendly Preprocessing ##
def simple_preprocess(df):
    # Handle missing values
    df = df.dropna().copy() # Work on a copy to avoid SettingWithCopyWarning

    # Convert categoricals (simple approach)
    # Updated column name for employment type
    if 'Employment.Type' in df.columns:
        df.loc[:, 'Employment.Type'] = df['Employment.Type'].astype('category').cat.codes

    # Basic feature engineering
    df.loc[:, 'loan_to_asset_ratio'] = df['disbursed_amount'] / (df['asset_cost'] + 1e-6)

    # Drop 'Date.of.Birth' as it's causing errors in numerical calculations
    if 'Date.of.Birth' in df.columns:
        df = df.drop('Date.of.Birth', axis=1)

    return df

# Step 2: Preprocess
data = simple_preprocess(data)

## 3. Data Preparation and Split

Yeh cell data ko model training ke liye tayyar karta hai. Features (input) aur target variable (output) ko alag alag kiya jata hai, aur phir data ko training aur testing sets mein divide kiya jata hai.

In [None]:
from sklearn.model_selection import train_test_split

# Step 3: Prepare data
X = data.drop('loan_default', axis=1)
y = data['loan_default']

# Step 4: Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y)

## 4. Model Training

Yeh cell machine learning model ko train karta hai. Ismein `GradientBoostingClassifier` use kiya gaya hai jo data par fit hota hai taake patterns seekh kar future predictions kar sake.

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

## 3. Model Training ##
def train_model(X, y):
    # Lightweight yet effective model
    model = GradientBoostingClassifier(
        n_estimators=100,
        max_depth=3,
        learning_rate=0.1,
        random_state=42,
        subsample=0.8  # Adds robustness
    )

    model.fit(X, y)
    return model

# Step 5: Train model
model = train_model(X_train, y_train)

## 5. Model Evaluation

Yeh cell trained model ko evaluate karta hai taake uski performance dekhi ja sake. Ismein predictions karna aur `classification_report` aur `roc_auc_score` jaisi metrics calculate karna shamil hai.

## Cleanup

Yeh cell memory se variables ko remove karta hai taake resources free hon.

In [None]:
from sklearn.metrics import roc_auc_score, classification_report

## 4. Evaluate Model ##
def evaluate_model(model, X_test, y_test):
    y_pred = model.predict(X_test)
    y_proba = model.predict_proba(X_test)[:, 1]

    print("\nModel Performance:")
    print(classification_report(y_test, y_pred))
    print(f"ROC AUC: {roc_auc_score(y_test, y_proba):.4f}")

# Step 6: Evaluate
evaluate_model(model, X_test, y_test)

# Clean up
del data, X, y, X_train, X_test, y_train, y_test, model


Model Performance:
              precision    recall  f1-score   support

           0       0.78      1.00      0.88     14131
           1       0.00      0.00      0.00      3910

    accuracy                           0.78     18041
   macro avg       0.39      0.50      0.44     18041
weighted avg       0.61      0.78      0.69     18041

ROC AUC: 0.6071
