<a href="https://colab.research.google.com/github/JacobMuli/ai-credit-scoring/blob/main/credit_scoring_colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🌾 AI Credit Scoring System - Google Colab Notebook

This notebook allows you to:
1. Train the credit scoring model in Google Colab
2. Test the model interactively
3. Save the trained model to your GitHub repository
4. Deploy directly to Streamlit Cloud

---

## 📋 Setup Instructions

### Before Running:
1. **Fork the repository** on GitHub
2. **Mount Google Drive** (optional, for saving models)
3. Run cells in order

### Deployment Flow:
```
Google Colab → Train Model → Save to GitHub → Deploy on Streamlit Cloud
```

## 1️⃣ Install Dependencies

In [1]:
!pip install -q scikit-learn pandas numpy streamlit

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.2/10.2 MB[0m [31m44.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.9/6.9 MB[0m [31m55.4 MB/s[0m eta [36m0:00:00[0m
[?25h

## 2️⃣ Import Libraries

In [2]:
import numpy as np
import pandas as pd
import pickle
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.metrics import classification_report, roc_auc_score, confusion_matrix

print("✅ Libraries imported successfully!")

✅ Libraries imported successfully!


## 3️⃣ Generate Synthetic Training Data

In [3]:
def generate_farmers(n=8000, seed=42):
    """
    Generate synthetic farmer data for training
    """
    rng = np.random.RandomState(seed)

    # Generate features
    gender = rng.choice(['Male', 'Female'], size=n, p=[0.6, 0.4])
    age = rng.randint(18, 70, size=n)
    farm_size = np.round(np.exp(rng.normal(np.log(2.0), 0.8, size=n)), 2)
    crop = rng.choice(['Maize', 'Beans', 'Tea', 'Coffee', 'Horticulture'], size=n, p=[0.4,0.25,0.15,0.1,0.1])
    cooperative = rng.binomial(1, p=0.4, size=n)
    yield_hist = np.maximum(0.1, rng.normal(2.0, 0.8, size=n))
    mobile_txns = rng.poisson(25, size=n)
    mobile_balance = np.maximum(0, rng.normal(1200, 600, size=n))
    ndvi = np.clip(rng.normal(0.45 + 0.1*(yield_hist/3.0), 0.08), 0.05, 0.9)
    drought_exposure = rng.binomial(1, p=0.3, size=n)

    # Generate target variable
    logits = (
        -1.0 * np.log(farm_size + 0.1)
        - 0.8 * cooperative
        - 1.2 * (yield_hist / np.maximum(farm_size, 0.1))
        + 1.5 * drought_exposure
        - 0.001 * mobile_balance
        + 0.01 * (60 - age)
    )
    prob = 1 / (1 + np.exp(-logits))
    default = (rng.rand(n) < prob).astype(int)

    df = pd.DataFrame({
        'gender': gender,
        'age': age,
        'farm_size': farm_size,
        'crop': crop,
        'cooperative': cooperative,
        'yield_hist': yield_hist,
        'mobile_txns': mobile_txns,
        'mobile_balance': mobile_balance,
        'ndvi': ndvi,
        'drought_exposure': drought_exposure,
        'default': default
    })

    return df

# Generate data
print("Generating synthetic farmer data...")
data = generate_farmers()
print(f"✅ Generated {len(data)} farmer records")
print(f"   Default rate: {data['default'].mean():.2%}")

# Display sample
data.head()

Generating synthetic farmer data...
✅ Generated 8000 farmer records
   Default rate: 7.10%


Unnamed: 0,gender,age,farm_size,crop,cooperative,yield_hist,mobile_txns,mobile_balance,ndvi,drought_exposure,default
0,Male,47,4.1,Maize,1,1.583878,25,952.188913,0.403236,0,0
1,Female,35,0.22,Maize,0,2.551827,25,1249.376902,0.470254,0,0
2,Female,59,5.17,Horticulture,0,1.995849,29,1178.651197,0.574373,0,0
3,Male,28,3.03,Maize,1,1.465841,35,1626.145523,0.561973,1,0
4,Male,46,5.73,Coffee,1,2.123196,23,1706.721459,0.562336,0,0


## 4️⃣ Data Exploration

In [4]:
# Summary statistics
print("📊 Dataset Summary:")
print(data.describe())

print("\n📈 Feature Distribution:")
print(data.info())

print("\n🎯 Target Distribution:")
print(data['default'].value_counts())

📊 Dataset Summary:
               age    farm_size  cooperative   yield_hist  mobile_txns  \
count  8000.000000  8000.000000  8000.000000  8000.000000  8000.000000   
mean     43.409750     2.790636     0.397750     1.986701    25.071375   
std      15.036335     2.688171     0.489464     0.795924     5.017188   
min      18.000000     0.110000     0.000000     0.100000     8.000000   
25%      31.000000     1.180000     0.000000     1.427945    22.000000   
50%      43.000000     2.030000     0.000000     1.984601    25.000000   
75%      56.000000     3.490000     1.000000     2.533872    28.000000   
max      69.000000    71.980000     1.000000     4.982267    48.000000   

       mobile_balance         ndvi  drought_exposure      default  
count     8000.000000  8000.000000       8000.000000  8000.000000  
mean      1210.852725     0.516636          0.297500     0.071000  
std        594.317622     0.083921          0.457187     0.256841  
min          0.000000     0.233416        

## 5️⃣ Train Model

In [5]:
# Prepare features and target
X = data.drop(columns=['default'])
y = data['default']

# Define feature types
num_features = ['age','farm_size','yield_hist','mobile_txns','mobile_balance','ndvi']
cat_features = ['gender','crop','cooperative','drought_exposure']

# Build preprocessing pipeline
preprocessor = ColumnTransformer([
    ('num', Pipeline([
        ('imputer', SimpleImputer(strategy='median')),
        ('scaler', StandardScaler())
    ]), num_features),
    ('cat', OneHotEncoder(handle_unknown='ignore', sparse_output=False), cat_features)
])

# Build full pipeline
model = Pipeline([
    ('pre', preprocessor),
    ('clf', RandomForestClassifier(n_estimators=200, random_state=42, n_jobs=-1))
])

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

print(f"Training set: {len(X_train)} samples")
print(f"Test set: {len(X_test)} samples")

# Train model
print("\n🚀 Training model...")
model.fit(X_train, y_train)
print("✅ Model training complete!")

Training set: 6400 samples
Test set: 1600 samples

🚀 Training model...
✅ Model training complete!


## 6️⃣ Evaluate Model

In [6]:
# Make predictions
y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test)[:, 1]

# Classification report
print("📊 Classification Report:")
print(classification_report(y_test, y_pred, target_names=['No Default', 'Default']))

# ROC AUC Score
roc_auc = roc_auc_score(y_test, y_pred_proba)
print(f"\n🎯 ROC AUC Score: {roc_auc:.4f}")

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
print(f"\n📈 Confusion Matrix:")
print(f"True Negatives: {cm[0,0]:,}, False Positives: {cm[0,1]:,}")
print(f"False Negatives: {cm[1,0]:,}, True Positives: {cm[1,1]:,}")

# Feature importance
feature_names = (
    num_features +
    list(model.named_steps['pre'].named_transformers_['cat'].get_feature_names_out())
)
importances = model.named_steps['clf'].feature_importances_

feature_importance_df = pd.DataFrame({
    'feature': feature_names,
    'importance': importances
}).sort_values('importance', ascending=False)

print("\n🔍 Top 10 Important Features:")
print(feature_importance_df.head(10))

📊 Classification Report:
              precision    recall  f1-score   support

  No Default       0.93      1.00      0.96      1486
     Default       0.14      0.01      0.02       114

    accuracy                           0.93      1600
   macro avg       0.54      0.50      0.49      1600
weighted avg       0.87      0.93      0.89      1600


🎯 ROC AUC Score: 0.7777

📈 Confusion Matrix:
True Negatives: 1,480, False Positives: 6
False Negatives: 113, True Positives: 1

🔍 Top 10 Important Features:
               feature  importance
4       mobile_balance    0.176756
2           yield_hist    0.168057
5                 ndvi    0.148973
1            farm_size    0.134442
0                  age    0.114078
3          mobile_txns    0.097529
16  drought_exposure_1    0.023672
15  drought_exposure_0    0.021622
11          crop_Maize    0.016263
6        gender_Female    0.014675


## 7️⃣ Test Model with Sample Data

In [7]:
# Create sample farmer for testing
sample_farmer = pd.DataFrame({
    'gender': ['Male'],
    'age': [35],
    'farm_size': [3.5],
    'crop': ['Maize'],
    'cooperative': [1],
    'yield_hist': [2.5],
    'mobile_txns': [30],
    'mobile_balance': [1500],
    'ndvi': [0.55],
    'drought_exposure': [0]
})

# Make prediction
prob_default = model.predict_proba(sample_farmer)[0, 1]
credit_score = (1 - prob_default) * 1000
eligible = credit_score >= 400

print("🧪 Sample Farmer Prediction:")
print(f"   Credit Score: {credit_score:.0f}")
print(f"   Default Probability: {prob_default:.2%}")
print(f"   Eligible: {'✅ Yes' if eligible else '❌ No'}")

if eligible:
    loan_amount = min(sample_farmer['farm_size'].values[0] * 300, 50000)
    interest_rate = 0.12 + prob_default * 0.5
    print(f"   Loan Amount: KES {loan_amount:,.0f}")
    print(f"   Interest Rate: {interest_rate*100:.2f}%")

🧪 Sample Farmer Prediction:
   Credit Score: 1000
   Default Probability: 0.00%
   Eligible: ✅ Yes
   Loan Amount: KES 1,050
   Interest Rate: 12.00%


## 8️⃣ Save Model Locally

In [9]:
# Save model to file
model_filename = 'credit_model.pkl'

with open(model_filename, 'wb') as f:
    pickle.dump(model, f)

import os # Import the os module

print(f"✅ Model saved to: {model_filename}")
print(f"   File size: {os.path.getsize(model_filename) / 1024:.2f} KB")

✅ Model saved to: credit_model.pkl
   File size: 15978.31 KB


## 9️⃣ Mount Google Drive (Optional)

In [11]:
from google.colab import drive

# Mount Google Drive
drive.mount('/content/drive')

# Save model to Google Drive
drive_path = '/content/drive/MyDrive/credit_model.pkl'
!cp credit_model.pkl "$drive_path"

print(f"✅ Model saved to Google Drive: {drive_path}")

Mounted at /content/drive
✅ Model saved to Google Drive: /content/drive/MyDrive/credit_model.pkl


## 🔟 Clone GitHub Repository and Push Model

In [17]:
# Configure Git (replace with your details)
!git config --global user.email "jacobmwalughs@gmail.com"
!git config --global user.name "JacobMuli"

# Clone your repository
# Replace 'yourusername' with your GitHub username
!git clone https://github.com/JacobMuli/ai-credit-scoring.git

# Copy model to repository
!cp credit_model.pkl ai-credit-scoring/

# Navigate to repository
%cd ai-credit-scoring

# Add and commit
!git add credit_model.pkl
!git commit -m "Update model from Google Colab"

print("\n📝 Model ready to push to GitHub!")
print("\n⚠️  To push, you need to authenticate:")
print("   1. Generate a GitHub Personal Access Token")
print("   2. Run: !git push origin main")
print("   3. Enter your username and token when prompted")

!git remote set-url origin https://JacobMuli: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@github.com/JacobMuli/ai-credit-scoring.git
!git push origin main

Cloning into 'ai-credit-scoring'...
remote: Enumerating objects: 12, done.[K
remote: Counting objects: 100% (12/12), done.[K
remote: Compressing objects: 100% (9/9), done.[K
remote: Total 12 (delta 0), reused 0 (delta 0), pack-reused 0 (from 0)[K
Receiving objects: 100% (12/12), 5.41 KiB | 5.41 MiB/s, done.
/content/ai-credit-scoring/ai-credit-scoring/ai-credit-scoring/ai-credit-scoring
[main 1bb877e] Update model from Google Colab
 1 file changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 credit_model.pkl

📝 Model ready to push to GitHub!

⚠️  To push, you need to authenticate:
   1. Generate a GitHub Personal Access Token
   2. Run: !git push origin main
   3. Enter your username and token when prompted
Enumerating objects: 4, done.
Counting objects: 100% (4/4), done.
Delta compression using up to 2 threads
Compressing objects: 100% (3/3), done.
Writing objects: 100% (3/3), 2.97 MiB | 2.18 MiB/s, done.
Total 3 (delta 0), reused 0 (delta 0), pack-reused 0
To https://gith

## 1️⃣1️⃣ Download Model (Alternative Method)

In [16]:
from google.colab import files

# Download model to your local machine
files.download('credit_model.pkl')

print("✅ Model downloaded to your computer!")
print("\n📤 Next steps:")
print("   1. Upload credit_model.pkl to your GitHub repository")
print("   2. Push changes to GitHub")
print("   3. Deploy on Streamlit Cloud")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

✅ Model downloaded to your computer!

📤 Next steps:
   1. Upload credit_model.pkl to your GitHub repository
   2. Push changes to GitHub
   3. Deploy on Streamlit Cloud


## 📚 Deployment Instructions

### Option 1: Direct GitHub Push (Recommended)
1. Run cells 1-10
2. Authenticate with GitHub Personal Access Token
3. Push model: `!git push origin main`
4. Deploy on [Streamlit Cloud](https://share.streamlit.io)

### Option 2: Manual Upload
1. Run cells 1-8 and 11
2. Download `credit_model.pkl`
3. Upload to your GitHub repository manually
4. Deploy on [Streamlit Cloud](https://share.streamlit.io)

### Streamlit Cloud Deployment:
1. Go to [share.streamlit.io](https://share.streamlit.io)
2. Click "New app"
3. Select your repository: `yourusername/ai-credit-scoring`
4. Select branch: `main`
5. Select main file: `app.py`
6. Click "Deploy"

### 🎉 Your app will be live at:
`https://yourusername-ai-credit-scoring.streamlit.app`

---

## 🔒 GitHub Personal Access Token

To push to GitHub from Colab:
1. Go to GitHub → Settings → Developer Settings → Personal Access Tokens → Tokens (classic)
2. Click "Generate new token (classic)"
3. Select scopes: `repo` (all)
4. Generate and copy token
5. Use token as password when pushing

---

## ✅ Checklist

- [ ] Model trained successfully
- [ ] Model evaluated (ROC AUC > 0.70)
- [ ] Model saved locally
- [ ] Model pushed to GitHub OR downloaded
- [ ] Repository deployed on Streamlit Cloud
- [ ] App URL tested and working

---

**Need help?** Open an issue on [GitHub](https://github.com/yourusername/ai-credit-scoring/issues)
