# Microloan Transaction Data Analysis

> **Purpose:** Complete Question 2: simulate / load a large microloan transaction dataset (500 features, millions of rows), apply feature selection (top-10 by correlation with `default`), apply PCA to compress to a few components, compare model performance and speed, and produce a reflection report. This notebook is organized into *runnable cells* with explanatory text. There are placeholder cells where you should paste or describe your observed results after you run each step in Colab.


#  **Group Members**

| Registration Number | Name             |
|---------------------|------------------|
| ST62/80168/2024     | GABRIEL NDUNDA   |
| ST62/80313/2024     | DONSY OMBURA     |
| ST62/80195/2024     | LEONARD KITI     |
| ST62/80774/2024     | JOSEPHAT MOTONU  |
| ST62/80472/2024     | TABITHA KIARIE   |



## 1. Notebook overview

This Colab notebook contains the following sections (each as a separate cell):

1. Environment setup (pip installs)
2. Imports and utility functions
3. Data generation (scalable, chunked) — or load your own CSV from Drive
4. Quick sanity checks (head, distributions, default rate)
5. Feature selection (top-10 by absolute Pearson correlation)
6. Dimensionality reduction (PCA / IncrementalPCA)
7. Modeling comparison (Logistic Regression on: full numeric features, top-10, PCA) with timing and ROC AUC
8. Export reduced dataset to CSV (for Google Drive)
9. Reflection/report template (placeholders for you to fill with experiment results)


## 2. Environment setup

In [None]:
# Run this cell first in Colab to install required packages
!pip install --quiet pandas numpy scikit-learn scipy matplotlib seaborn tqdm

## 3. Imports and helper utilities

In [None]:
import os
import gc
import time
from tqdm.auto import tqdm
import numpy as np
import pandas as pd
from scipy.special import expit
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA, IncrementalPCA
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

# utility: print memory info (optional)
try:
    import psutil
    def show_mem():
        p = psutil.Process()
        print(f"RSS: {p.memory_info().rss / 1024**2:.1f} MB")
except Exception:
    def show_mem():
        pass

print("imports ready")

imports ready


## 4. Data generation (scalable, chunked)

> This cell creates a gzipped CSV in chunks so you can generate millions of rows without exhausting RAM. Adjust `TOTAL_ROWS` and `CHUNK_SIZE` depending on your Colab runtime memory.

In [None]:
# PARAMETERS — change these as needed
TOTAL_ROWS = 200_000   # change to 1_000_000 or more if you want (Colab free RAM limits apply)
CHUNK_SIZE = 50_000
N_FEATURES = 500
OUTFILE = "/content/microloans_generated.csv.gz"
SEED = 42

# generator function (same as in the proper generator script)
from numpy.random import default_rng

def generate_chunk_df(start_id, n_rows, n_features=N_FEATURES, seed=SEED):
    rng = default_rng(seed + start_id)
    X = rng.normal(0, 1, size=(n_rows, n_features)).astype("float32")
    causal_idx = np.arange(12)
    weights = rng.normal(0.8, 0.5, size=len(causal_idx))
    linear = X[:, causal_idx] @ weights
    month = rng.integers(1, 13, size=n_rows)
    seasonal = np.sin(month / 12 * 2 * np.pi)
    score = linear + seasonal + rng.normal(0, 1, size=n_rows)
    prob_default = expit((score - score.mean()) / (score.std() + 1e-9)) * 0.5
    default = (rng.random(n_rows) < prob_default).astype(int)
    cols = [f"feat_{i:03d}" for i in range(n_features)]
    df = pd.DataFrame(X, columns=cols)
    df["month"] = month
    df["client_id"] = np.arange(start_id, start_id + n_rows)
    df["default"] = default
    return df

# Write in chunks
if os.path.exists(OUTFILE):
    os.remove(OUTFILE)

wrote_header = False
current_start = 1
rows_written = 0
for start in range(0, TOTAL_ROWS, CHUNK_SIZE):
    nrows = min(CHUNK_SIZE, TOTAL_ROWS - start)
    df_chunk = generate_chunk_df(current_start, nrows)
    df_chunk.to_csv(OUTFILE, index=False, header=not wrote_header, compression="gzip", mode="a")
    wrote_header = True
    current_start += nrows
    rows_written += nrows
    print(f"Wrote {rows_written}/{TOTAL_ROWS} rows")
    del df_chunk
    gc.collect()

print("Data generation finished. File:", OUTFILE)


Wrote 50000/200000 rows
Wrote 100000/200000 rows
Wrote 150000/200000 rows
Wrote 200000/200000 rows
Data generation finished. File: /content/microloans_generated.csv.gz


## 5. Load dataset

> Load the generated gzipped CSV. For very large files you can use `pd.read_csv(..., chunksize=...)` to iterate.

In [None]:
# adjust path if using your own file from Drive
DATAFILE = "/content/microloans_generated.csv.gz"

# quick load (if dataset fits in memory)
df = pd.read_csv(DATAFILE, compression="gzip")
print(df.shape)
df.head()

# If memory is tight, consider reading only numeric columns or using dtype=float32 for numeric.

(200000, 503)


Unnamed: 0,feat_000,feat_001,feat_002,feat_003,feat_004,feat_005,feat_006,feat_007,feat_008,feat_009,...,feat_493,feat_494,feat_495,feat_496,feat_497,feat_498,feat_499,month,client_id,default
0,0.24423,0.678178,-0.585529,-0.908673,-1.991838,0.971623,0.016657,0.205731,-0.783595,1.226498,...,0.733248,-0.937926,-0.24934,-0.453826,-0.505037,-0.618948,0.593156,10,1,1
1,0.553531,0.846116,1.803438,0.175773,0.46927,0.181753,-1.612282,-1.058011,0.470706,-0.437141,...,-0.640014,-0.325898,0.569612,0.143844,-0.104949,0.83931,-0.741194,12,2,1
2,0.604609,1.223441,-0.007221,-0.090196,-1.057149,-0.366855,-0.661511,0.999835,-0.260388,-0.284669,...,1.407685,0.656622,-0.698291,-1.473563,0.255281,0.461254,-0.231639,3,3,0
3,-0.358611,-0.538976,-0.232811,-0.977738,0.233612,0.412751,0.255036,0.121842,1.297138,-0.276421,...,0.3832,-0.752495,1.884973,1.577014,-0.781976,1.25049,0.595441,9,4,0
4,0.259756,-0.910076,-0.280531,0.359888,1.213674,-0.904145,0.212028,-1.409418,0.816652,0.495194,...,-0.310847,0.416164,1.364584,-0.737344,-0.247227,-0.548786,-1.391601,7,5,0



## 6. Sanity checks

Explain briefly (in text cell) you'll check default rate, distribution of a few features, and missing values. Then run the code cell below.

In [None]:
# Sanity checks
print("Default rate:", df["default"].mean())
print("Client id unique:", df["client_id"].nunique())
print("Feature sample stats:")
print(df[["feat_000","feat_001","feat_002"]].describe().T)

# Placeholder for you to paste results
# === YOUR SANITY CHECK RESULTS ===
# Paste observed default rate and any notes here:

Default rate: 0.24974
Client id unique: 200000
Feature sample stats:
             count      mean       std       min       25%       50%  \
feat_000  200000.0  0.002335  0.998270 -4.153395 -0.673174  0.001195   
feat_001  200000.0  0.002915  1.001018 -4.351909 -0.672159  0.003735   
feat_002  200000.0  0.000395  0.999618 -4.046616 -0.674700  0.002535   

               75%       max  
feat_000  0.676466  4.668834  
feat_001  0.681573  5.269891  
feat_002  0.674862  4.758881  


## 7. Feature selection — Top-10 by absolute Pearson correlation

> We'll compute the point-biserial correlation (Pearson between continuous features and binary `default`) and pick the top 10 features by absolute value. For very large datasets, compute correlation in chunks or use streaming statistics.


In [None]:
# Prepare numeric feature list
numeric_cols = [c for c in df.columns if c.startswith("feat_")]
print(f"Number of numeric features: {len(numeric_cols)}")

# Compute correlation with target
corrs = df[numeric_cols].corrwith(df['default']).abs().sort_values(ascending=False)
top_k = 10
top_features = corrs.head(top_k).index.tolist()
print("Top-10 features by abs(Pearson) with default:")
print(corrs.head(top_k))

# Save selection
with open('/content/top10_features.txt','w') as f:
    f.write('\n'.join(top_features))

# Placeholder to record your observed top-10 list
# === YOUR TOP-10 FEATURES ===
# Paste the top-10 features printed above and any notes here:


Number of numeric features: 500
Top-10 features by abs(Pearson) with default:
feat_008    0.094142
feat_003    0.074154
feat_000    0.072358
feat_001    0.048889
feat_011    0.046333
feat_002    0.046184
feat_006    0.044890
feat_007    0.044120
feat_005    0.039195
feat_009    0.036185
dtype: float64


## 8. Dimensionality reduction — PCA (Code cell)

> We'll standardize selected features and run PCA. If the dataset is very large, `IncrementalPCA` is used to fit in batches.

In [None]:
# PCA on top-10 and on a larger set if desired
n_pca_components = 5

# Standardize top-10
X_top10 = df[top_features].values
scaler_top10 = StandardScaler()
X_top10_scaled = scaler_top10.fit_transform(X_top10)

pca_top10 = PCA(n_components=n_pca_components, random_state=0)
X_top10_pca = pca_top10.fit_transform(X_top10_scaled)
print("Top-10 PCA explained variance ratio (cumulative):", pca_top10.explained_variance_ratio_.cumsum())

# PCA on many features using IncrementalPCA (memory-friendly)
USE_INCR = True
incr_components = 10
if USE_INCR:
    ipca = IncrementalPCA(n_components=incr_components)
    chunksize = 50_000
    for start in range(0, df.shape[0], chunksize):
        end = min(start + chunksize, df.shape[0])
        batch = df.iloc[start:end][numeric_cols].values
        batch = batch.astype('float32')
        # scale each batch (fit_transform would leak stats); here we center per-batch (approximate) or use a global StandardScaler in two passes
        batch = (batch - batch.mean(axis=0)) / (batch.std(axis=0) + 1e-9)
        ipca.partial_fit(batch)
    # transform into components (do second pass)
    X_ipca = []
    for start in range(0, df.shape[0], chunksize):
        end = min(start + chunksize, df.shape[0])
        batch = df.iloc[start:end][numeric_cols].values
        batch = batch.astype('float32')
        batch = (batch - batch.mean(axis=0)) / (batch.std(axis=0) + 1e-9)
        X_ipca.append(ipca.transform(batch))
    X_ipca = np.vstack(X_ipca)
    print("IncrementalPCA result shape:", X_ipca.shape)

# Placeholder to record PCA explained variance or observations
# === YOUR PCA OBSERVATIONS ===
# Paste explained variance or comments here:

Top-10 PCA explained variance ratio (cumulative): [0.10100549 0.20166813 0.30230078 0.40261951 0.50280859]
IncrementalPCA result shape: (200000, 10)


## 9. Modeling comparison: Full vs Top-10 vs PCA

> We'll train a logistic regression on three datasets and record training time and ROC AUC on a holdout set. For big datasets you may want to sample for training.


In [None]:
# Prepare data for comparison
TARGET = 'default'
SAMPLE_FOR_TRAINING = False  # set True if dataset is huge and you prefer to sample
SAMPLE_FRAC = 0.2

if SAMPLE_FOR_TRAINING:
    df_model = df.sample(frac=SAMPLE_FRAC, random_state=42)
else:
    df_model = df.copy()

X_full = df_model[numeric_cols].values
X_top10 = df_model[top_features].values
X_pca_top10 = X_top10_pca if X_top10_pca.shape[0] == df_model.shape[0] else None

y = df_model[TARGET].values

def train_and_eval(X, y, label):
    t0 = time.time()
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
    scaler = StandardScaler()
    X_train = scaler.fit_transform(X_train)
    X_test = scaler.transform(X_test)
    model = LogisticRegression(max_iter=200, solver='saga')
    model.fit(X_train, y_train)
    y_prob = model.predict_proba(X_test)[:,1]
    auc = roc_auc_score(y_test, y_prob)
    t1 = time.time()
    return dict(label=label, time_s=t1-t0, auc=auc, n_features=X.shape[1])

results = []

# Full (may be slow)
print("Training Full model — this may take time")
res_full = train_and_eval(X_full, y, 'Full')
print(res_full)
results.append(res_full)

# Top-10
print("Training Top-10 model")
res_top10 = train_and_eval(X_top10, y, 'Top-10')
print(res_top10)
results.append(res_top10)

# PCA on top-10 (if available)
if X_pca_top10 is not None:
    print("Training PCA (top-10) model")
    res_pca = train_and_eval(X_pca_top10, y, f'PCA_top10_{X_pca_top10.shape[1]}')
    print(res_pca)
    results.append(res_pca)

results_df = pd.DataFrame(results)
print(results_df)

# Save results for your reflection
results_df.to_csv('/content/model_comparison_results.csv', index=False)

# === YOUR MODELING RESULTS ===
# Paste the printed results (training times, AUCs) here and any observations

Training Full model — this may take time
{'label': 'Full', 'time_s': 41.60721015930176, 'auc': np.float64(0.6248187074111766), 'n_features': 500}
Training Top-10 model
{'label': 'Top-10', 'time_s': 2.5789544582366943, 'auc': np.float64(0.6245846913224454), 'n_features': 10}
Training PCA (top-10) model
{'label': 'PCA_top10_5', 'time_s': 1.642754316329956, 'auc': np.float64(0.5826454378403728), 'n_features': 5}
         label     time_s       auc  n_features
0         Full  41.607210  0.624819         500
1       Top-10   2.578954  0.624585          10
2  PCA_top10_5   1.642754  0.582645           5



---

# **Step 10: Final Reflection Report (Using Your Actual Outputs)**

**Microloan Transaction Data Analysis — Question 2 (20 Marks)**

---

## **1. Summary of Dataset**

* **Total rows analyzed:** 200,000  
* **Total numeric features before reduction:** 500  
* **Target variable:** `default`

The dataset is high-dimensional (500 variables) with millions of records possible, representing monthly microloan transaction logs for Kenyan borrowers.

---

## **2. Feature Selection (Top-10 Correlated Features)**

Using absolute Pearson correlations between each numeric feature and the target `default`, the **top-10 selected features** were:

| Rank | Feature    | Correlation |
| ---- | ---------- | ----------- |
| 1    | `feat_008` | 0.094142    |
| 2    | `feat_003` | 0.074154    |
| 3    | `feat_000` | 0.072358    |
| 4    | `feat_001` | 0.048889    |
| 5    | `feat_011` | 0.046333    |
| 6    | `feat_002` | 0.046184    |
| 7    | `feat_006` | 0.044890    |
| 8    | `feat_007` | 0.044120    |
| 9    | `feat_005` | 0.039195    |
| 10   | `feat_009` | 0.036185    |

### **Reflection**
* The strongest correlation (0.094) is modest, which is expected in noisy, high-dimensional financial datasets.  
* The fact that most of the top features are among the first 12 ("causal features") confirms the dataset was generated correctly.

---

## **3. PCA Dimensionality Reduction Results**

### **PCA on the Top-10 Features**

Cumulative explained variance over 5 components:

[0.1010, 0.2017, 0.3023, 0.4026, 0.5028]


### **Interpretation**
* The first 5 principal components capture **~50.3%** of variance in the top-10 features.  
* Because the dataset is noisy and the default signal is weak, PCA spreads variance relatively evenly across components.  
* PCA reduces the feature space from 10 → 5 but does so at the cost of **interpretability**.

### **Incremental PCA Output**

Shape: (200000, 10)


This means the full dataset was successfully compressed into 10 components using a memory-efficient method.

---

## **4. Model Performance Comparison**

| Model                        | Time (s)    | AUC         | # Features |
| ---------------------------- | ----------- | ----------- | ---------- |
| **Full (500 features)**      | **41.61 s** | **0.62482** | 500        |
| **Top-10 Selected Features** | **2.58 s**  | **0.62458** | 10         |
| **PCA (5 components)**       | **1.64 s**  | **0.58265** | 5          |

### **Key Observations**
* **Full model** performs best but is extremely slow (over 41 seconds).  
* **Top-10 model** performs almost identically in accuracy while being **16× faster** to train.  
* **PCA model** is fastest but loses significant accuracy due to loss of interpretability and signal dilution.

---

## **5. Final Reflection (Required for Marks)**

### **A. Impact on Dataset Size**
* **Original:** 500 features  
* **After feature selection:** 10 features (98% reduction)  
* **After PCA:** 5 principal components  

Results show a dramatic reduction in storage and memory footprint.

---

### **B. Impact on Speed**
* Training time improved from **41.6s → 2.6s** using feature selection (**16× faster**).  
* PCA further reduced time to **1.6s**, making it **~26× faster** than the full model.

---

### **C. Impact on Model Accuracy**
* **Full model AUC:** 0.62482  
* **Top-10 AUC:** 0.62458 (almost identical)  
* **PCA model AUC:** 0.58265 (noticeable drop due to abstraction of features)

Feature selection preserves accuracy while greatly improving speed.  
PCA trades accuracy for speed and compression.

---

### **D. Overall Conclusions**

1. **Feature selection is the most effective technique** for this microloan dataset because it:  
   * Improves model speed  
   * Maintains predictive accuracy  
   * Preserves interpretability  

2. **PCA is useful for compression**, but not ideal when prediction accuracy is the priority.

3. For real microloan providers managing millions of records, feature selection helps:  
   * Reduce compute cost  
   * Speed up model retraining  
   * Improve deployment efficiency  

4. PCA may still be preferred when:  
   * Visualizing patterns  
   * Reducing storage for downstream tasks  
   * Working in environments with strict memory limits  

---

## **6. Final Answer Summary (as required for the question)**

> **I applied feature selection to identify the top 10 most correlated features with loan default, then applied PCA to reduce dimensionality further. Feature selection reduced dataset size and improved processing speed by 16× with no loss in model accuracy. PCA reduced dimension more aggressively but caused a decrease in predictive accuracy. Overall, feature selection was the most effective method for balancing accuracy, speed, and interpretability.**





