# Extra Trees Model Training (from scraped GitHub code)

This notebook trains an **Extra Trees** model on the metrics dataset produced by your pipeline.

**Expected input:** `data/processed/dataset.csv` (built from analyzing code scraped from GitHub repos).

**Output artifacts:** saved model + feature columns under `models/` (so you can reuse it for inference).

## 1) Install dependencies (if needed)
If you already have these installed, you can skip this cell.

In [1]:
# If running in a fresh environment, uncomment:
%pip install -U pandas numpy scikit-learn joblib matplotlib

Collecting pandas
  Using cached pandas-2.3.3-cp313-cp313-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (91 kB)
Collecting numpy
  Downloading numpy-2.4.0-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (6.6 kB)
Collecting scikit-learn
  Using cached scikit_learn-1.8.0-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (11 kB)
Collecting joblib
  Using cached joblib-1.5.3-py3-none-any.whl.metadata (5.5 kB)
Collecting matplotlib
  Using cached matplotlib-3.10.8-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (52 kB)
Collecting pytz>=2020.1 (from pandas)
  Using cached pytz-2025.2-py2.py3-none-any.whl.metadata (22 kB)
Collecting tzdata>=2022.7 (from pandas)
  Using cached tzdata-2025.3-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting scipy>=1.10.0 (from scikit-learn)
  Using cached scipy-1.16.3-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (62 kB)
Collecting threadpoolctl>=3.2.0 (from scikit-lear

## 2) Load dataset

In [None]:
from pathlib import Path
import json

import numpy as np
import pandas as pd

PROJECT_ROOT = Path.cwd().resolve().parent  # notebooks/ -> repo root
DATASET_PATH = PROJECT_ROOT / 'data' / 'processed' / 'dataset.csv'
MODELS_DIR = PROJECT_ROOT / 'models'
MODELS_DIR.mkdir(parents=True, exist_ok=True)

if not DATASET_PATH.exists():
    raise FileNotFoundError(f'Missing dataset at {DATASET_PATH}. Build it first (scrape -> analyze -> dataset_builder).')

df = pd.read_csv(DATASET_PATH)
print('Loaded:', DATASET_PATH)
print('Shape:', df.shape)
display(df.head())

print('')
print('Columns:')
print(list(df.columns))

## 3) Choose target and task type
Your current `dataset.csv` includes only metric columns. Pick one column as the label/target to learn.

Common choices:
- **Classification (binary):** `security_high` (or `security_medium`, `security_low`)
- **Regression:** `maintainability_index`

In [None]:
# ---- Configuration ----
TARGET_COLUMN = 'security_high'  # <- change me
TASK = 'classification'  # 'classification' or 'regression'
TEST_SIZE = 0.2
RANDOM_STATE = 42

# Basic validation
if TARGET_COLUMN not in df.columns:
    raise ValueError(f'TARGET_COLUMN={TARGET_COLUMN!r} not found. Available: {list(df.columns)}')
if TASK not in {'classification', 'regression'}:
    raise ValueError("TASK must be 'classification' or 'regression'")

## 4) Build `X` and `y` (cleaning + split)
This keeps only numeric features and fills missing values.

In [None]:
from sklearn.model_selection import train_test_split

# Keep numeric columns only (ExtraTrees in sklearn expects numeric input)
numeric_df = df.select_dtypes(include=[np.number]).copy()
if TARGET_COLUMN not in numeric_df.columns:
    raise ValueError(
        f'Target {TARGET_COLUMN!r} is not numeric in the loaded dataset. '
        'Encode it to numbers (e.g., 0/1) or adjust preprocessing.'
    )

# Drop rows with missing target
numeric_df = numeric_df.dropna(subset=[TARGET_COLUMN])

y = numeric_df[TARGET_COLUMN]
X = numeric_df.drop(columns=[TARGET_COLUMN])

# Fill missing features with 0 (minimal, consistent default)
X = X.fillna(0)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=TEST_SIZE, random_state=RANDOM_STATE,
    stratify=y if TASK == 'classification' and y.nunique() > 1 else None,
)

print('X_train:', X_train.shape, 'X_test:', X_test.shape)
print('y distribution (train):')
display(y_train.value_counts(dropna=False) if TASK == 'classification' else y_train.describe())

## 5) Train Extra Trees

In [None]:
from sklearn.ensemble import ExtraTreesClassifier, ExtraTreesRegressor

if TASK == 'classification':
    model = ExtraTreesClassifier(
        n_estimators=400,
        random_state=RANDOM_STATE,
        n_jobs=-1,
        class_weight='balanced_subsample',
    )
else:
    model = ExtraTreesRegressor(
        n_estimators=400,
        random_state=RANDOM_STATE,
        n_jobs=-1,
    )

model.fit(X_train, y_train)
print('Trained:', model.__class__.__name__)

## 6) Evaluate

In [None]:
from sklearn.metrics import (
    accuracy_score, classification_report, confusion_matrix,
    mean_absolute_error, mean_squared_error, r2_score,
)

y_pred = model.predict(X_test)

if TASK == 'classification':
    print('Accuracy:', accuracy_score(y_test, y_pred))
    print('\nClassification report:')
    print(classification_report(y_test, y_pred, digits=4))
    print('Confusion matrix:')
    print(confusion_matrix(y_test, y_pred))
else:
    rmse = mean_squared_error(y_test, y_pred, squared=False)
    print('MAE:', mean_absolute_error(y_test, y_pred))
    print('RMSE:', rmse)
    print('R2:', r2_score(y_test, y_pred))

## 7) Feature importance (quick look)

In [None]:
import pandas as pd

fi = pd.Series(model.feature_importances_, index=X.columns).sort_values(ascending=False)
display(fi.head(20))

## 8) Save model + metadata
This writes:
- `models/extratrees_<task>_<target>.joblib`
- `models/extratrees_<task>_<target>_features.json`

In [None]:
import joblib

safe_target = ''.join(c if c.isalnum() or c in ('_', '-') else '_' for c in TARGET_COLUMN)
model_path = MODELS_DIR / f'extratrees_{TASK}_{safe_target}.joblib'
features_path = MODELS_DIR / f'extratrees_{TASK}_{safe_target}_features.json'

joblib.dump(model, model_path)
features_path.write_text(json.dumps({
    'target': TARGET_COLUMN,
    'task': TASK,
    'feature_columns': list(X.columns),
    'random_state': RANDOM_STATE,
}, indent=2))

print('Saved model to:', model_path)
print('Saved feature metadata to:', features_path)

## 9) (Optional) Inference helper
Given a single metrics record (dict), this predicts the target.

Note: your inference input must have the **same feature columns** as training.

In [None]:
def predict_one(metrics_record: dict):
    row = pd.DataFrame([metrics_record])
    row = row.reindex(columns=X.columns, fill_value=0)
    row = row.select_dtypes(include=[np.number]).fillna(0)
    return model.predict(row)[0]

# Example: take the first row of the dataset and predict
example = X.iloc[0].to_dict()
print('Prediction:', predict_one(example))