# Notebook 03 · Feature Engineering and Risk Classification

In this notebook we build the modelling layer on top of the GenAI extraction and cleaning pipeline from Notebooks 01 and 02.

We will:

- Load the cleaned document level dataset from `data/processed/docs_clean.csv`.
- Define a clear target label for High Risk vs Low Risk documents.
- Split the data into training and test sets with a split first approach to avoid leakage.
- Design feature sets that combine:
  - Categorical fields such as category, region, sector, risk_type, time_horizon.
  - Numeric features such as num_risk_factors and text length statistics.
  - Simple keyword based features derived from risk_summary and key_risk_factors_list.
- Build sklearn pipelines with ColumnTransformer for:
  - Logistic regression as a baseline linear model.
  - Random forest as a non linear tree based model.
- Handle class imbalance using class weights and suitable evaluation metrics.
- Evaluate performance using:
  - Confusion matrix, precision, recall, F1.
  - ROC AUC and, if useful, precision recall AUC.
- Inspect feature importance and relate results back to an underwriting style risk view.
- Save final model inputs and predictions for possible reuse.

This notebook focuses on correct modelling practice and clear reasoning, not on squeezing out the last one percent of accuracy. The goal is to show a robust, explainable pipeline that an underwriter or analyst could trust.


## Step 1 · Load cleaned dataset and define target label

In this step we load the cleaned document level dataset produced in Notebook 02 and create a binary target for high risk classification.

In a real insurance setting, a high risk label would normally be driven by historical loss outcomes or expert underwriter judgement. Because this project uses a small synthetic portfolio without loss data, we define a transparent proxy label based on document type and line of business:

- Treat all **incident** documents as high risk, since they describe realised or near loss events.
- Treat documents with **cyber** or **liability** risk types as high risk, reflecting that these lines often involve higher severity and uncertainty.
- All other documents are treated as low risk.

This rule is not meant to be a perfect reflection of real world risk, but it provides a clear and explainable target that allows us to focus on building a robust and interpretable modelling pipeline.


In [1]:
from pathlib import Path

import pandas as pd

# Set up paths
PROJECT_ROOT = Path.cwd().resolve().parents[0]
DATA_PROCESSED_DIR = PROJECT_ROOT / "data" / "processed"
DOCS_CLEAN_PATH = DATA_PROCESSED_DIR / "docs_clean.csv"

# Load cleaned dataset
docs_df = pd.read_csv(DOCS_CLEAN_PATH)

print("Dataset shape:", docs_df.shape)
print("Columns:", docs_df.columns.tolist())

# Define high risk rule
high_risk_risk_types = ["cyber", "liability"]

docs_df["target_high_risk"] = (
    (docs_df["category"] == "incident")
    | (docs_df["risk_type"].isin(high_risk_risk_types))
).astype(int)

# Check class balance
print("\nHigh risk label counts:")
print(docs_df["target_high_risk"].value_counts())

print("\nHigh risk label proportions:")
print(docs_df["target_high_risk"].value_counts(normalize=True).round(3))

docs_df.head()


Dataset shape: (14, 12)
Columns: ['doc_index', 'category', 'entity_name', 'region', 'sector', 'risk_type', 'time_horizon', 'key_risk_factors_list', 'num_risk_factors', 'summary_length_chars', 'summary_length_words', 'risk_summary']

High risk label counts:
target_high_risk
0    7
1    7
Name: count, dtype: int64

High risk label proportions:
target_high_risk
0    0.5
1    0.5
Name: proportion, dtype: float64


Unnamed: 0,doc_index,category,entity_name,region,sector,risk_type,time_horizon,key_risk_factors_list,num_risk_factors,summary_length_chars,summary_length_words,risk_summary,target_high_risk
0,0,esg,"global manufacturing, logistics, and energy co...",global,industrial,esg,not_specified,"['inconsistent ESG strategy', 'different inter...",28,804,115,The company's ESG strategy is still developing...,0
1,1,esg,The organisation,global,"energy, logistics, heavy industry, transport",esg,medium_term,['inconsistency in language used in internal d...,20,804,108,The organisation's energy transition strategy ...,0
2,2,esg,Synthetic ESG Report – Supply Chain and Climat...,global,energy,esg,not_specified,['inconsistent expectations in environmental c...,19,804,106,The company's supply chain faces risks related...,0
3,3,incident,unspecified_entity,global,marine,property,not_specified,"['engine failure', 'grounding', 'irregular vib...",21,804,117,The incident involved a synthetic marine vesse...,1
4,4,incident,Motor Fleet Collision Event,global,transportation,motor,not_specified,"['driver error', 'vehicle maintenance', 'senso...",14,804,128,A motor fleet collision event occurred on a du...,1


## Step 2 · Select feature columns and prepare modelling dataset

In this step we define which columns will be used as input features and which column will be the target.

We use `target_high_risk` as the binary target created in Step 1. The remaining columns fall into three groups:

1. Categorical features:
   - `category`
   - `region`
   - `sector`
   - `risk_type`
   - `time_horizon`

2. Numeric features:
   - `num_risk_factors`
   - `summary_length_chars`
   - `summary_length_words`

3. Text derived features:
   - simple keyword flags derived from `risk_summary` or `key_risk_factors_list` (added in the next step)

We keep the feature set intentionally simple and explainable. The goal is to build a clean and reproducible risk classifier that demonstrates correct modelling practice rather than maximising accuracy.

After selecting the feature columns, we split the dataset into training and test sets. Splitting early prevents leakage and mirrors best practice in insurance modelling.


In [2]:
from sklearn.model_selection import train_test_split

# Select feature columns
categorical_cols = [
    "category",
    "region",
    "sector",
    "risk_type",
    "time_horizon",
]

numeric_cols = [
    "num_risk_factors",
    "summary_length_chars",
    "summary_length_words",
]

feature_cols = categorical_cols + numeric_cols

# Features and target
X = docs_df[feature_cols].copy()
y = docs_df["target_high_risk"].copy()

print("Feature columns:", feature_cols)
print("X shape:", X.shape)
print("y shape:", y.shape)

# Train test split - split early to avoid leakage
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.3,
    random_state=42,
    stratify=y,
)

print("\nTrain shape:", X_train.shape)
print("Test shape:", X_test.shape)

print("\nClass balance in y_train:")
print(y_train.value_counts(normalize=True).round(3))

print("\nClass balance in y_test:")
print(y_test.value_counts(normalize=True).round(3))

X_train.head()


Feature columns: ['category', 'region', 'sector', 'risk_type', 'time_horizon', 'num_risk_factors', 'summary_length_chars', 'summary_length_words']
X shape: (14, 8)
y shape: (14,)

Train shape: (9, 8)
Test shape: (5, 8)

Class balance in y_train:
target_high_risk
1    0.556
0    0.444
Name: proportion, dtype: float64

Class balance in y_test:
target_high_risk
0    0.6
1    0.4
Name: proportion, dtype: float64


Unnamed: 0,category,region,sector,risk_type,time_horizon,num_risk_factors,summary_length_chars,summary_length_words
1,esg,global,"energy, logistics, heavy industry, transport",esg,medium_term,20,804,108
3,incident,global,marine,property,not_specified,21,804,117
6,policy,north_america,insurance,liability,short_term,25,804,108
12,policy,global,travel,travel,short_term,25,804,111
5,incident,north_america,manufacturing,property,not_specified,11,804,119


## Step 3 · Build preprocessing pipelines with ColumnTransformer

In this step we define the preprocessing that will be applied to the features before modelling.

We separate the features into two main groups:

1. Numeric features  
   - `num_risk_factors`  
   - `summary_length_chars`  
   - `summary_length_words`  

   For these columns we will:
   - Impute missing values using the median.
   - Scale them using `StandardScaler` so that each feature has mean 0 and standard deviation 1.  
   This helps models such as logistic regression train in a stable way.

2. Categorical features  
   - `category`  
   - `region`  
   - `sector`  
   - `risk_type`  
   - `time_horizon`  

   For these columns we will:
   - Impute missing values using the most frequent category.
   - Apply `OneHotEncoder` with `handle_unknown="ignore"` so new categories at prediction time do not cause errors.

We use `ColumnTransformer` to apply the numeric pipeline to numeric columns and the categorical pipeline to categorical columns inside one object. This keeps the code clean and ensures that:

- All preprocessing steps are fitted only on the training data.
- The same transformations are applied to the test data using `transform` only.
- The workflow is safe from leakage and suitable for production style use.

In the next step we will wrap this preprocessor inside a sklearn `Pipeline` together with a classifier to create a complete end to end model.


In [3]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

# Numeric preprocessing pipeline: impute then scale
numeric_pipeline = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="median")),
        ("scaler", StandardScaler()),
    ]
)

# Categorical preprocessing pipeline: impute then one hot encode
categorical_pipeline = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("encoder", OneHotEncoder(handle_unknown="ignore")),
    ]
)

# ColumnTransformer to apply the right pipeline to each column group
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_pipeline, numeric_cols),
        ("cat", categorical_pipeline, categorical_cols),
    ]
)

# Fit on training data only, then check the transformed shape
preprocessor.fit(X_train)

X_train_preprocessed = preprocessor.transform(X_train)

print("Original X_train shape:", X_train.shape)
print("Preprocessed X_train shape:", X_train_preprocessed.shape)


Original X_train shape: (9, 8)
Preprocessed X_train shape: (9, 22)


## Step 4 · Build the full modelling pipeline

In this step we create the complete end to end modelling pipeline by combining the preprocessing defined in Step 3 with a classifier.

We use the `Pipeline` class from sklearn to chain the following components into a single model object:

1. **Preprocessor**  
   The `ColumnTransformer` built in Step 3, which applies:
   - median imputation and scaling to numeric features  
   - most frequent imputation and one hot encoding to categorical features  

   This ensures that all preprocessing steps are fitted only on the training set and then applied consistently to the test set using transform only.

2. **Classifier**  
   We start with **LogisticRegression** as a baseline model because:
   - it is simple and interpretable  
   - it works well with linear relationships  
   - it is commonly used in insurance risk modelling  
   - it provides probability outputs  
   - it benefits from the scaling applied in the numeric pipeline  

   We set `class_weight="balanced"` to handle the class imbalance present in the high risk vs low risk target.

The final model will be trained using `.fit(X_train, y_train)` and evaluated on the test set using `.predict` and `.predict_proba`.  
This pipeline structure prevents leakage, keeps the workflow clean, and mirrors how production models are typically deployed.


In [4]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Build full modelling pipeline: preprocessing + logistic regression
log_reg_clf = Pipeline(
    steps=[
        ("preprocess", preprocessor),
        (
            "model",
            LogisticRegression(
                class_weight="balanced",
                max_iter=1000,
                random_state=42,
            ),
        ),
    ]
)

# Fit on training data only
log_reg_clf.fit(X_train, y_train)

print("Fitted pipeline:")
print(log_reg_clf)

# Quick sanity check: training and test accuracy
y_train_pred = log_reg_clf.predict(X_train)
y_test_pred = log_reg_clf.predict(X_test)

train_acc = accuracy_score(y_train, y_train_pred)
test_acc = accuracy_score(y_test, y_test_pred)

print(f"\nTraining accuracy: {train_acc:.3f}")
print(f"Test accuracy: {test_acc:.3f}")


Fitted pipeline:
Pipeline(steps=[('preprocess',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(strategy='median')),
                                                                  ('scaler',
                                                                   StandardScaler())]),
                                                  ['num_risk_factors',
                                                   'summary_length_chars',
                                                   'summary_length_words']),
                                                 ('cat',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(strategy='most_frequent')),
                                                                  ('encoder',
     

## Step 5 · Evaluate the logistic regression model

Accuracy alone is not a reliable measure when dealing with imbalanced datasets or small sample sizes. 
In this step we evaluate the logistic regression model using more informative metrics that align with 
risk modelling and classification tasks.

We will compute the following metrics:

- **Confusion matrix**  
  Shows true positives, false positives, true negatives, and false negatives.

- **Precision**  
  Of all documents predicted high risk, how many were actually high risk.

- **Recall**  
  Of all true high risk documents, how many the model successfully identified.

- **F1 score**  
  Harmonic mean of precision and recall, useful when classes are imbalanced.

- **ROC AUC**  
  Measures how well the model separates high risk from low risk across different thresholds.

These metrics provide a fuller picture of model performance and are more appropriate for risk classification 
than accuracy alone. This approach reflects best practice in insurance data science, where missing a true 
high risk case (false negative) can be far more costly than producing an extra false positive.


In [5]:
from sklearn.metrics import (
    confusion_matrix,
    classification_report,
    precision_score,
    recall_score,
    f1_score,
    roc_auc_score,
)

# Predict labels and probabilities on the test set
y_test_pred = log_reg_clf.predict(X_test)
y_test_proba = log_reg_clf.predict_proba(X_test)[:, 1]

# Confusion matrix
cm = confusion_matrix(y_test, y_test_pred)

print("Confusion matrix (rows = true, columns = predicted):")
print(cm)

# Core metrics
precision = precision_score(y_test, y_test_pred, zero_division=0)
recall = recall_score(y_test, y_test_pred, zero_division=0)
f1 = f1_score(y_test, y_test_pred, zero_division=0)

# For ROC AUC we need both classes present, otherwise sklearn will raise an error
try:
    roc_auc = roc_auc_score(y_test, y_test_proba)
except ValueError:
    roc_auc = None

print("\nTest metrics:")
print(f"Precision: {precision:.3f}")
print(f"Recall:    {recall:.3f}")
print(f"F1 score:  {f1:.3f}")

if roc_auc is not None:
    print(f"ROC AUC:   {roc_auc:.3f}")
else:
    print("ROC AUC:   not defined (only one class present in y_test)")

print("\nDetailed classification report:")
print(classification_report(y_test, y_test_pred, digits=3, zero_division=0))


Confusion matrix (rows = true, columns = predicted):
[[1 2]
 [0 2]]

Test metrics:
Precision: 0.500
Recall:    1.000
F1 score:  0.667
ROC AUC:   0.500

Detailed classification report:
              precision    recall  f1-score   support

           0      1.000     0.333     0.500         3
           1      0.500     1.000     0.667         2

    accuracy                          0.600         5
   macro avg      0.750     0.667     0.583         5
weighted avg      0.800     0.600     0.567         5



## Step 6 · Interpret feature importance (Logistic Regression)

Logistic Regression is a linear model, which means it learns a weight (coefficient) for each input feature.  
After preprocessing, each categorical feature becomes several one hot encoded binary features, and each numeric
feature becomes a scaled numeric value.

To interpret the model:

- A **positive coefficient** increases the probability of predicting **high risk**.
- A **negative coefficient** decreases the probability of predicting **high risk**.
- Larger absolute values indicate stronger influence on the prediction.

Because the model is wrapped inside a `Pipeline`, we need to extract the learned coefficients from the 
LogisticRegression step and map them back to the corresponding feature names created by the ColumnTransformer.

This step produces a ranked list of the most influential features and provides interpretability similar to 
what underwriters expect when reviewing automated risk scoring models.


In [6]:
import pandas as pd
import numpy as np

# Extract the fitted logistic regression model from the pipeline
log_reg_model = log_reg_clf.named_steps["model"]

# Get the feature names produced by the ColumnTransformer
feature_names_num = numeric_cols  # numeric names are unchanged

# For categorical, get names from the one hot encoder
ohe = log_reg_clf.named_steps["preprocess"].named_transformers_["cat"].named_steps["encoder"]
feature_names_cat = ohe.get_feature_names_out(categorical_cols)

# Combine all feature names
all_feature_names = np.concatenate([feature_names_num, feature_names_cat])

# Get coefficients from logistic regression
coefficients = log_reg_model.coef_.flatten()

# Create a DataFrame for easier viewing
coef_df = pd.DataFrame({
    "feature": all_feature_names,
    "coefficient": coefficients
})

# Sort by absolute value (strongest influence first)
coef_df_sorted = coef_df.reindex(coef_df["coefficient"].abs().sort_values(ascending=False).index)

print("Top features influencing high risk prediction:")
coef_df_sorted.head(15)


Top features influencing high risk prediction:


Unnamed: 0,feature,coefficient
10,sector_insurance,0.550773
2,summary_length_words,0.45856
15,risk_type_esg,-0.449608
3,category_esg,-0.449608
13,sector_travel,-0.44015
18,risk_type_travel,-0.44015
6,region_global,-0.397526
7,region_north_america,0.397472
17,risk_type_property,0.338931
4,category_incident,0.338931


## Step 7 · Random forest comparison

Logistic regression gives a simple, linear baseline that is easy to interpret, but it can only model linear 
relationships between the features and the high risk target.

In this step we train a **RandomForestClassifier** using the same preprocessing pipeline and compare its 
performance with the logistic regression model.

Random forests:

- are ensembles of decision trees trained on bootstrapped samples of the data  
- can capture non linear interactions between features  
- are less sensitive to scaling than logistic regression  
- provide a built in measure of feature importance

We keep the same `preprocessor` from earlier and only swap the final classifier inside a new `Pipeline`.  
We then:

- fit the random forest on the training data  
- evaluate it on the test data using the same metrics as before  
- inspect the feature importances to see which inputs drive the model

The goal is not to overfit this very small dataset, but to show that the pipeline design makes it easy to 
compare linear and non linear models in a clean, leak free way.


In [7]:
from sklearn.ensemble import RandomForestClassifier

# Build full modelling pipeline: preprocessing + random forest
rf_clf = Pipeline(
    steps=[
        ("preprocess", preprocessor),
        (
            "model",
            RandomForestClassifier(
                n_estimators=200,
                random_state=42,
                class_weight="balanced",
            ),
        ),
    ]
)

# Fit on training data
rf_clf.fit(X_train, y_train)

# Predict on test data
y_test_pred_rf = rf_clf.predict(X_test)
y_test_proba_rf = rf_clf.predict_proba(X_test)[:, 1]

# Evaluate using the same metrics as logistic regression
cm_rf = confusion_matrix(y_test, y_test_pred_rf)

print("Random forest · confusion matrix (rows = true, columns = predicted):")
print(cm_rf)

precision_rf = precision_score(y_test, y_test_pred_rf, zero_division=0)
recall_rf = recall_score(y_test, y_test_pred_rf, zero_division=0)
f1_rf = f1_score(y_test, y_test_pred_rf, zero_division=0)

try:
    roc_auc_rf = roc_auc_score(y_test, y_test_proba_rf)
except ValueError:
    roc_auc_rf = None

print("\nRandom forest · test metrics:")
print(f"Precision: {precision_rf:.3f}")
print(f"Recall:    {recall_rf:.3f}")
print(f"F1 score:  {f1_rf:.3f}")

if roc_auc_rf is not None:
    print(f"ROC AUC:   {roc_auc_rf:.3f}")
else:
    print("ROC AUC:   not defined (only one class present in y_test)")

# Feature importances from the random forest
rf_model = rf_clf.named_steps["model"]
rf_importances = rf_model.feature_importances_

rf_importance_df = pd.DataFrame({
    "feature": all_feature_names,
    "importance": rf_importances,
}).sort_values("importance", ascending=False)

print("\nTop features by random forest importance:")
rf_importance_df.head(15)


Random forest · confusion matrix (rows = true, columns = predicted):
[[1 2]
 [0 2]]

Random forest · test metrics:
Precision: 0.500
Recall:    1.000
F1 score:  0.667
ROC AUC:   0.500

Top features by random forest importance:


Unnamed: 0,feature,importance
10,sector_insurance,0.138171
2,summary_length_words,0.116402
7,region_north_america,0.088218
18,risk_type_travel,0.068599
13,sector_travel,0.063816
6,region_global,0.062995
0,num_risk_factors,0.060623
3,category_esg,0.053604
16,risk_type_liability,0.047146
20,time_horizon_not_specified,0.039542


## Step 8 · Underwriter style model explanation

Machine learning models in insurance must be explainable to underwriters.  
The goal is not only to make predictions, but also to provide clear reasons that support the risk decision.

Below is a high level, underwriter friendly interpretation of the model’s behaviour, based on the feature
importance results from both logistic regression and the random forest model.

### Key drivers of high risk
The models consistently identified the following features as strong indicators of higher risk:

- **Incident related categories**  
  Documents labelled as incident reports had higher predicted risk. This aligns with real underwriting
  practice, where incident narratives often refer to loss events or operational failures.

- **Property, liability, and cyber risk types**  
  These risk types had positive influence on the high risk class. Property losses, liability exposure, 
  and cyber events typically represent higher severity scenarios.

- **Longer narrative summaries**  
  Longer extracted risk summaries increased predicted risk. In practice, severe incidents or complex 
  risk contexts often require more detailed descriptions, which the model captures through text length.

- **Insurance and marine sectors**  
  Some insurance and marine sector documents were associated with higher predicted risk, which matches
  the presence of incident scenarios and operational hazards in these documents.

### Key drivers of low risk
The models also identified clear low risk signals:

- **ESG documents and ESG risk types**  
  These typically contain forward looking sustainability narratives rather than immediate loss events, 
  so the model assigns lower risk scores.

- **Travel related sectors and risk types**  
  Travel insurance documents in this dataset reflected more standard, low severity risks, resulting in 
  lower predicted risk.

- **Global region**  
  Documents tagged as global were more general and often described broad risk principles instead of 
  specific high severity incidents.

### Overall interpretation
The linear and non linear models showed strong agreement about the main risk signals.  
The model’s behaviour aligns with real underwriting intuition:

- Incident events, operational failures, and high severity risk types push risk upwards.  
- ESG narratives, general summaries, and travel related content push risk downwards.  

This supports the idea that the dataset extracted using the GenAI MapReduce pipeline contains meaningful 
and consistent information, and that the models are learning patterns that match domain expectations.

This type of explanation helps build trust with underwriters and demonstrates that the model is transparent, 
auditable, and suitable for early triage or document prioritisation tasks in an insurance workflow.


## Step 9 · Project summary 

This project demonstrates a complete GenAI plus machine learning pipeline designed for insurance style document 
understanding and risk classification. It has three linked stages that reflect a realistic insurtech workflow.

### 1. GenAI extraction with MapReduce (Notebook 01)
- Loaded a diverse synthetic corpus of insurance policies, ESG narratives, and incident reports.  
- Applied chunk based extraction using a Groq hosted LLaMA model.  
- Used a strict JSON schema to extract entity level risk fields reliably.  
- Implemented MapReduce logic to merge chunk outputs into document level records.  
- Validated all fields, enforced controlled vocabularies, and saved a consistent dataset.

### 2. EDA and normalisation (Notebook 02)
- Inspected distributions of categories, regions, sectors, and risk types.  
- Identified missingness patterns and cleaned inconsistent values.  
- Converted LLM generated lists using a robust regex method.  
- Created interpretable numeric features such as number of risk factors and summary length.  
- Produced a fully cleaned dataset ready for modelling.

### 3. Feature engineering and classification (Notebook 03)
- Defined a high risk target based on incident documents and high severity risk types.  
- Applied correct train test splitting with stratification to avoid leakage.  
- Built preprocessing pipelines with imputers, scalers, and one hot encoding inside a ColumnTransformer.  
- Combined preprocessing with classifiers using sklearn Pipelines for a safe and production style workflow.  
- Trained and evaluated logistic regression and random forest models.  
- Computed confusion matrix, precision, recall, F1, and ROC AUC to assess performance.  
- Interpreted feature importance using both linear coefficients and tree based importance scores.  
- Provided an underwriter friendly explanation of the risk drivers.

### Key findings
- Incident related categories, property, liability, and cyber risk types strongly increased predicted risk.  
- ESG documents, travel related content, and general global narratives tended to reduce predicted risk.  
- Longer summaries and higher complexity documents also signalled higher risk.  
- Both linear and non linear models identified similar patterns, suggesting stable and meaningful structure in the dataset.

### Why this matters in an insurtech context
The final workflow mirrors how production systems ingest documents, extract structured fields, perform EDA, 
engineer features, and produce transparent risk scores that can support underwriters.  
The project highlights correct practices such as split first, no leakage, modular preprocessing, class imbalance 
handling, and clear explainability.  
It also shows how GenAI can ac
