# Feature Engineering

In [189]:
import pandas as pd 
import numpy as np 
import os
import sys

# Get the absolute path to the 'util' directory (assuming 'eda.ipynb' is one level down from 'your_project')
data_dir = os.path.abspath(os.path.join('data'))

# Add the 'util' directory to sys.path if it's not already there
if data_dir not in sys.path:
    sys.path.append(data_dir)
data_dir

# Get the absolute path to the 'util' directory (assuming 'eda.ipynb' is one level down from 'your_project')
util_dir = os.path.abspath(os.path.join('utils'))

# Add the 'util' directory to sys.path if it's not already there
if util_dir not in sys.path:
    sys.path.append(util_dir)

In [190]:
df = pd.read_csv(data_dir+'/supervised_leaned_eda.csv')
df.head()

Unnamed: 0,ZIP Code,Cov C Amount Weighted Avg,Avg Fire Risk Score,Number of Negligible Fire Risk Exposure,Number of Low Fire Risk Exposure,Number of Moderate Fire Risk Exposure,Number of High Fire Risk Exposure,Number of Very High Fire Risk Exposure,Earned Premium 2020,Earned Exposure 2020,Non-CAT Cov A Smoke - Incurred Losses,Non-CAT Cov A Smoke - Number of Claims,Non-CAT Cov C Fire - Incurred Losses,Non-CAT Cov C Fire - Number of Claims,Non-CAT Cov C Smoke - Incurred Losses,Non-CAT Cov C Smoke - Number of Claims,CAT Cov A Smoke - Incurred Losses,CAT Cov A Smoke - Number of Claims,Earned Premium 2021,Earned Exposure 2021
0,90001,174339.07,0.32,884,407,0,0,0,982193,1291,9320,1,40267,1,5070,1,86803,13,1076066,1345
1,90002,167880.4,0.33,1270,614,0,0,0,1400005,1884,1686,1,20720,1,542,1,27666,5,1523488,1939
2,90003,177789.87,0.31,1195,535,1,0,0,1424103,1731,0,0,128964,2,0,0,49203,6,1537173,1769
3,90004,635509.87,0.45,962,643,56,0,0,3992219,1661,0,0,0,0,0,0,5186,1,4428387,1675
4,90005,852256.91,0.44,224,127,16,0,0,1263229,368,0,0,0,0,0,0,0,0,1377640,379


In [191]:
df.columns

Index(['ZIP Code', 'Cov C Amount Weighted Avg', 'Avg Fire Risk Score',
       'Number of Negligible Fire Risk Exposure',
       'Number of Low Fire Risk Exposure',
       'Number of Moderate Fire Risk Exposure',
       'Number of High Fire Risk Exposure',
       'Number of Very High Fire Risk Exposure', 'Earned Premium 2020',
       'Earned Exposure 2020', 'Non-CAT Cov A Smoke - Incurred Losses',
       'Non-CAT Cov A Smoke - Number of Claims',
       'Non-CAT Cov C Fire - Incurred Losses',
       'Non-CAT Cov C Fire - Number of Claims',
       'Non-CAT Cov C Smoke - Incurred Losses',
       'Non-CAT Cov C Smoke - Number of Claims',
       'CAT Cov A Smoke - Incurred Losses',
       'CAT Cov A Smoke - Number of Claims', 'Earned Premium 2021',
       'Earned Exposure 2021'],
      dtype='object')

In [192]:
df["Total CAT Losses"] =  df["CAT Cov A Smoke - Incurred Losses"]
df["Total Non-CAT Losses"] = df["Non-CAT Cov A Smoke - Incurred Losses"] + df["Non-CAT Cov C Fire - Incurred Losses"]+df["Non-CAT Cov C Smoke - Incurred Losses"]
df.rename(columns = { "CAT Cov A Smoke - Number of Claims":"Total CAT Claims"}, inplace= True)
df["Total Non-CAT Claims"] = df["Non-CAT Cov A Smoke - Number of Claims"]+df["Non-CAT Cov C Fire - Number of Claims"]+df["Non-CAT Cov C Smoke - Number of Claims"]
df['Avg Premium'] = df['Earned Premium 2020'] / df['Earned Exposure 2020'].replace(0, np.nan)
df['Avg CAT Loss'] = df['Total CAT Losses'] / df['Earned Exposure 2020'].replace(0, np.nan)
df['Avg Non-CAT Loss'] = df['Total Non-CAT Losses'] / df['Earned Exposure 2020'].replace(0, np.nan)
df['Avg CAT Claims'] = df['Total Non-CAT Claims'] / df['Earned Exposure 2020'].replace(0, np.nan)
df['Avg Non-CAT Claims'] = df['Total CAT Claims'] / df['Earned Exposure 2020'].replace(0, np.nan)
df['Avg Premium 2021'] = df['Earned Premium 2021'] / df['Earned Exposure 2021'].replace(0, np.nan)
df['Loss Ratio'] = (df['Total CAT Losses'] + df['Total Non-CAT Losses'])/df['Earned Premium 2020'].replace(0, np.nan)
df['Claim Frequency'] = (df['Total Non-CAT Claims']+df['Total CAT Claims'])/df['Earned Exposure 2020'].replace(0, np.nan)
df['Average Claim Severity'] = (df['Total CAT Losses'] + df['Total Non-CAT Losses'])/(df['Total Non-CAT Claims']+df['Total CAT Claims']).replace(0, np.nan)
# Replace inf with NaN, then drop all NaN values
df.replace([np.inf, -np.inf], np.nan, inplace=True)
df.dropna(inplace=True)

In [193]:
df.shape


(1106, 32)

In [194]:
import cleaning
import importlib
importlib.reload(cleaning)
from cleaning import highly_correlated_features

features_to_drop = highly_correlated_features(df, "Avg Premium 2021")
features_to_drop

Earned Premium 2021 Earned Premium 2020 0.4179693431497019 0.4150874277778271
feature_to_drop: Earned Premium 2020, predictive score Earned Premium 2020: 0.0, predictive score Earned Premium 2021: 0.0
Earned Exposure 2021 Earned Exposure 2020 -0.09794124758561937 -0.0997385613890001
feature_to_drop: Earned Exposure 2020, predictive score Earned Exposure 2020: 0.0, predictive score Earned Exposure 2021: 0.0
Total CAT Losses CAT Cov A Smoke - Incurred Losses -0.0023631761243800287 -0.0023631761243800287
feature_to_drop: Total CAT Losses, predictive score CAT Cov A Smoke - Incurred Losses: 0.0, predictive score Total CAT Losses: 0.0
Total Non-CAT Losses Non-CAT Cov C Fire - Incurred Losses -0.00783632955344162 0.004170494959775798
feature_to_drop: Total Non-CAT Losses, predictive score Non-CAT Cov C Fire - Incurred Losses: 0.0, predictive score Total Non-CAT Losses: 0.0
Total Non-CAT Claims Non-CAT Cov A Smoke - Number of Claims -0.0866460693888748 -0.0768119932098768
feature_to_drop: Tot

['Earned Premium 2020',
 'Earned Exposure 2020',
 'Total CAT Losses',
 'Total Non-CAT Losses',
 'Total Non-CAT Claims',
 'Total Non-CAT Claims',
 'Loss Ratio']

In [195]:
df.drop(columns=features_to_drop, inplace=True)
df.shape

(1106, 26)

Observation:

We have a group of newly engineered features that has hight correlation with each other.

Impacts:

Highly correlated features will be dropped. The features that are dropped are Earned Premium 2020',
 'Earned Exposure 2020',
 'Total CAT Losses',
 'Total Non-CAT Losses',
 'Total Non-CAT Claims',
 'Total Non-CAT Claims',
 'Loss Ratio'. Now, the total features reduced from __32 to 26__ (6)

## Exploratory Data Analysis for the newly engineered features

In [183]:
df.columns

Index(['ZIP Code', 'Cov C Amount Weighted Avg', 'Avg Fire Risk Score',
       'Number of Negligible Fire Risk Exposure',
       'Number of Low Fire Risk Exposure',
       'Number of Moderate Fire Risk Exposure',
       'Number of High Fire Risk Exposure',
       'Number of Very High Fire Risk Exposure',
       'Non-CAT Cov A Smoke - Incurred Losses',
       'Non-CAT Cov A Smoke - Number of Claims',
       'Non-CAT Cov C Fire - Incurred Losses',
       'Non-CAT Cov C Fire - Number of Claims',
       'Non-CAT Cov C Smoke - Incurred Losses',
       'Non-CAT Cov C Smoke - Number of Claims',
       'CAT Cov A Smoke - Incurred Losses', 'Total CAT Claims',
       'Earned Premium 2021', 'Earned Exposure 2021', 'Avg Premium',
       'Avg CAT Loss', 'Avg Non-CAT Loss', 'Avg CAT Claims',
       'Avg Non-CAT Claims', 'Avg Premium 2021', 'Claim Frequency',
       'Average Claim Severity'],
      dtype='object')

In [184]:
# Features that need transformation
# Make you're not running this twice, otherwise you endup with the same operation being performed twice
skewed_features=['Avg Fire Risk Score',
 'Number of Moderate Fire Risk Exposure',
 'Number of High Fire Risk Exposure',
 'Number of Very High Fire Risk Exposure',
 'Avg Non-CAT Loss',
 'Average Claim Severity']


for feature in skewed_features:
    df[feature] = np.sqrt(df[feature])


df.head()

Unnamed: 0,ZIP Code,Cov C Amount Weighted Avg,Avg Fire Risk Score,Number of Negligible Fire Risk Exposure,Number of Low Fire Risk Exposure,Number of Moderate Fire Risk Exposure,Number of High Fire Risk Exposure,Number of Very High Fire Risk Exposure,Non-CAT Cov A Smoke - Incurred Losses,Non-CAT Cov A Smoke - Number of Claims,...,Earned Premium 2021,Earned Exposure 2021,Avg Premium,Avg CAT Loss,Avg Non-CAT Loss,Avg CAT Claims,Avg Non-CAT Claims,Avg Premium 2021,Claim Frequency,Average Claim Severity
0,90001,174339.07,0.565685,884,407,0.0,0.0,0.0,9320,1,...,1076066,1345,760.800155,67.237026,6.506685,0.002324,0.01007,800.049071,0.012393,94.027921
1,90002,167880.4,0.574456,1270,614,0.0,0.0,0.0,1686,1,...,1523488,1939,743.102442,14.684713,3.490053,0.001592,0.002654,785.708097,0.004246,79.54087
2,90003,177789.87,0.556776,1195,535,1.0,0.0,0.0,0,0,...,1537173,1769,822.705373,28.42461,8.631489,0.001155,0.003466,868.950254,0.004622,149.234296
3,90004,635509.87,0.67082,962,643,7.483315,0.0,0.0,0,0,...,4428387,1675,2403.503311,3.122216,0.0,0.0,0.000602,2643.813134,0.000602,72.013888
5,90006,316286.33,0.6,220,121,1.0,0.0,0.0,0,0,...,482004,358,1257.560117,0.0,19.973883,0.008798,0.0,1346.379888,0.008798,212.950699


In [None]:
import ppscore as pps
import plotly.express as px
import numpy as np
import importlib
import feature_engineering
importlib.reload(feature_engineering)
from feature_engineering import feature_stats_histogram


target = "Avg Premium 2021"
column =target
for column in df.columns:
    stats_dict = feature_stats_histogram(df,column, target)



column : ZIP Code
Predictive Power Score: 0.0000
Correlation with Target: -0.08165867901721742
Skewness of the feature: -0.20333050915253742


column : Cov C Amount Weighted Avg
Predictive Power Score: 0.2031
Correlation with Target: 0.8563992674065493
Skewness of the feature: 2.9631433640416596


column : Avg Fire Risk Score
Predictive Power Score: 0.0666
Correlation with Target: 0.4587724251823534
Skewness of the feature: 1.1260779198950737


column : Number of Negligible Fire Risk Exposure
Predictive Power Score: 0.0000
Correlation with Target: -0.2629868945556229
Skewness of the feature: 1.202544062095347


column : Number of Low Fire Risk Exposure
Predictive Power Score: 0.0000
Correlation with Target: -0.05142704416795091
Skewness of the feature: 1.423168676274592


column : Number of Moderate Fire Risk Exposure
Predictive Power Score: 0.0000
Correlation with Target: 0.3894136149648377
Skewness of the feature: 1.0571881796926592


column : Number of High Fire Risk Exposure
Predictive Power Score: 0.0000
Correlation with Target: 0.4137711869721171
Skewness of the feature: 1.8791545663927702


column : Number of Very High Fire Risk Exposure
Predictive Power Score: 0.0126
Correlation with Target: 0.396118572759838
Skewness of the feature: 2.453007618026422


column : Non-CAT Cov A Smoke - Incurred Losses
Predictive Power Score: 0.0000
Correlation with Target: -0.050968384886013886
Skewness of the feature: 6.275332283948447


column : Non-CAT Cov A Smoke - Number of Claims
Predictive Power Score: 0.0000
Correlation with Target: -0.0768119932098768
Skewness of the feature: 7.958896393321983


column : Non-CAT Cov C Fire - Incurred Losses
Predictive Power Score: 0.0000
Correlation with Target: 0.004170494959775798
Skewness of the feature: 9.482165425870583


column : Non-CAT Cov C Fire - Number of Claims
Predictive Power Score: 0.0000
Correlation with Target: -0.09374000495547573
Skewness of the feature: 3.733171261427767


column : Non-CAT Cov C Smoke - Incurred Losses
Predictive Power Score: 0.0000
Correlation with Target: -0.013224105845388622
Skewness of the feature: 8.922316494096506


column : Non-CAT Cov C Smoke - Number of Claims
Predictive Power Score: 0.0000
Correlation with Target: -0.045695120714572796
Skewness of the feature: 8.124396243089945


column : CAT Cov A Smoke - Incurred Losses
Predictive Power Score: 0.0000
Correlation with Target: -0.0023631761243800287
Skewness of the feature: 13.89354345761267


column : Total CAT Claims
Predictive Power Score: 0.0000
Correlation with Target: -0.05088715799387567
Skewness of the feature: 8.674048089562708


column : Earned Premium 2021
Predictive Power Score: 0.0000
Correlation with Target: 0.4179693431497019
Skewness of the feature: 1.694308178883162


column : Earned Exposure 2021
Predictive Power Score: 0.0000
Correlation with Target: -0.09794124758561937
Skewness of the feature: 0.8704759132151885


column : Avg Premium
Predictive Power Score: 0.8691
Correlation with Target: 0.9969536711626394
Skewness of the feature: 5.9496562349178985


column : Avg CAT Loss
Predictive Power Score: 0.0000
Correlation with Target: 0.09254176513326465
Skewness of the feature: 23.278956628875875


column : Avg Non-CAT Loss
Predictive Power Score: 0.0000
Correlation with Target: 0.0359302674837235
Skewness of the feature: 8.137562870361005


column : Avg CAT Claims
Predictive Power Score: 0.0000
Correlation with Target: 0.01114017351629082
Skewness of the feature: 11.188650567928603


column : Avg Non-CAT Claims
Predictive Power Score: 0.0000
Correlation with Target: 0.04347695746849961
Skewness of the feature: 7.947006098540332


column : Avg Premium 2021
Predictive Power Score: 1.0000
Correlation with Target: 1.0
Skewness of the feature: 5.554956752513249


column : Claim Frequency
Predictive Power Score: 0.0000
Correlation with Target: 0.03175874229734121
Skewness of the feature: 7.534408516786382


column : Average Claim Severity
Predictive Power Score: 0.0000
Correlation with Target: 0.1450797281357173
Skewness of the feature: 1.3982856122889324


Observation:
To address the highly skewed variables, we need to do the transformation to understand if there correlation gets better, if yes, those varaibles will be changed and kept as transformed variables.

Impact:
Log transformation: 'Avg Fire Risk Score','Number of Moderate Fire Risk Exposure','Number of High Fire Risk Exposure','Number of Very High Fire Risk Exposure','Avg Non-CAT Loss','Average Claim Severity'

<!-- Work on the transformation code, we nned to remove the code that adds transformation column in the dataframe -->

In [196]:
output_path = "/Users/shireen/Documents/CaliforniaWildfireAnalysis/data/supervised_feature_engineered_eda.csv"
df.to_csv(output_path, index=False)


# Feature Engineering Summary: Insurance Risk Metrics

### Objective
Derived the **exposure-adjusted risk metrics** that standardize claims, losses, and premiums across policies.

### New Features Added

| Feature                | Formula                                  | Purpose                                                                 |
|------------------------|------------------------------------------|-------------------------------------------------------------------------|
| **Avg CAT Loss**       | `Total CAT Loss / Number of Exposures`   | Measures catastrophic loss burden per unit of exposure                  |
| **Avg Non-CAT Loss**   | `Total Non-CAT Loss / Exposures`         | Quantifies non-catastrophic loss density (e.g., theft, accidents)       |
| **Avg CAT Claims**     | `Total CAT Claims / Exposures`           | Normalizes catastrophic claim frequency by exposure count               |
| **Avg Non-CAT Claims** | `Total Non-CAT Claims / Exposures`       | Standardizes non-catastrophic claim frequency                          |
| **Avg Premium 2021**   | `Total Premium 2021 / Exposures`         | Evaluates premium pricing consistency relative to coverage size         |
| **Claim Frequency**    | `Total Claims / Exposures`               | General claim rate (combines CAT and Non-CAT)                          |
| **Avg Claim Severity** | `Total Loss / Total Claims`              | Measures average cost per claim (higher = more severe claims)           |
| **Avg Fire Risk Score** | `Fire Risk Score / Exposures`              | Normalizes fire risk by exposure count to compare risk density across policies.           |

### Feature Selection Process
1. **Correlation Analysis**:
   - Removed newly derived features showing correlation > 0.85 with existing features
   - Example: Dropped `Avg Total Loss` as it correlated strongly (0.92) with `Avg CAT Loss`

2. **Skewness Treatment**:
   - Applied quadratic scaling to features with skewness > 2.0
   - Example: `log(Claim Frequency + 1)` for right-skewed claim data
   - Preserved original scaling for features where transformation reduced predictive power



