# Feature Engineering 

Feature Engineering is the process of transforming raw data into meaningful features that can be used as input for advanceced visualisations or machine learning algorithms.

It involves selecting, creating, and transforming features to hopefully enhance the dataset.

Poorly designed features can lead to a disruptive dataset. 


## Types of Feature Engineering

* **Handling Missing Values**

    Filling missing values with appropriate strategies, e.g., mean, median, or constant values.

* **Encoding Categorical Variables**

    Converting categorical data into numeric form, such as one-hot encoding or label encoding. Only needed if you are building a model

* **Binning Numeric Variables**

    Grouping continuous data into bins or categories to simplify the representation.

* **Feature Scaling**

    Scaling features to bring them to a similar range, e.g., Min-Max scaling or Standard scaling.

* **Creating New Features**

    Generating new features by combining or transforming existing ones.

* **Handling Outliers**

    Managing extreme values that can affect model performance.

* **Feature Joining**

    Creating new features by combining multiple existing features.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Imports and Dataset

In [None]:
df = pd.read_excel("/kaggle/input/insurance-claims/Cleaned_Insurance_Claims.xlsx")

In [None]:
null_counts = df.isnull().sum()
null_counts

In [None]:
null_counts = df.isnull().sum()
print(null_counts)

In [None]:
print(df.head())

### Missing Values

In [None]:
print(df.columns.tolist())

In [None]:
df_new = df.drop('policy_number', axis=1)

In [None]:
df_new.head()

### Binding Numeric Data

In [None]:
df_new.describe()

In [None]:
# Choose the column for the histogram
column_name = 'age'

# Plot the histogram
plt.hist(df[column_name], bins=3, edgecolor='black')

# Add labels and title
plt.xlabel(column_name)
plt.ylabel('Frequency')
plt.title(f'Histogram of {column_name}')

# Display the histogram
plt.show()

In [None]:
bin_edges = [0, 30, 55, 100]  # Define the bin edges
bin_labels = ['Young Adult', 'Middle Aged', 'Elderly']  # Corresponding labels for each bin

# Create a new column based on the bin labels
df_new['ages_category'] = pd.cut(df_new['age'], bins=bin_edges, labels=bin_labels)

In [None]:
df_new.head()

In [None]:
bin_edges_customer = [0, 25, 150, 500]  # Define the bin edges
bin_labels_customer = ['New Client', 'Established Client', 'Long-Term Client']  # Corresponding labels for each bin

# Create a new column based on the bin labels
df_new['customer_category'] = pd.cut(df_new['months_as_customer'], bins=bin_edges_customer, labels=bin_labels_customer)

In [None]:
df_new.head()

## Creating New Features

In [None]:
df_new["Contract Years"] = df_new["months_as_customer"]/12

In [None]:
df_new.head()

## Feature Joining

In [None]:
df_new['total_premiums_paid'] = (df_new['policy_annual_premium']/12) * df_new['months_as_customer']

In [None]:
df_new.head()

In [None]:
df_new['net_value_of_customer'] = df_new['total_premiums_paid'] - df_new['total_claim_amount']

In [None]:
df_new.head()

## Saving the csv for late

In [None]:
df_new.to_csv('Advanced Features Claims Data.csv')

## Go wild

Go out a see what other features you can create that will be useful for our visualisations

In [None]:
import pandas as pd

# Feature 1: Age Groups
bins = [17, 24, 34, 44, 54, 64, 74, 84, 94]
labels = ['18-24', '25-34', '35-44', '45-54', '55-64', '65-74', '75-84', '85+']
df['age_group'] = pd.cut(df['age'], bins=bins, labels=labels)

# Feature 2: Policy Bind Year and Month
df['policy_bind_date'] = pd.to_datetime(df['policy_bind_date'])
df['policy_bind_year'] = df['policy_bind_date'].dt.year
df['policy_bind_month'] = df['policy_bind_date'].dt.month_name()

# Feature 3: Incident Timelines
df['incident_date'] = pd.to_datetime(df['incident_date'])
df['days_to_incident'] = (df['incident_date'] - df['policy_bind_date']).dt.days

# Feature 4: Premium Ranges
df['premium_range'] = pd.qcut(df['policy_annual_premium'], 4, labels=['Low', 'Medium', 'High', 'Very High'])

# Feature 5: Claims to Premium Ratio
df['claim_premium_ratio'] = df['total_claim_amount'] / df['policy_annual_premium']

# Feature 6: Accident Severity Count - this feature was not fully defined earlier, but let's create a one-hot encoding for incident severities
severity_dummies = pd.get_dummies(df['incident_severity'], prefix='severity')
df = pd.concat([df, severity_dummies], axis=1)

print(df.head())