**Kaggle Competition**

<a>https://www.kaggle.com/competitions/regression-with-an-insurance-dataset/overview</a>

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

In [2]:
# Load the data
insurance = pd.read_csv("./dataset/train.csv")

### Data Exploration

In [None]:
# Explore data
insurance.info()

In [None]:
# look at some records
insurance.head()

**Observation**: id can be marked as index here

In [None]:
# make id as the index key
insurance.set_index("id", inplace=True)
insurance.head()

In [None]:
#lets explore some stats
insurance.describe()

In [None]:
# Lets find out about categorical values
insurance["Gender"].value_counts().plot(kind="bar")

**Observation**: Gender data is balanced

In [None]:
insurance["Marital Status"].value_counts().plot(kind="bar")

**Observation**: Marital Status data is balanced

In [None]:
insurance["Number of Dependents"].value_counts().plot(kind="bar")

**Observation**: Number of Dependents data is balanced

In [None]:
insurance["Education Level"].value_counts().plot(kind="bar")

**Observation**: Education Level data is balanced

In [None]:
insurance["Occupation"].value_counts().plot(kind="bar")

**Observation**: Occupation data is balanced. But there are lots of missing data as well

In [None]:
insurance["Location"].value_counts().plot(kind="bar")

**Observation**: Location data is balanced

In [None]:
insurance["Policy Type"].value_counts().plot(kind="bar")

**Observation**: Policy Type data is balanced

In [None]:
insurance["Customer Feedback"].value_counts().plot(kind="bar")

**Observation**: Customer Feedback data is balanced

In [None]:
insurance["Smoking Status"].value_counts().plot(kind="bar")

**Observation**: Smoking Status data is balanced

In [None]:
insurance["Exercise Frequency"].value_counts().plot(kind="bar")

**Observation**: Exercise Frequency data is balanced

In [None]:
insurance["Property Type"].value_counts().plot(kind="bar")

**Observation**: Property Type data is balanced

**The dataset is well balanced, with a few missing values. Occupation seems the be the one with the most missing values.**

In [None]:
# lets look at the distribution of the data
insurance.hist(figsize=(12,8))

In [None]:
# Lets look at the correlation data
insurance.corr(numeric_only=True)["Premium Amount"]

In [None]:
sns.heatmap(insurance.corr(numeric_only=True), cmap="YlGnBu")

**Observation**

- The data is balanced across all the categorical features.
- There are a few missing values, except for 'Occupation'.
- There seems to be no real correlation between premium and other fields.

### Pre processing

##### Train test split

In [None]:
# separate the features and target value

In [7]:
X = insurance.drop(columns="Premium Amount", axis=1)
y_org = insurance[["Premium Amount"]]
y = np.ravel(insurance[["Premium Amount"]])

In [None]:
X.head()

In [6]:
y_org.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1200000 entries, 0 to 1199999
Data columns (total 1 columns):
 #   Column          Non-Null Count    Dtype  
---  ------          --------------    -----  
 0   Premium Amount  1200000 non-null  float64
dtypes: float64(1)
memory usage: 9.2 MB


In [None]:
# split the dataset to train and test data

In [8]:
from sklearn.model_selection import train_test_split

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

##### Analysing data for pre-processing

In [None]:
# Lets look at missing ages

In [None]:
round(len(X_train[X_train["Age"].isnull()])/len(X_train),2)

In [None]:
X_train[X_train["Age"].isnull()]

In [None]:
X_train[X_train["Age"].isnull()]["Gender"].value_counts()

In [None]:
# Lets look at Annual Income

In [None]:
round(len(X_train[X_train["Annual Income"].isnull()])/len(X_train),2)

In [None]:
X_train[X_train["Annual Income"].isnull()]

In [None]:
# Let's look at marital status
X_train[X_train["Marital Status"].isnull()]

In [None]:
round(len(X_train[X_train["Marital Status"].isnull()])/len(X_train),2)

In [None]:
# TODO: HANDLE NULL MARITAL STATUS
# For now we will drop it. But classification should be a way to fill blank values.

In [None]:
# Let's look at Number of Dependents

In [None]:
X_train[X_train["Number of Dependents"].isnull()]

In [None]:
round(len(X_train[X_train["Number of Dependents"].isnull()])/len(X_train),2)

In [None]:
# Let's look at Occupation

In [None]:
X_train[X_train["Occupation"].isnull()]

In [None]:
round(len(X_train[X_train["Occupation"].isnull()])/len(X_train),2)

In [None]:
# There is a considerable percentage of null values for occupation. Lets understand the relationship a bit more. We will drop these for now.

In [None]:
X_train[X_train["Occupation"].isnull()]

In [None]:
# Let's look at health score

In [None]:
X_train[X_train["Health Score"].isnull()]

In [None]:
round(len(X_train[X_train["Health Score"].isnull()])/len(X_train),2)

In [None]:
# Let's look at Previous Claims

In [None]:
X_train[X_train["Previous Claims"].isnull()]

In [None]:
round(len(X_train[X_train["Previous Claims"].isnull()])/len(X_train),2)

In [None]:
X_train[X_train["Credit Score"].isnull()]

In [None]:
round(len(X_train[X_train["Credit Score"].isnull()])/len(X_train),2)

In [None]:
X_train[X_train["Insurance Duration"].isnull()]

In [None]:
round(len(X_train[X_train["Insurance Duration"].isnull()])/len(X_train),2)

In [None]:
X_train[X_train["Customer Feedback"].isnull()]

In [None]:
round(len(X_train[X_train["Customer Feedback"].isnull()])/len(X_train),2)

In [None]:
# This is a considerable amount. Assuming the mean as Avergae, we will set the missing values with Average

#### Transforming

In [10]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline

##### **Pre-processing with custom tranfsormers (If your run this section, then do not run the Column Transfomer section)**

In [None]:
# Lets create a mean imputer for age, annual income, Number of Dependents, Previous Claims, Credit Score

In [None]:
class MeanImputer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        imputer = SimpleImputer(strategy="mean")
        X["Age"] = imputer.fit_transform(X[["Age"]])
        X["Annual Income"] = imputer.fit_transform(X[["Annual Income"]])
        X["Number of Dependents"] = imputer.fit_transform(X[["Number of Dependents"]])
        X["Health Score"] = imputer.fit_transform(X[["Health Score"]])
        X["Previous Claims"] = imputer.fit_transform(X[["Previous Claims"]])
        X["Credit Score"] = imputer.fit_transform(X[["Credit Score"]])
        return X

In [None]:
# column dropper. Used for Occupation, Vehicle Age

In [None]:
class ColumnDropperImputer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X.drop(columns=["Occupation", "Vehicle Age", "Marital Status", "Policy Start Date"], axis=1, inplace=True)
        return X

In [None]:
class CategoryImputer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X["Customer Feedback"].fillna("Average", inplace=True)
        return X

In [None]:
# encoding the categorical features

In [None]:
class FeatureEncoder(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        oh_encoder = OneHotEncoder(sparse_output=False).set_output(transform="pandas")
        features_names = ["Gender", "Customer Feedback", "Smoking Status", "Property Type", "Education Level", "Location", "Policy Type", "Exercise Frequency"]
        transformed_array = oh_encoder.fit_transform(X[features_names])
        #df = pd.DataFrame(transformed_array.toarray(), columns=oh_encoder.get_feature_names_out())
        encoded_X = pd.concat([X, transformed_array], axis=1)
        # drop the categorical features
        encoded_X.drop(columns=features_names, axis=1, inplace=True)
        return encoded_X

In [None]:
# Lets create preprocessing pipeline

In [None]:
X_train.info()

In [None]:
# let's drop the one blank insurance duration
X_train.dropna(subset=["Insurance Duration"], inplace=True)
X_train.info()

In [None]:
preprocessing_pipeline = Pipeline([
    ("meanimputer", MeanImputer()),
    ("columndropper", ColumnDropperImputer()),
    ("categoryimputer", CategoryImputer()),
    ("featureencoder", FeatureEncoder()),
    ("scaler", StandardScaler())
], verbose=True)

In [None]:
X_train_transformed = preprocessing_pipeline.fit_transform(X_train, y_train)

In [None]:
X_train_transformed

##### **Pre-processing with Column Transformer (If your run this section, then do not run the Custom Transfomer section)**

In [11]:
from sklearn.compose import ColumnTransformer

In [12]:
preprocessing_pipeline = ColumnTransformer([
    # handle numeric features: impute and scale
    ("num_handler", Pipeline([
        ("impute", SimpleImputer(strategy="mean")),
        ("scale", StandardScaler()) 
    ]), ["Age", "Annual Income", "Number of Dependents", "Health Score", "Previous Claims", "Credit Score"]),
    # handle customer feedback: Impute to 'Average' and then one hot encode
    ("cust_feedback_handler", Pipeline([
        ("const_impute", SimpleImputer(strategy="constant", fill_value="Average")),
        ("encode", OneHotEncoder(sparse_output=False, handle_unknown="ignore"))
    ]), ["Customer Feedback"]),
    # one-hot encode the remaining categorical features
    ("one_hot_encode", OneHotEncoder(sparse_output=False, handle_unknown="ignore"), 
    ["Gender", "Smoking Status", "Property Type", "Education Level", "Location", "Policy Type", "Exercise Frequency"]),
    # drop columns
    ("drop_columns", "drop", ["Occupation", "Vehicle Age", "Marital Status", "Policy Start Date", "Gender", "Customer Feedback", "Smoking Status", "Property Type", "Education Level", "Location", "Policy Type", "Exercise Frequency"])
], 
                                          # drop remaining columns
                                          remainder="drop",
                                          verbose=True)

In [None]:
X_train_transformed = preprocessing_pipeline.fit_transform(X_train, y_train)

In [None]:
X_train_transformed

### Training

#### Linear Regression

In [13]:
# Let's start with a simple Linear Regression
# we will use root mean squared error as our error 
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

In [16]:
pipeline = Pipeline([
    ("pre-processing", preprocessing_pipeline),
    ("lin_model", LinearRegression())
])
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
error = mean_squared_error(y_true=y_test, y_pred=y_pred)
print(f"Error : {error}")
rmse = np.sqrt(error)
print(f"RMSE : {rmse}")

[ColumnTransformer] ... (1 of 3) Processing num_handler, total=   0.1s
[ColumnTransformer]  (2 of 3) Processing cust_feedback_handler, total=   0.1s
[ColumnTransformer]  (3 of 3) Processing one_hot_encode, total=   0.4s
Error : 745404.2594285938
RMSE : 863.3679745210577


In [15]:
y_org.describe()

Unnamed: 0,Premium Amount
count,1200000.0
mean,1102.545
std,864.9989
min,20.0
25%,514.0
50%,872.0
75%,1509.0
max,4999.0


**The RMSE is almost equal to Std Dev, which is as good as predicting mean. So this is not a good model.**
- Correlation is weak. That could explain poor results.
- Feature engineering could be an option to explore.
- Should try polynomial features