**Kaggle Competition**

<a>https://www.kaggle.com/competitions/regression-with-an-insurance-dataset/overview</a>

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

In [2]:
# Load the data
insurance = pd.read_csv("./dataset/train.csv")

### Data Exploration

In [None]:
# Explore data
insurance.info()

In [None]:
# look at some records
insurance.head()

**Observation**: id can be marked as index here

In [None]:
# make id as the index key
insurance.set_index("id", inplace=True)
insurance.head()

In [None]:
#lets explore some stats
insurance.describe()

In [None]:
# Lets find out about categorical values
insurance["Gender"].value_counts().plot(kind="bar")

**Observation**: Gender data is balanced

In [None]:
insurance["Marital Status"].value_counts().plot(kind="bar")

**Observation**: Marital Status data is balanced

In [None]:
insurance["Number of Dependents"].value_counts().plot(kind="bar")

**Observation**: Number of Dependents data is balanced

In [None]:
insurance["Education Level"].value_counts().plot(kind="bar")

**Observation**: Education Level data is balanced

In [None]:
insurance["Occupation"].value_counts().plot(kind="bar")

**Observation**: Occupation data is balanced. But there are lots of missing data as well

In [None]:
insurance["Location"].value_counts().plot(kind="bar")

**Observation**: Location data is balanced

In [None]:
insurance["Policy Type"].value_counts().plot(kind="bar")

**Observation**: Policy Type data is balanced

In [None]:
insurance["Customer Feedback"].value_counts().plot(kind="bar")

**Observation**: Customer Feedback data is balanced

In [None]:
insurance["Smoking Status"].value_counts().plot(kind="bar")

**Observation**: Smoking Status data is balanced

In [None]:
insurance["Exercise Frequency"].value_counts().plot(kind="bar")

**Observation**: Exercise Frequency data is balanced

In [None]:
insurance["Property Type"].value_counts().plot(kind="bar")

**Observation**: Property Type data is balanced

**The dataset is well balanced, with a few missing values. Occupation seems the be the one with the most missing values.**

In [None]:
# lets look at the distribution of the data
insurance.hist(figsize=(12,8))

In [None]:
# Lets look at the correlation data
insurance.corr(numeric_only=True)["Premium Amount"]

In [None]:
sns.heatmap(insurance.corr(numeric_only=True), cmap="YlGnBu")

**Observation**

- The data is balanced across all the categorical features.
- There are a few missing values, except for 'Occupation'.
- There seems to be no real correlation between premium and other fields.

### Pre processing

In [5]:
# let's drop the one blank insurance duration
insurance.dropna(subset=["Insurance Duration"], inplace=True)

##### Train test split

In [6]:
# separate the features and target value

In [7]:
X = insurance.drop(columns="Premium Amount", axis=1)
y_org = insurance[["Premium Amount"]]
y = np.ravel(insurance[["Premium Amount"]])

In [None]:
X.head()

In [None]:
y_org.info()

In [None]:
# split the dataset to train and test data

In [8]:
from sklearn.model_selection import train_test_split

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [None]:
X_train.info()

##### Analysing data for pre-processing

In [None]:
# Lets look at missing ages

In [None]:
round(len(X_train[X_train["Age"].isnull()])/len(X_train),2)

In [None]:
X_train[X_train["Age"].isnull()]

In [None]:
X_train[X_train["Age"].isnull()]["Gender"].value_counts()

In [None]:
# Lets look at Annual Income

In [None]:
round(len(X_train[X_train["Annual Income"].isnull()])/len(X_train),2)

In [None]:
X_train[X_train["Annual Income"].isnull()]

In [None]:
# Let's look at marital status
X_train[X_train["Marital Status"].isnull()]

In [None]:
round(len(X_train[X_train["Marital Status"].isnull()])/len(X_train),2)

In [None]:
# TODO: HANDLE NULL MARITAL STATUS
# For now we will drop it. But classification should be a way to fill blank values.

In [None]:
# Let's look at Number of Dependents

In [None]:
X_train[X_train["Number of Dependents"].isnull()]

In [None]:
round(len(X_train[X_train["Number of Dependents"].isnull()])/len(X_train),2)

In [None]:
# Let's look at Occupation

In [None]:
X_train[X_train["Occupation"].isnull()]

In [None]:
round(len(X_train[X_train["Occupation"].isnull()])/len(X_train),2)

In [None]:
# There is a considerable percentage of null values for occupation. Lets understand the relationship a bit more. We will drop these for now.

In [None]:
X_train[X_train["Occupation"].isnull()]

In [None]:
# Let's look at health score

In [None]:
X_train[X_train["Health Score"].isnull()]

In [None]:
round(len(X_train[X_train["Health Score"].isnull()])/len(X_train),2)

In [None]:
# Let's look at Previous Claims

In [None]:
X_train[X_train["Previous Claims"].isnull()]

In [None]:
round(len(X_train[X_train["Previous Claims"].isnull()])/len(X_train),2)

In [None]:
X_train[X_train["Credit Score"].isnull()]

In [None]:
round(len(X_train[X_train["Credit Score"].isnull()])/len(X_train),2)

In [None]:
X_train[X_train["Insurance Duration"].isnull()]

In [None]:
round(len(X_train[X_train["Insurance Duration"].isnull()])/len(X_train),2)

In [None]:
X_train[X_train["Customer Feedback"].isnull()]

In [None]:
round(len(X_train[X_train["Customer Feedback"].isnull()])/len(X_train),2)

In [None]:
# This is a considerable amount. Assuming the mean as Avergae, we will set the missing values with Average

#### Transforming

In [10]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline

##### **Pre-processing with custom tranfsormers (If your run this section, then do not run the Column Transfomer section)**

In [None]:
# Lets create a mean imputer for age, annual income, Number of Dependents, Previous Claims, Credit Score

In [None]:
class MeanImputer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        imputer = SimpleImputer(strategy="mean")
        X["Age"] = imputer.fit_transform(X[["Age"]])
        X["Annual Income"] = imputer.fit_transform(X[["Annual Income"]])
        X["Number of Dependents"] = imputer.fit_transform(X[["Number of Dependents"]])
        X["Health Score"] = imputer.fit_transform(X[["Health Score"]])
        X["Previous Claims"] = imputer.fit_transform(X[["Previous Claims"]])
        X["Credit Score"] = imputer.fit_transform(X[["Credit Score"]])
        return X

In [None]:
# column dropper. Used for Occupation, Vehicle Age

In [None]:
class ColumnDropperImputer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X.drop(columns=["Occupation", "Vehicle Age", "Marital Status", "Policy Start Date"], axis=1, inplace=True)
        return X

In [None]:
class CategoryImputer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X["Customer Feedback"].fillna("Average", inplace=True)
        return X

In [None]:
# encoding the categorical features

In [None]:
class FeatureEncoder(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        oh_encoder = OneHotEncoder(sparse_output=False).set_output(transform="pandas")
        features_names = ["Gender", "Customer Feedback", "Smoking Status", "Property Type", "Education Level", "Location", "Policy Type", "Exercise Frequency"]
        transformed_array = oh_encoder.fit_transform(X[features_names])
        #df = pd.DataFrame(transformed_array.toarray(), columns=oh_encoder.get_feature_names_out())
        encoded_X = pd.concat([X, transformed_array], axis=1)
        # drop the categorical features
        encoded_X.drop(columns=features_names, axis=1, inplace=True)
        return encoded_X

In [None]:
# Lets create preprocessing pipeline

In [None]:
preprocessing_pipeline = Pipeline([
    ("meanimputer", MeanImputer()),
    ("columndropper", ColumnDropperImputer()),
    ("categoryimputer", CategoryImputer()),
    ("featureencoder", FeatureEncoder()),
    ("scaler", StandardScaler())
], verbose=True)

In [None]:
X_train_transformed = preprocessing_pipeline.fit_transform(X_train, y_train)

In [None]:
X_train_transformed

##### **Pre-processing with Column Transformer (If your run this section, then do not run the Custom Transfomer section)**

In [11]:
from sklearn.compose import ColumnTransformer

In [12]:
preprocessing_pipeline = ColumnTransformer([
    # handle numeric features: impute and scale
    ("num_handler", Pipeline([
        ("impute", SimpleImputer(strategy="mean")),
        ("scale", StandardScaler()) 
    ]), ["Age", "Annual Income", "Number of Dependents", "Health Score", "Previous Claims", "Credit Score"]),
    # handle customer feedback: Impute to 'Average' and then one hot encode
    ("cust_feedback_handler", Pipeline([
        ("const_impute", SimpleImputer(strategy="constant", fill_value="Average")),
        ("encode", OneHotEncoder(sparse_output=False, handle_unknown="ignore"))
    ]), ["Customer Feedback"]),
    # one-hot encode the remaining categorical features
    ("one_hot_encode", OneHotEncoder(sparse_output=False, handle_unknown="ignore"), 
    ["Gender", "Smoking Status", "Property Type", "Education Level", "Location", "Policy Type", "Exercise Frequency"]),
    # drop columns
    ("drop_columns", "drop", ["Occupation", "Vehicle Age", "Marital Status", "Policy Start Date", "Gender", "Customer Feedback", "Smoking Status", "Property Type", "Education Level", "Location", "Policy Type", "Exercise Frequency"])
], 
                                          # drop remaining columns
                                          remainder="drop",
                                          verbose=True)

In [None]:
X_train_transformed = preprocessing_pipeline.fit_transform(X_train, y_train)

In [None]:
X_train_transformed

### Training

In [40]:
from sklearn.model_selection import GridSearchCV
# we will use root mean squared error as our error 
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.tree import DecisionTreeRegressor
from sklearn.preprocessing import PolynomialFeatures
# for ensemble model
from sklearn.ensemble import VotingRegressor
from sklearn.ensemble import RandomForestRegressor

In [14]:
def display_best(y_pred, y_train, y_org):
    error = mean_squared_error(y_pred=y_pred, y_true=y_train)
    print(f"Error: {error}")
    print(f"RMSE: {np.sqrt(error)}")
    print(y_org.describe())

In [30]:
import joblib
def save_model(model, model_name):
    joblib.dump(model, "./"+model_name)

#### Linear Regression

In [None]:
# Let's start with a simple Linear Regression

In [34]:
lin_pipeline = Pipeline([
    ("pre-processing", preprocessing_pipeline),
    ("lin_model", LinearRegression())
])
lin_pipeline.fit(X_train, y_train)
y_pred = lin_pipeline.predict(X_test)
display_best(y_pred, y_test, y_org)

[ColumnTransformer] ... (1 of 3) Processing num_handler, total=   0.1s
[ColumnTransformer]  (2 of 3) Processing cust_feedback_handler, total=   0.1s
[ColumnTransformer]  (3 of 3) Processing one_hot_encode, total=   0.4s
Error: 744344.3652573212
RMSE: 862.7539424756754
       Premium Amount
count    1.199999e+06
mean     1.102545e+03
std      8.649992e+02
min      2.000000e+01
25%      5.140000e+02
50%      8.720000e+02
75%      1.509000e+03
max      4.999000e+03


**The RMSE is almost equal to Std Dev, which is as good as predicting mean. So this is not a good model.**
- Correlation is weak. That could explain poor results.
- Feature engineering could be an option to explore.
- Should try polynomial features.
- Should try PCA

#### Polynomial Features

In [None]:
# This gets interesting with the Pipeline. The scaling should happen after creating the polynomial features.
# We will redo the pre-processing pipeline here.
# We willl also use girdsearchcv for finding the optimal degrees, bias, and interaction only flag. 

In [16]:
numeric_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="mean")),
    ("poly_features", PolynomialFeatures()),
    ("scaler", StandardScaler())
])

feedback_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="constant", fill_value="Average")),
    ("encoder", OneHotEncoder(sparse_output=False, handle_unknown="ignore"))
])


preprocessing_pipeline_poly = ColumnTransformer([
    ("num_handler", numeric_pipeline, ["Age", "Annual Income", "Number of Dependents", "Health Score", "Previous Claims", "Credit Score"]),
    ("feedback_handler", feedback_pipeline, ["Customer Feedback"]),
    ("one_hot_encode", OneHotEncoder(sparse_output=False, handle_unknown="ignore"), 
    ["Gender", "Smoking Status", "Property Type", "Education Level", "Location", "Policy Type", "Exercise Frequency"]),
    ("drop_columns", "drop", ["Occupation", "Vehicle Age", "Marital Status", "Policy Start Date", "Gender", "Customer Feedback", "Smoking Status", "Property Type", "Education Level", "Location", "Policy Type", "Exercise Frequency"]),
], verbose=True, remainder="drop")

pipeline = Pipeline([
    ("preprocessing", preprocessing_pipeline_poly),
    ("lin_model", LinearRegression())
])



In [17]:
param_grid = {
    "preprocessing__num_handler__poly_features__degree": [3, 4, 5, 6],
    "preprocessing__num_handler__poly_features__interaction_only": [True, False],
    "preprocessing__num_handler__poly_features__include_bias": [True, False]
}


In [18]:
grid_search = GridSearchCV(estimator=pipeline, cv=5, param_grid=param_grid, verbose=True)

In [19]:
grid_search.fit(X_train, y_train)

Fitting 5 folds for each of 16 candidates, totalling 80 fits
[ColumnTransformer] ... (1 of 3) Processing num_handler, total=   0.3s
[ColumnTransformer]  (2 of 3) Processing feedback_handler, total=   0.1s
[ColumnTransformer]  (3 of 3) Processing one_hot_encode, total=   0.3s
[ColumnTransformer] ... (1 of 3) Processing num_handler, total=   0.3s
[ColumnTransformer]  (2 of 3) Processing feedback_handler, total=   0.1s
[ColumnTransformer]  (3 of 3) Processing one_hot_encode, total=   0.3s
[ColumnTransformer] ... (1 of 3) Processing num_handler, total=   0.3s
[ColumnTransformer]  (2 of 3) Processing feedback_handler, total=   0.1s
[ColumnTransformer]  (3 of 3) Processing one_hot_encode, total=   0.3s
[ColumnTransformer] ... (1 of 3) Processing num_handler, total=   0.3s
[ColumnTransformer]  (2 of 3) Processing feedback_handler, total=   0.1s
[ColumnTransformer]  (3 of 3) Processing one_hot_encode, total=   0.3s
[ColumnTransformer] ... (1 of 3) Processing num_handler, total=   0.3s
[ColumnT

In [20]:
pd.DataFrame(grid_search.cv_results_)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_preprocessing__num_handler__poly_features__degree,param_preprocessing__num_handler__poly_features__include_bias,param_preprocessing__num_handler__poly_features__interaction_only,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,1.626708,0.107038,0.197324,0.016957,3,True,True,{'preprocessing__num_handler__poly_features__d...,0.006819,0.007403,0.006436,0.008141,0.007468,0.007253,0.000585,10
1,2.694546,0.10779,0.242621,0.025063,3,True,False,{'preprocessing__num_handler__poly_features__d...,0.00901,0.009586,0.008758,0.010863,0.00982,0.009608,0.000735,8
2,1.586551,0.072404,0.211628,0.03201,3,False,True,{'preprocessing__num_handler__poly_features__d...,0.006819,0.007403,0.006436,0.008141,0.007468,0.007253,0.000585,9
3,2.690654,0.076267,0.280071,0.03465,3,False,False,{'preprocessing__num_handler__poly_features__d...,0.00901,0.009586,0.008758,0.010863,0.00982,0.009608,0.000735,7
4,1.975743,0.085468,0.207595,0.013153,4,True,True,{'preprocessing__num_handler__poly_features__d...,0.006791,0.007401,0.006475,0.008173,0.007395,0.007247,0.000584,12
5,5.12268,0.082527,0.345086,0.004604,4,True,False,{'preprocessing__num_handler__poly_features__d...,0.010362,0.010771,0.010304,0.012476,0.011181,0.011019,0.000794,6
6,2.006342,0.066659,0.225835,0.034764,4,False,True,{'preprocessing__num_handler__poly_features__d...,0.006791,0.007401,0.006475,0.008173,0.007395,0.007247,0.000584,11
7,5.130873,0.151473,0.351098,0.016719,4,False,False,{'preprocessing__num_handler__poly_features__d...,0.010362,0.010771,0.010304,0.012476,0.011181,0.011019,0.000794,5
8,2.084323,0.077684,0.22951,0.041228,5,True,True,{'preprocessing__num_handler__poly_features__d...,0.006764,0.007396,0.006471,0.008157,0.007377,0.007233,0.000583,14
9,11.280122,0.265107,0.581342,0.023814,5,True,False,{'preprocessing__num_handler__poly_features__d...,0.011622,0.011991,0.01107,0.013949,0.012819,0.01229,0.001006,3


In [21]:
grid_search.best_score_

np.float64(0.013918863206127629)

In [22]:
best_estimator_poly = grid_search.best_estimator_

In [23]:
y_pred = best_estimator_poly.predict(X_test)
display_best(y_pred, y_test, y_org)

Error: 735422.3678643422
RMSE: 857.5677045366984
       Premium Amount
count    1.199999e+06
mean     1.102545e+03
std      8.649992e+02
min      2.000000e+01
25%      5.140000e+02
50%      8.720000e+02
75%      1.509000e+03
max      4.999000e+03


**Observations: Marginal improvement**

#### Decision Trees

In [24]:
pipeline = Pipeline([
    ("preprocessing", preprocessing_pipeline),
    ("decision_model", DecisionTreeRegressor())
])

In [25]:
param_grid = {
    "decision_model__max_depth": [4, 5, 6, 7],
    "decision_model__max_leaf_nodes": [6, 7, 8, 9]
}

In [26]:
grid_search = GridSearchCV(estimator=pipeline, cv=5, param_grid=param_grid, verbose=True)
grid_search.fit(X_train, y_train)

Fitting 5 folds for each of 16 candidates, totalling 80 fits
[ColumnTransformer] ... (1 of 3) Processing num_handler, total=   0.0s
[ColumnTransformer]  (2 of 3) Processing cust_feedback_handler, total=   0.1s
[ColumnTransformer]  (3 of 3) Processing one_hot_encode, total=   0.3s
[ColumnTransformer] ... (1 of 3) Processing num_handler, total=   0.0s
[ColumnTransformer]  (2 of 3) Processing cust_feedback_handler, total=   0.1s
[ColumnTransformer]  (3 of 3) Processing one_hot_encode, total=   0.3s
[ColumnTransformer] ... (1 of 3) Processing num_handler, total=   0.0s
[ColumnTransformer]  (2 of 3) Processing cust_feedback_handler, total=   0.1s
[ColumnTransformer]  (3 of 3) Processing one_hot_encode, total=   0.4s
[ColumnTransformer] ... (1 of 3) Processing num_handler, total=   0.0s
[ColumnTransformer]  (2 of 3) Processing cust_feedback_handler, total=   0.1s
[ColumnTransformer]  (3 of 3) Processing one_hot_encode, total=   0.3s
[ColumnTransformer] ... (1 of 3) Processing num_handler, to

In [27]:
pd.DataFrame(grid_search.cv_results_)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_decision_model__max_depth,param_decision_model__max_leaf_nodes,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,1.481602,0.0139,0.105861,0.003311,4,6,"{'decision_model__max_depth': 4, 'decision_mod...",0.014794,0.014899,0.014292,0.016029,0.015501,0.015103,0.000602,16
1,1.475658,0.00743,0.105102,0.000937,4,7,"{'decision_model__max_depth': 4, 'decision_mod...",0.015739,0.015955,0.015634,0.017129,0.01644,0.016179,0.00055,12
2,1.547195,0.027147,0.106612,0.003225,4,8,"{'decision_model__max_depth': 4, 'decision_mod...",0.016081,0.016891,0.016205,0.018016,0.017018,0.016842,0.000692,11
3,1.548997,0.01095,0.105526,0.001143,4,9,"{'decision_model__max_depth': 4, 'decision_mod...",0.016316,0.017482,0.016869,0.018689,0.017357,0.017343,0.000789,9
4,1.571202,0.011324,0.104563,0.001265,5,6,"{'decision_model__max_depth': 5, 'decision_mod...",0.014896,0.015603,0.014422,0.016748,0.015347,0.015403,0.000784,13
5,1.602104,0.021757,0.104356,0.001348,5,7,"{'decision_model__max_depth': 5, 'decision_mod...",0.016398,0.017173,0.016245,0.018368,0.017241,0.017085,0.000756,10
6,1.617214,0.021115,0.105751,0.000729,5,8,"{'decision_model__max_depth': 5, 'decision_mod...",0.017343,0.018229,0.017587,0.019468,0.01818,0.018162,0.000737,8
7,1.660381,0.019443,0.105645,0.001881,5,9,"{'decision_model__max_depth': 5, 'decision_mod...",0.017685,0.019165,0.018158,0.020356,0.018758,0.018824,0.000917,7
8,1.603908,0.021548,0.105611,0.002105,6,6,"{'decision_model__max_depth': 6, 'decision_mod...",0.014896,0.015603,0.014422,0.016748,0.015347,0.015403,0.000784,13
9,1.596876,0.014527,0.106228,0.000597,6,7,"{'decision_model__max_depth': 6, 'decision_mod...",0.020329,0.020782,0.020136,0.022098,0.019366,0.020542,0.000903,5


In [28]:
best_estimator_dt = grid_search.best_estimator_

In [29]:
y_pred = best_estimator_dt.predict(X_test)
display_best(y_pred, y_test, y_org)

Error: 727800.0372792884
RMSE: 853.1119722986475
       Premium Amount
count    1.199999e+06
mean     1.102545e+03
std      8.649992e+02
min      2.000000e+01
25%      5.140000e+02
50%      8.720000e+02
75%      1.509000e+03
max      4.999000e+03


In [32]:
# Let's save the models
save_model(best_estimator_poly, "poly_model.mdl")
save_model(best_estimator_dt, "decision_tree.mdl")

#### Ensemble Models

In [35]:
voting_regressor = VotingRegressor(estimators=[("lin_model", lin_pipeline), ("poly_model", best_estimator_poly), ("tree_model", best_estimator_dt)])

In [36]:
voting_regressor.fit(X_train, y_train)

[ColumnTransformer] ... (1 of 3) Processing num_handler, total=   0.1s
[ColumnTransformer]  (2 of 3) Processing cust_feedback_handler, total=   0.1s
[ColumnTransformer]  (3 of 3) Processing one_hot_encode, total=   0.4s
[ColumnTransformer] ... (1 of 3) Processing num_handler, total=   4.8s
[ColumnTransformer]  (2 of 3) Processing feedback_handler, total=   0.1s
[ColumnTransformer]  (3 of 3) Processing one_hot_encode, total=   0.4s
[ColumnTransformer] ... (1 of 3) Processing num_handler, total=   0.0s
[ColumnTransformer]  (2 of 3) Processing cust_feedback_handler, total=   0.1s
[ColumnTransformer]  (3 of 3) Processing one_hot_encode, total=   0.4s


In [37]:
y_pred = voting_regressor.predict(X_test)

In [39]:
display_best(y_pred, y_test, y_org)

Error: 731620.7602511881
RMSE: 855.3483268535621
       Premium Amount
count    1.199999e+06
mean     1.102545e+03
std      8.649992e+02
min      2.000000e+01
25%      5.140000e+02
50%      8.720000e+02
75%      1.509000e+03
max      4.999000e+03


**Observation: This seems to be the best model till now. But still not satisfactory**

#### Random Forest

In [52]:
pipeline_rf = Pipeline([
    ("preprocessor", preprocessing_pipeline),
    ("randomforest_model", RandomForestRegressor(n_estimators=500, n_jobs=-1, max_depth=4, max_leaf_nodes=5))
])

In [53]:
pipeline_rf.fit(X_train, y_train)

[ColumnTransformer] ... (1 of 3) Processing num_handler, total=   0.1s
[ColumnTransformer]  (2 of 3) Processing cust_feedback_handler, total=   0.1s
[ColumnTransformer]  (3 of 3) Processing one_hot_encode, total=   0.4s


In [54]:
y_pred = pipeline_rf.predict(X_test)

In [55]:
display_best(y_pred, y_test, y_org)

Error: 736141.6789293293
RMSE: 857.9869922844572
       Premium Amount
count    1.199999e+06
mean     1.102545e+03
std      8.649992e+02
min      2.000000e+01
25%      5.140000e+02
50%      8.720000e+02
75%      1.509000e+03
max      4.999000e+03


In [None]:
**Observation: Does not beat ens