# Work Package 3 (WP3): Open Kaggle Competition

Supervised Machine Learning Model

Development and Deployment

---

**Universitat de Lleida**  
**Enginyeria Informàtica**  
**Sistemes Intel·ligents**  

---

**Professor:** Mariano Garralda Barrio  

**Authors:**  
- Jordi García Ventura  
- Christian López García  

**Date:** 12/01/2025  


## 0. Setup

In [1]:
!python --version

Python 3.9.13


In [2]:
%pip install -r ../requirements.txt

[0mCollecting scikit-learn==1.5.2
  Using cached scikit_learn-1.5.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.4 MB)
Collecting scipy==1.13.1
  Using cached scipy-1.13.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (38.6 MB)
Collecting joblib==1.4.2
  Using cached joblib-1.4.2-py3-none-any.whl (301 kB)
Collecting pycaret[full]==3.3.2
  Using cached pycaret-3.3.2-py3-none-any.whl (486 kB)
Collecting imbalanced-learn>=0.12.0
  Using cached imbalanced_learn-0.12.4-py3-none-any.whl (258 kB)
Collecting category-encoders>=2.4.0
  Using cached category_encoders-2.6.4-py2.py3-none-any.whl (82 kB)
INFO: pip is looking at multiple versions of stopit to determine which version is compatible with other requirements. This could take a while.
Collecting stopit==1.1.2
  Using cached stopit-1.1.2-py3-none-any.whl
INFO: pip is looking at multiple versions of tqdm to determine which version is compatible with other requirements. This could take a while.
Collecting tqdm==4.

In [3]:
%matplotlib inline

In [1]:
import sys
sys.path.append("../")

GPU_ENABLED = True

%load_ext autoreload
%autoreload 2
from lib.cache import cache, DataFrameCache

In [2]:
import webbrowser
from pathlib import Path

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from ydata_profiling import ProfileReport
from sklearn.preprocessing import StandardScaler
from xgboost import XGBRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import root_mean_squared_log_error

In [3]:
TRAIN_FILENAME = "train.csv"
TEST_FILENAME = "test.csv"
DATA_PATH = Path.joinpath(Path.cwd().parent, "data")
CACHE_PATH = Path.joinpath(Path.cwd().parent, "cache")
CACHE_IMAGES_PATH = Path.joinpath(CACHE_PATH, "images")
CACHE_MODELS_PATH = Path.joinpath(CACHE_PATH, "models")
CACHE_DATAFRAMES_PATH = Path.joinpath(CACHE_PATH, "dataframes")
CACHE_NUMPY_PATH = Path.joinpath(CACHE_PATH, "numpy")

In [9]:
df = pd.read_csv(DATA_PATH / TRAIN_FILENAME)

In [5]:
dataframe_cache = DataFrameCache(CACHE_DATAFRAMES_PATH)

In [10]:
dataframe_cache.save("0-original", df)

## 1. EDA

In [9]:
PROFILE_PATH = CACHE_PATH / "profiling_report.html"

if not PROFILE_PATH.exists():
    print("Generating profiling report...")
    profile = ProfileReport(df, title="Regression with an Insurance Dataset Report", explorative=True)
    profile.to_file(PROFILE_PATH)
    print("Profiling report generated.")

In [10]:
webbrowser.open(PROFILE_PATH.as_uri())

True

## 2. Feature engineering

In [11]:
df = dataframe_cache.load("0-original")

### Features

1. `id`: Unique identifier for the record (Numerical)
1. `Age`: Age of the insured individual (Numerical)
1. `Gender`: Gender of the insured individual (Categorical: Male, Female)
1. `Annual Income`: Annual income of the insured individual (Numerical, skewed)
1. `Marital Status`: Marital status of the insured individual (Categorical: Single, Married, Divorced)
1. `Number of Dependents`: Number of dependents (Numerical, with missing values)
1. `Education Level`: Highest education level attained (Categorical: High School, Bachelor's, Master's, PhD)
1. `Occupation`: Occupation of the insured individual (Categorical: Employed, Self-Employed, Unemployed)
1. `Health Score`: A score representing the health status (Numerical, skewed)
1. `Location`: Type of location (Categorical: Urban, Suburban, Rural)
1. `Policy Type`: Type of insurance policy (Categorical: Basic, Comprehensive, Premium)
1. `Previous Claims`: Number of previous claims made (Numerical, with outliers)
1. `Vehicle Age`: Age of the vehicle insured (Numerical)
1. `Credit Score`: Credit score of the insured individual (Numerical, with missing values)
1. `Insurance Duration`: Duration of the insurance policy (Numerical, in years)
1. `Policy Start Date`: Start date of the insurance policy (Text, improperly formatted)
1. `Customer Feedback`: Short feedback comments from customers (Text)
1. `Smoking Status`: Smoking status of the insured individual (Categorical: Yes, No)
1. `Exercise Frequency`: Frequency of exercise (Categorical: Daily, Weekly, Monthly, Rarely)
1. `Property Type`: Type of property owned (Categorical: House, Apartment, Condo)
1. `Premium Amount`: Target variable representing the insurance premium amount (Numerical, skewed)

We think that the 10 most important features related with the `Premium Amount` (target variable) are:
1. `Policy Type`
1. `Vehicle Age`
1. `Previous Claims`
1. `Annual Income`
1. `Credit Score`
1. `Age`
1. `Insurance Duration`
1. `Marital Status`
1. `Occupation`
1. `Location`

In [12]:
df_numerical = df.select_dtypes(include=[np.number])
df_numerical.head()

Unnamed: 0,id,Age,Annual Income,Number of Dependents,Health Score,Previous Claims,Vehicle Age,Credit Score,Insurance Duration,Premium Amount
0,0,19.0,10049.0,1.0,22.598761,2.0,17.0,372.0,5.0,2869.0
1,1,39.0,31678.0,3.0,15.569731,1.0,12.0,694.0,2.0,1483.0
2,2,23.0,25602.0,3.0,47.177549,1.0,14.0,,3.0,567.0
3,3,21.0,141855.0,2.0,10.938144,1.0,0.0,367.0,1.0,765.0
4,4,21.0,39651.0,1.0,20.376094,0.0,8.0,598.0,4.0,2022.0


In [13]:
df_categorical = df.select_dtypes(include=["object"])
df_categorical.head()

Unnamed: 0,Gender,Marital Status,Education Level,Occupation,Location,Policy Type,Policy Start Date,Customer Feedback,Smoking Status,Exercise Frequency,Property Type
0,Female,Married,Bachelor's,Self-Employed,Urban,Premium,2023-12-23 15:21:39.134960,Poor,No,Weekly,House
1,Female,Divorced,Master's,,Rural,Comprehensive,2023-06-12 15:21:39.111551,Average,Yes,Monthly,House
2,Male,Divorced,High School,Self-Employed,Suburban,Premium,2023-09-30 15:21:39.221386,Good,Yes,Weekly,House
3,Male,Married,Bachelor's,,Rural,Basic,2024-06-12 15:21:39.226954,Poor,Yes,Daily,Apartment
4,Male,Single,Bachelor's,Self-Employed,Rural,Premium,2021-12-01 15:21:39.252145,Poor,Yes,Weekly,House


### Null values

- `Age`: median
- `Annual Income`: median
- `Marital Status`:  mode
- `Number of Dependents`: mode
- `Occupation`: "Unkown" class
- `Health Score`: mean
- `Previous Claims`: median
- `Vehicle Age`: median
- `Credit Score`: median (could be binned and add an "Unknown" class)
- `Insurance Duration`: median
- `Customer Feedback`: "No feedback" class

In [14]:
nulls = df.isnull().sum()
nulls[nulls > 0]

Age                      18705
Annual Income            44949
Marital Status           18529
Number of Dependents    109672
Occupation              358075
Health Score             74076
Previous Claims         364029
Vehicle Age                  6
Credit Score            137882
Insurance Duration           1
Customer Feedback        77824
dtype: int64

In [25]:
df_imputed = dataframe_cache.load("0-original")
df_imputed["Age"] = df_imputed["Age"].fillna(df_imputed["Age"].median())
df_imputed["Annual Income"] = df_imputed["Annual Income"].fillna(df_imputed["Annual Income"].median())
df_imputed["Marital Status"] = df_imputed["Marital Status"].fillna(df_imputed["Marital Status"].mode().values[0])
df_imputed["Number of Dependents"] = df_imputed["Number of Dependents"].fillna(df_imputed["Number of Dependents"].mode().values[0])
df_imputed["Occupation"] = df_imputed["Occupation"].fillna("Unknown")
df_imputed["Health Score"] = df_imputed["Health Score"].fillna(df_imputed["Health Score"].mean())
df_imputed["Previous Claims"] = df_imputed["Previous Claims"].fillna(df_imputed["Previous Claims"].median())
df_imputed["Vehicle Age"] = df_imputed["Vehicle Age"].fillna(df_imputed["Vehicle Age"].median())
df_imputed["Credit Score"] = df_imputed["Credit Score"].fillna(df_imputed["Credit Score"].median())
df_imputed["Insurance Duration"] = df_imputed["Insurance Duration"].fillna(df_imputed["Insurance Duration"].median())
df_imputed["Customer Feedback"] = df_imputed["Customer Feedback"].fillna("No feedback")

In [26]:
df_imputed.isnull().sum().sum()

0

In [36]:
dataframe_cache.save("1-imputed", df_imputed)

### Encoding

- `Gender`: one-hot encoding
- `Marital Status`: one-hot encoding
- `Education Level`: one-hot encoding
- `Occupation`: one-hot encoding
- `Location`: one-hot encoding
- `Policy Type`: ordinal encoding
- `Policy Start Date`: year extraction
- `Customer Feedback`: ordinal encoding
- `Smoking Status`: binary encoding
- `Exercise Frequency`: ordinal encoding
- `Property Type`: one-hot encoding

In [37]:
df_encoded = dataframe_cache.load("1-imputed")
df_encoded["Education Level"] = df_encoded["Education Level"].map({"High School": 0, "Bachelor's": 1, "Master's": 2, "PhD": 3})
df_encoded["Policy Type"] = df_encoded["Policy Type"].map({"Basic": 0, "Comprehensive": 1, "Premium": 2})
df_encoded["Policy Start Year"] = pd.to_datetime(df_encoded["Policy Start Date"]).dt.year
df_encoded.drop(columns=["Policy Start Date"], inplace=True)
df_encoded["Customer Feedback"] = df_encoded["Customer Feedback"].map({"Poor": 0, "Average": 1, "Good": 2, "No feedback": 3})
df_encoded["Smoking Status"] = df_encoded["Smoking Status"].map({"No": 0, "Yes": 1})
df_encoded["Exercise Frequency"] = df_encoded["Exercise Frequency"].map({"Rarely": 0, "Monthly": 1, "Weekly": 2, "Daily": 3})
df_encoded = pd.get_dummies(df_encoded, columns=["Gender", "Marital Status", "Occupation", "Location", "Property Type"], drop_first=True)
df_encoded.head()

Unnamed: 0,id,Age,Annual Income,Number of Dependents,Education Level,Health Score,Policy Type,Previous Claims,Vehicle Age,Credit Score,...,Gender_Male,Marital Status_Married,Marital Status_Single,Occupation_Self-Employed,Occupation_Unemployed,Occupation_Unknown,Location_Suburban,Location_Urban,Property Type_Condo,Property Type_House
0,0,19.0,10049.0,1.0,1,22.598761,2,2.0,17.0,372.0,...,False,True,False,True,False,False,False,True,False,True
1,1,39.0,31678.0,3.0,2,15.569731,1,1.0,12.0,694.0,...,False,False,False,False,False,True,False,False,False,True
2,2,23.0,25602.0,3.0,0,47.177549,2,1.0,14.0,595.0,...,True,False,False,True,False,False,True,False,False,True
3,3,21.0,141855.0,2.0,1,10.938144,0,1.0,0.0,367.0,...,True,True,False,False,False,True,False,False,False,False
4,4,21.0,39651.0,1.0,1,20.376094,2,0.0,8.0,598.0,...,True,False,True,True,False,False,False,False,False,True


In [38]:
dataframe_cache.save("2-encoded", df_encoded)

### Scaling

In [39]:
df_scaled = dataframe_cache.load("2-encoded")

scaler = StandardScaler()
columns_to_scale = df_scaled.columns.difference(["Premium Amount"])
df_scaled[columns_to_scale] = scaler.fit_transform(df_scaled[columns_to_scale])
df_scaled.head()

Unnamed: 0,id,Age,Annual Income,Number of Dependents,Education Level,Health Score,Policy Type,Previous Claims,Vehicle Age,Credit Score,...,Gender_Male,Marital Status_Married,Marital Status_Single,Occupation_Self-Employed,Occupation_Unemployed,Occupation_Unknown,Location_Suburban,Location_Urban,Property Type_Condo,Property Type_House
0,-1.732049,-1.648301,-0.707414,-0.796935,-0.46541,-0.255071,1.221087,1.216739,1.286338,-1.567375,...,-1.004294,1.429421,-0.725646,1.801557,-0.547217,-0.652154,-0.709152,1.420839,-0.706673,1.413289
1,-1.732046,-0.159542,-0.023289,0.651486,0.433367,-0.849704,-0.003359,-0.002284,0.420713,0.71463,...,-1.004294,-0.699584,-0.725646,-0.555075,-0.547217,1.53338,-0.709152,-0.703809,-0.706673,1.413289
2,-1.732044,-1.350549,-0.215473,0.651486,-1.364188,1.824212,1.221087,-0.002284,0.766963,0.01302,...,0.995724,-0.699584,-0.725646,1.801557,-0.547217,-0.652154,1.410135,-0.703809,-0.706673,1.413289
3,-1.732041,-1.499425,3.461605,-0.072725,-0.46541,-1.241521,-1.227805,-0.002284,-1.656788,-1.60281,...,0.995724,1.429421,-0.725646,-0.555075,-0.547217,1.53338,-0.709152,-0.703809,-0.706673,-0.70757
4,-1.732038,-1.499425,0.228896,-0.796935,-0.46541,-0.443102,1.221087,-1.221307,-0.271787,0.034281,...,0.995724,-0.699584,1.378082,1.801557,-0.547217,-0.652154,-0.709152,-0.703809,-0.706673,1.413289


In [40]:
dataframe_cache.save("3-scaled", df_scaled)

## 3. Model training

In [6]:
df_train = dataframe_cache.load("3-scaled")

In [7]:
target = "Premium Amount"
X = df_train.drop(columns=[target])
y = df_train[target]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### XGBoost

In [23]:
@cache(CACHE_MODELS_PATH / "xgboost.joblib")
def train_xgboost(X_train, y_train):
    model = XGBRegressor(objective="reg:squarederror")
    model.fit(X_train, y_train)
    return model

model = train_xgboost(X_train, y_train)

y_true = y_test
y_pred = model.predict(X_test)
root_mean_squared_log_error(y_true, y_pred)

np.float64(1.1387584826591357)

### LightGBM

In [45]:
from lightgbm import LGBMRegressor

@cache(CACHE_MODELS_PATH / "lightgbm.joblib")
def train_lightgbm(X_train, y_train):
    model = LGBMRegressor()
    model.fit(X_train, y_train)
    return model

model = train_lightgbm(X_train, y_train)

y_true = y_test
y_pred = model.predict(X_test)
root_mean_squared_log_error(y_true, y_pred)

1.1365426349715293

### CatBoost

In [None]:
from catboost import CatBoostRegressor

@cache(CACHE_MODELS_PATH / "catboost.joblib")
def train_catboost(X_train, y_train):
    model = CatBoostRegressor(
        n_estimators=1000,
        learning_rate=0.1,
        max_depth=6,
        random_state=42,
    )
    model.fit(X_train, y_train, verbose=False)
    return model

model = train_catboost(X_train, y_train)

y_true = y_test
y_pred = model.predict(X_test)
root_mean_squared_log_error(y_true, y_pred)

1.1377063769261404

### Optuna

In [None]:
import optuna
from xgboost import XGBRegressor
from sklearn.model_selection import train_test_split

# Optuna Optimization
def objective(trial):
    params = {
        "objective": "reg:squarederror",
        "n_estimators": trial.suggest_int("n_estimators", 100, 1000),
        "max_depth": trial.suggest_int("max_depth", 3, 15),
        "learning_rate": trial.suggest_float("learning_rate", 0.01, 0.3),
        "subsample": trial.suggest_float("subsample", 0.5, 1.0),
        "colsample_bytree": trial.suggest_float("colsample_bytree", 0.5, 1.0),
        "min_child_weight": trial.suggest_int("min_child_weight", 1, 10),
        "gamma": trial.suggest_float("gamma", 0.0, 1.0),
        "lambda": trial.suggest_float("lambda", 0.0, 10.0),
        "alpha": trial.suggest_float("alpha", 0.0, 10.0)
    }
    
    if GPU_ENABLED:
        params["tree_method"] = "hist"
        params["device"] = "cuda"

    model = XGBRegressor(**params, random_state=42)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    y_pred = np.maximum(y_pred, 0)
    return root_mean_squared_log_error(y_test, y_pred)


# Data Preparation
target = "Premium Amount"
X = df_train.drop(columns=[target])
y = df_train[target]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Run Optimization
study = optuna.create_study(direction="minimize")
study.optimize(objective, n_trials=50)

print("Best trial:")
best_params = study.best_params
print(best_params)


# Train Best Model
def train_xgboost(X_train, y_train):
    model = XGBRegressor(**best_params, random_state=42)
    model.fit(X_train, y_train)
    return model


model = train_xgboost(X_train, y_train)

# Evaluate
y_true = y_test
y_pred = model.predict(X_test)
y_pred = np.maximum(y_pred, 0)
error = root_mean_squared_log_error(y_true, y_pred)
print(f"Optimized RMSLE: {error}")

[I 2025-01-03 20:49:38,878] A new study created in memory with name: no-name-ff4ffb3e-0d88-4b11-9006-eee536e18b3b
[I 2025-01-03 20:53:12,814] Trial 0 finished with value: 1.145774767255253 and parameters: {'max_depth': 8, 'n_estimators': 2800, 'eta': 0.012396942359939923, 'subsample': 0.2, 'colsample_bytree': 0.6000000000000001, 'colsample_bylevel': 0.5, 'min_child_weight': 446.67026305834474, 'reg_lambda': 2.5850495475123165, 'reg_alpha': 0.0061005320340240185, 'gamma': 0.026323242955892116}. Best is trial 0 with value: 1.145774767255253.
[I 2025-01-03 20:55:58,281] Trial 1 finished with value: 1.1462849899148484 and parameters: {'max_depth': 7, 'n_estimators': 2400, 'eta': 0.007783474420246597, 'subsample': 0.6000000000000001, 'colsample_bytree': 0.7, 'colsample_bylevel': 0.4, 'min_child_weight': 8.050441416509145, 'reg_lambda': 6.642514161223856, 'reg_alpha': 194.64608863398016, 'gamma': 2067.203032164484}. Best is trial 0 with value: 1.145774767255253.
[I 2025-01-03 21:01:23,534] T

Best trial:
{'max_depth': 8, 'n_estimators': 3600, 'eta': 0.011167554734562367, 'subsample': 0.9, 'colsample_bytree': 0.9, 'colsample_bylevel': 0.9, 'min_child_weight': 0.0004809356932127188, 'reg_lambda': 209.84703570779848, 'reg_alpha': 0.0015847230597761228, 'gamma': 1524.2983549944818}
Optimized RMSLE: 1.1373181783864443


Best trial:
{'n_estimators': 517,
 'max_depth': 10,
 'learning_rate': 0.032804707462710754,
 'subsample': 0.7968921460087949,
 'colsample_bytree': 0.890143405869324,
 'min_child_weight': 1,
 'gamma': 0.44190596808067456,
 'lambda': 2.562383451300799,
 'alpha': 5.862926508738752}

Optimized RMSLE: 1.1366257595558225

### Bayesian optimization

In [43]:
from hyperopt import hp, fmin, tpe, Trials
import numpy as np
from xgboost import XGBRegressor

# Define the objective function
def objective(params):
    # Ensure integer values for integer parameters
    params["max_depth"] = int(params["max_depth"])
    params["n_estimators"] = int(params["n_estimators"])
    
    model = XGBRegressor(**params, random_state=42)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    y_pred = np.maximum(y_pred, 0)  # Ensure no negative predictions
    return root_mean_squared_log_error(y_test, y_pred)

# Define the search space
space = {
    "max_depth": hp.quniform("max_depth", 6, 10, 1),  # Will be cast to int
    "n_estimators": hp.quniform("n_estimators", 400, 4000, 400),  # Will be cast to int
    "eta": hp.uniform("eta", 0.007, 0.013),
    "subsample": hp.uniform("subsample", 0.2, 0.9),
    "colsample_bytree": hp.uniform("colsample_bytree", 0.2, 0.9),
    "colsample_bylevel": hp.uniform("colsample_bylevel", 0.2, 0.9),
    "min_child_weight": hp.loguniform("min_child_weight", np.log(1e-4), np.log(1e4)),
    "reg_lambda": hp.loguniform("reg_lambda", np.log(1e-4), np.log(1e4)),
    "reg_alpha": hp.loguniform("reg_alpha", np.log(1e-4), np.log(1e4)),
    "gamma": hp.loguniform("gamma", np.log(1e-4), np.log(1e4)),
}

# Run the optimization
trials = Trials()
best = fmin(fn=objective, space=space, algo=tpe.suggest, max_evals=50, trials=trials)

print("Best trial:")
print(best)

# Train Best Model
best["max_depth"] = int(best["max_depth"])
best["n_estimators"] = int(best["n_estimators"])

model = XGBRegressor(**best, random_state=42)
model.fit(X_train, y_train)

# Evaluate
y_true = y_test
y_pred = model.predict(X_test)
y_pred = np.maximum(y_pred, 0)
error = root_mean_squared_log_error(y_true, y_pred)
print(f"Optimized RMSLE: {error}")

100%|██████████| 50/50 [3:27:05<00:00, 248.51s/trial, best loss: 1.1376710979595392]  
Best trial:
{'colsample_bylevel': np.float64(0.7492798516324369), 'colsample_bytree': np.float64(0.7616768959990403), 'eta': np.float64(0.011087662983626245), 'gamma': np.float64(514.4856366662277), 'max_depth': np.float64(9.0), 'min_child_weight': np.float64(5.143885798867552), 'n_estimators': np.float64(3600.0), 'reg_alpha': np.float64(13.343546144254283), 'reg_lambda': np.float64(27.350709874883446), 'subsample': np.float64(0.7975644489962314)}
Optimized RMSLE: 1.1376710979595392


Best trial:
{'colsample_bylevel': np.float64(0.7492798516324369), 'colsample_bytree': np.float64(0.7616768959990403), 'eta': np.float64(0.011087662983626245), 'gamma': np.float64(514.4856366662277), 'max_depth': np.float64(9.0), 'min_child_weight': np.float64(5.143885798867552), 'n_estimators': np.float64(3600.0), 'reg_alpha': np.float64(13.343546144254283), 'reg_lambda': np.float64(27.350709874883446), 'subsample': np.float64(0.7975644489962314)}

Optimized RMSLE: 1.1376710979595392

In [None]:
% optuna-dashboard sqlite:///db.sqlite3

### Genetic algorithm

In [19]:
from tpot import TPOTRegressor

tpot = TPOTRegressor(generations=5, population_size=20, random_state=42, cv=5, n_jobs=-1, offspring_size=10, verbosity=2, scoring="neg_mean_squared_log_error")
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))

Imputing missing values in feature set


Optimization Progress:   0%|          | 0/70 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: -1.2111142494042972

Generation 2 - Current best internal CV score: -1.2111142494042972

Generation 3 - Current best internal CV score: -1.2110564120287164

Generation 4 - Current best internal CV score: -1.2110564120287164

Generation 5 - Current best internal CV score: -1.2110564120287164

Best pipeline: LinearSVR(input_matrix, C=25.0, dual=True, epsilon=0.01, loss=epsilon_insensitive, tol=0.0001)
Imputing missing values in feature set
-1.2124341431922405


Best pipeline: LinearSVR(input_matrix, C=25.0, dual=True, epsilon=0.01, loss=epsilon_insensitive, tol=0.0001)
Imputing missing values in feature set
-1.2124341431922405

In [20]:
tpot.export("tpot_pipeline.py")

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVR
from sklearn.impute import SimpleImputer

tpot_data = df_train.copy()
features = tpot_data.drop('Premium Amount', axis=1)
training_features, testing_features, training_target, testing_target = \
            train_test_split(features, tpot_data['Premium Amount'], random_state=42)

imputer = SimpleImputer(strategy="median")
imputer.fit(training_features)
training_features = imputer.transform(training_features)
testing_features = imputer.transform(testing_features)

exported_pipeline = LinearSVR(C=25.0, dual=True, epsilon=0.01, loss="epsilon_insensitive", tol=0.0001)
if hasattr(exported_pipeline, 'random_state'):
    setattr(exported_pipeline, 'random_state', 42)

exported_pipeline.fit(training_features, training_target)

In [44]:
y_true = y_test
y_pred = exported_pipeline.predict(X_test)
root_mean_squared_log_error(y_true, y_pred)



1.1012484121048363

### pycaret

https://www.datacamp.com/tutorial/guide-for-automating-ml-workflows-using-pycaret

In [9]:
from pycaret.regression import *

exp1 = setup(data = df_train, target = 'Premium Amount', session_id=42)

Unnamed: 0,Description,Value
0,Session id,42
1,Target,Premium Amount
2,Target type,Regression
3,Original data shape,"(1200000, 26)"
4,Transformed data shape,"(1200000, 26)"
5,Transformed train set shape,"(840000, 26)"
6,Transformed test set shape,"(360000, 26)"
7,Numeric features,25
8,Preprocess,True
9,Imputation type,simple


In [None]:
best = compare_models()

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
lr,Linear Regression,667.38,746049.3625,863.739,0.0031,1.1674,3.0074,1.287
lasso,Lasso Regression,667.4063,746043.3562,863.7355,0.0031,1.1675,3.0085,0.525
ridge,Ridge Regression,667.38,746049.375,863.739,0.0031,1.1674,3.0074,0.393
lar,Least Angle Regression,667.38,746049.8188,863.7393,0.0031,1.1674,3.0074,0.498
llar,Lasso Least Angle Regression,667.4063,746043.3562,863.7355,0.0031,1.1675,3.0085,0.491
br,Bayesian Ridge,667.3851,746049.1688,863.7389,0.0031,1.1674,3.0076,0.941
en,Elastic Net,667.6889,746356.9,863.9171,0.0027,1.1685,3.0141,0.6
omp,Orthogonal Matching Pursuit,667.8141,746677.5938,864.1026,0.0022,1.1689,3.0167,0.474
huber,Huber Regressor,641.2666,776645.3117,881.2715,-0.0378,1.1157,2.4973,1.934
par,Passive Aggressive Regressor,644.3855,815967.256,903.3009,-0.0903,1.1068,2.3105,1.157


Processing:   0%|          | 0/85 [00:00<?, ?it/s]

### auto-sklearn

In [None]:
import autosklearn.classification
import sklearn.model_selection
import sklearn.datasets
import sklearn.metrics

if __name__ == "__main__":
    X, y = sklearn.datasets.load_digits(return_X_y=True)
    X_train, X_test, y_train, y_test =  \
        sklearn.model_selection.train_test_split(X, y, random_state=1)
    automl = autosklearn.classification.AutoSklearnClassifier()
    automl.fit(X_train, y_train)
    y_hat = automl.predict(X_test)
    print("Accuracy score", sklearn.metrics.accuracy_score(y_test, y_hat))