<a href="https://colab.research.google.com/github/RheyMartt/CCADMACL_EXERCISES_COM222ML/blob/main/Exercise2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Exercise 2: Use Gradient Boost for Regression

Instructions:

- Use the Dataset File to train your model
- Use the Test File to generate your results
- Use the Sample Submission file to generate the same format
Submit your results to:
https://www.kaggle.com/competitions/playground-series-s4e12/overview



In [28]:
import pandas as pd
import seaborn as sns

from matplotlib import pyplot as plt
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.impute import SimpleImputer as SimpleInputer
import numpy as np
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import root_mean_squared_log_error

## Dataset
Train, test and sample submission file can be found in this link
https://www.kaggle.com/competitions/playground-series-s4e12/data

## 1. Load the Data

In [2]:
df = pd.read_csv('train.csv')
dt = pd.read_csv('test.csv')
sf = pd.read_csv('sample_submission.csv')

In [3]:
df['source'] = 'train'
dt['source'] = 'test'
df.head()

data = pd.concat([df, dt], ignore_index=True)
print(data.shape)
data.head()

(2000000, 22)


Unnamed: 0,id,Age,Gender,Annual Income,Marital Status,Number of Dependents,Education Level,Occupation,Health Score,Location,...,Vehicle Age,Credit Score,Insurance Duration,Policy Start Date,Customer Feedback,Smoking Status,Exercise Frequency,Property Type,Premium Amount,source
0,0,19.0,Female,10049.0,Married,1.0,Bachelor's,Self-Employed,22.598761,Urban,...,17.0,372.0,5.0,2023-12-23 15:21:39.134960,Poor,No,Weekly,House,2869.0,train
1,1,39.0,Female,31678.0,Divorced,3.0,Master's,,15.569731,Rural,...,12.0,694.0,2.0,2023-06-12 15:21:39.111551,Average,Yes,Monthly,House,1483.0,train
2,2,23.0,Male,25602.0,Divorced,3.0,High School,Self-Employed,47.177549,Suburban,...,14.0,,3.0,2023-09-30 15:21:39.221386,Good,Yes,Weekly,House,567.0,train
3,3,21.0,Male,141855.0,Married,2.0,Bachelor's,,10.938144,Rural,...,0.0,367.0,1.0,2024-06-12 15:21:39.226954,Poor,Yes,Daily,Apartment,765.0,train
4,4,21.0,Male,39651.0,Single,1.0,Bachelor's,Self-Employed,20.376094,Rural,...,8.0,598.0,4.0,2021-12-01 15:21:39.252145,Poor,Yes,Weekly,House,2022.0,train


## 2. Perform Data preprocessing

In [4]:
def preprocess_data(data, is_train=True):
    data['source'] = 'train' if is_train else 'test'
    data['Policy Start Date'] = pd.to_datetime(data['Policy Start Date'])
    data['Policy End Date'] = pd.to_datetime(data['Policy End Date'])
    data['Vintage'] = (data['Policy End Date'] - data['Policy Start Date']).dt.days

    mapping = {
        'Gender': {'Male': 0, 'Female': 1},
        'Vehicle Class': {'Four-Door Car': 0, 'Two-Door Car': 1},
        'Vehicle Size': {'Small': 0, 'Medsize': 1, 'Large': 2},
        'Vehicle Damage': {'Yes': 1, 'No': 0},
        'Previously Insured': {'Yes': 1, 'No': 0}
    }
    for col, map_dict in mapping.items():
      data[col] = data[col].map(map_dict)

    for col in ['Annual Premium (in Rs)', 'Vintage']:
        most_frequent_value = data[col].mode()[0]
        data[col].fillna(most_frequent_value, inplace=True)

In [8]:
data.drop(['Policy Start Date'], axis=1, inplace=True)

In [9]:
cat = data.select_dtypes(include='object').columns.tolist()
num = data.select_dtypes(include=['float', 'int64']).columns.tolist()

In [12]:
num_inputer = SimpleInputer(missing_values=np.nan, strategy='mean')
cat_inputer = SimpleInputer(missing_values=np.nan, strategy='most_frequent')

data[num] = num_inputer.fit_transform(data[num])
data[cat] = cat_inputer.fit_transform(data[cat])

data[num].fillna(data[num].mean(), inplace=True)
data[cat].fillna(data[cat].mode().iloc[0], inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data[num].fillna(data[num].mean(), inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data[cat].fillna(data[cat].mode().iloc[0], inplace=True)


In [13]:
df = data.loc[data['source'] == 'train']
dt = data.loc[data['source'] == 'test']

df.drop(columns='source', inplace=True)
dt.drop(columns='source', inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.drop(columns='source', inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dt.drop(columns='source', inplace=True)


In [14]:
X = df.drop(columns='Premium Amount')
y = df['Premium Amount']

In [15]:
cat = X.select_dtypes(include='object').columns.tolist()
num = X.select_dtypes(include=['float', 'int64']).columns.tolist()

In [18]:
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), num),
        ('cat', OneHotEncoder(), cat)
    ])

In [19]:
params = {
    'n_estimators': 100,
    'learning_rate': 0.1,
    'max_depth': 3,
    'random_state': 42
}

model = GradientBoostingRegressor(**params)

## 3. Create a Pipeline

In [22]:
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('model', model)
])

## 4. Train the Model

In [23]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [24]:
pipeline.fit(X_train, y_train)

y_pred = pipeline.predict(X_test)

## 5. Evaluate the Model

In [29]:
rmsle = root_mean_squared_log_error(y_test, y_pred)
print(f'RMSLE: {rmsle}')

RMSLE: 1.159308896572489


## Generate Submission File

Choose the model that has the best performance to generate a submission file.

In [32]:
id = sf['id']
y_pred = pipeline.predict(dt)

# Create a submission DataFrame
submission_df = pd.DataFrame({
    'id': id,
    'Premium Amount': y_pred
})

# Save the submission DataFrame to a CSV file
submission_df.to_csv('submission_file.csv', index=False)
print("Submission file created: submission_file.csv")

Submission file created: submission_file.csv
