# 0. Introduction

This project is an example of Supervised Machine Learning, Regression Task.

## Overview of Supervised Learning

The model is trained on labeled data (features and their corresponding Premium Amount).
The goal is to learn a mapping between the input features and the target variable.

## Objective: Regression Task

The target variable (Premium Amount) is continuous, so the task is to predict a continuous value rather than a category.

- Why Regression?

The objective is to predict a numeric value (insurance premium). Hence, this problem falls under regression rather than classification.

- Key Takeaways

    The pipeline demonstrates a typical workflow in supervised regression, including preprocessing, feature engineering, model training, and evaluation.
    Using an ensemble model like Random Forest ensures robustness and handles non-linear relationships effectively.
Metrics like RMSLE are chosen to suit the nature of the target variable and competition requirements.
    This approach can be adapted to similar regression tasks involving tabular datasets.

## Why It’s Multiple Linear Regression
Features:

This project uses multiple input features such as Age, Annual Income, Health Score, Number of Dependents, and others. This clearly indicates a multiple variable setup.

- Regression Task:

    The goal is to predict a continuous target (Premium Amount), which is a regression task.

- Model:

    While Random Forest is a tree-based ensemble model and not explicitly a "linear" model, the problem setup itself fits the multiple regression context because it involves multiple input variables predicting one continuous target.

# 1. Import Required Libraries

In [1]:
# Import libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, KFold
from sklearn.metrics import mean_squared_log_error
from xgboost import XGBRegressor
from datetime import datetime
import matplotlib.pyplot as plt


# 2. Load and Inspect the Data

In [2]:
# Load datasets
train = pd.read_csv('/kaggle/input/playground-series-s4e12/train.csv')
test = pd.read_csv('/kaggle/input/playground-series-s4e12/test.csv')

# Check for missing values and data types
print("Train Data Info:")
print(train.info())
print("\nTest Data Info:")
print(test.info())

# Extract target variable
y = train['Premium Amount']
train.drop(columns=['Premium Amount'], inplace=True)


Train Data Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1200000 entries, 0 to 1199999
Data columns (total 21 columns):
 #   Column                Non-Null Count    Dtype  
---  ------                --------------    -----  
 0   id                    1200000 non-null  int64  
 1   Age                   1181295 non-null  float64
 2   Gender                1200000 non-null  object 
 3   Annual Income         1155051 non-null  float64
 4   Marital Status        1181471 non-null  object 
 5   Number of Dependents  1090328 non-null  float64
 6   Education Level       1200000 non-null  object 
 7   Occupation            841925 non-null   object 
 8   Health Score          1125924 non-null  float64
 9   Location              1200000 non-null  object 
 10  Policy Type           1200000 non-null  object 
 11  Previous Claims       835971 non-null   float64
 12  Vehicle Age           1199994 non-null  float64
 13  Credit Score          1062118 non-null  float64
 14  Insurance Duratio

In [3]:
print(train.columns)  # Confirm that 'Premium Amount' is not in train



Index(['id', 'Age', 'Gender', 'Annual Income', 'Marital Status',
       'Number of Dependents', 'Education Level', 'Occupation', 'Health Score',
       'Location', 'Policy Type', 'Previous Claims', 'Vehicle Age',
       'Credit Score', 'Insurance Duration', 'Policy Start Date',
       'Customer Feedback', 'Smoking Status', 'Exercise Frequency',
       'Property Type'],
      dtype='object')


In [4]:
# If you want to use X (features) and y (target)
X = train  # 'train' now contains only the features
print(X.head())
print(y.head())  # 'y' contains the target variable


   id   Age  Gender  Annual Income Marital Status  Number of Dependents  \
0   0  19.0  Female        10049.0        Married                   1.0   
1   1  39.0  Female        31678.0       Divorced                   3.0   
2   2  23.0    Male        25602.0       Divorced                   3.0   
3   3  21.0    Male       141855.0        Married                   2.0   
4   4  21.0    Male        39651.0         Single                   1.0   

  Education Level     Occupation  Health Score  Location    Policy Type  \
0      Bachelor's  Self-Employed     22.598761     Urban        Premium   
1        Master's            NaN     15.569731     Rural  Comprehensive   
2     High School  Self-Employed     47.177549  Suburban        Premium   
3      Bachelor's            NaN     10.938144     Rural          Basic   
4      Bachelor's  Self-Employed     20.376094     Rural        Premium   

   Previous Claims  Vehicle Age  Credit Score  Insurance Duration  \
0              2.0         17

# 3. Preprocessing

In [5]:
# Handle missing values
# Select numerical columns
num_cols = train.select_dtypes(include=['float64', 'int64']).columns
if 'Premium Amount' in num_cols:
    num_cols = num_cols.drop(['id', 'Premium Amount'])
else:
    num_cols = num_cols.drop(['id'])

# Select categorical columns
cat_cols = train.select_dtypes(include=['object']).columns

# Fill missing values for numerical columns
train[num_cols] = train[num_cols].apply(lambda col: col.fillna(col.median()))
test[num_cols] = test[num_cols].apply(lambda col: col.fillna(col.median()))

# Fill missing values for categorical columns
train[cat_cols] = train[cat_cols].apply(lambda col: col.fillna('Unknown'))
test[cat_cols] = test[cat_cols].apply(lambda col: col.fillna('Unknown'))

# Check and process 'Policy Start Date' column if it exists
if 'Policy Start Date' in train.columns:
    train['Policy Start Days'] = pd.to_datetime(train['Policy Start Date']).apply(lambda x: (x - pd.Timestamp("2000-01-01")).days)
    test['Policy Start Days'] = pd.to_datetime(test['Policy Start Date']).apply(lambda x: (x - pd.Timestamp("2000-01-01")).days)

    # Drop the original Policy Start Date column
    train.drop(columns=['Policy Start Date'], inplace=True)
    test.drop(columns=['Policy Start Date'], inplace=True)



# 4. Train-Test Split for Cross-Validation

In [6]:
# Split the data into training and validation sets
X_train, X_valid, y_train, y_valid = train_test_split(
    train.drop(columns=['id']), y, test_size=0.2, random_state=42
)



# 5. Cross-Validation with XGBoost

In [7]:
X_test = test.copy()  # Initialize X_test from the test dataset


In [8]:
# Retain the 'id' column for submission
id_column = X_test['id']  # Save the 'id' column for later use
X_test = X_test.drop(columns=['id'])  # Drop 'id' column for processing


In [9]:
from sklearn.model_selection import train_test_split

# Split training data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(
    train, y, test_size=0.2, random_state=42
)


In [10]:
# Define categorical columns
cat_cols = X_train.select_dtypes(include=['object']).columns

# Convert categorical columns to 'category' dtype
X_train[cat_cols] = X_train[cat_cols].astype('category')
X_val[cat_cols] = X_val[cat_cols].astype('category')
X_test[cat_cols] = X_test[cat_cols].astype('category')

# Ensure numerical columns are processed (e.g., missing values filled)
num_cols = X_train.select_dtypes(include=['float64', 'int64']).columns

# Explicitly handle missing values for numerical columns
for col in num_cols:
    if col in X_test.columns:  # Check if the column exists in X_test
        X_train.loc[:, col] = X_train[col].fillna(X_train[col].median())
        X_val.loc[:, col] = X_val[col].fillna(X_train[col].median())
        X_test.loc[:, col] = X_test[col].fillna(X_train[col].median())
    else:
        # Handle cases where the column is missing in X_test
        print(f"Column '{col}' is missing in X_test and will be skipped.")



Column 'id' is missing in X_test and will be skipped.


In [11]:
# Check for any remaining missing values
print("Missing values in X_train:", X_train.isnull().sum().sum())
print("Missing values in X_val:", X_val.isnull().sum().sum())
print("Missing values in X_test:", X_test.isnull().sum().sum())

# Verify data types
print("X_train dtypes:\n", X_train.dtypes)
print("X_val dtypes:\n", X_val.dtypes)
print("X_test dtypes:\n", X_test.dtypes)


Missing values in X_train: 0
Missing values in X_val: 0
Missing values in X_test: 0
X_train dtypes:
 id                         int64
Age                      float64
Gender                  category
Annual Income            float64
Marital Status          category
Number of Dependents     float64
Education Level         category
Occupation              category
Health Score             float64
Location                category
Policy Type             category
Previous Claims          float64
Vehicle Age              float64
Credit Score             float64
Insurance Duration       float64
Customer Feedback       category
Smoking Status          category
Exercise Frequency      category
Property Type           category
Policy Start Days          int64
dtype: object
X_val dtypes:
 id                         int64
Age                      float64
Gender                  category
Annual Income            float64
Marital Status          category
Number of Dependents     float64
Education Le

In [12]:
print(f"X_train shape: {X_train.shape}")
print(f"X_val shape: {X_val.shape}")
print(f"X_test shape: {X_test.shape}")

print(f"Categorical columns: {cat_cols}")


X_train shape: (960000, 20)
X_val shape: (240000, 20)
X_test shape: (800000, 19)
Categorical columns: Index(['Gender', 'Marital Status', 'Education Level', 'Occupation', 'Location',
       'Policy Type', 'Customer Feedback', 'Smoking Status',
       'Exercise Frequency', 'Property Type'],
      dtype='object')


# 6: Model Training on Full Data and Cross-Validation

In [13]:
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_log_error
import numpy as np

# Initialize the XGBoost Regressor
xgb_model = XGBRegressor(
    n_estimators=500,
    learning_rate=0.05,
    max_depth=6,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42,
    enable_categorical=True  # For categorical data
)

# Train the model on X_train and validate on X_val
xgb_model.fit(
    X_train.drop(columns=['id']),  # Drop non-feature columns
    y_train,
    early_stopping_rounds=50,
    eval_set=[(X_val.drop(columns=['id']), y_val)],
    verbose=50
)

# Predict on validation set
val_preds = xgb_model.predict(X_val.drop(columns=['id']))

# Calculate RMSLE
rmsle = np.sqrt(mean_squared_log_error(y_val, np.maximum(0, val_preds)))
print(f"Validation RMSLE: {rmsle}")


[0]	validation_0-rmse:862.80931




[50]	validation_0-rmse:845.19575
[100]	validation_0-rmse:842.58514
[150]	validation_0-rmse:841.62786
[200]	validation_0-rmse:840.86132
[250]	validation_0-rmse:840.08339
[300]	validation_0-rmse:839.78725
[350]	validation_0-rmse:839.58713
[400]	validation_0-rmse:839.26234
[450]	validation_0-rmse:839.07970
[499]	validation_0-rmse:838.91257
Validation RMSLE: 1.139874106038963


# 7: Predict on Test Data and Prepare Submission

In [15]:
# Ensure the 'id' column is preserved from the original test dataset
id_column = test['id']  # Save the 'id' column separately

# Predict on the test dataset
# Drop 'id' only if it exists in X_test, to avoid KeyError
if 'id' in X_test.columns:
    X_test = X_test.drop(columns=['id'])

# Make predictions using the model
test_preds = xgb_model.predict(X_test)

# Create a submission DataFrame
submission = pd.DataFrame({
    'id': id_column,  # Use the preserved 'id' column
    'Premium Amount': np.maximum(0, test_preds)  # Ensure no negative predictions
})

# Save the submission file
submission.to_csv('submission.csv', index=False)
print("Submission file saved as 'submission.csv'")



Submission file saved as 'submission.csv'


# 8. Cross-Validation with K-Fold

In [16]:
from sklearn.model_selection import KFold

# Update the XGBoost model to include early stopping rounds in the constructor
xgb_model = XGBRegressor(
    n_estimators=500,
    learning_rate=0.05,
    max_depth=6,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42,
    enable_categorical=True,  # For categorical data
    early_stopping_rounds=50  # Add this to the constructor
)

kf = KFold(n_splits=5, shuffle=True, random_state=42)
cv_results = []

for train_index, val_index in kf.split(X_train):
    X_tr, X_val_fold = X_train.iloc[train_index], X_train.iloc[val_index]
    y_tr, y_val_fold = y_train.iloc[train_index], y_train.iloc[val_index]

    # Train the model
    xgb_model.fit(
        X_tr.drop(columns=['id']),
        y_tr,
        eval_set=[(X_val_fold.drop(columns=['id']), y_val_fold)],
        verbose=False
    )

    # Predict and calculate RMSLE
    val_preds_fold = xgb_model.predict(X_val_fold.drop(columns=['id']))
    fold_rmsle = np.sqrt(mean_squared_log_error(y_val_fold, np.maximum(0, val_preds_fold)))
    cv_results.append(fold_rmsle)

print(f"Cross-Validation RMSLE scores: {cv_results}")
print(f"Mean RMSLE: {np.mean(cv_results)}")



Cross-Validation RMSLE scores: [1.1412591750529921, 1.1368383830179096, 1.1373502342492805, 1.139760262187529, 1.1364341915908904]
Mean RMSLE: 1.13832844921972


The cross-validation results show that your model is performing consistently across the folds, with RMSLE values slightly above 1.13. The mean RMSLE of 1.1383 indicates that the model has reasonable predictive power, though there is room for improvement. Here are some suggestions and next steps:

# 9. Recommendations for improvement

## 9.1. Hyperparameter tunning

In [None]:
from sklearn.model_selection import GridSearchCV
from xgboost import XGBRegressor

# Check for negative target values and adjust if needed
if (y_train < 0).any():
    print("Warning: Negative target values detected. Adjusting to non-negative.")
    y_train = y_train.clip(lower=0)

# Define parameter grid
param_grid = {
    'max_depth': [4, 6, 8],
    'learning_rate': [0.01, 0.05, 0.1],
    'subsample': [0.6, 0.8, 1.0],
    'colsample_bytree': [0.6, 0.8, 1.0],
    'n_estimators': [100, 300, 500],
}

# Initialize GridSearchCV with a different scoring metric
grid_search = GridSearchCV(
    estimator=XGBRegressor(random_state=42, enable_categorical=True),
    param_grid=param_grid,
    scoring='neg_mean_squared_error',  # Change to a compatible scoring metric
    cv=3,
    verbose=1
)

# Fit the grid search model
grid_search.fit(X_train.drop(columns=['id']), y_train)

# Best parameters
print(f"Best Parameters: {grid_search.best_params_}")


Fitting 3 folds for each of 243 candidates, totalling 729 fits


## 9.2. Feature Engineering

In [None]:
columns_to_drop = ['Policy Start Days']  # Replace with your columns
for col in columns_to_drop:
    if col in X_train.columns:
        X_train.drop(columns=[col], inplace=True)
    if col in X_val.columns:
        X_val.drop(columns=[col], inplace=True)
    if col in X_test.columns:
        X_test.drop(columns=[col], inplace=True)


In [None]:
print("X_train columns:", X_train.columns)
print("X_val columns:", X_val.columns)
print("X_test columns:", X_test.columns)


In [None]:
# Example of column creation
if 'Policy Start Date' in X_train.columns:
    X_train['Policy Start Days'] = (pd.to_datetime(X_train['Policy Start Date']) - pd.Timestamp("2000-01-01")).dt.days
    X_val['Policy Start Days'] = (pd.to_datetime(X_val['Policy Start Date']) - pd.Timestamp("2000-01-01")).dt.days
    X_test['Policy Start Days'] = (pd.to_datetime(X_test['Policy Start Date']) - pd.Timestamp("2000-01-01")).dt.days


In [None]:
irrelevant_columns = [col for col in ['Policy Start Days'] if col in X_train.columns]
X_train.drop(columns=irrelevant_columns, inplace=True)
X_val.drop(columns=irrelevant_columns, inplace=True)
X_test.drop(columns=irrelevant_columns, inplace=True)


In [None]:
# Feature engineering
X_train['Age_Income_Interaction'] = X_train['Age'] * X_train['Annual Income']
X_val['Age_Income_Interaction'] = X_val['Age'] * X_val['Annual Income']
X_test['Age_Income_Interaction'] = X_test['Age'] * X_test['Annual Income']

# Drop irrelevant columns dynamically
columns_to_drop = ['Policy Start Days']
for col in columns_to_drop:
    if col in X_train.columns:
        X_train.drop(columns=[col], inplace=True)
    if col in X_val.columns:
        X_val.drop(columns=[col], inplace=True)
    if col in X_test.columns:
        X_test.drop(columns=[col], inplace=True)

print("Feature engineering and column dropping completed successfully.")



## 9.3. Model ensembling

In [None]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

# Define the encoder for categorical columns
cat_cols = X_train.select_dtypes(include=['category']).columns

preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(handle_unknown='ignore', sparse_output=False), cat_cols)
    ],
    remainder='passthrough'  # Keep numeric columns as they are
)

# Transform the data
X_train_encoded = preprocessor.fit_transform(X_train)
X_val_encoded = preprocessor.transform(X_val)
X_test_encoded = preprocessor.transform(X_test)



In [None]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

# Identify categorical columns
cat_cols = X_train.select_dtypes(include=['category']).columns

# Create a ColumnTransformer for preprocessing
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(handle_unknown='ignore'), cat_cols)
    ],
    remainder='passthrough'  # Leave numerical columns as is
)

# Apply preprocessing
X_train_enc = preprocessor.fit_transform(X_train.drop(columns=['id']))
X_val_enc = preprocessor.transform(X_val.drop(columns=['id']))
X_test_enc = preprocessor.transform(X_test.drop(columns=['id']))


In [None]:
num_leaves = 2**6  # For max_depth=6, set num_leaves=64
lgbm_model = LGBMRegressor(
    n_estimators=500,
    learning_rate=0.05,
    max_depth=6,
    num_leaves=num_leaves,
    random_state=42
)


In [None]:
lgbm_model = LGBMRegressor(
    n_estimators=500,
    learning_rate=0.05,
    max_depth=6,
    num_leaves=64,  # Based on max_depth
    min_child_samples=20,  # Minimum samples in a leaf
    feature_fraction=0.8,  # Randomly use 80% of features for each tree
    bagging_fraction=0.8,  # Randomly use 80% of data for each tree
    force_row_wise=True,  # Enable row-wise processing
    random_state=42
)


In [None]:
model_ensemble = VotingRegressor([
    ('xgb', xgb_model),
    ('lgbm', lgbm_model)
])

model_ensemble.fit(X_train_enc, y_train)
ensemble_preds = model_ensemble.predict(X_val_enc)
ensemble_rmsle = np.sqrt(mean_squared_log_error(y_val, np.maximum(0, ensemble_preds)))
print(f"Ensemble RMSLE: {ensemble_rmsle}")


In [None]:
lgbm_model = LGBMRegressor(
    n_estimators=500,
    learning_rate=0.05,
    max_depth=6,
    num_leaves=64,
    feature_fraction=0.8,  # Use feature_fraction OR bagging_fraction
    subsample=0.8,         # Set subsample explicitly to avoid conflicts
    random_state=42
)


In [None]:
from lightgbm import plot_importance
import matplotlib.pyplot as plt

lgbm_model.fit(X_train_enc, y_train)
plt.figure(figsize=(10, 8))
plot_importance(lgbm_model, max_num_features=20)
plt.show()


In [None]:
lgbm_model = LGBMRegressor(
    n_estimators=500,
    learning_rate=0.05,
    max_depth=6,
    num_leaves=64,
    min_child_samples=50,  # Default is 20; increase for regularization
    subsample=0.8,
    random_state=42
)


In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'max_depth': [4, 6, 8],
    'num_leaves': [31, 64, 128],
    'min_child_samples': [20, 50, 100],
    'feature_fraction': [0.6, 0.8, 1.0],
    'subsample': [0.6, 0.8, 1.0]
}

grid_search = GridSearchCV(
    estimator=LGBMRegressor(random_state=42),
    param_grid=param_grid,
    scoring='neg_mean_squared_log_error',
    cv=3,
    verbose=1
)

grid_search.fit(X_train_enc, y_train)
print(f"Best Parameters: {grid_search.best_params_}")


In [None]:
# Check for constant or near-constant features
low_variance_cols = [col for col in X_train_enc if X_train_enc[col].nunique() < 2]
print(f"Low variance columns: {low_variance_cols}")

# Drop low-variance columns
X_train_enc.drop(columns=low_variance_cols, inplace=True)
X_val_enc.drop(columns=low_variance_cols, inplace=True)
X_test_enc.drop(columns=low_variance_cols, inplace=True)


In [None]:
from sklearn.ensemble import VotingRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor

# Define base models
xgb_model = XGBRegressor(n_estimators=500, learning_rate=0.05, max_depth=6, random_state=42)
lgbm_model = LGBMRegressor(n_estimators=500, learning_rate=0.05, max_depth=6, random_state=42)

# Ensemble model
model_ensemble = VotingRegressor([
    ('xgb', xgb_model),
    ('lgbm', lgbm_model)
])

# Train the ensemble model
model_ensemble.fit(X_train_enc, y_train)

# Predict and calculate RMSLE on the validation set
ensemble_preds = model_ensemble.predict(X_val_enc)
ensemble_rmsle = np.sqrt(mean_squared_log_error(y_val, np.maximum(0, ensemble_preds)))
print(f"Ensemble RMSLE: {ensemble_rmsle}")


In [None]:
print(f"X_train_enc shape: {X_train_enc.shape}")
print(f"X_val_enc shape: {X_val_enc.shape}")
print(f"X_test_enc shape: {X_test_enc.shape}")


In [None]:
class MyCustomError(Exception):
    pass
raise MyCustomError("The Jedi force stops you here")

In [None]:
from sklearn.ensemble import VotingRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor

# Define models
xgb_model = XGBRegressor(n_estimators=500, learning_rate=0.05, max_depth=6, random_state=42)
lgbm_model = LGBMRegressor(n_estimators=500, learning_rate=0.05, max_depth=6, random_state=42)

# Ensemble model
model_ensemble = VotingRegressor([
    ('xgb', xgb_model),
    ('lgbm', lgbm_model)
])

# Fit the ensemble model
model_ensemble.fit(X_train_enc, y_train)

# Predict and calculate RMSLE on validation set
ensemble_preds = model_ensemble.predict(X_val_enc)
ensemble_rmsle = np.sqrt(mean_squared_log_error(y_val, np.maximum(0, ensemble_preds)))
print(f"Ensemble RMSLE: {ensemble_rmsle}")


In [None]:
from sklearn.ensemble import VotingRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor

# Define models with optimized parameters
xgb_model = XGBRegressor(
    n_estimators=500,
    learning_rate=0.05,
    max_depth=6,
    random_state=42
)

lgbm_model = LGBMRegressor(
    n_estimators=500,
    learning_rate=0.05,
    max_depth=6,
    num_leaves=63,  # 2^6 - 1 to align with max_depth
    random_state=42
)

# Ensemble model
model_ensemble = VotingRegressor([
    ('xgb', xgb_model),
    ('lgbm', lgbm_model)
])

# Fit the ensemble model
model_ensemble.fit(X_train.drop(columns=['id']), y_train)

# Predict and calculate RMSLE on validation set
ensemble_preds = model_ensemble.predict(X_val.drop(columns=['id']))
ensemble_rmsle = np.sqrt(mean_squared_log_error(y_val, np.maximum(0, ensemble_preds)))
print(f"Ensemble RMSLE: {ensemble_rmsle}")



## 9.4. Statistical aspects of multiple linear regression and the worth of polynomial regression with synthetic data. Residual Analysis. 

In [None]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.graphics.gofplots import qqplot
import matplotlib.pyplot as plt

# Split data
X = train.drop(columns=['Premium Amount', 'id'])
y = train['Premium Amount']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Add constant for intercept
X_train_const = sm.add_constant(X_train)
X_test_const = sm.add_constant(X_test)

# Fit model
model = sm.OLS(y_train, X_train_const).fit()
print(model.summary())

# Predictions and metrics
y_pred = model.predict(X_test_const)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)

print(f"RMSE: {rmse}")
print(f"R²: {r2}")

# Residual analysis
residuals = y_test - y_pred
standardized_residuals = (residuals - np.mean(residuals)) / np.std(residuals)

# Q-Q plot for residual normality
qqplot(standardized_residuals, line='s')
plt.title('Q-Q Plot of Standardized Residuals')
plt.show()

# Heteroskedasticity test
from statsmodels.stats.diagnostic import het_breuschpagan
bp_test = het_breuschpagan(residuals, X_test_const)
print(f"Breusch-Pagan test: {bp_test}")

# VIF for multicollinearity
vif_data = pd.DataFrame()
vif_data["feature"] = X_train.columns
vif_data["VIF"] = [variance_inflation_factor(X_train_const.values, i + 1) for i in range(X_train.shape[1])]

print("VIF Data:")
print(vif_data)

# Cook's distance for influential values
influence = model.get_influence()
(c, p) = influence.cooks_distance
plt.stem(np.arange(len(c)), c, markerfmt=",", use_line_collection=True)
plt.title("Cook's Distance")
plt.show()

# Residual plots
plt.scatter(y_pred, residuals)
plt.axhline(0, color='red', linestyle='--')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residuals vs Predicted')
plt.show()


In [None]:
import matplotlib.pyplot as plt

# Predict on validation set
val_preds = xgb_model.predict(X_val.drop(columns=['id']))

# Plot actual vs predicted values
plt.scatter(y_val, val_preds, alpha=0.3)
plt.xlabel('Actual Premium Amount')
plt.ylabel('Predicted Premium Amount')
plt.title('Actual vs. Predicted')
plt.show()

# Residual plot
residuals = y_val - val_preds
plt.hist(residuals, bins=50, alpha=0.7)
plt.xlabel('Residuals')
plt.ylabel('Frequency')
plt.title('Residual Distribution')
plt.show()


## 9.5. Test Polynomial Regression:

In [None]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression

poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X_train)
poly_model = sm.OLS(y_train, sm.add_constant(X_poly)).fit()
print(poly_model.summary())


In [None]:
class MyCustomError(Exception):
    pass
raise MyCustomError("The Jedi force stops you here")

# 10. Submit the File to Kaggle

In [None]:
# Ensure 'id' is added back during submission
print(f"Saving 'id' column for submission: {id_column.head()}")

# After training the model, use the saved 'id' column for submission
predictions = xgb_model.predict(X_test)  # Replace 'model' with your trained model
submission = pd.DataFrame({'id': id_column, 'Premium Amount': predictions})
submission.to_csv('submission.csv', index=False)

print("Submission file created: submission.csv")

In [None]:
# Predict on test set
test_preds = xgb_model.predict(X_test.drop(columns=['id']))

# Create the submission DataFrame
submission = pd.DataFrame({
    'id': X_test['id'],
    'Premium Amount': np.maximum(0, test_preds)  # Ensure non-negative predictions
})

# Save submission file
submission.to_csv('submission.csv', index=False)
print("Submission file created: submission.csv")



In [None]:
kaggle competitions submit -c playground-series-s4e12 -f submission.csv -m "XGBoost model with tuned parameters and cross-validation"
