# Preprocessed Submissions

Purpose: We want to see if the preprocessing alone has any improvements prior to feature selection/engineering

### Key findings so far:

Data-related:
- **imputation**: clipping and filling nulls with mean  
- **encoding**: one-hot encoding
- **validation scores**: more reliable after performing stratified split

Model-related:
- **baseline**: logistic regression is best baseline model
- **advanced model**: random forest performs better than baseline model, but takes 45 mins to train on full dataset
- **heuristic baseline**: median value is **BEST SCORE SO FAR! -- 1.10815** 


### Findings in this NB:
#### Simple models
- training a linear regression model after applying transforms + stardarized X features and transformed Y performed better results (closest to *heuristic baseline* with a score of **1.15**)
- training with only applying preprocessing to the X features and NOT the y column had a worse score than the baseline model with only imputed data

#### Advanced Models

In [3]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.tree import DecisionTreeRegressor
import numpy as np
from scipy.stats import skew

# Explicitly define the target column
target_column = 'Premium Amount'

test_data = pd.read_csv("../data/raw_dataset/test.csv")

# 1. Training the simple baseline models with standardized data
(we specifically standardized the data for these models since they perform best/expect this distribution)

Best steps to apply:

- do a stratitied split when validating
- be sure to apply reverse log when generating submission

In [4]:
train_dataset = pd.read_csv("../data/04_standardized_preprocessed_dataset.csv")

## validating on a stratified split

Looking at MSE and MAE It seems that this is doing much better than our training runs using the baseline dataset on these simple models

In [5]:
# Separate features (X) and target (y)
X = train_dataset.drop(columns=[target_column])
y = train_dataset[target_column]

# Bin the target variable
n_bins = 20
bin_col = "y_bin"
y_binned = pd.qcut(y, q=n_bins, duplicates='drop', labels=False)
X[bin_col] = y_binned

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify= X["y_bin"])

#remove the stratification column
X_test = X_test.drop(columns=[bin_col])
X_train = X_train.drop(columns=[bin_col])

# Define baseline models
models = {
    "Linear Regression": LinearRegression(),
    "Ridge Regression": Ridge(),
    # "Decision Tree": DecisionTreeRegressor(),
}

# Train and evaluate each model
results = []

for name, model in models.items():
    print(f"begin {name} training")
    # Train the model
    model.fit(X_train, y_train)
    
    # Make predictions
    y_pred = model.predict(X_test)
    
    # Evaluate performance
    mse = mean_squared_error(y_test, y_pred)
    mae = mean_absolute_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    
    # Store results
    results.append({
        "Model": name,
        "MSE": mse,
        "MAE": mae,
        "R^2": r2
    })
    print(f"{name} training complete")


# Display results
results_df = pd.DataFrame(results).sort_values(by="R^2", ascending=False)
results_df

begin Linear Regression training
Linear Regression training complete
begin Ridge Regression training
Ridge Regression training complete


Unnamed: 0,Model,MSE,MAE,R^2
0,Linear Regression,1135128.0,848.89518,0.007794
1,Ridge Regression,1135128.0,848.895181,0.007794


## retraining on full dataset

In [3]:
# Separate features (X) and target (y)
X = train_dataset.drop(columns=[target_column])
y = train_dataset[target_column]

# Define baseline full_models
full_models = {
    "Linear Regression": LinearRegression(),
}

# Train and evaluate each model
full_results = []

for name, model in full_models.items():
    print(f"begin {name} training")
    # Train the model
    model.fit(X, y)


begin Linear Regression training


## Submission

### Prepare test set

In [4]:
linear_test = test_data.copy()


#minimally process the test dataset to get model predictions

#convert the policy start time to duration in mins
linear_test['Policy Start Date'] = pd.to_datetime(linear_test['Policy Start Date'])
linear_test['Policy Duration Mins'] = ((pd.Timestamp.now() - linear_test['Policy Start Date']).dt.total_seconds())/60
linear_test = linear_test.drop(columns=['Policy Start Date'])

#do label encoding
categorical_cols = [linear_test.columns[i] for i, x in enumerate(linear_test.dtypes) if x == 'object']
linear_test[categorical_cols] = linear_test[categorical_cols].astype('category')
# Convert categorical to one hot encodings
linear_test = pd.get_dummies(linear_test, drop_first=True)
#fill nulls with mean values
linear_test = linear_test.fillna(X.mean())

X_test = linear_test.iloc[:,1:]




In [5]:

###############################
#apply the transformations and standardization
non_binary_cols = [col for col in X_test.columns if X_test[col].dtype != 'bool']

# Apply transformations iteratively until skew is within [-0.5, 0.5]
for col in non_binary_cols:
    max_iterations = 3  # Prevent excessive loops
    iteration = 0

    while iteration < max_iterations:
        skew_value = skew(X_test[col])
        if -0.5 <= skew_value <= 0.5:
            break  # Stop if skew is already in range
        
        if skew_value > 0.5:
            X_test[col] = np.log1p(X_test[col])  # Log transform for positive skew
        elif skew_value < -0.5:
            X_test[col] = X_test[col]**2  # Square for negative skew
            
        iteration += 1
    
    final_skew = skew(X_test[col])
    print(f"Transformed {col} {iteration} time(s). Final skew: {final_skew:.2f}")



# Perform Z-score standardization for non-binary columns
for col in non_binary_cols:
    mean = X_test[col].mean()
    std = X_test[col].std()
    X_test[col] = (X_test[col] - mean) / std


Transformed Age 0 time(s). Final skew: -0.24
Transformed Annual Income 3 time(s). Final skew: -0.07
Transformed Number of Dependents 0 time(s). Final skew: 0.13
Transformed Health Score 0 time(s). Final skew: 0.12
Transformed Previous Claims 2 time(s). Final skew: 0.47
Transformed Vehicle Age 0 time(s). Final skew: -0.02
Transformed Credit Score 1 time(s). Final skew: 0.06
Transformed Insurance Duration 0 time(s). Final skew: -0.01
Transformed Policy Duration Mins 0 time(s). Final skew: -0.00


### predict and submit

In [6]:
#generate results and submit to competition
results_directory = "../results"

for name, model in full_models.items():
    
    y_pred = np.expm1(np.sqrt(np.sqrt(model.predict(X_test)))) #model.predict(X_test)#

    results = pd.DataFrame({
        'id': linear_test['id'],  
        'Premium Amount': y_pred   
    })

    filename = f"{name}_standardized_preprocessing.csv"
    results_full_path = os.path.join(results_directory,filename)
    
    results.to_csv(results_full_path, index=False)

    submission_comment = f"{name} with clipping and mean imputed values"
    kg_utils.submit(filename,submission_comment) 


Submitting file: ../results\Linear Regression_standardized_preprocessing.csv to competition: playground-series-s4e12


100%|██████████| 20.6M/20.6M [00:20<00:00, 1.06MB/s]


Submission to 'playground-series-s4e12' successful!


In [6]:
kg_utils.get_latest_score()

Latest submission 'Linear Regression_standardized_preprocessing.csv' score: 1.11532


'1.11532'

# 2. Training Advanced model with normalized data