# Linear Regression Model Comparison with the Crafter Class

This notebook demonstrates the use of the custom `Crafter` class to compare different linear regression models. We will be using the Prostate Cancer dataset to predict logarithm of prostate-specific antigen (`lcavol`). The models we'll compare are:

- Linear Regression
- Lasso Regression
- Ridge Regression

We'll use our custom `split_data` utility function for data splitting and then fit each model to the training data. Finally, we'll make and compare predictions from each model.


## Importing Necessary Libraries and Setup

Before we start, let's import the necessary libraries. We need scikit-learn for the regression models, pandas for data handling, and our custom modules: `Crafter` for model crafting and `split_data` for data splitting.


In [1]:
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from utils.data_split import split_data
import os

os.chdir("/Users/daniel.lopes/Projects/personal/ModelCrafter")
from crafter.crafter import Crafter

## Loading and Preparing the Data

We'll load the Prostate Cancer dataset using pandas and then split it into training and testing sets. The target variable is 'lcavol', and we will split the data randomly with a test size of 20%.


In [2]:
import pandas as pd

data = pd.read_csv("data/raw/prostate.csv")

In [3]:
data

Unnamed: 0,lcavol,lweight,age,lbph,svi,lcp,gleason,pgg45,lpsa,train
0,-0.579818,2.769459,50,-1.386294,0,-1.386294,6,0,-0.430783,True
1,-0.994252,3.319626,58,-1.386294,0,-1.386294,6,0,-0.162519,True
2,-0.510826,2.691243,74,-1.386294,0,-1.386294,7,20,-0.162519,True
3,-1.203973,3.282789,58,-1.386294,0,-1.386294,6,0,-0.162519,True
4,0.751416,3.432373,62,-1.386294,0,-1.386294,6,0,0.371564,True
...,...,...,...,...,...,...,...,...,...,...
92,2.830268,3.876396,68,-1.386294,1,1.321756,7,60,4.385147,True
93,3.821004,3.896909,44,-1.386294,1,2.169054,7,40,4.684443,True
94,2.907447,3.396185,52,-1.386294,1,2.463853,7,10,5.143124,False
95,2.882564,3.773910,68,1.558145,1,1.558145,7,80,5.477509,True


In [4]:
X_train, X_test, y_train, y_test = split_data(
    data=data, target_column="lcavol", split_method="random", test_size=0.2
)

## Defining the Modeling Steps

Next, we define the steps for our `Crafter` pipeline. Each step represents a model with its configuration. We are using three models: Linear Regression, Lasso, and Ridge Regression. Each model will be trained and evaluated independently.


In [5]:
steps = [
    (LinearRegression(), False, False),
    (Lasso(alpha=1.0), False, False),
    (Ridge(alpha=1.0), False, False),
]

## Training and Predicting with Crafter

With our models defined, we initialize the `Crafter` class with these steps. We then fit the `Crafter` with our training data and make predictions on the test set. Finally, we'll examine the first few predictions from each model.


In [6]:
# Initialize Crafter
crafter = Crafter(steps)

# Fit the Crafter
crafter.fit(X_train, y_train)

# Make predictions with all models
all_predictions = crafter.predict(X_test)

# Access predictions
for model_name, predictions in all_predictions.items():
    print(model_name)
    print(predictions[:5])  # Print the first 5 predictions of each model for brevity

Step 1: LinearRegression
[0.85459868 2.06575336 1.78426911 1.18840385 1.66009784]
Step 2: Lasso
[1.5899858  1.96400965 0.8419381  1.12245599 0.8419381 ]
Step 3: Ridge
[0.86139844 2.08301713 1.77756646 1.15035999 1.65225605]
