# Pipeline 1: Basic Preprocessing Pipeline

This notebook implements the first preprocessing pipeline with basic preprocessing techniques:
1. **Simple Imputation**: Using median for numerical and most frequent for categorical features
2. **Standard Scaling**: Using StandardScaler for numerical features
3. **One-Hot Encoding**: Using OneHotEncoder for categorical features
4. **Basic Pipeline**: Simple but effective preprocessing approach

## Motivation

This pipeline serves as a baseline approach that:
- Uses straightforward preprocessing techniques
- Provides a foundation for comparison with more advanced methods
- Demonstrates basic data preparation for machine learning


In [None]:
# Imports
import numpy as np
import pandas as pd
import sklearn

from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import SGDRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
import matplotlib.pyplot as plt


In [None]:
# These are your training samples along with their labels
data = pd.read_csv('health_insurance_train.csv')
data.head()


In [None]:
# Here we separate target (y) and features (X)
X = data.drop('whrswk', axis=1)
y = data['whrswk']
# divide columns into numerical and non-numerical features
numerical_feats = ['experience', 'kidslt6', 'kids618', 'husby']
categorical_feats = ['hhi', 'whi', 'hhi2', 'education', 'race', 'hispanic', 'region']


In [None]:
# transformer for numbers
numerical_transf = Pipeline(steps=[('imputer', SimpleImputer(strategy="median")), ('scaler', StandardScaler())])
# transformer for non-numbers
categorical_transf = Pipeline(steps=[('imputer', SimpleImputer(strategy="most_frequent")), ('encoder', OneHotEncoder())])
#
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transf, numerical_feats),
        ('cat', categorical_transf, categorical_feats)
    ]
)


In [None]:
preprocessor.fit(X)

X_transformed = preprocessor.transform(X)

print(X_transformed[0])
# You need to extract the features and the regression target. The regression target is 'whrswk'.


In [None]:
# Autograder 

In the autograder you will need to provide two things: 1) estimate of the MAE of your model on unseen data, 2) the predictions on the autograder data. For the autograder data we only provide the features and not the regression targets. Thus, you cannot compute the MAE on this data yourself - you need to estimate that with the data provided above.


In [None]:
data_autograder = pd.read_csv('health_insurance_autograde.csv')
data_autograder.head()


In [None]:
# TODO Replace this with your own estimate of the MAE of your best model
estimate_MAE_on_new_data = np.array([1.0])

# TODO Replace this with the predictions of your best model
# via e.g. prediction = model.predict(data_autograder)
predictions_autograder_data = np.array([-1] * 17272)

# Upload this file to the Vocareum autograder:
result = np.append(estimate_MAE_on_new_data, predictions_autograder_data)
pd.DataFrame(result).to_csv("autograder_submission.txt", index=False, header=False)


## Pipeline 1 Analysis and Discussion

### Basic Preprocessing Approach

This pipeline implements a straightforward preprocessing approach that serves as a baseline for comparison:

#### 1. **Simple Imputation**
- **Numerical features**: Uses median imputation (robust to outliers)
- **Categorical features**: Uses most frequent value imputation
- **Advantage**: Simple and fast
- **Limitation**: Doesn't consider relationships between features

#### 2. **Standard Scaling**
- **Method**: StandardScaler (mean=0, std=1)
- **Purpose**: Ensures all numerical features are on the same scale
- **Advantage**: Works well with most ML algorithms
- **Limitation**: Sensitive to outliers

#### 3. **One-Hot Encoding**
- **Method**: OneHotEncoder for categorical variables
- **Purpose**: Converts categorical data to numerical format
- **Advantage**: Preserves all categorical information
- **Limitation**: Can create high-dimensional feature space

### Pipeline Characteristics

- **Simplicity**: Easy to understand and implement
- **Speed**: Fast preprocessing and training
- **Reliability**: Well-established techniques with predictable behavior
- **Baseline**: Provides a solid foundation for comparison

### Expected Performance

This basic pipeline should provide reasonable performance but may be limited by:
- Simple imputation strategies
- Standard scaling sensitivity to outliers
- High-dimensional categorical encoding
- No feature engineering or selection

The pipeline serves as an excellent baseline to demonstrate the value of more advanced preprocessing techniques in Pipeline 2.
