# Quickstart Guide for Scikit-learn Training Using Pipeline and OneHotEncoder
## Introduction
This notebook provides a quickstart for training a Scikit-learn model using a pipeline that includes preprocessing steps with both `StandardScaler` and `OneHotEncoder`, applied to numerical data.

### Steps Covered:
- Load data from a Snowflake table.
- Preprocess numerical data using `Pipeline` and `ColumnTransformer`.
- Apply `OneHotEncoder` to a numerical column treated as categorical.
- Train a linear regression model.
- Make predictions and evaluate the model.

### Step 1: Set Up Snowflake Session
Initialize a Snowflake session to load data from the table.

In [None]:
# Initialize Snowflake session
from snowflake.snowpark.context import get_active_session
session = get_active_session()

### Step 2: Load Data from Snowflake Table
We load data from the `CR_QUICKSTART.PUBLIC.VEHICLE` table and drop the timestamp column (`C2`) as it is not needed for this quickstart.

In [None]:
# Load data from the Snowflake table
table_name = 'CR_QUICKSTART.PUBLIC.VEHICLE'
snowpark_df = session.table(table_name)

# Convert Snowpark DataFrame to Pandas DataFrame
pandas_df = snowpark_df.to_pandas()

# Drop the timestamp column ('C2')
pandas_df = pandas_df.drop(columns=['C2'])

# Separate features (X) and target (y). Assume C6 is the target column.
X = pandas_df.drop('C6', axis=1)
y = pandas_df['C6']


### Step 3: Preprocess Data Using Pipeline
We will use `StandardScaler` for scaling numerical columns and `OneHotEncoder` for a numerical column (e.g., `C3`) treated as categorical.

In [None]:
# Import necessary Scikit-learn modules
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer

# Identify numerical columns
numerical_cols = ['C1', 'C4', 'C5', 'C7', 'C8', 'C9']
# Select a numerical column to treat as categorical for OneHotEncoding
onehot_col = ['C3']

# Create transformers for numerical and onehot-encoded columns
numerical_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])
onehot_transformer = Pipeline([
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Combine transformers using ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('onehot', onehot_transformer, onehot_col)
    ]
)

# Test the preprocessor on the data
preprocessed_X = preprocessor.fit_transform(X)
print(f'Preprocessed shape: {preprocessed_X.shape}')

### Step 4: Train a Scikit-learn Model Using the Preprocessed Data
We will train a linear regression model using the preprocessed data.

In [None]:
# Import Scikit-learn model and training utilities
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets (90% train, 10% test)
X_train, X_test, y_train, y_test = train_test_split(preprocessed_X, y, test_size=0.1, random_state=42)

# Initialize the linear regression model
model = LinearRegression()

# Train the model
model.fit(X_train, y_train)
print('Model training complete.')

### Step 5: Make Predictions and Evaluate the Model
We use the trained model to make predictions on the test set and evaluate its performance.

In [None]:
# Make predictions on the test set
y_pred = model.predict(X_test)

# Display a few predictions
print(f'Sample predictions: {y_pred[:10]}')

## Conclusion
In this notebook, we demonstrated how to:
- Load and preprocess data using Scikit-learn's `Pipeline` and `OneHotEncoder` for numerical data.
- Train a linear regression model using the preprocessed data.
- Make predictions and evaluate the model.

This approach can be adapted to other datasets and models, leveraging the flexibility of Scikit-learn's API.