# Build Your First Machine Learning Project - Part 3 | `Machine Learning Algorithms`

In this notebook, we'll prepare the Bear data set for machine learning model building.

### What We'll Cover:

1. **Data Loading** - Load the bear dataset using Modin (`modin.pandas`) and Snowpark (`snowflake-snowpark-python`)
2. **Data Preparation** - Scale features and prepare data for model training using `scikit-learn`
3. **Model Training** - Train multiple machine learning models using `scikit-learn`:
   - Logistic Regression (`LogisticRegression`)
   - Random Forest (`RandomForestClassifier`)
   - Support Vector Machine (`SVC`)
4. **Performance Comparison** - Compare models using accuracy and MCC metric (`scikit-learn`)
5. **Model Interpretability** - Analyze feature importance and model coefficients to understand predictions (`Altair`)


# Notebook Setup

## Notebook Settings

1. Click on the three dots on the top-right hand corner and select "Notebook settings"
2. In the "Notebook settings" modal that appears, by default the General tab is activated, click on "Run on container" and under "Compute pool" choose a CPU compute node.
3. From the "Notebook settings" modal, click on the "External access" tab, select a policy that allows the notebook external access (i.e. this will allow access to data stored on GitHub).

## Install Prerequisite Libraries

Snowflake Notebooks includes common Python libraries by default. To add more, use the **Packages** dropdown in the top right. 

Let's add the following package:
- `modin` - Perform data operations (read/write) and wrangling just like pandas with the [Snowpark pandas API](https://docs.snowflake.com/en/developer-guide/snowpark/reference/python/latest/modin/index)
- `scikit-learn` - Perform data splits and build machine learning models
- `snowflake-ml-python` - a collection of ML functionalities from Snowflake. Here, we'll use model metrics logging functionality.

Note: When using an AI/ML container, Snowpark and relevant machine learning packages comes pre-installed.

In [None]:
! pip install snowflake-ml-python

## 1. Establish Snowflake Connection

We'll start by getting an active session via the `get_active_session()` method.

In [None]:
# Get active Snowflake session
from snowflake.snowpark.context import get_active_session
session = get_active_session()

print(f"✅ Connected using active Snowflake session!")

## 2. Data Operations

In this section, we'll proceed to loading, preparing the features/class, explore missing data and data splitting.

### 2.1. Load Data

Data is read from the `BEAR` table stored in Snowflake via the `read_snowflake()` method.

In [None]:
import modin.pandas as pd
import snowflake.snowpark.modin.plugin

bear_df = pd.read_snowflake("BEAR")
bear_df

### 2.2. Prepare features and class

The DataFrame is separate into 2. Features are assigned to the `X` variable while the class is assigned to `y`.

In [None]:
X = bear_df.drop(columns=['species', 'id'])
y = bear_df['species']

### 2.3. Check for Missing data

In [None]:
# Data quality checks
missing_features = X.isnull().sum().sum()
missing_target = y.isnull().sum()

print(f"\n🔍 Data Quality:")
print(f"   Missing feature values: {missing_features}")
print(f"   Missing target values: {missing_target}")

### 2.4. Data Splitting
The data is separated to Training-Testing sets using 80/20 ratio using `scikit-learn`:
- 80% is used as the **Training set** - used to train an ML model
- 20% is used as the **Testing set** - used as a test for the ML model

In [None]:
# Import scikit-learn modules at first use
from sklearn.model_selection import train_test_split

# Split data using scikit-learn (recommended by Snowflake for ML operations)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2, 
    random_state=42, 
    stratify=y  # Maintain target distribution
)

print("✅ Data splitting completed!")
print('-' * 35) 

print("📊 Data Split Summary:")
print(f"Training set: {X_train.shape[0]} samples ({X_train.shape[0]/len(X)*100:.1f}%)")
print(f"Testing set: {X_test.shape[0]} samples ({X_test.shape[0]/len(X)*100:.1f}%)")
print(f"Number of features: {X_train.shape[1]}")
print('-' * 35) 

# Check class distribution in splits
print("\n🎯 Class Distribution:")
print("Training set:", y_train.value_counts().sort_index().to_dict())
print("Testing set:", y_test.value_counts().sort_index().to_dict())
print('-' * 35) 

### 2.5. Feature Scaling

Feature scaling is a data preprocessing technique used to standardize the range of independent variables or features of data. This helps to ensure that features with larger value ranges (e.g. one variable can have a range of 10,000 to 1,000,000 while others could be 0.1 to 0.8) do not disproportionately influence the model's learning process.

Here, we're using `scikit-learn` to perform feature scaling by standardizing all variables by mean centering (mean = 0) unit variance (SD = 1).

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

# Identify numerical and categorical columns
numerical_features = X_train.select_dtypes(include=['int64', 'float64']).columns
categorical_features = X_train.select_dtypes(include=['object']).columns

print("Numerical features:", numerical_features.tolist())
print("Categorical features:", categorical_features.tolist())

# Scale numerical features
scaler = StandardScaler()
X_train_scaled_num = scaler.fit_transform(X_train[numerical_features])
X_test_scaled_num = scaler.transform(X_test[numerical_features])

# Convert categorical features using one-hot encoding
onehot = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
X_train_scaled_cat = onehot.fit_transform(X_train[categorical_features])
X_test_scaled_cat = onehot.transform(X_test[categorical_features])

# Get feature names after one-hot encoding
cat_feature_names = onehot.get_feature_names_out(categorical_features)

# Combine numerical and categorical features
X_train_scaled = np.hstack([X_train_scaled_num, X_train_scaled_cat])
X_test_scaled = np.hstack([X_test_scaled_num, X_test_scaled_cat])

# Convert to DataFrame with proper column names
all_feature_names = list(numerical_features) + list(cat_feature_names)
X_train_scaled = pd.DataFrame(X_train_scaled, columns=all_feature_names, index=X_train.index)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=all_feature_names, index=X_test.index)

print("\n✅ Features scaling completed!")
print('-' * 35) 

print("\n📊 Scaled Data Dimension:")
print(f"Scaled training features shape: {X_train_scaled.shape}")
print(f"Scaled testing features shape: {X_test_scaled.shape}")
print('-' * 35) 

# Show scaling effect for numerical features
if len(numerical_features) > 0:
    first_num_feature = numerical_features[0]
    print("\n📊 Scaling Effect (first numerical feature):")
    print(f"Original {first_num_feature}: mean={X_train[first_num_feature].mean():.3f}, std={X_train[first_num_feature].std():.3f}")
    print(f"Scaled {first_num_feature}: mean={X_train_scaled[first_num_feature].mean():.3f}, std={X_train_scaled[first_num_feature].std():.3f}")
print('-' * 35)


## 3. Machine Learning Model Training
Now that we have the scaled features, we'll build ML models using `scikit-learn`.

### 3.1. Logistic Regression


In [None]:
# Import logistic model and classification metrics
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, matthews_corrcoef
import numpy as np

# Logistic Regression using scikit-learn
print("🔧 Training Logistic Regression Model...")

log_reg_model = LogisticRegression(random_state=42) # random_state for reproducibility
log_reg_model.fit(X_train_scaled, y_train)

# Make predictions (outputs class labels directly)
log_reg_train_pred = log_reg_model.predict(X_train_scaled)
log_reg_test_pred = log_reg_model.predict(X_test_scaled)

# Calculate classification metrics
logreg_train_acc = accuracy_score(y_train, log_reg_train_pred)
logreg_test_acc = accuracy_score(y_test, log_reg_test_pred)
logreg_train_mcc = matthews_corrcoef(y_train, log_reg_train_pred)
logreg_test_mcc = matthews_corrcoef(y_test, log_reg_test_pred)

test_class_report = classification_report(y_test, log_reg_test_pred)

print("✅ Logistic Regression model trained!")
print('-' * 35)

print(f"📊 Logistic Regression Results:")
print(f"   Training Accuracy: {logreg_train_acc:.4f}")
print(f"   Testing Accuracy:  {logreg_test_acc:.4f}")
print(f"   Training MCC:      {logreg_train_mcc:.4f}")
print(f"   Testing MCC:       {logreg_test_mcc:.4f}")
print('-' * 35)

print("\nClassification Report (Test Set):")
print(test_class_report)
print('-' * 35)

### 3.2. Random Forest Classifier


In [None]:
# Import ensemble methods and detailed metrics
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, matthews_corrcoef, classification_report
import pandas as pd

# Random Forest using scikit-learn
print("🌲 Training Random Forest Classifier...")
print('-' * 35)

rf_model = RandomForestClassifier(random_state=42, n_jobs=-1) # Use parallel processing
rf_model.fit(X_train_scaled, y_train)

# Make predictions
rf_train_pred = rf_model.predict(X_train_scaled)
rf_test_pred = rf_model.predict(X_test_scaled)

# Calculate comprehensive metrics
rf_train_acc = accuracy_score(y_train, rf_train_pred)
rf_test_acc = accuracy_score(y_test, rf_test_pred)
rf_train_mcc = matthews_corrcoef(y_train, rf_train_pred)
rf_test_mcc = matthews_corrcoef(y_test, rf_test_pred)
test_class_report = classification_report(y_test, rf_test_pred)

print("✅ Random Forest model trained!")
print('-' * 35)

print(f"📊 Random Forest Results:")
print(f"   Training Accuracy: {rf_train_acc:.4f}")
print(f"   Testing Accuracy:  {rf_test_acc:.4f}")
print(f"   Training MCC:      {rf_train_mcc:.4f}")
print(f"   Testing MCC:       {rf_test_mcc:.4f}")
print("\nClassification Report (Test Set):")
print(test_class_report)
print('-' * 35)

### 3.3. Support Vector Machine (SVM)


In [None]:
# Import SVM and detailed metrics
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, matthews_corrcoef, classification_report

# Support Vector Machine using scikit-learn
print("🤖 Training Support Vector Machine...")
print('-' * 35)

svm_model = SVC(random_state=42)
svm_model.fit(X_train_scaled, y_train)

# Make predictions
svm_train_pred = svm_model.predict(X_train_scaled)
svm_test_pred = svm_model.predict(X_test_scaled)

# Calculate comprehensive metrics
svm_train_acc = accuracy_score(y_train, svm_train_pred)
svm_test_acc = accuracy_score(y_test, svm_test_pred)
svm_train_mcc = matthews_corrcoef(y_train, svm_train_pred)
svm_test_mcc = matthews_corrcoef(y_test, svm_test_pred)
test_class_report = classification_report(y_test, svm_test_pred)

print("✅ SVM model trained!")
print('-' * 35)

print(f"📊 SVM Results:")
print(f"   Training Accuracy: {svm_train_acc:.4f}")
print(f"   Testing Accuracy:  {svm_test_acc:.4f}")
print(f"   Training MCC:      {svm_train_mcc:.4f}")
print(f"   Testing MCC:       {svm_test_mcc:.4f}")
print("\nClassification Report (Test Set):")
print(test_class_report)
print('-' * 35)


## 4. Benchmarking of Machine Learning Algorithms
Benchmarking essentially means that we're comparing various ML algorithms to see which performs the best and/or are most suitable for our use case.

In selecting the best ML algorithm to use, we want an algorithm that can generalize well on new, unseen data and one that can provide actionable insights.
1. Model overfitting: the former point on generalizing well on new, unseen data could be evaluated by the degree at which the algorithm overfits the data
2. Model interpretability: the latter point on actionable insights can be gained by analyzing important features that contributes to the model's prediction


### 4.1. Assessing Overfitting

Overfitting is a measure of how much better a model performs on the data it was trained on compared to new, unseen data, indicating it has memorized noise instead of learning a general pattern.

> $$ Over fitting = Training Performance - Testing Performance $$

This formula calculates the performance drop when your model moves from familiar training data to new, unseen testing data.

- A big difference means the model is overfitted: It just memorized the training examples instead of learning the actual patterns, so it fails on new data. 👎
- A small difference is good: This means that the model generalizes well. 👍

In [None]:
# Import Altair at first use
import altair as alt

# Configure Altair for interactive visualizations
alt.data_transformers.enable('json')
alt.theme.enable('opaque')

# Compare all models
model_acc = pd.DataFrame({
    'Model': ['Logistic Regression', 'Random Forest', 'SVM'],
    'Training_Accuracy': [logreg_train_acc, rf_train_acc, svm_train_acc],
    'Testing_Accuracy': [logreg_test_acc, rf_test_acc, svm_test_acc]
})

model_acc['Overfitting'] = model_acc['Training_Accuracy'] - model_acc['Testing_Accuracy']

model_mcc = pd.DataFrame({
    'Model': ['Logistic Regression', 'Random Forest', 'SVM'],
    'Training_MCC': [logreg_train_mcc, rf_train_mcc, svm_train_mcc],
    'Testing_MCC': [logreg_test_mcc, rf_test_mcc, svm_test_mcc]
})

model_mcc['Overfitting'] = model_mcc['Training_MCC'] - model_mcc['Testing_MCC']


print("📊 Model Comparison:")
print('-' * 35)

print("Accuracy:")
print(model_acc.round(4))
print('-' * 35)

print("MCC:")
print(model_mcc.round(4))


### 4.2. Model interpretability

Interpretable ML models are those that provide the variable coefficients that directly dictates the relative degree at which it influences the target `y` values.

In linear models this may be summarized in the following equation:

> $$y = m_1x_1 + m_2x_2 + ... + b$$

where $$y$$ is the target or dependent variable, $$m_n$$ are the variable coefficients, $$x_n$$ are the features or independent variables and $$b$$ is the baseline value.

In essence, $$m_n$$ coefficients are direct measure of their influence on the prediction of $$y$$, where larger absolute coefficient value means that it has stronger impact on the prediction of $$y$$.



#### 4.2.1. Interpreting Logistic regression models

In [None]:
# Generated by Snowflake Copilot
# Get coefficients from the model
coefficients = log_reg_model.coef_[0]

# Create a DataFrame using the transformed feature names
logreg_feature_importance = pd.DataFrame({
    'feature': all_feature_names,  # Using all_feature_names from the py_feature_scaling cell
    'coefficient': coefficients
})

# Calculate the absolute value of the coefficients to use as 'importance'
logreg_feature_importance['abs_coefficient'] = np.abs(logreg_feature_importance['coefficient'])

# Sort the features by importance in descending order
logreg_feature_importance = logreg_feature_importance.sort_values('abs_coefficient', ascending=False)

# Print the results
print("✨ Top 5 Most Important Features (Logistic Regression):")
print(logreg_feature_importance[['feature', 'coefficient', 'abs_coefficient']].head())


In [None]:
# Select the top 5 features for visualization
import altair as alt

top_n = 5
chart_data = logreg_feature_importance.head(top_n)

# Create the bar chart
chart = alt.Chart(chart_data).mark_bar().encode(
    x=alt.X('coefficient:Q', title='Importance'),
    y=alt.Y('feature:N', title='Feature', sort='-x'), # Sort bars by importance
    color=alt.condition(
        alt.datum.coefficient > 0,
        alt.value('#00B8E7'),  # Positive coefficients
        alt.value('#FF61CC')    # Negative coefficients
    ),
    tooltip=[
        alt.Tooltip('feature:N', title='Feature'),
        alt.Tooltip('coefficient:Q', title='Coefficient', format='.4f'),
        alt.Tooltip('importance:Q', title='Importance', format='.4f')
    ]
)

# Explicitly set the text color for axes and titles to black
chart = chart.configure(
    background='transparent'
).configure_axis(
    labelColor='white',  # Color for the feature names and importance values
    titleColor='white'   # Color for the 'Feature' and 'Importance...' titles
).configure_title(
    color='white'        # Color for the main chart title
).properties(
    title=f'Top {top_n} Feature Importance (Logistic Regression)',
    width=600,
    height=400
)

chart

#### 4.2.2. Interpreting Random Forest models

In [None]:
# Feature importance analysis with Random forest
rf_feature_importance = pd.DataFrame({
    'feature': all_feature_names, # Using all_feature_names from the py_feature_scaling cell
    'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False)

print(f"✨ Top 5 Most Important Features:")
print(rf_feature_importance.head())

In [None]:
# Select the top 5 features for visualization
import altair as alt

top_n = 5
chart_data = rf_feature_importance.head(top_n)

# Create the bar chart
chart = alt.Chart(chart_data).mark_bar().encode(
    x=alt.X('importance:Q', title='Importance'),
    y=alt.Y('feature:N', title='Feature', sort='-x'), # Sort bars by importance
    color=alt.condition(
        alt.datum.importance > 0,
        alt.value('#00B8E7'),  # Positive coefficients
        alt.value('#FF61CC')    # Negative coefficients
    ),
    tooltip=[
        alt.Tooltip('feature:N', title='Feature'),
        alt.Tooltip('coefficient:Q', title='Coefficient', format='.4f'),
        alt.Tooltip('importance:Q', title='Importance', format='.4f')
    ]
)

# Explicitly set the text color for axes and titles to black
chart = chart.configure(
    background='transparent'
).configure_axis(
    labelColor='white',  # Color for the feature names and importance values
    titleColor='white'   # Color for the 'Feature' and 'Importance...' titles
).configure_title(
    color='white'        # Color for the main chart title
).properties(
    title=f'Top {top_n} Feature Importance (Random Forest)',
    width=600,
    height=400
)

chart

#### 4.2.3. Interpreting SVM models

The only interpretable SVM algorithm are those using linear kernel while those using non-linear kernels like polynomial SVM or radial basis function (RBF) SVM are no longer interpretable and are regarded as black-box models.

The previously built SVM model is using the RBF kernel and are thus non-linear and not interpretable.

As already mentioned, if you'd like to have an interpretable SVM model, then you can use linear kernel that you can also try.

## Resources
If you'd like to take a deeper dive into the various libraries used in this tutorial, here they are:
- [pandas on Snowflake](https://docs.snowflake.com/en/developer-guide/snowpark/python/pandas-on-snowflake)
- [Snowpark pandas API](https://docs.snowflake.com/en/developer-guide/snowpark/reference/python/latest/modin/index)
- [scikit-learn API reference](https://scikit-learn.org/stable/api/index.html)
- [Altair API reference](https://altair-viz.github.io/user_guide/api.html)
- [YouTube Playlist on Snowflake Notebooks](https://www.youtube.com/watch?v=YB1B6vcMaGE&list=PLavJpcg8cl1Efw8x_fBKmfA2AMwjUaeBI)