# Building an Interactive Machine Learning Demo with Streamlit in Snowflake

In this notebook, we'll create and deploy an interactive Machine Learning application using Streamlit, running it entirely within a Snowflake Notebook environment. This hands-on exercise will demonstrate how to combine the power of Streamlit's user interface capabilities with scikit-learn's machine learning tools.

## Learning Objectives

By completing this exercise, you will:

- Master the usage of Streamlit widgets to create interactive data applications
- Deploy and run a Streamlit application within Snowflake Notebook
- Implement a practical classification model using scikit-learn
- Create interactive ML predictions using Streamlit's dynamic interface capabilities

The unique aspect of this tutorial is that everything runs directly within your Snowflake Notebook environment, providing a seamless development experience.

- Reference Implementation: [Streamlit Machine Learning Demo](https://github.com/kameshsampath/st-ml-app)
- Detailed Tutorial: [Zero to Streamlit](https://snowflake-labs.github.io/zero-to-streamlit/) - A comprehensive guide by Snowflake Developers on building Streamlit applications


## Pre-requisite

Before we dive into building our Machine Learning application, this notebook will guide you through the essential setup steps required to prepare your Snowflake account. These preparations are crucial for deploying and running the Streamlit ML App successfully.

## Setup Steps

We will complete the following configuration tasks:

1. Database Structure Setup

   - Create necessary schemas
   - Set up required tables for our ML application


2. External Storage Configuration

   - Create and configure an external stage connected to Amazon S3 
   - Establish secure data access pathways

3. Data Preparation

   - Load the Penguins dataset into Snowflake
   - Prepare the data structure for ML operations

This foundational setup will ensure smooth execution of our Machine Learning application within the Snowflake environment. 

Let's proceed with these prerequisites step by step.


## Environment Setup: Schemas and Stages

In this section, we'll establish the foundational database structures needed for our Streamlit ML application. We'll create dedicated schemas to ensure proper organization and separation of concerns.

## Schema Organization

| Schema | Purpose |
|--------|----------|
| `apps` | Houses all application components, specifically our Streamlit application |
| `data` | Stores all data tables, including our Penguins dataset |
| `stages` | Contains all staging areas for data loading and file management |
| `file_formats` | Defines the file formats used for data ingestion |

Each schema serves a specific purpose in our application architecture:
- The `apps` schema keeps our application code isolated
- The `data` schema maintains our datasets in an organized manner
- The `stages` schema manages our external connections
- The `file_formats` schema ensures consistent data loading formats

Let's proceed with creating these schemas in our Snowflake environment.

In [None]:
-- data schema
CREATE SCHEMA IF NOT EXISTS DATA;
-- create schema to hold all stages
CREATE SCHEMA IF NOT EXISTS STAGES;
-- create schema to hold all file formats
CREATE SCHEMA IF NOT EXISTS FILE_FORMATS;
-- apps to hold all streamlit apps
CREATE SCHEMA IF NOT EXISTS APPS;

## Stage and File Format Configuration

In this section, we'll set up the necessary staging area and file format for our data loading process. Specifically, we will:

1. Create a stage named `stages.st_ml_app_penguins` that will:
   - Connect to the S3 bucket `s3://sfquickstarts/misc`
   - Serve as our data loading pipeline

2. Configure a file format `file_formats.csv` that will:
   - Define how we parse and load CSV files
   - Be associated with our stage for data processing

This setup will establish the foundation for loading our Penguins dataset into Snowflake.

Let's proceed with creating these configurations...


In [None]:
-- add an external stage to a s3 bucket
CREATE STAGE IF NOT EXISTS STAGES.ST_ML_APP_PENGUINS
  URL='s3://sfquickstarts/misc';

-- default CSV file format and allow values to quoted by "
CREATE FILE FORMAT IF NOT EXISTS FILE_FORMATS.CSV
  TYPE='CSV'
  SKIP_HEADER=1
  FIELD_OPTIONALLY_ENCLOSED_BY = '"';

## Loading the Penguins Dataset

As our next step, we'll load the penguins dataset that will serve as the foundation for our ML demo application. The dataset contains various measurements of different penguin species, making it perfect for our classification tasks.

## Data Loading Process

We will:
- Create a table `data.penguins` to store our penguin details
- Load data from the file `penguins_cleaned.csv` located in our external stage
- Use the previously configured stage path: `@stages.st_ml_app_penguins/penguins_cleaned.csv`

This dataset will be used throughout our demo to:
- Train our machine learning model
- Make predictions on penguin species
- Demonstrate interactive data visualization

Let's proceed with the data loading commands...

In [None]:
-- Create table to hold penguins data
CREATE OR ALTER TABLE DATA.PENGUINS(
   SPECIES STRING NOT NULL,
   ISLAND STRING NOT NULL,
   BILL_LENGTH_MM NUMBER NOT NULL,
   BILL_DEPTH_MM NUMBER NOT NULL,
   FLIPPER_LENGTH_MM NUMBER NOT NULL,
   BODY_MASS_G NUMBER NOT NULL,
   SEX STRING NOT NULL
);

-- Load the data from penguins_cleaned.csv
COPY INTO DATA.PENGUINS
FROM @STAGES.ST_ML_APP_PENGUINS/PENGUINS_CLEANED.CSV
FILE_FORMAT=(FORMAT_NAME='FILE_FORMATS.CSV');

## Building Our Streamlit ML Application

Now that we have our environment set up and the penguins dataset loaded, let's start building our interactive Machine Learning application using Streamlit. We'll create a user-friendly interface that allows users to:

- Visualize the penguins dataset
- Input penguin measurements through interactive widgets
- Make real-time predictions using our trained ML model
- Display the results in an engaging way

### Getting Started
We'll begin by importing the necessary libraries and setting up our Streamlit application structure. Our app will leverage:
- Streamlit for the interactive web interface
- scikit-learn for our ML model
- Snowflake for data access
- Pandas for data manipulation

Let's dive into the code and build our application step by step...

In [None]:
import streamlit as st
import os
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from snowflake.snowpark.session import Session
from snowflake.snowpark.functions import col
from snowflake.snowpark.types import StringType, DecimalType

### Application Session Configuration

The following code demonstrates a flexible session management approach that enables our Streamlit application to run seamlessly in both local development and Snowflake environments. This dual-environment capability is crucial for:

#### Key Benefits
- Development flexibility: Test and debug on your local machine
- Production readiness: Deploy directly to Snowflake
- Consistent behavior: Same application code works in both environments
- Efficient development cycle: Quick iterations during development


In [None]:
def get_active_session():
    """
    Returns the active Snowflake session based on the environment.
    For local development, it creates a new connection.
    For Snowflake, it uses the existing session.
    """
    conn = st.connection(
        os.getenv(
            "SNOWFLAKE_CONNECTION_NAME",
            "devrel-ent",
        ),
        type="snowflake",
    )
    return conn.session()

This session management function is a cornerstone of our application's architecture, allowing developers to:
- Develop and test locally on their laptops
- Deploy the same code to Snowflake without modifications
- Maintain a smooth development workflow

In [None]:
SELECT * FROM ST_ML_APP.DATA.PENGUINS;

### Data Preprocessing Steps
1. Import SQL output to pandas DataFrame, you can refer to the cell name in Snowflake Notebooks in this case `penguins_data`
2. Standardize column names to lowercase for consistency and easier reference
3. Set appropriate data types for each column:
   - Numeric columns: Convert to float64
   - Text columns: Convert to string

The text is clear, concise, and properly structured with the correct heading level (##), numbered list, and nested bullet points. No changes are needed.

In [None]:
df = penguins_data.to_pandas()

# for consistency and easiness let us change the column names to be of lower case
df.columns=df.columns.str.lower()

## Set the columns to right data type
df['island'] = df['island'].astype('str')
df['species'] = df['species'].astype('str')
df['bill_length_mm'] = df['bill_length_mm'].astype('float64')
df['bill_depth_mm'] = df['bill_depth_mm'].astype('float64')
df['flipper_length_mm'] = df['flipper_length_mm'].astype('float64')
df['body_mass_g'] = df['body_mass_g'].astype('float64')
df['sex'] = df['sex'].astype('str')



### Streamlit Expander Widget 📂

An `st.expander` creates a collapsible section in your app that can be expanded/collapsed by clicking. It's useful for:
- Hiding optional details or settings
- Organizing long-form content
- Creating FAQ-style interfaces
- Showing additional visualizations on demand

#### Key Features
- Maintains a clean UI by hiding secondary content
- Can contain any Streamlit elements (text, charts, inputs, etc.)
- Default state can be set (expanded/collapsed)
- Customizable label text

📚 Documentation: https://docs.streamlit.io/library/api-reference/layout/st.expander

In [None]:
with st.expander("**Raw Data**"):
    df.columns = df.columns.str.lower()
    
    st.write("**X**")
    st.write("The input features that will use to build the model.")
    X_raw = df.drop("species", axis=1)
    X_raw

    st.write("**y**")
    st.write("The target of our predicted model.")
    y_raw = df.species
    y_raw

### Scatter Plot Visualization using Altair in Streamlit 📊

Altair (powered by Vega-Lite) provides more customizable scatter plots than Streamlit's built-in charts. Perfect for the penguins dataset with features like:
- Interactive tooltips with custom formatting
- Layered visualizations
- Color encoding by categorical variables
- Dynamic filtering and zooming
- Configurable axis and legend properties

#### Key Advantages
- Declarative grammar of graphics
- Seamless integration with pandas DataFrames
- Publication-quality aesthetics
- Compositional layering system

📚 Documentation:
- Altair: https://altair-viz.github.io/user_guide/marks/scatter.html
- Streamlit-Altair Integration: https://docs.streamlit.io/library/api-reference/charts/st.altair_chart

*Note: Altair works natively with Streamlit using `st.altair_chart()`. No additional configuration needed.*

In [None]:
import altair as alt

with st.expander("Data Visualization",expanded=True):
   sp=alt.Chart(df).mark_circle().encode(
     alt.X('bill_length_mm').scale(zero=False),
     alt.Y('body_mass_g').scale(zero=False, padding=1),
     color='species',
   )

   st.altair_chart(sp)



### Interactive Widgets for Data Filtering 🎛️

Streamlit provides several widgets to create dynamic, interactive filters for your data:

#### Select Box (`st.selectbox`)
- Dropdown menu for single selection
- Perfect for categorical filters (e.g., penguin species)
- Clean interface for limited options
📚 [Select Box Documentation](https://docs.streamlit.io/library/api-reference/widgets/st.selectbox)

#### Radio Button (`st.radio`)
- Visual selection for mutually exclusive options
- Great for 2-5 choices
- More visible than dropdown menus
📚 [Radio Button Documentation](https://docs.streamlit.io/library/api-reference/widgets/st.radio)

#### Slider (`st.slider`)
- Interactive range selection
- Works with numbers, dates, and times
- Supports single value or range selection
- Ideal for numerical filters (e.g., bill length range)
📚 [Slider Documentation](https://docs.streamlit.io/library/api-reference/widgets/st.slider)

#### Sidebar Organization (`st.sidebar`)
All these widgets can be neatly organized in a collapsible sidebar using `st.sidebar`:
- Keeps main content area clean
- Creates intuitive filter panel
- Automatically responsive
- Perfect for filter controls and app navigation
📚 [Sidebar Documentation](https://docs.streamlit.io/library/api-reference/layout/st.sidebar)

*💡 Pro Tip: Using `with st.sidebar:` context manager keeps your sidebar code organized and readable. Very useful for standlone apps.*

In [None]:
st.header("Input Features")
# Islands
islands = df.island.unique().astype(str)
island = st.selectbox(
    "Island",
    islands,
)
# Bill Length
min, max, mean = (
    df.bill_length_mm.min(),
    df.bill_length_mm.max(),
    df.bill_length_mm.mean().round(2),
)
bill_length_mm = st.slider(
    "Bill Length(mm)",
    min_value=min,
    max_value=max,
    value=mean,
)
# Bill Depth
min, max, mean = (
    df.bill_depth_mm.min(),
    df.bill_depth_mm.max(),
    df.bill_depth_mm.mean().round(2),
)
bill_depth_mm = st.slider(
    "Bill Depth(mm)",
    min_value=min,
    max_value=max,
    value=mean,
)
# Filpper Length
min, max, mean = (
    df.flipper_length_mm.min(),
    df.flipper_length_mm.max(),
    df.flipper_length_mm.mean().round(2),
)
flipper_length_mm = st.slider(
    "Flipper Length(mm)",
    min_value=min,
    max_value=max,
    value=mean,
)
# Body Mass
min, max, mean = (
    df.body_mass_g.min(),
    df.body_mass_g.max(),
    df.body_mass_g.mean().round(2),
)
body_mass_g = st.slider(
    "Body Mass(g)",
    min_value=min,
    max_value=max,
    value=mean,
)
# Gender
gender = st.radio(
    "Gender",
    ("male", "female"),
)

### Display Input Features
We will use Streamlit's [data display elements](https://docs.streamlit.io/library/api-reference/data/st.dataframe) to showcase our input features. The `st.dataframe()` function provides an interactive table with sorting and filtering capabilities.

In [None]:
data = {
    "island": island,
    "bill_length_mm": bill_length_mm,
    "bill_depth_mm": bill_depth_mm,
    "flipper_length_mm": flipper_length_mm,
    "body_mass_g": body_mass_g,
    "sex": gender,
}
input_df = pd.DataFrame(data, index=[0])
input_penguins = pd.concat([input_df, X_raw], axis=0)

with st.expander("Input Features"):
    st.write("**Input Penguins**")
    input_df
    st.write("**Combined Penguins Data**")
    input_penguins

### Data Prepration

For the data preparation step in this demo, we'll keep things straightforward and focus on:
1. Encoding string features - converting text values into numbers that our ML model can understand
2. Preparing the target variable - ensuring our prediction target is properly encoded

This will be a minimal demonstration without additional preprocessing steps like feature scaling, handling missing values, or feature engineering. 

In [None]:
X_encode = ["island", "sex"]
df_penguins = pd.get_dummies(input_penguins, prefix=X_encode)
X = df_penguins[1:]
input_row = df_penguins[:1]

## Encode Y
target_mapper = {
    "Adelie": 0,
    "Chinstrap": 1,
    "Gentoo": 2,
}

y = y_raw.apply(lambda v: target_mapper[v])

with st.expander("Data Preparation"):
    st.write("**Encoded X (input penguins)**")
    input_row
    st.write("**Encoded y**")
    y

### Model Training and Prediction

For this final step, we'll use RandomForestClassifier - an ensemble learning method that operates by constructing multiple decision trees during training and outputs the class that is the mode of the classes predicted by individual trees. We'll display the progress and results using Streamlit's container and progress components for a better user experience, followed by a success message showing the prediction results.

RandomForest is a good choice for our demonstration as it:
- Handles both numerical and categorical features well
- Provides feature importance rankings
- Is less prone to overfitting compared to single decision trees
- Requires minimal hyperparameter tuning to get reasonable results

**References:**
* [Scikit-learn RandomForestClassifier Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)
* [Scikit-learn Ensemble Methods Guide](https://scikit-learn.org/stable/modules/ensemble.html#forest)
* [User Guide: Forest of randomized trees](https://scikit-learn.org/stable/modules/forest.html)
* [Streamlit Container API](https://docs.streamlit.io/library/api-reference/layout/st.container)
* [Streamlit Progress and Status API](https://docs.streamlit.io/library/api-reference/status/st.progress)
* [Streamlit Success Message](https://docs.streamlit.io/library/api-reference/status/st.success)

In [None]:
with st.container():
    st.subheader("**Prediction Probability**")
    ## Model Training
    rf_classifier = RandomForestClassifier()
    # Fit the model
    rf_classifier.fit(X, y)
    # predict using the model
    prediction = rf_classifier.predict(input_row)
    prediction_prob = rf_classifier.predict_proba(input_row)

    # reverse the target_mapper
    p_cols = dict((v, k) for k, v in target_mapper.items())
    df_prediction_prob = pd.DataFrame(prediction_prob)
    # set the column names
    df_prediction_prob.columns = p_cols.values()
    # set the Penguin name
    df_prediction_prob.rename(columns=p_cols)

    st.dataframe(
        df_prediction_prob,
        column_config={
            "Adelie": st.column_config.ProgressColumn(
                "Adelie",
                help="Adelie",
                format="%f",
                width="medium",
                min_value=0,
                max_value=1,
            ),
            "Chinstrap": st.column_config.ProgressColumn(
                "Chinstrap",
                help="Chinstrap",
                format="%f",
                width="medium",
                min_value=0,
                max_value=1,
            ),
            "Gentoo": st.column_config.ProgressColumn(
                "Gentoo",
                help="Gentoo",
                format="%f",
                width="medium",
                min_value=0,
                max_value=1,
            ),
        },
        hide_index=True,
    )

# display the prediction
st.subheader("Predicted Species")
st.success(p_cols[prediction[0]])



⚠️ **Important Note:**
* When changing input features, cells don't automatically re-run
* After modifying `st_input_features`, you need to manually run these cells in sequence:
  1. `st_input_features_df` - Updates the features DataFrame
  2. `py_model_data_prep` - Prepares data for model training
  3. `st_train_predict` - Trains model and shows prediction

Here is execution of cells flow:

`Change inputs[st_input_features]` → `Update DataFrame[st_input_features_df]` → `Prepare ML data[py_model_data_prep]` → `Train & predict[st_train_predict]`
                    

## Summary and Further Reading

Throughout this course, we've seen how Snowflake Notebooks and Streamlit work together to create powerful, interactive machine learning applications. This combination offers several advantages:

1. **Unified Development Environment**: Snowflake Notebooks provide a seamless environment for data preparation, model development, and testing, all within the Snowflake ecosystem.

2. **Interactive User Interfaces**: Streamlit enables us to transform our machine learning models into user-friendly applications, making complex analytics accessible to non-technical users.

3. **Scalable Processing**: By leveraging Snowflake's computational power, our applications can handle large-scale data processing without compromising performance.

4. **Real-time Analytics**: The integration allows for real-time data updates and model predictions, making our applications more dynamic and valuable for business decisions.

## Further Reading

- [Streamlit in Snowflake](https://docs.snowflake.com/en/developer-guide/streamlit/about-streamlit) - Learn more about building interactive data applications
- [Snowpark Python DataFrames](https://docs.snowflake.com/en/developer-guide/snowpark/python/working-with-dataframes) - Deep dive into data manipulation techniques
- [Snowflake ML](https://docs.snowflake.com/en/developer-guide/snowflake-ml/snowpark-ml) - Explore advanced machine learning capabilities
- [Snowflake Notebooks](https://docs.snowflake.com/en/user-guide/ui-snowsight/notebooks) - Master the notebook environment for development
- [Snowflake Quickstarts](https://quickstarts.snowflake.com/) - Get hands-on experience with guided tutorials and examples

Happy building!