# Build Your First Machine Learning Project - Part 1 | `Data Wrangling`

In this course, we'll build our first machine learning project end-to-end in Python. 

### Course Content
We'll cover this in five separate notebooks/apps:
1. **Data Operations** - Ingest data, data wrangling and write to Snowflake using Modin (`modin.pandas`) and Snowpark (`snowflake-snowpark-python`)
2. **Exploratory Data Analysis (EDA)** - Explore data, summary statistics, data visualization using `Altair` and `Streamlit`
3. **Machine learning (ML)** - Prepare data and features for build models using different ML algorithms (Logistic Regression, Random Forest and Support Vector Machine) with `scikit-learn`
4. **Experiment Tracking** - Initiate experiment tracking when building and trying out different hyperparameters with `ExperimentTracking()` from `snowflake-ml-python`
5. **Data App** - Build a sharable data app with `Streamlit`

### What We'll Cover (in this Notebook):

1. **Data Loading and Preparation** - Load the bear dataset and prepare it for analysis using Modin (`modin.pandas`) and Snowpark (`snowflake-snowpark-python`)
2. **Basic Statistics** - Calculate and visualize summary statistics of the dataset
3. **Feature Distribution Analysis** - Explore the distribution of individual features across different bear species with `Altair` and `Streamlit`
4. **Correlation Analysis** - Investigate relationships between numeric features using correlation heatmaps with `Altair` and `Streamlit`
5. **Feature Relationships** - Visualize relationships between pairs of features using interactive scatter plots with `Altair` and `Streamlit`
6. **Categorical Analysis** - Examine the distribution of categorical features including species classification with `Altair` and `Streamlit`



# Notebook Setup

## Install Prerequisite Libraries

Snowflake Notebooks includes common Python libraries by default. To add more, use the **Packages** dropdown in the top right. 

Let's add the following package:
- `modin` - Perform data operations (read/write) and wrangling just like pandas with the [Snowpark pandas API](https://docs.snowflake.com/en/developer-guide/snowpark/reference/python/latest/modin/index)
- `scikit-learn` - Perform data splits and build machine learning models

Note: When using an AI/ML container, Snowpark and relevant machine learning packages comes pre-installed.

## Notebook Settings

1. Click on the three dots on the top-right hand corner and select "Notebook settings"
2. In the "Notebook settings" modal that appears, by default the General tab is activated, click on "Run on container" and under "Compute pool" choose a CPU compute node.
3. From the "Notebook settings" modal, click on the "External access" tab, select a policy that allows the notebook external access (*i.e.* this will allow access to data stored on GitHub).

# Data Setup

In this step, we'll perform the following:
1. Load tabular data from a GitHub repo
2. Create a Snowflake stage for storing image data

## Bear dataset

The dataset is a classic multi-class classification problem where the goal of the machine learning task is to classify each entry as belonging to one of four species that a bear belongs to based on its features.

The bear dataset is comprised of 200 bear samples and each entry is described by 6 different features pertaining to the bear's physical characteristics (also known as parameters, independent variables or X variables) and is assigned to one of four bear species (A, B, C and D).



## Load Data

Here, we'll load in the first portion of the data set that comprises of the ID, bear species, and 6 feature columns:
- id
- species
- body_mass_kg
- shoulder_hump_height_cm
- claw_length_cm
- snout_length_cm
- forearm_circumference_cm
- ear_length_cm

As for the second portion, we'll prepare those from features that we'll extract from a collection of bear images that corresponds to each of the row IDs.

In [None]:
import modin.pandas as pd
import snowflake.snowpark.modin.plugin

df = pd.read_csv("https://raw.githubusercontent.com/dataprofessor/bear-dataset/refs/heads/master/bear_raw_data.csv")
df

## Load Images

As previously mentioned, each row from the data set has a unique ID for each bear along with its own corresponding image named using the ID (e.g. `GRZ_01`, `GRZ_02`, `GRZ_03`, etc.)

### Create Snowflake Stage to Store Image Uploads

Before we can work with any image, we'll need to create a Snowflake stage for storing the images.

We can do this via the Snowsight UI or with a SQL statement.

Essentially, this is implemented in 3 steps:
1. Create the stage using `CREATE STAGE stage_name`
2. Enable `DIRECTORY` so that files are shown in the stage
3. Use server-side encryption so that we can use Cortex functions on images stored in stage

Before creating the stage, let's switch to our working database (here I'll use `chaninn_demo_data`; feel free to replace this with another database of your choice).

I'll also specify that we'll use the `stages` schema.

In [None]:
USE DATABASE chaninn_demo_data;
USE SCHEMA stages;

In [None]:
CREATE STAGE IF NOT EXISTS input_stage
    DIRECTORY = ( ENABLE = true )
    ENCRYPTION = ( TYPE = 'SNOWFLAKE_SSE' );

Next, head over to the Database explorer in Snowsight and upload the 200 bear images to the `BEAR` stage located under the `CHANINN_DEMO_DATA` database and `STAGE` schema.

Afterwards, head back to this notebook and run the `ls` command to query the `@bear` stage.

In [None]:
ls @bear

## Display bear images

Now that we have all of the image uploaded, let's have a look at them.

In [None]:
import streamlit as st
from snowflake.snowpark.context import get_active_session

st.title("Bear species")

# Create a single row with 4 columns
cols = st.columns(4)

# Bear species and their captions
bears = [
    ("ABB", "American Black Bear"),
    ("EUR", "Eurasian Brown Bear"), 
    ("GRZ", "Grizzly Bear"),
    ("KDK", "Kodiak Bear")
]

# Display images in grid using loop
for col, (species, caption) in zip(cols, bears):
    with col:
        st.image(
            f'https://github.com/dataprofessor/bear-dataset/blob/master/images/{species}_01.png?raw=true',
            caption=f"{caption} ({species}_01.png)"
        )


# Image Analysis

Why are we analyzing the images? As mentioned earlier on, we're going to add additional features to the dataset by analyzing the bear images to figure out the following:
- Fur color
- Facial profile
- Paw pad texture

Thease 3 features are added to the data set loaded above in the `py_load_data` cell and stored in the `df` variable.

## LLM inference on Image

To perform an LLM inference, we're performing the following 4 things:
1. Use the `AI_COMPLETE()` SQL function to analyze the image
2. Use `claude-3-5-sonnet` LLM model to make the inference
3. Specify the prompt that will provide the necessary instructions on how to analyze the image
4. Use `TO_FILE()` to specify the image file to work on, while providing the stage and file names as input parameters.

In [None]:
SELECT AI_COMPLETE('claude-3-5-sonnet',
    'What is the fur color of the bear?',
    TO_FILE('@bear', 'ABB_01.png'));

## From static to dynamic queries

Now, we'll essentially do the same thing as shown above but structuring it in a such a way that will allow us to pass Python variables to the SQL query, making it more dynamic and reusable within a programmatic workflow.

Practically, this will allow us to process a large set of 200 images iteratively.

In [None]:
from snowflake.snowpark.context import get_active_session
session = get_active_session()

prompt = 'What is the fur color of the bear?'
image = 'ABB_01.png'

query = f"""
SELECT AI_COMPLETE('claude-3-5-sonnet',
    '{prompt}',
    TO_FILE('@bear', '{image}'));
"""

session.sql(query)

Expanding on the above query, we'll now change the prompt to allow us to infer the bear's fur color given the bear image.

In [None]:
prompt = """
Analyze the provided image of a bear. Describe only the fur color of the bear 
by choosing the most appropriate term from the following list. The response 
should be a single value.
- Light Brown
- Medium Brown
- Blond
- Dark Brown
- Grizzled (A mix of colors with silver-tipped hairs)
- Reddish Brown
- Blackish Brown
- Black
- Brown
- Cinnamon
"""

image = 'ABB_01.png'
query = f"""
SELECT AI_COMPLETE('claude-3-5-sonnet',
    '{prompt}',
    TO_FILE('@bear', '{image}'));
"""

session.sql(query)

## Iterative Image Analysis

Here, we'll apply Cortex AISQL to analyze the image and determine the bear's features.



### Fur Color

We'll start with analyzing the fur color for all 200 images.

In [None]:
from snowflake.snowpark.context import get_active_session

session = get_active_session()

prompt = """
Analyze the provided image of a bear. Describe only the fur color of the bear
by choosing the most appropriate term from the following list. The response
should be a single value.
- Light Brown
- Medium Brown
- Blond
- Dark Brown
- Grizzled (A mix of colors with silver-tipped hairs)
- Reddish Brown
- Blackish Brown
- Black
- Brown
- Cinnamon
"""

# Get a list of all image files in the stage
# staged_files_df = session.sql("LIST @bear").collect()

# Sample the first N rows
nrows = len(df)
staged_files_df = session.sql("LIST @bear").collect()[:nrows]

# Create a list of image filenames to iterate over
image_files = [row['name'] for row in staged_files_df if row['name'].lower().endswith(('.png', '.jpg', '.jpeg', '.gif'))]

# Create an empty list to store the results
results_list = []

# Loop through each image file and execute the AI function
for image_path in image_files:
    # Extract just the filename from the full path
    image_name = image_path.split('/')[-1]

    # Dynamically build the query for each image
    query = f"""
    SELECT AI_COMPLETE('claude-3-5-sonnet',
        '{prompt}',
        TO_FILE('@bear', '{image_name}'));
    """

    # Execute the query and collect the result
    result = session.sql(query).collect()
    
    # Append a tuple of the filename and the result to the list
    results_list.append((image_name, result[0][0]))

    print(f"Analysis for {image_name}: {result[0][0]}")

# You can now work with the `results_list`
print("\n--- All Results Collected ---")

In [None]:
results_list

Let's remove `.PNG` from the image name to obtain the ID.

In [None]:
fur_with_id = [
    (image_name.replace('.png', ''), color)
    for image_name, color in results_list
]

fur_with_id

Here, we'll convert our collected fur color analysis results into a structured Snowpark DataFrame, which we'll add to the full data set later on.


In [None]:
from snowflake.snowpark import types as T

schema = T.StructType([T.StructField("id", T.StringType()), T.StructField("color", T.StringType())])

# Convert the results_list to a Snowpark DataFrame
df_results = session.create_dataframe(fur_with_id, schema=schema)
df_fur = pd.DataFrame(df_results.to_pandas())
df_fur

### Facial Profile

Next, we'll analyze the facial profile of the bears, which is a key distinguishing feature. The facial profile can be either:
- **Dished**: Concave profile, where the bridge of the nose dips)
- **Straight**: Flat profile, with no dip from the forehead to the nose)

In [None]:
# Define the prompt for facial profile analysis
prompt = """
Analyze the provided image of a bear. Describe only the facial profile of the bear. 
The response must be one of the following two values as a single word with no explanation:
- Dished (Concave profile, where the bridge of the nose dips)
- Straight (Flat profile, with no dip from the forehead to the nose)
"""

# Get a list of first N image files (for testing)
staged_files_df = session.sql("LIST @bear").collect()[:nrows]

# Create a list of image filenames
image_files = [row['name'] for row in staged_files_df if row['name'].lower().endswith(('.png', '.jpg', '.jpeg', '.gif'))]

# Create an empty list to store results
results_list = []

# Process each image
for image_path in image_files:
    image_name = image_path.split('/')[-1]
    
    query = f"""
    SELECT AI_COMPLETE('claude-3-5-sonnet',
        '{prompt}',
        TO_FILE('@bear', '{image_name}'));
    """
    
    result = session.sql(query).collect()
    # Extract ID by removing .png and store with result
    id_value = image_name.replace('.png', '')
    results_list.append((id_value, result[0][0]))
    print(f"Analysis for {image_name}: {result[0][0]}")

# Create Snowpark DataFrame with results
schema = T.StructType([
    T.StructField("ID", T.StringType()), 
    T.StructField("FACIAL_PROFILE", T.StringType())
])
df_results = session.create_dataframe(results_list, schema=schema)

df_facial_profile = pd.DataFrame(df_results.to_pandas())
df_facial_profile

### Paw Pad Texture

Next, we'll analyze the texture of the bears' paw pads, which is another distinguishing characteristic. The paw pad texture can be either:
- **Smooth**: Less textured and relatively flat, for walking
- **Rough**: More textured and grooved, for gripping and climbing

In [None]:
# Define the prompt for paw pad texture analysis
prompt = """
Analyze the provided image of a bear. Describe only the paw pad texture of the bear. 
The response must be one of the following two values as a single word with no explanation:
- Smooth (Less textured and relatively flat, for walking)
- Rough (More textured and grooved, for gripping and climbing)
"""

# Get a list of first N image files (for testing)
staged_files_df = session.sql("LIST @bear").collect()[:nrows]

# Create a list of image filenames
image_files = [row['name'] for row in staged_files_df if row['name'].lower().endswith(('.png', '.jpg', '.jpeg', '.gif'))]

# Create an empty list to store results
results_list = []

# Process each image
for image_path in image_files:
    image_name = image_path.split('/')[-1]
    
    query = f"""
    SELECT AI_COMPLETE('claude-3-5-sonnet',
        '{prompt}',
        TO_FILE('@bear', '{image_name}'));
    """
    
    result = session.sql(query).collect()
    # Extract ID by removing .png and store with result
    id_value = image_name.replace('.png', '')
    results_list.append((id_value, result[0][0]))
    print(f"Analysis for {image_name}: {result[0][0]}")

# Create Snowpark DataFrame with results
schema = T.StructType([
    T.StructField("ID", T.StringType()), 
    T.StructField("Paw_Pad_Texture", T.StringType())
])
df_results = session.create_dataframe(results_list, schema=schema)

df_paw_pad = pd.DataFrame(df_results.to_pandas())
df_paw_pad

# Data Operations

In this section, we'll perform essential data operations to:
- Combine the extracted features (fur color, facial profile, and paw pad texture) with the original dataset 
- Write the final dataset to a Snowflake table


In [None]:
# Read categorical columns
import modin.pandas as pd
import snowflake.snowpark.modin.plugin

# Load the categorical feature data from CSV files
df_fur_color = pd.read_csv("https://raw.githubusercontent.com/dataprofessor/bear-dataset/refs/heads/master/fur_color.csv")
df_facial_profile = pd.read_csv("https://raw.githubusercontent.com/dataprofessor/bear-dataset/refs/heads/master/facial_profile.csv")
df_paw_pad = pd.read_csv("https://raw.githubusercontent.com/dataprofessor/bear-dataset/refs/heads/master/paw_pad_texture.csv")

## Combining Features

Now that we have extracted all three features (fur color, facial profile, and paw pad texture) from the bear images, let's combine them with our original dataset. 

The final combined dataset will include:

- Original physical measurements (body mass, shoulder hump height, etc.)
- Fur color analysis
- Facial profile classification
- Paw pad texture assessment

This comprehensive dataset will give us a more complete picture of each bear's characteristics for our analysis.


In [None]:
# Combining df_fur, df_facial_profile and df_paw_pad to df
# Standardize column names to match
df['id'] = df['id'].str.upper()  # Ensure IDs are in uppercase
df_fur_color['id'] = df_fur_color['id'].str.upper()
df_facial_profile['id'] = df_facial_profile['id'].str.upper()
df_paw_pad['id'] = df_paw_pad['id'].str.upper()

# Perform sequential merges to combine all features using proper indexing
df_combined = df.merge(df_fur_color, on='id', how='inner')
df_combined = df_combined.merge(df_facial_profile, on='id', how='inner')
df_combined = df_combined.merge(df_paw_pad, on='id', how='inner')

# Display the combined DataFrame
df_combined

## Write data to a database table

### Determine current database and schema

But before we write to a Snowflake database table, let's figure out the current location where this notebook is located, which in turn is where are database table will reside in.

In [None]:
SELECT CURRENT_DATABASE(), CURRENT_SCHEMA();

In [None]:
USE DATABASE chaninn_demo_data;
USE SCHEMA public;

In [None]:
df_combined.to_snowflake(
    "BEAR",
    if_exists="replace",
    index=False
)

## Query data from table

In [None]:
SELECT * FROM CHANINN_DEMO_DATA.PUBLIC.BEAR;

# Resources
If you'd like to take a deeper dive into Snowpark pandas:
- [pandas on Snowflake](https://docs.snowflake.com/en/developer-guide/snowpark/python/pandas-on-snowflake)
- [Snowpark pandas API](https://docs.snowflake.com/en/developer-guide/snowpark/reference/python/latest/modin/index)
- [Snowflake Cortex AISQL](https://docs.snowflake.com/user-guide/snowflake-cortex/aisql)
- [YouTube Playlist on Snowflake Notebooks](https://www.youtube.com/watch?v=YB1B6vcMaGE&list=PLavJpcg8cl1Efw8x_fBKmfA2AMwjUaeBI)