# Build Your First Machine Learning Project - Part 1 | `Data Wrangling`

In this course, we'll build our first machine learning project end-to-end in Python. 

### Course Content
We'll cover this in five separate notebooks/apps:
1. **Data Operations** - Ingest data, data wrangling and write to Snowflake using Snowpark (`snowflake-snowpark-python`)
2. **Exploratory Data Analysis (EDA)** - Explore data, summary statistics, data visualization using `Altair` and `Streamlit`
3. **Machine learning (ML)** - Prepare data and features for build models using different ML algorithms (Logistic Regression, Random Forest and Support Vector Machine) with `scikit-learn`
4. **Experiment Tracking** - Initiate experiment tracking when building and trying out different hyperparameters with `ExperimentTracking()` from `snowflake-ml-python`
5. **Data App** - Build a sharable data app with `Streamlit`

### What We'll Cover (in this Notebook):

1. **Data Loading and Preparation** - Load the bear dataset and prepare it for analysis using Modin (`modin.pandas`) and Snowpark (`snowflake-snowpark-python`)
2. **Basic Statistics** - Calculate and visualize summary statistics of the dataset
3. **Feature Distribution Analysis** - Explore the distribution of individual features across different bear species with `Altair` and `Streamlit`
4. **Correlation Analysis** - Investigate relationships between numeric features using correlation heatmaps with `Altair` and `Streamlit`
5. **Feature Relationships** - Visualize relationships between pairs of features using interactive scatter plots with `Altair` and `Streamlit`
6. **Categorical Analysis** - Examine the distribution of categorical features including species classification with `Altair` and `Streamlit`



# Data Setup

In this step, we'll perform the following:
1. Load tabular data from a GitHub repo
2. Create a Snowflake stage for storing image data

## Bear dataset

The dataset is a classic multi-class classification problem where the goal of the machine learning task is to classify each entry as belonging to one of four species that a bear belongs to based on its features.

The bear dataset is comprised of 200 bear samples and each entry is described by 6 different features pertaining to the bear's physical characteristics (also known as parameters, independent variables or X variables) and is assigned to one of four bear species (A, B, C and D).



## Load Data

Here, we'll load in the first portion of the data set that comprises of the ID, bear species, and 6 feature columns:
- id
- species
- body_mass_kg
- shoulder_hump_height_cm
- claw_length_cm
- snout_length_cm
- forearm_circumference_cm
- ear_length_cm

As for the second portion, we'll prepare those from features that we'll extract from a collection of bear images that corresponds to each of the row IDs.

### Download CSV data

For the first portion of the data, please download the [bear_raw_data.csv](https://raw.githubusercontent.com/dataprofessor/bear-dataset/refs/heads/master/bear_raw_data.csv) file onto your computer and then we'll upload it to Snowflake.

### Upload CSV to Workspaces

From within Workspaces, to the same working directory as these notebooks, click **+ Add new** > **Upload Files** and select the `bear_raw_data.csv` file. 

Or if the notebook is in a folder, hover on the folder and click on **+** > **Upload Files** and select the `bear_raw_data.csv` file.

In [None]:
import pandas as pd

df = pd.read_csv("bear_raw_data.csv")
df

## Load Images

As previously mentioned, each row from the data set has a unique ID for each bear along with its own corresponding image named using the ID (e.g. `GRZ_01`, `GRZ_02`, `GRZ_03`, etc.)

### Upload images to Workspace

In a similar fashion to how we'd upload the CSV file previously, now we'll upload the 200 images as a folder.

From within Workspaces, to the same working directory as these notebooks, click **+ Add new** > **Upload Folder** and select the `images/` folder. 

Or if the notebook is in a folder, hover on the folder and click on **+** > **Upload Folder** and select the `images/` folder.

### Upload images to Stage

Here, we're going to upload these images to a stage, which is needed when we run the LLM inference on the image in a few moments.

In [None]:
import os
from snowflake.snowpark.context import get_active_session

session = get_active_session()

# Set database and schema
DATABASE = "CHANINN_DEMO_DATA"
SCHEMA = "PUBLIC"

session.sql(f"CREATE DATABASE IF NOT EXISTS {DATABASE}").collect()
session.sql(f"USE DATABASE {DATABASE}").collect()
session.sql(f"USE SCHEMA {SCHEMA}").collect()

# Create stage with server-side encryption
session.sql("""
    CREATE OR REPLACE STAGE img_stage
    ENCRYPTION = (TYPE = 'SNOWFLAKE_SSE')
""").collect()

# Get all PNG files from images/ folder
image_folder = 'images'
image_files = [os.path.join(image_folder, f) for f in os.listdir(image_folder) if f.endswith('.png')]

# Upload all images to stage
for img in image_files:
    session.file.put(img, "@img_stage", auto_compress=False, overwrite=True)

print(f"Stage ready with {len(image_files)} images uploaded: {image_files}")

## Display bear images

Now that we have all of the image uploaded, let's have a look at them.

In [13]:
import matplotlib.pyplot as plt
from PIL import Image

# Bear species and their captions
bears = [
    ("ABB", "American Black Bear"),
    ("EUR", "Eurasian Brown Bear"), 
    ("GRZ", "Grizzly Bear"),
    ("KDK", "Kodiak Bear")
]

# Create a figure with 4 subplots
fig, axes = plt.subplots(1, 4, figsize=(16, 4))
fig.suptitle('Bear Species', fontsize=16, fontweight='bold')

for idx, (species, caption) in enumerate(bears):
    img_path = f'images/{species}_01.png'
    
    try:
        # Open local image directly
        img = Image.open(img_path)
        
        axes[idx].imshow(img)
        axes[idx].set_title(f"{caption}\n({species}_01.png)", fontsize=10)
        axes[idx].axis('off')
    except Exception as e:
        axes[idx].text(0.5, 0.5, 'Image\nNot\nAvailable', 
                      ha='center', va='center', fontsize=12)
        axes[idx].set_title(f"{caption}\n({species}_01.png)", fontsize=10)
        axes[idx].axis('off')

plt.tight_layout()
plt.show()

# Image Analysis

Why are we analyzing the images? As mentioned earlier on, we're going to add additional features to the dataset by analyzing the bear images to figure out the following:
- Fur color
- Facial profile
- Paw pad texture

Thease 3 features are added to the data set loaded above in the `py_load_data` cell and stored in the `df` variable.

## Fur color classification with `ai_classify`

To classify an image into predefined categories, we're performing the following steps:
1. Use the `ai_classify()` Snowpark function to classify the image
2. Use `prompt()` to specify the instruction text with `{0}` as a placeholder for the image
3. Use `to_file()` to reference the image file from the stage
4. Provide a list of categories (`fur_colors`) for the model to classify into

In [None]:
from snowflake.snowpark.functions import ai_classify, to_file, prompt

fur_colors = ["Light Brown", "Medium Brown", "Blond", "Dark Brown", "Grizzled Brown", 
              "Reddish Brown", "Blackish Brown", "Black", "Brown", "Cinnamon"]

response = session.range(1).select(
    ai_classify(
        prompt("Please help me classify the fur color of the bear {0}", to_file("@img_stage/ABB_01.png")),
        fur_colors
    ).alias("classes")
)
response.show()

## Iterative Image Analysis

Here, we'll apply Cortex AISQL to analyze the image and determine the bear's features.



### Fur Color

We'll start with analyzing the fur color for all 200 images.

In [None]:
from snowflake.snowpark.context import get_active_session
from snowflake.snowpark.functions import ai_classify, to_file, prompt, col, concat, lit

session = get_active_session()

fur_colors = [
    "Light Brown", "Medium Brown", "Blond", "Dark Brown", "Grizzled Brown",
    "Reddish Brown", "Blackish Brown", "Black", "Brown", "Cinnamon"
]

# Get files and classify in batch
results_df = (
    session.sql("LIST @img_stage")
    .select(
        col('"name"').alias("file_name"),
        concat(lit("@"), col('"name"')).alias("file_path")
    )
    .select(
        col("file_name"),
        ai_classify(
            prompt("Please classify the fur color of the bear {0}", to_file(col("file_path"))),
            fur_colors
        ).alias("classification")
    )
)

results_df.show()

In [None]:
import json

results_list = [
    json.loads(row["CLASSIFICATION"])["labels"][0] 
    for row in results_df.select("classification").collect()
]

In [25]:
results_list

Here, we'll convert our collected fur color analysis results into a structured Snowpark DataFrame, which we'll add to the full data set later on.


In [None]:
import json
import pandas as pd

# Collect both columns and parse
data = [
    {
        "id": row["FILE_NAME"],
        "color": json.loads(row["CLASSIFICATION"])["labels"][0]
    }
    for row in results_df.collect()
]

# Convert to pandas DataFrame
df_fur = pd.DataFrame(data)
df_fur["id"] = df_fur["id"].apply(lambda x: x.split("/")[-1].replace(".png", ""))
df_fur

### Facial Profile

Next, we'll analyze the facial profile of the bears, which is a key distinguishing feature. The facial profile can be either:
- **Dished**: Concave profile, where the bridge of the nose dips)
- **Straight**: Flat profile, with no dip from the forehead to the nose)

In [28]:
from snowflake.snowpark.context import get_active_session
from snowflake.snowpark.functions import ai_classify, to_file, prompt, col, concat, lit

session = get_active_session()

facial_profiles = ["Dished", "Straight"]

# Get files and classify in batch
results_df = (
    session.sql("LIST @img_stage")
    .select(
        col('"name"').alias("file_name"),
        concat(lit("@"), col('"name"')).alias("file_path")
    )
    .select(
        col("file_name"),
        ai_classify(
            prompt("Analyze the facial profile of the bear in this image {0}. Is it Dished (concave, nose dips) or Straight (flat, no dip)?", to_file(col("file_path"))),
            facial_profiles
        ).alias("classification")
    )
)

results_df

In [None]:
import json
import pandas as pd

# Collect both columns and parse
data = [
    {
        "id": row["FILE_NAME"].split("/")[-1].replace(".png", ""),
        "facial_profile": json.loads(row["CLASSIFICATION"])["labels"][0]
    }
    for row in results_df.collect()
]

# Convert to pandas DataFrame
df_facial_profile = pd.DataFrame(data)
df_facial_profile

### Paw Pad Texture

Next, we'll analyze the texture of the bears' paw pads, which is another distinguishing characteristic. The paw pad texture can be either:
- **Smooth**: Less textured and relatively flat, for walking
- **Rough**: More textured and grooved, for gripping and climbing

In [30]:
from snowflake.snowpark.context import get_active_session
from snowflake.snowpark.functions import ai_classify, to_file, prompt, col, concat, lit

session = get_active_session()

paw_textures = ["Smooth", "Rough"]

# Get files and classify in batch
results_df = (
    session.sql("LIST @img_stage")
    .select(
        col('"name"').alias("file_name"),
        concat(lit("@"), col('"name"')).alias("file_path")
    )
    .select(
        col("file_name"),
        ai_classify(
            prompt("Analyze the paw pad texture of the bear in this image {0}. Is it Smooth (less textured, flat, for walking) or Rough (more textured, grooved, for gripping)?", to_file(col("file_path"))),
            paw_textures
        ).alias("classification")
    )
)

results_df

In [None]:
import json
import pandas as pd

# Collect both columns and parse
data = [
    {
        "id": row["FILE_NAME"].split("/")[-1].replace(".png", ""),
        "paw_pad_texture": json.loads(row["CLASSIFICATION"])["labels"][0]
    }
    for row in results_df.collect()
]

# Convert to pandas DataFrame
df_paw_pad = pd.DataFrame(data)
df_paw_pad

# Data Operations

In this section, we'll perform essential data operations to:
- Combine the extracted features (fur color, facial profile, and paw pad texture) with the original dataset 
- Write the final dataset to a Snowflake table


In [32]:
# Read categorical columns
import pandas as pd

# Load the categorical feature data from CSV files
# df_fur_color = pd.read_csv("https://raw.githubusercontent.com/dataprofessor/bear-dataset/refs/heads/master/fur_color.csv")
# df_facial_profile = pd.read_csv("https://raw.githubusercontent.com/dataprofessor/bear-dataset/refs/heads/master/facial_profile.csv")
# df_paw_pad = pd.read_csv("https://raw.githubusercontent.com/dataprofessor/bear-dataset/refs/heads/master/paw_pad_texture.csv")

## Combining Features

Now that we have extracted all three features (fur color, facial profile, and paw pad texture) from the bear images, let's combine them with our original dataset. 

The final combined dataset will include:

- Original physical measurements (body mass, shoulder hump height, etc.)
- Fur color analysis
- Facial profile classification
- Paw pad texture assessment

This comprehensive dataset will give us a more complete picture of each bear's characteristics for our analysis.


In [33]:
# Combining df_fur, df_facial_profile and df_paw_pad to df
# Standardize column names to match
df['id'] = df['id'].str.upper()  # Ensure IDs are in uppercase
df_fur_color['id'] = df_fur_color['id'].str.upper()
df_facial_profile['id'] = df_facial_profile['id'].str.upper()
df_paw_pad['id'] = df_paw_pad['id'].str.upper()

# Perform sequential merges to combine all features using proper indexing
df_combined = df.merge(df_fur_color, on='id', how='inner')
df_combined = df_combined.merge(df_facial_profile, on='id', how='inner')
df_combined = df_combined.merge(df_paw_pad, on='id', how='inner')

# Display the combined DataFrame
df_combined

## Write data to a database table

### Determine current database and schema

But before we write to a Snowflake database table, let's figure out the current location where this notebook is located, which in turn is where are database table will reside in.

In [34]:
SELECT CURRENT_DATABASE(), CURRENT_SCHEMA();

In [35]:
USE DATABASE chaninn_demo_data;
USE SCHEMA public;

In [36]:
# Convert pandas DataFrame to Snowpark DataFrame and write to table
snowpark_df = session.create_dataframe(df_combined)
snowpark_df.write.mode("overwrite").save_as_table("BEAR")

print("Data written to BEAR table successfully!")

## Query data from table

In [39]:
SELECT * FROM CHANINN_DEMO_DATA.PUBLIC.BEAR;

# Resources
If you'd like to take a deeper dive into Snowpark pandas:
- [pandas on Snowflake](https://docs.snowflake.com/en/developer-guide/snowpark/python/pandas-on-snowflake)
- [Snowpark pandas API](https://docs.snowflake.com/en/developer-guide/snowpark/reference/python/latest/modin/index)
- [Snowflake Cortex AISQL](https://docs.snowflake.com/user-guide/snowflake-cortex/aisql)
- [YouTube Playlist on Snowflake Notebooks](https://www.youtube.com/watch?v=YB1B6vcMaGE&list=PLavJpcg8cl1Efw8x_fBKmfA2AMwjUaeBI)