# Build Your First Machine Learning Project - Part 2 | `Exploratory Data Analysis`

In this notebook, we'll explore the Bear data set, performing summary statistics, and data visualization. We'll also leverage Streamlit to add interactivity to the data visualization.

### What We'll Cover:

1. **Data Loading and Preparation** - Load the bear dataset and prepare it for analysis using Modin (`modin.pandas`) and Snowpark (`snowflake-snowpark-python`)
2. **Basic Statistics** - Calculate and visualize summary statistics of the dataset
3. **Feature Distribution Analysis** - Explore the distribution of individual features across different bear species with `Altair` and `Streamlit`
4. **Correlation Analysis** - Investigate relationships between numeric features using correlation heatmaps with `Altair` and `Streamlit`
5. **Feature Relationships** - Visualize relationships between pairs of features using interactive scatter plots with `Altair` and `Streamlit`
6. **Categorical Analysis** - Examine the distribution of categorical features including species classification with `Altair` and `Streamlit`



# Notebook Setup

## Install Prerequisite Libraries

Snowflake Notebooks includes common Python libraries by default. To add more, use the **Packages** dropdown in the top right. 

Let's add the following package:
- `modin` - Perform data operations (read/write) and wrangling just like pandas with the [Snowpark pandas API](https://docs.snowflake.com/en/developer-guide/snowpark/reference/python/latest/modin/index)
- `scikit-learn` - Perform data splits and build machine learning models

# Data Operations

We'll start off by loading and preparing the bear data set.

## Load data

In Part 1, we've retrieved, prepared and wrote the Bear data to Snowflake and here we'll proceed to loading the data by reading it from a Snowflake table.

### Load data via SQL query

Previously, we've saved it to `CHANIN_DEMO_DATA.PUBLIC.BEAR` and therefore we'll query the data with the following SQL statement:

In [None]:
SELECT * FROM CHANINN_DEMO_DATA.PUBLIC.BEAR

### Load data via Python statement

We can also use the same SQL statement inside `pd.read_snowflake()`.

In [None]:
import modin.pandas as pd
import snowflake.snowpark.modin.plugin

df = pd.read_snowflake("SELECT * FROM CHANINN_DEMO_DATA.PUBLIC.BEAR")
df

Or simply by calling the table name via `pd.read_snowflake()`.

In [None]:
pd.read_snowflake("BEAR")

## Data Preparation

In preparation for forthcoming plots that require numeric data for visualizations, we'll select only numeric columns from the DataFrame using `select_dtypes()`.

In [None]:
# Select only numeric columns from the DataFrame
numeric_df = df.select_dtypes(include=['float64', 'int64'])

numeric_df

# Exploratory Data Analysis

Exploratory Data Analysis (EDA) is the essential first step in any data project, where you get the chance to become familiar with the data by performing an open-ended exploration of the data. 

This is achieved by performing summary statistics and data visualization to uncover patterns, spot anomalies, and identify relationships between variables, which helps you make sense of its key characteristics and prepare it for more complex tasks.

## Feature Distribution
Let's start by exploring the distribution of individual features and see how they are distributed across different bear species.

We'll perform this in 2 implementations:
1. Traditional script-based approach - Features to visualize can be hard-coded
1. Interactive data app - You can select feature to visualize


In [None]:
import altair as alt

# Create feature distribution plots
print("--------------------")
print("Feature Distributions")
print("--------------------")

numeric_cols = numeric_df.columns

# Manually specify features by changing the index number
feature = numeric_cols[0]

chart = alt.Chart(df).mark_bar().encode(
    alt.X(f"{feature}:Q", bin=True),
    y='count()',
    color=alt.Color('species:N', scale=alt.Scale(scheme='category10'))
).properties(
    height=380,
    title=f"Distribution of {feature} by Class"
)
chart

In [None]:
import streamlit as st
import altair as alt

# Create feature distribution plots
st.header("Feature Distributions")

# Create form for feature selection
with st.form("distribution_form"):
    feature = st.selectbox("Select a Feature", numeric_df.columns)
    submit_dist = st.form_submit_button("Submit", type="primary")

if submit_dist:
    chart = alt.Chart(df).mark_bar().encode(
        alt.X(f"{feature}:Q", bin=True),
        y='count()',
        color=alt.Color('Species:N', scale=alt.Scale(scheme='category10'))
    ).properties(
        height=380,
        title=f"Distribution of {feature} by Species"
    )
    st.altair_chart(chart, use_container_width=True)


## Feature Correlations

Next, we're going to get a bird's-eye view of how all our numeric features relate to each other at once. Doing this helps us quickly find the most interesting relationships to explore in more detail.



### Correlation matrix

First, we'll calculate a correlation matrix. This is just a table that shows the correlation coefficient (a value from -1 to +1) between every possible pair of our numeric features. A value of 1 means a perfect positive relationship, -1 means a perfect negative relationship, and 0 means no linear relationship.

In [None]:
# Correlation heatmap
print("--------------------")
print("Feature Correlations")
print("--------------------")

color_option = ['blueorange', 'spectral', 'viridis']
# More color schemes at https://vega.github.io/vega/docs/schemes/

corr_matrix = numeric_df.corr()

corr_data = (
    corr_matrix
    .stack()
    .reset_index(name='value')
    .rename(columns={'level_0': 'index', 'level_1': 'variable'})
)

corr_chart = alt.Chart(corr_data).mark_rect().encode(
    x=alt.X('index:N', sort=None),
    y=alt.Y('variable:N', sort=None),
    color=alt.Color('value:Q', scale=alt.Scale(scheme=color_option[0])),
    tooltip=[alt.Tooltip('index:N'), alt.Tooltip('variable:N'), alt.Tooltip('value:Q')]
).properties(
    width=400,
    title="Correlation Heatmap"
)
corr_chart

In [None]:
import streamlit as st
import altair as alt

# Create correlation heatmap
st.header("Feature Correlations")

# Add color scheme selection
color_option = st.selectbox("Select Color Scheme", ['blueorange', 'spectral', 'viridis'])

with st.spinner("Generating correlation heatmap..."):
    # Calculate correlation matrix
    corr_matrix = numeric_df.corr()

    # Reshape correlation matrix for visualization 
    corr_data = (
        corr_matrix
        .stack()
        .reset_index(name='value')
        .rename(columns={'level_0': 'index', 'level_1': 'variable'})
    )

    # Create correlation heatmap
    corr_chart = alt.Chart(corr_data).mark_rect().encode(
        x=alt.X('index:N', sort=None),
        y=alt.Y('variable:N', sort=None),
        color=alt.Color('value:Q', scale=alt.Scale(scheme=color_option)),
        tooltip=[alt.Tooltip('index:N'), alt.Tooltip('variable:N'), alt.Tooltip('value:Q')]
    ).properties(
        height=380,
        title="Correlation Heatmap"
    )
    st.altair_chart(corr_chart, use_container_width=True)


## Feature Relationships

After examining individual feature distributions and correlations between features, let's explore specific relationships between pairs of features in more detail. This visualization allows us to:

- See how different numeric features relate to each other
- Identify potential patterns or clusters by bear species
- Spot any outliers or unusual relationships in the data

The scatter plot below shows the relationship between two selected numeric features, with points colored by bear species. This helps us understand how different bear species may cluster or separate based on their physical characteristics.

In [None]:
# Scatter plot for feature relationships
print("--------------------")
print("Feature Relationships")
print("--------------------")

# Manually changing the index value
x_axis = numeric_cols[0]
y_axis = numeric_cols[1]

scatter = alt.Chart(df).mark_circle().encode(
    x=f'{x_axis}:Q',
    y=f'{y_axis}:Q',
    color='species:N',
    tooltip=[f'{x_axis}:Q', f'{y_axis}:Q', 'Species:N']
).properties(
    width=380,
    height=380,
    title=f"{x_axis} vs {y_axis} by Class"
)

scatter

In [None]:
# Generated by Snowflake Copilot
import streamlit as st
import altair as alt

# Create scatter plot for feature relationships
st.header("Feature Relationships")

# Create columns for feature selection
col1, col2 = st.columns(2)
x_axis = col1.selectbox("Select X-axis feature", numeric_df.columns, key="x_feature")
y_axis = col2.selectbox("Select Y-axis feature", numeric_df.columns, key="y_feature") 

with st.spinner("Generating scatter plot..."):
    scatter = alt.Chart(df).mark_circle().encode(
        x=f'{x_axis}:Q',
        y=f'{y_axis}:Q',
        color='species:N',
        tooltip=[f'{x_axis}:Q', f'{y_axis}:Q', 'species:N']
    ).properties(
        height=380,
        title=f"{x_axis} vs {y_axis} by Species"
    )
    st.altair_chart(scatter, use_container_width=True)


## Species Class Distribution

Let's examine the distribution of bear species in our dataset. This visualization will show us:

- The number of samples for each bear species
- Whether our dataset is balanced or imbalanced across different species
- The relative proportions of each species in our dataset

This information is crucial for understanding our data composition and potential biases that could affect our analysis.


In [None]:
# Class distribution
print("--------------------")
print("Species Class Distribution")
print("--------------------")

class_dist = alt.Chart(df).mark_bar().encode(
    x='species:N',
    y='count()',
    color='species:N'
).properties(
    width=400,
    height=400,
    title="Distribution of Bear Classes"
)
class_dist

In [None]:
import streamlit as st
import altair as alt


# Create categorical distribution plot
st.header("Categorical Feature Distribution")

# Get categorical columns (excluding ID)
categorical_cols = df.select_dtypes(include=['object']).columns.tolist()
categorical_cols = [col for col in categorical_cols if col != 'id']

# Add visualization options
col1, col2 = st.columns(2)
selected_feature = col1.selectbox("Select Feature", categorical_cols)
chart_type = col2.selectbox("Chart Type", ["Bar", "Pie"])

with st.spinner("Generating distribution plot..."):
    if chart_type == "Bar":
        # Create bar chart
        chart = alt.Chart(df).mark_bar().encode(
            x=f'{selected_feature}:N',
            y='count()',
            color=f'{selected_feature}:N',
            tooltip=[f'{selected_feature}:N', alt.Tooltip('count()', title='Count')]
        ).properties(
            height=380,
            title=f"Distribution of {selected_feature}"
        )
    else:
        # Create pie chart
        chart = alt.Chart(df).mark_arc().encode(
            theta='count()',
            color=f'{selected_feature}:N',
            tooltip=[f'{selected_feature}:N', alt.Tooltip('count()', title='Count')]
        ).properties(
            height=380,
            title=f"Distribution of {selected_feature}"
        )
        
    st.altair_chart(chart, use_container_width=True)


## Basic script

Now, we'll combine everything together in a single code cell, which you could also be embedded into a single Python file to create a script.

Typically, to display text displays in Python we can use the `print()` method, which does the job but does not have an aesthetic appeal.

A Jupyter notebook provides the ability to use Markdown to stylize text display.

As you'll soon see, we can use Streamlit widgets inside this notebook to add interactivity to it.

In [None]:
print("============================================")
print("Dataset Exploratory Data Analysis")
print("============================================")

# Display basic dataset information
print("--- Dataset Overview ---")
print(f"Total Samples: {len(df)}")
print(f"Number of Features: {len(df.columns) - 1}")
print(f"Number of Species Classes: {len(df['species'].unique())}")

# Display summary statistics
print("--- Summary Statistics ---")
print(f"Average Body Mass: {numeric_df['body_mass_kg'].mean():.2f}")
print(f"Average Shoulder Hump Height: {numeric_df['shoulder_hump_height_cm'].mean():.2f}")
print(f"Average Claw Length: {numeric_df['claw_length_cm'].mean():.2f}")
print(f"Average Snout Length: {numeric_df['snout_length_cm'].mean():.2f}")
print(f"Average Forearm Circumference: {numeric_df['forearm_circumference_cm'].mean():.2f}")
print(f"Average Ear Length: {numeric_df['ear_length_cm'].mean():.2f}")

# Create feature distribution plots
print("--------------------")
print("Feature Distributions")
print("--------------------")

numeric_cols = numeric_df.columns

# Manually specify features by changing the index number
feature = numeric_cols[0]

chart = alt.Chart(df).mark_bar().encode(
    alt.X(f"{feature}:Q", bin=True),
    y='count()',
    color=alt.Color('species:N', scale=alt.Scale(scheme='category10'))
).properties(
    height=380,
    title=f"Distribution of {feature} by Class"
)
chart

# Correlation heatmap
print("--------------------")
print("Feature Correlations")
print("--------------------")

color_option = ['blueorange', 'spectral', 'viridis']
# More color schemes at https://vega.github.io/vega/docs/schemes/

corr_matrix = numeric_df.corr()

corr_data = (
    corr_matrix
    .stack()
    .reset_index(name='value')
    .rename(columns={'level_0': 'index', 'level_1': 'variable'})
)

corr_chart = alt.Chart(corr_data).mark_rect().encode(
    x=alt.X('index:N', sort=None),
    y=alt.Y('variable:N', sort=None),
    color=alt.Color('value:Q', scale=alt.Scale(scheme=color_option[0])),
    tooltip=[alt.Tooltip('index:N'), alt.Tooltip('variable:N'), alt.Tooltip('value:Q')]
).properties(
    width=400,
    title="Correlation Heatmap"
)
corr_chart

# Scatter plot for feature relationships
print("--------------------")
print("Feature Relationships")
print("--------------------")

# Manually changing the index value
x_axis = numeric_cols[0]
y_axis = numeric_cols[1]

scatter = alt.Chart(df).mark_circle().encode(
    x=f'{x_axis}:Q',
    y=f'{y_axis}:Q',
    color='species:N',
    # Corrected line: Add data types to each tooltip field
    tooltip=[f'{x_axis}:Q', f'{y_axis}:Q', 'species:N']
).properties(
    width=380,
    height=380,
    title=f"{x_axis} vs {y_axis} by Class"
)

scatter

# Class distribution
print("--------------------")
print("Class Distribution")
print("--------------------")

class_dist = alt.Chart(df).mark_bar().encode(
    x='species:N',
    y='count()',
    color='species:N'
).properties(
    width=380,
    height=380,
    title="Distribution of Species Classes"
)
class_dist



## Streamlit

The previous script when implemented using Streamlit widgets gives us an interactive data app.

In [None]:
import streamlit as st
import altair as alt
import modin.pandas as pd
import numpy as np

# Set up the page
st.title("🐻 Bear Dataset Exploratory Data Analysis")

# Display basic dataset information
st.header("Dataset Overview")


with st.container(border=True):
    data_col = st.columns(3)
    data_col[0].metric("Total Samples", len(df))
    data_col[1].metric("Number of Features", len(df.columns) - 1)
    data_col[2].metric("Number of Species", len(df['species'].unique()))

# Display summary statistics in a grid
st.header("Summary Statistics")

with st.container(border=True):
    stats_col = st.columns(3)
    stats_col[0].metric("Average Body Mass (kg)", f"{numeric_df['body_mass_kg'].mean():.2f}")
    stats_col[1].metric("Average Shoulder Height (cm)", f"{numeric_df['shoulder_hump_height_cm'].mean():.2f}")
    stats_col[2].metric("Average Claw Length (cm)", f"{numeric_df['claw_length_cm'].mean():.2f}")
    
    stats_col2 = st.columns(3)
    stats_col2[0].metric("Average Snout Length (cm)", f"{numeric_df['snout_length_cm'].mean():.2f}")
    stats_col2[1].metric("Average Forearm Circumference (cm)", f"{numeric_df['forearm_circumference_cm'].mean():.2f}")
    stats_col2[2].metric("Average Ear Length (cm)", f"{numeric_df['ear_length_cm'].mean():.2f}")

# Create feature distribution plots
feature_col_1 = st.columns(2)
feature_col_2 = st.columns(2)

with feature_col_1[0]:
    with st.container(border=True):
        st.header("Feature Distributions")
        numeric_cols = numeric_df.columns
        feature = st.selectbox("Select Feature", numeric_cols)
        
        with st.spinner("Generating feature distribution plot..."):
            chart = alt.Chart(df).mark_bar().encode(
                alt.X(f"{feature}:Q", bin=True),
                y='count()',
                color=alt.Color('species:N', scale=alt.Scale(scheme='category10'))
            ).properties(
                height=380,
                title=f"Distribution of {feature} by Species"
            )
            st.altair_chart(chart, use_container_width=True)

# Correlation heatmap
with feature_col_1[1]:
    with st.container(border=True):
        st.header("Feature Correlations")
        color_option = st.selectbox("Color Scheme:", ['blueorange', 'spectral', 'viridis'])
        
        with st.spinner("Generating correlation heatmap..."):
            corr_matrix = numeric_df.corr()
            corr_data = (
                corr_matrix
                .stack()
                .reset_index(name='value')
                .rename(columns={'level_0': 'index', 'level_1': 'variable'})
            )
    
            corr_chart = alt.Chart(corr_data).mark_rect().encode(
                x=alt.X('index:N', sort=None),
                y=alt.Y('variable:N', sort=None),
                color=alt.Color('value:Q', scale=alt.Scale(scheme=color_option)),
                tooltip=[alt.Tooltip('index:N'), alt.Tooltip('variable:N'), alt.Tooltip('value:Q')]
            ).properties(
                height=380,
                title="Correlation Heatmap"
            )
            st.altair_chart(corr_chart, use_container_width=True)

# Scatter plot for feature relationships
with feature_col_2[0]:
    with st.container(border=True):
        st.header("Feature Relationships")
        axis_col = st.columns(2)
        x_axis = axis_col[0].selectbox("Select X-axis feature", numeric_cols, key="x_axis")
        y_axis = axis_col[1].selectbox("Select Y-axis feature", numeric_cols, key="y_axis")
        
        with st.spinner("Generating scatter plot..."):
            scatter = alt.Chart(df).mark_circle().encode(
                x=f'{x_axis}:Q',
                y=f'{y_axis}:Q',
                color='species:N',
                tooltip=[f'{x_axis}:Q', f'{y_axis}:Q', 'Species:N']
            ).properties(
                height=380,
                title=f"{x_axis} vs {y_axis} by Species"
            )
            st.altair_chart(scatter, use_container_width=True)

# Categorical feature distribution
with feature_col_2[1]:
    with st.container(border=True):
        st.header("Categorical Feature Distribution")
        categorical_cols = df.select_dtypes(include=['object']).columns.tolist()
        categorical_cols = [col for col in categorical_cols if col != 'id']
        
        cat_col = st.columns(2)
        selected_feature = cat_col[0].selectbox("Select Feature", categorical_cols)
        chart_type = cat_col[1].selectbox("Chart Type", ["Bar", "Pie"])
        
        with st.spinner("Generating distribution plot..."):
            if chart_type == "Bar":
                chart = alt.Chart(df).mark_bar().encode(
                    x=f'{selected_feature}:N',
                    y='count()',
                    color=f'{selected_feature}:N',
                    tooltip=[f'{selected_feature}:N', alt.Tooltip('count()', title='Count')]
                ).properties(
                    height=380,
                    title=f"Distribution of {selected_feature}"
                )
            else:
                chart = alt.Chart(df).mark_arc().encode(
                    theta='count()',
                    color=f'{selected_feature}:N',
                    tooltip=[f'{selected_feature}:N', alt.Tooltip('count()', title='Count')]
                ).properties(
                    height=380,
                    title=f"Distribution of {selected_feature}"
                )
            st.altair_chart(chart, use_container_width=True)


# Resources
If you'd like to take a deeper dive into Snowpark pandas:
- [Altair User Guide](https://altair-viz.github.io/user_guide/data.html)
- [pandas on Snowflake](https://docs.snowflake.com/en/developer-guide/snowpark/python/pandas-on-snowflake)
- [Snowpark pandas API](https://docs.snowflake.com/en/developer-guide/snowpark/reference/python/latest/modin/index)
- [YouTube Playlist on Snowflake Notebooks](https://www.youtube.com/watch?v=YB1B6vcMaGE&list=PLavJpcg8cl1Efw8x_fBKmfA2AMwjUaeBI)