# Build Your First Machine Learning Project - Part 2 | `Exploratory Data Analysis`

In this notebook, we'll explore the Bear data set, performing summary statistics, and data visualization. We'll also leverage Streamlit to add interactivity to the data visualization.

### What We'll Cover:

1. **Data Loading and Preparation** - Load the bear dataset and prepare it for analysis using Snowpark (`snowflake-snowpark-python`)
2. **Basic Statistics** - Calculate and visualize summary statistics of the dataset
3. **Feature Distribution Analysis** - Explore the distribution of individual features across different bear species with `Altair` and `Streamlit`
4. **Correlation Analysis** - Investigate relationships between numeric features using correlation heatmaps with `Altair` and `Streamlit`
5. **Feature Relationships** - Visualize relationships between pairs of features using interactive scatter plots with `Altair` and `Streamlit`
6. **Categorical Analysis** - Examine the distribution of categorical features including species classification with `Altair` and `Streamlit`



# Data Operations

We'll start off by loading and preparing the bear data set.

## Load data

In Part 1, we've retrieved, prepared and wrote the Bear data to Snowflake and here we'll proceed to loading the data by reading it from a Snowflake table.

### Load data via SQL query

Previously, we've saved it to `CHANIN_DEMO_DATA.PUBLIC.BEAR` and therefore we'll query the data with the following SQL statement:

In [1]:
SELECT * FROM CHANINN_DEMO_DATA.PUBLIC.BEAR

### Load data via Python statement

We can also use the same SQL statement inside `pd.read_snowflake()`.

In [3]:
from snowflake.snowpark.context import get_active_session

session = get_active_session()

# Read data from Snowflake table using Snowpark
df_snowpark = session.table("CHANINN_DEMO_DATA.PUBLIC.BEAR")

# Convert to pandas for compatibility with visualization libraries
df = df_snowpark.to_pandas()
df

## Data Preparation

In preparation for forthcoming plots that require numeric data for visualizations, we'll select only numeric columns from the DataFrame using `select_dtypes()`.

In [6]:
# Select only numeric columns from the DataFrame
numeric_df = df.select_dtypes(include=['float64', 'int64'])

numeric_df

# Exploratory Data Analysis

Exploratory Data Analysis (EDA) is the essential first step in any data project, where you get the chance to become familiar with the data by performing an open-ended exploration of the data. 

This is achieved by performing summary statistics and data visualization to uncover patterns, spot anomalies, and identify relationships between variables, which helps you make sense of its key characteristics and prepare it for more complex tasks.

## Feature Distribution
Let's start by exploring the distribution of individual features and see how they are distributed across different bear species.

We'll perform this in 2 implementations:
1. Traditional script-based approach - Features to visualize can be hard-coded
1. Interactive data app - You can select feature to visualize


In [7]:
import altair as alt
alt.renderers.enable("mimetype")

# Create feature distribution plots
print("--------------------")
print("Feature Distributions")
print("--------------------")

numeric_cols = numeric_df.columns

# Manually specify features by changing the index number
feature = numeric_cols[0]

chart = alt.Chart(df).mark_bar().encode(
    alt.X(f"{feature}:Q", bin=True),
    y='count()',
    color=alt.Color('species:N', scale=alt.Scale(scheme='category10'))
).properties(
    height=380,
    title=f"Distribution of {feature} by Class"
)
chart

In [None]:
import altair as alt
alt.renderers.enable("mimetype")

# Create feature distribution plots
print("--------------------")
print("Feature Distributions")
print("--------------------")

numeric_cols = numeric_df.columns

# Manually specify features by changing the index number
feature = numeric_cols[0]

chart = alt.Chart(df).mark_bar().encode(
    alt.X(f"{feature}:Q", bin=True),
    y='count()',
    color=alt.Color('species:N', scale=alt.Scale(scheme='category10'))
).properties(
    height=380,
    title=f"Distribution of {feature} by Class"
)
chart

## Feature Correlations

Next, we're going to get a bird's-eye view of how all our numeric features relate to each other at once. Doing this helps us quickly find the most interesting relationships to explore in more detail.



### Correlation matrix

First, we'll calculate a correlation matrix. This is just a table that shows the correlation coefficient (a value from -1 to +1) between every possible pair of our numeric features. A value of 1 means a perfect positive relationship, -1 means a perfect negative relationship, and 0 means no linear relationship.

In [None]:
# Correlation heatmap
print("--------------------")
print("Feature Correlations")
print("--------------------")

color_option = ['blueorange', 'spectral', 'viridis']
# More color schemes at https://vega.github.io/vega/docs/schemes/

corr_matrix = numeric_df.corr()

corr_data = (
    corr_matrix
    .stack()
    .reset_index(name='value')
    .rename(columns={'level_0': 'index', 'level_1': 'variable'})
)

corr_chart = alt.Chart(corr_data).mark_rect().encode(
    x=alt.X('index:N', sort=None),
    y=alt.Y('variable:N', sort=None),
    color=alt.Color('value:Q', scale=alt.Scale(scheme=color_option[0])),
    tooltip=[alt.Tooltip('index:N'), alt.Tooltip('variable:N'), alt.Tooltip('value:Q')]
).properties(
    width=400,
    title="Correlation Heatmap"
)
corr_chart

## Feature Relationships

After examining individual feature distributions and correlations between features, let's explore specific relationships between pairs of features in more detail. This visualization allows us to:

- See how different numeric features relate to each other
- Identify potential patterns or clusters by bear species
- Spot any outliers or unusual relationships in the data

The scatter plot below shows the relationship between two selected numeric features, with points colored by bear species. This helps us understand how different bear species may cluster or separate based on their physical characteristics.

In [None]:
# Scatter plot for feature relationships
print("--------------------")
print("Feature Relationships")
print("--------------------")

# Manually changing the index value
x_axis = numeric_cols[0]
y_axis = numeric_cols[1]

scatter = alt.Chart(df).mark_circle().encode(
    x=f'{x_axis}:Q',
    y=f'{y_axis}:Q',
    color='species:N',
    tooltip=[f'{x_axis}:Q', f'{y_axis}:Q', 'Species:N']
).properties(
    width=380,
    height=380,
    title=f"{x_axis} vs {y_axis} by Class"
)

scatter

## Species Class Distribution

Let's examine the distribution of bear species in our dataset. This visualization will show us:

- The number of samples for each bear species
- Whether our dataset is balanced or imbalanced across different species
- The relative proportions of each species in our dataset

This information is crucial for understanding our data composition and potential biases that could affect our analysis.


In [None]:
# Class distribution
print("--------------------")
print("Species Class Distribution")
print("--------------------")

class_dist = alt.Chart(df).mark_bar().encode(
    x='species:N',
    y='count()',
    color='species:N'
).properties(
    width=400,
    height=400,
    title="Distribution of Bear Classes"
)
class_dist

# Resources
If you'd like to take a deeper dive into Snowpark pandas:
- [Altair User Guide](https://altair-viz.github.io/user_guide/data.html)
- [pandas on Snowflake](https://docs.snowflake.com/en/developer-guide/snowpark/python/pandas-on-snowflake)
- [Snowpark pandas API](https://docs.snowflake.com/en/developer-guide/snowpark/reference/python/latest/modin/index)
- [YouTube Playlist on Snowflake Notebooks](https://www.youtube.com/watch?v=YB1B6vcMaGE&list=PLavJpcg8cl1Efw8x_fBKmfA2AMwjUaeBI)