# Process DataFrame with Modin and Snowflake Cortex

In this notebook, we'll use Snowflake Cortex to process the Avalanche product catalog data directly from a Modin DataFrame.

Here's what we're covering in this end-to-end tutorial:
1. Load the Avalanche product catalog data from an S3 bucket into a Snowflake stage
2. Read CSV data into a Modin DataFrame
3. Perform data processing using Cortex LLM functionalities: classify, translate, sentiment, summarize and extract answers
4. Perform data post-processing to tidy up the DataFrame
5. Write data to a Snowflake database table
6. Query the newly created table
7. Create a simple interactive UI with Streamlit

## Install Prerequisite Libraries

Snowflake Notebooks includes common Python libraries by default. To add more, use the **Packages** dropdown in the top right. 

Let's add these packages:
- `modin` - Enables the use of Modin
- `snowflake-ml-python` - Enables the use of Cortex LLM functions
- `snowflake-snowpark-python` - Enables the use of Snowpark

In [None]:
# Import Python packages
import modin.pandas as pd
import snowflake.snowpark.modin.plugin

# Connecting to Snowflake
from snowflake.snowpark.context import get_active_session
session = get_active_session()


## Load data into Snowflake

We can load data from an S3 bucket and bring it into Snowflake.

To do this, we'll create a stage on Snowflake to house the data:

In [None]:
CREATE OR REPLACE STAGE AVALANCHE
    URL = 's3://sfquickstarts/misc/avalanche/csv/';

### List contents of a stage

Next, we'll use `ls` to list the contents of our stage that is referred to as `@avalanche`, which is located within the same database and schema where this Notebook resides on when the Notebook was first created.

In [None]:
ls @avalanche/

### Read CSV Data

Here, we'll read in `@avalanche/product-catalog.csv` via Pandas' `pd.read_csv()` method.

We should see the following 3 columns:
- `name`
- `description`
- `price`

In [None]:
df = pd.read_csv("@avalanche/product-catalog.csv")

df

## Use Cortex for Data Pre-processing

Snowflake Cortex offers powerful AI and ML capabilities directly within your Snowflake Data Cloud, including various functions for data/image pre-processing and analysis.

## Classify

We'll classify each entry of a specified column in a Modin DataFrame via the `apply()` method together with the `ClassifyText` function. In addition, we're comparing the use of the product `name` vs `description` to generate the categorical labels.

You'll also notice that we also provided a few possible categorical labels for Cortex to work with as a list (`["Apparel","Accessories"]`).

In [None]:
from snowflake.cortex import ClassifyText

df["label"] = df["name"].apply(ClassifyText, categories=["Apparel","Accessories"])
df["label2"] = df["description"].apply(ClassifyText, categories=["Apparel","Accessories"])

df

You'll noticed that the generated label for each entry is in a dictionary format with key-value pair: `{"label":"Accessories"}`. We'll extract only the value by applying the `get()` method.

Finally, we'll drop the `label` and `label2` columns.

In [None]:
df["category"] = df["label"].apply(lambda x: x.get('label'))
df["category2"] = df["label2"].apply(lambda x: x.get('label'))

df.drop(["label", "label2"], axis=1, inplace=True)

df

## Translate

Similar to the previous example, we can also use `apply()` together with `Translate` and `from_language` and `to_language` parameters to tell Cortex what languages to work with.

In [None]:
from snowflake.cortex import Translate

df["name_de"] = df["name"].apply(Translate, from_language="en", to_language="de")
df["description_de"] = df["description"].apply(Translate, from_language="en", to_language="de")
df["category_de"] = df["category"].apply(Translate, from_language="en", to_language="de")
df["category2_de"] = df["category2"].apply(Translate, from_language="en", to_language="de")

df

## Sentiment

Let's also compute the sentiment of the description (as a use case example) using `apply()` with the `Sentiment` function.

In [None]:
from snowflake.cortex import Sentiment

df["sentiment_score"] = df["description"].apply(Sentiment)

df

## Summarize

We'll also summarize the description text using `apply()` with the `Summarize` function. 

In [None]:
from snowflake.cortex import Summarize

df["description_summary"] = df["description"].apply(Summarize)

df

## Extract Answer

We'll also summarize the description text using `apply()` with the `ExtractAnswer` function. 

In [None]:
from snowflake.cortex import ExtractAnswer

df["product"] = df["name"].apply(ExtractAnswer, question="What product is being mentioned?")
df["product"] = [x[0]["answer"] for x in df['product']]

df

## Data Post-processing

Here, we'll remove the `$` symbol from the `price` column.

In [None]:
# For the price column, remove $ symbol and convert to numeric
df["price"] = df["price"].str.replace("$", "", regex=False)
df["price"] = pd.to_numeric(df["price"])

In [None]:
df

As the columns are of the `object` data type, we'll convert them to the `str` data type.

In [None]:
# Convert all other columns to the string type
for col_name in df.columns:
    if col_name != "price" and col_name != "sentiment_score":
        df[col_name] = df[col_name].astype(str)

In [None]:
df

## Write Data to Snowflake

Writing data to Snowflake can be done from a Modin DataFrame using the `to_snowflake()` method:

In [None]:
df.to_snowflake("avalanche_products", if_exists="replace", index=False )

## Read Data from a Snowflake Table

### Read Data using SQL
We'll now query the data using SQL:

In [None]:
SELECT * FROM CHANINN_DEMO_DATA.PUBLIC.AVALANCHE_PRODUCTS

### Read Data using Python

We'll also read data using Python:

In [None]:
pd.read_snowflake("avalanche_products")

## Streamlit Example

In [None]:
import streamlit as st

df = pd.read_snowflake("avalanche_products")

#df = sql_read_data.to_pandas()

#df['sentiment_score'] = pd.to_numeric(df['sentiment_score'])

st.header("Product Category Distribution")

# Selectbox for choosing the category column
selected_category_column = st.selectbox(
    "Select Category Type:",
    ("category", "category2")
)

# Count the occurrences of each category based on the selected column
category_counts = df[selected_category_column].value_counts().reset_index()
category_counts.columns = ['Category', 'Count']

st.bar_chart(category_counts, x='Category', y='Count', color='Category')


st.header("Product Sentiment Analysis")

# Calculate metrics
st.write("Overall Sentiment Scores:")

cols = st.columns(4)

with cols[0]:
    st.metric("Mean Sentiment", df['sentiment_score'].mean() )
with cols[1]:
    st.metric("Min Sentiment", df['sentiment_score'].min() )
with cols[2]:
    st.metric("Max Sentiment", df['sentiment_score'].max() )
with cols[3]:
    st.metric("Standard Deviation", df['sentiment_score'].std() )

# Create a bar chart showing sentiment scores for all products
st.write("Individual Product Sentiment Scores:")
option = st.selectbox("Color bar by", ("name", "sentiment_score"))
st.bar_chart(df[['name', 'sentiment_score']], x='name', y='sentiment_score', color=option)

## Resources
- [pandas on Snowflake](https://docs.snowflake.com/en/developer-guide/snowpark/python/pandas-on-snowflake)
- [Using Snowflake Cortex LLM functions with Snowpark pandas](https://docs.snowflake.com/en/developer-guide/snowpark/python/pandas-on-snowflake#using-snowflake-cortex-llm-functions-with-snowpark-pandas)
- [Snowflake Cortex AI](https://www.snowflake.com/en/product/features/cortex/)