Quick starter to Build an AI-Powered Data Pipeline with Modin and Snowflake Cortex
1. Import Packages and Connect to Snowflake active session
2. Create the dataframe
3. Classify Text using Snowflake
4. Convert language from English to German
5. Summarize the text
6. Extract Specific Information
7. Post Process -> Convert datatypes
8. Read the dataframe as a snowflake table.
9. Build a simple streamlit app

Quickstarter Guide: https://quickstarts.snowflake.com/guide/process-modin-dataframe-with-cortex/#0

In [None]:
# Import Python packages
import modin.pandas as pd
import snowflake.snowpark.modin.plugin

# Connecting to Snowflake
from snowflake.snowpark.context import get_active_session
session = get_active_session()

In [None]:
snow_df = session.table("SNOWFLAKE_LEARNING_DB.PUBLIC.JOB_DESCRIPTION")
df = snow_df.to_pandas()
df

Classify Text using Snowflake

In [None]:
from snowflake.cortex import ClassifyText
df["company_label_2"] = df["COMPANY"].apply(ClassifyText, categories=["Product","Consulting","Other"])
df

In [None]:
df["company_label_2"] = df["company_label_2"].apply(lambda x: x.get('label'))

df

Translate Text -> English to German

In [None]:
from snowflake.cortex import Translate

df["JOB_ROLE_de"] = df["JOB_ROLE"].apply(Translate, from_language="en", to_language="de")
df["COMPANY_de"] = df["COMPANY"].apply(Translate, from_language="en", to_language="de")
df["JOB_DESCRIPTION_de"] = df["JOB_DESCRIPTION"].apply(Translate, from_language="en", to_language="de")
df["SOURCE_de"] = df["SOURCE"].apply(Translate, from_language="en", to_language="de")

df

Analyze Text -> Summarize

In [None]:
from snowflake.cortex import Summarize

df["description_summary"] = df["JOB_DESCRIPTION"].apply(Summarize)

df

Analyze Text -> Extract Specific Information

In [None]:
import json
from snowflake.cortex import ExtractAnswer

df["company_name"] = df["COMPANY"].apply(
    lambda text: json.loads(ExtractAnswer(text, question="What company is being mentioned?"))[0]["answer"]
)


In [None]:
df

Convert company_name to Type -> Str

In [None]:
# Convert jobRole to type dict
for col_name in df.columns:
    if col_name == "company_name":
        df[col_name] = df[col_name].astype(str)


Data Operations
1. Read as a snowflake table

In [None]:
pd.read_snowflake("JOB_DESCRIPTION")

In [None]:
import streamlit as st

df = pd.read_snowflake("JOB_DESCRIPTION")

st.header("Company Category Description")

# Selectbox for choosing the category column
selected_category_column = st.selectbox(
    "Select Category Type:",
    ("COMPANY","JOB_ROLE")
)

# Count the occurrences of each category based on the selected column
category_counts = df[selected_category_column].value_counts().reset_index()
category_counts.columns = ['Category', 'Count']

st.bar_chart(category_counts, x='Category', y='Count', color='Category')