# Gaining insights from Unstructured Data with Snowflake Cortex AI

Tasty Bytes is a global food truck network operating in 15 countries with fleet of 450 trucks. They collect customer reviews to get customer feedback on their food-trucks which come in from multiple sources and span multiple languages. 
This enables them to better understand the areas which require improvement and drive up customer loyalty along with satisfaction. 

In this notebook, we will look at how we analyze these collated customer reviews using Snowflake Cortex to understand :
  * What our international customers are saying with Cortex **Translate**
  * Get a summary of what customers are saying with Cortex **Summary**
  * Classify reviews to determine if they would recomend a food truck with Cortex **ClassifyText**
  * Gain specfic insights with Cortex **ExtractAnswer**
  * Understand how customers are feeling with Cortex **Sentiment**

Lets see how many reviews we have.

In [None]:
SELECT COUNT(*) FROM TRUCK_REVIEWS_V;

**Import python packages**

Snowflake Notebooks include Streamlit and the third-party packages listed in the Snowflake Anaconda channel.  
Installing a package is made easy by enabling user to select required pacakges from a list of available pacakges under Packages on the top right corner.  
Once installed, we can import installed packages as we would in any other notebook.

In [None]:
# Import python packages
import streamlit as st
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Snowpark
from snowflake.snowpark.context import get_active_session
import snowflake.snowpark.functions as F
from snowflake.snowpark.functions import when, date_part
from snowflake.snowpark.window import Window

# Cortex Functions
import snowflake.cortex  as cortex

session = get_active_session()
# Add a query tag to the session.
session.query_tag = {"origin":"sf_sit-is", 
                     "name":"voc", 
                     "version":{"major":1, "minor":0},
                     "attributes":{"is_quickstart":1, "source":"notebook", "vignette":"customer_reviews"}}


Lets preview the reviews

In [None]:
reviews_df = session.table('TRUCK_REVIEWS_V') \
             .filter(date_part("year", F.col('DATE')) == 2024)
reviews_df.show()

In the next cell, we leverage **Translate** - one of the **Snowflake Cortex specialised LLM functions** are available in Snowpark ML, to translate the multilingual reviews to english to enable easier analysis for folks who don't speak the language the original review is in.

In [None]:
# Conditionally translate reviews that are not english using Cortex Translate
reviews_df = reviews_df.withColumn('TRANSLATED_REVIEW',when(F.col('LANGUAGE') != F.lit("en"), \
                                                            cortex.Translate(F.col('REVIEW'), \
                                                                             F.col('LANGUAGE'), \
                                                                             "en")) \
                                   .otherwise(F.col('REVIEW')))

reviews_df.filter(F.col('LANGUAGE') != F.lit("en")) \
.select(["REVIEW","LANGUAGE","TRANSLATED_REVIEW"]).show(3)

We can quickly learn what our customers are saying with Cortex **Summarize**

In [None]:
# Step 1: Add a row number for each review within each TRUCK_BRAND_NAME
window_spec = Window.partition_by("TRUCK_BRAND_NAME").order_by("REVIEW")
ranked_reviews_df = reviews_df.with_column(
    "ROW_NUM", F.row_number().over(window_spec)
)

# Step 2: Filter to include only the first 20 rows per TRUCK_BRAND_NAME to get a general idea
filtered_reviews_df = ranked_reviews_df.filter(F.col("ROW_NUM") <= 20)

# Step 3: Aggregate reviews by TRUCK_BRAND_NAME
aggregated_reviews_df = filtered_reviews_df.group_by("TRUCK_BRAND_NAME").agg(
    F.array_agg(F.col("REVIEW")).alias("ALL_REVIEWS")
)

# Step 4: Convert the array of reviews to a single string
concatenated_reviews_df = aggregated_reviews_df.with_column(
    "ALL_REVIEWS_TEXT", F.call_function("array_to_string", F.col("ALL_REVIEWS"), F.lit(' '))
)

# Step 5: Generate summaries for each truck brand
summarized_reviews_df = concatenated_reviews_df.with_column(
    "SUMMARY", cortex.Summarize(F.col("ALL_REVIEWS_TEXT"))
)

# Step 6: Display the results
summarized_reviews_df.select(["TRUCK_BRAND_NAME", "SUMMARY"]).show(3)

one_summary_row = summarized_reviews_df.limit(1).collect()
if one_summary_row:
    brand = one_summary_row[0]['TRUCK_BRAND_NAME']
    summary = one_summary_row[0]['SUMMARY']

    # Split the summary roughly in half
    half = len(summary) // 2
    split_index = summary[:half].rfind(' ')  # Find the last space before the halfway point
    if split_index == -1:
        split_index = half  # If no space found, split at halfway

    summary_part1 = summary[:split_index].strip()
    summary_part2 = summary[split_index:].strip()

    print(f"Truck Brand: {brand}")
    print(f"Summary (part 1): {summary_part1}")
    print(f"Summary (part 2): {summary_part2}")

We can similarly understand if a customer would recommend the food truck based on their review using Snowflake Cortex **ClassifyText**

In [None]:
# Prompt to understand whether a customer would recommend food truck based on their review 
prompt = """
Tell me based on the following food truck customer review, will they recommend the food truck to \
their friends and family? Answer should be only one of the following words - \
"Likely" or "Unlikely" or "Unsure".
"""

# Ask cortex ClassifyText and create a new column
reviews_df = reviews_df.withColumn('RECOMMEND', cortex.ClassifyText(F.col('REVIEW'),["Likely","Unlikely","Unsure"], prompt))\
.withColumn('CLEAN_RECOMMEND', when(F.contains(F.col('RECOMMEND'), F.lit('Likely')), \
                                                            F.lit('Likely')) \
                                       .when(F.contains(F.col('RECOMMEND'), F.lit('Unlikely' )), \
                                                            F.lit('Unlikely')) \
            .when(F.contains(F.col('RECOMMEND'), F.lit('Unsure' )), \
                                                            F.lit('Unsure')))

reviews_df.select(["REVIEW","CLEAN_RECOMMEND"]).show(3)

Gaining specific insights through Cortex ExtractAnswer

In [None]:
# Step 1: Add a row number for each review within each TRUCK_BRAND_NAME
window_spec = Window.partition_by("TRUCK_BRAND_NAME").order_by("REVIEW")
ranked_reviews_df = reviews_df.with_column(
    "ROW_NUM", F.row_number().over(window_spec)
)

# Step 2: Filter to include only the first 20 rows per TRUCK_BRAND_NAME to get a general idea
filtered_reviews_df = ranked_reviews_df.filter(F.col("ROW_NUM") <= 20)

# Step 3: Aggregate reviews by TRUCK_BRAND_NAME
aggregated_reviews_df = filtered_reviews_df.group_by("TRUCK_BRAND_NAME").agg(
    F.array_agg(F.col("REVIEW")).alias("ALL_REVIEWS")
)

# Step 4: Convert the array of reviews to a single string
concatenated_reviews_df = aggregated_reviews_df.with_column(
    "ALL_REVIEWS_TEXT", F.call_function("array_to_string", F.col("ALL_REVIEWS"), F.lit(' '))
)

# Step 5: Generate summaries for each truck brand
summarized_reviews_df = concatenated_reviews_df.with_column(
    "NUMBER_ONE_DISH", cortex.ExtractAnswer(F.col("ALL_REVIEWS_TEXT"), "What is the number one dish positivly mentioned in the feedback?")
)

# Step 6: Extract the first element of the array
first_element_df = summarized_reviews_df.with_column(
    "FIRST_ELEMENT", F.expr("NUMBER_ONE_DISH[0]")
)

# Step 7: Parse the first element as JSON and extract the "answer" field
readable_df = first_element_df.with_column(
    "NUMBER_ONE_DISH", F.get(F.parse_json(F.col("FIRST_ELEMENT")), F.lit("answer"))
)

# Display the simplified results
readable_df.select(["TRUCK_BRAND_NAME", "NUMBER_ONE_DISH"]).show()


So far we saw Snowflake Cortex - Translate & Complete. Next we will look at another **task specific LLM function in Cortex - Sentiment**. We utilise sentiment function to understand customer's tone based on the review they provided. Sentiment return value between -1 and 1 such that -1 is the most negative while 1 is the most positive.  

In [None]:
# Understand the sentiment of customer review using Cortex Sentiment
reviews_df = reviews_df.withColumn('SENTIMENT', cortex.Sentiment(F.col('REVIEW')))

reviews_df.select(["REVIEW","SENTIMENT"]).show(3)