### Solution 07: Process unstructured data with LLM 

In this scenario, the data scientist perform some data preparation tasks and then process unstructured text with LLM. 

**NOTE:** <mark>you can use [AI Functions (Pandas or PySpark)](https://learn.microsoft.com/en-us/fabric/data-science/ai-functions/overview?tabs=pandas-pyspark%2Cpandas) or [AI Services (OpenAI GPT-x | Cognitive Services as part of MS Fabric)](https://learn.microsoft.com/en-us/fabric/data-science/ai-services/ai-services-overview) or combination of all.</mark>

_**Notebook is prepared for AI Functions (Pandas)**_

In [8]:
# The pandas AI functions package requires OpenAI version 1.99.5 or later
%pip install -q --force-reinstall openai==1.99.5 2>/dev/null

# AI functions are preinstalled on the Fabric PySpark runtime

StatementMeta(, 2e15bb95-c78e-4758-9e34-39dc1177ca23, 15, Finished, Available, Finished)

Note: you may need to restart the kernel to use updated packages.



In [10]:
# Required imports
import synapse.ml.aifunc as aifunc
import pandas as pd

# Optional import for progress bars. In future versions, this import will be included by default, Controlled by aifunc.default_conf.use_progress_bar and conf parameter of AI functions
from tqdm.auto import tqdm
tqdm.pandas()

StatementMeta(, 2e15bb95-c78e-4758-9e34-39dc1177ca23, 18, Finished, Available, Finished)

**Goal:** Get data with reviews

**Actions:**
- Read NY_places_customer_reviews.csv file

**Success Criteria:**
- Dataframe with data    

In [3]:
# a Spark DataFrame containing CSV data from "Files/NY_places_customer_reviews.csv".
df = spark.read.format("text").option("header","true").load("Files/NY_places_customer_reviews.csv")


df_count = df.count()
print(f"Written {df_count} records")

display(df)

StatementMeta(, 046a5e51-4d5b-426c-bf67-5ad7b4708545, 5, Finished, Available, Finished)

Written 266 records


SynapseWidget(Synapse.DataFrame, b03752ed-3466-40be-8d09-8e582a6b1ebc)

**Goal:** Prepare data: remove the first and last double quotes from all string columns

**Actions:**
- Remove the first and last double quotes from all string columns in df

**Success Criteria:**
- Dataframe without the first and last double quotes

In [4]:
from pyspark.sql.functions import expr

# Remove the first and last double quotes from all string columns in df
string_cols = [f.name for f in df.schema.fields if f.dataType.typeName() == "string"]
for col in string_cols:
    df = df.withColumn(col, expr(f"""regexp_replace(regexp_replace({col}, '^"', ''), '"$', '')"""))

display(df)

StatementMeta(, 046a5e51-4d5b-426c-bf67-5ad7b4708545, 6, Finished, Available, Finished)

SynapseWidget(Synapse.DataFrame, 1f65d91f-aafc-41e6-8807-286f69d2967f)

**Goal:** Prepare data: replace double double-quotes "" with a single double-quote " in all string columns

**Actions:**
- Replace double double-quotes "" with a single double-quote " in all string columns in df

**Success Criteria:**
- Dataframe without double double-quotes

In [5]:
from pyspark.sql.functions import regexp_replace

# Replace double double-quotes "" with a single double-quote " in all string columns in df
string_cols = [f.name for f in df.schema.fields if f.dataType.typeName() == "string"]
for col in string_cols:
    df = df.withColumn(col, regexp_replace(col, '""', '"'))

display(df)

StatementMeta(, 046a5e51-4d5b-426c-bf67-5ad7b4708545, 7, Finished, Available, Finished)

SynapseWidget(Synapse.DataFrame, 8acf08f3-6404-4bd9-904e-881b45bf1bdd)

**Goal:** Prepare data: parse comma-separated data (where review text with commas is in double quotes) into tabular model

**Actions:**
- Parse comma-separated data into tabular model

**Success Criteria:**
- Dataframe with 265 reviews/records and columns:
    - LocationID (long)
    - Place (string)
    - Review (string)
    - User (string)


In [6]:
import csv
from io import StringIO
from pyspark.sql import Row
from pyspark.sql.functions import col, when

# Assume 'df' has one column with comma-separated data where Review text with commas is in double quotes
data = df.collect()
if not data:
    display(df)
else:
    # Convert Spark rows to a list of strings
    rows = [row[0] for row in data]
    # Use csv.reader to properly split considering quoted fields
    reader = csv.reader(rows, delimiter=',', quotechar='"')
    
    parsed = list(reader)
    header = parsed[0]
    records = parsed[1:]
    
    tabular_df = spark.createDataFrame([Row(**dict(zip(header, record))) for record in records])
    
    tabular_df = tabular_df.withColumn("LocationID",col("LocationID").cast("long"))

    df_count = tabular_df.count()
    print(f"Written {df_count} records")

    display(tabular_df)

StatementMeta(, 046a5e51-4d5b-426c-bf67-5ad7b4708545, 8, Finished, Available, Finished)

Written 265 records


SynapseWidget(Synapse.DataFrame, 64a76f4c-58e4-4b7b-97a3-2c9cc2417c54)

**Goal:** Process unstructured data with LLM

**Actions:**
- Transform Spark Dataframe to Pandas Dataframe
- Extract sentiment from reviews
- Classify reviews by "bars", "restaurants", "accommodation", "historic building" categories 
- Summarize reviews
- Translate reviews and summarization into Czech language
- Custom prompt: Identify and list the main subjects or aspects discussed in the review (e.g., food quality, customer service, cleanliness, pricing, atmosphere)
- Custom prompt: Extract any mentioned locations, such as city names, neighborhoods, landmarks, or specific venues.
- Write output into goldcurated.reviewLocations


**Success Criteria:**
- New table goldcurated.reviewLocations exists with 265 records and columns:
    - LocationID
    - Place
    - Review
    - User
    - AIsentiment
    - AIsummarize
    - AItranslationsReview
    - AItranslationsSummarize
    - AIresponseKeyTopics
    - AIresponseLocations

In [15]:
# Spark Dataframe to Pandas Dataframe
dfp = tabular_df.toPandas()

StatementMeta(, 2e15bb95-c78e-4758-9e34-39dc1177ca23, 23, Finished, Available, Finished)

In [10]:
#The SENTIMENT Analysis feature provides a way for detecting the sentiment labels (such as positive, negative, mixed, or neutral.)

dfp["AIsentiment"] = dfp["Review"].ai.analyze_sentiment()
display(dfp)

StatementMeta(, b0b0c181-c0d2-48a4-a8da-f0d3bd3ef53e, 18, Finished, Available, Finished)

100%|██████████| 265/265 [00:17<00:00, 15.12it/s]


SynapseWidget(Synapse.DataFrame, 5c15ae52-1c3c-47cc-9af9-07984bcddf3b)

In [15]:
# CATEGORIZE input text according to custom labels

dfp["AIclassify"] = dfp['Review'].ai.classify("bars", "restaurants", "accommodation", "historic building")
display(dfp)

StatementMeta(, b0b0c181-c0d2-48a4-a8da-f0d3bd3ef53e, 23, Finished, Available, Finished)

100%|██████████| 265/265 [00:18<00:00, 14.66it/s]


SynapseWidget(Synapse.DataFrame, 828a4e48-13f8-4df4-9f47-04a3427be956)

In [16]:
# SUMMARIES of input text—either values from one column or values across all the columns

dfp["AIsummarize"] = dfp["Review"].ai.summarize()
display(dfp)


StatementMeta(, b0b0c181-c0d2-48a4-a8da-f0d3bd3ef53e, 24, Finished, Available, Finished)

100%|██████████| 265/265 [00:36<00:00,  7.21it/s]


SynapseWidget(Synapse.DataFrame, 1ac1f1f0-e892-4164-a971-026287e5c8a3)

In [17]:
# TRANSLATE input text to Czech language

dfp["AItranslationsReview"] = dfp["Review"].ai.translate("Czech")
dfp["AItranslationsSummarize"] = dfp["AIsummarize"].ai.translate("Czech")
display(dfp)


StatementMeta(, b0b0c181-c0d2-48a4-a8da-f0d3bd3ef53e, 25, Finished, Available, Finished)

100%|██████████| 265/265 [00:47<00:00,  5.53it/s]
100%|██████████| 265/265 [00:46<00:00,  5.71it/s]


SynapseWidget(Synapse.DataFrame, af4010a1-3c89-4820-a11b-0d1334206b79)

In [18]:
# GENERATE custom text responses based on your own instructions

dfp["AIresponseKeyTopics"] = dfp.ai.generate_response(prompt="Extract Key Topics from {Review}. Identify and list the main subjects or aspects discussed in the review (e.g., food quality, customer service, cleanliness, pricing, atmosphere). Return as comma separated text", is_prompt_template=True)
dfp["AIresponseLocations"] = dfp.ai.generate_response(prompt="Identify Locations from {Review}: Extract any mentioned locations, such as city names, neighborhoods, landmarks, or specific venues. Return as comma separated text", is_prompt_template=True)
display(dfp)

StatementMeta(, b0b0c181-c0d2-48a4-a8da-f0d3bd3ef53e, 26, Finished, Available, Finished)

100%|██████████| 265/265 [00:33<00:00,  7.95it/s]
100%|██████████| 265/265 [00:20<00:00, 12.64it/s]


SynapseWidget(Synapse.DataFrame, 34d72511-c008-4bb1-b41c-e6755a66e202)

In [19]:
# Convert the Pandas DataFrame to PySpark DataFrame
df_spark = spark.createDataFrame(dfp)

# Display the PySpark DataFrame
display(df_spark)

StatementMeta(, b0b0c181-c0d2-48a4-a8da-f0d3bd3ef53e, 27, Finished, Available, Finished)

  Did not pass numpy.dtype object
Attempting non-optimization as 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.
  warn(msg)


SynapseWidget(Synapse.DataFrame, ea9b720a-58bc-443d-b003-ff96fb646b99)

In [20]:
delta_table_path = "Tables/reviewLocations"
df_spark.write.format("delta").mode("overwrite").option("mergeSchema", "true").save(delta_table_path)

StatementMeta(, b0b0c181-c0d2-48a4-a8da-f0d3bd3ef53e, 28, Finished, Available, Finished)

In [21]:
df = spark.sql("SELECT * FROM goldcurated.reviewLocations LIMIT 1000")
display(df)

StatementMeta(, b0b0c181-c0d2-48a4-a8da-f0d3bd3ef53e, 29, Finished, Available, Finished)

SynapseWidget(Synapse.DataFrame, 05444de4-15fc-4c23-986c-977a8520b0ea)