### Solution 07: Process unstructured data with LLM 

In this scenario, the data scientist perform some data preparation tasks and then process unstructured text with LLM. 

**NOTE:** <mark>you can use [AI Functions (Pandas or PySpark)](https://learn.microsoft.com/en-us/fabric/data-science/ai-functions/overview?tabs=pandas-pyspark%2Cpandas) or [AI Services (OpenAI GPT-x | Cognitive Services as part of MS Fabric)](https://learn.microsoft.com/en-us/fabric/data-science/ai-services/ai-services-overview) or combination of all.</mark>

_**Notebook is prepared for AI Functions (Pandas)**_

In [1]:
# The pandas AI functions package requires OpenAI version 1.99.5 or later
%pip install -q --force-reinstall openai==1.99.5 2>/dev/null

# AI functions are preinstalled on the Fabric PySpark runtime

StatementMeta(, ef4f751f-b7aa-4e92-914a-82145d810af8, 8, Finished, Available, Finished)

Note: you may need to restart the kernel to use updated packages.



In [2]:
# Required imports
import synapse.ml.aifunc as aifunc
import pandas as pd

# Optional import for progress bars. In future versions, this import will be included by default, Controlled by aifunc.default_conf.use_progress_bar and conf parameter of AI functions
from tqdm.auto import tqdm
tqdm.pandas()

StatementMeta(, ef4f751f-b7aa-4e92-914a-82145d810af8, 10, Finished, Available, Finished)

**Goal:** Get data with reviews

**Actions:**
- Read NY_places_customer_reviews.csv file

**Success Criteria:**
- Dataframe with data    

In [3]:
# a Spark DataFrame containing CSV data from "Files/NY_places_customer_reviews.csv".
df = spark.read.format("text").option("header","true").load("Files/NY_places_customer_reviews.csv")


df_count = df.count()
print(f"Written {df_count} records")

display(df)

StatementMeta(, ef4f751f-b7aa-4e92-914a-82145d810af8, 11, Finished, Available, Finished)

Written 266 records


SynapseWidget(Synapse.DataFrame, 4532003e-6431-4513-9219-e683fdf6b5d4)

**Goal:** Prepare data: remove the first and last double quotes from all string columns

**Actions:**
- Remove the first and last double quotes from all string columns in df

**Success Criteria:**
- Dataframe without the first and last double quotes

In [4]:
from pyspark.sql.functions import expr

# Remove the first and last double quotes from all string columns in df
string_cols = [f.name for f in df.schema.fields if f.dataType.typeName() == "string"]
for col in string_cols:
    df = df.withColumn(col, expr(f"""regexp_replace(regexp_replace({col}, '^"', ''), '"$', '')"""))

display(df)

StatementMeta(, ef4f751f-b7aa-4e92-914a-82145d810af8, 12, Finished, Available, Finished)

SynapseWidget(Synapse.DataFrame, a68d4e4b-7299-4fac-9020-49cb874eb277)

**Goal:** Prepare data: replace double double-quotes "" with a single double-quote " in all string columns

**Actions:**
- Replace double double-quotes "" with a single double-quote " in all string columns in df

**Success Criteria:**
- Dataframe without double double-quotes

In [5]:
from pyspark.sql.functions import regexp_replace

# Replace double double-quotes "" with a single double-quote " in all string columns in df
string_cols = [f.name for f in df.schema.fields if f.dataType.typeName() == "string"]
for col in string_cols:
    df = df.withColumn(col, regexp_replace(col, '""', '"'))

display(df)

StatementMeta(, ef4f751f-b7aa-4e92-914a-82145d810af8, 13, Finished, Available, Finished)

SynapseWidget(Synapse.DataFrame, 563313c0-3a0d-4f4d-9b4d-341c150058dc)

**Goal:** Prepare data: parse comma-separated data (where review text with commas is in double quotes) into tabular model

**Actions:**
- Parse comma-separated data into tabular model

**Success Criteria:**
- Dataframe with 265 reviews/records and columns:
    - LocationID (long)
    - Place (string)
    - Review (string)
    - User (string)


In [6]:
import csv
from io import StringIO
from pyspark.sql import Row
from pyspark.sql.functions import col, when

# Assume 'df' has one column with comma-separated data where Review text with commas is in double quotes
data = df.collect()
if not data:
    display(df)
else:
    # Convert Spark rows to a list of strings
    rows = [row[0] for row in data]
    # Use csv.reader to properly split considering quoted fields
    reader = csv.reader(rows, delimiter=',', quotechar='"')
    
    parsed = list(reader)
    header = parsed[0]
    records = parsed[1:]
    
    tabular_df = spark.createDataFrame([Row(**dict(zip(header, record))) for record in records])
    
    tabular_df = tabular_df.withColumn("LocationID",col("LocationID").cast("long"))

    df_count = tabular_df.count()
    print(f"Written {df_count} records")

    display(tabular_df)

StatementMeta(, ef4f751f-b7aa-4e92-914a-82145d810af8, 14, Finished, Available, Finished)

Written 265 records


SynapseWidget(Synapse.DataFrame, 0c95304c-81d3-4afd-b14c-35151a99c559)

**Goal:** Process unstructured data with LLM

**Actions:**
- Transform Spark Dataframe to Pandas Dataframe
- Extract sentiment from reviews
- Classify reviews by "bars", "restaurants", "accommodation", "historic building" categories 
- Summarize reviews
- Translate reviews and summarization into Czech language
- Custom prompt: Identify and list the main subjects or aspects discussed in the review (e.g., food quality, customer service, cleanliness, pricing, atmosphere)
- Custom prompt: Extract any mentioned locations, such as city names, neighborhoods, landmarks, or specific venues.
- Write output into goldcurated.reviewLocations


**Success Criteria:**
- New table goldcurated.reviewLocations exists with 265 records and columns:
    - LocationID
    - Place
    - Review
    - User
    - AIsentiment
    - AIsummarize
    - AItranslationsReview
    - AItranslationsSummarize
    - AIresponseKeyTopics
    - AIresponseLocations

In [7]:
# Spark Dataframe to Pandas Dataframe
dfp = tabular_df.toPandas()

StatementMeta(, ef4f751f-b7aa-4e92-914a-82145d810af8, 15, Finished, Available, Finished)

In [8]:
#The SENTIMENT Analysis feature provides a way for detecting the sentiment labels (such as positive, negative, mixed, or neutral.)

dfp["AIsentiment"] = dfp["Review"].ai.analyze_sentiment()
display(dfp)

StatementMeta(, ef4f751f-b7aa-4e92-914a-82145d810af8, 16, Finished, Available, Finished)

100%|██████████| 265/265 [00:05<00:00, 52.86it/s] 


SynapseWidget(Synapse.DataFrame, 060a07ab-8653-4f70-8422-fb665844f912)

In [9]:
# CATEGORIZE input text according to custom labels

dfp["AIclassify"] = dfp['Review'].ai.classify("bars", "restaurants", "accommodation", "historic building")
display(dfp)

StatementMeta(, ef4f751f-b7aa-4e92-914a-82145d810af8, 17, Finished, Available, Finished)

100%|██████████| 265/265 [00:03<00:00, 77.78it/s] 


SynapseWidget(Synapse.DataFrame, 905f7d4a-0b96-485f-95f8-9bccd6869c26)

In [10]:
# SUMMARIES of input text—either values from one column or values across all the columns

dfp["AIsummarize"] = dfp["Review"].ai.summarize()
display(dfp)


StatementMeta(, ef4f751f-b7aa-4e92-914a-82145d810af8, 18, Finished, Available, Finished)

100%|██████████| 265/265 [00:04<00:00, 53.86it/s] 


SynapseWidget(Synapse.DataFrame, a3e01768-7ae9-4a79-96df-d57db061a2f4)

In [11]:
# TRANSLATE input text to Czech language

dfp["AItranslationsReview"] = dfp["Review"].ai.translate("Czech")
dfp["AItranslationsSummarize"] = dfp["AIsummarize"].ai.translate("Czech")
display(dfp)


StatementMeta(, ef4f751f-b7aa-4e92-914a-82145d810af8, 19, Finished, Available, Finished)

100%|██████████| 265/265 [00:07<00:00, 34.95it/s]
100%|██████████| 265/265 [00:06<00:00, 42.94it/s]


SynapseWidget(Synapse.DataFrame, 6b371981-314b-469c-8a67-b91602659cff)

In [12]:
# GENERATE custom text responses based on your own instructions

dfp["AIresponseKeyTopics"] = dfp.ai.generate_response(prompt="Extract Key Topics from {Review}. Identify and list the main subjects or aspects discussed in the review (e.g., food quality, customer service, cleanliness, pricing, atmosphere). Return as comma separated text", is_prompt_template=True)
dfp["AIresponseLocations"] = dfp.ai.generate_response(prompt="Identify Locations from {Review}: Extract any mentioned locations, such as city names, neighborhoods, landmarks, or specific venues. Return as comma separated text", is_prompt_template=True)
display(dfp)

StatementMeta(, ef4f751f-b7aa-4e92-914a-82145d810af8, 20, Finished, Available, Finished)

100%|██████████| 265/265 [00:04<00:00, 58.91it/s] 
100%|██████████| 265/265 [00:03<00:00, 83.93it/s] 


SynapseWidget(Synapse.DataFrame, 4d7905ec-0ea0-4a95-91c9-facb341b3f34)

In [17]:
# TEST IT AGAIN
# Zkontroluj typy sloupců
print(dfp.dtypes)

# Odstranění nebo konverze nativních objektů
for col in dfp.columns:
    if dfp[col].apply(lambda x: isinstance(x, (object, type)) and not isinstance(x, (int, float, str, bool, pd.Timestamp))).any():
        dfp[col] = dfp[col].astype(str)

# Nebo jednoduše všechny problematické sloupce přetypuj
dfp = dfp.applymap(lambda x: str(x) if not isinstance(x, (int, float, str, bool, pd.Timestamp, type(None))) else x)



# Convert the Pandas DataFrame to PySpark DataFrame
df_spark = spark.createDataFrame(dfp)

# Display the PySpark DataFrame
display(df_spark)

StatementMeta(, ef4f751f-b7aa-4e92-914a-82145d810af8, 25, Finished, Available, Finished)

  dfp = dfp.applymap(lambda x: str(x) if not isinstance(x, (int, float, str, bool, pd.Timestamp, type(None))) else x)


SynapseWidget(Synapse.DataFrame, 4c6358ff-650d-4521-a03c-3558dc685474)

In [31]:
delta_table_path = "Tables/reviewLocations"
df_spark.write.format("delta").mode("overwrite").option("mergeSchema", "true").save(delta_table_path)

StatementMeta(, 83a73c75-0410-4e3b-ab02-389f9d423888, 51, Finished, Available, Finished)

In [32]:
df = spark.sql("SELECT * FROM goldcurated.reviewLocations LIMIT 1000")
display(df)

StatementMeta(, 83a73c75-0410-4e3b-ab02-389f9d423888, 52, Finished, Available, Finished)