This is LLM-Based Aspect & its Sentiment Extraction + Recommendation Minining.
 For this project i used Open-ai's gpt-4.

### Imports & Config

In [1]:
!pip install openai==0.28

Collecting openai==0.28
  Downloading openai-0.28.0-py3-none-any.whl.metadata (13 kB)
Downloading openai-0.28.0-py3-none-any.whl (76 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.5/76.5 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: openai
  Attempting uninstall: openai
    Found existing installation: openai 1.84.0
    Uninstalling openai-1.84.0:
      Successfully uninstalled openai-1.84.0
Successfully installed openai-0.28.0


In [1]:
import os
import re
import json
import logging
import pandas as pd
import openai
import math

# configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)


In [2]:
openai.api_key = "Open-AI API"

### Load & Clean Data

In [5]:
df = pd.read_csv('AWARE_Comprehensive.csv')

In [6]:
def clean_text(text):
    text = re.sub(r'^\d+\.\s*', '', text)         # remove “2.”, “3.”, etc at start
    text = re.sub(r'<[^>]+>', '', text)
    text = re.sub(r'\s+', ' ', text).strip()
    return text.lower()

df['sentence'] = df['sentence'].astype(str).apply(clean_text)

### Prompt Definitions

In [7]:
ASPECT_PROMPT = """
You are an aspect extractor for customer reviews. Given a sentence, return **only** a JSON list of concrete feature/function nouns (noun compounds) in the order they appear.

=== Rules ===
1. **Concrete nouns only**
   • Extract tangible features or functions—no raw verbs or standalone generic nouns (e.g., don’t extract “cars” as a verb).
2. **Action → noun form**
   • Convert verb phrases into noun compounds (e.g. “send a picture” → “picture sending”).
3. **Error → noun form**
   • Convert failures/errors into nominal forms (e.g. “photo vanished” → “photo disappearance”).
4. **Preserve modifiers & preps**
   • Keep full noun phrases, including adjectives and prepositional phrases (“syncing between devices”).
5. **Generic filtering**
   • Skip overly generic terms (e.g., “server”, “options”, “update”) unless paired with a descriptor (e.g., “server notifications”, “haptic options”, “volume settings”).
6. **No externals or domains**
   • Skip mentions of external contexts (URLs, platforms).
7. **Distinct & descriptive**
   • Remove duplicates/synonyms; always keep the longest, most informative term.
8. **Standardize format**
   • Use hyphens for multi-word compounds (e.g. “end-to-end encryption”, “top-of-conversation”).
9. **Order by appearance**
   • List aspects exactly in the sequence they occur.

=== Examples ===
Review: “Clicking on multiple emails and deleting them as a group is super fast.”
Aspects: ["multiple emails", "bulk deletion"]

Review: “Syncing between devices fails half the time.”
Aspects: ["syncing between devices"]

Review: “I love how clear the new dashboard layout is.”
Aspects: ["dashboard layout"]

Review: “The app opens instantly, but sometimes crashes without warning.”
Aspects: ["app opening", "crash"]

Review: “This review mentions cars but I’m talking about driving—not a feature.”
Aspects: []

Review: “Good job this app is worth the money now.”
Aspects: ["value for money"]

Review: “The haptics options within the interface just seem bland.”
Aspects: ["haptic options"]

Review: “Please fix the volume on the new update.”
Aspects: ["volume settings"]

Review: “I would like it if clicking my avatar in the slide menu would refresh my folder list and put me “back to top” of my folder list.”
Aspects: ["avatar clicking", "slide menu", "folder list refresh", "back-to-top function"]

=== Now you: ===
Review: "{sentence}"
Aspects:
"""

In [8]:
# 2) Build a single‐shot prompt to classify *all* aspects for one sentence
SENTIMENT_PROMPT = """
You are a sentiment analyzer.
Given a customer-review sentence and a list of aspect terms, classify the sentiment toward each aspect as one of:
- positive
- neutral
- negative

Return a JSON object mapping each aspect to its sentiment.

Sentence: "{sentence}"
Aspects: {aspects}
Sentiment:
"""


In [9]:
# --- 1) recommendation‐extraction prompt
RECO_PROMPT = """
You are a recommendation extractor.
Given a customer‐review sentence, pull out any actionable suggestions or requests for improvement, and return them as a JSON list of short verb‐noun (or noun‐phrase) statements.

=== Examples ===
Review: "it would be better to list the newest email on the top of the conversation."
Recommendations: ["list newest email on top of conversation"]

Review: "please fix the volume on the new update."
Recommendations: ["fix volume settings"]

Review: "i would like clicking my avatar in the slide menu to refresh my folder list."
Recommendations: ["make avatar click refresh folder list"]

Review: "i like the speed but the syncing between devices is so unreliable."
Recommendations: []

=== Now you: ===
Review: "{sentence}"
Recommendations:
"""


In [10]:
PROMPTS = {
    "aspect": ASPECT_PROMPT,
    "sentiment": SENTIMENT_PROMPT,
    "recommendation": RECO_PROMPT
}


### LLM Helpers & Parsers

In [11]:
def call_llm(prompt: str, model="gpt-4", temp=0.0, max_tokens=150) -> str:
    """Call OpenAI ChatCompletion and return the raw assistant content."""
    resp = openai.ChatCompletion.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        temperature=temp,
        max_tokens=max_tokens
    )
    return resp.choices[0].message.content.strip()


In [12]:
def safe_parse_list(raw: str) -> list:
    """Parse a JSON list from raw LLM output, with fallback."""
    try:
        return json.loads(raw)
    except json.JSONDecodeError:
        logger.warning("Failed to parse list, raw output: %r", raw)
        # very simple fallback
        return re.findall(r'"([^"]+)"', raw)


def safe_parse_dict(raw: str) -> dict:
    """Parse a JSON dict from raw LLM output, with fallback."""
    try:
        return json.loads(raw)
    except json.JSONDecodeError:
        logger.warning("Failed to parse dict, raw output: %r", raw)
        # no robust fallback here
        return {}

In [13]:
def safe_parse_list_of_lists(raw: str) -> list[list[str]]:
    """
    Parse a JSON array of arrays (e.g. [["aspect1"], ["a","b"], ...]).
    Falls back to regex if needed.
    """
    try:
        return json.loads(raw)
    except json.JSONDecodeError:
        logger.warning("Failed to parse batch list, raw: %r", raw)
        # super‐simple fallback: grab each bracketed group
        groups = re.findall(r'\[([^\[\]]+)\]', raw)
        return [re.findall(r'"([^"]+)"', g) for g in groups]


### Extraction Functions

1) EXTRACT ASPECTS

In [23]:
# Randomly sample 200 unique sentences (stratify if desired)
seed_df = df.dropna(subset=['term']).sample(n=200, random_state=42).reset_index(drop=True)

# Keep only the columns we need for later
seed_df = seed_df[['sentence', 'term']]


In [14]:

def extract_aspects(sentences, model="gpt-4", n_few_shot=4):
    # 1) Sample examples with non-null terms
    few_shot = (
        seed_df[['sentence','term']]
        .dropna(subset=['term'])
        .drop_duplicates()
        .sample(n=n_few_shot, random_state=42)
        .reset_index(drop=True)
    )

    records = []
    for s in sentences:
        # 2) Static header
        base = PROMPTS["aspect"]

        # 3) Dynamic few-shot examples
        shot_lines = []
        for _, ex in few_shot.iterrows():
            shot_lines.append(f'Review: "{ex["sentence"]}"')
            shot_lines.append(f'Aspects: ["{ex["term"]}"]')
        dynamic_section = "\n\n".join(shot_lines)

        # 4) Build full prompt
        prompt = (
            f"{base}\n\n"
            f"{dynamic_section}\n\n"
            "Now you:\n"
            f'Review: "{s}"\n'
            "Aspects:"
        )

        # 5) Call the LLM
        raw = call_llm(prompt, model=model)

        # 6) Parse & clean into `aspects`
        aspects = safe_parse_list(raw)
        clean, seen = [], set()
        for term in aspects:
            t = term.lower().strip()
            t = re.sub(r'^[^\w]+|[^\w]+$', '', t)
            if t and t not in seen:
                clean.append(t)
                seen.add(t)

        # 7) Record result
        records.append({
            "sentence": s,
            "aspects": clean
        })

    return pd.DataFrame(records)


In [15]:
# def extract_aspects_batch(sentences, model="gpt-4", batch_size=20):
#     """Batch N sentences into one prompt for aspect extraction."""
#     all_records = []
#     total = len(sentences)
#     steps = math.ceil(total / batch_size)

#     for i in range(steps):
#         chunk = sentences[i*batch_size:(i+1)*batch_size]
#         # 1) Enumerate reviews
#         block = "\n".join(f"{j+1}) Review: \"{s}\"" for j,s in enumerate(chunk))
#         prompt = (
#             PROMPTS["aspect"] + "\n\n"
#             "Here are multiple reviews. Extract aspects for each and return a JSON list of lists:\n\n"
#             + block +
#             "\n\nOutput:"
#         )
#         raw = call_llm(prompt, model=model, max_tokens=batch_size*60)
#         batch_aspects = safe_parse_list_of_lists(raw)
#         for s, aspects in zip(chunk, batch_aspects):
#             # reuse your cleaning logic
#             clean, seen = [], set()
#             for term in aspects:
#                 t = term.lower().strip()
#                 t = re.sub(r'^[^\w]+|[^\w]+$', '', t)
#                 if t and t not in seen:
#                     clean.append(t); seen.add(t)
#             all_records.append({"sentence": s, "aspects": clean})

#     return pd.DataFrame(all_records)


2) CLASSIFY SENTIMENT

In [16]:
def classify_sentiments(df, model="gpt-4"):
    out = []
    for _, row in df.iterrows():
        aspects = row["aspects"]
        if not aspects:
            out.append({})
            continue
        prompt = PROMPTS["sentiment"].format(
            sentence=row["sentence"],
            aspects=json.dumps(aspects)
        )
        raw = call_llm(prompt, model=model)
        out.append(safe_parse_dict(raw))
    df["sentiments"] = out
    return df

In [17]:
# def classify_sentiments_batch(df, model="gpt-4", batch_size=20):
#     """Batch sentiment classification on df with columns ['sentence','aspects']."""
#     records = []
#     total = len(df)
#     steps = math.ceil(total / batch_size)

#     for i in range(steps):
#         sub = df.iloc[i*batch_size:(i+1)*batch_size]
#         # 1) enumerate sentence+aspects pairs
#         lines = []
#         for idx,row in sub.iterrows():
#             lines.append(f"{idx}) Sentence: \"{row['sentence']}\"\nAspects: {json.dumps(row['aspects'])}")
#         prompt = (
#             PROMPTS["sentiment"] + "\n\n"
#             "Here are multiple entries. Return a JSON list of dicts (one per entry):\n\n"
#             + "\n\n".join(lines) +
#             "\n\nOutput:"
#         )
#         raw = call_llm(prompt, model=model, max_tokens=batch_size*80)
#         batch_dicts = safe_parse_list(raw)
#         for idx, mapping in zip(sub.index, batch_dicts):
#             records.append((idx, mapping))

#     # reassemble into df.sentiments
#     out = {}
#     for idx,map_ in records:
#         out[idx] = map_
#     df["sentiments"] = df.index.map(lambda i: out.get(i, {}))
#     return df


3) EXTRACT RECOMMENDATION

In [18]:
def extract_recommendations(sentences, model="gpt-4"):
    recos = []
    for s in sentences:
        prompt = PROMPTS["recommendation"].format(sentence=s)
        raw = call_llm(prompt, model=model, max_tokens=80)
        items = safe_parse_list(raw)
        # tidy
        recos.append([i.strip().rstrip(".") for i in items])
    return recos

In [19]:
# def extract_recommendations_batch(sentences, model="gpt-4", batch_size=20):
#     """Batch N sentences into one prompt for recommendation extraction."""
#     all_recos = []
#     total = len(sentences)
#     steps = math.ceil(total / batch_size)

#     for i in range(steps):
#         chunk = sentences[i*batch_size:(i+1)*batch_size]
#         # Enumerate reviews
#         block = "\n".join(f"{j+1}) Review: \"{s}\"" for j, s in enumerate(chunk))
#         prompt = (
#             PROMPTS["recommendation"] + "\n\n"
#             "Here are multiple reviews. Return a JSON list of recommendation‐lists in order:\n\n"
#             + block +
#             "\n\nOutput:"
#         )

#         raw = call_llm(prompt, model=model, max_tokens=batch_size * 40)
#         # parse as list of lists
#         try:
#             batch = json.loads(raw)
#         except json.JSONDecodeError:
#             logger.warning("Failed to parse recommendations batch JSON: %r", raw)
#             batch = [[] for _ in chunk]

#         # clean each recommendation list
#         for recos in batch:
#             clean = []
#             for r in recos:
#                 t = r.strip().rstrip(".")
#                 if t:
#                     clean.append(t)
#             all_recos.append(clean)

#     return all_recos  # list-of-lists, aligned with sentences


### Pipeline

In [20]:
def full_analysis(sentences, model="gpt-4", batch_size=20):
    # 1) aspects
    df = extract_aspects(sentences, model=model)

    # 2) sentiments
    df = classify_sentiments(df, model=model)

    # 3) recommendations
    recos = extract_recommendations(df["sentence"].tolist(), model=model)
    df["recommendations"] = recos

    return df


### Demo

In [32]:
# Example usage:
sentences = df["sentence"].tolist()[40:50]
result_df = full_analysis(sentences, model="gpt-4")
result_df.head()

Unnamed: 0,sentence,aspects,sentiments,recommendations
0,"first, the handful of buttons on the left side of the ipad version wastes valuable horizontal screen space.","[buttons, horizontal screen space]","{'buttons': 'negative', 'horizontal screen space': 'negative'}","[reduce button size on left side, optimize horizontal screen space]"
1,bring in dark mode and the app will be perfect.,[dark mode],{'dark mode': 'positive'},[implement dark mode]
2,"then, in addition, when i click “mobile view” the alignment of the text box shows up differently.","[mobile view, text box alignment]","{'mobile view': 'neutral', 'text box alignment': 'negative'}",[fix mobile view alignment]
3,also there is no way to tell from this screen which notes are in what notebook.,"[screen, notes, notebook]","{'screen': 'negative', 'notes': 'neutral', 'notebook': 'neutral'}",[indicate notebook for each note]
4,i think this app is definitely worth the money!,[value for money],{'value for money': 'positive'},[]


In [33]:
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

result_df

Unnamed: 0,sentence,aspects,sentiments,recommendations
0,"first, the handful of buttons on the left side of the ipad version wastes valuable horizontal screen space.","[buttons, horizontal screen space]","{'buttons': 'negative', 'horizontal screen space': 'negative'}","[reduce button size on left side, optimize horizontal screen space]"
1,bring in dark mode and the app will be perfect.,[dark mode],{'dark mode': 'positive'},[implement dark mode]
2,"then, in addition, when i click “mobile view” the alignment of the text box shows up differently.","[mobile view, text box alignment]","{'mobile view': 'neutral', 'text box alignment': 'negative'}",[fix mobile view alignment]
3,also there is no way to tell from this screen which notes are in what notebook.,"[screen, notes, notebook]","{'screen': 'negative', 'notes': 'neutral', 'notebook': 'neutral'}",[indicate notebook for each note]
4,i think this app is definitely worth the money!,[value for money],{'value for money': 'positive'},[]
5,"i basically gave up on them, going back to my paper “systems” — and i use that word loosely.",[paper systems],{'paper systems': 'neutral'},[]
6,"second, it messes with the font size.",[font size],{'font size': 'negative'},[fix font size issues]
7,i don’t use it all the time and it’s a waste of money for me to pay for it a month and me not use it that month.,[waste of money],{'waste of money': 'negative'},[offer pay-per-use option]
8,i used it for everything from shopping lists to subscription information to the chores my kids are supposed to do.,"[shopping lists, subscription information, chores]","{'shopping lists': 'positive', 'subscription information': 'positive', 'chores': 'positive'}",[]
9,i do not need social media alerts clouding my more important email list.,"[social media alerts, email list]","{'social media alerts': 'negative', 'email list': 'neutral'}",[remove social media alerts from email list]
