<a href="https://colab.research.google.com/github/AshwaniCoding/Colab-Notebooks/blob/main/pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Week 2 Assignment: Automating Customer Feedback Analysis**

---

### **Objective**

The goal of this assignment is to use advanced prompting techniques and your understanding of LLM fundamentals to analyze and extract structured information from raw customer feedback. This project will test your skills in tokenization and prompt engineering (few-shot, structured output, and chain-of-thought).

### **Background & Problem Statement**

You are an AI Engineer at a growing e-commerce company. The customer support team is manually reading through hundreds of product reviews every day to identify key issues and sentiment. This process is slow and inconsistent.

Your manager has asked you to build a prototype that uses a Large Language Model to automate this analysis. Specifically, you need to prove that you can:
1.  Classify the sentiment of a review.
2.  Extract specific, structured information (like product names and issues).

### **Dataset**

For this assignment, you will work with a small, curated list of customer reviews. This allows you to focus on the quality of your prompts rather than on data cleaning. Use the following Python list as your dataset:

```python
reviews = [
    # Review 1: Positive
    "I absolutely love the new QuantumX Pro camera! The picture quality is stellar and the battery life is amazing. Shipped super fast too. A++!",

    # Review 2: Negative with specific issue
    "The SonicWave earbuds have a serious design flaw. The left earbud stopped charging after just one week. I expected better for the price. Very disappointed.",

    # Review 3: Mixed with a question
    "The Titan smartwatch is decent. The screen is bright and the features are good, but the step counter seems inaccurate. It's off by at least 20%. Is there a way to calibrate it?",

    # Review 4: Negative with multiple issues
    "My order for the AeroDrone was a disaster. It arrived with a broken propeller and the battery was completely dead on arrival. Customer service has been unresponsive for 3 days.",

    # Review 5: Positive but mentions a minor issue
    "Overall, I'm happy with the PureGlow Air Purifier. It's quiet and effective. My only complaint is that the replacement filters are a bit expensive."
]
```

---

### **Tasks & Instructions**

Structure your code in a Jupyter Notebook or Python script. Use markdown cells or comments to explain your process and show the outputs for each task.

**Part 1: Understanding Tokenization**
*   **Objective:** To see firsthand how different models "read" the same text.
*   **Tasks:**
    1.  Import `AutoTokenizer` from the `transformers` library.
    2.  Load the tokenizer for `"gpt2"` and the tokenizer for `"bert-base-uncased"`.
    3.  Take the third review (`reviews[2]`) about the Titan smartwatch.
    4.  Tokenize this review using **both** tokenizers and print the resulting list of tokens for each.
    5.  In a markdown cell, answer the following:
        *   Are the token lists identical?
        *   Point out one or two specific differences you notice.
        *   In one sentence, explain *why* different models might have different tokenizers.

**Part 2: Advanced Prompt Engineering**
*   **Objective:** To use different prompting techniques to perform three distinct analysis tasks.
*   **Setup:** Load a basic instruction-following or text-generation LLM (e.g., `google/flan-t5-large` or `gpt2-large`) using the Hugging Face `pipeline`.
*   **Tasks (perform for each review in the `reviews` list):**
    1.  **Task A: Sentiment Classification (Few-Shot Prompting)**
        *   Design a **few-shot prompt** that provides two examples of reviews classified as "Positive", "Negative", or "Mixed".
        *   Use this prompt to classify each of the five reviews in the dataset. Print the classification for each review.
    2.  **Task B: Structured Data Extraction (Instruction & Format Prompting)**
        *   Design a prompt that instructs the model to extract the following information from each review and format the output as a JSON object: `{"product_name": "...", "issue_summary": "...", "sentiment": "..."}`.
        *   If a piece of information isn't present, the model should output "N/A".
        *   Run this prompt on all five reviews and print the resulting JSON for each.
    3.  **Task C: Root Cause Analysis (Chain-of-Thought Prompting)**
        *   For the **negative and mixed reviews only** (reviews 2, 3, 4, 5), design a **Chain-of-Thought prompt**.
        *   The prompt should ask the model to first identify the customer's core problem and then explain its reasoning step-by-step.
        *   Example Prompt Structure: `Analyze the following customer review to identify the root cause of their issue. First, state the main problem. Second, explain your reasoning in a single sentence. Let's think step by step.`
        *   Print the model's full step-by-step analysis for these reviews.

---

### **Submission Instructions**

1.  **Deadline:** You have **one week** from the assignment release date to submit your work.
2.  **Platform:** All submissions must be made to your allocated private GitLab repository. You **must** submit your work in a branch named `week_2`.
3.  **Format:** You can submit your work as either a Jupyter Notebook (`.ipynb`) or a Python script (`.py`).
4.  After pushing, you should verify that your branch and files are visible on the GitLab web interface. No further action is needed. The trainers will review all submissions on the `week_2` branch after the deadline. Any assignments submitted after the deadline won't be reviewed and will reflect in your course score.
5. The use of LLMs is encouraged, but ensure that you’re not copying solutions blindly. Always review, test, and understand any code generated, adapting it to the specific requirements of your assignment. Your submission should demonstrate your own comprehension, problem-solving process, and coding style, not just an unedited output from an AI tool.

# Part 1: Understanding Tokenization

### Load the reviews

In [None]:
reviews = [
    "I absolutely love the new QuantumX Pro camera! The picture quality is stellar and the battery life is amazing. Shipped super fast too. A++!",
    "The SonicWave earbuds have a serious design flaw. The left earbud stopped charging after just one week. I expected better for the price. Very disappointed.",
    "The Titan smartwatch is decent. The screen is bright and the features are good, but the step counter seems inaccurate. It's off by at least 20%. Is there a way to calibrate it?",
    "My order for the AeroDrone was a disaster. It arrived with a broken propeller and the battery was completely dead on arrival. Customer service has been unresponsive for 3 days.",
    "Overall, I'm happy with the PureGlow Air Purifier. It's quiet and effective. My only complaint is that the replacement filters are a bit expensive."
]

### Import Autotokenizer

In [None]:
from transformers import AutoTokenizer

gpt2_tokenizer = AutoTokenizer.from_pretrained("gpt2")
bert_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

review_text = reviews[2]

gpt2_tokens = gpt2_tokenizer.tokenize(review_text)
bert_tokens = bert_tokenizer.tokenize(review_text)

print("Review Text:", review_text)
print("GPT-2 Tokens:", gpt2_tokens)
print("BERT Tokens:", bert_tokens)

#

# Part 2: Advanced Prompt Engineering

### Setup and load model

In [None]:
from transformers import pipeline

generator = pipeline("text2text-generation", model="google/flan-t5-base")


### Task A: Sentiment Classification (Few-Shot Prompting)

In [None]:
few_shot = """Example 1:
Review: "I absolutely love the new QuantumX Pro camera! The picture quality is stellar and the battery life is amazing. Shipped super fast too. A++!"
Label: Positive

Example 2:
Review: "The SonicWave earbuds have a serious design flaw. The left earbud stopped charging after just one week. I expected better for the price. Very disappointed."
Label: Negative

Now label the sentiment of the following review with one of: Positive, Negative, Mixed.
Review: "{review}"
Label:"""

print("=== Task A: Sentiment ===")
for r in reviews:
    output = generator(few_shot.format(review=r), max_length=50)
    print(f"Review: {r}\nPrediction: {output[0]['generated_text'].strip()}\n")


### Task B: Structured Data Extraction (Instruction & Format Prompting)

In [None]:
structured_prompt = """
Extract the following from the Review and return ONLY valid JSON (no extra text):

{{
  "product_name": "string",
  "issue_summary": "string",
  "sentiment": "Positive or Negative or Mixed"
}}

If a field is missing, write "N/A".

Review: "{review}"
JSON:
"""

# Run Task B
print("\n=== Task B: Structured Extraction ===")
for r in reviews:
  output = generator(structured_prompt.format(review=r), max_length=150)
  print("Output: ", output)
  print(f"Review: {r}\nJSON: {output[0]['generated_text'].strip()}\n")


### Task C: Root Cause Analysis (Chain-of-Thought Prompting)

In [None]:
cot_prompt = """
Analyze the following customer review to identify the root cause of their issue.
First, state the main problem.
Second, explain your reasoning in one sentence.
Let's think step by step.

Review: {text}
"""

print("---- Task C: Root Cause Analysis ----")
for r in reviews[1:]:  # Only negative/mixed reviews (2,3,4,5)
    prompt = cot_prompt.format(text=r)
    output = generator(prompt, max_new_tokens=200)
    print("Output: ", output)
    print(f"Review: {r}\nAnalysis: {output[0]['generated_text'].strip()}\n")
