# Goal of Phase 3 - FLAN-T5 Implementation (Simplified)

## Input
A CSV file (`final.csv`) containing product information and reviews, with columns such as:

- `ProductID`, `Product Name`, `Category`, `Brand`, `Ratings`  
- `sentiment` (positive/negative)  
- `reviews.text`  

## Processing
For each product cluster:

1. Compute **product-level statistics**: average rating, positive/negative review fractions, and number of reviews.  
2. Select the **Top 3 products** per cluster based on a score combining high ratings and low negative review fractions.  
3. Identify the **product to avoid** (lowest scoring product).  

## AI Prompt Construction
Create a structured, multi-part prompt instructing the AI to:

- Recommend the Top 3 products with key differences.  
- Explain why the worst product should be avoided.  
- Base all insights **solely on the review data**.  
- Format output with category, product details, positive review summaries, and neutral statements for products to avoid.  

## Output
- Use **FLAN-T5 (`google/flan-t5-small`)** to generate **human-readable, factual buying guides** per cluster.  
- Optionally fine-tune the model with **LoRA** on generated articles to improve output consistency.  
- Save all articles in a CSV (`generated_articles_flan.csv`) with columns for cluster, cluster summary, and generated article.


We chose these models because it is instruction tuned, lightweight and supports text to text tasks like summarization and recommendation. This would make it easy to adapt to our product review.

M3 - google/ flan -t5- small (very light and easy to run)

### 1. Install packages

In [21]:
!pip install pandas transformers torch
!pip install "transformers>=4.44" datasets peft accelerate sentencepiece
!pip uninstall -y pyarrow
!pip install --no-cache-dir pyarrow

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Found existing installation: pyarrow 22.0.0
Uninstalling pyarrow-22.0.0:
  Successfully uninstalled pyarrow-22.0.0


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting pyarrow
  Downloading pyarrow-22.0.0-cp313-cp313-macosx_12_0_arm64.whl.metadata (3.2 kB)
Downloading pyarrow-22.0.0-cp313-cp313-macosx_12_0_arm64.whl (34.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m34.2/34.2 MB[0m [31m66.9 MB/s[0m  [33m0:00:00[0m eta [36m0:00:01[0m
[?25hInstalling collected packages: pyarrow
Successfully installed pyarrow-22.0.0


### 2. Loads libraries and environment

In [22]:
import os, torch, pandas as pd
from sklearn.model_selection import train_test_split
from dotenv import load_dotenv
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, TrainingArguments, Trainer, DataCollatorForSeq2Seq
from datasets import Dataset
from peft import LoraConfig, get_peft_model, PeftModel


### 3. Load data

In [23]:
data_path = "../outputs/final.csv"
df=pd.read_csv(data_path)

print (df.head())
print (df.columns)

              ProductID                                       Product Name  \
0  AVqkIhwDv8e3D1O-lebb  All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...   
1  AVqkIhwDv8e3D1O-lebb  All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...   
2  AVqkIhwDv8e3D1O-lebb  All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...   
3  AVqkIhwDv8e3D1O-lebb  All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...   
4  AVqkIhwDv8e3D1O-lebb  All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...   

                                            Category   Brand  Ratings  \
0  Electronics,iPad & Tablets,All Tablets,Fire Ta...  Amazon      5.0   
1  Electronics,iPad & Tablets,All Tablets,Fire Ta...  Amazon      5.0   
2  Electronics,iPad & Tablets,All Tablets,Fire Ta...  Amazon      5.0   
3  Electronics,iPad & Tablets,All Tablets,Fire Ta...  Amazon      4.0   
4  Electronics,iPad & Tablets,All Tablets,Fire Ta...  Amazon      5.0   

           Cluster sentiment  \
0  Fire HD Tablets  positive   
1  Fire HD Tablets  positive

### 3. Sets up FLAN-T5 model

In [24]:
model_name= "google/flan-t5-small"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

device=torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

T5ForConditionalGeneration(
  (shared): Embedding(32128, 512)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32128, 512)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=512, out_features=384, bias=False)
              (k): Linear(in_features=512, out_features=384, bias=False)
              (v): Linear(in_features=512, out_features=384, bias=False)
              (o): Linear(in_features=384, out_features=512, bias=False)
              (relative_attention_bias): Embedding(32, 6)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseGatedActDense(
              (wi_0): Linear(in_features=512, out_features=1024, bias=False)
              (wi_1): Linear(in_features=512, out_features=1024, bias=False)
              (wo): 

Generation helper

In [25]:
def generate_article_flan(prompt: str, 
                     max_new_tokens: int = 320,
                     temperature: float = 0.7) -> str:
    inputs = tokenizer(prompt, return_tensors="pt", truncation=True).to(device)
    outputs = model.generate(
        **inputs,
          max_new_tokens=max_new_tokens, 
          do_sample=True,
          temperature=temperature,
          top_p=0.95,
    )
    return tokenizer.decode(outputs[0],skip_special_tokens=True)

Prompt

In [26]:
base_prompt = """
You are a professional analyst who many people rely on. You will receive structured data.

Your task:
1. Use the information provided in the structured data.
2. Recommend the 3 top products for each category.
3. Follow this order: category, product, average rating, positive reviews, summary.
4. Each summary must be exactly 2 short factual sentences based on the numbers.
5. For the products to avoid, write 1 neutral sentence based on data.
6. If no summary text is available for a product, write: "No review text provided."

Here are example of the structured data:
Categories: Cameras
Product: Canon PowerShot G7X|Average rating 4.7| Positive reviews: 92.0%| 120 reviews
Product: Sony A6000 Kit|Average rating 4.6| Positive reviews: 89.0%| 310 reviews
Product: Nikon D3500| Average rating: 4.5| Positive reviews: 87.0%| 280 reviews    
Avoid: Kodak PixPro FZ43|Average rating: 3.1| Negative reviews: 45.0%| 566 reviews

Example output:
Category: Cameras
1. Canon PowerShot G7X- 4.7 average rating, 92% positive reviews, 120 reviews.
Summary: Reviewers highlight image quality. Customers also mention good ease of use.
2. Sony A6000 Kit- 4.6 average rating, 89% positive reviews, 310 reviews. 
Summary: Many users like the fast performance. They also appreciate the sharp image results.
3. Nikon D3500- 4.5 average rating, 87% positive reviews, 280 reviews. 
Summary: Reviewers value its simple controls. It is often praised for reliable photo quality.
Product to avoid: Kodak PixPro FZ43 has a higher share of negative reviews compared to others.
    
Here is the structured data: 
 {cluster_summary}

Now kindly write the article:
"""

In [27]:
print(df.columns)


Index(['ProductID', 'Product Name', 'Category', 'Brand', 'Ratings', 'Cluster',
       'sentiment', 'reviews.text'],
      dtype='object')


### 4. Aggregates products per cluster

In [28]:
def select_products_for_cluster(cluster_df: pd.DataFrame, top_n:int=3):
    """
    For a given cluster (many review rows), compute per- product statistics, 
    then return top 3 products and one "Product to avoid"(top_products_df, worst_product_df).

    Works with columns in cluster_df:
    ['ProductID', 'Product Name', 'Category', 'Brand', 'Ratings', 'Cluster',
       'sentiment', 'reviews.text']
    """
    stats=(
        cluster_df
        .groupby(["ProductID", "Product Name", "Category", "Brand"], as_index=False)
        .agg(
            num_reviews=("sentiment", "size"),
            avg_rating=("Ratings", "mean"), 
            pos_reviews=("sentiment", lambda s:(s== "positive"). sum()),
            neg_reviews=("sentiment", lambda s:( s== "negative").sum()),
        )
    )
    stats["pos_frac"]=stats["pos_reviews"]/stats["num_reviews"]
    stats["neg_frac"]=stats["neg_reviews"]/stats["num_reviews"]

    # Score: high rating, few negatives
    min_reviews=20
    candidates=stats[stats["num_reviews"]>=min_reviews].copy()
    if candidates.empty:
        candidates=stats.copy()

    candidates["score"]=candidates["avg_rating"]*(1-candidates["neg_frac"])

    top_df=(
        candidates
        .sort_values(["score", "pos_frac"], ascending=[False, False])
        .head(top_n)
    )
    worst_df=(
        candidates
        .sort_values(["score", "pos_frac"], ascending=[True, True])
        .head(1)
    )
    
    return top_df, worst_df

     


### 5. Builds cluster summary for prompt

In [29]:
def product_block(row)-> str:
    """Turn one product row into a text snippet for the prompt."""
    return (
        f"-Product:{row["Product Name"]} (Brand:{row["Brand"]})\n"
        f"  -Average rating: {row["avg_rating"]:.2f} from{int(row["num_reviews"])} review.\n"
        f"  -Positive reviews: {row["pos_frac"]*100:.1f}%.\n"
    )

def build_cluster_summary(cluster_name:str,
                          top_df:pd.DataFrame,
                          worst_df:pd.DataFrame)-> str:
    """ Create the {cluster_summary} text that goes into the prompt."""
    parts=[]
    parts.append(f"Category:{cluster_name}\n")
    parts.append("Top products:\n")
    
    for _, row in top_df.iterrows():
        parts.append(product_block(row))
        parts.append("\n")
        
    parts.append("\nProduct to avoid:\n")
    for _, row in worst_df.iterrows():
        parts.append(product_block(row))

    return"\n".join(parts)


### 6. Full loop: Generate article for every cluster

- For each cluster, selects top/worst products, builds prompt, generates text with FLAN-T5, and stores it.

- Saves all articles to CSV.

In [30]:
articles = []

for cluster_value,cluster_df in df.groupby("Cluster"):

    # Get a human-readable summary of the cluster
    cluster_name = f"Cluster {cluster_value}"
    
    # Select top and worst products in a cluster
    top_df, worst_df = select_products_for_cluster(cluster_df, top_n=3)

    # Build text summary for the cluster
    cluster_summary = build_cluster_summary(cluster_name, top_df, worst_df)

    # Build prompt
    prompt = base_prompt.format(cluster_summary=cluster_summary)
    
    print(f"\n=== Generating article for Cluster {cluster_name} ===")
    article_text = generate_article_flan(prompt)

    # Store result
    articles.append({
        "cluster_value": cluster_value,
        "cluster_name": cluster_name,
        "cluster_summary": cluster_summary,
        "article": article_text,
    })

    # Show a preview of the article
    print(article_text[:400], "\n---\n")

articles_df = pd.DataFrame(articles)
articles_df.to_csv("../deliverables/generated_articles_flan.csv", index=False)

print("Saved", len(articles_df), "articles to ../deliverables/generated_articles_flan.csv")
print("articles_df columns:",articles_df.columns)


=== Generating article for Cluster Cluster Batteries ===
Summary: AmazonBasics AAA Performance Alkaline Batteries (48 Count) is rated 4.45. 
---


=== Generating article for Cluster Cluster Classic Fire Tablets ===
Amazon Kindle Fire 16gb 7 Ips Display Tablet is the only one that has a 4.72 rating. 
---


=== Generating article for Cluster Cluster Fire HD Tablets ===
Product: Fire HD 8 Tablet, 8" HD Display, Wi-Fi, 32 GB - Includes Special Offers, Magenta 
---


=== Generating article for Cluster Cluster Kindle E-readers ===
3 of the 5 Amazon Kindle Fire E-readers are the best and most expensive. 
---


=== Generating article for Cluster Cluster Smart Home / Audio ===
Amazon Fire Hd 10 Tablet, Wi-Fi, 16 Gb, Special Offers - Silver Aluminum - Amazon Fire Hd 10 Tablet, Wi-Fi, 16 Gb, Special Offers - Silver Aluminum - Amazon Fire Hd 10 Tablet, Wi-Fi, 16 Gb, Special Offers - Silver Aluminum - Amazon Fire Hd 10 Tablet, Wi-Fi, 16 Gb, Special Offers - Silver Aluminum - Amazon Fire Hd 10 Tabl

### 7. LoRA Fine-tuning

- Fine-tunes FLAN-T5 using LoRA on the generated articles dataset.

In [31]:
model_name= "google/flan-t5-small"
tokenizer = AutoTokenizer.from_pretrained(model_name)
base_model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

#LoRA configuration
peft_config=LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    target_modules=["q","v"],
    task_type="SEQ_2_SEQ_LM",
)

model= get_peft_model(base_model,peft_config)
model.print_trainable_parameters()


trainable params: 688,128 || all params: 77,649,280 || trainable%: 0.8862


In [32]:
print("article_df columns:",articles_df.columns)

base_df=articles_df.copy().rename(columns={
    "cluster_summary":"input_text",
    "article":"target_text",
    })
    
print("base_df columns:",base_df.columns)

train_df,val_df=train_test_split(base_df,test_size=0.1,random_state=42)

train_ds=Dataset.from_pandas(train_df.reset_index(drop=True))
val_ds=Dataset.from_pandas(val_df.reset_index(drop=True))


article_df columns: Index(['cluster_value', 'cluster_name', 'cluster_summary', 'article'], dtype='object')
base_df columns: Index(['cluster_value', 'cluster_name', 'input_text', 'target_text'], dtype='object')


Tokenizer and Preprocess function

In [33]:
max_input_len=512
max_target_len=256

def preprocess(batch):
    # 
    inputs=[
        "Write a short, neutral, helpful buying guide from this structured data :\n"
        +t
        for t in batch["input_text"]
    ]
    model_inputs=tokenizer(
        inputs,
        max_length=max_input_len,
        padding="max_length",
        truncation=True,
        return_tensors="np"
    )
    with tokenizer.as_target_tokenizer():
        labels=tokenizer(
            batch["target_text"],
            max_length=max_target_len,
            padding="max_length",
            truncation=True,
            return_tensors="np"
        )
        model_inputs["labels"]= labels["input_ids"]
        return model_inputs

train_tokenized=train_ds.map(preprocess,batched=True,remove_columns=train_ds.column_names)
val_tokenized=val_ds.map(preprocess,batched=True,remove_columns=val_ds.column_names)
data_collator=DataCollatorForSeq2Seq(tokenizer,model=model)

Map: 100%|██████████| 4/4 [00:00<00:00, 210.93 examples/s]
Map: 100%|██████████| 1/1 [00:00<00:00, 262.06 examples/s]


In [34]:
training_args=TrainingArguments(
    output_dir="../outputs/flan_t5_small_lora_ SANDRA",

    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,

    gradient_accumulation_steps=4,
    num_train_epochs=3,
    learning_rate=3e-4,

    logging_steps=10,
    save_total_limit=2,

    fp16=False,
)

trainer=Trainer(
    model=model,
    args=training_args,
    train_dataset=train_tokenized,
    eval_dataset=val_tokenized,
    data_collator=data_collator,
)
trainer.train()



Step,Training Loss


TrainOutput(global_step=3, training_loss=44.897186279296875, metrics={'train_runtime': 2.1934, 'train_samples_per_second': 5.471, 'train_steps_per_second': 1.368, 'total_flos': 2256053207040.0, 'train_loss': 44.897186279296875, 'epoch': 3.0})

In [35]:
trainer.save_model("../outputs/flan_t5_small_lora_ SANDRA")
tokenizer.save_pretrained("../outputs/flan_t5_small_lora_ SANDRA")


('../outputs/flan_t5_small_lora_ SANDRA/tokenizer_config.json',
 '../outputs/flan_t5_small_lora_ SANDRA/special_tokens_map.json',
 '../outputs/flan_t5_small_lora_ SANDRA/spiece.model',
 '../outputs/flan_t5_small_lora_ SANDRA/added_tokens.json',
 '../outputs/flan_t5_small_lora_ SANDRA/tokenizer.json')

### Predictions with fine-tuned model

In [36]:
model_path="../outputs/flan_t5_small_lora_ SANDRA"

tokenizer=AutoTokenizer.from_pretrained(model_path)
model=AutoModelForSeq2SeqLM.from_pretrained(model_path)

Run a Test-Prompt

In [37]:
row=articles_df.iloc[0]
cluster_summary=row["cluster_summary"]

prompt=f"""
You are an excellent product guide. Write a short , clear, helpful and fact- based buying guide.

Write an article based on these rules below:
1. Use only the information provided in the structured data.
2. Recommend the top 3 products.
3. Show your results in the following order: category, product, average rating, positive reviews, and a short summary about the product. Mentioning why a certain product is especially good or either bad, based on customer reviews.
4. For each recommended product, write a summary of all positive reviews about the product in 2 short sentences.
5. When mentioning products to avoid use neutral language and data- based statements. 
6. Never claim general brand quality.

Here is the structured data for the next cluster: 
 {cluster_summary}

Now kindly write the article:
"""


inputs=tokenizer(prompt,return_tensors="pt")

with torch.no_grad():
    output=model.generate(
        **inputs,
        max_new_tokens=120,
        num_beams=3
    )

print(prompt[:400],"\n---Model Output---\n")
print(tokenizer.decode(output[0], skip_special_tokens=True))


You are an excellent product guide. Write a short , clear, helpful and fact- based buying guide.

Write an article based on these rules below:
1. Use only the information provided in the structured data.
2. Recommend the top 3 products.
3. Show your results in the following order: category, product, average rating, positive reviews, and a short summary about the product. Mentioning why a certain  
---Model Output---

Product:AmazonBasics AA Performance Alkaline Batteries (48 Count) - Packaging May Vary (Brand:AmazonBasics) - Average rating: 4.45 from 8343 review. - Product:AmazonBasics AAA Performance Alkaline Batteries (36 Count) - Packaging May Vary (Brand:AmazonBasics) - Average rating: 4.45 from 8343 review. - Product:AmazonBasics AAA Performance Alkaline Batteries


In [38]:
print(cluster_summary)

Category:Cluster Batteries

Top products:

-Product:AmazonBasics AA Performance Alkaline Batteries (48 Count) - Packaging May Vary (Brand:Amazonbasics)
  -Average rating: 4.45 from3728 review.
  -Positive reviews: 100.0%.



-Product:AmazonBasics AAA Performance Alkaline Batteries (36 Count) (Brand:Amazonbasics)
  -Average rating: 4.45 from8343 review.
  -Positive reviews: 0.0%.




Product to avoid:

-Product:AmazonBasics AAA Performance Alkaline Batteries (36 Count) (Brand:Amazonbasics)
  -Average rating: 4.45 from8343 review.
  -Positive reviews: 0.0%.



In [39]:
prod=top_df.iloc[1]
pid=prod["ProductID"]

print(prod[["Product Name","avg_rating","pos_frac","neg_frac","num_reviews"]])

rows=df[df["ProductID"]==pid]
print(rows["sentiment"].value_counts(normalize=True))
print(rows["sentiment"].value_counts())

Product Name    Amazon - Echo Plus w/ Built-In Hub - Silver
avg_rating                                         4.749153
pos_frac                                                1.0
neg_frac                                                0.0
num_reviews                                             590
Name: 31, dtype: object
sentiment
positive    1.0
Name: proportion, dtype: float64
sentiment
positive    590
Name: count, dtype: int64


In [40]:
print(cluster_summary)

Category:Cluster Batteries

Top products:

-Product:AmazonBasics AA Performance Alkaline Batteries (48 Count) - Packaging May Vary (Brand:Amazonbasics)
  -Average rating: 4.45 from3728 review.
  -Positive reviews: 100.0%.



-Product:AmazonBasics AAA Performance Alkaline Batteries (36 Count) (Brand:Amazonbasics)
  -Average rating: 4.45 from8343 review.
  -Positive reviews: 0.0%.




Product to avoid:

-Product:AmazonBasics AAA Performance Alkaline Batteries (36 Count) (Brand:Amazonbasics)
  -Average rating: 4.45 from8343 review.
  -Positive reviews: 0.0%.



Key Learning of M3
Prompting in Flan Model- it is extremely sensitive to constraints like 
max. of two sentences
use only factual information
only use structured data
If the rules are too strict, the model will hallucinate strongly. Avoid strict wording..

# Review of FLAN-T5 Results

- Pipeline successfully generates buying guides per cluster from review data.  
- `cluster_summary` lists top products and product to avoid accurately.  
- Generated `article` outputs are factually correct but sometimes too short or repetitive.  
- Summaries often miss the full 2-sentence detail per product.  
- Recommendations: refine prompts, clean inputs, and optionally fine-tune with LoRA for better formatting and completeness.


| Cluster Value          | Cluster Name              | Top Products (Summary)                                                                                          | Product to Avoid / Article Summary                                      |
|------------------------|---------------------------|---------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------|
| Batteries              | Cluster Batteries         | AmazonBasics AA 48 Count, rating 4.45, 100% positive reviews<br>AmazonBasics AAA 36 Count, rating 4.45, 0% | AmazonBasics AAA 36 Count, rating 4.45<br>Summary: Rated 4.45          |
| Classic Fire Tablets   | Cluster Classic Fire Tablets | Kindle Voyage 4GB, rating 4.72, 100% positive<br>Kindle Fire 16GB, rating 4.60, 100% positive<br>Fire Kids Edition 16GB, rating 4.53, 100% | Fire Tablet 8GB, rating 4.45<br>Summary: Kindle Fire 16GB has 4.72 rating |
| Fire HD Tablets        | Cluster Fire HD Tablets   | PowerFast USB Charger, rating 4.86, 100% positive<br>Fire HD 8 Tablet 32GB, rating 4.67, 100% positive<br>PowerFast USB Charger, rating 4.67, 100% | Fire HD 8 Tablet 16GB, rating 4.58<br>Summary: Fire HD 8 Tablet 32GB recommended |
| Kindle E-readers       | Cluster Kindle E-readers  | Kindle Voyage E-reader, rating 4.89, 100% positive<br>Kindle Fire 16GB, rating 4.86, 100% positive<br>Fire Tablet 8GB, rating 4.82, 100% | Kindle E-reader Black, rating 4.43<br>Summary: 3 of 5 Kindle Fire E-readers are best |
| Smart Home / Audio     | Cluster Smart Home / Audio | Fire HD 10 Tablet 16GB, rating 4.77, 100% positive<br>Echo Plus, rating 4.75, 100% positive<br>Amazon Tap, rating 4.73, 100% | Amazon Tap Portable Speaker, rating 4.51<br>Summary: Fire HD 10 Tablet recommended |
