<a href="https://colab.research.google.com/github/Naomie25/DI-Bootcamp/blob/main/Week8_Day3_ExerciceXP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Part I. Setup

In [34]:
!pip install rouge_score==0.1.2
!pip install evaluate
!pip install -U accelerate --quiet
!pip install datasets
!pip install nltk



In [35]:
import nltk
nltk.download("punkt")
nltk.download("punkt_tab")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

Part II : Dataset Loading and Exploration

In [36]:
import pandas as pd

# Load datasets
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')

train_sample = train_df.sample(n=100, random_state=42)
test_sample = test_df.sample(n=50, random_state=42)

# Display first example (show full row of features)
print("🔷 First Example from Train Sample:")
print(train_sample.iloc[0])

# Inspect DataFrames
print("\n Sampled Train DataFrame:\n", train_sample.head())
print("\n Sampled Test DataFrame:\n", test_sample.head())

🔷 First Example from Train Sample:
battery_power    1646.0
blue                0.0
clock_speed         2.5
dual_sim            0.0
fc                  3.0
four_g              1.0
int_memory         25.0
m_dep               0.6
mobile_wt         200.0
n_cores             2.0
pc                  5.0
px_height         211.0
px_width         1608.0
ram               686.0
sc_h                8.0
sc_w                6.0
talk_time          11.0
three_g             1.0
touch_screen        1.0
wifi                0.0
price_range         0.0
Name: 1860, dtype: float64

 Sampled Train DataFrame:
       battery_power  blue  clock_speed  dual_sim  fc  four_g  int_memory  \
1860           1646     0          2.5         0   3       1          25   
353            1182     0          0.5         0   7       1           8   
1333           1972     0          2.9         0   9       0          14   
905             989     1          2.0         0   4       0          17   
1289            615     1 

Part III : Summarization with T5

In [37]:
import torch
import gc
import pandas as pd
from transformers import T5ForConditionalGeneration, AutoTokenizer

In [38]:
def batch_generator(items, batch_size):
    for i in range(0, len(items), batch_size):
        yield items[i:i + batch_size]

In [39]:
def summarize_with_t5(articles, model, tokenizer, batch_size=8, device=None):
    if device is None:
        device = 'cuda' if torch.cuda.is_available() else 'cpu'

    model = model.to(device)
    summaries = []

    for batch in batch_generator(articles, batch_size):
        inputs = ["summarize: " + text for text in batch]
        encoding = tokenizer(
            inputs,
            return_tensors="pt",
            padding=True,
            truncation=True,
            max_length=512
        ).to(device)

        with torch.no_grad():
            outputs = model.generate(
                **encoding,
                max_length=150,
                num_beams=4,
                early_stopping=True
            )

        decoded = tokenizer.batch_decode(outputs, skip_special_tokens=True)
        summaries.extend(decoded)

        # Free memory after each batch
        del encoding, outputs
        torch.cuda.empty_cache()
        gc.collect()

    # Final cleanup
    torch.cuda.empty_cache()
    gc.collect()

    return summaries


In [40]:
from transformers import AutoTokenizer, T5ForConditionalGeneration

tokenizer = AutoTokenizer.from_pretrained("t5-small")
model = T5ForConditionalGeneration.from_pretrained("t5-small")

In [41]:
# Example: Combine features into text to simulate summarization input
articles = train_sample.astype(str).apply(lambda row: " ".join(row.values), axis=1).tolist()

# Generate summaries
generated_summaries = summarize_with_t5(articles, model, tokenizer, batch_size=8)

In [42]:
results_df = pd.DataFrame({
    'Input_Features': articles,
    'Generated_Summary': generated_summaries,
    'Reference_Price_Class': train_sample['price_range'].values
})

# Display
print(results_df.head(10))

                                      Input_Features  \
0  1646 0 2.5 0 3 1 25 0.6 200 2 5 211 1608 686 8...   
1  1182 0 0.5 0 7 1 8 0.5 138 8 16 275 986 2563 1...   
2  1972 0 2.9 0 9 0 14 0.4 196 7 18 293 952 1316 ...   
3  989 1 2.0 0 4 0 17 0.2 166 3 19 256 1394 3892 ...   
4  615 1 0.5 1 7 0 58 0.5 130 5 8 1021 1958 1906 ...   
5  627 1 1.6 1 3 1 12 0.2 131 7 17 447 819 2476 1...   
6  894 0 0.9 0 5 1 54 0.2 130 3 15 104 541 2829 1...   
7  1066 0 3.0 1 6 1 5 0.5 167 5 7 53 1504 1044 8 ...   
8  616 0 1.9 1 13 1 44 0.8 81 3 17 651 1618 3366 ...   
9  712 0 0.5 0 6 0 27 0.5 86 2 11 1245 1309 2001 ...   

                                   Generated_Summary  Reference_Price_Class  
0  1646 0 2.5 0 3 1 25 0.6 200 2 5 211 1608 686 8...                      0  
1  1182 0 0.5 0 7 1 8 0.5 138 8 16 275 986 2563 1...                      2  
2     1972 0 2.9 0 9 0 14 0.4 196 7 18 293 952 1316.                      1  
3  989 1 2.0 0 4 0 17 0.2 166 3 19 256 1394 3892 ...                   

Part IV : Accuracy Evaluation

In [43]:
from sklearn.metrics import accuracy_score

# Simplified comparison: if generated summary exactly matches the price class (as string)
generated_predictions = [summary.strip() for summary in generated_summaries]
reference_labels = [str(label) for label in train_sample['price_range'].values]

# Calculate exact match accuracy
accuracy = accuracy_score(reference_labels, generated_predictions)

print(f"Accuracy of T5-small summaries: {accuracy:.4f}")

Accuracy of T5-small summaries: 0.0000


An accuracy of 0.0000 confirms what we anticipated:

➡️ T5-small is not predicting your price_range classes correctly, simply because it’s not designed for classification but for text generation/summarization.



Part V : ROUGE Metric Implementation

In [44]:
import evaluate
import nltk

# Load ROUGE metric
rouge = evaluate.load("rouge")

# Download NLTK punkt tokenizer (for sentence splitting)
nltk.download('punkt')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [45]:
from nltk.tokenize import sent_tokenize

def compute_rouge_score(predictions, references):
    # Preprocess by adding newline between sentences
    predictions = ["\n".join(sent_tokenize(pred)) for pred in predictions]
    references = ["\n".join(sent_tokenize(ref)) for ref in references]

    # Compute ROUGE scores
    results = rouge.compute(predictions=predictions, references=references)

    return results

In [46]:
# Example reference and generated summaries
refs = [
    "The quick brown fox jumps over the lazy dog.",
    "Machine learning is great for text summarization."
]

preds = [
    "The brown fox jumps over the dog.",
    "ML is useful for summarizing text."
]

scores = compute_rouge_score(preds, refs)
print(scores)


{'rouge1': np.float64(0.6682692307692308), 'rouge2': np.float64(0.28571428571428575), 'rougeL': np.float64(0.6682692307692308), 'rougeLsum': np.float64(0.6682692307692308)}


Part VI : Understanding ROUGE Scores

In [47]:
refs = [
    "The quick brown fox jumps over the lazy dog.",
    "Machine learning is great for text summarization."
]

preds = [
    "The quick brown fox jumps over the lazy dog.",
    "Machine learning is great for text summarization."
]

scores = compute_rouge_score(preds, refs)
print("Exact Match ROUGE Scores:", scores)


Exact Match ROUGE Scores: {'rouge1': np.float64(1.0), 'rouge2': np.float64(1.0), 'rougeL': np.float64(1.0), 'rougeLsum': np.float64(1.0)}


In [48]:
#Null prediction
preds = ["", ""]

scores = compute_rouge_score(preds, refs)
print("Null Prediction ROUGE Scores:", scores)


Null Prediction ROUGE Scores: {'rouge1': np.float64(0.0), 'rouge2': np.float64(0.0), 'rougeL': np.float64(0.0), 'rougeLsum': np.float64(0.0)}


In [49]:
#Stemming Effect
refs = ["The cat is running in the park."]
preds_no_stem = ["The cat is run in the park."]  # Different word form, same meaning

scores = compute_rouge_score(preds_no_stem, refs)
print("ROUGE with stemming (default):", scores)

ROUGE with stemming (default): {'rouge1': np.float64(0.8571428571428571), 'rouge2': np.float64(0.6666666666666666), 'rougeL': np.float64(0.8571428571428571), 'rougeLsum': np.float64(0.8571428571428571)}


In [50]:
# N-grams
refs = ["The quick brown fox jumps over the lazy dog."]

# Different predictions
preds_list = [
    "The quick brown fox jumps over the lazy dog.",  # full overlap
    "The brown fox jumps over the dog.",             # partial overlap
    "Fox jumps over lazy.",                           # less overlap
    "Cat runs fast."                                  # no overlap
]

for pred in preds_list:
    score = compute_rouge_score([pred], refs)
    print(f"Prediction: {pred}\nROUGE scores: {score}\n")

Prediction: The quick brown fox jumps over the lazy dog.
ROUGE scores: {'rouge1': np.float64(1.0), 'rouge2': np.float64(1.0), 'rougeL': np.float64(1.0), 'rougeLsum': np.float64(1.0)}

Prediction: The brown fox jumps over the dog.
ROUGE scores: {'rouge1': np.float64(0.8750000000000001), 'rouge2': np.float64(0.5714285714285715), 'rougeL': np.float64(0.8750000000000001), 'rougeLsum': np.float64(0.8750000000000001)}

Prediction: Fox jumps over lazy.
ROUGE scores: {'rouge1': np.float64(0.6153846153846153), 'rouge2': np.float64(0.36363636363636365), 'rougeL': np.float64(0.6153846153846153), 'rougeLsum': np.float64(0.6153846153846153)}

Prediction: Cat runs fast.
ROUGE scores: {'rouge1': np.float64(0.0), 'rouge2': np.float64(0.0), 'rougeL': np.float64(0.0), 'rougeLsum': np.float64(0.0)}



In [51]:
refs = ["The quick brown fox jumps over the lazy dog."]
preds = ["The brown fox jumps."]

score1 = compute_rouge_score(preds, refs)
score2 = compute_rouge_score(refs, preds)

print("ROUGE(preds, refs):", score1)
print("ROUGE(refs, preds):", score2)

ROUGE(preds, refs): {'rouge1': np.float64(0.6153846153846153), 'rouge2': np.float64(0.36363636363636365), 'rougeL': np.float64(0.6153846153846153), 'rougeLsum': np.float64(0.6153846153846153)}
ROUGE(refs, preds): {'rouge1': np.float64(0.6153846153846153), 'rouge2': np.float64(0.36363636363636365), 'rougeL': np.float64(0.6153846153846153), 'rougeLsum': np.float64(0.6153846153846153)}


Part VII : Comparing Small and Large Models

In [52]:
import torch
import gc
import pandas as pd
from transformers import T5ForConditionalGeneration, AutoTokenizer, GPT2Tokenizer, GPT2LMHeadModel
import evaluate
from nltk.tokenize import sent_tokenize
import nltk

nltk.download('punkt')
rouge = evaluate.load("rouge")


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [53]:
#Rouge functions
def preprocess_summaries(summaries):
    return ["\n".join(sent_tokenize(s)) for s in summaries]

def compute_rouge_score(predictions, references):
    predictions = preprocess_summaries(predictions)
    references = preprocess_summaries(references)
    return rouge.compute(predictions=predictions, references=references)

def compute_rouge_per_row(preds, refs):
    """Compute ROUGE scores for each prediction-reference pair."""
    results = []
    for p, r in zip(preds, refs):
        score = rouge.compute(predictions=[p], references=[r])
        results.append(score)
    return pd.DataFrame(results)


In [54]:
#Summarize with T5
def summarize_with_t5(articles, model_name="t5-small", batch_size=8, max_length=150):
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    model = T5ForConditionalGeneration.from_pretrained(model_name).to(device)
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    summaries = []
    for i in range(0, len(articles), batch_size):
        batch = articles[i:i+batch_size]
        inputs = ["summarize: " + text for text in batch]
        encoding = tokenizer(
            inputs,
            return_tensors="pt",
            padding=True,
            truncation=True,
            max_length=512
        ).to(device)

        with torch.no_grad():
            outputs = model.generate(
                **encoding,
                max_length=max_length,
                num_beams=4,
                early_stopping=True
            )
        decoded = tokenizer.batch_decode(outputs, skip_special_tokens=True)
        summaries.extend(decoded)

        del encoding, outputs
        torch.cuda.empty_cache()
        gc.collect()
    return summaries


In [55]:
def summarize_with_gpt2(articles, batch_size=8, max_length=100):
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    model_name = "gpt2"
    tokenizer = GPT2Tokenizer.from_pretrained(model_name)
    model = GPT2LMHeadModel.from_pretrained(model_name).to(device)

    summaries = []
    for i in range(0, len(articles), batch_size):
        batch = articles[i:i+batch_size]
        for text in batch:
            prompt = text.strip() + "\nTL;DR:"
            inputs = tokenizer.encode(prompt, return_tensors='pt').to(device)

            # GPT2 has max length limits (usually 1024 tokens)
            max_gen_len = min(max_length, 1024 - inputs.size(1))

            with torch.no_grad():
                outputs = model.generate(
                    inputs,
                    max_length=inputs.size(1) + max_gen_len,
                    do_sample=False,
                    num_beams=4,
                    no_repeat_ngram_size=3,
                    early_stopping=True
                )
            generated = tokenizer.decode(outputs[0], skip_special_tokens=True)

            # Extract summary after "TL;DR:"
            summary = generated.split("TL;DR:")[-1].strip()
            summaries.append(summary)

        torch.cuda.empty_cache()
        gc.collect()
    return summaries


In [57]:
def compare_models(rouge_scores_dict):
    """
    Aggregate average ROUGE scores from different models into a single DataFrame.

    Args:
        rouge_scores_dict (dict): {model_name: {rouge_type: score, ...}, ...}

    Returns:
        pd.DataFrame: Aggregated ROUGE scores.
    """
    rows = []
    for model_name, scores in rouge_scores_dict.items():
        row = {'Model': model_name}
        row.update(scores)
        rows.append(row)
    return pd.DataFrame(rows)



In [58]:
def compare_models_summaries(summaries_dict, references, sample_count=5):
    df = pd.DataFrame({'Reference Summary': references})
    for model_name, summaries in summaries_dict.items():
        df[f"{model_name} Summary"] = summaries
    return df.head(sample_count)


In [59]:
# Example dictionaries
rouge_scores_dict = {
    't5-small': rouge_small,
    't5-base': rouge_base,
    'gpt2': rouge_gpt2
}

summaries_dict = {
    't5-small': summaries_small,
    't5-base': summaries_base,
    'gpt2': summaries_gpt2
}

# Generate aggregated ROUGE table
rouge_df = compare_models(rouge_scores_dict)
print("\n📊 Aggregated ROUGE Scores:")
print(rouge_df)

# Generate side-by-side summary comparison
summaries_df = compare_models_summaries(summaries_dict, references, sample_count=5)
print("\n📑 Side-by-Side Summary Comparison:")
print(summaries_df)



📊 Aggregated ROUGE Scores:
      Model    rouge1  rouge2    rougeL  rougeLsum
0  t5-small  0.056458     0.0  0.056655   0.056639
1   t5-base  0.085246     0.0  0.085376   0.085406
2      gpt2  0.000440     0.0  0.000523   0.000523

📑 Side-by-Side Summary Comparison:
  Reference Summary                                   t5-small Summary  \
0                 0  1646 0 2.5 0 3 1 25 0.6 200 2 5 211 1608 686 8...   
1                 2  1182 0 0.5 0 7 1 8 0.5 138 8 16 275 986 2563 1...   
2                 1     1972 0 2.9 0 9 0 14 0.4 196 7 18 293 952 1316.   
3                 3  989 1 2.0 0 4 0 17 0.2 166 3 19 256 1394 3892 ...   
4                 1  615 1 0.5 1 7 0 58 0.5 130 5 8 1021 1958 1906 ...   

                                     t5-base Summary  \
0  1646 0 2.5 0 3 1 25 0.6 200 2 5 211 1608 686 8...   
1  0 0 0 7 1 8 0.5 138 8 16 275 986 2563 19 17 19...   
2  0 2.9 0 9 0 14 0.4 196 7 18 293 952 1316 8 1 8...   
3  989 1 2.0 0 4 0 17 0.2 166 3 19 256 1394 3892 ...   
4  58 0