<a href="https://colab.research.google.com/github/Naomie25/Hackaton-Fashion-Description-Generator/blob/Output/Fashion_Description_Generator_Hackathon.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

1.Define the Task & Pipeline Overview

Input (keyword or image) → Generation Model → Quality-Check Module → (Optional) Image Generator → Ethical Filter → Final Output

In [1]:
!pip install transformers torch sentencepiece
!pip install schedule
!pip install --upgrade datasets

Collecting datasets
  Downloading datasets-4.0.0-py3-none-any.whl.metadata (19 kB)
Collecting fsspec<=2025.3.0,>=2023.1.0 (from fsspec[http]<=2025.3.0,>=2023.1.0->datasets)
  Downloading fsspec-2025.3.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-4.0.0-py3-none-any.whl (494 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m494.8/494.8 kB[0m [31m13.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2025.3.0-py3-none-any.whl (193 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m193.6/193.6 kB[0m [31m14.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: fsspec, datasets
  Attempting uninstall: fsspec
    Found existing installation: fsspec 2025.7.0
    Uninstalling fsspec-2025.7.0:
      Successfully uninstalled fsspec-2025.7.0
  Attempting uninstall: datasets
    Found existing installation: datasets 2.14.4
    Uninstalling datasets-2.14.4:
      Successfully uninstalled datasets-2.14.4
[31mERROR: pip's dependency r

In [2]:
import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer, BartForConditionalGeneration, BartTokenizer
from datasets import load_dataset
import difflib
import re
import random
import schedule
import time

# ============================
# 1. Define the Task & Pipeline Overview
# ============================
# Objective: Generate fashion product descriptions, assess their quality via summarization, apply ethical filtering, and output the final results.
# Workflow : prompt → génération texte → résumé (qualité) → filtre éthique → sortie finale.
#Input (keyword or image) → Generation Model → Quality-Check Module → (Optional) Image Generator → Ethical Filter → Final Output

In [5]:
device = torch.device("cpu")  # Force the use of the CPU
print("Device set to use", device)

# ============================
# 2. Select Your Generation Method
# ============================
# On choisit un modèle Transformer léger pour la génération (distilgpt2)
# et un modèle BART-base pour le summary

# ============================
# 3. Pick Specific Pre-trained Models
# ============================
# Chargement du tokenizer et modèle GPT2 distillé (petit et rapide)
gpt2_model_name = "distilgpt2"
gpt2_tokenizer = GPT2Tokenizer.from_pretrained(gpt2_model_name) # loading the tokenizer
gpt2_model = GPT2LMHeadModel.from_pretrained(gpt2_model_name).to(device) #loading the pretrained model

# Chargement du tokenizer et modèle BART pour resumer (qualité)
bart_model_name = "facebook/bart-base"
bart_tokenizer = BartTokenizer.from_pretrained(bart_model_name)
bart_model = BartForConditionalGeneration.from_pretrained(bart_model_name).to(device)

# Liste de mots clés liés à la mode pour scorer la qualité
fashion_keywords = [
    "elegant", "stylish", "refined", "modern", "vintage", "casual",
    "minimalist", "chic", "versatile", "comfort", "premium", "crafted",
    "tailored", "cut", "fit", "fabric", "soft", "bold", "timeless"
]

# ============================
# 4. Prepare & Subsample Your Dataset
# ============================

def load_and_subsample_dataset(subsample_ratio=0.05, seed=42):
    dataset = load_dataset("imdb", split="train")
    random.seed(seed)
    sample_size = int(len(dataset) * subsample_ratio)
    indices = random.sample(range(len(dataset)), sample_size)
    subsampled_dataset = dataset.select(indices)
    print(f"Original size: {len(dataset)}, subsampled size: {len(subsampled_dataset)}")
    return subsampled_dataset

# ============================
# 5. Module génération texte
# ============================
def generate_descriptions(keyword, num_variants=5):

    #Génère plusieurs descriptions à partir d'un mot-clé prompté.

    prompt = f"Write a stylish, concise, and elegant product description focusing on fabric, cut, and style for: {keyword}.\n\n"
    input_ids = gpt2_tokenizer.encode(prompt, return_tensors="pt").to(device)

    outputs = gpt2_model.generate(
        input_ids,
        max_new_tokens=50,
        do_sample=True,
        top_k=40,
        top_p=0.9,
        temperature=0.7,
        num_return_sequences=num_variants,
        repetition_penalty=1.2,
        pad_token_id=gpt2_tokenizer.eos_token_id
    )

    results = []
    for output in outputs:
        decoded = gpt2_tokenizer.decode(output, skip_special_tokens=True)
        gen_text = decoded[len(prompt):].strip()  # Enlever le prompt initial du texte généré
        score = score_description(gen_text, prompt)
        results.append((gen_text, score))

    results = sorted(results, key=lambda x: x[1], reverse=True)  # Trier par score décroissant
    results = clean_descriptions(results)
    return results

# ============================
# Fonction utilitaire : détection répétitions
# ============================
def has_repetitions(text, max_repeat=3):
    #Detecte si un mot est répété plus de max_repeat fois consécutivement dans le texte.

    pattern = r'\b(\w+)( \1){' + str(max_repeat) + ',}\b'
    return re.search(pattern, text.lower()) is not None

# ============================
# Filtrer descriptions lisibles
# ============================
def clean_descriptions(descriptions):

    #Garde uniquement les descriptions avec suffisamment de mots et sans répétitions abusives.
    filtered = []
    for desc, score in descriptions:
        if len(desc.split()) < 8:
            continue  # Trop court = filtré
        if has_repetitions(desc):
            continue  # Répétitions trop fréquentes = filtré
        filtered.append((desc, score))
    return filtered

# ============================
# Scoring description
# ============================
def score_description(desc, prompt):
    """
    Scorer la description générée en fonction:
    - longueur (max 50 mots)
    - présence de mots clés mode
    - pénalité si trop proche du prompt (texte copié)
    """
    words = desc.lower().split()
    keyword_bonus = sum(word in words for word in fashion_keywords)
    length_score = min(len(words), 50) / 50

    similarity = difflib.SequenceMatcher(None, desc.lower(), prompt.lower()).ratio()
    penalty = max(0, 1 - similarity)  # plus on est différent, mieux c'est

    return length_score + 0.5 * keyword_bonus + penalty


Device set to use cpu


In [6]:
# ============================
# 6. Add Summarization for Quality Control
# ============================
def summarize_text(text):
    """
    Résume un texte donné via BART pour contrôle qualité.
    """
    inputs = bart_tokenizer(text, return_tensors="pt", truncation=True, max_length=512).to(device)
    summary_ids = bart_model.generate(inputs["input_ids"], num_beams=4, max_length=30, early_stopping=True)
    summary = bart_tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    return summary

# ============================
# 10. Integrate an Ethical Filter
# ============================
def ethical_filter(text):
    """
    Filtre basique : rejette texte contenant mots sensibles.
    """
    blacklist = ["hate", "violence", "racism", "sexism", "terrorism"]
    text_lower = text.lower()
    for bad_word in blacklist:
        if bad_word in text_lower:
            return False  # texte rejeté
    return True  # texte accepté

# ============================
# 7. (Optionnel) Génération image (placeholder)
# ============================
def generate_image_placeholder():
    """
    Placeholder simple pour génération d'image (à développer avec VAE si besoin).
    """
    print("Image generation step (placeholder).")

# ============================
# 8. Automate the Workflow
# ============================
def run_pipeline(keyword, num_variants=5):
    """
    Fonction qui enchaîne génération, résumé, filtre, et sortie.
    """
    print(f"\n--- Génération pour: {keyword} ---")
    descriptions = generate_descriptions(keyword, num_variants)

    final_results = []
    for desc, score in descriptions:
        summary = summarize_text(desc)
        if not ethical_filter(desc):
            print("Filtré éthiquement:", desc)
            continue
        final_results.append((desc, summary, score))

    # Affichage résultats finaux
    for i, (desc, summary, score) in enumerate(final_results, 1):
        print(f"\nDescription {i} [Score: {score:.2f}]:\n{desc}")
        print(f"Résumé qualité:\n{summary}")

    # Placeholder image generation
    generate_image_placeholder()

    return final_results

# ============================
# 9. Evaluate Your System
# ============================
def evaluate_generation(keyword, num_samples=50):
    """
    Génère plusieurs descriptions et calcule un score moyen.
    """
    results = generate_descriptions(keyword, num_samples)
    avg_score = sum(score for _, score in results) / len(results)
    print(f"Evaluation moyenne sur {num_samples} échantillons : {avg_score:.2f}")

In [7]:
# ============================
# 11. Document & Reflect
# ============================
def document_pipeline():
    print("""
Résumé pipeline IA mode (CPU-friendly):
- Modèle de génération : distilgpt2 (petit, rapide, CPU)
- Modèle résumé : bart-base (contrôle qualité)
- Dataset : IMDB text dataset, sous-échantillonné à 5%
- Filtre éthique basique basé sur blacklist
- Score combiné longueur, mots-clés mode, différence prompt
- Image generation placeholder (à étendre avec VAE)
- Automatisation possible avec schedule
- Limites CPU : génération lente, pas de GAN lourd
- Améliorations futures : filtre éthique avancé, métriques BLEU/ROUGE, VAE images

Ce pipeline montre un workflow complet de génération, contrôle et filtrage pour la description produit mode.
""")

# ============================
# Exemple d'usage complet avec plusieurs vêtements
# ============================
if __name__ == "__main__":
    # Étape 4: Charger et sous-échantillonner dataset
    dataset_sample = load_and_subsample_dataset(subsample_ratio=0.05)

    # Liste des keywords à générer
    keywords = [
        "oversized denim jacket",
        "minimalist leather handbag",
        "tailored wool coat"
    ]

    # Exécuter le pipeline pour chaque vêtement
    for kw in keywords:
        run_pipeline(kw, num_variants=5)

    # Étape 9: Evaluation rapide sur le premier keyword
    evaluate_generation(keywords[0], num_samples=20)

    # Étape 11: Documentation
    document_pipeline()

README.md: 0.00B [00:00, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

unsupervised-00000-of-00001.parquet:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Original size: 25000, subsampled size: 1250

--- Génération pour: oversized denim jacket ---

Description 1 [Score: 1.68]:
This is the most important thing you can do in your life because of how comfortable it may be to wear an old-fashioned cotton shirt while wearing jeans that are just as long or loose like this one! This new design features some great materials including
Résumé qualité:
This is the most important thing you can do in your life because of how comfortable it may be to wear an old-fashioned cotton shirt

Description 2 [Score: 1.68]:
A unique design that is made of cotton fibers with the same pattern as its traditional fabrics (think T-shirts). It allows you to look at your sewing patterns in detail while also adding an extra layer of color added by using less effort than it
Résumé qualité:
A unique design that is made of cotton fibers with the same pattern as its traditional fabrics (think T-shirts). It allows you to

Description 3 [Score: 1.61]:
The design of the garme