# *Teknoloogiaa*: Chatbot Bas√© sur GPT-2

Ce notebook pr√©sente une d√©marche compl√®te pour cr√©er un chatbot en anglais √† partir du mod√®le pr√©-entra√Æn√© GPT-2. Il couvre l‚Äôensemble du processus, du traitement des donn√©es jusqu‚Äô√† l‚Äôinterface utilisateur.

### Objectifs

- Charger et pr√©parer les donn√©es textuelles pour l‚Äôentra√Ænement.
- Affiner le mod√®le GPT-2 avec des donn√©es personnalis√©es.
- D√©velopper une fonction de r√©ponse automatique bas√©e sur le mod√®le.
- Int√©grer le chatbot dans une interface interactive avec *Gradio*.

### Donn√©es et source:
- Les donn√©es utilis√©es comme corpus pour le Chatbot sont des commentaires issus de posts sur Reddit. Ces posts sont en rapport avec la technologie (l'Intelligence Artificielle (IA), le Machine Learning..)

### √âtapes principales

1. *Installation des biblioth√®ques n√©cessaires* (Transformers, Gradio, etc.)
2. *Chargement et pr√©traitement des fichiers textes*
3. *Fine-tuning* du mod√®le GPT-2 avec gestion des ressources GPU
4. *Cr√©ation d‚Äôune fonction de g√©n√©ration de r√©ponses*
5. *√âvaluation du chatbot*
6. *D√©ploiement de l'interface Gradio pour tester le chatbot*

### D√©pendances

- transformers : pour charger, configurer et entra√Æner le mod√®le GPT-2.
- torch : la biblioth√®que PyTorch utilis√©e pour entra√Æner et manipuler les mod√®les de deep learning.
- pandas : pour charger et manipuler les fichiers de donn√©es textuelles au format tabulaire.
- gradio : pour cr√©er une interface web interactive et tester le chatbot en direct.
- sklearn : pour certaines fonctions utilitaires comme la s√©paration des jeux de donn√©es.
- numpy : pour les op√©rations num√©riques et la gestion efficace des tableaux de donn√©es.

### Installation et Importation des packages et des bases

In [9]:
! pip install gradio hf_xet
import os
import time
import math
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import GPT2Tokenizer, GPT2LMHeadModel, get_linear_schedule_with_warmup
from torch.optim import AdamW
from tqdm.notebook import tqdm, trange
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import glob
from IPython.display import Markdown, display, HTML
import pandas as pd



### T√©l√©chargement et fine-tuning du mod√®le

In [10]:
# ===============================================================
# SECTION 1: V√âRIFICATION DU MAT√âRIEL ET UTILITAIRES DE BASE
# ===============================================================

# V√©rifier la disponibilit√© du GPU
def check_gpu():
    if torch.cuda.is_available():
        device = torch.device("cuda")
        gpu_name = torch.cuda.get_device_name(0)
        gpu_memory = torch.cuda.get_device_properties(0).total_memory / 1e9  # En GB
        print(f"GPU disponible: {gpu_name} ({gpu_memory:.2f} GB)")
        return device
    else:
        print("Aucun GPU d√©tect√©, utilisation du CPU")
        return torch.device("cpu")

# ===============================================================
# SECTION 2: CHARGEMENT ET PR√âPARATION DES DONN√âES
# ===============================================================

# Classe pour charger les documents txt depuis notre repertoire
class SimpleDirectoryReader:
    def __init__(self, directory_path):
        self.directory_path = directory_path
        
    def load_data(self):
        documents = []
        for file_path in glob.glob(os.path.join(self.directory_path, "*.txt")):
            with open(file_path, 'r', encoding='utf-8', errors='ignore') as file:
                text = file.read()
                documents.append(Document(text, extra_info={"source": file_path}))
        return documents

class Document:
    def __init__(self, text, extra_info=None):
        self.text = text
        self.extra_info = extra_info or {}

# ===============================================================
# SECTION 3: T√âL√âCHARGEMENT ET PR√âPARATION DU MOD√àLE GPT-2
# ===============================================================

# T√©l√©chargeons le mod√®le GPT-2 Medium et le tokenizer
def download_and_save_model():
    model_dir = "/kaggle/working/gpt2-medium"

    start_time = time.time()
    
    # T√©l√©charger le tokenizer et le mod√®le
    tokenizer = GPT2Tokenizer.from_pretrained("gpt2-medium")
    # D√©finir le token de padding pour GPT-2
    tokenizer.pad_token = tokenizer.eos_token
    model = GPT2LMHeadModel.from_pretrained("gpt2-medium")
    # Mettre √† jour la configuration du mod√®le pour reconna√Ætre le pad_token
    model.config.pad_token_id = model.config.eos_token_id
    
    # Sauvegarder le mod√®le et le tokenizer
    os.makedirs(model_dir, exist_ok=True)
    tokenizer.save_pretrained(model_dir)
    model.save_pretrained(model_dir)
    
    elapsed_time = time.time() - start_time
    
    # Afficher un message de confirmation
    print(f"Temps √©coul√©: {elapsed_time:.2f} secondes")
    
    return tokenizer, model

# ===============================================================
# SECTION 4: PR√âPARATION DES DONN√âES POUR L'ENTRA√éNEMENT
# ===============================================================

# Classe de dataset personnalis√©e
class TextDataset(Dataset):
    def __init__(self, texts, tokenizer, max_length=512):
        self.encodings = tokenizer(texts, truncation=True, padding="max_length", 
                                  max_length=max_length, return_tensors="pt")
        
    def __getitem__(self, idx):
        item = {key: val[idx] for key, val in self.encodings.items()}
        item["labels"] = item["input_ids"].clone()
        return item
    
    def __len__(self):
        return len(self.encodings["input_ids"])

# Fonction pour diviser les textes en chunks
def chunk_text(text, chunk_size_limit=600, overlap=20):
    words = text.split()
    chunks = []
    
    for i in range(0, len(words), chunk_size_limit - overlap):
        chunk = " ".join(words[i:i + chunk_size_limit])
        chunks.append(chunk)
    return chunks

# ===============================================================
# SECTION 5: CONSTRUCTION DE L'INDEX POUR LA RECHERCHE S√âMANTIQUE
# ===============================================================

# Construction de l'index
def construct_index(directory_path):
    # V√©rifier le GPU
    device = check_gpu()
    
    # Param√®tres
    max_input_size = 512  # Limit√© pour GPT-2
    num_outputs = 256
    max_chunk_overlap = 20
    chunk_size_limit = 600
    
    # T√©l√©charger et sauvegarder le mod√®le
    tokenizer, model = download_and_save_model()
    model.to(device)
    
    # V√©rifier si des documents existent dans le r√©pertoire
    document_files = glob.glob(os.path.join(directory_path, "*.txt"))
    if not document_files:
        print(f"Aucun fichier .txt trouv√© dans {directory_path}. Cr√©ation d'un exemple simple...")

    documents = SimpleDirectoryReader(directory_path).load_data()
    
    if not documents:
        raise ValueError(f"Aucun document n'a √©t√© charg√© depuis {directory_path}")
    
    print(f"{len(documents)} document(s) charg√©(s)")
    
    # Pr√©traiter les documents
    all_chunks = []
    document_embeddings = []
    print(f"Cr√©ation des chunks et des embeddings...")

    
    # Barre de progression pour les chunks afin de suivre le traitement
    for doc_idx, doc in enumerate(tqdm(documents, desc="Documents", position=0)):
        chunks = chunk_text(doc.text, chunk_size_limit, max_chunk_overlap)
        all_chunks.extend(chunks)
        
        # Cr√©er des embeddings simples avec le mod√®le GPT-2
        for chunk_idx, chunk in enumerate(tqdm(chunks, desc=f"Embeddings pour doc {doc_idx+1}/{len(documents)}", position=1, leave=False)):
            inputs = tokenizer(chunk, return_tensors="pt", padding=True, truncation=True, max_length=max_input_size).to(device)
            with torch.no_grad():
                outputs = model(**inputs, output_hidden_states=True)
                # Utiliser la moyenne de la derni√®re couche cach√©e comme embedding
                last_hidden_state = outputs.hidden_states[-1].mean(dim=1)
                document_embeddings.append(last_hidden_state.squeeze().cpu().numpy())
    
    # Convertir en array numpy pour faciliter la r√©cup√©ration
    document_embeddings = np.array(document_embeddings)
    
    # Sauvegarder les donn√©es n√©cessaires
    index_data = {
        "chunks": all_chunks,
        "embeddings": document_embeddings,
    }
    
    torch.save(index_data, "/kaggle/working/index.pt")
    
    print(f"Index cr√©√© et sauvegard√© dans /kaggle/working/index.pt")
    
    return index_data

# ===============================================================
# SECTION 6: FINE-TUNING DU MOD√àLE SUR LES DOCUMENTS
# ===============================================================

# Fonction pour fine-tuner le mod√®le sur les documents
def fine_tune_model(directory_path, epochs=3, batch_size=4, learning_rate=5e-5):
    # V√©rifier le GPU
    device = check_gpu()
    
    # T√©l√©charger le mod√®le
    tokenizer, model = download_and_save_model()
    
    # Charger les documents
    documents = SimpleDirectoryReader(directory_path).load_data()
    texts = [doc.text for doc in documents]
    
    if not texts:
        raise ValueError(f"Aucun document trouv√© pour le fine-tuning dans {directory_path}")
    
    print(f"{len(texts)} texte(s) charg√©(s) pour l'entra√Ænement")
    
    # Cr√©er le dataset
    dataset = TextDataset(texts, tokenizer)
    dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)
    
    # Configurer l'optimiseur
    optimizer = AdamW(model.parameters(), lr=learning_rate)
    total_steps = len(dataloader) * epochs
    scheduler = get_linear_schedule_with_warmup(
        optimizer, num_warmup_steps=0, num_training_steps=total_steps
    )
    
    # D√©placer le mod√®le sur GPU 
    model.to(device)
    
    # Boucle d'entra√Ænement
    
    model.train()
    
    # Initialiser le meilleur loss pour sauvegarder le meilleur mod√®le
    best_loss = float('inf')
    
    for epoch in range(epochs):
        print(f"\nüîÑ Epoch {epoch+1}/{epochs}")
        epoch_start_time = time.time()
        total_loss = 0
        
        progress_bar = tqdm(dataloader, desc=f"Training", position=0, 
                           bar_format='{l_bar}{bar:30}{r_bar}{bar:-30b}')
        
        for step, batch in enumerate(progress_bar):
            # Obtenir les entr√©es et les envoyer sur GPU
            batch = {k: v.to(device) for k, v in batch.items()}
            
            # R√©initialiser les gradients
            optimizer.zero_grad()
            
            # Forward pass
            outputs = model(**batch)
            loss = outputs.loss
            total_loss += loss.item()
            
            # Backward pass
            loss.backward()
            optimizer.step()
            scheduler.step()
            
            # Calculer les m√©triques et mettre √† jour la barre de progression
            avg_loss = total_loss / (step + 1)
            elapsed = time.time() - epoch_start_time
            time_per_step = elapsed / (step + 1)
            remaining_steps = len(dataloader) - (step + 1)
            eta = time_per_step * remaining_steps
            
            # Mettre √† jour la barre de progression avec des informations d√©taill√©es
            progress_bar.set_postfix({
                'loss': f'{avg_loss:.4f}',
                'elapsed': f'{int(elapsed//60)}m {int(elapsed%60)}s',
                'ETA': f'{int(eta//60)}m {int(eta%60)}s',
                'lr': f'{scheduler.get_last_lr()[0]:.2e}'
            })
            
            # Lib√©rer la m√©moire
            if step % 10 == 0 and torch.cuda.is_available():
                torch.cuda.empty_cache()
        
        # Calculer la perte moyenne pour cette √©poque
        epoch_avg_loss = total_loss / len(dataloader)
        epoch_time = time.time() - epoch_start_time
        
        print(f"Epoch {epoch+1}/{epochs} termin√©e: Loss = {epoch_avg_loss:.4f}, Temps = {int(epoch_time//60)}m {int(epoch_time%60)}s")
        
        # Sauvegarder le meilleur mod√®le
        if epoch_avg_loss < best_loss:
            best_loss = epoch_avg_loss
            best_model_dir = "/kaggle/working/gpt2-medium-fine-tuned-best"
            os.makedirs(best_model_dir, exist_ok=True)
            model.save_pretrained(best_model_dir)
            tokenizer.save_pretrained(best_model_dir)
            print(f"Meilleur mod√®le sauvegard√© (loss: {best_loss:.4f})")
    
    # Sauvegarder le mod√®le fine-tun√© final
    fine_tuned_dir = "/kaggle/working/gpt2-medium-fine-tuned"
    os.makedirs(fine_tuned_dir, exist_ok=True)
    model.save_pretrained(fine_tuned_dir)
    tokenizer.save_pretrained(fine_tuned_dir)
    
    display(HTML(f"""
    <div style="padding: 10px; border-radius: 5px; background-color: #eafaf1; border-left: 5px solid #2ecc71;">
      <h3 style="margin: 0;">‚úÖ Fine-tuning termin√© avec succ√®s!</h3>
      <p>Mod√®le final sauvegard√© dans <code>{fine_tuned_dir}</code></p>
      <p>Meilleur mod√®le sauvegard√© dans <code>/kaggle/working/gpt2-medium-fine-tuned-best</code> (loss: {best_loss:.4f})</p>
    </div>
    """))
    
    return model, tokenizer

### Cr√©ation de la fonction pour poser les questions

In [14]:
# Fonction pour poser des questions
def ask_me_anything(question, use_fine_tuned=True):
    # Charger le mod√®le appropri√©
    if use_fine_tuned and os.path.exists("/kaggle/input/base-de-reddit/modele-pre-entraine/modele-pre-entraine"):
        model_path = "/kaggle/input/base-de-reddit/modele-pre-entraine/modele-pre-entraine"
    else:
        model_path = "/kaggle/working/gpt2-medium"
    
    tokenizer = GPT2Tokenizer.from_pretrained(model_path)
    # D√©finir le token de padding
    tokenizer.pad_token = tokenizer.eos_token
    model = GPT2LMHeadModel.from_pretrained(model_path)
    model.config.pad_token_id = model.config.eos_token_id
    
    # Charger l'index
    index_data = torch.load("/kaggle/input/base-de-reddit/index.pt")
    chunks = index_data["chunks"]
    embeddings = index_data["embeddings"]
    
    # Cr√©er un embedding pour la question
    inputs = tokenizer(question, return_tensors="pt", padding=True, truncation=True)
    with torch.no_grad():
        outputs = model(**inputs, output_hidden_states=True)
        question_embedding = outputs.hidden_states[-1].mean(dim=1).squeeze().numpy()
    
    # Calculer les similarit√©s avec les chunks
    similarities = cosine_similarity([question_embedding], embeddings)[0]
    
    # Trouver les chunks les plus pertinents
    top_idx = np.argsort(similarities)[-3:][::-1]  # Top 3 chunks les plus similaires
    
    # CORRECTION: Limiter la taille du contexte pour ne pas d√©passer max_length
    context_chunks = [chunks[i] for i in top_idx]
    
    # Choisir un contexte plus court si n√©cessaire
    prompt = f"Question: {question}\nContexte: {context_chunks[0]}\nR√©ponse:"
    inputs = tokenizer(prompt, return_tensors="pt")
    
    # Si le prompt est encore trop long, utiliser uniquement la question
    if inputs["input_ids"].shape[1] > 400:  # Laisse de la marge pour la g√©n√©ration
        prompt = f"Question: {question}\nR√©ponse:"
        inputs = tokenizer(prompt, return_tensors="pt")
    
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)
    inputs = {k: v.to(device) for k, v in inputs.items()}
    

    with torch.no_grad():
        output_sequences = model.generate(
            input_ids=inputs["input_ids"],
            attention_mask=inputs["attention_mask"],
            max_new_tokens=200,
            temperature=0.7,
            top_k=50,
            top_p=0.95,
            do_sample=True,
            num_return_sequences=1,
        )
    
    response = tokenizer.decode(output_sequences[0], skip_special_tokens=True)
    response = response.split("R√©ponse:")[-1].strip()
    
    display(Markdown(f"You asked: <b>{question}</b>"))
    display(Markdown(f"Bot says: <b>{response}</b>"))
    
    return response

### Entrainement du mod√®le sur nos donn√©es

In [4]:
# 1. Construire l'index
#index_data = construct_index("/kaggle/input/base-de-reddit/")

# 2. Fine-tuner le mod√®le
#model, tokenizer = fine_tune_model("/kaggle/input/base-de-reddit", epochs=3)

GPU disponible: Tesla T4 (15.83 GB)


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/718 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.52G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Temps √©coul√©: 10.85 secondes
1 texte(s) charg√©(s) pour l'entra√Ænement

üîÑ Epoch 1/3


Training:   0%|                              | 0/1 [00:00<?, ?it/s]

`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


Epoch 1/3 termin√©e: Loss = 3.7245, Temps = 0m 1s
Meilleur mod√®le sauvegard√© (loss: 3.7245)

üîÑ Epoch 2/3


Training:   0%|                              | 0/1 [00:00<?, ?it/s]

Epoch 2/3 termin√©e: Loss = 3.2840, Temps = 0m 0s
Meilleur mod√®le sauvegard√© (loss: 3.2840)

üîÑ Epoch 3/3


Training:   0%|                              | 0/1 [00:00<?, ?it/s]

Epoch 3/3 termin√©e: Loss = 2.9781, Temps = 0m 0s
Meilleur mod√®le sauvegard√© (loss: 2.9781)


### Tester le chatbot

In [15]:
ask_me_anything("What do people think about AI ?", use_fine_tuned=True)

  index_data = torch.load("/kaggle/input/base-de-reddit/index.pt")
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


You asked: <b>What do people think about AI ?</b>

Bot says: <b>We are getting to the point where AI is not only going to be pervasive, it's going to be dominant. I would argue that by 2030, AI will have become a dominant force in our economy.
AI will be the norm and you'll see it everywhere. You'll see people doing things in their home, your car, your home office, your job, your home office. You'll see it everywhere.
AI is going to be ubiquitous and its impact will be felt across all sectors of our economy.
And by 2030, you'll see AI becoming more and more of a driver of economic growth.
The point is that AI is going to be a dominant force in the future.
AI is going to be ubiquitous and its impact will be felt across all sectors of our economy.
If AI is going to be pervasive, it's going to be dominant in all sectors.
If AI is going to be dominant in all sectors, it's going to be dominant in every</b>

"We are getting to the point where AI is not only going to be pervasive, it's going to be dominant. I would argue that by 2030, AI will have become a dominant force in our economy.\nAI will be the norm and you'll see it everywhere. You'll see people doing things in their home, your car, your home office, your job, your home office. You'll see it everywhere.\nAI is going to be ubiquitous and its impact will be felt across all sectors of our economy.\nAnd by 2030, you'll see AI becoming more and more of a driver of economic growth.\nThe point is that AI is going to be a dominant force in the future.\nAI is going to be ubiquitous and its impact will be felt across all sectors of our economy.\nIf AI is going to be pervasive, it's going to be dominant in all sectors.\nIf AI is going to be dominant in all sectors, it's going to be dominant in every"

In [6]:
#!zip -r /kaggle/working/gpt2-medium-fine-tuned-best.zip /kaggle/working/gpt2-medium-fine-tuned-best

  adding: kaggle/working/gpt2-medium-fine-tuned-best/ (stored 0%)
  adding: kaggle/working/gpt2-medium-fine-tuned-best/vocab.json (deflated 68%)
  adding: kaggle/working/gpt2-medium-fine-tuned-best/special_tokens_map.json (deflated 74%)
  adding: kaggle/working/gpt2-medium-fine-tuned-best/config.json (deflated 53%)
  adding: kaggle/working/gpt2-medium-fine-tuned-best/generation_config.json (deflated 24%)
  adding: kaggle/working/gpt2-medium-fine-tuned-best/merges.txt (deflated 53%)
  adding: kaggle/working/gpt2-medium-fine-tuned-best/model.safetensors (deflated 7%)
  adding: kaggle/working/gpt2-medium-fine-tuned-best/tokenizer_config.json (deflated 56%)


### Cr√©ation de l'interface gradio

In [13]:
import gradio as gr
import torch
import os

# D√©finir la variable device si elle n'est pas d√©j√† d√©finie
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Fonction pour interagir avec le chatbot via Gradio
def gradio_ask(question, history):
    # Chargement des mod√®les et des donn√©es
    if not os.path.exists("/kaggle/working/index.pt"):
        return "Erreur: Veuillez d'abord ex√©cuter construct_index() pour cr√©er l'index.", history
    
    # Vous pouvez modifier ce param√®tre pour utiliser le mod√®le fine-tun√© ou non
    use_fine_tuned = os.path.exists("/kaggle/working/gpt2-medium-fine-tuned")
    
    # Utiliser la fonction existante
    response = ask_me_anything(question, use_fine_tuned=use_fine_tuned)
    
    # Formater pour Gradio - histoire normale, pas de type 'messages'
    history.append((question, response))
    return "", history

# Cr√©ation de l'interface Gradio
def create_gradio_interface():
    with gr.Blocks(title="Teknoloogiaa") as demo:
        gr.Markdown("# Teknoloogiaa")
        gr.Markdown("Posez vos questions a Teknoloogiaa sur tout ce qui est en rapport avec l'IA, le machine learning, la datascience, ...")
        
        # Retirez le param√®tre type='messages'
        chatbot = gr.Chatbot(height=300)
        
        # Cr√©ation d'une ligne avec un champ texte et un bouton d'envoi
        with gr.Row():
            msg = gr.Textbox(placeholder="Posez votre question ici...", lines=1, scale=4)
            submit_btn = gr.Button("Envoyer", scale=1)
        
        clear = gr.Button("Effacer la conversation")
        
        # Connecter le bouton d'envoi √† la fonction gradio_ask
        submit_btn.click(gradio_ask, [msg, chatbot], [msg, chatbot])
        
        # Garder la possibilit√© d'envoyer en appuyant sur Entr√©e
        msg.submit(gradio_ask, [msg, chatbot], [msg, chatbot])
        
        clear.click(lambda: None, None, chatbot, queue=False)
        
        
    return demo

# Pour lancer l'interface:
demo = create_gradio_interface()
demo.launch()

  chatbot = gr.Chatbot(height=300)


* Running on local URL:  http://127.0.0.1:7860
It looks like you are running Gradio on a hosted a Jupyter notebook. For the Gradio app to work, sharing must be enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

* Running on public URL: https://e238daadaf84a1a7e8.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


