<h1 style="text-align: center; font-weight: bold; color:rgb(255, 255, 255);">Final Project: Building a Daily Life Assistant</h1>

<p style="font-size: 25px; line-height: 1.6; text-align: justify; max-width: 1200px; margin: 0 auto; margin-bottom: 20px;">
    This project aims to create a practical AI assistant using an instruction fine-tuned GPT-2 model. 
    The assistant will perform daily tasks such as scheduling, answering questions, and providing personalized recommendations.
</p>

<ul style="font-size: 20px; line-height: 1.8; max-width: 1000px; margin: 0 auto;">
    <li><strong>Model Architecture & Pretraining</strong>: Understanding GPT-2’s architecture and pretraining process.</li>
    <li><strong>Instruction Fine-Tuning</strong>: Training the model with instruction-response pairs for enhanced task performance.</li>
    <li><strong>Evaluation & Refinement</strong>: Assessing the model's output and iterating for better results.</li>
    <li><strong>Practical Application</strong>: Implementing the model in real-world scenarios such as daily task management.</li>
</ul>

# **Environment Setup**

We start by loading the required libraries.

In [1]:
pip install -r requirements.txt

Note: you may need to restart the kernel to use updated packages.


ERROR: Could not open requirements file: [Errno 2] No such file or directory: 'requirements.txt'

[notice] A new release of pip is available: 24.0 -> 25.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
import os
from dotenv import load_dotenv

import torch
import tiktoken
from transformers import GPT2Tokenizer, GPT2Model

from gpt_download import download_gpt2

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
device = torch.device(
    "cuda" if torch.cuda.is_available() else "cpu"
)

## **Loading the Pretrained GPT-2 Model**

In [4]:
load_dotenv()

MODEL_DIR = os.getenv("MODEL_DIR")
MODEL_SIZE = os.getenv("MODEL_SIZE")

In [5]:
model, tokenizer = download_gpt2(MODEL_DIR,MODEL_SIZE)

Téléchargement du modèle gpt2-large dans model...
Modèle gpt2-large téléchargé et sauvegardé dans model.


## **Testing the Model**

In [6]:
prompt = "What is the capital of Spain ?"

inputs = tokenizer(prompt, return_tensors="pt")
input_ids = inputs.input_ids
attention_mask = inputs.attention_mask

input_ids = input_ids.to(device)
attention_mask = attention_mask.to(device)

In [7]:
output = model.generate(
    input_ids = input_ids,
    attention_mask = attention_mask,
    pad_token_id = tokenizer.pad_token_id,
    max_length = input_ids.shape[1] + 100,
    num_beams = 5,
    temperature = 1,
    top_k = 50,
    do_sample = True
)

generated_text = tokenizer.decode(output[0], skip_special_tokens=True)

generated_text

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


'What is the capital of Spain ?\n\nThe capital of Spain is Madrid, the capital of Spain is Madrid, the capital of Spain is Madrid, the capital of Spain is Madrid, the capital of Spain is Madrid, the capital of Spain is Madrid, the capital of Spain is Madrid, the capital of Spain is Madrid, the capital of Spain is Madrid, the capital of Spain is Madrid, the capital of Spain is Madrid, the capital of Spain is Madrid, the capital of Spain is Madrid, the capital of Spain is Madrid,'

### **Handling Incomplete Responses from GPT-2**

In some cases, GPT-2 provides an incomplete response. It starts a sentence but doesn't finish it. This is problematic because it negatively impacts the user experience. There are several solutions to address this issue:

1. **Fine-tuning with examples of complete responses**: Train the model on a dataset that includes well-structured and complete answers to improve its behavior.

2. **Post-processing generated responses**: Implement logic to analyze the output and request the model to continue if a response is detected as incomplete.

3. **Adjusting generation parameters**: Modify parameters such as `max_length`, `temperature`, `top_k`, or `top_p` to increase the likelihood of producing complete and coherent answers.

4. **Adding a contextual prefix**: Use a prompt like _"Please provide a detailed and complete answer:"_ before the main query to guide the model towards better responses.

5. **Automatic verification with a script**: Create a script to detect incomplete responses and prompt the model to continue if necessary.


We will choose to integrate a script to handle this issue. Additionally, during the fine-tuning process for a daily assistant, we will ensure that this concern is addressed in the training data.


In [8]:
import re

def clean_incomplete_sentences(text):
    """
    Slice the input text into sentences, keeping the formatting 
    (e.g., \n, spaces), and remove incomplete phrases that do not 
    end with a proper punctuation mark.
    """
    # Split the text while keeping the delimiters (e.g., .!?) and formatting
    sentences = re.split(r'(?<=[.!?])(\s+)', text)
    
    cleaned_text = ""
    for i in range(0, len(sentences) - 1, 2):  # Process sentences with their trailing spaces
        sentence = sentences[i]
        trailing_space = sentences[i + 1]
        if re.search(r'[.!?]$', sentence):  # Check if the sentence ends with valid punctuation
            cleaned_text += sentence + trailing_space
    
    # Handle cases where the last part is an incomplete sentence
    if len(sentences) % 2 != 0 and re.search(r'[.!?]$', sentences[-1]):
        cleaned_text += sentences[-1]
    
    return cleaned_text

In [9]:
text_complete_sentences = clean_incomplete_sentences(generated_text)

print("Original Text:\n")
print(generated_text)
print("\nCleaned Text:\n")
print(text_complete_sentences)


Original Text:

What is the capital of Spain ?

The capital of Spain is Madrid, the capital of Spain is Madrid, the capital of Spain is Madrid, the capital of Spain is Madrid, the capital of Spain is Madrid, the capital of Spain is Madrid, the capital of Spain is Madrid, the capital of Spain is Madrid, the capital of Spain is Madrid, the capital of Spain is Madrid, the capital of Spain is Madrid, the capital of Spain is Madrid, the capital of Spain is Madrid, the capital of Spain is Madrid,

Cleaned Text:

What is the capital of Spain ?




### **Handling Redundant Sentences in GPT-Generated Text**

GPT often generates sentences that are almost identical or convey similar information. This redundancy can make it challenging to filter out phrases with overlapping content. To address this issue, we will implement a script that detects and removes duplicate or nearly identical sentences.

1. **Sentence Splitting**: 
   - The text will be divided into individual sentences using a delimiter (e.g., `.`, `!`, `?`).

2. **Similarity Detection**:
   - We will compare each sentence against others using a similarity metric, such as Levenshtein distance or cosine similarity on vector embeddings.

3. **Duplicate Removal**:
   - Sentences identified as duplicates or with high similarity scores will be removed, leaving only unique information.

4. **Reconstruction**:
   - The remaining unique sentences will be combined into a coherent, cleaned text while preserving the original format.

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def remove_redundant_sentences(text, similarity_threshold=0.8):
    """
    Removes redundant or highly similar sentences from the given text, preserving formatting such as \n and spaces, while keeping key sentences.

    Parameters:
        text (str): The input text containing potentially redundant sentences.
        similarity_threshold (float): The cosine similarity threshold above which
                                       sentences are considered redundant.

    Returns:
        str: Text with redundant sentences removed.
    """
    # Split the text into sentences while preserving the delimiters and formatting
    sentences = re.split(r'(?<=[.!?])\s+', text)

    # Vectorize the sentences using TF-IDF
    vectorizer = TfidfVectorizer().fit_transform(sentences)

    # Compute cosine similarity between all sentence pairs
    similarity_matrix = cosine_similarity(vectorizer)

    # Identify sentences to keep
    sentences_to_keep = []
    for i, sentence in enumerate(sentences):
        # Check if the sentence is similar to any previously kept sentence
        if all(similarity_matrix[i, j] < similarity_threshold for j in sentences_to_keep):
            sentences_to_keep.append(i)

    # Reconstruct the text with only unique sentences
    unique_sentences = [sentences[i] for i in sentences_to_keep]
    return '\n'.join(unique_sentences)


In [11]:
cleaned_text = remove_redundant_sentences(text_complete_sentences)

print("Original Text:\n")
print(text_complete_sentences)
print("\nCleaned Text:\n")
print(cleaned_text)

Original Text:

What is the capital of Spain ?



Cleaned Text:

What is the capital of Spain ?



## **Finetune The Model**

In [12]:
import json
from transformers import GPT2Tokenizer, GPT2LMHeadModel, Trainer, TrainingArguments
from datasets import Dataset




In [13]:
# Step 1: Load the dataset
with open('prompt.json', 'r',encoding='utf8') as f:
    data = json.load(f)

In [14]:
# Step 2: Preprocess the dataset
def preprocess(example):
    # Combine instruction, input, and output into a single text prompt
    prompt = f"Instruction: {example['instruction']}\n"
    if example['input']:
        prompt += f"Input: {example['input']}\n"
    prompt += f"Output: {example['output']}"
    return {"text": prompt}

In [15]:
from sklearn.model_selection import train_test_split

# Step 3: Split the dataset into training and evaluation sets
train_data, eval_data = train_test_split(data, test_size=0.1, random_state=42)

# Convert train and eval data into Hugging Face Datasets
train_dataset = Dataset.from_list(train_data).map(preprocess)
eval_dataset = Dataset.from_list(eval_data).map(preprocess)

Map: 100%|██████████| 137/137 [00:00<00:00, 6160.69 examples/s]
Map: 100%|██████████| 16/16 [00:00<00:00, 4005.30 examples/s]


In [16]:
# Step 4: Tokenize the dataset
tokenizer = GPT2Tokenizer.from_pretrained("gpt2-large")
tokenizer.pad_token = tokenizer.eos_token  # GPT-2 does not have a pad token

def tokenize(example):
    return tokenizer(example["text"], padding="max_length", truncation=True, max_length=512)

train_dataset = train_dataset.map(tokenize, batched=True)
eval_dataset = eval_dataset.map(tokenize, batched=True)

Map: 100%|██████████| 137/137 [00:00<00:00, 1321.69 examples/s]
Map: 100%|██████████| 16/16 [00:00<00:00, 993.51 examples/s]


In [17]:
# Step 5: Set up training arguments
training_args = TrainingArguments(
    output_dir="./model_finetuned",
    overwrite_output_dir=True,
    num_train_epochs=3,
    per_device_train_batch_size=8,
    save_steps=500,
    save_total_limit=2,
    learning_rate=5e-5,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=100,
    evaluation_strategy="steps",
    eval_steps=500,
    save_strategy="steps",
    fp16=True,
)



In [18]:
# Step 6: Define the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
)

  trainer = Trainer(


In [19]:
# Step 7: Train the model

# trainer.train()

In [20]:
# Step 8: Save the fine-tuned model

# model.save_pretrained("./model_finetuned")
# tokenizer.save_pretrained("./model_finetuned")