# Fine Tuning process

### We'll break down the process into the following sections:

#### 1. Collect and Prepare Domain-Specific Training Data
#### 2. Fine-Tune a Base Language Model (LLM) for Your Specific Use Case
#### 3. Implement Evaluation Metrics for Model Performance
#### 4. Document the Fine-Tuning Process and Results 


## 1. Collect and Prepare Domain-Specific Training Data
Before fine-tuning, it's essential to prepare high-quality, domain-specific training data. This involves organizing your prompt-completion pairs and splitting them into training and validation sets to evaluate model performance effectively.

### a. Organize Prompt-Completion Data
- Assuming you've already generated output.jsonl containing your prompt-completion pairs, ensure the data is correctly formatted for fine-tuning.

In [47]:
import pandas as pd
import json
import random

# Load the JSONL file
jsonl_file = "output.jsonl"
with open(jsonl_file, "r") as file:
    data = [json.loads(line) for line in file]

# Convert to DataFrame for easier manipulation
df = pd.DataFrame(data)

# Display the first few entries
df.head()


Unnamed: 0,messages
0,"[{'role': 'system', 'content': 'You are a help..."
1,"[{'role': 'system', 'content': 'You are a help..."
2,"[{'role': 'system', 'content': 'You are a help..."
3,"[{'role': 'system', 'content': 'You are a help..."
4,"[{'role': 'system', 'content': 'You are a help..."


### b. Split Data into Training and Validation Sets
- It's crucial to have separate datasets for training and validation to monitor the model's performance and prevent overfitting.

In [50]:
import pandas as pd
import json
from sklearn.model_selection import train_test_split

# Assuming `df` is already defined
# Define the split ratio
train_ratio = 0.8  # 80% for training, 20% for validation

# Perform the split
train_df, val_df = train_test_split(df, train_size=train_ratio, random_state=42)

# Save the split data into separate JSONL files
train_file = "train.jsonl"
val_file = "validation.jsonl"

def save_to_jsonl(data, filename):
    with open(filename, "w") as f:
        for _, row in data.iterrows():
            json.dump(row.to_dict(), f)  # Convert the Series to a dictionary
            f.write("\n")

save_to_jsonl(train_df, train_file)
save_to_jsonl(val_df, val_file)

print(f"Training data saved to {train_file}")
print(f"Validation data saved to {val_file}")


Training data saved to train.jsonl
Validation data saved to validation.jsonl


### c. Verify the Data Split 
- Ensure that both training and validation files are correctly created.

In [52]:
# Check the number of entries
print(f"Number of training samples: {len(train_df)}")
print(f"Number of validation samples: {len(val_df)}")


Number of training samples: 260
Number of validation samples: 65


## 2. Fine-Tune a Base Language Model (LLM) for Your Specific Use Case
Fine-tuning a base LLM involves training it on your domain-specific data to adapt its responses to your specific needs.

### a. Set Up OpenAI API for Fine-Tuning
- Ensure your OpenAI API key is set correctly. 

In [57]:
import os
import openai

# Set your OpenAI API key as an environment variable for security
openai_api_key = os.getenv("OPENAI_API_KEY")
openai.api_key = os.getenv("OPENAI_API_KEY")


In [58]:
import os
from dotenv import load_dotenv

# Load variables from .env file into environment
load_dotenv()

# Retrieve the variables
openai_api_key = os.getenv("OPENAI_API_KEY")

### b. Upload Training and Validation Data to OpenAI
- OpenAI requires the training data to be uploaded before initiating the fine-tuning process.

In [62]:
from openai import OpenAI

def upload_file(file_path, purpose):
    """
    Uploads a file to OpenAI for fine-tuning.
    
    Args:
        file_path (str): Path to the JSONL file.
        purpose (str): Purpose of the file (e.g., "fine-tune").
    
    Returns:
        str: File ID assigned by OpenAI.
    """
    client = OpenAI()
    try:
        with open(file_path, "rb") as file:
            response = client.files.create(
                file=file,
                purpose=purpose
            )
        print(f"Uploaded {file_path}: ID {response.id}")
        return response.id
    except Exception as e:
        print(f"Error uploading {file_path}: {e}")
        return None

# Upload training and validation files
train_file_id = upload_file(train_file, "fine-tune")
val_file_id = upload_file(val_file, "fine-tune")

Uploaded train.jsonl: ID file-S42Er76HPZSRf4ij9JLT2L
Uploaded validation.jsonl: ID file-S1JtGdAe5X9tyFRQSdN9tC


### c. Initiate the Fine-Tuning Process
- Start fine-tuning using the uploaded training data. You can specify additional parameters such as the base model, number of epochs, and learning rate.

In [64]:
def fine_tune_model(training_file_id, validation_file_id, model="gpt-4o-mini-2024-07-18", epochs=4):
    """
    Initiates the fine-tuning process.
    
    Args:
        training_file_id (str): File ID for the training data.
        validation_file_id (str): File ID for the validation data.
        model (str): Base model to fine-tune (e.g., "davinci").
        epochs (int): Number of training epochs.
    
    Returns:
        dict: Fine-tuning job details.
    """
    client = OpenAI()
    try:
        response = client.fine_tuning.jobs.create(
            training_file=training_file_id,
            validation_file=validation_file_id,
            model=model,
            hyperparameters={
                "n_epochs": epochs
            }
        )
        print(f"Fine-tuning job created: {response.id}")
        return response
    except Exception as e:
        print(f"Error initiating fine-tuning: {e}")
        return None

# Start fine-tuning
fine_tune_response = fine_tune_model(train_file_id, val_file_id)
fine_tune_job_id = fine_tune_response.id if fine_tune_response else None

Fine-tuning job created: ftjob-Q0DuBG1ly5K2d2B8hCZnQXC3


### d. Monitor the Fine-Tuning Job
- Fine-tuning can take some time. Monitor the job's status to know when it's complete.

In [68]:
from openai import OpenAI
import time

def get_fine_tune_status(job_id):
    """
    Retrieves the status and details of a fine-tuning job.
    
    Args:
        job_id (str): The fine-tuning job ID.
    
    Returns:
        dict: Job status details.
    """
    client = OpenAI()
    try:
        response = client.fine_tuning.jobs.retrieve(job_id)
        return response
    except Exception as e:
        print(f"Error retrieving job status: {e}")
        return None

# Polling the job status
if fine_tune_job_id:
    while True:
        status_response = get_fine_tune_status(fine_tune_job_id)
        if status_response:
            status = status_response.status
            print(f"Fine-tuning job status: {status}")
            
            # Print out more detailed error information if the job failed
            if status == "failed":
                print("\nDetailed Error Information:")
                print(f"Error Message: {status_response.error}")
                
                # Additional error details if available
                if hasattr(status_response, 'validation_file'):
                    print(f"Validation File: {status_response.validation_file}")
                
                break
            
            if status in ["succeeded", "failed"]:
                break
        else:
            print("Failed to retrieve job status.")
            break
        time.sleep(60)  # Wait for 1 minute before checking again

Fine-tuning job status: validating_files
Fine-tuning job status: validating_files
Fine-tuning job status: queued
Fine-tuning job status: running
Fine-tuning job status: running
Fine-tuning job status: running
Fine-tuning job status: running
Fine-tuning job status: running
Fine-tuning job status: running
Fine-tuning job status: running
Fine-tuning job status: running
Fine-tuning job status: running
Fine-tuning job status: running
Fine-tuning job status: running
Fine-tuning job status: running
Fine-tuning job status: running
Fine-tuning job status: running
Fine-tuning job status: running
Fine-tuning job status: running
Fine-tuning job status: running
Fine-tuning job status: running
Fine-tuning job status: running
Fine-tuning job status: succeeded


### e. Retrieve the Fine-Tuned Model
- Once the fine-tuning job is complete, retrieve the fine-tuned model's name for usage.

In [76]:
if status_response and status_response.status == "succeeded":
    if hasattr(status_response, 'fine_tuned_model'):
        fine_tuned_model = status_response.fine_tuned_model
        print(f"Fine-tuned model available: {fine_tuned_model}")
    else:
        print("Fine-tuned model information not available.")
else:
    print("Fine-tuning was not successful.")

Fine-tuned model available: ft:gpt-4o-mini-2024-07-18:personal::Ae9mdwfc


## 3. Implement Evaluation Metrics for Model Performance
After fine-tuning, it's vital to evaluate the model's performance using various metrics to ensure it meets your requirements.

### a. Define Evaluation Metrics
Common evaluation metrics for language models include:

- **Perplexity**: Measures how well the model predicts a sample.
- **Accuracy**: Percentage of correct responses.
- **BLEU Score**: Evaluates the similarity between generated text and reference text.
- **ROUGE Score**: Measures the overlap of n-grams between generated and reference texts.

In [79]:
from nltk.translate.bleu_score import sentence_bleu
import nltk

# More robust NLTK resource download
def download_nltk_resources():
    try:
        # Attempt to download multiple resources
        resources = ['punkt', 'punkt_tab']
        for resource in resources:
            try:
                nltk.download(resource, quiet=True)
            except Exception as e:
                print(f"Could not download {resource}: {e}")
    except Exception as e:
        print(f"Error in NLTK resource download: {e}")

# Call this at the start of your script
download_nltk_resources()

### b. Prepare Validation Data for Evaluation
Use the validation set to generate model responses and compare them with the reference completions.

## Model Evaluation with OpenAI GPT and Custom Metrics

### Overview
This script evaluates the performance of a fine-tuned OpenAI GPT model on a validation dataset. It leverages metrics like **accuracy** and a custom **BLEU score** to assess the quality of generated completions compared to expected outputs. The code processes a `.jsonl` validation file, generates model predictions, and computes evaluation metrics for robust performance analysis.

---

### Code Explanation

1. **Data Loading**  
   - Loads validation data from a `.jsonl` file into a Pandas DataFrame, ensuring error handling for file operations.

2. **OpenAI Model Integration**  
   - Interacts with the OpenAI ChatCompletion API to generate model completions based on provided prompts and system messages.

3. **Custom BLEU Score Calculation**  
   - Implements a simplified BLEU score to evaluate the similarity between generated completions and expected outputs. Includes fallback tokenization for robustness.

4. **Evaluation Process**  
   - Processes each validation sample to compute:
     - **Accuracy**: Exact match between generated and expected completions.
     - **Custom BLEU Score**: Measures partial matching precision and brevity penalty.

5. **Error Handling**  
   - Includes comprehensive error handling for missing data, incorrect formats, and API issues to ensure smooth evaluation.

6. **Metrics Calculation**  
   - Computes and reports average accuracy and BLEU scores across all samples.

7. **Main Function**  
   - Orchestrates the loading of data, validation checks, and evaluation process, ensuring that the required environment variables and fine-tuned model are correctly configured.

---

### Key Features
- Robust data handling for `.jsonl` format.
- Custom BLEU score implementation with fallback tokenization.
- Dynamic and structured prompt handling for OpenAI API.
- Comprehensive evaluation metrics for model performance.

---

### Purpose
This script is designed for evaluating fine-tuned language models, providing insights into their ability to generate accurate and contextually relevant completions. It is highly customizable and supports integration with OpenAI's GPT models for various NLP tasks.


In [81]:
import openai
import nltk
import pandas as pd
import json
import os
from nltk.translate.bleu_score import sentence_bleu
from nltk.tokenize import word_tokenize

# Ensure NLTK resources are downloaded
try:
    nltk.download('punkt', quiet=True)
except Exception as e:
    print(f"Error downloading NLTK resources: {e}")

# Function to load validation data from a .jsonl file
def load_validation_data(file_path):
    try:
        data = []
        with open(file_path, 'r', encoding='utf-8') as f:
            for line in f:
                data.append(json.loads(line))
        return pd.DataFrame(data)
    except Exception as e:
        print(f"Error loading validation data: {e}")
        return pd.DataFrame()

# Function to generate model completion using ChatCompletion API
def generate_completion(messages, model, max_tokens=100):
    try:
        response = openai.chat.completions.create(
            model=model,
            messages=messages,
            max_tokens=max_tokens,
            temperature=0.7,
            n=1,
            stop=None
        )
        return response.choices[0].message.content.strip()
    except Exception as e:
        print(f"Error generating completion: {e}")
        return ""

# Custom BLEU score calculation function
def calculate_custom_bleu(reference, candidate):
    """
    Calculate a simplified BLEU-like score with fallback tokenization
    """
    # Fallback tokenization method if NLTK fails
    def simple_tokenize(text):
        return text.lower().split()
    
    try:
        # First try NLTK tokenization
        try:
            ref_tokens = word_tokenize(reference.lower())
            cand_tokens = word_tokenize(candidate.lower())
        except Exception:
            # Fallback to simple tokenization if NLTK fails
            ref_tokens = simple_tokenize(reference)
            cand_tokens = simple_tokenize(candidate)
        
        # If either is empty, return 0
        if not ref_tokens or not cand_tokens:
            return 0.0
        
        # Calculate modified precision
        matches = sum(1 for token in cand_tokens if token in ref_tokens)
        precision = matches / len(cand_tokens)
        
        # Add brevity penalty
        brevity_penalty = min(1, len(cand_tokens) / len(ref_tokens)) if len(ref_tokens) > 0 else 0
        
        # Combine precision and brevity penalty
        bleu_score = precision * brevity_penalty
        
        return bleu_score
    except Exception as e:
        print(f"BLEU score calculation error: {e}")
        return 0.0

# Function to compute evaluation metrics with enhanced error handling
def evaluate_model(model, validation_data):
    print("\nStarting model evaluation...\n")
    
    # Validate input data
    if validation_data.empty:
        print("Error: Validation data is empty.")
        return {"average_accuracy": 0, "average_bleu_score": 0}
    
    # Ensure the 'messages' column exists
    if 'messages' not in validation_data.columns:
        raise ValueError(f"Expected 'messages' column in validation data. Found columns: {validation_data.columns}")
    
    accuracies = []
    bleu_scores = []
    
    total_samples = len(validation_data)
    print(f"Total samples to evaluate: {total_samples}\n")
    
    for index, row in validation_data.iterrows():
        messages = row['messages']
        
        if not isinstance(messages, list):
            print(f"Row {index} has invalid 'messages' format. Skipping.")
            continue
        
        # Extract the user prompt and the expected assistant completion
        user_prompt = ""
        expected_completion = ""
        system_prompt = ""
        
        for message in messages:
            if message['role'] == 'system':
                system_prompt = message['content']
            elif message['role'] == 'user':
                user_prompt += message['content'] + "\n"
            elif message['role'] == 'assistant':
                expected_completion += message['content'] + "\n"
        
        user_prompt = user_prompt.strip()
        expected_completion = expected_completion.strip().lower()
        
        if not user_prompt or not expected_completion:
            print(f"Row {index} is missing user prompt or expected completion. Skipping.")
            continue
        
        # Prepare messages for the model
        input_messages = []
        if system_prompt:
            input_messages.append({"role": "system", "content": system_prompt})
        input_messages.append({"role": "user", "content": user_prompt})
        
        # Generate completion
        generated = generate_completion(input_messages, model).lower().strip()
        
        # Debug: Print the generated and expected completions
        print(f"\nSample {index + 1}/{total_samples}:")
        print(f"User Prompt: {user_prompt}")
        print(f"Expected Completion: {expected_completion}")
        print(f"Generated Completion: {generated}")
        
        # Compute Accuracy (exact match)
        accuracy = int(generated == expected_completion)
        accuracies.append(accuracy)
        
        # Compute Custom BLEU Score
        try:
            bleu_score = calculate_custom_bleu(expected_completion, generated)
            bleu_scores.append(bleu_score)
        except Exception as bleu_error:
            print(f"BLEU score calculation error for sample {index}: {bleu_error}")
            bleu_scores.append(0)
        
        # Optional: Print progress every 10 samples
        if (index + 1) % 10 == 0:
            print(f"\nProcessed {index + 1} samples out of {total_samples}.\n")
    
    # Calculate average metrics with error handling
    avg_accuracy = sum(accuracies) / len(accuracies) if accuracies else 0
    avg_bleu = sum(bleu_scores) / len(bleu_scores) if bleu_scores else 0
    
    return {
        "average_accuracy": avg_accuracy,
        "average_bleu_score": avg_bleu
    }

def main():
    # Load validation data
    val_df = load_validation_data('validation.jsonl')  # Replace with your actual file path
    
    if val_df.empty:
        print("Failed to load validation data. Exiting.")
        return
    
    print("Validation Data Loaded Successfully.")
    print(f"Number of samples: {len(val_df)}")
    
    # Set OpenAI API key
    openai.api_key = os.getenv('OPENAI_API_KEY')  # Ensure this environment variable is set
    
    if openai.api_key is None:
        raise ValueError("OpenAI API key is not set. Please set it as an environment variable 'OPENAI_API_KEY'.")
    else:
        print("OpenAI API key is set.")
    
    # Define fine-tuned model
    fine_tuned_model = "ft:gpt-4o-mini-2024-07-18:personal::Ae9mdwfc"  # Replace with your actual fine-tuned model name
    
    if not fine_tuned_model:
        raise ValueError("Fine-tuned model name is not specified.")
    else:
        print(f"Using fine-tuned model: {fine_tuned_model}")
    
    try:
        # Perform evaluation
        evaluation_metrics = evaluate_model(fine_tuned_model, val_df)
        
        # Print Evaluation Metrics
        print("\nEvaluation Metrics:")
        print(f"Average Accuracy: {evaluation_metrics['average_accuracy']:.2f}")
        print(f"Average BLEU Score: {evaluation_metrics['average_bleu_score']:.2f}")
    except Exception as e:
        print(f"An error occurred during evaluation: {e}")

if __name__ == "__main__":
    main()

Validation Data Loaded Successfully.
Number of samples: 65
OpenAI API key is set.
Using fine-tuned model: ft:gpt-4o-mini-2024-07-18:personal::Ae9mdwfc

Starting model evaluation...

Total samples to evaluate: 65


Sample 1/65:
User Prompt: What is significant about chandipur beach in balasore?
Expected Completion: chandipur beach in balasore is known for its recreational. it was established in unknown and has a rating of 4.2. ideal visit time: morning.
Generated Completion: chandipur beach in balasore is a beach rated 4.5 on google. it's best visited in the morning.

Sample 2/65:
User Prompt: What makes wonderla amusement park in kochi special?
Expected Completion: wonderla amusement park in kochi is a amusement park rated 4.6 on google. it's best visited in the all.
Generated Completion: located in kochi, wonderla amusement park is a popular recreational destination built in 2000. it is rated 4.4 on google reviews.

Sample 3/65:
User Prompt: Where can I experience religious significan

## 4. Document the Fine-Tuning Process and Results
Proper documentation ensures reproducibility and clarity in your fine-tuning workflow. This includes logging the steps taken, parameters used, and results obtained.

### a. Log Fine-Tuning Details
Save the fine-tuning job details and evaluation metrics to a log file for future reference.

In [85]:
import json
from datetime import datetime

def convert_timestamp(timestamp):
    if isinstance(timestamp, int):
        return datetime.fromtimestamp(timestamp).isoformat()
    elif isinstance(timestamp, datetime):
        return timestamp.isoformat()
    return str(timestamp)

fine_tuning_log = {
    "fine_tuning_job_id": fine_tune_job_id,
    "fine_tuned_model": fine_tuned_model,
    "status": status_response.status if status_response else "Unknown",
    "created_at": convert_timestamp(status_response.created_at) if status_response and hasattr(status_response, 'created_at') else datetime.now().isoformat(),
    "parameters": {
        "base_model": "gpt-4o-mini-2024-07-18",
        "train_file": train_file,
        "validation_file": val_file
    }
}

# Add evaluation metrics if they were calculated
if 'evaluation_metrics' in locals():
    fine_tuning_log["evaluation_metrics"] = {
        "average_accuracy": evaluation_metrics.get('average_accuracy', 0),
        "average_bleu_score": evaluation_metrics.get('average_bleu_score', 0)
    }

# Save the log to a JSON file
log_file = "fine_tuning_log.json"
with open(log_file, "w") as f:
    json.dump(fine_tuning_log, f, indent=4)

print(f"Fine-tuning process documented in {log_file}")

Fine-tuning process documented in fine_tuning_log.json


### b. Summarize the Fine-Tuning Outcome
Provide a concise summary of the fine-tuning process and its results.

In [88]:
# Default metrics if evaluation hasn't been performed
evaluation_metrics = {
    'average_accuracy': 0.0,
    'average_bleu_score': 0.0
}

# If you have previously run the evaluation, replace the above with the actual metrics
# evaluation_metrics = evaluate_model(fine_tuned_model, val_df)

summary = f"""
Fine-Tuning Summary:
--------------------
Job ID: {fine_tune_job_id}
Model: {fine_tuned_model}
Status: {getattr(status_response, 'status', 'Unknown') if status_response else 'Unknown'}
Created At: {getattr(status_response, 'created_at', 'N/A') if status_response else 'N/A'}
Evaluation Metrics:
- Average Accuracy: {evaluation_metrics['average_accuracy']:.2f}
- Average BLEU Score: {evaluation_metrics['average_bleu_score']:.2f}
Parameters:
- Base Model: gpt-4o-mini-2024-07-18
- Epochs: 4
- Training File: {train_file}
- Validation File: {val_file}
"""
print(summary)
# Optionally, save the summary to a text file
with open("fine_tuning_summary.txt", "w") as f:
    f.write(summary)
print("Fine-tuning summary saved to fine_tuning_summary.txt")


Fine-Tuning Summary:
--------------------
Job ID: ftjob-Q0DuBG1ly5K2d2B8hCZnQXC3
Model: ft:gpt-4o-mini-2024-07-18:personal::Ae9mdwfc
Status: succeeded
Created At: 1734132536
Evaluation Metrics:
- Average Accuracy: 0.00
- Average BLEU Score: 0.00
Parameters:
- Base Model: gpt-4o-mini-2024-07-18
- Epochs: 4
- Training File: train.jsonl
- Validation File: validation.jsonl

Fine-tuning summary saved to fine_tuning_summary.txt


# Prompt Engineering

## Chatbot with Context Management and Evaluation Metrics

### Overview
This script implements a chatbot powered by a fine-tuned OpenAI GPT model. It manages dynamic user interactions, evaluates responses using BLEU scores and accuracy, and maintains conversational context for continuity. The chatbot provides a looped interface for real-time user interaction while assessing its performance in generating relevant responses.

---

### Code Explanation

1. **OpenAI Client Initialization**  
   - Configures the OpenAI client using an API key retrieved from environment variables. This ensures secure access to the fine-tuned model for generating responses.

2. **Dynamic Prompt Generation**  
   - Includes a helper function to dynamically format prompts based on input templates and variables. This enables flexible interaction and query customization.

3. **Chat Interaction with the Model**  
   - Defines a function to interact with the fine-tuned model using the OpenAI API, passing conversational context and configuration parameters like `max_tokens` and `temperature`.

4. **BLEU Score Calculation**  
   - Implements a function to evaluate the similarity between the model's generated response and an expected completion using the BLEU metric. This assesses the quality of generated outputs.

5. **Context Management**  
   - Tracks and updates user-specific conversation history in a dictionary. Limits the context length to a predefined number of exchanges to avoid excessive memory usage.

6. **Chatbot Interaction Loop**  
   - Provides a real-time conversational interface for users. It:
     - Accepts user input.
     - Manages and updates conversational context.
     - Sends prompts to the model for response generation.
     - Evaluates responses using BLEU scores and accuracy.
     - Displays the generated response and evaluation metrics.

7. **Evaluation Metrics**  
   - Includes BLEU score to measure response relevance and an accuracy metric for exact match validation. These provide quantitative insights into the chatbot's performance.

8. **Run Functionality**  
   - Executes the chatbot in an interactive loop when the script is run directly, allowing users to engage with the chatbot and observe its performance.



In [106]:
import os
import openai
from nltk.tokenize import word_tokenize
from nltk.translate.bleu_score import sentence_bleu

# Initialize OpenAI client
client = openai.OpenAI(api_key=os.getenv('OPENAI_API_KEY'))

# Fine-tuned model name
FINE_TUNED_MODEL = fine_tuned_model  # Replace with your fine-tuned model name

# Helper function for dynamic prompt generation
def generate_prompt(template, **kwargs):
    try:
        return template.format(**kwargs)
    except KeyError as e:
        return f"Error generating prompt: Missing {e.args[0]}"

# Function to interact with the fine-tuned model
def chat_with_model(messages, model=FINE_TUNED_MODEL, max_tokens=100):
    try:
        response = client.chat.completions.create(
            model=model,
            messages=messages,
            max_tokens=max_tokens,
            temperature=0.7
        )
        return response.choices[0].message.content
    except Exception as e:
        return f"Error communicating with the model: {e}"

# Evaluation metrics: BLEU score
def calculate_bleu(reference, candidate):
    try:
        ref_tokens = word_tokenize(reference.lower())
        cand_tokens = word_tokenize(candidate.lower())
        return sentence_bleu([ref_tokens], cand_tokens)
    except Exception as e:
        print(f"BLEU score calculation error: {e}")
        return 0.0

# Context management
user_contexts = {}
def update_context(user_id, new_message):
    if user_id not in user_contexts:
        user_contexts[user_id] = []
    user_contexts[user_id].append(new_message)
    if len(user_contexts[user_id]) > 10:  # Limit context length
        user_contexts[user_id].pop(0)
    return user_contexts[user_id]

# Chatbot interaction loop with evaluation
def chat_with_user(user_id="default_user"):
    print("Chatbot is ready. Type 'exit' to end the conversation.")
    
    while True:
        user_message = input("You: ").strip()
        if user_message.lower() == 'exit':
            print("Ending the conversation. Goodbye!")
            break
        
        # Update and manage context
        context = update_context(user_id, {"role": "user", "content": user_message})
        
        # Add system instructions if necessary
        if len(context) == 1:
            context.insert(0, {"role": "system", "content": "You are a helpful and knowledgeable assistant."})
        
        # Generate response using the fine-tuned model
        model_response = chat_with_model(context)
        
        # Simulate expected completion for evaluation (you can replace this with actual expected outputs)
        expected_completion = "This is the expected response for your validation."
        
        # Evaluate the response
        bleu_score = calculate_bleu(expected_completion, model_response)
        accuracy = int(model_response.strip().lower() == expected_completion.strip().lower())
        
        # Add the assistant's response to the context
        update_context(user_id, {"role": "assistant", "content": model_response})
        
        # Display the response and evaluation metrics
        print(f"Chatbot: {model_response}")

# Run the chatbot if executed directly
if __name__ == '__main__':
    chat_with_user()

Chatbot is ready. Type 'exit' to end the conversation.


You:  hi


Chatbot: Hello! How can I assist you today?


You:  Suggest a nature destination in chhattisgarh.


The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


Chatbot: Located in chhattisgarh, udanti wildlife sanctuary is a popular nature destination built in 1983. It is rated 4.1 on Google reviews.


You:  What is significant about chaubatia gardens in ranikhet?


Chatbot: For nature experiences in ranikhet, chaubatia gardens is a top choice. Built in 1952, it has a Google review rating of 4.4.


You:  Suggest a recreational destination in gujarat.


Chatbot: Located in gujarat, gir national park is a popular recreational destination built in 1965. It is rated 4.5 on Google reviews.


You:  Which fort in jaipur is worth visiting?


Chatbot: Located in jaipur, amber fort is a popular historical destination built in 1592. It is rated 4.6 on Google reviews.


You:  Tell me about victoria memorial in kolkata.


Chatbot: victoria memorial is a monument in kolkata, built in 1921. It has a Google review rating of 4.6 and is best visited in the afternoon.


You:  Where can I experience shopping significance in gurugram?


Chatbot: kingdom of dreams is a shopping in gurugram, built in 2009. It has a Google review rating of 4.4 and is best visited in the afternoon.


You:  Where can I experience religious significance in jammu?


Chatbot: Located in jammu, vaishno devi temple is a popular religious destination built in 1986. It is rated 4.8 on Google reviews.


You:  Thanks


Chatbot: You're welcome! If you have any more questions, feel free to ask.


You:  exit


Ending the conversation. Goodbye!


### Purpose
This script provides a comprehensive framework for conversational AI systems, integrating fine-tuned language models, context management, and performance evaluation. It is ideal for applications requiring dynamic user interaction and iterative improvement based on evaluation metrics.