# Fine-Tuning OpenAI API GPT Models for Sentiment Analysis With Weights & Biases

### https://wandb.ai/mostafaibrahim17/ml-articles/reports/Fine-Tuning-ChatGPT-for-Sentiment-Analysis-With-W-B--Vmlldzo1NjMzMjQx 

This notebook explores fine-tuning GPT Models for sentiment analysis using Weights & Biases. Our experiment will lead to an overall accuracy boost, and we'll delve into applications. In today's data-driven world, sentiment analysis plays a pivotal role in discerning public opinion on a myriad of topics. Advanced models like GPT Model, built on the GPT architecture, offer immense potential in understanding and interpreting human emotions from textual data. However, like many tools, their out-of-the-box capabilities might not capture the nuanced intricacies of sentiment, especially in diverse datasets like those from Reddit. This article dives deep into the process of fine-tuning GPT Models for sentiment analysis, utilizing the powerful features of the Weights & Biases platform, and delves into the improvements and challenges faced.

# Table of Contents

- How Can a GPT Model Be Used for Sentiment Analysis?
- Fine-Tuning GPT Models for Sentiment Analysis
- Data Preparation and Labeling
  - The Current Data Set at Hand
  - Data Augmentation Sentiment Analysis Dataset for Fine-Tuning
  - The Importance of High-Quality Training Data for Sentiment Analysis
- Step-by-Step Tutorial
  - Evaluating the Old Model’s Performance
  - Fine-Tuning the GPT Model
  - Evaluating the New Model’s Performance
- Fine-Tuning Results and Analysis
- Practical Applications and Use Cases
  - Jargon and Slang Understanding
  - E-Commerce Product Reviews
- Further Improvements
- Conclusion


## How Can a GPT Model Be Used for Sentiment Analysis?

GPT Model's ability to understand natural language makes it a good fit for sentiment analysis. This is because, unlike traditional chatbots that rely on predefined responses, GPT Models generate real-time answers based on a vast amount of training data. This approach enables it to provide responses that are contextually relevant and informed by a broad spectrum of information. 


## Fine-Tuning GPT Model for Sentiment Analysis

Fine-tuning is a pivotal step in adapting a general-purpose models, like GPT Models, to a specific task such as sentiment analysis. A GPT Model, with its broad language understanding capabilities, can grasp a vast array of topics and concepts. However, sentiment analysis is more than just comprehending text; it requires a nuanced understanding of subjective tones, moods, and emotions.
<br/><br/>
Think about sarcasm. Understanding sarcasm is tricky, even for humans sometimes. Sarcasm is when we say something but mean the opposite, often in a joking or mocking way. For example, if it starts raining just as you're about to go outside, and you say, "Oh, perfect timing!" you're probably being sarcastic because it's actually bad timing. Now, imagine a machine trying to understand this. Without special training, it might think you're genuinely happy about the rain because you said "perfect." This is where fine-tuning a model like GPT Model becomes crucial.
<br/><br/>
GPT Model, out of the box, is pretty good at understanding a lot of text. It's read more than most humans ever will. But sarcasm is subtle and often needs context. So, to make GPT Models really get sarcasm, we'd expose it to many examples of sarcastic sentences until it starts catching on to the patterns. But here's the catch: sarcasm doesn't look the same everywhere. In different cultures or situations, what's sarcastic in one place might be meant seriously in another. That's why just general knowledge isn't enough. The model needs specific examples to truly grasp the playful twists and turns of sarcasm. In short, to make GPT Model understand sarcasm like a human, it needs extra training on it, just like someone might need to watch several comedy shows to start understanding a comedian's sense of humor.

## Data Preparation and Labeling

### The Current Data Set at Hand

In this notebook, we'll leverage a Reddit dataset sourced from Kaggle, available here: https://www.kaggle.com/datasets/cosmos98/twitter-and-reddit-sentimental-analysis-dataset. This dataset features two key columns: clean_comment(the sentiment text) and its corresponding category (sentiment label).
<br/><br/>
The File Contains 37k comments along with its Sentimental Labelling. All the Comments in the dataset are cleaned and assigned with a Sentiment Label. These Comments Dataset Can Be used to Build a Sentimental Analysis Machine Learning Model.

### Data Augmentation Sentiment Analysis Dataset for Fine-Tuning

It's important to note that the refined Fine-Tuning GPT Model process mandates a specific data structure in a JSONL file for optimal training. 

#### What is a JSONL File?

A `.jsonl` file (short for JSON Lines) is a file format used to store structured data, typically for machine learning and data processing applications. Each line in a `.jsonl` file is a separate, self-contained JSON object. This makes it particularly useful for handling large datasets that can be processed line-by-line, avoiding loading everything into memory at once.

##### Key Features of JSONL Format:
- **One JSON Object Per Line:** Each line in the file is an independent JSON object.
- **Line-Delimited:** The objects are separated by newlines (`\n`), not by commas or brackets as in standard JSON.
- **Efficient Parsing:** Line-by-line processing is easy and efficient, which is helpful when working with large datasets.
- **No Root Structure:** Unlike regular JSON, there is no outer array or object enclosing the entire dataset.

##### Example of a JSONL File:
```json
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "What's the capital of France?"}, {"role": "assistant", "content": "Paris, as if everyone doesn't know that already."}]}


{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "Who wrote 'Romeo and Juliet'?"}, {"role": "assistant", "content": "Oh, just some guy named William Shakespeare. Ever heard of him?"}]}


{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "How far is the Moon from Earth?"}, {"role": "assistant", "content": "Around 384,400 kilometers. Give or take a few, like that really matters."}]}

```
##### Common Use Cases:
- **Training Data for Machine Learning Models:** Frequently used in NLP tasks where each line contains an individual record (e.g., a sentence with a label).
- **Log Data Storage:** Each log entry is a separate JSON object.
- **Streaming Data Processing:** Ideal for scenarios where you process data incrementally.

##### How to Work with JSONL:
- **Reading and Writing:** In Python, you can use the `json` or `jsonlines` library to read and write JSONL files.
- **Tools:** Many tools like `jq`, Pandas, and other data processing libraries support the JSONL format.


### The Importance of High-Quality Training Data for Sentiment Analysis

High-quality training data is pivotal for sentiment analysis as it ensures the model learns to accurately distinguish nuances in emotions. Poor data can lead to misinterpretations, reducing the effectiveness of the analysis. Moreover, comprehensive and well-curated data can significantly boost the model's ability to generalize across diverse real-world scenarios. The dataset we're utilizing underscores this point. As even some of its entries are so nuanced that even humans might struggle to discern their sentiment.

## Step-by-Step Tutorial

### Evaluating the Normal Model's Performance

#### Step 1: Installing and Importing Necessary Packages

In [1]:
# Import necessary libraries
# The following commands are used to install the required libraries if they are not already installed.
# Note: The exclamation mark (!) is used to run shell commands within a Jupyter Notebook or certain IDEs.

# Uncomment and run the following lines if you need to install the libraries:
# !pip install openai  # Install the OpenAI library for interacting with the OpenAI API
# !pip install wandb   # Install the Weights and Biases (wandb) library for experiment tracking


In [23]:
# Import standard libraries
import os  # Provides a way to interact with the operating system
import json  # For parsing and working with JSON data
import random  # For generating random numbers and making random choices
import datetime # For handling date and time-related functions
from datetime import datetime
import time  # For handling time-related functions
from pathlib import Path  # For handling filesystem paths in an object-oriented way

# Import third-party libraries
import openai  # For accessing OpenAI’s GPT models
from openai import OpenAI # For accessing OpenAI’s GPT models
import wandb  # For experiment tracking and model logging with Weights and Biases
import pandas as pd  # For handling data structures and data analysis

#### Step 2: Creating our client

In [3]:
# Create an instance of the OpenAI client
# This client will be used to interact with OpenAI's API
client = openai.Client()

#### Step 3: Loading and Processing the Sentiment Analysis Dataset

In [4]:
# Define the file path for the CSV data
filename = "./practical_data/reddit_data.csv"

# Read the CSV file into a pandas DataFrame
df = pd.read_csv(filename)

#Get the total number of rows and columns in the DataFrame before cleaning
num_rows = len(df)
print(f"The DataFrame has {num_rows} rows before cleaning.")

# Remove rows with missing values in 'clean_comment' and 'category' columns
# The 'inplace=True' argument modifies the DataFrame directly
df.dropna(subset=['clean_comment', 'category'], inplace=True)

#Get the total number of rows and columns in the DataFrame after cleaning
num_rows = len(df)
print(f"The DataFrame has {num_rows} rows after cleaning.\n")

# Display the first 5 rows of the DataFrame
print(df.head(5))


The DataFrame has 37249 rows before cleaning.
The DataFrame has 37149 rows after cleaning.
                                       clean_comment  category
0   family mormon have never tried explain them t...         1
1  buddhism has very much lot compatible with chr...         1
2  seriously don say thing first all they won get...        -1
3  what you have learned yours and only yours wha...         0
4  for your own benefit you may want read living ...         1


#### Step 4: Initializing a New Weights & Biases Project

Now for something new. We will create a new WandB project with code.



In [5]:
# Set the environment variable for the notebook name
# This helps Weights & Biases (wandb) identify the source of the run
os.environ["WANDB_NOTEBOOK_NAME"] = "fine_tune_openai_sentiment_wandb.ipynb"

# Initialize a new Weights & Biases run
# This creates a new experiment in the "Reddit_Sentiment_Analysis" project
run = wandb.init(project="Reddit_Sentiment_Analysis_Fine_Tuning")

# Note: Remember to close the wandb run when you're finished
# Uncomment the following line at the end of your script:
# run.finish()

[34m[1mwandb[0m: Currently logged in as: [33msuspicious-cow[0m ([33msuspicious-cow-self[0m). Use [1m`wandb login --relogin`[0m to force relogin


#### Step 5: Take a Sample to Test the Model

In [6]:
# Randomly select a sample of 100 rows from the DataFrame
df_sample = df.sample(100)

#### Step 6: Define Helper Functions to Convert Model Response to Sentiment Value and Vice Versa

In [7]:
def convert_response_to_sentiment(response):
    """
    Convert a text response to a numeric sentiment value.

    Args:
    response (str): The sentiment response as text.

    Returns:
    int: The sentiment as a numeric value:
        1 for positive, -1 for negative, 0 for neutral, -1 for unknown.
    """
    response = response.lower()
    if 'positive' in response:
        return 1
    elif 'negative' in response:
        return -1
    elif 'neutral' in response:
        return 0
    else:
        return -1  # Unknown sentiment


def convert_numeric_to_string_sentiment(value):
    """
    Convert a numeric sentiment value to a string representation.

    Args:
    value (int): The sentiment as a numeric value.

    Returns:
    str: The sentiment as a string:
        "positive", "negative", "neutral", or "unknown".
    """
    if value == 1:
        return "positive"
    elif value == -1:
        return "negative"
    elif value == 0:
        return "neutral"
    else:
        return "unknown"


#### Step 7: Evaluating the Current Model Performance

In [8]:
# Initialize counters and results list
correct_predictions = 0
loop_count = 0
results = []

# Iterate over each row in the sample DataFrame
for index, row in df_sample.iterrows():
    loop_count += 1
    text = row['clean_comment']
    
    try:
        # Make an API call to OpenAI for sentiment analysis
        completion = client.chat.completions.create(
            model="gpt-3.5-turbo",
            messages=[
                {"role": "system", "content": "What is the sentiment of the following text? Please respond with 'positive', 'negative', or 'neutral'."},
                {"role": "user", "content": text},
            ]
        )
        
        # Extract the model's response
        response = completion.choices[0].message.content
        
        # Convert the text response to a numeric sentiment
        predicted_sentiment = convert_response_to_sentiment(response)
        
        # Store the results
        results.append({
            "sentiment": text,  
            "labeled_prediction": convert_numeric_to_string_sentiment(row['category']),
            "old_model_prediction": response
        })
        
        # Check if the prediction matches the actual sentiment
        if predicted_sentiment == row['category']:
            correct_predictions += 1
        
        # Print progress
        total_rows = len(df_sample)
        print(f"Processed {loop_count}/{total_rows} rows.")
        
    except Exception as e:
        print(f"Error on index {index}: {e}")
        continue


Processed 1/100 rows.
Processed 2/100 rows.
Processed 3/100 rows.
Processed 4/100 rows.
Processed 5/100 rows.
Processed 6/100 rows.
Processed 7/100 rows.
Processed 8/100 rows.
Processed 9/100 rows.
Processed 10/100 rows.
Processed 11/100 rows.
Processed 12/100 rows.
Processed 13/100 rows.
Processed 14/100 rows.
Processed 15/100 rows.
Processed 16/100 rows.
Processed 17/100 rows.
Processed 18/100 rows.
Processed 19/100 rows.
Processed 20/100 rows.
Processed 21/100 rows.
Processed 22/100 rows.
Processed 23/100 rows.
Processed 24/100 rows.
Processed 25/100 rows.
Processed 26/100 rows.
Processed 27/100 rows.
Processed 28/100 rows.
Processed 29/100 rows.
Processed 30/100 rows.
Processed 31/100 rows.
Processed 32/100 rows.
Processed 33/100 rows.
Processed 34/100 rows.
Processed 35/100 rows.
Processed 36/100 rows.
Processed 37/100 rows.
Processed 38/100 rows.
Processed 39/100 rows.
Processed 40/100 rows.
Processed 41/100 rows.
Processed 42/100 rows.
Processed 43/100 rows.
Processed 44/100 row

#### Step 8: Calculating the Model Accuracy

In [9]:
# Calculate the accuracy percentage
accuracy = (correct_predictions / total_rows) * 100

# Print the accuracy with two decimal places
print(f"Accuracy: {accuracy:.2f}%")

Accuracy: 42.00%


#### Step 9: Logging the Accuracy to WandB

In [10]:
# Log the accuracy to Weights & Biases
wandb.log({"Old Accuracy": accuracy})

# Print the accuracy to the console
print(f'Model Accuracy before: {accuracy:.2f}%')

Model Accuracy before: 42.00%


### Fine-Tuning a Model

#### Step 10: Converting the Dataframe to JSONL format

In [11]:
output_filename = "reddit_sentiment_data.jsonl"

# Convert DataFrame to the desired JSONL format
with open(output_filename, "w") as file:
    for _, row in df.iterrows():
        # Map the numeric sentiment to its corresponding string label
        target_label = {
            0: 'neutral',
            1: 'positive',
            -1: 'negative'
        }.get(row['category'], 'unknown')
        
        # Create a dictionary representing the conversation format
        data = {
            "messages": [
                {
                    "role": "system",
                    "content": "What is the sentiment of the following text? Please respond with 'positive', 'negative', or 'neutral'."
                },
                {
                    "role": "user",
                    "content": row['clean_comment']
                },
                {
                    "role": "assistant",
                    "content": target_label
                }
            ]
        }
        
        # Write each data point as a separate line in the JSONL file
        file.write(json.dumps(data) + "\n")

print(f"Data has been written to {output_filename}")

Data has been written to reddit_sentiment_data.jsonl


#### Step 10b: Create the Train Test Split

In [12]:
# Helper function for doing train and test splits on JSONL files
def split_jsonl_file(file_path, train_ratio=0.8):
    """
    Split a JSONL file into training and test sets.

    Args:
    file_path (str): Path to the input JSONL file.
    train_ratio (float): Ratio of data to use for training (default: 0.8).

    Returns:
    tuple: Paths to the created training and test files.
    """
    # Convert file path to Path object
    file_path = Path(file_path)

    # Read and parse the input JSONL file
    with file_path.open('r', encoding='utf-8') as f:
        data = [json.loads(line) for line in f]
    
    # Shuffle the data randomly
    random.shuffle(data)
    
    # Calculate the split index
    split_index = int(len(data) * train_ratio)
    
    # Split the data into train and test sets
    train_data = data[:split_index]
    test_data = data[split_index:]
    
    # Prepare output file paths
    train_file = file_path.with_name(f"{file_path.stem}_train{file_path.suffix}")
    test_file = file_path.with_name(f"{file_path.stem}_test{file_path.suffix}")
    
    # Write train data to file
    with train_file.open('w', encoding='utf-8') as f:
        for item in train_data:
            json.dump(item, f)
            f.write('\n')
    
    # Write test data to file
    with test_file.open('w', encoding='utf-8') as f:
        for item in test_data:
            json.dump(item, f)
            f.write('\n')
    
    # Print summary information
    print(f"Train data saved to: {train_file}")
    print(f"Test data saved to: {test_file}")
    print(f"Train set size: {len(train_data)}")
    print(f"Test set size: {len(test_data)}")
    
    return train_file, test_file

In [13]:
# File paths and data processing
file_path = output_filename

# Split the JSONL file into train and test sets
train_test_files = split_jsonl_file(file_path)
print("\n")  # Print a blank line for better output readability

# Convert the returned file paths to strings
train_path, test_path = [str(file) for file in train_test_files]

# Print the paths of the resulting train and test files
print(f"Train file path: {train_path}")
print(f"Test file path: {test_path}")

Train data saved to: reddit_sentiment_data_train.jsonl
Test data saved to: reddit_sentiment_data_test.jsonl
Train set size: 29719
Test set size: 7430


Train file path: reddit_sentiment_data_train.jsonl
Test file path: reddit_sentiment_data_test.jsonl


#### Step 11: Upload the files to OpenAI

In [14]:
# Upload the training data to the OpenAI API
train_set_file = client.files.create(
    file=open(train_path, "rb"),
    purpose="fine-tune"
)

# Upload the test data to the OpenAI API
test_set_file = client.files.create(
    file=open(test_path, "rb"),
    purpose="fine-tune"
)

# Print confirmation messages
print(f"Training file uploaded with ID: {train_set_file.id}")
print(f"Test file uploaded with ID: {test_set_file.id}")

Training file uploaded with ID: file-y5CZr6MnLF8IAWsAH5p88muT
Test file uploaded with ID: file-6OUJ7W7T83e1JGkAJkO3DVwe


#### Step 12: Create a Fine-Tuning Job

In [15]:
# Create a fine-tuning job using the uploaded training data
wandb_params_ft_job = client.fine_tuning.jobs.create(
    model="gpt-3.5-turbo",  # Base model to be fine-tuned
    training_file=train_set_file.id,  # ID of the uploaded training data file
    validation_file=test_set_file.id,  # ID of the uploaded validation (test) data file
    hyperparameters={
        "batch_size": "auto",  # Let API automatically determine batch size
        "learning_rate_multiplier": "auto",  # Auto-set learning rate multiplier
        "n_epochs": "auto",  # Automatically decide number of training epochs
    },
    suffix="reddit_sentiment",  # Append this to the fine-tuned model's name
    integrations=[
        {
            "type": "wandb",
            "wandb": {
                "project": "Reddit_Sentiment_Analysis",  # Replace with your actual project name
                "name": "Reddit_Sentiment_Analysis_Run_001",  # Optional: Replace with your desired run name
                "entity": "suspicious-cow-self",  # Optional: Replace with your entity
                "tags": ["reddit", "sentiment"]  # Optional: Replace with your desired tags
            }
        }
    ],
    seed=None,  # Specific random seed set for reproducibility
)

# Print confirmation and job details
print(f"Fine-tuning job created with ID: {wandb_params_ft_job.id}")
print(f"Model: {wandb_params_ft_job.model}")
print(f"Status: {wandb_params_ft_job.status}")

Fine-tuning job created with ID: ftjob-NDiUitRsbGkDI7D6XPFLI0cD
Model: gpt-3.5-turbo-0125
Status: validating_files


#### Step 12a: Make Sure the Job is Done

In [16]:
# Create helper function to check the status of a fine-tuning job
# Along with a special exception for failed jobs
class FineTuningFailedException(Exception):
    """Custom exception for failed fine-tuning jobs."""
    pass

def check_fine_tuning_status(client: OpenAI, job_id: str, seconds_to_wait: int = 60) -> dict:
    """
    Continuously check the status of a fine-tuning job until it succeeds or fails.

    Args:
        client (OpenAI): The OpenAI API client object.
        job_id (str): The ID of the fine-tuning job to check.
        seconds_to_wait (int): The number of seconds to wait between status checks.

    Returns:
        dict: The final job details if the job succeeds.

    Raises:
        FineTuningFailedException: If the fine-tuning job fails.
        Exception: For any other errors during the process.
    """
    while True:
        try:
            # Retrieve updated information for the fine-tuning job
            retrieved_job = client.fine_tuning.jobs.retrieve(job_id)
            
            print(f"Current status: {retrieved_job.status}")
            
            if retrieved_job.status == "failed":
                print("Job failed. Final job details:")
                print(retrieved_job)
                raise FineTuningFailedException("The fine-tuning job has failed.")
            
            if retrieved_job.status == "succeeded":
                print("Job succeeded. Final job details:")
                print(retrieved_job)
                return retrieved_job
            
            # Wait for the specified number of seconds before checking again
            time.sleep(seconds_to_wait)
        
        except Exception as e:
            print(f"An error occurred: {e}")
            raise  # Re-raise the exception to stop the function

In [19]:
# Use our helper function to make sure the job is done
# Then print the final job details
try:
    # Extract the job ID from the previously created fine-tuning job
    job_id = wandb_params_ft_job.id
    
    print(f"Starting to monitor fine-tuning job with ID: {job_id}")

    # Check the fine-tuning status until completion or failure
    # Set the checking interval to 10 minutes (600 seconds)
    final_job = check_fine_tuning_status(client, job_id, seconds_to_wait=600)
    
    
except AttributeError:
    print("Error: 'job' object does not have 'id' attribute. "
        "Make sure the job was created successfully.")

except openai.OpenAIError as e:
    print(f"An OpenAI API error occurred: {str(e)}")

except Exception as e:
    print(f"An unexpected error occurred: {str(e)}")

finally:
    print("Fine-tuning status check process completed.")

Starting to monitor fine-tuning job with ID: ftjob-NDiUitRsbGkDI7D6XPFLI0cD
Current status: validating_files
Current status: validating_files
Current status: validating_files
Current status: validating_files
Current status: validating_files
Current status: validating_files
Current status: validating_files
Current status: validating_files
Current status: running
Current status: running
Current status: running
Current status: running
Current status: running
Current status: running
Current status: succeeded
Job succeeded. Final job details:
FineTuningJob(id='ftjob-NDiUitRsbGkDI7D6XPFLI0cD', created_at=1723997733, error=Error(code=None, message=None, param=None), fine_tuned_model='ft:gpt-3.5-turbo-0125:personal:reddit-sentiment:9xewUqje', finished_at=1724005760, hyperparameters=Hyperparameters(n_epochs=1, batch_size=19, learning_rate_multiplier=2), model='gpt-3.5-turbo-0125', object='fine_tuning.job', organization_id='org-SQH2HT1IvRszon9pdYwV1yvQ', result_files=['file-caEVDcydgUhYKNuwWJyLF

In [22]:
# Print the final job details
print("\nFinal job details:")
print(f"Status: {final_job.status}")
print(f"Created at: {final_job.created_at}")
print(f"Finished at: {final_job.finished_at}")
print(f"Fine-tuned model: {final_job.fine_tuned_model}")
print(f"Training file: {final_job.training_file}")
print(f"Validation file: {final_job.validation_file}")
print(f"Trained tokens: {final_job.trained_tokens}")


Final job details:
Status: succeeded
Created at: 1723997733
Finished at: 1724005760
Fine-tuned model: ft:gpt-3.5-turbo-0125:personal:reddit-sentiment:9xewUqje
Training file: file-y5CZr6MnLF8IAWsAH5p88muT
Validation file: file-6OUJ7W7T83e1JGkAJkO3DVwe
Trained tokens: 2133448


In [24]:
# Let's show friendly timestamps for easier reading
import pytz

# Set the timezone to US Central Time
central_tz = pytz.timezone('US/Central')

def format_timestamp(timestamp):
    # Convert Unix timestamp to datetime object in UTC
    dt_utc = datetime.fromtimestamp(timestamp).replace(tzinfo=pytz.UTC)
    
    # Convert to Central Time
    dt_central = dt_utc.astimezone(central_tz)
    
    # Format the datetime object as a string
    return dt_central.strftime('%Y-%m-%d %H:%M:%S %Z')

# Assuming final_job.created_at and final_job.finished_at are your Unix timestamps
print(f"Created at: {format_timestamp(final_job.created_at)}")
print(f"Finished at: {format_timestamp(final_job.finished_at)}")

Created at: 2024-08-18 06:15:33 CDT
Finished at: 2024-08-18 08:29:20 CDT


In [25]:
# Get the final status of the fine-tuning job
wandb_params_ft_job_final = client.fine_tuning.jobs.retrieve(wandb_params_ft_job.id)

# Print the name of the fine-tuned model
print(f"Fine-tuned model name: {wandb_params_ft_job_final.fine_tuned_model}")

# Optional: Print additional job details
print(f"Job status: {wandb_params_ft_job_final.status}")
print(f"Training file: {wandb_params_ft_job_final.training_file}")
print(f"Validation file: {wandb_params_ft_job_final.validation_file}")
print(f"Created at: {wandb_params_ft_job_final.created_at}")
print(f"Finished at: {wandb_params_ft_job_final.finished_at}")
print(f"Trained tokens: {wandb_params_ft_job_final.trained_tokens}")

Fine-tuned model name: ft:gpt-3.5-turbo-0125:personal:reddit-sentiment:9xewUqje
Job status: succeeded
Training file: file-y5CZr6MnLF8IAWsAH5p88muT
Validation file: file-6OUJ7W7T83e1JGkAJkO3DVwe
Created at: 1723997733
Finished at: 1724005760
Trained tokens: 2133448


### Evaluating the New Model's Performance

#### Step 13: Looking at the metrics from WandB

We will do manual calculations later for fun but, for now, let's look at the data from WandB. There are two ways you can do this:
1. Through the WandB website
2. Through code
<br/><br/>
We will do both of these. 

First, for the website go to https://wandb.ai/suspicious-cow-self/projects and click on the Reddit_Sentiment_Analysis project. This should automatically show you the results from the latest run in a graphical format. 

Second, let's manually compute rough statistics.

In [26]:
# Use our fine-tuned model to make predictions on the sample data
model_id = wandb_params_ft_job_final.fine_tuned_model
correct_predictions = 0
loop_count = 0  # Counter for loop iterations
loop_index = 0  # Initialize loop_index

# Iterate over each row in the DataFrame for the new model
for index, row in df_sample.iterrows():
    loop_count += 1  # Increment the loop count
    text = row['clean_comment']
    
    try:
        completion = client.chat.completions.create(
            model=model_id,
            messages=[
                {"role": "system", "content": "What is the sentiment of the following text? Please respond with 'positive', 'negative', or 'neutral'."},
                {"role": "user", "content": text},
            ]
        )
        response = completion.choices[0].message.content
        predicted_sentiment = convert_response_to_sentiment(response)
        
        results[loop_index].update({"new_model_prediction": response})
        loop_index += 1  # Increment the loop index
        
        # Check if the predicted sentiment matches the actual sentiment
        if predicted_sentiment == row['category']:
            correct_predictions += 1
        
        # Print the current progress
        print(f"Processed {loop_count}/{len(df_sample)} rows.")
        
    except openai.OpenAIError as e:
        print(f"OpenAI API error on index {index}: {e}")
    except Exception as e:
        print(f"Unexpected error on index {index}: {e}")
    
    # Optional: Add a delay to avoid hitting rate limits
    # time.sleep(1)

Processed 1/100 rows.
Processed 2/100 rows.
Processed 3/100 rows.
Processed 4/100 rows.
Processed 5/100 rows.
Processed 6/100 rows.
Processed 7/100 rows.
Processed 8/100 rows.
Processed 9/100 rows.
Processed 10/100 rows.
Processed 11/100 rows.
Processed 12/100 rows.
Processed 13/100 rows.
Processed 14/100 rows.
Processed 15/100 rows.
Processed 16/100 rows.
Processed 17/100 rows.
Processed 18/100 rows.
Processed 19/100 rows.
Processed 20/100 rows.
Processed 21/100 rows.
Processed 22/100 rows.
Processed 23/100 rows.
Processed 24/100 rows.
Processed 25/100 rows.
Processed 26/100 rows.
Processed 27/100 rows.
Processed 28/100 rows.
Processed 29/100 rows.
Processed 30/100 rows.
Processed 31/100 rows.
Processed 32/100 rows.
Processed 33/100 rows.
Processed 34/100 rows.
Processed 35/100 rows.
Processed 36/100 rows.
Processed 37/100 rows.
Processed 38/100 rows.
Processed 39/100 rows.
Processed 40/100 rows.
Processed 41/100 rows.
Processed 42/100 rows.
Processed 43/100 rows.
Processed 44/100 row

#### Step 14: Calculating the Fine-Tuned Model Accuracy Manually

In [27]:
# Calculate and print accuracy
accuracy = (correct_predictions / len(df_sample)) * 100
print(f"\nNew Model Accuracy: {accuracy:.2f}%")


New Model Accuracy: 87.00%


#### Step 15: Logging the Accuracy

In [28]:
# Log the new accuracy to Weights & Biases
wandb.log({"New Model Accuracy": accuracy})

# Print the new accuracy to the console
print(f'Model Accuracy after fine-tuning: {accuracy:.2f}%')

# Optional: Log additional metrics or comparisons
old_accuracy = wandb.run.summary.get("Old Accuracy", 0)  # Get the old accuracy, default to 0 if not found
accuracy_improvement = accuracy - old_accuracy

wandb.log({
    "Accuracy Improvement": accuracy_improvement,
    "Relative Improvement": (accuracy_improvement / old_accuracy) * 100 if old_accuracy > 0 else 0
})

print(f'Accuracy Improvement: {accuracy_improvement:.2f} percentage points')
print(f'Relative Improvement: {(accuracy_improvement / old_accuracy) * 100:.2f}%' if old_accuracy > 0 else 'N/A (no previous accuracy recorded)')


Model Accuracy after fine-tuning: 87.00%
Accuracy Improvement: 45.00 percentage points
Relative Improvement: 107.14%


#### Step 16: Create a Fine-Tuned vs Non-Tuned Result Comparison Table

In [29]:
# Convert results list to DataFrame
df_results = pd.DataFrame(results)

# Print the first few rows of the DataFrame
print("First few rows of the results DataFrame:")
print(df_results.head())

# Log the entire DataFrame as a table to Weights & Biases
wandb.log({"results_table": wandb.Table(dataframe=df_results)})

# Optional: Log some summary statistics
summary_stats = {
    "total_samples": len(df_results),
    "unique_sentiments": df_results['labeled_prediction'].nunique(),
    "old_model_accuracy": (df_results['old_model_prediction'] == df_results['labeled_prediction']).mean(),
    "new_model_accuracy": (df_results['new_model_prediction'] == df_results['labeled_prediction']).mean()
}

wandb.log(summary_stats)

print("\nSummary Statistics:")
for key, value in summary_stats.items():
    print(f"{key}: {value}")

# Optional: Create and log a confusion matrix
from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

def create_confusion_matrix(y_true, y_pred, title):
    cm = confusion_matrix(y_true, y_pred, labels=['positive', 'neutral', 'negative'])
    plt.figure(figsize=(10, 8))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['positive', 'neutral', 'negative'], yticklabels=['positive', 'neutral', 'negative'])
    plt.title(title)
    plt.xlabel('Predicted')
    plt.ylabel('True')
    return plt

new_model_cm = create_confusion_matrix(df_results['labeled_prediction'], df_results['new_model_prediction'], 'New Model Confusion Matrix')
wandb.log({"new_model_confusion_matrix": wandb.Image(new_model_cm)})

old_model_cm = create_confusion_matrix(df_results['labeled_prediction'], df_results['old_model_prediction'], 'Old Model Confusion Matrix')
wandb.log({"old_model_confusion_matrix": wandb.Image(old_model_cm)})

plt.close('all')  # Close all plt figures to free up memory


First few rows of the results DataFrame:
                                           sentiment labeled_prediction  \
0                      have words just how the does             neutral   
1  what with stig abel shaking like leaf and hasn...            neutral   
2                         may 2018 update don think             neutral   
3  based indian leader talking about legitimate e...           negative   
4   quote arun shourie when all said and done mor...           positive   

  old_model_prediction new_model_prediction  
0              neutral              neutral  
1             negative              neutral  
2              neutral              neutral  
3             negative             positive  
4             positive             positive  

Summary Statistics:
total_samples: 100
unique_sentiments: 3
old_model_accuracy: 0.42
new_model_accuracy: 0.87


#### Step 17: Finish the WandB Run


In [30]:
# Close the wandb run
wandb.finish()

print("Weights & Biases run has been completed and synced.")

VBox(children=(Label(value='0.108 MB of 0.142 MB uploaded\r'), FloatProgress(value=0.7570944699688118, max=1.0…

0,1
Accuracy Improvement,▁
New Model Accuracy,▁
Old Accuracy,▁
Relative Improvement,▁
new_model_accuracy,▁
old_model_accuracy,▁
total_samples,▁
unique_sentiments,▁

0,1
Accuracy Improvement,45.0
New Model Accuracy,87.0
Old Accuracy,42.0
Relative Improvement,107.14286
new_model_accuracy,0.87
old_model_accuracy,0.42
total_samples,100.0
unique_sentiments,3.0


Weights & Biases run has been completed and synced.
