In [3]:
#| default_exp inLoop


# AI-Assisted Data Analysis Notebook
This notebook is designed for AI-assisted data labeling and querying using OpenAI's APIs. It demonstrates modular programming practices, AI-assisted labeling, and dataset querying functionalities. 

In [4]:
#| export
import pandas as pd
import requests 
import openai


# inLoop Class Explanation

## Overview of `inLoop` Class
- **Class Name**: `inLoop`
- **Purpose**: To process a dataset and use an AI model to generate labels for the data.

## `__init__(self, data_file)`
- **Functionality**: Initializes the class instance with a dataset.
- **Parameters**: 
  - `data_file`: The path to the dataset file.
- **Internal Variables**:
  - `self.data_csv`: Stores the data file for use in other methods.

## `get_prediction_for_text(self, data)`
- **Functionality**: Sends a text data piece to the OpenAI API and retrieves a predicted label.
- **Parameters**: 
  - `data`: A string containing the text data for which a label is to be predicted.
- **OpenAI API Interaction**:
  - Constructs a request to the OpenAI API with a specific instruction to return labels separated by commas, with no explanations.
  - Parses the API response to extract the label.

## `turn_dataframe(self, prediction, data_chunk)`
- **Functionality**: Converts predictions and text data into a pandas DataFrame.
- **Parameters**: 
  - `prediction`: A string of comma-separated labels returned by the AI model.
  - `data_chunk`: A pandas DataFrame chunk of the dataset.
- **Process**: 
  - Splits the prediction string into individual labels.
  - Pads the labels and text data to ensure equal length.
  - Creates a new DataFrame with text and corresponding labels.

## `update_labels(self, start_index, revised_labels)`
- **Functionality**: Updates the labels in the original dataset.
- **Parameters**: 
  - `start_index`: The index in the dataset from where labels should be updated.
  - `revised_labels`: A list of new labels to replace the old ones.
- **Process**: Iterates over the revised labels and updates them in the dataset at the corresponding index.

## `get_updated_data(self)`
- **Functionality**: Retrieves the updated dataset.
- **Returns**: The updated dataset with new labels.


## In-the-Loop Process
The `inLoop` class is crucial for an in-the-loop labeling system. It interacts with an AI model to initially label data and provides functionalities to update these labels based on user feedback or further analysis. This iterative process allows for continuous improvement of data labeling quality and accuracy.

In [5]:
#| export

class inLoop:

    def __init__(self, data_file):
        self.data_csv = data_file

    
    def get_predictions(self, prompt):
        predictions = []
        for _, row in data_chunk.iterrows():
            text = row['Text']
            prediction = self.get_prediction_for_text(text)  # Implement this method
            predictions.append(prediction)
        return predictions
    
    def get_prediction_for_text(self, data):
        try:
            # Set your OpenAI API key
            key = ""

            # API endpoint for ChatGPT
            url = "https://api.openai.com/v1/chat/completions"

            # Headers including the Authorization with your API key
            headers = {
                "Content-Type": "application/json",
                "Authorization": f"Bearer {openai_api_key}"
            }
             
            # Data payload for the request
            data_payload = {
                "model": "gpt-3.5-turbo",
                "messages": [
                    {"role": "system", "content": "You are a labeler and have to generate 5 labels. Each line has to have a label which is OTR(Opportunity to Respond), PRS(Praise),REP(Reprimand), NEU(None of the Above). And the result should only return the 5 labels and it should be seperated in comma. DO NOT INCLUDE ANY EXPLANATIONS. JUST LABELS SEPERATED IN COMMAS. ONLY GENERATE LABELS(OTR, PRS, REP, NEU) FOR EACH ROW. NO EXPLANATIONS. "},
                    {"role": "user", "content": f"Based on {data} labels: "}
                ]
            }

            # Make the POST request
            response = requests.post(url, headers=headers, json=data_payload)
            response_json = response.json()

            # Check if 'choices' is in the response and if it's not empty
            if 'choices' in response_json and response_json['choices']:
                # Accessing the first choice's message
                message = response_json['choices'][0]['message']['content']
                # Ensure the message is a string before calling strip()
                if isinstance(message, str):
                    return message.strip()
                else:
                    # Handle the case where message is not a string
                    return "Received non-string response: " + str(message)
            else:
                return "No choices in response or empty response: " + str(response_json)

            
        except Exception as e:
            return f"Error: {e}"

    
    def turn_dataframe(self, prediction, data_chunk):
        predicted_list = [item.strip() for item in prediction.split(',')]
        print(predicted_list)
        text_variable = data_chunk['Text'].to_list()
        print(text_variable)

        if len(predicted_list) < 5:
            predicted_list = predicted_list + ([None] * (5 - len(predicted_list)))
        if len(data_chunk) < 5 :
            text_variable = text_variable + ([None] * (5 - len(data_chunk)))
        df = pd.DataFrame({'Text': text_variable, 'Label' : predicted_list})
        print(df)
        return df
    
    
    def update_labels(self, start_index, revised_labels):
        for i, label in enumerate(revised_labels):
            self.data_csv.at[start_index + i, 'Label'] = label

    def get_updated_data(self):
        return self.data_csv

    # def get_predictions(self, prompt):
    #     try:
    #         # Set your OpenAI API key
    #         key = ""

    #         # API endpoint for ChatGPT
    #         url = "https://api.openai.com/v1/chat/completions"

    #         # Headers including the Authorization with your API key
    #         headers = {
    #             "Content-Type": "application/json",
    #             "Authorization": f"Bearer {openai_api_key}"
    #         }

    #         # Data payload for the request
    #         data_payload = {
    #             "model": "gpt-3.5-turbo",
    #             "messages": [
    #                 {"role": "system", "content": "prompt"},
    #                 {"role": "user", "content": f"Based on {self.dataset_string} solve the Question: {prompt}"}
    #             ]
    #         }

    #         # Make the POST request
    #         response = requests.post(url, headers=headers, json=data_payload)
    #         response_json = response.json()

    #         # Check if 'choices' is in the response and if it's not empty
    #         if 'choices' in response_json and response_json['choices']:
    #             # Accessing the first choice's message
    #             message = response_json['choices'][0]['message']
    #             # Ensure the message is a string before calling strip()
    #             if isinstance(message, str):
    #                 return message.strip()
    #             else:
    #                 # Handle the case where message is not a string
    #                 return "Received non-string response: " + str(message)
    #         else:
    #             return "No choices in response or empty response: " + str(response_json)

    #     except Exception as e:
    #         return f"Error: {e}"


# Example Usage

In [14]:
# Example usage


# Load the dataset
data_chunk = pd.read_csv('/workspaces/ai-assisted-coding_panther/test_ai.csv')

in_loop_instance = inLoop(data_chunk)

# Define the chunk size
chunk_size = 5

# Process the dataset in chunks
for start_index in range(0, len(data_chunk), chunk_size):
    # Selecting a chunk of rows
    chunk = data_chunk.iloc[start_index:start_index + chunk_size]

    # Concatenating the text entries in the chunk
    concatenated_text = ' '.join(chunk['Text'].fillna(''))

    # Getting predictions for the concatenated text
    prediction = in_loop_instance.get_prediction_for_text(concatenated_text)

    # Split the prediction into individual labels
    predicted_labels = prediction.split(',') if prediction else ['None'] * chunk_size

    # Update the labels in the original dataset
    in_loop_instance.update_labels(start_index, predicted_labels[:len(chunk)])

# Fetching the updated dataset
updated_dataset = in_loop_instance.get_updated_data()



In [16]:
print(updated_dataset)

  Timestamp                                               Text Label
0      0:00  Good morning class, today we are going to lear...   OTR
1   00:04.4         Can anyone tell me what a civilization is?   NEU
2       NaN  Yes, that's right. A civilization is a complex...   NEU
3       NaN  Now, let's talk about the first civilization w...   NEU
4       NaN  They lived in a region called Mesopotamia. Can...   NEU
5   00:22.8  Good job, Sarah! Mesopotamia is between the Ti...   PRS
6   00:27.0  The Sumerians invented many things we still us...   NEU
7       NaN              Excellent, they did invent the wheel!   PRS
8   00:35.9  They also invented writing. They wrote on clay...   PRS
9       NaN     Jack, please stop talking while I am teaching.   REP


**Explanation:**

1. Loading the Dataset:

The dataset is loaded from a CSV file into a pandas DataFrame. This step involves reading the file located at /workspaces/ai-assisted-coding_panther/test_ai.csv. The DataFrame, data_chunk, contains the data that will be processed.

2. Initializing the inLoop Instance:

An instance of the inLoop class is created with the loaded DataFrame. This instance, in_loop_instance, will be used to interact with the AI model for label predictions and to update the dataset.

3. Defining Chunk Size:

The chunk_size variable is set to 5. This determines the number of rows from the dataset to be processed in each iteration. The idea is to process and label the dataset in smaller, manageable batches.

4. Processing the Dataset in Chunks:

The dataset is processed in chunks, each consisting of 5 rows (as defined by chunk_size). For each chunk:
a. Chunk Selection: A subset of the dataset, chunk, is selected based on the current start_index. This subset contains 5 rows of text data.
b. Text Concatenation: The text entries in the chunk are concatenated into a single string, concatenated_text. This is done to prepare the data for label prediction. Missing values (NaN) are filled with an empty string to ensure consistency.
c. Getting Predictions: The concatenated text is sent to the get_prediction_for_text method of the inLoop instance. This method interacts with the AI model to get a prediction string, which contains labels separated by commas.
d. Label Processing: The prediction string is split into individual labels. If a prediction is not available, a default label 'None' is used for each entry in the chunk.
e. Updating the Dataset: The original dataset is updated with these new labels using the update_labels method. This method places the labels at the correct positions in the dataset, starting from start_index.

5. Fetching the Updated Dataset:

After processing all chunks, the updated dataset with new labels is retrieved using the get_updated_data method. This dataset, updated_dataset, now contains the original text data along with the AI-generated labels.