# Machine Learning Pipeline for GDPR Clause Prediction

This notebook details the implementation of a machine learning pipeline designed to predict GDPR clause types and their respective degrees of unfairness using various GPT models.

### Overview

The code involves several key steps outlined below:

- **Utility Functions**:
  - `get_system_prompt`: Defines a standard prompt for the model about GDPR analyst's task.
  - `get_response`: Fetches model predictions from the OpenAI API.
  - `parse_clause_info`: Parses the predicted clause type and degree of unfairness from the model's response.
  - `get_clause_type_and_degree_of_unfairness`: Processes a dataframe to predict clause types and degrees of unfairness.

- **Model Predictions**:
  - **Baseline GPT-3.5 Model Results**: Utilizes the baseline GPT-3.5 model to predict and save clause information.
  - **Fine-Tuned GPT-3.5 Model Results**: Employs a fine-tuned GPT-3.5 model specific to GDPR clauses.
  - **Zero Shot GPT-4 Model Results**: Uses the latest GPT-4 model in a zero-shot manner to predict GDPR clauses.

- **Results Analysis**:
  - **Load Predicted Results**: Load the CSV files containing model predictions.
  - **Calculate Accuracy Stats**: Compute and display accuracy statistics for training and validation datasets.


In [16]:
from dotenv import load_dotenv
from openai import OpenAI
from tqdm import tqdm
import pandas as pd
from utils import get_training_data, get_validation_data, get_accuracy_stats, read_labelled_data

# Load environment variables from .env file
load_dotenv()

True

### GPT Model Prediction Utils

In [17]:
def get_system_prompt():
    return """
        You are a smart GDPR analyst tasked with predicting the 'Type of Clause' and the 'Degree of Unfairness'. 

        The clause types include 'Unknown', 'Choice of Law', 'Content Removal', 'Unilateral Termination',
        'Unilateral Change', 'Contract by Using', 'Limitation of Liability', 'Jurisdiction', and 'Arbitration'.

        The 'Degree of Unfairness' is rated from zero to five. For the 'Unknown' type,
        the 'Degree of Unfairness' is always zero

        Response format: 'Type of Clause: <clause_type>, Degree of Unfairness: <degree_of_unfairness>'
    """

def get_response(client, model_name, clause_text):
    try:
        response = client.chat.completions.create(model=model_name,
            messages=[
                {"role": "system", "content": get_system_prompt()},
                {"role": "user", "content": clause_text}
            ],
        stream=False,
        )
        response_content = response.choices[0].message.content
    except Exception as e:
        print("Error sending request to OpenAI API, so returning default response")
        response_content = "Type of Clause: Unknown, Degree of Unfairness: 0" 
    return response_content

def parse_clause_info(input_str):
    parts = input_str.split(',')
    result = {}
    for part in parts:
        key_value = part.split(':')
        if len(key_value) == 2:
            # Strip any leading or trailing spaces from key and value
            key = key_value[0].strip()
            value = key_value[1].strip()            
            if key == "Degree of Unfairness":
                try:
                    result[key] = int(value)
                except ValueError:
                    result[key] = None 
            else:
                result[key] = value
    return result


def get_clause_type_and_degree_of_unfairness(df):
    results = []
    for clause_text in tqdm(df['content'].values):
        response = get_response(client, model_name, clause_text)
        parsed_response = parse_clause_info(response)
        results.append(parsed_response)
    
    df['predicted_clause_type'] = [result.get('Type of Clause', None) for result in results]
    df['predicted_degree_of_unfairness'] = [result.get('Degree of Unfairness', None) for result in results]
    return df

### Get Baseline GPT3.5 Model Results

In [18]:

# client = OpenAI()
# model_name = "gpt-3.5-turbo"
# system_prompt = get_system_prompt()

# training_df = get_training_data()
# validation_df = get_validation_data()
# labeled_training_df = get_clause_type_and_degree_of_unfairness(training_df.copy())
# labeled_validation_df = get_clause_type_and_degree_of_unfairness(validation_df.copy())
# labeled_training_df.to_csv(f'labeled_training_data_{model_name}.csv', index=False)
# labeled_validation_df.to_csv(f'labeled_validation_data_{model_name}.csv', index=False)

### Get FineTuned_GPT3.5 Model Results

In [19]:
# client = OpenAI()
# model_name = "ft:gpt-3.5-turbo-1106:personal:gdpr-trial-2:9Gl2wNWI"
# system_prompt = get_system_prompt()

# validation_df = get_validation_data()
# labeled_validation_df = get_clause_type_and_degree_of_unfairness(validation_df)
# labeled_validation_df.to_csv(f'labeled_validation_data_{model_name}.csv', index=False)

### Get Zero Shot GPT-4-turbo Model Results

In [20]:
# client = OpenAI()
# model_name = "gpt-4-turbo-2024-04-09"
# system_prompt = get_system_prompt()

# validation_df = get_validation_data()
# labeled_validation_df = get_clause_type_and_degree_of_unfairness(validation_df)
# labeled_validation_df.to_csv(f'labeled_validation_data_{model_name}.csv', index=False)

#### Load Predicted GPT-3.turbo Baseline Results

In [21]:
model_name = "gpt-3.5-turbo"
labeled_training_df = read_labelled_data(f'labeled_training_data_{model_name}.csv')
labeled_validation_df = read_labelled_data(f'labeled_validation_data_{model_name}.csv')

training_stats=get_accuracy_stats(labeled_training_df)
validation_stats=get_accuracy_stats(labeled_validation_df)

print(f"Training data accuracy: {training_stats}")
print(f"Validation data accuracy: {validation_stats}")

Training data accuracy: {'Type of Clause': 0.13164049448713666, 'Degree of Unfairness': 0.17607751419979953, 'Combined': 0.11359839625793518}
Validation data accuracy: {'Type of Clause': 0.14454045561665357, 'Degree of Unfairness': 0.19795758051846032, 'Combined': 0.12647289866457187}


#### Load FineTuned GPT-3.turbo Baseline Results

In [22]:
model_name = "ft:gpt-3.5-turbo-1106:personal:gdpr-trial-2:9Gl2wNWI"

labeled_validation_df = read_labelled_data(f'labeled_validation_data_{model_name}.csv')
validation_stats=get_accuracy_stats(labeled_validation_df)
print(f"Validation data accuracy: {validation_stats}")

Validation data accuracy: {'Type of Clause': 0.11468970934799685, 'Degree of Unfairness': 0.9167321288295365, 'Combined': 0.11468970934799685}


#### Load Predicted GPT-4-turbo Baseline Results

In [23]:
model_name = "gpt-4-turbo-2024-04-09"

labeled_validation_df = read_labelled_data(f'labeled_validation_data_{model_name}.csv')
validation_stats=get_accuracy_stats(labeled_validation_df)
print(f"Validation data accuracy: {validation_stats}")

Validation data accuracy: {'Type of Clause': 0.1720345640219953, 'Degree of Unfairness': 0.19088766692851533, 'Combined': 0.1633935585231736}
