### GPT Labeler

---

In this notebook, we test how well can `GPT 3.5` classify given websites based on the provided context. Specifically, we will try to classify websites based on the following contexts: 

1. tld + domain + metatags
2. context 1 + title + description + keywords
3. context 2 + links + text

In [1]:
%load_ext autoreload
%autoreload 2

import pickle
import os
from openai import OpenAI
from ml_project_2_mlp import gpt

# Load env variables
import dotenv
dotenv.load_dotenv()

import pandas as pd

Setup the OpenAI client:

In [2]:
client = OpenAI(
    api_key=os.environ.get("OPENAI_API_KEY"), # Change this to your API name
)

Load the web features and labelling info of the **crowdsourced** dataset, for more info about these, check out the [eda notebook](eda.ipynb).

In [3]:
# Features as a dict of dicts where outer dict has as a key webiste id and inner dict are the features
with open('../data/crowdsourced/processed/web_features.pkl', 'rb') as f:
    web_features = pickle.load(f)[:5]

# Websites with corresponding label - at least 2 votes for each label
websites = pd.read_csv('../data/crowdsourced/processed/websites.csv')

### Context 1: tld + domain + metatags

---

In [4]:
# Define the context
context1 = [('tld', None), ('domain', None), ('metatags', 10)]

# Define the labeler
c1_lab = gpt.GPTLabeler(client, context1)

# Get the labeled data
c1_out = c1_lab.predict(web_features)

# Save the labeled data
folder_path = os.path.join('..', 'data', "tld_domain_meta")
if not os.path.exists(folder_path):
    os.makedirs(folder_path)

with open(os.path.join(folder_path, 'labeled_data.pkl'), 'wb') as f:
    pickle.dump(c1_out, f)

100%|██████████| 5/5 [00:18<00:00,  3.70s/it]


### Context 2: context 1 + title + description + keywords

---

In [5]:
# Set the context
context2 = context1 + [('title', None), ('description', None), ('keywords', None)]

# Define the labeler
c2_lab = gpt.GPTLabeler(client, context2)

# Get the labeled data
c2_out = c2_lab.predict(web_features)

# Save the labeled data
folder_path = os.path.join('..', 'data', "c1_title_desc_kws")
if not os.path.exists(folder_path):
    os.makedirs(folder_path)

with open(os.path.join(folder_path, 'labeled_data.pkl'), 'wb') as f:
    pickle.dump(c2_out, f)

100%|██████████| 5/5 [00:16<00:00,  3.22s/it]


### Context 3: context 2 + links + text

---

In [9]:
# Set the context 
context3 = context2 + [('links', 10), ('sentences', 20)]

# Define the labeler
c3_lab = gpt.GPTLabeler(client, context3)

# Get the labeled data
c3_out = c3_lab.predict(web_features)

# Save the labeled data
folder_path = os.path.join('..', 'data', "c2_links_text")
if not os.path.exists(folder_path):
    os.makedirs(folder_path)

with open(os.path.join(folder_path, 'labeled_data.pkl'), 'wb') as f:
    pickle.dump(c3_out, f)

100%|██████████| 5/5 [00:14<00:00,  2.92s/it]


---