### GPT Labeler

---

In this notebook, we test how well can `GPT 3.5` classify given websites based on the provided context. Specifically, we will try to classify websites based on the following contexts: 

1. tld + domain + metatags
2. context 1 + title + description + keywords
3. context 2 + links + text

In [None]:
%load_ext autoreload
%autoreload 2

import pickle
import os

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Load the web features and labelling info of the **crowdsourced** dataset, for more info about these, check out the [eda notebook](eda.ipynb).

In [None]:
# Features as a dict of dicts where outer dict has as a key webiste id and inner dict are the features
with open('../data/crowdsourced/processed/web_features.pkl', 'rb') as f:
    web_features = pickle.load(f)

# Websites with corresponding label - at least 2 votes for each label
websites = pd.read_csv('../data/crowdsourced/processed/websites.csv')

### Context 1: tld + domain + metatags

---

In [None]:
# Define the context
context1 = ['tld', 'domain', 'metatags']

# Define the labeler
c1_lab = ...

# Get the labeled data
c1_out = c1_lab.predict(web_features, context1)

# Save the labeled data
folder_path = os.path.join('..', 'data', "tld_domain_meta")
if not os.path.exists(folder_path):
    os.makedirs(folder_path)

with open(os.path.join(folder_path, 'labeled_data.pkl'), 'wb') as f:
    pickle.dump(c1_out, f)

### Context 2: context 1 + title + description + keywords

---

In [None]:
# Set the context
context2 = context1 + ['title', 'description', 'keywords']

# Define the labeler
c2_lab = ...

# Get the labeled data
c2_out = c2_lab.predict(web_features, context2)

# Save the labeled data
folder_path = os.path.join('..', 'data', "c1_title_desc_kws")
if not os.path.exists(folder_path):
    os.makedirs(folder_path)

with open(os.path.join(folder_path, 'labeled_data.pkl'), 'wb') as f:
    pickle.dump(c2_out, f)

### Context 3: context 2 + links + text

---

In [None]:
# Set the context
context3 = context2 + ['links', 'sentences']

# Define the labeler
c3_lab = ...

# Get the labeled data
c3_out = c3_lab.predict(web_features, context3)

# Save the labeled data
folder_path = os.path.join('..', 'data', "c2_links_text")
if not os.path.exists(folder_path):
    os.makedirs(folder_path)

with open(os.path.join(folder_path, 'labeled_data.pkl'), 'wb') as f:
    pickle.dump(c3_out, f)