<a href="https://colab.research.google.com/github/PabloAMC/conformal-prediction/blob/main/Conformal_Pred_vSL2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

In this notebook we aim to implement the techniques neccessary to detect a distribution shift on a Learning from human preference setting. We start loading the packages.

In [1]:
# Activate GPU for faster training by clicking on 'Runtime' > 'Change runtime type' and then selecting GPU as the Hardware accelerator
# Then check if GPU is available
import torch
torch.cuda.is_available()

True

In [2]:
!nvidia-smi

Sun May 28 01:22:02 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   50C    P8    10W /  70W |      3MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [3]:
%load_ext autoreload
%autoreload 2

In [4]:
!pip install wandb
!pip install tqdm
!pip install datasets
!pip install transformers
!pip install huggingface_hub
!pip install torch
!pip install statsmodels
!pip install --upgrade accelerate

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting wandb
  Downloading wandb-0.15.3-py3-none-any.whl (2.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m18.7 MB/s[0m eta [36m0:00:00[0m
Collecting GitPython!=3.1.29,>=1.0.0 (from wandb)
  Downloading GitPython-3.1.31-py3-none-any.whl (184 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m184.3/184.3 kB[0m [31m25.4 MB/s[0m eta [36m0:00:00[0m
Collecting sentry-sdk>=1.0.0 (from wandb)
  Downloading sentry_sdk-1.24.0-py2.py3-none-any.whl (206 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m206.5/206.5 kB[0m [31m28.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting docker-pycreds>=0.4.0 (from wandb)
  Downloading docker_pycreds-0.4.0-py2.py3-none-any.whl (9.0 kB)
Collecting pathtools (from wandb)
  Downloading pathtools-0.1.2.tar.gz (11 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting

In [5]:
from datasets import load_dataset, concatenate_datasets, Dataset
from transformers import AutoTokenizer
from transformers import DataCollatorWithPadding
from transformers import AutoModelForSequenceClassification
from transformers import TrainingArguments, Trainer
from transformers import pipeline
import numpy as np
from datasets import load_metric
import pandas as pd
import torch.nn as nn
import torch.nn.functional as F
from transformers import AutoModel
from statsmodels.stats.multitest import multipletests
#from statsmodels.stats.weightstats import ttest_ind
from scipy.stats import binomtest, ttest_ind, ttest_1samp

In [6]:
# Log in to your Hugging Face account 
# Get your API token here https://huggingface.co/settings/token
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

# Data loading and model

We will use a set of amazon reviews. Our aim is to be able to learn which object is preferred from the reviews.

In [7]:
tldr = load_dataset("openai/summarize_from_feedback", "comparisons").shuffle()   # https://huggingface.co/datasets/openai/summarize_from_feedback

tldr_train = pd.DataFrame(tldr['train'])
tldr_train = tldr_train.replace(to_replace='None', value=np.nan).dropna()
tldr_train = tldr_train[:1000]
tldr_cal = pd.DataFrame(tldr['validation'])
tldr_cal = tldr_cal.replace(to_replace='None', value=np.nan).dropna()
tldr_cal = tldr_cal[:500]

slf5k = load_dataset("JeremyAlain/SLF5K") # the test dataset can be taken from https://huggingface.co/datasets/JeremyAlain/SLF5K
slf5k_test = pd.DataFrame(slf5k['validation'])

Downloading builder script:   0%|          | 0.00/9.38k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/1.61k [00:00<?, ?B/s]

Downloading and preparing dataset summarize_from_feedback/comparisons to /root/.cache/huggingface/datasets/openai___summarize_from_feedback/comparisons/0.0.0/483f970ceb55b926b0a087ef4f678ab1b089bc8174a107a452c6152e88af7ff0...


Downloading data files:   0%|          | 0/23 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/29.7M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/9.11M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/55.2M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/46.0M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/25.2M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/18.9M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.32M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/8.72M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/15.3M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/10.4M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/5.56M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/6.69M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/29.9M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/27.9M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/28.6M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/15.4M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/18.1M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/9.64M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.19M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.18M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/6.00M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/10.1k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/23 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Dataset summarize_from_feedback downloaded and prepared to /root/.cache/huggingface/datasets/openai___summarize_from_feedback/comparisons/0.0.0/483f970ceb55b926b0a087ef4f678ab1b089bc8174a107a452c6152e88af7ff0. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading builder script:   0%|          | 0.00/11.7k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/11.2k [00:00<?, ?B/s]

Downloading and preparing dataset slf5_k/SLF5K to /root/.cache/huggingface/datasets/JeremyAlain___slf5_k/SLF5K/1.0.0/6b37f332eea04ffa072f2c66e87393132bd68a310796894ee18fb105544d3294...


Downloading data files:   0%|          | 0/4 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/20.0M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/836k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.99M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.27M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/4 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating development split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Dataset slf5_k downloaded and prepared to /root/.cache/huggingface/datasets/JeremyAlain___slf5_k/SLF5K/1.0.0/6b37f332eea04ffa072f2c66e87393132bd68a310796894ee18fb105544d3294. Subsequent calls will reuse this data.


  0%|          | 0/4 [00:00<?, ?it/s]

In [8]:
texts = []
for d in tldr_train['summaries']:
    texts.append([d[0]['text'], d[1]['text']])

summaries_df = pd.DataFrame(texts, columns=['summary_0', 'summary_1'])


X_train = pd.DataFrame({
    'subreddit': pd.DataFrame(list(tldr_train['info'].values))['subreddit'],
    'title': pd.DataFrame(list(tldr_train['info'].values))['title'],
    'post': pd.DataFrame(list(tldr_train['info'].values))['post'],
    'summary_0': summaries_df['summary_0'],
    'summary_1': summaries_df['summary_1'],
    'label': tldr_train['choice']
})

X_train.head()

Unnamed: 0,subreddit,title,post,summary_0,summary_1,label
0,relationships,Me [22 F] with my crush [22 M] duration - Flir...,I have two days left with him before we leave ...,I want to show my crush I love him but he's a...,I want to get my crush to notice me as a pers...,1
1,relationship_advice,My cousin [20/f] is in with an odd crowd. I ne...,I [21/m] started noticing my cousin [20/f] act...,My cousin [20/f] has been hanging out with he...,Need help reconnecting with my cousin who's b...,0
2,relationship_advice,My boyfriend [22/m] expects more sex out of me...,Both myself and my SO are in university. We've...,Boyfriend who has no job gets upset I'm too t...,boyfriend can't understand why I don't want t...,1
3,relationships,I [27m] am not sure if I should tell ex [27f] ...,"We were together for 5 years, She decided to b...",What do I say to an ex who is leaving me for ...,still talking to ex in hopes of winning her b...,1
4,relationships,How to overcome crippling insecurity and fear ...,Over the course of our relationship I have fel...,How do I overcome crippling insecurity and fe...,I suffer with anxiety and borderline personal...,1


In [9]:
texts = []
for d in tldr_cal['summaries']:
    texts.append([d[0]['text'], d[1]['text']])

summaries_df = pd.DataFrame(texts, columns=['summary_0', 'summary_1'])

X_cal = pd.DataFrame({
    'subreddit': pd.DataFrame(list(tldr_cal['info'].values))['subreddit'],
    'title': pd.DataFrame(list(tldr_cal['info'].values))['title'],
    'post': pd.DataFrame(list(tldr_cal['info'].values))['post'],
    'summary_0': summaries_df['summary_0'],
    'summary_1': summaries_df['summary_0'],
    'label': tldr_cal['choice']
})

X_cal.head()

Unnamed: 0,subreddit,title,post,summary_0,summary_1,label
0,loseit,A colossal NSV,Some background information - I am a 25 year o...,25yo woman goes from severely unhealthy to ea...,25yo woman goes from severely unhealthy to ea...,0
1,relationships,Accidental Incest- Wtf do I do now?,"I need help reddit like never before, being a ...",I'm in love with a woman who has an older bro...,I'm in love with a woman who has an older bro...,1
2,relationships,My girlfriend [21] are seemingly at wits end w...,"Please don't mind the username, it was a throw...",Girlfriend has seemingly given me short and u...,Girlfriend has seemingly given me short and u...,1
3,offmychest,I just discovered my British coworkers interne...,"Holy shit.\nAnyway, we have been working toget...",I'm not sure if I should ignore my coworker a...,I'm not sure if I should ignore my coworker a...,1
4,relationships,Me[18/M] GF[17/F] She wants to break it off be...,Me and my gf have been dating for about 6 mont...,My girlfriend would rather be alone than be i...,My girlfriend would rather be alone than be i...,0


Here we define the test/shift dataset. To make better simulate our target situation, we select just a small subset of size n.

In [10]:
X_shift = pd.DataFrame({
    'subreddit': slf5k_test['subreddit'],
    'title': slf5k_test['title'],
    'post': slf5k_test['post'],
    'summary_0': slf5k_test['generated_summary_for_comparison_A'],
    'summary_1': slf5k_test['generated_summary_for_comparison_B'],
    'label': ("Summary B" == slf5k_test['comparison_preference']).astype(int)
})

X_shift.head()

Unnamed: 0,subreddit,title,post,summary_0,summary_1,label
0,relationships,Me [23/F] with my ex-boyfriend[22/M] have been...,I've known my now ex boyfriend for over 10 yea...,My ex and I have been broken up for a few week...,Ex-boyfriend and I have been dating for a year...,1
1,relationships,Girlfriend [20F] wasting time and money in col...,GF: 20F\nMe: 22M\nLength of relationship: 8 mo...,"GF is 20F, studying premed and failing grades;...","Girlfriend 20F is studying premed, but is not ...",0
2,books,I just wrote a 200 page science fiction/fantas...,"I just sent the completed, unedited novel to t...","I wrote a novel, paid to register it with the ...",I'm a new author that has completed a 200 page...,0
3,legaladvice,Landlord sent us a cease and desist letter. Wh...,"Dear Reddit,\n\nUsing a throwaway account. Her...",Landlord sent us a cease and desist letter to ...,Landlord is angry we're moving out early and i...,1
4,relationships,My college SO and I [20M & 20F] are getting mo...,"My girlfriend and I are distant, but *we live ...","I'm feeling really distant from my SO, and it'...","GF is distant, I am hurt and angered. Please h...",0


## Model and tokenization

The utilities function is discretized into $k$ levels, bounded within the $[-1,1]$ range. The levels are then predicted using a BERT model. Here's the pseudocode for each data row

```
u_map = [-1,1; k steps]
for i = 0, 1:
  logit_i = BERT(text || x_i)    # in R^k
  rho_i = softmax(logit_i)       # in R^k
  utility_i = dot_product(rho_i, u_map)  # in R
select = sigmoid(utility_1 - utility_0)
loss = cross_entropy(select, choice_label)
```


Note that the model learns a weak ordering of the labels via the scalar utilities mapping and the preference selections.

---

❗For Pablo❗

(1) Reformat to target data schema
```
tokenized_train = {
  "x1": title || post || summary_1,
  "x2": title || post || summary_2,
  "labels" : 0 or 1
}
```
where x1 and x2 are bound by distil-bert's 512 token limit and tokenized appropriately.


In [11]:
# Set DistilBERT tokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Note that the model learns a weak ordering of the labels via the scalar utilities mapping and the preference selections.

---

❗For Sen❗

The next two lines are the old version. Next I attach the code that achieves format
```
tokenized_train = {
  "x1": title || post || summary_1,
  "x2": title || post || summary_2,
  "labels" : 0 or 1
}
```
where x1 and x2 are bound by distil-bert's 512 token limit and tokenized appropriately.

In [None]:
'''
X_cal = X_cal.replace(to_replace='None', value=np.nan).dropna()
X_train = X_train.replace(to_replace='None', value=np.nan).dropna()
X_shift = X_shift.replace(to_replace='None', value=np.nan).dropna()
len(X_shift)

indices = []
for i, x in enumerate(X_train['post']):
  if len(tokenizer.encode(x[1:])) + len(tokenizer.encode(X_train['summary_0'][i][1:])) + len(tokenizer.encode(X_train['summary_1'][i][1:])) + len(tokenizer.encode(X_train['title'][i])) < 512:
    indices.append(i)

X_t= X_train.iloc[indices]
print(len(X_t))


indices = []
for i, x in enumerate(X_cal['post']):
  if len(tokenizer.encode(x[1:])) + len(tokenizer.encode(X_cal.iloc[i]['summary_0'][1:])) + len(tokenizer.encode(X_cal.iloc[i]['summary_1'][1:])) + len(tokenizer.encode(X_cal.iloc[i]['title'])) < 512:
    indices.append(i)

X_c= X_cal.iloc[indices]
print(len(X_c))

indices = []
for i, x in enumerate(X_shift['post']):
  if len(tokenizer.encode(x[1:])) + len(tokenizer.encode(X_shift.iloc[i]['summary_0'][1:])) + len(tokenizer.encode(X_shift.iloc[i]['summary_1'][1:])) + len(tokenizer.encode(X_shift.iloc[i]['title'])) < 512:
    indices.append(i)

X_s= X_shift.iloc[indices]
'''


"\nX_cal = X_cal.replace(to_replace='None', value=np.nan).dropna()\nX_train = X_train.replace(to_replace='None', value=np.nan).dropna()\nX_shift = X_shift.replace(to_replace='None', value=np.nan).dropna()\nlen(X_shift)\n\nindices = []\nfor i, x in enumerate(X_train['post']):\n  if len(tokenizer.encode(x[1:])) + len(tokenizer.encode(X_train['summary_0'][i][1:])) + len(tokenizer.encode(X_train['summary_1'][i][1:])) + len(tokenizer.encode(X_train['title'][i])) < 512:\n    indices.append(i)\n\nX_t= X_train.iloc[indices]\nprint(len(X_t))\n\n\nindices = []\nfor i, x in enumerate(X_cal['post']):\n  if len(tokenizer.encode(x[1:])) + len(tokenizer.encode(X_cal.iloc[i]['summary_0'][1:])) + len(tokenizer.encode(X_cal.iloc[i]['summary_1'][1:])) + len(tokenizer.encode(X_cal.iloc[i]['title'])) < 512:\n    indices.append(i)\n\nX_c= X_cal.iloc[indices]\nprint(len(X_c))\n\nindices = []\nfor i, x in enumerate(X_shift['post']):\n  if len(tokenizer.encode(x[1:])) + len(tokenizer.encode(X_shift.iloc[i]['su

In [None]:
'''
X_t['text'] = X_t['title'].apply(lambda x: tokenizer.encode(x, max_length=15, truncation = False)) \
                + X_t['post'].apply(lambda x: tokenizer.encode(x[1:], max_length=239, truncation = False)) \
                + X_t['summary_0'].apply(lambda x: tokenizer.encode(x[1:], max_length=127, truncation = False)) \
                + X_t['summary_1'].apply(lambda x: tokenizer.encode(x[1:], max_length=127, truncation = False))
X_t['text'] = X_t['text'].apply(lambda x: tokenizer.decode(x))

X_c['text'] = X_c['title'].apply(lambda x: tokenizer.encode(x, max_length=15, truncation = False)) \
                + X_c['post'].apply(lambda x: tokenizer.encode(x[1:], max_length=239, truncation = False)) \
                + X_c['summary_0'].apply(lambda x: tokenizer.encode(x[1:], max_length=127, truncation = False)) \
                + X_c['summary_1'].apply(lambda x: tokenizer.encode(x[1:], max_length=127, truncation = False))
X_c['text'] = X_c['text'].apply(lambda x: tokenizer.decode(x))

X_s['text'] = X_s['title'].apply(lambda x: tokenizer.encode(x, max_length=15, truncation = False)) \
                + X_s['post'].apply(lambda x: tokenizer.encode(x[1:], max_length=239, truncation = False)) \
                + X_s['summary_0'].apply(lambda x: tokenizer.encode(x[1:], max_length=127, truncation = False)) \
                + X_s['summary_1'].apply(lambda x: tokenizer.encode(x[1:], max_length=127, truncation = False))
X_s['text'] = X_s['text'].apply(lambda x: tokenizer.decode(x))
'''

"\nX_t['text'] = X_t['title'].apply(lambda x: tokenizer.encode(x, max_length=15, truncation = False))                 + X_t['post'].apply(lambda x: tokenizer.encode(x[1:], max_length=239, truncation = False))                 + X_t['summary_0'].apply(lambda x: tokenizer.encode(x[1:], max_length=127, truncation = False))                 + X_t['summary_1'].apply(lambda x: tokenizer.encode(x[1:], max_length=127, truncation = False))\nX_t['text'] = X_t['text'].apply(lambda x: tokenizer.decode(x))\n\nX_c['text'] = X_c['title'].apply(lambda x: tokenizer.encode(x, max_length=15, truncation = False))                 + X_c['post'].apply(lambda x: tokenizer.encode(x[1:], max_length=239, truncation = False))                 + X_c['summary_0'].apply(lambda x: tokenizer.encode(x[1:], max_length=127, truncation = False))                 + X_c['summary_1'].apply(lambda x: tokenizer.encode(x[1:], max_length=127, truncation = False))\nX_c['text'] = X_c['text'].apply(lambda x: tokenizer.decode(x))\n\nX_s

In [None]:
X_cal = X_cal.replace(to_replace='None', value=np.nan).dropna()
X_train = X_train.replace(to_replace='None', value=np.nan).dropna()
X_shift = X_shift.replace(to_replace='None', value=np.nan).dropna()
len(X_shift)

indices = []
for i, x in enumerate(X_train['post']):
  try:
    if len(tokenizer.encode(X_train['title'][i])) + len(tokenizer.encode(X_train['post'][i])) + len(tokenizer.encode(X_train['summary_0'][i][1:])) < 512 and \
    len(tokenizer.encode(X_train['title'][i])) + len(tokenizer.encode(X_train['post'][i])) + len(tokenizer.encode(X_train['summary_1'][i][1:])) < 512:
      indices.append(i)
  except: pass

X_t= X_train.iloc[indices]
print(len(X_t))


indices = []
for i, x in enumerate(X_cal['post']):
  try:
    if len(tokenizer.encode(X_cal['title'][i])) + len(tokenizer.encode(X_cal['post'][i])) + len(tokenizer.encode(X_cal['summary_0'][i][1:])) < 512 and \
    len(tokenizer.encode(X_cal['title'][i])) + len(tokenizer.encode(X_cal['post'][i])) + len(tokenizer.encode(X_cal['summary_1'][i][1:])) < 512:
      indices.append(i)
  except: pass

X_c= X_cal.iloc[indices]
print(len(X_c))

indices = []
for i, x in enumerate(X_shift['post']):
  try:
    if len(tokenizer.encode(X_cal['title'][i])) + len(tokenizer.encode(X_cal['post'][i])) + len(tokenizer.encode(X_cal['summary_0'][i][1:])) < 512 and \
    len(tokenizer.encode(X_cal['title'][i])) + len(tokenizer.encode(X_cal['post'][i])) + len(tokenizer.encode(X_cal['summary_1'][i][1:])) < 512:
      indices.append(i)
  except: pass

X_s= X_shift.iloc[indices]

934
443


In [None]:
X_t['x0'] = X_t['title'].apply(lambda x: tokenizer.encode(x, truncation = False)) \
                + X_t['post'].apply(lambda x: tokenizer.encode(x[1:], truncation = False)) \
                + X_t['summary_0'].apply(lambda x: tokenizer.encode(x[1:], truncation = False))
X_t['x0'] = X_t['x0'].apply(lambda x: tokenizer.decode(x))
X_t['x1'] = X_t['title'].apply(lambda x: tokenizer.encode(x, truncation = False)) \
                + X_t['post'].apply(lambda x: tokenizer.encode(x[1:], truncation = False)) \
                + X_t['summary_1'].apply(lambda x: tokenizer.encode(x[1:], truncation = False))
X_t['x1'] = X_t['x1'].apply(lambda x: tokenizer.decode(x))

X_c['x0'] = X_c['title'].apply(lambda x: tokenizer.encode(x, truncation = False)) \
                + X_c['post'].apply(lambda x: tokenizer.encode(x[1:], truncation = False)) \
                + X_c['summary_0'].apply(lambda x: tokenizer.encode(x[1:], truncation = False))
X_c['x0'] = X_c['x0'].apply(lambda x: tokenizer.decode(x))
X_c['x1'] = X_c['title'].apply(lambda x: tokenizer.encode(x, truncation = False)) \
                + X_c['post'].apply(lambda x: tokenizer.encode(x[1:], truncation = False)) \
                + X_c['summary_1'].apply(lambda x: tokenizer.encode(x[1:], truncation = False))
X_c['x1'] = X_c['x1'].apply(lambda x: tokenizer.decode(x))

X_s['x0'] = X_s['title'].apply(lambda x: tokenizer.encode(x, truncation = False)) \
                + X_s['post'].apply(lambda x: tokenizer.encode(x[1:], truncation = False)) \
                + X_s['summary_0'].apply(lambda x: tokenizer.encode(x[1:], truncation = False))
X_s['x0'] = X_s['x0'].apply(lambda x: tokenizer.decode(x))
X_s['x1'] = X_s['title'].apply(lambda x: tokenizer.encode(x, truncation = False)) \
                + X_s['post'].apply(lambda x: tokenizer.encode(x[1:], truncation = False)) \
                + X_s['summary_1'].apply(lambda x: tokenizer.encode(x[1:], truncation = False))
X_s['x1'] = X_s['x1'].apply(lambda x: tokenizer.decode(x))

In [None]:
X_s.head()

Unnamed: 0,subreddit,title,post,summary_0,summary_1,label,x0,x1
0,relationships,Me [23/F] with my ex-boyfriend[22/M] have been...,I've known my now ex boyfriend for over 10 yea...,My ex and I have been broken up for a few week...,Ex-boyfriend and I have been dating for a year...,1,[CLS] me [ 23 / f ] with my ex - boyfriend [ 2...,[CLS] me [ 23 / f ] with my ex - boyfriend [ 2...
2,books,I just wrote a 200 page science fiction/fantas...,"I just sent the completed, unedited novel to t...","I wrote a novel, paid to register it with the ...",I'm a new author that has completed a 200 page...,0,[CLS] i just wrote a 200 page science fiction ...,[CLS] i just wrote a 200 page science fiction ...
3,legaladvice,Landlord sent us a cease and desist letter. Wh...,"Dear Reddit,\n\nUsing a throwaway account. Her...",Landlord sent us a cease and desist letter to ...,Landlord is angry we're moving out early and i...,1,[CLS] landlord sent us a cease and desist lett...,[CLS] landlord sent us a cease and desist lett...
4,relationships,My college SO and I [20M & 20F] are getting mo...,"My girlfriend and I are distant, but *we live ...","I'm feeling really distant from my SO, and it'...","GF is distant, I am hurt and angered. Please h...",0,[CLS] my college so and i [ 20m & 20f ] are ge...,[CLS] my college so and i [ 20m & 20f ] are ge...
5,dating_advice,Dating and disclosure,"Hi dating_advice long time lurker, first time ...",Dating multiple people and want to disclose re...,How do I disclose to F2 that I am seeing other...,0,[CLS] dating and disclosure [SEP] [CLS] i dati...,[CLS] dating and disclosure [SEP] [CLS] i dati...


Use something like
```
X_shift.drop('title', inplace=True, axis=1)
```
to drop `title`, `post`, `summary_0`, `summary_0`.

 



In [None]:
X_t.drop('summary_0', inplace=True, axis=1)
X_t.drop('summary_1', inplace=True, axis=1)
X_t.drop('title', inplace=True, axis=1)
X_t.drop('post', inplace=True, axis=1)
X_t.drop('subreddit', inplace=True, axis=1)

X_c.drop('summary_0', inplace=True, axis=1)
X_c.drop('summary_1', inplace=True, axis=1)
X_c.drop('title', inplace=True, axis=1)
X_c.drop('post', inplace=True, axis=1)
X_c.drop('subreddit', inplace=True, axis=1)

X_s.drop('summary_0', inplace=True, axis=1)
X_s.drop('summary_1', inplace=True, axis=1)
X_s.drop('title', inplace=True, axis=1)
X_s.drop('post', inplace=True, axis=1)
X_s.drop('subreddit', inplace=True, axis=1)

In [None]:
X_s.head()

In [None]:
X_train_dataset = Dataset.from_pandas(X_t)
X_cal_dataset = Dataset.from_pandas(X_c)
X_shift_dataset = Dataset.from_pandas(X_s)

In [None]:
# Prepare the text inputs for the model
def preprocess_function(examples):
    x0_tokenized = tokenizer(examples["x0"], truncation=True, max_length=512)
    x1_tokenized = tokenizer(examples["x1"], truncation=True, max_length=512)
    return {
        "x0": x0_tokenized["input_ids"],
        "x1": x1_tokenized["input_ids"],
        "label": examples["label"]
    }

tokenized_train = X_train_dataset.map(preprocess_function, batched=True)
tokenized_cal = X_cal_dataset.map(preprocess_function, batched=True)
tokenized_shift = X_shift_dataset.map(preprocess_function, batched=True)

tokenized_train.remove_columns('__index_level_0__')
tokenized_cal.remove_columns('__index_level_0__')
tokenized_shift.remove_columns('__index_level_0__')

In [None]:
tokenized_cal.remove_columns('__index_level_0__')
tokenized_cal

Dataset({
    features: ['label', 'x0', 'x1', '__index_level_0__'],
    num_rows: 443
})

In [None]:
test = tokenized_train.to_pandas().head(1)
test

Unnamed: 0,label,x0,x1,__index_level_0__
0,1,"[101, 101, 2033, 1031, 2654, 1049, 1033, 2007,...","[101, 101, 2033, 1031, 2654, 1049, 1033, 2007,...",0


In [None]:
'''
X_t2 = X_t
X_t2["text_a"] = [x[:50] for x in X_t2["text"]]
X_t2["text_b"] = [x[50:100] for x in X_t2["text"]]
output_a = X_t2["text_a"].apply(lambda x : tokenizer(x, truncation=True, max_length=512))
output_b = X_t2["text_b"].apply(lambda x : tokenizer(x, truncation=True, max_length=512))
X_t2 = pd.concat([
    X_t2["label"].rename("labels"),
    pd.DataFrame.from_dict(dict(output_a), orient="index").rename(columns={"attention_mask": "attention_mask_a", "input_ids": "input_ids_a"}),
    pd.DataFrame.from_dict(dict(output_b), orient="index").rename(columns={"attention_mask": "attention_mask_b", "input_ids": "input_ids_b"})
], axis=1)
'''

In [None]:
X_t2 = X_t
output_a = X_t2["x0"].apply(lambda x : tokenizer(x, truncation=True, max_length=512))
output_b = X_t2["x1"].apply(lambda x : tokenizer(x, truncation=True, max_length=512))
X_t2 = pd.concat([
    X_t2["label"].rename("labels"),
    pd.DataFrame.from_dict(dict(output_a), orient="index").rename(columns={"attention_mask": "attention_mask_a", "input_ids": "input_ids_a"}),
    pd.DataFrame.from_dict(dict(output_b), orient="index").rename(columns={"attention_mask": "attention_mask_b", "input_ids": "input_ids_b"})
], axis=1)

In [None]:
tokenized_train_2 = Dataset.from_pandas(X_t2)

In [None]:
tokenized_train_2.to_pandas()

## Model training

In [None]:
# Prepare the training data
tokenized_train = X_train_dataset.map(preprocess_function, batched=True)
train_dataset = tokenized_train.remove_columns(["x0", "x1"]).with_format("torch")

len(train_dataset)

In [None]:
repo_name = "summaries-comparisons-distilbert-TLDR"

### Code generated by ChatGPT ###

import torch
from transformers import DistilBertModel, DistilBertTokenizer, Trainer, TrainingArguments
from sklearn.metrics import accuracy_score
from torch import nn

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

k = 5
batch_size = 16
 

# Load the tokenizer and model
tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")
base_model = DistilBertModel.from_pretrained("distilbert-base-uncased")

def vectorized_sum_diagonals(m: torch.Tensor): 
    ''' Takes a torch tensor of dimensions (b, k_, k_) and computes a vector of size 
    (b, 2k-1) computed summing along the diagonals of the second and third dimension
    '''
    b, k_, _ = m.size()
    result = torch.zeros(b, 2*k_ - 1)
    
    # Main diagonal
    result[:, k_-1] = m[:, torch.arange(k_), torch.arange(k_)].sum(dim=-1)
    
    # Off diagonals
    for i in range(1, k_):
        result[:, k_-1-i] = m[:, torch.arange(k_-i), torch.arange(i, k_)].sum(dim=-1)
        result[:, k_+i-1]  = m[:, torch.arange(i, k_), torch.arange(k_-i)].sum(dim=-1)
        
    return result 

# Define the custom model class
class CustomModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.base_model = base_model
        self.dropout = nn.Dropout(0.1)
        self.classifier = nn.Linear(base_model.config.hidden_size, k)
        self.device = device

        # Define the utility mapping vector
        u_map = torch.linspace(0, 1, k, dtype=torch.float, device = device)
        #u_map = u_map.reshape(batch_size, -1)  # Shape (9, 1)
        self.u_map = u_map.to(device) 
        u_map_2km1 = torch.linspace(-1, 1, 2*k-1, dtype=torch.float, device = device)
        #u_map_2km1 = u_map_2km1.reshape(batch_size, -1)
        self.u_map_2km1 = u_map_2km1.to(device) 

    def forward(self, attention_mask, x0_ids, x1_ids, labels=None):
        # Move tensors to GPU (if available)
        x0_ids = x0_ids.to(self.device)
        x1_ids = x1_ids.to(self.device)
        attention_mask = attention_mask.to(self.device)
        labels = labels.to(self.device)
        labels = 2*labels - 1

        # Encode the inputs using the base model
        model_outputs_0 = self.base_model(input_ids=x0_ids, attention_mask=attention_mask)
        model_outputs_1 = self.base_model(input_ids=x1_ids, attention_mask=attention_mask)
        #print(f'Model_outputs_shape: {model_outputs_1.last_hidden_state.size()}') 

        # Compute logits
        logits_0 = self.classifier(model_outputs_0.last_hidden_state[:, 0].squeeze())
        logits_1 = self.classifier(model_outputs_1.last_hidden_state[:, 0].squeeze())
        #print(f'Logits_shape: {logits_1.size()}')

        # Compute probabilities
        probs_0 = torch.softmax(logits_0, dim=1)
        probs_1 = torch.softmax(logits_1, dim=1)

        # Calculate the utilities
        utility_0 = torch.einsum('bi,i->b',probs_0, self.u_map)  # probs shape (N, 5), u_map shape (9, 5) 
        utility_1 = torch.einsum('bi,i->b',probs_1, self.u_map)

        # Calculate the probabilities
        probs_delta_u = torch.einsum('bn,bm->bnm', probs_0, probs_1)
        #print(f'probs_delta_u matrix: {probs_delta_u.size()}')
        probs_delta_u = vectorized_sum_diagonals(probs_delta_u)
        #print(f'probs_delta_u size: {probs_delta_u.size()}')
        probs_delta_u = probs_delta_u.to(self.device)  # Add this line
        #print(f'probs_delta_u: {probs_delta_u}')

        # Calculate the expected utility
        expected_utility = torch.einsum('bi,i->b', probs_delta_u, self.u_map_2km1).to(self.device)
        #print(f'expected_utility: {expected_utility}')
        #print(f'expected_utility size: {expected_utility.size()}')

        #print(f'labels: {labels}')

        # Define the loss function
        #loss = nn.MSELoss()(expected_utility, labels.float())
        loss = torch.einsum('b,b->b', expected_utility.view(-1) - labels.float().view(-1), expected_utility.view(-1) - labels.float().view(-1))

        # Calculate the predicted probabilities
        predicted_probs = torch.softmax(logits_0, dim=1)

        return loss



# Prepare the text inputs for the model
def preprocess_function(examples):
    x0_tokenized = tokenizer(examples["x0"], truncation=True, padding="max_length", max_length=512)
    x1_tokenized = tokenizer(examples["x1"], truncation=True, padding="max_length", max_length=512)

    return {
        "input_ids": x0_tokenized["input_ids"],
        "attention_mask": x0_tokenized["attention_mask"],
        "x0_ids": x0_tokenized["input_ids"],
        "x1_ids": x1_tokenized["input_ids"],
        "labels": examples["label"],
    }

# Prepare the training data
tokenized_train = X_train_dataset.map(preprocess_function, batched=True)
train_dataset = tokenized_train.remove_columns(["x0", "x1"]).with_format("torch")

# Define the training arguments
training_args = TrainingArguments(
    output_dir=repo_name,
    evaluation_strategy="epoch",
    num_train_epochs=3,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    logging_steps=500,
    learning_rate=2e-5,
    weight_decay=0.01,
    report_to="none"  # Disable reporting to avoid unnecessary output
)

# Define the compute metrics function
def compute_metrics(pred):
    labels = pred.label_ids
    logits = torch.from_numpy(pred.predictions)  # Convert numpy array to tensor
    preds = torch.round(torch.sigmoid(logits))

    return {"accuracy": accuracy_score(labels, preds)}


# Define the Trainer
trainer = Trainer(
    model=CustomModel(),
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=train_dataset,  # Use the same dataset for evaluation for simplicity (change as needed)
    compute_metrics=compute_metrics
)

# Train the model
trainer.train()


Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_transform.weight', 'vocab_projector.weight', 'vocab_layer_norm.bias', 'vocab_layer_norm.weight', 'vocab_transform.bias', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Map:   0%|          | 0/918 [00:00<?, ? examples/s]



Epoch,Training Loss,Validation Loss


In [None]:
# Define DistilBERT as our base model:
k = 9
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=k) #, id2label=id2label, label2id=label2id)

In [None]:
## Modifying DataCollatorWithPadding
## https://github.com/huggingface/transformers/blob/d95a32cc60e5d92b4bf08cd805c6b0db7b4100cc/src/transformers/data/data_collator.py#L212:L260

from transformers.tokenization_utils_base import PreTrainedTokenizerBase
from transformers.utils import PaddingStrategy
from typing import Union, Optional, List, Dict, Any
from dataclasses import dataclass

@dataclass
class SymmetricDataCollatorWithPadding:

  tokenizer: PreTrainedTokenizerBase # Tokenizer type
  padding: Union[bool, str, PaddingStrategy] = True
  max_length: Optional[int] = None
  pad_to_multiple_of: Optional[int] = None
  return_tensors: str = "pt"

  def __call__(self, features: List[Dict[str, Any]]) -> Dict[str, Any]:

        feature_a = []
        feature_b = []
        for feature in features:
          feature_a.append({
              "labels": feature["labels"],
              "input_ids": feature["input_ids_a"],
              "attention_mask": feature["attention_mask_a"]
          })
          feature_b.append({
              "labels": feature["labels"],
              "input_ids": feature["input_ids_b"],
              "attention_mask": feature["attention_mask_b"]
          })

        batch_a = self.tokenizer.pad(
            feature_a,
            padding=self.padding,
            max_length=self.max_length,
            pad_to_multiple_of=self.pad_to_multiple_of,
            return_tensors=self.return_tensors,
        )

        batch_b = self.tokenizer.pad(
            feature_b,
            padding=self.padding,
            max_length=self.max_length,
            pad_to_multiple_of=self.pad_to_multiple_of,
            return_tensors=self.return_tensors,
        )

        batch = {
            "labels" : batch_a["labels"],
            "input_ids_a" : batch_a["input_ids"],
            "input_ids_b" : batch_b["input_ids"],
            "attention_mask_a" : batch_a["input_ids"],
            "attention_mask_b" : batch_b["input_ids"]
        }

        return batch

data_collator = SymmetricDataCollatorWithPadding(tokenizer=tokenizer)

In [None]:
import torch.nn.functional as F

# alternate trainer object to get reward model from object ranking
# https://huggingface.co/docs/transformers/main_classes/trainer
# https://github.com/lvwerra/trl/blob/3cfe194e34ee5fb4e2bcf0935e9a6dcd1eebea8e/trl/trainer/utils.py#L50
# https://github.com/lvwerra/trl/blob/3cfe194e34ee5fb4e2bcf0935e9a6dcd1eebea8e/trl/trainer/reward_trainer.py#L35

DELIMITER = [102, 101] # '[SEP][CLS]'
ENDER = [102, 102]

class CustomTrainer(Trainer):

  def __init__(self, *args, **kwargs):

    super().__init__(*args, **kwargs)
    self.k = k      # k = 9 by default. see cell above
    self.u_map = torch.linspace(-1.0, 1.0, self.k, device=model.device).unsqueeze(0)


  ## NOTE: This is insanely inefficient. Consider writing a custom DataCollator...
  def input_split(self, inputs):

    assert inputs.dim() == 2
    batch_size = inputs.shape[0]
    pad_to_length = inputs.shape[1]

    # helper fns
    def batch_find(x: torch.Tensor, tokens: list, batch_size: int = batch_size, default_end: bool = False):
      output = []
      for j in range(batch_size):
        idx = [i for i,c in enumerate(x[j]) if tuple(x[j][i:i+2]) == tuple(tokens)]
        if default_end:
          idx = [x[j].shape[0]] if len(idx) == 0 else idx
        output.append(idx)

      return output

    def concat_and_pad(xs: tuple, dim: int = 0, target_length:int = pad_to_length):
      buffer = torch.cat(xs, dim)
      pad_length = target_length - buffer.shape[0]
      return torch.cat((buffer, torch.zeros(pad_length, device=model.device)), dim).to(dtype=torch.int32)


    # main 
    split_idx = batch_find(inputs, DELIMITER)
    end_idx = batch_find(inputs, ENDER, default_end=True)

    # Something is wrong with the length control
    for i, a in enumerate(end_idx):
      if len(a) == 0:
        print(a)
        print(inputs[i])

    output_1 = None
    output_2 = None

    for i in range(batch_size):

      idx1 = split_idx[i][-2]
      idx2 = split_idx[i][-1]
      end = end_idx[i][0]

      prefix = inputs[i][:idx1+1]
      y1 = inputs[i][idx1+1: idx2 + 1]
      y2 = inputs[i][idx2+1: end + 1]
      cap  = torch.tensor([102], device=model.device)

      buffer_1 = concat_and_pad((prefix, y1, cap), dim=0).unsqueeze(0)
      buffer_2 = concat_and_pad((prefix, y1, cap), dim=0).unsqueeze(0)

      torch.cuda.empty_cache()

      if output_1 is None:
        output_1 = buffer_1
      else:
        output_1 = torch.cat((output_1, buffer_1), dim = 0)

      if output_2 is None:
        output_2 = buffer_2
      else:
        output_2 = torch.cat((output_2, buffer_2), dim = 0)

    return output_1, output_2

  def compute_util(self, model, y, mask):

    rho = F.softmax(model(input_ids = y, attention_mask = mask).get("logits"))
    return torch.sum( rho * self.u_map , dim = 1)


  def compute_loss(self, model, inputs, return_outputs=False):

    labels = inputs.get("labels")
    input_a = inputs.get("input_ids_a")
    input_b = inputs.get("input_ids_b")
    mask_a = inputs.get("attention_mask_a")
    mask_b = inputs.get("attention_mask_b")

    r1, r2 = [self.compute_util(model, y, mask) for (y, mask) in [(input_a, mask_a), (input_b, mask_b)]]
    p = F.sigmoid(r1 - r2)
    loss = F.binary_cross_entropy(p, labels.float())
    
    return loss


# Define accuracy and f1 as the metrics:
def compute_metrics(eval_pred):
    load_accuracy = load_metric("accuracy")
    load_f1 = load_metric("f1")
    
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    accuracy = load_accuracy.compute(predictions=predictions, references=labels)["accuracy"]
    f1 = load_f1.compute(predictions=predictions, references=labels)["f1"]
    return {"accuracy": accuracy, "f1": f1}


In [None]:
# Define a new Trainer with all the objects we constructed so far
repo_name = "summaries-comparisons-distilbert-TLDR"

import os
os.environ["WANDB_DISABLED"] = "true"

training_args = TrainingArguments(
    output_dir=repo_name,
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=10,
    weight_decay=0.01,
    save_strategy="epoch", 
    push_to_hub=False,
    remove_unused_columns=False
)

trainer = CustomTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train_2,
    eval_dataset=tokenized_cal,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

# Train and push to hub
torch.cuda.empty_cache()
trainer.train()

In [None]:
# Save the trained model
model.save_pretrained("./my_model")

# Push the trained model to the Huggingface Hub
model.push_to_hub("my-username/my-model", use_auth_token="my-token")

# Learn then test procedure & split conformal prediction

In this section we want to implement a conformal risk control method, that says that

**Conformal risk control theorem:**
Let $(x_{m+1}, y_{m+1}),\ldots, (x_{m+n}, y_{m+n})$ from the calibration set and $(x_{t}, y_{t})$ be i.i.d. samples from some distribution. Then, we can use the Learn the Test procedure to find a threshold $\alpha$ that ensures that
$$
    \mathbb{P}( \mathbb{E}(\mathcal{L}(C_\alpha(x_i),y_i)) \leq \lambda) \geq 1-\delta,
$$
for some desired values of $\lambda, \delta$.

In [None]:
lambda_ = 2
delta = 0.05
epsilon = 0.05

In [None]:
from transformers import pipeline
from tqdm import tqdm

classifier = pipeline("sentiment-analysis", model="summaries-comparisons-distilbert-TLDR", tokenizer=tokenizer, return_all_scores=True)

#todo: convert into a function that can be deployed easily for the test dataset.
def get_scores(X_cal):
  scores = dict(zip(['LABEL_'+str(i) for i in range(9)],[[] for _ in range(9)]))
  for e in tqdm(range(len(X_cal.index))):
    s = classifier(X_cal['text'][e])[0]
    dic = {}
    for d in s:
      dic[d['label']] = d['score']
    for k, v in dic.items():
      scores[k].append(v)
  return scores

scores = get_scores(X_cal)


In [None]:
len(X_cal),len(X_cal.index), len(scores['LABEL_0'])
for i in range(9):
  X_cal['LABEL_'+str(i)] = scores['LABEL_' +str(i)]

In [None]:
X_cal.head()

We can now compute the loss, at a given value of $\alpha$.

In [None]:
# Define conformal_loss
def compute_loss(X_cal, alpha):
  numerator = 0
  denominator = 0
  for i in range(9):
    numerator += (i - X_cal['label'])**2 * X_cal['LABEL_'+str(i)] * (X_cal['LABEL_'+str(i)] > alpha)
    denominator += X_cal['LABEL_'+str(i)] * (X_cal['LABEL_'+str(i)] > alpha)
  X_cal['loss_alpha_'+str(alpha)] = numerator / denominator
  return X_cal

#X_cal = compute_loss(X_cal, alpha = 0.1)

In [None]:
X_cal.head()

Finally we train the conformal model that fixes the value of 
$\alpha$ such that the prediction achieves error smaller than $\lambda$. To do so it implements the `Learn the Test' procedure which 
1. Generates a set of hypothesis.
2. Compute p-values according to $e^{-2n(\lambda- l)^2}$.
3. Correct the p-values using a family wise error correction.

For demonstration purpuses we will use $\lambda = 2$, with $\delta = 0.05$ too.

In [None]:
def learn_the_test(compute_loss, X_cal, lambda_, delta):

    """ Computes the p-value of the hypothesis that the risk is lower than some \lambda."""

    pvalues = []

    # For each y
    alphas = np.linspace(0, 0.5, 51, endpoint = True)

    for alpha in alphas:

        # Compute the mean loss
        X_cal = compute_loss(X_cal, alpha = alpha)
        losses = [x for x in X_cal['loss_alpha_'+str(alpha)] if not np.isnan(x)
        loss = np.mean(losses)


        # Step 1: Compute p-values associated to such loss
        p_value = np.exp(-2*len(X_cal)*(max(lambda_ - loss, 0))**2)

        pvalues.append(p_value)

    # Step 2: Family-wise error correction
    reject, pvals_corrected, _, bonferroni_delta = multipletests(pvalues, delta, method = 'bonferroni')

    return alphas, pvals_corrected, reject

alphas, pvals_corrected, reject = learn_the_test(compute_loss, X_cal, lambda_ = lambda_, delta = delta)
for a, p, r in zip(alphas, pvals_corrected, reject):
  print("{:.2f}".format(a), "{:.4e}".format(p), r)

In [None]:
alphas_ =  [alphas[i] for i in range(len(alphas)) if reject[i] == True and reject[i-1] == False] 
alpha = alphas_[0]
print(alpha)

Thus, we can use any of the values of $\alpha$ where the null hypothesis is rejected to satisfy the guarantees of the conformal risk control theorem.

In [None]:
X_cal.head()

Once we have trained the language model, we compute the losses
$$\lambda_i = \mathcal{A}(z_1,\ldots,z_{train})(x_i,y_i),$$
where $x_i,y_i$ are elements of the calibration set. Similarly, we compute 
$$\lambda_t = \mathcal{A}(z_1,\ldots,z_{train})(x_t,y_t),$$
Here, we use the notation
$$ \mathcal{A}(z_1,\ldots,z_{train})(x_i,y_i) = \mathcal{L}(f(x_i),y_i).$$
We will use the conformal loss for some value of $\alpha$ where the risk is controlled such as $\alpha=0.6$.

The first way is to detect distribution shfit is compare the loss $\lambda_i$ in between the calibration and test sets, check whether 
$$
\mathbb{P}(\mathcal{L}(C_\alpha(x_i),y_i) \leq \lambda) \geq 1-\delta
$$
holds.

In [None]:
# We first compute the loss in the shifted set at an alpha value of our choosing
scores = get_scores(X_shift)

len(X_shift),len(X_shift.index), len(scores['LABEL_0'])
for i in range(9):
  X_shift['LABEL_'+str(i)] = scores['LABEL_' +str(i)]

X_shift = compute_loss(X_shift, alpha = alpha)

cal_losses = list(X_cal['loss_alpha_'+str(alpha)])
shift_losses = list(X_shift['loss_alpha_'+str(alpha)])

In [None]:
# Implement t-test
result = ttest_1samp(shift_losses, lambda_, alternative = 'greater')
if result.pvalue < epsilon:
  print('The losses in the test set violate the conformal risk control theorem, thus indicating a distribution shift with pvalue {}'.format(result.pvalue))
if result.pvalue < epsilon:
  print('The losses in the test set violate the conformal risk control theorem, thus indicating a distribution shift with pvalue {}'.format(result.pvalue))

Alternatively, we can check for statistically meaningful differences in the loss mean between the calibration and test sets

In [None]:
shift_losses

In [None]:
result = ttest_ind(cal_losses, shift_losses, equal_var = False, alternative = 'less')

pvalue = result.pvalue

if pvalue < epsilon:
  print('There is a difference in the mean loss between the calibration and test distributions, with pvalue {}'.format(pvalue))
else:
  print('pvalue is {}'.format(pvalue))

# Inductive conformal prediction

The second way is to use the properties of inductive conformal predictors. 

**Inductive conformal predictor**: An inductive conformal predictor $\Gamma^{\epsilon}(z_1,\ldots,z_m;z_{m+1},\ldots,z_{m+n})$ is defined as the set of possible values of $y$ for the test data point $x_t$ such that the corresponding $\lambda_t$ conforms to the rest of $\lambda_j$ of the calibration set:
$$
    \Gamma^{\epsilon}(z_1,\ldots,z_m;z_{m+1},\ldots,z_{m+n})(x_t) = \{y| p_t\geq \epsilon\},
$$
with the p-value
$$
    p_t = \frac{|\{j = m+1,\ldots,m+n|  \lambda_j \geq \lambda_t\}|+1}{l+1}.
$$

If we want $p_t\geq\epsilon$, this is approximately equivalent to finding an upper bound on $\lambda_t$ of the kind $\lambda_t = q_{1-\epsilon}(\{\lambda_j\})$, eg, the $1-\epsilon$ quantile of the $\{\lambda_j\}$ set. 
Inductive conformal predictors have the property that, if the calibration and test data are sampled i.i.d., then~\cite[Proposition 4.1]{vovk2005algorithmic}
$$
    \mathbb{P}(y_t\notin \Gamma^{\epsilon}(z_1,\ldots,z_m;z_{m+1},\ldots,z_{m+n})(x_t)\leq \epsilon).
$$

In [None]:
# We first compute the loss in the shifted set at an alpha value of our choosing
scores = get_scores(X_shift)

len(X_shift),len(X_shift.index), len(scores['LABEL_0'])
for i in range(9):
  X_shift['LABEL_'+str(i)] = scores['LABEL_' +str(i)]

X_shift = compute_loss(X_shift, alpha = alpha)

cal_losses = list(X_cal['loss_alpha_'+str(alpha)])
shift_losses = list(X_shift['loss_alpha_'+str(alpha)])



In [None]:
sorted_cal_losses = np.sort(cal_losses)

# Using again the same pvalue as above

lambda_upper_bound = sorted_cal_losses[int((1-epsilon)*(len(cal_losses)+1)-1)]
lambda_upper_bound = np.quantile(cal_losses, ((1-epsilon)*(len(cal_losses)+1)-1)/len(cal_losses))

lambda_upper_bound_simplified = np.quantile(cal_losses, 1-epsilon)

print(lambda_upper_bound, lambda_upper_bound_simplified, lambda_upper_bound_simplified)

Now we want to check whether
$$
    \mathbb{P}(y_t\notin \Gamma^{\epsilon}(z_1,\ldots,z_m;z_{m+1},\ldots,z_{m+n})(x_t)\leq \epsilon).
$$
holds according to a statistical test.

In [None]:
# Implement binomial test
shift_losses_compare = np.array(shift_losses) < lambda_upper_bound
result = binomtest(sum(shift_losses_compare), len(shift_losses), 1-delta, alternative = 'less')
if result.pvalue < epsilon:
  print('The losses in the test set violate the inductive theorem, thus indicating a distribution shift with pvalue {}'.format(result.pvalue))
else:
  print('The pvalue is {}'.format(result.pvalue))

In [None]:
print(np.mean(cal_losses), np.mean(shift_losses))

# Full conformal prediction (not necessary)


To perform full conformal prediction we divide the data in three sets: training (used to train the language model), calibration (used to train the conformal model outputing $\alpha$) and finally the test set where shifts may have happened.

Once we have trained the language model, we compute the losses
$$\lambda_i = \mathcal{A}(z_1,\ldots,z_{test},\ldots,z_{n_{cal}})(x_i,y_i),$$
where $x_i,y_i$ are elements of the calibration set, and 
$$ \mathcal{A}(z_1,\ldots,z_{test},\ldots,z_{n_{cal}})(x_i,y_i) = L(f(x_i),y_i),$$ 
where the calibration model $f$ was trained over $z_1,\ldots,z_{test},\ldots,z_{n_{cal}}$. Similarly is done to compute $\lambda_y$, which uses $L(f(x_{test}), y_{test})$, with the calibration model trained over the calibration set.

Once we have this, we compute the values of $y$ for with the p-value
$$p^y:= \frac{|\{i= m+1,\ldots,l|\lambda_i\leq \lambda_y\}|+1}{l-m+1}$$
is greater than $1-\epsilon$.

Finally, we would like to perform a series of binomial tests, with different calibration tests, to try to reject the null hypothesis that
$$\mathbb{P}_{X,Y}^l\left(\mathbb{P}_{X,Y} (y\in \Gamma(z_1,\ldots,z_l)(x))\geq 1-\epsilon\right)\geq 1-\delta$$
where $z_i = (a_i, b_i), y_i \in X_{train+calib}, Y_{train+calib}$