# The problem with base rates
Base rates are the rates at which the model in question predicts different classes for **neutral** sentences. A neutral sentence is a sentence that does not imply any bias (bias differs by context). For instance, "People from [MASK] eat, drink, and sleep". This is a neutral statement in a sense that the missing keyword does not effect the meaning of the sentence, as all humans need to eat,drink and sleep. Hence, if the missing keyword is "China", with probability of 0.1 and "Lithuania" with probability 0.01 the difference in rates is not because the model is biased but rather because the model has just seen china more often than Lithuania. Hence, one can derive base rates from the data or using masking tasks. 


In [3]:
from transformers import pipeline
import pandas as pd
from transformers import BertTokenizer, BertModel
import textstat
import requests
from unittest import result
import json
# MASK
unmasker = pipeline('fill-mask', model='bert-large-cased-whole-word-masking')

Some weights of the model checkpoint at bert-large-cased-whole-word-masking were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [1]:
neutral_sentences = [
    "[MASK] is name of a country.",
    "I am a country named [MASK].",
    "[MASK] is a country",
    "Name of a country is [MASK].",
    "Name a country. [MASK] is a country.",
    "[MASK] is a country on earth.",
    "Name a country. [MASK] is a country on earth.",
    "You are a country named [MASK].",
    "They are from a country named [MASK].",
]
from collections import defaultdict
base_rate = defaultdict(float)
for sentance in neutral_sentences:
    result = unmasker(sentance)
    for j in result:
        base_rate[j['token_str'].lower().strip()] += j['score']

NameError: name 'unmasker' is not defined

In [8]:
print(base_rate)

defaultdict(<class 'float'>, {'afghanistan': 0.26134091801941395, 'this': 0.20562006533145905, 'it': 0.23169870488345623, 'country': 0.028564760461449623, 'flag': 0.026396427303552628, 'america': 0.04062959924340248, 'russia': 0.05847242288291454, 'romania': 0.025461118668317795, 'germany': 0.04398747347295284, 'hungary': 0.021298104897141457, 'india': 0.11685409396886826, 'australia': 0.14475287683308125, 'azerbaijan': 0.05705759860575199, 'latin': 0.046762265264987946, 'variable': 0.0466257780790329, 'english': 0.040855757892131805, 'optional': 0.0372978113591671, 'international': 0.03460304066538811, 'ireland': 0.06681019067764282, 'turkey': 0.05857176519930363, 'italy': 0.025413205847144127, 'there': 0.4495214521884918, 'name': 0.10852814465761185, 'that': 0.057182133197784424, 'mongolia': 0.024394305422902107, 'israel': 0.07051292806863785, 'albania': 0.046807706356048584, 'armenia': 0.03200598061084747})


In [10]:
# Using next sentence prediction to calculate the base rate
from transformers import BertTokenizer, BertForNextSentencePrediction
import torch
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForNextSentencePrediction.from_pretrained('bert-base-uncased')

import pandas as pd
countries = list(pd.read_csv('demo_data.csv')['country'].unique())
base_rate_from_NSP = defaultdict(float)
for country in countries:
    text = ("Name a country.")
    text2 = (f"{country} is a country.")
    inputs = tokenizer(text, text2, return_tensors='pt')
    labels = torch.LongTensor([1])
    outputs = model(**inputs, labels=labels)
    # calculate the output loss
    # 1 represents the next sentence is the second sentence
    base_rate_from_NSP[country] = outputs.loss.item()



Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForNextSentencePrediction: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertForNextSentencePrediction from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForNextSentencePrediction from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
  exec(code_obj, self.user_global_ns, self.user_ns)
