[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1Yj5dWp0YoIaphOCXNr58fXkdGAnjkha8?usp=sharing)

# Bias in BERT  

In [1]:
%%capture
!pip install transformers[sentencepiece]

In [2]:
from transformers import pipeline

# For checking the bias we use a sentence and see the probability
# here we are using the nurse as profession --> It biased towards female
fill_mask = pipeline("fill-mask", model="bert-base-uncased")
results = fill_mask("The nurse needed a drink because [MASK] was tired after a long day's work at the hospital.")
results

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'bert.pooler.dense.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'score': 0.9641987085342407,
  'token': 2016,
  'token_str': 'she',
  'sequence': "the nurse needed a drink because she was tired after a long day's work at the hospital."},
 {'score': 0.0224923025816679,
  'token': 2002,
  'token_str': 'he',
  'sequence': "the nurse needed a drink because he was tired after a long day's work at the hospital."},
 {'score': 0.0014032472390681505,
  'token': 1045,
  'token_str': 'i',
  'sequence': "the nurse needed a drink because i was tired after a long day's work at the hospital."},
 {'score': 0.0012861390132457018,
  'token': 2009,
  'token_str': 'it',
  'sequence': "the nurse needed a drink because it was tired after a long day's work at the hospital."},
 {'score': 0.0006937917787581682,
  'token': 3071,
  'token_str': 'everyone',
  'sequence': "the nurse needed a drink because everyone was tired after a long day's work at the hospital."}]

In [3]:
# Here we are using the Doctor as Profession --> Biased towards male
results = fill_mask("The doctor needed a drink because [MASK] was tired after a long day's work at the hospital.")
results

[{'score': 0.931253969669342,
  'token': 2002,
  'token_str': 'he',
  'sequence': "the doctor needed a drink because he was tired after a long day's work at the hospital."},
 {'score': 0.044910334050655365,
  'token': 2016,
  'token_str': 'she',
  'sequence': "the doctor needed a drink because she was tired after a long day's work at the hospital."},
 {'score': 0.002265272429212928,
  'token': 1045,
  'token_str': 'i',
  'sequence': "the doctor needed a drink because i was tired after a long day's work at the hospital."},
 {'score': 0.002123521873727441,
  'token': 2009,
  'token_str': 'it',
  'sequence': "the doctor needed a drink because it was tired after a long day's work at the hospital."},
 {'score': 0.0010061532957479358,
  'token': 3071,
  'token_str': 'everyone',
  'sequence': "the doctor needed a drink because everyone was tired after a long day's work at the hospital."}]

In [4]:
# Here we are using the office scenario
# we are using the profession of Receptionist --> it considered as female
results = fill_mask("We had a meeting with our company receptionist and [MASK] was not happy.")
results

[{'score': 0.8818802237510681,
  'token': 2016,
  'token_str': 'she',
  'sequence': 'we had a meeting with our company receptionist and she was not happy.'},
 {'score': 0.029698358848690987,
  'token': 1045,
  'token_str': 'i',
  'sequence': 'we had a meeting with our company receptionist and i was not happy.'},
 {'score': 0.01622089371085167,
  'token': 2002,
  'token_str': 'he',
  'sequence': 'we had a meeting with our company receptionist and he was not happy.'},
 {'score': 0.00825280137360096,
  'token': 3071,
  'token_str': 'everyone',
  'sequence': 'we had a meeting with our company receptionist and everyone was not happy.'},
 {'score': 0.0028577768243849277,
  'token': 2009,
  'token_str': 'it',
  'sequence': 'we had a meeting with our company receptionist and it was not happy.'}]

In [5]:
# here we are using the profession as president --> it results as male
results = fill_mask("We had a meeting with our company president and [MASK] was not happy.")
results

[{'score': 0.9263393878936768,
  'token': 2002,
  'token_str': 'he',
  'sequence': 'we had a meeting with our company president and he was not happy.'},
 {'score': 0.05635707825422287,
  'token': 2016,
  'token_str': 'she',
  'sequence': 'we had a meeting with our company president and she was not happy.'},
 {'score': 0.0031763967126607895,
  'token': 1045,
  'token_str': 'i',
  'sequence': 'we had a meeting with our company president and i was not happy.'},
 {'score': 0.0009640404605306685,
  'token': 2009,
  'token_str': 'it',
  'sequence': 'we had a meeting with our company president and it was not happy.'},
 {'score': 0.0006586547242477536,
  'token': 3071,
  'token_str': 'everyone',
  'sequence': 'we had a meeting with our company president and everyone was not happy.'}]

In [6]:
# Here we are using the Programmer as profession --> Result considered as male
results = fill_mask("The programmer stepped away from the computer because [MASK] wanted a break.")
results

[{'score': 0.9594998955726624,
  'token': 2002,
  'token_str': 'he',
  'sequence': 'the programmer stepped away from the computer because he wanted a break.'},
 {'score': 0.025105174630880356,
  'token': 2016,
  'token_str': 'she',
  'sequence': 'the programmer stepped away from the computer because she wanted a break.'},
 {'score': 0.006808234378695488,
  'token': 2027,
  'token_str': 'they',
  'sequence': 'the programmer stepped away from the computer because they wanted a break.'},
 {'score': 0.004370279144495726,
  'token': 2009,
  'token_str': 'it',
  'sequence': 'the programmer stepped away from the computer because it wanted a break.'},
 {'score': 0.0007986065465956926,
  'token': 1045,
  'token_str': 'i',
  'sequence': 'the programmer stepped away from the computer because i wanted a break.'}]

In [None]:
# By using these examples, we able to see the Model links
  # Low skill and low pay jobs are linked to women
  # High skill and high pay jobs are linked to men