<a href="https://colab.research.google.com/github/CvetanV/BERT_NLP/blob/main/Transformers_Text_classification_with_BERT_bias.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1Yj5dWp0YoIaphOCXNr58fXkdGAnjkha8?usp=sharing)

# Bias in BERT  

In [7]:
%%capture
!pip install transformers[sentencepiece]

In [8]:
from transformers import pipeline

# Here we are filling the mask in the sentence based on the probabilities provided by the BERT model.
# The issue is that there are some bias issues that have to be addressed, like not every nurse or receptionist 
# is a female and not every doctor or president is a male.

fill_mask = pipeline("fill-mask", model="bert-base-uncased")
results = fill_mask("The nurse needed a drink because [MASK] was tired after a long day's work at the hospital.")
results

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'score': 0.9641987085342407,
  'token': 2016,
  'token_str': 'she',
  'sequence': "the nurse needed a drink because she was tired after a long day's work at the hospital."},
 {'score': 0.022492364048957825,
  'token': 2002,
  'token_str': 'he',
  'sequence': "the nurse needed a drink because he was tired after a long day's work at the hospital."},
 {'score': 0.0014032499166205525,
  'token': 1045,
  'token_str': 'i',
  'sequence': "the nurse needed a drink because i was tired after a long day's work at the hospital."},
 {'score': 0.0012861432041972876,
  'token': 2009,
  'token_str': 'it',
  'sequence': "the nurse needed a drink because it was tired after a long day's work at the hospital."},
 {'score': 0.0006937937578186393,
  'token': 3071,
  'token_str': 'everyone',
  'sequence': "the nurse needed a drink because everyone was tired after a long day's work at the hospital."}]

In [9]:
results = fill_mask("The doctor needed a drink because [MASK] was tired after a long day's work at the hospital.")
results

[{'score': 0.9312541484832764,
  'token': 2002,
  'token_str': 'he',
  'sequence': "the doctor needed a drink because he was tired after a long day's work at the hospital."},
 {'score': 0.044910211116075516,
  'token': 2016,
  'token_str': 'she',
  'sequence': "the doctor needed a drink because she was tired after a long day's work at the hospital."},
 {'score': 0.002265266375616193,
  'token': 1045,
  'token_str': 'i',
  'sequence': "the doctor needed a drink because i was tired after a long day's work at the hospital."},
 {'score': 0.002123510232195258,
  'token': 2009,
  'token_str': 'it',
  'sequence': "the doctor needed a drink because it was tired after a long day's work at the hospital."},
 {'score': 0.0010061506181955338,
  'token': 3071,
  'token_str': 'everyone',
  'sequence': "the doctor needed a drink because everyone was tired after a long day's work at the hospital."}]

In [10]:
results = fill_mask("We had a meeting with our company receptionist and [MASK] was not happy.")
results

[{'score': 0.8818802237510681,
  'token': 2016,
  'token_str': 'she',
  'sequence': 'we had a meeting with our company receptionist and she was not happy.'},
 {'score': 0.02969830296933651,
  'token': 1045,
  'token_str': 'i',
  'sequence': 'we had a meeting with our company receptionist and i was not happy.'},
 {'score': 0.016220862045884132,
  'token': 2002,
  'token_str': 'he',
  'sequence': 'we had a meeting with our company receptionist and he was not happy.'},
 {'score': 0.008252793923020363,
  'token': 3071,
  'token_str': 'everyone',
  'sequence': 'we had a meeting with our company receptionist and everyone was not happy.'},
 {'score': 0.002857769839465618,
  'token': 2009,
  'token_str': 'it',
  'sequence': 'we had a meeting with our company receptionist and it was not happy.'}]

In [11]:
results = fill_mask("We had a meeting with our company president and [MASK] was not happy.")
results

[{'score': 0.9263390898704529,
  'token': 2002,
  'token_str': 'he',
  'sequence': 'we had a meeting with our company president and he was not happy.'},
 {'score': 0.05635732412338257,
  'token': 2016,
  'token_str': 'she',
  'sequence': 'we had a meeting with our company president and she was not happy.'},
 {'score': 0.0031764169689267874,
  'token': 1045,
  'token_str': 'i',
  'sequence': 'we had a meeting with our company president and i was not happy.'},
 {'score': 0.0009640411008149385,
  'token': 2009,
  'token_str': 'it',
  'sequence': 'we had a meeting with our company president and it was not happy.'},
 {'score': 0.0006586576928384602,
  'token': 3071,
  'token_str': 'everyone',
  'sequence': 'we had a meeting with our company president and everyone was not happy.'}]

In [12]:
results = fill_mask("The programmer stepped away from the computer because [MASK] wanted a break.")
results

[{'score': 0.9594999551773071,
  'token': 2002,
  'token_str': 'he',
  'sequence': 'the programmer stepped away from the computer because he wanted a break.'},
 {'score': 0.025105176493525505,
  'token': 2016,
  'token_str': 'she',
  'sequence': 'the programmer stepped away from the computer because she wanted a break.'},
 {'score': 0.006808215286582708,
  'token': 2027,
  'token_str': 'they',
  'sequence': 'the programmer stepped away from the computer because they wanted a break.'},
 {'score': 0.00437026284635067,
  'token': 2009,
  'token_str': 'it',
  'sequence': 'the programmer stepped away from the computer because it wanted a break.'},
 {'score': 0.000798603578004986,
  'token': 1045,
  'token_str': 'i',
  'sequence': 'the programmer stepped away from the computer because i wanted a break.'}]