<a href="https://colab.research.google.com/github/RudyVenguswamy/notebooks/blob/master/Race_Removal.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Removing Race from Text Documents
The following is a new version of a tool built by Rudy Venguswamy to remove race and gender from a document.

Disclaimer:
Simply removing race from documents does not go far enough to undo the systemic problems that lead to bias in decision making and the ways race are used to evaluate and judge humans in the United States. This tool is just one part of a larger change institutions should make to provide equity, justice and rights to those disenfranchised via the normative and oppressive systems of our country.

I hope this tool can be used to not turn a blind eye to racism and bias, but to highlight its subtle presence and to consider what we must do to stop it.

In [None]:
! git clone https://github.com/huggingface/transformers
%cd transformers
! pip install .
! pip install -r ./examples/requirements.txt

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

import torch
import torch.nn as nn

from transformers import BertTokenizer, BertForQuestionAnswering #BertQA
from sklearn.metrics.pairwise import cosine_similarity

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
modelqa = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad', output_hidden_states= True)


Cloning into 'transformers'...
remote: Enumerating objects: 22, done.[K
remote: Counting objects: 100% (22/22), done.[K
remote: Compressing objects: 100% (15/15), done.[K
remote: Total 29300 (delta 7), reused 10 (delta 3), pack-reused 29278[K
Receiving objects: 100% (29300/29300), 26.60 MiB | 25.87 MiB/s, done.
Resolving deltas: 100% (20299/20299), done.
/content/transformers
Processing /content/transformers
Collecting tokenizers==0.7.0
[?25l  Downloading https://files.pythonhosted.org/packages/14/e5/a26eb4716523808bb0a799fcfdceb6ebf77a18169d9591b2f46a9adb87d9/tokenizers-0.7.0-cp36-cp36m-manylinux1_x86_64.whl (3.8MB)
[K     |████████████████████████████████| 3.8MB 2.8MB/s 
Collecting sentencepiece
[?25l  Downloading https://files.pythonhosted.org/packages/d4/a4/d0a884c4300004a78cca907a6ff9a5e9fe4f090f5d95ab341c53d28cbc58/sentencepiece-0.1.91-cp36-cp36m-manylinux1_x86_64.whl (1.1MB)
[K     |████████████████████████████████| 1.1MB 27.9MB/s 
[?25hCollecting sacremoses
[?25l  Dow

  import pandas.util.testing as tm


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=443.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1340675298.0, style=ProgressStyle(descr…




In [None]:
#choose a topic by commenting one out.
topic = 'race'
#topic = 'gender'

#what are some words you do not want removed even if they are related to the topic?
exception_list = ['racism','racist','race']

#Threshold for closeness of embeddings
THRESHOLD = 0.85

In [None]:
'''
Generates the sentence embedding from BERT, a 1x1024 vector representing a sentence
'''  
def generate_vector(hidden): 
  #Found empirically that the 23 last layers performs the best for a sentence embedding representation
  layer_depth = -23
  vector = []
  #shapes the embeddings into a 1x1024 vector
  for i in hidden[layer_depth:-1]:
    vector.append(i.detach().numpy()[0])
  vector = np.array(vector).sum(axis =(0,1)).reshape(1024)
  return vector

def compute_corpus(corpus):
#Precompute the embedding representation of the topic of interest to save time on comparisons later.

  corpus_embeddings = []
  for sentence in corpus:
    input_ids = tokenizer.encode(sentence)
    token_type_ids = [0 if i <= input_ids.index(102) else 1 for i in range(len(input_ids))]
    #does prediction in order to generate hidden embeddings of each statement
    _, _, hiddens_race = modelqa(torch.tensor([input_ids]), token_type_ids=torch.tensor([token_type_ids]))
    corpus_embeddings.append(generate_vector(hiddens_race))

  #appends to corpus
  corpus_embeddings = np.array(corpus_embeddings)
  return corpus_embeddings

if topic=='race':
    corpus = ["The person's race is White.","The person's race is Black.","The person's race is Hispanic.","The person's race is Asian.","The person's race is American Indian."]
corpus_embeddings = compute_corpus(corpus)
print(corpus_embeddings.shape)


general_exception_list = np.array(['masked']) #necessary for code to function
exception_list = np.append(np.array(exception_list), general_exception_list) #combines exception list specified by user along with code necessary list
  

(5, 1024)


In [None]:
from sklearn.metrics.pairwise import cosine_similarity

def sim_criteria(hiddens, topic):
  candidate = generate_vector(hiddens)
  similarity = cosine_similarity(corpus_embeddings, candidate.reshape(1,-1)).mean()
  print('sim: ', similarity)
  if topic is not 'race':
    return True
  return similarity > THRESHOLD #HYPER PARAM

In [None]:
def remove_term(text, topic):

  #selects question to ask to remove race in document
  if topic == 'race':
    question = "What is the subjects ethnicity?"
  elif topic == 'name':
    question = "Name?"
  elif topic == 'location':
    question = "Where are they from?"
  else:
    return 'Topic Unsupported at this time.'

  question = question.lower()

  #processes text, first combining question and text as BERTQA expects. Second, it converts it into IDs
  q_text = question+' ' + text
  input_ids = tokenizer.encode(question,text)
  token_type_ids = [0 if i <= input_ids.index(102) else 1 for i in range(len(input_ids))]

  #does prediction
  start_scores, end_scores, hiddens = modelqa(torch.tensor([input_ids]), token_type_ids=torch.tensor([token_type_ids]))

  #converts ids to tokens that are word like
  all_tokens = tokenizer.convert_ids_to_tokens(input_ids)
  start = torch.argmax(start_scores)
  end = torch.argmax(end_scores)+1

  #answer of question
  answer = ' '.join(all_tokens[start : end])
  print('Answer: ', answer)

  exception_criteria = False
  for word in exception_list:
    if word in answer:
      exception_criteria = True

#is this worth including?
  # if word not in general_exception_list[0]:
  #   print("Stopping because algorithm was instructed not to remove word: ", word)


  #only continue removing words if the embedding sim is close to a race related word and is not in exception list
  if sim_criteria(hiddens, topic) and not exception_criteria and len(answer) > 0: 
    
    for i in range(start,end):
      all_tokens[i] = general_exception_list[0]
    new_id = tokenizer.convert_tokens_to_ids(all_tokens)
    re_q_text = tokenizer.decode(new_id, skip_special_tokens= True)
  
    re_text = re_q_text.split(question)[1]
    #iterate again and remove more words
    
    return remove_term(re_text, topic)

  else:
    new_id = tokenizer.convert_tokens_to_ids(all_tokens)
    return tokenizer.decode(new_id, skip_special_tokens= True).split(question)[1]

In [None]:
#paragraph from sample text
text = "Roy is lowkey racist. He constantly goes around virtue signaling, but when push comes to shove, he isn't a true ally to the movement. \
I think it's pretty telling- that Roy comes from an upper middle class, white family. He's from the uptown area of Manhattan. His parents own an entire floor there."
res = remove_term(text, 'race')
res

Answer:  white
sim:  0.8501217
Answer:  upper middle class , masked family . he ' s from the uptown area of manhattan
sim:  0.849675


" roy is lowkey racist. he constantly goes around virtue signaling, but when push comes to shove, he isn't a true ally to the movement. i think it's pretty telling - that roy comes from an upper middle class, masked family. he's from the uptown area of manhattan. his parents own an entire floor there."

In [None]:
res2 = remove_term(res, 'name')
res2

Answer:  roy
sim:  0.8418867
Answer:  roy
sim:  0.8421305
Answer:  masked
sim:  0.8439621


" masked is lowkey racist. he constantly goes around virtue signaling, but when push comes to shove, he isn't a true ally to the movement. i think it's pretty telling - that masked comes from an upper middle class, masked family. he's from the uptown area of manhattan. his parents own an entire floor there."

In [None]:
res3 = remove_term(res2, 'location')
res3

Answer:  uptown area of manhattan
sim:  0.8506276
Answer:  masked masked masked masked
sim:  0.8481865


" masked is lowkey racist. he constantly goes around virtue signaling, but when push comes to shove, he isn't a true ally to the movement. i think it's pretty telling - that masked comes from an upper middle class, masked family. he's from the masked masked masked masked. his parents own an entire floor there."