# Data Supplementation (suppl)

Method
1.	Identify identity terms with the most disproportionate data distributions 
    1. Stem/lemmatize dataset
    2. For each lemma in the synthetic test set:
        1. Check distribution across labels in dataset, i.e. difference between frequency in toxic comments and overall
        2.	Also check length differences!
    3. What does this mean exactly? 
        1.	“Identity terms affected by the false positive bias are disproportionately used in toxic comments in our training data. For example, the word ‘gay’ appears in 3% of toxic comments but only 0.5% of comments overall.”
        2.	Frequency of identity terms in toxic comments and overall: 
2.	Add additional non-toxic examples that contain the identity terms that appear disproportionately across labels in the original dataset
    1.	Use wiki data – assumed to be non-toxic
    2.	Add enough so that the balance is in line with the prior distribution for the overall dataset
        1.	E.g. until % “gay” in toxic comment is close to 0.50% as in overall data.
3.	Maybe consider different lengths as CNNs could be sensitive to this
    1.	“toxic comments tend to be shorter” (Dixon et al. 2018)
4.	Supposed to reduce false positives. Could also do the opposite? But more difficult to find toxic comments unless we take them from places that are supposedly toxic (e.g. “roast me”)


## Imports

In [299]:
# set cwd
import os
os.chdir("g:\\My Drive\\ITC, 5th semester (Thesis)\\Code\\Github_code\\toxicity_detection")

# imports
import pandas as pd
# from random import choice, choices
# from collections import Counter
import matplotlib.pyplot as plt
# from string import punctuation
# # import spacy
from spacy import displacy
from tqdm import tqdm
from utils import load_dkhate
# from typing import Dict
import pickle
import dacy
# import utils
# import nltk
# import re
# import string
from wiki_scraper import scrape_wiki_text
tqdm.pandas()

## Functions

In [2]:
def lemmatize_text(text:str) -> str:
    """Returns a lemmatized version of the text or itself if the string is empty."""
    if len(text) > 0:
        doc = nlp(text)
        lemmas = [token.lemma_ for token in doc]
        lemmatized_text = " ".join(lemmas)
        return lemmatized_text
    else:
        return text

def occurs_in(target, text):
    """Checks whether a word occurs in a text."""
    for word in text.split():
        if word == target:
            return 1
    return 0

## Load DaCy model

In [3]:
# load daCy model (medium works fine)
nlp = dacy.load("da_dacy_medium_trf-0.2.0") # takes around 4 minutes the first time

In [4]:
# test that it works as expected 
doc = nlp("Mit navn er Maja. Jeg bor på Bispebjerg, men er fra Næstved.") 
print("Token     \tLemma\t\tPOS-tag\t\tEntity type")
for tok in doc: 
    print(f"{str(tok).ljust(10)}:\t{str(tok.lemma_).ljust(10)}\t{tok.pos_}\t\t{tok.ent_type_}")
displacy.render(doc, style="ent")

Token     	Lemma		POS-tag		Entity type
Mit       :	Mit       	DET		
navn      :	navn      	NOUN		
er        :	være      	AUX		
Maja      :	Maja      	PROPN		PER
.         :	.         	PUNCT		
Jeg       :	jeg       	PRON		
bor       :	bo        	VERB		
på        :	på        	ADP		
Bispebjerg:	Bispebjerg	PROPN		LOC
,         :	,         	PUNCT		
men       :	men       	CCONJ		
er        :	være      	VERB		
fra       :	fra       	ADP		
Næstved   :	Næstved   	PROPN		LOC
.         :	.         	PUNCT		


## Load preprocessed training data

In [5]:
# load data splits 
_, _, y_train_orig, _ = load_dkhate(test_size=0.2)
with open(os.getcwd()+"/data/X_orig_preproc.pkl", "rb") as f:
    content = pickle.load(f)

X_train_orig = content["X_train"]
train_orig = pd.DataFrame([X_train_orig, y_train_orig]).T
train_orig.tail()

Unnamed: 0_level_0,tweet,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1
2378,hørt,0
1879,reaktion svensker,0
42,hey champ smide link ser hearthstone henne,0
457,melder vold voldtægt viser sandt beviser diver...,1
3108,betaler omkring mb kb får nok tættere kb kb be...,0


In [6]:
# lemmatize the texts
train_orig["lemmas"] = train_orig["tweet"].progress_apply(lemmatize_text)

100%|██████████| 2631/2631 [05:56<00:00,  7.39it/s]


In [7]:
# split into toxic, non-toxic and all
toxic_text = train_orig[train_orig["label"] == 1]["lemmas"]
nontoxic_text = train_orig[train_orig["label"] == 0]["lemmas"]
all_text =  train_orig["lemmas"]

NUM_TOXIC = len(toxic_text)
NUM_NONTOXIC = len(nontoxic_text)
NUM_TOTAL = len(all_text)

toxic_text.head()

id
1174    scanne lortet pc markere tage underskrift ny d...
3301    kunne klarer fyr stort se venn vej samme spor ...
1390    fuck meget sol varme lille regn please dansk å...
799     hvorfor fucking stor helvede fejre kristn hell...
900     ingen udlænding ved grænse heller kriminell ku...
Name: lemmas, dtype: object

## Load identity terms

In [8]:
# load identity terms
identities = pd.read_excel(os.getcwd()+"/data/identity_terms.xlsx")
print(len(set(identities["identity_lemma"])), "unique identity lemmas")
identities.tail()

45 unique identity lemmas


Unnamed: 0,identity_term,identity_lemma
155,transpersonerne,transperson
156,transvestitterne,transvestit
157,transerne,trans
158,androgynerne,androgyn
159,hermafroditterne,hermafrodit


In [9]:
# lemmatize the identity terms
identities["lemmatized"] = identities["identity_term"].progress_apply(lemmatize_text)
print(len(set(identities["lemmatized"])), "unique lemmatized identity terms")
identities.tail()

100%|██████████| 160/160 [00:08<00:00, 18.21it/s]

133 unique lemmatized identity terms





Unnamed: 0,identity_term,identity_lemma,lemmatized
155,transpersonerne,transperson,transperson
156,transvestitterne,transvestit,transvestitterne
157,transerne,trans,transe
158,androgynerne,androgyn,androgynerne
159,hermafroditterne,hermafrodit,hermafroditterne


In [10]:
# create map from lemmatized word to the actual lemma
lemmatized_2_lemma = dict(zip(identities["lemmatized"], identities["identity_lemma"]))

## Test scraper

In [300]:
content = scrape_wiki_text("https://da.wikipedia.org/wiki/Transk%C3%B8nnethed")
print("_"*100)
for text in content:
    print(text)

Successfully scraped the webpage with the title: "Transkønnethed"
____________________________________________________________________________________________________
Transkønnethed er en betegnelse for personer, der har en kønsidentitet eller et kønsudtryk, der adskiller sig fra deres fødselskøn.[1][2][3] Nogle transkønnede, som ønsker medicinsk hjælp til at overgå fra et køn til et andet, identificerer sig som transseksuelle.[4][5] Transkønnet, ofte forkortet til blot trans, er også et paraplybegreb: Udover at omfatte personer, hvis kønsidentitet er det modsatte af deres fødselskøn (dvs. transmænd og transkvinder), kan det også anvendes om personer, hvis kønsudtryk ikke er eksklusivt maskulint eller feminint (personer, som er ikke-binære eller genderqueer, heriblandt bikønnede, pankønnede, genderfluid og akønnede).[2][6][7] Blandt andre definitioner af transkønnet er også at inkludere personer, der tilhører et tredje køn, eller konceptualisere transkønnede som et tredje køn.[8][9] Be

## Perform Data Supplemenation

### Frequency of identity terms

In [11]:
# test function
print("This should return False. Result:", occurs_in("mor", "elsker din humor"))
print("This should return True.  Result:", occurs_in("mor", "hans mor er pænt sød"))

This should return False. Result: 0
This should return True.  Result: 1


In [210]:
# count how many texts these terms occur in
lemmatized_identities = list(set(identities["lemmatized"]))
occur_in_n_texts = {"lemmatized_identity": lemmatized_identities, "toxic_count": [], "nontoxic_count":[], "total_count":[]}

for lemma in lemmatized_identities:
    occur_in_n_texts["toxic_count"].append(toxic_text.apply(lambda x: occurs_in(target=lemma, text=x)).sum())
    occur_in_n_texts["nontoxic_count"].append(nontoxic_text.apply(lambda x: occurs_in(target=lemma, text=x)).sum())
    occur_in_n_texts["total_count"].append(all_text.apply(lambda x: occurs_in(target=lemma, text=x)).sum())

In [214]:
# create df with these occurrence numbers
occurrence_df = pd.DataFrame(occur_in_n_texts)

# map back to actual lemma and aggregate duplicates
occurrence_df["lemma"] = occurrence_df["lemmatized_identity"].map(lemmatized_2_lemma)
occurrence_df = occurrence_df.groupby("lemma").agg({"toxic_count": "sum", "nontoxic_count": "sum", "total_count": "sum"}).reset_index()

# calculate percentages
occurrence_df["toxic_pct"] = (occurrence_df["toxic_count"]/NUM_TOXIC)*100 
occurrence_df["nontoxic_pct"] = (occurrence_df["nontoxic_count"]/NUM_NONTOXIC)*100 
occurrence_df["total_pct"] = (occurrence_df["total_count"]/NUM_TOTAL)*100 

# calculate differences
occurrence_df["tox_total_diff"] = occurrence_df["toxic_pct"] - occurrence_df["total_pct"]
occurrence_df["tox_total_abs_diff"] = abs(occurrence_df["toxic_pct"] - occurrence_df["total_pct"])

# sort by difference
sorted_occurrence_df = occurrence_df.sort_values("tox_total_diff", ascending=False).reset_index(drop=True)

# display rows where toxic pct != total pct
sorted_occurrence_df[sorted_occurrence_df["tox_total_diff"] != 0].round(2)

Unnamed: 0,lemma,toxic_count,nontoxic_count,total_count,toxic_pct,nontoxic_pct,total_pct,tox_total_diff,tox_total_abs_diff
0,mand,16,57,73,4.6,2.5,2.77,1.82,1.82
1,kvinde,7,26,33,2.01,1.14,1.25,0.76,0.76
2,fyr,1,0,1,0.29,0.0,0.04,0.25,0.25
3,mandfolk,1,0,1,0.29,0.0,0.04,0.25,0.25
4,queer,1,0,1,0.29,0.0,0.04,0.25,0.25
5,kvindfolk,1,0,1,0.29,0.0,0.04,0.25,0.25
6,tøs,1,0,1,0.29,0.0,0.04,0.25,0.25
7,søn,1,1,2,0.29,0.04,0.08,0.21,0.21
8,fætter,1,2,3,0.29,0.09,0.11,0.17,0.17
9,kone,2,9,11,0.57,0.39,0.42,0.16,0.16


In [14]:
# save this df
sorted_occurrence_df.to_excel(os.getcwd()+"/mitigation/frequency_of_identity_lemmas.xlsx")

The ones with a difference > 0 are the ones that I need to look at. 

I can actually make a difference here by adding non-toxic data and getting the toxic_pct number closer to the total_pct number, thereby reducing the difference so it's as close to zero as possible. 

### Length differences

Percent of comments labeled as toxic at each length containing the given terms, e.g.:

| Term | 20-59 | 60-179 |
|:---:|:---:|:---:|
| ALL | 17% | 12% |
| gay | 88% | 77% |
| queer | 75% | 83% |
| ... | ... | ... |

Other lengths:
* 180-539
* 540-1619
* 1620-4859


Method:

* For each lemma:
  * Find the texts that it occur in
  * Separate these texts into 5 length buckets
  * For each length_bucket:
    * Find the percentage that are toxic

In [271]:
# add lengths to df
train_orig["length"] = train_orig["tweet"].progress_apply(lambda x: len(x))

100%|██████████| 2631/2631 [00:00<00:00, 381155.49it/s]


In [273]:
# divide into 6 buckets
print("Min length:", train_orig["length"].min())
print("Max length:", train_orig["length"].max())

bin1 = train_orig.query("0 <= length <= 19") # 20
bin2 = train_orig.query("20 <= length <= 59") # 40
bin3 = train_orig.query("60 <= length <= 139") # 80
bin4 = train_orig.query("140 <= length <= 299") # 160
bin5 = train_orig.query("300 <= length <= 619") # 320
bin6 = train_orig.query("620 <= length") # the rest
bins = [bin1, bin2, bin3, bin4, bin5, bin6]
bin_labels = ["0-19", "20-59", "60-139", "140-299", "300-619", "620-3519"]

Min length: 0
Max length: 3518


In [283]:
# find proportion of toxic comments for each bin (no specific terms)
results = {"bin_range":bin_labels, "toxic":[], "nontoxic":[]}
for bin in bins: # length bins
    results["toxic"].append(len(bin[bin["label"] == 1])) # count toxic in that bin
    results["nontoxic"].append(len(bin[bin["label"] == 0])) # and non-toxic

# prepare preliminary results df
prel_results_df = pd.DataFrame(results)
prel_results_df["pct_toxic"] = ( prel_results_df["toxic"] / (prel_results_df["toxic"]+prel_results_df["nontoxic"]) ) * 100 # add percentage
prel_results_df.set_index("bin_range", inplace=True)

# add to final results df
results_df_1 = prel_results_df[["pct_toxic"]].T
results_df_1.index = ["ALL"]
results_df_1.round(2)

bin_range,0-19,20-59,60-139,140-299,300-619,620-3519
ALL,10.74,12.79,13.75,17.5,29.63,33.33


In [284]:
# do the same for each lemma

# prepare dicts
toxic_count_dict = {"lemmatized_identity": lemmatized_identities}
total_count_dict = {"lemmatized_identity": lemmatized_identities}
for label in bin_labels:
    toxic_count_dict[label] = []
    total_count_dict[label] = []
    
for lemma in lemmatized_identities: # for each lemma
    for (bin_label, bin) in zip(bin_labels, bins): # for each bin
        
        # count no. of toxic/all texts this lemma occurs in in this bin
        toxic_count = bin[bin["label"]==1]["lemmas"].apply(lambda x: occurs_in(target=lemma, text=x)).sum() 
        total_count = bin["lemmas"].apply(lambda x: occurs_in(target=lemma, text=x)).sum() 
        
        # add to count_dicts
        toxic_count_dict[bin_label].append(toxic_count)
        total_count_dict[bin_label].append(total_count)

In [285]:
# create df with these occurrence numbers
toxic_count_df = pd.DataFrame(toxic_count_dict)
total_count_df = pd.DataFrame(total_count_dict)

# map back to actual lemma and aggregate duplicates
toxic_count_df["lemma"] = toxic_count_df["lemmatized_identity"].map(lemmatized_2_lemma)
toxic_count_df = toxic_count_df.groupby("lemma").agg({"0-19": "sum", "20-59": "sum", "60-139": "sum", "140-299": "sum", "300-619": "sum", "620-3519": "sum"}).reset_index()
toxic_count_df["sum"] = toxic_count_df["0-19"] + toxic_count_df["20-59"] + toxic_count_df["60-139"] + toxic_count_df["140-299"] + toxic_count_df["300-619"] + toxic_count_df["620-3519"]
toxic_count_df = toxic_count_df.sort_values("lemma")
total_count_df["lemma"] = total_count_df["lemmatized_identity"].map(lemmatized_2_lemma)
total_count_df = total_count_df.groupby("lemma").agg({"0-19": "sum", "20-59": "sum", "60-139": "sum", "140-299": "sum", "300-619": "sum", "620-3519": "sum"}).reset_index()
total_count_df["sum"] = total_count_df["0-19"] + total_count_df["20-59"] + total_count_df["60-139"] + total_count_df["140-299"] + total_count_df["300-619"] + total_count_df["620-3519"]
total_count_df = total_count_df.sort_values("lemma")

In [243]:
toxic_count_df.columns[1:-1]

Index(['0-19', '20-59', '60-139', '140-299', '300-619', '620-3519'], dtype='object')

In [288]:
# add to results df
results_df_2 = toxic_count_df[["lemma"]]
for col in toxic_count_df.columns[1:-1]:
    results_df_2[col] = (toxic_count_df[col] / total_count_df[col]) * 100 # calculate percentages
results_df_2.set_index("lemma", inplace=True)

In [295]:
# final df
results_df = pd.concat([results_df_1, results_df_2])
results_df.dropna(axis = 0, how = 'all', inplace = True) # drop rows with all NA values
display(results_df.round(2).fillna("")) # show results

Unnamed: 0,0-19,20-59,60-139,140-299,300-619,620-3519
ALL,10.74,12.79,13.75,17.5,29.63,33.33
bror,,0.0,0.0,0.0,0.0,50.0
dame,,,0.0,0.0,,
datter,,,0.0,0.0,0.0,0.0
dreng,,100.0,0.0,,,0.0
far,,0.0,0.0,50.0,,0.0
fyr,,,100.0,,,
fætter,,0.0,,100.0,0.0,
herre,,0.0,0.0,,,
kone,,0.0,0.0,0.0,100.0,50.0


In [297]:
# save results
results_df = results_df.fillna("") # fill NAs
results_df.to_excel(os.getcwd()+"/mitigation/toxicity_at_diff_lengths.xlsx") # save as xlsx file

### Calculate how much new data is needed

In [27]:
# from dixon's code
# def calculate_non_toxic_deficit(f, n, t):
#     """
#     Calculate the deficit of non-toxic examples needed to achieve the desired non-toxic fraction.

#     Parameters:
#     - f: Desired non-toxic fraction
#     - n: Current number of non-toxic examples
#     - t: Current number of toxic examples

#     Returns:
#     - a: Number of non-toxic examples needed to be added
#     """
#     a = (f * (t + n) - n) / (1 - f)
#     return a

# # Example usage:
# desired_non_toxic_fraction = 2.0
# current_non_toxic_examples = 2
# current_toxic_examples = 2


# deficit_of_non_toxic_examples = calculate_non_toxic_deficit(desired_non_toxic_fraction, current_non_toxic_examples, current_toxic_examples)
# deficit_of_non_toxic_examples

-6.0

### Scrape from wikipedia

1) Search for pages to add (manually selected)
2) Scrape these pages using requests and beautifulsoup4
3) Concatenate to one big text bank
4) Search for word forms in this text bank. Extract the needed number of texts in the correct length.
5) Train model on the new dataset and do bias analysis

Afterwards, try to do both types of mitigation on the oversampled dataset

OR 

Try to rerun the original model on non-oversampled dataset
ASK MANEX!