# Data Supplementation (suppl)

Method
1.	Identify identity terms with the most disproportionate data distributions 
    1. Stem/lemmatize dataset
    2. For each lemma in the synthetic test set:
        1. Check distribution across labels in dataset, i.e. difference between frequency in toxic comments and overall
        2.	Also check length differences!
    3. What does this mean exactly? 
        1.	“Identity terms affected by the false positive bias are disproportionately used in toxic comments in our training data. For example, the word ‘gay’ appears in 3% of toxic comments but only 0.5% of comments overall.”
        2.	Frequency of identity terms in toxic comments and overall: 
2.	Add additional non-toxic examples that contain the identity terms that appear disproportionately across labels in the original dataset
    1.	Use wiki data – assumed to be non-toxic
    2.	Add enough so that the balance is in line with the prior distribution for the overall dataset
        1.	E.g. until % “gay” in toxic comment is close to 0.50% as in overall data.
3.	Maybe consider different lengths as CNNs could be sensitive to this
    1.	“toxic comments tend to be shorter” (Dixon et al. 2018)
4.	Supposed to reduce false positives. Could also do the opposite? But more difficult to find toxic comments unless we take them from places that are supposedly toxic (e.g. “roast me”)


## Imports

In [682]:
# set cwd
import os
os.chdir("g:\\My Drive\\ITC, 5th semester (Thesis)\\Code\\Github_code\\toxicity_detection")

# imports
import pandas as pd
# from random import choice, choices
# from collections import 
import numpy as np
import matplotlib.pyplot as plt
# from string import punctuation
# # import spacy
from spacy import displacy
from tqdm import tqdm
from utils import load_dkhate
from typing import List
import pickle
import dacy
# import utils
# import nltk
# import re
# import string
from wiki_scraper import scrape_wiki_text
tqdm.pandas()

## Functions

In [753]:
def lemmatize_text(text:str) -> str:
    """Returns a lemmatized version of the text or itself if the string is empty."""
    if len(text) > 0:
        doc = nlp(text)
        lemmas = [token.lemma_ for token in doc]
        lemmatized_text = " ".join(lemmas)
        return lemmatized_text
    else:
        return text

def occurs_in_string(target:str, text:str) -> bool:
    """Checks whether a word occurs in a text."""
    for word in text.split():
        if word == target:
            return True
    return False

## Load DaCy model

In [3]:
# load daCy model (medium works fine)
nlp = dacy.load("da_dacy_medium_trf-0.2.0") # takes around 4 minutes the first time

In [4]:
# test that it works as expected 
doc = nlp("Mit navn er Maja. Jeg bor på Bispebjerg, men er fra Næstved.") 
print("Token     \tLemma\t\tPOS-tag\t\tEntity type")
for tok in doc: 
    print(f"{str(tok).ljust(10)}:\t{str(tok.lemma_).ljust(10)}\t{tok.pos_}\t\t{tok.ent_type_}")
displacy.render(doc, style="ent")

Token     	Lemma		POS-tag		Entity type
Mit       :	Mit       	DET		
navn      :	navn      	NOUN		
er        :	være      	AUX		
Maja      :	Maja      	PROPN		PER
.         :	.         	PUNCT		
Jeg       :	jeg       	PRON		
bor       :	bo        	VERB		
på        :	på        	ADP		
Bispebjerg:	Bispebjerg	PROPN		LOC
,         :	,         	PUNCT		
men       :	men       	CCONJ		
er        :	være      	VERB		
fra       :	fra       	ADP		
Næstved   :	Næstved   	PROPN		LOC
.         :	.         	PUNCT		


## Load preprocessed training data

In [5]:
# load data splits 
_, _, y_train_orig, _ = load_dkhate(test_size=0.2)
with open(os.getcwd()+"/data/X_orig_preproc.pkl", "rb") as f:
    content = pickle.load(f)

X_train_orig = content["X_train"]
train_orig = pd.DataFrame([X_train_orig, y_train_orig]).T
train_orig.tail()

Unnamed: 0_level_0,tweet,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1
2378,hørt,0
1879,reaktion svensker,0
42,hey champ smide link ser hearthstone henne,0
457,melder vold voldtægt viser sandt beviser diver...,1
3108,betaler omkring mb kb får nok tættere kb kb be...,0


In [6]:
# lemmatize the texts
train_orig["lemmas"] = train_orig["tweet"].progress_apply(lemmatize_text)

100%|██████████| 2631/2631 [05:56<00:00,  7.39it/s]


In [7]:
# split into toxic, non-toxic and all
toxic_text = train_orig[train_orig["label"] == 1]["lemmas"]
nontoxic_text = train_orig[train_orig["label"] == 0]["lemmas"]
all_text =  train_orig["lemmas"]

NUM_TOXIC = len(toxic_text)
NUM_NONTOXIC = len(nontoxic_text)
NUM_TOTAL = len(all_text)

toxic_text.head()

id
1174    scanne lortet pc markere tage underskrift ny d...
3301    kunne klarer fyr stort se venn vej samme spor ...
1390    fuck meget sol varme lille regn please dansk å...
799     hvorfor fucking stor helvede fejre kristn hell...
900     ingen udlænding ved grænse heller kriminell ku...
Name: lemmas, dtype: object

#### Oversampled

In [317]:
with open(os.getcwd()+"/data/orig_dataset_splits.pkl", "rb") as f:
    orig_oversampled = pickle.load(f)
X_oversampl = orig_oversampled["X training preprocessed and oversampled"]
y_oversampl = orig_oversampled["y training preprocessed and oversampled"]

In [321]:
train_oversampl = pd.DataFrame([X_oversampl, y_oversampl]).T
train_oversampl.rename(columns={"Unnamed 0": "tweet"}, inplace=True)
train_oversampl

Unnamed: 0,tweet,label
0,hahaha,0
1,user føler svært så prøv flytte afrika får str...,0
2,endnu barriere bønder uden eu,0
3,eneste møde ved snuskede stambar aalborg altid...,0
4,godt forøvrigt taget dokumentarprogram svensk ...,0
...,...,...
3419,danske lort fucking god hører kender lyder mus...,1
3420,så må lort kr sindsyge,1
3421,fucking smukt,1
3422,lortet begynder går indtil amok hvæse,1


In [322]:
# lemmatize the texts
train_oversampl["lemmas"] = train_oversampl["tweet"].progress_apply(lemmatize_text)

100%|██████████| 3424/3424 [08:27<00:00,  6.75it/s]


In [323]:
# split into toxic, non-toxic and all
toxic_text_oversampl = train_oversampl[train_oversampl["label"] == 1]["lemmas"]
nontoxic_text_oversampl = train_oversampl[train_oversampl["label"] == 0]["lemmas"]
all_text_oversampl = train_oversampl["lemmas"]

NUM_TOXIC_OVERSAMPL = len(toxic_text_oversampl)
NUM_NONTOXIC_OVERSAMPL = len(nontoxic_text_oversampl)
NUM_TOTAL_OVERSAMPL = len(all_text_oversampl)

toxic_text_oversampl.head()

10    få tage scanne lortet pc markere underskrif ny...
12    føle snakke kunne klarer fyr stort se venn vej...
13    så lidt fuck meget sol varme lille regn please...
24    afrika forstå mene hvorfor fucking stor helved...
31    så uden ved lige kunne heller df ingen udlændi...
Name: lemmas, dtype: object

## Load identity terms

In [8]:
# load identity terms
identities = pd.read_excel(os.getcwd()+"/data/identity_terms.xlsx")
print(len(set(identities["identity_lemma"])), "unique identity lemmas")
identities.tail()

45 unique identity lemmas


Unnamed: 0,identity_term,identity_lemma
155,transpersonerne,transperson
156,transvestitterne,transvestit
157,transerne,trans
158,androgynerne,androgyn
159,hermafroditterne,hermafrodit


In [9]:
# lemmatize the identity terms
identities["lemmatized"] = identities["identity_term"].progress_apply(lemmatize_text)
print(len(set(identities["lemmatized"])), "unique lemmatized identity terms")
identities.tail()

100%|██████████| 160/160 [00:08<00:00, 18.21it/s]

133 unique lemmatized identity terms





Unnamed: 0,identity_term,identity_lemma,lemmatized
155,transpersonerne,transperson,transperson
156,transvestitterne,transvestit,transvestitterne
157,transerne,trans,transe
158,androgynerne,androgyn,androgynerne
159,hermafroditterne,hermafrodit,hermafroditterne


In [10]:
# create map from lemmatized word to the actual lemma
lemmatized_2_lemma = dict(zip(identities["lemmatized"], identities["identity_lemma"]))

## Test scraper

In [300]:
content = scrape_wiki_text("https://da.wikipedia.org/wiki/Transk%C3%B8nnethed")
print("_"*100)
for text in content:
    print(text)

Successfully scraped the webpage with the title: "Transkønnethed"
____________________________________________________________________________________________________
Transkønnethed er en betegnelse for personer, der har en kønsidentitet eller et kønsudtryk, der adskiller sig fra deres fødselskøn.[1][2][3] Nogle transkønnede, som ønsker medicinsk hjælp til at overgå fra et køn til et andet, identificerer sig som transseksuelle.[4][5] Transkønnet, ofte forkortet til blot trans, er også et paraplybegreb: Udover at omfatte personer, hvis kønsidentitet er det modsatte af deres fødselskøn (dvs. transmænd og transkvinder), kan det også anvendes om personer, hvis kønsudtryk ikke er eksklusivt maskulint eller feminint (personer, som er ikke-binære eller genderqueer, heriblandt bikønnede, pankønnede, genderfluid og akønnede).[2][6][7] Blandt andre definitioner af transkønnet er også at inkludere personer, der tilhører et tredje køn, eller konceptualisere transkønnede som et tredje køn.[8][9] Be

## Perform Data Supplemenation

### Frequency of identity terms

In [754]:
# test function
print("This should return False. Result:", occurs_in_string("mor", "elsker din humor"))
print("This should return True.  Result:", occurs_in_string("mor", "hans mor er pænt sød"))

This should return False. Result: False
This should return True.  Result: True


In [760]:
# count how many texts these terms occur in
lemmatized_identities = list(set(identities["lemmatized"]))
occur_in_n_texts = {"lemmatized_identity": lemmatized_identities, "toxic_count": [], "nontoxic_count":[], "total_count":[]}

for lemma in lemmatized_identities:
    occur_in_n_texts["toxic_count"].append(toxic_text.apply(lambda x: int(occurs_in_string(target=lemma, text=x))).sum())
    occur_in_n_texts["nontoxic_count"].append(nontoxic_text.apply(lambda x: int(occurs_in_string(target=lemma, text=x))).sum())
    occur_in_n_texts["total_count"].append(all_text.apply(lambda x: (occurs_in_string(target=lemma, text=x))).sum())

In [761]:
# create df with these occurrence numbers
occurrence_df = pd.DataFrame(occur_in_n_texts)

# map back to actual lemma and aggregate duplicates
occurrence_df["lemma"] = occurrence_df["lemmatized_identity"].map(lemmatized_2_lemma)
occurrence_df = occurrence_df.groupby("lemma").agg({"toxic_count": "sum", "nontoxic_count": "sum", "total_count": "sum"}).reset_index()

# calculate percentages
occurrence_df["toxic_pct"] = (occurrence_df["toxic_count"]/NUM_TOXIC)*100 
occurrence_df["nontoxic_pct"] = (occurrence_df["nontoxic_count"]/NUM_NONTOXIC)*100 
occurrence_df["total_pct"] = (occurrence_df["total_count"]/NUM_TOTAL)*100 

# calculate differences
occurrence_df["tox_total_diff"] = occurrence_df["toxic_pct"] - occurrence_df["total_pct"]
occurrence_df["tox_total_abs_diff"] = abs(occurrence_df["toxic_pct"] - occurrence_df["total_pct"])

# sort by difference
sorted_occurrence_df = occurrence_df.sort_values("tox_total_diff", ascending=False).reset_index(drop=True)

# display rows where toxic pct != total pct
sorted_occurrence_df[sorted_occurrence_df["tox_total_diff"] != 0].round(2)

Unnamed: 0,lemma,toxic_count,nontoxic_count,total_count,toxic_pct,nontoxic_pct,total_pct,tox_total_diff,tox_total_abs_diff
0,mand,16,57,73,4.6,2.5,2.77,1.82,1.82
1,kvinde,7,26,33,2.01,1.14,1.25,0.76,0.76
2,fyr,1,0,1,0.29,0.0,0.04,0.25,0.25
3,mandfolk,1,0,1,0.29,0.0,0.04,0.25,0.25
4,queer,1,0,1,0.29,0.0,0.04,0.25,0.25
5,kvindfolk,1,0,1,0.29,0.0,0.04,0.25,0.25
6,tøs,1,0,1,0.29,0.0,0.04,0.25,0.25
7,søn,1,1,2,0.29,0.04,0.08,0.21,0.21
8,fætter,1,2,3,0.29,0.09,0.11,0.17,0.17
9,kone,2,9,11,0.57,0.39,0.42,0.16,0.16


In [14]:
# save this df
sorted_occurrence_df.to_excel(os.getcwd()+"/mitigation/frequency_of_identity_lemmas.xlsx")

The ones with a difference > 0 are the ones that I need to look at. 

I can actually make a difference here by adding non-toxic data and getting the toxic_pct number closer to the total_pct number, thereby reducing the difference so it's as close to zero as possible. 

#### Oversampled data

In [762]:
# count how many texts these terms occur in
occur_in_n_texts_oversampl = {"lemmatized_identity": lemmatized_identities, "toxic_count": [], "nontoxic_count":[], "total_count":[]}

for lemma in lemmatized_identities:
    occur_in_n_texts_oversampl["toxic_count"].append(toxic_text_oversampl.apply(lambda x: int(occurs_in_string(target=lemma, text=x))).sum())
    occur_in_n_texts_oversampl["nontoxic_count"].append(nontoxic_text_oversampl.apply(lambda x: int(occurs_in_string(target=lemma, text=x))).sum())
    occur_in_n_texts_oversampl["total_count"].append(all_text_oversampl.apply(lambda x: int(occurs_in_string(target=lemma, text=x))).sum())

In [763]:
# create df with these occurrence numbers
occurrence_df_oversampl = pd.DataFrame(occur_in_n_texts_oversampl)

# map back to actual lemma and aggregate duplicates
occurrence_df_oversampl["lemma"] = occurrence_df_oversampl["lemmatized_identity"].map(lemmatized_2_lemma)
occurrence_df_oversampl = occurrence_df_oversampl.groupby("lemma").agg({"toxic_count": "sum", "nontoxic_count": "sum", "total_count": "sum"}).reset_index()

# calculate percentages
occurrence_df_oversampl["toxic_pct"] = (occurrence_df_oversampl["toxic_count"]/NUM_TOXIC_OVERSAMPL)*100 
occurrence_df_oversampl["nontoxic_pct"] = (occurrence_df_oversampl["nontoxic_count"]/NUM_NONTOXIC_OVERSAMPL)*100 
occurrence_df_oversampl["total_pct"] = (occurrence_df_oversampl["total_count"]/NUM_TOTAL_OVERSAMPL)*100 

# calculate differences
occurrence_df_oversampl["tox_total_diff"] = occurrence_df_oversampl["toxic_pct"] - occurrence_df_oversampl["total_pct"]
occurrence_df_oversampl["tox_total_abs_diff"] = abs(occurrence_df_oversampl["toxic_pct"] - occurrence_df_oversampl["total_pct"])

# sort by difference
sorted_occurrence_df_oversampl = occurrence_df_oversampl.sort_values("tox_total_diff", ascending=False).reset_index(drop=True)

# display rows where toxic pct != total pct
sorted_occurrence_df_oversampl[sorted_occurrence_df_oversampl["tox_total_diff"] != 0].round(2)

Unnamed: 0,lemma,toxic_count,nontoxic_count,total_count,toxic_pct,nontoxic_pct,total_pct,tox_total_diff,tox_total_abs_diff
0,mand,61,55,116,5.35,2.41,3.39,1.96,1.96
1,kvinde,32,25,57,2.8,1.1,1.66,1.14,1.14
2,fætter,6,2,8,0.53,0.09,0.23,0.29,0.29
3,kone,9,9,18,0.79,0.39,0.53,0.26,0.26
4,fyr,4,0,4,0.35,0.0,0.12,0.23,0.23
5,far,6,5,11,0.53,0.22,0.32,0.2,0.2
6,søn,4,2,6,0.35,0.09,0.18,0.18,0.18
7,dreng,6,8,14,0.53,0.35,0.41,0.12,0.12
8,mandfolk,2,0,2,0.18,0.0,0.06,0.12,0.12
9,queer,2,0,2,0.18,0.0,0.06,0.12,0.12


In [326]:
# save this df
sorted_occurrence_df_oversampl.to_excel(os.getcwd()+"/mitigation/frequency_of_identity_lemmas_oversampl.xlsx")

The difference is that *dreng* is now in the top part (positive). Some differences are smaller, some are larger.

### Length differences

Percent of comments labeled as toxic at each length containing the given terms, e.g.:

| Term | 20-59 | 60-179 |
|:---:|:---:|:---:|
| ALL | 17% | 12% |
| gay | 88% | 77% |
| queer | 75% | 83% |
| ... | ... | ... |

Other lengths:
* 180-539
* 540-1619
* 1620-4859


Method:

* For each lemma:
  * Find the texts that it occur in
  * Separate these texts into 5 length buckets
  * For each length_bucket:
    * Find the percentage that are toxic

In [271]:
# add lengths to df
train_orig["length"] = train_orig["tweet"].progress_apply(lambda x: len(x))

100%|██████████| 2631/2631 [00:00<00:00, 381155.49it/s]


In [273]:
# divide into 6 buckets
print("Min length:", train_orig["length"].min())
print("Max length:", train_orig["length"].max())

bin1 = train_orig.query("0 <= length <= 19") # 20
bin2 = train_orig.query("20 <= length <= 59") # 40
bin3 = train_orig.query("60 <= length <= 139") # 80
bin4 = train_orig.query("140 <= length <= 299") # 160
bin5 = train_orig.query("300 <= length <= 619") # 320
bin6 = train_orig.query("620 <= length") # the rest
bins = [bin1, bin2, bin3, bin4, bin5, bin6]
bin_labels = ["0-19", "20-59", "60-139", "140-299", "300-619", "620-3519"]

Min length: 0
Max length: 3518


In [283]:
# find proportion of toxic comments for each bin (no specific terms)
results = {"bin_range":bin_labels, "toxic":[], "nontoxic":[]}
for bin in bins: # length bins
    results["toxic"].append(len(bin[bin["label"] == 1])) # count toxic in that bin
    results["nontoxic"].append(len(bin[bin["label"] == 0])) # and non-toxic

# prepare preliminary results df
prel_results_df = pd.DataFrame(results)
prel_results_df["pct_toxic"] = ( prel_results_df["toxic"] / (prel_results_df["toxic"]+prel_results_df["nontoxic"]) ) * 100 # add percentage
prel_results_df.set_index("bin_range", inplace=True)

# add to final results df
results_df_1 = prel_results_df[["pct_toxic"]].T
results_df_1.index = ["ALL"]
results_df_1.round(2)

bin_range,0-19,20-59,60-139,140-299,300-619,620-3519
ALL,10.74,12.79,13.75,17.5,29.63,33.33


In [764]:
# do the same for each lemma

# prepare dicts
toxic_count_dict = {"lemmatized_identity": lemmatized_identities}
total_count_dict = {"lemmatized_identity": lemmatized_identities}
for label in bin_labels:
    toxic_count_dict[label] = []
    total_count_dict[label] = []
    
for lemma in lemmatized_identities: # for each lemma
    for (bin_label, bin) in zip(bin_labels, bins): # for each bin
        
        # count no. of toxic/all texts this lemma occurs in in this bin
        toxic_count = bin[bin["label"]==1]["lemmas"].apply(lambda x: int(occurs_in_string(target=lemma, text=x))).sum() 
        total_count = bin["lemmas"].apply(lambda x: int(occurs_in_string(target=lemma, text=x))).sum() 
        
        # add to count_dicts
        toxic_count_dict[bin_label].append(toxic_count)
        total_count_dict[bin_label].append(total_count)

In [765]:
# create df with these occurrence numbers
toxic_count_df = pd.DataFrame(toxic_count_dict)
total_count_df = pd.DataFrame(total_count_dict)

# map back to actual lemma and aggregate duplicates
toxic_count_df["lemma"] = toxic_count_df["lemmatized_identity"].map(lemmatized_2_lemma)
toxic_count_df = toxic_count_df.groupby("lemma").agg({"0-19": "sum", "20-59": "sum", "60-139": "sum", "140-299": "sum", "300-619": "sum", "620-3519": "sum"}).reset_index()
toxic_count_df["sum"] = toxic_count_df["0-19"] + toxic_count_df["20-59"] + toxic_count_df["60-139"] + toxic_count_df["140-299"] + toxic_count_df["300-619"] + toxic_count_df["620-3519"]
toxic_count_df = toxic_count_df.sort_values("lemma")
total_count_df["lemma"] = total_count_df["lemmatized_identity"].map(lemmatized_2_lemma)
total_count_df = total_count_df.groupby("lemma").agg({"0-19": "sum", "20-59": "sum", "60-139": "sum", "140-299": "sum", "300-619": "sum", "620-3519": "sum"}).reset_index()
total_count_df["sum"] = total_count_df["0-19"] + total_count_df["20-59"] + total_count_df["60-139"] + total_count_df["140-299"] + total_count_df["300-619"] + total_count_df["620-3519"]
total_count_df = total_count_df.sort_values("lemma")

In [766]:
toxic_count_df.columns[1:-1]

Index(['0-19', '20-59', '60-139', '140-299', '300-619', '620-3519'], dtype='object')

In [767]:
# add to results df
results_df_2 = toxic_count_df[["lemma"]]
for col in toxic_count_df.columns[1:-1]:
    results_df_2[col] = (toxic_count_df[col] / total_count_df[col]) * 100 # calculate percentages
results_df_2.set_index("lemma", inplace=True)

In [768]:
results_df

Unnamed: 0,0-19,20-59,60-139,140-299,300-619,620-3519
ALL,10.735294,12.785775,13.754647,17.5,29.62963,33.333333
bror,,0.0,0.0,0.0,0.0,50.0
dame,,,0.0,0.0,,
datter,,,0.0,0.0,0.0,0.0
dreng,,100.0,0.0,,,0.0
far,,0.0,0.0,50.0,,0.0
fyr,,,100.0,,,
fætter,,0.0,,100.0,0.0,
herre,,0.0,0.0,,,
kone,,0.0,0.0,0.0,100.0,50.0


In [295]:
# final df
results_df = pd.concat([results_df_1, results_df_2])
results_df.dropna(axis = 0, how = 'all', inplace = True) # drop rows with all NA values
display(results_df.round(2).fillna("")) # show results

Unnamed: 0,0-19,20-59,60-139,140-299,300-619,620-3519
ALL,10.74,12.79,13.75,17.5,29.63,33.33
bror,,0.0,0.0,0.0,0.0,50.0
dame,,,0.0,0.0,,
datter,,,0.0,0.0,0.0,0.0
dreng,,100.0,0.0,,,0.0
far,,0.0,0.0,50.0,,0.0
fyr,,,100.0,,,
fætter,,0.0,,100.0,0.0,
herre,,0.0,0.0,,,
kone,,0.0,0.0,0.0,100.0,50.0


In [297]:
# save results
results_df = results_df.fillna("") # fill NAs
results_df.to_excel(os.getcwd()+"/mitigation/toxicity_at_diff_lengths.xlsx") # save as xlsx file

### Calculate how much new data is needed
Based on:
https://github.com/conversationai/unintended-ml-bias-analysis/blob/main/archive/unintended_ml_bias/Dataset_bias_analysis.ipynb

In [None]:
## Pseudocode
# num_nontoxic_to_add = {}

# for word in list_of_words_to_fix:
#   for length:
#       t = get t from toxic_count_df
#       n = get n from total_count_df - t
#       f = get from results_df.loc[ALL, bin_label]
#       a = calculate_nontoxic_to_add(f=f, n=n, t=t, method="round")
#       num_nontoxic_to_add[word] = a

In [642]:
def calculate_nontoxic_to_add(f:float, n:int, t:int, method:str) -> int:
    """Calculate how many non-toxic examples you need to add to get the desired non-toxic fraction.

    Args:
        f (float): desired non-toxic fraction.
        n (int): current number of non-toxic examples.
        t (int): current number of toxic examples.
        method (str): method to convert result to int: "round", "ceiling", or "floor".

    Returns:
        int: number of non-toxic examples to add.    
    """
    a = (f*(t+n)-n) / (1-f)
    
    method = method.lower()
    if method == "round":
        return round(a)
    elif method == "ceiling":
        return int(np.ceil(a))
    elif method == "floor":
        return int(np.floor(a))
    else:
        raise Exception("Unknown method. Must be either 'round', 'ceiling', or 'floor'.")

def calculate_nontoxic_fraction(n:float, t:float, a:int) -> float:
    """Returns the fraction of non-toxic examples.

    Args:
        n (int): current number of non-toxic examples.
        t (int): current number of toxic examples.
        a (int): number of non-toxic examples to add.

    Returns:
        float: non-toxic fraction.
    """
    f = (n+a) / (t+n+a)
    return f

In [643]:
# # example (mand 140-?)
# t = 6 # current number of toxic examples
# n = 18 # current number of non-toxic examples
# a = calculate_nontoxic_to_add(f=0.825, n=n, t=t, method="round")
# f = calculate_nontoxic_fraction(n=n, t=t, a=a) # new toxic fraction

# print("Old non-toxic fraction  :", round(calculate_nontoxic_fraction(n=n, t=t, a=0), 4))
# print("Add n non-toxic examples:", a)
# print("New non-toxic fraction  :", round(calculate_nontoxic_fraction(n=n, t=t, a=a), 4))

In [644]:
# # example (pige 60-?)
# t = 1 # current number of toxic examples
# n = 2 # current number of non-toxic examples
# a = calculate_nontoxic_to_add(f=0.8625, n=n, t=t, method="round")
# f = calculate_nontoxic_fraction(n=n, t=t, a=a) # new toxic fraction

# print("Old toxic fraction      :", round(calculate_nontoxic_fraction(n=n, t=t, a=0), 4))
# print("Add n non-toxic examples:", a)
# print("New toxic fraction      :", round(calculate_nontoxic_fraction(n=n, t=t, a=a), 4))

In [645]:
# find words to fix
overall_prior_distributions = results_df.iloc[0, :] 
lengths = overall_prior_distributions.keys()
unbalanced_lemmas_at_lengths = {}

print("LEMMA\t\tLENGTH\t\tTOXIC%")
for row in results_df.iloc[1:,:].iterrows(): # for each unbalanced row
    lemma = row[0]
    content = row[1]
    
    unbalanced_lengths = []
    for i, x in enumerate(content): # for each column (= length bucket)
        if type(x) == float and x >= overall_prior_distributions.iloc[i]: # if the percentage of toxic is larger than the prior distribution 
            print(f"{lemma.ljust(9)}\t{lengths[i].ljust(8)}\t{x:6.2f} %") 
            unbalanced_lengths.append(lengths[i])        
    if unbalanced_lengths: # if not empty
        unbalanced_lemmas_at_lengths[lemma] = unbalanced_lengths

LEMMA		LENGTH		TOXIC%
bror     	620-3519	 50.00 %
dreng    	20-59   	100.00 %
far      	140-299 	 50.00 %
fyr      	60-139  	100.00 %
fætter   	140-299 	100.00 %
kone     	300-619 	100.00 %
kone     	620-3519	 50.00 %
kvinde   	60-139  	 16.67 %
kvinde   	140-299 	 40.00 %
kvindfolk	300-619 	100.00 %
mand     	60-139  	 50.00 %
mand     	140-299 	 25.00 %
mandfolk 	20-59   	100.00 %
mor      	60-139  	 33.33 %
pige     	60-139  	 33.33 %
queer    	620-3519	100.00 %
søn      	140-299 	100.00 %
tøs      	0-19    	100.00 %


In [572]:
# display words to fix
unbalanced_lemmas_at_lengths

{'bror': ['620-3519'],
 'dreng': ['20-59'],
 'far': ['140-299'],
 'fyr': ['60-139'],
 'fætter': ['140-299'],
 'kone': ['300-619', '620-3519'],
 'kvinde': ['60-139', '140-299'],
 'kvindfolk': ['300-619'],
 'mand': ['60-139', '140-299'],
 'mandfolk': ['20-59'],
 'mor': ['60-139'],
 'pige': ['60-139'],
 'queer': ['620-3519'],
 'søn': ['140-299'],
 'tøs': ['0-19']}

In [649]:
# find a for each unbalanced lemma at length

num_nontoxic_to_add = {}
old_new_nontoxic_frac = {}
total_to_add = 0

for lemma in unbalanced_lemmas_at_lengths: # for word in list_of_words_to_fix:
    
    for length in unbalanced_lemmas_at_lengths[lemma]: # for length
    
        current_toxic = toxic_count_df[toxic_count_df["lemma"]==lemma][length].iloc[0] #  t = get t from toxic_count_df
        current_total = total_count_df[total_count_df["lemma"]==lemma][length].iloc[0]
        current_nontoxic = current_total - current_toxic # n = get n from total_count_df - t
        desired_f = 1 - (overall_prior_distributions[length]/100) # f = 1 - toxic frac (get this from overall_prior_distributions/100 (results_df.loc[ALL, bin_label]))
        add_n_nontoxic = calculate_nontoxic_to_add(f=desired_f, n=current_nontoxic, t=current_toxic, method="round")
        
        num_nontoxic_to_add[(lemma, length)] = add_n_nontoxic
        new_f = calculate_nontoxic_fraction(n=current_nontoxic, t=current_toxic, a=add_n_nontoxic)
        old_new_nontoxic_frac[(lemma, length)] = (desired_f, new_f)
        total_to_add += add_n_nontoxic
print("Done")
print("Total to add:", total_to_add)

Done
Total to add: 114


In [647]:
# display results
print("(lemma, length): number to add")
num_nontoxic_to_add

(lemma, length): number to add


{('bror', '620-3519'): 1,
 ('dreng', '20-59'): 7,
 ('far', '140-299'): 4,
 ('fyr', '60-139'): 6,
 ('fætter', '140-299'): 5,
 ('kone', '300-619'): 2,
 ('kone', '620-3519'): 1,
 ('kvinde', '60-139'): 1,
 ('kvinde', '140-299'): 13,
 ('kvindfolk', '300-619'): 2,
 ('mand', '60-139'): 32,
 ('mand', '140-299'): 10,
 ('mandfolk', '20-59'): 7,
 ('mor', '60-139'): 4,
 ('pige', '60-139'): 4,
 ('queer', '620-3519'): 2,
 ('søn', '140-299'): 5,
 ('tøs', '0-19'): 8}

In [648]:
# display old and new nontoxic fraction
old_new_nontoxic_frac_df = pd.DataFrame(old_new_nontoxic_frac).T
old_new_nontoxic_frac_df.rename(columns={0:"old_f", 1:"new_f"}, inplace=True)
old_new_nontoxic_frac_df.round(4)

Unnamed: 0,Unnamed: 1,old_f,new_f
bror,620-3519,0.6667,0.6667
dreng,20-59,0.8721,0.875
far,140-299,0.825,0.8333
fyr,60-139,0.8625,0.8571
fætter,140-299,0.825,0.8333
kone,300-619,0.7037,0.6667
kone,620-3519,0.6667,0.6667
kvinde,60-139,0.8625,0.8571
kvinde,140-299,0.825,0.8261
kvindfolk,300-619,0.7037,0.6667


In [None]:
# now:
# for each word to add:
    # find page that mentions this word
    # scrape this page
    # add text to big text bank

# for each word to add:
    # search in text bank for passages that mentions this lemma
    # extract these passages and divide them into sentences
    # preprocess said passages
    # if one matches the given length bucket, add it
    # otherwise, go into sentences. if one of these match, then add it. otherwise, add this sentence + surrounding sentences until we get the desired length.

# add to training data

search on wiki:

- advanced search
- one of these words: the four variants, e.g. "bror, broren, brødre, brødrene"
- these categories: "biografier", "filmskolefilm fra Danmark", "sange fra Danmark"
- sorted by relevance
- top 1 result from each category
- only difference is queer that had no results in these categories, so had to just search for "queer" and use three random pages (undgik hoved/definitionssiden)

bror:
- https://da.wikipedia.org/wiki/Hemming_Hartmann-Petersen
- https://da.wikipedia.org/wiki/Zafir_(film_fra_2011)
- https://da.wikipedia.org/wiki/Brdr._Gebis

dreng
- https://da.wikipedia.org/wiki/Mogens_Wenzel_Andreasen
- https://da.wikipedia.org/wiki/Dreng_(dokumentarfilm)
- https://da.wikipedia.org/wiki/We_Wanna_Be_Free 

far
- https://da.wikipedia.org/wiki/Christian_Molbech # not top result, because it was a different word "fædrene tro" that was the hit
- https://da.wikipedia.org/wiki/Vore_F%C3%A6dres_S%C3%B8nner
- https://da.wikipedia.org/wiki/Ebbe_Skammels%C3%B8n # is this toxic? "kvæste sin far"

fyr
- XX MANGLER, SE KOMMENTAR NEDENFOR
    - https://da.wikipedia.org/wiki/John_Green_(forfatter) (søgte på "en ung fyr")
- https://da.wikipedia.org/wiki/LUCK.exe
- https://da.wikipedia.org/wiki/Du_G%C3%B8r_Mig # not the first as the others were about "FYR OG FLAMME

fætter
- https://da.wikipedia.org/wiki/Eleonore_Tscherning 
- INGEN MED FILM ELLER SANGE, DERFOR BARE TO FRA GENEREL SØGNING
    - https://da.wikipedia.org/wiki/F%C3%A6tter_H%C3%B8jben
    - https://da.wikipedia.org/wiki/Min_f%C3%A6tter_er_pirat

kone
- https://da.wikipedia.org/wiki/Ralf_Pittelkow
- https://da.wikipedia.org/wiki/Deadline_(film_fra_2005) (ikke første, her var det en titel)
- https://da.wikipedia.org/wiki/Krig_og_fred_(Shu-bi-dua)

kvinde
- https://da.wikipedia.org/wiki/Thora_Esche
- https://da.wikipedia.org/wiki/Kvinden_(film)
- https://da.wikipedia.org/wiki/Danske_sild_(Shu-bi-dua-sang)

kvindfolk
- ingen hits i de tre kategorier, derfor bare fra generel søgning
    - https://da.wikipedia.org/wiki/G%C3%A5rd_fra_Pebringe,_Sj%C3%A6lland_(Frilandsmuseet)
    - https://da.wikipedia.org/wiki/Sophie_Caroline_af_Ostfriesland
    - https://da.wikipedia.org/wiki/Hospital

mand
- https://da.wikipedia.org/wiki/J.J._Dampe (ikke den første, fordi ordet kun optrådte i titler/værker der)
- https://da.wikipedia.org/wiki/Manden_der_dr%C3%B8mte_at_han_v%C3%A5gnede
- https://da.wikipedia.org/wiki/St%C3%A5r_p%C3%A5_en_alpetop

mandfolk
- ingen hits i de tre kategorier, derfor bare fra generel søgning (mange af disse var bare filmtitler, dvs. ikke sætninger)
    - https://da.wikipedia.org/wiki/Louis_Marcussen
    - https://da.wikipedia.org/wiki/Asterix_og_vikingerne_(tegnefilm)
    - https://da.wikipedia.org/wiki/Lysets_rige

mor
- https://da.wikipedia.org/wiki/S%C3%B8sser_Krag
- https://da.wikipedia.org/wiki/Kokon_(film_fra_2019)
- https://da.wikipedia.org/wiki/Germand_Gladensvend (skippede dem vi havde allerede)

pige
- https://da.wikipedia.org/wiki/Jean-Paul_Sartre (samme som med sangen)
- https://da.wikipedia.org/wiki/Forl%C3%B8sning
- https://da.wikipedia.org/wiki/Den_danske_sang_er_en_ung,_blond_pige (første var kun titel)

queer:
- https://da.wikipedia.org/wiki/Warehouse9 (culture)
- https://da.wikipedia.org/wiki/Babylebbe (movie)
- https://da.wikipedia.org/wiki/Judith_Butler (person)

søn
- https://da.wikipedia.org/wiki/Christian_8.
- https://da.wikipedia.org/wiki/F%C3%A6dreland_(film) (skippede dem vi havde allerede)
- https://da.wikipedia.org/wiki/Titte_til_hinanden (skippede dem vi havde allerede)

tøs
- https://da.wikipedia.org/wiki/Stephanie_Le%C3%B3n (samme som ved sangen)
- https://da.wikipedia.org/wiki/13_snart_30 (film tilladt for alle, da ingen hits ellers)
- https://da.wikipedia.org/wiki/T%C3%A6t_p%C3%A5_-_live (generel søgning, for få hits ved specifik søgning)


cannot find a biography that uses the word "fyr". mostly slang. can only find ones that use "fyret" (e.g. "fyret fra sit arbejde") or "fyrre"

In [659]:
urls = [
    "https://da.wikipedia.org/wiki/Hemming_Hartmann-Petersen",
    "https://da.wikipedia.org/wiki/Zafir_(film_fra_2011)",
    "https://da.wikipedia.org/wiki/Brdr._Gebis",
    "https://da.wikipedia.org/wiki/Mogens_Wenzel_Andreasen",
    "https://da.wikipedia.org/wiki/Dreng_(dokumentarfilm)",
    "https://da.wikipedia.org/wiki/We_Wanna_Be_Free",
    "https://da.wikipedia.org/wiki/Christian_Molbech",
    "https://da.wikipedia.org/wiki/Vore_F%C3%A6dres_S%C3%B8nner",
    "https://da.wikipedia.org/wiki/Ebbe_Skammels%C3%B8n",
    "https://da.wikipedia.org/wiki/John_Green_(forfatter)",
    "https://da.wikipedia.org/wiki/LUCK.exe",
    "https://da.wikipedia.org/wiki/Du_G%C3%B8r_Mig",
    "https://da.wikipedia.org/wiki/Eleonore_Tscherning",
    "https://da.wikipedia.org/wiki/F%C3%A6tter_H%C3%B8jben",
    "https://da.wikipedia.org/wiki/Min_f%C3%A6tter_er_pirat",
    "https://da.wikipedia.org/wiki/Ralf_Pittelkow",
    "https://da.wikipedia.org/wiki/Deadline_(film_fra_2005)",
    "https://da.wikipedia.org/wiki/Krig_og_fred_(Shu-bi-dua)",
    "https://da.wikipedia.org/wiki/Thora_Esche",
    "https://da.wikipedia.org/wiki/Kvinden_(film)",
    "https://da.wikipedia.org/wiki/Danske_sild_(Shu-bi-dua-sang)",
    "https://da.wikipedia.org/wiki/G%C3%A5rd_fra_Pebringe,_Sj%C3%A6lland_(Frilandsmuseet)",
    "https://da.wikipedia.org/wiki/Sophie_Caroline_af_Ostfriesland",
    "https://da.wikipedia.org/wiki/Hospital",
    "https://da.wikipedia.org/wiki/J.J._Dampe",
    "https://da.wikipedia.org/wiki/Manden_der_dr%C3%B8mte_at_han_v%C3%A5gnede",
    "https://da.wikipedia.org/wiki/St%C3%A5r_p%C3%A5_en_alpetop",
    "https://da.wikipedia.org/wiki/Louis_Marcussen",
    "https://da.wikipedia.org/wiki/Asterix_og_vikingerne_(tegnefilm)",
    "https://da.wikipedia.org/wiki/Lysets_rige",
    "https://da.wikipedia.org/wiki/S%C3%B8sser_Krag",
    "https://da.wikipedia.org/wiki/Kokon_(film_fra_2019)",
    "https://da.wikipedia.org/wiki/Germand_Gladensvend",
    "https://da.wikipedia.org/wiki/Jean-Paul_Sartre",
    "https://da.wikipedia.org/wiki/Forl%C3%B8sning",
    "https://da.wikipedia.org/wiki/Den_danske_sang_er_en_ung,_blond_pige",
    "https://da.wikipedia.org/wiki/Warehouse9",
    "https://da.wikipedia.org/wiki/Babylebbe",
    "https://da.wikipedia.org/wiki/Judith_Butler",
    "https://da.wikipedia.org/wiki/Christian_8.",
    "https://da.wikipedia.org/wiki/F%C3%A6dreland_(film)",
    "https://da.wikipedia.org/wiki/Titte_til_hinanden",
    "https://da.wikipedia.org/wiki/Stephanie_Le%C3%B3n",
    "https://da.wikipedia.org/wiki/13_snart_30",
    "https://da.wikipedia.org/wiki/T%C3%A6t_p%C3%A5_-_live"
]

In [None]:
# if there's not enough data, find more webpages

### Scrape from wikipedia

1) Search for pages to add (manually selected)
2) Scrape these pages using requests and beautifulsoup4
3) Concatenate to one big text bank
4) Search for word forms in this text bank. Extract the needed number of texts in the correct length.
5) Train model on the new dataset and do bias analysis

Afterwards, try to do both types of mitigation on the oversampled dataset

OR 

Try to rerun the original model on non-oversampled dataset
ASK MANEX!

In [672]:
content = scrape_wiki_text("https://da.wikipedia.org/wiki/Transk%C3%B8nnethed")
for passage in content[:3]:
    print(passage)

Successfully scraped the webpage with the title: "Transkønnethed"
Transkønnethed er en betegnelse for personer, der har en kønsidentitet eller et kønsudtryk, der adskiller sig fra deres fødselskøn.[1][2][3] Nogle transkønnede, som ønsker medicinsk hjælp til at overgå fra et køn til et andet, identificerer sig som transseksuelle.[4][5] Transkønnet, ofte forkortet til blot trans, er også et paraplybegreb: Udover at omfatte personer, hvis kønsidentitet er det modsatte af deres fødselskøn (dvs. transmænd og transkvinder), kan det også anvendes om personer, hvis kønsudtryk ikke er eksklusivt maskulint eller feminint (personer, som er ikke-binære eller genderqueer, heriblandt bikønnede, pankønnede, genderfluid og akønnede).[2][6][7] Blandt andre definitioner af transkønnet er også at inkludere personer, der tilhører et tredje køn, eller konceptualisere transkønnede som et tredje køn.[8][9] Begrebet transkønnet kan defineres meget bredt til også at inkludere transvestisme eller ligefrem cross

In [673]:
# scrape webpages
passages = []

for url in tqdm(urls):
    content = scrape_wiki_text(url)
    for passage in content:
        passages.append(passage)

  4%|▍         | 2/45 [00:00<00:08,  5.23it/s]

Successfully scraped the webpage with the title: "Hemming Hartmann-Petersen"
Successfully scraped the webpage with the title: "None"


  7%|▋         | 3/45 [00:00<00:08,  4.80it/s]

Successfully scraped the webpage with the title: "Brdr. Gebis"
Successfully scraped the webpage with the title: "Mogens Wenzel Andreasen"


 11%|█         | 5/45 [00:01<00:08,  4.61it/s]

Successfully scraped the webpage with the title: "None"


 13%|█▎        | 6/45 [00:01<00:08,  4.34it/s]

Successfully scraped the webpage with the title: "We Wanna Be Free"


 18%|█▊        | 8/45 [00:01<00:08,  4.37it/s]

Successfully scraped the webpage with the title: "Christian Molbech"
Successfully scraped the webpage with the title: "Vore Fædres Sønner"


 20%|██        | 9/45 [00:01<00:08,  4.47it/s]

Successfully scraped the webpage with the title: "Ebbe Skammelsøn"
Successfully scraped the webpage with the title: "John Green (forfatter)"


 27%|██▋       | 12/45 [00:02<00:06,  4.84it/s]

Successfully scraped the webpage with the title: "LUCK.exe"
Successfully scraped the webpage with the title: "Du Gør Mig"


 31%|███       | 14/45 [00:03<00:10,  3.05it/s]

Successfully scraped the webpage with the title: "Eleonore Tscherning"
Successfully scraped the webpage with the title: "Fætter Højben"


 36%|███▌      | 16/45 [00:03<00:07,  3.83it/s]

Successfully scraped the webpage with the title: "Min fætter er pirat"
Successfully scraped the webpage with the title: "Ralf Pittelkow"


 38%|███▊      | 17/45 [00:04<00:06,  4.17it/s]

Successfully scraped the webpage with the title: "None"


 40%|████      | 18/45 [00:04<00:06,  4.31it/s]

Successfully scraped the webpage with the title: "Krig og fred (Shu-bi-dua)"
Successfully scraped the webpage with the title: "Thora Esche"


 44%|████▍     | 20/45 [00:04<00:05,  4.39it/s]

Successfully scraped the webpage with the title: "None"


 47%|████▋     | 21/45 [00:05<00:05,  4.34it/s]

Successfully scraped the webpage with the title: "Danske sild (Shu-bi-dua-sang)"


 49%|████▉     | 22/45 [00:05<00:05,  4.20it/s]

Successfully scraped the webpage with the title: "Gård fra Pebringe, Sjælland (Frilandsmuseet)"


 51%|█████     | 23/45 [00:05<00:05,  4.27it/s]

Successfully scraped the webpage with the title: "Sophie Caroline af Ostfriesland"


 53%|█████▎    | 24/45 [00:05<00:04,  4.25it/s]

Successfully scraped the webpage with the title: "Hospital"


 58%|█████▊    | 26/45 [00:06<00:04,  4.11it/s]

Successfully scraped the webpage with the title: "J.J. Dampe"
Successfully scraped the webpage with the title: "Manden der drømte at han vågnede"


 60%|██████    | 27/45 [00:06<00:04,  4.28it/s]

Successfully scraped the webpage with the title: "Står på en alpetop"


 62%|██████▏   | 28/45 [00:06<00:03,  4.32it/s]

Successfully scraped the webpage with the title: "Louis Marcussen"


 64%|██████▍   | 29/45 [00:06<00:03,  4.39it/s]

Successfully scraped the webpage with the title: "None"
Successfully scraped the webpage with the title: "Lysets rige"


 71%|███████   | 32/45 [00:07<00:02,  4.81it/s]

Successfully scraped the webpage with the title: "Søsser Krag"
Successfully scraped the webpage with the title: "None"


 73%|███████▎  | 33/45 [00:07<00:02,  4.93it/s]

Successfully scraped the webpage with the title: "Germand Gladensvend"


 78%|███████▊  | 35/45 [00:08<00:02,  4.47it/s]

Successfully scraped the webpage with the title: "Jean-Paul Sartre"
Successfully scraped the webpage with the title: "Forløsning"


 80%|████████  | 36/45 [00:08<00:02,  4.21it/s]

Successfully scraped the webpage with the title: "Den danske sang er en ung, blond pige"
Successfully scraped the webpage with the title: "Warehouse9"


 84%|████████▍ | 38/45 [00:08<00:01,  4.43it/s]

Successfully scraped the webpage with the title: "Babylebbe"


 87%|████████▋ | 39/45 [00:09<00:01,  4.29it/s]

Successfully scraped the webpage with the title: "Judith Butler"


 91%|█████████ | 41/45 [00:09<00:00,  4.11it/s]

Successfully scraped the webpage with the title: "Christian 8."
Successfully scraped the webpage with the title: "None"


 96%|█████████▌| 43/45 [00:10<00:00,  4.45it/s]

Successfully scraped the webpage with the title: "Titte til hinanden"
Successfully scraped the webpage with the title: "Stephanie León"


100%|██████████| 45/45 [00:10<00:00,  4.60it/s]

Successfully scraped the webpage with the title: "13 snart 30"
Successfully scraped the webpage with the title: "Tæt på - live"


100%|██████████| 45/45 [00:10<00:00,  4.25it/s]


In [740]:
# pseudocode

# split passages into sentences
# preprocess passage bank

# for (lemma, length) in num_nontoxic_to_add
    # num_to_add = num_nontoxic_to_add[(lemma, length)]
    # map from lemmas to word forms using get_word_forms

    # call function that loops through passage bank and outputs n passages where this words occur (find passages)

In [690]:
# test check occurrences 
for passage in passages[:9]:
    occurs = False
    for word in passage.split():
        if word == "bror":
            occurs = True
    if occurs == True:
        print("bror")
        print(passage)

bror
Efter nogle år som matematiklærer på seminariet i Nuuk og et år som gymnasielærer på Stenhus Kostskole var han i 1960'erne med til at skabe det, vi i dag kender som P3, og var med radiodirektør Leif Lønsmanns ord var han en af dem, "som var fremsynede og modige nok til at tage de unge radiolyttere alvorligt"[kilde mangler]. Således var han gennem 30 år blandt radioens mest skattede studieværter[kilde mangler]. Han lavede musikprogrammer, interviewer og programserien Mellem brødre med sin bror Jørgen Hartmann-Petersen, også kendt under pseudonymet "Habakuk". Sammen med oboisten Waldemar Wolsing lavede han en række radioprogrammer om døde komponister: Himmelske samtaler.

bror
Mark er en ung mand der bærer en stor byrde af sorg og had. Han har mistet sin ældre bror, en dansk soldat der blev dræbt i Afghanistan. Som mange teenage drenge har Mark svært ved, at kontrollere sine følelser og hans tab drukner ham i et hav af had mod den ¿mørkhudede fjende¿ der tog fra ham, hvad han elsked

In [719]:
# test getting word forms 
print("lemmatized:", list(identities[identities["identity_lemma"]=="trans"]["lemmatized"]))
print("word forms:", list(identities[identities["identity_lemma"]=="trans"]["identity_term"]))

lemmatized: ['trans', 'transe', 'transe', 'transe']
word forms: ['trans', 'transen', 'transerne', 'transerne']


In [831]:
def occurs_in_list(target:str, text_list:List[str], return_idx:bool=False) -> bool:
    """Checks whether a word occurs in a list of texts/sentences."""
    for i, text in enumerate(text_list):
        if occurs_in_string(target, text):
            if return_idx:
                return True, i
            else:
                return True
    if return_idx:
        return False, None
    else:
        return False

print(occurs_in_list("bror", ["min søster er stolt", "det er hendes mor ikke"]))
print(occurs_in_list("bror", ["min søster er stolt", "det er hendes mor ikke", "jeg elsker min bror højt"], True))

False
(True, 2)


In [906]:
from itertools import combinations

def find_combination_idxs(lengths:List[int], lower_bound:int, upper_bound:int) -> List[tuple]:
    """Find combinations that fall within that the lower and upper bound (range) and returns the indexes. Combinations can vary in size."""
    result = []

    for r in range(1, len(lengths)+1): # different size of combinations
        for combo_idxs in combinations(range(len(lengths)), r): # indexes of different combinations of that size
            combo_lengths = [lengths[i] for i in combo_idxs]
            if lower_bound <= sum(combo_lengths) <= upper_bound:
                result.append(combo_idxs)

    return result

In [None]:
occurs_in_string

(47, 47, 55, 58)

In [907]:
r=find_combination_idxs([10,20,30,40], 60, 65)
for x in r:
    print(x)

(1, 3)
(0, 1, 2)


In [775]:
def find_passages(passage_bank:List[str], word_list:List[str]) -> List[str]:
    """Outputs all the passages where any of the words in the word list occur.

    Args:
        passage_bank (List[str]): list of text passages.
        word_list (List[str]): list of words to find in the text passages.

    Returns:
        List[str]: list of text passages where at least one of the target words appear once.
    """
    result = []
    
    for sentence_list in passage_bank:
        for word in word_list:
            if occurs_in_list(target=word, text_list=sentence_list):
                result.append(sentence_list)
                break # don't need to add it twice
    
    return result

find_passages(passage_bank[:9], ["bror"])

[['år matematiklærer seminariet nuuk år gymnasielærer stenhus kostskole erne skabe dag kender p radiodirektør leif lønsmanns ord fremsynede modige nok tage unge radiolyttere alvorligt kilde mangler',
  'således gennem år blandt radioens mest skattede studieværter kilde mangler',
  'lavede musikprogrammer interviewer programserien',
  'mellem brødre bror jørgen hartmannpetersen kendt pseudonymet habakuk',
  'sammen oboisten waldemar wolsing lavede række radioprogrammer døde komponister himmelske samtaler'],
 ['mark ung mand bærer stor byrde sorg had',
  'mistet ældre bror dansk soldat dræbt afghanistan',
  'teenage drenge mark svært ved kontrollere følelser tab drukner hav had ¿mørkhudede fjende¿ tog elskede så højt',
  'blændet had nægter åbne andre',
  'par måneder senere løber situationen løbsk afghansk dreng mushin starter klasse',
  'mark symbol hader så inderligt vise nåde ligesom bror ej heller vist',
  'marks indledende forsøg ryste mørke dreng mislykkes mushin interesse konflik

In [776]:
def get_word_forms(lemma:str, identities:pd.DataFrame) -> List[str]:
    """Get all the word forms of a lemma, which appear in the identities dataframe."""
    return list(identities[identities["identity_lemma"]==lemma]["identity_term"])

get_word_forms("trans", identities)

['trans', 'transen', 'transerne', 'transerne']

In [686]:
# # test breaking in nested loops
# for y in [[1,-1],[2,2,3,2],[3,2,3],[2]]:
#     print("Y = ", y)
#     for x in y:
#         print("X =", x)
#         if x == 2:
#             print("                two found")
#             break
#         # print("!")

Y =  [1, -1]
X = 1
X = -1
Y =  [2, 2, 3, 2]
X = 2
                two found
Y =  [3, 2, 3]
X = 3
X = 2
                two found
Y =  [2]
X = 2
                two found


In [723]:
# test splitting into sentences

doc = nlp('Det her er en sætning. Det her er endnu en sætning, hihi.')
for sent in doc.sents:
    print(sent)

Det her er en sætning.
Det her er endnu en sætning, hihi.


In [728]:
from utils import preprocess
import nltk
stop_words = nltk.corpus.stopwords.words('danish')

In [741]:
# split passages into sentences (preprocess)

passage_bank = []
for passage in tqdm(passages):
    sentences = []
    doc = nlp(passage)
    for sent in doc.sents:
        clean_sent = preprocess(str(sent), stop_words)
        if len(clean_sent) > 0: # don't add empty strings
            sentences.append(clean_sent)
    passage_bank.append(sentences)

100%|██████████| 330/330 [07:04<00:00,  1.29s/it]


In [742]:
len(passage_bank)

330

In [795]:
' '.join(passage)

'tæt live livealbum danske sangerinde sangskriver medina udkommer marts labelmade a larm music albummet indspillet drs koncertsalen medinas akustiske tæt påturné oktober november albummet medina udtalt akustiske liveplade realisering fantastisk drøm glædet føre livet længe faktisk lige siden ung tøs så nirvana unplugged mtv tænk blevet virkelighed'

In [828]:
[1, None, None]

[1, None, None]

In [844]:
i=1
len(passage[i-1])+len(passage[i])

101

In [910]:
d = {}

for (lemma, length) in tqdm(num_nontoxic_to_add):
    length_list = length.split("-")
    lower_bound, upper_bound = int(length_list[0]), int(length_list[1])
    num_to_add = num_nontoxic_to_add[(lemma, length)] # number new nontoxic to add
    print(lemma, lower_bound, upper_bound, num_to_add)
    word_list = get_word_forms(lemma, identities) # word forms
    # print(word_list)
    passages = find_passages(passage_bank, word_list) # find passages with these words in
    
    added = 0
    add = []
    
    # add whole passages if within range
    # for passage in passages: # for each passage
    #     passage_len = len(' '.join(passage)) # length of entire passage
        
    #     if added < num_to_add: # add the n first occurrences that match
    #         if passage_len >= lower_bound and passage_len <= upper_bound: # if passage is within the bucket
    #             added += 1
    #             print("pass", added, passage_len, passage)
    
    # else add combinations
    # if added < num_to_add:
    
    for passage in passages:
        
        if added <= num_to_add: # only continue if we need to add more sentences            
            sentence_lengths = [len(sent)+1 for sent in passage] # +1 = space between sentences
            
            # if the full passage is within range, add that
            if lower_bound <= len(passage) <= upper_bound:
                add.append(passage)
                added += 1
        
            else:
                # get the sentence the word appears in
                combos = find_combination_idxs(sentence_lengths, lower_bound, upper_bound)
                if len(combos) > 0:
                    # pick random combination
                    rd_combo_idx = random.choice(combos)
                    rd_combo_text = ' '.join([passage[i] for i in rd_combo_idx])
                    add.append(rd_combo_text)
                    added += 1
    print(added, num_to_add)
        
    # if added < num_to_add: # if we still don't have enough
    #     occurrence_idxs = [occurs_in_list(word, passage, True)[1] for word in word_list] # get index of sentence it occurs in
    #     occurrence_idxs = [x for x in occurrence_idxs if type(x) == int] # remove None values
    #     for i in occurrence_idxs: 
    #         if len(passage[i]) >= lower_bound and len(passage[i]) <= upper_bound:
    #             print(i, len(passage[i]))
    #         else:
    #             print(i)
    #             if i > 0:
    #                 if (len(passage[i-1])+len(passage[i])) >= lower_bound and (len(passage[i-1])+len(passage[i])) <= upper_bound:
    #                     print("TWOSENT", i, (len(passage[i-1])+len(passage[i])))
    #             else:
    #                 print(len(passage[i+1])+len(passage[i])+len(passage[i+2])+len(passage[i+3]))
    #                 if (len(passage[i+1])+len(passage[i])) >= lower_bound and (len(passage[i+1])+len(passage[i])) <= upper_bound:
    #                     print("TWOSENT", i, (len(passage[i+1])+len(passage[i])))
        
        
    #     for sent in passage: 
    #         for word in word_list:
    #             if occurs_in_string(word, sent):
    #                 if len(sent) >= lower_bound and len(sent) <= upper_bound:
    #                     print("sent", len(sent), word, sent)
    #                 elif passage_len >= lower_bound and passage_len <= upper_bound:
    #                     print("pass", passage_len, word, sent)
        
    # d[lemma] = passages
    print()

bror 620 3519 1
2 1

dreng 20 59 7
6 7

far 140 299 4
5 4

fyr 60 139 6
3 6

fætter 140 299 5
6 5

kone 300 619 2
3 2

kone 620 3519 1
2 1

kvinde 60 139 1
2 1

kvinde 140 299 13


KeyboardInterrupt: 