# Data Supplementation (suppl)

Method
1.	Identify identity terms with the most disproportionate data distributions 
    1. Stem/lemmatize dataset
    2. For each lemma in the synthetic test set:
        1. Check distribution across labels in dataset, i.e. difference between frequency in toxic comments and overall
        2.	Also check length differences!
    3. What does this mean exactly? 
        1.	“Identity terms affected by the false positive bias are disproportionately used in toxic comments in our training data. For example, the word ‘gay’ appears in 3% of toxic comments but only 0.5% of comments overall.”
        2.	Frequency of identity terms in toxic comments and overall: 
2.	Add additional non-toxic examples that contain the identity terms that appear disproportionately across labels in the original dataset
    1.	Use wiki data – assumed to be non-toxic
    2.	Add enough so that the balance is in line with the prior distribution for the overall dataset
        1.	E.g. until % “gay” in toxic comment is close to 0.50% as in overall data.
3.	Maybe consider different lengths as CNNs could be sensitive to this
    1.	“toxic comments tend to be shorter” (Dixon et al. 2018)
4.	Supposed to reduce false positives. Could also do the opposite? But more difficult to find toxic comments unless we take them from places that are supposedly toxic (e.g. “roast me”)


## Imports

In [1]:
# set cwd
import os
os.chdir("g:\\My Drive\\ITC, 5th semester (Thesis)\\Code\\Github_code\\toxicity_detection")

# imports
import pandas as pd
# from random import choice, choices
# from collections import 
import numpy as np
import matplotlib.pyplot as plt
# from string import punctuation
# # import spacy
from spacy import displacy
from tqdm import tqdm
from utils import load_dkhate
from typing import List
import pickle
import dacy
import utils
import nltk
# import re
# import string
from wiki_scraper import scrape_wiki_text
tqdm.pandas()

  from .autonotebook import tqdm as notebook_tqdm


## Functions

In [2]:
def lemmatize_text(text:str) -> str:
    """Returns a lemmatized version of the text or itself if the string is empty."""
    if len(text) > 0:
        doc = nlp(text)
        lemmas = [token.lemma_ for token in doc]
        lemmatized_text = " ".join(lemmas)
        return lemmatized_text
    else:
        return text

def occurs_in_string(target:str, text:str) -> bool:
    """Checks whether a word occurs in a text."""
    for word in text.split():
        if word == target:
            return True
    return False

## Load DaCy model

In [3]:
# load daCy model (medium works fine)
nlp = dacy.load("da_dacy_medium_trf-0.2.0") # takes around 4 minutes the first time

In [4]:
# test that it works as expected 
doc = nlp("Mit navn er Maja. Jeg bor på Bispebjerg, men er fra Næstved.") 
print("Token     \tLemma\t\tPOS-tag\t\tEntity type")
for tok in doc: 
    print(f"{str(tok).ljust(10)}:\t{str(tok.lemma_).ljust(10)}\t{tok.pos_}\t\t{tok.ent_type_}")
displacy.render(doc, style="ent")

Token     	Lemma		POS-tag		Entity type
Mit       :	Mit       	DET		
navn      :	navn      	NOUN		
er        :	være      	AUX		
Maja      :	Maja      	PROPN		PER
.         :	.         	PUNCT		
Jeg       :	jeg       	PRON		
bor       :	bo        	VERB		
på        :	på        	ADP		
Bispebjerg:	Bispebjerg	PROPN		LOC
,         :	,         	PUNCT		
men       :	men       	CCONJ		
er        :	være      	VERB		
fra       :	fra       	ADP		
Næstved   :	Næstved   	PROPN		LOC
.         :	.         	PUNCT		


## Load preprocessed training data

In [5]:
# load data splits 
_, _, y_train_orig, _ = load_dkhate(test_size=0.2)
with open(os.getcwd()+"/data/X_orig_preproc.pkl", "rb") as f:
    content = pickle.load(f)

X_train_orig = content["X_train"]
train_orig = pd.DataFrame([X_train_orig, y_train_orig]).T
train_orig.tail()

Unnamed: 0_level_0,tweet,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1
2378,hørt,0
1879,reaktion svensker,0
42,hey champ smide link ser hearthstone henne,0
457,melder vold voldtægt viser sandt beviser diver...,1
3108,betaler omkring mb kb får nok tættere kb kb be...,0


In [6]:
# lemmatize the texts
train_orig["lemmas"] = train_orig["tweet"].progress_apply(lemmatize_text)

100%|██████████| 2631/2631 [06:07<00:00,  7.15it/s]


In [7]:
# split into toxic, non-toxic and all
toxic_text = train_orig[train_orig["label"] == 1]["lemmas"]
nontoxic_text = train_orig[train_orig["label"] == 0]["lemmas"]
all_text =  train_orig["lemmas"]

NUM_TOXIC = len(toxic_text)
NUM_NONTOXIC = len(nontoxic_text)
NUM_TOTAL = len(all_text)

toxic_text.head()

id
1174    scanne lortet pc markere tage underskrift ny d...
3301    kunne klarer fyr stort se venn vej samme spor ...
1390    fuck meget sol varme lille regn please dansk å...
799     hvorfor fucking stor helvede fejre kristn hell...
900     ingen udlænding ved grænse heller kriminell ku...
Name: lemmas, dtype: object

#### Oversampled

In [8]:
# with open(os.getcwd()+"/data/orig_dataset_splits.pkl", "rb") as f:
#     orig_oversampled = pickle.load(f)
# X_oversampl = orig_oversampled["X training preprocessed and oversampled"]
# y_oversampl = orig_oversampled["y training preprocessed and oversampled"]

In [9]:
# train_oversampl = pd.DataFrame([X_oversampl, y_oversampl]).T
# train_oversampl.rename(columns={"Unnamed 0": "tweet"}, inplace=True)
# train_oversampl

In [10]:
# # lemmatize the texts
# train_oversampl["lemmas"] = train_oversampl["tweet"].progress_apply(lemmatize_text)

In [11]:
# # split into toxic, non-toxic and all
# toxic_text_oversampl = train_oversampl[train_oversampl["label"] == 1]["lemmas"]
# nontoxic_text_oversampl = train_oversampl[train_oversampl["label"] == 0]["lemmas"]
# all_text_oversampl = train_oversampl["lemmas"]

# NUM_TOXIC_OVERSAMPL = len(toxic_text_oversampl)
# NUM_NONTOXIC_OVERSAMPL = len(nontoxic_text_oversampl)
# NUM_TOTAL_OVERSAMPL = len(all_text_oversampl)

# toxic_text_oversampl.head()

## Load identity terms

In [12]:
# load identity terms
identities = pd.read_excel(os.getcwd()+"/data/identity_terms.xlsx")
print(len(set(identities["identity_lemma"])), "unique identity lemmas")
identities.tail()

45 unique identity lemmas


Unnamed: 0,identity_term,identity_lemma
155,transpersonerne,transperson
156,transvestitterne,transvestit
157,transerne,trans
158,androgynerne,androgyn
159,hermafroditterne,hermafrodit


In [13]:
# lemmatize the identity terms
identities["lemmatized"] = identities["identity_term"].progress_apply(lemmatize_text)
print(len(set(identities["lemmatized"])), "unique lemmatized identity terms")
identities.tail()

100%|██████████| 160/160 [00:09<00:00, 16.49it/s]


133 unique lemmatized identity terms


Unnamed: 0,identity_term,identity_lemma,lemmatized
155,transpersonerne,transperson,transperson
156,transvestitterne,transvestit,transvestitterne
157,transerne,trans,transe
158,androgynerne,androgyn,androgynerne
159,hermafroditterne,hermafrodit,hermafroditterne


In [14]:
# create map from lemmatized word to the actual lemma
lemmatized_2_lemma = dict(zip(identities["lemmatized"], identities["identity_lemma"]))

## Test scraper

In [15]:
content = scrape_wiki_text("https://da.wikipedia.org/wiki/Sankt_Mortens_Kirke_(N%C3%A6stved)")
print("_"*100)
for text in content:
    print(text)

Successfully scraped the webpage with the title: "Sankt Mortens Kirke (Næstved)"
____________________________________________________________________________________________________
55°13′47″N 11°45′39″Ø﻿ / ﻿55.2297°N 11.7608°Ø﻿ / 55.2297; 11.7608Koordinater: 55°13′47″N 11°45′39″Ø﻿ / ﻿55.2297°N 11.7608°Ø﻿ / 55.2297; 11.7608
Sankt Mortens Kirke er beliggende i Næstved centrum og er en af byens gamle middelalderkirker. Den er kendt fra en tidlig optegnelse omkring 1280, men menes at være bygget og taget i brug omkring 1200.

Kirken, der fra middelalderen blev bygget til at være byens sognekirke, mens de andre kirke var tættere knyttet til ordensvæsenet, er opkaldt efter den legendariske Sankt Martin af Tours, på dansk kaldt Sankt Morten.
Sankt Morten fejrer man i Danmark på Skt. Mortens aften den 10. november (Skt. Mortens Dag er den 11. november). Ifølge legenden ville Martin af Tours ikke udnævnes til biskop og gemte sig i en gåsesti. Gæssene afslørede ham ved deres høje skræppen, hvor

## Perform Data Supplemenation

### Frequency of identity terms

In [16]:
# test function
print("This should return False. Result:", occurs_in_string("mor", "elsker din humor"))
print("This should return True.  Result:", occurs_in_string("mor", "hans mor er pænt sød"))

This should return False. Result: False
This should return True.  Result: True


In [17]:
# count how many texts these terms occur in
lemmatized_identities = list(set(identities["lemmatized"]))
occur_in_n_texts = {"lemmatized_identity": lemmatized_identities, "toxic_count": [], "nontoxic_count":[], "total_count":[]}

for lemma in lemmatized_identities:
    occur_in_n_texts["toxic_count"].append(toxic_text.apply(lambda x: int(occurs_in_string(target=lemma, text=x))).sum())
    occur_in_n_texts["nontoxic_count"].append(nontoxic_text.apply(lambda x: int(occurs_in_string(target=lemma, text=x))).sum())
    occur_in_n_texts["total_count"].append(all_text.apply(lambda x: (occurs_in_string(target=lemma, text=x))).sum())

In [18]:
# create df with these occurrence numbers
occurrence_df = pd.DataFrame(occur_in_n_texts)

# map back to actual lemma and aggregate duplicates
occurrence_df["lemma"] = occurrence_df["lemmatized_identity"].map(lemmatized_2_lemma)
occurrence_df = occurrence_df.groupby("lemma").agg({"toxic_count": "sum", "nontoxic_count": "sum", "total_count": "sum"}).reset_index()

# calculate percentages
occurrence_df["toxic_pct"] = (occurrence_df["toxic_count"]/NUM_TOXIC)*100 
occurrence_df["nontoxic_pct"] = (occurrence_df["nontoxic_count"]/NUM_NONTOXIC)*100 
occurrence_df["total_pct"] = (occurrence_df["total_count"]/NUM_TOTAL)*100 

# calculate differences
occurrence_df["tox_total_diff"] = occurrence_df["toxic_pct"] - occurrence_df["total_pct"]
occurrence_df["tox_total_abs_diff"] = abs(occurrence_df["toxic_pct"] - occurrence_df["total_pct"])

# sort by difference
sorted_occurrence_df = occurrence_df.sort_values("tox_total_diff", ascending=False).reset_index(drop=True)

# display rows where toxic pct != total pct
sorted_occurrence_df[sorted_occurrence_df["tox_total_diff"] != 0].round(2)

Unnamed: 0,lemma,toxic_count,nontoxic_count,total_count,toxic_pct,nontoxic_pct,total_pct,tox_total_diff,tox_total_abs_diff
0,mand,16,57,73,4.6,2.5,2.77,1.82,1.82
1,kvinde,7,26,33,2.01,1.14,1.25,0.76,0.76
2,fyr,1,0,1,0.29,0.0,0.04,0.25,0.25
3,mandfolk,1,0,1,0.29,0.0,0.04,0.25,0.25
4,queer,1,0,1,0.29,0.0,0.04,0.25,0.25
5,kvindfolk,1,0,1,0.29,0.0,0.04,0.25,0.25
6,tøs,1,0,1,0.29,0.0,0.04,0.25,0.25
7,søn,1,1,2,0.29,0.04,0.08,0.21,0.21
8,fætter,1,2,3,0.29,0.09,0.11,0.17,0.17
9,kone,2,9,11,0.57,0.39,0.42,0.16,0.16


In [19]:
# save this df
sorted_occurrence_df.to_excel(os.getcwd()+"/mitigation/frequency_of_identity_lemmas.xlsx")

The ones with a difference > 0 are the ones that I need to look at. 

I can actually make a difference here by adding non-toxic data and getting the toxic_pct number closer to the total_pct number, thereby reducing the difference so it's as close to zero as possible. 

#### Oversampled data

In [20]:
# # count how many texts these terms occur in
# occur_in_n_texts_oversampl = {"lemmatized_identity": lemmatized_identities, "toxic_count": [], "nontoxic_count":[], "total_count":[]}

# for lemma in lemmatized_identities:
#     occur_in_n_texts_oversampl["toxic_count"].append(toxic_text_oversampl.apply(lambda x: int(occurs_in_string(target=lemma, text=x))).sum())
#     occur_in_n_texts_oversampl["nontoxic_count"].append(nontoxic_text_oversampl.apply(lambda x: int(occurs_in_string(target=lemma, text=x))).sum())
#     occur_in_n_texts_oversampl["total_count"].append(all_text_oversampl.apply(lambda x: int(occurs_in_string(target=lemma, text=x))).sum())

In [21]:
# # create df with these occurrence numbers
# occurrence_df_oversampl = pd.DataFrame(occur_in_n_texts_oversampl)

# # map back to actual lemma and aggregate duplicates
# occurrence_df_oversampl["lemma"] = occurrence_df_oversampl["lemmatized_identity"].map(lemmatized_2_lemma)
# occurrence_df_oversampl = occurrence_df_oversampl.groupby("lemma").agg({"toxic_count": "sum", "nontoxic_count": "sum", "total_count": "sum"}).reset_index()

# # calculate percentages
# occurrence_df_oversampl["toxic_pct"] = (occurrence_df_oversampl["toxic_count"]/NUM_TOXIC_OVERSAMPL)*100 
# occurrence_df_oversampl["nontoxic_pct"] = (occurrence_df_oversampl["nontoxic_count"]/NUM_NONTOXIC_OVERSAMPL)*100 
# occurrence_df_oversampl["total_pct"] = (occurrence_df_oversampl["total_count"]/NUM_TOTAL_OVERSAMPL)*100 

# # calculate differences
# occurrence_df_oversampl["tox_total_diff"] = occurrence_df_oversampl["toxic_pct"] - occurrence_df_oversampl["total_pct"]
# occurrence_df_oversampl["tox_total_abs_diff"] = abs(occurrence_df_oversampl["toxic_pct"] - occurrence_df_oversampl["total_pct"])

# # sort by difference
# sorted_occurrence_df_oversampl = occurrence_df_oversampl.sort_values("tox_total_diff", ascending=False).reset_index(drop=True)

# # display rows where toxic pct != total pct
# sorted_occurrence_df_oversampl[sorted_occurrence_df_oversampl["tox_total_diff"] != 0].round(2)

In [22]:
# # save this df
# sorted_occurrence_df_oversampl.to_excel(os.getcwd()+"/mitigation/frequency_of_identity_lemmas_oversampl.xlsx")

The difference is that *dreng* is now in the top part (positive). Some differences are smaller, some are larger.

### Length differences

Percent of comments labeled as toxic at each length containing the given terms, e.g.:

| Term | 20-59 | 60-179 |
|:---:|:---:|:---:|
| ALL | 17% | 12% |
| gay | 88% | 77% |
| queer | 75% | 83% |
| ... | ... | ... |

Other lengths:
* 180-539
* 540-1619
* 1620-4859


Method:

* For each lemma:
  * Find the texts that it occur in
  * Separate these texts into 5 length buckets
  * For each length_bucket:
    * Find the percentage that are toxic

In [23]:
# add lengths to df
train_orig["length"] = train_orig["tweet"].progress_apply(lambda x: len(x))

100%|██████████| 2631/2631 [00:00<00:00, 393552.56it/s]


In [24]:
# divide into 6 buckets
print("Min length:", train_orig["length"].min())
print("Max length:", train_orig["length"].max())

bin1 = train_orig.query("0 <= length <= 19") # 20
bin2 = train_orig.query("20 <= length <= 59") # 40
bin3 = train_orig.query("60 <= length <= 139") # 80
bin4 = train_orig.query("140 <= length <= 299") # 160
bin5 = train_orig.query("300 <= length <= 619") # 320
bin6 = train_orig.query("620 <= length") # the rest
bins = [bin1, bin2, bin3, bin4, bin5, bin6]
bin_labels = ["0-19", "20-59", "60-139", "140-299", "300-619", "620-3519"]

Min length: 0
Max length: 3518


In [25]:
# find proportion of toxic comments for each bin (no specific terms)
results = {"bin_range":bin_labels, "toxic":[], "nontoxic":[]}
for bin in bins: # length bins
    results["toxic"].append(len(bin[bin["label"] == 1])) # count toxic in that bin
    results["nontoxic"].append(len(bin[bin["label"] == 0])) # and non-toxic

# prepare preliminary results df
prel_results_df = pd.DataFrame(results)
prel_results_df["pct_toxic"] = ( prel_results_df["toxic"] / (prel_results_df["toxic"]+prel_results_df["nontoxic"]) ) * 100 # add percentage
prel_results_df.set_index("bin_range", inplace=True)

# add to final results df
results_df_1 = prel_results_df[["pct_toxic"]].T
results_df_1.index = ["ALL"]
results_df_1.round(2)

bin_range,0-19,20-59,60-139,140-299,300-619,620-3519
ALL,10.74,12.79,13.75,17.5,29.63,33.33


In [26]:
# do the same for each lemma

# prepare dicts
toxic_count_dict = {"lemmatized_identity": lemmatized_identities}
total_count_dict = {"lemmatized_identity": lemmatized_identities}
for label in bin_labels:
    toxic_count_dict[label] = []
    total_count_dict[label] = []
    
for lemma in lemmatized_identities: # for each lemma
    for (bin_label, bin) in zip(bin_labels, bins): # for each bin
        
        # count no. of toxic/all texts this lemma occurs in in this bin
        toxic_count = bin[bin["label"]==1]["lemmas"].apply(lambda x: int(occurs_in_string(target=lemma, text=x))).sum() 
        total_count = bin["lemmas"].apply(lambda x: int(occurs_in_string(target=lemma, text=x))).sum() 
        
        # add to count_dicts
        toxic_count_dict[bin_label].append(toxic_count)
        total_count_dict[bin_label].append(total_count)

In [27]:
# create df with these occurrence numbers
toxic_count_df = pd.DataFrame(toxic_count_dict)
total_count_df = pd.DataFrame(total_count_dict)

# map back to actual lemma and aggregate duplicates
toxic_count_df["lemma"] = toxic_count_df["lemmatized_identity"].map(lemmatized_2_lemma)
toxic_count_df = toxic_count_df.groupby("lemma").agg({"0-19": "sum", "20-59": "sum", "60-139": "sum", "140-299": "sum", "300-619": "sum", "620-3519": "sum"}).reset_index()
toxic_count_df["sum"] = toxic_count_df["0-19"] + toxic_count_df["20-59"] + toxic_count_df["60-139"] + toxic_count_df["140-299"] + toxic_count_df["300-619"] + toxic_count_df["620-3519"]
toxic_count_df = toxic_count_df.sort_values("lemma")
total_count_df["lemma"] = total_count_df["lemmatized_identity"].map(lemmatized_2_lemma)
total_count_df = total_count_df.groupby("lemma").agg({"0-19": "sum", "20-59": "sum", "60-139": "sum", "140-299": "sum", "300-619": "sum", "620-3519": "sum"}).reset_index()
total_count_df["sum"] = total_count_df["0-19"] + total_count_df["20-59"] + total_count_df["60-139"] + total_count_df["140-299"] + total_count_df["300-619"] + total_count_df["620-3519"]
total_count_df = total_count_df.sort_values("lemma")

In [28]:
toxic_count_df.columns[1:-1]

Index(['0-19', '20-59', '60-139', '140-299', '300-619', '620-3519'], dtype='object')

In [29]:
# add to results df
results_df_2 = toxic_count_df[["lemma"]]
for col in toxic_count_df.columns[1:-1]:
    results_df_2[col] = (toxic_count_df[col] / total_count_df[col]) * 100 # calculate percentages
results_df_2.set_index("lemma", inplace=True)

In [31]:
# final df
results_df = pd.concat([results_df_1, results_df_2])
results_df.dropna(axis = 0, how = 'all', inplace = True) # drop rows with all NA values
display(results_df.round(2).fillna("")) # show results

Unnamed: 0,0-19,20-59,60-139,140-299,300-619,620-3519
ALL,10.74,12.79,13.75,17.5,29.63,33.33
bror,,0.0,0.0,0.0,0.0,50.0
dame,,,0.0,0.0,,
datter,,,0.0,0.0,0.0,0.0
dreng,,100.0,0.0,,,0.0
far,,0.0,0.0,50.0,,0.0
fyr,,,100.0,,,
fætter,,0.0,,100.0,0.0,
herre,,0.0,0.0,,,
kone,,0.0,0.0,0.0,100.0,50.0


In [32]:
# save results
results_df = results_df.fillna("") # fill NAs
results_df.to_excel(os.getcwd()+"/mitigation/toxicity_at_diff_lengths.xlsx") # save as xlsx file

### Calculate how much new data is needed
Based on:
https://github.com/conversationai/unintended-ml-bias-analysis/blob/main/archive/unintended_ml_bias/Dataset_bias_analysis.ipynb

In [None]:
## Pseudocode
# num_nontoxic_to_add = {}

# for word in list_of_words_to_fix:
#   for length:
#       t = get t from toxic_count_df
#       n = get n from total_count_df - t
#       f = get from results_df.loc[ALL, bin_label]
#       a = calculate_nontoxic_to_add(f=f, n=n, t=t, method="round")
#       num_nontoxic_to_add[word] = a

In [33]:
def calculate_nontoxic_to_add(f:float, n:int, t:int, method:str) -> int:
    """Calculate how many non-toxic examples you need to add to get the desired non-toxic fraction.

    Args:
        f (float): desired non-toxic fraction.
        n (int): current number of non-toxic examples.
        t (int): current number of toxic examples.
        method (str): method to convert result to int: "round", "ceiling", or "floor".

    Returns:
        int: number of non-toxic examples to add.    
    """
    a = (f*(t+n)-n) / (1-f)
    
    method = method.lower()
    if method == "round":
        return round(a)
    elif method == "ceiling":
        return int(np.ceil(a))
    elif method == "floor":
        return int(np.floor(a))
    else:
        raise Exception("Unknown method. Must be either 'round', 'ceiling', or 'floor'.")

def calculate_nontoxic_fraction(n:float, t:float, a:int) -> float:
    """Returns the fraction of non-toxic examples.

    Args:
        n (int): current number of non-toxic examples.
        t (int): current number of toxic examples.
        a (int): number of non-toxic examples to add.

    Returns:
        float: non-toxic fraction.
    """
    f = (n+a) / (t+n+a)
    return f

In [643]:
# # example (mand 140-?)
# t = 6 # current number of toxic examples
# n = 18 # current number of non-toxic examples
# a = calculate_nontoxic_to_add(f=0.825, n=n, t=t, method="round")
# f = calculate_nontoxic_fraction(n=n, t=t, a=a) # new toxic fraction

# print("Old non-toxic fraction  :", round(calculate_nontoxic_fraction(n=n, t=t, a=0), 4))
# print("Add n non-toxic examples:", a)
# print("New non-toxic fraction  :", round(calculate_nontoxic_fraction(n=n, t=t, a=a), 4))

In [644]:
# # example (pige 60-?)
# t = 1 # current number of toxic examples
# n = 2 # current number of non-toxic examples
# a = calculate_nontoxic_to_add(f=0.8625, n=n, t=t, method="round")
# f = calculate_nontoxic_fraction(n=n, t=t, a=a) # new toxic fraction

# print("Old toxic fraction      :", round(calculate_nontoxic_fraction(n=n, t=t, a=0), 4))
# print("Add n non-toxic examples:", a)
# print("New toxic fraction      :", round(calculate_nontoxic_fraction(n=n, t=t, a=a), 4))

In [34]:
# find words to fix
overall_prior_distributions = results_df.iloc[0, :] 
lengths = overall_prior_distributions.keys()
unbalanced_lemmas_at_lengths = {}

print("LEMMA\t\tLENGTH\t\tTOXIC%")
for row in results_df.iloc[1:,:].iterrows(): # for each unbalanced row
    lemma = row[0]
    content = row[1]
    
    unbalanced_lengths = []
    for i, x in enumerate(content): # for each column (= length bucket)
        if type(x) == float and x >= overall_prior_distributions.iloc[i]: # if the percentage of toxic is larger than the prior distribution 
            print(f"{lemma.ljust(9)}\t{lengths[i].ljust(8)}\t{x:6.2f} %") 
            unbalanced_lengths.append(lengths[i])        
    if unbalanced_lengths: # if not empty
        unbalanced_lemmas_at_lengths[lemma] = unbalanced_lengths

LEMMA		LENGTH		TOXIC%
bror     	620-3519	 50.00 %
dreng    	20-59   	100.00 %
far      	140-299 	 50.00 %
fyr      	60-139  	100.00 %
fætter   	140-299 	100.00 %
kone     	300-619 	100.00 %
kone     	620-3519	 50.00 %
kvinde   	60-139  	 16.67 %
kvinde   	140-299 	 40.00 %
kvindfolk	300-619 	100.00 %
mand     	60-139  	 50.00 %
mand     	140-299 	 25.00 %
mandfolk 	20-59   	100.00 %
mor      	60-139  	 33.33 %
pige     	60-139  	 33.33 %
queer    	620-3519	100.00 %
søn      	140-299 	100.00 %
tøs      	0-19    	100.00 %


In [35]:
# display words to fix
unbalanced_lemmas_at_lengths

{'bror': ['620-3519'],
 'dreng': ['20-59'],
 'far': ['140-299'],
 'fyr': ['60-139'],
 'fætter': ['140-299'],
 'kone': ['300-619', '620-3519'],
 'kvinde': ['60-139', '140-299'],
 'kvindfolk': ['300-619'],
 'mand': ['60-139', '140-299'],
 'mandfolk': ['20-59'],
 'mor': ['60-139'],
 'pige': ['60-139'],
 'queer': ['620-3519'],
 'søn': ['140-299'],
 'tøs': ['0-19']}

In [36]:
# find a for each unbalanced lemma at length

num_nontoxic_to_add = {}
old_new_nontoxic_frac = {}
total_to_add = 0

for lemma in unbalanced_lemmas_at_lengths: # for word in list_of_words_to_fix:
    
    for length in unbalanced_lemmas_at_lengths[lemma]: # for length
    
        current_toxic = toxic_count_df[toxic_count_df["lemma"]==lemma][length].iloc[0] #  t = get t from toxic_count_df
        current_total = total_count_df[total_count_df["lemma"]==lemma][length].iloc[0]
        current_nontoxic = current_total - current_toxic # n = get n from total_count_df - t
        desired_f = 1 - (overall_prior_distributions[length]/100) # f = 1 - toxic frac (get this from overall_prior_distributions/100 (results_df.loc[ALL, bin_label]))
        add_n_nontoxic = calculate_nontoxic_to_add(f=desired_f, n=current_nontoxic, t=current_toxic, method="round")
        
        num_nontoxic_to_add[(lemma, length)] = add_n_nontoxic
        new_f = calculate_nontoxic_fraction(n=current_nontoxic, t=current_toxic, a=add_n_nontoxic)
        old_new_nontoxic_frac[(lemma, length)] = (desired_f, new_f)
        total_to_add += add_n_nontoxic
print("Done")
print("Total to add:", total_to_add)

Done
Total to add: 114


In [37]:
# display results
print("(lemma, length): number to add")
num_nontoxic_to_add

(lemma, length): number to add


{('bror', '620-3519'): 1,
 ('dreng', '20-59'): 7,
 ('far', '140-299'): 4,
 ('fyr', '60-139'): 6,
 ('fætter', '140-299'): 5,
 ('kone', '300-619'): 2,
 ('kone', '620-3519'): 1,
 ('kvinde', '60-139'): 1,
 ('kvinde', '140-299'): 13,
 ('kvindfolk', '300-619'): 2,
 ('mand', '60-139'): 32,
 ('mand', '140-299'): 10,
 ('mandfolk', '20-59'): 7,
 ('mor', '60-139'): 4,
 ('pige', '60-139'): 4,
 ('queer', '620-3519'): 2,
 ('søn', '140-299'): 5,
 ('tøs', '0-19'): 8}

In [38]:
# display old and new nontoxic fraction
old_new_nontoxic_frac_df = pd.DataFrame(old_new_nontoxic_frac).T
old_new_nontoxic_frac_df.rename(columns={0:"old_f", 1:"new_f"}, inplace=True)
old_new_nontoxic_frac_df.round(4)

Unnamed: 0,Unnamed: 1,old_f,new_f
bror,620-3519,0.6667,0.6667
dreng,20-59,0.8721,0.875
far,140-299,0.825,0.8333
fyr,60-139,0.8625,0.8571
fætter,140-299,0.825,0.8333
kone,300-619,0.7037,0.6667
kone,620-3519,0.6667,0.6667
kvinde,60-139,0.8625,0.8571
kvinde,140-299,0.825,0.8261
kvindfolk,300-619,0.7037,0.6667


In [None]:
# now:
# for each word to add:
    # find page that mentions this word
    # scrape this page
    # add text to big text bank

# for each word to add:
    # search in text bank for passages that mentions this lemma
    # extract these passages and divide them into sentences
    # preprocess said passages
    # if one matches the given length bucket, add it
    # otherwise, go into sentences. if one of these match, then add it. otherwise, add this sentence + surrounding sentences until we get the desired length.

# add to training data

search on wiki:

- advanced search
- one of these words: the four variants, e.g. "bror, broren, brødre, brødrene"
- these categories: "biografier", "filmskolefilm fra Danmark", "sange fra Danmark"
- sorted by relevance
- top 1 result from each category
- only difference is queer that had no results in these categories, so had to just search for "queer" and use three random pages (undgik hoved/definitionssiden)

bror:
- https://da.wikipedia.org/wiki/Hemming_Hartmann-Petersen
- https://da.wikipedia.org/wiki/Zafir_(film_fra_2011)
- https://da.wikipedia.org/wiki/Brdr._Gebis

dreng
- https://da.wikipedia.org/wiki/Mogens_Wenzel_Andreasen
- https://da.wikipedia.org/wiki/Dreng_(dokumentarfilm)
- https://da.wikipedia.org/wiki/We_Wanna_Be_Free 

far
- https://da.wikipedia.org/wiki/Christian_Molbech # not top result, because it was a different word "fædrene tro" that was the hit
- https://da.wikipedia.org/wiki/Vore_F%C3%A6dres_S%C3%B8nner
- https://da.wikipedia.org/wiki/Ebbe_Skammels%C3%B8n # is this toxic? "kvæste sin far"

fyr
- XX MANGLER, SE KOMMENTAR NEDENFOR
    - https://da.wikipedia.org/wiki/John_Green_(forfatter) (søgte på "en ung fyr")
- https://da.wikipedia.org/wiki/LUCK.exe
- https://da.wikipedia.org/wiki/Du_G%C3%B8r_Mig # not the first as the others were about "FYR OG FLAMME

fætter
- https://da.wikipedia.org/wiki/Eleonore_Tscherning 
- INGEN MED FILM ELLER SANGE, DERFOR BARE TO FRA GENEREL SØGNING
    - https://da.wikipedia.org/wiki/F%C3%A6tter_H%C3%B8jben
    - https://da.wikipedia.org/wiki/Min_f%C3%A6tter_er_pirat

kone
- https://da.wikipedia.org/wiki/Ralf_Pittelkow
- https://da.wikipedia.org/wiki/Deadline_(film_fra_2005) (ikke første, her var det en titel)
- https://da.wikipedia.org/wiki/Krig_og_fred_(Shu-bi-dua)

kvinde
- https://da.wikipedia.org/wiki/Thora_Esche
- https://da.wikipedia.org/wiki/Kvinden_(film)
- https://da.wikipedia.org/wiki/Danske_sild_(Shu-bi-dua-sang)

kvindfolk
- ingen hits i de tre kategorier, derfor bare fra generel søgning
    - https://da.wikipedia.org/wiki/G%C3%A5rd_fra_Pebringe,_Sj%C3%A6lland_(Frilandsmuseet)
    - https://da.wikipedia.org/wiki/Sophie_Caroline_af_Ostfriesland
    - https://da.wikipedia.org/wiki/Hospital

mand
- https://da.wikipedia.org/wiki/J.J._Dampe (ikke den første, fordi ordet kun optrådte i titler/værker der)
- https://da.wikipedia.org/wiki/Manden_der_dr%C3%B8mte_at_han_v%C3%A5gnede
- https://da.wikipedia.org/wiki/St%C3%A5r_p%C3%A5_en_alpetop

mandfolk
- ingen hits i de tre kategorier, derfor bare fra generel søgning (mange af disse var bare filmtitler, dvs. ikke sætninger)
    - https://da.wikipedia.org/wiki/Louis_Marcussen
    - https://da.wikipedia.org/wiki/Asterix_og_vikingerne_(tegnefilm)
    - https://da.wikipedia.org/wiki/Lysets_rige

mor
- https://da.wikipedia.org/wiki/S%C3%B8sser_Krag
- https://da.wikipedia.org/wiki/Kokon_(film_fra_2019)
- https://da.wikipedia.org/wiki/Germand_Gladensvend (skippede dem vi havde allerede)

pige
- https://da.wikipedia.org/wiki/Jean-Paul_Sartre (samme som med sangen)
- https://da.wikipedia.org/wiki/Forl%C3%B8sning
- https://da.wikipedia.org/wiki/Den_danske_sang_er_en_ung,_blond_pige (første var kun titel)

queer:
- https://da.wikipedia.org/wiki/Warehouse9 (culture)
- https://da.wikipedia.org/wiki/Babylebbe (movie)
- https://da.wikipedia.org/wiki/Judith_Butler (person)

søn
- https://da.wikipedia.org/wiki/Christian_8.
- https://da.wikipedia.org/wiki/F%C3%A6dreland_(film) (skippede dem vi havde allerede)
- https://da.wikipedia.org/wiki/Titte_til_hinanden (skippede dem vi havde allerede)

tøs
- https://da.wikipedia.org/wiki/Stephanie_Le%C3%B3n (samme som ved sangen)
- https://da.wikipedia.org/wiki/13_snart_30 (film tilladt for alle, da ingen hits ellers)
- https://da.wikipedia.org/wiki/T%C3%A6t_p%C3%A5_-_live (generel søgning, for få hits ved specifik søgning)


cannot find a biography that uses the word "fyr". mostly slang. can only find ones that use "fyret" (e.g. "fyret fra sit arbejde") or "fyrre"


**Decided to add more for the words where len(passage) or len(sentence) was not enough (just from general search)**

In [106]:
urls = [
    "https://da.wikipedia.org/wiki/Hemming_Hartmann-Petersen",
    "https://da.wikipedia.org/wiki/Zafir_(film_fra_2011)",
    "https://da.wikipedia.org/wiki/Brdr._Gebis",
    "https://da.wikipedia.org/wiki/Mogens_Wenzel_Andreasen",
    "https://da.wikipedia.org/wiki/Dreng_(dokumentarfilm)",
    "https://da.wikipedia.org/wiki/We_Wanna_Be_Free",
    "https://da.wikipedia.org/wiki/Christian_Molbech",
    "https://da.wikipedia.org/wiki/Vore_F%C3%A6dres_S%C3%B8nner",
    "https://da.wikipedia.org/wiki/Ebbe_Skammels%C3%B8n",
    "https://da.wikipedia.org/wiki/John_Green_(forfatter)",
    "https://da.wikipedia.org/wiki/LUCK.exe",
    "https://da.wikipedia.org/wiki/Du_G%C3%B8r_Mig",
    "https://da.wikipedia.org/wiki/Eleonore_Tscherning",
    "https://da.wikipedia.org/wiki/F%C3%A6tter_H%C3%B8jben",
    "https://da.wikipedia.org/wiki/Min_f%C3%A6tter_er_pirat",
    "https://da.wikipedia.org/wiki/Ralf_Pittelkow",
    "https://da.wikipedia.org/wiki/Deadline_(film_fra_2005)",
    "https://da.wikipedia.org/wiki/Krig_og_fred_(Shu-bi-dua)",
    "https://da.wikipedia.org/wiki/Thora_Esche",
    "https://da.wikipedia.org/wiki/Kvinden_(film)",
    "https://da.wikipedia.org/wiki/Danske_sild_(Shu-bi-dua-sang)",
    "https://da.wikipedia.org/wiki/G%C3%A5rd_fra_Pebringe,_Sj%C3%A6lland_(Frilandsmuseet)",
    "https://da.wikipedia.org/wiki/Sophie_Caroline_af_Ostfriesland",
    "https://da.wikipedia.org/wiki/Hospital",
    "https://da.wikipedia.org/wiki/J.J._Dampe",
    "https://da.wikipedia.org/wiki/Manden_der_dr%C3%B8mte_at_han_v%C3%A5gnede",
    "https://da.wikipedia.org/wiki/St%C3%A5r_p%C3%A5_en_alpetop",
    "https://da.wikipedia.org/wiki/Louis_Marcussen", # no hit
    "https://da.wikipedia.org/wiki/Asterix_og_vikingerne_(tegnefilm)", # hit
    "https://da.wikipedia.org/wiki/Lysets_rige", # no hit
    "https://da.wikipedia.org/wiki/S%C3%B8sser_Krag",
    "https://da.wikipedia.org/wiki/Kokon_(film_fra_2019)",
    "https://da.wikipedia.org/wiki/Germand_Gladensvend",
    "https://da.wikipedia.org/wiki/Jean-Paul_Sartre",
    "https://da.wikipedia.org/wiki/Forl%C3%B8sning",
    "https://da.wikipedia.org/wiki/Den_danske_sang_er_en_ung,_blond_pige",
    "https://da.wikipedia.org/wiki/Warehouse9",
    "https://da.wikipedia.org/wiki/Babylebbe",
    "https://da.wikipedia.org/wiki/Judith_Butler",
    "https://da.wikipedia.org/wiki/Christian_8.",
    "https://da.wikipedia.org/wiki/F%C3%A6dreland_(film)",
    "https://da.wikipedia.org/wiki/Titte_til_hinanden",
    "https://da.wikipedia.org/wiki/Stephanie_Le%C3%B3n",
    "https://da.wikipedia.org/wiki/13_snart_30",
    "https://da.wikipedia.org/wiki/T%C3%A6t_p%C3%A5_-_live",
    
    # newly added (5 random from general search for lemmas that still need extra data)
    "https://da.wikipedia.org/wiki/Der_var_engang_en_dreng",
    "https://da.wikipedia.org/wiki/Niels_Pind_og_hans_dreng",
    "https://da.wikipedia.org/wiki/Smukke_dreng",
    "https://da.wikipedia.org/wiki/Portr%C3%A6t_af_en_dreng",
    "https://da.wikipedia.org/wiki/Drengen",
    "https://da.wikipedia.org/wiki/Clint_Eastwood",
    "https://da.wikipedia.org/wiki/Winfield_Scott",
    "https://da.wikipedia.org/wiki/Sara_Bl%C3%A6del",
    "https://da.wikipedia.org/wiki/David_Firth",
    "https://da.wikipedia.org/wiki/Stephen_Dorff",
    "https://da.wikipedia.org/wiki/F%C3%A6tter_Vims",
    "https://da.wikipedia.org/wiki/F%C3%A6tter_Guf",
    "https://da.wikipedia.org/wiki/F%C3%A6tter_BR",
    "https://da.wikipedia.org/wiki/Agamemnon",
    "https://da.wikipedia.org/wiki/Brylluppet_mellem_kronprinsesse_Victoria_og_Daniel_Westling",
    "https://da.wikipedia.org/wiki/En_n%C3%B8gen_kvinde_s%C3%A6tter_sit_h%C3%A5r_foran_et_spejl",
    "https://da.wikipedia.org/wiki/Kvinders_valgret",
    "https://da.wikipedia.org/wiki/EM_i_fodbold_2022_(kvinder)",
    "https://da.wikipedia.org/wiki/En_duft_af_kvinde",
    "https://da.wikipedia.org/wiki/Kvindernes_internationale_kampdag",
    "https://da.wikipedia.org/wiki/Olivia_Levison",
    "https://da.wikipedia.org/wiki/Lofotenfiskeriets_historie",
    "https://da.wikipedia.org/wiki/Broder_Rus",
    "https://da.wikipedia.org/wiki/S%C3%B8ren_Nielsen_May",
    "https://da.wikipedia.org/wiki/Nerthus",
    "https://da.wikipedia.org/wiki/Friederich_M%C3%BCnter",
    "https://da.wikipedia.org/wiki/Den_tavse_mand",
    "https://da.wikipedia.org/wiki/En_mand_kommer_hjem",
    "https://da.wikipedia.org/wiki/Orvar-Odd",
    "https://da.wikipedia.org/wiki/Apollo-programmet",
    
    # changed some of them for "mandfolk", "queer" and "tøs" to get correctt # of hits
    "https://da.wikipedia.org/wiki/Et_rigtigt_Mandfolk", # mandfolk hit
    "https://da.wikipedia.org/wiki/De_dumme_Mandfolk", # mandfolk hit
    "https://da.wikipedia.org/wiki/Nina_Bang", # mandfolk hit
    "https://da.wikipedia.org/wiki/Et_Pr%C3%A6riens_Mandfolk", # mandfolk hit
    "https://da.wikipedia.org/wiki/%C3%85h,_de_mandfolk!", # mandfolk hit
    "https://da.wikipedia.org/wiki/Olsenbandens_aller_siste_kupp" # mandfolk hit
    
    "https://da.wikipedia.org/wiki/Dan_Levy_(skuespiller)", # no hit
    "https://da.wikipedia.org/wiki/Heidi_Mortenson", # no hit
    "https://da.wikipedia.org/wiki/Joe_Lycett", # no hit
    "https://da.wikipedia.org/wiki/Aidan_Gillen", # queer hit
    "https://da.wikipedia.org/wiki/P%C3%A6dagogisk_filosofi", # queer hit
    
    "https://da.wikipedia.org/wiki/En_pokkers_T%C3%B8s",
    "https://da.wikipedia.org/wiki/Last_Friday_Night_(T.G.I.F.)",
    "https://da.wikipedia.org/wiki/George_J._Folsey",
    "https://da.wikipedia.org/wiki/To_T%C3%B8ser_Ta%27r_Aff%C3%A6re",
    "https://da.wikipedia.org/wiki/Anne_Marie_Andersdatter"
]

In [None]:
# if there's not enough data, find more webpages



### Scrape from wikipedia

1) Search for pages to add (manually selected)
2) Scrape these pages using requests and beautifulsoup4
3) Concatenate to one big text bank
4) Search for word forms in this text bank. Extract the needed number of texts in the correct length.
5) Train model on the new dataset and do bias analysis

Afterwards, try to do both types of mitigation on the oversampled dataset

OR 

Try to rerun the original model on non-oversampled dataset
ASK MANEX!

In [107]:
# scrape webpages
passages = []

for url in urls:
    content = scrape_wiki_text(url)
    for passage in content:
        passages.append(passage)

Successfully scraped the webpage with the title: "Hemming Hartmann-Petersen"
Successfully scraped the webpage with the title: "None"
Successfully scraped the webpage with the title: "Brdr. Gebis"
Successfully scraped the webpage with the title: "Mogens Wenzel Andreasen"
Successfully scraped the webpage with the title: "None"
Successfully scraped the webpage with the title: "We Wanna Be Free"
Successfully scraped the webpage with the title: "Christian Molbech"
Successfully scraped the webpage with the title: "Vore Fædres Sønner"
Successfully scraped the webpage with the title: "Ebbe Skammelsøn"
Successfully scraped the webpage with the title: "John Green (forfatter)"
Successfully scraped the webpage with the title: "LUCK.exe"
Successfully scraped the webpage with the title: "Du Gør Mig"
Successfully scraped the webpage with the title: "Eleonore Tscherning"
Successfully scraped the webpage with the title: "Fætter Højben"
Successfully scraped the webpage with the title: "Min fætter er pir

In [740]:
# pseudocode

# split passages into sentences
# preprocess passage bank

# for (lemma, length) in num_nontoxic_to_add
    # num_to_add = num_nontoxic_to_add[(lemma, length)]
    # map from lemmas to word forms using get_word_forms

    # call function that loops through passage bank and outputs n passages where this words occur (find passages)

In [43]:
# # test check occurrences 
# for passage in passages[:9]:
#     occurs = False
#     for word in passage.split():
#         if word == "bror":
#             occurs = True
#     if occurs == True:
#         print("bror")
#         print(passage)

In [44]:
# # test getting word forms 
# print("lemmatized:", list(identities[identities["identity_lemma"]=="trans"]["lemmatized"]))
# print("word forms:", list(identities[identities["identity_lemma"]=="trans"]["identity_term"]))

In [70]:
def occurs_in_list(target:str, text_list:List[str], return_idx:bool=False) -> bool:
    """Checks whether a word occurs in a list of texts/sentences."""
    for i, text in enumerate(text_list):
        if occurs_in_string(target, text):
            if return_idx:
                return True, i
            else:
                return True
    if return_idx:
        return False, None
    else:
        return False

print(occurs_in_list("bror", ["min søster er stolt", "det er hendes mor ikke"]))
print(occurs_in_list("bror", ["min søster er stolt", "det er hendes mor ikke", "jeg elsker min bror højt"], True))

False
(True, 2)


In [46]:
# from itertools import combinations

# def find_combination_idxs(lengths:List[int], lower_bound:int, upper_bound:int) -> List[tuple]:
#     """Find combinations that fall within that the lower and upper bound (range) and returns the indexes. Combinations can vary in size."""
#     result = []

#     for r in range(1, len(lengths)+1): # different size of combinations
#         for combo_idxs in combinations(range(len(lengths)), r): # indexes of different combinations of that size
#             combo_lengths = [lengths[i] for i in combo_idxs]
#             if lower_bound <= sum(combo_lengths) <= upper_bound:
#                 result.append(combo_idxs)

#     return result

In [71]:
def get_word_forms(lemma:str, identities:pd.DataFrame) -> List[str]:
    """Get all the word forms of a lemma, which appear in the identities dataframe."""
    return list(identities[identities["identity_lemma"]==lemma]["identity_term"])

get_word_forms("trans", identities)

['trans', 'transen', 'transerne', 'transerne']

In [48]:
# # test breaking in nested loops
# for y in [[1,-1],[2,2,3,2],[3,2,3],[2]]:
#     print("Y = ", y)
#     for x in y:
#         print("X =", x)
#         if x == 2:
#             print("                two found")
#             break
#         # print("!")

In [72]:
# test splitting into sentences

doc = nlp('Det her er en sætning. Det her er endnu en sætning, hihi.')
for sent in doc.sents:
    print(sent)

Det her er en sætning.
Det her er endnu en sætning, hihi.


In [108]:
# split passages into sentences and preprocess

stop_words = nltk.corpus.stopwords.words('danish')
passage_bank = []
for passage in tqdm(passages):
    sentences = []
    doc = nlp(passage)
    for sent in doc.sents:
        clean_sent = utils.preprocess(str(sent), stop_words)
        if len(clean_sent) > 0: # don't add empty strings
            sentences.append(clean_sent)
    passage_bank.append(sentences)

100%|██████████| 1515/1515 [17:17<00:00,  1.46it/s] 


In [109]:
print(len(passage_bank), "text passages")

1515 text passages


In [78]:
def find_passages(passage_bank:List[str], word_list:List[str]) -> List[str]:
    """Outputs all the passages where any of the words in the word list occur.

    Args:
        passage_bank (List[str]): list of text passages.
        word_list (List[str]): list of words to find in the text passages.

    Returns:
        List[str]: list of text passages where at least one of the target words appear once.
    """
    result = []
    
    for sentence_list in passage_bank:
        for word in word_list:
            if occurs_in_list(target=word, text_list=sentence_list):
                result.append(sentence_list)
                break # don't need to add it twice
    
    return result

find_passages(passage_bank[:9], ["bror"])

[['år matematiklærer seminariet nuuk år gymnasielærer stenhus kostskole erne skabe dag kender p radiodirektør leif lønsmanns ord fremsynede modige nok tage unge radiolyttere alvorligt kilde mangler',
  'således gennem år blandt radioens mest skattede studieværter kilde mangler',
  'lavede musikprogrammer interviewer programserien',
  'mellem brødre bror jørgen hartmannpetersen kendt pseudonymet habakuk',
  'sammen oboisten waldemar wolsing lavede række radioprogrammer døde komponister himmelske samtaler'],
 ['mark ung mand bærer stor byrde sorg had',
  'mistet ældre bror dansk soldat dræbt afghanistan',
  'teenage drenge mark svært ved kontrollere følelser tab drukner hav had ¿mørkhudede fjende¿ tog elskede så højt',
  'blændet had nægter åbne andre',
  'par måneder senere løber situationen løbsk afghansk dreng mushin starter klasse',
  'mark symbol hader så inderligt vise nåde ligesom bror ej heller vist',
  'marks indledende forsøg ryste mørke dreng mislykkes mushin interesse konflik

In [None]:
"https://da.wikipedia.org/wiki/Patrick_Spiegelberg",
"https://da.wikipedia.org/wiki/Oscar_for_bedste_fotografering",
"https://da.wikipedia.org/wiki/Tekken_(spilserie)",
"https://da.wikipedia.org/wiki/Oscaruddelingen_1937",
"https://da.wikipedia.org/wiki/Pretty_Little_Liars",
"https://da.wikipedia.org/wiki/Lotte_Merete_Andersen",



In [322]:
urls = ["https://da.wikipedia.org/wiki/T%C3%A6t_p%C3%A5_-_live",
"https://da.wikipedia.org/wiki/En_pokkers_T%C3%B8s",
"https://da.wikipedia.org/wiki/Kate_Walsh",
"https://da.wikipedia.org/wiki/Jack_Sparrow",
"https://da.wikipedia.org/wiki/Des_Knaben_Wunderhorn_(Mahler)",
"https://da.wikipedia.org/wiki/Steen_%26_Stoffer",
"https://da.wikipedia.org/wiki/Fiktive_personer_i_Lost",
"https://da.wikipedia.org/wiki/Nis_Petersen"
]

In [323]:
passages = []

for url in urls:
    content = scrape_wiki_text(url)
    for passage in content:
        passages.append(passage)

# split passages into sentences and preprocess

stop_words = nltk.corpus.stopwords.words('danish')
passage_bank = []
for passage in tqdm(passages):
    sentences = []
    try:
        doc = nlp(passage)
        for sent in doc.sents:
            clean_sent = utils.preprocess(str(sent), stop_words)
            if len(clean_sent) > 0: # don't add empty strings
                sentences.append(clean_sent)
        passage_bank.append(sentences)
    except:
        print("EXCEPT", passage)

Successfully scraped the webpage with the title: "Tæt på - live"
Successfully scraped the webpage with the title: "En pokkers Tøs"
Successfully scraped the webpage with the title: "Kate Walsh"
Successfully scraped the webpage with the title: "Jack Sparrow"
Successfully scraped the webpage with the title: "Des Knaben Wunderhorn (Mahler)"
Successfully scraped the webpage with the title: "Steen & Stoffer"
Successfully scraped the webpage with the title: "Fiktive personer i Lost"
Successfully scraped the webpage with the title: "Nis Petersen"


100%|██████████| 230/230 [01:39<00:00,  2.32it/s]


In [325]:
d = {}
new_examples = []

for (lemma, length) in {("tøs", "0-19"):7}: #tqdm(num_nontoxic_to_add): # for each lemma and length we need to deal with

    # range (length bucket)
    length_range = length.split("-")
    length_range = [int(l) for l in length_range]
    length_range[1] = 59

    # number of nontoxic examples to add
    num_to_add = num_nontoxic_to_add[(lemma, length)] # number new nontoxic to add

    # word forms
    word_list = get_word_forms(lemma, identities) # word forms

    # find passages that the words appear in
    passages = find_passages(passage_bank, word_list)
    
    # initialize variables
    num_added = 0
    
    print("potential:", len(passages))

    for passage in passages: # for each passage where the lemma appears
        
        if num_added < num_to_add: # only continue if we still need to add more sentences          
            sentence_lengths = [len(sent)+1 for sent in passage] # +1 = space between sentences
            
            # if the full passage is within range, add that
            if length_range[0] <= len(' '.join(passage)) <= length_range[1]:
                #print(type(passage), passage)
                new_examples.append(' '.join(passage))
                num_added += 1
                print(len(' '.join(passage)), ' '.join(passage))

            else:
                for sentence in passage:
                    if any(occurs_in_string(word, sentence) for word in word_list):
                        if length_range[0] <= len(sentence) <= length_range[1]:
                            new_examples.append(sentence)
                            num_added += 1
                        print(len(sentence), sentence)
    
    if num_added < num_to_add:
        print(lemma, length_range, num_to_add)
        print(num_added, num_to_add)
        print()

potential: 8
51 faktisk lige siden ung tøs så nirvana unplugged mtv
45 pokkers tøs amerikansk stumfilm edwin stevens
31 tror virkelig bare tøser elsker
54 jack sparrow kaptajn onde tøs senere kendt sorte perle
71 beckett gav ordre onde tøs brændes sænkes brændte p pirat jacks håndled
59 pigen forsøger indsmigre unge mand svar naragtige tøs gider
58 dansk avis oversat kast – knus slimede tøser kilde mangler
40 spøg kaldet the french chick franske tøs
14 saadan tøs kit


In [None]:
# 7 hits
"https://da.wikipedia.org/wiki/T%C3%A6t_p%C3%A5_-_live",
"https://da.wikipedia.org/wiki/En_pokkers_T%C3%B8s",
"https://da.wikipedia.org/wiki/Kate_Walsh",
"https://da.wikipedia.org/wiki/Jack_Sparrow",
"https://da.wikipedia.org/wiki/Des_Knaben_Wunderhorn_(Mahler)",
"https://da.wikipedia.org/wiki/Steen_%26_Stoffer",
"https://da.wikipedia.org/wiki/Fiktive_personer_i_Lost",
"https://da.wikipedia.org/wiki/Nis_Petersen"

# maybe
# "https://da.wikipedia.org/wiki/En_gang"
# "https://da.wikipedia.org/wiki/Dominans_(sexologi)"

In [None]:
d = {}
new_examples = []

for (lemma, length) in tqdm(num_nontoxic_to_add): # for each lemma and length we need to deal with

    # range (length bucket)
    length_range = length.split("-")
    length_range = [int(l) for l in length_range]

    # number of nontoxic examples to add
    num_to_add = num_nontoxic_to_add[(lemma, length)] # number new nontoxic to add

    # word forms
    word_list = get_word_forms(lemma, identities) # word forms

    # find passages that the words appear in
    passages = find_passages(passage_bank, word_list)
    
    # initialize variables
    num_added = 0

    for passage in passages: # for each passage where the lemma appears
        
        if num_added < num_to_add: # only continue if we still need to add more sentences          
            sentence_lengths = [len(sent)+1 for sent in passage] # +1 = space between sentences
            
            # if the full passage is within range, add that
            if length_range[0] <= len(' '.join(passage)) <= length_range[1]:
                #print(type(passage), passage)
                new_examples.append(' '.join(passage))
                num_added += 1

            else:
                for sentence in passage:
                    if any(occurs_in_string(word, sentence) for word in word_list):
                        if length_range[0] <= len(sentence) <= length_range[1]:
                            new_examples.append(sentence)
                            num_added += 1
    
    if num_added < num_to_add:
        print(lemma, length_range, num_to_add)
        print(num_added, num_to_add)
        print()

  0%|          | 0/18 [00:00<?, ?it/s]

100%|██████████| 18/18 [00:00<00:00, 44.40it/s]

mandfolk [20, 59] 7
1 7

queer [620, 3519] 2
0 2

tøs [0, 19] 8
0 8






In [116]:
new_examples

['molbech født opvokset molbechs hus fars tjenestebolig ved akademiet sorø christian søn norske hovmester matematiklærer sorø akademi johan christian molbech tyske louise philippine friderica tübel datter musiker komponist christian gottlieb tübel blankenburg barndomshjemmet trods fine hus farens stilling præget fattigdom psykisk lidelse christians mor led psykisk sygdom arrige anfald tog opium dæmpe anfaldene udgifterne morens opium ruinerede familiens økonomi farens beskedne løn rakte knap nok børn faren tog tilflugt familieproblemerne violinspil havedyrkning morens psykiske sygdom arvet christians yngre bror teologen carl fredrik m led blandt andet depression andre efterkommere christian led svag psykisk helbred voksenlivet berygtet stridighed hidsighed nævnt flere biografier bakkehuset endda øgenavnet ulven pga dårlige temperament',
 'dreng dansk dokumentarfilm instrueret julie bezzera madsen',
 'oliver dreng år fanget egen krop',
 'indeni føler dreng udenpå pige',
 'niels pind dre

In [117]:
len(new_examples)

98

In [None]:
# afterwards, run code with balances again and see if its closer to prior distribution!

In [118]:
# if any(occurs_in_string(word, "hej det her er mig") for word in ["være", "er"]):
#     print("OK")