<h1>db-index, new way to measure dataset bias</h1>

In [1]:
from datasets import load_dataset
import spacy as sp
import random
import en_core_web_lg
import pandas as pd

In [2]:
nlp = en_core_web_lg.load()

In [3]:
dataset = load_dataset("ccdv/govreport-summarization")
dataset

DatasetDict({
    train: Dataset({
        features: ['report', 'summary'],
        num_rows: 17517
    })
    validation: Dataset({
        features: ['report', 'summary'],
        num_rows: 973
    })
    test: Dataset({
        features: ['report', 'summary'],
        num_rows: 973
    })
})

In [4]:
dataset.set_format(type='pandas')
targetdset = dataset['train'][1:]
targetdset.drop(['summary'], axis=1)

Unnamed: 0,report
0,Most income derived from private sector busine...
1,There are some similarities in how Medicare pa...
2,IRS’s mission is to provide America’s taxpayer...
3,"In 1991, we reported that, historically, INS l..."
4,While TCE and perchlorate are both DOD-classif...
...,...
17511,Some Members of Congress have expressed intere...
17512,"Since November 1986, the Commemorative Works A..."
17513,"On May 27, 2015, the Army Corps of Engineers (..."
17514,Section 1512 applies to the obstruction of fed...


Dataset sanitized. Let's get a sample dataset of offensive language

In [5]:
offensivedataset = load_dataset("henryscheible/implicit_bias")

In [6]:
offensivedataset.set_format(type='pandas')
compdataset = offensivedataset['train'][1:]
compdataset.drop(['category', 'label'], axis = 1)

Unnamed: 0,sentence
0,They are Asians. I don't want to invite too ma...
1,They are Queer. Why do all gay men have to be ...
2,They are OldPeople. All Middle Easterners are ...
3,They are Females. She's always so put together...
4,They are Christans. I don't understand how you...
...,...
1003,They are MiddleEastern. I saw a group of Middl...
1004,"They are Asians. Oh, my grandpa doesn't even k..."
1005,They are Queer. I can't believe she's still tr...
1006,"They are Whites. Oh, you're white, so you prob..."


In [10]:
def dbindex(target, comparison):
    #Pick a random entry from the comparison dataset.
    index = random.randrange(0, len(comparison), 1)

    #Find that entry.
    comparisonEntry = comparison['sentence'][index]

    #Vectoritze entry.
    vectorizedComparisonEntry = nlp(comparisonEntry)

    #initialitize total cosine similarity
    tcs = 0

    #Go through each target entry
    for entry in target.report:
        
        #Find similariity between each target and the comparison entry
        vecEntry = nlp(entry)
        tcs += vecEntry.similarity(vectorizedComparisonEntry)

    #Raise to inverse power of size
    dbi = tcs**(1/len(target))
    return dbi

In [11]:
target = pd.DataFrame(["grey", "grey"])
target.columns = ['report']
comparison = pd.DataFrame(['white', "red", "orange"])
comparison.columns = ['sentence']

In [12]:
dbindex(targetdset, compdataset)

ValueError: [E088] Text of length 1231622 exceeds maximum of 1000000. The parser and NER models require roughly 1GB of temporary memory per 100,000 characters in the input. This means long texts may cause memory allocation errors. If you're not using the parser or NER, it's probably safe to increase the `nlp.max_length` limit. The limit is in number of characters, so you can check whether your inputs are too long by checking `len(text)`.