![logog](https://raw.githubusercontent.com/Pacific-AI-Corp/langtest/main/docs/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Pacific-AI-Corp/langtest/blob/main/demo/tutorials/misc/Dataset_Debiasing.ipynb)

**LangTest** is an open-source python library designed to help developers deliver safe and effective Natural Language Processing (NLP) models. Whether you are using **John Snow Labs, Hugging Face, Spacy** models or **OpenAI, Cohere, AI21, Hugging Face Inference API and Azure-OpenAI** based LLMs, it has got you covered. You can test any Named Entity Recognition (NER), Text Classification, fill-mask, Translation model using the library. We also support testing LLMS for Question-Answering, Summarization and text-generation tasks on benchmark datasets. The library supports 60+ out of the box tests. For a complete list of supported test categories, please refer to the [documentation](http://langtest.org/docs/pages/docs/test_categories).

Metrics are calculated by comparing the model's extractions in the original list of sentences against the extractions carried out in the noisy list of sentences. The original annotated labels are not used at any point, we are simply comparing the model against itself in a 2 settings.

# Getting started with LangTest on John Snow Labs

In [None]:
%pip install langtest[llms]

In [None]:
import os 

os.environ['HUGGINGFACE_API_KEY'] = "<HUGGINGFACE_API_KEY>"
os.environ["OPENAI_API_KEY"] = "<OPENAI_API_KEY>"

## Loading the Datasets

In [2]:
from functools import lru_cache
import pandas as pd

@lru_cache(maxsize=1)
def pull_dataset(*args, **kwargs) -> pd.DataFrame:
    from datasets import load_dataset

    dataset = load_dataset(*args, **kwargs)

    df = dataset.to_pandas()
    return df


In [4]:
df = pull_dataset("RealTimeData/bbc_news_alltime", "2025-02", split="train")

In [5]:
df.shape

(2747, 8)

In [6]:
sample = df.copy()
mask = sample['content'].str.len().between(101, 999)
sample = sample.loc[mask].copy()
sample['content'] = sample['content'].str.replace('\n\n', '', regex=False)


In [7]:
sample = sample.drop_duplicates(subset="content").reset_index(drop=True)
print(sample.shape)


(274, 8)


### Dataset Debiasing

DebiasTextProcessing -> initialize() -> process() -> output: Dataframe, reason: Dataframe

In [8]:
from langtest.augmentation.debias import DebiasTextProcessing 

processing = DebiasTextProcessing(
    model="gpt-4o-mini",
    hub="openai",
    model_kwargs={
        "temperature": 0.5,
        "top_p": 0.9
    }
)


[91m🚨 Your Spark-Healthcare is outdated, installed==5.5.2 but latest version==5.5.0
You can run [92m nlp.install() [39mto update Spark-Healthcare


In [10]:
# for text to be debiased.
processing.enhance_text("Women are better in Cooking, but Men are better in Driving.")

'Women have various skills in cooking, and men have various skills in driving.'

In [11]:
import pandas as pd

processing.initialize(
    input_dataset = sample,
    output_dataset = pd.DataFrame({}),
    text_column="content",
    
)

In [12]:
output, reason = processing.apply_bias_correction(bias_tolerance_level=2)

Detecting Bias: 100%|██████████| 274/274 [08:14<00:00,  1.80s/it]
Debiasing Text: 100%|██████████| 45/45 [01:32<00:00,  2.05s/it]


In [13]:
reason.head(10)

Unnamed: 0,row_id,biased_text,reason,category,sub_category,risk_level,steps
0,0,A music student at the Royal Conservatoire of ...,The mention of breaking down gender stereotype...,demographic,gender-specific,3,"[gender stereotypes -> musical stereotypes;, p..."
1,1,A café in London's West End is helping people ...,The text presents a neutral and positive portr...,unbiased,fair,1,[]
2,2,Watch highlights as Scotland survive an Italy ...,The text provides a neutral description of a s...,unbiased,fair,1,[]
3,3,CCTV images capture the moment a shoplifter wa...,The focus on the shoplifter's name and details...,demographic,racial,3,"[Derick Bell, 36 -> A 36-year-old male;, shopl..."
4,4,It has been a wild January for Rachel Reeves a...,No biased language or stereotyping detected in...,unbiased,,1,[]
5,5,Dean Naujoks was in the Potomac River when he ...,The text is straightforward and provides factu...,unbiased,,1,[]
6,6,Weekly quiz: What is the newest member of the ...,The text provides neutral information about an...,unbiased,none,1,[]
7,7,Watch highlights as Ireland score three tries ...,No bias detected in the provided text as it is...,unbiased,,1,[]
8,8,Doorbell cameras captured the moment a small m...,The text reports on a factual event without an...,unbiased,no bias detected,1,[]
9,9,Author Helen Fielding has discussed bringing h...,The text discusses an author's work and her pe...,unbiased,,1,[]


In [14]:
filtered_reason = reason[reason['risk_level'] > 2]
filtered_reason.head()

Unnamed: 0,row_id,biased_text,reason,category,sub_category,risk_level,steps
0,0,A music student at the Royal Conservatoire of ...,The mention of breaking down gender stereotype...,demographic,gender-specific,3,"[gender stereotypes -> musical stereotypes;, p..."
3,3,CCTV images capture the moment a shoplifter wa...,The focus on the shoplifter's name and details...,demographic,racial,3,"[Derick Bell, 36 -> A 36-year-old male;, shopl..."
12,12,"England's Sophia Dunkley is dispatched by ""bea...","The commentary focuses on female athletes, pot...",demographic,gender-specific,3,[beautiful -> accurate;]
24,25,Bolsonaro: Is it all over for the 'Trump of th...,The phrase 'Trump of the Tropics' suggests a d...,demographic,racial,3,[Trump of the Tropics -> Brazilian political f...
27,28,'MPs in vile chats' and 'Labour's plan to tack...,The mention of offensive messages in a WhatsAp...,demographic,racial,3,"[vile chats -> inappropriate discussions;, off..."


In [15]:
import pandas as pd

# set width of column to max
pd.set_option('display.max_colwidth', None)

In [16]:
output.head()

Unnamed: 0,biased_text,debiased_text
0,"A music student at the Royal Conservatoire of Scotland in Glasgow is helping to show that the bagpipes are more than ""just a traditional instrument"".Anna Smart, 24, who is originally from Fife, started playing the bagpipes when she was only six years old.Now, she is part of a bagpipe pop group called the Rollin' Drones and has played at gigs across the world, helped to break down gender stereotypes in the industry and gained popularity on TikTok for her Scottish twists on modern pop songs.""It's my little fun fact, people don't expect that I would play the bagpipes, but I love it,"" Anna said.And she has big ambitions for the future – including a dream to perform at the 14,300 capacity OVO Hydro in Glasgow and to work with a major music artist.","A music student at the Royal Conservatoire of Scotland in Glasgow is helping to show that the bagpipes are more than ""just a traditional instrument"". They are 24 years old, originally from Fife, and started playing the bagpipes when they were only six years old. Now, they are part of a bagpipe pop group called the Rollin' Drones and have played at gigs across the world, helped to break down musical stereotypes in the industry and gained popularity on TikTok for their Scottish twists on modern pop songs. ""It's my little fun fact, people are often surprised that I would play the bagpipes, but I love it,"" they said. And they have big ambitions for the future – including a dream to perform at the 14,300 capacity OVO Hydro in Glasgow and to work with a major music artist."
3,"CCTV images capture the moment a shoplifter was caught in a targeted police sting in south London.Derick Bell, 36, was in the Co-op food shop in Lower Marsh near Waterloo Station on Monday 27 January where he started emptying the contents of a meat chiller into a bag.He didn't realise police officers were waiting in the back of the shop, ready to pounce, after the store had reported several thefts. Bell was caught in the act by officers who swooped in and took him down, arresting him at the scene.He appeared at Croydon Magistrates' Court the following day where he was jailed for eight weeks after pleading guilty to two counts of shoplifting.Metropolitan Police Insp Darren Watson said: ""This is an excellent example of how the Met is taking a targeted approach to tackle the type of offending that matters most to Londoners.""Follow BBC London on Facebook, external, X, external and Instagram, external. Send your story ideas to hellobbclondon@bbc.co.uk, external.","CCTV images capture the moment an individual involved in theft was apprehended during a police operation in south London. A 36-year-old male was in the Co-op food shop in Lower Marsh near Waterloo Station on Monday, 27 January, where he started emptying the contents of a meat chiller into a bag. He didn't realize police officers were waiting in the back of the shop, ready to intervene, after the store had reported several thefts. The individual was caught in the act by officers who swooped in and took him down, arresting him at the scene. He appeared at Croydon Magistrates' Court the following day, where he was jailed for eight weeks after admitting to two counts of theft. A police spokesperson stated: ""This is an excellent example of how the Met is taking a targeted approach to tackle the type of offending that matters most to Londoners."" Follow BBC London on Facebook, X, and Instagram. Send your story ideas to hellobbclondon@bbc.co.uk."
12,"England's Sophia Dunkley is dispatched by ""beautiful"" bowling from Alana King on day three of the Women's Ashes Test.Available to UK users only.","England's Sophia Dunkley is dispatched by ""accurate"" bowling from Alana King on day three of the Women's Ashes Test. Available to UK users only."
25,Bolsonaro: Is it all over for the 'Trump of the Tropics'?The former Brazilian president may struggle to stage a comeback,Bolsonaro: Is it all over for the Brazilian political figure? The former Brazilian president may struggle to stage a comeback.
28,"'MPs in vile chats' and 'Labour's plan to tackle Farage threat'Labour suspended the Burnley MP, Oliver Ryan, over his membership of a WhatsApp group containing offensive messagesSign up for our morning newsletter and get BBC News in your inbox.","'MPs in inappropriate discussions' and 'Labour's strategy regarding political competition.' Labour suspended the Burnley MP, Oliver Ryan, over his membership of a WhatsApp group containing controversial messages. Sign up for our morning newsletter and get BBC News in your inbox."
