# Filtering corpora using Quality

<a target="_blank" href="https://colab.research.google.com/github/HLasse/TextDescriptives/blob/main/docs/tutorials/filter_corpus_using_quality.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

In many cases if you want to analyse tweets, train a model on text scraped from the web or similar, it is important to filter out low-quality texts.

TextDescriptives implements a series of heuristic filters for removing low-quality text. This tutorial will take you through how to use these to filter
your text corpora.

## Setup

For this we will use the [Danish Gigaword](https://sprogteknologi.dk/dataset/danish-gigaword) available on [Huggingface Datasets](DDSC/partial-danish-gigaword-no-twitter). A large collection of Danish texts collected from a variety of domains. To download it we will need the `datasets` package. To install it please run:

```python
!pip install datasets
```

We can now easily donwload the dataset using the following command:

In [1]:
from datasets import load_dataset

# note this can take quite a while
dataset = load_dataset("DDSC/partial-danish-gigaword-no-twitter")

# All of the dataset is available in the train split - we can simply:
dataset = dataset["train"]

Using custom data configuration DDSC--partial-danish-gigaword-no-twitter-d9c41a85c2339e48
Found cached dataset parquet (/Users/au561649/.cache/huggingface/datasets/DDSC___parquet/DDSC--partial-danish-gigaword-no-twitter-d9c41a85c2339e48/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)


  0%|          | 0/1 [00:00<?, ?it/s]

In [2]:
# We can take a look at one of the examples:
ten_samples = dataset.select(range(10))
ten_samples.to_pandas()

Unnamed: 0,text,source,doc_id,LICENSE,uri,date_built
0,JØRGINE JØRGINE KØBENHAVN HAGE & CLAUSENS FORL...,jvj,jvj_Jørgine,Attribution-ShareAlike 4.0 International,,Fri Jun 26 13:06:11 2020 CEST +0200
1,MYTER MYTER NY SAMLING GYLDENDALSKE BOGHANDEL...,jvj,jvj_Myter-ny-samling,Attribution-ShareAlike 4.0 International,,Fri Jun 26 13:06:01 2020 CEST +0200
2,DEN NY VERDEN DEN NY VERDEN TIL INTERNATIONAL ...,jvj,jvj_Den-ny-Verden,Attribution-ShareAlike 4.0 International,,Fri Jun 26 13:06:45 2020 CEST +0200
3,CIMBRERNES TOG TIL EMMERIK JENSEN F . 15 . MAJ...,jvj,jvj_Cimbrernes-Tog,Attribution-ShareAlike 4.0 International,,Fri Jun 26 13:05:56 2020 CEST +0200
4,OM SPROGET OG UNDERVISNINGEN OM SPROGET OG UND...,jvj,jvj_Om-Sproget-og-Undervisningen,Attribution-ShareAlike 4.0 International,,Fri Jun 26 13:05:49 2020 CEST +0200
5,GÆST KOMMER TIL VERDEN HAN var født paa Sjælan...,jvj,jvj_Gæst-kommer-til-Verden,Attribution-ShareAlike 4.0 International,,Fri Jun 26 13:06:21 2020 CEST +0200
6,MYTER OG JAGTER MYTER OG JAGTER GYLDENDALSKE B...,jvj,jvj_Myter-og-Jagter,Attribution-ShareAlike 4.0 International,,Fri Jun 26 13:06:14 2020 CEST +0200
7,"DET TABTE LAND DET TABTE LAND, MENNESKET FØR I...",jvj,jvj_Det-tabte-Land,Attribution-ShareAlike 4.0 International,,Fri Jun 26 13:06:50 2020 CEST +0200
8,SANGERINDEN SANGERINDEN (MADAME D'ORA) DRAMA I...,jvj,jvj_Sangerinden,Attribution-ShareAlike 4.0 International,,Fri Jun 26 13:06:52 2020 CEST +0200
9,DYRENES FORVANDLING DYRENES FORVANDLING TIL UD...,jvj,jvj_Dyrenes-Forvandling,Attribution-ShareAlike 4.0 International,,Fri Jun 26 13:06:48 2020 CEST +0200


As previously mentioned the Danish Gigaword consist of multiple domains. For this tutorial, we will look at three of these domains. `retsinformationdk` which consist of legal documents, `wiki` which contain Wikipedia articles and `spont` which contains texts transcriped from spontaneous speech.

In [3]:
# we can filter out these three datasets based on the "source"
legal = dataset.filter(lambda x: x["source"] == "retsinformationdk")
wiki = dataset.filter(lambda x: x["source"] == "wiki")
speech = dataset.filter(lambda x: x["source"] == "spont")

Loading cached processed dataset at /Users/au561649/.cache/huggingface/datasets/DDSC___parquet/DDSC--partial-danish-gigaword-no-twitter-d9c41a85c2339e48/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-6e6efda35614635a.arrow
Loading cached processed dataset at /Users/au561649/.cache/huggingface/datasets/DDSC___parquet/DDSC--partial-danish-gigaword-no-twitter-d9c41a85c2339e48/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-3ce9447c21439e3f.arrow
Loading cached processed dataset at /Users/au561649/.cache/huggingface/datasets/DDSC___parquet/DDSC--partial-danish-gigaword-no-twitter-d9c41a85c2339e48/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-6528b379c635e45c.arrow


We can now examine these datasets a bit more:

In [4]:
print(f"Legal contains {len(legal)} examples")
print(f"Wiki contains {len(wiki)} examples")
print(f"Speech contains {len(speech)} examples")

Legal contains 64043 examples
Wiki contains 425938 examples
Speech contains 411 examples


We can for example see that the speech dataset contains notably fewer sampels than the rest. So let us downsample the rest to ~1000 samples each before we start the analysis.

In [5]:
legal = legal.select(range(1000))
wiki = wiki.select(range(1000))

print(f"Legal now contains {len(legal)} examples")
print(f"Wiki now contains {len(wiki)} examples")

Legal now contains 1000 examples
Wiki now contains 1000 examples


# Quality Filtering
After we have prepared our datasets we can now start with the quality filtering. Using Textdescriptives this is extremely simple. We need to do 3 thing:

1) Create a pipeline
2) Add the quality component from textdescriptives to it
3) Apply the pipeline to the dataset


In [20]:
import spacy
import textdescriptives as td

# 1. Crease a blank spaCy model with a sentencizer
nlp = spacy.blank("da")
nlp.add_pipe("sentencizer")
nlp.max_length = 2000000  # as some of the documents are quite long we can increase the max length
# however it might be worth filtering out these documents before for very very long documents.

# 2. Add the textdescriptives pipeline
quality_pipe = nlp.add_pipe("textdescriptives/quality")

# 3. Apply the pipeline to the legal documents
legal_docs = nlp.pipe(legal["text"], batch_size=100, n_process=4)

If we check now we can see that legal_docs is a generator. This can be a quite efficient format, but for now we just want to process all the text so we simply need to convert it to a list:

In [7]:
legal_docs

<generator object Language.pipe at 0x415ab9c10>

In [8]:
legal_docs = list(legal_docs)

We can now inspect the output here:

In [17]:
legal_doc = legal_docs[0]

print(legal_doc[:100]) # print the first 100 tokens
print("----")
print("This is pass the quality filter:")
legal_doc._.passed_quality_check

Den fulde tekst Pressenævnets kendelse i sag nr. 15-70-00822
Resumé
Foreningen for Skånsomt Kystfiskeri har ikke retlig interesse
DR bragte et radioindslag om Natur- og Erhvervsstyrelsens fiskeriinspektorats fangst af ulovlige ålefælder. Foreningen for Skånsomt Kystfiskeri klagede blandt andet med den begrundelse, at betegnelsen ” ålefælder ” er forkert, idet ålene selv kan svømme ind og ud. Pressenævnet afviser at behandle klagen, da foreningen ikke er omtalt i udsendelsen og derfor ikke har retlig interesse.
Pressenævnets formand udtaler:
Det er en betingelse for at klage til Pressenævnet, at
----
This is pass the quality filter:


False

Here we see that the text did not pass the quality filter. We can now examine why that using the following code:

In [18]:
legal_doc._.quality

{'n_stop_words': 192,
 'alpha_ratio': 0.804,
 'mean_word_length': 4.546,
 'doc_length': 500,
 'proportion_ellipsis': 0.0,
 'proportion_bullet_points': 0.0,
 'duplicate_line_chr_fraction': 0.25737766156144937,
 'duplicate_paragraph_chr_fraction': 0.0,
 'duplicate_5-gram_chr_fraction': 0.5401568920433321,
 'duplicate_6-gram_chr_fraction': 0.519237952932387,
 'duplicate_7-gram_chr_fraction': 0.519237952932387,
 'duplicate_8-gram_chr_fraction': 0.519237952932387,
 'duplicate_9-gram_chr_fraction': 0.519237952932387,
 'duplicate_10-gram_chr_fraction': 0.519237952932387,
 'top_2-gram_chr_fraction': 0.017930519237952934,
 'top_3-gram_chr_fraction': 0.042958535674262235,
 'top_4-gram_chr_fraction': 0.0653716847217034,
 'symbol_#_to_word_ratio': 0.0,
 'contains_lorem ipsum': False}

Here we see that fraction of characters which is a part of a duplicate 10 gram is >50%. This is likely the reason why the sample was filtered out. This is not uncommon for legal documents which contain a lot of standard phrases. However you might wish to change the threshold for this filter. You can see an example of how to do this in the [documentation](file:///Users/au561649/Github/TextDescriptives/docs/_build/html/quality.html).

You can also inspect the existing thresholds:

In [25]:
quality_pipe.quality_thresholds

QualityThresholds(n_stop_words=(2, None), alpha_ratio=(0.8, None), mean_word_length=(3, 10), doc_length=(10, 100000), symbol_to_word_ratio={'#': (None, 0.1)}, proportion_ellipsis=(None, 0.3), proportion_bullet_points=(None, 0.8), contains={'lorem ipsum': False}, duplicate_line_chr_fraction=(None, 0.2), duplicate_paragraph_chr_fraction=(None, 0.2), duplicate_ngram_chr_fraction={'5': (None, 0.15), '6': (None, 0.14), '7': (None, 0.13), '8': (None, 0.12), '9': (None, 0.11), '10': (None, 0.1)}, top_ngram_chr_fraction={'2': (None, 0.2), '3': (None, 0.18), '4': (None, 0.16)})

Here we see that the `duplicate_ngram_chr_fraction` for 10-grams is 0.1. This means that if a text contains more than 10% of characters which are a part of a duplicate 10-gram it will be filtered out.

### Filtering out the text
Assuming we don't want to change the filter we can now use it to filter out the texts that we want to keep:

In [26]:
# 4. Filter out the documents that do not pass the quality
legal_docs_filtered = [doc for doc in legal_docs if doc._.passed_quality_check]


In [29]:
print(f"We had a total of {len(legal['text'])} which we filtered down to {len(legal_docs_filtered)}.")

We had a total of 1000 which we filtered down to 68.
