# Edit the corpus post-extraction and post-deduplication
The reasons for this editing are manyfold. After extracting the text, many new errors with some files can become apparent. The text might indicate that what you have is, in fact, not an article, or it might be in a language you do not like.

This notebook gives you tools to find and delete such files.

The final group of cells also adds some preprocessing with SPACY.

>WARNING: DO NOT RUN 'ALL CELLS' IN THIS NOTEBOOK. ONLY RUN THE CELLS YOU NEED AND HAVE TESTED, OR YOU MIGHT DELETE FILES YOU DID NOT WANT TO DELETE.

In [None]:
from functions import postedit_corpus as pc
from functions import utils

%load_ext autoreload
%autoreload 2

In [None]:
GOAL_DIR = '....'

You might want to copy the files to a new directory before running this notebook, so you can always go back to the original files.

In [None]:
from functions.utils import copy_jsoncorpus

copy_jsoncorpus(GOAL_DIR, f'{GOAL_DIR}-copy')

In [None]:
raise ValueError('Are you ready to proceed?')

## Suspicios Text Lengths
Some texts in your corpus might be very short. This can be an indication that the text is not an article, or, if it is, that it is cut off or otherwise non-useful.

In [None]:
lens = pc.get_textlens(GOAL_DIR, 'text_deduped')
pc.plot_textlens(GOAL_DIR)
pc.print_textlen_stats(lens)

In [None]:
pc.print_textlen_threshold(GOAL_DIR, 1200, below_threshold=True, text_key='text_deduped')

In [None]:
pc.remove_textlen_threshold(GOAL_DIR, 1200, below_threshold=True, force=True, text_key='text_deduped')

## Unwanted Languages
If you are only interested in German texts, you might want to delete all texts that are not in German.
We use langdetect to detect the language of the text. This is not perfect, but it is a great start.

In [None]:
languages = pc.plot_languages(GOAL_DIR, text_key='text_deduped')

In [None]:
pc.print_languages(GOAL_DIR, ['en'], languages, text_key='text_deduped')

In [None]:
pc.remove_languages(
    GOAL_DIR, ['en', 'tr'],
    languages,
    force=True
)

## Average Paragraph Lengths

The average paragraph length can be another indicator for non-article texts. If the average paragraph length is very short, this might be an indication that the text is a list or a table.
If it is very long, perhaps something went wrong with the extraction.

In [None]:
pc.plot_avg_parlens(GOAL_DIR, text_key='text_deduped')

In [None]:
pc.print_parlen_threshold(GOAL_DIR, 200, below_threshold=True, text_key='text_deduped')

In [None]:
pc.remove_avg_parlen_threshold(GOAL_DIR, 200, below_threshold=True, force=True, text_key='text_deduped')

## Other Suspicions

You might have other ideas about what unwanted data looks like in your corpus. Test out some functions and see if you find anything!

In [None]:
import re
import functions._postedit_checkers as postcheck


# For multi-page articles, keep only 'Komplettansicht' URLs and drop individual pages.
def custom_checker(data):
    url = data['url']
    if re.search(r'/seite-[0-9]+', url):
        return False
    return True

In [None]:
pc.print_custom_removal(
    GOAL_DIR, custom_checker, 'url'
)

In [None]:
pc.apply_custom_removal(
    GOAL_DIR, postcheck.zeit_dpa, force=True, title_key='h1'
)

## Add Linguistic Information

You can run a spacy tokenizer over your data so that you only have to do it once.

In [None]:
pc.add_lemma_token(
    GOAL_DIR, text_key='text_deduped', spacy_model='de_core_news_lg'
)

In [None]:
from functions.utils import rename_files_with_padded_index_prefixed

rename_files_with_padded_index_prefixed(GOAL_DIR, 'infoakt')