# Data augmentation

Transform the original dataset with new samples generated through different methods. Most libraries like textattack, nlpaug, are not designed for german sentences.

Source: [Text data augmentations: Permutation, antonyms and negation
](https://www.sciencedirect.com/science/article/abs/pii/S0957417421002104)

## 🎓 Library

In [None]:
from googletrans import Translator
import pandas as pd

DATA = "classification/data/"


## Initial setup

We start with 71570 poems

In [None]:
poems_df = pd.read_parquet(DATA + "de_poems.parquet")

In [None]:
poems_df.head(3)["text"]

In [None]:
new_poems_df = poems_df.copy()

## Line permutation

Check the file `classification/utils.py` for the function `data_augment`. This function shuffles the lines of poems in the same century.


The model `classification/logistic_regression/tfidf.ipynb` can be trained with the augmented data.

## Translation to english, textattack augment, and back to german

Augmenting with textattack changes proper names and quantities.

In [None]:
translator = Translator()

In [None]:
transformed_poems = []
for index, row in new_poems_df.iterrows():
    translated = await translator.translate(row["text"], src="de", dest="en")
    translated_row = {
        "title": row["title"],
        "text": translated.text,
        "author": row["author"],
        "creation": row["creation"]
    }
    transformed_poems.append(translated_row)

translated_df = pd.DataFrame(transformed_poems)

## Germanet synonyms

GermaNet is a semantic network for german, similar to WordNet. It contains information about the meaning of words and their relationships to each other. It can be used to find semantic relations between words, such as synonyms, and antonyms.

Sadly the dataset is not open source and needs to have an authorization to use it.