<a href="https://colab.research.google.com/github/duchaba/Data-Augmentation-with-Python/blob/main/data_augmentation_with_python_chapter_5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🌻 Welcome to Chapter 5, Text Augmentation


---

I am glad to see you using this Python Notebook. 🐶

The Python Notebook is an integral part of the book. You can add new “code cells” to extend the functions, add your data, and explore new possibilities, such as downloading additional real-world datasets from the Kaggle website and coding the **Fun challenges**. Furthermore, the book has **Fun facts**, in-depth discussion about augmentation techniques, and Pluto, an imaginary Siberian Huskey coding companion. Together they will guide you every steps of the way.

Pluto encourages you to copy or save a copy of this Python Notebook to your local space and add the “text cells” to keep your notes. In other words, read the book and copy the relevant concept to this Python Notebook’s text-cells. Thus, you can have the explanation, note, original code, your code, and any crazy future ideas in one place.  


💗 I hope you enjoy reading the book and hacking code as much as I enjoy writing it.


## 🌟 Amazon Book

---

- The book is available on the Amazon Book website:
  - https://www.amazon.com/dp/1803246456

  - Author: Duc Haba
  - Published: 2023
  - Page count: 390+


- The original Python Notebook is on:
  - https://github.com/PacktPublishing/Data-Augmentation-with-Python/blob/main/Chapter_5/data_augmentation_with_python_chapter_5.ipynb

- 🚀 Click on the blue "Open in Colab" button at the top of this page to begin hacking.



# 😀 Excerpt from Chapter 5, Text Augmentation

---

> In case you haven’t bought the book. Here is an teaser from the first page of Chapter 5.

---

Text augmentation is a technique that is used in Natural Language Processing (NLP) to generate additional data by modifying or creating new text from existing text data. Text augmentation involves techniques such as character swapping, noise injection, synonym replacement, word deletion, word insertion, and word swapping. Image and text augmentation have the same goal. They strive to increase the size of the training dataset and improve AI prediction accuracy.

Text augmentation is relatively more challenging to evaluate because it is not as intuitive as image augmentation. The intent of an image augmentation technique is clear, such as flipping a photo, but a character-swapping technique will be disorienting to the reader. Therefore, readers might perceive the benefits as subjective.

The effectiveness of text augmentation depends on the quality of the generated data and the specific NLP task being performed. It can be challenging to determine the appropriate safe level of text augmentation that is required for a given dataset, and it often requires experimentation and testing to achieve the desired results.

Customer feedback or social media chatter is fair game for text augmentation because the writing is messy and, predominantly, contains grammatical errors. Conversely, legal documents or written medical communications, such as doctor’s prescriptions or reports, are off-limits because the message is precise. In other words, error injections, synonyms, or even AI-generated text might change the legal or medical meaning beyond the safe level.

The biases in text augmentation are equally difficult to discern. For example, adding noise by purposing misspell words using the Keyboard augmentation method might introduce bias against real-world tweets, which typically contain misspelled words. There are no generalized rules to follow, and the answer only becomes evident after thoroughly studying the data and reviewing the AI forecasting objective.

---

Fun fact

---

As generative AI becomes more widely available, you can use OpenAI’s GPT-3, Google Bard or Facebook’s Roberta system to generate original articles for text augmentation. For example, you can ask generative AI to create positive or negative reviews about a company product, then use the AI-written articles to train predictive AI on sentiment analysis.

---

In Chapter 5, you will learn about text augmentation and how to code the methods in Python notebooks. In particular, we will cover the following topics:

- Character augmenting

- Word augmenting

- Sentence and flow augmenting

- Text augmentation libraries

- Real-world text datasets

- Reinforce learning through Python Notebook

Let’s get started with the simplest topic, character augmentation.

---

🌴 *end of excerpt from the book*

# GitHub Clone

In [1]:
# git version should be 2.17.1 or higher
!git --version

git version 2.34.1


In [2]:
url = 'https://github.com/PacktPublishing/Data-Augmentation-with-Python'
!git clone {url}

Cloning into 'Data-Augmentation-with-Python'...
remote: Enumerating objects: 462, done.[K
remote: Counting objects: 100% (440/440), done.[K
remote: Compressing objects: 100% (261/261), done.[K
remote: Total 462 (delta 240), reused 359 (delta 178), pack-reused 22[K
Receiving objects: 100% (462/462), 139.70 MiB | 15.58 MiB/s, done.
Resolving deltas: 100% (240/240), done.


## Fetch file from URL (Optional)

- Uncommend the below 2 code cells if you want to use URL and not Git Clone

In [None]:
# import requests
# #
# def fetch_file(url, dst):
#   downloaded_obj = requests.get(url)
#   with open(dst, "wb") as file:
#     file.write(downloaded_obj.content)
#   return

In [None]:
# url = ''
# dst = 'pluto_chapter_1.py'
# fetch_file(url,dst)

# Run Pluto

- Instantiate up Pluto, aka. "Pluto, wake up!"

In [3]:
# %% CARRY-OVER code install

!pip install opendatasets --upgrade
!pip install pyspellchecker

Collecting opendatasets
  Downloading opendatasets-0.1.22-py3-none-any.whl (15 kB)
Installing collected packages: opendatasets
Successfully installed opendatasets-0.1.22
Collecting pyspellchecker
  Downloading pyspellchecker-0.7.2-py3-none-any.whl (3.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.4/3.4 MB[0m [31m34.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pyspellchecker
Successfully installed pyspellchecker-0.7.2


In [4]:
#load and run the pluto chapter 1 Python code.
pluto_file = 'Data-Augmentation-with-Python/pluto/pluto_chapter_2.py'
%run {pluto_file}

---------------------------- : ----------------------------
            Hello from class : <class '__main__.PacktDataAug'> Class: PacktDataAug
                   Code name : Pluto
                   Author is : Duc Haba
---------------------------- : ----------------------------


In [5]:
# check out the AI auto generated doc.
help(pluto)

Help on PacktDataAug in module __main__ object:

class PacktDataAug(builtins.object)
 |  PacktDataAug(name='Pluto', is_verbose=True, *args, **kwargs)
 |  
 |  The PacktDataAug class is the based class for the
 |  "Data Augmentation with Python" book.
 |  
 |  Methods defined here:
 |  
 |  __init__(self, name='Pluto', is_verbose=True, *args, **kwargs)
 |      This is the constructor function.
 |      
 |      Args:
 |      
 |       name (str): It requires a name for the object. The default is 'Pluto'
 |       verbose (bool):  The default value of `verbose` is True. This function prints out the
 |          name of the object if `is_verbose == True`. This is used to debug
 |          code. When you are ready to deploy the model, then you should set
 |          `is_verbose == False` in order to avoid printing out diagnostic
 |          messages.
 |      
 |        Additionally, this function takes any number of other
 |        parameters. These parameters are stored in `**kwargs` and are

# Verify Pluto

In [6]:
pluto.say_sys_info()

---------------------------- : ----------------------------
                 System time : 2023/09/15 23:51
                    Platform : linux
     Pluto Version (Chapter) : 2.0
             Python (3.7.10) : actual: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0]
            PyTorch (1.11.0) : actual: 2.0.1+cu118
              Pandas (1.3.5) : actual: 1.5.3
                 PIL (9.0.0) : actual: 9.4.0
          Matplotlib (3.2.2) : actual: 3.7.1
                   CPU count : 12
                   CPU speed : 2.20 GHz
               CPU max speed : 0.00 GHz
---------------------------- : ----------------------------


## (Optional) Export to .py

In [7]:
pluto_chapter_5 = 'Data-Augmentation-with-Python/pluto/pluto_chapter_5.py'
!cp {pluto_file} {pluto_chapter_5}

# ✋ Set up Kaggle username and app Key

- Install the following libraries, and import it on the Notebook.
- Follow by initialize Kaggle username, key and fetch methods.
- STOP: Update your Kaggle access username or key first.

In [8]:
# %%CARRY-OVER code

# -------------------- : --------------------
# READ ME
# Chapter 2 begin:
# Install the following libraries, and import it on the Notebook.
# Follow by initialize Kaggle username, key and fetch methods.
# STOP: Update your Kaggle access username or key first.
# -------------------- : --------------------

!pip install opendatasets --upgrade
import opendatasets
print("\nrequired version 0.1.22 or higher: ", opendatasets.__version__)

!pip install pyspellchecker
import spellchecker
print("\nRequired version 0.7+", spellchecker.__version__)

# STOP: Update your Kaggle access username or key first.
pluto.remember_kaggle_access_keys("YOUR_KAGGLE_USERNAME", "YOUR_KAGGLE_API_KEY")
pluto._write_kaggle_credit()
import kaggle

@add_method(PacktDataAug)
def fetch_kaggle_comp_data(self,cname):
  #self._write_kaggle_credit()  # need to run only once.
  path = pathlib.Path(cname)
  kaggle.api.competition_download_cli(str(path))
  zipfile.ZipFile(f'{path}.zip').extractall(path)
  return

@add_method(PacktDataAug)
def fetch_kaggle_dataset(self,url,dest="kaggle"):
  #self._write_kaggle_credit()    # need to run only once.
  opendatasets.download(url,data_dir=dest)
  return
# -------------------- : --------------------



required version 0.1.22 or higher:  0.1.22

Required version 0.7+ 0.7.2


# Fetch Kaggle Data

## NetFlix

In [9]:
%%time
url = 'https://www.kaggle.com/datasets/infamouscoder/dataset-netflix-shows'
pluto.fetch_kaggle_dataset(url)

Downloading dataset-netflix-shows.zip to kaggle/dataset-netflix-shows


100%|██████████| 1.34M/1.34M [00:01<00:00, 1.38MB/s]


CPU times: user 203 ms, sys: 12 ms, total: 215 ms
Wall time: 2.1 s





In [None]:
f = 'kaggle/dataset-netflix-shows/netflix_titles.csv'
pluto.df_netflix_data = pluto.fetch_df(f)
pluto.df_netflix_data.head(3)

In [None]:
pluto.print_batch_text(pluto.df_netflix_data)

In [None]:
pluto.count_word(pluto.df_netflix_data)
pluto.df_netflix_data.head()

In [None]:
pluto.draw_word_count(pluto.df_netflix_data)

In [10]:
# %%CARRY-OVER code install

!pip install missingno



In [14]:
# %%writefile -a {pluto_chapter_5}

# prompt: write detail Python documentation with default value for the following function: draw_text_null_data
pluto.version = 5.0
import missingno
@add_method(PacktDataAug)
def draw_text_null_data(self, df, color=(0.3,0.36,0.44)):

  """
  Draws a heatmap of missing values for a given Pandas DataFrame.

  Args:
    df (Pandas DataFrame): The data to draw a heatmap of from.
    color (tuple, optional): The color for the heatmap. Defaults to (0.3,0.36,0.44).

  Returns:
    None.
  """
  canvas, pic = matplotlib.pyplot.subplots(1, 1, figsize=(10, 6))
  missingno.matrix(df,color=color,ax=pic)
  pic.set_title('Missing Data (Null Value)')
  pic.set_xlabel('Solid is has data. White line is missing/null data.')
  canvas.tight_layout()
  self._drop_image(canvas)
  canvas.show()
  return

In [None]:
pluto.draw_text_null_data(pluto.df_netflix_data)

In [12]:
# %%CARRY-OVER code install

!pip install nltk
!pip install wordcloud
#
# tested with the following version
# !pip install nltk==3.8.1
# !pip install wordcloud==1.8.2.2



In [18]:
# %%writefile -a {pluto_chapter_5}

# prompt: write detail Python documentation with default value for the following function: _draw_text_wordcloud
import nltk
import wordcloud
import re
@add_method(PacktDataAug)
def _draw_image_wordcloud(self, words_str, xignore_words='cat', title='Word Cloud:'):
  """
  Draws a word cloud from a given string of text.

  Args:
    words_str (string): The data string format seperate by space.
    xignore_words (str, optional): Any word(s) that should be ignored. Defaults to 'cat'.
    title (str, optional): The title of the plot. Defaults to 'Word Cloud:'.

  Returns:
    None.
  """
  canvas, pic = matplotlib.pyplot.subplots(1, 1, figsize=(16, 8))
  img = wordcloud.WordCloud(width = 1400,
    height = 800,
    background_color ='white',
    stopwords = xignore_words,
    min_font_size = 10).generate(words_str)
  pic.imshow(img)
  pic.set_title(title)
  pic.set_xlabel(f'Approximate Words: {int(len(words_str) / 5)}')
  pic.tick_params(left = False, right = False, labelleft = False,
    labelbottom = False, bottom = False)
  canvas.tight_layout()
  self._drop_image(canvas)
  canvas.show()
  return
  #
# prompt: write detail Python documentation with default value for the following function: draw_text_wordcloud
@add_method(PacktDataAug)
def draw_text_wordcloud(self, df_1column, xignore_words='cat', title='Word Cloud:'):
  """
  Draws a word cloud from a given column of text. It uses the helper function:
  _draw_image_wordcloud().

  Args:
    df_1column (Pandas DataFrame): The data to draw a heatmap of from.
    xignore_words (str, optional): Any word(s) that should be ignored. Defaults to 'cat'.
    title (str, optional): The title of the plot. Defaults to 'Word Cloud:'.

  Returns:
    None.
  """
  orig = df_1column.str.cat()
  clean = re.sub('[^A-Za-z0-9 ]+', '', orig)
  self._draw_image_wordcloud(clean, xignore_words=xignore_words,title=title)
  return

In [None]:
print('Nltk version 3.7: actual: ', nltk.__version__)
print('WordCloud version 1.8.2.2: actual: ', wordcloud.__version__)

In [None]:
pluto.draw_text_wordcloud(pluto.df_netflix_data.description,
  xignore_words=wordcloud.STOPWORDS,
  title='Word Cloud: Netflix Movie Review')

## Twitter

In [None]:
# @add_method(PacktDataAug)
# def fetch_df(self, csv):
#   df = pandas.read_csv(csv, encoding='latin-1')
#   return df

In [None]:
#
%%time
url = 'https://www.kaggle.com/datasets/mayurdalvi/twitter-sentiments-analysis-nlp'
pluto.fetch_kaggle_dataset(url)

In [None]:
# remove white space in directory and filename
# run this until no error/output
f = 'kaggle/twitter-sentiments-analysis-nlp'
#!find {f} -name "* *" -type d | rename 's/ /_/g'
!find {f} -name "* *" -type f | rename 's/ /_/g'

In [None]:
f = 'kaggle/twitter-sentiments-analysis-nlp/Twitter_Sentiments.csv'
pluto.df_twitter_data = pluto.fetch_df(f)
pluto.df_twitter_data.head(3)

In [None]:
pluto.print_batch_text(pluto.df_twitter_data,cols=['label', 'tweet'])

In [17]:
# %%CARRY-OVER code install

!pip install filter-profanity
#
# tested on the following on version:
#!pip install filter-profanity==1.0.9

Collecting filter-profanity
  Downloading filter-profanity-1.0.9.tar.gz (4.8 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: filter-profanity
  Building wheel for filter-profanity (setup.py) ... [?25l[?25hdone
  Created wheel for filter-profanity: filename=filter_profanity-1.0.9-py3-none-any.whl size=5199 sha256=d0104de21ef1c6d53ab49250f80047a017749d97c207478a3e22c4ae1f2608e9
  Stored in directory: /root/.cache/pip/wheels/62/c4/67/6c6cabc8e2202296e219178d7b6304d893a8b884f5b70406b6
Successfully built filter-profanity
Installing collected packages: filter-profanity
Successfully installed filter-profanity-1.0.9


In [20]:
# %%writefile -a {pluto_chapter_5}

# prompt: write detail Python documentation with default value for the following function: _clean_text
import profanity
import re
#
@add_method(PacktDataAug)
def _clean_text(self,x):
  """
  Cleans a text by removing any special characters.

  Args:
    x (str): The text to be cleaned.

  Returns:
    str: The cleaned text.
  """
  return (re.sub('[^A-Za-z0-9 .,!?#@]+', '', str(x)))
#
# prompt: write detail Python documentation with default value for the following function: _clean_bad_word
@add_method(PacktDataAug)
def _clean_bad_word(self,x):

  """
  Cleans a text by replacing any bad word with a '***' character. It uses the
  cesnsor_profanity() function from profanity library .

  Args:
    x (str): The text to be cleaned.

  Returns:
    str: The cleaned text.
  """

  return (profanity.censor_profanity(x, ''))
#
# prompt: write detail Python documentation with default value for the following function: clean_text
@add_method(PacktDataAug)
def clean_text(self, df):

  """
  Cleans a text by replacing any special characters with a '_' character and
  replaces any bad word with a '***' character.
  It uses the _clean_text() and _clean_bad_word() helper functions.

  Args:
    df (Pandas DataFrame): The Pandas DataFrame containing the column 'tweet'
    which is to be cleaned.

  Returns:
    Pandas DataFrame: The modified Pandas DataFrame with a new column 'clean_tweet'.
  """
  df['clean_tweet'] = df.tweet.apply(self._clean_text)
  df['clean_tweet'] = df['clean_tweet'].apply(self._clean_bad_word)
  return df

In [None]:
%%time
pluto.clean_text(pluto.df_twitter_data)
pluto.df_twitter_data.head()
#
pluto.df_netflix_data['description'] = pluto.df_netflix_data['description'].apply(pluto._clean_text)

In [None]:
pluto.print_batch_text(pluto.df_twitter_data, cols=['label', 'clean_tweet'])

In [None]:
# double check on clean tweets
print('clean: ', pluto.df_twitter_data.clean_tweet[13538], ' : original: ',
  pluto.df_twitter_data.tweet[13538], ': label: ', pluto.df_twitter_data.label[13538])

In [None]:
# double check on clean tweets
with pandas.option_context("display.max_colwidth", None):
  display(pluto.df_twitter_data[pluto.df_twitter_data.label == 1].sample(10))

In [None]:
pluto.count_word(pluto.df_twitter_data,col_dest='clean_tweet')
pluto.draw_word_count(pluto.df_twitter_data)

In [None]:
pluto.draw_text_null_data(pluto.df_twitter_data)

In [None]:
pluto.draw_text_wordcloud(pluto.df_twitter_data.clean_tweet,
  xignore_words=wordcloud.STOPWORDS,
  title='Clean Tweets Word Cloud')

# Export (aka drop or save) data file

In [22]:
# %%writefile -a {pluto_chapter_5}

# prompt: write detail Python documentation with default value for the following function: _drop_df_file
@add_method(PacktDataAug)
def _drop_df_file(self, df,fname,type='csv',sep='~'):
  """
  Convert a Pandas dataframe to a file. It uses the Pandas dataframe .to_csv()
  function. It adds an option to specify the file type. The default file type
  is 'csv'.

  Args:
    df (Pandas DataFrame): The Pandas DataFrame to be dropped to a file.
    fname (str): The name of the file to which the Pandas DataFrame
      is to be created.
    type (str, optional): The file type. The default file type is 'csv'.
    sep (str, optional): The separator character for CSV file. The default
      separator is '~'.

  Returns:
    None.
  """
  df.to_csv(fname,sep=sep)
  return

In [None]:
f = 'Data-Augmentation-with-Python/pluto_data/netflix_data.csv'
pluto._drop_df_file(pluto.df_netflix_data, f)

In [None]:
f = 'Data-Augmentation-with-Python/pluto_data/twitter_data.csv'
pluto._drop_df_file(pluto.df_twitter_data, f)

# Character Augmenter<a class="anchor" id="chara_aug">

Augmenting data in character level. Possible scenarios include image to text and chatbot. During recognizing text from image, we need to optical character recognition (OCR) model to achieve it but OCR introduces some errors such as recognizing "o" and "0". `OCRAug` simulate these errors to perform the data augmentation. For chatbot, we still have typo even though most of application comes with word correction. Therefore, `KeyboardAug` is introduced to simulate this kind of errors.

In [23]:
# %%CARRY-OVER code install

!pip install nlpaug
#
# tested on the following version:
#!pip install nlpaug==1.1.11

Collecting nlpaug
  Downloading nlpaug-1.1.11-py3-none-any.whl (410 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/410.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.2/410.5 kB[0m [31m2.8 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m410.5/410.5 kB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: nlpaug
Successfully installed nlpaug-1.1.11


In [24]:
# %%writefile -a {pluto_chapter_5}

import nlpaug
import nlpaug.augmenter
import nlpaug.augmenter.char
import nlpaug.augmenter.word

In [None]:
print('version 1.1.11, actual: ',nlpaug.__version__)

In [25]:
# %%writefile -a {pluto_chapter_5}

pluto.orig_text = 'It was the best of times. It was the worst of times. It was the age of wisdom. It was the age of foolishness. It was the epoch of belief. It was the epoch of incredulity.'

In [29]:
# %%writefile -a {pluto_chapter_5}

# prompt: write detail Python documentation with default value for the following function: _print_aug_batch
@add_method(PacktDataAug)
def _print_aug_batch(self, df, aug_func, col_dest="description",
  bsize=3, aug_name='Augmented',is_larger_font=True):

  """
  Prints a batch of data augmentation results.

  Args:
    df (Pandas DataFrame): The Pandas DataFrame containing the column to be
      augmented.
    aug_func (nlpaug.augmenter.Augmenter): The nlpaug augmenter to be used for
      data augmentation.
    col_dest (str, optional): The column name of the column to be augmented.
      The default column name is 'description'.
    bsize (int, optional): The batch size of the data augmentation results to be
      printed. The default batch size is 3.
    aug_name (str, optional): The name of the data augmentation function. The
      default name is 'Augmented'.
    is_larger_font (bool, optional): Whether to use a larger font for printing
      the data augmentation results. The default is True.

  Returns:
    None.
  """

  col_name = [aug_name, 'Original']
  aug = aug_func.augment(self.orig_text, n=1)
  data = [[aug[0], self.orig_text]]
  df_aug = pandas.DataFrame(data, columns=col_name)
  orig = df[col_dest].sample(bsize)
  for tx in orig:
    aug = aug_func.augment(tx, n=1)
    data = [[aug[0], tx]]
    t = pandas.DataFrame(data, columns=col_name)
    df_aug = df_aug.append(t, ignore_index=True)
  #
  with pandas.option_context("display.max_colwidth", None):
    if (is_larger_font):
      display(df_aug.head(bsize+1).style.set_table_styles(self._fetch_larger_font()))
    else:
      display(df_aug.head(bsize+1))
  return

## OCR augmenting

In [30]:
# %%writefile -a {pluto_chapter_5}

# prompt: write detail Python documentation with default value for the following function: print_aug_ocr
@add_method(PacktDataAug)
def print_aug_ocr(self, df, col_dest="description",bsize=3, aug_name='Augmented'):

  """
  Performs OCR augmentation on the given text data and prints the results.

  Args:
    df (Pandas DataFrame): The Pandas DataFrame containing the column to be
      augmented.
    col_dest (str, optional): The column name of the column to be augmented.
      The default column name is 'description'.
    bsize (int, optional): The batch size of the data augmentation results to be
      printed. The default batch size is 3.
    aug_name (str, optional): The name of the data augmentation function. The
      default name is 'Augmented'.

  Returns:
    None.
  """
  aug_func = nlpaug.augmenter.char.OcrAug()
  self._print_aug_batch(df, aug_func,col_dest=col_dest,bsize=bsize, aug_name=aug_name)
  return

In [None]:
pluto.print_aug_ocr(pluto.df_netflix_data, col_dest='description',aug_name='OCR Augment')

In [None]:
pluto.print_aug_ocr(pluto.df_twitter_data, col_dest='clean_tweet',aug_name='OCR Augment')

## Keyboard Augmenter<a class="anchor" id="keyboard_aug"></a>

- Substitute character by keyboard distance

In [33]:
# %%writefile -a {pluto_chapter_5}

# prompt: write detail Python documentation with default value for the following function: print_aug_keyboard
@add_method(PacktDataAug)
def print_aug_keyboard(self, df, col_dest="description",bsize=3, aug_name='Keyboard Augment'):

  """
  Performs keyboard augmentation on the given text data and prints the results.

  Args:
    df (Pandas DataFrame): The Pandas DataFrame containing the column to be
      augmented.
    col_dest (str, optional): The column name of the column to be augmented.
      The default column name is 'description'.
    bsize (int, optional): The batch size of the data augmentation results to be
      printed. The default batch size is 3.
    aug_name (str, optional): The name of the data augmentation function. The
      default name is 'Keyboard Augment'.

  Returns:
    None.
  """
  aug_func = nlpaug.augmenter.char.KeyboardAug()
  self._print_aug_batch(df, aug_func,col_dest=col_dest,bsize=bsize, aug_name=aug_name)
  return

In [None]:
pluto.print_aug_keyboard(pluto.df_netflix_data, col_dest='description',
  aug_name='Keyboard Augment')

In [None]:
pluto.print_aug_keyboard(pluto.df_twitter_data, col_dest='clean_tweet',
  aug_name='Keyboard Augment')

## Random Augmenter<a class="anchor" id="random_aug"></a>

In [36]:
# %%writefile -a {pluto_chapter_5}

# prompt: write detail Python documentation with default value for the following function: print_aug_char_random
@add_method(PacktDataAug)
def print_aug_char_random(self, df, action='insert', col_dest="description",bsize=3, aug_name='Augment'):

  """
  Performs random augmentation on the given text data and prints the results.

  Args:
    df (Pandas DataFrame): The Pandas DataFrame containing the column to be
      augmented.
    col_dest (str, optional): The column name of the column to be augmented.
      The default column name is 'description'.
    bsize (int, optional): The batch size of the data augmentation results to be
      printed. The default batch size is 3.
    aug_name (str, optional): The name of the data augmentation function. The
      default name is 'Augmented'.
    action (str, optional): The action of augmentation. Possible values include:
      - 'insert': Insert random characters in the target text.
      - 'delete': Delete random characters in the target text.
      - 'substitute': Substitute random characters in the target text.
      The default value is 'insert'.

  Returns:
    None.
  """
  aug_func = nlpaug.augmenter.char.RandomCharAug(action=action)
  self._print_aug_batch(df, aug_func,col_dest=col_dest,bsize=bsize, aug_name=aug_name)
  return

In [None]:
pluto.print_aug_char_random(pluto.df_netflix_data, action='insert',
  col_dest='description', aug_name='Random Insert Augment')

In [None]:
pluto.print_aug_char_random(pluto.df_netflix_data, action='delete',
  col_dest='description', aug_name='Random Delete Augment')

In [None]:
pluto.print_aug_char_random(pluto.df_netflix_data, action='substitute',
  col_dest='description', aug_name='Random Substitute Augment')

In [None]:
pluto.print_aug_char_random(pluto.df_netflix_data, action='swap',
  col_dest='description', aug_name='Random Swap Augment')

In [None]:
pluto.print_aug_char_random(pluto.df_twitter_data, action='insert',
  col_dest='clean_tweet', aug_name='Random Insert Augment')

In [None]:
pluto.print_aug_char_random(pluto.df_twitter_data, action='delete',
  col_dest='clean_tweet', aug_name='Random Delete Augment')

In [None]:
pluto.print_aug_char_random(pluto.df_twitter_data, action='substitute',
  col_dest='clean_tweet', aug_name='Random Substitute Augment')

In [None]:
pluto.print_aug_char_random(pluto.df_twitter_data, action='swap',
  col_dest='clean_tweet', aug_name='Random Swap Augment')

# Word augmentation

## Misspell Augmenter<a class="anchor" id="spelling_aug"></a>

- Substitute word by spelling mistake words dictionary

In [38]:
# %%writefile -a {pluto_chapter_5}

# prompt: write detail Python documentation with default value for the following function: print_aug_word_misspell
@add_method(PacktDataAug)
def print_aug_word_misspell(self, df, col_dest="description",bsize=3, aug_name='Augment'):

  """
  Performs misspell augmentation on the given text data and prints the results.

  Args:
    df (Pandas DataFrame): The Pandas DataFrame containing the column to be
      augmented.
    col_dest (str, optional): The column name of the column to be augmented.
      The default column name is 'description'.
    bsize (int, optional): The batch size of the data augmentation results to be
      printed. The default batch size is 3.
    aug_name (str, optional): The name of the data augmentation function. The
      default name is 'Augmented'.

  Returns:
    None.
  """
  aug_func = nlpaug.augmenter.word.SpellingAug()
  self._print_aug_batch(df, aug_func,col_dest=col_dest,bsize=bsize, aug_name=aug_name)
  return

In [None]:
pluto.print_aug_word_misspell(pluto.df_netflix_data,
  col_dest='description', aug_name='Word Spelling Augment')

In [None]:
pluto.print_aug_word_misspell(pluto.df_twitter_data,
  col_dest='clean_tweet', aug_name='Word Spelling Augment')

## Split Augmenter<a class="anchor" id="split_aug"></a>

- Split word to two tokens randomly

In [40]:
# %%writefile -a {pluto_chapter_5}

# prompt: write detail Python documentation with default value for the following function: print_aug_word_split
@add_method(PacktDataAug)
def print_aug_word_split(self, df, col_dest="description",bsize=3, aug_name='Augment'):

  """
  Performs split augmentation on the given text data and prints the results.

  Args:
    df (Pandas DataFrame): The Pandas DataFrame containing the column to be
      augmented.
    col_dest (str, optional): The column name of the column to be augmented.
      The default column name is 'description'.
    bsize (int, optional): The batch size of the data augmentation results to be
      printed. The default batch size is 3.
    aug_name (str, optional): The name of the data augmentation function. The
      default name is 'Augmented'.

  Returns:
    None.
  """
  aug_func = nlpaug.augmenter.word.SplitAug()
  self._print_aug_batch(df, aug_func,col_dest=col_dest,bsize=bsize, aug_name=aug_name)
  return

In [None]:
pluto.print_aug_word_split(pluto.df_netflix_data, col_dest='description', aug_name='Word Split Augment')

In [None]:
pluto.print_aug_word_split(pluto.df_twitter_data, col_dest='clean_tweet', aug_name='Word Split Augment')

## Random Word Augmenter

In [42]:
# %%writefile -a {pluto_chapter_5}

# prompt: write detail Python documentation with default value for the following function: print_aug_word_random
@add_method(PacktDataAug)
def print_aug_word_random(self, df, action='swap', col_dest="description",bsize=3, aug_name='Augment'):

  """
  Performs random word augmentation on the given text data and prints the results.

  Args:
    df (Pandas DataFrame): The Pandas DataFrame containing the column to be
      augmented.
    col_dest (str, optional): The column name of the column to be augmented.
      The default column name is 'description'.
    bsize (int, optional): The batch size of the data augmentation results to be
      printed. The default batch size is 3.
    aug_name (str, optional): The name of the data augmentation function. The
      default name is 'Augmented'.
    action (str, optional): The action of augmentation. Possible values include:
      - 'swap': Swap two random words in the target text.
      - 'substitute': Substitute a random word with another random word from the
        vocab.
      - 'insert': Insert a random word at a random position in the target text.
      - 'delete': Delete a random word from the target text.
      The default value is 'swap'.

  Returns:
    None.
  """
  aug_func = nlpaug.augmenter.word.RandomWordAug(action=action)
  self._print_aug_batch(df, aug_func,col_dest=col_dest,bsize=bsize, aug_name=aug_name)
  return

In [None]:
pluto.print_aug_word_random(pluto.df_netflix_data, action='swap',
  col_dest='description', aug_name='Word Random Swap Augment')

In [None]:
pluto.print_aug_word_random(pluto.df_twitter_data, action='swap',
  col_dest='clean_tweet', aug_name='Word Random Swap Augment')

In [None]:
pluto.print_aug_word_random(pluto.df_netflix_data, action='substitute',
  col_dest='description', aug_name='Word Random Substitude Augment')

In [None]:
pluto.print_aug_word_random(pluto.df_twitter_data, action='substitute',
  col_dest='clean_tweet', aug_name='Word Random Substitute Augment')

In [None]:
pluto.print_aug_word_random(pluto.df_netflix_data, action='crop',
  col_dest='description', aug_name='Word Random Crop Augment')

In [None]:
pluto.print_aug_word_random(pluto.df_twitter_data, action='crop',
  col_dest='clean_tweet', aug_name='Word Random Crop Augment')

In [None]:
pluto.print_aug_word_random(pluto.df_netflix_data, action='delete',
  col_dest='description', aug_name='Word Random Delete Augment')

In [None]:
pluto.print_aug_word_random(pluto.df_twitter_data, action='delete',
  col_dest='clean_tweet', aug_name='Word Random Delete Augment')

## Synonym Augmenter

In [43]:
# %%writefile -a {pluto_chapter_5}

nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('punkt')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

- Substitute word by WordNet's synonym

In [45]:
# %%writefile -a {pluto_chapter_5}

# prompt: write detail Python documentation with default value for the following function: print_aug_word_synonym
@add_method(PacktDataAug)
def print_aug_word_synonym(self, df, col_dest="description",bsize=3, aug_name='Augment'):

  """
  Performs synonym augmentation on the given text data and prints the results.

  Args:
    df (Pandas DataFrame): The Pandas DataFrame containing the column to be
      augmented.
    col_dest (str, optional): The column name of the column to be augmented.
      The default column name is 'description'.
    bsize (int, optional): The batch size of the data augmentation results to be
      printed. The default batch size is 3.
    aug_name (str, optional): The name of the data augmentation function. The
      default name is 'Augmented'.

  Returns:
    None.
  """
  aug_func = nlpaug.augmenter.word.SynonymAug(aug_src='wordnet')
  self._print_aug_batch(df, aug_func,col_dest=col_dest,bsize=bsize, aug_name=aug_name)
  return

In [None]:
pluto.print_aug_word_synonym(pluto.df_netflix_data,
  col_dest='description', aug_name='Synonym WordNet Augment')

In [None]:
pluto.print_aug_word_synonym(pluto.df_twitter_data,
  col_dest='clean_tweet', aug_name='Synonym WordNet Augment')

## Antonym Augmenter<a class="anchor" id="antonym_aug"></a>

In [47]:
# %%writefile -a {pluto_chapter_5}

# prompt: write detail Python documentation with default value for the following function: print_aug_word_antonym
@add_method(PacktDataAug)
def print_aug_word_antonym(self, df, col_dest="description",bsize=3, aug_name='Augment'):

  """
  Performs antonym augmentation on the given text data and prints the results.

  Args:
    df (Pandas DataFrame): The Pandas DataFrame containing the column to be
      augmented.
    col_dest (str, optional): The column name of the column to be augmented.
      The default column name is 'description'.
    bsize (int, optional): The batch size of the data augmentation results to be
      printed. The default batch size is 3.
    aug_name (str, optional): The name of the data augmentation function. The
      default name is 'Augmented'.

  Returns:
    None.
  """
  aug_func = nlpaug.augmenter.word.AntonymAug()
  self._print_aug_batch(df, aug_func,col_dest=col_dest,bsize=bsize, aug_name=aug_name)
  return

In [None]:
pluto.print_aug_word_antonym(pluto.df_netflix_data,
  col_dest='description',aug_name='Antonym Augment')

In [None]:
pluto.print_aug_word_antonym(pluto.df_twitter_data,
  col_dest='clean_tweet',aug_name='Antonym Augment')

## Reserved Word Augmenter<a class="anchor" id="reserved_aug"></a>

In [49]:
# %%writefile -a {pluto_chapter_5}

# prompt: write detail Python documentation with default value for the following function: print_aug_word_reserved
@add_method(PacktDataAug)
def print_aug_word_reserved(self, df, col_dest="description",reserved_tokens=None,bsize=3, aug_name='Augment'):

  """
  Performs reserved word augmentation on the given text data and prints the results.

  Args:
    df (Pandas DataFrame): The Pandas DataFrame containing the column to be
      augmented.
    col_dest (str, optional): The column name of the column to be augmented.
      The default column name is 'description'.
    reserved_tokens (list, optional): The list of words that will be reserved
      and not be augmented. The default value is None.
    bsize (int, optional): The batch size of the data augmentation results to be
      printed. The default batch size is 3.
    aug_name (str, optional): The name of the data augmentation function. The
      default name is 'Augmented'.

  Returns:
    None.
  """
  aug_func = nlpaug.augmenter.word.ReservedAug(reserved_tokens=reserved_tokens)
  self._print_aug_batch(df, aug_func,col_dest=col_dest,bsize=bsize, aug_name=aug_name)
  return

In [None]:
# %%writefile -a {pluto_chapter_5}

pluto.reserved_control = [['wisdom', 'sagacity', 'intelligence', 'prudence'],
  ['foolishness', 'folly', 'idiocy', 'stupidity']]

In [None]:
# %%writefile -a {pluto_chapter_5}

pluto.reserved_netflix = [['family','household', 'brood', 'unit', 'families'],
  ['life','existance', 'entity', 'creation'],
  ['love', 'warmth', 'endearment','tenderness']]
pluto.reserved_netflix = pluto.reserved_control + pluto.reserved_netflix

In [None]:
pluto.print_aug_word_reserved(pluto.df_netflix_data, col_dest='description',
  reserved_tokens=pluto.reserved_netflix, aug_name='Netflix Reserved word augment')

In [None]:
# %%writefile -a {pluto_chapter_5}

pluto.reserved_twitter = [['user', 'users', 'customer', 'client','people','member','shopper'],
  ['happy', 'cheerful', 'joyful', 'carefree'],
  ['time','clock','hour']]
pluto.reserved_twitter = pluto.reserved_control + pluto.reserved_twitter

In [None]:
# pluto.reserved_twitter

In [None]:
pluto.print_aug_word_reserved(pluto.df_twitter_data, col_dest='clean_tweet',
  reserved_tokens=pluto.reserved_twitter,aug_name='Twitter Reserved word augment')

In [50]:
# end of chapter 5
print('End of chapter 5.')

End of chapter 5.


In [51]:
# check on AI auto documentation
help(pluto)

Help on PacktDataAug in module __main__ object:

class PacktDataAug(builtins.object)
 |  PacktDataAug(name='Pluto', is_verbose=True, *args, **kwargs)
 |  
 |  The PacktDataAug class is the based class for the
 |  "Data Augmentation with Python" book.
 |  
 |  Methods defined here:
 |  
 |  __init__(self, name='Pluto', is_verbose=True, *args, **kwargs)
 |      This is the constructor function.
 |      
 |      Args:
 |      
 |       name (str): It requires a name for the object. The default is 'Pluto'
 |       verbose (bool):  The default value of `verbose` is True. This function prints out the
 |          name of the object if `is_verbose == True`. This is used to debug
 |          code. When you are ready to deploy the model, then you should set
 |          `is_verbose == False` in order to avoid printing out diagnostic
 |          messages.
 |      
 |        Additionally, this function takes any number of other
 |        parameters. These parameters are stored in `**kwargs` and are

# Push up all changes (Optional)

- username: duchaba

- password: [use the token]

In [None]:
# import os
# f = 'Data-Augmentation-with-Python'
# os.chdir(f)
# !git add -A
# !git config --global user.email "duc.haba@gmail.com"
# !git config --global user.name "duchaba"
# !git commit -m "end of session"
# # do the git push in the xterm console
# #!git push

In [None]:
# %%script false --no-raise-error  #temporary stop execute for export file

# Summary

Every chaper will begin with same base class "PacktDataAug".

✋ FAIR WARNING:

- The coding uses long and complete function path name.

- Pluto wrote the code for easy to understand and not for compactness, fast execution, nor cleaverness.

- Use Xterm to debug cloud server



In [None]:
# !pip install colab-xterm
# %load_ext colabxterm
# %xterm