<a href="https://colab.research.google.com/github/duchaba/Data-Augmentation-with-Python/blob/main/data_augmentation_with_python_chapter_6.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Augmentation with Python, Chapter 6

## 🌻 Welcome to Chapter 6, "Text Augmentation with Machine Learning"

In this chapter, you will learn:

- Machine Learning models 

- Word augmenting 

- Sentence augmenting 

- Real-world NLP datasets 

- Reinforce learning through Python code 

# Load Notebook

- This Notebook original link is: 
  - https://github.com/PacktPublishing/Data-Augmentation-with-Python/blob/main/data_augmentation_with_python_chapter_6.ipynb

# GitHub Clone

In [None]:
!pip install gensim==4.2.0
import gensim
print(gensim.__version__)
# version4.2.0

In [None]:
# git version should be 2.17.1 or higher
!git --version

In [None]:
url = 'https://github.com/PacktPublishing/Data-Augmentation-with-Python'
!git clone {url}

## Fetch file from URL (Optional)

- Uncommend the below 2 code cells if you want to use URL and not Git Clone

In [None]:
# import requests
# #
# def fetch_file(url, dst):
#   downloaded_obj = requests.get(url)
#   with open(dst, "wb") as file:
#     file.write(downloaded_obj.content)
#   return

In [None]:
# url = ''
# dst = 'pluto_chapter_1.py'
# fetch_file(url,dst)

# Run Pluto

- Instantiate up Pluto, aka. "Pluto, wake up!"

In [None]:
# %% CARRY-OVER code install

!pip install opendatasets --upgrade
!pip install pyspellchecker 
!pip install missingno
!pip install nltk
!pip install wordcloud
!pip install filter-profanity
!pip install nlpaug

In [None]:
#load and run the pluto chapter 1 Python code.
pluto_file = 'Data-Augmentation-with-Python/pluto/pluto_chapter_5.py'
%run {pluto_file}

## Verify Pluto

In [None]:
pluto.say_sys_info()

## (Optional) Export to .py

In [None]:
pluto_chapter_6 = 'Data-Augmentation-with-Python/pluto/pluto_chapter_6.py'
!cp {pluto_file} {pluto_chapter_6}

# ✋ Get Kaggle username and api key (Optional for this chapter)

- Install the following libraries, and import it on the Notebook.
- Follow by initialize Kaggle username, key and fetch methods.
- STOP: Update your Kaggle access username or key first.

In [None]:
# -------------------- : --------------------
# READ ME
# Chapter 2 begin:
# Install the following libraries, and import it on the Notebook.
# Follow by initialize Kaggle username, key and fetch methods.
# STOP: Update your Kaggle access username or key first.
# -------------------- : --------------------

!pip install opendatasets --upgrade
import opendatasets
print("\nrequired version 0.1.22 or higher: ", opendatasets.__version__)

!pip install pyspellchecker 
import spellchecker
print("\nRequired version 0.7+", spellchecker.__version__)

# STOP: Update your Kaggle access username or key first.
pluto.remember_kaggle_access_keys("YOUR-USERNAME", "YOUR-KEY")
pluto._write_kaggle_credit()
import kaggle

@add_method(PacktDataAug)
def fetch_kaggle_comp_data(self,cname):
  #self._write_kaggle_credit()  # need to run only once.
  path = pathlib.Path(cname)
  kaggle.api.competition_download_cli(str(path))
  zipfile.ZipFile(f'{path}.zip').extractall(path)
  return

@add_method(PacktDataAug)
def fetch_kaggle_dataset(self,url,dest="kaggle"):
  #self._write_kaggle_credit()    # need to run only once.
  opendatasets.download(url,data_dir=dest)
  return
# -------------------- : --------------------


# Fetch data from chapter 5

In [None]:
f = 'Data-Augmentation-with-Python/pluto_data'
!ls -la {f}

In [None]:
f = 'Data-Augmentation-with-Python/pluto_data/netflix_data.csv'
pluto.df_netflix_data = pluto.fetch_df(f,sep='~')

In [None]:
f = 'Data-Augmentation-with-Python/pluto_data/twitter_data.csv'
pluto.df_twitter_data = pluto.fetch_df(f,sep='~')

In [None]:
pluto.print_batch_text(pluto.df_netflix_data, cols=['title','description'])

In [None]:
pluto.draw_text_wordcloud(pluto.df_netflix_data.description, 
  xignore_words=wordcloud.STOPWORDS, 
  title='Word Cloud: Netflix Movie Review')

In [None]:
pluto.draw_text_wordcloud(pluto.df_twitter_data.clean_tweet, 
  xignore_words=wordcloud.STOPWORDS, 
  title='Word Cloud: Twitter Tweets')

## Extened control text

In [None]:
# %%writefile -a {pluto_chapter_6}

pluto.version = 7.0
#
pluto.orig_text = '''It was the best of times. It was the worst of times. It was the age of wisdom. It was the age of foolishness. It was the epoch of belief. It was the epoch of incredulity.'''
pluto.orig_dickens_page = '''There were a king with a large jaw and a queen with a plain face, 
on the throne of England; there were a king with a large jaw and a queen with a fair face, on the throne of France. 
In both countries it was clearer than crystal to the lords of the State preserves of loaves and fishes, 
that things in general were settled for ever.
It was the year of Our Lord one thousand seven hundred and seventy-five. 
Spiritual revelations were conceded to England at that favoured period, 
as at this. Mrs. Southcott had recently attained her five-and-twentieth blessed birthday, 
of whom a prophetic private in the Life Guards had heralded the sublime appearance 
by announcing that arrangements were made for the swallowing up of London and Westminster. 
Even the Cock-lane ghost had been laid only a round dozen of years, 
after rapping out its messages, 
as the spirits of this very year last past (supernaturally deficient in originality) rapped out theirs. 
Mere messages in the earthly order of events had lately come to the English Crown and People, 
from a congress of British subjects in America: which, strange to relate, 
have proved more important to the human race than any communications yet received through any of the chickens of the Cock-lane brood.'''

pluto.orig_melville_page2 = '''Call me Ishmael. 
Some years ago—never mind how long precisely—having little or no money in my purse, 
and nothing particular to interest me on shore, 
I thought I would sail about a little and see the watery part of the world. 
It is a way I have of driving off the spleen and regulating the circulation.'''

pluto.orig_melville_page = '''Call me Ishmael. 
Some years ago—never mind how long precisely—having little or no money in my purse, 
and nothing particular to interest me on shore, 
I thought I would sail about a little and see the watery part of the world. 
It is a way I have of driving off the spleen and regulating the circulation. 
Whenever I find myself growing grim about the mouth; whenever it is a damp, 
drizzly November in my soul; 
whenever I find myself involuntarily pausing before coffin warehouses, 
and bringing up the rear of every funeral I meet; 
and especially whenever my hypos get such an upper hand of me, 
that it requires a strong moral principle to prevent me from deliberately stepping into the street, 
and methodically knocking people’s hats off—then, 
I account it high time to get to sea as soon as I can. 
This is my substitute for pistol and ball. 
With a philosophical flourish Cato throws himself upon his sword; 
I quietly take to the ship. There is nothing surprising in this. 
If they but knew it, almost all men in their degree, some time or other, 
cherish very nearly the same feelings towards the ocean with me.'''

#
pluto.orig_carroll_page = '''Alice was beginning to get very tired of sitting by her sister on the bank, 
and of having nothing to do. 
Once or twice she had peeped into the book her sister was reading, 
but it had no pictures or conversations in it. 
“and what is the use of a book,” thought Alice “without pictures or conversations?”
So she was considering in her own mind. as well as she could, 
for the hot day made her feel very sleepy and stupid. 
whether the pleasure of making a daisy-chain would be worth the trouble of getting up and picking the daisies, 
when suddenly a White Rabbit with pink eyes ran close by her.
There was nothing so very remarkable in that; 
nor did Alice think it so very much out of the way to hear the Rabbit say to itself. 
“Oh dear! Oh dear! I shall be late!” 
when she thought it over afterwards, 
it occurred to her that she ought to have wondered at this, 
but at the time it all seemed quite natural; 
but when the Rabbit actually took a watch out of its waistcoat-pocket, 
and looked at it, and then hurried on, Alice started to her feet, 
for it flashed across her mind that she had never before seen a rabbit with either a waistcoat-pocket, 
or a watch to take out of it, 
and burning with curiosity, 
she ran across the field after it, 
and fortunately was just in time to see it pop down a large rabbit-hole under the hedge.
'''
#
pluto.orig_carroll_page2 = '''Alice was beginning to get very tired of sitting by her sister on the bank, 
and of having nothing to do. 
Once or twice she had peeped into the book her sister was reading, 
but it had no pictures or conversations in it.
“and what is the use of a book,” thought Alice “without pictures or conversations?”
So she was considering in her own mind. as well as she could, 
for the hot day made her feel very sleepy and stupid. 
whether the pleasure of making a daisy-chain would be worth the trouble of getting up and picking the daisies, 
when suddenly a White Rabbit with pink eyes ran close by her.'''
#
pluto.orig_self = '''Text augmentation with Machine Learning (ML) is an advanced technique. 
Ironically, Text augmentation aims to improve ML model accuracy, 
but we used a pre-trained ML model to create additional training NLP data. 
It’s a circular process. ML coding is not in this book’s scope, 
but understanding the difference between text augmentation using libraries and ML is beneficial.  
Augmentation libraries, whether for image, text, or audio, 
follow the traditional programming methodologies with structure data, loops, 
and conditional statements in the algorithm. 
For example, from Chapter 5, the pseudo-code for implementing the method _print_aug_reserved() could be as follows:'''

In [None]:
# %%writefile -a {pluto_chapter_6}

@add_method(PacktDataAug)
def _clean_text(self,x):
  return (re.sub('[^A-Za-z0-9 .,!?#@]+', '', str(x)))

In [None]:
pluto.df_netflix_data['description'] = pluto.df_netflix_data['description'].apply(pluto._clean_text)

# Word2Vec

In [None]:
# !ls -la ../model/

In [None]:
# %%writefile -a {pluto_chapter_6}

from nlpaug.util.file.download import DownloadUtil

#DownloadUtil.download_word2vec(dest_dir='.') # word2vec
DownloadUtil.download_glove(model_name='glove.6B', dest_dir='.') # GloVe
DownloadUtil.download_fasttext(model_name='wiki-news-300d-1M', dest_dir='.') # fasttext model


In [None]:
# %%writefile -a {pluto_chapter_6}

try:
  DownloadUtil.download_word2vec(dest_dir='.') # word2vec
except Exception:
  print('\nIt happen frequently that this file can not be download due to too many access.')
  print('When it failed, you can not use Word2Vec transformation/augmentation.')
  print('You could download the file: GoogleNews-vectors-negative300.bin.gz')
  print('From URL: https://drive.google.com/uc?export=download&id=0B7XkCwpI5KDYNlNUTTlSS21pQmM \n')

In [None]:
# %%writefile -a {pluto_chapter_6}

import nlpaug
@add_method(PacktDataAug)
def print_aug_ai_word2vec(self, df, 
  col_dest="description",
  action='insert',
  model_type='word2vec',
  model_path='GoogleNews-vectors-negative300.bin',
  bsize=3, 
  aug_name='Augment',
  is_orig_control=True):
  aug_func = nlpaug.augmenter.word.WordEmbsAug(model_type=model_type,action=action,model_path=model_path)
  self._print_aug_batch(df, aug_func,col_dest=col_dest,bsize=bsize, aug_name=aug_name)
  return

In [None]:
pluto.print_aug_ai_word2vec(pluto.df_netflix_data,
  col_dest='description',
  action='insert',
  aug_name='Word2Vec-GoogleNews Word Embedding Augment')

In [None]:
pluto.print_aug_ai_word2vec(pluto.df_netflix_data,
  col_dest='description',
  action='insert',
  aug_name='Word2Vec-GoogleNews Word Embedding Augment')

In [None]:
pluto.print_aug_ai_word2vec(pluto.df_twitter_data,col_dest='clean_tweet',action='insert',aug_name='Word2Vec-GoogleNews Word Embedding Augment')

## Substitute

In [None]:
pluto.print_aug_ai_word2vec(pluto.df_netflix_data,
  col_dest='description',
  action='substitute',
  aug_name='Word2Vec-GoogleNews Word Embedding Augment')

In [None]:
pluto.print_aug_ai_word2vec(pluto.df_twitter_data,
  col_dest='clean_tweet',
  action='substitute',
  aug_name='Word2Vec-GoogleNews Word Embedding Augment')

In [None]:
df_page = pandas.DataFrame([pluto.orig_carroll_page], columns=['page'])
pluto.print_aug_ai_word2vec(df_page,
  col_dest='page',
  action='substitute',
  aug_name='Word2Vec-GoogleNews Word Embedding Augment',
  bsize=1,
  is_orig_control=False)

In [None]:
df_page = pandas.DataFrame([pluto.orig_dickens_page], columns=['page'])
pluto.print_aug_ai_word2vec(df_page,
  col_dest='page',
  action='substitute',
  aug_name='Word2Vec-GoogleNews Word Embedding Augment',
  bsize=1,
  is_orig_control=False)

#BERT

In [None]:
!pip install transformers

In [None]:
# %%writefile -a {pluto_chapter_6}

import transformers
print(transformers.__version__)

In [None]:
!pip install simpletransformers>=0.61.10
import simpletransformers
#print(simpletransformers.__version__)

In [None]:
!pip install nltk>=3.4.5
!pip install gensim>=4.1.2

In [None]:
# %%writefile -a {pluto_chapter_6}

import nlpaug
@add_method(PacktDataAug)
def print_aug_ai_bert(self, df, 
  col_dest="description",
  action='insert',
  model_path='bert-base-uncased',
  bsize=3, 
  aug_name='Augment',
  is_orig_control=True):
  aug_func = nlpaug.augmenter.word.ContextualWordEmbsAug(action=action,model_path=model_path)
  #self._print_aug_batch(df, aug_func,col_dest=col_dest,bsize=bsize, aug_name=aug_name,is_orig_control=is_orig_control)
  self._print_aug_batch(df, aug_func,col_dest=col_dest,bsize=bsize, aug_name=aug_name)
  return

In [None]:
pluto.print_aug_ai_bert(pluto.df_netflix_data,col_dest='description',
  action='insert',aug_name='BERT Embedding Insert Augment')

In [None]:
pluto.print_aug_ai_bert(pluto.df_twitter_data,col_dest='clean_tweet',
  action='insert',aug_name='BERT Embedding Insert Augment')

## Substitute

In [None]:
pluto.print_aug_ai_bert(pluto.df_netflix_data,col_dest='description',
  action='substitute',aug_name='BERT Embedding Substitute Augment')

In [None]:
pluto.print_aug_ai_bert(pluto.df_twitter_data,col_dest='clean_tweet',
  action='substitute',aug_name='BERT Embedding Substitute Augment')

In [None]:
df_page = pandas.DataFrame([pluto.orig_dickens_page], columns=['page'])
pluto.print_aug_ai_bert(df_page,
  col_dest='page',
  action='substitute',
  aug_name='BERT Embedding Substitute Augment',
  bsize=1,
  is_orig_control=False)

In [None]:
df_page = pandas.DataFrame([pluto.orig_melville_page], columns=['page'])
pluto.print_aug_ai_bert(df_page,
  col_dest='page',
  action='substitute',
  aug_name='BERT Embedding Substitute Augment',
  bsize=1,
  is_orig_control=False)

# RoBERTa

In [None]:
pluto.print_aug_ai_bert(pluto.df_netflix_data,col_dest='description',
  model_path='roberta-base',
  action='insert',aug_name='Roberta Embedding Insert Augment')

In [None]:
pluto.print_aug_ai_bert(pluto.df_twitter_data,col_dest='clean_tweet',
  model_path='roberta-base',
  action='insert',aug_name='Roberta Embedding Insert Augment')

## Substitute

In [None]:
pluto.print_aug_ai_bert(pluto.df_netflix_data,col_dest='description',
  model_path='roberta-base',
  action='substitute',aug_name='Roberta Embedding Substitute Augment')

In [None]:
pluto.print_aug_ai_bert(pluto.df_twitter_data,col_dest='clean_tweet',
  model_path='roberta-base',
  action='substitute',aug_name='Roberta Embedding Substitute Augment')

In [None]:
df_page = pandas.DataFrame([pluto.orig_dickens_page], columns=['page'])
pluto.print_aug_ai_bert(df_page,
  col_dest='page',
  model_path='roberta-base',
  action='substitute',
  aug_name='Roberta Embedding Substitute Augment',
  bsize=1,
  is_orig_control=False)

In [None]:
df_page = pandas.DataFrame([pluto.orig_melville_page], columns=['page'])
pluto.print_aug_ai_bert(df_page,
  col_dest='page',
  model_path='roberta-base',
  action='substitute',
  aug_name='Roberta Embedding Substitute Augment',
  bsize=1,
  is_orig_control=False)

# Back Translation 

In [None]:
!pip install sacremoses

In [None]:
# %%writefile -a {pluto_chapter_6}

@add_method(PacktDataAug)
def print_aug_ai_back_translation(self, df, col_dest="description",
  from_model_name='facebook/wmt19-en-de', 
  to_model_name='facebook/wmt19-de-en',
  bsize=3, aug_name='Augment',
  is_orig_control=True):
  aug_func = nlpaug.augmenter.word.BackTranslationAug(from_model_name=from_model_name,
    to_model_name=to_model_name)
  self._print_aug_batch(df, aug_func,col_dest=col_dest,bsize=bsize, aug_name=aug_name)
  return

In [None]:
pluto.print_aug_ai_back_translation(pluto.df_netflix_data,col_dest='description',
  from_model_name='facebook/wmt19-en-de', 
  to_model_name='facebook/wmt19-de-en',
  aug_name='FaceBook Back Translation: English <-> German Augment')

In [None]:
# https://huggingface.co/models

In [None]:
pluto.print_aug_ai_back_translation(pluto.df_twitter_data,col_dest='clean_tweet',
  from_model_name='facebook/wmt19-en-de', 
  to_model_name='facebook/wmt19-de-en',
  aug_name='FaceBook Back Translation: English <-> German Augment')

In [None]:
pluto.print_aug_ai_back_translation(pluto.df_netflix_data,col_dest='description',
  from_model_name='facebook/wmt19-en-de', 
  to_model_name='facebook/wmt19-de-en',
  aug_name='FaceBook Back Translation: English <-> German Augment')

## Russian

In [None]:
pluto.print_aug_ai_back_translation(pluto.df_netflix_data,col_dest='description',
  from_model_name='facebook/wmt19-en-ru', 
  to_model_name='facebook/wmt19-ru-en',
  aug_name='FaceBook Back Translation: English <-> Russian Augment')

In [None]:
pluto.print_aug_ai_back_translation(pluto.df_twitter_data,col_dest='clean_tweet',
  from_model_name='facebook/wmt19-en-ru', 
  to_model_name='facebook/wmt19-ru-en',
  aug_name='FaceBook Back Translation: English <-> Russian Augment')

# T5-Base

In [None]:
# %%writefile -a {pluto_chapter_6}

@add_method(PacktDataAug)
def _fetch_larger_font(self):
  heading_properties = [('font-size', '20px')]
  cell_properties = [('font-size', '18px'), ('text-align', 'left')]
  dfstyle = [dict(selector="th", props=heading_properties),
    dict(selector="td", props=cell_properties)]
  return dfstyle

In [None]:
# %%writefile -a {pluto_chapter_6}

@add_method(PacktDataAug)
def _print_aug_ai(self, orig, aug_func, 
  bsize=2, aug_name='Augmented',is_larger_font=True):
  aug = aug_func.augment(orig)
  data = [[aug[0]]]
  df_aug = pandas.DataFrame(data, columns=[aug_name])
  #
  if (bsize > 1):
    for i in range(bsize-1):
      aug = aug_func.augment(orig)
      data = [[aug[0]]]
      t = pandas.DataFrame(data, columns=[aug_name])
      df_aug = df_aug.append(t, ignore_index=True)
  #
  with pandas.option_context("display.max_colwidth", None):
    if (is_larger_font):
      # df_aug.style.set_properties(**{'text-align': 'left'})
      # display(df_aug.style.set_properties(**{'text-align': 'left'}))
      display(df_aug.style.set_table_styles(self._fetch_larger_font()))
    else:
      display(df_aug)
  return df_aug
#
@add_method(PacktDataAug)
def print_aug_ai_t5(self, orig, 
  bsize=2, 
  aug_name='T5_summary',
  is_orig_control=True):
  aug_func = nlpaug.augmenter.sentence.AbstSummAug(model_path='t5-base')
  df = self._print_aug_ai(orig, aug_func,bsize=bsize, aug_name=aug_name)
  return df

In [None]:
pluto.df_t5_carroll = pluto.print_aug_ai_t5(pluto.orig_carroll_page, bsize=1)

In [None]:
pluto.df_t5_carroll = pluto.print_aug_ai_t5(pluto.orig_carroll_page2, bsize=1)

In [None]:
pluto.df_t5_melville = pluto.print_aug_ai_t5(pluto.orig_melville_page, bsize=1)

In [None]:
pluto.df_t5_dickens = pluto.print_aug_ai_t5(pluto.orig_dickens_page, bsize=1)

In [None]:
pluto.df_t5_self = pluto.print_aug_ai_t5(pluto.orig_self, bsize=1)

# Sentense flow

In [None]:
# %%writefile -a {pluto_chapter_6}

import nlpaug.augmenter.word
import nlpaug.augmenter.sentence
import nlpaug.flow
pluto.ai_aug_glove = nlpaug.augmenter.word.WordEmbsAug(
    model_type='glove', model_path='glove.6B.300d.txt',
    action="substitute")
#
pluto.ai_aug_glove.aug_p=0.5
#
pluto.ai_aug_bert = nlpaug.augmenter.word.ContextualWordEmbsAug(
    model_path='bert-base-uncased', 
    action='substitute', top_k=20)


In [None]:
# %%writefile -a {pluto_chapter_6}

import nlpaug
@add_method(PacktDataAug)
def print_aug_ai_sequential(self, df, 
  aug_name="T5_summary",
  bsize=4):
  aug_func = nlpaug.flow.Sequential([self.ai_aug_bert, self.ai_aug_glove])
  orig = df[aug_name][0]
  self._print_aug_ai(orig, aug_func,bsize=bsize, aug_name=aug_name)
  return

In [None]:
pluto.df_t5_carroll.T5_summary[0]

In [None]:
pluto.print_aug_ai_sequential(pluto.df_t5_carroll)

In [None]:
pluto.print_aug_ai_sequential(pluto.df_t5_dickens)

In [None]:
pluto.print_aug_ai_sequential(pluto.df_t5_melville)

In [None]:
pluto.print_aug_ai_sequential(pluto.df_t5_self)

In [None]:
# end of chapter 6
print('end of chatper 6')

# Push up all changes (Optional)

In [None]:
# import os
# f = 'Data-Augmentation-with-Python'
# os.chdir(f)
# !git add -A
# !git config --global user.email "duc.haba@gmail.com"
# !git config --global user.name "duchaba"
# !git commit -m "end of session"
# # do the git push in the xterm console
# #!git push

# Summary 

Every chaper will begin with same base class "PacktDataAug".

✋ FAIR WARNING:

- The coding uses long and complete function path name.

- Pluto wrote the code for easy to understand and not for compactness, fast execution, nor cleaverness.

- Use Xterm to debug cloud server

In [None]:
# !pip install colab-xterm
# %load_ext colabxterm
# %xterm