<a href="https://colab.research.google.com/github/duchaba/Data-Augmentation-with-Python/blob/main/data_augmentation_with_python_chapter_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Augmentation with Python, Chapter 2

## 🌻 Welcome to Chapter 2, "Biases and Data Augmentation"


In this chapter, we’ll cover the following primary topics. 

- Computational Biases 

- Human Biases 

- Systemic Biases 

- Deep Dive to Image Augmentation Biases 

- Deep Dive to Text Augmentation Biases 

# Load Notebook

- This Notebook original link is: 
  - https://github.com/PacktPublishing/Data-Augmentation-with-Python/blob/main/data_augmentation_with_python_chapter_2.ipynb



# GitHub Clone

In [None]:
# git version should be 2.17.1 or higher
!git --version

In [None]:
# url = 'https://github.com/duchaba/Data-Augmentation-with-Python'  # for testing: remove after the book is finished.

url = 'https://github.com/PacktPublishing/Data-Augmentation-with-Python'
!git clone {url}

## (Optional) Fetch file from URL

- Uncommend the below 2 code cells if you want to use URL and not Git Clone

In [None]:
# import requests
# #
# def fetch_file(url, dst):
#   downloaded_obj = requests.get(url)
#   with open(dst, "wb") as file:
#     file.write(downloaded_obj.content)
#   return

In [None]:
# url = ''
# dst = 'pluto_chapter_1.py'
# fetch_file(url,dst)

# Run Pluto

In [None]:
#load and run the pluto chapter 1 Python code.
pluto_file = 'Data-Augmentation-with-Python/pluto/pluto_chapter_1.py'
%run {pluto_file}

# Verify Pluto

In [None]:
pluto.say_sys_info()

- (Optional) Copy the Pluto chapter 1 to begin chapter 2

In [None]:
pluto_chapter_2 = 'Data-Augmentation-with-Python/pluto/pluto_chapter_2.py'
!cp {pluto_file} {pluto_chapter_2}

# Get Kaggle ID, key, and setup

✋ STOP

- First, sign up on kaggle.com. Get username and api key (refer to the book, Chapter 2)

In [None]:
# %%CARRY-OVER install

# easy method to download kaggle data files
!pip install opendatasets --upgrade

In [None]:
# %%writefile -a {pluto_chapter_2}

pluto.version = 2.0
import opendatasets
#
@add_method(PacktDataAug)
def remember_kaggle_access_keys(self,username,key):
  self.kaggle_username = username
  self.kaggle_key = key
  return

In [None]:
print("\nrequired version 0.1.22 or higher: ", opendatasets.__version__)

In [None]:
# %%writefile -a {pluto_chapter_2}

@add_method(PacktDataAug)
def _write_kaggle_credit(self):
  creds = '{"username":"'+self.kaggle_username+'","key":"'+self.kaggle_key+'"}'
  kdirs = ["~/.kaggle/kaggle.json", "./kaggle.json"]
  #
  for k in kdirs:
    cred_path = pathlib.Path(k).expanduser()
    cred_path.parent.mkdir(exist_ok=True)
    cred_path.write_text(creds)
    cred_path.chmod(0o600)
  import kaggle
  #
  return
#
@add_method(PacktDataAug)
def fetch_kaggle_comp_data(self,cname):
  #self._write_kaggle_credit()  # need to run only once.
  path = pathlib.Path(cname)
  kaggle.api.competition_download_cli(str(path))
  zipfile.ZipFile(f'{path}.zip').extractall(path)
  return
#
#
@add_method(PacktDataAug)
def fetch_kaggle_dataset(self,url,dest="kaggle"):
  #self._write_kaggle_credit()    # need to run only once.
  opendatasets.download(url,data_dir=dest)
  return

✋ STOP

- user (your_kaggle_username) and (your_kaggle_api_key)

In [None]:
# %%CARRY-OVER code

pluto.remember_kaggle_access_keys("YOR_KAGGLE_KEY", "YOUR_KAGGLE_API_KEY")

pluto._write_kaggle_credit()
import kaggle

# Fetch State Farm real-world dataset

In [None]:
# %%writefile -a {pluto_chapter_2}

import zipfile
import os

In [None]:
kaggle_competition_name = "state-farm-distracted-driver-detection"
pluto.fetch_kaggle_comp_data(kaggle_competition_name)

## Quick view

In [None]:
# quick view one image
f = 'state-farm-distracted-driver-detection/imgs/train/c0/img_100026.jpg'
img = PIL.Image.open(f)
display(img)

# Import to Pandas

In [None]:
# %%writefile -a {pluto_chapter_2}

@add_method(PacktDataAug)
def fetch_df(self, csv,sep=','):
  df = pandas.read_csv(csv, encoding='latin-1', sep=sep)
  return df
#
@add_method(PacktDataAug)
def _fetch_larger_font(self):
  heading_properties = [('font-size', '20px')]
  cell_properties = [('font-size', '18px')]
  dfstyle = [dict(selector="th", props=heading_properties),
    dict(selector="td", props=cell_properties)]
  return dfstyle

In [None]:
f = 'state-farm-distracted-driver-detection/driver_imgs_list.csv'
pluto.df_sf_data = pluto.fetch_df(f)

In [None]:
# pluto.df_sf_data.tail(3)
# larger fonts
pluto.df_sf_data.tail(3).style.set_table_styles(pluto._fetch_larger_font())

In [None]:
pluto.df_sf_data.describe()

In [None]:
# %%writefile -a {pluto_chapter_2}

@add_method(PacktDataAug)
def build_sf_fname(self, df):
  root = 'state-farm-distracted-driver-detection/imgs/train/'
  df["fname"] = root + df.classname+'/'+df.img
  return

In [None]:
pluto.build_sf_fname(pluto.df_sf_data)

In [None]:
# pluto.df_sf_data.head(3)
# use larger font
pluto.df_sf_data.head(3).style.set_table_styles(pluto._fetch_larger_font())

- Verify the fname is correct

In [None]:
img = PIL.Image.open(pluto.df_sf_data.fname[0])
display(img)

# Draw the Images

In [None]:
# %%writefile -a {pluto_chapter_2}

# set internal counter for image to be zero, e.g. pluto0.jpg, pluto1.jpg, etc.
pluto.fname_id = 0
#
@add_method(PacktDataAug)
def _drop_image(self,canvas, fname=None,format=".jpg",dname="Data-Augmentation-with-Python/pluto_img"):
  if (fname is None):
    self.fname_id += 1
    if not os.path.exists(dname):
      os.makedirs(dname)
    fn = f'{dname}/pluto{self.fname_id}{format}'
  else:
    fn = fname
  canvas.savefig(fn, cmap="Greys", bbox_inches="tight", pad_inches=0.25)
  return
#
@add_method(PacktDataAug)
def draw_batch(self,df_filenames, disp_max=10,is_shuffle=False, figsize=(16,8)):
  disp_col = 5
  disp_row = int(numpy.round((disp_max/disp_col)+0.4, 0))
  _fns = list(df_filenames)
  if (is_shuffle):
    numpy.random.shuffle(_fns)
  k = 0
  clean_fns = []
  if (len(_fns) >= disp_max):
    canvas, pic = matplotlib.pyplot.subplots(disp_row,disp_col, figsize=figsize)
    for i in range(disp_row):
      for j in range(disp_col):
        try:
          im = PIL.Image.open(_fns[k])
          pic[i][j].imshow(im)
          pic[i][j].set_title(pathlib.Path(_fns[k]).name)
          clean_fns.append(_fns[k])
        except:
          pic[i][j].set_title(pathlib.Path(_fns[k]).name)
        k += 1
    canvas.tight_layout()
    self._drop_image(canvas)
    canvas.show()
  else:
    print("**Warning: the length should be more then ", disp_max, ". The given length: ", len(_fns))
  return clean_fns

## State Farm

In [None]:
x = pluto.draw_batch(pluto.df_sf_data["fname"], is_shuffle=True)

In [None]:
x = pluto.draw_batch(pluto.df_sf_data["fname"], is_shuffle=True,disp_max=20,figsize=(18,14))

## Nike Shoe

- For Nike, Adidas and Converse Shoes Images

In [None]:
url = 'https://www.kaggle.com/datasets/die9origephit/nike-adidas-and-converse-imaged'
pluto.fetch_kaggle_dataset(url)

In [None]:
# %%writefile -a {pluto_chapter_2}

@add_method(PacktDataAug)
def build_shoe_fname(self, start_path):
  df = pandas.DataFrame()
  for root, dirs, files in os.walk(start_path, topdown=False):
   for name in files:
      f = os.path.join(root, name)
      p = pathlib.Path(f).parent.name 
      d = pandas.DataFrame({'fname': [f], 'label': [p]})
      df = df.append(d, ignore_index=True)
  #
  # clean it up
  df = df.reset_index(drop=True)
  return df
#
# create the same with a generic function name
@add_method(PacktDataAug)
def make_dir_dataframe(self, start_path):
  return self.build_shoe_fname(start_path)

In [None]:
f = 'kaggle/nike-adidas-and-converse-imaged/train'
pluto.df_shoe_data = pluto.build_shoe_fname(f)

In [None]:
# pluto.df_shoe_data.head(3)
# use larger font
pluto.df_shoe_data.head(3).style.set_table_styles(pluto._fetch_larger_font())

In [None]:
pluto.df_shoe_data.tail(3).style.set_table_styles(pluto._fetch_larger_font())

In [None]:
x = pluto.draw_batch(pluto.df_shoe_data["fname"], is_shuffle=True)

In [None]:
x = pluto.draw_batch(pluto.df_shoe_data["fname"], is_shuffle=True,disp_max=20,figsize=(18,14))

## Grapevine Images

In [None]:
#
%%time
url = "https://www.kaggle.com/datasets/muratkokludataset/grapevine-leaves-image-dataset"
pluto.fetch_kaggle_dataset(url)

In [None]:
!ls -la kaggle/grapevine-leaves-image-dataset/Grapevine_Leaves_Image_Dataset

In [None]:
f = 'kaggle/grapevine-leaves-image-dataset/Grapevine_Leaves_Image_Dataset/Ak'
!ls -la {f} | head

- remove all space in file name

In [None]:
# run this until no error/output
f2 = 'kaggle/grapevine-leaves-image-dataset/Grapevine_Leaves_Image_Dataset'
!find {f2} -name "* *" -type f | rename 's/ /_/g'

In [None]:
!ls -la {f} | head

In [None]:
f = 'kaggle/grapevine-leaves-image-dataset/Grapevine_Leaves_Image_Dataset/Grapevine_Leaves_Image_Dataset_Citation_Request.txt'
!cat {f}

In [None]:
!mv {f} .

- The grapevine image structure is the same as the shoe image.
  - folder name is the label
  - the images are in their respected folder
  - No csv file

In [None]:
f = 'kaggle/grapevine-leaves-image-dataset/Grapevine_Leaves_Image_Dataset'
pluto.df_grapevine_data = pluto.build_shoe_fname(f)
pluto.df_grapevine_data.head(3)

In [None]:
x = pluto.draw_batch(pluto.df_grapevine_data["fname"], is_shuffle=True)

In [None]:
x = pluto.draw_batch(pluto.df_grapevine_data["fname"], is_shuffle=True,disp_max=20,figsize=(18,14))

## Monkeypox (optional for Notebook only)

- Uncoment before run.

In [None]:
# # quick view one image
# f = 'kaggle/monkeypox-skin-lesion-dataset/Original Images/Original Images/Monkey Pox/M01_03.jpg'
# img = PIL.Image.open(f)
# display(img)

In [None]:
# f = 'kaggle/monkeypox-skin-lesion-dataset/Augmented Images/Augmented Images/Monkeypox_augmented/M01_01_02.jpg'
# img = PIL.Image.open(f)
# display(img)

In [None]:
# f = 'kaggle/monkeypox-skin-lesion-dataset/Monkeypox_Dataset_metadata.csv'
# pluto.df_monkey_data = pluto.fetch_df(f)
# pluto.df_monkey_data.tail(3)

- Run this until No error
- 3 times for monkeypox 

In [None]:
# !find . -name "* *" -type d | rename 's/ /_/g'

In [None]:
# # %write -a {pluto_chapter_2}

# @add_method(PacktDataAug)
# def build_monkey_fname(self, df):
#   url_monkey = 'kaggle/monkeypox-skin-lesion-dataset/Original_Images/Original_Images/Monkey_Pox/'
#   url_other = 'kaggle/monkeypox-skin-lesion-dataset/Original_Images/Original_Images/Others/'
#   df["fname"] = url_monkey + df.ImageID + ".jpg"
#   # quick replace other
#   df.loc[df['Label'] == 'Non Monkeypox', 'fname'] = url_other + df.ImageID + ".jpg"
#   return


In [None]:
# pluto.build_monkey_fname(pluto.df_monkey_data)
# pluto.df_monkey_data.head(3)

In [None]:
# pluto.df_monkey_data.tail(3)

- Draw it

In [None]:
# x = pluto.draw_batch(pluto.df_monkey_data["fname"], is_shuffle=True)

In [None]:
# x = pluto.draw_batch(pluto.df_monkey_data["fname"], is_shuffle=True,disp_max=20,figsize=(18,14))

# NLP (Text) data

## NetFlix

In [None]:
%%time
url = 'https://www.kaggle.com/datasets/infamouscoder/dataset-netflix-shows'
pluto.fetch_kaggle_dataset(url)

In [None]:
f = 'kaggle/dataset-netflix-shows/netflix_titles.csv'
pluto.df_netflix_data = pluto.fetch_df(f)
#pluto.df_netflix_data.head(3)

In [None]:
# %%writefile -a {pluto_chapter_2}

@add_method(PacktDataAug)
def print_batch_text(self,df_orig, disp_max=6, cols=["title", "description"],is_larger_font=True): 
  df = df_orig[cols] 
  with pandas.option_context("display.max_colwidth", None):
    if (is_larger_font):
      display(df.sample(disp_max).style.set_table_styles(self._fetch_larger_font()))
    else:
      display(df.sample(disp_max))
  return


- Show table in three part for book  

In [None]:
pluto.print_batch_text(pluto.df_netflix_data.head(3),
  disp_max=3,
  cols=['show_id', 'type','title','director','cast'])

In [None]:
pluto.print_batch_text(pluto.df_netflix_data.head(3),
  disp_max=3,
  cols=['country', 'date_added','release_year','rating'])

In [None]:
#pluto.df_netflix_data.head(3)[['duration', 'listed_in','description']].style.set_table_styles(pluto._fetch_larger_font())
pluto.print_batch_text(pluto.df_netflix_data.head(3),
  disp_max=3,
  cols=['duration', 'listed_in','description'])

In [None]:
#print(pluto.df_netflix_data.description[0])

In [None]:
pluto.print_batch_text(pluto.df_netflix_data)

In [None]:
# %%writefile -a {pluto_chapter_2}

@add_method(PacktDataAug)
def count_word(self, df, col_dest="description"):
  df['wordc'] = df[col_dest].apply(lambda x: len(x.split()))
  return

In [None]:
pluto.count_word(pluto.df_netflix_data)
# pluto.df_netflix_data.head()
pluto.print_batch_text(pluto.df_netflix_data,cols=['description','wordc'])

In [None]:
# %%writefile -a {pluto_chapter_2}

@add_method(PacktDataAug)
def draw_word_count(self,df, wc='wordc',is_stack_verticle=True):
  if (is_stack_verticle):
    canvas, pic = matplotlib.pyplot.subplots(2,1, figsize=(8,10))
  else:
    canvas, pic = matplotlib.pyplot.subplots(1,2, figsize=(16,5))
  df.boxplot(ax=pic[0],column=[wc],vert=False,color="black")
  df[wc].hist(ax=pic[1], color="cornflowerblue", alpha=0.9)
  #
  title=["Description BoxPlot", "Description Histogram"]
  yaxis=["Description", "Stack"]
  x1 = f'Word Count: Mean: {df[wc].mean():0.2f}, Min: {df[wc].min()}, Max: {df[wc].max()}'
  xaxis=[x1, "Word Count"]
  #
  pic[0].set_title(title[0], fontweight ="bold")
  pic[1].set_title(title[1], fontweight ="bold")
  pic[0].set_ylabel(yaxis[0])
  pic[1].set_ylabel(yaxis[1])
  pic[0].set_xlabel(xaxis[0])
  pic[1].set_xlabel(xaxis[1])
  #
  canvas.tight_layout()
  self._drop_image(canvas)
  # 
  canvas.show()
  return

In [None]:
pluto.draw_word_count(pluto.df_netflix_data, is_stack_verticle=False)

In [None]:
pluto.draw_word_count(pluto.df_netflix_data)

## Spell checker

In [None]:
# %%CARRY-OVER code install

!pip install pyspellchecker 

In [None]:
# %%writefile -a {pluto_chapter_2}

import re
import spellchecker
@add_method(PacktDataAug)
def _strip_punc(self,s):
  p = re.sub(r'[^\w\s]','',s)
  return(p)
#
@add_method(PacktDataAug)
def check_spelling(self,df, col_dest='description'):
  spell = spellchecker.SpellChecker()
  df["misspelled"] = df[col_dest].apply(lambda x: spell.unknown(self._strip_punc(x).split()))
  df["misspelled_count"] = df["misspelled"].apply(lambda x: len(x))
  return

In [None]:
pluto._pp("Required version 0.7+", spellchecker.__version__)

In [None]:
pluto.check_spelling(pluto.df_netflix_data)

In [None]:
pluto.print_batch_text(pluto.df_netflix_data,cols=['description', 'misspelled'])

In [None]:
pluto.draw_word_count(pluto.df_netflix_data,wc='misspelled_count')

## Amazon review

In [None]:
%%time
url = 'https://www.kaggle.com/datasets/tarkkaanko/amazon'
pluto.fetch_kaggle_dataset(url)

In [None]:
f = 'kaggle/amazon/amazon_reviews.csv'
pluto.df_amazon_data = pluto.fetch_df(f)
pluto.df_amazon_data.head(3)

In [None]:
# there is a "nan" in the amazon data, so drop/delete it.
pluto.df_amazon_data = pluto.df_amazon_data.dropna()

In [None]:
pluto.check_spelling(pluto.df_amazon_data,col_dest='reviewText')

In [None]:
pluto.print_batch_text(pluto.df_amazon_data, cols=['reviewText','misspelled'])

In [None]:
pluto.count_word(pluto.df_amazon_data,col_dest='reviewText')

In [None]:
pluto.draw_word_count(pluto.df_amazon_data)

In [None]:
pluto.draw_word_count(pluto.df_amazon_data, wc='misspelled_count')

In [None]:
# end of chapter 2
print('End of chapter 2')

# Push up all changes (optional)

- username: [use your name or email]

- password: [use the token]

In [None]:
# import os
# f = 'Data-Augmentation-with-Python'
# os.chdir(f)
# !git add -A
# !git config --global user.email "duc.haba@gmail.com"
# !git config --global user.name "duchaba"
# !git commit -m "end of session"
# #do the git push in the xterm console
# #!git push

# Summary

Every chaper will begin with same base class "PacktDataAug".

✋ FAIR WARNING:

- The coding uses long and complete function path name.

- I wrote the code for easy to understand and not for compactness, fast execution, nor cleaverness.



In [None]:
# !pip install colab-xterm
# %load_ext colabxterm
# %xterm