<a href="https://colab.research.google.com/github/duchaba/Data-Augmentation-with-Python/blob/main/data_augmentation_with_python_chapter_9.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Augmentation with Python, Chapter 9

## 🌻 Welcome to Chapter 7, "tabular Augmentation"


There will be a slight departure from the image, text, and audio augmentation format. We will spend more time in Python code studying the real-world tabular dataset, and in particular, in this chapter, we will cover the following topics: 

- Tabular augmentation libraries 

- Real-world tabular datasets 

- Explore and visualize tabular data 

- Transforming augmentation 

- Interacting augmentation 

- Extracting augmentation 

- ✋ STOP: You must set up Kaggle username and app Key in below step.

# Load Notebook


- This Notebook original link is: 
  - https://github.com/PacktPublishing/Data-Augmentation-with-Python/blob/main/data_augmentation_with_python_chapter_9.ipynb

# GitHub Clone

In [None]:
# git version should be 2.17.1 or higher
!git --version

In [None]:
url = 'https://github.com/PacktPublishing/Data-Augmentation-with-Python'
!git clone {url}

## Fetch file from URL (Optional)

- Uncommend the below 2 code cells if you want to use URL and not Git Clone

In [None]:
# import requests
# #
# def fetch_file(url, dst):
#   downloaded_obj = requests.get(url)
#   with open(dst, "wb") as file:
#     file.write(downloaded_obj.content)
#   return

In [None]:
# url = ''
# dst = 'pluto_chapter_1.py'
# fetch_file(url,dst)

# Run Pluto

- Instantiate up Pluto, aka. "Pluto, wake up!"

In [None]:
# %% CARRY-OVER code install

!pip install opendatasets --upgrade
!pip install pyspellchecker 

In [None]:
#load and run the pluto chapter 1 Python code.
pluto_file = 'Data-Augmentation-with-Python/pluto/pluto_chapter_2.py'
%run {pluto_file}

## Verify Pluto

In [None]:
pluto.say_sys_info()

## (Optional) Export to .py

In [None]:
pluto_chapter_9 = 'Data-Augmentation-with-Python/pluto/pluto_chapter_9.py'
!cp {pluto_file} {pluto_chapter_9}

# ✋ Set up Kaggle username and app Key

- Install the following libraries, and import it on the Notebook.
- Follow by initialize Kaggle username, key and fetch methods.

- STOP: Update your Kaggle access username or key first.

In [None]:
# %%CARRY-OVER code 

# -------------------- : --------------------
# READ ME
# Chapter 2 begin:
# Install the following libraries, and import it on the Notebook.
# Follow by initialize Kaggle username, key and fetch methods.
# STOP: Update your Kaggle access username or key first.
# -------------------- : --------------------

!pip install opendatasets --upgrade
import opendatasets
print("\nrequired version 0.1.22 or higher: ", opendatasets.__version__)

!pip install pyspellchecker 
import spellchecker
print("\nRequired version 0.7+", spellchecker.__version__)

# STOP: Update your Kaggle access username or key first.
pluto.remember_kaggle_access_keys("YOUR_KAGGLE_USERNAME", "YOUR_KAGGLE_KEY")
pluto._write_kaggle_credit()
import kaggle

@add_method(PacktDataAug)
def fetch_kaggle_comp_data(self,cname):
  #self._write_kaggle_credit()  # need to run only once.
  path = pathlib.Path(cname)
  kaggle.api.competition_download_cli(str(path))
  zipfile.ZipFile(f'{path}.zip').extractall(path)
  return

@add_method(PacktDataAug)
def fetch_kaggle_dataset(self,url,dest="kaggle"):
  #self._write_kaggle_credit()    # need to run only once.
  opendatasets.download(url,data_dir=dest)
  return
# -------------------- : --------------------


# Fetch Kaggle bank fraud data

In [None]:
%%time
url = 'https://www.kaggle.com/datasets/sgpjesus/bank-account-fraud-dataset-neurips-2022'
pluto.fetch_kaggle_dataset(url)

In [None]:
f = '/content/kaggle/bank-account-fraud-dataset-neurips-2022/Base.csv'
pluto.df_bank_data = pluto.fetch_df(f)
pluto.df_bank_data.head(3)

# Data structure

In [None]:
pluto.df_bank_data.info()

In [None]:
pluto.df_bank_data[['fraud_bool', 
  'proposed_credit_limit',
  'customer_age', 
  'payment_type']].sample(5)

In [None]:
# Transpose for easier to read
df = pluto.df_bank_data.describe()
df = df.transpose()
df

In [None]:
df[['count','mean','std','min','max']]

In [None]:
pluto.df_bank_data.nunique()

# First graph view

In [None]:
# %%writefile -a {pluto_chapter_9}

import matplotlib
@add_method(PacktDataAug)
def draw_tabular_histogram(self, df, title='Histogram',maxcolors=32):
  canvas, pic = matplotlib.pyplot.subplots(1, 1, figsize=(12, 6))
  comap = matplotlib.cm.get_cmap('rainbow', 256)
  newcolors = comap(numpy.linspace(0, 1, maxcolors))
  #newcolors = matplotlib.cm.cool(range(256))
  df.plot.hist(ax=pic,color=newcolors)
  #
  pic.set_title(title,fontsize=20.0)
  pic.legend(ncol=2, loc="upper right")
  canvas.tight_layout()
  self._drop_image(canvas)
  canvas.show()
  return

In [None]:
pluto.draw_tabular_histogram(pluto.df_bank_data,
  title='Bank Fraud data, 32 million points')

# Categorical type 

- not continuous int or float numbers 

In [None]:
pluto.df_bank_data.payment_type.unique()

In [None]:
pluto.df_bank_data.employment_status.unique()

In [None]:
pluto.df_bank_data.housing_status.unique()

In [None]:
pluto.df_bank_data.source.unique()

In [None]:
pluto.df_bank_data.device_os.unique()

## Checksum, Tokenize

In [None]:
# %%writefile -a {pluto_chapter_9}

@add_method(PacktDataAug)
def _fetch_token_index(self, val, xarr):
  for i, x in enumerate(xarr):
    if (val == x):
      return i
#
@add_method(PacktDataAug)
def add_token_index(self,df, df_colname):
  for cname in df_colname:
    tname = cname + "_tokenize"
    arrname = numpy.array(df[cname].unique())
    df[tname] = df[cname].apply(self._fetch_token_index, args=(arrname,))
  return

In [None]:
pluto.df_bank_tokenize_data = pluto.df_bank_data.copy()
pluto.add_token_index(pluto.df_bank_tokenize_data, 
  ['payment_type', 'employment_status', 'housing_status', 'source', 'device_os'])

In [None]:
pluto.df_bank_tokenize_data[['payment_type', 'payment_type_tokenize']].head(10)

In [None]:
pluto.df_bank_tokenize_data[['device_os', 'device_os_tokenize']].head(10)

In [None]:
pluto.df_bank_tokenize_data = pluto.df_bank_tokenize_data.drop(
  ['payment_type', 'employment_status', 'housing_status', 'source', 'device_os'], axis=1)
pluto.df_bank_tokenize_data.info()

### checksum

In [None]:
# %%writefile -a {pluto_chapter_9}

@add_method(PacktDataAug)
def _fetch_checksum(self, df):
  df['checksum'] = df.apply(
  lambda x: numpy.mean(tuple(x)), axis=1)
  return

In [None]:
%%time
pluto._fetch_checksum(pluto.df_bank_tokenize_data)

## Subset of data

In [None]:
pluto.df_bank_half_one_data = pluto.df_bank_tokenize_data.head(5000)
pluto.df_bank_half_data = pluto.df_bank_tokenize_data

In [None]:
# %%writefile -a {pluto_chapter_9}

@add_method(PacktDataAug)
def _drop_bank_columns(self, df):
  df_out = df.drop(
    ['name_email_similarity',
    'prev_address_months_count',
    'current_address_months_count',
    'days_since_request',
    'intended_balcon_amount',
    'zip_count_4w',
    'velocity_6h',
    'velocity_24h',
    'velocity_4w',
    'bank_branch_count_8w',
    'date_of_birth_distinct_emails_4w',
    'phone_home_valid',
    'bank_months_count',
    'has_other_cards',
    'foreign_request',
    'session_length_in_minutes',
    'keep_alive_session',
    'device_distinct_emails_8w',
    'housing_status_tokenize',
    'source_tokenize',
    'month',
    'device_fraud_count',
    'device_os_tokenize'],
    axis=1)
  return df_out

In [None]:
pluto.df_bank_half_data = pluto._drop_bank_columns(pluto.df_bank_half_data)

In [None]:
list(pluto.df_bank_half_data.columns)

# Specialize graphs

In [None]:
# %%writefile -a {pluto_chapter_9}

import seaborn
@add_method(PacktDataAug)
def draw_tabular_correlogram(self, df,title='', figsize=(12,10)):
  canvas = matplotlib.pyplot.figure(figsize=figsize)
  seaborn.heatmap(df.corr(), 
    xticklabels=df.corr().columns, 
    yticklabels=df.corr().columns, 
    cmap='viridis_r', 
    center=0, 
    annot=True)
  #
  matplotlib.pyplot.title(title, fontsize=20.0)
  canvas.tight_layout()
  self._drop_image(canvas)
  canvas.show()
  return

In [None]:
pluto.draw_tabular_correlogram(pluto.df_bank_half_data,
  title='Bank Fraud half Correlogram')

In [None]:
pluto.draw_tabular_correlogram(pluto.df_bank_tokenize_data,
  title='Bank Fraud half Correlogram',
  figsize=(22,24))

## heatmap

In [None]:
# %%writefile -a {pluto_chapter_9}

@add_method(PacktDataAug)
def draw_tabular_heatmap(self, df, x='checksum', y='month'):
  canvas, pic = matplotlib.pyplot.subplots(figsize=(12,6))
  df.plot.hexbin(x=x, y=y, gridsize=20, ax=pic,cmap='Reds')
  pic.set_title(f'Heatmap of {x} and {y}', fontsize=22.0)
  canvas.tight_layout()
  self._drop_image(canvas)
  canvas.show()
  return

In [None]:
pluto.draw_tabular_heatmap(pluto.df_bank_tokenize_data, x='checksum', y='month')

In [None]:
pluto.df_bank_fraud_data = pluto.df_bank_tokenize_data[pluto.df_bank_tokenize_data.fraud_bool == 1]
pluto.df_bank_fraud_data.reset_index(drop=True,inplace=True)

In [None]:
df.info()

In [None]:
pluto.df_bank_fraud_data.describe()

## (Optional) Seaborn heatmap, Swarmplot, and Tricolor

In [None]:
%%time
# Compute the correlation matrix
corr = pluto.df_bank_fraud_data.corr()

# Generate a mask for the upper triangle
mask = numpy.triu(numpy.ones_like(corr, dtype=bool))

# Set up the matplotlib figure
canvas, pic = matplotlib.pyplot.subplots(figsize=(11, 9))

# Generate a custom diverging colormap
cmap = seaborn.diverging_palette(230, 20, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
seaborn.heatmap(corr, mask=mask, cmap='Set2', vmax=.3, center=0,
  square=True, linewidths=.5, cbar_kws={"shrink": .5},
  ax=pic)
#
pic.set_xticklabels([])
pic.set_title("Seaborn Heatmap with Mask for Bank Fraud Data",fontsize=20.0)
canvas.tight_layout()
pluto._drop_image(canvas)
canvas.show()

In [None]:
# sns.jointplot(x=x, y=y, kind="hex", color="#4CB391")
canvas, pic = matplotlib.pyplot.subplots(figsize=(12,6))
seaborn.swarmplot(data=pluto.df_bank_tokenize_data.sample(2000),
  x='fraud_bool', 
  y='checksum', 
  palette="Set2",
  ax=pic)
pic.set_title("Swarmplot Bank data. sample 2,000 points", fontsize=20.0)
canvas.tight_layout()
pluto._drop_image(canvas)
canvas.show()

In [None]:

# plt.style.use('_mpl-gallery-nogrid')

# make data:
# numpy.random.seed(1)
x = numpy.random.uniform(-3, 3, 256)
y = numpy.random.uniform(-3, 3, 256)
# x = pluto.df_world_tokenize_data.winning_team_tokenize
# y = pluto.df_world_tokenize_data.losing_team_tokenize
z = (1 - x/2 + x**5 + y**3) * numpy.exp(-x**2 - y**2)

# plot:
canvas, pic = matplotlib.pyplot.subplots(figsize=(12,5))

# ax.plot(x, y, 'o', markersize=2, color='grey')
pic.tripcolor(x, y, z, cmap='BrBG')

# ax.set(xlim=(-3, 3), ylim=(-3, 3))
pic.set_title("Tripcolor plot over X, Y, and Z", fontsize=20.0)
canvas.tight_layout()
pluto._drop_image(canvas)
canvas.show()

## (Optional) Load Varient 1 

In [None]:
# remove white space in directory and filename
# run this until no error/output
f = 'kaggle/bank-account-fraud-dataset-neurips-2022'
#!find {f} -name "* *" -type d | rename 's/ /_/g'
!find {f} -name "* *" -type f | rename 's/ /_/g'

In [None]:
f = '/content/kaggle/bank-account-fraud-dataset-neurips-2022/Variant_I.csv'
pluto.df_bank_v1_data = pluto.fetch_df(f)
pluto.df_bank_v1_data.head(3)

# World Series Baseball Television Ratings

In [None]:
%%time
f = 'https://www.kaggle.com/datasets/mattop/world-series-baseball-television-ratings'
pluto.fetch_kaggle_dataset(f)

In [None]:
f = '/content/kaggle/world-series-baseball-television-ratings/world-series-ratings.csv'
pluto.df_world_data = pluto.fetch_df(f)
pluto.df_world_data.head(3)

In [None]:
pluto.df_world_data.info()

In [None]:
pluto.df_world_data.nunique()

In [None]:
%%time
pluto.df_world_tokenize_data = pluto.df_world_data.copy()
pluto.df_world_tokenize_data = pluto.df_world_tokenize_data.fillna(0)
pluto.add_token_index(pluto.df_world_tokenize_data, 
  ['network', 'winning_team', 'losing_team'])
pluto.df_world_tokenize_data = pluto.df_world_tokenize_data.drop(
  ['network', 'winning_team', 'losing_team'], 
  axis=1)
pluto._fetch_checksum(pluto.df_world_tokenize_data)


In [None]:
pluto.draw_tabular_histogram(pluto.df_world_data,
  title='World Series Baseball',
  maxcolors=14)

In [None]:
pluto.df_world_tokenize_data.info()

In [None]:
!pip install joypy

In [None]:
pluto.draw_tabular_correlogram(pluto.df_world_tokenize_data,
  title='World Series Baseball Correlogram')

In [None]:
# %%writefile -a {pluto_chapter_9}

import joypy
#
@add_method(PacktDataAug)
def draw_tabular_joyplot(self, df, x=[], y='network', t='',legloc='upper left'):
  canvas, pics = joypy.joyplot(df, 
    column=x, 
    by=y, 
    ylim='own', figsize=(12,6),
    overlap=1)

  # Decoration
  matplotlib.pyplot.title(t, fontsize=22)
  pics[0].legend(ncol=2, loc=legloc)
  canvas.tight_layout()
  self._drop_image(canvas)
  canvas.show()
  return

In [None]:
pluto.draw_tabular_joyplot(pluto.df_world_data, 
  x=['game_1_audience', 'game_2_audience', 'game_3_audience',
     'game_4_audience', 'game_5_audience', 'game_6_audience', 
     'game_7_audience'],
  y='network',
  t='World series baseball audience')

In [None]:
pluto.draw_tabular_joyplot(pluto.df_world_tokenize_data, 
  x=['checksum', 'average_audience'],
  y='network_tokenize',
  t='World series baseball, checksum and average auidence',
  legloc='upper right')

In [None]:
! pip install pywaffle

In [None]:
# %%writefile -a {pluto_chapter_9}

import pywaffle
#
@add_method(PacktDataAug)
def draw_tabular_waffle(self, df_orig, col='winning_team', 
  title='',legloc='lower center', anchor=(0.5, -0.5)):
  df = df_orig.groupby(col).size().reset_index(name='counts')
  cat = df.shape[0]
  colors = [matplotlib.pyplot.cm.nipy_spectral(i/float(cat)) for i in range(cat)]

  # Draw Plot and Decorate
  canvas = matplotlib.pyplot.figure(
    FigureClass=pywaffle.Waffle,
    # plots={
    #   '111': {
    #     # 'values': df['counts'],
    #     'labels': ["{0} ({1})".format(n[0], n[1]) for n in df[[col, 'counts']].itertuples()],
    #     'legend': {'loc': legloc, 'fontsize': 11, 'ncol': 4, 'bbox_to_anchor':anchor},
    #     'title': {'label': title, 'loc': 'center', 'fontsize':20.0}},},
      rows=4,
      values=df['counts'],
      colors=colors,
      figsize=(10, 8))
  #
  canvas.tight_layout()
  self._drop_image(canvas)
  canvas.show()
  return

In [None]:
pluto.draw_tabular_waffle(pluto.df_world_data, 
  col='winning_team',
  title='World Series Baseball Winning Team')

In [None]:
pluto.draw_tabular_waffle(pluto.df_world_data, 
  col='losing_team',
  title='World Series Baseball Losing Team')

In [None]:

pluto.draw_tabular_waffle(pluto.df_world_data, 
  col='network',
  title='World Series Baseball Network',
  anchor=(0.5, -0.2))

# Transformation Augmentation

## ✋ STOP

- The below install pystan and fbprophet takes upto 11 minutes.
- **Warning: The library is beta release and may be unstable.

In [None]:
%%time
!pip install pystan==2.18.0.0
!pip install fbprophet
!pip install deltapy

In [None]:
%%time
!pip install pykalman
!pip install tsaug
!pip install ta
!pip install tsaug
!pip install pandasvault
!pip install gplearn
!pip install ta
!pip install seasonal
!pip install pandasvault

In [None]:
import deltapy

## Robust Scaler

In [None]:
# %%writefile -a {pluto_chapter_9}

import matplotlib
@add_method(PacktDataAug)
def augment_tabular_robust_scaler(self, df):
  return deltapy.transform.robust_scaler(df.copy(), drop=["checksum"])

In [None]:
df_out = pluto.augment_tabular_robust_scaler(
  pluto.df_world_tokenize_data) 

In [None]:
pluto.draw_tabular_joyplot(df_out, 
  x=['game_1_audience', 'game_2_audience', 'game_3_audience',
     'game_4_audience', 'game_5_audience', 'game_6_audience', 
     'game_7_audience'],
  y='network_tokenize',
  t='World series baseball audience')

In [None]:
pluto.draw_tabular_waffle(df_out, 
  col='network_tokenize',
  title='World Series Baseball Network',
  anchor=(0.5, -0.2))

## Standard scaler

In [None]:
# %%writefile -a {pluto_chapter_9}

@add_method(PacktDataAug)
def augment_tabular_standard_scaler(self, df):
  return deltapy.transform.standard_scaler(df.copy(), drop=["checksum"])

In [None]:
df_out = pluto.augment_tabular_standard_scaler( 
  pluto.df_world_tokenize_data) 

In [None]:
pluto.draw_tabular_joyplot(df_out, 
  x=['game_1_audience', 'game_2_audience', 'game_3_audience',
     'game_4_audience', 'game_5_audience', 'game_6_audience', 
     'game_7_audience'],
  y='network_tokenize',
  t='World series baseball audience',
  legloc='upper right')

In [None]:
pluto.draw_tabular_waffle(df_out, 
  col='network_tokenize',
  title='World Series Baseball Network',
  anchor=(0.5, -0.2))

## Capping

In [None]:
# %%writefile -a {pluto_chapter_9}

@add_method(PacktDataAug)
def augment_tabular_capping(self, df):
  x, y = deltapy.transform.outlier_detect(df, "checksum")
  return deltapy.transform.windsorization(df.copy(),"checksum",y,strategy='both')

In [None]:
df_out = pluto.augment_tabular_capping( 
  pluto.df_bank_tokenize_data) 

In [None]:
df_out = pluto._drop_bank_columns(df_out)

In [None]:
pluto.draw_tabular_correlogram(df_out, 
  title='Bank Fraud Capping Transformation')

# Interaction augmentation

## Regression

In [None]:
# %%writefile -a {pluto_chapter_9}

@add_method(PacktDataAug)
def augment_tabular_regression(self, df):
  return deltapy.interact.lowess(
    df.copy(), 
    ["winning_team_tokenize","losing_team_tokenize"], 
    pluto.df_world_tokenize_data["checksum"], 
    f=0.25, iter=3)

In [None]:
df_out = pluto.augment_tabular_regression( 
  pluto.df_world_tokenize_data) 

In [None]:
pluto.draw_tabular_joyplot(df_out, 
  x=['game_1_audience', 'game_2_audience', 'game_3_audience',
     'game_4_audience', 'game_5_audience', 'game_6_audience', 
     'game_7_audience'],
  y='network_tokenize',
  t='World series baseball audience: Regression',
  legloc='upper right')

In [None]:
pluto.draw_tabular_waffle(df_out, 
  col='network_tokenize',
  title='World Series Baseball Network: Regression',
  anchor=(0.5, -0.2))

In [None]:
pluto.draw_tabular_correlogram(df_out, 
  title='World Series Baseball: Regression')

## Operator

In [None]:
# %%writefile -a {pluto_chapter_9}

@add_method(PacktDataAug)
def augment_tabular_operator(self, df):
  return deltapy.interact.muldiv(
    df.copy(), 
    ["credit_risk_score","proposed_credit_limit"])

In [None]:
df_out = pluto.augment_tabular_operator( 
  pluto.df_bank_tokenize_data) 

In [None]:
df_out = pluto._drop_bank_columns(df_out)
pluto.draw_tabular_correlogram(df_out, 
  title='Bank Fraud Operator Interaction')

# Push up all changes (Optional)

- username: duchaba

- password: [use the token]

In [None]:
# import os
# f = 'Data-Augmentation-with-Python'
# os.chdir(f)
# !git add -A
# !git config --global user.email "duc.haba@gmail.com"
# !git config --global user.name "duchaba"
# !git commit -m "end of session"
# # do the git push in the xterm console
# #!git push

In [None]:
# %%script false --no-raise-error  #temporary stop execute for export file

# Summary 

Every chaper will begin with same base class "PacktDataAug".

✋ FAIR WARNING:

- The coding uses long and complete function path name.

- Pluto wrote the code for easy to understand and not for compactness, fast execution, nor cleaverness.

- Use Xterm to debug cloud server



In [None]:
# !pip install colab-xterm
# %load_ext colabxterm
# %xterm