# LSAP Data Inspection & Retrieval
---
This notebook contains code for retrieving the polybanking dataset used in our NLP project. 


## Install Libraries & Set Seed

In [1]:
import numpy as np
import pandas as pd

#Set to True to combine all data or False for "train.csv", "val.csv", "test.csv"
SPLIT = False

#Set train, val, and test split sizes
TRAIN_SPLIT = 0.6
VAL_SPLIT   = 0.2
TEST_SPLIT  = 0.2

## Preprocess pre-training dataset

Our pre-training dataset is a combination of the [Banking77 Dataset](https://huggingface.co/datasets/PolyAI/banking77) and [WikiHow Intents Dataset](https://github.com/zharry29/wikihow-intent) with a few tweaks. Below, we load them both in, preprocess them, and get them ready for training our model.

---

### Load PolyBanking Dataset

In [2]:
#Load PolyBanking Dataset from HuggingFace
from datasets import load_dataset, Dataset, concatenate_datasets, ClassLabel
banking_dataset = load_dataset( "PolyAI/banking77", cache_dir="D:\digit\Documents\Development\HuggingFace\Datasets\PolyAIBanking")

#Return Dataframe to perform operations on
banking_dataset.set_format( type="pandas" )
banking_dataset

Found cached dataset banking77 (D:/digit/Documents/Development/HuggingFace/Datasets/PolyAIBanking/PolyAI___banking77/default/1.1.0/aec0289529599d4572d76ab00c8944cb84f88410ad0c9e7da26189d31f62a55b)


  0%|          | 0/2 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 10003
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 3080
    })
})

### Preprocessing PolyBanking Dataset
We want to ensure the labels are properly formatted.

In [3]:
#Concatenate train/test into one dataframe
banking_dict = concatenate_datasets( [banking_dataset["train"], banking_dataset["test"]] )

#Extract data from dictionary
banking_df = banking_dict.data.to_pandas()

#Add Label Names for each label
def add_labelname( row ):
  return banking_dataset["train"].features["label"].int2str( row )

def correct_label( row ):
  return " ".join( [word.capitalize() for word in row.split("_")] ) + "."

#Add label name associated with each label
banking_df["label_name"] = banking_df["label"].apply( add_labelname )
banking_df["label_name"] = banking_df["label_name"].apply( correct_label )

In [4]:
banking_df.sample( frac = 1 ).head( 10 )

Unnamed: 0,text,label,label_name
12491,Can I request a physical card?,43,Order Physical Card.
6228,"This is terrible, i want to delete my account",55,Terminate Account.
11339,Can I give my friends access to my account so ...,62,Topping Up By Card.
10808,I used my card for a purchase and was charged ...,15,Card Payment Fee Charged.
3926,Why isn't this accepting my identity?,68,Unable To Verify Identity.
9707,Does my top up and apple pay work together?,2,Apple Pay Or Google Pay.
9339,Why have I been charged a fee for cash withdra...,19,Cash Withdrawal Charge.
7218,why was i chargged,64,Transfer Fee Charged.
7845,I want to give a second card for this account ...,39,Getting Spare Card.
6756,why did my payment revert,53,Reverted Card Payment?.


### Inspection

Below, we inspect our dataset to look for imbalances in labels, and to see if we need to sample our data differently.

In [5]:
import plotly.express as px

# Gets the frequency of each label
label_freq = banking_df[ "label_name" ].value_counts()

#Plot it as an area chart
fig = px.area( 
    x = label_freq.index, 
    y = label_freq.values, 
    title = "Label Frequency", 
    template="plotly_dark",
)

fig.update_layout( height = 700 )
fig.update_xaxes( title_text = "Label" )
fig.update_yaxes( title_text = "Frequency" )

#Update indv. train/test split with proper labels
updated_train = banking_dataset["train"][ : ]
updated_test  = banking_dataset["test"][ : ]

#Add label name associated with each label
updated_train["label_name"] = updated_train["label"].apply( add_labelname )
updated_test["label_name"]  = updated_test["label"].apply( add_labelname )

#Clean labels
updated_train["label_name"] = updated_train["label_name"].apply( correct_label )
updated_test["label_name"]  = updated_test["label_name"].apply( correct_label )

#Plot it as an area chart overlapping the area chart
fig.add_scatter(
    x = updated_train["label_name"].value_counts().index,
    y = updated_train["label_name"].value_counts().values,
    name = "Train",
    mode = "lines",
    line = dict( color = "purple" )
)

fig.add_scatter(
    x = updated_test["label_name"].value_counts().index,
    y = updated_test["label_name"].value_counts().values,
    name = "Test",
    mode = "lines",
    line = dict( color = "green" )
)

#Show chart
fig.show()

### Save Training, Testing, and Validation Data

In [6]:
#Split data into train, validation and test sets
shuffled_df = banking_df.sample( frac = 1, random_state=42 )

#Split data into train, validation and test sets
n = len( shuffled_df )
train_size = int( TRAIN_SPLIT * n )
val_size   = int( VAL_SPLIT   * n ) + train_size
train, validate, test = np.split( shuffled_df, [ train_size, val_size ])

# If combine, then combine train, val and test into one dataframe
if not SPLIT:
    banking_df.to_csv( "data.csv" )
else:
    #Save data to csv files
    train.to_csv( "train.csv" ); validate.to_csv( "val.csv" ); test.to_csv( "test.csv" )