# LSAP Data Inspection & Retrieval
---
This notebook contains code for retrieving the Wikihow Intents data used in our NLP project. 


## Install Libraries & Set Seed

In [1]:
import numpy as np
import pandas as pd

#Set to True to combine all data or False for "train.csv", "val.csv", "test.csv"
COMBINE = True

#Set train, val, and test split sizes
TRAIN_SPLIT = 0.6
VAL_SPLIT   = 0.2
TEST_SPLIT  = 0.2

## Preprocess pre-training dataset

Our pre-training dataset is a combination of the [Banking77 Dataset](https://huggingface.co/datasets/PolyAI/banking77) and [WikiHow Intents Dataset](https://github.com/zharry29/wikihow-intent) with a few tweaks. Below, we load them both in, preprocess them, and get them ready for training our model.

---

### Load WikiHow Intent Dataset

In [2]:
!gdown 1KaFOVZFxZoR6weLuDNmVFwRRsGTFq5tJ

Downloading...
From: https://drive.google.com/uc?id=1KaFOVZFxZoR6weLuDNmVFwRRsGTFq5tJ
To: d:\digit\Documents\UMass\CS685\FinalProject\nlp-project\data\pretraining\wikihow\en_wikihow_train.csv

  0%|          | 0.00/28.2M [00:00<?, ?B/s]
  7%|▋         | 2.10M/28.2M [00:00<00:01, 20.5MB/s]
 19%|█▊        | 5.24M/28.2M [00:00<00:00, 25.5MB/s]
 30%|██▉       | 8.39M/28.2M [00:00<00:00, 27.2MB/s]
 41%|████      | 11.5M/28.2M [00:00<00:00, 28.5MB/s]
 52%|█████▏    | 14.7M/28.2M [00:00<00:00, 28.5MB/s]
 65%|██████▌   | 18.4M/28.2M [00:00<00:00, 30.2MB/s]
 78%|███████▊  | 22.0M/28.2M [00:00<00:00, 31.2MB/s]
 89%|████████▉ | 25.2M/28.2M [00:00<00:00, 30.2MB/s]
100%|██████████| 28.2M/28.2M [00:00<00:00, 29.3MB/s]


### Preprocess 

We want to keep both datasets in the same format to combine them later.

In [3]:
import pandas as pd
import re

def create_label_name(df):
    """
    Create a column with each sentence's intent label.
    """
    df[ 'label_name' ] = df.apply( lambda row: row[ f'ending{row["label"]}' ] + ".", axis=1 )
    return df

def drop_ending_columns(df):
    """
    Drop the 'ending' columns from the dataframe.
    """
    endings_to_drop = [ f'ending{i}' for i in range( 4 ) ]
    df = df.drop( endings_to_drop, axis=1 )
    return df

def remove_how_to(df):
    """
    Remove "How to" from utterances.
    """
    pattern = re.compile( r'(?i)How to ' )
    df[ 'label_name' ] = df[ 'label_name' ].apply( lambda x: re.sub( pattern, '', x ).strip() )
    return df

# Load data
wikihow_df = pd.read_csv( 'en_wikihow_train.csv', index_col=0 )

# Drop unnecessary columns
wikihow_df = wikihow_df.drop(['startphrase', 'video-id', 'gold-source', 'fold-ind', 'sent1'], axis = 1)

# Append label name to each column
wikihow_df = create_label_name( wikihow_df )

# Drop the 'ending' columns
wikihow_df = drop_ending_columns( wikihow_df )

# Remove "How to" from the 'label_name' column (case-insensitive)
wikihow_df = remove_how_to( wikihow_df ).rename( columns={'sent2': 'text'} )[ ['text', 'label', 'label_name'] ]

#Change label value to position i.e. 0, 1, 2, 3, 4... n (n = number of labels)
wikihow_df[ 'label' ] = wikihow_df.index.values

In [4]:
wikihow_df.sample( frac = 1 ).head( 10 )

Unnamed: 0,text,label,label_name
42937,Trim a length of 2 by 4 in (5.1 by 10.2 cm) wo...,42937,Build an Adjustable Dog Agility Seesaw.
80676,Add specific details to the partnership deed a...,80676,Register a Partnership Firm in India.
66333,Pull the netting taut across the forehead and ...,66333,Make a Cap for Wigs.
20378,Apply a highly pigmented treatment concealer b...,20378,Cover up Blackheads With Makeup.
79099,Use a tripod when the shutter speed is too slo...,79099,Use a Tripod.
14667,"While pulling the circle up, remove the top ha...",14667,Open an Otterbox Case.
22852,Preheat the oven to 350 °F (177 °C) and rinse ...,22852,Roast Pine Nuts.
26880,"Alternatively, you can avoid the flyer process...",26880,Sell Your Used Textbook to Another Student on ...
35171,Sprinkle the powder across potential entry ways.,35171,Kill Ants Using Borax.
39473,"Stir in 4 packets of unflavored gelatin, then ...",39473,Make Gelatin Ice Cubes.


### Inspection

Below, we inspect our dataset to look for imbalances in labels, and to see if we need to sample our data differently. In the banking dataset, each label was associated with multiple entries. In our Wikihow Intents, however, each label is used only once.

In [5]:
#Count if any two articles have the same intent
num_unique = wikihow_df['label_name'].nunique()
num_unique

110573

Below, we check the distribution of the intent labels from our WikiHow dataset. Although this should not affect the analysis itself, it is important to check for class imbalance as this can affect the performance and accuracy of any machine learning models trained on the dataset.

In [6]:
import plotly.express as px

# Get length of words in each label name
label_counts = wikihow_df[ 'label_name' ].str.split().apply( len ).value_counts()

#Plot the distribution of label name lengths
fig = px.bar( x=label_counts.index, y=label_counts.values, template='plotly_dark' )

#exponential scale
fig.update_yaxes( type='log' )

#Update the plot's title and axes labels
fig.update_layout( title_text='Distribution of Label Name Lengths' )
fig.update_xaxes( title_text='Label Name Length' )
fig.update_yaxes( title_text='Count' )

#Show the plot
fig.show()

In [7]:
#Split data into train, validation and test sets
shuffled_df = wikihow_df.sample( frac = 1, random_state=42 )

#Split data into train, validation and test sets
n = len( shuffled_df )
train_size = int( TRAIN_SPLIT * n )
val_size   = int( VAL_SPLIT   * n ) + train_size
train, validate, test = np.split( shuffled_df, [ train_size, val_size ])

# If combine, then combine train, val and test into one dataframe
if COMBINE:
    wikihow_df.to_csv( "data.csv" )
else:
    #Save data to csv files
    train.to_csv( "train.csv" ); validate.to_csv( "val.csv" ); test.to_csv( "test.csv" )