# NLP Assignment

Submission:
* There is no late submission for this assignment.
* It is individual work.
* Each student should submit an .ipynb file to the Teams Assignment.
* Grading: 0-15% of the course.

## Task
Here is a <a href="https://huggingface.co/datasets/banking77">dataset</a> composed of online banking queries annotated with their corresponding intents. You can download it by running the cell below. The data will be saved in the working directory.

DO NOT USE function ```datasets.load('banking77')```! Run the cell below and work with raw data.

What I expect you to do:
* Explore data: shape, number of classes, balance of classes. __2 points__.
* Solve a classification problem for a dataset using any <a href="https://huggingface.co/docs/transformers/tasks/sequence_classification">transformer</a> from the huggingface library. __3 points__.  
The tutorial at the link might be helpful.
* Justify choice of a metric. __3 points__.
* Split the training dataset into train and valid datasets. Train the model on train dataset and evaluate it on the valid dataset during training. Evaluate model on the test dataset after training. DO NOT USE a test dataset for validation!  __2 points__.
* Come up with 3 or more queries on banking topics and make a forecast of intents using your model. __2 points__.
* Comment code and describe your actions in the notebook. __1 point__.
* You must achieve a metric value of at least 90% __2 points__.
* Attach this file to the Teams Assignment. If you do not attach the file, you will get __0 points__.

The final score is calculated as a sum of all points.

If any of the tasks below will be completed with an error, the number of points for it may be reduced. For example, if you wrote only one query to the model instead of at least 3, then instead of __2 points__ you will get __1 point__.

## Notes
* Feel free to ask questions.
* Google Colab and Kaggle provide some free CPU and GPU time. Feel free to use it or use university resources.
* Good luck!


In [20]:
!pip install transformers evaluate accelerate
!wget -q -nc "https://raw.githubusercontent.com/PolyAI-LDN/task-specific-datasets/master/banking_data/train.csv"
!wget -q -nc "https://raw.githubusercontent.com/PolyAI-LDN/task-specific-datasets/master/banking_data/test.csv"



In [21]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score, confusion_matrix
from sklearn.preprocessing import LabelEncoder
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from datasets import Dataset
from transformers import pipeline

import warnings
warnings.filterwarnings('ignore')

pd.options.display.float_format = '{:,.2f}'.format
RS = 42

In [22]:
df_train = pd.read_csv('train.csv')
df_test = pd.read_csv('test.csv')

In [23]:
pd.set_option('display.max_columns', None)
pd.set_option('display.expand_frame_repr', False)

## Training data Exploration

In [24]:
# Let's explore the train data
df_train.head()

Unnamed: 0,text,category
0,I am still waiting on my card?,card_arrival
1,What can I do if my card still hasn't arrived ...,card_arrival
2,I have been waiting over a week. Is the card s...,card_arrival
3,Can I track my card while it is in the process...,card_arrival
4,"How do I know if I will get my card, or if it ...",card_arrival


In [25]:
# Training data structure and shape
df_train.info()
print(f'Shape of the train dataset: {df_train.shape}')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10003 entries, 0 to 10002
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   text      10003 non-null  object
 1   category  10003 non-null  object
dtypes: object(2)
memory usage: 156.4+ KB
Shape of the train dataset: (10003, 2)


In [26]:
# Exploring gaps in a dataset
df_train.isna().mean()*100

Unnamed: 0,0
text,0.0
category,0.0


In [27]:
# Find duplicates
df_train.duplicated().sum()

0

In [28]:
# function for analysis categorical features
def unique(colomns, data):
    for column in colomns:
        print(f'Number of unique values ​​in a column {column}: {data[column].nunique()}')
        print(data[column].unique())
        print('-----------------------------')

In [29]:
# Let's check the correctness of the toxic column category
unique(['category'], df_train)

Number of unique values ​​in a column category: 77
['card_arrival' 'card_linking' 'exchange_rate'
 'card_payment_wrong_exchange_rate' 'extra_charge_on_statement'
 'pending_cash_withdrawal' 'fiat_currency_support'
 'card_delivery_estimate' 'automatic_top_up' 'card_not_working'
 'exchange_via_app' 'lost_or_stolen_card' 'age_limit' 'pin_blocked'
 'contactless_not_working' 'top_up_by_bank_transfer_charge'
 'pending_top_up' 'cancel_transfer' 'top_up_limits'
 'wrong_amount_of_cash_received' 'card_payment_fee_charged'
 'transfer_not_received_by_recipient' 'supported_cards_and_currencies'
 'getting_virtual_card' 'card_acceptance' 'top_up_reverted'
 'balance_not_updated_after_cheque_or_cash_deposit'
 'card_payment_not_recognised' 'edit_personal_details'
 'why_verify_identity' 'unable_to_verify_identity' 'get_physical_card'
 'visa_or_mastercard' 'topping_up_by_card' 'disposable_card_limits'
 'compromised_card' 'atm_support' 'direct_debit_payment_not_recognised'
 'passcode_forgotten' 'declined_ca

In [30]:
# Let's analyze how many records we have for each category
category_counts = df_train['category'].value_counts()

# Create a table with the number of records and the percentage of the total
category_table = pd.DataFrame({
    'Count': category_counts,
    'Percentage': (category_counts / category_counts.sum() * 100).round(2)
})

In [31]:
category_table

Unnamed: 0_level_0,Count,Percentage
category,Unnamed: 1_level_1,Unnamed: 2_level_1
card_payment_fee_charged,187,1.87
direct_debit_payment_not_recognised,182,1.82
balance_not_updated_after_cheque_or_cash_deposit,181,1.81
wrong_amount_of_cash_received,180,1.80
cash_withdrawal_charge,177,1.77
...,...,...
lost_or_stolen_card,82,0.82
card_swallowed,61,0.61
card_acceptance,59,0.59
virtual_card_not_working,41,0.41


## Test data exploration

In [32]:
# Testing data structure and shape
df_test.info()
print(f'Shape of the test dataset: {df_test.shape}')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3080 entries, 0 to 3079
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   text      3080 non-null   object
 1   category  3080 non-null   object
dtypes: object(2)
memory usage: 48.2+ KB
Shape of the test dataset: (3080, 2)


In [33]:
# Let's explore the test data
df_test.head()

Unnamed: 0,text,category
0,How do I locate my card?,card_arrival
1,"I still have not received my new card, I order...",card_arrival
2,I ordered a card but it has not arrived. Help ...,card_arrival
3,Is there a way to know when my card will arrive?,card_arrival
4,My card has not arrived yet.,card_arrival


In [34]:
# Exploring gaps in a dataset
df_test.isna().mean()*100

Unnamed: 0,0
text,0.0
category,0.0


In [35]:
# Find duplicates
df_test.duplicated().sum()

0

####**Conclusion**:
The training dataset contains 10,003 rows and 2 columns, while the test dataset consists of 3,080 rows and 2 columns. There are 77 unique classes in the
category column, representing various topics such as card_arrival, exchange_rate, lost_or_stolen_card and so on. The dataset is imbalanced, with a noticeable disparity in the distribution of samples across classes. For instance, high-frequency classes like card_payment_fee_charged and direct_debit_payment_not_recognised have 187 (1.87%) and 182 (1.82%) samples, respectively, whereas low-frequency classes such as contactless_not_working and virtual_card_not_working have only 35 (0.35%) and 41 (0.41%) samples.  Additionally, after analyzing the data, we found no missing values or duplicate entries.

## Training model LogisticRegression

To evaluate the model's performance in a multi-class task, I will use **F1-macro**. F1-macro is a more suitable metric for multi-class problems, especially when there is a class imbalance in the data. It takes into account both precision and recall, and averages them across all classes, giving equal weight to each class. This is particularly important when some classes may be significantly less represented than others.

In [36]:
# Feature and target extraction
features = df_train['text']
target = df_train['category']

# Split data into train and validation sets
features_train, features_valid, target_train, target_valid = train_test_split(features, target, test_size=0.2, random_state=RS)

# Test set features and target
features_test = df_test['text']
target_test = df_test['category']

In [37]:
# Vectorizer initialization
vectorizer = TfidfVectorizer(ngram_range=(1, 2), min_df=2, max_features=5000)
# Transformation of training data, validation data, test data
features_train_tfidf = vectorizer.fit_transform(features_train)
features_valid_tfidf = vectorizer.transform(features_valid)
features_test_tfidf = vectorizer.transform(features_test)

In [38]:
# Create model LogisticRegression
clf = LogisticRegression(random_state=RS, multi_class='ovr', solver = 'sag', C = 100, max_iter=3000)
# Training model
clf.fit(features_train_tfidf, target_train)

In [39]:
# Predict on validation set
target_valid_pred = clf.predict(features_valid_tfidf)
f1_macro_valid = f1_score(target_valid, target_valid_pred, average='macro')
print(f"Validation f1 macro: {f1_macro_valid:.2f}")

# Predict on test set
target_test_pred = clf.predict(features_test_tfidf)
f1_macro_test = f1_score(target_test, target_test_pred, average='macro')
print(f"Test f1 macro: {f1_macro_test:.2f}")

Validation f1 macro: 0.88
Test f1 macro: 0.88


####**Conclusion**:
The current results show a validation F1-macro score of 0.88 and a test F1-macro score of 0.88. Although the model performs reasonably well, the target metric value is at least 90%, meaning there is still room for improvement in performance. To achieve better results, a BERT model will be used, which is expected to provide improved performance due to its ability to capture deeper semantic features of the text.

## Banking query intent classification

In [43]:
queries = [
    "How do I link my bank account to the card?",
    "What is the exchange rate for EUR to USD?",
    "I received a wrong exchange rate on my card payment.",
    "Can I change the PIN on my card?"
]

# Check if these queries exist in the 'text' column of your dataset
matches = df_test['text'].isin(queries)

# Output the rows where the queries match
matching_rows = df_test[matches]

print(f"Found {matching_rows.shape[0]} matching queries in the dataset.")
print(matching_rows)

Found 0 matching queries in the dataset.
Empty DataFrame
Columns: [text, category]
Index: []


For the created queries, I manually matched each query with its corresponding category from the 77 predefined classes to verify if the model would correctly identify the categories in the future.

In [41]:
category = ["card_linking",
            "exchange_rate",
            "card_payment_wrong_exchange_rate",
            "change_pin"]


In [42]:
# Transform the queries using the trained vectorizer
queries_tfidf = vectorizer.transform(queries)

# Make predictions for each query using the trained model
predictions = clf.predict(queries_tfidf)

# Print the predicted categories for each query using a formatted string
for query, prediction in zip(queries, predictions):
    print(f"Query: {query}\nPredicted category: {prediction}\n")

Query: How do I link my bank account to the card?
Predicted category: card_linking

Query: What is the exchange rate for EUR to USD?
Predicted category: exchange_rate

Query: I received a wrong exchange rate on my card payment.
Predicted category: card_payment_wrong_exchange_rate

Query: Can I change the PIN on my card?
Predicted category: change_pin



The model has successfully and accurately identified the classes for our queries. Each query was correctly mapped to its corresponding category, demonstrating the model's ability to classify text with high precision.

## Training model BERT

In [44]:
# Initialize the label encoder
label_encoder = LabelEncoder()

# Encode labels for training and test data
df_train['label'] = label_encoder.fit_transform(df_train['category'])
df_test['label'] = label_encoder.transform(df_test['category'])

# Get the number of unique classes
num_classes = len(label_encoder.classes_)
num_classes

77

In [45]:
train_data, valid_data = train_test_split(df_train, test_size=0.15, random_state=RS)

In [46]:
# Load the pre-trained tokenizer and model for sequence classification using BERT
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')
model = AutoModelForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels = num_classes)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [47]:
# Define a preprocessing function to tokenize the text
def preprocess_function(requests):
    tokenized = tokenizer(
        text=requests['text'],
        truncation=True,
        padding='max_length',
        max_length=512
    )
    return tokenized

# Apply the preprocessing function to the training, validation and test dataset
train_dataset = Dataset.from_pandas(train_data).map(preprocess_function, batched=True)
valid_dataset = Dataset.from_pandas(valid_data).map(preprocess_function, batched=True)
test_dataset = Dataset.from_pandas(df_test).map(preprocess_function, batched=True)

Map:   0%|          | 0/8502 [00:00<?, ? examples/s]

Map:   0%|          | 0/1501 [00:00<?, ? examples/s]

Map:   0%|          | 0/3080 [00:00<?, ? examples/s]

In [48]:
# Define a function to evaluate the metrics
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    f1 = f1_score(labels, preds, average="macro")
    return {"f1_macro": f1}

In [49]:
# Model training parameter settings
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs = 8,
    per_device_train_batch_size = 16,
    per_device_eval_batch_size = 16,
    learning_rate = 2e-5,
    evaluation_strategy = "epoch",
    load_best_model_at_end = True,
    metric_for_best_model = "f1_macro",
    save_strategy = "epoch",
    logging_dir = './logs',
    logging_steps = 10,
    report_to = "none"
)

In [50]:
# Model training
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=valid_dataset,
    compute_metrics=compute_metrics,
)

trainer.train()

Epoch,Training Loss,Validation Loss,F1 Macro
1,2.2175,2.002822,0.596054
2,1.0001,0.865744,0.811762
3,0.6158,0.512224,0.869511


Epoch,Training Loss,Validation Loss,F1 Macro
1,2.2175,2.002822,0.596054
2,1.0001,0.865744,0.811762
3,0.6158,0.512224,0.869511
4,0.3269,0.381975,0.89657
5,0.2297,0.33577,0.905647
6,0.173,0.312254,0.911171
7,0.1475,0.299613,0.911489
8,0.0883,0.300855,0.912803


TrainOutput(global_step=4256, training_loss=0.7839503657576957, metrics={'train_runtime': 3441.879, 'train_samples_per_second': 19.761, 'train_steps_per_second': 1.237, 'total_flos': 9021953498628096.0, 'train_loss': 0.7839503657576957, 'epoch': 8.0})

In [51]:
pred = trainer.predict(test_dataset)
test_labels = df_test['label'].values
test_predictions = pred.predictions.argmax(-1)
bert_test_f1 = f1_score(test_labels, test_predictions, average='macro')
print(f"Test F1-macro: {bert_test_f1:.4f}")

Test F1-macro: 0.9186


In [52]:
# Initialize the text classification pipeline
classifier = pipeline("text-classification", model=trainer.model, tokenizer=tokenizer)

queries = [
    "How do I link my bank account to the card?",
    "What is the exchange rate for EUR to USD?",
    "I received a wrong exchange rate on my card payment.",
    "Can I change the PIN on my card?"
]

# Loop through the queries and classify the intent for each
for query in queries:
    predicted_result = classifier(query)
    print(f"Query: {query}\nPredicted Intent: {predicted_result[0]['label']} with score {predicted_result[0]['score']:.4f}\n")

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


Query: How do I link my bank account to the card?
Predicted Intent: LABEL_14 with score 0.3451

Query: What is the exchange rate for EUR to USD?
Predicted Intent: LABEL_33 with score 0.9620

Query: I received a wrong exchange rate on my card payment.
Predicted Intent: LABEL_18 with score 0.9851

Query: Can I change the PIN on my card?
Predicted Intent: LABEL_22 with score 0.9886



In [53]:
train_data[train_data['label'] == 14].sample(1)

Unnamed: 0,text,category,label
287,What do I do if I already had a card with you ...,card_linking,14


In [54]:
train_data[train_data['label'] == 33].sample(1)

Unnamed: 0,text,category,label
355,explain the interbank exchange rate,exchange_rate,33


In [55]:
train_data[train_data['label'] == 18].sample(1)

Unnamed: 0,text,category,label
425,Why was the exchange rate so wrong when I boug...,card_payment_wrong_exchange_rate,18


In [56]:
train_data[train_data['label'] == 22].sample(1)

Unnamed: 0,text,category,label
6985,How can I change my PIN? Help.,change_pin,22


## Conclusion
In this project, I successfully achieved the goal of building a model for multi-class text classification to predict categories of banking queries. I started by analyzing the data, examining the dataset's structure, the number of classes, and class balance, which helped me better understand the data before training the model.

To solve the classification task, I used a transformer from the Hugging Face library based on the pre-trained DistilBERT model. The model was fine-tuned on the training dataset, which I split into training and validation sets. This approach allowed me to monitor the model's performance during training and optimize its hyperparameters.

I chose F1-macro as the primary metric since it considers both precision and recall, averaging them across all classes. This is especially important when working with imbalanced data.

During testing, the model achieved an F1-macro score of 0.9186, exceeding the minimum required threshold of 90%. This result demonstrates the model's high effectiveness in predicting categories for multi-class data. I also tested the model on several real-world banking queries, and it successfully provided accurate predictions for each of them.

In conclusion, the project was successfully completed: the model not only met the required metric thresholds but also performed excellently when processing new data.
