#About the Data

* The data was taken from 
https://www.eba.europa.eu/sites/default/documents/files/documents/10180/2321183/b67323ac-27fa-482d-926e-ae7ba3e90cb8/Annex%20III%20%28Annex%205%20%28FINREP%29%29.pdf?msclkid=fc3c7b21b0af11ec8242e40ef60465b7

* The Data consist of four columns: Rule_number, Text, Topic and Character_count.<br>
  - Rule_number: FINREP Rule number.
  - Text: List of rules.
  - Topic: Category to which the rule belongs.
 -  Character_count: The number of characters in 'Text'.

* Each rule under the the headings:<br>

  1. Equity
  2. Statement of profit or loss 
  3. Statement of comprehensive income 
  4. Breakdown of financial assets by instrument and by counterparty sector 
  5. Breakdown of non-trading loans and advances by product 
  6. Breakdown of non-trading loans and advances to non-financial corporations by NACE codes
  7. Financial assets subject to impairment that are past due 
  8. Breakdown of financial liabilities
  9. Loan commitments, financial guarantees and other commitments 
  10. Derivatives and hedge accounting
  11. Movements in allowances and provisions for credit losses
  12. Collateral and guarantees received 

are taken as a text (page 16 - 46)

* Rules are seperated by their rule number. For instance, 'Rule no. 175' and 'Rule no. 175i' are taken as seperate text.





# Text Preprocessing

* The line break ‘\n’ is removed from the text.

* Round and square parantheses are removed.

* Text contatined within round paranthesis are removed.

* Text contatined within square paranthesis are removed.

* The stop words are removed manully if they are not the beginning and end word in a sentence. The stop words are 
'a', 'are' , 'shall', 'those', 'the', 'which', 'has', 'been', 'of', 'by', 'to', 'at', 'is',  'an', 'in', 'for', 'be', 'it' and 'such'.

* No summarization was done.


## Next Steps

* Fine-tuning DistilBERT model for text classification





##Google Drive access

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Write the appropriate paths to retrieve the data and store results 
data_path  = '/content/drive/MyDrive/Full_Code/FINREP/Dataset/FINREP_Regulations.csv'
saved_path = '/content/drive/MyDrive/Full_Code/FINREP/Manual_Stopwords_Removed/'

## Set-up environment

Install the libraries folium version 0.2.1 and HuggingFace Datasets

In [None]:
#Install the package folium version 0.2.1 and HuggingFace datasets library
!pip install -q folium==0.2.1 datasets 

[K     |████████████████████████████████| 69 kB 3.4 MB/s 
[K     |████████████████████████████████| 346 kB 11.3 MB/s 
[K     |████████████████████████████████| 84 kB 2.9 MB/s 
[K     |████████████████████████████████| 212 kB 40.8 MB/s 
[K     |████████████████████████████████| 140 kB 52.9 MB/s 
[K     |████████████████████████████████| 1.1 MB 50.9 MB/s 
[K     |████████████████████████████████| 86 kB 5.6 MB/s 
[K     |████████████████████████████████| 127 kB 43.7 MB/s 
[K     |████████████████████████████████| 94 kB 2.9 MB/s 
[K     |████████████████████████████████| 271 kB 58.0 MB/s 
[K     |████████████████████████████████| 144 kB 44.7 MB/s 
[?25h  Building wheel for folium (setup.py) ... [?25l[?25hdone


In [None]:
import torch
#Check if GPU is available
device = "cuda:0" if torch.cuda.is_available() else "cpu"
print(device, "is available")

cpu is available


##Memory Allocated

In [None]:
!free -h --si | awk  '/Mem:/{print $2}'

13G


# Loading the dataset

In [None]:
#Load the data
import pandas as pd

df = pd.read_csv(data_path)
#Display the first five rows in df
df.head()


Unnamed: 0,Rule_number,Text,Topic,Character_count
0,16.0,Under IFRS equity instruments that are financi...,Equity,117
1,17.0,"Under the relevant national GAAP based on BAD,...",Equity,683
2,18.0,‘Equity component of compound financial instru...,Equity,416
3,19.0,‘Other equity instruments issued’ shall includ...,Equity,176
4,20.0,‘Other equity’ shall comprise all equity instr...,Equity,173


In [None]:
#Display information about the dataframe df
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 165 entries, 0 to 164
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Rule_number      165 non-null    object
 1   Text             165 non-null    object
 2   Topic            165 non-null    object
 3   Character_count  165 non-null    int64 
dtypes: int64(1), object(3)
memory usage: 5.3+ KB


In [None]:
#Convert the data type of 'Text' to string
df['Text'] = df['Text'].astype(str)
#Display information about the dataframe df
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 165 entries, 0 to 164
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Rule_number      165 non-null    object
 1   Text             165 non-null    object
 2   Topic            165 non-null    object
 3   Character_count  165 non-null    int64 
dtypes: int64(1), object(3)
memory usage: 5.3+ KB


In [None]:
#Define a function to calculate the number of words in a text
def count_words(example):
  number_of_words = len(example.split()) 
  return number_of_words 


#Calculate the number of words for each 'Text' row in the dataframe df
df['number_of_words_Text'] = df['Text'].map(lambda row: count_words(row))
#Display Descriptive statistics about the 'number_of_words_Text' column in the dataframe df
df['number_of_words_Text'].describe(include='all') 

count    165.000000
mean      77.800000
std       61.292004
min        9.000000
25%       41.000000
50%       62.000000
75%       97.000000
max      501.000000
Name: number_of_words_Text, dtype: float64

In [None]:
#Define a function remove_pattern to remove the newline character, parenthesis and the text present within parenthesis from the text
import re
def remove_pattern(example):
  patterns = '\(.*?\)|\[.*?\]|\n|\s\(.*?\)|\s\[.*?\]|\(.*?\)\s|\[.*?\]\s|\s\(.*?\)\s|\s\[.*?\]\s'
  pattern_removed_text = re.sub(patterns,'',example)
  return pattern_removed_text

#Remove the newline character, parenthesis and the text present within parenthesis in 'Text' in the dataframe df
df['Pattern_Removed_Text'] = df['Text'].map(lambda row : remove_pattern(row))

In [None]:
# Remove the stopwords from the text by manually specifying the list of stop words:
def remove_stop_words(example,words):
  for word in words:
    example = re.sub(' '+ word + ' ',' ',example)
  extra_space_removed = re.sub(' +',' ',example)
  return  extra_space_removed 

#'that','if','no', 'not', 'as','but','on' are retained
stop_words = ['a','are','shall','those','the','which','has','been','of','by','to','at','is','an','in','for','be','it','such']

#Remove the words in stop_words from 'Pattern_Removed_Text' in the dataframe df
df['Stopwords_Removed_Text'] = df['Pattern_Removed_Text'].map(lambda row: remove_stop_words(row,stop_words))

#Calculate the number of words in each 'Stopwords_Removed_Text' row in the df
df['number_of_words_stopwords_removed_Text'] = df['Stopwords_Removed_Text'].map(lambda x: len(x.split()))


In [None]:
#Display Descriptive statistics about the 'number_of_words_stopwords_removed_Text' column in the dataframe df
df['number_of_words_stopwords_removed_Text'].describe(include='all') 

count    165.000000
mean      54.927273
std       41.401450
min        7.000000
25%       29.000000
50%       45.000000
75%       71.000000
max      341.000000
Name: number_of_words_stopwords_removed_Text, dtype: float64

In [None]:
#Download the dataframe df for future reference
df.to_csv(saved_path + "FINREP_data_stopwords_removed.csv", encoding='utf-8', index=False)
print("\n Saved: FINREP_data_stopwords_removed.csv")


 Saved: FINREP_data_stopwords_removed.csv


In [None]:
#Filter entries for which the word count in 'number_of_words_stopwords_removed_Text' is atleast 7 in the dataframe df
df1 = df.query('number_of_words_stopwords_removed_Text >= 7')
df1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 165 entries, 0 to 164
Data columns (total 8 columns):
 #   Column                                  Non-Null Count  Dtype 
---  ------                                  --------------  ----- 
 0   Rule_number                             165 non-null    object
 1   Text                                    165 non-null    object
 2   Topic                                   165 non-null    object
 3   Character_count                         165 non-null    int64 
 4   number_of_words_Text                    165 non-null    int64 
 5   Pattern_Removed_Text                    165 non-null    object
 6   Stopwords_Removed_Text                  165 non-null    object
 7   number_of_words_stopwords_removed_Text  165 non-null    int64 
dtypes: int64(3), object(5)
memory usage: 11.6+ KB


In [None]:
#Drop Duplicate entries in 'Stopwords_Removed_Text' if there exist any in the dataframe df1
unique_df  = df1.drop_duplicates(subset=['Stopwords_Removed_Text'])
unique_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 165 entries, 0 to 164
Data columns (total 8 columns):
 #   Column                                  Non-Null Count  Dtype 
---  ------                                  --------------  ----- 
 0   Rule_number                             165 non-null    object
 1   Text                                    165 non-null    object
 2   Topic                                   165 non-null    object
 3   Character_count                         165 non-null    int64 
 4   number_of_words_Text                    165 non-null    int64 
 5   Pattern_Removed_Text                    165 non-null    object
 6   Stopwords_Removed_Text                  165 non-null    object
 7   number_of_words_stopwords_removed_Text  165 non-null    int64 
dtypes: int64(3), object(5)
memory usage: 11.6+ KB


In [None]:
#Create dataframe data with the columns 'Stopwords_Removed_Text' and 'Topic' from unique_df
data = unique_df[['Stopwords_Removed_Text','Topic']].copy()
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 165 entries, 0 to 164
Data columns (total 2 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   Stopwords_Removed_Text  165 non-null    object
 1   Topic                   165 non-null    object
dtypes: object(2)
memory usage: 3.9+ KB


In [None]:
#Reset the index in data
data = data.reset_index(drop=True)
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 165 entries, 0 to 164
Data columns (total 2 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   Stopwords_Removed_Text  165 non-null    object
 1   Topic                   165 non-null    object
dtypes: object(2)
memory usage: 2.7+ KB


In [None]:
#Convert the Data into HuggingFace Dataset
from datasets import Dataset
dataset = Dataset.from_pandas(data)
dataset


Dataset({
    features: ['Stopwords_Removed_Text', 'Topic'],
    num_rows: 165
})

Let's look at the features of the dataset

In [None]:
dataset.features

{'Stopwords_Removed_Text': Value(dtype='string', id=None),
 'Topic': Value(dtype='string', id=None)}

The dataset has to be split into training, validation and test set. Let's check the first example of the dataset:



In [None]:
example = dataset[0]
example

{'Stopwords_Removed_Text': 'Under IFRS equity instruments that financial instruments include contracts under scope IAS 32.',
 'Topic': 'Equity'}

Let's sort the dataset by Topic name

In [None]:
dataset= dataset.sort('Topic')

Let's rename the column Topic to label

In [None]:
dataset = dataset.rename_column("Topic", "label")
dataset

Dataset({
    features: ['Stopwords_Removed_Text', 'label'],
    num_rows: 165
})

In [None]:
import collections 
#Define the function to check the frequency count of elements in the list
def frequency_count(mylist):
  frequency = collections.Counter(mylist)
  [print(key,':',value) for key, value in frequency.items()]

In [None]:
print("The Frequency of label in the Dataset : \n")
frequency_count(dataset['label'])

The Frequency of label in the Dataset : 

BREAKDOWN OF FINANCIAL ASSETS BY INSTRUMENT AND BY COUNTERPARTY SECTOR : 16
BREAKDOWN OF FINANCIAL LIABILITIES : 5
BREAKDOWN OF NON-TRADING LOANS AND ADVANCES BY PRODUCT : 7
BREAKDOWN OF NON-TRADING LOANS AND ADVANCES TO NON-FINANCIAL CORPORATIONS BY NACE CODES : 4
COLLATERAL AND GUARANTEES RECEIVED : 8
DERIVATIVES AND HEDGE ACCOUNTING : 33
Equity : 15
FINANCIAL ASSETS SUBJECT TO IMPAIRMENT THAT ARE PAST DUE : 3
LOAN COMMITMENTS, FINANCIAL GUARANTEES AND OTHER COMMITMENTS : 18
MOVEMENTS IN ALLOWANCES AND PROVISIONS FOR CREDIT LOSSES : 19
STATEMENT OF COMPREHENSIVE INCOME : 10
STATEMENT OF PROFIT OR LOSS : 27


The dataset consists of 12 labels.

Let's create a list that contains the labels, as well as 2 dictionaries that map labels to integers and back.

In [None]:
labels = sorted(list(set(dataset['label'])))
id2label = {idx:label for idx, label in enumerate(labels)}
label2id = {label:idx for idx, label in enumerate(labels)}
labels

['BREAKDOWN OF FINANCIAL ASSETS BY INSTRUMENT AND BY COUNTERPARTY SECTOR',
 'BREAKDOWN OF FINANCIAL LIABILITIES',
 'BREAKDOWN OF NON-TRADING LOANS AND ADVANCES BY PRODUCT',
 'BREAKDOWN OF NON-TRADING LOANS AND ADVANCES TO NON-FINANCIAL CORPORATIONS BY NACE CODES',
 'COLLATERAL AND GUARANTEES RECEIVED',
 'DERIVATIVES AND HEDGE ACCOUNTING',
 'Equity',
 'FINANCIAL ASSETS SUBJECT TO IMPAIRMENT THAT ARE PAST DUE',
 'LOAN COMMITMENTS, FINANCIAL GUARANTEES AND OTHER COMMITMENTS',
 'MOVEMENTS IN ALLOWANCES AND PROVISIONS FOR CREDIT LOSSES',
 'STATEMENT OF COMPREHENSIVE INCOME',
 'STATEMENT OF PROFIT OR LOSS']

## Splitting the Data into Train, Validation and Test set






In [None]:
fix_seed = 42
from sklearn.model_selection import train_test_split

# Split data into train and val_test set 
X= dataset['Stopwords_Removed_Text']
y= dataset['label']
X_train, X_val_test, y_train, y_val_test = train_test_split(X, y, test_size=0.48,stratify=y ,random_state=fix_seed)

# Split data into val and test set 
X_val, X_test, y_val, y_test = train_test_split(X_val_test, y_val_test, test_size=0.5, stratify = y_val_test ,random_state=fix_seed)

In [None]:
# Create the HuggingFace Dataset train_data
dict_train = {"sentence": X_train,"label": y_train}
train_data = Dataset.from_dict(dict_train)
train_data

Dataset({
    features: ['sentence', 'label'],
    num_rows: 85
})

In [None]:
#Create the HuggingFace Dataset val_data
dict_val = {"sentence": X_val,"label":y_val}
val_data = Dataset.from_dict(dict_val)
val_data

Dataset({
    features: ['sentence', 'label'],
    num_rows: 40
})

In [None]:
#Create the HuggingFace Dataset test_data
dict_test = {"sentence": X_test,"label":y_test}
test_data = Dataset.from_dict(dict_test)
test_data

Dataset({
    features: ['sentence', 'label'],
    num_rows: 40
})

Let's Check the Frequency Count of Train, Validation and Test label data

In [None]:
#Print the frequency count of label in train_data
print("The Frequency of label in train_data : \n")
frequency_count(train_data['label'])

The Frequency of label in train_data : 

COLLATERAL AND GUARANTEES RECEIVED : 4
MOVEMENTS IN ALLOWANCES AND PROVISIONS FOR CREDIT LOSSES : 10
BREAKDOWN OF FINANCIAL LIABILITIES : 3
BREAKDOWN OF FINANCIAL ASSETS BY INSTRUMENT AND BY COUNTERPARTY SECTOR : 8
LOAN COMMITMENTS, FINANCIAL GUARANTEES AND OTHER COMMITMENTS : 9
STATEMENT OF PROFIT OR LOSS : 14
BREAKDOWN OF NON-TRADING LOANS AND ADVANCES BY PRODUCT : 4
DERIVATIVES AND HEDGE ACCOUNTING : 17
BREAKDOWN OF NON-TRADING LOANS AND ADVANCES TO NON-FINANCIAL CORPORATIONS BY NACE CODES : 2
STATEMENT OF COMPREHENSIVE INCOME : 5
Equity : 8
FINANCIAL ASSETS SUBJECT TO IMPAIRMENT THAT ARE PAST DUE : 1


In [None]:
#Print the frequency count of label in val_data
print("The Frequency of label in val_data : \n")
frequency_count(val_data['label'])

The Frequency of label in val_data : 

BREAKDOWN OF FINANCIAL ASSETS BY INSTRUMENT AND BY COUNTERPARTY SECTOR : 4
STATEMENT OF PROFIT OR LOSS : 7
BREAKDOWN OF FINANCIAL LIABILITIES : 1
LOAN COMMITMENTS, FINANCIAL GUARANTEES AND OTHER COMMITMENTS : 4
DERIVATIVES AND HEDGE ACCOUNTING : 8
MOVEMENTS IN ALLOWANCES AND PROVISIONS FOR CREDIT LOSSES : 4
BREAKDOWN OF NON-TRADING LOANS AND ADVANCES BY PRODUCT : 2
Equity : 4
COLLATERAL AND GUARANTEES RECEIVED : 2
STATEMENT OF COMPREHENSIVE INCOME : 2
FINANCIAL ASSETS SUBJECT TO IMPAIRMENT THAT ARE PAST DUE : 1
BREAKDOWN OF NON-TRADING LOANS AND ADVANCES TO NON-FINANCIAL CORPORATIONS BY NACE CODES : 1


In [None]:
#Print the frequency count of label in test_data
print("The Frequency of label in test_data : \n")
frequency_count(test_data['label'])

The Frequency of label in test_data : 

Equity : 3
BREAKDOWN OF FINANCIAL ASSETS BY INSTRUMENT AND BY COUNTERPARTY SECTOR : 4
BREAKDOWN OF NON-TRADING LOANS AND ADVANCES BY PRODUCT : 1
STATEMENT OF COMPREHENSIVE INCOME : 3
LOAN COMMITMENTS, FINANCIAL GUARANTEES AND OTHER COMMITMENTS : 5
DERIVATIVES AND HEDGE ACCOUNTING : 8
STATEMENT OF PROFIT OR LOSS : 6
MOVEMENTS IN ALLOWANCES AND PROVISIONS FOR CREDIT LOSSES : 5
FINANCIAL ASSETS SUBJECT TO IMPAIRMENT THAT ARE PAST DUE : 1
COLLATERAL AND GUARANTEES RECEIVED : 2
BREAKDOWN OF NON-TRADING LOANS AND ADVANCES TO NON-FINANCIAL CORPORATIONS BY NACE CODES : 1
BREAKDOWN OF FINANCIAL LIABILITIES : 1


In [None]:
#Create dataset_clean to store the train_data, val_data and test_data
from datasets.dataset_dict import DatasetDict
dataset_clean = DatasetDict({
    'train': train_data,
    'validation': val_data,
    'test': test_data
})
dataset_clean

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label'],
        num_rows: 85
    })
    validation: Dataset({
        features: ['sentence', 'label'],
        num_rows: 40
    })
    test: Dataset({
        features: ['sentence', 'label'],
        num_rows: 40
    })
})

## Save the Data

In [None]:
#Save the HuggingFace Dataset dataset_clean in drive
dataset_clean.save_to_disk(saved_path  + "dataset_clean")
print("\nSaved dataset_clean")


Saved dataset_clean
