#About the Data

* The data was taken from 
https://m.rbi.org.in//scripts/PublicationReportDetails.aspx?ID=242

* Each rule is considered as a record under the the headings:<br>

  1. Enhancing Bank Transparency
  2. Best Practices for Credit Risk Disclosure
  3. Supervision of Financial Conglomerates 
  4. Risk Concentrations Principles
  5. Intra-Group Transactions and Exposures Principles
  6. Principles for the Supervision of Banks’ Foreign Establishments (The Basel Concordat)
  7. Information Flows Between Banking Supervisory Authorities
  8. Minimum Standards for the Supervision of 
International Banking Groups and their Cross-Border Establishments
  9. The Supervision of Cross-Border Banking 
  
* Rules are seperated by their rule number.


# Text Preprocessing

* The line break ‘\n’ is removed from the text, if there exist any.

* Round and square parantheses are removed, if there exist any.

* Text contatined within round paranthesis are removed, if there exist any.

* Text contatined within square paranthesis are removed, if 
there exist any.

* The stop words are removed manully if they are not the beginning and end word in a sentence. The stop words are 
'a','are','shall','those','the','which','has','been','of','by','to','at','is','an','in','for','be','it' and 'such'.

* No summarization was done.

## Next Steps

* Fine-tuning DistilBERT model for text classification





##Google Drive access

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Write the appropriate paths to retrieve the data and store results 
data_path  = '/content/drive/MyDrive/Full_Code/RBI/Dataset/RBI_Regulations.csv'
saved_path = '/content/drive/MyDrive/Full_Code/RBI/Stopwords_Removed/'

## Set-up environment

Install the libraries folium version 0.2.1 and HuggingFace Datasets

In [None]:
#Install the package folium version 0.2.1 and HuggingFace datasets library
!pip install -q folium==0.2.1 datasets 

[K     |████████████████████████████████| 69 kB 3.4 MB/s 
[K     |████████████████████████████████| 346 kB 11.0 MB/s 
[K     |████████████████████████████████| 212 kB 58.2 MB/s 
[K     |████████████████████████████████| 1.1 MB 44.1 MB/s 
[K     |████████████████████████████████| 84 kB 2.8 MB/s 
[K     |████████████████████████████████| 140 kB 58.9 MB/s 
[K     |████████████████████████████████| 127 kB 52.5 MB/s 
[K     |████████████████████████████████| 144 kB 44.5 MB/s 
[K     |████████████████████████████████| 94 kB 3.0 MB/s 
[K     |████████████████████████████████| 271 kB 48.3 MB/s 
[?25h  Building wheel for folium (setup.py) ... [?25l[?25hdone


In [None]:
import torch
#Check if GPU is available
device = "cuda:0" if torch.cuda.is_available() else "cpu"
print(device, "is available")

cpu is available


##Memory Allocated

In [None]:
!free -h --si | awk  '/Mem:/{print $2}'

13G


# Loading the dataset

In [None]:
#Load the data
import pandas as pd

df = pd.read_csv(data_path)
#Display the first five rows in df
df.head()


Unnamed: 0,Principle number,Principle,Indian Position,Remarks,Sub-heading,Topic
0,1.1,"The Basel Committee recommends that banks, in ...",Banks’ financial reporting broadly encompasses...,All these six broad categories of information ...,1.0 General Level,Enhancing Bank Transparency
1,1.2,The scope and content of information provided ...,Irrespective of the size and nature of a bank’...,,1.0 General Level,Enhancing Bank Transparency
2,1.3,In countries with less developed financial mar...,This principle is acceptable. The level of com...,,1.0 General Level,Enhancing Bank Transparency
3,2.1.1,"Information about the performance of a bank, i...",RBI is committed to enhance and improve the le...,"However, we would have to go beyond these disc...",2.0 Details in disclosure 2.1 Financial Perfor...,Enhancing Bank Transparency
4,2.1.2,"To assess the financial performance of a bank,...",The income statement usually includes items fo...,,2.0 Details in disclosure 2.1 Financial Perfor...,Enhancing Bank Transparency


In [None]:
#Display information about the dataframe df
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 206 entries, 0 to 205
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Principle number  196 non-null    object
 1   Principle         206 non-null    object
 2   Indian Position   199 non-null    object
 3   Remarks           85 non-null     object
 4   Sub-heading       148 non-null    object
 5   Topic             206 non-null    object
dtypes: object(6)
memory usage: 9.8+ KB


In [None]:
#Convert the data type of 'Principle' and 'Indian Position' to string
df['Principle'] = df['Principle'].astype(str)
df['Indian Position'] = df['Indian Position'].astype(str)
#Display information about the dataframe df
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 206 entries, 0 to 205
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Principle number  196 non-null    object
 1   Principle         206 non-null    object
 2   Indian Position   206 non-null    object
 3   Remarks           85 non-null     object
 4   Sub-heading       148 non-null    object
 5   Topic             206 non-null    object
dtypes: object(6)
memory usage: 9.8+ KB


In [None]:
#Define a function to calculate the number of words in a text
def count_words(example):
  number_of_words = len(example.split()) 
  return number_of_words 


#Calculate the number of words for each 'Principle' row in the dataframe df
df['number_of_words_Principle'] = df['Principle'].map(lambda row: count_words(row))
#Display Descriptive statistics about the 'number_of_words_Principle' column in the dataframe df
df['number_of_words_Principle'].describe(include='all') 

count    206.000000
mean      49.150485
std       31.353090
min        7.000000
25%       26.000000
50%       39.500000
75%       66.750000
max      174.000000
Name: number_of_words_Principle, dtype: float64

In [None]:
#Calculate the number of words for each 'Indian Position' row in the dataframe df
df['number_of_words_Indian_Position'] = df['Indian Position'].apply(lambda x: len(x.split()))
#Display Descriptive statistics about the 'number_of_words_Indian_Position' column in the dataframe df
df['number_of_words_Indian_Position'].describe(include='all') 

count    206.000000
mean      29.368932
std       26.923556
min        1.000000
25%        8.000000
50%       24.000000
75%       41.000000
max      175.000000
Name: number_of_words_Indian_Position, dtype: float64

In [None]:
#Define a function remove_pattern to remove the newline character, parenthesis and the text present within parenthesis from the text
import re
def remove_pattern(example):
  patterns = '\(.*?\)|\[.*?\]|\n|\s\(.*?\)|\s\[.*?\]|\(.*?\)\s|\[.*?\]\s|\s\(.*?\)\s|\s\[.*?\]\s'
  pattern_removed_text = re.sub(patterns,'',example)
  return pattern_removed_text

#Remove the newline character, parenthesis and the text present within parenthesis in 'Principle' in the dataframe df
df['Pattern_Removed_Principle'] = df['Principle'].map(lambda row : remove_pattern(row))

#Remove the newline character, parenthesis and the text present within parenthesis in 'Indian Position' in the dataframe df
df['Pattern_Removed_Indian_Position'] = df['Indian Position'].map(lambda row : remove_pattern(row))



In [None]:
# Remove the stopwords from the text by manually specifying the list of stop words:

def remove_stop_words(example,words):
  for word in words:
    example = re.sub(' '+ word + ' ',' ',example)
  extra_space_removed = re.sub(' +',' ',example)
  return  extra_space_removed 

#'that','if','no', 'not', 'as','but','on' are retained
stop_words = ['a','are','shall','those','the','which','has','been','of','by','to','at','is','an','in','for','be','it','such']

#Remove the words in stop_words from 'Pattern_Removed_Principle' in the dataframe df
df['Stopwords_Removed_Principle'] = df['Pattern_Removed_Principle'].map(lambda row: remove_stop_words(row,stop_words))
#Remove the words in stop_words from 'Pattern_Removed_Indian_Position' in the dataframe df
df['Stopwords_Removed_Indian_Position'] = df['Pattern_Removed_Indian_Position'].map(lambda row: remove_stop_words(row,stop_words))


#Calculate the number of words in each 'Stopwords_Removed_Principle' row in the df
df['number_of_words_stopwords_removed_Principle'] = df['Stopwords_Removed_Principle'].map(lambda x: len(x.split()))

#Calculate the number of words in each 'Stopwords_Removed_Indian_Position' row in the df
df['number_of_words_stopwords_removed_Indian_Position'] = df['Stopwords_Removed_Indian_Position'].map(lambda x: len(x.split()))



In [None]:
#Display Descriptive statistics about the 'number_of_words_stopwords_removed_Principle' column in the dataframe df
df['number_of_words_stopwords_removed_Principle'].describe(include='all') 

count    206.000000
mean      36.509709
std       22.674309
min        6.000000
25%       20.000000
50%       29.000000
75%       47.750000
max      118.000000
Name: number_of_words_stopwords_removed_Principle, dtype: float64

In [None]:
#Display Descriptive statistics about the 'number_of_words_stopwords_removed_Indian_Position' column in the dataframe df
df['number_of_words_stopwords_removed_Indian_Position'].describe(include='all') 

count    206.000000
mean      21.495146
std       19.673121
min        1.000000
25%        6.000000
50%       17.500000
75%       29.750000
max      116.000000
Name: number_of_words_stopwords_removed_Indian_Position, dtype: float64

In [None]:
#Download the dataframe df for future reference
df.to_csv(saved_path + "RBI_data_stopwords_removed.csv", encoding='utf-8', index=False)
print("\n Saved: RBI_data_stopwords_removed.csv")


 Saved: RBI_data_stopwords_removed.csv


In [None]:
#Filter entries for which the word count in 'number_of_words_stopwords_removed_Principle' and 'number_of_words_stopwords_removed_Indian_Position' is atleast 7 in the dataframe df
df1 = df.query('(number_of_words_stopwords_removed_Principle >= 7) & (number_of_words_stopwords_removed_Indian_Position >= 7)')
df1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 151 entries, 0 to 205
Data columns (total 14 columns):
 #   Column                                             Non-Null Count  Dtype 
---  ------                                             --------------  ----- 
 0   Principle number                                   143 non-null    object
 1   Principle                                          151 non-null    object
 2   Indian Position                                    151 non-null    object
 3   Remarks                                            69 non-null     object
 4   Sub-heading                                        113 non-null    object
 5   Topic                                              151 non-null    object
 6   number_of_words_Principle                          151 non-null    int64 
 7   number_of_words_Indian_Position                    151 non-null    int64 
 8   Pattern_Removed_Principle                          151 non-null    object
 9   Pattern_Removed_India

In [None]:
#Drop Duplicate entries in 'Stopwords_Removed_Principle' if there exist any in the dataframe df1
df2 = df1.drop_duplicates(subset=['Stopwords_Removed_Principle'])
df2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 151 entries, 0 to 205
Data columns (total 14 columns):
 #   Column                                             Non-Null Count  Dtype 
---  ------                                             --------------  ----- 
 0   Principle number                                   143 non-null    object
 1   Principle                                          151 non-null    object
 2   Indian Position                                    151 non-null    object
 3   Remarks                                            69 non-null     object
 4   Sub-heading                                        113 non-null    object
 5   Topic                                              151 non-null    object
 6   number_of_words_Principle                          151 non-null    int64 
 7   number_of_words_Indian_Position                    151 non-null    int64 
 8   Pattern_Removed_Principle                          151 non-null    object
 9   Pattern_Removed_India

In [None]:
#Drop Duplicate entries in 'Stopwords_Removed_Indian_Position' if there exist any in the dataframe df2
unique_df = df2.drop_duplicates(subset=['Stopwords_Removed_Indian_Position'])
unique_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 149 entries, 0 to 205
Data columns (total 14 columns):
 #   Column                                             Non-Null Count  Dtype 
---  ------                                             --------------  ----- 
 0   Principle number                                   141 non-null    object
 1   Principle                                          149 non-null    object
 2   Indian Position                                    149 non-null    object
 3   Remarks                                            69 non-null     object
 4   Sub-heading                                        112 non-null    object
 5   Topic                                              149 non-null    object
 6   number_of_words_Principle                          149 non-null    int64 
 7   number_of_words_Indian_Position                    149 non-null    int64 
 8   Pattern_Removed_Principle                          149 non-null    object
 9   Pattern_Removed_India

In [None]:
#Create dataframe data with the columns 'Stopwords_Removed_Principle','Stopwords_Removed_Indian_Position'and 'Topic' from unique_df
data = unique_df[['Stopwords_Removed_Principle','Stopwords_Removed_Indian_Position','Topic']].copy()
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 149 entries, 0 to 205
Data columns (total 3 columns):
 #   Column                             Non-Null Count  Dtype 
---  ------                             --------------  ----- 
 0   Stopwords_Removed_Principle        149 non-null    object
 1   Stopwords_Removed_Indian_Position  149 non-null    object
 2   Topic                              149 non-null    object
dtypes: object(3)
memory usage: 4.7+ KB


In [None]:
#Reset the index in data
data = data.reset_index(drop=True)
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 149 entries, 0 to 148
Data columns (total 3 columns):
 #   Column                             Non-Null Count  Dtype 
---  ------                             --------------  ----- 
 0   Stopwords_Removed_Principle        149 non-null    object
 1   Stopwords_Removed_Indian_Position  149 non-null    object
 2   Topic                              149 non-null    object
dtypes: object(3)
memory usage: 3.6+ KB


In [None]:
#Convert the Data into HuggingFace Dataset
from datasets import Dataset
dataset = Dataset.from_pandas(data)
dataset


Dataset({
    features: ['Stopwords_Removed_Principle', 'Stopwords_Removed_Indian_Position', 'Topic'],
    num_rows: 149
})

Let's look at the features of the dataset

In [None]:
dataset.features

{'Stopwords_Removed_Indian_Position': Value(dtype='string', id=None),
 'Stopwords_Removed_Principle': Value(dtype='string', id=None),
 'Topic': Value(dtype='string', id=None)}

The dataset has to be split into training, validation and test set. Let's check the first example of the dataset:



In [None]:
example = dataset[0]
example

{'Stopwords_Removed_Indian_Position': 'Banks’ financial reporting broadly encompasses financial performance and financial position and accounting policies. As regards information on basic business management and corporate governance, wide range practices prevalent from elaborate disclosures very little information.',
 'Stopwords_Removed_Principle': 'The Basel Committee recommends that banks, regular financial reporting and other public disclosures, provide timely information, facilitates market participants’ assessment banks. It identified following six broad categories information, each should addressed clear terms and appropriate detail help achieve satisfactory level bank transparency:financial performance;financial position;risk management strategies and practices;risk exposures;accounting policies; andbasic business, management and corporate governance information.',
 'Topic': 'Enhancing Bank Transparency'}

Let's sort the dataset by Topic name

In [None]:
dataset= dataset.sort('Topic')

Let's rename the column Topic to label

In [None]:
dataset = dataset.rename_column("Topic", "label")
dataset

Dataset({
    features: ['Stopwords_Removed_Principle', 'Stopwords_Removed_Indian_Position', 'label'],
    num_rows: 149
})

Let's create a list that contains the labels.

In [None]:
labels = list(sorted(set(dataset['label'])))
len(labels)

9

The dataset consists of 9 labels.

## Splitting the Data into Train, Validation and Test set






In [None]:
fix_seed = 42
from sklearn.model_selection import train_test_split

In [None]:
#Split data into val and test set 
X_val_test = dataset['Stopwords_Removed_Indian_Position']
y_val_test = dataset['label']
X_val, X_test, y_val, y_test = train_test_split(X_val_test, y_val_test, test_size = 0.5, stratify = y_val_test, random_state = fix_seed)

In [None]:
# Create the HuggingFace Dataset train_data
dict_train = {"sentence": dataset['Stopwords_Removed_Principle'],"label": dataset['label']}
train_data = Dataset.from_dict(dict_train)
train_data

Dataset({
    features: ['sentence', 'label'],
    num_rows: 149
})

In [None]:
#Create the HuggingFace Dataset val_data
dict_val = {"sentence": X_val,"label":y_val}
val_data = Dataset.from_dict(dict_val)
val_data

Dataset({
    features: ['sentence', 'label'],
    num_rows: 74
})

In [None]:
#Create the HuggingFace Dataset test_data
dict_test = {"sentence": X_test,"label":y_test}
test_data = Dataset.from_dict(dict_test)
test_data

Dataset({
    features: ['sentence', 'label'],
    num_rows: 75
})

Let's Check the Frequency Count of Train, Validation and Test label data

In [None]:

import collections 
#Define the function to check the frequency count of elements in the list
def frequency_count(mylist):
  frequency = collections.Counter(mylist)
  [print(key,':',value) for key, value in frequency.items()]

In [None]:
#Print the frequency count of label in train_data
print("The Frequency of label in train_data : \n")
frequency_count(train_data['label'])

The Frequency of label in train_data : 

Best Practices for Credit Risk Disclosure : 10
Enhancing Bank Transparency : 45
Information Flows Between Banking Supervisory Authorities : 16
Intra-Group Transactions and Exposures Principles : 13
Minimum Standards for the Supervision of International Banking Groups and their Cross-Border Establishments : 3
Principles for the Supervision of Banks’ Foreign Establishments (The Basel Concordat) : 10
Risk Concentrations Principles : 11
Supervision of Financial Conglomerates : 20
The Supervision of Cross-Border Banking : 21


In [None]:
#Print the frequency count of label in val_data
print("The Frequency of label in val_data : \n")
frequency_count(val_data['label'])

The Frequency of label in val_data : 

Enhancing Bank Transparency : 22
The Supervision of Cross-Border Banking : 10
Principles for the Supervision of Banks’ Foreign Establishments (The Basel Concordat) : 5
Intra-Group Transactions and Exposures Principles : 6
Risk Concentrations Principles : 6
Best Practices for Credit Risk Disclosure : 5
Supervision of Financial Conglomerates : 10
Information Flows Between Banking Supervisory Authorities : 8
Minimum Standards for the Supervision of International Banking Groups and their Cross-Border Establishments : 2


In [None]:
#Print the frequency count of label in test_data
print("The Frequency of label in test_data : \n")
frequency_count(test_data['label'])

The Frequency of label in test_data : 

Information Flows Between Banking Supervisory Authorities : 8
Enhancing Bank Transparency : 23
Principles for the Supervision of Banks’ Foreign Establishments (The Basel Concordat) : 5
Supervision of Financial Conglomerates : 10
Best Practices for Credit Risk Disclosure : 5
The Supervision of Cross-Border Banking : 11
Intra-Group Transactions and Exposures Principles : 7
Risk Concentrations Principles : 5
Minimum Standards for the Supervision of International Banking Groups and their Cross-Border Establishments : 1


In [None]:

#Create dataset_clean to store the train_data, val_data and test_data
from datasets.dataset_dict import DatasetDict
dataset_clean = DatasetDict({
    'train': train_data,
    'validation': val_data,
    'test': test_data
})
dataset_clean

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label'],
        num_rows: 149
    })
    validation: Dataset({
        features: ['sentence', 'label'],
        num_rows: 74
    })
    test: Dataset({
        features: ['sentence', 'label'],
        num_rows: 75
    })
})

## Save the Data

In [None]:
#Save the HuggingFace Dataset dataset_clean in drive
dataset_clean.save_to_disk(saved_path  + "dataset_clean")
print("\nSaved dataset_clean")


Saved dataset_clean
