#About the Data

* The data was taken from https://www.finra.org/rules-guidance/key-topics 

* The Data consist of three columns: Rule, Text and Topic.
  - Rule: FINRA Rule number. 
  - Text: List of rules.
  - Topic: Category to which the rule belongs.

* Each rule under the Key Topics:<br>

  1. Anti-Money Laundering
  2. Broker-Dealer Recruitment Disclosures: Complying with FINRA Rule 2273
  3. Business Continuity Planning
  4. Subordination Agreements 

are taken as a text.

* Each new line of the Rule Book was converted into text.
* Blank text is not considered.
* Prefixes in the beginning of the rules are retained.
* The word ‘and’ to indicate the last rule is removed.
* No summarization was done.


# Text Preprocessing

* Sentences containing more than 30 words are considered.
* Duplicate ‘Text’ are removed.




## Next Steps

* Fine-tuning DistilBERT/BERT model for text classification





##Google Drive access

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Write the appropriate paths to retrieve the data and store results 
data_path  = '/content/drive/MyDrive/Full_Code/FINRA/Strategy1/Dataset/Text_file_Strategy1.csv'
saved_path = '/content/drive/MyDrive/Full_Code/FINRA/Strategy1/'

## Set-up environment

Install the libraries folium version 0.2.1 and HuggingFace Datasets

In [None]:
#Install the package folium version 0.2.1 and HuggingFace datasets library
!pip install -q folium==0.2.1 datasets 

[K     |████████████████████████████████| 69 kB 4.3 MB/s 
[K     |████████████████████████████████| 346 kB 18.3 MB/s 
[K     |████████████████████████████████| 1.1 MB 51.6 MB/s 
[K     |████████████████████████████████| 212 kB 63.2 MB/s 
[K     |████████████████████████████████| 86 kB 5.4 MB/s 
[K     |████████████████████████████████| 86 kB 5.7 MB/s 
[K     |████████████████████████████████| 140 kB 34.8 MB/s 
[K     |████████████████████████████████| 596 kB 48.6 MB/s 
[K     |████████████████████████████████| 127 kB 50.0 MB/s 
[K     |████████████████████████████████| 94 kB 2.9 MB/s 
[K     |████████████████████████████████| 271 kB 54.3 MB/s 
[K     |████████████████████████████████| 144 kB 75.7 MB/s 
[?25h  Building wheel for folium (setup.py) ... [?25l[?25hdone


In [None]:
import torch
#Check if GPU is available
device = "cuda:0" if torch.cuda.is_available() else "cpu"
print(device, "is available")

cpu is available


##Memory Allocated

In [None]:
!free -h --si | awk  '/Mem:/{print $2}'

13G


# Loading the dataset

In [None]:
#Load the data
import pandas as pd

df = pd.read_csv(data_path)
#Display the first five rows in df
df.head()


Unnamed: 0,Rule,Text,Topic
0,3310,(a) Establish and implement policies and proce...,Anti-Money Laundering
1,3310,"(b) Establish and implement policies, procedur...",Anti-Money Laundering
2,3310,(c) Provide for annual (on a calendar-year bas...,Anti-Money Laundering
3,3310,"(d) Designate and identify to FINRA (by name, ...",Anti-Money Laundering
4,3310,(e) Provide ongoing training for appropriate p...,Anti-Money Laundering


In [None]:
#Display information about the dataframe df
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26 entries, 0 to 25
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Rule    26 non-null     int64 
 1   Text    26 non-null     object
 2   Topic   26 non-null     object
dtypes: int64(1), object(2)
memory usage: 752.0+ bytes


In [None]:
#Convert the data type of 'Text' to string
df['Text'] = df['Text'].astype(str)
#Display information about the dataframe df
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26 entries, 0 to 25
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Rule    26 non-null     int64 
 1   Text    26 non-null     object
 2   Topic   26 non-null     object
dtypes: int64(1), object(2)
memory usage: 752.0+ bytes


Split the text whenever a new line occurs 

In [None]:
split_text= df['Text'].str.split('\n').explode()
split_text[:10]

0    (a) Establish and implement policies and proce...
1    (b) Establish and implement policies, procedur...
2    (c) Provide for annual (on a calendar-year bas...
3    (d) Designate and identify to FINRA (by name, ...
4    (e) Provide ongoing training for appropriate p...
5    (f) Include appropriate risk-based procedures ...
5    (i) Understanding the nature and purpose of cu...
5    (ii) Conducting ongoing monitoring to identify...
6    (a) Establish and implement policies and proce...
7    (b) Establish and implement policies, procedur...
Name: Text, dtype: object

In [None]:
split_text.name='Text'
del df['Text']
df1 = df.join(split_text).reset_index(drop=True)
df1.head(10)

Unnamed: 0,Rule,Topic,Text
0,3310,Anti-Money Laundering,(a) Establish and implement policies and proce...
1,3310,Anti-Money Laundering,"(b) Establish and implement policies, procedur..."
2,3310,Anti-Money Laundering,(c) Provide for annual (on a calendar-year bas...
3,3310,Anti-Money Laundering,"(d) Designate and identify to FINRA (by name, ..."
4,3310,Anti-Money Laundering,(e) Provide ongoing training for appropriate p...
5,3310,Anti-Money Laundering,(f) Include appropriate risk-based procedures ...
6,3310,Anti-Money Laundering,(i) Understanding the nature and purpose of cu...
7,3310,Anti-Money Laundering,(ii) Conducting ongoing monitoring to identify...
8,331,Anti-Money Laundering,(a) Establish and implement policies and proce...
9,331,Anti-Money Laundering,"(b) Establish and implement policies, procedur..."


In [None]:
#Define a function to calculate the number of words in a text
def count_words(example):
  number_of_words = len(example.split()) 
  return number_of_words 


#Calculate the number of words for each 'Text' row in the dataframe df
df1['number_of_words_Text'] = df1['Text'].map(lambda row: count_words(row))
#Display Descriptive statistics about the 'number_of_words_Text' column in the dataframe df
df1['number_of_words_Text'].describe(include='all') 

count     63.000000
mean      36.460317
std       28.459833
min        3.000000
25%        9.000000
50%       30.000000
75%       54.000000
max      134.000000
Name: number_of_words_Text, dtype: float64

In [None]:
#Filter entries for which the word count in 'number_of_words_Text' is greater than 30 in the dataframe df1
df2 = df1.query('number_of_words_Text > 30')
df2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 31 entries, 2 to 58
Data columns (total 4 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   Rule                  31 non-null     int64 
 1   Topic                 31 non-null     object
 2   Text                  31 non-null     object
 3   number_of_words_Text  31 non-null     int64 
dtypes: int64(2), object(2)
memory usage: 1.2+ KB


In [None]:
#Drop Duplicate entries in 'Text' if there exist any in the dataframe df2
unique_df  = df2.drop_duplicates(subset=['Text'])
unique_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 30 entries, 2 to 58
Data columns (total 4 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   Rule                  30 non-null     int64 
 1   Topic                 30 non-null     object
 2   Text                  30 non-null     object
 3   number_of_words_Text  30 non-null     int64 
dtypes: int64(2), object(2)
memory usage: 1.2+ KB


In [None]:
#Create dataframe data with the columns 'Text' and 'Topic' from unique_df
data = unique_df[['Text','Topic']].copy()
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 30 entries, 2 to 58
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Text    30 non-null     object
 1   Topic   30 non-null     object
dtypes: object(2)
memory usage: 720.0+ bytes


In [None]:
#Reset the index in data
data = data.reset_index(drop=True)
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30 entries, 0 to 29
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Text    30 non-null     object
 1   Topic   30 non-null     object
dtypes: object(2)
memory usage: 608.0+ bytes


In [None]:
#Convert the Data into HuggingFace Dataset
from datasets import Dataset
dataset = Dataset.from_pandas(data)
dataset


Dataset({
    features: ['Text', 'Topic'],
    num_rows: 30
})

Let's look at the features of the dataset

In [None]:
dataset.features

{'Text': Value(dtype='string', id=None),
 'Topic': Value(dtype='string', id=None)}

The dataset has to be split into training, validation and test set. Let's check the first example of the dataset:



In [None]:
example = dataset[0]
example

{'Text': '(c) Provide for annual (on a calendar-year basis) independent testing for compliance to be conducted by member personnel or by a qualified outside party, unless the member does not execute transactions for customers or otherwise hold customer accounts or act as an introducing broker with respect to customer accounts (e.g., engages solely in proprietary trading or conducts business only with other broker-dealers), in which case such "independent testing" is required every two years (on a calendar-year basis).',
 'Topic': 'Anti-Money Laundering'}

Let's sort the dataset by Topic name

In [None]:
dataset= dataset.sort('Topic')

Let's rename the column Topic to label

In [None]:
dataset = dataset.rename_column("Topic", "label")
dataset

Dataset({
    features: ['Text', 'label'],
    num_rows: 30
})

In [None]:
import collections 
#Define the function to check the frequency count of elements in the list
def frequency_count(mylist):
  frequency = collections.Counter(mylist)
  [print(key,':',value) for key, value in frequency.items()]

In [None]:
print("The Frequency of label in the Dataset : \n")
frequency_count(dataset['label'])

The Frequency of label in the Dataset : 

Anti-Money Laundering : 4
Broker-Dealer Recruitment Disclosures: Complying with FINRA Rule 2273 : 6
Business Continuity Planning : 9
Subordination Agreements : 11


The dataset consists of 4 labels.

Let's create a list that contains the labels, as well as 2 dictionaries that map labels to integers and back.

In [None]:
labels = sorted(list(set(dataset['label'])))
id2label = {idx:label for idx, label in enumerate(labels)}
label2id = {label:idx for idx, label in enumerate(labels)}
labels

['Anti-Money Laundering',
 'Broker-Dealer Recruitment Disclosures: Complying with FINRA Rule 2273',
 'Business Continuity Planning',
 'Subordination Agreements']

## Splitting the Data into Train, Validation and Test set






In [None]:
fix_seed = 42
from sklearn.model_selection import train_test_split

# Split data into train and val_test set 
X= dataset['Text']
y= dataset['label']
X_train, X_val_test, y_train, y_val_test = train_test_split(X, y, test_size=0.40,stratify=y ,random_state=fix_seed)

# Split data into val and test set 
X_val, X_test, y_val, y_test = train_test_split(X_val_test, y_val_test, test_size=0.5, stratify = y_val_test ,random_state=fix_seed)

In [None]:
# Create the HuggingFace Dataset train_data
dict_train = {"sentence": X_train,"label": y_train}
train_data = Dataset.from_dict(dict_train)
train_data

Dataset({
    features: ['sentence', 'label'],
    num_rows: 18
})

In [None]:
#Create the HuggingFace Dataset val_data
dict_val = {"sentence": X_val,"label":y_val}
val_data = Dataset.from_dict(dict_val)
val_data

Dataset({
    features: ['sentence', 'label'],
    num_rows: 6
})

In [None]:
#Create the HuggingFace Dataset test_data
dict_test = {"sentence": X_test,"label":y_test}
test_data = Dataset.from_dict(dict_test)
test_data

Dataset({
    features: ['sentence', 'label'],
    num_rows: 6
})

Let's Check the Frequency Count of Train, Validation and Test label data

In [None]:
#Print the frequency count of label in train_data
print("The Frequency of label in train_data : \n")
frequency_count(train_data['label'])

The Frequency of label in train_data : 

Anti-Money Laundering : 2
Subordination Agreements : 7
Broker-Dealer Recruitment Disclosures: Complying with FINRA Rule 2273 : 4
Business Continuity Planning : 5


In [None]:
#Print the frequency count of label in val_data
print("The Frequency of label in val_data : \n")
frequency_count(val_data['label'])

The Frequency of label in val_data : 

Subordination Agreements : 2
Anti-Money Laundering : 1
Business Continuity Planning : 2
Broker-Dealer Recruitment Disclosures: Complying with FINRA Rule 2273 : 1


In [None]:
#Print the frequency count of label in test_data
print("The Frequency of label in test_data : \n")
frequency_count(test_data['label'])

The Frequency of label in test_data : 

Anti-Money Laundering : 1
Broker-Dealer Recruitment Disclosures: Complying with FINRA Rule 2273 : 1
Business Continuity Planning : 2
Subordination Agreements : 2


In [None]:
#Create dataset_clean to store the train_data, val_data and test_data
from datasets.dataset_dict import DatasetDict
dataset_clean = DatasetDict({
    'train': train_data,
    'validation': val_data,
    'test': test_data
})
dataset_clean

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label'],
        num_rows: 18
    })
    validation: Dataset({
        features: ['sentence', 'label'],
        num_rows: 6
    })
    test: Dataset({
        features: ['sentence', 'label'],
        num_rows: 6
    })
})

## Save the Data

In [None]:
#Save the HuggingFace Dataset dataset_clean in drive
dataset_clean.save_to_disk(saved_path  + "dataset_clean")
print("\nSaved dataset_clean")


Saved dataset_clean
