#About the Data

* The data was taken from https://www.finra.org/rules-guidance/key-topics 

* The Data consist of three columns: Rule, Text and Topic.
  - Rule: FINRA Rule number. 
  - Text: List of rules.
  - Topic: Category to which the rule belongs.

* Each rule under the Key Topics: <br>

  1. Anti-Money Laundering
  2. Broker-Dealer Recruitment Disclosures: Complying with FINRA Rule 2273
  3. Business Continuity Planning
  4. Subordination Agreements 

are taken as a text.

* The rules are summarized with no duplicate text and the character count is not more than 512.
* Blank text is not considered.
* Prefixes in the beginning of the rules are removed.
* The word ‘and’ to indicate the last rule is removed.
* All sentences are considered.



# Text Preprocessing

* Duplicate ‘Text’ if any are removed.




## Next Steps

* Fine-tuning DistilBERT/BERT model for text classification





##Google Drive access

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Write the appropriate paths to retrieve the data and store results 
data_path  = '/content/drive/MyDrive/Full_Code/FINRA/Strategy2/Four_Labels/Dataset/Text_file_Strategy2_four_classes.csv'
saved_path = '/content/drive/MyDrive/Full_Code/FINRA/Strategy2/Four_Labels/'

## Set-up environment

Install the libraries folium version 0.2.1 and HuggingFace Datasets

In [None]:
#Install the package folium version 0.2.1 and HuggingFace datasets library
!pip install -q folium==0.2.1 datasets 

[K     |████████████████████████████████| 69 kB 3.4 MB/s 
[K     |████████████████████████████████| 346 kB 12.2 MB/s 
[K     |████████████████████████████████| 212 kB 52.5 MB/s 
[K     |████████████████████████████████| 140 kB 59.8 MB/s 
[K     |████████████████████████████████| 86 kB 4.7 MB/s 
[K     |████████████████████████████████| 86 kB 5.2 MB/s 
[K     |████████████████████████████████| 1.1 MB 52.8 MB/s 
[K     |████████████████████████████████| 596 kB 63.4 MB/s 
[K     |████████████████████████████████| 127 kB 56.2 MB/s 
[K     |████████████████████████████████| 94 kB 3.1 MB/s 
[K     |████████████████████████████████| 144 kB 61.5 MB/s 
[K     |████████████████████████████████| 271 kB 40.4 MB/s 
[?25h  Building wheel for folium (setup.py) ... [?25l[?25hdone


In [None]:
import torch
#Check if GPU is available
device = "cuda:0" if torch.cuda.is_available() else "cpu"
print(device, "is available")

cpu is available


##Memory Allocated

In [None]:
!free -h --si | awk  '/Mem:/{print $2}'

13G


# Loading the dataset

In [None]:
#Load the data
import pandas as pd

df = pd.read_csv(data_path)
#Display the first five rows in df
df.head()


Unnamed: 0,Rule,Text,Topic
0,3310,Each member shall develop and implement a writ...,Anti-Money Laundering
1,3310,The anti-money laundering programs required by...,Anti-Money Laundering
2,3310,Provide for annual (on a calendar-year basis) ...,Anti-Money Laundering
3,3310,The anti-money laundering programs required by...,Anti-Money Laundering
4,3310,The anti-money laundering programs required by...,Anti-Money Laundering


In [None]:
#Display information about the dataframe df
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37 entries, 0 to 36
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Rule    37 non-null     int64 
 1   Text    37 non-null     object
 2   Topic   37 non-null     object
dtypes: int64(1), object(2)
memory usage: 1016.0+ bytes


In [None]:
#Convert the data type of 'Text' to string
df['Text'] = df['Text'].astype(str)
#Display information about the dataframe df
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37 entries, 0 to 36
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Rule    37 non-null     int64 
 1   Text    37 non-null     object
 2   Topic   37 non-null     object
dtypes: int64(1), object(2)
memory usage: 1016.0+ bytes


In [None]:
#Define a function to calculate the number of words in a text
def count_words(example):
  number_of_words = len(example.split()) 
  return number_of_words 


#Calculate the number of words for each 'Text' row in the dataframe df
df['number_of_words_Text'] = df['Text'].map(lambda row: count_words(row))
#Display Descriptive statistics about the 'number_of_words_Text' column in the dataframe df
df['number_of_words_Text'].describe(include='all') 

count    37.000000
mean     58.000000
std      15.445244
min      15.000000
25%      50.000000
50%      62.000000
75%      68.000000
max      82.000000
Name: number_of_words_Text, dtype: float64

In [None]:
#Drop Duplicate entries in 'Text' if there exist any in the dataframe df
unique_df  = df.drop_duplicates(subset=['Text'])
unique_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 37 entries, 0 to 36
Data columns (total 4 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   Rule                  37 non-null     int64 
 1   Text                  37 non-null     object
 2   Topic                 37 non-null     object
 3   number_of_words_Text  37 non-null     int64 
dtypes: int64(2), object(2)
memory usage: 1.4+ KB


In [None]:
#Create dataframe data with the columns 'Text' and 'Topic' from unique_df
data = unique_df[['Text','Topic']].copy()
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 37 entries, 0 to 36
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Text    37 non-null     object
 1   Topic   37 non-null     object
dtypes: object(2)
memory usage: 888.0+ bytes


In [None]:
#Reset the index in data
data = data.reset_index(drop=True)
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37 entries, 0 to 36
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Text    37 non-null     object
 1   Topic   37 non-null     object
dtypes: object(2)
memory usage: 720.0+ bytes


In [None]:
#Convert the Data into HuggingFace Dataset
from datasets import Dataset
dataset = Dataset.from_pandas(data)
dataset


Dataset({
    features: ['Text', 'Topic'],
    num_rows: 37
})

Let's look at the features of the dataset

In [None]:
dataset.features

{'Text': Value(dtype='string', id=None),
 'Topic': Value(dtype='string', id=None)}

The dataset has to be split into training, validation and test set. Let's check the first example of the dataset:



In [None]:
example = dataset[0]
example

{'Text': "Each member shall develop and implement a written anti-money laundering program reasonably designed to achieve and monitor the member's compliance with the requirements of the Bank Secrecy Act (31 U.S.C. 5311, et seq.), and the implementing regulations promulgated thereunder by the Department of the Treasury. Each member's anti-money laundering program must be approved, in writing, by a member of senior management.",
 'Topic': 'Anti-Money Laundering'}

Let's sort the dataset by Topic name

In [None]:
dataset= dataset.sort('Topic')

Let's rename the column Topic to label

In [None]:
dataset = dataset.rename_column("Topic", "label")
dataset

Dataset({
    features: ['Text', 'label'],
    num_rows: 37
})

In [None]:
import collections 
#Define the function to check the frequency count of elements in the list
def frequency_count(mylist):
  frequency = collections.Counter(mylist)
  [print(key,':',value) for key, value in frequency.items()]

In [None]:
print("The Frequency of label in the Dataset : \n")
frequency_count(dataset['label'])

The Frequency of label in the Dataset : 

Anti-Money Laundering : 10
Broker-Dealer Recruitment Disclosures: Complying with FINRA Rule 2273 : 6
Business Continuity Planning : 11
Subordination Agreements : 10


The dataset consists of 4 labels.

Let's create a list that contains the labels, as well as 2 dictionaries that map labels to integers and back.

In [None]:
labels = sorted(list(set(dataset['label'])))
id2label = {idx:label for idx, label in enumerate(labels)}
label2id = {label:idx for idx, label in enumerate(labels)}
labels

['Anti-Money Laundering',
 'Broker-Dealer Recruitment Disclosures: Complying with FINRA Rule 2273',
 'Business Continuity Planning',
 'Subordination Agreements']

## Splitting the Data into Train, Validation and Test set






In [None]:
fix_seed = 42
from sklearn.model_selection import train_test_split

# Split data into train and val_test set 
X= dataset['Text']
y= dataset['label']
X_train, X_val_test, y_train, y_val_test = train_test_split(X, y, test_size=0.40,stratify=y ,random_state=fix_seed)

# Split data into val and test set 
X_val, X_test, y_val, y_test = train_test_split(X_val_test, y_val_test, test_size=0.5, stratify = y_val_test ,random_state=fix_seed)

In [None]:
# Create the HuggingFace Dataset train_data
dict_train = {"sentence": X_train,"label": y_train}
train_data = Dataset.from_dict(dict_train)
train_data

Dataset({
    features: ['sentence', 'label'],
    num_rows: 22
})

In [None]:
#Create the HuggingFace Dataset val_data
dict_val = {"sentence": X_val,"label":y_val}
val_data = Dataset.from_dict(dict_val)
val_data

Dataset({
    features: ['sentence', 'label'],
    num_rows: 7
})

In [None]:
#Create the HuggingFace Dataset test_data
dict_test = {"sentence": X_test,"label":y_test}
test_data = Dataset.from_dict(dict_test)
test_data

Dataset({
    features: ['sentence', 'label'],
    num_rows: 8
})

Let's Check the Frequency Count of Train, Validation and Test label data

In [None]:
#Print the frequency count of label in train_data
print("The Frequency of label in train_data : \n")
frequency_count(train_data['label'])

The Frequency of label in train_data : 

Subordination Agreements : 6
Business Continuity Planning : 6
Broker-Dealer Recruitment Disclosures: Complying with FINRA Rule 2273 : 4
Anti-Money Laundering : 6


In [None]:
#Print the frequency count of label in val_data
print("The Frequency of label in val_data : \n")
frequency_count(val_data['label'])

The Frequency of label in val_data : 

Anti-Money Laundering : 2
Subordination Agreements : 2
Business Continuity Planning : 2
Broker-Dealer Recruitment Disclosures: Complying with FINRA Rule 2273 : 1


In [None]:
#Print the frequency count of label in test_data
print("The Frequency of label in test_data : \n")
frequency_count(test_data['label'])

The Frequency of label in test_data : 

Subordination Agreements : 2
Broker-Dealer Recruitment Disclosures: Complying with FINRA Rule 2273 : 1
Anti-Money Laundering : 2
Business Continuity Planning : 3


In [None]:
#Create dataset_clean to store the train_data, val_data and test_data
from datasets.dataset_dict import DatasetDict
dataset_clean = DatasetDict({
    'train': train_data,
    'validation': val_data,
    'test': test_data
})
dataset_clean

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label'],
        num_rows: 22
    })
    validation: Dataset({
        features: ['sentence', 'label'],
        num_rows: 7
    })
    test: Dataset({
        features: ['sentence', 'label'],
        num_rows: 8
    })
})

## Save the Data

In [None]:
#Save the HuggingFace Dataset dataset_clean in drive
dataset_clean.save_to_disk(saved_path  + "dataset_clean")
print("\nSaved dataset_clean")


Saved dataset_clean
