# About the Data

* The data was taken from 
the csv file complaints.csv

* The complaints are for the products:<br>

  1. 'Bank account or service'
  2. 'Checking or savings account'
  3. 'Consumer Loan'
  4. 'Credit card or prepaid card'
  5. 'Credit reporting, credit repair services, or other personal consumer reports' 
  6. 'Debt collection'
  7. Money transfer/s, virtual currency, or money service'
  8. 'Mortgage'
  9. 'Payday loan, title loan, or personal loan'
  10. 'Student loan'
  11. 'Vehicle loan or lease'

* The data cleaning was done using spacy library


## Next Steps

* Use Spacy library to preprocess the sample data.

* Fine-tune DistilBERT on the sample data 

## Google Drive access

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# write the appropriate paths to retrieve the data and store results 
sample_data_path = '/content/drive/MyDrive/Complaints_csv/Experiment3/SAMPLE_25_APRIL_2022_Experiment3.csv'

# Loading the sample dataset

In [None]:
#Load the data
import pandas as pd
sample_df = pd.read_csv(sample_data_path)
sample_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 79993 entries, 0 to 79992
Data columns (total 10 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   consumer_complaint_narrative   79993 non-null  object 
 1   product                        79993 non-null  object 
 2   split_words_whitespaces        79993 non-null  object 
 3   number_of_words                79993 non-null  int64  
 4   number_of_charachters          79993 non-null  int64  
 5   charachters_by_words           79993 non-null  int64  
 6   number_of_unique_words         79993 non-null  int64  
 7   potenial_mask_words            79993 non-null  object 
 8   number_of_potenial_mask_words  79993 non-null  int64  
 9   potenial_mask_words_BY_words   79993 non-null  float64
dtypes: float64(1), int64(5), object(4)
memory usage: 6.1+ MB


In [None]:
# drop the duplicate entries which are not maximum length
sample_df = sample_df.loc[sample_df['number_of_words'].groupby(sample_df['consumer_complaint_narrative']).idxmax()].sort_values('product')
sample_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 77895 entries, 1039 to 79231
Data columns (total 10 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   consumer_complaint_narrative   77895 non-null  object 
 1   product                        77895 non-null  object 
 2   split_words_whitespaces        77895 non-null  object 
 3   number_of_words                77895 non-null  int64  
 4   number_of_charachters          77895 non-null  int64  
 5   charachters_by_words           77895 non-null  int64  
 6   number_of_unique_words         77895 non-null  int64  
 7   potenial_mask_words            77895 non-null  object 
 8   number_of_potenial_mask_words  77895 non-null  int64  
 9   potenial_mask_words_BY_words   77895 non-null  float64
dtypes: float64(1), int64(5), object(4)
memory usage: 6.5+ MB


In [None]:
import spacy
# use spacy with the dependency parse 
spacy_nlp = spacy.load("en_core_web_sm")

In [None]:
from tqdm import tqdm
# instantiate
tqdm.pandas()
sample_df['spacy_doc']= sample_df['consumer_complaint_narrative'].progress_apply(lambda x :list(spacy_nlp.pipe([x]))[0])
print("\n\nSpacy Doc Completed")

100%|██████████| 77895/77895 [46:00<00:00, 28.22it/s]



Spacy Doc Completed





In [None]:
sample_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 77895 entries, 1039 to 79231
Data columns (total 11 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   consumer_complaint_narrative   77895 non-null  object 
 1   product                        77895 non-null  object 
 2   split_words_whitespaces        77895 non-null  object 
 3   number_of_words                77895 non-null  int64  
 4   number_of_charachters          77895 non-null  int64  
 5   charachters_by_words           77895 non-null  int64  
 6   number_of_unique_words         77895 non-null  int64  
 7   potenial_mask_words            77895 non-null  object 
 8   number_of_potenial_mask_words  77895 non-null  int64  
 9   potenial_mask_words_BY_words   77895 non-null  float64
 10  spacy_doc                      77895 non-null  object 
dtypes: float64(1), int64(5), object(5)
memory usage: 7.1+ MB


In [None]:
#Check the of the first 'spacy_doc' record
type(sample_df.loc[0,'spacy_doc'])

spacy.tokens.doc.Doc

In [None]:
#Define the strings to mask
mask_words_list =['XX /XX/XXXX','XX-XX-XXXX', #DATE mm/dd/yyyy mm-dd-yyyy
                  'XXXX XXXX XXXX XXXX XXXX','XXXX-XXXX-XXXX-XXXX',#CREDIT or PREPASID CARD NUMBER
                  'XXXX XXXX XXXX XXXX','XXXX XXXX XXXX','XXXX-XXXX-XXXX','XXXX-XXXX','XXXX XXXX',
                  'XXX-XX-XXXX','XXX-XXX','XX-XXXX',
                  'XXXXXXXXXXXXXXXXXX','XXXXXXXXXXXXXXXXX', 'XXXXXXXXXXXXXXXX', 'XXXXXXXXXXXXXXX', 'XXXXXXXXXXXXXX',# BANK ACCOUNT NUMBER
                  'XXXXXXXXXXXXX', 'XXXXXXXXXXXX', 'XXXXXXXXXXX',                                                   # RANGES FROM 12 TO 18 DIGITS
                  'XXXXXXXXXX','XXXXXXXXX'          #ROUTING NUMBER IS 9 DIGIT
                  'XXXX','XXX','XX']

In [None]:
sample_df2 =sample_df.copy()
sample_df2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 77895 entries, 1039 to 79231
Data columns (total 11 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   consumer_complaint_narrative   77895 non-null  object 
 1   product                        77895 non-null  object 
 2   split_words_whitespaces        77895 non-null  object 
 3   number_of_words                77895 non-null  int64  
 4   number_of_charachters          77895 non-null  int64  
 5   charachters_by_words           77895 non-null  int64  
 6   number_of_unique_words         77895 non-null  int64  
 7   potenial_mask_words            77895 non-null  object 
 8   number_of_potenial_mask_words  77895 non-null  int64  
 9   potenial_mask_words_BY_words   77895 non-null  float64
 10  spacy_doc                      77895 non-null  object 
dtypes: float64(1), int64(5), object(5)
memory usage: 9.1+ MB


In [None]:
#working
import re
# Function to identify the tokens and named entities that have to be MASKED replace them with ' <MASK> '
def change_details(word):
    if word.like_email or word.like_url:
        return '<MASK>'
    elif any(mask_word in word.string for mask_word in mask_words_list):
        return '<MASK>'
    elif word.is_stop:
        return ''
    #elif (( word.string != '.' )|(word.string != '. ')|( word.string != ' .' )|(word.string != ' . ' )):
    elif (len(re.findall('\.',word.string)) < 1) :
        if word.is_punct:
            return ''
    return word.string

# Function where each token of spacy doc is passed through change_details()
def change_text(doc):
    # Passing each token through change_details() function.
    new_tokens = map(change_details,doc)
    new_text = str(' '.join(new_tokens))
    # replace more than one white space in the string with one white space
    new_text = re.sub(' +', ' ',new_text)
    new_text = new_text.replace(' .', '.')
    new_text = new_text.replace('\n', '')
    return new_text

In [None]:
sample_df2['Change_text']= sample_df2['spacy_doc'].progress_apply(lambda x: change_text(x))
print("\n\nText Transformation Completed")

100%|██████████| 77895/77895 [03:16<00:00, 396.34it/s]



Text Transformation Completed





In [None]:
#Display an example of how the Text is changed
pd.set_option('display.max_colwidth', None)
display(sample_df2.loc[:,['spacy_doc','Change_text']][:1])

Unnamed: 0,spacy_doc,Change_text
1039,"(My, account, #, XXXX, at, Bofa, was, charged, a, NSF, fee, on, XXXX, /, XXXX/2016, for, {, $, 35.00, }, when, I, noticed, on, the, account, activity, a, merchant, deposit, being, made, the, same, day, so, I, am, not, sure, why, the, account, was, charged, this, fee, ., I, believe, there, were, enough, funds, in, transit, that, should, have, been, sufficient, enough, to, take, care, of, this, fee, ., The, Bank, charged, this, fee, unjustly, in, spite, of, the, fact, that, the, merchant, services, account, is, also, with, Bofa, with, daily, deposits, ..)",account <MASK> Bofa charged NSF fee <MASK> <MASK> $ 35.00 noticed account activity merchant deposit day sure account charged fee. believe funds transit sufficient care fee. Bank charged fee unjustly spite fact merchant services account Bofa daily deposits..


In [None]:
#Split 'Change_text' into substrings whenever whitespace occur
sample_df2['split_words_whitespaces'] = sample_df2['Change_text'].apply(lambda x: x.split())
#Count the number of substrings in 'split_words_whitespaces'
sample_df2['number_of_words'] = sample_df2['split_words_whitespaces'].apply(lambda x: len(x))
#Count the number of charachters in  'Change_text'
sample_df2['number_of_charachters'] = sample_df2['Change_text'].apply(lambda x: len(x))
#Calculate the ratio of number of charachters by number of words
sample_df2['charachters_by_words'] = sample_df2['number_of_charachters'] // sample_df2['number_of_words']
#Count the number of unique strings in 'split_words_whitespaces'
sample_df2['number_of_unique_words'] = sample_df2['split_words_whitespaces'].apply(lambda x : len(set(x)))
#Count the number of '<MASK>' strings in 'Change_text'
sample_df2['number_of_<MASK>'] = sample_df2['Change_text'].apply(lambda x : x.count('<MASK>'))
#Count the number of '<MASK>' by 'number of words'
sample_df2['<MASK>_BY_WORDS'] = sample_df2['number_of_<MASK>']/sample_df2['number_of_words']

sample_df2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 77895 entries, 1039 to 79231
Data columns (total 14 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   consumer_complaint_narrative   77895 non-null  object 
 1   product                        77895 non-null  object 
 2   split_words_whitespaces        77895 non-null  object 
 3   number_of_words                77895 non-null  int64  
 4   number_of_charachters          77895 non-null  int64  
 5   charachters_by_words           77895 non-null  int64  
 6   number_of_unique_words         77895 non-null  int64  
 7   potenial_mask_words            77895 non-null  object 
 8   number_of_potenial_mask_words  77895 non-null  int64  
 9   potenial_mask_words_BY_words   77895 non-null  float64
 10  spacy_doc                      77895 non-null  object 
 11  Change_text                    77895 non-null  object 
 12  number_of_<MASK>               77895 non-nu

In [None]:
sample_df3 = sample_df2[sample_df2['<MASK>_BY_WORDS'].le(0.3)]
sample_df3.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 71584 entries, 1039 to 79231
Data columns (total 14 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   consumer_complaint_narrative   71584 non-null  object 
 1   product                        71584 non-null  object 
 2   split_words_whitespaces        71584 non-null  object 
 3   number_of_words                71584 non-null  int64  
 4   number_of_charachters          71584 non-null  int64  
 5   charachters_by_words           71584 non-null  int64  
 6   number_of_unique_words         71584 non-null  int64  
 7   potenial_mask_words            71584 non-null  object 
 8   number_of_potenial_mask_words  71584 non-null  int64  
 9   potenial_mask_words_BY_words   71584 non-null  float64
 10  spacy_doc                      71584 non-null  object 
 11  Change_text                    71584 non-null  object 
 12  number_of_<MASK>               71584 non-nu

In [None]:
sample_df4 = sample_df2[sample_df2['<MASK>_BY_WORDS'].le(0.1)]
sample_df4.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 37487 entries, 1039 to 79231
Data columns (total 14 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   consumer_complaint_narrative   37487 non-null  object 
 1   product                        37487 non-null  object 
 2   split_words_whitespaces        37487 non-null  object 
 3   number_of_words                37487 non-null  int64  
 4   number_of_charachters          37487 non-null  int64  
 5   charachters_by_words           37487 non-null  int64  
 6   number_of_unique_words         37487 non-null  int64  
 7   potenial_mask_words            37487 non-null  object 
 8   number_of_potenial_mask_words  37487 non-null  int64  
 9   potenial_mask_words_BY_words   37487 non-null  float64
 10  spacy_doc                      37487 non-null  object 
 11  Change_text                    37487 non-null  object 
 12  number_of_<MASK>               37487 non-nu

In [None]:
sample_df5 = sample_df2[sample_df2['<MASK>_BY_WORDS'].le(0.2)]
sample_df5.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 61096 entries, 1039 to 79231
Data columns (total 14 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   consumer_complaint_narrative   61096 non-null  object 
 1   product                        61096 non-null  object 
 2   split_words_whitespaces        61096 non-null  object 
 3   number_of_words                61096 non-null  int64  
 4   number_of_charachters          61096 non-null  int64  
 5   charachters_by_words           61096 non-null  int64  
 6   number_of_unique_words         61096 non-null  int64  
 7   potenial_mask_words            61096 non-null  object 
 8   number_of_potenial_mask_words  61096 non-null  int64  
 9   potenial_mask_words_BY_words   61096 non-null  float64
 10  spacy_doc                      61096 non-null  object 
 11  Change_text                    61096 non-null  object 
 12  number_of_<MASK>               61096 non-nu

## Download the Sample Data

In [None]:
#Download the sample data
sample_df3.to_csv("SAMPLE_DOC_30_PER_MASK_EXP_3.csv", encoding='utf-8', index=False)
print("\n\nDownload Completed")



Download Completed


In [None]:
#Download the sample data
sample_df4.to_csv("SAMPLE_DOC_10_PER_MASK_EXP_3.csv", encoding='utf-8', index=False)
print("\n\nDownload Completed")



Download Completed


In [None]:
#Download the sample data
sample_df5.to_csv("SAMPLE_DOC_20_PER_MASK_EXP_3.csv", encoding='utf-8', index=False)
print("\n\nDownload Completed")



Download Completed


In [None]:
#Transfer the data
import shutil
destination_path_10_per = '/content/drive/MyDrive/Complaints_csv/Experiment3/SAMPLE_DOC_10_PER_MASK_EXP_3.csv'
shutil.copy("SAMPLE_DOC_10_PER_MASK_EXP_3.csv", destination_path_10_per )
destination_path_20_per = '/content/drive/MyDrive/Complaints_csv/Experiment3/SAMPLE_DOC_20_PER_MASK_EXP_3.csv'
shutil.copy("SAMPLE_DOC_20_PER_MASK_EXP_3.csv", destination_path_20_per )
destination_path_30_per = '/content/drive/MyDrive/Complaints_csv/Experiment3/SAMPLE_DOC_30_PER_MASK_EXP_3.csv'
shutil.copy("SAMPLE_DOC_30_PER_MASK_EXP_3.csv", destination_path_30_per )
print("\nTransfer Complete")


Transfer Complete
