# About the Data

* The data was taken from 
the csv file complaints.csv

* The Input text is "Consumer complaint narrative." which is preprocessed to predict the "Product".

* The complaints are for the products:<br>

  1. Banking Services
  2. Card Services
  3. Credit Reporting
  4. Debt Collection
  5. Loans
  6. Mortgage

* The data cleaning was done using gensim library

* A sample of the Data 'sample_df' which has 10000 entries of each of the 'product' counts is created for fine-tuning DistilBERT.



##Memory Allocated

In [None]:
#Display Allocated Memory in Colab
!free -h --si | awk  '/Mem:/{print $2}'

26G


## Google Drive access

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# write the appropriate paths to retrieve the data and store results 

saved_path = '/content/drive/MyDrive/EY_Internship/Consumer_Complaints/'
data_path = '/content/drive/MyDrive/EY_Internship/Consumer_Complaints/Raw_Data/complaints.csv'

# Loading the dataset

In [None]:
#Set Paramenters 
fix_seed = 42

In [None]:
#Load the data
import pandas as pd

raw_data= pd.read_csv(data_path, dtype={"Consumer complaint narrative": "string", "Consumer consent provided?": "string", "Timely response?":"string"})
raw_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1922572 entries, 0 to 1922571
Data columns (total 18 columns):
 #   Column                        Dtype 
---  ------                        ----- 
 0   Date received                 object
 1   Product                       object
 2   Sub-product                   object
 3   Issue                         object
 4   Sub-issue                     object
 5   Consumer complaint narrative  string
 6   Company public response       object
 7   Company                       object
 8   State                         object
 9   ZIP code                      object
 10  Tags                          object
 11  Consumer consent provided?    string
 12  Submitted via                 object
 13  Date sent to company          object
 14  Company response to consumer  object
 15  Timely response?              string
 16  Consumer disputed?            object
 17  Complaint ID                  int64 
dtypes: int64(1), object(14), string(3)
memory 

In [None]:
#Print the Product counts in raw_data
raw_data['Product'].value_counts()

Credit reporting, credit repair services, or other personal consumer reports    612582
Debt collection                                                                 329695
Mortgage                                                                        317836
Credit reporting                                                                140432
Credit card or prepaid card                                                      99278
Credit card                                                                      89190
Bank account or service                                                          86206
Checking or savings account                                                      79836
Student loan                                                                     60258
Consumer Loan                                                                    31604
Money transfer, virtual currency, or money service                               22021
Vehicle loan or lease                      

In [None]:
#Create dataframe df1 with the columns 'Product' and Consumer complaint narrative' from raw_data
df1 = raw_data[['Product', 'Consumer complaint narrative']]
#Rename the columns 'Product' as 'product' and 'Consumer complaint narrative' as 'complaint'
df1.columns = ['product', 'complaint']
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1922572 entries, 0 to 1922571
Data columns (total 2 columns):
 #   Column     Dtype 
---  ------     ----- 
 0   product    object
 1   complaint  string
dtypes: object(1), string(1)
memory usage: 29.3+ MB


In [None]:
#Check for null values in the dataframe df1
df1.isnull().sum()

product            0
complaint    1270120
dtype: int64

In [None]:
#Drop null values in the dataframe df1
df1 = df1[df1['complaint'].notnull() & df1['product'].notnull()]
df1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 652452 entries, 0 to 1922571
Data columns (total 2 columns):
 #   Column     Non-Null Count   Dtype 
---  ------     --------------   ----- 
 0   product    652452 non-null  object
 1   complaint  652452 non-null  string
dtypes: object(1), string(1)
memory usage: 14.9+ MB


In [None]:
# Print the counts of unique values in 'product' 
df1['product'].value_counts()

Credit reporting, credit repair services, or other personal consumer reports    233854
Debt collection                                                                 130474
Mortgage                                                                         72924
Credit card or prepaid card                                                      47763
Credit reporting                                                                 31588
Checking or savings account                                                      27716
Student loan                                                                     27427
Credit card                                                                      18838
Bank account or service                                                          14885
Money transfer, virtual currency, or money service                               12212
Vehicle loan or lease                                                            11664
Consumer Loan                              

In [None]:
# Create a column to store the original_product
df1['original_product'] = df1['product'].copy()

df1 = df1.replace({'product':
             {'Credit reporting, credit repair services, or other personal consumer reports': 'Credit Reporting',
              'Debt collection': 'Debt Collection',
              'Credit reporting': 'Credit Reporting',
              'Credit card': 'Card Services',
              'Bank account or service': 'Banking Services',
              'Credit card or prepaid card': 'Card Services',
              'Student loan': 'Loans',
              'Checking or savings account': 'Banking Services',
              'Consumer Loan': 'Loans',
              'Vehicle loan or lease': 'Loans',
              'Money transfer, virtual currency, or money service': 'Banking Services',
              'Payday loan, title loan, or personal loan': 'Loans',
              'Payday loan': 'Loans',
              'Money transfers': 'Banking Services',
              'Prepaid card': 'Card Services',
              'Other financial service': 'Other',
              'Virtual currency': 'Banking Services'}
            })
df1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 652452 entries, 0 to 1922571
Data columns (total 3 columns):
 #   Column            Non-Null Count   Dtype 
---  ------            --------------   ----- 
 0   product           652452 non-null  object
 1   complaint         652452 non-null  string
 2   original_product  652452 non-null  object
dtypes: object(2), string(1)
memory usage: 19.9+ MB


In [None]:
#Print the counts of unique values in 'product' 
df1['product'].value_counts()

Credit Reporting    265442
Debt Collection     130474
Mortgage             72924
Card Services        68051
Loans                58943
Banking Services     56326
Other                  292
Name: product, dtype: int64

In [None]:
#Drop entries having 'product' 'Other' as it is under-represented
df1 = df1[df1['product'] != 'Other']
df1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 652160 entries, 0 to 1922571
Data columns (total 3 columns):
 #   Column            Non-Null Count   Dtype 
---  ------            --------------   ----- 
 0   product           652160 non-null  object
 1   complaint         652160 non-null  string
 2   original_product  652160 non-null  object
dtypes: object(2), string(1)
memory usage: 19.9+ MB


In [None]:
#Print the products counts
df1['product'].value_counts()

Credit Reporting    265442
Debt Collection     130474
Mortgage             72924
Card Services        68051
Loans                58943
Banking Services     56326
Name: product, dtype: int64

In [None]:
import plotly.graph_objects as go

labels = df1['product'].value_counts().sort_index().index.to_list()
values = df1['product'].value_counts().sort_index().to_list()

# Use `hole` to create a donut-like pie chart
fig = go.Figure(data=[go.Pie(labels=labels, values=values, hole=.38,title = 'Product distribution',sort = True, textinfo='percent')])
fig.show()

In [None]:
#Display the 'complaint' of the first entry
df1.iloc[0]['complaint']

'transworld systems inc. \nis trying to collect a debt that is not mine, not owed and is inaccurate.'

In [None]:
#Display the 'product' of the first entry
df1.iloc[0]['product']

'Debt Collection'

Create a function for formatting elapsed times as hh:mm:ss

In [None]:
import time
import datetime

def format_time(elapsed):
    '''
    Takes a time in seconds and returns a string hh:mm:ss
    '''
    # Round to the nearest second.
    elapsed_rounded = int(round((elapsed)))
    
    # Format as hh:mm:ss
    return str(datetime.timedelta(seconds=elapsed_rounded))

Define the function **get_text_analysis** to analyze the <br><br>

1)  Words in the text<br>
2)  Number of words<br>
3)  Number of characters<br>
4)  Number of unique words<br>
5)  Number of alphabets<br>
6)  Number of numerical values<br>
7)  Number of white spaces<br>
8)  Number of other characters<br>
9)  Number of character 'X'<br>
10) Number of character 'x'<br>
11) Percentage of alphabets<br>
12) Percentage of numerical values<br>
13) Percentage of white spaces<br>
14) Percentage of other characters<br>
15) Percentage of character 'X'<br>
16) Percentage of character 'x'<br>


In [None]:
import time 

pd.options.mode.chained_assignment = None
def get_text_analysis(df,column_name):
  '''
  df: the input dataframe containing column_name
  column_name : the column in df for which text analysis is to be done
  '''
  t0 = time.time()
  #Split column_name into substrings whenever whitespace occur
  df['split_words_whitespaces'] = df[column_name].map(lambda x: x.split())
  #Count the number of substrings in 'split_words_whitespaces'
  df['number_of_words'] = df['split_words_whitespaces'].map(lambda x: len(x))
  #Count the number of characters in column_name
  df['number_of_characters'] = df[column_name].map(lambda x: len(x))
  #Count the number of unique strings in 'split_words_whitespaces'
  df['number_of_unique_words'] = df['split_words_whitespaces'].map(lambda x : len(set(x)))
  #Count the number of alphabets in column_name
  df['number_of_alphabets'] = df[column_name].map(lambda x: len([c for c in x if c.isalpha()]))
  #Count the number of numerical values in column_name
  df['number_of_numerical_values'] = df[column_name].map(lambda x: len([c for c in x if c.isnumeric()]))
  #Count the number of white spaces in column_name
  df['number_of_white_spaces'] = df[column_name].map(lambda x: len([c for c in x if c.isspace()]) )
  #Count the number of other characters in column_name
  df['number_of_other_characters'] = df['number_of_characters'] - df['number_of_alphabets'] - df['number_of_numerical_values'] - df['number_of_white_spaces']
  #Count the number of character X in column_name
  df['number_of_character_X'] = df[column_name].map(lambda row: row.count('X'))
  #Count the number of character x in column_name
  df['number_of_character_x'] = df[column_name].map(lambda row: row.count('x'))
  #Calculate the percentage of alphabets in column_name
  df['percentage_of_alphabets'] = df['number_of_alphabets']/df['number_of_characters']
  #Calculate the percentage of numerical values in column_name
  df['percentage_of_numerical_values'] = df['number_of_numerical_values']/df['number_of_characters']
  #Calculate the percentage of white spaces in column_name
  df['percentage_of_white_spaces'] = df['number_of_white_spaces']/df['number_of_characters']
  #Calculate the percentage of other characters in column_name
  df['percentage_of_other_characters'] = df['number_of_other_characters']/df['number_of_characters']
  #Calculate the percentage of character X in column_name
  df['percentage_of_character_X'] = df['number_of_character_X']/df['number_of_characters']
  #Calculate the percentage of character x in column_name
  df['percentage_of_character_x'] = df['number_of_character_x']/df['number_of_characters']
  print("\nTime taken to get text metrics: {:} (h:mm:ss)\n\n".format(format_time(time.time() - t0)))
  df.info()


In [None]:
#Define the function display_descriptive_statistics to display statistics like minimum, median, maximum and standard deviation
def display_descriptive_statistics(df):
  '''
  df: the dataframe for which the descriptive statistics like minimum, median, maximum and standard deviation is to be displayed
  '''
  return df.agg(['min','median','max','std']).T


In [None]:
#Get the text analysis for the 'complaint' column in df1
get_text_analysis(df1,'complaint')


Time taken to get text metrics: 0:03:03 (h:mm:ss)


<class 'pandas.core.frame.DataFrame'>
Int64Index: 652160 entries, 0 to 1922571
Data columns (total 19 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   product                         652160 non-null  object 
 1   complaint                       652160 non-null  string 
 2   original_product                652160 non-null  object 
 3   split_words_whitespaces         652160 non-null  object 
 4   number_of_words                 652160 non-null  int64  
 5   number_of_characters            652160 non-null  int64  
 6   number_of_unique_words          652160 non-null  int64  
 7   number_of_alphabets             652160 non-null  int64  
 8   number_of_numerical_values      652160 non-null  int64  
 9   number_of_white_spaces          652160 non-null  int64  
 10  number_of_other_characters      652160 non-null  int64  
 11  number_of_character_X   

In [None]:
#Store the name of text metrics in a list 
text_metrics = [
                'number_of_words','number_of_characters', 'number_of_unique_words', 
                'number_of_alphabets', 'number_of_numerical_values',
                'number_of_white_spaces', 'number_of_other_characters',
                'number_of_character_X','number_of_character_x', 
                'percentage_of_alphabets', 'percentage_of_numerical_values',
                'percentage_of_white_spaces', 'percentage_of_other_characters',
                'percentage_of_character_X','percentage_of_character_x'
                ]

In [None]:
#Display the descriptive statistics of text metrics in df1
#take snapshot
display_descriptive_statistics(df1[text_metrics])

Unnamed: 0,min,median,max,std
number_of_words,1.0,128.0,6314.0,225.085042
number_of_characters,4.0,705.0,32317.0,1265.504492
number_of_unique_words,1.0,81.0,1839.0,81.663219
number_of_alphabets,4.0,550.0,25681.0,990.486837
number_of_numerical_values,0.0,4.0,3259.0,19.395898
number_of_white_spaces,0.0,129.0,7172.0,230.217405
number_of_other_characters,0.0,16.0,5229.0,42.432988
number_of_character_X,0.0,21.0,8136.0,95.691499
number_of_character_x,0.0,0.0,150.0,2.421816
percentage_of_alphabets,0.01922,0.786962,1.0,0.030456


In [None]:
# Retain certain records (with word count more than 7 and charachter count more than 14) only
df2 = df1.query('(number_of_characters > 14)&(number_of_words > 7)')
df2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 647953 entries, 0 to 1922571
Data columns (total 19 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   product                         647953 non-null  object 
 1   complaint                       647953 non-null  string 
 2   original_product                647953 non-null  object 
 3   split_words_whitespaces         647953 non-null  object 
 4   number_of_words                 647953 non-null  int64  
 5   number_of_characters            647953 non-null  int64  
 6   number_of_unique_words          647953 non-null  int64  
 7   number_of_alphabets             647953 non-null  int64  
 8   number_of_numerical_values      647953 non-null  int64  
 9   number_of_white_spaces          647953 non-null  int64  
 10  number_of_other_characters      647953 non-null  int64  
 11  number_of_character_X           647953 non-null  int64  
 12  number_of_chara

In [None]:
# Display the descriptive statistics of text metrics in df2
display_descriptive_statistics(df2[text_metrics])

Unnamed: 0,min,median,max,std
number_of_words,8.0,129.0,6314.0,225.304262
number_of_characters,30.0,710.0,32317.0,1266.886754
number_of_unique_words,1.0,82.0,1839.0,81.554683
number_of_alphabets,18.0,554.0,25681.0,991.573878
number_of_numerical_values,0.0,4.0,3259.0,19.442699
number_of_white_spaces,7.0,130.0,7172.0,230.44683
number_of_other_characters,0.0,17.0,5229.0,42.510997
number_of_character_X,0.0,24.0,8136.0,95.926315
number_of_character_x,0.0,0.0,150.0,2.427749
percentage_of_alphabets,0.01922,0.786828,0.973942,0.029788


In [None]:
#Create a sample data sample_big which has 20000 entries of each of the product counts in df2
df2_group = df2.groupby('product', group_keys=False)
sample_big = df2_group.sample(n =20000, random_state = fix_seed, replace =False)
sample_big .info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 120000 entries, 1597758 to 824502
Data columns (total 19 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   product                         120000 non-null  object 
 1   complaint                       120000 non-null  string 
 2   original_product                120000 non-null  object 
 3   split_words_whitespaces         120000 non-null  object 
 4   number_of_words                 120000 non-null  int64  
 5   number_of_characters            120000 non-null  int64  
 6   number_of_unique_words          120000 non-null  int64  
 7   number_of_alphabets             120000 non-null  int64  
 8   number_of_numerical_values      120000 non-null  int64  
 9   number_of_white_spaces          120000 non-null  int64  
 10  number_of_other_characters      120000 non-null  int64  
 11  number_of_character_X           120000 non-null  int64  
 12  number_of_

In [None]:
# Display the descriptive statistics of text metrics in sample_big
display_descriptive_statistics(sample_big[text_metrics])

Unnamed: 0,min,median,max,std
number_of_words,8.0,151.0,5813.0,239.951312
number_of_characters,30.0,826.0,31705.0,1337.572385
number_of_unique_words,5.0,94.0,1642.0,85.332625
number_of_alphabets,20.0,642.0,25580.0,1044.595245
number_of_numerical_values,0.0,5.0,3259.0,22.285141
number_of_white_spaces,7.0,152.0,5997.0,244.889497
number_of_other_characters,0.0,20.0,5229.0,46.625493
number_of_character_X,0.0,24.0,7792.0,94.086056
number_of_character_x,0.0,0.0,110.0,2.567328
percentage_of_alphabets,0.305976,0.784148,0.973942,0.028512


## Text Pre-processing using Gensim

In [None]:
#Install the package folium version 0.2.1 and the package pattern (requirement for gensim library)
!pip install folium==0.2.1 pattern -q

[K     |████████████████████████████████| 69 kB 5.2 MB/s 
[K     |████████████████████████████████| 22.2 MB 1.4 MB/s 
[K     |████████████████████████████████| 87 kB 6.6 MB/s 
[K     |████████████████████████████████| 81 kB 9.1 MB/s 
[K     |████████████████████████████████| 5.6 MB 61.0 MB/s 
[K     |████████████████████████████████| 5.6 MB 66.6 MB/s 
[K     |████████████████████████████████| 419 kB 69.9 MB/s 
[K     |████████████████████████████████| 104 kB 74.4 MB/s 
[K     |████████████████████████████████| 3.6 MB 58.3 MB/s 
[?25h  Building wheel for folium (setup.py) ... [?25l[?25hdone
  Building wheel for pattern (setup.py) ... [?25l[?25hdone
  Building wheel for mysqlclient (setup.py) ... [?25l[?25hdone
  Building wheel for python-docx (setup.py) ... [?25l[?25hdone
  Building wheel for sgmllib3k (setup.py) ... [?25l[?25hdone


In [None]:
# Import libraries for text processing
import string
import re
import pattern
from gensim.utils import tokenize

def text_pre_process(text):

  """
  text: the input text
  the function text_pre_process outputs preprocessed_text that is the pre-processed text
  """
  # Some basic helper functions to clean text.
  def remove_URL(text):
    '''
    text: input text for which the url has to be removed
    the function remove_URL removes the URL from the text
    '''
    url = re.compile(r'https?://\S+|www\.\S+')
    return url.sub(r'', text)

  def remove_emoji(text):
    '''
    text: input text for which the emoji has to be removed
    the function remove_emoji removes the emoji from the text
    '''
    emoji_pattern = re.compile(
        '['
        u'\U0001F600-\U0001F64F'  # emoticons
        u'\U0001F300-\U0001F5FF'  # symbols & pictographs
        u'\U0001F680-\U0001F6FF'  # transport & map symbols
        u'\U0001F1E0-\U0001F1FF'  # flags (iOS)
        u'\U00002702-\U000027B0'
        u'\U000024C2-\U0001F251'
        ']+',
        flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)
  
  def remove_html(text):
    '''
    text: input text for which the html has to be removed
    the function remove_html removes the html from the text
    '''
    html = re.compile(r'<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});')
    return re.sub(html, '', text)
   
  def remove_non_ascii(text):
    '''
    text: input text for which Non ASCII Characters has to be removed 
    the function remove_non_ascii removes the Non ASCII Characters from the text
    '''
    non_ascii_removed = text.encode('ascii', 'ignore').decode('ascii')
    return non_ascii_removed   

  def remove_character_X_and_x(text):
    '''
    text: input text for which more than one occurrence of character X or x has to be removed
    the function remove_character_X_and_x removes more than one occurrence of character X  or x from the text
    '''
    remove_X = re.sub(r'XX+','',text)
    remove_X_space = re.sub(r'X[/s]X','',remove_X )   
    remove_x = re.sub(r'xx+','',remove_X_space)
    remove_x_space = re.sub(r'x[/s]x','',remove_x)
    return remove_x_space

  def remove_punct(text):
    '''
    text: input text for which punctuations has to be removed 
    the function remove_punct removes the punctuations !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~  from the text
    '''
    table = str.maketrans('', '', string.punctuation)
    return text.translate(table)
  
  def remove_extra_white_spaces(text):
    '''
    text: input text for which the characters [\t\n\r\f\v] has to be replaced by a single space character
    the function remove_extra_white_spaces replaces the characters [\t\n\r\f\v] by a single space character
    '''
    #the characters [\t\n\r\f\v] are replaced by a single space character
    remove_extra_space_characters = re.sub('/s',' ',text)
    #replace more than one occurrence of a white space character by an occurrence
    remove_extra_space = re.sub(' +',' ',remove_extra_space_characters)
    return remove_extra_space
  
  def remove_consecutive_repeated_substrings(text):
    '''
    text: input text for which the consecutive repeated substring is to be removed
    the function remove_consecutive_repeated_substrings removes the consecutive repeated substring in text
    '''
    while re.search(r'\b(.+)(\s+\1\b)+', text):
      text = re.sub(r'\b(.+)(\s+\1\b)+', r'\1', text)
    return text
  
  # Applying helper functions

  #remove URL from the text
  text_clean = remove_URL(text)
  #remove emoji from text_clean
  text_clean = remove_emoji(text_clean)
  #remove html from text_clean
  text_clean =  remove_html(text_clean)
  #remove Non ASCII Characters from text_clean
  text_clean = remove_non_ascii(text_clean)  
  #remove more than one occurrence of character X and x from text_clean
  text_clean =  remove_character_X_and_x(text_clean)
  #remove punctuations from text_clean
  text_clean = remove_punct(text_clean)

  # Tokenizing 'text_clean' using tokenize function from gensim.utils 
  tokenized_data = list(tokenize(text_clean))

  # Lower casing 'tokenized_data'
  lower_text = str(' '.join([word.lower() for word in tokenized_data]))

  #Remove extra white spaces of lower_text
  extra_space_removed = remove_extra_white_spaces(lower_text)

  #Remove consecutive repeated substrings of extra_space_removed
  consecutive_repeated_substrings_removed = remove_consecutive_repeated_substrings(extra_space_removed)

  #Remove the quotes at the end of consecutive_repeated_substrings_removed
  preprocessed_text =  consecutive_repeated_substrings_removed.strip()
  return preprocessed_text

Example of text_pre_process function

In [None]:
sample_big.loc[sample_big.index[0],'complaint']

"This is the third complaint I have filed against Wells Fargo. First they opened a credit card in my husband 's name and mine. We did not request this card. I also found out they attached the credit card to an account that was in my name since XXXX. This account was set up for my job. All direct deposits for pay went to this account. I also had set up this account for online bill pay. They changed the primary owner of this account to my husband so that the credit card that Wells Fargo opened ILLEGALLY, without our approval or knowledge and added it as overdraft for the the account. In XXXX XXXX, I made transfers for XXXX transactions from another checking account in my name to what I thought was my primary account to pay bills over a 35 day period. Not only did Wells Fargo transfer those monies back into the checking but they also made cash advances on the credit card and charged us for overdraft protection resulting in almost {$500.00} in cash advance fees, interest, and this overdraf

In [None]:
text_pre_process(sample_big.loc[sample_big.index[0],'complaint'])

'this is the third complaint i have filed against wells fargo first they opened a credit card in my husband s name and mine we did not request this card i also found out they attached the credit card to an account that was in my name since this account was set up for my job all direct deposits for pay went to this account i also had set up this account for online bill pay they changed the primary owner of this account to my husband so that the credit card that wells fargo opened illegally without our approval or knowledge and added it as overdraft for the account in i made transfers for transactions from another checking account in my name to what i thought was my primary account to pay bills over a day period not only did wells fargo transfer those monies back into the checking but they also made cash advances on the credit card and charged us for overdraft protection resulting in almost in cash advance fees interest and this overdraft protection fee neither my husband or i knew that 

In [None]:
# Obtain the 'preprocessed_text' for the 'complaint' data in sample_big
from tqdm import tqdm
tqdm.pandas()
t0 = time.time()
sample_big.loc[:,'preprocessed_text']= sample_big.loc[:,'complaint'].progress_map(lambda row: text_pre_process(row))
print("\nText Transformation Completed in {:} (h:mm:ss)\n\n".format(format_time(time.time() - t0)))

100%|██████████| 120000/120000 [1:42:57<00:00, 19.43it/s]


Text Transformation Completed in 1:42:57 (h:mm:ss)







In [None]:
# Display the 'complaint' and 'preprocessed_text' for the first 3 rows
pd.set_option('display.max_colwidth', None)
#create a random sample check_preprocessing_sample of 50 entries from the dataframe sample_big 
check_preprocessing_sample = sample_big.sample(n = 50, random_state = fix_seed )
display(sample_big.loc[:,['complaint','preprocessed_text']][:3])

Unnamed: 0,complaint,preprocessed_text
1597758,"This is the third complaint I have filed against Wells Fargo. First they opened a credit card in my husband 's name and mine. We did not request this card. I also found out they attached the credit card to an account that was in my name since XXXX. This account was set up for my job. All direct deposits for pay went to this account. I also had set up this account for online bill pay. They changed the primary owner of this account to my husband so that the credit card that Wells Fargo opened ILLEGALLY, without our approval or knowledge and added it as overdraft for the the account. In XXXX XXXX, I made transfers for XXXX transactions from another checking account in my name to what I thought was my primary account to pay bills over a 35 day period. Not only did Wells Fargo transfer those monies back into the checking but they also made cash advances on the credit card and charged us for overdraft protection resulting in almost {$500.00} in cash advance fees, interest, and this overdraft protection fee. Neither my husband or I knew that my primary account had been manipulated and my identity and my husband 's were stolen by using our social security information. Now today XXXX/XXXX/XXXX, I have found my other credit card information has also been changed for the online bill pay lists placing all of my credit cards, different from my husbands and not in his name, changed to his name along with changing the account numbers for these cards. XXXX card number was not correct at all. I now have to check all of our credit card statements to make sure that the bills have been paid!!! My advice to any Wells Fargo account holder is to check everything in your accounts because they did a lot of damage not only in CA but across the US and in many different areas of your banking accounts!",this is the third complaint i have filed against wells fargo first they opened a credit card in my husband s name and mine we did not request this card i also found out they attached the credit card to an account that was in my name since this account was set up for my job all direct deposits for pay went to this account i also had set up this account for online bill pay they changed the primary owner of this account to my husband so that the credit card that wells fargo opened illegally without our approval or knowledge and added it as overdraft for the account in i made transfers for transactions from another checking account in my name to what i thought was my primary account to pay bills over a day period not only did wells fargo transfer those monies back into the checking but they also made cash advances on the credit card and charged us for overdraft protection resulting in almost in cash advance fees interest and this overdraft protection fee neither my husband or i knew that my primary account had been manipulated and my identity and my husband s were stolen by using our social security information now today i have found my other credit card information has also been changed for the online bill pay lists placing all of my credit cards different from my husbands and not in his name changed to his name along with changing the account numbers for these cards card number was not correct at all i now have to check all of our credit card statements to make sure that the bills have been paid my advice to any wells fargo account holder is to check everything in your accounts because they did a lot of damage not only in ca but across the us and in many different areas of your banking accounts
1290191,"This is a follow up of my original complaint case # XXXX XXXX . PNC Bak has failed to provide me with and explanation why {$360.00} was taken out of my personal checking account ending in XXXX on XXXX , XXXX XXXX . They have failed to provide me with the complete investigation file which PNC Bank claims I have mishandled my personal checking account. This request is specifically directed to a letter dated XXXX XXXX , XXXX Ref. # XXXX . In that lette r PNC Ban k used as a reason to deny my request for a credit card. see reason # XXXX "" Unsatisfactory handling of your relationsh ip ( s ) wit h us '' I have requested a detailed explanation of both issues 7 times, in writing, verbally over the phone and walking into branches. There are serious security issues with the {$360.00} ATM withdrawal of my account and I have a legal right to know what exactly happened. Further, if PNC ban k is claiming I have mishandled my personal checking account, that claim has the potential to negatively impact my reputation. I want the information i have requested and hoping PNC ba nk will comply with federal and state laws that clearly state they MUST provide the information requested. Otherwise I will be forced to file civil claims and through that venue I will ascertain it.",this is a follow up of my original complaint case pnc bak has failed to provide me with and explanation why was taken out of my personal checking account ending in on they have failed to provide me with the complete investigation file which pnc bank claims i have mishandled my personal checking account this request is specifically directed to a letter dated ref in that lette r pnc ban k used as a reason to deny my request for a credit card see reason unsatisfactory handling of your relationsh ip s wit h us i have requested a detailed explanation of both issues times in writing verbally over the phone and walking into branches there are serious security issues with the atm withdrawal of my account and i have a legal right to know what exactly happened further if pnc ban k is claiming i have mishandled my personal checking account that claim has the potential to negatively impact my reputation i want the information i have requested and hoping pnc ba nk will comply with federal and state laws that clearly state they must provide the information requested otherwise i will be forced to file civil claims and through that venue i will ascertain it
13387,XX/XX/2019 - An account fraudulently opened by someone who used my identity with Bank of the West. XXXX - XXXX - Several communications between me and the bank of the west branch as well as customer support center to clarify that I did not open an account. - Bank of the West team telling me they are unable to close this account. - Now I am getting repeated notification from the back indicating they are deducting charges for maintenance and paper statement of the account from the balance that was fraudulently used to create the account.,an account fraudulently opened by someone who used my identity with bank of the west several communications between me and the bank of the west branch as well as customer support center to clarify that i did not open an account bank of the west team telling me they are unable to close this account now i am getting repeated notification from the back indicating they are deducting charges for maintenance and paper statement of the account from the balance that was fraudulently used to create the account


In [None]:
# Set the display max_columns to 50
pd.pandas.set_option('display.max_columns', 50)
display(check_preprocessing_sample .loc[:,['complaint','preprocessed_text']])
# Set the display max_columns to 10
pd.pandas.set_option('display.max_columns', 10)

Unnamed: 0,complaint,preprocessed_text
799075,"In XXXX a valid debt was charged off from XXXX XXXX XXXX XXXX XXXX. I paid this in a settlement in XXXX over the phone to a collection agency. I contested this still being on my credit report in XXXX XXXX. I received a settlement letter on XXXX/XXXX/XXXX from XXXX XXXX XXXX on behalf of XXXX XXXX Bank. When I called XXXX on XXXX/XXXX/XXXX to say I had already settled this back in XXXX, but in the interest of settlement, I was willing to pay it again if I could be ensured it would be reflected properly on my credit report, they directed me to call Galaxy International Purchasing LLC ( XXXX ) to ensure it would reflect properly upon payment. After repeated calls, I finally got through to Galaxy on XXXX/XXXX/XXXX and explained the situation. They told me to call XXXX XXXX Bank ( XXXX ) as they have nothing to do with my credit report. I called XXXX XXXX Bank, after verifying my information and that it was in concern to a debt patched me through to some other company of which I did n't get their name due to a language barrier. The lady I spoke with told me to disregard the letter I received on XXXX from XXXX, despite XXXX telling me they were representing XXXX XXXX and that I had 60 days to accept their settlement offer. I was told XXXX XXXX has no record of them or Galaxy handling this account and that I should work through XXXX XXXX ( XXXX XXXX, a company I 've never received anything from. I have called XXXX XXXX and have n't been able to get through to them. When searching XXXX 's number it comes up as XXXX XXXX XXXX XXXX All of these companies, XXXX XXXX XXXX XXXX, Galaxy, they all have slews of scam warnings and complaints online for shady practices. This same thing happened in XXXX and my settlement payment was lost in the mix. This has already been paid off once and I fear that if I pay anyone in this matter it will only get lost again, especially since they all seem to think I can pay them and no one else. I want this account deleted from my credit report as it was paid XXXX years ago and I 'm being scammed in an endless loop of calling other scammy businesses.",in a valid debt was charged off from i paid this in a settlement in over the phone to a collection agency i contested this still being on my credit report in i received a settlement letter on from on behalf of bank when i called on to say i had already settled this back in but in the interest of settlement i was willing to pay it again if i could be ensured it would be reflected properly on my credit report they directed me to call galaxy international purchasing llc to ensure it would reflect properly upon payment after repeated calls i finally got through to galaxy on and explained the situation they told me to call bank as they have nothing to do with my credit report i called bank after verifying my information and that it was in concern to a debt patched me through to some other company of which i did nt get their name due to a language barrier the lady i spoke with told me to disregard the letter i received on from despite telling me they were representing and that i had days to accept their settlement offer i was told has no record of them or galaxy handling this account and that i should work through a company i ve never received anything from i have called and have nt been able to get through to them when searching s number it comes up as all of these companies galaxy they all have slews of scam warnings and complaints online for shady practices this same thing happened in and my settlement payment was lost in the mix this has already been paid off once and i fear that if i pay anyone in this matter it will only get lost again especially since they all seem to think i can pay them and no one else i want this account deleted from my credit report as it was paid years ago and i m being scammed in an endless loop of calling other scammy businesses
1645882,"When Collection agency first contacted me, I was only told that if I did not pay the debt off, my name would be reported to all XXXX credit bureaus. I was not told I had the right to verify the debt and that by disputing it, I would not have to pay anything until the issue was resolved. I agreed to pay XX/XX/XXXX, but only found out by mail a week later that I could have disputed the debt until I received verification. I then wrote to the Collection agency as did an attorney on my behalf asking them to cease further collection until the dispute was resolved. They still automatically withdrew money from my checking account this month, putting me in a very awkward position financially.",when collection agency first contacted me i was only told that if i did not pay the debt off my name would be reported to all credit bureaus i was not told i had the right to verify the debt and that by disputing it i would not have to pay anything until the issue was resolved i agreed to pay but only found out by mail a week later that i could have disputed the debt until i received verification i then wrote to the collection agency as did an attorney on my behalf asking them to cease further collection until the dispute was resolved they still automatically withdrew money from my checking account this month putting me in a very awkward position financially
171942,Synchrony bank credit card/airline-XX/XX/2019 Synchrony bank credit cards/airlines-XX/XX/XXXXXX/XX/2019,synchrony bank credit cardairline synchrony bank credit cardsairlines
1552491,"Bank rep came to my workplace in XX/XX/XXXX and gave sales talk about how if I opened an account with them and did a certain number of transactions I would get a large incentive ( {$200.00} ). Bank rep failed to explain to me I had to transfer a minimum balance into the account by a certain time or get hit with service fees. I never received a statement the first few months as the fees were apparently waived somehow as a "" courtesy '' on new accounts, and I was traveling and did not use the account. I thought either PNC had automatically transferred money from one of my other bank accounts into the PNC account, or else the account was not fully activated till I went online and took some action. I received a debit card but did not get around to activating or using it. In XX/XX/XXXX I received out of the blue a statement from PNC showing that my balance was XXXX AND I owed them a XXXX dollar fee. I basically got charged a fee on an account I never used ; never put money into or out of, never activated or used the debit card, never logged into PNC Online Banking. It 's an inactive account and they charged me a fee without fully explaining that I would get hit with fees if I did n't put in a certain balance by a certain time. I have called them and asked that the account be closed and the fee waived. I did not open this account with the idea of owing the bank money, I thought I was getting a XXXX dollar incentive ( which I never got because I needed to use the debit card within some "" promotional period '' to get that money, apparently ). I was misled by the sales rep and I just want this to go away and I do NOT want to pay a fee on an account I NEVER USED. See attached letter for more details.",bank rep came to my workplace in and gave sales talk about how if i opened an account with them and did a certain number of transactions i would get a large incentive bank rep failed to explain to me i had to transfer a minimum balance into the account by a certain time or get hit with service fees i never received a statement the first few months as the fees were apparently waived somehow as a courtesy on new accounts and i was traveling and did not use the account i thought either pnc had automatically transferred money from one of my other bank accounts into the pnc account or else the account was not fully activated till i went online and took some action i received a debit card but did not get around to activating or using it in i received out of the blue a statement from pnc showing that my balance was and i owed them a dollar fee i basically got charged a fee on an account i never used never put money into or out of never activated or used the debit card never logged into pnc online banking it s an inactive account and they charged me a fee without fully explaining that i would get hit with fees if i did nt put in a certain balance by a certain time i have called them and asked that the account be closed and the fee waived i did not open this account with the idea of owing the bank money i thought i was getting a dollar incentive which i never got because i needed to use the debit card within some promotional period to get that money apparently i was misled by the sales rep and i just want this to go away and i do not want to pay a fee on an account i never used see attached letter for more details
158141,"I have been a member of XXXX XXXX XXXX for a number of years and have a VISA card with them. I recently received a letter that stated on XX/XX/19 their cards were being switched over to Elan Financial Services. The letter stated I could "" opt out '' of the transition if I chose to do so, and needed to contact them by XX/XX/19. The letter went on to recommend I check if I had any rewards points on balance before closing the account and to cash those points out before XX/XX/19. Tonight, XX/XX/19, I contacted the customer service number on my recent bill and was told by Elan services my current credit card with XXXX has a XXXX balance and a new card has been issued to me. I did not authorize Elan or XXXX to do this and in fact was calling because I want to opt out of the transition. I have a balance transfer check I want to send from my new credit union and place on their new credit card. Elan further stated on the phone that if I closed out my account tonight I would lose all rewards points and suggested I check my point balance before proceeding. I hung up, went to my rewards site, logged in as usual and a message appeared stating "" Cardholder does not participate. '' I then called the number on the back of my CC, found my point balance via the automated menu, and then proceeded until a XXXX person answered. After explaining this whole mess, the XXXX customer service person stated "" they had jumped the gun '' on the transition. This all violates what is stated in their letter about this transition and I consider it a breech of contract. What recourse do I have to a ) get back my reward points?, b ) stop a credit transfer to Elan, a company I do NOT want to do business with?, and c ) make a clean break with both XXXX and Elan without it impacting my credit? Any help you can offer will be greatly appreciated!!",i have been a member of for a number of years and have a visa card with them i recently received a letter that stated on their cards were being switched over to elan financial services the letter stated i could opt out of the transition if i chose to do so and needed to contact them by the letter went on to recommend i check if i had any rewards points on balance before closing the account and to cash those points out before tonight i contacted the customer service number on my recent bill and was told by elan services my current credit card with has a balance and a new card has been issued to me i did not authorize elan or to do this and in fact was calling because i want to opt out of the transition i have a balance transfer check i want to send from my new credit union and place on their new credit card elan further stated on the phone that if i closed out my account tonight i would lose all rewards points and suggested i check my point balance before proceeding i hung up went to my rewards site logged in as usual and a message appeared stating cardholder does not participate i then called the number on the back of my cc found my point balance via the automated menu and then proceeded until a person answered after explaining this whole mess the customer service person stated they had jumped the gun on the transition this all violates what is stated in their letter about this transition and i consider it a breech of contract what recourse do i have to a get back my reward points b stop a credit transfer to elan a company i do not want to do business with and c make a clean break with both and elan without it impacting my credit any help you can offer will be greatly appreciated
1607744,"I would like copies of Insurance premiums and cashed checks from my mortgage company Arvest Bank. As you can see for the year XXXX and XXXX stick out like sour thumb. You have closed case XXXX with out any merit or have given me copies of the Insurance Premium as well as cashed checks this has ROBO SIGN all over it, Also you have falsified and alter XXXX contract. Please see attached as you have removed the SCOPE OF THE WORK as well as the amount ROBO SIGN again. I have copies of originals Contracts from XXXX XXXX with initials in every page as well as a notary stamp with different dates. Also please see attached a copy of a new Authorization form from my new Attorney XXXX XXXX. This is not a duplicate complaint",i would like copies of insurance premiums and cashed checks from my mortgage company arvest bank as you can see for the year and stick out like sour thumb you have closed case with out any merit or have given me copies of the insurance premium as well as cashed checks this has robo sign all over it also you have falsified and alter contract please see attached as you have removed the scope of the work as well as the amount robo sign again i have copies of originals contracts from with initials in every page as well as a notary stamp with different dates also please see attached a copy of a new authorization form from my new attorney this is not a duplicate complaint
360361,"Hi, My name is XXXX XXXX. Ive recently been a victim of a debt consolidation company that causes a big impact on my credit score. One of the agreements with them is that I can not make any payment with chase cards until they negotiate a payment plan reduction for me, which I didnt know that will causes my credit score to dropped dramatically. Ive already cancelled my contract with them but I still need to chase to update this in their system so that they can adjust the credit bureaus to bring back my original credit score. I can be reach by email XXXX XXXX XXXX or my mobile at XXXX. Thank you XXXX",hi my name is ive recently been a victim of a debt consolidation company that causes a big impact on my credit score one of the agreements with them is that i can not make any payment with chase cards until they negotiate a payment plan reduction for me which i didnt know that will causes my credit score to dropped dramatically ive already cancelled my contract with them but i still need to chase to update this in their system so that they can adjust the credit bureaus to bring back my original credit score i can be reach by email or my mobile at thank you
1825727,"On XX/XX/XXXX I initiated a {$2300.00} transfer from my XXXX checking account to my HSBC checking account. On XX/XX/XXXX, unbeknownst to me, HSBC unilaterally closed all of my accounts. XXXX attempted a reversal of the {$2300.00} on XXXX, which HSBC rejected. The money was never sent to me by HSBC nor was it credited back to my XXXX account. It is now XXXX. They are telling me that they do not have my money - tough luck. XXXX tells me they have the evidence that the money was not sent back. HSBC is being very wishy-washy.",on i initiated a transfer from my checking account to my hsbc checking account on unbeknownst to me hsbc unilaterally closed all of my accounts attempted a reversal of the on which hsbc rejected the money was never sent to me by hsbc nor was it credited back to my account it is now they are telling me that they do not have my money tough luck tells me they have the evidence that the money was not sent back hsbc is being very wishywashy
1802811,"OnXX/XX/XXXX, I went to XXXX XXXX located at XXXX XXXX XXXX in XXXX, CA to withdraw money from my American Express Serve card. I went through the process of entering in my pin number and the amount along with which account to take the funds out of only to have the ATM machine not to dispense any money. I waited a good 3 minutes to see if the machine would dispense any money, before attempting to it again since I had some bills to pay that was due in a day I really did not think nothing of it, so I attempted to try again only this time the ATM machine said withdraw is not allowed. A male was waiting to use the ATM and stated oh that ATM machine is always breaking down and went inside the branch. I was waiting longer than normal for my ATM card to be returned. A female employee came outside and asked me if everything was okay, I stated the machine final gave me my card I did not think much of it, she looked at the ATM machine and went back in, I on the other had went to another XXXX XXXX near where I was at this time it was in a grocery store only for the machine to say that contacted lender. I then looked online only to discovered. The funds was showing as if I made a transaction. I attempted to make a transaction but no funds was dispense. I had to be at work at XXXX XXXX. so I had to head to work on my break on XX/XX/XXXX. I contacted my bank to dispute the charges and on XX/XX/XXXX my bank sent me an email informing me that my dispute was denied. I then contacted XXXX XXXX to inquiry why would my bank American Express serve would say that ATM vendor indicated funds was dispense when they were not. I spoke with a person by the name of XXXX she informed me that a detail investigation was not done and that American express Serve need to request the date stamps, camera, error log. She only mention that a basic one was done and I need to contact my lender to re investigate. I contacted my lender on XX/XX/XXXX. I am late on a few bills because of this situation that I had no control over.",on i went to located at in ca to withdraw money from my american express serve card i went through the process of entering in my pin number and the amount along with which account to take the funds out of only to have the atm machine not to dispense any money i waited a good minutes to see if the machine would dispense any money before attempting to it again since i had some bills to pay that was due in a day i really did not think nothing of it so i attempted to try again only this time the atm machine said withdraw is not allowed a male was waiting to use the atm and stated oh that atm machine is always breaking down and went inside the branch i was waiting longer than normal for my atm card to be returned a female employee came outside and asked me if everything was okay i stated the machine final gave me my card i did not think much of it she looked at the atm machine and went back in i on the other had went to another near where i was at this time it was in a grocery store only for the machine to say that contacted lender i then looked online only to discovered the funds was showing as if i made a transaction i attempted to make a transaction but no funds was dispense i had to be at work at so i had to head to work on my break on i contacted my bank to dispute the charges and on my bank sent me an email informing me that my dispute was denied i then contacted to inquiry why would my bank american express serve would say that atm vendor indicated funds was dispense when they were not i spoke with a person by the name of she informed me that a detail investigation was not done and that american express serve need to request the date stamps camera error log she only mention that a basic one was done and i need to contact my lender to re investigate i contacted my lender on i am late on a few bills because of this situation that i had no control over
1618840,"Reference prior Consumer Financial Protection Bureau complaint # XXXX, Ditech Financial LLC Account Number : XXXX. Complaint response from DiTech, dated XXXX XXXX 2016 ( attached ), indicated specific actions that would be immediately taken by DiTech to correct their error. To date, and based on a conversation with DiTech this morning, none of the actions have been completed. This has resulted in my not receiving a correct XXXX for mortgage interest payments, affecting my ability to file my tax return both accurately and timely. This has been a horrific customer service experience - the worst I 've ever encountered - customer service reps state that I can only speak with them and not be connected with someone in a higher position that just might be helpful. This is a horrible company under any name ( i.e., Green Tree ).",reference prior consumer financial protection bureau complaint ditech financial llc account number complaint response from ditech dated attached indicated specific actions that would be immediately taken by ditech to correct their error to date and based on a conversation with ditech this morning none of the actions have been completed this has resulted in my not receiving a correct for mortgage interest payments affecting my ability to file my tax return both accurately and timely this has been a horrific customer service experience the worst i ve ever encountered customer service reps state that i can only speak with them and not be connected with someone in a higher position that just might be helpful this is a horrible company under any name ie green tree


In [None]:
# Get the text analysis for the 'preprocessed_text' column in check_preprocessing_sample
get_text_analysis(check_preprocessing_sample,'preprocessed_text')


Time taken to get text metrics: 0:00:00 (h:mm:ss)


<class 'pandas.core.frame.DataFrame'>
Int64Index: 50 entries, 799075 to 527927
Data columns (total 20 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   product                         50 non-null     object 
 1   complaint                       50 non-null     string 
 2   original_product                50 non-null     object 
 3   split_words_whitespaces         50 non-null     object 
 4   number_of_words                 50 non-null     int64  
 5   number_of_characters            50 non-null     int64  
 6   number_of_unique_words          50 non-null     int64  
 7   number_of_alphabets             50 non-null     int64  
 8   number_of_numerical_values      50 non-null     int64  
 9   number_of_white_spaces          50 non-null     int64  
 10  number_of_other_characters      50 non-null     int64  
 11  number_of_character_X           50 no

In [None]:
# Display the descriptive statistics of text metrics in check_preprocessing_sample
display_descriptive_statistics(check_preprocessing_sample[text_metrics])

Unnamed: 0,min,median,max,std
number_of_words,8.0,164.5,1102.0,170.231129
number_of_characters,69.0,865.0,6091.0,919.902973
number_of_unique_words,5.0,86.0,312.0,53.276713
number_of_alphabets,57.0,700.0,4990.0,750.144649
number_of_numerical_values,0.0,0.0,0.0,0.0
number_of_white_spaces,7.0,163.5,1101.0,170.231129
number_of_other_characters,0.0,0.0,0.0,0.0
number_of_character_X,0.0,0.0,0.0,0.0
number_of_character_x,0.0,1.0,7.0,1.768315
percentage_of_alphabets,0.798194,0.816366,0.898551,0.016321


In [None]:
# Display information of sample_big
sample_big.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 120000 entries, 1597758 to 824502
Data columns (total 20 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   product                         120000 non-null  object 
 1   complaint                       120000 non-null  string 
 2   original_product                120000 non-null  object 
 3   split_words_whitespaces         120000 non-null  object 
 4   number_of_words                 120000 non-null  int64  
 5   number_of_characters            120000 non-null  int64  
 6   number_of_unique_words          120000 non-null  int64  
 7   number_of_alphabets             120000 non-null  int64  
 8   number_of_numerical_values      120000 non-null  int64  
 9   number_of_white_spaces          120000 non-null  int64  
 10  number_of_other_characters      120000 non-null  int64  
 11  number_of_character_X           120000 non-null  int64  
 12  number_of_

In [None]:
# Get the text analysis for the 'preprocessed_text' column in sample_big
get_text_analysis(sample_big,'preprocessed_text')


Time taken to get text metrics: 0:00:34 (h:mm:ss)


<class 'pandas.core.frame.DataFrame'>
Int64Index: 120000 entries, 1597758 to 824502
Data columns (total 20 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   product                         120000 non-null  object 
 1   complaint                       120000 non-null  string 
 2   original_product                120000 non-null  object 
 3   split_words_whitespaces         120000 non-null  object 
 4   number_of_words                 120000 non-null  int64  
 5   number_of_characters            120000 non-null  int64  
 6   number_of_unique_words          120000 non-null  int64  
 7   number_of_alphabets             120000 non-null  int64  
 8   number_of_numerical_values      120000 non-null  int64  
 9   number_of_white_spaces          120000 non-null  int64  
 10  number_of_other_characters      120000 non-null  int64  
 11  number_of_character

In [None]:
# Display the descriptive statistics of text metrics in sample_big
display_descriptive_statistics(sample_big[text_metrics])

Unnamed: 0,min,median,max,std
number_of_words,1.0,139.0,5665.0,217.995735
number_of_characters,7.0,745.0,30397.0,1202.264097
number_of_unique_words,1.0,82.0,1171.0,69.348835
number_of_alphabets,7.0,607.0,24793.0,985.292748
number_of_numerical_values,0.0,0.0,0.0,0.0
number_of_white_spaces,0.0,138.0,5664.0,217.995735
number_of_other_characters,0.0,0.0,0.0,0.0
number_of_character_X,0.0,0.0,0.0,0.0
number_of_character_x,0.0,1.0,193.0,2.68671
percentage_of_alphabets,0.740741,0.815668,1.0,0.01395


In [None]:
# Retain certain records (with word count more than 7 and charachter count more than 14) only
sample_big_new = sample_big.query('(number_of_characters > 14)&(number_of_words > 7)')
sample_big_new.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 119843 entries, 1597758 to 824502
Data columns (total 20 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   product                         119843 non-null  object 
 1   complaint                       119843 non-null  string 
 2   original_product                119843 non-null  object 
 3   split_words_whitespaces         119843 non-null  object 
 4   number_of_words                 119843 non-null  int64  
 5   number_of_characters            119843 non-null  int64  
 6   number_of_unique_words          119843 non-null  int64  
 7   number_of_alphabets             119843 non-null  int64  
 8   number_of_numerical_values      119843 non-null  int64  
 9   number_of_white_spaces          119843 non-null  int64  
 10  number_of_other_characters      119843 non-null  int64  
 11  number_of_character_X           119843 non-null  int64  
 12  number_of_

In [None]:
# Display the descriptive statistics of text metrics in sample_big_new
display_descriptive_statistics(sample_big_new[text_metrics])

Unnamed: 0,min,median,max,std
number_of_words,8.0,139.0,5665.0,218.024963
number_of_characters,30.0,746.0,30397.0,1202.452339
number_of_unique_words,4.0,82.0,1171.0,69.31362
number_of_alphabets,23.0,608.0,24793.0,985.452866
number_of_numerical_values,0.0,0.0,0.0,0.0
number_of_white_spaces,7.0,138.0,5664.0,218.024963
number_of_other_characters,0.0,0.0,0.0,0.0
number_of_character_X,0.0,0.0,0.0,0.0
number_of_character_x,0.0,1.0,193.0,2.688023
percentage_of_alphabets,0.740741,0.815657,0.99597,0.013822


In [None]:
# Reset the index of sample_big_new and store the original indices in a new column and name than column as 'original_index'
sample_big_new= sample_big_new.reset_index()
sample_big_new.rename(columns={"index": "original_index"},inplace= True)
sample_big_new.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119843 entries, 0 to 119842
Data columns (total 21 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   original_index                  119843 non-null  int64  
 1   product                         119843 non-null  object 
 2   complaint                       119843 non-null  string 
 3   original_product                119843 non-null  object 
 4   split_words_whitespaces         119843 non-null  object 
 5   number_of_words                 119843 non-null  int64  
 6   number_of_characters            119843 non-null  int64  
 7   number_of_unique_words          119843 non-null  int64  
 8   number_of_alphabets             119843 non-null  int64  
 9   number_of_numerical_values      119843 non-null  int64  
 10  number_of_white_spaces          119843 non-null  int64  
 11  number_of_other_characters      119843 non-null  int64  
 12  number_of_charac

In [None]:
#Del Data to free memory
del raw_data ,df1,df2
import gc
gc.collect()

176

In [None]:
# Group the dataframe sample_big_new by 'preprocessed_text' and store the entries as a list
sample_big_new_group_text = sample_big_new.groupby(sample_big_new['preprocessed_text']).agg(list).reset_index()
sample_big_new_group_text.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 117131 entries, 0 to 117130
Data columns (total 21 columns):
 #   Column                          Non-Null Count   Dtype 
---  ------                          --------------   ----- 
 0   preprocessed_text               117131 non-null  object
 1   original_index                  117131 non-null  object
 2   product                         117131 non-null  object
 3   complaint                       117131 non-null  object
 4   original_product                117131 non-null  object
 5   split_words_whitespaces         117131 non-null  object
 6   number_of_words                 117131 non-null  object
 7   number_of_characters            117131 non-null  object
 8   number_of_unique_words          117131 non-null  object
 9   number_of_alphabets             117131 non-null  object
 10  number_of_numerical_values      117131 non-null  object
 11  number_of_white_spaces          117131 non-null  object
 12  number_of_other_characters    

In [None]:
# Calculate the number of instances in the column 'product' in sample_big_new_group_text
sample_big_new_group_text['product_instance_length'] = sample_big_new_group_text['product'].map(len)
# Calculate the number of unique instance in the column 'product' in sample_big_new_group_text
sample_big_new_group_text['product_unique_instance_length'] = sample_big_new_group_text['product'].map(lambda row: len(set(row)))
sample_big_new_group_text.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 117131 entries, 0 to 117130
Data columns (total 23 columns):
 #   Column                          Non-Null Count   Dtype 
---  ------                          --------------   ----- 
 0   preprocessed_text               117131 non-null  object
 1   original_index                  117131 non-null  object
 2   product                         117131 non-null  object
 3   complaint                       117131 non-null  object
 4   original_product                117131 non-null  object
 5   split_words_whitespaces         117131 non-null  object
 6   number_of_words                 117131 non-null  object
 7   number_of_characters            117131 non-null  object
 8   number_of_unique_words          117131 non-null  object
 9   number_of_alphabets             117131 non-null  object
 10  number_of_numerical_values      117131 non-null  object
 11  number_of_white_spaces          117131 non-null  object
 12  number_of_other_characters    

In [None]:
#Query the entries which has only one product instance
sample_big_new_unique_instance = sample_big_new_group_text.query('product_instance_length == 1')
sample_big_new_unique_instance.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 116138 entries, 0 to 117130
Data columns (total 23 columns):
 #   Column                          Non-Null Count   Dtype 
---  ------                          --------------   ----- 
 0   preprocessed_text               116138 non-null  object
 1   original_index                  116138 non-null  object
 2   product                         116138 non-null  object
 3   complaint                       116138 non-null  object
 4   original_product                116138 non-null  object
 5   split_words_whitespaces         116138 non-null  object
 6   number_of_words                 116138 non-null  object
 7   number_of_characters            116138 non-null  object
 8   number_of_unique_words          116138 non-null  object
 9   number_of_alphabets             116138 non-null  object
 10  number_of_numerical_values      116138 non-null  object
 11  number_of_white_spaces          116138 non-null  object
 12  number_of_other_characters    

In [None]:
#Unlist the entries in sample_big_new_unique_instance
sample_big_new_instance_unlist = sample_big_new_unique_instance.applymap(lambda row : row[0] if isinstance(row, list) else row)
sample_big_new_instance_unlist.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 116138 entries, 0 to 117130
Data columns (total 23 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   preprocessed_text               116138 non-null  object 
 1   original_index                  116138 non-null  int64  
 2   product                         116138 non-null  object 
 3   complaint                       116138 non-null  object 
 4   original_product                116138 non-null  object 
 5   split_words_whitespaces         116138 non-null  object 
 6   number_of_words                 116138 non-null  int64  
 7   number_of_characters            116138 non-null  int64  
 8   number_of_unique_words          116138 non-null  int64  
 9   number_of_alphabets             116138 non-null  int64  
 10  number_of_numerical_values      116138 non-null  int64  
 11  number_of_white_spaces          116138 non-null  int64  
 12  number_of_other_

In [None]:
#Calculate the product count in sample_big_new_instance_unlist
sample_big_new_instance_unlist['product'].value_counts()

Mortgage            19960
Banking Services    19922
Loans               19825
Card Services       19583
Debt Collection     19333
Credit Reporting    17515
Name: product, dtype: int64

In [None]:
#Query the entries which has more than one product instance and only one unique product instance
sample_big_new_unique_product_not_instance = sample_big_new_group_text.query('(product_unique_instance_length == 1) &(product_instance_length > 1)')
sample_big_new_unique_product_not_instance.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 912 entries, 36 to 117084
Data columns (total 23 columns):
 #   Column                          Non-Null Count  Dtype 
---  ------                          --------------  ----- 
 0   preprocessed_text               912 non-null    object
 1   original_index                  912 non-null    object
 2   product                         912 non-null    object
 3   complaint                       912 non-null    object
 4   original_product                912 non-null    object
 5   split_words_whitespaces         912 non-null    object
 6   number_of_words                 912 non-null    object
 7   number_of_characters            912 non-null    object
 8   number_of_unique_words          912 non-null    object
 9   number_of_alphabets             912 non-null    object
 10  number_of_numerical_values      912 non-null    object
 11  number_of_white_spaces          912 non-null    object
 12  number_of_other_characters      912 non-null  

In [None]:
#Unlist the entries in sample_big_new_product_not_instance_unlist and choose the first instance
sample_big_new_product_not_instance_unlist = sample_big_new_unique_product_not_instance.applymap(lambda row : row[0] if isinstance(row, list) else row)
sample_big_new_product_not_instance_unlist .info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 912 entries, 36 to 117084
Data columns (total 23 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   preprocessed_text               912 non-null    object 
 1   original_index                  912 non-null    int64  
 2   product                         912 non-null    object 
 3   complaint                       912 non-null    object 
 4   original_product                912 non-null    object 
 5   split_words_whitespaces         912 non-null    object 
 6   number_of_words                 912 non-null    int64  
 7   number_of_characters            912 non-null    int64  
 8   number_of_unique_words          912 non-null    int64  
 9   number_of_alphabets             912 non-null    int64  
 10  number_of_numerical_values      912 non-null    int64  
 11  number_of_white_spaces          912 non-null    int64  
 12  number_of_other_characters      

In [None]:
#Concatenate the dataframes sample_big_new_instance_unlist and sample_big_new_product_not_instance_unlist
sample_big_new1 = pd.concat([sample_big_new_instance_unlist, sample_big_new_product_not_instance_unlist], ignore_index=True)
sample_big_new1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 117050 entries, 0 to 117049
Data columns (total 23 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   preprocessed_text               117050 non-null  object 
 1   original_index                  117050 non-null  int64  
 2   product                         117050 non-null  object 
 3   complaint                       117050 non-null  object 
 4   original_product                117050 non-null  object 
 5   split_words_whitespaces         117050 non-null  object 
 6   number_of_words                 117050 non-null  int64  
 7   number_of_characters            117050 non-null  int64  
 8   number_of_unique_words          117050 non-null  int64  
 9   number_of_alphabets             117050 non-null  int64  
 10  number_of_numerical_values      117050 non-null  int64  
 11  number_of_white_spaces          117050 non-null  int64  
 12  number_of_other_

In [None]:
#Display the descriptive statistics of text metrics in sample_big_new1
display_descriptive_statistics(sample_big_new1[text_metrics])

Unnamed: 0,min,median,max,std
number_of_words,8.0,142.0,5665.0,217.960319
number_of_characters,30.0,759.0,30397.0,1199.121585
number_of_unique_words,4.0,83.0,1171.0,69.183741
number_of_alphabets,23.0,618.0,24793.0,982.154171
number_of_numerical_values,0.0,0.0,0.0,0.0
number_of_white_spaces,7.0,141.0,5664.0,217.960319
number_of_other_characters,0.0,0.0,0.0,0.0
number_of_character_X,0.0,0.0,0.0,0.0
number_of_character_x,0.0,1.0,193.0,2.710422
percentage_of_alphabets,0.740741,0.815498,0.99597,0.013585


In [None]:
#Calculate the product count in sample_big_new1
sample_big_new1['product'].value_counts()

Mortgage            19966
Banking Services    19942
Loans               19848
Card Services       19639
Debt Collection     19479
Credit Reporting    18176
Name: product, dtype: int64

In [None]:
#Create a sample data sample_df which has 10000 entries of each of the product counts in sample_big_new1
gensim_data_df_group = sample_big_new1.groupby('product', group_keys=False)
sample_df = gensim_data_df_group.sample(n =10000, random_state = fix_seed, replace =False)
# Reset the index of sample_df
sample_df = sample_df.reset_index(drop=True)
sample_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60000 entries, 0 to 59999
Data columns (total 23 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   preprocessed_text               60000 non-null  object 
 1   original_index                  60000 non-null  int64  
 2   product                         60000 non-null  object 
 3   complaint                       60000 non-null  object 
 4   original_product                60000 non-null  object 
 5   split_words_whitespaces         60000 non-null  object 
 6   number_of_words                 60000 non-null  int64  
 7   number_of_characters            60000 non-null  int64  
 8   number_of_unique_words          60000 non-null  int64  
 9   number_of_alphabets             60000 non-null  int64  
 10  number_of_numerical_values      60000 non-null  int64  
 11  number_of_white_spaces          60000 non-null  int64  
 12  number_of_other_characters      

In [None]:
#Calculate the product count in sample_df
sample_df['product'].value_counts()

Banking Services    10000
Card Services       10000
Credit Reporting    10000
Debt Collection     10000
Loans               10000
Mortgage            10000
Name: product, dtype: int64

In [None]:
#Display the descriptive statistics of text metrics in sample_df
display_descriptive_statistics(sample_df[text_metrics])

Unnamed: 0,min,median,max,std
number_of_words,8.0,141.0,5383.0,219.892735
number_of_characters,30.0,757.0,30148.0,1212.793377
number_of_unique_words,5.0,83.0,1171.0,69.511275
number_of_alphabets,23.0,617.0,24766.0,993.897519
number_of_numerical_values,0.0,0.0,0.0,0.0
number_of_white_spaces,7.0,140.0,5382.0,219.892735
number_of_other_characters,0.0,0.0,0.0,0.0
number_of_character_X,0.0,0.0,0.0,0.0
number_of_character_x,0.0,1.0,193.0,2.707624
percentage_of_alphabets,0.740741,0.815534,0.99597,0.013624


In [None]:
#Display the first five rows in sample_df
pd.reset_option('^display.', silent=True)
sample_df.head()

Unnamed: 0,preprocessed_text,original_index,product,complaint,original_product,split_words_whitespaces,number_of_words,number_of_characters,number_of_unique_words,number_of_alphabets,...,number_of_character_X,number_of_character_x,percentage_of_alphabets,percentage_of_numerical_values,percentage_of_white_spaces,percentage_of_other_characters,percentage_of_character_X,percentage_of_character_x,product_instance_length,product_unique_instance_length
0,on i notified usbank fraud dept line of an una...,755334,Banking Services,On XX/XX/XXXX I notified USbank fraud dept lin...,Checking or savings account,"[on, i, notified, usbank, fraud, dept, line, o...",644,3777,300,3134,...,0,0,0.829759,0.0,0.170241,0.0,0.0,0.0,1,1
1,previously i had repeatedly received spam mail...,693962,Banking Services,"Previously, I had repeatedly received spam mai...","Money transfer, virtual currency, or money ser...","[previously, i, had, repeatedly, received, spa...",62,340,44,279,...,0,0,0.820588,0.0,0.179412,0.0,0.0,0.0,1,1
2,on i received a notice from that my car paymen...,248080,Banking Services,On XX/XX/19 I received a notice from XXXX XXXX...,Checking or savings account,"[on, i, received, a, notice, from, that, my, c...",420,2242,187,1823,...,0,0,0.813113,0.0,0.186887,0.0,0.0,0.0,1,1
3,on i initiated a transfer of from my bank acco...,1562368,Banking Services,"On XXXX XXXX, 2016 I initiated a transfer of {...",Money transfers,"[on, i, initiated, a, transfer, of, from, my, ...",201,1100,84,900,...,0,0,0.818182,0.0,0.181818,0.0,0.0,0.0,1,1
4,hi i used paypal to purchase eyelashes online ...,1637582,Banking Services,Hi I used Paypal to purchase eyelashes online ...,Money transfers,"[hi, i, used, paypal, to, purchase, eyelashes,...",101,544,67,444,...,0,0,0.816176,0.0,0.183824,0.0,0.0,0.0,1,1


## Download data

In [None]:
sample_df.to_csv(saved_path +'data_sample_balanced.csv',index=False)
print("\nDownload Complete : data_sample_balanced.csv")


Download Complete : data_sample_balanced.csv


In [None]:
sample_big_new1.to_csv(saved_path + 'sample_big_new1.csv',index=False)
print("\nDownload Complete : sample_big_new1.csv")


Download Complete : sample_big_new1.csv
