## Problem Statement 

Need to build a model that is able to classify customer complaints based on the products/services. By doing so, it can segregate these tickets into their relevant categories and, therefore, help in the quick resolution of the issue.
Need to classify tickets into the following five clusters based on their products/services:

* Credit card / Prepaid card

* Bank account services

* Theft/Dispute reporting

* Mortgages/loans

* Others 

In [1]:
# Importing relevant libraries
import json 
import numpy as np
import pandas as pd
import re, nltk, spacy, string
import en_core_web_sm
nlp = en_core_web_sm.load()
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

from plotly.offline import plot
import plotly.graph_objects as go
import plotly.express as px

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from pprint import pprint

In [4]:
# Loading the data
# Opening JSON file 
f = open('C:\\Users\\moham\\Downloads\\inputs\\complaints-2021-05-14_08_16.json')
  
# returns JSON object as a dictionary 
data = json.load(f)
df=pd.json_normalize(data)
df.head()

Unnamed: 0,_index,_type,_id,_score,_source.tags,_source.zip_code,_source.complaint_id,_source.issue,_source.date_received,_source.state,...,_source.company_response,_source.company,_source.submitted_via,_source.date_sent_to_company,_source.company_public_response,_source.sub_product,_source.timely,_source.complaint_what_happened,_source.sub_issue,_source.consumer_consent_provided
0,complaint-public-v2,complaint,3211475,0.0,,90301,3211475,Attempts to collect debt not owed,2019-04-13T12:00:00-05:00,CA,...,Closed with explanation,JPMORGAN CHASE & CO.,Web,2019-04-13T12:00:00-05:00,,Credit card debt,Yes,,Debt is not yours,Consent not provided
1,complaint-public-v2,complaint,3229299,0.0,Servicemember,319XX,3229299,Written notification about debt,2019-05-01T12:00:00-05:00,GA,...,Closed with explanation,JPMORGAN CHASE & CO.,Web,2019-05-01T12:00:00-05:00,,Credit card debt,Yes,Good morning my name is XXXX XXXX and I apprec...,Didn't receive enough information to verify debt,Consent provided
2,complaint-public-v2,complaint,3199379,0.0,,77069,3199379,"Other features, terms, or problems",2019-04-02T12:00:00-05:00,TX,...,Closed with explanation,JPMORGAN CHASE & CO.,Web,2019-04-02T12:00:00-05:00,,General-purpose credit card or charge card,Yes,I upgraded my XXXX XXXX card in XX/XX/2018 and...,Problem with rewards from credit card,Consent provided
3,complaint-public-v2,complaint,2673060,0.0,,48066,2673060,Trouble during payment process,2017-09-13T12:00:00-05:00,MI,...,Closed with explanation,JPMORGAN CHASE & CO.,Web,2017-09-14T12:00:00-05:00,,Conventional home mortgage,Yes,,,Consent not provided
4,complaint-public-v2,complaint,3203545,0.0,,10473,3203545,Fees or interest,2019-04-05T12:00:00-05:00,NY,...,Closed with explanation,JPMORGAN CHASE & CO.,Referral,2019-04-05T12:00:00-05:00,,General-purpose credit card or charge card,Yes,,Charged too much interest,


### Data Preparation

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 78313 entries, 0 to 78312
Data columns (total 22 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   _index                             78313 non-null  object 
 1   _type                              78313 non-null  object 
 2   _id                                78313 non-null  object 
 3   _score                             78313 non-null  float64
 4   _source.tags                       10900 non-null  object 
 5   _source.zip_code                   71556 non-null  object 
 6   _source.complaint_id               78313 non-null  object 
 7   _source.issue                      78313 non-null  object 
 8   _source.date_received              78313 non-null  object 
 9   _source.state                      76322 non-null  object 
 10  _source.consumer_disputed          78313 non-null  object 
 11  _source.product                    78313 non-null  obj

In [6]:
#Assign new column names
columns=df.columns
new_columns = [re.sub('source.','',column[1:]) for column in columns]
df.columns = new_columns
df.columns

Index(['index', 'type', 'id', 'score', 'tags', 'zip_code', 'complaint_id',
       'issue', 'date_received', 'state', 'consumer_disputed', 'product',
       'company_response', 'company', 'submitted_via', 'date_sent_to_company',
       'company_public_response', 'sub_product', 'timely',
       'complaint_what_happened', 'sub_issue', 'consumer_consent_provided'],
      dtype='object')

In [14]:
df[df.complaint_what_happened==""].shape

(57241, 22)

In [15]:
#Assign nan in place of blanks in the complaints column
df['complaint_what_happened']=df['complaint_what_happened'].apply(lambda x: pd.NA if len(x)==0 else x)

In [16]:
#Remove all rows where complaints column is nan
df=df[~df['complaint_what_happened'].isna()]

In [17]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 21072 entries, 1 to 78312
Data columns (total 22 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   index                      21072 non-null  object 
 1   type                       21072 non-null  object 
 2   id                         21072 non-null  object 
 3   score                      21072 non-null  float64
 4   tags                       3816 non-null   object 
 5   zip_code                   16427 non-null  object 
 6   complaint_id               21072 non-null  object 
 7   issue                      21072 non-null  object 
 8   date_received              21072 non-null  object 
 9   state                      20929 non-null  object 
 10  consumer_disputed          21072 non-null  object 
 11  product                    21072 non-null  object 
 12  company_response           21072 non-null  object 
 13  company                    21072 non-null  object 


In [18]:
# Creating new data frame with relevant information
df_new = df[['complaint_what_happened', 'product']]

In [19]:
df_new.shape

(21072, 2)

#### Text Preprocessing

In [20]:
# Function to clean the text and remove all the unnecessary elements.
def data_cleaner(text):
    text = text.lower()                                                 # Lower case
    text = re.sub(r'\s\[\$\S*','',text)                                 # Remove text in square brackets
    text = re.sub(r'(\W\s)|(\W$)|(\W\d*)',' ',text)                     # Remove punctuation
    text = re.sub(r'(\w+\d)|(\d\w+)|(\w+\d\w+)','',text)                # Remove words containing numbers
    text = re.sub(r'x+((/xx)*/\d*\s*)|x*','',text)                      # Remove date
    text = re.sub(r'\d+\s', '', text)                                   # Remove other numerical values
    text = re.sub(r' +', ' ',text)                                      #Remove unnecessary white spaces
    return text

In [21]:
df_new['complaints'] = df_new['complaint_what_happened'].apply(data_cleaner)



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [22]:
df_new.head()

Unnamed: 0,complaint_what_happened,product,complaints
1,Good morning my name is XXXX XXXX and I apprec...,Debt collection,good morning my name is and i appreciate it if...
2,I upgraded my XXXX XXXX card in XX/XX/2018 and...,Credit card or prepaid card,i upgraded my card in and was told by the agen...
10,Chase Card was reported on XX/XX/2019. However...,"Credit reporting, credit repair services, or o...",chase card was reported on however fraudulent ...
11,"On XX/XX/2018, while trying to book a XXXX XX...","Credit reporting, credit repair services, or o...",on while trying to book a ticket i came across...
14,my grand son give me check for {$1600.00} i de...,Checking or savings account,my grand son give me check for i deposit it in...


In [23]:
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

In [24]:
# Function to Lemmatize the texts
def lemmatize(text):
    words = word_tokenize(text)    # Divide strings into lists of substring
    wordnet_lemmatizer = WordNetLemmatizer()  # Create instance of lemmatizer
    lemmatized_text = [wordnet_lemmatizer.lemmatize(word) for word in words]  # Apply lemmatizer on data
    lemmatized_str = " ".join(lemmatized_text)  # Join lemmatized text with whitespace & convert into string
    return lemmatized_str

In [26]:
# Creating a dataframe('df_clean') that will only have the complaints and the lemmatized complaints 
df_clean = pd.DataFrame({'Complaints': df_new['complaints'], 'Lemmatized': df_new['complaints'].apply(lemmatize)})

In [27]:
df_clean.head()

Unnamed: 0,Complaints,Lemmatized
1,good morning my name is and i appreciate it if...,good morning my name is and i appreciate it if...
2,i upgraded my card in and was told by the agen...,i upgraded my card in and wa told by the agent...
10,chase card was reported on however fraudulent ...,chase card wa reported on however fraudulent a...
11,on while trying to book a ticket i came across...,on while trying to book a ticket i came across...
14,my grand son give me check for i deposit it in...,my grand son give me check for i deposit it in...


In [29]:
import swifter

In [30]:
# Function to extract the POS tags 
def pos_tags(text):
    NN_tags = []
    doc = nlp(text)
    for tok in doc:
        if (tok.tag_ == 'NN'):
            NN_tags.append(tok.lemma_.lower())
    NN_tags_str = " ".join(NN_tags)
    return NN_tags_str


df_clean["complaint_POS_removed"] =  df_clean.swifter.apply(lambda x: pos_tags(x['Lemmatized']), axis = 1)

#this column should contain lemmatized text with all the words removed which have tags other than NN[tag == "NN"].


Pandas Apply:   0%|          | 0/21072 [00:00<?, ?it/s]