# A data-driven way to merge similar classes together

Juan Lopez Martin

The quality of the classes is determinant for training a classifier. In the famous MNIST example, the handwritten number '6' was always labeled correclty as a '6'. However, it is clear it would be very problematic if that digit was sometimes labels as '6', others as 'six', 'SIX', '6s', etc. In this case, it would be necessary to merge all these labels together before training. If not, the classifier would probably pick up irrelevant features of the '6' to try to distinguish between the classes '6', 'six', '6s', etc and end up with much less prediction accuracy.

We are in a similar situation if we want to train a classifier to infer the issue from the narrative. For instance, some narratives have the issue 'Incorrect information on credit report', while others have 'Incorrect information on your report'. It is relatively clear that these two issues are not different, and that these two labels should be merged. The same happens, for instance, with 'Attempts to collect debt not owed' and 'Cont'd attempts collect debt not owed'.

## Loading data

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive


In [0]:
import pandas as pd
import numpy as np
from datetime import datetime
import pickle
from operator import itemgetter
from itertools import compress
import spacy
import en_core_web_sm
nlp = en_core_web_sm.load()
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
import networkx 
from networkx.algorithms.components.connected import connected_components
from sklearn.metrics import pairwise_distances
from scipy.spatial.distance import cosine


In [3]:
folder = 'drive/My Drive/IBM/'
df = pd.read_csv(folder+'Consumer_Complaints.csv')

  interactivity=interactivity, compiler=compiler, result=result)


In [0]:
df = df.dropna(subset=['Consumer complaint narrative'])

In [0]:
filename = folder+'texts.pickle'
infile = open(filename,'rb')
texts = pickle.load(infile)
infile.close()

In [0]:
filename = folder+'documents.pickle'
infile = open(filename,'rb')
documents = pickle.load(infile)
infile.close()

In [0]:
filename = folder+'model_100.pickle'
infile = open(filename,'rb')
model = pickle.load(infile)
infile.close()

In [8]:
#This is to get a subset of the dataset. p=1 means we get 100% of the rows

p = 1
rand = np.random.choice(a=[False, True], size=len(texts), p = [1-p, p])

df_s = df[rand]

texts_s = list(compress(texts, rand))
len(texts_s)

documents_s = list(compress(documents, rand))
len(documents_s)

444683

In [9]:
# We use L2 normalization on the vectors
model.docvecs.init_sims(replace=True)

# Then take the appropiate sample
vecs_s = []
for i in range(0, df.shape[0]):
  if rand[i] == True:
    vecs_s.append(model.docvecs[i])
    
len(vecs_s)

444683

## Data cleaning

We re using spacy to:

* Remove the X, XX, XXX, etc.
* Remove strings that are not alphanumeric
* Remove spaces
* Remove punctuation
* Remove numbers
* Lemmatize

This trasnforms the narrative to a list of words.

In [0]:
## Commented out -- output saved at model_100.pickle

#xs = ["X"*2, "X"*3, "X"*4, "X"*5, "X"*6, "X"*7, "X"*8, "X"*9, "X"*10, "X"*11, "XX/XX", "XX/XX/XXX", "XX/XX/XX", "XX/XXXX"]

#texts = []
#for sent in nlp.pipe(df['Consumer complaint narrative'], disable=["tagger", "parser", "ner", "textcat"]):
#    texts.append([word.lemma_ for word in sent 
#                  if word.is_alpha and
#                  not word.is_space and
#                  not word.is_stop and 
#                  not word.is_punct and 
#                  not word.like_num and 
#                     word.text not in xs])
    
#filename = folder+'texts.pickle'
#outfile = open(filename,'wb')
#pickle.dump(texts,outfile)
#outfile.close()

##Doc2vec

Word embeddings are vector representations of individual words that carry their semantic meaning. If these word vectors are obtained from a large corpus we would expect, for instance, that the vector for frog and toad will be very similar. Additionally, these vectors capture more complex relationships. For instance, the association between 'Washington' and USA will be similar as the association between 'China' and 'Beijin'.

Doc2vec is an extension to the famous word embedding approach that trains vector representations of entire documents instead of just words. If trained on a large set of books we would expect, for instance, that Statistics books will have similar vectors.

In this case, we will train document embeddings based on the narratives. In other words, each narrative will be transformed into a 300-dimensional vector. The main advantge of this approach over simpler methods is that it takes into account the order of the words in sentences. In fact, under the hood the vectors are built by trying to predict a center word based on the surrounding words (PV-DM). This allows to capture complex semantic relationships between words.

The hyperparamethers used here are relatively standard in the field:

* 300-dimensional vectors.
* Window of 15 words (meaning we try to predict the center word from a 15-word window).
* Negative sampling of 5 -- improves results and reduces computational cost.
* 100 epochs. This is good but it took 4 hours.


In [0]:
## Commented out -- output saved at model_100.pickle

# First, we create a taggeddocument object, necessary for word2vec. Then we
# train the vecs.

#documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(texts)]
#model = Doc2Vec(documents, dm = 1, vector_size=300, window=15, negative=5, min_count=1, workers=4, epochs = 100)

#filename = folder+'model_100.pickle'
#outfile = open(filename,'wb')
#model = pickle.dump(model, outfile)
#outfile.close()

## Selecting labels to merge

For each issue (class) we have a set of narratives (documents) that have been represented as (300-dimensional L2-normalized) vectors with the doc2vec algorithm. For example, for issue $A$ ('Incorrect information on your report') we have obtained $a_1, a_2, ..., a_{65508}$ vectors. One basic approach to summarize the information we have about the issue $A$ is to calculate the mean for all the vectors $a_i$. Doing this, we obtain an average (300-dimensional L2-normalized) vector that can be seen as an approximate representation of the issue $A$. The idea is that the average of the vectors $a_i$ should contain that is to a certain degree common for the issue $A$.   

In [0]:
# Create list of issues ordered by value_counts
issues_list = list(df_s['Issue'].value_counts().index)

# Calculate the mean for the vectors corresponding to each issue
vec_list = [np.mean(np.array(list(compress(vecs_s, list(df[rand]['Issue']==issue)))), axis  = 0) for issue in issues_list]

Now we have a vector representation of issue $A$, $B$, $C$, etc. In word2vec the similarity of two vectors is measured using cosine similarity, which determines whether the two vectors are pointing in approximately the same direction.


We will create a similarity matrix that reports the cosine similarity between all the issues. If you have no experience with word2vec, this can be conceptually similar to a correlation matrix. That is, issues that are similar between each other will have a cosine similarity close to one and non-related issues a score close to zero.

In [25]:
# Create distance matrix with cosine similarity
dist = 1-pairwise_distances(np.array(vec_list), metric="cosine")

# Put the distance matrix in a dataframe with index and column names
dist_df = pd.DataFrame(dist, index = issues_list, columns=issues_list)

dist_df.iloc[0:5,0:5]

Unnamed: 0,Incorrect information on your report,Problem with a credit reporting company's investigation into an existing problem,Attempts to collect debt not owed,Incorrect information on credit report,Improper use of your report
Incorrect information on your report,1.0,0.988587,0.960438,0.990585,0.962792
Problem with a credit reporting company's investigation into an existing problem,0.988587,1.0,0.947416,0.988502,0.950099
Attempts to collect debt not owed,0.960438,0.947416,1.0,0.947011,0.929568
Incorrect information on credit report,0.990585,0.988502,0.947011,1.0,0.955511
Improper use of your report,0.962792,0.950099,0.929568,0.955511,1.0


We have to set up a threshold to decide which issues to merge and which issues to keep separate. In this case I have decided for 0.985, and in my experience something between 0.98 and 0.99 seems like a reasonable threshold. However, we can discuss the appropiate threshold. 


Note that merges of  issues can be multiple. Namely, if issue $A$ is going to be merged with $B$ and $B$ is also going to be merged with $C$, we are going to obtain one new issue $K = A \cup B \cup C$. This is similar to say that, given a network $G = (V, E)$ in which each vertex is an original issue and the edges are $1$ if the issues are going to be merged and $0$ otherwise, we are going to keep all the distinct components as labels.

In [23]:
# The issues to merge are decided according to an user-defined threshold.
threshold = 0.985
l = [list(dist_df[dist_df.loc[issue]>threshold].index) for issue in issues_list]

# Finding components in the network.
# Code from https://stackoverflow.com/questions/4842613/merge-lists-that-share-common-elements?lq=1

def to_graph(l):
    G = networkx.Graph()
    for part in l:
        # each sublist is a bunch of nodes
        G.add_nodes_from(part)
        # it also imlies a number of edges:
        G.add_edges_from(to_edges(part))
    return G

def to_edges(l):
    """ 
        treat `l` as a Graph and returns it's edges 
        to_edges(['a','b','c','d']) -> [(a,b), (b,c),(c,d)]
    """
    it = iter(l)
    last = next(it)

    for current in it:
        yield last, current
        last = current    

G = to_graph(l)
components = list(connected_components(G))

for aset in components:
  if len(aset)>1:
    for element in aset:
      print(element)
    print("\n")

Credit reporting company's investigation
Incorrect information on credit report
Incorrect information on your report
Problem with a credit reporting company's investigation into an existing problem


Attempts to collect debt not owed
Cont'd attempts collect debt not owed


Trouble during payment process
Loan servicing, payments, escrow account


Written notification about debt
Disclosure verification of debt


Struggling to pay mortgage
Loan modification,collection,foreclosure


Managing an account
Deposits and withdrawals


Took or threatened to take negative or legal action
Taking/threatening an illegal action


Application, originator, mortgage broker
Applying for a mortgage or refinancing an existing mortgage


Can't repay my loan
Struggling to repay your loan


Unable to get your credit report or credit score
Unable to get credit report/credit score


Problem caused by your funds being low
Problems caused by my funds being low




## Garbage

In [0]:
#documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(texts)]
#model = Doc2Vec(documents, dm = 1, vector_size=300, window=15, negative=5, min_count=1, workers=4, epochs = 25)

In [0]:
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=20).fit(vecs_s)

In [0]:

issues_considered = list(df_s['Issue'].value_counts().index)
newdf = df_s[df_s['Issue'].isin(issues_considered)]
ct = pd.crosstab(newdf['Issue'],kmeans.labels_)
ctp = ct.apply(lambda x: x/x.sum(), axis = 1)
ctp['n'] = ct.apply(lambda x: x.sum(), axis=1)
#ctp.loc[list(newdf['Issue'].value_counts().index)].plot.bar(stacked = True)

In [0]:
ctp_f = ctp[ctp['n']<10]

[ctp_f[ctp_f[i]>0].index for i in range(0,20)]

[Index(['Arbitration', 'Balance transfer fee', 'Privacy'], dtype='object', name='Issue'),
 Index(['Can't contact lender or servicer',
        'Loan payment wasn't credited to your account',
        'Vehicle was damaged or destroyed the vehicle',
        'Vehicle was repossessed or sold the vehicle'],
       dtype='object', name='Issue'),
 Index(['Arbitration', 'Can't contact lender or servicer',
        'Can't stop charges to bank account',
        'Can't stop withdrawals from your bank account',
        'Charged bank acct wrong day or amt',
        'Confusing or misleading advertising or marketing',
        'Customer service/Customer relations', 'Getting a line of credit',
        'Lender repossessed or sold the vehicle',
        'Loan payment wasn't credited to your account',
        'Lost or stolen money order', 'Other service issues',
        'Payment to acct not credited',
        'Problem with additional add-on products or services',
        'Sale of account'],
       dtype='obje

In [0]:
issuestoconsider = df['Issue'].value_counts()[df['Issue'].value_counts()>1000].index.values.tolist()

In [0]:
issuesclustered = ctp.loc[issuestoconsider].idxmax(axis=1)

In [0]:
for i in range(0, 8):
    print(str(i), ":\n")
    print(issuesclustered[issuesclustered == i].index.values.tolist())
    print("\n")

0 :

['Problem with a purchase shown on your statement', 'Other features, terms, or problems', 'Billing disputes', 'Other', 'Identity theft / Fraud / Embezzlement', 'Closing/Cancelling account']


1 :

['Loan servicing, payments, escrow account', 'Dealing with my lender or servicer', 'Managing the loan or lease', 'Fees or interest', 'Struggling to repay your loan', 'Struggling to pay your loan', 'Closing your account', 'Problems caused by my funds being low', 'Opening an account', 'Money was not available when promised', 'Problem caused by your funds being low']


2 :

['Incorrect information on your report', "Problem with a credit reporting company's investigation into an existing problem", 'Incorrect information on credit report', 'Improper use of your report', 'Problem when making payments', 'Improper use of my credit report']


3 :

[]


4 :

['Attempts to collect debt not owed', "Cont'd attempts collect debt not owed", 'Communication tactics', 'Written notification about debt', 'D

In [0]:
from sklearn.cluster import DBSCAN

dbscan = DBSCAN().fit(vecs_s)

In [0]:
unique, counts = np.unique(dbscan.labels_, return_counts=True)

np.asarray((unique, counts)).T

array([[   -1, 20595],
       [    0,  1854],
       [    1,     5]])