### Multi-Class Text Classification with Scikit-Learn – Towards Data Science

Classifying Consumer Finance Complaints into 12 pre-defined classes. 

https://towardsdatascience.com/multi-class-text-classification-with-scikit-learn-12f1e60e0a9f

### 1. loading data 

In [1]:
# importing pandas package 
import pandas as pd

# making data frame from csv file 
df = pd.read_csv('Consumer_Complaints.csv')




### 2. Data exploration

In [2]:
df.head()

Unnamed: 0,Date received,Product,Sub-product,Issue,Sub-issue,Consumer Complaint,Company Public Response,Company,State,ZIP code,Tags,Consumer consent provided?,Submitted via,Date Sent to Company,Company Response to Consumer,Timely response?,Consumer disputed?,Complaint ID,Unnamed: 18
0,03-12-2014,Mortgage,Other mortgage,"Loan modification,collection,foreclosure",,,,M&T BANK CORPORATION,MI,48382,,,Referral,03/17/2014,Closed with explanation,Yes,No,759217,
1,10-01-2016,Credit reporting,,Incorrect information on credit report,Account status,I have outdated information on my credit repor...,Company has responded to the consumer and the ...,"TRANSUNION INTERMEDIATE HOLDINGS, INC.",AL,352XX,,Consent provided,Web,10-05-2016,Closed with explanation,Yes,No,2141773,
2,10/17/2016,Consumer Loan,Vehicle loan,Managing the loan or lease,,I purchased a new car on XXXX XXXX. The car de...,,"CITIZENS FINANCIAL GROUP, INC.",PA,177XX,Older American,Consent provided,Web,10/20/2016,Closed with explanation,Yes,No,2163100,
3,06-08-2014,Credit card,,Bankruptcy,,,,AMERICAN EXPRESS COMPANY,ID,83854,Older American,,Web,06-10-2014,Closed with explanation,Yes,Yes,885638,
4,09/13/2014,Debt collection,Credit card,Communication tactics,Frequent or repeated calls,,,"CITIBANK, N.A.",VA,23233,,,Web,09/13/2014,Closed with explanation,Yes,Yes,1027760,


In [3]:
# show the info of the data  
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1025010 entries, 0 to 1025009
Data columns (total 19 columns):
Date received                   1025010 non-null object
Product                         1025010 non-null object
Sub-product                     789840 non-null object
Issue                           1025010 non-null object
Sub-issue                       528853 non-null object
Consumer Complaint              277814 non-null object
Company Public Response         318364 non-null object
Company                         1025010 non-null object
State                           1012650 non-null object
ZIP code                        1008292 non-null object
Tags                            141588 non-null object
Consumer consent provided?      491911 non-null object
Submitted via                   1025010 non-null object
Date Sent to Company            1025010 non-null object
Company Response to Consumer    1025007 non-null object
Timely response?                1025010 non-null obje

### 3. data clean

In [4]:
# Syntax
# pd.notnull(“DataFrame Name”) 

# creating bool series True for NaN values 
not_null_columns = pd.notnull(df['Consumer Complaint'])

# filtering data 
df = df[not_null_columns]


df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 277814 entries, 1 to 1025009
Data columns (total 19 columns):
Date received                   277814 non-null object
Product                         277814 non-null object
Sub-product                     225631 non-null object
Issue                           277814 non-null object
Sub-issue                       178874 non-null object
Consumer Complaint              277814 non-null object
Company Public Response         135323 non-null object
Company                         277814 non-null object
State                           276758 non-null object
ZIP code                        275414 non-null object
Tags                            47538 non-null object
Consumer consent provided?      277814 non-null object
Submitted via                   277814 non-null object
Date Sent to Company            277814 non-null object
Company Response to Consumer    277813 non-null object
Timely response?                277814 non-null object
Consumer 

In [5]:
df.head(5)

Unnamed: 0,Date received,Product,Sub-product,Issue,Sub-issue,Consumer Complaint,Company Public Response,Company,State,ZIP code,Tags,Consumer consent provided?,Submitted via,Date Sent to Company,Company Response to Consumer,Timely response?,Consumer disputed?,Complaint ID,Unnamed: 18
1,10-01-2016,Credit reporting,,Incorrect information on credit report,Account status,I have outdated information on my credit repor...,Company has responded to the consumer and the ...,"TRANSUNION INTERMEDIATE HOLDINGS, INC.",AL,352XX,,Consent provided,Web,10-05-2016,Closed with explanation,Yes,No,2141773,
2,10/17/2016,Consumer Loan,Vehicle loan,Managing the loan or lease,,I purchased a new car on XXXX XXXX. The car de...,,"CITIZENS FINANCIAL GROUP, INC.",PA,177XX,Older American,Consent provided,Web,10/20/2016,Closed with explanation,Yes,No,2163100,
7,06/15/2015,Credit reporting,,Credit reporting company's investigation,Inadequate help over the phone,An account on my credit report has a mistaken ...,Company chooses not to provide a public response,Experian Information Solutions Inc.,VA,224XX,,Consent provided,Web,06/15/2015,Closed with explanation,Yes,No,1420702,
12,02-03-2016,Debt collection,"Other (i.e. phone, health club, etc.)",Disclosure verification of debt,Not given enough info to verify debt,This company refuses to provide me verificatio...,,"The CBE Group, Inc.",TX,752XX,,Consent provided,Web,02-03-2016,Closed with explanation,Yes,Yes,1772196,
16,02/17/2016,Debt collection,Credit card,Improper contact or sharing of info,Talked to a third party about my debt,This complaint is in regards to Square Two Fin...,Company has responded to the consumer and the ...,SQUARETWO FINANCIAL CORPORATION,NE,693XX,,Consent provided,Web,03-04-2016,Closed with explanation,Yes,Yes,1790634,


### 4. feature selection

In [6]:
# For this project, we need only two columns — “Product” and “Consumer complaint narrative”.
# column manupilation 

selected_columns = ['Product', 'Consumer Complaint']

# get the new df
df = df[selected_columns]

# check the new df's columns
df.columns

Index(['Product', 'Consumer Complaint'], dtype='object')

In [7]:
# rename the columns

df.columns = ['Product', 'Consumer_complaint_narrative']
df.columns

Index(['Product', 'Consumer_complaint_narrative'], dtype='object')

In [8]:
# check the new df 
df.head()

Unnamed: 0,Product,Consumer_complaint_narrative
1,Credit reporting,I have outdated information on my credit repor...
2,Consumer Loan,I purchased a new car on XXXX XXXX. The car de...
7,Credit reporting,An account on my credit report has a mistaken ...
12,Debt collection,This company refuses to provide me verificatio...
16,Debt collection,This complaint is in regards to Square Two Fin...


In [9]:
# add a new column
# add a column encoding the product as an integer 
df['category_id'] = pd.factorize(df['Product'])[0]

# another example
# df['category_id'] = df['Product'].factorize()[0]


df.head(5)

Unnamed: 0,Product,Consumer_complaint_narrative,category_id
1,Credit reporting,I have outdated information on my credit repor...,0
2,Consumer Loan,I purchased a new car on XXXX XXXX. The car de...,1
7,Credit reporting,An account on my credit report has a mistaken ...,0
12,Debt collection,This company refuses to provide me verificatio...,2
16,Debt collection,This complaint is in regards to Square Two Fin...,2


In [10]:


# managing data as file object

from io import StringIO
category_id_df = df[['Product', 'category_id']].drop_duplicates().sort_values('category_id')

category_id_df

Unnamed: 0,Product,category_id
1,Credit reporting,0
2,Consumer Loan,1
12,Debt collection,2
25,Mortgage,3
36,Credit card,4
90,Other financial service,5
124,Bank account or service,6
152,Student loan,7
168,Money transfers,8
538,Payday loan,9


In [11]:
category_to_id = dict(category_id_df.values)
category_to_id

{'Credit reporting': 0,
 'Consumer Loan': 1,
 'Debt collection': 2,
 'Mortgage': 3,
 'Credit card': 4,
 'Other financial service': 5,
 'Bank account or service': 6,
 'Student loan': 7,
 'Money transfers': 8,
 'Payday loan': 9,
 'Prepaid card': 10,
 'Money transfer, virtual currency, or money service': 11,
 'Credit reporting, credit repair services, or other personal consumer reports': 12,
 'Checking or savings account': 13,
 'Vehicle loan or lease': 14,
 'Credit card or prepaid card': 15,
 'Virtual currency': 16,
 'Payday loan, title loan, or personal loan': 17}

In [12]:

id_to_category = dict(category_id_df[['category_id', 'Product']].values)
id_to_category

{0: 'Credit reporting',
 1: 'Consumer Loan',
 2: 'Debt collection',
 3: 'Mortgage',
 4: 'Credit card',
 5: 'Other financial service',
 6: 'Bank account or service',
 7: 'Student loan',
 8: 'Money transfers',
 9: 'Payday loan',
 10: 'Prepaid card',
 11: 'Money transfer, virtual currency, or money service',
 12: 'Credit reporting, credit repair services, or other personal consumer reports',
 13: 'Checking or savings account',
 14: 'Vehicle loan or lease',
 15: 'Credit card or prepaid card',
 16: 'Virtual currency',
 17: 'Payday loan, title loan, or personal loan'}

In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 277814 entries, 1 to 1025009
Data columns (total 3 columns):
Product                         277814 non-null object
Consumer_complaint_narrative    277814 non-null object
category_id                     277814 non-null int64
dtypes: int64(1), object(2)
memory usage: 8.5+ MB


In [14]:
df.head()

Unnamed: 0,Product,Consumer_complaint_narrative,category_id
1,Credit reporting,I have outdated information on my credit repor...,0
2,Consumer Loan,I purchased a new car on XXXX XXXX. The car de...,1
7,Credit reporting,An account on my credit report has a mistaken ...,0
12,Debt collection,This company refuses to provide me verificatio...,2
16,Debt collection,This complaint is in regards to Square Two Fin...,2


In [15]:
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(8,6)) # 800 * 600

df.groupby('Product').Consumer_complaint_narrative.count().plot.bar(ylim=0)

plt.show() # run 2 times, if the figure doesn't show up

<Figure size 800x600 with 1 Axes>

In [16]:

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer


tf_count_vectorizer = CountVectorizer(strip_accents = 'unicode',
                                max_features=1000,
                                stop_words='english',
                                max_df = 0.5,
                                min_df = 10)



tfidf_vectorizer = TfidfVectorizer(sublinear_tf=True, 
                        min_df=5, 
                        norm='l2', 
                        encoding='latin-1', 
                        ngram_range=(1, 2), 
                        stop_words='english')



In [17]:
labels = df.category_id

# labels

In [None]:
# features = tfidf_vectorizer.fit_transform(df.Consumer_complaint_narrative).toarray()
#features.shape


features = tfidf_vectorizer.fit_transform(df.Consumer_complaint_narrative)

features.shape


In [None]:
# to find the terms that are the most correlated with each of the products:

from sklearn.feature_selection import chi2
import numpy as np

N = 2
for Product, category_id in sorted(category_to_id.items()):
    
  features_chi2 = chi2(features, labels == category_id)

  indices = np.argsort(features_chi2[0])
    
  feature_names = np.array(tfidf_vectorizer.get_feature_names())[indices]

  unigrams = [v for v in feature_names if len(v.split(' ')) == 1]
    
  bigrams = [v for v in feature_names if len(v.split(' ')) == 2]

  print("# '{}':".format(Product))
    
  print("  . Most correlated unigrams:\n       . {}".format('\n       . '.join(unigrams[-N:])))

  print("  . Most correlated bigrams:\n       . {}".format('\n       . '.join(bigrams[-N:])))

In [None]:
# Naive Bayes Classifier

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB

X_train, X_test, y_train, y_test = train_test_split(df['Consumer_complaint_narrative'], df['Product'], random_state = 0)
count_vect = CountVectorizer()

X_train_counts = count_vect.fit_transform(X_train)

tfidf_transformer = TfidfTransformer()

X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

clf = MultinomialNB().fit(X_train_tfidf, y_train)

In [None]:
print(clf.predict(count_vect.transform(["This company refuses to provide me verification and validation of debt per my right under the FDCPA. I do not believe this debt is mine."])))

In [None]:
df[df['Consumer_complaint_narrative'] == "This company refuses to provide me verification and validation of debt per my right under the FDCPA. I do not believe this debt is mine."]

In [None]:
print(clf.predict(count_vect.transform(["I am disputing the inaccurate information the Chex-Systems has on my credit report. I initially submitted a police report on XXXX/XXXX/16 and Chex Systems only deleted the items that I mentioned in the letter and not all the items that were actually listed on the police report. In other words they wanted me to say word for word to them what items were fraudulent. The total disregard of the police report and what accounts that it states that are fraudulent. If they just had paid a little closer attention to the police report I would not been in this position now and they would n't have to research once again. I would like the reported information to be removed : XXXX XXXX XXXX"])))

In [None]:
df[df['Consumer_complaint_narrative'] == "I am disputing the inaccurate information the Chex-Systems has on my credit report. I initially submitted a police report on XXXX/XXXX/16 and Chex Systems only deleted the items that I mentioned in the letter and not all the items that were actually listed on the police report. In other words they wanted me to say word for word to them what items were fraudulent. The total disregard of the police report and what accounts that it states that are fraudulent. If they just had paid a little closer attention to the police report I would not been in this position now and they would n't have to research once again. I would like the reported information to be removed : XXXX XXXX XXXX"]

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_val_score
models = [
    RandomForestClassifier(n_estimators=200, max_depth=3, random_state=0),
    LinearSVC(),
    MultinomialNB(),
    LogisticRegression(random_state=0),
]
CV = 5
cv_df = pd.DataFrame(index=range(CV * len(models)))
entries = []



In [None]:

for model in models:
  model_name = model.__class__.__name__
  accuracies = cross_val_score(model, features, labels, scoring='accuracy', cv=CV)
  for fold_idx, accuracy in enumerate(accuracies):
    entries.append((model_name, fold_idx, accuracy))

In [None]:
cv_df = pd.DataFrame(entries, columns=['model_name', 'fold_idx', 'accuracy'])

In [None]:
import seaborn as sns

sns.boxplot(x='model_name', y='accuracy', data=cv_df)
sns.stripplot(x='model_name', y='accuracy', data=cv_df, 
              size=8, jitter=True, edgecolor="gray", linewidth=2)
plt.show()