### Multi-Class Text Classification with Scikit-Learn – Towards Data Science

Classifying Consumer Finance Complaints into 12 pre-defined classes. 

https://towardsdatascience.com/multi-class-text-classification-with-scikit-learn-12f1e60e0a9f

### 1. loading data 

In [1]:
# importing pandas package 
import pandas as pd

# making data frame from csv file 
df = pd.read_csv('Consumer_Complaints.csv')




### 2. Data exploration

In [2]:
df.head()

Unnamed: 0,Date received,Product,Sub-product,Issue,Sub-issue,Consumer Complaint,Company Public Response,Company,State,ZIP code,Tags,Consumer consent provided?,Submitted via,Date Sent to Company,Company Response to Consumer,Timely response?,Consumer disputed?,Complaint ID,Unnamed: 18
0,03-12-2014,Mortgage,Other mortgage,"Loan modification,collection,foreclosure",,,,M&T BANK CORPORATION,MI,48382,,,Referral,03/17/2014,Closed with explanation,Yes,No,759217,
1,10-01-2016,Credit reporting,,Incorrect information on credit report,Account status,I have outdated information on my credit repor...,Company has responded to the consumer and the ...,"TRANSUNION INTERMEDIATE HOLDINGS, INC.",AL,352XX,,Consent provided,Web,10-05-2016,Closed with explanation,Yes,No,2141773,
2,10/17/2016,Consumer Loan,Vehicle loan,Managing the loan or lease,,I purchased a new car on XXXX XXXX. The car de...,,"CITIZENS FINANCIAL GROUP, INC.",PA,177XX,Older American,Consent provided,Web,10/20/2016,Closed with explanation,Yes,No,2163100,
3,06-08-2014,Credit card,,Bankruptcy,,,,AMERICAN EXPRESS COMPANY,ID,83854,Older American,,Web,06-10-2014,Closed with explanation,Yes,Yes,885638,
4,09/13/2014,Debt collection,Credit card,Communication tactics,Frequent or repeated calls,,,"CITIBANK, N.A.",VA,23233,,,Web,09/13/2014,Closed with explanation,Yes,Yes,1027760,


In [3]:
# show the info of the data  
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1025010 entries, 0 to 1025009
Data columns (total 19 columns):
Date received                   1025010 non-null object
Product                         1025010 non-null object
Sub-product                     789840 non-null object
Issue                           1025010 non-null object
Sub-issue                       528853 non-null object
Consumer Complaint              277814 non-null object
Company Public Response         318364 non-null object
Company                         1025010 non-null object
State                           1012650 non-null object
ZIP code                        1008292 non-null object
Tags                            141588 non-null object
Consumer consent provided?      491911 non-null object
Submitted via                   1025010 non-null object
Date Sent to Company            1025010 non-null object
Company Response to Consumer    1025007 non-null object
Timely response?                1025010 non-null obje

In [4]:
df.shape

(1025010, 19)

### 3. data clean

In [5]:
# Syntax
# pd.notnull(“DataFrame Name”) 

# creating bool series True for NaN values 
not_null_columns = pd.notnull(df['Consumer Complaint'])

# filtering data 
df = df[not_null_columns]


df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 277814 entries, 1 to 1025009
Data columns (total 19 columns):
Date received                   277814 non-null object
Product                         277814 non-null object
Sub-product                     225631 non-null object
Issue                           277814 non-null object
Sub-issue                       178874 non-null object
Consumer Complaint              277814 non-null object
Company Public Response         135323 non-null object
Company                         277814 non-null object
State                           276758 non-null object
ZIP code                        275414 non-null object
Tags                            47538 non-null object
Consumer consent provided?      277814 non-null object
Submitted via                   277814 non-null object
Date Sent to Company            277814 non-null object
Company Response to Consumer    277813 non-null object
Timely response?                277814 non-null object
Consumer 

In [6]:
df.head(5)

Unnamed: 0,Date received,Product,Sub-product,Issue,Sub-issue,Consumer Complaint,Company Public Response,Company,State,ZIP code,Tags,Consumer consent provided?,Submitted via,Date Sent to Company,Company Response to Consumer,Timely response?,Consumer disputed?,Complaint ID,Unnamed: 18
1,10-01-2016,Credit reporting,,Incorrect information on credit report,Account status,I have outdated information on my credit repor...,Company has responded to the consumer and the ...,"TRANSUNION INTERMEDIATE HOLDINGS, INC.",AL,352XX,,Consent provided,Web,10-05-2016,Closed with explanation,Yes,No,2141773,
2,10/17/2016,Consumer Loan,Vehicle loan,Managing the loan or lease,,I purchased a new car on XXXX XXXX. The car de...,,"CITIZENS FINANCIAL GROUP, INC.",PA,177XX,Older American,Consent provided,Web,10/20/2016,Closed with explanation,Yes,No,2163100,
7,06/15/2015,Credit reporting,,Credit reporting company's investigation,Inadequate help over the phone,An account on my credit report has a mistaken ...,Company chooses not to provide a public response,Experian Information Solutions Inc.,VA,224XX,,Consent provided,Web,06/15/2015,Closed with explanation,Yes,No,1420702,
12,02-03-2016,Debt collection,"Other (i.e. phone, health club, etc.)",Disclosure verification of debt,Not given enough info to verify debt,This company refuses to provide me verificatio...,,"The CBE Group, Inc.",TX,752XX,,Consent provided,Web,02-03-2016,Closed with explanation,Yes,Yes,1772196,
16,02/17/2016,Debt collection,Credit card,Improper contact or sharing of info,Talked to a third party about my debt,This complaint is in regards to Square Two Fin...,Company has responded to the consumer and the ...,SQUARETWO FINANCIAL CORPORATION,NE,693XX,,Consent provided,Web,03-04-2016,Closed with explanation,Yes,Yes,1790634,


In [7]:
df.shape

(277814, 19)

### 4. feature selection

In [8]:
# For this project, we need only two columns — “Product” and “Consumer complaint narrative”.
# column manupilation 

selected_columns = ['Product', 'Consumer Complaint']

# get the new df
df = df[selected_columns]

# check the new df's columns
df.columns

Index(['Product', 'Consumer Complaint'], dtype='object')

In [9]:
# rename the columns

df.columns = ['category', 'consumer_complaint']
df.columns

Index(['category', 'consumer_complaint'], dtype='object')

In [10]:
# check the new df 
df.head()

Unnamed: 0,category,consumer_complaint
1,Credit reporting,I have outdated information on my credit repor...
2,Consumer Loan,I purchased a new car on XXXX XXXX. The car de...
7,Credit reporting,An account on my credit report has a mistaken ...
12,Debt collection,This company refuses to provide me verificatio...
16,Debt collection,This complaint is in regards to Square Two Fin...


In [11]:
# add the category_id column


from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()

df['category_id'] = label_encoder.fit_transform(df['category'])



df.head()

Unnamed: 0,category,consumer_complaint,category_id
1,Credit reporting,I have outdated information on my credit repor...,5
2,Consumer Loan,I purchased a new car on XXXX XXXX. The car de...,2
7,Credit reporting,An account on my credit report has a mistaken ...,5
12,Debt collection,This company refuses to provide me verificatio...,7
16,Debt collection,This complaint is in regards to Square Two Fin...,7


In [12]:
df.shape

(277814, 3)

### get top words under each category

In [13]:
# StringIO is for file input and output 

from io import StringIO

category_id_df = df[['category', 'category_id']].drop_duplicates().sort_values('category_id')

category_id_df

Unnamed: 0,category,category_id
124,Bank account or service,0
4099,Checking or savings account,1
2,Consumer Loan,2
36,Credit card,3
7342,Credit card or prepaid card,4
1,Credit reporting,5
1529,"Credit reporting, credit repair services, or o...",6
12,Debt collection,7
1431,"Money transfer, virtual currency, or money ser...",8
168,Money transfers,9


In [14]:
category_id_df1 = df.category_id.unique()
category_id_df1

array([ 5,  2,  7, 10,  3, 11,  0, 15,  9, 12, 14,  8,  6,  1, 16,  4, 17,
       13])

In [15]:
type(category_id_df)

pandas.core.frame.DataFrame

In [16]:
category_id_dict = dict(category_id_df.values)
category_id_dict

{'Bank account or service': 0,
 'Checking or savings account': 1,
 'Consumer Loan': 2,
 'Credit card': 3,
 'Credit card or prepaid card': 4,
 'Credit reporting': 5,
 'Credit reporting, credit repair services, or other personal consumer reports': 6,
 'Debt collection': 7,
 'Money transfer, virtual currency, or money service': 8,
 'Money transfers': 9,
 'Mortgage': 10,
 'Other financial service': 11,
 'Payday loan': 12,
 'Payday loan, title loan, or personal loan': 13,
 'Prepaid card': 14,
 'Student loan': 15,
 'Vehicle loan or lease': 16,
 'Virtual currency': 17}

In [17]:
type(category_id_dict)

dict

In [18]:
id_category_dict = dict(category_id_df[['category_id', 'category']].values)
id_category_dict

{0: 'Bank account or service',
 1: 'Checking or savings account',
 2: 'Consumer Loan',
 3: 'Credit card',
 4: 'Credit card or prepaid card',
 5: 'Credit reporting',
 6: 'Credit reporting, credit repair services, or other personal consumer reports',
 7: 'Debt collection',
 8: 'Money transfer, virtual currency, or money service',
 9: 'Money transfers',
 10: 'Mortgage',
 11: 'Other financial service',
 12: 'Payday loan',
 13: 'Payday loan, title loan, or personal loan',
 14: 'Prepaid card',
 15: 'Student loan',
 16: 'Vehicle loan or lease',
 17: 'Virtual currency'}

In [19]:
type(id_category_dict)

dict

### 4. data overview

In [20]:
import matplotlib.pyplot as plt

fig = plt.figure(figsize=(8,6))
df.groupby('category').consumer_complaint.count().plot.bar(ylim=0)
plt.show()

<Figure size 800x600 with 1 Axes>

### text representation by vectors 

In [21]:
# configure the weapon


from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer


count_vectorizer = CountVectorizer(strip_accents = 'unicode',
                                max_features=1000,
                                stop_words='english',
                               # max_df = 0.5,
                                min_df = 10)



tfidf_vectorizer = TfidfVectorizer(sublinear_tf=True, 
                        max_features=1000,           
                        min_df=10, 
                        norm='l2', 
                        encoding='latin-1', 
                        ngram_range=(1, 2), 
                        stop_words='english')

In [22]:
labels = df.category_id
labels.shape

(277814,)

In [23]:

features = tfidf_vectorizer.fit_transform(df.consumer_complaint)

features.shape

# Now, each of 277814 consumer complaint narratives is represented by 1000 features, 
# representing the tf-idf score for different unigrams and bigrams.

(277814, 1000)

In [24]:
type(features)

scipy.sparse.csr.csr_matrix

In [25]:
# We can use sklearn.feature_selection.chi2 to find the terms 
# that are the most correlated with each of the products:

from sklearn.feature_selection import chi2
import numpy as np

N = 2
for category, category_id in sorted(category_id_dict.items()):
    
  features_chi2 = chi2(features, labels == category_id)
  indices = np.argsort(features_chi2[0])
    
  feature_names = np.array(tfidf_vectorizer.get_feature_names())[indices]

  unigrams = [v for v in feature_names if len(v.split(' ')) == 1]
  bigrams = [v for v in feature_names if len(v.split(' ')) == 2]



  print("# '{}':".format(category))
  print("  . Most correlated unigrams:\n       . {}".format('\n       . '.join(unigrams[-N:])))
  print("  . Most correlated bigrams:\n       . {}".format('\n       . '.join(bigrams[-N:])))

# 'Bank account or service':
  . Most correlated unigrams:
       . bank
       . overdraft
  . Most correlated bigrams:
       . debit card
       . checking account
# 'Checking or savings account':
  . Most correlated unigrams:
       . overdraft
       . deposit
  . Most correlated bigrams:
       . debit card
       . checking account
# 'Consumer Loan':
  . Most correlated unigrams:
       . car
       . vehicle
  . Most correlated bigrams:
       . payment xxxx
       . xxxx payments
# 'Credit card':
  . Most correlated unigrams:
       . citi
       . card
  . Most correlated bigrams:
       . american express
       . credit card
# 'Credit card or prepaid card':
  . Most correlated unigrams:
       . cards
       . card
  . Most correlated bigrams:
       . american express
       . credit card
# 'Credit reporting':
  . Most correlated unigrams:
       . equifax
       . experian
  . Most correlated bigrams:
       . report xxxx
       . credit report
# 'Credit reporting, credit

### split the data 

In [26]:
# Naive Bayes Classifier
from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(features, labels, random_state = 0)



In [27]:
X_train.shape

(208360, 1000)

In [28]:
y_train.shape

(208360,)

In [29]:
X_test.shape

(69454, 1000)

In [30]:
y_test.shape

(69454,)

### modify the weapon 

In [31]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer


count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train)


tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)


In [32]:
from sklearn.naive_bayes import MultinomialNB


clf = MultinomialNB().fit(X_train, y_train)

In [33]:
clf.predict(count_vectorizer.transform([" Histograms of oriented gradients for human detection,We study the question of feature sets for robust visual object recognition, adopting linear SVM based human detection as a test case. After reviewing existing edge and gradient based descriptors, we show experimentally that grids of Histograms of Oriented Gradient (HOG) descriptors significantly outperform existing feature sets for human detection. We study the influence of each stage of the computation on performance, concluding that fine-scale gradients, fine orientation binning, relatively coarse spatial binning, and high-quality local contrast normalization in overlapping descriptor blocks are all important for good results. The new approach gives near-perfect separation on the original MIT pedestrian database, so we introduce a more challenging dataset containing over 1800 annotated human images with a large range of pose variations and backgrounds  "]))

NotFittedError: CountVectorizer - Vocabulary wasn't fitted.