<a href="https://colab.research.google.com/github/Jayavathsan/MachineLearning/blob/main/12_Consumer_Complaint_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Classify customer complaints by training the model on dataset from CFPB

Dataset is here: https://statso.io/consumer-complaint-identification-case-study/

In [1]:
!wget https://statso.io/wp-content/uploads/2022/11/complaints.zip

--2024-02-13 15:34:04--  https://statso.io/wp-content/uploads/2022/11/complaints.zip
Resolving statso.io (statso.io)... 192.0.78.171, 192.0.78.225
Connecting to statso.io (statso.io)|192.0.78.171|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 433524260 (413M) [application/zip]
Saving to: ‘complaints.zip’


2024-02-13 15:34:12 (50.2 MB/s) - ‘complaints.zip’ saved [433524260/433524260]



In [2]:
!unzip '/content/complaints.zip' -d '/content/complaints'

Archive:  /content/complaints.zip
   creating: /content/complaints/complaints/
  inflating: /content/complaints/complaints/consumercomplaints.csv  
  inflating: /content/complaints/__MACOSX/complaints/._consumercomplaints.csv  


Import necessary libraries

In [4]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier
import nltk
import re
from nltk.corpus import stopwords
import string

Load the data

In [5]:
data = pd.read_csv("/content/complaints/complaints/consumercomplaints.csv")
print(data.head())

   Unnamed: 0 Date received  \
0           0    2022-11-11   
1           1    2022-11-23   
2           2    2022-11-16   
3           3    2022-11-15   
4           4    2022-11-07   

                                             Product  \
0                                           Mortgage   
1  Credit reporting, credit repair services, or o...   
2                                           Mortgage   
3                        Checking or savings account   
4                                           Mortgage   

                  Sub-product                           Issue  \
0  Conventional home mortgage  Trouble during payment process   
1            Credit reporting     Improper use of your report   
2                 VA mortgage  Trouble during payment process   
3            Checking account             Managing an account   
4      Other type of mortgage  Trouble during payment process   

                                       Sub-issue  \
0                                

In [6]:
# check for null values
print(data.isnull().sum())

Unnamed: 0                            0
Date received                         0
Product                               0
Sub-product                      235294
Issue                                 0
Sub-issue                        683355
Consumer complaint narrative    1987977
dtype: int64


In [7]:
# drop rows with null values
data = data.dropna()

Product column contains the labels. Exploring the labels and frequency

In [8]:
print(data["Product"].value_counts())

Credit reporting, credit repair services, or other personal consumer reports    507582
Debt collection                                                                 192045
Credit card or prepaid card                                                      80410
Checking or savings account                                                      54192
Student loan                                                                     32697
Vehicle loan or lease                                                            19874
Payday loan, title loan, or personal loan                                         1008
Name: Product, dtype: int64


In [9]:
nltk.download('stopwords')
stemmer = nltk.SnowballStemmer("english")
stopword=set(stopwords.words('english'))

def clean(text):
    text = str(text).lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('https?://\S+|www\.\S+', '', text)
    text = re.sub('<.*?>+', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\n', '', text)
    text = re.sub('\w*\d\w*', '', text)
    text = [word for word in text.split(' ') if word not in stopword]
    text=" ".join(text)
    text = [stemmer.stem(word) for word in text.split(' ')]
    text=" ".join(text)
    return text

data["Consumer complaint narrative"] = data["Consumer complaint narrative"].apply(clean)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Split the data into training and test sets

In [10]:
data = data[["Consumer complaint narrative", "Product"]]
x = np.array(data["Consumer complaint narrative"])
y = np.array(data["Product"])

cv = CountVectorizer()
X = cv.fit_transform(x)
x_train, x_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=.33,
                                                    random_state=42)

Training the model using Stochastic Gradient Descent classification

In [11]:
sgdmodel = SGDClassifier()
sgdmodel.fit(x_train, y_train)

Make predictions

In [12]:
user = input("Enter a text: ")
data = cv.transform([user]).toarray()
output = sgdmodel.predict(data)
print(output)

Enter a text: Investigation took more than 30 days and nothing was changed when clearly there are misleading, incorrect, inaccurate items on my credit report..i have those two accounts attached showing those inaccuracies... I need them to follow the law because this is a violation of my rights!! The EVIDENCE IS IN BLACK AND WHITE ....
['Credit reporting, credit repair services, or other personal consumer reports']
