<a href="https://colab.research.google.com/github/Mahjabeenqamar1/Machain-Learning/blob/main/Consumer_Complaint_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Consumer Complaint Classification means classifying the nature of the complaint reported by the consumer.



I found an ideal dataset for this task that contains data about:

The nature of the complaint reported by the consumer
The Issue mentioned by the consumer
The complete description of the complaint of the consumer

In [2]:
# Import the library
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier
import nltk
import re
from nltk.corpus import stopwords
import string


In [6]:
#Read the CSV file
data = pd.read_csv("/content/consumercomplaints.csv", encoding='utf8')
print(data.head())

   Unnamed: 0 Date received  \
0           0    2022-11-11   
1           1    2022-11-23   
2           2    2022-11-16   
3           3    2022-11-15   
4           4    2022-11-07   

                                             Product  \
0                                           Mortgage   
1  Credit reporting, credit repair services, or o...   
2                                           Mortgage   
3                        Checking or savings account   
4                                           Mortgage   

                  Sub-product                           Issue  \
0  Conventional home mortgage  Trouble during payment process   
1            Credit reporting     Improper use of your report   
2                 VA mortgage  Trouble during payment process   
3            Checking account             Managing an account   
4      Other type of mortgage  Trouble during payment process   

                                       Sub-issue  \
0                                

In [7]:
#The dataset contains an Unnamed column. I’ll remove the column and move further:
data = data.drop("Unnamed: 0",axis=1)

In [8]:
# Drop the NULL values
print(data.isnull().sum())

Date received                       0
Product                             0
Sub-product                         0
Issue                               1
Sub-issue                         953
Consumer complaint narrative    19172
dtype: int64


In [9]:
#Drop the na values
data = data.dropna()

In [10]:
# Count the Product values
print(data["Product"].value_counts())

Credit reporting, credit repair services, or other personal consumer reports    844
Debt collection                                                                 113
Checking or savings account                                                     108
Credit card or prepaid card                                                      96
Vehicle loan or lease                                                            29
Student loan                                                                      5
Payday loan, title loan, or personal loan                                         2
Name: Product, dtype: int64


In [11]:
nltk.download('stopwords')
stemmer = nltk.SnowballStemmer("english")
stopword=set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [12]:
def clean(text):
    text = str(text).lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('https?://\S+|www\.\S+', '', text)
    text = re.sub('<.*?>+', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\n', '', text)
    text = re.sub('\w*\d\w*', '', text)
    text = [word for word in text.split(' ') if word not in stopword]
    text=" ".join(text)
    text = [stemmer.stem(word) for word in text.split(' ')]
    text=" ".join(text)
    return text
data["Consumer complaint narrative"] = data["Consumer complaint narrative"].apply(clean)

Now, let’s split the data into training and test sets:

In [13]:
data = data[["Consumer complaint narrative", "Product"]]
x = np.array(data["Consumer complaint narrative"])
y = np.array(data["Product"])

cv = CountVectorizer()
X = cv.fit_transform(x)
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.33, 
                                                    random_state=42)

Now, let’s train the Machine Learning model using the Stochastic Gradient Descent classification algorithm:

In [14]:
sgdmodel = SGDClassifier()
sgdmodel.fit(X_train,y_train)

Now, let’s use our trained model to make predictions:

In [15]:
user = input("Enter a Text: ")
data = cv.transform([user]).toarray()
output = sgdmodel.predict(data)
print(output)

Enter a Text: in
['Credit reporting, credit repair services, or other personal consumer reports']


In [16]:
user = input("Enter a Text: ")
data = cv.transform([user]).toarray()
output = sgdmodel.predict(data)
print(output)

Enter a Text: the
['Credit reporting, credit repair services, or other personal consumer reports']
