## Problem Statement
There is an increasing volume of consumer complaints received by consumer care departments daily, leading to a need for an efficient and automated system to classify the nature of these complaints in order to prioritize and solve them in a timely manner, and prevent consumer loss.This problem aims to identify the category of the complaint filed by the consumer to identify the most important issues to be addressed first. The Product column contains the name of the product where the consumer found issues. So the Product column is the target variable in this classification problem

## Data
The problem of consumer complaint classification is based on Natural Language Processing and Multiclass Classification. To solve this problem, we needed a dataset containing complaints reported by consumers.

I found an ideal dataset for this task that contains data about:

1. The nature of the complaint reported by the consumer

2. The Issue mentioned by the consumer

3. The complete description of the complaint of the consumer

We can use this data to build a Machine Learning model that can classify the nature of complaints reported by consumers in real time. You can download the dataset here:https://statso.io/consumer-complaint-identification-case-study/

## Data Features
1. Date received: The date on which the complaint received
2. Product: Type of product in which the consumer identified complaint
3. Sub-product: Type of sub-product in which the consumer identified complaint
4. Issue: The issue reported by the consumer
5. Sub-issue: The sub-issue reported by the consumer
6. Consumer complaint narrative: Complete description of the complaint reported by the consumer

## importing the necessary Python libraries

In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier
import nltk
import re
from nltk.corpus import stopwords
import string

## Loading the data

In [3]:
data = pd.read_csv("/content/drive/MyDrive/filename/consumercomplaints.csv")
print(data.head())

   Unnamed: 0 Date received  \
0           0    2022-11-11   
1           1    2022-11-23   
2           2    2022-11-16   
3           3    2022-11-15   
4           4    2022-11-07   

                                             Product  \
0                                           Mortgage   
1  Credit reporting, credit repair services, or o...   
2                                           Mortgage   
3                        Checking or savings account   
4                                           Mortgage   

                  Sub-product                           Issue  \
0  Conventional home mortgage  Trouble during payment process   
1            Credit reporting     Improper use of your report   
2                 VA mortgage  Trouble during payment process   
3            Checking account             Managing an account   
4      Other type of mortgage  Trouble during payment process   

                                       Sub-issue  \
0                                

The dataset contains an Unnamed column. I’ll remove the column and move further

In [4]:
data = data.drop("Unnamed: 0",axis=1)

Now let’s have a look if the dataset contains null values or not:

In [5]:
print(data.isnull().sum())

Date received                         0
Product                               0
Sub-product                      235294
Issue                                 0
Sub-issue                        683355
Consumer complaint narrative    1987977
dtype: int64


The dataset contains so many null values. I’ll drop all the rows containing null values and move further:

In [6]:
data = data.dropna()

The product column in the dataset contains the labels. Here the labels represent the nature of the complaints reported by the consumers. Let’s have a look at all the labels and their frequency:

In [7]:
print(data["Product"].value_counts())

Credit reporting, credit repair services, or other personal consumer reports    507582
Debt collection                                                                 192045
Credit card or prepaid card                                                      80410
Checking or savings account                                                      54192
Student loan                                                                     32697
Vehicle loan or lease                                                            19874
Payday loan, title loan, or personal loan                                         1008
Name: Product, dtype: int64


## Training Consumer Complaint Classification Model
The consumer complaint narrative column contains the complete description of the complaints reported by the consumers. I will clean and prepare this column before using it in a Machine Learning model 

In [8]:
nltk.download('stopwords')
stemmer = nltk.SnowballStemmer("english")
stopword=set(stopwords.words('english'))

def clean(text):
    text = str(text).lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('https?://\S+|www\.\S+', '', text)
    text = re.sub('<.*?>+', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\n', '', text)
    text = re.sub('\w*\d\w*', '', text)
    text = [word for word in text.split(' ') if word not in stopword]
    text=" ".join(text)
    text = [stemmer.stem(word) for word in text.split(' ')]
    text=" ".join(text)
    return text

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Now, let’s split the data into training and test sets:

In [9]:
data = data[["Consumer complaint narrative", "Product"]]
x = np.array(data["Consumer complaint narrative"])
y = np.array(data["Product"])

cv = CountVectorizer()
X = cv.fit_transform(x)
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.33, 
                                                    random_state=42)

Now, let’s train the Machine Learning model using the Stochastic Gradient Descent classification algorithm:

In [10]:
sgdmodel = SGDClassifier()
sgdmodel.fit(X_train,y_train)

SGDClassifier()

Now, let’s use our trained model to make predictions:

In [11]:
user = input("Enter a Text: ")
data = cv.transform([user]).toarray()
output = sgdmodel.predict(data)
print(output)

Enter a Text: On 02/05/2022, I called Citi Bank Customer Service at 555-555-5555. I did not want to pay {$99.00} for the next year membership and wanted to cancel my card account. A customer service representative told me if I pay the {$99.00} membership fee and spending {$1000.00} in 3 months, I can get 10,000 mileage reward points of ABC Airlines. I believed what he said and paid {$99.00} membership fee on 02/07/2022. I spent more than {$1000.00} in 3 months since 02/07/2022. On 05/10/2022, I called the card Customer Service about my reward mileage points. I was told the reward mileage points are NOT 10,000. I can only get 7,500 mileage points instead. I believe that the Citi Bank Customer Service cheated me. This is business fraud!
['Credit card or prepaid card']


In [12]:
user = input("Enter a Text: ")
data = cv.transform([user]).toarray()
output = sgdmodel.predict(data)
print(output)

Enter a Text: Investigation took more than 30 days and nothing was changed when clearly there are misleading, incorrect, inaccurate items on my credit report..i have those two accounts attached showing those inaccuracies... I need them to follow the law because this is a violation of my rights!! The EVIDENCE IS IN BLACK AND WHITE ....
['Credit reporting, credit repair services, or other personal consumer reports']


So this is how we can use Machine Learning for the task of Classifying Consumer Complaints.

## Conclusion
In conclusion, the Consumer Complaint Classification Project aimed at accurately categorizing consumer complaints into various issue categories, such as billing, loan, or fraud. The use of advanced machine learning techniques and natural language processing has proven to be effective in achieving high accuracy levels in classifying these complaints. The results of this project have significant implications for both consumers and companies, as it allows for a more efficient and effective means of addressing consumer complaints and resolving issues. The project highlights the importance of understanding consumer behavior and sentiment in today's data-driven world, and emphasizes the need for continued investment in the development of advanced technologies to better serve the needs of both consumers and businesses.