### Problem Statement

**Overview:**  

In this project, you will work with customer complaints submitted to the Consumer Financial Protection Bureau (CFPB) regarding various financial products. Your task is to build an NLP classification model that can automatically categorize complaints based on the provided textual narratives. The classification will help financial institutions resolve complaints more efficiently by routing them to the appropriate teams for quicker handling.

The dataset contains customer complaints categorized into five product classes: credit reporting, debt collection, mortgages and loans, credit cards, and retail banking. These narratives are often raw and noisy, requiring significant preprocessing to develop a model capable of accurate classification.


### DAY 1

Today's Goals :
- Basic understanding of the dataset
- A bit of Exploratory Data Analysis
- Data cleaning
- Data preprocessing (maybe! if i get time)

### Lets get familiar with our dataset

### Importing the necessary libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

### Importing the dataset

In [2]:
dataframe = pd.read_csv('./Datasets/complaints.csv')

In [3]:
dataframe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 162421 entries, 0 to 162420
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   Unnamed: 0  162421 non-null  int64 
 1   product     162421 non-null  object
 2   narrative   162411 non-null  object
dtypes: int64(1), object(2)
memory usage: 3.7+ MB


In [4]:
print(dataframe['product'].unique())

['credit_card' 'retail_banking' 'credit_reporting' 'mortgages_and_loans'
 'debt_collection']


-  an Unnamed column which is the Complaint Number: The dataset has a unique complaint number for each complaint. we got our ID!

-  Product: This is our target variable, which is categorized into five classes:
    - Credit cards
    - Retail banking
    - Credit reporting
    - Mortgages and loans
    - Debt collection

- Narrative: This is the narrative of customers' complaints. Obvisously, this is the most important feature for our analysis.

In [5]:
dataframe.head()

Unnamed: 0.1,Unnamed: 0,product,narrative
0,0,credit_card,purchase order day shipping amount receive pro...
1,1,credit_card,forwarded message date tue subject please inve...
2,2,retail_banking,forwarded message cc sent friday pdt subject f...
3,3,credit_reporting,payment history missing credit report speciali...
4,4,credit_reporting,payment history missing credit report made mis...


The columns are not named properly, so we will rename them

In [6]:
""" 
# rename the Unnamed: 0 column to id
# rename the product column to class
# rename the narrative column to complaint
"""

dataframe.rename(columns={'Unnamed: 0': 'id', 'product': 'class', 'narrative': 'complaint'}, inplace=True)

In [7]:
dataframe.head()

Unnamed: 0,id,class,complaint
0,0,credit_card,purchase order day shipping amount receive pro...
1,1,credit_card,forwarded message date tue subject please inve...
2,2,retail_banking,forwarded message cc sent friday pdt subject f...
3,3,credit_reporting,payment history missing credit report speciali...
4,4,credit_reporting,payment history missing credit report made mis...


In [8]:
# set the id column as the index
dataframe.set_index('id', inplace=True)

In [9]:
dataframe.head()

Unnamed: 0_level_0,class,complaint
id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,credit_card,purchase order day shipping amount receive pro...
1,credit_card,forwarded message date tue subject please inve...
2,retail_banking,forwarded message cc sent friday pdt subject f...
3,credit_reporting,payment history missing credit report speciali...
4,credit_reporting,payment history missing credit report made mis...


In [10]:
print(dataframe['class'].unique())

['credit_card' 'retail_banking' 'credit_reporting' 'mortgages_and_loans'
 'debt_collection']


### Lets Clean the data if required (obviously it will be required)

checking Null values

In [11]:
dataframe.isna().sum()

class         0
complaint    10
dtype: int64

great! just 10 null values in the dataset. We can definitely drop them.

But lets take a look at them

In [12]:
print(dataframe[dataframe.isnull().any(axis=1)])

                   class complaint
id                                
1089    credit_reporting       NaN
3954    credit_reporting       NaN
3955    credit_reporting       NaN
29690   credit_reporting       NaN
139436   debt_collection       NaN
151052   debt_collection       NaN
154494  credit_reporting       NaN
156902    retail_banking       NaN
158538  credit_reporting       NaN
159503  credit_reporting       NaN


In [13]:
dataframe = dataframe.dropna()

In [14]:
dataframe.isna().sum()

class        0
complaint    0
dtype: int64

### Handling the duplicates in the dataset

apart from the null values, we also have to handle the duplicates in the dataset if any.

In [15]:
dataframe.tail()

Unnamed: 0_level_0,class,complaint
id,Unnamed: 1_level_1,Unnamed: 2_level_1
162416,debt_collection,name
162417,credit_card,name
162418,debt_collection,name
162419,credit_card,name
162420,credit_reporting,name


wait a minute! okay lets get rid of these rows which just have 'name' as the complaint.

In [16]:
# get rid of the rows where the complaint is just 'name'
dataframe = dataframe[dataframe['complaint'] != 'name']

In [17]:
# check for duplicates
print(dataframe.duplicated().sum())

37732


In [18]:
dataframe = dataframe.drop_duplicates(keep='first')

In [19]:
dataframe.duplicated().sum()

np.int64(0)

In [20]:
dataframe.info()

<class 'pandas.core.frame.DataFrame'>
Index: 124673 entries, 0 to 162414
Data columns (total 2 columns):
 #   Column     Non-Null Count   Dtype 
---  ------     --------------   ----- 
 0   class      124673 non-null  object
 1   complaint  124673 non-null  object
dtypes: object(2)
memory usage: 2.9+ MB


In [21]:
dataframe.tail()

Unnamed: 0_level_0,class,complaint
id,Unnamed: 1_level_1,Unnamed: 2_level_1
162410,credit_reporting,zales comenity bank closed sold account report...
162411,retail_banking,zelle suspended account without cause banking ...
162412,debt_collection,zero contact made debt supposedly resolved fou...
162413,mortgages_and_loans,zillow home loan nmls nmls actual quote provid...
162414,debt_collection,zuntafi sent notice willing settle defaulted s...


### Lets clean the dataset using regex, remove stopwords and maybe try stemming

In [22]:
import re
import string
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

In [23]:
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
stop_words.remove('not')
ps = PorterStemmer()

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/mehmood/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [24]:
def clean_text(text):
    text = text.lower()
    text = re.sub(r'\[.*?\]', '', text)
    url = re.compile(r'https?://\S+|www\.\S+')
    text = url.sub(r'',text)
    text = re.sub(r'<.*?>+', '', text)
    text = re.sub(r'[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub(r'\w*\d\w*', '', text)
    text = ' '.join(ps.stem(word) for word in text.split() if word not in stop_words)
    return text

In [25]:
dataframe['cleaned_complaint'] = dataframe['complaint'].apply(clean_text)

In [26]:
dataframe.drop('complaint', axis=1, inplace=True)

In [27]:
dataframe.head()

Unnamed: 0_level_0,class,cleaned_complaint
id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,credit_card,purchas order day ship amount receiv product w...
1,credit_card,forward messag date tue subject pleas investig...
2,retail_banking,forward messag cc sent friday pdt subject fina...
3,credit_reporting,payment histori miss credit report special loa...
4,credit_reporting,payment histori miss credit report made mistak...


In [28]:
# encode the class column
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
dataframe['class'] = le.fit_transform(dataframe['class'])

In [29]:
dataframe.head()

Unnamed: 0_level_0,class,cleaned_complaint
id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,0,purchas order day ship amount receiv product w...
1,0,forward messag date tue subject pleas investig...
2,4,forward messag cc sent friday pdt subject fina...
3,1,payment histori miss credit report special loa...
4,1,payment histori miss credit report made mistak...


### Now i am going to speedrun the prediction for DAY 1

In [30]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(max_features=5000)
X = vectorizer.fit_transform(dataframe['cleaned_complaint'])
y = dataframe['class']

In [31]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [32]:
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
nb.fit(X_train, y_train)

y_pred = nb.predict(X_test)

from sklearn.metrics import accuracy_score

print(accuracy_score(y_test, y_pred))

0.812271906958091


In [33]:

from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100)
rf.fit(X_train, y_train)

y_pred = rf.predict(X_test)

print(accuracy_score(y_test, y_pred))

0.848526168036896


In [35]:
# logistic regression
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(max_iter=1000)
lr.fit(X_train, y_train)

y_pred = lr.predict(X_test)

print(accuracy_score(y_test, y_pred))

0.849127732103469


In [36]:
# knn
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

y_pred = knn.predict(X_test)

print(accuracy_score(y_test, y_pred))

0.5687988770804091


In [38]:
from xgboost import XGBClassifier

xgb = XGBClassifier()
xgb.fit(X_train, y_train)

y_pred = xgb.predict(X_test)

print(accuracy_score(y_test, y_pred))

0.8542209745337879


Great! we got 85% accuracy without any EDA 

we will look into it tomorrow and perhaps make it more better


*Day 1 ends*