# US Consumer Finance Complaints
This notebook is a quick analysis of the US Consumer Finance Complaints dataset. The dataset is available on [Kaggle](https://www.kaggle.com/datasets/kaggle/us-consumer-finance-complaints) and contains complaints about consumer financial products and services that were sent to companies for response. The dataset contains 18 columns and 555957 rows.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from os import stat
from IPython.display import display
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier

## Importing the dataset
First, we need to import the dataset. We will use the `pandas` library to read the dataset into a `DataFrame` object.

In [None]:
# Size of the file

filename = 'data/Consumer_Complaints.csv'

file = stat(filename)
print(f'File size: {file.st_size / 1024 / 1024} MB.')

# Read the CSV file

df = pd.read_csv(filename, low_memory=False)

print(f'Data from the CSV file: {df.shape}')
display(df.info())
display(df.head(10))
display(df.describe())

### Preprocessing

Some values are missing.
We are going to drop the rows with missing values.

In [None]:
print(f"Number of NA: {df.isna().sum().sum()}.")

print(f"Before removing NA: {df.shape}.")
df.dropna(axis = 0, how = 'any', inplace = True)
print(f"After removing NA: {df.shape}.")
display(df.info())
display(df.head(10))
display(df.describe())

Then, we will convert the `date_received` column to a `datetime` object.

In [None]:
# Convert the date to a datetime object
df['date_received'] = pd.to_datetime(df['date_received'])
df['date_sent_to_company'] = pd.to_datetime(df['date_sent_to_company'])

company_dict = {}

df.drop(['complaint_id'], axis=1, inplace=True)

for p in ["consumer_complaint_narrative",'company_public_response','product', 'sub_product', 'issue', 'sub_issue', 'company', 'state', 'zipcode', 'tags', 'consumer_consent_provided', 'submitted_via', 'company_response_to_consumer', 'timely_response', 'consumer_disputed?']:
    i = 0
    dic = {}
    for k in df[p].unique():
        dic[k] = i
        i+=1
    df[p] = df[p].map(dic)
    if p == 'company' and len(company_dict) == 0:
        company_dict = dic

### Visualizing the dataset

In [None]:
display(df.head(10))

First, let's plot the number of complaints per year.

In [None]:
plt.figure(figsize=(15,9))
plt.hist(df['date_received'], bins=10, color='blue', edgecolor='black', alpha=0.7)
plt.show()

Then, let's plot the number of complaints by company.

In [None]:
print(f"Number of companies: {df['company'].nunique()}.")
print(f"Matching between numbers and companies: {company_dict}.")
plt.figure(figsize=(15,9))
plt.hist(df['company'], bins=df['company'].nunique(), color='blue', edgecolor='black', alpha=0.7)
plt.show()

We are now going to plot the number of complaints by company for each year.

In [None]:
for year in df["date_received"].dt.year.unique():
    plt.figure(figsize=(15,9))
    plt.hist(df[df["date_received"].dt.year == year]["company"], bins=df['company'].nunique(), color='blue', edgecolor='black', alpha=0.7)
    plt.title(f"Number of complaints per company in {year}")
    plt.show()

Let's do the same thing for the `subproduct` column.

In [None]:
for sp in df['sub_product'].unique():
    plt.figure(figsize=(15,9))
    plt.hist(df[df['sub_product'] == sp]['company'], bins=df['company'].nunique(), color='blue', edgecolor='black', alpha=0.7)
    plt.title(f"Number of complaints per company for subproduct {sp}")
    plt.show()

## Training the model
### Linear model
We are trying to predict the `consumer_disputed?` column, using all the other columns as features.
As reminder, here are the names of the columns.

In [None]:
print(df.columns)

The `consumer_disputed?` column is a boolean column, so we will use a logistic regression model.

In [None]:
X = df[['product', 'sub_product', 'issue', 'sub_issue', 'company', 'state', 'zipcode', 'tags', 'consumer_consent_provided', 'submitted_via', 'company_response_to_consumer', 'timely_response']]
Y = df['consumer_disputed?']

In [None]:
reg = LogisticRegression()
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=0)

reg.fit(X_train, Y_train)
print(f"Score: {reg.score(X_test, Y_test)}.")
cvs = cross_val_score(reg, X, Y, cv=5)
print(f"Cross-validation score: {cvs}.")
print(f"Mean cross-validation score: {cvs.mean()}.")
print(f"Standard deviation of cross-validation score: {cvs.std()}.")

This model has a good score, however, it seems not to converge very well.
We are going to try other ones.

### Random forest
We are going to use a random forest classifier.

In [None]:
clf = RandomForestClassifier(n_estimators = 100, max_depth=5, random_state=0)
clf.fit(X_train, Y_train)

print(f"Score: {clf.score(X_test, Y_test)}.")
cvs = cross_val_score(clf, X, Y, cv=5)
print(f"Cross-validation score: {cvs}.")
print(f"Mean cross-validation score: {cvs.mean()}.")
print(f"Standard deviation of cross-validation score: {cvs.std()}.")

The random forest classifier seems to have a good score. Let's try to improve it.
Perhaps we can use a grid search to find the best parameters.

In [None]:
#Grid search
from sklearn.model_selection import GridSearchCV
rfc = RandomForestClassifier()

param_grid = {
    'n_estimators' : [50,100,150],
    'max_depth' : [2,3,4,5,6],
}

clf = GridSearchCV(estimator=rfc, param_grid=param_grid, cv= 5)
clf.fit(X_train, Y_train)
print(f"Score: {clf.score(X_test, Y_test)}.")