<div class="alert alert-warning"> 
    
# Consumer Complaint Response
    
#### <span style="color:teal">GitHubs: [Tyler Kephart](https://github.com/tkephart96), [Chellyan Moreno](https://github.com/chellyan-moreno), [Rosendo Lugo](https://github.com/rosendo-lugo), [Alexia Lewis](https://github.com/lewisalexia)</span>
    
</div>

<div class="alert alert-success">    
    
## Goal: 
This classification NLP project aims to provide an accurate prediction of company response based on the language of a consumer's complaint.

## Description:

Our project involves analyzing 3.5 million consumer complaints to the Consumer Financial Protection Bureau (CFPB) from 2011 to 2023. We'll use Natural language Processing to analyze how the wording of complaints affects a company's response. Our goal is to provide insights on complaint language and its impact, helping companies improve their responses and enhancing the outcomes for consumers and businesses.
  
</div>

<div class="alert alert-warning"> 
    
# Imports
    
### <span style="color:teal">Libraries Used: google.oauth2, Pandas_gbq, os, ...</span>

</div>

In [None]:
# #.py modules
import wrangle as wr
import explore as ex
import model as mo

#standard
import pandas as pd
import numpy as np
import re

#file
import os
import json

#vizz
import seaborn as sns
import matplotlib.pyplot as plt
from wordcloud import WordCloud

#preprocess
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

#split and model
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier

#set random state
random_state=123

#ignore warnings
import warnings
warnings.filterwarnings("ignore")

pd.set_option("display.max_colwidth", 250)

random_state = 123

<div class="alert alert-warning"> 

# Wrangle
    
---
    
## Acquire

### <span style="color:teal">Libraries Used: google.oauth2, Pandas_gbq, os, ...</span>

</div>

* Data acquired from [Google BigQuery](https://console.cloud.google.com/marketplace/product/cfpb/complaint-database)
* 3,458,906 rows × 18 columns *before* cleaning
* 1,246,736 rows x 8 columns *after* cleaning


### Data Dictionary

| Feature                               | Definition                                                                                  |
| :------------------------------------ | :------------------------------------------------------------------------------------------ |
| date_received                         | Date the complaint was received by the CFPB                                                 |
| product                               | The type of product the consumer identified in the complaint                                |
| subproduct                            | The type of sub-product the consumer identified in the complaint                            |
| issue                                 | The issue the consumer identified in the complaint                                          |
| subissue                              | The sub-issue the consumer identified in the complaint                                      |
| consumer_complaint_narrative          | A description of the complaint provided by the consumer                                     |
| company_public_response               | The company's optional public-facing response to a consumer's complaint                     |
| company_name                          | Name of the company identified in the complaint by the consumer                             |
| state                                 | Two-letter postal abbreviation of the state of the mailing address provided by the consumer |
| zip_code                              | The mailing ZIP code provided by the consumer                                               |
| tags                                  | Older American is aged 62 and older, Servicemember is Active/Guard/Reserve member or spouse |
| consumer_consent_provided             | Identifies whether the consumer opted in to publish their complaint narrative               |
| submitted_via                         | How the complaint was submitted to the CFPB                                                 |
| date_sent_to_company                  | The date the CFPB sent the complaint to the company                                         |
| company_response_to_consumer (target) | The response from the company about this complaint                                          |
| timely_response                       | Indicates whether the company gave a timely response or not                                 |
| consumer_disputed                     | Whether the consumer disputed the company's response                                        |
| complaint_id                          | Unique ID for complaints registered with the CFPB                                           |

In [None]:
# Acquire and write/read CSV
df = wr.check_file_exists_gbq('cfpb.csv', 'service_key.json')

<div class="alert alert-success">    

### Insight: 
    
* **Quick run**
    * Verify `import wrangle as w` is in the imports section 
    * Run final report
    * This will use a pre-built and cleaned parquet file
<br>
<br>
* **For the longer run: ⚠️WARNING⚠️:** These are almost the same steps we took to originally acquire the data. The steps take a lot of time (and space) and may not even be the best way of doing it. We highly recommend doing the quick run above unless you want to know how we got the data.
    * Verify `import big_wrangle as w` is in the imports section
    * Install the pandas-gbq package
        * `pip install pandas-gbq`
    * Go to Google BigQuery and create a project
    * Copy the `'long-SQL queries found in big_wrangle.py`
        * Run in [Google BigQuery](https://cloud.google.com/bigquery/public-data)
    * Click on 'Go to Datasets in Cloud Marketplace' and search for 'CFPB'
        * View the dataset to open a quick SQL prompt to query in
    * Save each result as a BigQuery table in your project
    * You can look in `big_wrangle.py for what we named our project, database, and tables`
    * Edit and save the `'small-SQL query variables found in big_wrangle.py` to the respective table names in your BigQuery project using this format: 
        * ***FROM 'database. table' and edit the 'project_ID' variable to your project's ID***
    * Run final report
    * It may ask for authentication when it tries to query Google BigQuery
        * Try to run again if it stopped
    * This will run through the longer pathway of getting the datasets from the source and merging/cleaning/prep
    * It will probably take a while **(3+ millions of rows, +2GB)**

</div>

<div class="alert alert-warning"> 

# Prepare

### **<span style="color:teal">Libraries Used: Regex, NLTK, Unicode, Pandas...</span>**

---
</div>

* **date_received**
  * changed date to DateTime
  * no nulls
  * 2015 to 2023
    <br>
  <br>
  
* **product**
  * no nulls
  * credit related
  * *<span style="color:orange">ENGINEERED FEATURE</span>*
    * **<span style="color:orange">bin related products/services together</span>**
        * <span style="color:orange">bins = credit_report, credit_card, debt_collection, mortgage, bank, loans, and money_service</span>
    * **<span style="color:red">drop after engineering</span>**
  <br>
  <br>
* **subproduct**
  * 7% null
  * top value = credit reporting
  * fill nulls with the product
  * what does subproduct correlate with?
    * **<span style="color:red">drop column</span>**
      <br>
  <br>
* **issue**
  * no nulls
  * 165 unique values
  * concat into consumer_complaint_narrative column to address those nulls and then drop the issue
    * **<span style="color:red">drop column</span>**
      <br>
  <br>
* **subissue**
  * 20% null
  * 221 unique
    * **<span style="color:red">drop column</span>**    
      <br>
  <br>
* **consumer_complaint_narrative**
  * 64% null
  * renamed to narrative
    * **<span style="color:orange">drop all null values</span>**
    * **<span style="color:red">drop column after NLTK cleaning</span>**  
       <br>
  <br>
* **company_public_response**
  * 56% null
    * **<span style="color:red">drop column</span>**
    * **<span style="color:blue">nice-to-have: second iteration</span>**
      <br>
  <br>
* **company_name**
  * no nulls
  * 6,694 Companies
  <br>
  <br>
* **state**
  * 1% null
  * keep for purposes of exploration
  * do not broad bin (causes manipulation)
    * **<span style="color:orange">bin 1% null into UNKNOWN label</span>**
      <br>
  <br>
* **zip code**
  * 1% null
  * located a string buried in the data
    * **<span style="color:red">drop column</span>**    
    * **<span style="color:blue">nice-to-have: second iteration</span>**
      <br>
  <br>
* **tags**
  * 89% null
  * domain knowledge: 62 and older accounted for senior - pulled straight from the source
      * **<span style="color:orange">impute nulls with "Average Person label</span>**
      <br>
  <br>
* **consumer_consent_provided**
  * does not relate to the target
    * **<span style="color:red">drop column</span>**    
      <br>
  <br>
* **submitted_via**
  * no nulls
    * **<span style="color:red">drop column</span>**    
      <br>
  <br>
* **date_sent_to_company**
  * no nulls
    * **<span style="color:red">drop column</span>**    
      <br>
  <br>
* **company_response_to_consumer**
  * 4 nulls = 0%
    * **<span style="color:orange">drop these 4 rows because this is the target column</span>**
  * 8 unique values
    * **<span style="color:blue">nice to have: apply the model to in_progress complaints and see what it predicts based on the language</span>**
    * **<span style="color:orange">drop 'in progress' response because there is no conclusion</span>**
  <br>
  <br>
* **timely_response**
  * no nulls
  * boolean
    * **<span style="color:red">drop column</span>**    
      <br>
  <br>
* **consumer_disputed**
  * 77% null
    * **<span style="color:red">drop column</span>**    
      <br>
  <br>
* **complaint_id**
  * no nulls
    * **<span style="color:red">drop column</span>** 
    * **<span style="color:blue">nice-to-have: second iteration</span>**
      <br>
  <br>

In [None]:
# Clean
df_clean = wr.clean_data(df)
# Write to Parquet
df_clean.to_parquet('df_clean.parquet')
# Assign
df_clean = pd.read_parquet('df_clean.parquet')

# Prep
df_prep = wr.prep_narrative(df_clean)
# Write to Parquet
df_prep.to_parquet('df_prep.parquet')
# Assign
df_prep = pd.read_parquet('df_prep.parquet')

<div class="alert alert-success">    

### Insight:
    
Dropped columns:
* product (after engineering new feature)
* subproduct
* issue
* subissue
* consumer_complaint_narrative (after NLTK cleaning)
* company_public_response
* zip_code
* consumer_consent_provided
* submitted_via
* date_sent_to_company
* timely_response
* consumer_disputed
* complaint_id
    <br>
    <br>
    
Used NLTK to clean each document resulting in:
* 2 new columns: *clean* (removes redacted XXs, and stopwords removed) and *lemon* (lemmatized)
    <br>
    <br>
    
Selected columns to proceed with after cleaning:
* date_received, product_bins, company_name, state, tags, company_response_to_customer (target), clean, lemon
    
</div>

<div class="alert alert-warning"> 

# Explore

## <span style="color:teal">Libraries Used: Numpy, Pandas, Seaborn, Matplotlib, Scipy ...</span>

---
</div>

### Split and Parquet

In [None]:
# Split
train, validate, test = wr.split_data(df_prep,"company_response_to_consumer")

# Write to Parquet
train.to_parquet('train.parquet')
validate.to_parquet('validate.parquet')
test.to_parquet('test.parquet')

# Assign 
train = pd.read_parquet('train.parquet')
validate = pd.read_parquet('validate.parquet')
test = pd.read_parquet('test.parquet')

## Questions To Answer:

**1. Are there words that get particular responses and is there a relationship?**
* What are the payout words that got a company response of closed with monetary relief?
* Are there unique words associated with products? Is there a relationship between unique product words and responses?
<br>

**2. Do all responses have a negative sentiment?**
* Do narratives with a neutral or positive sentiment analysis relating to bank account products lead to a response of closed with monetary relief?
<br>

**3. Are there unique words associated with the most negative and most positive company responses?**
<br>

**4. Which product is more likely to have monetary relief?**

### **1. Are there words that get particular company responses and is there a relationship?**
* Are there unique words associated with products? Is there a relationship between unique product words and responses?

$H_0$: This notebook is not pretty


$H_a$: This notebook is pretty

In [None]:
# Get words per company response and per product
word_counts = ex.get_words(train).sort_values(by='all',ascending=False)
word_counts = word_counts.sort_values(by='all',ascending=False)
word_counts_products = ex.get_words_products(train)
word_counts_products = word_counts_products.sort_values(by='all',ascending=False)

In [None]:
# Visualize words per company response
ex.unique_words(word_counts)

In [None]:
# Visualize words per product
ex.unique_words(word_counts_products)

<div class="alert alert-success">    

## Insight:

There is a relationship between words used in complaints and company responses. The words used relate to products that consumer's can complain about. There are unique words associated with each product and those words can be used to predict a company response.
    
### Company Responses and top 5 words:

* Explanation
    * Account, Credit, Report, Payment, Information
        * This type of response looks like it could relate to credit reporting products
* Non-Monetary
    * Credit, Account, Report, Information, Reporting
* **Monetary**
    * Account, **Bank**, **Card**, Credit, Payment
        * This type of response looks like it could relate to credit card or bank products
* **Untimely Reponse**
    * **Debt**, Credit, Account, **Company**, **Loan**
        * This type of response looks like it could relate to debt products
* Closed
    * Account, Debt, Credit, Payment, Loan
---

### Products and top 5 words:
* Credit Report
    * Credit, Account, Report, Information, Reporting
        * matches up with hypothesis where this type of product might get a response of explanation or non-monetary relief
* Debt
    * **Debt**, Credit, Account, **Collection**, Report
        * matches up with hypothesis where this type of product might get an untimely response
* Credit Card
    * **Card**, Credit, Account, Payment, **Charge**
        * matches up with hypothesis where this type of product might get a response of monetary relief
* Mortgage
    * Payment, Loan, Mortgage, Would, **Time**
        * matches up with hypothesis where this type of product might get a response of closed
* Loans
    * Loan, Payment, Account, Would, Credit
        * matches up with hypothesis where this type of product might get a response of closed
* Bank
    * Account, Bank, **Check**, Money, Would
        * matches up with hypothesis where this type of product might get a response of monetary relief
* Money Service
    * Account, Money, Bank, **Paypal**, **Transaction**
        * matches up with hypothesis where this type of product might get a response of monetary relief
    

</div>

### **2. Do all responses have a negative sentiment?**
* Do narratives with a neutral or positive sentiment analysis relating to bank account products lead to a response of closed with monetary relief? 

$H_0$: There is no significant effect of sentiment on company response to the consumer.


$H_a$: There is a significant effect of sentiment on company response to the consumer.

In [None]:
# visualize data and run statistical analysis
ex.analyze_sentiment(train)

<div class="alert alert-success">    

### Insight: 
#### - Overall, there is a strong correlation between the sentiment of consumer complaints/narratives and the corresponding responses from companies.

1. **Mortgage**:
  - Consumer complaints/narratives exhibit predominantly positive sentiment, and companies provide an equal distribution of responses across different categories.
  
2. **Credit Report**:
  - Consumer complaints/narratives with positive sentiment tend to receive the "closed with monetary relief" response most frequently.
  - Overall, the sentiment of complaints/narratives is generally neutral to positive.
  
3. **Debt Collection**:
  - All consumer complaints/narratives have negative sentiment scores, and the complaints with the most negative scores typically receive an "untimely response."
  
4. **Loans**:
  - Complaints/narratives regarding loans have sentiment scores ranging from neutral to positive. Companies provide different responses irrespective of the sentiment score.
   
5. **Bank**:
  - Sentiment scores for bank-related complaints/narratives are somewhat mixed, ranging from neutral to negative. The more negative complaints tend to receive a "closed" or "untimely response."
  
6. **Money Service**:
  - Sentiment scores for complaints/narratives about money services vary between negative and positive. The most negative complaints receive a "closed" response.
  
7. **Credit Card**:
 - The majority of sentiment scores for credit card complaints/narratives range from neutral to positive. The most common response received by consumers is "closed with non-monetary relief."
 
 
#### - These findings indicate that the sentiment of consumer complaints/narratives has an influence on the type of response received from companies across different industry sectors.


### **3. Are there unique words associated with the most negative and most positive company responses?**


$H_0$: This notebook is not pretty


$H_a$: This notebook is pretty

In [None]:
#visualize


In [None]:
#analyze


<div class="alert alert-success">    

### Insight: 
    
''
</div>

### **4. Which product is more likely to have monetary relief?**

$H_0$: This notebook is not pretty


$H_a$: This notebook is pretty

In [None]:
#visualize


In [None]:
#analyze


<div class="alert alert-success">    

### Insight: 
    
''
</div>

<div class="alert alert-warning"> 

# Modeling

## <span style="color:teal">Libraries Used: ...</span>

---
</div>

## Selected Classification Models:
- Decission Tree
- KNN
- Multinomial
- (?)

---

### Evaluation Metric:
- Accuracy (?)

---

### Features Sent In:
-
-
-

---

### Hyperparameters Used:
-

---

### <span style="color:blue">Baseline: 78% (?)</span>

--- 


<div class="alert alert-success">    

### Insight: 
    
''
</div>

<div class="alert alert-warning"> 

# Conclusion

## <span style="color:teal">Libraries Used: ...</span>

---
    
</div>

<div class="alert alert-success">    

    
## Explore

* 
* 
* 
* 

## Modeling
* 
* 
    

</div>

<div class="alert alert-warning"> 


# Project Summary:

* The analysis revealed a significant relationship between consumer sentiment in complaints/narratives and the corresponding company responses, indicating the importance of sentiment in consumer-company interactions.
* Sentiment patterns varied across industries, with positive sentiment in mortgage complaints, credit report complaints receiving "closed with monetary relief" responses, and consistently negative sentiment in debt collection complaints leading to "untimely response" from companies. These findings highlight the need to consider sentiment for effective consumer grievance resolution.
---
    
</div>

<div class="alert alert-warning"> 


# Recommendations and Next Steps

---
    
</div>

<div class="alert alert-success">    

    
## Recommendations

* 
* 
* 
* 

## Next Steps
* 
* 
    

</div>

<div class="alert alert-warning"> 

# Recommendations/Next Steps

## <span style="color:teal">Libraries Used: Numpy, Pandas, Seaborn, Matplotlib, Scipy ...</span>

---
</div>

* 
* 
* 