# Procesamiento de Lenguaje Natural


**Pablo Martínez Olmos**

Departamento de Teoría de la Señal y Comunicaciones

**Universidad Carlos III de Madrid**

<img src='http://www.tsc.uc3m.es/~emipar/BBVA/INTRO/img/logo_uc3m_foot.jpg' width=400 />

# Proyecto I: agrupación reclamaciones en una base de datos de clientes de productos/servicios financieros




En este proyecto, utilizaremos la siguiente [base de datos](https://catalog.data.gov/dataset/consumer-complaint-database) publicada por el *Bureau of Consumer Financial Protection* (BCFP), agencial federal de EEUU:

> The Consumer Complaint Database is a collection of complaints about consumer financial products and services that we sent to  companies for response. Complaints are published after the company responds, confirming a commercial relationship with the  consumer, or after 15 days, whichever comes first. Complaints referred to other regulators, such as complaints about depository institutions with less than $10 billion in assets, are not published in the Consumer Complaint Database. 

Podemos cargar la base de datos completa del siguiente modo:

```python
df = pd.read_csv('https://files.consumerfinance.gov/ccdb/complaints.csv.zip', compression='zip', sep=',')
```

Pero es bastante pesada (1.2 Gb). Vamos a cargar un pequeño extracto de la misma, con unos 20.000 registros ...

In [6]:
import pandas as pd

df = pd.read_csv('http://www.tsc.uc3m.es/~olmos/BBVA/complaints_extracto.csv', sep=',')


In [7]:
df.head(10)

Unnamed: 0.1,Unnamed: 0,Date received,Product,Sub-product,Issue,Sub-issue,Consumer complaint narrative,Company public response,Company,State,ZIP code,Tags,Consumer consent provided?,Submitted via,Date sent to company,Company response to consumer,Timely response?,Consumer disputed?,Complaint ID
0,1050152,2018-02-11,"Credit reporting, credit repair services, or o...",Credit reporting,Problem with a credit reporting company's inve...,Their investigation did not fix an error on yo...,I disputed with this company back in XXXX the ...,,"EQUIFAX, INC.",MI,482XX,,Consent provided,Web,2018-02-11,Closed with explanation,Yes,,2811182
1,138991,2019-04-18,"Credit reporting, credit repair services, or o...",Credit reporting,Incorrect information on your report,Account status incorrect,,,"EQUIFAX, INC.",PR,00987,,Consent not provided,Web,2019-04-18,Closed with non-monetary relief,Yes,,3216620
2,1128029,2014-03-01,Mortgage,Conventional adjustable mortgage (ARM),"Loan modification,collection,foreclosure",,,,Ocwen Financial Corporation,CA,91601,,,Web,2014-02-28,Closed with explanation,Yes,No,738615
3,54959,2019-07-08,Debt collection,Credit card debt,Communication tactics,Frequent or repeated calls,,,Radius Global Solutions LLC,FL,33756,,Consent not provided,Web,2019-07-08,Closed with explanation,Yes,,3298990
4,2007823,2018-11-18,"Credit reporting, credit repair services, or o...",Credit reporting,Incorrect information on your report,Account information incorrect,XXXX Details Account # XXXX A payment was sent...,Company has responded to the consumer and the ...,Experian Information Solutions Inc.,WI,,,Consent provided,Web,2018-11-18,Closed with non-monetary relief,Yes,,3077718
5,671104,2020-12-18,Debt collection,Credit card debt,Attempts to collect debt not owed,Debt was result of identity theft,,Company has responded to the consumer and the ...,"TRANSUNION INTERMEDIATE HOLDINGS, INC.",TX,75904,,Consent not provided,Web,2020-12-18,Closed with explanation,Yes,,4019919
6,671377,2021-02-08,Checking or savings account,Checking account,Managing an account,Problem using a debit or ATM card,This is more of a general complaint because it...,Company has responded to the consumer and the ...,"BANK OF AMERICA, NATIONAL ASSOCIATION",PA,194XX,,Consent provided,Web,2021-02-08,Closed with monetary relief,Yes,,4124936
7,502772,2020-08-06,"Credit reporting, credit repair services, or o...",Credit reporting,Incorrect information on your report,Information belongs to someone else,MY WIFE AND I HAVE BEEN A VICTIM OF IDENTITY T...,Company has responded to the consumer and the ...,Experian Information Solutions Inc.,AR,721XX,,Consent provided,Web,2020-08-06,Closed with explanation,Yes,,3782734
8,6661,2019-03-26,"Credit reporting, credit repair services, or o...",Credit reporting,Incorrect information on your report,Old information reappears or never goes away,,Company has responded to the consumer and the ...,Experian Information Solutions Inc.,ID,832XX,,Consent not provided,Web,2019-03-26,Closed with explanation,Yes,,3191684
9,1552571,2014-12-13,Credit reporting,,Unable to get credit report/credit score,Problem getting my free annual report,,,"TRANSUNION INTERMEDIATE HOLDINGS, INC.",TX,77005,Older American,,Web,2014-12-16,Closed with non-monetary relief,Yes,No,1155672


Vamos a centrarnos en el campo `Consumer complaint narrative`:

In [10]:
complaints = list(df['Consumer complaint narrative'])

# Complaints es una lista con las quejas de los usuarios

complaints[0]

'I disputed with this company back in XXXX the company is XXXX. They didnt send me any method of verification stating a original signed contract with my signature because according to Section 609 of the FCRA it has to be proof of a original signed contract with my signature. This company just continues to report this on my credit report and being that XXXX has so many names and addresses on my file I didnt sign up for this.Please reach out to this company XXXX to investigate.'

Lo primero, vamos a hacernos una idea de la longitud del textos para los distintos registros.

> **Ejercicio** Dibuje el histograma del número de palabras por entrada del campo `Consumer complaint narrative`. 

In [13]:
# YOUR CODE HERE

Habrás podido comprobar como los textos son de longitud muy variable. Este campo puede llegar a ser muy largo, de tal manera que la representación mediante TF-IDF o promedio word2vec puede ser perder capacidad discriminativa. Vamos a implementar una función para resumir cada documento con sus 3 oraciones más significativas.

> **Ejercicio**: Cree una función que, dado un texto en formato *string*, devuelva un resumen con las $I=3$ oraciones más significativas utilizando TextRank. El formato del texto resumido debe ser un único *string* con la concatenación de $I$ oraciones encontradas (**Nota: concatene las oraciones ORIGINALES, sin normalizar**).

In [1]:
# YOUR CODE HERE

> **Ejercicio:** Genere un corpus de documentos resumidos, en un formato lista de *strings*.

In [1]:
# YOUR CODE HERE

> **Ejercicio**: Normalice el corpus resumido y genere la codificación TF-IDF de todos los documentos. 

In [None]:
# YOUR CODE HERE

>**Ejercicio**: Finalmente, haga un agrupamiento de las quejas usando K-means y distancia coseno. Utilice $K=5$ grupos. Imprima algunos documentos de cada grupo y discuta los resultados.

In [1]:
# YOUR CODE HERE