# Example Fasttext Modeling

Date: 2021-01-01  
Author: Jason Beach  
Categories: Introduction_Tutorial, Data_Science 
Tags: NLP, development 

<!--eofm-->

Consumer complaints database downloaded from [here](https://catalog.data.gov/dataset/consumer-complaint-database)

## Configure

In [1]:
! ls resources/

Bundles.jsonl		dataCombine.csv			 dataProducts.csv
complaints.csv		dataFinance_10-K_A.html		 OUTPUT_Finance.jsonl
complaints_results.csv	dataFinance_dataframe.jsonl	 tbl_Vulnerability.png
complaints_test.txt	dataFinanceIncidents_calcs.xlsx  tbl_Vul_Prod-Cat.png
complaints_train.txt	dataFinance-label_config.json
dataCategory.csv	dataIncidents.csv


In [2]:
import csv

import pandas as pd
import fasttext as ft

In [None]:
consumercompliants = pd.read_csv('resources/complaints.csv')

## Prepare Data

In [4]:
col = ['Product', 'Consumer complaint narrative']
consumercompliants = consumercompliants[col]
consumercompliants = consumercompliants[pd.notnull(consumercompliants['Consumer complaint narrative'])]
consumercompliants.columns = ['Product', 'Consumer_complaint_narrative']
consumercompliants.head()

Unnamed: 0,Product,Consumer_complaint_narrative
1,Vehicle loan or lease,I contacted Ally on Friday XX/XX/XXXX after fa...
3,Student loan,I was contacted about student loan consolidati...
5,"Credit reporting, credit repair services, or o...",Hello This complaint is against the three cred...
6,"Credit reporting, credit repair services, or o...",I am a victim of Identity Theft & currently ha...
8,"Credit reporting, credit repair services, or o...",Two accounts are still on my credit history af...


In [5]:
consumercompliants.shape

(841229, 2)

In [6]:
consumercompliants['Product'].value_counts()

Credit reporting, credit repair services, or other personal consumer reports    332914
Debt collection                                                                 163006
Mortgage                                                                         86571
Credit card or prepaid card                                                      63645
Checking or savings account                                                      39415
Credit reporting                                                                 31588
Student loan                                                                     29509
Money transfer, virtual currency, or money service                               19729
Credit card                                                                      18838
Vehicle loan or lease                                                            15809
Bank account or service                                                          14885
Payday loan, title loan, or personal loan  

In [7]:
payday = consumercompliants[consumercompliants['Product']=='Payday loan']
other = consumercompliants[consumercompliants['Product']!='Payday loan']
nonpayday = other.sample(2000)
nonpayday['Product'] = 'Other'

Use one of the two cell to decide what type of model to build for demonstration purposes.

In [8]:
#using all records to create a good model
payTrain = payday.sample(frac=0.75)
payTest = payday[~payday.index.isin(payTrain.index)]

nonTrain = nonpayday.sample(frac=0.75)
nonTest = nonpayday[~nonpayday.index.isin(nonTrain.index)]

Train = pd.concat([payTrain,nonTrain], ignore_index=True)
Test = pd.concat([payTest,nonTest], ignore_index=True)

In [44]:
#reducing records used to demonstrate a poor model
payTrain = payday.sample(frac=0.5)
payTest = payday[~payday.index.isin(payTrain.index)]

nonTrain = nonpayday.sample(frac=0.5)
nonTest = nonpayday[~nonpayday.index.isin(nonTrain.index)]

Train = pd.concat([payTrain,nonTrain], ignore_index=True)
Test = pd.concat([payTest,nonTest], ignore_index=True)

In [45]:
Train['label']=['__label__'+s.replace(' or ', '$').replace(', or ','$').replace(',','$').replace(' ','_').replace(',','__label__').replace('$$','$').replace('$',' __label__').replace('___','__') for s in Train['Product']]
Test['label']=['__label__'+s.replace(' or ', '$').replace(', or ','$').replace(',','$').replace(' ','_').replace(',','__label__').replace('$$','$').replace('$',' __label__').replace('___','__') for s in Test['Product']]

Train['narrative']= Train['Consumer_complaint_narrative'].replace('\n',' ', regex=True).replace('\t',' ', regex=True)
Test['narrative']= Test['Consumer_complaint_narrative'].replace('\n',' ', regex=True).replace('\t',' ', regex=True)

In [46]:
Train[['label','narrative']].to_csv(r'./resources/complaints_train.txt', index=False, sep=' ', header=False, quoting=csv.QUOTE_NONE, quotechar="", escapechar=" ")
Test[['label','narrative']].to_csv(r'./resources/complaints_test.txt', index=False, sep=' ', header=False, quoting=csv.QUOTE_NONE, quotechar="", escapechar=" ")

In [47]:
Test[['label','narrative']].to_csv(r'./resources/complaints_complete_test.txt', index=False)

## Model

In [48]:
def print_results(N, p, r):
    print("N\t" + str(N))
    print("P@{}\t{:.3f}".format(1, p))
    print("R@{}\t{:.3f}".format(1, r))

In [49]:
model = ft.train_supervised('./resources/complaints_train.txt')

Read 0M words
Number of words:  16747
Number of labels: 2
Progress: 100.0% words/sec/thread: 2579481 lr:  0.000000 avg.loss:  0.598023 ETA:   0h 0m 0s


In [50]:
print_results(*model.test('./resources/complaints_test.txt'))

N	1873
P@1	0.766
R@1	0.766


In [60]:
#use if you don't want to re-process
Test = pd.read_csv('./resources/complaints_complete_test.txt')
Test = Test.sample(300)

In [61]:
Test['predict'] = Test['narrative'].apply(model.predict)

## Prepare Predictions for Dashboard

In [62]:
#df_raw.columns.tolist()   #['actualy', 'yhat', 'id', 'text]    <<< the columns you need
Test['index'] = Test.index
Test['actualy'] = 0
Test['actualy'][Test['label']=='__label__Payday_loan'] = 1

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  Test['actualy'][Test['label']=='__label__Payday_loan'] = 1


In [63]:
Test['yhat'] = [item[1][0] if item[0][0] == '__label__Payday_loan' else 1 - item[1][0] for item in Test['predict'].tolist() ]

In [64]:
#example
Test[Test['index'] == 97]

Unnamed: 0,label,narrative,predict,index,actualy,yhat


In [65]:
df_raw = Test[['actualy', 'yhat', 'index', 'narrative']]
df_raw.rename(columns={'index':'id', 'narrative':'text'}, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().rename(


In [66]:
df_raw.head()

Unnamed: 0,actualy,yhat,id,text
1799,0,0.445166,1799,I called USAA today XX/XX/XXXX because someone...
320,1,0.452314,320,I have been getting payday loans from this com...
593,1,0.4475,593,My identity was stolen and a loan was open usi...
203,1,0.448908,203,"I inquire about this loan, the explanation was..."
1705,0,0.306474,1705,"On XXXX XX/XX/XXXX, I initiated an online mone..."


In [67]:
df_raw.to_csv('./resources/complaints_results_pct25.csv', index=False)