## 1. Load the dataset



To run this notebook, you will need to install: pandas, openai, transformers, plotly, matplotlib, scikit-learn, torch (transformer dep), torchvision, and scipy.

In [11]:
# install openai (do this only once at the start. Source: https://community.openai.com/t/no-module-named-openai/8303/3)
# !python --version
# !pip --version
# !pip install openai

Import various library's

In [12]:
import pandas as pd
import time
import openai
import json
import numpy as np
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

Create embeddings, make sure you change the API key. Currently limits to 200 rows because of token prices.

In [19]:
# Setup openAI API key
openai.api_key = '<api_key>'

# OpenAI model for embedding complaints: text-embedding-ada-002, link: https://platform.openai.com/docs/models/embeddings
embedding_model = 'text-embedding-ada-002'

# Retrieve the data from the database
input_datapath = pd.read_csv('StaterData.csv')

# Limit test size due to performance issues
data = input_datapath.loc[:200]

# Create and return embedding
def get_embedding(text, model="text-embedding-ada-002"):
    # sleep 1 second to prevent reaching rate limit. Limit: 60 requests per min.
    time.sleep(1)
    text = text.replace("\n", " ")
    return openai.Embedding.create(input = [text], model=model)['data'][0]['embedding']

data['Embedding'] = data['Clean consumer complaint'].apply(lambda x: get_embedding(x))
data.to_csv('StaterDataEmbeddings.csv', index=False)

Print head of data to check if the data is correct

In [20]:
data = pd.read_csv('StaterDataEmbeddings.csv')
data.head()

Unnamed: 0,Complaint ID,Issue,Consumer complaint narrative,Clean consumer complaint,Embedding
0,6697855,Trouble during payment process,"I closed on my house XX/XX/XXXX, XXXX payment...",close house payment due xxxxoriginal loan serv...,"[-0.022687040269374847, 0.012659582309424877, ..."
1,5832311,Applying for a mortgage or refinancing an exis...,XXXX and XXXX Mortgage payments were marked as...,mortgage payment mark late account covid forbe...,"[-0.02141152136027813, -0.0036303510423749685,..."
2,2787647,Trouble during payment process,"XXXX XXXX XXXX On XX/XX/XXXX, I called 21st mo...",call st mortgage home finance father name titl...,"[-0.03758683055639267, -0.006581039633601904, ..."
3,6589323,Trouble during payment process,Loancare through XXXX XXXX XXXX' is negligentl...,loancare negligently purposely delay release c...,"[-0.01969774439930916, -0.005841182079166174, ..."
4,6633594,Struggling to pay mortgage,"In XXXX of XXXX, I received a Loan Modificatio...",receive loan modification partial claim bank n...,"[-0.03471546992659569, 0.0037510935217142105, ..."


Used random forest to calculate the accuracy of the embeddings using OpenAI

In [21]:
# Prepare the feature matrix X and target vector y
X = data['Embedding'].apply(lambda x: json.loads(x)).tolist()
y = data['Issue']

# Reshape the embeddings into a 2D array
X = np.array(X)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2)

# Create an instance of the random forest classifier
rf_classifier = RandomForestClassifier(n_estimators=200, random_state=2)

# Train the random forest classifier
rf_classifier.fit(X_train, y_train)

# Make predictions on the test data
y_pred = rf_classifier.predict(X_test)

# Evaluate the performance of the model
accuracy = accuracy_score(y_test, y_pred)
print('Accuracy:', accuracy)

Accuracy: 0.5573770491803278


Accuracy for this model is 0.56. It is an ok model. Hence the score is less than the traditional method of using TF-IDF. 200 complaints are used for this model. It is recommended to try this with a larger dataset, as we occurred errors due to token and api request limit.