Source: https://github.com/openai/openai-cookbook/blob/main/examples/Classification_using_embeddings.ipynb

## Classification using embeddings

There are many ways to classify text. This notebook shares an example of text classification using embeddings. For many text classification tasks, we've seen fine-tuned models do better than embeddings. See an example of fine-tuned models for classification in [Fine-tuned_classification.ipynb](Fine-tuned_classification.ipynb). We also recommend having more examples than embedding dimensions, which we don't quite achieve here.

In this text classification task, we predict the score of a food review (1 to 5) based on the embedding of the review's text. We split the dataset into a training and a testing set for all the following tasks, so we can realistically evaluate performance on unseen data. The dataset is created in the [Get_embeddings_from_dataset Notebook](Get_embeddings_from_dataset.ipynb).


In [1]:
# imports
import pandas as pd
import numpy as np
from ast import literal_eval

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score

# load data
datafile_path = "https://github.com/openai/openai-cookbook/blob/main/examples/data/fine_food_reviews_with_embeddings_1k.csv?raw=True"

df = pd.read_csv(datafile_path)
df.head()



Unnamed: 0.1,Unnamed: 0,ProductId,UserId,Score,Summary,Text,combined,n_tokens,embedding
0,0,B003XPF9BO,A3R7JR3FMEBXQB,5,where does one start...and stop... with a tre...,Wanted to save some to bring to my Chicago fam...,Title: where does one start...and stop... wit...,52,"[0.03599238395690918, -0.02116263099014759, -0..."
1,297,B003VXHGPK,A21VWSCGW7UUAR,4,"Good, but not Wolfgang Puck good","Honestly, I have to admit that I expected a li...","Title: Good, but not Wolfgang Puck good; Conte...",178,"[-0.07042013108730316, -0.03175969794392586, -..."
2,296,B008JKTTUA,A34XBAIFT02B60,1,Should advertise coconut as an ingredient more...,"First, these should be called Mac - Coconut ba...",Title: Should advertise coconut as an ingredie...,78,"[0.05692615360021591, -0.005402443464845419, 0..."
3,295,B000LKTTTW,A14MQ40CCU8B13,5,Best tomato soup,I have a hard time finding packaged food of an...,Title: Best tomato soup; Content: I have a har...,111,"[-0.011223138310015202, -0.049720242619514465,..."
4,294,B001D09KAM,A34XBAIFT02B60,1,Should advertise coconut as an ingredient more...,"First, these should be called Mac - Coconut ba...",Title: Should advertise coconut as an ingredie...,78,"[0.05692615360021591, -0.005402443464845419, 0..."


In [2]:
df.dtypes

Unnamed: 0     int64
ProductId     object
UserId        object
Score          int64
Summary       object
Text          object
combined      object
n_tokens       int64
embedding     object
dtype: object

In [3]:
df["embedding"] = df["embedding"].apply(literal_eval).apply(np.array)  # convert string to array
df.loc[0, "embedding"]

array([ 0.03599238, -0.02116263, -0.02902304, ..., -0.01533097,
        0.003065  , -0.04977196])

In [4]:
len(df.loc[0, "embedding"])

1536

In [5]:
# split data into train and test
X_train, X_test, y_train, y_test = train_test_split(
    list(df.embedding.values), df.Score, test_size=0.2, random_state=42
)

# train random forest classifier
clf = RandomForestClassifier(n_estimators=100)
clf.fit(X_train, y_train)
preds = clf.predict(X_test)
probas = clf.predict_proba(X_test)

report = classification_report(y_test, preds)
print(report)


              precision    recall  f1-score   support

           1       0.89      0.40      0.55        20
           2       1.00      0.38      0.55         8
           3       1.00      0.18      0.31        11
           4       1.00      0.26      0.41        27
           5       0.75      1.00      0.86       134

    accuracy                           0.77       200
   macro avg       0.93      0.44      0.53       200
weighted avg       0.82      0.77      0.72       200



We can see that the model has learnt to distinguish between the categories decently. 5-star reviews show the best performance overall, and this is not too surprising, since they are the most common in the dataset.

In [None]:
! pip install openai

In [16]:
import requests
request = requests.get("https://raw.githubusercontent.com/openai/openai-cookbook/main/examples/utils/embeddings_utils.py")
with open("embeddings_utils.py", "wb") as f:
      f.write(request.content)
from embeddings_utils import plot_multiclass_precision_recall
plot_multiclass_precision_recall(probas, y_test, [1, 2, 3, 4, 5], clf)


<img src="https://github.com/AashitaK/datasets/blob/main/images/Classification%20using%20embeddings.png?raw=True" width="750" />


Unsurprisingly 5-star and 1-star reviews seem to be easier to predict. Perhaps with more data, the nuances between 2-4 stars could be better predicted, but there's also probably more subjectivity in how people use the inbetween scores.