## Step 1 - Add your OpenAI API key.
The key must be taken from a registered OpenAI account.
Link to the keys sub-menu https://platform.openai.com/account/api-keys.

In [16]:
%env OPENAI_API_KEY=YOUT_OPENAI_API_KEY

env: OPENAI_API_KEY=YOUT_OPENAI_API_KEY


## Step 2 - Installing necessary libraries
The libraries below are needed for:
- Work with CSV files;
- Visualising files characteristics;
- Creating embeddings;

In [4]:
!pip install pandas
!pip install tiktoken
!pip install openai
!pip install plotly
!pip install scipy
!pip install scikit-learn
!pip install tenacity


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3.1[0m[39;49m -> [0m[32;49m23.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3.1[0m[39;49m -> [0m[32;49m23.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3.1[0m[39;49m -> [0m[32;49m23.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3.1[0m[39;49m -> [0m[32;49m23.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgra

## Step 3 - Libraries import
Imported libraries will be used across the whole notebook

In [17]:
import pandas as pd
import tiktoken
import openai
from tenacity import retry, wait_random_exponential, stop_after_attempt

## Step 4 - Configuration
The parameters below will be used across the whole notebook

In [8]:
"""
The model to be used for creating embeddings
"""
embedding_model = "text-embedding-ada-002"

"""
The characters encoding for text-embedding-ada-002
"""
embedding_encoding = "cl100k_base"

"""
The maximal tokens number that a text chunk passed to the embeddings model can have.
Actual maximum for text-embedding-ada-002 is 8191
"""
max_tokens = 8000

## Step 5.1 - Create a pre-filtered dataset
The original dataset is taken from Kaggle and contains about 300MB of Amazon food reviews. This step will preprocess the original dataset with the following amplifications:
- Take only a subset of columns - "Time", "ProductId", "UserId", "Score", "Summary", "Text";
- Create a new column called "combined" that will include original **"Summary"** and "Text" columns for the purpose of embeddings generation;

### Prerequisites
- Original dataset must be downloaded and unzipped. After, the dataset must be placed under the `docs` folder and be named as `Reviews.csv`. The `dosc` folder must be at the same file level as this notebook.
```
- project/
    - data_preparation.ipynb
    - docs/
        - Reviews.csv
```

### IMPORTANT!
The `docs/Reviews.csv` is ignored for Git. If you somehow commit this file Github will not allow you to push due to the maximal filesize limit!

### Resources:
- [Amazon Fine Food Reviews dataset](https://www.kaggle.com/datasets/snap/amazon-fine-food-reviews);

In [10]:
df = pd.read_csv("docs/Reviews.csv")
df = df[["Time", "ProductId", "UserId", "Score", "Summary", "Text"]]
df = df.dropna()

df["combined"] = (
    "Title: " + df.Summary.str.strip() + "; Content: " + df.Text.str.strip()
)

df.head(5)

Unnamed: 0,Time,ProductId,UserId,Score,Summary,Text,combined
0,1303862400,B001E4KFG0,A3SGXH7AUHU8GW,5,Good Quality Dog Food,I have bought several of the Vitality canned d...,Title: Good Quality Dog Food; Content: I have ...
1,1346976000,B00813GRG4,A1D87F6ZCVE5NK,1,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...,Title: Not as Advertised; Content: Product arr...
2,1219017600,B000LQOCH0,ABXLMWJIXXAIN,4,"""Delight"" says it all",This is a confection that has been around a fe...,"Title: ""Delight"" says it all; Content: This is..."
3,1307923200,B000UA0QIQ,A395BORC6FGVXV,2,Cough Medicine,If you are looking for the secret ingredient i...,Title: Cough Medicine; Content: If you are loo...
4,1350777600,B006K2ZZ7K,A1UQRSCLF8GW1T,5,Great taffy,Great taffy at a great price. There was a wid...,Title: Great taffy; Content: Great taffy at a ...


## Step 5.2 - Filter out not valid records
Due to the embedding model limitations the data must be "cleaned" before creating embeddings.
After this step the dataset will be turned into the following:
- Sorted by the time the review was left;
- A new column is created that shows the number of characters in the "combined" column (see the previous step);
- The dataset is filtered by the number of tokens to remove records that exceed the limit;
- Last 50 records are taken;

The number of records in the filtered dataset and the tokens limit can be changed per requirements.

### IMPORTANT!
To count the number of token in a text the "tiktoken" module used as the recommended one by OpenAI.

In [11]:
df = df.sort_values("Time")
df.drop("Time", axis=1, inplace=True)

encoding = tiktoken.get_encoding(embedding_encoding)

df["n_tokens"] = df.combined.apply(lambda x: len(encoding.encode(x)))
df = df[df.n_tokens < max_tokens].tail(50)

print(f"Dataset length: {len(df)}")

df.head(5)

Dataset length: 50


Unnamed: 0,ProductId,UserId,Score,Summary,Text,combined,n_tokens
346130,B004TJF3BE,A2TZKSY1ZWPOU9,5,Great Hot Cider!!!,It is hard to find much of anything sugarfree ...,Title: Great Hot Cider!!!; Content: It is hard...,46
135890,B001ACMCLM,A2PCNXBSKCABG5,4,GOOD GLUTEN FREE BREAD STCK MIX,Makes very good break sticks.. Also can be use...,Title: GOOD GLUTEN FREE BREAD STCK MIX; Conten...,52
182237,B004LM9KHW,A1AOOCCQ27K9IT,3,French Vanilla Wolfgang Puck,Product is easy to use.... Just cut or tear pa...,Title: French Vanilla Wolfgang Puck; Content: ...,123
354602,B000LKU3A6,A2YRK0YLBN5CC2,3,"Good flavor, but a wet mess","I got the teriyaki flavor and, while the flavo...","Title: Good flavor, but a wet mess; Content: I...",267
320387,B008JA73RG,AFJFXN42RZ3G2,4,Neither too sweet nor fizzy,V8 V-Fusion may appear to be the typical energ...,Title: Neither too sweet nor fizzy; Content: V...,230


## Step 6 - Generate embeddings
Text embeddings are generated for the "combined" column. After generation, the embeddings are stored under "embeddings" column.
For embeddings generation OpenAI API is used.

As the embeddings generation is done the resulting dataset is saved to a CSV file under the `docs` directory. Current implementation uses `docs/50_reviews_with_embeddings.csv`.

### Resources
- [The simplest example of embeddings creation](https://github.com/openai/openai-cookbook/blob/main/examples/Get_embeddings.ipynb);
- [Original document about dataset preparation and embeddings generation](https://github.com/openai/openai-cookbook/blob/28ab8b5c44851fe99cb90a962d41095cf9525940/examples/Obtain_dataset.ipynb);

In [14]:
@retry(wait=wait_random_exponential(min=1, max=20), stop=stop_after_attempt(6))
def get_embedding(text, model):
    text = text.replace("\n", " ")
    response = openai.Embedding.create(input=[text], model=model)
    usage = response["usage"]
    print(f"""
    Input: {text}
    OpenAI API tokens usage:
    - Prompt tokens: {usage["prompt_tokens"]}
    - Total tokens: {usage["total_tokens"]}
    """)
    return response["data"][0]["embedding"]

df["embedding"] = df.combined.apply(lambda x: get_embedding(x, model=embedding_model))
df.to_csv("./docs/50_reviews_with_embeddings.csv")


    Input: Title: Great Hot Cider!!!; Content: It is hard to find much of anything sugarfree that really tastes good but this apple cider is the best.  I also love the regular hot apple cider in sugarfree form.
    OpenAI API tokens usage:
    - Prompt tokens: 46
    - Total tokens: 46
    

    Input: Title: GOOD GLUTEN FREE BREAD STCK MIX; Content: Makes very good break sticks.. Also can be used for a pizza crust.<br /><br />My wife is a celiac so we both enjoy this either as bread sticks or pizza crust.
    OpenAI API tokens usage:
    - Prompt tokens: 52
    - Total tokens: 52
    

    Input: Title: French Vanilla Wolfgang Puck; Content: Product is easy to use.... Just cut or tear pack open and place into the coffee pot.  But I definitely did not think that the flavor was particularly French Vanilla, even with French Vanilla creamer.  It felt "heavy" on my tongue and had a "dark" coffee flavor which I believe overcame the French Vanilla flavor.  I was very disappointed that I had

## Step 7 - Final dataset test

In [15]:
df = pd.read_csv("./docs/50_reviews_with_embeddings.csv")

df

Unnamed: 0.1,Unnamed: 0,ProductId,UserId,Score,Summary,Text,combined,n_tokens,embedding
0,346130,B004TJF3BE,A2TZKSY1ZWPOU9,5,Great Hot Cider!!!,It is hard to find much of anything sugarfree ...,Title: Great Hot Cider!!!; Content: It is hard...,46,"[0.0007746443734504282, -0.0042549604550004005..."
1,135890,B001ACMCLM,A2PCNXBSKCABG5,4,GOOD GLUTEN FREE BREAD STCK MIX,Makes very good break sticks.. Also can be use...,Title: GOOD GLUTEN FREE BREAD STCK MIX; Conten...,52,"[-0.0011603275779634714, -0.009558608755469322..."
2,182237,B004LM9KHW,A1AOOCCQ27K9IT,3,French Vanilla Wolfgang Puck,Product is easy to use.... Just cut or tear pa...,Title: French Vanilla Wolfgang Puck; Content: ...,123,"[0.004102123901247978, -0.014700944535434246, ..."
3,354602,B000LKU3A6,A2YRK0YLBN5CC2,3,"Good flavor, but a wet mess","I got the teriyaki flavor and, while the flavo...","Title: Good flavor, but a wet mess; Content: I...",267,"[-0.01738598570227623, 0.001981155714020133, -..."
4,320387,B008JA73RG,AFJFXN42RZ3G2,4,Neither too sweet nor fizzy,V8 V-Fusion may appear to be the typical energ...,Title: Neither too sweet nor fizzy; Content: V...,230,"[-0.0018182089552283287, -0.031320970505476, 0..."
5,486552,B000MUT928,AMV75AVRSNM0L,3,Crunchy strong and ok taste,"I thought the pocket coffee was good, not sure...",Title: Crunchy strong and ok taste; Content: I...,71,"[0.004584897309541702, -0.027496036142110825, ..."
6,355351,B0007PNKRS,A1TED4G0PWZPQV,5,Came as expected,It was tasty and fresh. The other one I bought...,Title: Came as expected; Content: It was tasty...,32,"[0.016189226880669594, -0.02362893521785736, 0..."
7,402155,B0006349WQ,A21BT40VZCCYT4,5,Good Training Treat,My dog will come in from outside when I am tra...,Title: Good Training Treat; Content: My dog wi...,48,"[-0.02448691613972187, -0.017595387995243073, ..."
8,131483,B001ANXL84,A3NZ74QTATJ45W,5,Best electrolyte replacement drink,I'm a disabled Vet with 80% of my kidneys gone...,Title: Best electrolyte replacement drink; Con...,228,"[-0.0006335642538033426, 0.013451849110424519,..."
9,519037,B008TZJUOA,A1LHOKYENR7HP2,5,Cute!,Bought these to decorate cupcakes for a kid's ...,Title: Cute!; Content: Bought these to decorat...,32,"[-0.01844242587685585, 0.0013125381665304303, ..."
