Skip to content

MiguelElGallo/embeddindataengineering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Warning: Every time you run the Notebook the API of Azure OpenAI will be called, get familiar with those costs!

Using OpenAI Embeddings for Data Engineering in Fabric

See a practical and simple example how you can leverage OpenAI models embedding capabilities in Data Engineering

Not ready for production

The blog accompaining this REPO is located here and as you can see this is not production ready. Is just an example to see some pragmatic use of OpenAI embeddings in Data Engineering

How to run the Notebook

I was running this Notebook in Fabric, since there is a 60 day trial for Fabric. You can run it in different places, you just need to adjust the first cell mainly the path in the call to pd.read_csv().

# Reads the records from the last level of the hierarchy
import pandas as pd
# Load data into pandas DataFrame from "/lakehouse/default/" + "Files/masterdata/product_hier/dairy_products.csv"
df = pd.read_csv("/lakehouse/default/" + "Files/masterdata/product_hier/dairy_products.csv")
display(df)

Steps

  1. Create a new Notebook in Fabric. More info here.

  2. Add any existing lakehouse to your Notebook

Screeshot of adding lakehouse to notebook

  1. Upload the file /resources/dairy_products to your lakehouse

More info here

I you do not want to alter the first cell of your Notebook, then create the folder structure /masterdata/product_hier/ and upload the file in that folder.

Screeshot folder structure

  1. You need an Azure Subscription where you have an Azure OpenAI service deployed. In that services you need to deploy a model of type text-embedding-ada-002. Make note of the name of that model deployment.

More info here

  1. Import the notebook EmbedforDE.ipynb into Fabric.

More info here

  1. Update this three lines in the notebook with the information from 4
openai.api_base = "https://yourservicename.openai.azure.com/"
openai.api_key = "yoursecretkey" #Never share this!!!
df['vector'] = df["text_to_embedd"].apply(lambda x : get_embedding(x, engine = 'dep-ada002'))

In the last line just update dep-ada002, that is the name of the your deployment from step 4.

  1. Press the Run All in the notebook. The Notebooks takes like 30 seconds to run.

About

Using OpenAI Embeddings for Data Engineering in Fabric

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •