# Large Language Models and Data Engineering pipelines

Large Language Models (LLMs) such as GPT-4 and Claude are transforming how data engineers approach common challenges within data pipelines. By leveraging the power of natural language processing and prompt-based interactions, these models can automate, accelerate, and improve the accuracy of many data-related tasks that previously required manual intervention or extensive scripting.

One of the most effective ways to use LLMs in data engineering is through prompting—that is, providing the model with carefully crafted instructions and example inputs so it can perform the desired task. This prompt-based approach allows LLMs to handle a wide range of data engineering activities without needing complex custom code for each scenario.

In this notebook, we will explore how to use LLMs for various tasks.



In [None]:
!pip install openai

In [None]:
import os
from openai import OpenAI

client = OpenAI(
    api_key='YOUR_OPENAI_API_KEY',

)

#This is the main function that communicate with LLM. Temperature controls how random will your answer should be.
#If you increase the temperature, running the same prompt twice may ends up in different answer
def get_completion(prompt, model="gpt-4o", temperature=0):
    chat_completion = client.chat.completions.create(
      messages=[
          {
              "role": "user",
              "content": prompt,
          }
      ],
      model=model,
      temperature=temperature)
    return chat_completion.choices[0].message.content


## Data cleaning

LLMs can be prompted to standardize values, fix typos, and normalize formats. These tasks would otherwise require lengthy scripts and manual checking. By providing a few examples of messy data, you can ask the model to output a cleaned, standardized version of the same data.

In [None]:
cities = [
    "New York City",
    "NYC",
    "nyc",
    "San Francisco",
    "San Fran",
    "sanfrancisco",
    "Chicago",
    "CHI",
    "Los Angeles",
    "LA",
    "L.A.",
    "los angeles",
    "Boston",
    "BOS",
    "boston",
    "Houston",
    "H-Town",
    "houston",
    "Seattle",
    "SEA",
    "seattle",
    "Washington D.C.",
    "DC",
    "Washington DC",
    "Miami",
    "MIA",
    "miami",
    "Dallas",
    "DAL",
    "dallas"
]


prompt_template = """
Below is a list of city names. Some entries are inconsistent or misspelled.
Output a cleaned list where all city names are standardized.
Only output the corrected city name.

Example 1:
Input: NYC
Output: New York City

Example 2:
Input: San Fran
Output: San Francisco

Input: {}
Output:
"""

for c in cities:
  prompt = prompt_template.format(c);
  response = get_completion(prompt)
  print(response)


### Data Transformation

LLMs excel at understanding and reformatting semi-structured or unstructured data, such as extracting structured fields from free-form text or reformatting address blocks into CSV rows. Well-designed prompts can instruct the model to output exactly the data format you need.






In [None]:
addresses = [
    "Address is 1600 Amphitheatre Parkway Mountain View CA 94043",
    "1 Apple Park Way Cupertino California, 95014",
    "233 S Wacker Dr Chicago Ilinois Zip: 60606"
]

prompt_template = """Extract street address, city, state, and ZIP code from the following lines and output
as CSV rows in the order: street, city, state, zip. Use 2 letter state code for consistency. Only respond with comma separated entries for each address.

text: {}
output: ""
"""

for address in addresses:
  prompt = prompt_template.format(address);
  response = get_completion(prompt)
  print(response)


### Data Scraping

LLMs can parse through messy text or code, identify relevant fields (like product names and prices), and produce clean, machine-readable outputs. Prompting the LLM with an example HTML snippet and clear extraction instructions enables quick and accurate information retrieval.



In [None]:
prompt = """
Given the HTML below, extract all product names and their prices. Output as a two-column CSV with headers: product, price.

<html>
  <div class="product">
    <span class="name">Wireless Mouse</span>
    <span class="price">$15.99</span>
  </div>
  <div class="product">
    <span class="name">USB-C Adapter</span>
    <span class="price">$8.49</span>
  </div>
  <div class="product">
    <span class="name">Bluetooth Keyboard</span>
    <span class="price">$22.00</span>
  </div>
</html>

Output:
"""
response = get_completion(prompt)
print(response)


## Translation

Large language models are trained with sources in many languages. This gives the model the ability to do translation. Here are some examples of how to use this capability.

In [None]:
prompt = f"""
Translate the following English text to Spanish: \
```Hi, I would like to order a blender```
"""
response = get_completion(prompt)
print(response)

In [None]:
prompt = f"""
Tell me which language this is:
```Combien coûte le lampadaire?```
"""
response = get_completion(prompt)
print(response)

In [None]:
prompt = f"""
Translate the following  text to French and Spanish
and English pirate: \
```I want to order a basketball```
"""
response = get_completion(prompt)
print(response)

In [None]:
prompt = f"""
Translate the following text to Spanish in both the \
formal and informal forms:
'Would you like to order a pillow?'
"""
response = get_completion(prompt)
print(response)

### Universal Translator
Imagine you are in charge of IT at a large multinational e-commerce company. Users are messaging you with IT issues in all their native languages. Your staff is from all over the world and speaks only their native languages. You need a universal translator!

In [None]:
user_messages = [
  "La performance du système est plus lente que d'habitude.",  # System performance is slower than normal
  "Mi monitor tiene píxeles que no se iluminan.",              # My monitor has pixels that are not lighting
  "Il mio mouse non funziona",                                 # My mouse is not working
  "Mój klawisz Ctrl jest zepsuty",                             # My keyboard has a broken control key
  "我的屏幕在闪烁"                                               # My screen is flashing
]

In [None]:
for issue in user_messages:
    prompt = f"Tell me what language this is: ```{issue}```"
    lang = get_completion(prompt)
    print(f"Original message ({lang}): {issue}")

    prompt = f"""
    Translate the following  text to English \
    and Korean: ```{issue}```
    """
    response = get_completion(prompt)
    print(response, "\n")

## Try it yourself!
Try some translations on your own!

## Tone Transformation
Writing can vary based on the intended audience. LLMs can produce different tones.


In [None]:
prompt = f"""
Translate the following from slang to a business letter:
'Dude, This is Joe, check out this spec on this standing lamp.'
"""
response = get_completion(prompt)
print(response)

## Format Conversion
LLMs can translate between formats. The prompt should describe the input and output formats.

In [None]:
data_json = { "resturant employees" :[
    {"name":"Shyam", "email":"shyamjaiswal@gmail.com"},
    {"name":"Bob", "email":"bob32@gmail.com"},
    {"name":"Jai", "email":"jai87@gmail.com"}
]}

prompt = f"""
Translate the following python dictionary from JSON to an HTML \
table with column headers and title: {data_json}
"""
response = get_completion(prompt)
print(response)

In [None]:
from IPython.display import display, Markdown, Latex, HTML, JSON
display(HTML(response))

## Spellcheck/Grammar check.

Here are some examples of common grammar and spelling problems and the LLM's response.

To signal to the LLM that you want it to proofread your text, you instruct the model to 'proofread' or 'proofread and correct'.

In [None]:
text = [
  "The girl with the black and white puppies have a ball.",  # The girl has a ball.
  "Yolanda has her notebook.", # ok
  "Its going to be a long day. Does the car need it’s oil changed?",  # Homonyms
  "Their goes my freedom. There going to bring they’re suitcases.",  # Homonyms
  "Your going to need you’re notebook.",  # Homonyms
  "That medicine effects my ability to sleep. Have you heard of the butterfly affect?", # Homonyms
  "This phrase is to cherck chatGPT for speling abilitty"  # spelling
]
for t in text:
    prompt = f"""Proofread and correct the following text
    and rewrite the corrected version. If you don't find
    and errors, just say "No errors found". Don't use
    any punctuation around the text:
    ```{t}```"""
    response = get_completion(prompt)
    print(response)

In [None]:
text = f"""
Got this for my daughter for her birthday cuz she keeps taking \
mine from my room.  Yes, adults also like pandas too.  She takes \
it everywhere with her, and it's super soft and cute.  One of the \
ears is a bit lower than the other, and I don't think that was \
designed to be asymmetrical. It's a bit small for what I paid for it \
though. I think there might be other options that are bigger for \
the same price.  It arrived a day earlier than expected, so I got \
to play with it myself before I gave it to my daughter.
"""
prompt = f"proofread and correct this review: ```{text}```"
response = get_completion(prompt)
print(response)

In [None]:
prompt = f"""
proofread and correct this review. Make it more compelling.
Ensure it follows APA style guide and targets an advanced reader.
Output in markdown format.
Text: ```{text}```
"""
response = get_completion(prompt)
display(Markdown(response))