# LLM Data Cleansing

#### Loading Environment Variables

In [1]:
from dotenv import load_dotenv
import os

# loading .env file
load_dotenv()

# setting LLM_API_KEY
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

#### Initialize LLM and Prompt

In [2]:
import openai
from langchain_openai import OpenAI

# Initialize OpenAI LLM with LangChain
llm = OpenAI(api_key=OPENAI_API_KEY)

# Load prompt from prompt.txt file
with open('prompt.txt', 'r') as file:
    prompt = file.read()

print('PROMPT:\n')
print(prompt)

PROMPT:

You are tasked with converting the following durations to days. These durations are a free-form text 
field on an 811 OneCall Ticket.  The Duration is how long the job should last. There may be misspellings 
or errors. If you see an estimate or range, use the highest. The minimum number of days should be 1, 
if something is 1 hour, it should be 1 day. Anything that cannot convert to a day, is error. For full days, 
enter the exact number of days. For any part of a day mentioned (like hours or half days), count each as a 
full day if the activity spans multiple days, reflecting the occupation of each calendar day. 

 
First column, “duration”, will be the original value (leave it exactly as it was entered). 

Second column, “days”, will be standardized amount in days (integer only).   

Third column, "is_estimate",  will be binary (1 for yes, 0 for no). This used for anything that could 
be an estimate based on ranges or if language that could suggest an estimate is present. 
E

#### Initialize Data

Loading <code>llm_unique_data.csv</code>, which has duplicate values removed to save on unnecessary computational costs

In [7]:
import pandas as pd

# Load the CSV data and assign to DataFrame
csv_file_path = 'data/llm_unique_data.csv'
df = pd.read_csv(csv_file_path)

# Setting the DURATION column to llm_data
llm_data = df['DURATION']

# Checking output 
print(f'Total Rows: {len(llm_data)}')
display(llm_data.head())


Total Rows: 1356
Total Rows: 1356


0    NOT APPLICABLE
1           14 DAYS
2             1 DAY
3            5 DAYS
4           2 WEEKS
Name: DURATION, dtype: object