# LLMOps: LLM Tracking with MLFlow

Let's begin with connecting the mlflow server and creating the experiments

In [None]:
import mlflow
mlflow.set_tracking_uri("http://127.0.0.1:5000")
mlflow.set_experiment("llm_tracking")

We will use langchain to talk with chat-gpt3. We have 4 main components:
- SystemMessage: The message that is sent to the AI
- HumanMessage: The message that is sent to the human
- AIMessage: The message that is sent to the AI
- ChatOpenAI: The chatbot that we will use to talk with the AI

In [8]:
from langchain.chat_models import ChatOpenAI
from langchain.schema import SystemMessage, HumanMessage, AIMessage


In [9]:
import os
# Add your OpenAI API key here
os.environ['OPENAI_API_KEY'] = "sk-"

In [10]:
chat = ChatOpenAI()

In [18]:
system_prompt = """
You are a human text classifier. You are given a text and you have to classify it as either positive or negative. Given an input text, you will think about the answer first and classify it next. Your output format should be:
<Insert what you think>
[RESULT]: <1 for positive, 0 for negative, -1 for if you don't know>
"""
system_message = SystemMessage(content=system_prompt)

Using log_predictions, we can log the inputs and outputs of the model. The inputs are the human messages and the outputs are the AI responses. We can also log the prompts that we used to generate the outputs.
This is for offline evaluation.

In [19]:
human_messages = []
ai_responses = []
with mlflow.start_run():
    
    for i in range(5):
        user_input = input("Enter your message: ")
        human_message = HumanMessage(content=user_input)
        human_messages.append(user_input)
        ai_message = chat([system_message, human_message])
        ai_responses.append(ai_message.content)
    
    mlflow.llm.log_predictions(inputs=human_messages, outputs=ai_responses,prompts=[system_prompt]*len(human_messages))

2023/07/29 18:34:23 INFO mlflow.tracking.llm_utils: Creating a new llm_predictions.csv for run ddd5253a960e45e997e4851f553e81a1.


We can also use log_table to log the inputs, outputs and prompts in a table format for comparing the results with other models.

In [33]:
test_dataset = [
    "I love this movie. It's so good!",
    "I hate this movie. It's so bad!",
    "I don't know what to think about this movie. It's so-so.",
    "This product is great. I love it!",
    "This product is terrible. I hate it!",
]

from mlflow.data.pandas_dataset import from_pandas
import pandas as pd

human_messages = []
ai_responses = []
with mlflow.start_run(run_name="gpt-3.5-turbo-16k"):
    chat = ChatOpenAI(model_name='gpt-3.5-turbo-16k')
    test_mlflow_dataset = from_pandas(pd.DataFrame(test_dataset, columns=["content"]))
    mlflow.log_input(test_mlflow_dataset, "test_dataset")
    for i in test_dataset:
        human_message = HumanMessage(content=i)
        human_messages.append(i)
        ai_message = chat([system_message, human_message])
        ai_responses.append(ai_message.content)
    
    mlflow.log_table({"inputs":human_messages, "outputs":ai_responses, "prompts":[system_prompt]*len(human_messages)}, "predictions")