In [None]:
#!pip install openai==1.7.1

In [None]:
#!pip install langchain_openai

In [None]:
#!pip install langchain==0.1.0

# Strategy 1

# Check for offensive words in LLM output

Let us say we have a LLM for a specific use case. Let us call this Task LLM.

We can have another LLM to evalulate the output of our Task LLM. Let us call this Evaluation LLM.

We can prompt this Evaluation LLM to look for offensive words in the output produced by the Task LLM.

To make the Evaluation LLM robust, we will also include a custom offensive word list so that we can guide it to act as per our internal requirements.


In [None]:
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
import os
from langchain.llms import OpenAI
from langchain_openai import OpenAI

### Step 1 Define custom offensive word list

In [3]:
offensive_word_custom_list = ['knife','weapon']

### Step 2 Define rule_to_find_offense

In [4]:
rule_to_find_offense ="""
Please review the following sentence and identify any words 
that are commonly recognized as abusive, offensive, or 
inappropriate. 

classify each word as abusive or offensive, or  inappropriate or normal.

'Sentence: {actual_sentence}'?

Also if any of the words in in the sentence are present in the  offensive_word_custom_list , then
classify them as inappropriate

'offensive_word_custom_list : {offensive_word_custom_list}

Output in the following structure for each of the word in the Sentence.

word : classification : reason for classification

"""

### Step 3 Get output from Task LLM
To make it simpler we will assume below is the output produced by the Task LLM.

In [8]:
actual_sentence = "hello rude  park threaten knife"

### Step 4 Prompt Evaluation LLM to check the output from task LLM


In [72]:
llm = OpenAI(openai_api_key=os.getenv("open_ai_secret_key"))

prompt = PromptTemplate(
    input_variables=["actual_sentence", "offensive_word_custom_list"],
    template=rule_to_find_offense

)

text = prompt.format(actual_sentence=actual_sentence, offensive_word_custom_list= offensive_word_custom_list)
print(text)


Please review the following sentence and identify any words 
that are commonly recognized as abusive, offensive, or 
inappropriate. 

classify each word as abusive or offensive, or  inappropriate or normal.

'Sentence: hello rude  park threaten knife'?

Also if any of the words in in the sentence are present in the  offensive_word_custom_list , then
classify them as inappropriate

'offensive_word_custom_list : ['knife', 'weapon']

Output in the following structure for each of the word in the Sentence.

word : classification : reason for classification




In [12]:
text_llm_chain = OpenAI(temperature=0, model_name = 'gpt-3.5-turbo-instruct',openai_api_key=os.getenv("open_ai_secret_key"))
print(text_llm_chain.invoke(text))

hello : normal : Not considered abusive, offensive, or inappropriate.
rude : abusive : Contains a derogatory term.
park : normal : Not considered abusive, offensive, or inappropriate.
threaten : normal : Not considered abusive, offensive, or inappropriate.
knife : inappropriate : Present in the offensive_word_custom_list.

word : classification : reason for classification
hello : normal : Not considered abusive, offensive, or inappropriate.
rude : abusive : Contains a derogatory term.
park : normal : Not considered abusive, offensive, or inappropriate.
threaten : normal : Not considered abusive, offensive, or inappropriate.
knife : inappropriate : Present in the offensive_word_custom_list.


#### As we can see from the above ouput, the Evaluation LLM marked words rude as abusive based on its own training.

#### Also, it flagged knife as inappropriate based on our custom logic of checking in the offensive_word_custom_list

# Strategy 2

# Chain of Thought Reasoning

Let us say we have a LLM for a specific use case. Let us call this Task LLM.

We can prompt this Task LLM to explain the chain of thought reasoning for it to have produced that particular output.

Along with the output text, we can also store the reason in to a database.

We can then perform some analysis on the reasoning given to evaluate if the LLM is performing well.
The analysis can be manual to start with, we can have an actual person look at the reasons to check if the reasoning looks good.

This can then be automated later to look for some patterns.

#### Step 1 Set up Chain of thought Prompt

In [20]:
chain_of_thought_prompt ="""
 Please explain the chain of thought reasoning for giving the answer.
 format the output in the following structure.
 
 QUESTION : {QUESTION}
 ANSWER: answer
 REASONING : chain of thought reasoning for giving the answer


"""

In [19]:

prompt = PromptTemplate(
    input_variables=["QUESTION"],
    template=chain_of_thought_prompt

)

text = prompt.format(QUESTION="What is the main reason for shopping at walmart ")
text

'\n Please explain the chain of thought reasoning for giving the answer.\n format the output in the following structure.\n \n QUESTION : What is the main reason for shopping at walmart \n ANSWER: answer\n REASONING : chain of thought reasoning for giving the answer\n\n\n'

#### Step 2 get Answer and Reasoning from LLM

In [18]:
text_llm_chain = OpenAI(temperature=0, model_name = 'gpt-3.5-turbo-instruct',openai_api_key=os.getenv("open_ai_secret_key"))
print(text_llm_chain.invoke(text))

QUESTION : What is the main reason for shopping at walmart
ANSWER: The main reason for shopping at Walmart is its low prices and wide variety of products.

REASONING: Walmart is known for its low prices and this is a major factor that attracts customers. The company has a reputation for offering products at a lower cost compared to other retailers. This makes it an attractive option for budget-conscious shoppers who are looking to save money. Additionally, Walmart offers a wide variety of products, from groceries to electronics to clothing, making it a one-stop shop for many people. This convenience factor is another reason why customers choose to shop at Walmart. With a large selection of products at affordable prices, Walmart is able to cater to the needs of a diverse customer base, making it a popular choice for shopping.


#### As we can see in this simple example, the LLM was able to provide the reasons for coming up with the output. 
#### It mentioned reasons like save money, a wide variety of products, from groceries to electronics to clothing , affordable prices and so on.

#### So, In real life use cases as well, we can use similar prompting technique and get the chain of thought. 
#### We can then evaluate if it meets our own internal requirements.

# Strategy 3

# LLM Graded Evaluation

Let us say we have a LLM for a specific use case. Let us call this Task LLM.

We can have another LLM to evalulate the output of our Task LLM. Let us call this Evaluation LLM.

We can prompt this Evaluation LLM to grade the output of the Task LLM.

First we prepare a QUESTION and ANSWER pair based on our knowledge about our systems, use case and data.

For each question, We ask the Evaluation LLM to compare(grade) the output of the Task LLM with the ANSWERS prepared in the above step.

The Evaluation LLM comes up with the grading. If the predictions/outputs from the task LLM are similar to the QUESTION and ANSWER pair we prepared it says the result or grade is CORRECT. If the prediction does not match the QUESTION and ANSWER pair we prepared , it says the result or grade is INCORRECT.


### Step 1 prepare a QUESTION and ANSWER pair

##### For this example, we will purposely set the answer for 3rd question to be incorrect.We will say walmart is very expensive

##### We are setting it incorrectly just to show how the LLM evaluation QA chain compares the predictions with QUESTION,ANSWER PAIR

In [76]:
examples = [
     {
         "QUESTION": "What is the main reason for shopping at Best Buy according to the document?",
         "ANSWER": "The main reason for shopping at Best Buy according to the document is to find electronics, computers, appliances, cell phones, video games, and more new tech."
     },
     {
         "QUESTION": "What is the main benefit offered by Prime?",
         "ANSWER": "The main benefit offered by Prime is free shipping on millions of items."
     },
         {
         "QUESTION": "What is the main reason for shopping at walmart?",
         "ANSWER": "TThe main reason for shopping at walmart is its very expensive."
     }
 ]

#### If we already stored(say in a database) the output of the Task LLM, we can use it directly.

#### But for our simple example, we will get the output on the fly.

#### so we will just prepare a list of questions as well

In [77]:
example_questions = [
     {
         "QUESTION": "What is the main reason for shopping at Best Buy according to the document?",
     },
     {
         "QUESTION": "What is the main benefit offered by Prime?",
     },
         {
         "QUESTION": "What is the main reason for shopping at walmart?",
     }
 ]

### Step 2 Set up langchain prompt template

In [78]:
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
import os
from langchain.llms import OpenAI

llm = OpenAI(openai_api_key=os.getenv("open_ai_secret_key"))

prompt = PromptTemplate(
    input_variables=["QUESTION"],
    template="QUESTION: {QUESTION}\ANSWER:",
)
prompt


PromptTemplate(input_variables=['QUESTION'], template='QUESTION: {QUESTION}\\ANSWER:')

### Step 3 Get output from task LLM
### If we already stored(say in a database) the output of the Task LLM, we can use it directly.
### But for our example, we will get the output on the fly.

In [80]:

text_llm_chain = OpenAI(temperature=0, model_name = 'gpt-3.5-turbo-instruct',openai_api_key=os.getenv("open_ai_secret_key"))
ch = LLMChain(llm=text_llm_chain, prompt=prompt)
#predictions = ch.apply(examples)
predictions = ch.apply(example_questions)
print (example_questions)
print ('****************')
print (predictions)

[{'QUESTION': 'What is the main reason for shopping at Best Buy according to the document?'}, {'QUESTION': 'What is the main benefit offered by Prime?'}, {'QUESTION': 'What is the main reason for shopping at walmart?'}]
****************
[{'text': ' The main reason for shopping at Best Buy according to the document is the wide selection of products and brands available.'}, {'text': ' The main benefit offered by Prime is access to free and fast shipping on eligible items, as well as access to a wide range of other benefits such as streaming of movies, TV shows, and music, unlimited photo storage, and exclusive deals and discounts.'}, {'text': ' The main reason for shopping at Walmart is typically the low prices and wide selection of products.'}]


### Step 4 Set up langchain QA Evaluation chain 

In [56]:
from langchain.evaluation.qa import QAEvalChain
#from langchain.chat_models import ChatOpenAI
from langchain_openai import ChatOpenAI
llm_model = "gpt-3.5-turbo-instruct"
#llm = ChatOpenAI(temperature=0, model=llm_model,openai_api_key=os.getenv("open_ai_secret_key"))
llm = OpenAI(temperature=0, model_name = 'gpt-3.5-turbo-instruct',openai_api_key=os.getenv("open_ai_secret_key"))
eval_chain = QAEvalChain.from_llm(llm)

### Step 5 Evaluate / Grade the Task LLM Output

In [74]:
graded_outputs = eval_chain.evaluate(examples, predictions, question_key='QUESTION', prediction_key='text', answer_key='ANSWER')
graded_outputs

[{'results': ' CORRECT'}, {'results': ' CORRECT'}, {'results': ' INCORRECT'}]

#### As we can see from the above grading, the first 2 predictions/outputs from the task LLM are similar to the QUESTION and ANSWER pair we prepared.
    
#### The 3rd prediction does not match the QUESTION and ANSWER pair we prepared.

### Display Question, predictions and grade together

In [69]:
for i, eg in enumerate(examples):
    print(f"Example {i}:")
    print("Question: " + examples[i]['QUESTION'])
    print("Real Answer: " + examples[i]['ANSWER'])
    print("Predicted Answer: " + predictions[i]['text'])
    print("Predicted Grade:" + graded_outputs[i]['results'])
    print('********************************')

Example 0:
Question: What is the main reason for shopping at Best Buy according to the document?
Real Answer: The main reason for shopping at Best Buy according to the document is to find electronics, computers, appliances, cell phones, video games, and more new tech.
Predicted Answer:  The main reason for shopping at Best Buy according to the document is the wide selection of products and brands available.
Predicted Grade: CORRECT
********************************
Example 1:
Question: What is the main benefit offered by Prime?
Real Answer: The main benefit offered by Prime is free shipping on millions of items.
Predicted Answer:  The main benefit offered by Prime is access to free and fast shipping on eligible items, as well as access to a wide range of other benefits such as streaming of movies, TV shows, and music, unlimited photo storage, and exclusive deals and discounts.
Predicted Grade: CORRECT
********************************
Example 2:
Question: What is the main reason for shop