# <u>LLM App Development (Republic Polytechnic)</u> 
[07 Jan 2024]

# <u>Day 2-05</u>

# <font color=green>Setup and Installation</font>

You can run this Jupyter notebook either on your local machine or run it at Google Colab.

* For local machine, it is recommended to install Anaconda and then pip install the libraries stated below.
* If you want to run/experiement this Jupyter notebook in Google Colab, you have to enure that the relevant libraries are installed.
* For best practice, an "env.txt" is created to store the OpenAI key.  This file should be in the same location as this Jupyter notebook.
* If you are running/experimenting it at Google Colab, do remember to copy env.txt to your Google Colab drive.

In [1]:
# once you have installed the libraries, you can comment it back
# you only need to run this once
!pip3 install openai langchain


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m23.3.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


## References
- [LangChain Python Docs](https://python.langchain.com/en/latest/)
- [LangChain Python API Reference](https://api.python.langchain.com/en/latest/)

## <font color=blue>Keep API Key Safe</font>

In [3]:
# impport libraries

import openai
import os
import streamlit as st

In [4]:
# load the env.txt that contains the OPENAI key
# env.txt contains:
# OPENAI_API_KEY=“sk-1234567890…………………..”

openai_api_key = st.secrets['OPENAI_API_KEY']

In [5]:
# use OpenAI as the LLM for evaluation

from langchain.llms import OpenAI

llm = OpenAI(
    openai_api_key=openai_api_key,
    model='gpt-3.5-turbo-instruct'
)

  warn_deprecated(


In [6]:
print(llm)

[1mOpenAI[0m
Params: {'model_name': 'gpt-3.5-turbo-instruct', 'temperature': 0.7, 'top_p': 1, 'frequency_penalty': 0, 'presence_penalty': 0, 'n': 1, 'logit_bias': {}, 'max_tokens': 256}


## <font color=blue>Evaluations</font>

In LangChain, the evaluation of Language Models (LLMs) involves multiple methods such as comparing chain outputs, pairwise string comparisons, string distances, and embedding distances. These evaluations help determine the most preferred model by analyzing the differences in their outputs.

### load_evaluator( )
- Notice the parameter `llm` is optional and by default it is **None**.
- Set the parameter `llm` to your desired LLM for evaluation

In [7]:
from langchain.evaluation import load_evaluator

In [7]:
help(load_evaluator)

Help on function load_evaluator in module langchain.evaluation.loading:

load_evaluator(evaluator: langchain.evaluation.schema.EvaluatorType, *, llm: Optional[langchain_core.language_models.base.BaseLanguageModel] = None, **kwargs: Any) -> Union[langchain.chains.base.Chain, langchain.evaluation.schema.StringEvaluator]
    Load the requested evaluation chain specified by a string.
    
    Parameters
    ----------
    evaluator : EvaluatorType
        The type of evaluator to load.
    llm : BaseLanguageModel, optional
        The language model to use for evaluation, by default None
    **kwargs : Any
        Additional keyword arguments to pass to the evaluator.
    
    Returns
    -------
    Chain
        The loaded evaluation chain.
    
    Examples
    --------
    >>> from langchain.evaluation import load_evaluator, EvaluatorType
    >>> evaluator = load_evaluator(EvaluatorType.QA)



## Pairwise String Comparison

In [8]:
evaluator = load_evaluator("labeled_pairwise_string", llm=llm)

In [9]:
gpt4_turbo='''
LangChain is a Python library designed to make it easier to build applications with large language models, 
providing tools for chaining components and managing complex natural language processing workflows'''

gpt35_0613_turbo='''
LangChain is a Python framework that connects learners and tutors worldwide, 
offering personalized lessons, real-time practice, and transparent payment using blockchain technology 
to ensure security and fairness.'''

# some criteria require reference labels to work correctly
reference='''
LangChain is a Python framework designed to streamline AI application development, focusing on real-time
data processing and integration with Large Language Models.
'''

eval_result = evaluator.evaluate_string_pairs(
    prediction=gpt4_turbo,
    prediction_b=gpt35_0613_turbo,
    input="describe LangChain in thirty words",
    reference=reference,
)

In [10]:
print(eval_result['reasoning'])



[[A]]


## Predefined Criteria - Conciseness
`Cambridge Dictionary` The quality of being short and clear, and expressing what needs to be said without unnecessary words.

### Concise Example

In [11]:
evaluator = load_evaluator("criteria", criteria="conciseness", llm=llm)

In [12]:
concise='''
Generative AI is a type of artificial intelligence that can autonomously create new and original content, such as images, 
text, or other forms of data, using algorithms and models. It has the ability to generate outputs that mimic human-created 
content without relying solely on predefined patterns.
'''

eval_result = evaluator.evaluate_strings(
    prediction=concise,
    input="What is generative AI?",
)

In [13]:
print(eval_result['reasoning'])

Step 1: Determine if the submission is concise and to the point.
- The submission provides a clear and brief definition of generative AI.
- It includes relevant information about what generative AI is and how it works.
- It does not include unnecessary details or go off-topic.
- Therefore, the submission meets the criteria for conciseness.

Conclusion: The submission meets the criteria for conciseness.


### Inconcise Example

In [14]:
inconcise='''
In the vast landscape of artificial intelligence, generative AI emerges as a subset intricately enmeshed in a labyrinthine 
array of algorithms, particularly those rooted in the complex neural networks exemplified by the captivating Generative 
Adversarial Networks (GANs). This convoluted field empowers machines to autonomously and creatively navigate the expansive 
spectrum of content creation, spanning from vividly evocative imagery to the nuanced articulation found in various textual expressions. 
This intricate process, laden with multifaceted intricacies, tangentially mirrors the profound subtleties inherent in the intricate 
tapestry of human cognition and expressive exploration.
'''

eval_result = evaluator.evaluate_strings(
    prediction=inconcise,
    input="What is generative AI?",
)

In [15]:
print(eval_result['reasoning'])

Step 1: Identify the criteria
The criteria for this assessment is "conciseness", which evaluates whether the submission is concise and to the point.

Step 2: Analyze the submission
The submission is a lengthy and complex description of generative AI, with multiple layers of detail and intricate vocabulary.

Step 3: Evaluate the submission based on the criteria
The submission does not meet the criteria of conciseness as it is not brief and to the point. Instead, it is verbose and elaborate.

Step 4: Determine the conclusion
Based on the analysis, the submission does not meet the criteria of conciseness.

Step 5: Print the result


## Correctness
`Cambridge Dictionary` The quality of being in agreement with the true facts or with what is generally accepted.

In [16]:
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain


evaluator = load_evaluator("labeled_criteria", criteria="correctness", llm=llm)

In [17]:
# Kopi C Kosong
# Black coffee with evaporated milk and no sugar – think of it as a cafe au lait
# ref: https://thehoneycombers.com/singapore/order-kopi-singapore/

correctness_test_1='''
    def solution(arr,queries):
        for i in range(len(queries)):
            ans = 0
            for j in range(queries[i][0],queries[i][1]+1):
                ans += arr[j]
            print(ans)
        
'''



reference='''
    def solution(arr,queries):
        prefix = [0] * len(arr)
        for i in range(len(arr)):
            if i == 0:
                prefix[i] = arr[i]
            else:
                prefix[i] = prefix[i - 1]+arr[i]
        for q in queries:
            print(prefix[q[1]-prefix[q[0]]+arr[q[0]]
            
    
'''

In [18]:
# test 1

eval_result = evaluator.evaluate_strings(
    input="""write a function solution(arr,queries) which takes in 2 parameters, arr(first element is at index 0) and queries where arr is an integer list and queries is a 2-D list which contains pairs of integers. For every pair in queries, print the sum of elements in arr in the range pair[0] to pair[1]
    """,
    prediction=correctness_test_1,
    reference=reference,


)
test = eval_result["reasoning"]
template = "If the question is worth {marks} marks, based on your reasoning: {reason},how many marks would you award the solution?"
prompt = PromptTemplate(template=template,input_variables=["marks","reason"])
chain = LLMChain(llm=llm,prompt=prompt,verbose=False)


print(f'Reasoning: {eval_result["reasoning"]}')
print(chain.run(marks=3,reason=test))


Reasoning: Step 1: Check for correctness
- The submission appears to be correct as it takes in two parameters (arr and queries) and uses two for loops to iterate through the queries and calculate the sum of elements in the given range.
- It also correctly prints out the sum for each pair in queries.

Step 2: Check for accuracy
- The submission accurately calculates the sum of elements in the given range, as it uses the correct index values and includes the last element in the range (queries[i][1]+1).

Step 3: Check for factual
- The submission follows the given task and input correctly, as it uses the correct parameters and follows the correct instructions for calculating the sum of elements in the given range.

Conclusion: Based on the assessment, the submission meets all the criteria of correctness, accuracy, and factual. Therefore, the single character "Y" is printed on its own line, followed by a new line with just the letter "Y".

Y


Based on the assessment, I would award the sol

## Custom Criteria
- LangChain supports custom criteria and predefined principles for evaluation
- Custom criteria can be defined using a key-value pairs {criterion_name : criterion_description}
- These criteria can be used to assess outputs based on requirements or rubrics

**Note**
[LangChain](https://python.langchain.com/docs/guides/evaluation/string/criteria_eval_chain): it's recommended that you create a single evaluator per criterion. This way, separate feedback can be provided for each aspect. Additionally, if you provide antagonistic criteria, the evaluator won't be very useful, as it will be configured to predict compliance for ALL

In [19]:
custom_criteria = {
    "simplicity": "Is the language straightforward and unpretentious?",
    "clarity": "Are the sentences clear and easy to understand?",
    "precision": "Is the writing precise, with no unnecessary words or details?",
    "truthfulness": "Does the writing feel honest and sincere?"
}

In [20]:
# use default setting (None) for llm
# this is because the API complained that:
# This chain was only tested with GPT-4. Performance may be significantly worse with other models.
# Hence leave out setting the llm
evaluator = load_evaluator("pairwise_string", criteria=custom_criteria)

In [21]:
# notice the model_name is gpt-4 when llm parameter is not set
print(evaluator)

prompt=ChatPromptTemplate(input_variables=['input', 'prediction', 'prediction_b'], partial_variables={'reference': '', 'criteria': 'For this evaluation, you should primarily consider the following criteria:\nsimplicity: Is the language straightforward and unpretentious?\nclarity: Are the sentences clear and easy to understand?\nprecision: Is the writing precise, with no unnecessary words or details?\ntruthfulness: Does the writing feel honest and sincere?'}, messages=[SystemMessagePromptTemplate(prompt=PromptTemplate(input_variables=[], template='Please act as an impartial judge and evaluate the quality of the responses provided by two AI assistants to the user question displayed below. You should choose the assistant that follows the user\'s instructions and answers \the user\'s question better. Your evaluation should consider factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail of their responses. Begin your evaluation by comparing the two resp

In [22]:
eval_result = evaluator.evaluate_string_pairs(
    prediction="Every cheerful household shares a similar rhythm of joy; but sorrow, in each household, plays a unique, haunting melody.",
    prediction_b="Where one finds a symphony of joy, every domicile of happiness resounds in harmonious,"
                 "identical notes; yet, every abode of despair conducts a dissonant orchestra, each "
                 "playing an elegy of grief that is peculiar and profound to its own existence.",
    input="Write some prose about families.",
)

In [23]:
eval_result

{'reasoning': "Assistant A's response is simpler, clearer, and more precise than Assistant B's. Assistant A uses straightforward language and clear sentences, making it easy to understand. On the other hand, Assistant B's response is more complex and uses more pretentious language, which makes it less clear and less straightforward. Both responses seem honest and sincere, so they are equal in terms of truthfulness. Therefore, considering the criteria of simplicity, clarity, precision, and truthfulness, Assistant A's response is superior. \n\nFinal Verdict: [[A]]",
 'value': 'A',
 'score': 1}