# Challenge 1

### Description: Evaluating the performance of LLM based solutions is crucial for ensuring their reliability and quality. Manual human evaluation is time consuming, expensive, and subject to variability. 


#### Details: Develop an evaluation tool to evaluate LLM based applications accurately and efficiently. The tool should simplify LLM evaluation and reduce manual efforts by providing reliable metrics and insights to developers. Results of the tool should be aligned with manual human evaluation. By focusing on various evaluation aspects, including RAG (retrieval augmented generation) based solutions, the tool will streamline the evaluation process.

Example features: (These are ideas to inspire your development and do not need to necessarily be implemented)
RAG-based solution focus
Human-like evaluation
Customizable evaluation criteria
Benchmarking and comparison
Integration with development pipeline

In [1]:
from transformers import pipeline

In [2]:
import ollama
response = ollama.chat(model='llama3', messages=[
  {
    'role': 'user',
    'content': 'Why is the sky blue?',
  }
])
print(response['message']['content'])

The color of the sky is a fascinating topic that has puzzled humans for centuries. The short answer is: the sky appears blue because of the way that light interacts with the Earth's atmosphere.

Here's a more detailed explanation:

1. **Solar radiation**: When sunlight enters the Earth's atmosphere, it contains all the colors of the visible spectrum, including red, orange, yellow, green, blue, indigo, and violet.
2. **Scattering**: As this solar radiation travels through the atmosphere, it encounters tiny molecules of gases such as nitrogen (N2) and oxygen (O2), as well as aerosols like dust, pollen, and water vapor. These particles scatter the light in all directions, a process known as Rayleigh scattering.
3. **Blue light dominance**: The shorter wavelengths of light, particularly blue and violet, are scattered more than the longer wavelengths, like red and orange. This is because smaller particles scatter shorter wavelengths more effectively.
4. **Atmospheric filtering**: As the sca

In [5]:
import ollama


response = ollama.chat(model='gemma:7b', messages=[
  {
    'role': 'user',
    'content': 'Why is the sky blue?',
  },
])
print(response['message']['content'])

The sky is blue due to a phenomenon called **Rayleigh scattering**. 

* Sunlight is composed of all the colors of the rainbow, each with a specific wavelength.
* When sunlight interacts with molecules in the atmosphere, such as nitrogen and oxygen, the shorter wavelengths (blue light) are scattered more efficiently than the longer wavelengths (red light).
* The scattered blue light is dispersed in all directions, but our eyes are most sensitive to blue light from the direction opposite the sun. 
* Since the sun is overhead during the day, we primarily see the scattered blue light from the sky above us, which is why the sky appears blue.


In [3]:
import ollama
import textwrap

response = ollama.chat(model='mistral', messages=[
  {
    'role': 'user',
    'content': 'Why is the sky blue?',
  }
])

# Wrap text to fit within a specified width (e.g., 80 characters)
wrapped_response = textwrap.fill(response['message']['content'], width=80)

# Print the wrapped text
print(wrapped_response)

 The sky appears blue during a clear day due to a process called Rayleigh
scattering. As sunlight reaches Earth, it is made up of different wavelengths or
colors. Short wavelength light (like violet and blue) is scattered in all
directions more than other colors because they have smaller wavelengths and
interact more with the nitrogen and oxygen molecules in the atmosphere. Our eyes
are more sensitive to blue light and less blue light reaches us from the direct
path, making the sky appear blue instead of violet. However, when we see the sun
at sunrise or sunset, it appears red or orange because most of the blue light
has been scattered away, leaving only longer wavelengths (like red and yellow)
to reach our eyes.


In [7]:
import ollama
response = ollama.chat(model='phi3:mini', messages=[
  {
    'role': 'user',
    'content': 'Why is the sky blue?',
  }
])
print(response['message']['content'])

 The sky appears blue to our eyes because of a process called Rayleigh scattering. As sunlight travels through Earth's atmosphere, it encounters molecules and small particles that are much smaller than the wavelength of visible light. When white sunlight enters the atmosphere, the shorter-wavelength blue colors get scattered in many directions by these tiny atmospheric particles more so than the other colors like red or yellow due to their shorter wavelengths. Our eyes perceive this as a blue sky.


Here's how Rayleigh scattering works:

1. Sunlight enters Earth's atmosphere and is made up of various colors, each with its specific wavelength.

2. The atmosphere contains gases and particles that are smaller than the light's wavelength (ranging from 400 nanometers for violet to over 700 nanometers for red).

3. These small particles scatter shorter wavelengths of light (blue and violet) more effectively than longer wavelengths (red, yellow, etc.).

4. Although violet light is scattered e

In [67]:
# evaluation
import ollama
import pandas as pd
models = ['llama3', 'mistral', 'gemma:7b', 'phi3:mini']

models_df = pd.DataFrame(columns=models + ['Prompt'])

input_promts = ['Why is the sky blue?', 'What is the meaning of life?', 'What is the capital of France?', 'What is the largest planet in our solar system?', 'What is the speed of light?',
                'What is the boiling point of water?', 'What is the population of the United States?', 'What is the largest ocean on Earth?', 'What is the tallest mountain in the world?', 'What is the longest river in the world?',
                'What is the largest desert in the world?', 'What is the largest rainforest in the world?', 'What is the largest continent on Earth?', 'What is the largest country in the world?', 'What is the smallest country in the world?',
                'What is the largest city in the world?', 'What is the smallest city in the world?', 'What is the largest animal in the world?', 'What is the smallest animal in the world?', 'What is the largest bird in the world?',
                'What is the smallest bird in the world?', 'What is the largest fish in the world?', 'What is the smallest fish in the world?', 'What is the largest mammal in the world?', 'What is the smallest mammal in the world?',
                'What is the largest reptile in the world?', 'What is the smallest reptile in the world?', 'What is the largest amphibian in the world?', 'What is the smallest amphibian in the world?', 'What is the largest insect in the world?']

for input_prompt in input_promts:
    for model in models:
        response = ollama.chat(model=model, messages=[
            {
            'role': 'user',
            'content': input_prompt,
            }
        ])
        wrapped_response = textwrap.fill(response['message']['content'], width=170)
        print(f'{model}: {wrapped_response}')
    ratings = []
    for model in models:
        rating = int(input(f'Rate the response of a {model} from 1 to 4: '))
        ratings.append(rating)
    data = ratings + [input_prompt]
    models_df = pd.concat([models_df, pd.DataFrame([data], columns=models + ['Prompt'])])
    #models_df = models_df.append({'Model': model, 'Prompt': input_prompt, 'Response': wrapped_response, 'Rating': rating}, ignore_index=True)



llama3: The sky appears blue to us because of a phenomenon called Rayleigh scattering, which is the scattering of light by small particles or molecules in the atmosphere. The
shorter, blue wavelengths are scattered more than the longer, red wavelengths, resulting in the blue color we see.  Here's a simplified explanation:  1. When sunlight
enters Earth's atmosphere, it encounters tiny molecules of gases like nitrogen (N2) and oxygen (O2). 2. These molecules scatter the light in all directions. 3. The
smaller molecules are more effective at scattering shorter wavelengths of light, such as blue and violet. 4. The larger molecules are better at scattering longer
wavelengths, like red and orange. 5. As a result, the blue light is scattered more than other colors, giving the sky its blue appearance.  This effect is most pronounced
during the daytime when the sun is overhead, and the amount of scattered blue light reaching your eyes from all directions creates the bright blue color we associ

In [68]:
models_df

Unnamed: 0,llama3,mistral,gemma:7b,phi3:mini,Prompt
0,1,2,4,3,Why is the sky blue?
0,2,1,3,4,What is the meaning of life?
0,2,3,1,4,What is the capital of France?
0,2,3,1,4,What is the largest planet in our solar system?
0,2,1,4,3,What is the speed of light?
0,2,1,4,3,What is the boiling point of water?
0,1,2,3,4,What is the population of the United States?
0,2,3,1,4,What is the largest ocean on Earth?
0,2,3,1,4,What is the tallest mountain in the world?
0,2,3,4,1,What is the longest river in the world?


In [69]:
df = models_df.drop(columns=['Prompt'])
sum_of_elements = df.values.sum()

normalized_df = df / sum_of_elements

# Print the normalized dataframe
print(normalized_df)

     llama3   mistral  gemma:7b phi3:mini
0  0.003333  0.006667  0.013333      0.01
0  0.006667  0.003333      0.01  0.013333
0  0.006667      0.01  0.003333  0.013333
0  0.006667      0.01  0.003333  0.013333
0  0.006667  0.003333  0.013333      0.01
0  0.006667  0.003333  0.013333      0.01
0  0.003333  0.006667      0.01  0.013333
0  0.006667      0.01  0.003333  0.013333
0  0.006667      0.01  0.003333  0.013333
0  0.006667      0.01  0.013333  0.003333
0  0.006667  0.003333  0.013333      0.01
0  0.006667      0.01  0.013333  0.003333
0  0.003333  0.006667      0.01  0.013333
0  0.003333  0.006667      0.01  0.013333
0  0.006667  0.003333      0.01  0.013333
0  0.006667  0.003333      0.01  0.013333
0  0.003333  0.006667      0.01  0.013333
0  0.003333  0.006667      0.01  0.013333
0  0.006667      0.01  0.013333  0.003333
0  0.003333  0.006667      0.01  0.013333
0  0.003333  0.006667      0.01  0.013333
0  0.006667  0.013333      0.01  0.003333
0  0.013333  0.006667      0.01  0

In [70]:
model_raitings = normalized_df.sum(axis=0)

In [71]:
import pandas as pd

# Data in dictionary format
data = {
    "Benchmarking - Winogrande": [83.1, 75.3, 72.3, 72.5],
    "Benchmarking - ARC e": [93.0, 80.0, 81.5, 95.2],
    "Benchmarking - ARC c": [93.0, 55.5, 53.2, 84.0],
    "Benchmarking - TriviaQA": [89.7, 69.9, 63.4, 57.1]
}

# Create DataFrame
df = pd.DataFrame(data, index=["LLAMA3 8b", "MISTRAL 7b", "GEMMA 7b", "PHI 3 mini"])

In [72]:
for col in df.columns:
    df[col] = df[col] / df[col].sum()

In [73]:
df

Unnamed: 0,Benchmarking - Winogrande,Benchmarking - ARC e,Benchmarking - ARC c,Benchmarking - TriviaQA
LLAMA3 8b,0.274077,0.265942,0.325516,0.320243
MISTRAL 7b,0.248351,0.228768,0.19426,0.249554
GEMMA 7b,0.238456,0.233057,0.186209,0.226348
PHI 3 mini,0.239116,0.272233,0.294015,0.203856


In [74]:
import numpy as np
weights = {}
for benchmark_col in df.columns:
    weight = np.dot(df[benchmark_col], model_raitings)
    weights[benchmark_col] = weight
    

In [75]:
total_sum = sum(weights.values())
weights_normalized = {key: value / total_sum for key, value in weights.items()}

In [76]:
weights_normalized

{'Benchmarking - Winogrande': 0.25141870701837643,
 'Benchmarking - ARC e': 0.25461410120360656,
 'Benchmarking - ARC c': 0.2495843486736346,
 'Benchmarking - TriviaQA': 0.24438284310438246}

### Conclusion

With available standard benchmarks that were used to evaluate LLMs these days.
We figured the better approach to evaluate the LLM .models universally is to make use of those benchmarks out there.
Here is a statistical view of how we came up with our modelling:
1. We first selected 4 of the best LLMs developed by the giants out there.
2. Collated its performance on cetain benchmarks
3. Normalised the scoring to be between 1-4
4. Script to call the 4 models and randomnly score each model based on answers to curated prompts
5. So, now we have 2 dataframes one with scoring based on standard benchmarks and with those on our prompts
6. We evaluated the existing benchmarks relevance by factoring in the human choice of scoring

Team Members

1. Illia 
2. Sai
3. Lilly