# **Introduction**

**IMPORTANT:** Installation guide can be found in the readme.md

This project aims to leverage existing personal notes of various topics using Large Language Models (LLMs). The notes are available in markdown format and therefore easily readable for both machines and humans in raw format. However, the main usage of said notes in this projects scenario is through the note-taking app Obsidian (https://obsidian.md).

## **Obsidian and Open-Source**

Obsidian shines through its transparancy in comparison to other apps like Notion. By using the markdown format, taken notes are independent of the Obsidian app and can be just as well read in a simple editor. To support this ideology, which fits very well into the Open-Source idea, I aim to use (Open-Source) non-proprietary LLMS, like Llama, to "chat" with my notes. This will allow me better access to ther knowledge I have saved in my notes and furthermore expand it by the knowledge of the LLM, if necessary.

## **RAG Workflow**

Obsidian saves all markdown files in one folder, which of course can have a more complex sub-structure. This folder is called the "vault". To give a LLM access to my vault, I will create contexttual embeddings of the note files. Contextual embeddings are created using an embedding model and do not only convert single words into a numerical representation (vectors), but also capture the context they appear in. The vectors then will be stored in a persistent database, which allows the LLM acces to it. Everytime I will send a Prompt to the LLM, it will compare the prompts context to all my notes and retrieve useful information for the answer, if available.
 
---

## **Challenges**

### **1. The perfect prompt**

A major challenge is the prompt engineering, which describes writing the best prompt to get the best possible answer. To do that I first create a list of requirements for the LLMS answer:
1. I only want it to use my notes, if it actually fits the question and is useful and otherwise fall back to its own general knowledge
2. I want the LLM to tell me what source it used (my notes or its own knowledge)

To find a prompt that leads the LLM to answer both points, I used ChatGPT 4o as help and came up with the following prompt: "Answer the following question based on the provided notes, if relevant and available, otherwise use your own knowledge to provide an accurate response. After answering, indicate the source by specifying either 'from notes' or 'from general knowledge'. Here’s the question:"

### **2. Evaluation**

To evaluate a LLM is a difficult task, as there are no simple metrics like accuracy. Thus I will evaluate the entire RAG-workflow, so the LLM + embedding model, especially for my use case, meaning how useful is it really. To do this in the best possible way, I came up with the following questions I will evaluate the workflow on:

1. To what extend of completeness has the actually matching content available in the notes been used in the response?
2. How good/effective has the available content been used for the response?
3. Were all points mentioned in the prompt attended? (sub-questions, asked to provide source, include example names etc.)
4. Extend of Hallucination and/or using wrong/not aimed for information (For the purpose of calculating a total score at the end, this score is in the range of -10 to 0 with -10 = much hallucination, 0 = no hallucination)

Each question will be evaluated on a scale from 0 to 10, where 0 is the worst and 10 the best possible outcome.

## **LLMs and Embedding-Models Choice**

I tested the following LLMs:

1. **Llama 3 8B**
2. **Llama 3.2 1B**
3. **gemma2 2B**
4. **phi 3.5 3.8B**

And I will use the embedding model **nomic-embed-text**.

---

## **The Application**

The actual application presenting a chat-interface to communicate with the LLM is a Streamlit GUI. A Dockerfile ready to build is included in this repo. Simply use the command docker build and start the container.

---

# **Testing**

In this repo are 3 different note files in markdown format, which are copies from my original vault. I will ask every model 3 questions about each note and evaluate it on the questions I listed before. The results can be found in the results.xlsx and are presented below. Each answer is zero-shot, meaning I ask each question only once and evaluate the answer.

For each note, I will simply ask the model what it can tell me about the topic mentioned in the title of the note, based on my notes. For example, if the note is called "RAG Workflows", I will ask the model to tell me about RAG workflows, based on my notes. The second and third question will be a follow-up, which is topic-specific. The questions and answers are included in the results file.

**REMARK**

The rest of my notes are not included in this repo. Sometimes "wrong" notes or not the notes I intended the LLM to use were retrieved by the RAG-workflow, which therefore lead to mistakes that arent reproducible with this repo.

---

# **Results**

First, I will present the achieved scores by each model and present an overall winner and a winner per topic/note. After that, I will present my overall observations about the answers and styles of the LLMs.

## **I. Scoring**

First, I simply load the results, including all questions, answers and the models scores.

In [17]:
import pandas as pd

results = pd.read_excel('results.xlsx')
results.head(3)

Unnamed: 0,note,question,llama3 -answer,llama 3 - score 1,llama 3 - score 2,llama 3 - score 3,llama 3 - score 4,llama 3 - total,llama 3.2 - answer,llama 3.2 - score 1,...,gemma 2 - score 2,gemma 2 - score 3,gemma 2 - score 4,gemma 2 - total,phi 3.5 - answer,phi 3.5 - score 1,phi 3.5 - score 2,phi 3.5 - score 3,phi 3.5 - score 4,phi 3.5 - total
0,decision trees,"what can you tell me about decision trees, bas...","Based on your notes, here's what I can summari...",9,10,10,0,29,I can provide an answer based on my general kn...,8,...,10,8,-1,26,Based on the provided notes from a data scienc...,9,10,8,-1,26
1,decision trees,"Explain to me what ""breadth"" means in the cont...","According to your notes, ""Breadth"" refers to t...",10,7,10,0,27,From notes:\n\nIn the context of decision tree...,7,...,10,10,0,30,In the provided note excerpts regarding decisi...,0,0,0,-10,-10
2,decision trees,What methods to fight overfitting are there in...,What methods to fight overfitting are there in...,10,9,10,0,29,From notes:\n\nOverfitting is a common issue w...,0,...,10,10,0,30,"From the provided ""Notes"" on Decision Trees wi...",10,7,9,-1,25


However, we are only interested in the scores here.

In [18]:
scores = results.iloc[:,[0,1,2,7,8,13,14,19,20,25]]
scores.head(3)

Unnamed: 0,note,question,llama3 -answer,llama 3 - total,llama 3.2 - answer,llama 3.2 - total,gemma 2 - answer,gemma 2 - total,phi 3.5 - answer,phi 3.5 - total
0,decision trees,"what can you tell me about decision trees, bas...","Based on your notes, here's what I can summari...",29,I can provide an answer based on my general kn...,20,"Based on your notes, here's a breakdown of dec...",26,Based on the provided notes from a data scienc...,26
1,decision trees,"Explain to me what ""breadth"" means in the cont...","According to your notes, ""Breadth"" refers to t...",27,From notes:\n\nIn the context of decision tree...,13,"In the context of decision trees, ""breadth"" re...",30,In the provided note excerpts regarding decisi...,-10
2,decision trees,What methods to fight overfitting are there in...,What methods to fight overfitting are there in...,29,From notes:\n\nOverfitting is a common issue w...,-3,Here are a few methods for fighting overfittin...,30,"From the provided ""Notes"" on Decision Trees wi...",25


### **Topic 1 - general**

To get a general feeling for the LLM and its style of writing, I ask two simple general questions:

1. "Hi, how are you doing?"
2. "Who are you?"

Especially the second question is interesting to take a look at.

In [19]:
general = scores.tail(2)
general

Unnamed: 0,note,question,llama3 -answer,llama 3 - total,llama 3.2 - answer,llama 3.2 - total,gemma 2 - answer,gemma 2 - total,phi 3.5 - answer,phi 3.5 - total
9,general,"Hi, how are you doing?","I'm just an AI, I don't have a physical presen...",30,"I'm doing well, thank you for asking.\n\nFrom ...",27,"Hello! I'm doing well, thank you for asking. 😊...",27,I'm just a language model AI developed by Micr...,29
10,general,Who are you?,"I'm just an AI, I don't have a personal identi...",30,I'm not sure from the notes.\n\nFrom general k...,-10,"I am Gemma, an AI assistant trained to be info...",29,I'm just a language model AI developed by Micr...,28


In [20]:
q2 = general.tail(1)

for i in range(2, 10, 2):
    print(general.columns[i])
    print(q2.iloc[0, i])
    print("\n --- END OF ANSWER ---\n")

general.iloc[:,[3,5,7,9]].sum()

llama3 -answer
I'm just an AI, I don't have a personal identity or physical existence. I exist solely as a digital entity, designed to provide information and assist with tasks. My purpose is to help users like you by answering questions and providing relevant information.

Source: from general knowledge

 --- END OF ANSWER ---

llama 3.2 - answer
I'm not sure from the notes.

From general knowledge: I don't have any information about Schwarz Kaufland & Lidl.

 --- END OF ANSWER ---

gemma 2 - answer
I am Gemma, an AI assistant trained to be informative and comprehensive.

Source: from notes

 --- END OF ANSWER ---

phi 3.5 - answer
I'm just a language model AI developed by Microsoft, so I don'thy have feelings. However, I'm fully operational and ready to assist you! How can I help today? (Answer sourced from general knowledge)

 --- END OF ANSWER ---



llama 3 - total      60
llama 3.2 - total    17
gemma 2 - total      56
phi 3.5 - total      57
dtype: int64

We can observe that all LLMs, except for llama3.2, "know" that they are an AI or LLM. However, gemma2 "knows" its name and phi3.5 "knows" its developer (Microsoft).

Overall, llama3, the model with the most parameters, reaches the highest score, closely followed by gemma2 and phi3.5. llama3.2, the smallest LLM, clearly reaches last place.

---

### **Topic 2 - Decision Trees**

The questions about the Decision Trees note are of the structure I described before, first general, then two follow-up questions.

1. what can you tell me about decision trees, based on my notes
2. Explain to me what "breadth" means in the context of Decision Trees
3. What methods to fight overfitting are there in the context of decision trees?

In [21]:
dtrees = scores.iloc[[0,1,2,],:]
dtrees

Unnamed: 0,note,question,llama3 -answer,llama 3 - total,llama 3.2 - answer,llama 3.2 - total,gemma 2 - answer,gemma 2 - total,phi 3.5 - answer,phi 3.5 - total
0,decision trees,"what can you tell me about decision trees, bas...","Based on your notes, here's what I can summari...",29,I can provide an answer based on my general kn...,20,"Based on your notes, here's a breakdown of dec...",26,Based on the provided notes from a data scienc...,26
1,decision trees,"Explain to me what ""breadth"" means in the cont...","According to your notes, ""Breadth"" refers to t...",27,From notes:\n\nIn the context of decision tree...,13,"In the context of decision trees, ""breadth"" re...",30,In the provided note excerpts regarding decisi...,-10
2,decision trees,What methods to fight overfitting are there in...,What methods to fight overfitting are there in...,29,From notes:\n\nOverfitting is a common issue w...,-3,Here are a few methods for fighting overfittin...,30,"From the provided ""Notes"" on Decision Trees wi...",25


Lets take a look at the total scores for the Decision Trees questions.

In [22]:
dtrees.iloc[:,[3,5,7,9]].sum()

llama 3 - total      85
llama 3.2 - total    30
gemma 2 - total      86
phi 3.5 - total      41
dtype: int64

As we can see, Llama3 and gemma2 reach the highest scores with twice as many points as the third place phi3.5. Llama3.2 again is clearly in last place.

As an example, lets look at the second question:

In [23]:
q2 = dtrees.iloc[[1],:]
for i in range(2, 10, 2):
    print(dtrees.columns[i])
    print(f"Reached score for 2nd question: {dtrees.iloc[1, i+1]}")
    print(q2.iloc[0, i])
    print("\n --- END OF ANSWER ---\n")

llama3 -answer
Reached score for 2nd question: 27
According to your notes, "Breadth" refers to the number of leaf nodes in a decision tree.

Source: From Notes

 --- END OF ANSWER ---

llama 3.2 - answer
Reached score for 2nd question: 13
From notes:

In the context of decision trees, "breadth" refers to the number of nodes on a given level (or layer) of the tree. Specifically, it measures the depth or height of the tree.

Breadth is often represented graphically by the number of leaves (or terminal nodes) on each branch, rather than the number of internal nodes (also called "splitting decisions" in decision trees). The more branching, the greater the breadth.

In other words, if a node has many children, it's like having many different paths or branches that can be taken from the current point. This increased complexity often leads to better performance, as the tree is able to capture more relevant relationships between features and the target variable.

Breadth is an important aspect

The models answers represent the overall rating. Phi3.5 unfortunately wasnt able to answer the question at all and only produced hallucinations, therefore reaching a score of -10.

---

### Topic 3 - Pandas Dataframe

The questions about the Pandas Dataframe note are of the structure I described before, first general, then two follow-up questions.

1. what can you tell me about pandas dataframes, based on my notes
2. How can I group a Dataframe by a column named "name"?
3. How to read-in an excel file with the name "collected-data" into a pandas dataframe?

In [24]:
pandas_df = scores.iloc[[3,4,5,],:]
pandas_df

Unnamed: 0,note,question,llama3 -answer,llama 3 - total,llama 3.2 - answer,llama 3.2 - total,gemma 2 - answer,gemma 2 - total,phi 3.5 - answer,phi 3.5 - total
3,pandas df,"what can you tell me about pandas dataframes, ...","Based on your notes, here's what I can tell yo...",8,"Based on your provided notes, here's what I ca...",14,"Based on the provided notes, here's what I kno...",18,"From ""notes"" provided regarding machine learni...",6
4,pandas df,How can I group a Dataframe by a column named ...,From Notes:\n\nTo group a DataFrame by a colum...,20,"Based on your provided notes, here's what I ca...",0,"To group a DataFrame by a column named ""name"" ...",10,"From Problemset 3 in your provided notes (""all...",13
5,pandas df,"How to read-in an excel file with the name ""co...","Based on the provided notes, I can see that Pa...",30,"Based on your provided notes, here's what I ca...",3,"To read an Excel file named ""collected-data"" i...",27,"From Problemset 2 in your provided notes (""all...",11


In [25]:
pandas_df.iloc[:,[3,5,7,9]].sum()

llama 3 - total      58
llama 3.2 - total    17
gemma 2 - total      55
phi 3.5 - total      30
dtype: int64

Also for the second topic, Llama3 and gemma2 are the clear winners with small differences, phi3.5 comes in third place and llama3.2 at the bottom. For this topic, I want to take a closer look at the answers of the llama3.2 model, to show reasons for the bad score.

First I will present the first question and answer.

In [26]:
print(f"Reached score: {pandas_df.iloc[0,5]}")
print(pandas_df.iloc[0,4])

Reached score: 14
Based on your provided notes, here's what I can infer about Pandas DataFrames:

You have mentioned handling NumPy arrays.
You have also talked about handling Pandas DataFrames.
Pandas DataFrames are a fundamental data structure in Python for data manipulation and analysis. They are similar to Excel tables or SQL tables but offer more features and functionality.

Here's a summary of what you can tell me about Pandas DataFrames based on your notes:

You have discussed how to handle Pandas DataFrames using NumPy arrays.
You haven't explicitly mentioned any specific methods or functions for handling DataFrames, but it seems that you're familiar with the basic concepts and operations.
If I had to extract more information from your notes, I would say that you've covered some of the essential topics related to Pandas DataFrames, such as:

Reading Data using Pandas
Creating DataFrames using data sources (e.g., NumPy arrays)
Filtering, grouping, aggregating, and plotting DataF

In this case, the answer reached a score of 14. In general, the LLM and RAG-workflow correctly identified the topic. However, the RAG-Worklfow did not retrieve the note about Pandas Dataframes (which is included in this repo), but another note mentioning the Dataframes. Therefore, a lot of information, which can be found in the notes, is left out and not mentioned, leading to a bad score for completeness and way of using provided information.

Now lets take a look at the last question.

In [27]:
print(f"Reached score: {pandas_df.iloc[2,5]}")
print(pandas_df.iloc[2,4])

Reached score: 3
Based on your provided notes, here's what I can infer about reading in an Excel file:

You have discussed handling NumPy arrays.
You have also talked about handling Pandas DataFrames.
To read in an Excel file using the name "collected-data", you can use the read_excel function from pandas. Here's how it works:

import pandas as pd

# Read in the Excel file 'collected-data.xlsx'
df = pd.read_excel('collected-data.xlsx')

# Print the result
print(df)

This will output:

name  age city
0    John   28  New York
1     Anna   24    Paris
2      Peter   35  Berlin
3       Linda   32    London

The read_excel function takes the file path as an argument and returns a pandas DataFrame.

When to use reading in Excel files:

When you want to work with data that is stored in an Excel file
To load large datasets into memory (more efficient than reading files sequentially)
To perform data manipulation, filtering, or grouping operations on the data
Note: You can also specify additiona

Again, the RAG-Workflow retrieved a different note than intended. While the provided information is correct, it mentions a few unneccesary things and doesnt use some of the information provided in the actual PD Dataframe note.

To conclude, the RAG-Workflow and correct retrieval of relevant information is very important to the scoring and the correct respond heavily depends on it.

---

### **Topic 4 - VADER Sentiment Analysis**

The questions about the Pandas Dataframe note are of the structure I described before, first general, then two follow-up questions.

1. what can you tell me about vader sentiment analysis, based on my notes
2. what does the term "lexicon" mean in connection to VADER sentiment analysis?
3. can you find  a simple code implementation for vader?

In [28]:
vader = scores.iloc[[6,7,8],:]
vader

Unnamed: 0,note,question,llama3 -answer,llama 3 - total,llama 3.2 - answer,llama 3.2 - total,gemma 2 - answer,gemma 2 - total,phi 3.5 - answer,phi 3.5 - total
6,vader,what can you tell me about vader sentiment ana...,"Based on the provided notes, there is no menti...",0,"Based on your provided notes, here's what I ca...",0,VADER (Valence Aware Dictionary for sEntiment ...,30,"From ""general knowledge,"" VADER (Valence Aware...",-8
7,vader,"what does the term ""lexicon"" mean in connectio...","Based on the provided notes, I can see that wo...",0,"what does the term ""lexicon"" mean in connectio...",7,"In the context of VADER sentiment analysis, a ...",30,In relation to VADER (Valence Aware Dictionary...,28
8,vader,can you find a simple code implementation for...,"Yes, I can find a simple code implementation f...",30,from nltk.sentiment.vader import SentimentInte...,30,"Yes, the provided code snippet demonstrates a ...",20,"Yes, I can outline a simplified Python code ex...",-3


In [29]:
vader.iloc[:,[3,5,7,9]].sum()

llama 3 - total      30
llama 3.2 - total    37
gemma 2 - total      80
phi 3.5 - total      17
dtype: int64

Surprisingly, in this topic LLama3 does not reach first place and only lands in third place. Gemma2 however reaches a similar high score as before. Phi3.5 and Llama3.2 still build the bottom of the leaderboard, but have switched places, now with phi3.5 in last place.

Finally, lets take a look at the two answers of Llama3, which resulted in such a low scoring.

In [30]:
print(f"Reached score: {vader.iloc[0,3]}")
print(vader.iloc[0,2])

Reached score: 0
Based on the provided notes, there is no mention of "Vader Sentiment Analysis". Therefore, I will provide an answer based on my own knowledge.

Vader Sentiment Analysis is a tool used to analyze the sentiment (positive or negative) of text data. It was developed by Stanford Natural Language Processing Group and is widely used in natural language processing and text analysis applications.

It's not specifically mentioned in your notes, but it seems that you are studying Data Science, which includes tasks such as data cleaning, annotation, training models, and visualization. Vader Sentiment Analysis can be a useful tool for analyzing the sentiment of text data, which is a common task in many data science projects.

Source: from general knowledge


In [31]:
print(f"Reached score: {vader.iloc[1,3]}")
print(vader.iloc[1,2])

Reached score: 0
Based on the provided notes, I can see that word embeddings are discussed, but there's no mention of VADER sentiment analysis specifically. However, since you're asking about a term related to sentiment analysis, I'll provide an answer.

In the context of natural language processing and sentiment analysis, "lexicon" refers to a collection of words with their corresponding meanings or nuances. In other words, it's a dictionary-like resource that provides information on word semantics, connotations, and relationships.

In VADER (Valence Aware Dictionary and sEntiment Reasoner) sentiment analysis specifically, the lexicon is likely used to capture the emotional tone of words, phrases, and sentences. The lexicon would contain pre-defined rules for identifying the sentiment (positive, negative, or neutral) of each word or phrase, allowing VADER to analyze text data and determine its overall sentiment.

Source: from general knowledge


For both questions, the RAG-Worklflow failed and did not provide the correct information. Therefore, the LLM fell back to its own knowledge instead of using the provided notes.

---

### **Overall winner**

To conclude the scoring, I will aggregate the sum of each models topic-scores and present the overall leaderboard.

In [32]:
general.iloc[:,[3,5,7,9]].sum() + dtrees.iloc[:,[3,5,7,9]].sum() + pandas_df.iloc[:,[3,5,7,9]].sum() + vader.iloc[:,[3,5,7,9]].sum()

llama 3 - total      233
llama 3.2 - total    101
gemma 2 - total      277
phi 3.5 - total      145
dtype: int64

The leaderboard shows **Gemma2 with 3B parameters as the winner**, followed by LLama3 8B in second place, Phi3.5 3.8B with quite a distance in third place and lastly Llama3.2 1B.

---

## **II. Observations**

**IMPORTANT:** The following observations are **heavily biased** and were written based on my (the testers) experience and opinion when using the models.

**Llama3**
- sticks close to provided content from notes
- always provides source in the same "manner"

**Llama3.2 1B**
- "strays" from provided content --> it mixes provided and own knowledge, not clearly marking when it does it. While this can be useful, its not clear how much the provided knowledge is used at all
- very often uses noticebly own knowledge and still states it as "Source: your notes"
- sometimes mixes up topics or uses completely irrelevant information

**Gemma2 2B**
- uses emojis --> positive experience
- general tone and "way of talking" sounds better than Llama3.2 1B and Llama3
	- Uses rhetorical methods: (What are they? Pandas DataFrames are essentially tables of data. Think of them like spreadsheets or databases with rows and columns that organize your information neatly. Why are they useful? They provide a powerful way to store and manipulate structured data in Python, making it much easier than handling raw arrays.)
- often uses provided notes and adds own knowledge to further explain
	- but also keeps answers short if question is clear and only requires short, precise answer
	- anyway always states the source would be from notes

**Phi 3.5**
- hallucinates non-existent words sometimes (e.g.: Don´thy)
- often uses provided notes and adds own knowledge to further explain, but anyway states the source would be from notes
- in comparison to gemma2, writes lengthy answers
- has a "weird" style, uses brackets a lot, mentions source in an unclear way ("all indicators point back here")
- understands Obsidian "tags" (mentioned it and understood its context)
- sometimes mentions exact note used as source, which is good
- sometimes only produces hallucination: invents words, writes japanese or other non-latin letters and complete "nonsense", meaning sentences with no meaning or structure.

---

## **III. Conclusion / Personal Opinion**


First of all, it is important to mention that the RAG-Workflow "failed" multiple times, as shown above and provided unmatching content to the LLMs, although the embedding model and the content or the questions werent changed throughout the testing. Therefore the testing environment wasnt the same to a hundred percent in each case, even if no changes to the code or system were made. However, I intend this project to be useful for the end-user (myself) and therefore want to evaluate the experience the user has when using the system. "Pressing out" the last possible points a model can reach is not serving the purpose of this project, but evaluating first-try experience is.

The prompt used is another crucial part of the RAG-workflow, along with the correct retrieval of embeddings. As explained in the Introduction, I wanted the LLM to fulfill two major guidelines: first, to stick to my notes and only fall back to its own knowledge if necessary and second, to always provide a source. While all models fulfilled the second task in every response, it was often untrue. As listed in the observations, some models, like Gemma2, often mixed the knowledge from my notes and their own knowledge, but still stated the source to be my notes. This is rather unsatisfying. Furthermore, a clear line between own and provided knowledge cant be drawn. Therefore the models cant be evaluated in a comprehensive way for this matter. This is also the case for the first guideline. It is difficult to determine, wether it was necessary or not, in cases where the model mixed-in its own knowledge. For the prompt, I can simply conclude that all models always provide a source, wether it is true or not and the first guideline is very "flexible", making an evaluation very difficult. Further prompt-engineering could lead to the models sticking closer to the rules, but also effect the user-experience.

To sum it up, based on the provided scoring and the observations, **my personal favorite and the overall best experience I had with Gemma2 2B.**

---

# **Final Remark**

It is important to keep in mind that the tested models have different amounts of parameters and use different amounts of memory, because they were made for different purposes. A Llama3 with 8B will always overall perform better than a Llama3.2 with only 1B. Therefore, the hardware one intends to use the LLM on has to be taken into consideration when using the testing and scoring as means to evaluate which model to use.

# **Sources**

Obsidian: https://obsidian.md

Streamlit: https://streamlit.io

Ollama: https://ollama.com