# Natural Language Processing Project 2023/2024
*Braidi Federico*

# **Problems of LLM**

Large Language Models (LLM) have become more and more competent and used in various fields in the past few years.
However, they also have their flaws (mostly due to the sheer scale and complexity of the models themselves and the way they are trained), namely:


*   Use of old (time of training) information to answer questions.
*   Always answering questions, even when the information needed is not present (making up data, Hallucination).
*   Use of non-reliable information present in the huge training set.



# **Introduction to RAG**

All of the previous problems are addressed in a more modern approach to Generative Language Models called Retrieval-Augmented Generation (RAG).
The main idea is to force the LM to base its answer on a collection of relevant documents, given to it by a previous, complementary algorithm. To do so we first need a specific (or wide) purpose database of documents we deem relevant and useful to the topic(s) we want the LM to be knowledgeable about. The diagram for a RAG model is explained in the following image:<br><br>

<img src='https://drive.google.com/uc?id=1cbvykuc0YCO2zkCogFmJgxl6t59f8cBh'>


Following the diagram we can see that the process is divided in 5 steps:
<br><br>
1.   The user asks the computer a question in the form of a prompt and a query.<br><br>
2.   The computer sends the query to an algorithm that has access to the knowledge sources and is able to extract the most useful sources to answer the question.<br><br>
3.   The documents are sent back to the computer to aid the generation.<br><br>
4.   A new prompt is formed, adding the documents to the promp+query initial version, and is sent to a LLM.<br><br>
5.   The LLM sends back the generated text, which can have also references to the documents provided.





# **The Dataset**

The dataset we used is taken from Kaggle at this [link](https://www.kaggle.com/datasets/jvanelteren/boardgamegeek-reviews?select=bgg-19m-reviews.csv) and is a csv version of the database of reviews from the website [BoardGameGeek](https://boardgamegeek.com/). It is a snapshot of the database taken at january 2022 with 19M reviews left by more than 400k users and has the following structure:

Counter   |  User  | Comment  | ID | Name
----------|--------|----------|----|-----
0|mitnachtKAUBO-I|Hands down my favorite new game...|30549|Pandemic
...|...|...|...|...|...|
144|jmcmilla|Works great as a 2-player game. ...|811|Carcassonne
...|...|...|...|...|...|

Of course, for our work, 19M reviews are too many and the computation time would become huge. There are also reviews with some problems, namely:


*   Blank or too short of a comment.
*   Having strange symbols that need to be cleaned.
*   Ratings that are not floats.
*   Written in languages other than english.

To get the dataset we ended up using we wrote a Python code to filter the original database. This code roughly follows these points:


*   Defines a list of "games_to_keep", games on which we decided to focus. They are 10 of the top rated games according to the BoardGameGeek online leaderboard.


```
games_to_keep = [
    'Brass: Birmingham',
    'Pandemic Legacy: Season 1',
    'Gloomhaven',
    'Catan',
    'Twilight Imperium: Fourth Edition',
    'Carcassonne',
    'Terraforming Mars',
    '7 Wonders',
    'Star Wars: Rebellion',
    'Gloomhaven: Jaws of the Lion'
]
```


*   Divides the dataset in chunks, because our computer couldn't handle the whole file.

*   Iterates over the chunks and, at each line checks:

    -   If the game is in the list.
    -   If the length of the review is >= 200 characters.
    -   If there are any strange characters we don't want.
    -   If the rating is a float type.
    -   If the language of the review is English (this is done using the langdetect package).
*   After having filtered the dataset, we take 500 reviews for each of the games in the list and we save them is csv format.

Our database is made of 5000 reviews, 500 for each of the 10 games we decided on.







# **Code**

We employed 4 Google Colab Notebooks, one for each Language Models we used, because the resources given in a Notebook were not enough for loading 2 models at the same time.

The 4 LM we used are the following and are all taken from HuggingFace:


*   **Llama2-7B**
*   **Llama2-13B**
*   **Llama3-8B**
*   **Qwen2-7B**

In the explanation of the code we will use snippets of code from the **Llama2-7B** notebook but, except for the LM download part, the codes are the same.

The code is structured in the following way:


*   All the needed libraries are downloaded via pip.
*   The dataset, hosted as a shared file on Google Drive, is downloaded using  the gdown package.
*   The dataset is imported using Langchain's CSVLoader function.
*   A *split_text* function is defined that splits the documents according to the given parameters.




In [None]:
def split_text(chunk_size,overlap):
  text_splitter = RecursiveCharacterTextSplitter(
      chunk_size = chunk_size,
      chunk_overlap = overlap,
      length_function = len,
  )
  reviews_documents = text_splitter.transform_documents(reviews)

  return reviews_documents

*   A *create_vector_store* function is defined that uses an embedding model to embed every document and save the embeddings in a local directory for faster computation time.

In [None]:
def create_vector_store(emb_model,store,reviews_documents,chunk_size,overlap):

  embedder = CacheBackedEmbeddings.from_bytes_store(emb_model, store, namespace=f"{emb_model.model_name}_{chunk_size}_{overlap}")
  vector_store = FAISS.from_documents(reviews_documents, embedder)

  return vector_store

*   A *create_emb_model* function that calls the HuggingFace module to obtain the desired model.

In [None]:
def create_emb_model(emb_model_name):
  return HuggingFaceEmbeddings(model_name=emb_model_name)

*   The notebook logs into HuggingFace with a generated token and downloads the desired model. The BitsAndBytes module of transformers is used to make sure the model fits in the memory of the Colab Notebook, it stores parameters in a more compact form and only when a certain parameter is needed it gets expanded to the more precise representation.
*   Defines the tokenizer from the pretrained tokenizer relative to the model in question.
*   Defines an empty model dictionary that will contain the vector_store objects and allow us to check if the vector store of a certain combination of parameters has already been computed and, if that's the case, to use it.
*   Defines a *save_response* function that can save the answers provided by the LLM to files.

In [None]:
def save_response(filename, response):
    if not os.path.exists(filename):
        with open(filename, 'w') as file:
            file.write(response + "\n")
            file.write("\n" + "="*50 + "\n\n")
    else:
        with open(filename, 'a') as file:
            file.write(response + "\n")
            file.write("\n" + "="*50 + "\n\n")

*   Defines an *ask* function that allows the user to ask questions to the LM, it has these parameters:

In [None]:
def ask(query,emb_model, chunk_size, overlap, temperature, model_dict,max_new_tokens=500,min_new_tokens=0):

We can explain this function in a few steps:
*   Calculates a *key* for the *model_dict* dictionary to check if the vector store has already been created, creates it (with the previously defined functions) if it doesn't yet exist.

In [None]:
key = f"{emb_model.model_name},{chunk_size},{overlap}"

  if key not in model_dict:
    print("Building new model")
    reviews_documents = split_text(chunk_size=chunk_size,overlap=overlap)
    vector_store = create_vector_store(emb_model=emb_model,store=LocalFileStore("./cache/"),reviews_documents=reviews_documents,
                                       chunk_size=chunk_size, overlap=overlap)
    model_dict[key]={"reviews_documents":reviews_documents,"vector_store":vector_store}
  else:
    print("Re-using previously built model")

*   Defines a pipeline for text generation with all the needed parameters and makes it a HuggingFacePipeline.

In [None]:
generate_text = transformers.pipeline(
    model=model,
    tokenizer=tokenizer,
    task="text-generation",
    return_full_text=True,
    temperature=temperature,
    max_new_tokens=max_new_tokens,
    min_new_tokens=min_new_tokens
  )


  llm = HuggingFacePipeline(pipeline=generate_text)

*   Defines a retriever object with the right vector store and sets the number of retrieved documents to 10. Then creates a retrieval chain object to be used for querying the LM.

In [None]:
retriever = model_dict[key]["vector_store"].as_retriever(search_kwargs={"k": 10})

handler = StdOutCallbackHandler()

qa_with_sources_chain = RetrievalQA.from_chain_type(
      llm=llm,
      retriever=retriever,
      callbacks=[handler],
      return_source_documents=True
  )

*   Queries the chain with the query passed to the *ask* function.

In [None]:
response = qa_with_sources_chain.invoke({"query":query})

*   Finally, from the response, extracts and divides the part where the actual response is from the part where the documents and the query are and formats a good-looking response to return.

After all of these definitions, we can use the ask function to ask questions to the LM in the following way

In [None]:
lm_name="llama2-7B"
temperature=0.2

response0 = ask(query="How is Catan according to the reviews?",
          emb_model=create_emb_model(emb_model_name="sentence-transformers/all-MiniLM-L6-v2"),
          chunk_size=1000,
          overlap=100,
          temperature=0.2,
          model_dict=model_dict,
          min_new_tokens=100)
save_response(f"all-MiniLM-L6-v2-{lm_name}-{temperature}", response0)

The answer saved in the file will be something like (here the answer and the reviews are cut for length issues):
<br><br><br>

**How is Catan according to the reviews?**

sentence-transformers/all-MiniLM-L6-v2,1000,100,0.2

**Helpful Answer:**
According to the reviews, Catan is...

**Source Documents:**

*Document n˚1*

Source: top_5000_reviews.csv

Line: 588

review: Summary at bottom. Note that this ...

*Document n˚2*

Source: top_5000_reviews.csv

Line: 803

game_name: Catan

review: I have found Catan to be ...

rating: 6.0

user: CtheTree

*   

*   

*   

*Document n˚9*

Source: top_5000_reviews.csv

Line: 675

game_name: Catan

review: Catan has comfortably established ...

rating: 7.0

user: BobaFett912

*Document n˚10*

Source: top_5000_reviews.csv

Line: 958

game_name: Catan

review: I played Catan extensively ...

rating: 6.2

user: Elston

# **Analysis of the Results**




# **Introduction to the Analysis**

As seen in the ask function used, the answers of the model are a product of many different parameters. In this section we analyze them one by one.

The important parameters discussed are:

- Chunk_size
- Prompt
- Embedding Model
- LLM
- Temperature of sampling

## **Document Segmentation**

The models were initially tested with different values of chunk_size and overlap to see the effect of the dimension of the documents on the performance in question answering.

When a small chunk_size was selected the computation time during the building of the vector store increased drastically and, depending on how many of them were built (for different embedding models), also the allocated GPURam space started filling up. The answers showed poor performance as well because of hallucination or repetition caused by the low amount of information gained from the extracted documents.

Between experiments with chunk sizes of the order of magnitude of the length (or half) of the average review, the difference in performance during question answering was minimal and, at our judgment, random.

We deem that this behaviour shows the importance of having longer documents for our task: since we are using reviews, crucial information is also stored in "long range" relations relative to the length of the text and cutting them in small parts and treating them independently makes us lose meaning.

Taking notice of what is said above, we settled for a rather long chunk size (1000) relative to the length of the average review.

# **Prompt Analysis**

An important step in any matter related to Language Models and/or Document Retrieval is the choice of proper prompts.

It is important, for example, to choose prompts that under some configurations of the RAG system produce good answers, and under others may make it struggle. This way these prompts can be used to test the impact of the parameters on the production of a good response.

Furthermore, we want to use question in a wide range of "abstraction" to test the capability of the language model to understand certain topics and the capability of embedding models to encode them in the vectors they produce.

Questions can also be of comparative type, asking to compare games with a certain name, theme or gameplay mechanic, thus prompting the system to find same features in reviews about different games.

We write below a couple of the discarded prompts and the reason why they were excluded:

- **Prompt:** Compare the aspects of games that have reviews with a rating that is higher than 8.

  **Reason:** It appeared the embedding models weren't including the review rating in a significant way and thus the LMs couldn't base their response on well fetched data.

- **Prompt:** I am very passionate about board games and space, are there any games you think i should play?

  **Reason:** For some reason this prompt pushed the LMs to answer with out-of-dataset games about space, probably present in their original training data.

In the end the questions we settled on were devised to asses various levels of difficulty of information retrieval and abstract subjects comprehension. We list them below:

-   How is Catan according to the reviews?
-   How is Brass: Birmingham according to the reviews?
-   Is Brass: Birmingham a good family game?
-   Break down the gameplay of Catan.
-   Which game would you suggest bringing to a friend's home who is not accustomed to board games?
-   My friends find Brass: Birmingham boring. Suggest a board game that they might like, break down the features of this game, and compare it with Brass: Birmingham.
-   What are the main strengths and weaknesses of Catan according to reviewer feedback?
-   Compare Catan and Brass: Birmingham. Which game would you suggest as a present for an expert board game player?
-   Can you compare 2 versions of Gloomheaven?
-   Compare different games about space.

We should keep in mind that our prompt is automatically concatenated by the pipeline with another prompt for RAG, telling the model to answer the question with the information contained in the documents and specifying that it should not make up information if the answer is not contained in them.

# Comparison of Embedding Models

Different sentence embedding models are able to encode highly variable levels of information and meaning in the embeddings they produce. To qualitatively measure the impact of the quality of the embeddings used to encode the documents on the answers generated by a LM we chose 3 candidates from the HuggingFace-hosted options:

- **all-MiniLM-L6-v2**
- **multi-qa-MiniLM-L6-cos-v1**
- **nli-roberta-base-v2**

When evaluating them we used a chunk_size=1000.

The "metrics" we kept an eye on were:
- Wether or not the answers included **hallucinations**. This might also be related to the LM itself but it is a problem a RAG system should be able to fix.
- **Incomplete answers**, which might be a sign that the embedding hasn't encoded all of the important review information.
- Presence of references to **data not in database**, which could indicate that the LM can't extract the info needed to answer the question from the documents because they are off-topic and they have been chose because the embedding is somehow wrong.



## **nli-roberta-base-v2**
**Performance**:

**nli-roberta-base-v2** is an embedding model derived from RoBERTa, an optimized and robust version of BERT. It encodes the meaning of sentences into 768-dimensional dense vectors.

Roberta proved to be the worst of the emdeddings we tried, it ofter produced short answers or long answers composed of many rephrasings of short ones. Furthermore, the answers produced by Llama when using Roberta had less of a human-feel to them when compared with other embeddings. It sometimes stopped answering the question and started rephrasing the beginning of the prompt "Please answer the question based on the information provided in the passage. If you don't know the answer, please say so". Some examples of these claims are shown below:

- **Short or Repetitive answers:**
  - **Question:** Which game would you suggest to bring to some friends' home who are not accustomed to board games?

    **Answer:** Carcassonne.
  - **Question:** How is Catan according to the reviews?
  
    **Answer:** The reviews say that Catan is a good game, but some people find it too dependent on dice rolls... It’s a good game, but it’s not the best game in the last 10 years. It’s a good game, but it’s not the best game in the last 10 years. It’s a good game, but it’s not the best game in the last 10 years. It’s a good game, but it’s not the best game in the last 10 years. It’s a good game, but it’s not the best game in the last 10 years. It’s a good game.

- **Rephrasing prompt**
  - **Question:** Break down the gameplay of Catan
  
    **Answer:** Sure! Catan is a game of strategy and resource management where players take on the roles of settlers and try to build and trade their way to victory. The game is played on a modular board made up of hexes, with each hex representing a different resource (such as wood, brick, sheep, etc.). Players start with a small settlement and must use their resources to build roads, settlements, and cities. They can also trade resources with other players to acquire the resources they need. The game ends when one player has built all of their settlements and cities, or when no more resources can be traded. The player with the most victory points, which are awarded for building settlements and cities, wins the game.
    I don't know the answer to your question.

However, expecially as temperature was increased and more complex LMs were used, some of the answers were left us very satisfied, with good re-elaboration and reference to the Documents:
- **Satisfactory answers:**
  - **Question:** Is Brass:Birmingham a good family game?
    
    **Answer:**"Brass: Birmingham is considered a good family game by many users on Board Game Geek. It has a rating of 7.5 on average, with many users praising its accessibility, simplicity, and thinkiness. However, some users have noted that the game can be mean and may not be suitable for all families. It's worth noting that the game is rated 9.0 or higher by several users who have played it with their families, including OldGrayBeard and CatTastic23."

**Final thoughts**

**nli-roberta-base-v2** has some problems expecially when paired with lower-parameter LMs sampled at low temperature, however it shows pretty good performance in the complementary cases (i.e. Llama2-13B with T=0.8).


## **all-MiniLM-L6-v2**

**Performance:**

**all-MiniLM-L6-v2** is an embedding model created from the pre-trained **MiniLM-L6-H384-uncased** and fine-tuned on a 1B sentence pairs dataset. It produces 384-dimensional vectors as embeddings.

This model performed pretty well overall. However it wasn't exempt from making the LM cite out-of-dataset games that had nothing to do with the documents retrieved and from producing citations which were incoherent with the documents themselves:

- **Out-of-dataset reference:**
  - **Question:** My friends find Brass: Birmingham boring. Suggest a board game that they might like, break down the features of this game, and compare it with Brass: Birmingham
  
    **Answer:** It's difficult to say which game your friends might like without more information about their preferences. However, based on the reviews you provided, it seems that Brass: Birmingham is a complex and strategic game that may not be for everyone. If your friends are looking for a game with similar depth and complexity, but with a different theme or gameplay style, they might enjoy a game like Twilight Struggle.
    Twilight Struggle is a two-player only game ...

- **Erroneous semi-citation:**
  - **Question:** Is Brass:Birmingham a good family game?

    **Answer:** Based on the reviews, it seems that Brass: Birmingham is a game that is best suited for experienced players. Many of the reviewers mention that the game is fiddly and has a lot of rules, which may be overwhelming for new players. Additionally, some reviewers mention that the game can be quite long, which may not be ideal for a family game. However, if you have a family of experienced gamers who enjoy heavy euros, Brass: Birmingham may be a good fit.

    **Explanation:** In one of the documents a user says "Brass: Birmingham is a extremely compelling and evocative heavy economic game that perfectly fits ..." but in the response "enjoy heavy euros" has been erroneously reported.

However, as stated before, the responses were mostly well put together, even for more complex questions that required some kind of deeper understanding of the meaning of the reviews. This proves that the embedding generally incapsulated well the underlying nature of the texts, even if the produced vectors are around half as big as the Roberta ones.
- **Sastisfactory answers:**
  - **Question:** Break down the gameplay of Catan.

    **Answer:** Catan is a game where players develop their own settlements in a new land by trading resources and building roads, settlements, and cities. The game has a mix of luck and strategy, with players trying to create the most efficient and profitable settlements while also competing with other players for resources and territory. The game has a lot of replayability with its various expansions and spinoffs, and it's a great game for both new and experienced players.
  
  - **Question:** Which game would you suggest to bring to some friends' home who are not accustomed to board games?

    **Answer:** Both 7 Wonders and Gloomhaven are great choices for introducing new players to board games! They are both easy to learn and offer a lot of variety and replayability. 7 Wonders is a more straightforward drafting game, while Gloomhaven has a deeper storyline and more complex gameplay, but both are great for beginners and experienced players alike.
    I hope this helps! Let me know if you have any other questions. 😊

**Final thoughts**

**all-miniLM-L6-v2** has proved to be a good embedding model for our RAG task. If we overlook the very few cases where it doesn't fully embed the meaning or embeds a slightly different one, it can be used with very good results. However we also have to note that, even as well written as the responses are, they still don't feel extremely human.

# **multi-qa-MiniLM-L6-cos-v1**

**Performance:**

**multi-qa-MiniLM-L6-cos-v1**, like **all-MiniLM-L6-v2**, is an embedding model that uses as a starting point **MiniLM-L6-H384-uncased** but is later fine-tuned on a dataset of 200M+ pairs question-answer: it is optimized for quesion answering!

Indeed, as we expected, this is the embedding model that resulted in the best responses by the LM. What really proves that it was definetely better than the others was the fact that it worked the best no matter what LM was used, hinting that it somewhat understood the question answering task better than all-mini and Roberta.

Below are some examples of the responses obtained:

- **Question:** Break down the gameplay of Catan.

  **Answer:** Sure! Here's a breakdown of the gameplay of Catan:

    1. Setup: Players set up the game board by placing the hexes tiles in a specific configuration. Each player chooses a color and is given two roads and two settlements of that color to place on the board.
    2. Resource Collection: Players roll two dice to collect resources (brick, wood, sheep, wheat, and ore). The number on the dice corresponds to the type and amount of resource collected.
    3. Trading: Players can trade resources with each other, either one-for-one or through a series of trades.
    ...

  **Explanation:** The capability of extracting different gameplay elements from different comments and discerning between them is a very human-like properity.

- **Question:** Can you compare 2 versions of Gloomheaven?

  **Answer:** Sure! Gloomhaven: Jaws of the Lion is a smaller, more streamlined version of the original Gloomhaven. It has shorter scenario setups and tear-down times, and it introduces four new playable characters. However, it also has fewer depth and less powerful characters compared to the original Gloomhaven. Additionally...

- **Question:** What are some games about space.

  **Answer:** Based on the reviews, it seems that Twilight Imperium: Fourth Edition is the most highly rated game about space. The reviewers praise its replayability, complexity, and thematic elements. Terraforming Mars are also well-regarded, but not as highly rated as Twilight Imperium.

  **Explanation:** The answer is concise and doesn't make up information that is not present in the Documents.


**Final thoughts**

Due to it being fine-tuned on a QA dataset **multi-qa-MiniLM-L6-cos-v1** appears to be the best model for incapsulating QA-relevant information in sentence embeddings. Since it is better than the other 2 models in every case, we continue this analysis by restricting ourselves to using it for embeddings.

# **Comparison of LLMs**

Another essential part of RAG is the Language Model, after having decided that **multi-qa-MiniLM-L6-cos-v1** is the best embedding model we focus on how well the LMs can perform in creating answers based on the relevant documents retrieved.

The tested language models are all taken from HuggingFace and are:

- **Llama2-7B**
- **Llama2-13B**
- **Llama3-8B**
- **Qwen2-7B**

Our evaluation of them is based on the following qualitative metrics:

- Presence of **hallucinations** in the responses.
- **On-topic** responses.
- **Human-like writing**.
- **Reference** to the documents.

# **Llama2-7B**

**Performance:**

**Llama2-7B** is a model of the Llama2 family released by Meta, we employ the "chat" version which is fine-tuned for dialogue and question answering use cases.

When it comes to questions about the characteristics of a game, though the text is grammatically and syntactically correct, the answer appears to be just a list of user evaluations that are not rephrased and incorporated in a more human-like text.

When confronted with more complex and abstract questions that can't be answered by listing pieces of the documents, the model tends to either say it doesn't know the answer or fall in the "repetition of prompt" problem discussed in the previous section.

These problems don't seem to be very much affected by the temperature used to sample.

- **Listing of Review Pieces:**

  - **Question:** How is Catan according to the reviews?

    **Answer:** According to the reviews, Catan is a game that has a mix of opinions. Some people find it to be a great game with a good balance of luck and strategy, while others find it to be overrated or lacking in replayability. The initial placement of settlements seems to be a crucial factor in the game's outcome, and some players find that the game can become stale after a few plays. However, many players ...

- **Repetition of prompt:**

  - **Question:** Which game would you suggest to bring to some friends' home who are not accustomed to board games?

    **Answer:** Carcassonne! It's a great gateway game that is easy to learn and has a low barrier to entry. It's also visually appealing and can be played with 2-4 players.
    Please answer the question based on the provided information. If you don't know the answer...

- **Don't know:**

  - **Question:** My friends find Brass: Birmingham boring. Suggest a board game that they might like, break down the features of this game, and compare it with Brass: Birmingham

    **Answer:** It's difficult to say which game your friends might like without more information about their preferences...

# **Llama3-8B**

**Performance:**

**Llama3-8B** is a model of the Llama3 family released by Meta, it is fine-tuned for dialogue and question answering use cases on a dataset of 10M+ human-annotated examples.

Its performance on simple questions like "How is Catan according to the reviews?" is very similar to the **Llama2-7B** model showing the same behaviour while being only slightly more articulate in language generation.

When it comes to more complex questions **Llama3-8B** works much better than **Llama2-7B** even if, in one occasion, it started talking about a game that wasn't in the dataset or in the retrieved documents.

In order to confront this model and the previous one we insert below the answers **Llama3-8B** provided to the same questions we talked about in the previous section.

- **"List" question:**

  - **Question:** How is Catan according to the reviews?

    **Answer:** Based on the reviews, Catan is a game that has both positive and negative aspects. Some reviewers consider it a classic game, enjoyable, and replayable, while others find it overrated, too dependent on luck, and lacking in strategy. The game's initial placement of settlements is seen as crucial, and some reviewers feel that the game's outcome is often determined by chance rather than skill. However, many reviewers also mention that the game is a great introduction to board gaming and can be enjoyable for casual players. Overall, the reviews suggest that Catan is a game that can be enjoyable for some, but may not be as engaging for others. Rating: 6.5 (average of all ratings)

- **Unexperienced suggestion:**

    - **Quesiton:** Which game would you suggest to bring to some friends' home who are not accustomed to board games?

      **Answer:** I'm happy to help!
      Based on your friends' feedback, I'm going to suggest a game that might appeal to them. The game is Azul, a popular and highly-rated game that shares some similarities with Brass: Birmingham.

      **Azul**

      **Gameplay:** Azul is a tile-laying game where players compete to create the most beautiful mosaic. Each player takes turns drafting and placing tiles to score points.

      ...

- **Taste suggestion:**

  - **Question:** My friends find Brass: Birmingham boring. Suggest a board game that they might like, break down the features of this game, and compare it with Brass: Birmingham

    **Answer:** Based on the reviews, I would suggest Carcassonne. It's a gateway game that is easy to learn and plays well with 2-4 players. Many reviewers mentioned that it's a great game to introduce to non-board gamers, and it's a simple and fun game that doesn't require a complicated rules set...

# **Qwen2-7B**

**Performance:**

**Qwen2-7B** is a LM that is competitive with many proprietary models (like Llama) on a wide range of usual metrics. It also supports a context length of 130000+ tokens making it useful in case of future projects were the needed context is much larger than in this case.

If the model is presented with simple questions about the characteristics of a game it tends to produce a list stating, for every document retrieved, the opinion of the writer on the matter, followed by the Rating provided. This is obviously not good or useful since if that was our objective we could look at the reviews ourselves. It exhibits, at least, a summarizing capability.
It should be noted that for T=0.8 the answers are much more human-like and useful.

When more complex questions are posed, the model acts one of two ways:
- If the question is at a low level of abstraction (i.e. It can be extracted from the reviews: "Which game would you suggest to new players?") the model correctly answers producing a satifying response.
- If the question is more abstract (i.e. Some higher level of understanding must be employed: "My friends are bored: what do you suggest?") the model ignores the retrieved documents and talks about existing games that are not in the dataset but were probably in the training data (i.e. Codenames, Ticket to Ride). The answer would be satifying if it was on topic with the documents.

These second group of problems don't disappear even with high temperature.

- **Simple list answer:**

  - **Question:** How is Catan according to the reviews?

    **Answer:** According to the reviews, Catan is perceived differently based on the user's experience and perspective:

      - EnchantedPlaneswalker finds Catan enjoyable but slightly overrated, noting that luck plays a significant role in the game's outcome. Rating: 6.0
      - CtheTree appreciates the game's replayability and strategic depth but feels it is overrated overall. Rating: 6.0

      ...

- **Not very abstract question:**

  - **Question:** Which game would you suggest to bring to some friends' home who are not accustomed to board games?

    **Answer:** Based on the reviews provided, Carcassonne seems like the best suggestion for introducing non-acquainted friends to board games. It is described as a "gateway game," making it easy to understand and enjoy for those unfamiliar with board gaming. The review mentions that it is particularly good for "non-boardgaming gamers" and that it has successfully converted people into enjoying board games...

- **Very abstract question:**

  - **Question:** My friends find Brass: Birmingham boring. Suggest a board game that they might like, break down the features of this game, and compare it with Brass: Birmingham

    **Answer:** Based on the reviews provided, it seems that Brass: Birmingham is a complex economic game with a focus on strategy, area control, and card mechanics. It has been praised for its engaging gameplay and satisfying experience, especially when played with multiple players. However, it has also been noted for its complexity, requiring a significant amount of setup and remembering various rules.
    If your friends find Brass: Birmingham boring, they might prefer a game that offers simpler gameplay and quicker sessions. One such game could be "Ticket to Ride." Here's a breakdown of its features

    ...

Except for these very abstract cases, the answers are well put together and the problems with them including lists can be fixed by sampling at high temperature.


# **Llama2-13B**

**Performance:**

**Llama2-13B** is a model of the Llama2 family released by Meta, we employ the "chat" version which is fine-tuned for dialogue and question answering use cases. This is a bigger version of **Llama2-13B** so we expect it to perform better.

In general the answers provided by this model have everything we look for:
- Demonstration of deeper understanding of the question.
- Internal coherence and coherence with the question's theme.
- Human-like writing.
- Demonstration of ability to paraphrase the reviews.

This model is not exempt from very rarely providing non-satifying answers such as a case where it simply cited opinions from the reviews (only one time and at low temperature) or another where it cited an out-of-dataset game.

These problems almost never happened and the model showed good level of answers to pretty much any question. It is also the only model that answered to the most abstract question in a satisfying way:

- **Abstract Question Answering:**

  - **Question:** My friends find Brass: Birmingham boring. Suggest a board game that they might like, break down the features of this game, and compare it with Brass: Birmingham

    **Answer:**Brass: Birmingham is a complex and strategic board game that may not appeal to everyone. If your friends find it boring, they might enjoy a game with a different theme or set of mechanics and features. Here are a few suggestions:
      1. Terraforming Mars - This game is similar to Brass: Birmingham in terms of complexity and strategic depth, but it has a different theme and mechanics. Players take on the role of corporations working to terraform the planet Mars, and must balance resource management, card play, and strategic planning to succeed.
      2. Pandemic Legacy - This game is a cooperative game where players work together to save the world from a series of diseases and outbreaks. It has a different set of mechanics and features than Brass: Birmingham, but it is still a strategic and...

# **Comparison of T values**

Temperature is a key parameter when working with Language Models. We tried 3 different values for it, namely:

- 0.2
- 0.5
- 0.8

As we know, temperature has an effect on the softmax function used at the end of each generative step to decide the distribution of probability over the possible following words. Higher temperature tend to flatten the probability distribution and make it possible for words that would have lower probability to be selected, increasing the model's "creativity".

This might lead the produced text to feel more human but could also induce it to be more prone to hallucination.

With reference to **multi-qa-MiniLM-L6-cos-v1** embedding we analyzed the impact of temperature on the answers given by the LMs.

The results, as we expected, were more or less similar across the different Language Models.

- For simple questions, which every model was able to answer to, higher temperature correlated with higher readability and "human-like" answers.
- For more complex or abstract questions, however, higher temperature generally correlated with both higher "human-like" answers and higher probability of hallucination. This second effect, even if in extremely low measure, was present when using **Llama2-13B** as well.

# **Conclusions**

Summarizing, our study on **Retrieval Augmented Generation** (**RAG**) consisted in testing:

- **Impact of Document Segmentation:**

  Several values were tested, it was found that for small values in comparison with the length of the average document the responses were very lacking. This was due to the fact that when writing a review even relatively long distance parts of the text are contextual to the meaning and we can't treat small parts independently. We settled on a value of chunk_size=1000.

- **Impact of prompt:**

  We studied the impact of prompt and took on a systematic approach to produce prompts that could help us study better the dependance of the system's responses on the other paramters. The list of the final prompts we used is in the dedicated section.

- **Impact of Document Embedding:**

  We tested 3 embedding models (**all-MiniLM-L6-v2**,**multi-qa-MiniLM-L6-cos-v1**,**nli-roberta-base-v2**) and measured the quality of the answers to identify the amount of information each model could store in the embedding. We decided that **multi-qa-MiniLM-L6-cos-v1** was the best at capturing relevant information, at least for the task of question answering on which it was fine-tuned.

- **Impact of Language Model:**

  We ran 4 Language Models (**Llama2-7B**,**Llama2-13B**,**Llama3-8B**,**Qwen2-7B**) with all combinations of embeddings and tested them on the best (**multi-qa-MiniLM-L6-cos-v1**) to find the one that consistently could answer in the most fitting way both simple and complex questions about the dataset: **Llama2-13B** was deemed to be the one.

- **Impact of Sampling Temperature:**

  Finally we tested various values of temperature (0.2,0.5,0.8) for every LM.
  Our best (**Llama2-13B** with **multi-qa-MiniLM-L6-cos-v1** embeddings) showed good performance for every value, with low values being best for complex answers (to avoid risk of hallucination) and higher values being best for simple answers (where there was no real risk).

All in all, we managed to determine a setup that is capable of answering even relatively complex questions about the dataset of BoardGames we prepared.

This setup could, for example, be repurposed for any other dataset of reviews (music, cinema, etc.) and, with some thought and tweaking of the parameters, even for pretty different contexts.