David Kuliev 329777460

# Experimental RAG – Football Knowledge Base

This project implements an **experimental Retrieval-Augmented Generation (RAG)** pipeline
on a small football-related text corpus.

The goal of the experiment is to **compare different RAG configurations**, focusing on:
- different **chunking strategies**
- local **vector indexing**
- retrieval quality and behavior

All processing is performed **locally**, including embedding generation, vector search,
and answer generation.

## Embedding Model
We use SentenceTransformer (all-MiniLM-L6-v2) to encode texts into vectors.

In [2]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")
print("Model loaded successfully")

Model loaded successfully


## Dataset

The dataset consists of **10 short football-related documents**, each focusing on a specific topic:

- player transfers in major leagues
- tactical concepts (high pressing, low block)
- injuries and squad depth
- coaching changes
- club finances and wages

Each document is stored as a separate `.txt` file and loaded from the local `data/` directory.

### Loading local documents
List .txt files from the data/ folder.

In [3]:
import os

data_dir = "data"
files = sorted([f for f in os.listdir(data_dir) if f.endswith(".txt")])
print("Files:", files)
print("Count:", len(files))

Files: ['doc01_transfers_premier_league.txt', 'doc02_transfers_la_liga.txt', 'doc03_transfers_serie_a.txt', 'doc04_transfers_bundesliga.txt', 'doc05_tactics_pressing.txt', 'doc06_tactics_low_block.txt', 'doc07_match_report.txt', 'doc08_injuries_squad_depth.txt', 'doc09_finance_wages_ffp.txt', 'doc10_coaches_changes.txt']
Count: 10


Load documents into memory and show a short sample.

In [4]:
documents = []

for fname in files:
    path = os.path.join(data_dir, fname)
    with open(path, "r", encoding="utf-8") as f:
        text = f.read().strip()
        documents.append(text)

print("Loaded documents:", len(documents))
print("Sample document:\n")
print(documents[0][:300])

Loaded documents: 10
Sample document:

﻿Premier League clubs often focus on pace, physical duels, and transitions. In a typical summer window, top teams look for a striker who can finish chances and press from the front. Mid-table clubs prioritize fullbacks and defensive midfielders to protect the back line. A common pattern is buying yo


## Embedding Model

For text encoding, we use the **SentenceTransformer `all-MiniLM-L6-v2`** model.

This model converts both document chunks and user queries into fixed-size dense vectors.
Cosine similarity is later used to compare query embeddings with document embeddings.

## Baseline: Whole-document retrieval (no chunking, no FAISS)
We embed each full document and retrieve using cosine similarity.

In [34]:
import numpy as np

# the model has already been loaded earlier, let's use it
doc_embeddings = model.encode(documents, convert_to_numpy=True)

print("Embeddings shape:", doc_embeddings.shape)

Embeddings shape: (10, 384)


Baseline retrieval: cosine similarity between query embedding and document embeddings.

In [31]:
from sklearn.metrics.pairwise import cosine_similarity

def retrieve_documents(query, model, doc_embeddings, documents, top_k=3):
    """
    Returns top_k most relevant documents for a given query
    """
    query_embedding = model.encode([query], convert_to_numpy=True)
    similarities = cosine_similarity(query_embedding, doc_embeddings)[0]

    top_indices = similarities.argsort()[::-1][:top_k]
    results = []

    for idx in top_indices:
        results.append({
            "score": similarities[idx],
            "text": documents[idx]
        })

    return results

In [33]:
query = "How do injuries affect squad depth during a long season?"

results = retrieve_documents(
    query=query,
    model=model,
    doc_embeddings=doc_embeddings,
    documents=documents,
    top_k=3
)

for i, r in enumerate(results, 1):
    print(f"\n#{i} | score={r['score']:.4f}")
    print(r['text'][:300])



#1 | score=0.7723
﻿Injuries affect results when key players miss weeks during a crowded schedule. Teams with strong squad depth can rotate without losing quality. A common problem is when both the main striker and backup striker are unavailable. Muscle injuries often increase with high pressing and frequent matches. 

#2 | score=0.4839
﻿A coaching change often shifts tactics and recruitment priorities. A new coach might prefer a 4-3-3 with wingers, while the old coach used a 3-5-2 with wing-backs. Training intensity can change, affecting fitness and injuries. Players who fit the new system gain minutes, while others may be sold. L

#3 | score=0.4618
﻿Premier League clubs often focus on pace, physical duels, and transitions. In a typical summer window, top teams look for a striker who can finish chances and press from the front. Mid-table clubs prioritize fullbacks and defensive midfielders to protect the back line. A common pattern is buying yo

#4 | score=0.3847
﻿Bundesliga clubs oft

## Local LLM (Ollama)
We call a local LLM via Ollama API for answer generation.

In [10]:
import requests

def ask_ollama(prompt, model="llama3"):
    url = "http://localhost:11434/api/generate"
    payload = {"model": model, "prompt": prompt, "stream": False}
    r = requests.post(url, json=payload, timeout=180)
    r.raise_for_status()
    return r.json()["response"]

# quick test
print(ask_ollama("Say hello in one short sentence.", model="llama3"))


Hello!


### RAG prompt template
LLM must answer only using retrieved sources and list which sources were used.

In [29]:
def rag_answer(query, retrieved_texts, model="llama3"):
    context = "\n\n".join([f"[Source {i+1}]\n{t}" for i, t in enumerate(retrieved_texts)])

    prompt = f"""
You are a helpful assistant.
Answer the question ONLY using the information in the sources.
If the sources do not contain the answer, say: "I don't know based on the provided sources."

Question: {query}

Sources:
{context}

Instructions:
- Give a short answer (3-6 sentences).
- Then write "Sources used:" and list which source numbers you used.
"""

    return ask_ollama(prompt, model=model)

# use your existing retrieval results (top_idx from earlier)
retrieved_texts = [documents[i] for i in top_idx]
print(rag_answer(query, retrieved_texts, model="llama3"))


Injuries can significantly impact squad depth during a long season by limiting the number of players available. A team's ability to rotate without losing quality is crucial, especially when both main and backup strikers are unavailable. Muscle injuries may increase with high pressing and frequent matches.

Sources used: 1


## Chunking Strategies

To explore the effect of chunking on retrieval performance, two configurations were tested:

- **Configuration A**
  - chunk size: 60 words
  - overlap: 20 words

- **Configuration B**
  - chunk size: 120 words
  - overlap: 40 words

Each document is split into overlapping chunks according to the configuration.
The resulting chunks are embedded and indexed separately for each setup.

## Chunking
Split each document into overlapping word chunks (chunk_size, overlap).
Below is a simple demo with (80/30), then we run A/B in the experiment.

In [14]:
import re

def chunk_text(text, chunk_size=80, overlap=30):
    # chunk_size/overlap here in "words"
    words = re.findall(r"\b\w+\b", text)
    chunks = []
    start = 0
    while start < len(words):
        end = start + chunk_size
        chunk_words = words[start:end]
        if not chunk_words:
            break
        chunks.append(" ".join(chunk_words))
        start += (chunk_size - overlap)
    return chunks

Create chunks and metadata (file, doc_id, chunk_id).

In [15]:
chunk_size = 80
overlap = 30

chunks = []
metas = []  # (file, doc_id, chunk_id)

for doc_id, (fname, text) in enumerate(zip(files, documents)):
    doc_chunks = chunk_text(text, chunk_size=chunk_size, overlap=overlap)
    for chunk_id, ch in enumerate(doc_chunks):
        chunks.append(ch)
        metas.append({"file": fname, "doc_id": doc_id, "chunk_id": chunk_id})

print("Total chunks:", len(chunks))
print("Example chunk:\n", chunks[0][:250], "...")

Total chunks: 20
Example chunk:
 Premier League clubs often focus on pace physical duels and transitions In a typical summer window top teams look for a striker who can finish chances and press from the front Mid table clubs prioritize fullbacks and defensive midfielders to protect  ...


## Vector Indexing

For vector search, we use **FAISS** as a local vector database.

For each configuration:
- all chunk embeddings are normalized
- a FAISS index is built using cosine similarity
- the index stores embeddings together with metadata linking each chunk to its source document

Build a FAISS index over chunk embeddings (cosine similarity via normalized vectors).

In [16]:
import faiss
import numpy as np

chunk_emb = model.encode(chunks, convert_to_numpy=True).astype("float32")

# normalize vectors to unit length -> inner product becomes cosine similarity
faiss.normalize_L2(chunk_emb)

dim = chunk_emb.shape[1]
index = faiss.IndexFlatIP(dim)  # simple exact search with inner product
index.add(chunk_emb)

print("FAISS index size:", index.ntotal)

FAISS index size: 20


Retrieve top-k most similar chunks from FAISS for a query.

In [17]:
def retrieve_chunks_faiss(query, top_k=5):
    q = model.encode([query], convert_to_numpy=True).astype("float32")
    faiss.normalize_L2(q)
    scores, idx = index.search(q, top_k)
    results = []
    for score, i in zip(scores[0], idx[0]):
        results.append({
            "score": float(score),
            "chunk": chunks[i],
            "meta": metas[i]
        })
    return results

query2 = "How do injuries affect squad depth during a long season?"
retrieved = retrieve_chunks_faiss(query2, top_k=5)

for r in retrieved[:3]:
    print(r["score"], r["meta"]["file"], r["chunk"][:140], "...")

0.7772618532180786 doc08_injuries_squad_depth.txt Injuries affect results when key players miss weeks during a crowded schedule Teams with strong squad depth can rotate without losing qualit ...
0.5865771174430847 doc08_injuries_squad_depth.txt management help reduce risk Coaches may change formation to protect tired players Depth in midfield and at fullback is crucial over a long s ...
0.49424755573272705 doc01_transfers_premier_league.txt players with high potential and selling veterans to manage wages Deadlines create short term loans and late deals Teams also look for depth  ...


## Final RAG demo (FAISS retrieval + LLM)
We retrieve relevant chunks and generate the final answer using the local LLM.

In [18]:
retrieved_texts = [r["chunk"] for r in retrieved]
print(rag_answer(query2, retrieved_texts, model="llama3"))

Injuries affect squad depth during a long season by causing key players to miss weeks, potentially disrupting the team's performance. Teams with strong squad depth can rotate players without losing quality. Coaches may change formations to protect tired players and reduce the risk of injuries.

Sources used: 1


## Experiment setup
Build separate FAISS indexes for different chunking configurations and compare retrieval behavior.

In [19]:
import time
import pandas as pd

def build_faiss_for_config(chunk_size, overlap):
    cfg_chunks = []
    cfg_metas = []
    for doc_id, (fname, text) in enumerate(zip(files, documents)):
        doc_chunks = chunk_text(text, chunk_size=chunk_size, overlap=overlap)
        for chunk_id, ch in enumerate(doc_chunks):
            cfg_chunks.append(ch)
            cfg_metas.append({"file": fname, "doc_id": doc_id, "chunk_id": chunk_id})

    emb = model.encode(cfg_chunks, convert_to_numpy=True).astype("float32")
    faiss.normalize_L2(emb)
    dim = emb.shape[1]
    idx = faiss.IndexFlatIP(dim)
    idx.add(emb)
    return cfg_chunks, cfg_metas, idx

def retrieve_cfg(query, cfg_chunks, cfg_metas, cfg_index, top_k=5):
    q = model.encode([query], convert_to_numpy=True).astype("float32")
    faiss.normalize_L2(q)

    t0 = time.time()
    scores, ids = cfg_index.search(q, top_k)
    dt = time.time() - t0

    top_files = []
    for i in ids[0]:
        top_files.append(cfg_metas[i]["file"])

    return dt, top_files, [float(s) for s in scores[0]]

## Experimental Results

Multiple football-related queries were tested against both configurations.

For each configuration and query, the following were recorded:
- chunk size
- overlap
- retrieval time
- top retrieved source documents

The results are summarized in a comparison table, allowing analysis of how chunking
parameters affect retrieval behavior.

Test queries for comparing configurations.
Chunking configurations (A vs B).

In [38]:
queries = [
    "How do injuries affect squad depth during a long season?",
    "What is high pressing and why is it risky?",
    "What is a low block and what is the trade-off?",
    "How do club wages and finances affect transfers?",
    "How can a coaching change affect injuries and tactics?"
]

configs = [
    {"name": "A", "chunk_size": 60, "overlap": 20, "top_k": 5},
    {"name": "B", "chunk_size": 120, "overlap": 40, "top_k": 5},
]

rows = []

for cfg in configs:
    cfg_chunks, cfg_metas, cfg_index = build_faiss_for_config(cfg["chunk_size"], cfg["overlap"])
    for q in queries:
        dt, top_files, top_scores = retrieve_cfg(q, cfg_chunks, cfg_metas, cfg_index, top_k=cfg["top_k"])
        rows.append({
            "config": cfg["name"],
            "chunk_size": cfg["chunk_size"],
            "overlap": cfg["overlap"],
            "query": q,
            "retrieval_time_ms": round(dt * 1000, 3),
            "top_files": ", ".join(top_files[:3])
        })

df = pd.DataFrame(rows)
df

Unnamed: 0,config,chunk_size,overlap,query,retrieval_time_ms,top_files
0,A,60,20,How do injuries affect squad depth during a lo...,0.0,"doc08_injuries_squad_depth.txt, doc08_injuries..."
1,A,60,20,What is high pressing and why is it risky?,0.0,"doc05_tactics_pressing.txt, doc05_tactics_pres..."
2,A,60,20,What is a low block and what is the trade-off?,0.0,"doc06_tactics_low_block.txt, doc06_tactics_low..."
3,A,60,20,How do club wages and finances affect transfers?,0.0,"doc09_finance_wages_ffp.txt, doc09_finance_wag..."
4,A,60,20,How can a coaching change affect injuries and ...,0.0,"doc10_coaches_changes.txt, doc08_injuries_squa..."
5,B,120,40,How do injuries affect squad depth during a lo...,0.0,"doc08_injuries_squad_depth.txt, doc01_transfer..."
6,B,120,40,What is high pressing and why is it risky?,0.0,"doc05_tactics_pressing.txt, doc04_transfers_bu..."
7,B,120,40,What is a low block and what is the trade-off?,0.0,"doc06_tactics_low_block.txt, doc05_tactics_pre..."
8,B,120,40,How do club wages and finances affect transfers?,0.0,"doc09_finance_wages_ffp.txt, doc03_transfers_s..."
9,B,120,40,How can a coaching change affect injuries and ...,0.0,"doc10_coaches_changes.txt, doc08_injuries_squa..."


## Conclusion

This experiment demonstrates a complete **local experimental RAG pipeline**.

By comparing different chunking strategies and indexing configurations, we show how
RAG components influence retrieval quality and behavior.

The results highlight the importance of chunking design in retrieval-based systems,
even when using the same embedding model and vector index.