# Embedding Model Comparison for Retrieval (Large & Detailed Dataset)

This notebook evaluates several embedding models on a retrieval task using an expanded set of detailed document paragraphs across multiple domains (Tech, Medicine, History, Law, Space). 

In [7]:
%pip install -q langchain-community sentence-transformers faiss-cpu pandas scikit-learn

^C
Note: you may need to restart the kernel to use updated packages.


In [None]:
import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import HuggingFaceEmbeddings

## 1. Large & Detailed Mock Data Generation

We generate 50 detailed paragraphs and 15 queries to provide a rigorous benchmark.

In [None]:


documents = [
    # Tech & Programming
    "Python is a high-level, interpreted programming language known for its emphasis on code readability. Its syntax is clean and allows programmers to express concepts in fewer lines of code compared to languages like C++ or Java. Python supports multiple programming paradigms, including object-oriented, imperative, and functional programming, making it highly versatile for web development, data science, and automation.",
    "JavaScript, often abbreviated as JS, is a programming language that is one of the core technologies of the World Wide Web, alongside HTML and CSS. Over 97% of websites use JavaScript on the client side for web page behavior, often incorporating third-party libraries. It is a multi-paradigm language, supporting event-driven, functional, and imperative programming styles, and has APIs for working with text, dates, regular expressions, and the DOM.",
    "C++ is a high-performance, general-purpose programming language created by Bjarne Stroustrup as an extension of the C programming language. It is widely used for systems and software research, game development, drivers, and client-server applications. It provides low-level memory manipulation features combined with high-level object-oriented programming characteristics, allowing for efficient resource management.",
    "Docker is a set of platform as a service products that use OS-level virtualization to deliver software in packages called containers. Containers are isolated from one another and bundle their own software, libraries and configuration files; they can communicate with each other through well-defined channels. Because all containers share the services of a single operating system kernel, they use fewer resources than virtual machines.",
    "Kubernetes is an open-source container-orchestration system for automating computer application deployment, scaling, and management. It was originally designed by Google and is now maintained by the Cloud Native Computing Foundation. It works with a range of container tools, including Docker, and provides a framework to run distributed systems resiliently, taking care of scaling and failover for your applications.",
    "Rust is a multi-paradigm, general-purpose programming language designed for performance and safety, especially safe concurrency. It is syntactically similar to C++, but it can guarantee memory safety by using a borrow checker to validate references. This avoids common bugs such as null pointer dereferences and buffer overflows, making it a favorite for systems programmers who value both speed and reliability.",
    "TypeScript is a programming language developed and maintained by Microsoft. It is a strict syntactical superset of JavaScript and adds optional static typing to the language. TypeScript is designed for the development of large applications and transpiles to JavaScript. As it is a superset of JavaScript, existing JavaScript programs are also valid TypeScript programs, allowing for gradual adoption.",
    "Git is a software for tracking changes in any set of files, usually used for coordinating work among programmers collaboratively developing source code during software development. Its goals include speed, data integrity, and support for distributed, non-linear workflows. Every Git directory on every computer is a full-fledged repository with complete history and full version-tracking abilities, independent of network access.",
    "SQL (Structured Query Language) is a domain-specific language used in programming and designed for managing data held in a relational database management system (RDBMS), or for stream processing in a relational data stream management system. It is particularly useful in handling structured data, i.e., data incorporating relations among entities and variables, and is the standard for data manipulation in most modern databases.",
    "React is a free and open-source front-end JavaScript library for building user interfaces based on UI components. It is maintained by Meta and a community of individual developers and companies. React can be used as a base in the development of single-page or mobile applications. It allows developers to create large web applications that can change data, without reloading the page, by using a virtual DOM.",
    
    # Medicine & Science
    "Photosynthesis is a process used by plants and other organisms to convert light energy into chemical energy that, through cellular respiration, can later be released to fuel the organism's activities. This chemical energy is stored in carbohydrate molecules, such as sugars, which are synthesized from carbon dioxide and water. In most cases, oxygen is also released as a waste product, which is vital for aerobic life on Earth.",
    "DNA (Deoxyribonucleic acid) is a molecule composed of two polynucleotide chains that coil around each other to form a double helix carrying genetic instructions for the development, functioning, growth, and reproduction of all known organisms and many viruses. DNA and ribonucleic acid (RNA) are nucleic acids. Alongside proteins, lipids, and complex carbohydrates, nucleic acids are one of the four major types of macromolecules that are essential for all known life.",
    "The heart is a muscular organ in most animals, which pumps blood through the blood vessels of the circulatory system. The pumped blood carries oxygen and nutrients to the body, while carrying metabolic waste such as carbon dioxide to the lungs. In humans, the heart is approximately the size of a closed fist and is located between the lungs, in the middle compartment of the chest. It has four chambers: two upper atria and two lower ventricles.",
    "Penicillin (PCN) is a group of antibiotics, derived from Penicillium fungi, which include penicillin G, penicillin V, procaine penicillin, and benzathine penicillin. Penicillin antibiotics were among the first medications to be effective against many bacterial infections caused by staphylococci and streptococci. They are still widely used today, though many types of bacteria have developed resistance following extensive use.",
    "Mitosis is a part of the cell cycle in which replicated chromosomes are separated into two new nuclei. Cell division gives rise to genetically identical cells in which the total number of chromosomes is maintained. In general, mitosis (division of the nucleus) is preceded by the S stage of interphase (during which the DNA is replicated) and is often followed by telophase and cytokinesis, which divides the cytoplasm.",
    "The human brain is the central organ of the human nervous system, and with the spinal cord makes up the central nervous system. The brain consists of the cerebrum, the brainstem and the cerebellum. It controls most of the activities of the body, processing, integrating, and coordinating the information it receives from the sense organs, and making decisions as to the instructions sent to the rest of the body.",
    "Gravity is a fundamental interaction which causes mutual attraction between all things with mass or energy. Gravity is by far the weakest of the four fundamental interactions, approximately 10^38 times weaker than the strong interaction. However, it is the most significant interaction on the scale of planets, stars, and galaxies, and is responsible for the formation of the Earth, the Sun, and most of the macroscopic objects in the universe.",
    "Thermodynamics is a branch of physics that deals with heat, work, and temperature, and their relation to energy, entropy, and the physical properties of matter and radiation. The behavior of these quantities is governed by the four laws of thermodynamics which convey a quantitative description using measurable macroscopic physical quantities, but may be explained in terms of microscopic constituents by statistical mechanics.",
    "Evolution is the change in the heritable characteristics of biological populations over successive generations. These characteristics are the expressions of genes that are passed on from parent to offspring during reproduction. Different characteristics tend to exist within any given population as a result of mutation, genetic recombination, and other sources of genetic variation. Evolution occurs when evolutionary processes such as natural selection act on this variation.",
    "A virus is a submicroscopic infectious agent that replicates only inside the living cells of an organism. Viruses infect all life forms, from animals and plants to microorganisms, including bacteria and archaea. Since Dmitri Ivanovsky's 1892 article describing a non-bacterial pathogen infecting tobacco plants and the discovery of the tobacco mosaic virus by Martinus Beijerinck in 1898, more than 9,000 virus species have been described in detail.",
    
    # History & Geography
    "The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris, France. It is named after the engineer Gustave Eiffel, whose company designed and built the tower. Locally nicknamed 'La dame de fer' (French for 'Iron Lady'), it was constructed from 1887 to 1889 as the centerpiece of the 1889 World's Fair and was initially criticized by some of France's leading artists and intellectuals for its design, but it has become a global cultural icon of France.",
    "The Great Barrier Reef is the world's largest coral reef system composed of over 2,900 individual reefs and 900 islands stretching for over 2,300 kilometres over an area of approximately 344,400 square kilometres. The reef is located in the Coral Sea, off the coast of Queensland, Australia. The Great Barrier Reef can be seen from outer space and is the world's biggest single structure made by living organisms. This reef structure is composed of and built by billions of tiny organisms, known as coral polyps.",
    "The Roman Empire was the post-Republican period of ancient Rome. As a polity, it included large territorial holdings around the Mediterranean Sea in Europe, North Africa, and Western Asia, ruled by emperors. From the accession of Augustus to the military anarchy of the 3rd century, it was a princeps state with Italy as the mother country of the empire and Rome as its sole capital. The empire was among the most powerful economic, cultural, political, and military forces in the world of its time.",
    "Mount Everest is Earth's highest mountain above sea level, located in the Mahalangur Himal sub-range of the Himalayas. The China-Nepal border runs across its summit point. Its elevation (snow height) of 8,848.86 m was most recently established in 2020 by the Chinese and Nepalese authorities. Mount Everest attracts many climbers, including highly experienced mountaineers. There are two main climbing routes, one approaching the summit from the southeast in Nepal and the other from the north in Tibet.",
    "The Industrial Revolution was the transition to new manufacturing processes in Great Britain, continental Europe, and the United States, in the period from about 1760 to sometime between 1820 and 1840. This transition included going from hand production methods to machines, new chemical manufacturing and iron production processes, the increasing use of steam power and water power, the development of machine tools and the rise of the mechanized factory system.",
    "Machu Picchu is a 15th-century Inca citadel located in the Eastern Cordillera of southern Peru on a 2,430-meter mountain ridge. It is located in the Machupicchu District within Urubamba Province above the Sacred Valley, which is 80 kilometers northwest of Cuzco. The Urubamba River flows past it, cutting through the Cordillera and creating a canyon with a tropical mountain climate. For many, it is the most familiar icon of the Inca civilization.",
    "The Nile is a major north-flowing river in northeastern Africa. It flows into the Mediterranean Sea. The Nile is the longest river in Africa and has historically been considered the longest river in the world, though this has been contested by research suggesting that the Amazon River is slightly longer. The Nile is among the smallest of the great world rivers by measure of annual flow in cubic meters of water. Its drainage basin covers eleven countries.",
    "Ancient Egypt was a civilization in Northeast Africa situated in the Nile Valley. Ancient Egyptian civilization followed prehistoric Egypt and coalesced around 3100 BC with the political unification of Upper and Lower Egypt under Menes. The history of ancient Egypt occurred as a series of stable kingdoms, separated by periods of relative instability known as Intermediate Periods: the Old Kingdom, the Middle Kingdom, and the New Kingdom.",
    "The Cold War was a period of geopolitical tension between the United States and the Soviet Union and their respective allies, the Western Bloc and the Eastern Bloc, which began following World War II. Historians do not fully agree on its starting and ending points, but the period is generally considered to span the 1947 Truman Doctrine to the 1991 dissolution of the Soviet Union. The term cold is used because there was no large-scale fighting directly between the two superpowers.",
    "London is the capital and largest city of England and the United Kingdom, with a population of just under 9 million. It stands on the River Thames in south-east England at the head of a 50-mile estuary down to the North Sea, and has been a major settlement for two millennia. The City of London, its ancient core and financial centre, was founded by the Romans as Londinium and retains its medieval boundaries.",
    
    # Space
    "The Milky Way is the galaxy that contains our Solar System, with the name describing the galaxy's appearance from Earth: a hazy band of light seen in the night sky formed from stars that cannot be individually distinguished by the naked eye. The term Milky Way is a translation of the Latin via lactea, from the Greek galaxías kúklos (milky circle). From Earth, the Milky Way appears as a band because its disk-shaped structure is viewed from within.",
    "Mars is the fourth planet from the Sun and the second-smallest planet in the Solar System, being larger than only Mercury. In English, Mars carries the name of the Roman god of war and is often referred to as the 'Red Planet'. The latter refers to the effect of the iron oxide prevalent on Mars's surface, which gives it a reddish appearance that is distinctive among the stellar bodies visible to the naked eye. Mars is a terrestrial planet with a thin atmosphere.",
    "A black hole is a region of spacetime where gravity is so strong that nothing—no particles or even electromagnetic radiation such as light—can escape from it. The theory of general relativity predicts that a sufficiently compact mass can deform spacetime to form a black hole. The boundary of the region from which no escape is possible is called the event horizon. Although it has a great effect on the fate and circumstances of an object crossing it, it has no locally detectable features.",
    "The Moon is Earth's only natural satellite. At about one-quarter the diameter of Earth, it is the largest natural satellite in the Solar System relative to the size of its planet, and the fifth largest satellite in the Solar System overall. The Moon is a planetary-mass object that formed a differentiated rocky body, making it a satellite planet under the geophysical definitions of the term. It lacks any significant atmosphere, hydrosphere, or magnetic field.",
    "Jupiter is the fifth planet from the Sun and the largest in the Solar System. It is a gas giant with a mass more than two and a half times that of all the other planets in the Solar System combined, but slightly less than one-thousandth the mass of the Sun. Jupiter is the third-brightest natural object in the Earth's night sky after the Moon and Venus. It has been observed since prehistoric times and is named after the Roman god Jupiter, the king of the gods.",
    "The Big Bang theory is the prevailing cosmological model explaining the existence of the observable universe from the earliest known periods through its subsequent large-scale evolution. The model describes how the universe expanded from an initial state of extremely high density and extremely high temperature, and offers a comprehensive explanation for a broad range of observed phenomena, including the abundance of light elements and the cosmic microwave background radiation.",
    "Saturn is the sixth planet from the Sun and the second-largest in the Solar System, after Jupiter. It is a gas giant with an average radius of about nine and a half times that of Earth. It only has one-eighth the average density of Earth; however, with its larger volume, Saturn is over 95 times more massive. Saturn is probably best known for the system of planetary rings that makes it visually unique. The rings are composed mostly of ice particles, with a smaller amount of rocky debris and dust.",
    "Venus is the second planet from the Sun. It is sometimes called Earth's 'sister' or 'twin' planet as it is almost as large and has a similar composition. As an interior planet to Earth, Venus (like Mercury) appears in Earth's sky never far from the Sun, either as morning star or evening star. Venus has the densest atmosphere of the four terrestrial planets, consisting of more than 96% carbon dioxide. The atmospheric pressure at the planet's surface is about 92 times the sea level pressure of Earth.",
    "Astronomy is a natural science that studies celestial objects and phenomena. It uses mathematics, physics, and chemistry in order to explain their origin and evolution. Objects of interest include planets, moons, stars, nebulae, galaxies, and comets. Relevant phenomena include supernova explosions, gamma ray bursts, quasars, blazars, pulsars, and cosmic microwave background radiation. More generally, astronomy studies everything that originates outside Earth's atmosphere.",
    "The International Space Station (ISS) is a modular space station (habitable artificial satellite) in low Earth orbit. It is a multinational collaborative project between five participating space agencies: NASA (United States), Roscosmos (Russia), JAXA (Japan), ESA (Europe), and CSA (Canada). The ownership and use of the space station is established by intergovernmental treaties and agreements. It serves as a microgravity and space environment research laboratory.",
    
    # Art & Culture
    "The Mona Lisa is a half-length portrait painting by Italian artist Leonardo da Vinci. Considered an archetypal masterpiece of the Italian Renaissance, it has been described as 'the best known, the most visited, the most written about, the most sung about, the most parodied work of art in the world'. The painting's novel qualities include the subject's elusive expression, which is frequently described as enigmatic, the monumentality of the composition, and the subtle modeling of forms.",
    "William Shakespeare was an English playwright, poet, and actor. He is widely regarded as the greatest writer in the English language and the world's greatest dramatist. He is often called England's national poet and the 'Bard of Avon'. His extant works, including collaborations, consist of some 39 plays, 154 sonnets, three long narrative poems, and a few other verses, some of uncertain authorship. His plays have been translated into every major living language and are performed more oft than those of any other playwright.",
    "Jazz is a music genre that originated in the African-American communities of New Orleans, Louisiana, in the late 19th and early 20th centuries, with its roots in blues and ragtime. Since the 1920s Jazz Age, it has been recognized as a major form of musical expression in traditional and popular music, linked by the common bonds of African-American and European-American musical parentage. Jazz is characterized by swing and blue notes, complex chords, call and response vocals, polyrhythms and improvisation.",
    "The Renaissance was a period in European history marking the transition from the Middle Ages to modernity and covering the 15th and 16th centuries. It occurred after the Crisis of the Late Middle Ages and was associated with great social change. In addition to the standard periodization, proponents of a long Renaissance may put its beginning in the 14th century and its end in the 17th century. The traditional view focuses more on the early modern aspects of the Renaissance and argues that it was a break from the past.",
    "The Starry Night is an oil-on-canvas painting by the Dutch Post-Impressionist painter Vincent van Gogh. Painted in June 1889, it depicts the view from the east-facing window of his asylum room at Saint-Rémy-de-Provence, just before sunrise, with the addition of an imaginary village. It has been in the permanent collection of the Museum of Modern Art in New York City since 1941, acquired through the Lillie P. Bliss Bequest. Widely regarded as Van Gogh's magnum opus, The Starry Night is one of the most recognized paintings in Western art.",
    "Ludwig van Beethoven was a German composer and pianist. Beethoven remains one of the most admired composers in the history of Western music; his works rank amongst the most performed of the classical music canon and span the transition from the Classical period to the Romantic era in classical music. His career has conventionally been divided into early, middle, and late periods. His early period, during which he forged his craft, is typically considered to have lasted until 1802.",
    "The Louvre, or the Louvre Museum, is the world's most-visited museum and a historic monument in Paris, France. It is the home of some of the best-known works of art, including the Mona Lisa and the Venus de Milo. A central landmark of the city, it is located on the Right Bank of the Seine in the city's 1st arrondissement. Approximately 38,000 objects from prehistory to the 21st century are exhibited over an area of 72,735 square meters.",
    "Opera is a form of theatre in which music is a fundamental component and the dramatic roles are taken by singers, but is distinct from musical theatre. Such a 'work' (the literal translation of the Italian word opera) is typically a collaboration between a composer and a librettist and incorporates a number of the performing arts, such as acting, scenery, costume, and sometimes dance or ballet. The performance is typically given in an opera house, accompanied by an orchestra or smaller musical ensemble.",
    "Origami is the Japanese art of paper folding. In modern usage, the word 'origami' is often used as an inclusive term for all folding practices, regardless of their culture of origin. The goal is to transform a flat square sheet of paper into a finished sculpture through folding and sculpting techniques. Modern origami practitioners generally discourage the use of cuts, glue, or markings on the paper. Origami folders often use the Japanese word kirigami to refer to designs which use cuts.",
    "Salsa is a popular form of social dance originating from Cuban folk dances. The specific dances which combined to create salsa were son, danzón, guaguancó, mambo, and pachanga, as well as other Afro-Cuban forms. Salsa also incorporates elements of American jazz. Although salsa as a dance form has its roots in Cuba, it developed as a distinct style in New York City among the Puerto Rican and Cuban communities in the 1960s and 1970s."
]

queries = [
    {"query": "How do computers track code changes?", "expected_idx": 7}, # Git
    {"query": "What provides genetic info?", "expected_idx": 11}, # DNA
    {"query": "Tell me about the highest peak on Earth.", "expected_idx": 23}, # Mt Everest
    {"query": "Why is Mars red?", "expected_idx": 31}, # Mars
    {"query": "Who is the greatest English writer?", "expected_idx": 41}, # Shakespeare
    {"query": "What is a container platform?", "expected_idx": 3}, # Docker
    {"query": "How do plants make food?", "expected_idx": 10}, # Photosynthesis
    {"query": "Where is the Eiffel Tower?", "expected_idx": 20}, # Eiffel Tower
    {"query": "Tell me about the largest planet.", "expected_idx": 34}, # Jupiter
    {"query": "What is the Japanese art of folding paper?", "expected_idx": 48}, # Origami
    {"query": "A language for relational databases.", "expected_idx": 8}, # SQL
    {"query": "Antibiotic discovery by Fleming.", "expected_idx": 13}, # Penicillin
    {"query": "Empire spanning three continents.", "expected_idx": 22}, # Roman Empire
    {"query": "Region where gravity traps light.", "expected_idx": 32}, # Black hole
    {"query": "Painting with a famous smile.", "expected_idx": 40}  # Mona Lisa
]

print(f"Generated {len(documents)} document paragraphs and {len(queries)} evaluation queries.")

## 2. RAG Trial (LangChain)

A small demonstration of how to use LangChain to retrieve from the larger dataset. This cell allows you to easily test any query.

In [None]:
def test_retrieval(query_text, model_name="sentence-transformers/all-mpnet-base-v2"):
    print(f"--- Testing Retrieval with {model_name} ---")
    embeddings = HuggingFaceEmbeddings(model_name=model_name)
    vectorstore = FAISS.from_texts(documents, embeddings)
    results = vectorstore.similarity_search(query_text, k=1)
    
    print(f"Query: {query_text}")
    print(f"Top Result:\n{results[0].page_content[:500]}...")

test_retrieval("How do plants get their energy?")

## 3. Benchmarking Embedding Models

Comparison of retrieval performance on the 50 detailed paragraphs.

In [None]:
model_configs = [
    {"name": "LaBSE", "id": "sentence-transformers/LaBSE"},
    {"name": "Jina (Gena)", "id": "jinaai/jina-embeddings-v2-base-en"},
    {"name": "all-MiniLM-L6", "id": "sentence-transformers/all-MiniLM-L6-v2"},
    {"name": "all-mpnet-base", "id": "sentence-transformers/all-mpnet-base-v2"},
    {"name": "BGE-small", "id": "BAAI/bge-small-en-v1.5"},
    {"name": "E5-small", "id": "intfloat/e5-small-v2"}
]

results_data = []

for cfg in model_configs:
    print(f"Processing model: {cfg['name']}...")
    
    try:
        model = SentenceTransformer(cfg['id'], trust_remote_code=True)
        doc_embeddings = model.encode(documents)
        query_texts = [q['query'] for q in queries]
        query_embeddings = model.encode(query_texts)
        
        similarities = cosine_similarity(query_embeddings, doc_embeddings)
        
        total_mrr = 0
        hits_at_1 = 0
        hits_at_3 = 0
        
        for i, q_data in enumerate(queries):
            scores = similarities[i]
            top_indices = np.argsort(scores)[::-1]
            rank = np.where(top_indices == q_data['expected_idx'])[0][0] + 1
            
            total_mrr += (1.0 / rank)
            if rank == 1: hits_at_1 += 1
            if rank <= 3: hits_at_3 += 1
            
        results_data.append({
            "Model": cfg['name'],
            "Hit Rate @ 1": hits_at_1 / len(queries),
            "Hit Rate @ 3": hits_at_3 / len(queries),
            "MRR": total_mrr / len(queries)
        })
    except Exception as e:
        print(f"Error processing {cfg['name']}: {e}")

print("Benchmarking complete!")

## 4. Final Comparison Results

Evaluation summary for the 50-paragraph benchmark.

In [None]:
df_results = pd.DataFrame(results_data)
df_results = df_results.sort_values(by="MRR", ascending=False)

print("### Retrieval Performance (Detailed Paragraphs, N=50) ###")
display(df_results.style.background_gradient(cmap='Greens'))