# k-Nearest-Neighbours Classification with HuggingFace Embeddings and Vectorstores

In this notebook we will build a kNN classifier that classifies texts(a.k.a. strings) based on their close neighboorhod in an embedding space

## Imports

In [None]:
# sentence transformers is a popular opensource embedding model. 
# The model and its associated methods can be accessed via the sentence-transformers package
!pip install sentence-transformers

In [None]:
#langchain is a popular opensource library for working with LLMs, EmbeddingModels and vectorstores
!pip install langchain

In [None]:
# faiss is a popular vectorstore package
!pip install faiss-gpu

#alternatively if you have no gpu available
# !pip install faiss-cpu

In [1]:
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.schema import Document
from tqdm import tqdm
from sklearn.datasets import fetch_20newsgroups
from tqdm import tqdm

## Load data
- The 20 Newsgroup dataset is a famous dataset for NLP Tasks. However if you want you can experiment with your own datasets
- If you dont have a dataset in mind, you can also ask ChatGPT to generate one

In [13]:
# fetch the Newsgroup dataset
newsgroups = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))

### Inspect one sample

In [3]:
print(f"Article: {newsgroups.data[0]}")
print(f"\n Category: {newsgroups.target_names[newsgroups.target[0]]}")

Article: I am a little confused on all of the models of the 88-89 bonnevilles.
I have heard of the LE SE LSE SSE SSEI. Could someone tell me the
differences are far as features or performance. I am also curious to
know what the book value is for prefereably the 89 model. And how much
less than book value can you usually get them for. In other words how
much are they in demand this time of year. I have heard that the mid-spring
early summer is the best time to buy.

 Category: rec.autos


## Create Documents with Langchain

A Document is a utility class from langchain to also include metadata and other properties together with the original text  (i.e. string)

Most other langchain operations depend on working with Langchain documents

In [14]:
# If you use your own dataset you will have to adapt this
# Note that the metadata is optional and can be left empty if your text doesnt contain any sort of labels
documents = [Document(page_content=newsgroups.data[i], metadata={"category":newsgroups.target_names[newsgroups.target[i]]}) for i in tqdm(range(len(list(newsgroups.data))))]

100%|██████████| 18846/18846 [00:00<00:00, 95945.73it/s]


## Build Knowledge base with embeddings
- Go to [HuggingFace](https://huggingface.co/spaces/mteb/leaderboard) to check out the latest and greatest embedding models
- Try models with different embedding sizes and compare the results

### Large embedding model
- The embedding dimension of this model is 1024

In [24]:
model_name = "BAAI/bge-large-en-v1.5"
model_kwargs = {"device": "cuda"}
large_embeddings = HuggingFaceEmbeddings(
    model_name=model_name,
    model_kwargs=model_kwargs,
)

Downloading (…)5e2c6/.gitattributes: 100%|██████████| 1.52k/1.52k [00:00<00:00, 11.3MB/s]
Downloading (…)_Pooling/config.json: 100%|██████████| 191/191 [00:00<00:00, 1.53MB/s]
Downloading (…)ba76d5e2c6/README.md: 100%|██████████| 90.3k/90.3k [00:00<00:00, 919kB/s]
Downloading (…)76d5e2c6/config.json: 100%|██████████| 779/779 [00:00<00:00, 6.21MB/s]
Downloading (…)ce_transformers.json: 100%|██████████| 124/124 [00:00<00:00, 1.09MB/s]
Downloading pytorch_model.bin: 100%|██████████| 1.34G/1.34G [00:12<00:00, 103MB/s] 
Downloading (…)nce_bert_config.json: 100%|██████████| 52.0/52.0 [00:00<00:00, 435kB/s]
Downloading (…)cial_tokens_map.json: 100%|██████████| 125/125 [00:00<00:00, 698kB/s]
Downloading (…)5e2c6/tokenizer.json: 100%|██████████| 711k/711k [00:00<00:00, 3.61MB/s]
Downloading (…)okenizer_config.json: 100%|██████████| 366/366 [00:00<00:00, 2.90MB/s]
Downloading (…)ba76d5e2c6/vocab.txt: 100%|██████████| 232k/232k [00:00<00:00, 58.7MB/s]
Downloading (…)6d5e2c6/modules.json: 100%|███

### Small embedding model
- The embedding dimension of this model is 384

In [None]:
model_name = "BAAI/bge-small-en-v1.5"
model_kwargs = {"device": "cuda"}
small_embeddings = HuggingFaceEmbeddings(
    model_name=model_name,
    model_kwargs=model_kwargs,
)

### First you have to build a vectorstore from the documents

In [25]:
# for the small embeddings
knowledge_base_small_emb = FAISS.from_documents(documents, small_embeddings)
knowledge_base_small_emb.save_local("vector_stores/knowledge_base_20_newsgroup_all_bge-small-en-v1.5")

In [None]:
# for the large embeddings
knowledge_base_large_emb = FAISS.from_documents(documents, large_embeddings)
knowledge_base_large_emb.save_local("vector_stores/knowledge_base_20_newsgroup_all_bge-large-en-v1.5")

### If you have already built the vector store you can simply load it

In [None]:
#let's pick here the vector store for our large embedding model
knowledge_base = FAISS.load_local("vector_stores/knowledge_base_20_newsgroup_all_bge-large-en-v1.5", large_embeddings)

## Experiment with some queries

In [26]:
query = "Is the earth flat?"
k = 10 
#returns the k=10 nearest neighbour in the vector space with the associated documents
results = knowledge_base.similarity_search_with_score(query, k=k)
for doc in results:
  print(doc[0])

page_content='It has been known for quite a while that the earth is actually more pear\nshaped than globular/spherical.  Does anyone make a "globe" that is accurate\nas to actual shape, landmass configuration/Long/Lat lines etc.?\nThanks in advance.\n\n--\n\nbill@xpresso.UUCP                   (Bill Vance),             Bothell, WA\nrwing!xpresso!bill' metadata={'category': 'sci.space'}
page_content='\n    What do you accept as a fact --  the roundness of the earth (after \nall, the ancient Greeks thought it was a sphere, and then Newton said \nit was a spheroid, and now people say it\'s a geoid [?])?  yourself \n(isn\'t your personal identity just a theoretical construct to make \nsense of memories, feelings, perceptions)?  I\'m trying to think of \nanything that would be a fact for you.  Give some examples, and let\'s\nsee how factual they are by your criteria (BTW, what are your\ncriteria?).\n\n    "Gravity is _not_ a fact": is that a fact?  How about Newton\'s \nand Einstein\'s thou

In [27]:
#Let's try another one
query = "Who won the superbowl?"
k = 10
result = knowledge_base.similarity_search_with_score(query, k=k)
for doc in result:
  print(doc[0])

page_content="Giants Win the Pennant!!  Giants Win the Pennant !! Gi... OOOPS\nI guess I'm a little early here...\nSee you in October...\n\n" metadata={'category': 'rec.sport.baseball'}
page_content="Tonight in Boston, the Buffalo Sabres blanked the Boston\nBruins 4-0 tonight in Boston. Looks like Boston can hang\nthis season up, because Buffalo's home record is awesome!!!!\nThis is great.. Buffalo fans might get to see revenge for\nlast year!!!!! :)\n-- \ndelarocq@eos.ncsu.edu\n\n\n      \n---------------------------------------------------------------------------   \n1988,1989,1990,1991 AFC East Division Champions\n1991,1992, AND 1993 AFC Conference Champions!!!!!!!!  :)\n\nSquished the Fish ............... Monday Night Football, November 16, 1992..\nSQUISHED THE TRASH TALKING FISH.. AFC CHAMPIONSHIP, JANUARY 17, 1992.." metadata={'category': 'rec.sport.hockey'}
page_content="\nHopefully, a miracle (o.k. not quite a miracle, but close!) will occur and\nPittsburgh will be elminated pr

In [28]:
query = "What is Artificial Intelligence"
k = 10
result = knowledge_base.similarity_search_with_score(query, k=k)
for doc in result:
  print(doc[0])

page_content='If you have any information on artificial intelligence in medicine, then I\nwould appreciate it if you could mail me with whatever it is. The informations\nis needed for a project.\n\nThank you, Ian.' metadata={'category': 'sci.med'}
page_content="From article <1993May1.092058.1@aurora.alaska.edu>, by pstlb@aurora.alaska.edu:\n\n\nSince this was posted on comp.ai, I assume there is an AI angle to this.  Hacking is\nwhat AI students do when they're really supposed to be doing something else, e.g.\nthesis research & write up, getting their supervisors' pet programs to run properly,\netc.  No-one gets much glory for hacking, and no-one gets any money out of it.\nProducing good free software requires an enormous investment of time & resources that\nnot many people can, or want to, afford - particularly during a recession.\n\nIn addition, over the last 10 years, I think there has been a de-emphasis on producing\nrunning programs in AI research, and a greater emphasis on more f

## Experiment with smaller embedding

In [None]:
knowledge_base = FAISS.load_local("vector_stores/knowledge_base_20_newsgroup_all_bge-small-en-v1.5", small_embeddings)

In [21]:
query = "Is the earth flat?"
k = 10
result = knowledge_base.similarity_search_with_score(query, k=k)
for doc in result:
  print(doc[0])

page_content='It has been known for quite a while that the earth is actually more pear\nshaped than globular/spherical.  Does anyone make a "globe" that is accurate\nas to actual shape, landmass configuration/Long/Lat lines etc.?\nThanks in advance.\n\n--\n\nbill@xpresso.UUCP                   (Bill Vance),             Bothell, WA\nrwing!xpresso!bill' metadata={'category': 'sci.space'}
page_content='\n    What do you accept as a fact --  the roundness of the earth (after \nall, the ancient Greeks thought it was a sphere, and then Newton said \nit was a spheroid, and now people say it\'s a geoid [?])?  yourself \n(isn\'t your personal identity just a theoretical construct to make \nsense of memories, feelings, perceptions)?  I\'m trying to think of \nanything that would be a fact for you.  Give some examples, and let\'s\nsee how factual they are by your criteria (BTW, what are your\ncriteria?).\n\n    "Gravity is _not_ a fact": is that a fact?  How about Newton\'s \nand Einstein\'s thou

In [22]:
query = "Who won the superbowl?"
k = 10
result = knowledge_base.similarity_search_with_score(query, k=k)
for doc in result:
  print(doc[0])

page_content="\nLooks like Bob Errey's ring really sparkles in that locker room, and everyone\nelse wants one, too! :-)  Correct me if I'm wrong though, (just through\nthe net, not through e-mail, I don't need 100 rl's in my e-mail!) but wasn't\nBoston down 2-0 vs. Buffalo last year?Boston lost 1 and 2 at home and won\n3 and 4 in Buffalo.  Whoever wins game 3 will advance.  Simple as that!!! :-)" metadata={'category': 'rec.sport.hockey'}
page_content='Lake State/Maine in finals...WHO WON?   Please post.\n' metadata={'category': 'rec.sport.hockey'}
page_content='\nYou must be kidding, right? In losing Stevens the Blues got Shanahan and\nkept\nJoseph. Then they traded Oates for Janney. As a Hawks fan you have got to\nrespect those "hapless" names. 8^) Lets see, who scored the game winning\novertime goal in the 4th game???\n\n' metadata={'category': 'rec.sport.hockey'}
page_content='The Hawks win!!  Jermey Roenick scored his 50 th goal and the Hawks put the\nLeafs in their place, the lose

In [23]:
query = "What is Artificial Intelligence"
k = 10
result = knowledge_base.similarity_search_with_score(query, k=k)
for doc in result:
  print(doc[0])

page_content='If you have any information on artificial intelligence in medicine, then I\nwould appreciate it if you could mail me with whatever it is. The informations\nis needed for a project.\n\nThank you, Ian.' metadata={'category': 'sci.med'}
page_content='\n\nDiplomatic :-)\n\nI realize I\'m fighting Occam\'s razor in this argument, so I\'ll try to\nexplain why I feel a mind is necessary. \n\nFirstly, I\'m not impressed with the ability of algorithms. They\'re\ngreat at solving problems once the method has been worked out, but not\nat working out the method itself.\n\nAs a specific example, I like to solve numerical crosswords (not the\nsimple do-the-sums-and-insert-the-answers type, the hard ones.) To do\nthese with any efficiency, you need to figure out a variety of tricks.\nNow, I know that you can program a computer to do these puzzles, but\nin doing so you have to work out the tricks _yourself_, and program\nthem into the computer. You can, of course, \'obfuscate\' the trick