# Notebook with addtional approach

We build a kNN classifier that classifies strings based on the neighboorhod in an embedding space

## Imports

In [None]:
!pip install sentence-transformers

In [None]:
!pip install langchain

In [None]:
!pip install faiss-gpu

In [None]:
import pandas as pd
import numpy as np
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.schema import Document
from tqdm import tqdm
from collections import Counter
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.datasets import fetch_20newsgroups
from tqdm import tqdm

## Load data
- The 20 Newsgroup dataset is a famous dataset for NLP Tasks however we want you to experiment with your own datasets
- If you dont have a dataset in mind ask Chat GPT to generate one

In [None]:
newsgroups = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'))

### Inspect one sample

In [None]:
print(f"Article: {newsgroups.data[0]}")
print(f"\n Category: {newsgroups.target_names[newsgroups.target[0]]}")

Article: I am a little confused on all of the models of the 88-89 bonnevilles.
I have heard of the LE SE LSE SSE SSEI. Could someone tell me the
differences are far as features or performance. I am also curious to
know what the book value is for prefereably the 89 model. And how much
less than book value can you usually get them for. In other words how
much are they in demand this time of year. I have heard that the mid-spring
early summer is the best time to buy.

 Category: rec.autos


## Create Documents with Langchain

In [None]:
# If you use your own dataset you will have to adapt this
# Note that the metadata is optional and can be left empty if your text doesnt contain any sort of labels
documents = [Document(page_content=newsgroups.data[i], metadata={"category":newsgroups.target_names[newsgroups.target[i]]}) for i in tqdm(range(len(list(newsgroups.data))))]

100%|██████████| 7532/7532 [00:00<00:00, 103194.32it/s]


## Build Knowledge base with embeddings
- go to [HuggingFace](https://huggingface.co/spaces/mteb/leaderboard) to check out the latest and greatest embedding models
- Try models with different embedding sizes and compare the results

In [None]:
embeddings = HuggingFaceEmbeddings(model_name="BAAI/bge-large-en")

### First you have to build a vectorstore from the documents

In [None]:
knowledge_base = FAISS.from_documents(documents, embeddings)
knowledge_base.save_local("knowledge_base_20_newsgroup")

### If you have already built the vector store you can simply load it

In [None]:
test = FAISS.load_local("knowledge_base_20_newsgroup", embeddings)

## Experiment with some queries

In [None]:
query = "Is the earth flat?"
k = 10
result = knowledge_base.similarity_search_with_score(query, k=k)
for doc in result:
  print(doc[0])

page_content='\nThe variance from perfect sphericity in a model of the earth small enough\nto fit into your home would probably be imperceptible.\n\nAny globe you can buy will be close enough.\n\n\n\n\n-- ' metadata={'category': 'sci.space'}
page_content='\n    What do you accept as a fact --  the roundness of the earth (after \nall, the ancient Greeks thought it was a sphere, and then Newton said \nit was a spheroid, and now people say it\'s a geoid [?])?  yourself \n(isn\'t your personal identity just a theoretical construct to make \nsense of memories, feelings, perceptions)?  I\'m trying to think of \nanything that would be a fact for you.  Give some examples, and let\'s\nsee how factual they are by your criteria (BTW, what are your\ncriteria?).\n\n    "Gravity is _not_ a fact": is that a fact?  How about Newton\'s \nand Einstein\'s thoughts about gravity -- is it a fact that they had \nthose thoughts?  I don\'t see how any of the things that you are \nasserting are any more factua