#### Importing Necessary Packages

In [2]:
from PyPDF2 import PdfReader
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import FAISS 
import openai
import import_ipynb
from Config import *
import os
import warnings
warnings.filterwarnings("ignore")

##### Environmental Variables (api_key for OPENAI)

In [3]:
os.environ["OPENAI_API_TYPE"] = key_value_dict['api_type']
os.environ["OPENAI_API_VERSION"] = key_value_dict['api_version']
os.environ["OPENAI_API_BASE"] = key_value_dict['api_base']
os.environ["OPENAI_API_KEY"] = key_value_dict['api_key']

In [4]:
# location of the pdf file/files. 
doc_reader = PdfReader('knn.pdf')

In [5]:
# read data from the file and put them into a variable called raw_text
raw_text = ''
for i, page in enumerate(doc_reader.pages):
    text = page.extract_text()
    if text:
        raw_text += text

In [6]:
len(raw_text)

6449

In [7]:
raw_text[:1000]

'K-Nearest Neighbour\nTushar B. Kute,\nhttp://tusharkute.com\nWhat sort of Machine Learning?\n•An idea that can be used for machine learning—\nas does another maxim involving poultry: "birds \nof a feather flock together." \n•In other words, things that are alike are likely to \nhave properties that are alike. \n•We can use this principle to classify data by \nplacing it in the category with the most similar, \nor "nearest" neighbors.\nNearest Neighbor Classification\n•In a single sentence, nearest neighbor classifiers are defined \nby their characteristic of classifying unlabeled examples by \nassigning them the class of the most similar labeled examples. \nDespite the simplicity of this idea, nearest neighbor methods \nare extremely powerful. They have been used successfully for:\n–Computer vision applications, including optical character \nrecognition and facial recognition in both still images and \nvideo\n–Predicting whether a person enjoys a movie which he/she \nhas been recommen

## 1.Text Splitter
### This takes the text and splits it into chunks. The chunk size is characters not tokens

In [8]:

# Splitting up the text into smaller chunks for indexing
text_splitter = CharacterTextSplitter(        
    separator = "\n",
    chunk_size = 1000,
    chunk_overlap  = 200 #striding over the text
)
texts = text_splitter.split_text(raw_text)

In [9]:
texts[:1000]

['K-Nearest Neighbour\nTushar B. Kute,\nhttp://tusharkute.com\nWhat sort of Machine Learning?\n•An idea that can be used for machine learning—\nas does another maxim involving poultry: "birds \nof a feather flock together." \n•In other words, things that are alike are likely to \nhave properties that are alike. \n•We can use this principle to classify data by \nplacing it in the category with the most similar, \nor "nearest" neighbors.\nNearest Neighbor Classification\n•In a single sentence, nearest neighbor classifiers are defined \nby their characteristic of classifying unlabeled examples by \nassigning them the class of the most similar labeled examples. \nDespite the simplicity of this idea, nearest neighbor methods \nare extremely powerful. They have been used successfully for:\n–Computer vision applications, including optical character \nrecognition and facial recognition in both still images and \nvideo\n–Predicting whether a person enjoys a movie which he/she',
 '–Computer visi

## 2.Embedding Layer

In [10]:
from langchain.embeddings import AzureOpenAIEmbeddings

In [11]:
embeddings = AzureOpenAIEmbeddings(
    deployment=key_value_dict["embed_eng_dep_nm"],
    model=key_value_dict["embedding_model"],
    chunk_size=1,
)

In [12]:
docsearch = FAISS.from_texts(texts, embeddings)

In [13]:
docsearch

<langchain_community.vectorstores.faiss.FAISS at 0x1eefb6c2650>

In [14]:
#docsearch.embedding_function

In [15]:
query = "Where does knn is significantly used?"
docs = docsearch.similarity_search(query,k=2)

In [16]:
docs

[Document(page_content='–Computer vision applications, including optical character \nrecognition and facial recognition in both still images and \nvideo\n–Predicting whether a person enjoys a movie which he/she \nhas been recommended (as in the Netflix challenge)\n–Identifying patterns in genetic data, for use in detecting \nspecific protein or diseases\nThe kNN Algorithm\n•The kNN algorithm begins with a training dataset \nmade up of examples that are classified into several \ncategories, as labeled by a nominal variable. \n•Assume that we have a test dataset containing \nunlabeled examples that otherwise have the same \nfeatures as the training data. \n•For each record in the test dataset, kNN identifies k \nrecords in the training data that are the "nearest" in \nsimilarity, where k is an integer specified in advance. \n•The unlabeled test instance is assigned the class of \nthe majority of the k nearest neighbors\nExample:\nReference: Machine Learning with R, Brett Lantz, Packt Pub

## 3.Retrival

In [17]:
from langchain.chains.question_answering import load_qa_chain
from langchain.chat_models import AzureChatOpenAI

In [18]:
 llm = AzureChatOpenAI(
            deployment_name=key_value_dict["comp_eng_dep_nm"],
            temperature=0,
            openai_api_version=key_value_dict["api_version"])

In [19]:
chain = load_qa_chain(llm = llm, 
                      chain_type="stuff") # we are going to stuff all the docs in at once

In [20]:
query = "who is author of the book?"
docs = docsearch.similarity_search(query)
chain.run(input_documents=docs, question=query)

'The author of the book is not mentioned in the given context.'

In [21]:
chain2 = load_qa_chain(llm, 
                      chain_type="map_rerank",
                      return_intermediate_steps=True
                      ) 

query = "What are the Lazy Algorithms quotted in the book?"
docs = docsearch.similarity_search(query,k=2)
results = chain2({"input_documents": docs, "question": query}, return_only_outputs=True)

In [22]:
results

{'intermediate_steps': [{'answer': 'K Nearest Neighbors, Local Regression, Lazy Naive Bayes',
   'score': '100'},
  {'answer': 'K Nearest Neighbors, Local Regression, Lazy Naive Bayes',
   'score': '100'}],
 'output_text': 'K Nearest Neighbors, Local Regression, Lazy Naive Bayes'}

In [23]:
results['output_text']

'K Nearest Neighbors, Local Regression, Lazy Naive Bayes'

## Retrival QAChain

In [24]:
from langchain.chains import RetrievalQA

# set up FAISS as a generic retriever 
retriever = docsearch.as_retriever(search_type="similarity", search_kwargs={"k":4})

# create the chain to answer questions 
rqa = RetrievalQA.from_chain_type(llm, 
                                  chain_type="stuff", 
                                  retriever=retriever, 
                                  return_source_documents=True)

In [25]:
rqa("what is knn?")

{'query': 'what is knn?',
 'result': 'KNN stands for K-Nearest Neighbors, which is a machine learning algorithm used for classification and regression. It is based on the principle that similar things are likely to have similar properties. In KNN, a test instance is classified by finding the K training instances that are closest to it in terms of distance and assigning the test instance the class that is most common among its K nearest neighbors.',
 'source_documents': [Document(page_content='–Computer vision applications, including optical character \nrecognition and facial recognition in both still images and \nvideo\n–Predicting whether a person enjoys a movie which he/she \nhas been recommended (as in the Netflix challenge)\n–Identifying patterns in genetic data, for use in detecting \nspecific protein or diseases\nThe kNN Algorithm\n•The kNN algorithm begins with a training dataset \nmade up of examples that are classified into several \ncategories, as labeled by a nominal variabl

In [26]:
query = "how does the knn classifier works?"
print(rqa(query)['result'],sep = "\n")

The kNN (k-Nearest Neighbors) algorithm works by first training a dataset with labeled examples. Then, for each unlabeled example in the test dataset, the algorithm identifies the k nearest labeled examples in the training dataset based on similarity. The unlabeled example is then assigned the class of the majority of the k nearest neighbors. The value of k is specified in advance and typically set between 3 and 10, depending on the difficulty of the concept to be learned and the number of records in the training data. The algorithm can be used for various applications, including computer vision, genetic data analysis, and predicting user preferences.


In [27]:
query = "how to calculate distance in KNN?"
print(rqa(query)['result'],sep = "\n")

To calculate distance in KNN, a distance function or formula is used to measure the similarity between two instances. The most commonly used distance function in KNN is Euclidean distance, which is the distance one would measure if you could use a ruler to connect two points. The formula for Euclidean distance is:

d(p, q) = sqrt((p1 - q1)^2 + (p2 - q2)^2 + ... + (pn - qn)^2)

where p and q are the examples to be compared, each having n features. The term p1 refers to the value of the first feature of example p, while q1 refers to the value of the first feature of example q. Other distance functions that can be used in KNN include Manhattan distance and Minkowski distance.


In [28]:
query = "how to choose k in KNN?"
print(rqa(query)['result'],sep = "\n")

Choosing k in KNN depends on the difficulty of the concept to be learned and the number of records in the training data. Typically, k is set somewhere between 3 and 10. One common practice is to set k equal to the square root of the number of training examples. Choosing a large k reduces the impact or variance caused by noisy data, but can bias the learner such that it runs the risk of ignoring small, but important patterns. Therefore, deciding how many neighbors to use for KNN determines how well the model will generalize to future data. The balance between overfitting and underfitting the training data is a problem known as the bias-variance tradeoff.


In [29]:
query = "What are the packages need to implement KNN?"
print(rqa(query)['result'],sep = "\n")

The Python packages needed to implement KNN are pandas for data analytics, numpy for numerical computing, matplotlib.pyplot for plotting graphs, and sklearn for classification and regression classes.
