Cosine similarity is a popular similarity measure used in machine learning and information retrieval to compare the similarity between two vectors. It is particularly useful for text mining and natural language processing tasks because it can capture the semantic meaning of words and phrases.

The cosine similarity between two vectors A and B is defined as the cosine of the angle between them, which is given by the dot product of the two vectors divided by the product of their magnitudes:

In [None]:
cosine_similarity(A, B) = (A . B) / (||A|| ||B||)

where A . B is the dot product of A and B, and ||A|| and ||B|| are the magnitudes of A and B, respectively.

K-nearest neighbors (KNN) is a simple machine learning algorithm used for classification and regression tasks. It works by finding the K nearest neighbors of a given test point in the training set, and predicting the label or value of the test point based on the labels or values of its K nearest neighbors.

KNN can be combined with cosine similarity to perform text classification or document retrieval tasks. The idea is to represent each document as a vector of word frequencies or embeddings, and compute the cosine similarity between the test document and each training document. Then, we can find the K nearest neighbors of the test document based on their cosine similarities, and predict the label or category of the test document based on the labels or categories of its K nearest neighbors.

Here's an example code for implementing KNN with cosine similarity in Python:

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.neighbors import KNeighborsClassifier

# Load the training data and labels
X_train = ...  # matrix of training document vectors
y_train = ...  # vector of training document labels

# Load the test data
X_test = ...   # matrix of test document vectors

# Compute cosine similarities between test and training data
cos_sim = cosine_similarity(X_test, X_train)

# Initialize KNN classifier with K=5
knn = KNeighborsClassifier(n_neighbors=5)

# Fit the KNN model to the training data
knn.fit(X_train, y_train)

# Predict the labels of the test data based on the nearest neighbors
y_pred = knn.predict(cos_sim)

In this code, we first load the training data and labels, and the test data. We then compute the cosine similarities between the test and training data using the cosine_similarity function from the sklearn.metrics.pairwise module. We initialize a KNN classifier with K=5 using the KNeighborsClassifier class from the sklearn.neighbors module, and fit the KNN model to the training data using the fit method. Finally, we predict the labels of the test data based on the nearest neighbors using the predict method.

## run code .

In [20]:
from sklearn.metrics.pairwise import cosine_similarity
from sklearn import datasets,preprocessing,model_selection,metrics
from sklearn.neighbors import KNeighborsClassifier

In [21]:
x , y = datasets.load_iris(return_X_y=True)
x_normalize = preprocessing.StandardScaler()
x_norm = x_normalize.fit_transform(x)
x.shape,y.shape,x_norm.shape

((150, 4), (150,), (150, 4))

In [22]:


# Load the training data and labels
# X_train = ...  # matrix of training document vectors
# y_train = ...  # vector of training document labels
x_train,x_test,y_train,y_test= model_selection.train_test_split(x_norm,y,test_size=.1,random_state=42,stratify=y)
# Load the test data
# X_test = ...   # matrix of test document vectors

# Compute cosine similarities between test and training data
cos_sim = cosine_similarity(x_test, x_train)

# Initialize KNN classifier with K=5
knn = KNeighborsClassifier(n_neighbors=5)

# Fit the KNN model to the training data
knn.fit(x_train, y_train)

# Predict the labels of the test data based on the nearest neighbors
# y_pred = knn.predict(cos_sim)
y_pred = knn.predict(x_test)
metrics.accuracy_score(y_test,y_pred)

0.9333333333333333

In [23]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.neighbors import KNeighborsClassifier

# Preprocess the data
docs = ["This is the first document.", "This is the second document.", "This is the third document."]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(docs)

# Compute cosine similarity
cos_sim = cosine_similarity(X)

# Define K for KNN
K = 2

# Fit KNN model
knn = KNeighborsClassifier(n_neighbors=K, metric='cosine')
knn.fit(X, [0, 1, 2]) # Assumes a 3-class problem, one for each document

# Classify a new document
new_doc = "This is a new document."
new_doc_vec = vectorizer.transform([new_doc])
pred = knn.predict(new_doc_vec)

print("Predicted class:", pred)

Predicted class: [0]
