## Assignment 2 - BDP
### Sreejesh S Nair - 20318001

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.datasets import fetch_20newsgroups
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

### Loading Dataset

In [None]:
newsgroups = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))

In [None]:
documents = newsgroups.data
categories = newsgroups.target

In [None]:
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf_vectorizer.fit_transform(documents)

In [None]:
# Function to calculate cosine similarity
def calculate_cosine_similarity(vector1, vector2):
    return cosine_similarity(vector1.reshape(1, -1), vector2.reshape(1, -1))[0, 0]

# Document similarity search function
def document_similarity_search(input_document, top_n=5):
    # Transform the input document to TF-IDF vector
    input_vector = tfidf_vectorizer.transform([input_document])

    # Calculate cosine similarity with all documents
    similarities = [calculate_cosine_similarity(input_vector, tfidf_matrix[i]) for i in range(tfidf_matrix.shape[0])]

    # Get indices of documents sorted by similarity
    sorted_indices = np.argsort(similarities)[::-1]

    # Return top N documents based on similarity
    top_documents = [(documents[i], similarities[i]) for i in sorted_indices[:top_n]]

    return top_documents



In [None]:
# Example usage
input_doc = "Is it good enough"

similar_documents = document_similarity_search(input_doc, top_n=5)

# Print results
print(f"Input Document:\n{input_doc}\n")
print("Top 5 Similar Documents:")
for i, (doc, similarity) in enumerate(similar_documents, 1):
    print(f"{i}. Similarity: {similarity:.4f}\n{doc}\n{'='*50}\n")


Input Document:
Is it good enough

Top 5 Similar Documents:
1. Similarity: 1.0000
very good.



2. Similarity: 0.4869
Good luck.



3. Similarity: 0.4070
You can't.  But good luck trying.

4. Similarity: 0.3938


Let's see... April 15th... less than 30 at bats.... and you claim that he 
hasn't done too much so far!

Cut this guy some slack. Danny will produce this year. It's scary to think
just how much he'll produce if he were to stay healthy all year.

The Yanks have a lot going for them this year: good starting rotation, good
bullpen, good defense and a good lineup. Also, I like Buck Showalter. Frank
Howard on 1st is also a good move. Everything sounds good so far. 

If the Yanks stay healthy, they have a good chance at winning the pennant. This 
is the most fun I've had watching the Yanks since "78!

5. Similarity: 0.3079
You a good case for rights to abortion.

