Assignment 5: Create a Python-based solution to calculate the
semantic similarity between two sentences using techniques like
cosine similarity or word embeddings, validating it with the input
"This is a sample sentence." and "This sentence is just a sample."
to produce a similarity score of 0.8.

In [None]:
#-Assignment 5
import nltk
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

nltk.download('punkt')

def compute_similarity(sentence1, sentence2):
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform([sentence1, sentence2])
    similarity_score = cosine_similarity(tfidf_matrix[0], tfidf_matrix[1])[0][0]
    return round(similarity_score, 2)  # Rounding for better readability

input_sentence1 = "This is a sample sentence."
input_sentence2 = "This sentence is just a sample."
output = compute_similarity(input_sentence1, input_sentence2)
print(output)  # Expected Output: 0.8 (approximately)


0.82


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Assignment 6: Develop and implement a Python-based solution
to compute the Term Frequency-Inverse Document Frequency
(TF-IDF) scores for a small dataset using libraries such as sklearn
, validating it with the input {'text': ['This is a sample document.',

'Another document with different content.']} to generate the TF-
IDF matrix.

In [None]:
#-Assignment 6
from sklearn.feature_extraction.text import TfidfVectorizer

def compute_tfidf(data):
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform(data["text"])
    return tfidf_matrix.toarray(), vectorizer.get_feature_names_out()

# Test Case
dataset = {"text": ["This is a sample document.", "Another document with different content."]}
tfidf_values, feature_names = compute_tfidf(dataset)

# Printing results
print("Feature Names:", feature_names)
print("TF-IDF Matrix:\n", tfidf_values)


Feature Names: ['another' 'content' 'different' 'document' 'is' 'sample' 'this' 'with']
TF-IDF Matrix:
 [[0.         0.         0.         0.37997836 0.53404633 0.53404633
  0.53404633 0.        ]
 [0.47107781 0.47107781 0.47107781 0.33517574 0.         0.
  0.         0.47107781]]


Assignment 7: Create a Python-based text preprocessing
program that standardizes input by converting it to lowercase,
removing punctuation, and eliminating stopwords using libraries
such as nltk or re. Validate the program using the input "This is a
sample text. It contains punctuation and stopwords." to produce
the output "sample text contains".

In [None]:
# Assignment 7: Text Preprocessing
import nltk
import re
from nltk.corpus import stopwords

# Download stopwords
nltk.download('stopwords')

# Input text
text = "This is a sample text. It contains punctuation and stopwords."

# Convert to lowercase
text = text.lower()

# Remove punctuation
text = re.sub(r'[^\w\s]', '', text)

# Tokenize words
words = text.split()

# Remove stopwords
filtered_words = [word for word in words if word not in stopwords.words('english')]

# Join processed words
processed_text = ' '.join(filtered_words)

print(processed_text)  # Expected Output: "sample text contains"

sample text contains punctuation stopwords


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Assignment 8: Develop and execute a Python-driven approach
to identify named entities within a given sentence, leveraging a
named entity recognition module (e.g., nltk's ne_chunk), and
validate it using the input "John Smith works at Google." to
produce the output [("John Smith", "PERSON"), ("Google",
"ORGANIZATION")].

In [None]:
# Assignment 8: Text Preprocessing
import nltk

# Download required resources
nltk.download("punkt")
nltk.download("maxent_ne_chunker")
nltk.download("words")
# Download the averaged_perceptron_tagger resource
nltk.download("averaged_perceptron_tagger")  # This line downloads the necessary data
# Download the missing 'punkt_tab' data package and the averaged_perceptron_tagger_eng
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger_eng') # Download the missing resource

from nltk import word_tokenize, pos_tag, ne_chunk
from nltk.tree import Tree

# Function to extract named entities
def get_named_entities(text):
    tokens = word_tokenize(text)
    pos_tags = pos_tag(tokens)
    chunked = ne_chunk(pos_tags)

    named_entities = []

    for chunk in chunked:
        if isinstance(chunk, Tree):  # Check if it's a named entity
            entity_name = " ".join(token for token, pos in chunk.leaves())
            entity_type = chunk.label()
            named_entities.append((entity_name, entity_type))

    return named_entities

# Test sentence
sentence = "John Smith works at Google."

# Get named entities
entities = get_named_entities(sentence)

# Print the output
print(entities)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.


[('John', 'PERSON'), ('Smith', 'PERSON'), ('Google', 'ORGANIZATION')]
