<a href="https://colab.research.google.com/github/MohanaSrinitha/Mohana_INF05731_Spring2024/blob/main/Shaga_Mohana_Exercise_03.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 3**

The purpose of this exercise is to explore various aspects of text analysis, including feature extraction, feature selection, and text similarity ranking.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)
Describe an interesting text classification or text mining task and explain what kind of features might be useful for you to build the machine learning model. List your features and explain why these features might be helpful. You need to list at least five different types of features.

In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Bag of Words (BoW): Bag of Words (BoW) is an easy but effective text analysis method that ignores word order and syntax in favor of word frequency.
Because of its ease of use and effectiveness, BoW is frequently utilized in natural language processing applications like text classification, sentiment analysis, and information retrieval. It establishes how frequently a word occurs in the text.
Term Frequency-Inverse Document Frequency, or TF-IDF, compares a word's frequency in all texts to determine how significant it is in a given document.
Part-of-Speech (POS) Tagging- This method extracts the POS tags for every word and, depending on specific POS patterns, can yield sentiment signals. This algorithm makes use of nltk, which gives information about the grammatical categories of words.
Sentiment Lexicon Score: This method uses a sentiment lexicon to give words a sentiment score, which is then added together to create the overall sentiment score for the document. The code determines sentiment scores for every sentence in the text_data using TextBlob.
N-grams:Words, characters, or symbols that are taken from a particular text or voice and arranged in a contiguous sequence are referred to as n-grams."n": Indicates how many elements there are in each sequence







'''

## Question 2 (10 Points)
Write python code to extract these features you discussed above. You can collect a few sample text data for the feature extraction.

In [4]:
# pip install nltk scikit-learn

In [22]:
import nltk
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from nltk.util import ngrams
from textblob import TextBlob

# Download 'punkt' and 'averaged_perceptron_tagger' resources
nltk.download('punkt')
nltk.download('vader_lexicon')
nltk.download('averaged_perceptron_tagger')

# Sample data
text_data = [
    "The film was fantastic and really impressive.",
    "The food at this restaurant was very unhygenic.",
    "It was a lovely shopping experience for me.",
    "The customer service provided was rude and unhelpful.."
]

# Bag of Words (BoW) feature extraction
bow_vectorizer = CountVectorizer()
bow_features = bow_vectorizer.fit_transform(text_data)
print("Bag of Words (BoW) Features:")
print(bow_features.toarray())

# TF-IDF feature extraction
tfidf_vectorizer = TfidfVectorizer()
tfidf_features = tfidf_vectorizer.fit_transform(text_data)
print("\nTF-IDF Features:")
print(tfidf_features.toarray())


# Sentiment Lexicon feature extraction
sentiment_scores = []
for sentence in text_data:
    sentiment = TextBlob(sentence).sentiment.polarity
    sentiment_scores.append(sentiment)
print("\nSentiment Lexicon Scores:")
print(sentiment_scores)

# N-grams feature extraction (bigrams)
ngram_vectorizer = CountVectorizer(ngram_range=(2, 2))
ngram_features = ngram_vectorizer.fit_transform(text_data)
print("\nBigrams Features:")
print(ngram_features.toarray())

# 5. Part-of-Speech (POS) Tagging
pos_tags = [pos_tag(word_tokenize(text)) for text in text_data]
print("\nPOS Tags:")
print(pos_tags)


Bag of Words (BoW) Features:
[[1 1 0 0 0 1 0 0 0 1 1 0 0 0 0 0 1 0 0 0 0 0]
 [0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 1 1 0 1 1 1 0]
 [0 0 0 0 1 0 0 1 1 0 0 0 0 0 1 0 0 0 0 0 0 1]
 [0 1 0 1 0 0 0 0 0 0 0 0 1 1 0 1 0 1 0 0 1 0]]

TF-IDF Features:
[[0.43671931 0.34431452 0.         0.         0.         0.43671931
  0.         0.         0.         0.43671931 0.43671931 0.
  0.         0.         0.         0.         0.34431452 0.
  0.         0.         0.         0.        ]
 [0.         0.         0.38166888 0.         0.         0.
  0.38166888 0.         0.         0.         0.         0.38166888
  0.         0.         0.         0.30091213 0.30091213 0.
  0.38166888 0.38166888 0.30091213 0.        ]
 [0.         0.         0.         0.         0.4472136  0.
  0.         0.4472136  0.4472136  0.         0.         0.
  0.         0.         0.4472136  0.         0.         0.
  0.         0.         0.         0.4472136 ]
 [0.         0.32555709 0.         0.41292788 0.         0.
  0.   

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


## Question 3 (10 points):
Use any of the feature selection methods mentioned in this paper "Deng, X., Li, Y., Weng, J., & Zhang, J. (2019). Feature selection for text classification: A review. Multimedia Tools & Applications, 78(3)."

Select the most important features you extracted above, rank the features based on their importance in the descending order.

In [7]:
# Feature Selection using Filter, Wrapper, and Hybrid Methods with TF-IDF Feature

from sklearn.feature_extraction.text import TfidfVectorizer

# Sample text data
corpus = [
    'This is the document.'
    'This is second document'
    'this is third docnument',
]

# Create TF-IDF vectorizer
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)

# Get feature names
feature_names = vectorizer.get_feature_names_out()

# Rank features based on importance (descending order)
feature_importance = {feature: importance for feature, importance in zip(feature_names, X.toarray()[0])}
sorted_features = sorted(feature_importance, key=feature_importance.get, reverse=True)

# Print ranked features
print("Ranked Features based on Importance:")
for idx, feature in enumerate(sorted_features):
    print(f"{idx+1}. {feature}: {feature_importance[feature]}")







Ranked Features based on Importance:
1. is: 0.6882472016116852
2. this: 0.4588314677411235
3. docnument: 0.22941573387056174
4. document: 0.22941573387056174
5. documentthis: 0.22941573387056174
6. second: 0.22941573387056174
7. the: 0.22941573387056174
8. third: 0.22941573387056174


## Question 4 (10 points):
Write python code to rank the text based on text similarity. Based on the text data you used for question 2, design a query to match the most relevant docments. Please use the BERT model to represent both your query and the text data, then calculate the cosine similarity between the query and each text in your data. Rank the similary with descending order.

In [2]:
!pip install sentence-transformers

Collecting sentence-transformers
  Downloading sentence_transformers-2.5.0-py3-none-any.whl (156 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m156.3/156.3 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: sentence-transformers
Successfully installed sentence-transformers-2.5.0


In [4]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

# Sample data
text_data = [
    "The film was fantastic and really impressive.",
    "The food at this restaurant was very unhygienic.",
    "It was a lovely shopping experience for me.",
    "The customer service provided was rude and unhelpful."
]

# BERT model for sentence embeddings
bert_model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

# Query
query = "A great movie with impressive acting."

# Calculate BERT embeddings for text data and the query
text_embeddings = bert_model.encode(text_data, convert_to_tensor=True)
query_embedding = bert_model.encode(query, convert_to_tensor=True)

# Reshape the query_embedding to make it a 2D array
query_embedding = query_embedding.reshape(1, -1)

# Calculate cosine similarity between the query and each text
cosine_similarities = cosine_similarity(query_embedding, text_embeddings).flatten()

# Rank the text based on cosine similarity in descending order
ranked_indices = cosine_similarities.argsort()[::-1]

# Print the ranked text
print("Ranked Text Based on Cosine Similarity:")
for i, index in enumerate(ranked_indices):
    print(f"{i + 1}. {text_data[index]} (Cosine Similarity: {cosine_similarities[index]:.4f})")








Ranked Text Based on Cosine Similarity:
1. The film was fantastic and really impressive. (Cosine Similarity: 0.7100)
2. It was a lovely shopping experience for me. (Cosine Similarity: 0.1689)
3. The food at this restaurant was very unhygienic. (Cosine Similarity: -0.0563)
4. The customer service provided was rude and unhelpful. (Cosine Similarity: -0.1406)


# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment. Consider the following points in your response:

Learning Experience: Describe your overall learning experience in working on extracting features from text data. What were the key concepts or techniques you found most beneficial in understanding the process?

Challenges Encountered: Were there specific difficulties in completing this exercise?

Relevance to Your Field of Study: How does this exercise relate to the field of NLP?

**(Your submission will not be graded if this question is left unanswered)**



In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:





'''