<a href="https://colab.research.google.com/github/RST0310/INFO-5731/blob/main/RAYABARAPU_SAITEJA_Exercise_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 3**

The purpose of this exercise is to explore various aspects of text analysis, including feature extraction, feature selection, and text similarity ranking.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)
Describe an interesting text classification or text mining task and explain what kind of features might be useful for you to build the machine learning model. List your features and explain why these features might be helpful. You need to list at least five different types of features.

In [None]:
# One interesting text classification task related to cricket could be classifying match commentary or articles into different categories such as match summaries, player profiles, match predictions, and analysis pieces. Here are some features that could be useful for building a machine learning model for this task:

# 1. Word Frequency Features: The frequency of specific cricket-related terms or phrases in the text could be indicative of its category. For example, terms like "century," "wicket," "runs," "dismissal," and "boundary" might be more common in match summaries, while terms like "player statistics," "performance analysis," and "strategy" might be more common in analysis pieces.

# 2. N-grams: Analyzing sequences of words (n-grams) could provide valuable context. For instance, phrases like "man of the match," "caught behind," or "run rate" might be indicative of specific types of content.

# 3. Sentiment Analysis: The sentiment expressed in the text could help classify articles as positive (e.g., celebrating a team's victory), negative (e.g., criticizing a player's performance), or neutral. This could be achieved by analyzing the sentiment of individual sentences or using pre-trained sentiment analysis models.

# 4. Named Entity Recognition (NER): Identifying named entities such as player names, team names, tournament names, and locations mentioned in the text could provide clues about the content's category. For example, mentions of players and teams are likely to appear in player profiles or match summaries.

# 5. Part-of-Speech (POS) Tagging: Analyzing the grammatical structure of the text by tagging words with their parts of speech could reveal patterns specific to certain categories. For instance, articles with a high frequency of verbs related to analysis (e.g., "analyze," "evaluate," "assess") might belong to the analysis category.

# 6. Topic Modeling: Identifying underlying topics in the text using techniques like Latent Dirichlet Allocation (LDA) or Non-Negative Matrix Factorization (NMF) could help categorize articles based on the dominant themes present. For instance, topics related to match tactics, player performances, or match outcomes might be indicative of specific categories.

# These features capture different aspects of the text content, allowing the machine learning model to learn patterns associated with each category and improve classification accuracy. By combining multiple types of features, the model can better understand the usage of cricket-related text and accurately classify it into relevant categories.

## Question 2 (10 Points)
Write python code to extract these features you discussed above. You can collect a few sample text data for the feature extraction.

In [14]:
import nltk
nltk.download('punkt')
nltk.download('vader_lexicon')
nltk.download('maxent_ne_chunker')
nltk.download('words')

from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
from nltk.sentiment import SentimentIntensityAnalyzer
from nltk import pos_tag, ne_chunk
from nltk.util import ngrams
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

# Sample cricket-related text data
sample_text = """
Virat Kohli scored a century in yesterday's match, leading his team to victory. His exceptional performance was instrumental in the team's success.
In the analysis of today's game, experts evaluated the batting strategies of both teams and highlighted the importance of partnerships in building a competitive score.
Rohit Sharma, known for his aggressive style of play, smashed several boundaries in the match, thrilling the fans with his spectacular innings.
The prediction for tomorrow's game favors the home team, given their strong track record on their home ground and the recent form of their key players.
"""

# Tokenize the text into words
tokens = word_tokenize(sample_text)

# 1. Word Frequency Features
fdist = FreqDist(tokens)
print("Word Frequency Features:")
print(fdist.most_common(10))  # Print top 10 most common words

# 2. N-grams
print("\nN-grams:")
for n in range(2, 4):  # Bi-grams and tri-grams
    n_grams = list(ngrams(tokens, n))
    print(f"{n}-grams:", n_grams)

# 3. Sentiment Analysis
sia = SentimentIntensityAnalyzer()
sentiment_score = sia.polarity_scores(sample_text)
print("\nSentiment Score:", sentiment_score)

# 4. Named Entity Recognition (NER)
print("\nNamed Entities:")
entities = ne_chunk(pos_tag(tokens))
print(entities)

# 5. Part-of-Speech (POS) Tagging
print("\nPart-of-Speech Tags:")
pos_tags = pos_tag(tokens)
print(pos_tags)

# 6. Topic Modeling
vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform([sample_text])
lda = LatentDirichletAllocation(n_components=1, random_state=42)
lda.fit(X)
print("\nTopics:")
feature_names = vectorizer.get_feature_names_out()
for idx, topic in enumerate(lda.components_):
    print("Topic #%d:" % idx)
    print(" ".join([feature_names[i] for i in topic.argsort()[:-10 - 1:-1]]))


Word Frequency Features:
[('the', 8), (',', 6), ('.', 5), ('of', 5), ('in', 4), ("'s", 4), ('his', 3), ('team', 3), ('their', 3), ('a', 2)]

N-grams:
2-grams: [('Virat', 'Kohli'), ('Kohli', 'scored'), ('scored', 'a'), ('a', 'century'), ('century', 'in'), ('in', 'yesterday'), ('yesterday', "'s"), ("'s", 'match'), ('match', ','), (',', 'leading'), ('leading', 'his'), ('his', 'team'), ('team', 'to'), ('to', 'victory'), ('victory', '.'), ('.', 'His'), ('His', 'exceptional'), ('exceptional', 'performance'), ('performance', 'was'), ('was', 'instrumental'), ('instrumental', 'in'), ('in', 'the'), ('the', 'team'), ('team', "'s"), ("'s", 'success'), ('success', '.'), ('.', 'In'), ('In', 'the'), ('the', 'analysis'), ('analysis', 'of'), ('of', 'today'), ('today', "'s"), ("'s", 'game'), ('game', ','), (',', 'experts'), ('experts', 'evaluated'), ('evaluated', 'the'), ('the', 'batting'), ('batting', 'strategies'), ('strategies', 'of'), ('of', 'both'), ('both', 'teams'), ('teams', 'and'), ('and', 'hig

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!


## Question 3 (10 points):
Use any of the feature selection methods mentioned in this paper "Deng, X., Li, Y., Weng, J., & Zhang, J. (2019). Feature selection for text classification: A review. Multimedia Tools & Applications, 78(3)."

Select the most important features you extracted above, rank the features based on their importance in the descending order.

In [None]:
#Three-phases hybrid feature selection for facial expression recognition
# Machine learning applications are increasingly challenged by the growing volume of data. In this context, selecting relevant features from the vast...

#Ones Sidhom, Haythem Ghazouani, Walid Barhoumi in The Journal of Supercomputing
#Article 13 November 2023

In [21]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
from nltk.sentiment import SentimentIntensityAnalyzer
from nltk import pos_tag, ne_chunk
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

# Sample cricket-related text data
sample_text = """
Virat Kohli scored a century in yesterday's match, leading his team to victory. His exceptional performance was instrumental in the team's success.
In the analysis of today's game, experts evaluated the batting strategies of both teams and highlighted the importance of partnerships in building a competitive score.
Rohit Sharma, known for his aggressive style of play, smashed several boundaries in the match, thrilling the fans with his spectacular innings.
The prediction for tomorrow's game favors the home team, given their strong track record on their home ground and the recent form of their key players.
"""

# Tokenize the text into words
tokens = word_tokenize(sample_text)

# 1. Word Frequency Features
fdist = FreqDist(tokens)
word_frequency_features = {word: len(word) * fdist[word] for word, _ in fdist.most_common(10)}

# 2. N-grams
n_grams_features = {}
for n in range(2, 4):  # Bi-grams and tri-grams
    n_grams = list(nltk.ngrams(tokens, n))
    for gram in n_grams:
        n_grams_features[' '.join(gram)] = len(gram) * sample_text.count(' '.join(gram))

# 3. Sentiment Analysis
sia = SentimentIntensityAnalyzer()
sentiment_score = sia.polarity_scores(sample_text)
sentiment_analysis_features = {key: len(key) * sentiment_score[key] for key in sentiment_score.keys()}

# 4. Named Entity Recognition (NER)
ner_features = {}
entities = ne_chunk(pos_tag(tokens))
for entity in entities:
    if isinstance(entity, nltk.tree.Tree):
        ner_features[' '.join([word for word, _ in entity])] = sum(len(word) for word, _ in entity) * sample_text.count(' '.join([word for word, _ in entity]))

# 5. Part-of-Speech (POS) Tagging
pos_tags_features = {pos: len(pos) * sample_text.count(pos) for _, pos in pos_tag(tokens)}

# 6. Topic Modeling
vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform([sample_text])
lda = LatentDirichletAllocation(n_components=1, random_state=42)
lda.fit(X)
feature_names = vectorizer.get_feature_names_out()
topic_modeling_features = {feature_names[i]: len(feature_names[i]) * sample_text.count(feature_names[i]) for i in lda.components_[0].argsort()[:-10 - 1:-1]}

# Combine all features
all_features = {**word_frequency_features, **n_grams_features, **sentiment_analysis_features, **ner_features, **pos_tags_features, **topic_modeling_features}

# Rank features based on their importance (length * occurrence count) in descending order
ranked_features = sorted(all_features.items(), key=lambda x: x[1], reverse=True)

print("Ranked Features:")
for i, (feature, score) in enumerate(ranked_features, start=1):
    print(f"{i}. {feature}: {score}")


Ranked Features:
1. the: 24
2. team: 16
3. their: 15
4. instrumental: 12
5. Rohit Sharma: 11
6. of: 10
7. match: 10
8. importance: 10
9. his: 9
10. in: 8
11. 's: 8
12. home: 8
13. game: 8
14. form: 8
15. compound: 7.5536
16. innings: 7
17. ,: 6
18. .: 5
19. Virat: 5
20. Kohli: 5
21. known: 5
22. in the: 4
23. 's game: 4
24. Virat Kohli scored: 3
25. Kohli scored a: 3
26. scored a century: 3
27. a century in: 3
28. century in yesterday: 3
29. , leading his: 3
30. leading his team: 3
31. his team to: 3
32. team to victory: 3
33. . His exceptional: 3
34. His exceptional performance: 3
35. exceptional performance was: 3
36. performance was instrumental: 3
37. was instrumental in: 3
38. instrumental in the: 3
39. in the team: 3
40. In the analysis: 3
41. the analysis of: 3
42. analysis of today: 3
43. , experts evaluated: 3
44. experts evaluated the: 3
45. evaluated the batting: 3
46. the batting strategies: 3
47. batting strategies of: 3
48. strategies of both: 3
49. of both teams: 3
50. b

## Question 4 (10 points):
Write python code to rank the text based on text similarity. Based on the text data you used for question 2, design a query to match the most relevant docments. Please use the BERT model to represent both your query and the text data, then calculate the cosine similarity between the query and each text in your data. Rank the similary with descending order.

In [9]:
import torch
from transformers import BertTokenizer, BertModel
from sklearn.metrics.pairwise import cosine_similarity

# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Define the query
query = "Cricket match analysis"

# Tokenize and encode the query
query_tokens = tokenizer.tokenize(query)
query_tokens = ['[CLS]'] + query_tokens + ['[SEP]']
query_ids = tokenizer.convert_tokens_to_ids(query_tokens)
query_tensor = torch.tensor([query_ids])

# Generate embeddings for the query
with torch.no_grad():
    outputs = model(query_tensor)
    query_embedding = outputs.pooler_output  # Extract the pooled output layer

# Sample cricket-related text data
sample_texts = [
    "Virat Kohli scored a century in yesterday's match, leading his team to victory.",
    "In the analysis of today's game, experts evaluated the batting strategies of both teams.",
    "Rohit Sharma smashed several boundaries in the match, thrilling the fans with his spectacular innings.",
    "The prediction for tomorrow's game favors the home team, given their strong track record on their home ground."
]

# Calculate embeddings for each text
text_embeddings = []
for text in sample_texts:
    # Tokenize and encode the text
    text_tokens = tokenizer.tokenize(text)
    text_tokens = ['[CLS]'] + text_tokens + ['[SEP]']
    text_ids = tokenizer.convert_tokens_to_ids(text_tokens)
    text_tensor = torch.tensor([text_ids])

    # Generate embeddings for the text
    with torch.no_grad():
        outputs = model(text_tensor)
        text_embedding = outputs.pooler_output  # Extract the pooled output layer

    text_embeddings.append(text_embedding)

# Calculate cosine similarity between query and each text document
similarities = []
for text_embedding in text_embeddings:
    similarity = cosine_similarity(query_embedding, text_embedding)
    similarities.append(similarity.item())

# Rank documents based on similarity scores
ranked_indices = sorted(range(len(similarities)), key=lambda i: similarities[i], reverse=True)
ranked_texts = [sample_texts[i] for i in ranked_indices]

# Print ranked documents
print("Ranked Documents:")
for rank, idx in enumerate(ranked_indices, 1):
    print(f"Rank {rank}: Similarity = {similarities[idx]}, Text = {sample_texts[idx]}")


Ranked Documents:
Rank 1: Similarity = 0.9494487047195435, Text = The prediction for tomorrow's game favors the home team, given their strong track record on their home ground.
Rank 2: Similarity = 0.9393464922904968, Text = Rohit Sharma smashed several boundaries in the match, thrilling the fans with his spectacular innings.
Rank 3: Similarity = 0.9369758367538452, Text = In the analysis of today's game, experts evaluated the batting strategies of both teams.
Rank 4: Similarity = 0.9340389966964722, Text = Virat Kohli scored a century in yesterday's match, leading his team to victory.


# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment. Consider the following points in your response:

Learning Experience: Describe your overall learning experience in working on extracting features from text data. What were the key concepts or techniques you found most beneficial in understanding the process?

Challenges Encountered: Were there specific difficulties in completing this exercise?

Relevance to Your Field of Study: How does this exercise relate to the field of NLP?

**(Your submission will not be graded if this question is left unanswered)**



In [None]:
#Extracting the features from the text was very informative. That was very challenging to get the exact feature from the paper given. Feature extraction in NLP involves transforming raw text data into numerical representations, enabling machine learning algorithms to operate effectively on text-based tasks like sentiment analysis, text classification, and language modeling.