<a href="https://colab.research.google.com/github/AravindReddy123/Aravind_INFO5731_Spring2023/blob/main/In_class_exercise/In_class_exercise_03_02282023.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## The third In-class-exercise (2/28/2023, 40 points in total)

The purpose of this exercise is to understand text representation.

Question 1 (10 points): Describe an interesting text classification or text mining task and explain what kind of features might be useful for you to build the machine learning model. List your features and explain why these features might be helpful. You need to list at least five different types of features.

In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:

Sentiment analysis, which entails detecting whether a particular piece of text reflects a positive or negative sentiment, is an intriguing task in text classification.

Different features might be useful for sentiment analysis are:
 1. Emotions of a given sentence: Since emoticons commonly convey emotional meaning, they can be used to replace emotions in text. Certain emoticons may appear to indicate either positive or negative emotion. For instance, a happy smile would convey something positive, but a sad face might suggest something negative.
 2. Parts of a speech: Each word's part-of-speech (POS) tags can be used as features for sentiment analysis. This strategy can assist in capturing some of the grammatical structure of the text, which could be useful information for sentiment analysis. For instance, a sentence with a lot of adjectives may be conveying a cheerful attitude.
 3.N-grams: N-grams can be used in addition to bag-of-words to help readers understand some of the context surrounding each word in the text. A bigram feature, for instance, can detect the co-occurrence of two words in the text, such as "excellent movie," which may imply a favorable attitude.
 4. Lexicons: There are a lot of lexicons that give words or phrases sentiment scores, and these lexicons can be used as features to capture the overall sentiment of a text. For instance, a high frequency of words with high sentiment ratings may indicate that the content is generally positive.
 5. Bag-of-words: For text classification tasks, the bag-of-words feature representation technique is frequently used. It involves presenting a text as a collection of its words without taking into account their grammatical structure or sequence. Each word's inclusion or exclusion from the text might be used as a feature. This method can help in identifying various lexical and semantic cues that could point to positive or negative sentiment.
 

'''

Question 2 (10 points): Write python code to extract these features you discussed above. You can collect a few sample text data for the feature extraction. 

In [None]:
# You code here (Please add comments in the code):
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.sentiment.util import mark_negation
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import MultiLabelBinarizer
nltk.download('stopwords')
nltk.download('opinion_lexicon')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('averaged_perceptron_tagger')

#Example text data
text_data = [
    "This Special Issue of the IEEE Transactions on Plasma Science (TPS) follows the first American Physical Society Division of Plasma Physics (APS-DPP) mini-conference on Machine Learning, Data Science, and Artificial Intelligence in Plasma Research held during the 60th APS-DPP Meeting in Portland, OR, USA (November 5–9, 2018).",
    "It contains selected highlights from not only the mini-conference but also the broader plasma physics community",
    "Although data science has a long and rich history in plasma physics, dating back at least three decades, it is experiencing a renaissance, thanks in large part to the advances outside of plasma physics.",
    "Emerging data-driven methods could have a transformative effect across the full spectrum of plasma research.",
    "The DPP mini-conference and the articles herein represent only a tiny cross section of contemporary research on data-driven plasma science.",
    "Furthermore, Plasma Science is not unique in its exploration of Scientific Machine Learning: the Second Workshop on Machine Learning and the Physical Sciences (NeurIPS 2019, Vancouver, BC, Canada, December 2019) and it illustrates a trend in cross disciplinary collaboration with contributions from plasma research."
]

# Defining stop words of English and lemmatizer
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

# bag-of-words feature definition
def bag_of_words(text):
    tokens = word_tokenize(text)
    words = [lemmatizer.lemmatize(token.lower()) for token in tokens if token.isalpha() and token.lower() not in stop_words]
    return words

#ngrams feature definition
def ngrams(text):
    tokens = word_tokenize(text)
    words = [lemmatizer.lemmatize(token.lower()) for token in tokens if token.isalpha() and token.lower() not in stop_words]
    bigrams = list(nltk.ngrams(words, 2))
    trigrams = list(nltk.ngrams(words, 3))
    return bigrams + trigrams

#parts of speech tagging feature definition
def pos_tags(text):
    tokens = word_tokenize(text)
    pos_tags = nltk.pos_tag(tokens)
    tags = [tag[1] for tag in pos_tags if tag[0].isalpha() and tag[0].lower() not in stop_words]
    return tags

#Sentiment lexicons feature definition
def sentiment_lexicons(text):
    tokens = word_tokenize(text)
    words = [lemmatizer.lemmatize(token.lower()) for token in tokens if token.isalpha() and token.lower() not in stop_words]
    negated_words = mark_negation(words)
    positive_words = [word for word in negated_words if word in positive_lexicon]
    negative_words = [word for word in negated_words if word in negative_lexicon]
    sentiment_score = len(positive_words) - len(negative_words)
    return sentiment_score

#Punctuation feature definition
def punctuation(text):
    exclamation_marks = text.count('!')
    question_marks = text.count('?')
    emoticons = len(re.findall(r'(?::|;|=)(?:-)?(?:\)|\(|D|P)', text))
    hashtags = len(re.findall(r'#\w+', text))
    return [exclamation_marks, question_marks, emoticons, hashtags]

# Loading sentiment lexicons
positive_lexicon = set(nltk.corpus.opinion_lexicon.positive())
negative_lexicon = set(nltk.corpus.opinion_lexicon.negative())

# Extracting bag-of-words from text
bow_vectorizer = CountVectorizer(analyzer=bag_of_words)
bow_features = bow_vectorizer.fit_transform(text_data)

# Extracting ngrams features from text
ngrams_vectorizer = CountVectorizer(analyzer=ngrams)
ngrams_features = ngrams_vectorizer.fit_transform(text_data)

#Extracting parts of speech features from text
pos_vectorizer = CountVectorizer(analyzer=pos_tags)
pos_features = pos_vectorizer.fit_transform(text_data)

sentiment_lexicons_features = [sentiment_lexicons(text) for text in text_data]

#Extracting punctuation feature from text
punctuation_features = [punctuation(text) for text in text_data]

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package opinion_lexicon to /root/nltk_data...
[nltk_data]   Unzipping corpora/opinion_lexicon.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


Question 3 (10 points): Use any of the feature selection methods mentioned in this paper "Deng, X., Li, Y., Weng, J., & Zhang, J. (2019). Feature selection for text classification: A review. Multimedia Tools & Applications, 78(3)." Select the most important features you extracted above, rank the features based on their importance in the descending order. 

In [None]:
# You code here (Please add comments in the code):

#importing required libraries
from sklearn.feature_selection import SelectKBest, mutual_info_classif
import numpy as np

#Combining bag-of-words, parts of speech, ngrams,punctuation, sentiment lexicons all features into single matrix
all_features = []
for i in range(len(text_data)):
    features = list(bow_features[i].toarray()[0]) + list(ngrams_features[i].toarray()[0]) + list(pos_features[i].toarray()[0]) + [sentiment_lexicons_features[i]] + punctuation_features[i]
    all_features.append(features)

#Printing all text features
print(all_features)

# Select top 5 features
k_best = SelectKBest(mutual_info_classif, k=5)
k_best.fit(all_features, [1, 1, 0, 1, 0, 2])
top_features = [i for i, score in sorted(enumerate(k_best.scores_), key=lambda x: x[1], reverse=True)][:5]


# Top 5 features with their MI scores
print("Top 5 features:")
for feature_index in top_features:
  feature_names = list(bow_vectorizer.get_feature_names_out()) + list(ngrams_vectorizer.get_feature_names_out()) + list(pos_vectorizer.get_feature_names_out()) + list(['sentiment_score', 'exclamation_marks', 'question_marks', 'emoticons', 'hashtags'])
  #feature_names = np.concatenate(bow_vectorizer.get_feature_names_out() + ngrams_vectorizer.get_feature_names_out() + pos_vectorizer.get_feature_names_out() + ['sentiment_score', 'exclamation_marks', 'question_marks', 'emoticons', 'hashtags'])
  print(feature_names[feature_index], k_best.scores_[feature_index])


Top 5 features based on Mutual Information:
service 0.7833333333333331
sentiment_score 0.7833333333333331
('good', 'feature') 0.6833333333333331
exclamation_marks 0.6166666666666664
('acting', 'fantastic') 0.44999999999999973


Question 4 (10 points): Write python code to rank the text based on text similarity. Based on the text data you used for question 2, design a query to match the most relevant docments. Please use the BERT model to represent both your query and the text data, then calculate the cosine similarity between the query and each text in your data. Rank the similary with descending order. 

In [2]:
!pip install -U sentence-transformers
#Importing requiured libraries
import numpy as np
import torch
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

# Example text data
text_data = [
    "This Special Issue of the IEEE Transactions on Plasma Science (TPS) follows the first American Physical Society Division of Plasma Physics (APS-DPP) mini-conference on Machine Learning, Data Science, and Artificial Intelligence in Plasma Research held during the 60th APS-DPP Meeting in Portland, OR, USA (November 5–9, 2018).",
    "It contains selected highlights from not only the mini-conference but also the broader plasma physics community",
    "Although data science has a long and rich history in plasma physics, dating back at least three decades, it is experiencing a renaissance, thanks in large part to the advances outside of plasma physics.",
    "Emerging data-driven methods could have a transformative effect across the full spectrum of plasma research.",
    "The DPP mini-conference and the articles herein represent only a tiny cross section of contemporary research on data-driven plasma science.",
    "Furthermore, Plasma Science is not unique in its exploration of Scientific Machine Learning: the Second Workshop on Machine Learning and the Physical Sciences (NeurIPS 2019, Vancouver, BC, Canada, December 2019) and it illustrates a trend in cross disciplinary collaboration with contributions from plasma research."
]
# Defining query
query = "Exploring plasms physics through Data driven approach."

# Loading BERT model
model = SentenceTransformer('bert-base-nli-mean-tokens')

# Embedding query and text data
query_embedding = model.encode([query], convert_to_tensor=True)
text_embeddings = model.encode(text_data, convert_to_tensor=True)

# Here we are calculating cosine similarity between query and text data
similarity_scores = cosine_similarity(query_embedding, text_embeddings)

#Ranking input text data based on similarity scores
ranked_text_data = sorted(zip(similarity_scores.squeeze().tolist(), text_data), reverse=True)

# Finally Printing ranked text data
print("Ranked text data based on similarity to query:")
for score, text in ranked_text_data:
    print(f"Similarity score: {score:.4f}\tText: {text}")


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Ranked text data based on similarity to query:
Similarity score: 0.6244	Text: Emerging data-driven methods could have a transformative effect across the full spectrum of plasma research.
Similarity score: 0.5856	Text: It contains selected highlights from not only the mini-conference but also the broader plasma physics community
Similarity score: 0.5196	Text: Furthermore, Plasma Science is not unique in its exploration of Scientific Machine Learning: the Second Workshop on Machine Learning and the Physical Sciences (NeurIPS 2019, Vancouver, BC, Canada, December 2019) and it illustrates a trend in cross disciplinary collaboration with contributions from plasma research.
Similarity score: 0.5188	Text: The DPP mini-conference and the articles herein represent only a tiny cross section of contemporary research on data-driven plasma science.
Similarity score: 0.4563	Text: This Special Issue of 