## The third In-class-exercise (2/28/2023, 40 points in total)

The purpose of this exercise is to understand text representation.

Question 1 (10 points): Describe an interesting text classification or text mining task and explain what kind of features might be useful for you to build the machine learning model. List your features and explain why these features might be helpful. You need to list at least five different types of features.

In [2]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:
1.
Text classification or text mining task: Classifying text as either Positive or Negative sentiment.

Features: 
1) Word Count: Word count of the text can be used to identify the sentiment of the text. A higher word count will generally indicate a more positive sentiment, while a lower word count may indicate a more negative sentiment. 
2) Sentiment Lexicon: Identifying words in the text that are associated with either positive or negative sentiment, such as happy, sad, good, bad, etc. 
3) Content Analysis: Analyzing the context of the text, such as the overall topic, writing style and use of language, to identify the sentiment of the text.
4) Emotion Detection: Identifying emotions such as joy, anger, surprise, and so on, from the text by analyzing the facial expressions and body language used in the text.
5) Readability Analysis: Analyzing the readability of the text to determine the complexity of the language used, which can be used to identify the sentiment of the text.
6) Grammar Analysis: Identifying patterns of grammar and syntax in the text in order to identify the sentiment of the text.

These features could be helpful in building a machine learning model because they each have the potential to accurately identify the sentiment of the text. 
Word count and sentiment lexicon capture the explicit sentiment of the text, while content analysis and emotion detection




'''

'\nPlease write you answer here:\n1.\nText classification or text mining task: Classifying text as either Positive or Negative sentiment.\n\nFeatures: \n1) Word Count: Word count of the text can be used to identify the sentiment of the text. A higher word count will generally indicate a more positive sentiment, while a lower word count may indicate a more negative sentiment. \n2) Sentiment Lexicon: Identifying words in the text that are associated with either positive or negative sentiment, such as happy, sad, good, bad, etc. \n3) Content Analysis: Analyzing the context of the text, such as the overall topic, writing style and use of language, to identify the sentiment of the text.\n4) Emotion Detection: Identifying emotions such as joy, anger, surprise, and so on, from the text by analyzing the facial expressions and body language used in the text.\n5) Readability Analysis: Analyzing the readability of the text to determine the complexity of the language used, which can be used to ide

Question 2 (10 points): Write python code to extract these features you discussed above. You can collect a few sample text data for the feature extraction. 

In [2]:
# You code here (Please add comments in the code):

import nltk
nltk.download('punkt')



[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Supraja\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [2]:
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from string import punctuation
from collections import Counter
from textblob import TextBlob
from sklearn.feature_extraction.text import CountVectorizer

# Define a sample text data
text_data = """
The quick brown fox jumps over the lazy dog. The dog barks at the fox but the fox just runs away.
"""

# Tokenize the text data into words and sentences
words = word_tokenize(text_data)
sentences = sent_tokenize(text_data)

# Remove stop words and punctuation from the words list
stop_words = set(stopwords.words('english'))
filtered_words = [word.lower() for word in words if word.lower() not in stop_words and word not in punctuation]

# Count the frequency of each word in the filtered_words list
word_counts = Counter(filtered_words)

# Calculate the sentiment polarity and subjectivity of the text data using TextBlob
blob = TextBlob(text_data)
polarity = blob.sentiment.polarity
subjectivity = blob.sentiment.subjectivity

# Generate a bag-of-words representation of the text data using CountVectorizer
vectorizer = CountVectorizer()
bag_of_words = vectorizer.fit_transform(sentences)

# Print the extracted text features
print('Words:', words)
print('Sentences:', sentences)
print('Filtered words:', filtered_words)
print('Word counts:', word_counts)
print('Sentiment polarity:', polarity)
print('Sentiment subjectivity:', subjectivity)
print('Bag-of-words:', bag_of_words.toarray())


Words: ['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '.', 'The', 'dog', 'barks', 'at', 'the', 'fox', 'but', 'the', 'fox', 'just', 'runs', 'away', '.']
Sentences: ['\nThe quick brown fox jumps over the lazy dog.', 'The dog barks at the fox but the fox just runs away.']
Filtered words: ['quick', 'brown', 'fox', 'jumps', 'lazy', 'dog', 'dog', 'barks', 'fox', 'fox', 'runs', 'away']
Word counts: Counter({'fox': 3, 'dog': 2, 'quick': 1, 'brown': 1, 'jumps': 1, 'lazy': 1, 'barks': 1, 'runs': 1, 'away': 1})
Sentiment polarity: 0.04166666666666666
Sentiment subjectivity: 0.75
Bag-of-words: [[0 0 0 1 0 1 1 1 0 1 1 1 0 2]
 [1 1 1 0 1 1 2 0 1 0 0 0 1 3]]


Question 3 (10 points): Use any of the feature selection methods mentioned in this paper "Deng, X., Li, Y., Weng, J., & Zhang, J. (2019). Feature selection for text classification: A review. Multimedia Tools & Applications, 78(3)." Select the most important features you extracted above, rank the features based on their importance in the descending order. 

In [1]:
# You code here (Please add comments in the code):
1. Information Gain and Conditional Mutual Information: This is a measure of the relevance of a set of features to the target attribute. It identifies the most important features by considering both the relevance and dependence between features.
2. Term Frequency-Inverse Document Frequency: This measure considers the frequency of a feature in a given sample and assigns more importance to those features that are more distinct and rare among all samples.
3. Chi-Square Test: This is an independence measure that quantifies the strength of association between a set of features and the target attribute.
4. Correlation-based Feature Selection: This method measures the correlation between features and the target attribute and selects those that are most strongly correlated. 
5. Genetic Algorithm: This method applies evolutionary algorithms to the feature selection problem by using a fitness value to evaluate the importance of each feature.
6. Wrapper Method: This method evaluates the performance of a subset of features on some learning algorithm and selects those that yield the best performance.

Ranking of Features Based on Their Importance:
1. Information Gain and Conditional Mutual Information 
2. Term Frequency-Inverse Document Frequency 
3. Chi-Square Test 
4. Correlation-based Feature Selection
5. ReliefF
6. Genetic Algorithm
7. Wrapper Method





Question 4 (10 points): Write python code to rank the text based on text similarity. Based on the text data you used for question 2, design a query to match the most relevant docments. Please use the BERT model to represent both your query and the text data, then calculate the cosine similarity between the query and each text in your data. Rank the similary with descending order. 

In [22]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.26.1-py3-none-any.whl (6.3 MB)
     ---------------------------------------- 6.3/6.3 MB 15.0 MB/s eta 0:00:00
Collecting pyyaml>=5.1
  Using cached PyYAML-6.0-cp310-cp310-win_amd64.whl (151 kB)
Collecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.13.0-py3-none-any.whl (199 kB)
     ------------------------------------- 199.1/199.1 kB 11.8 MB/s eta 0:00:00
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp310-cp310-win_amd64.whl (3.3 MB)
     ---------------------------------------- 3.3/3.3 MB 23.5 MB/s eta 0:00:00
Installing collected packages: tokenizers, pyyaml, huggingface-hub, transformers
Successfully installed huggingface-hub-0.13.0 pyyaml-6.0 tokenizers-0.13.2 transformers-4.26.1


In [2]:
!pip install torch

Collecting torch
  Downloading torch-1.13.1-cp310-cp310-win_amd64.whl (162.6 MB)
     -------------------------------------- 162.6/162.6 MB 8.7 MB/s eta 0:00:00
Installing collected packages: torch
Successfully installed torch-1.13.1


In [6]:
!pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu116


Looking in indexes: https://pypi.org/simple, https://download.pytorch.org/whl/cu116
Collecting torchvision
  Downloading https://download.pytorch.org/whl/cu116/torchvision-0.14.1%2Bcu116-cp310-cp310-win_amd64.whl (4.8 MB)
     ---------------------------------------- 4.8/4.8 MB 9.3 MB/s eta 0:00:00
Collecting torchaudio
  Downloading https://download.pytorch.org/whl/cu116/torchaudio-0.13.1%2Bcu116-cp310-cp310-win_amd64.whl (2.3 MB)
     ---------------------------------------- 2.3/2.3 MB 20.7 MB/s eta 0:00:00
Installing collected packages: torchvision, torchaudio
Successfully installed torchaudio-0.13.1+cu116 torchvision-0.14.1+cu116


In [1]:
from transformers import AutoTokenizer, AutoModel
import torch
from sklearn.metrics.pairwise import cosine_similarity

# Load the pre-trained BERT model and tokenizer
model_name = 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

# Define the text data to be ranked
text_data = [
    "The quick brown fox jumps over the lazy dog",
    "The lazy dog is quick to bark but slow to bite",
    "A brown dog and a white dog are playing in the park"
]

# Define the query for matching relevant documents
query = "Brown dog playing in the park"

# Encode the query and text data using the BERT tokenizer
encoded_query = tokenizer.encode(query, add_special_tokens=True, return_tensors='pt')
encoded_text_data = [tokenizer.encode(text, add_special_tokens=True, return_tensors='pt') for text in text_data]

# Use the BERT model to obtain the embeddings for the query and text data
with torch.no_grad():
    query_embedding = model(encoded_query)[0][:, 0, :]
    text_data_embeddings = [model(encoded_text)[0][:, 0, :] for encoded_text in encoded_text_data]

# Calculate the cosine similarity between the query and text data embeddings
similarity_scores = [cosine_similarity(query_embedding, text_embedding)[0][0] for text_embedding in text_data_embeddings]

# Rank the text data based on the similarity scores in descending order
ranked_text_data = [text_data[i] for i in sorted(range(len(similarity_scores)), key=lambda k: similarity_scores[k], reverse=True)]

# Print the ranked text data
print(ranked_text_data)


Downloading pytorch_model.bin:   0%|          | 0.00/440M [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exact

['A brown dog and a white dog are playing in the park', 'The quick brown fox jumps over the lazy dog', 'The lazy dog is quick to bark but slow to bite']
