<a href="https://colab.research.google.com/github/Gnanadeepa05/INFO-5731/blob/main/Paladugu_Gnana_Deepa_InclassExercise_3_(1).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 3**

The purpose of this exercise is to explore various aspects of text analysis, including feature extraction, feature selection, and text similarity ranking.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of Friday, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**

**Please check that the link you submitted can be opened and points to the correct assignment.**

## Question 1 (10 Points)
Describe an interesting **text classification or text mining task** and explain what kind of features might be useful for you to build the machine learning model. List your features and explain why these features might be helpful. You need to list at least five different types of features. **Your dataset must be text.**

# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:


Text Classification Task: Clustering of Multiple Class Sentiment Analysis of Book Reviews
The purpose is to categorize the book review into several postive/negative sentiments such as; very positive, positive, neutral, negative, very negative using the text of the review. This task is not only more complex than just assigning articles into two buckets and may be quite useful for both readers and publishers.
Here are five types of features that would be useful for building a machine learning model for this task:

Lexical Features:

Bag-of-words representation
The TF-IDF ranking of the terms, short for Term Frequency-Inverse Document Frequency
Unigram, bigram, trigram

Why helpful: These extract features reflect the words and terminologies that appear in the reviews. There are always certain words or phrases that connote positive or negative – a sentiment for example, that could span a book entirely such as; ‘Great read’, ‘Couldn’t put it down,’ ‘Disappointing,’ ‘Boring’.
Syntactic Features:

Distribution of the different POS tags.
Dependency parse structures
Assessment of grammatic density such as mean of number of words per sentence, number of clauses per one thousand words.

Why helpful: Reviewers may present positive or negative comments through the their syntactic patterns. As anticipated, positive word counts mean positive reviews as these are likely to have more adjectives, while negative words are likely to have compound word counts due to explanations provided on the experiences.
Semantic Features:

Proper noun the identification of proper nouns such as books titles, authors, and so on.
Latent Dirichlet Allocation (LDA concepts)
Word vectors (see the Word2Vec section, GloVe or BERT vectors)

Why helpful: These features bear higher level semantics. Integrating the sentiment levels enables them to aid in finding out the themes or concepts related to each level of sentiment. For instance, the presence of word such as ‘‘plot twists’’ or word such as ‘‘character development’’ might be a positive biasing case.
Sentiment-specific Features:

The scores of certain sentiment lexicon such as VADER or TextBlob
Pleasure/displeasure arbitration: specifically their interaction, result or product, including joy, anger, sadness.
Subjectivity scores

Why helpful: All these features are aimed at the sentiment aspect of the task directly. They can serve as a good starting point for sentiment and allow for the capturing of sophisticated feelings that put towards the whole of the review sentiment.
Stylometric Features:

Flesch-Kincaid grade-level measurement.
Employment of figures of speech (metaphor, Simile)
The final marks of intonation which are the punctuation (for example, exclamation marks, question marks)

Why helpful: The language can tell very much about the opinion of the reviewer. Those reviews which are more optimistic can contain more exclamation marks, or pronounced emotions, while those which is more pessimistic – more terms or more question words, for example.

It is also worthy to note that the used feature types are able to capture these distinctions in the feature space so that we can have a basis for classifying different levels of sentiments in book reviews. Employing all of these features would probably provide a superior classification to make use of a single sort, for these features address both content and style of the Reviews, which both write the sentiment.
'''

## Question 2 (10 Points)
Write python code to extract these features you discussed above. You can collect a few sample text data for the feature extraction.

In [None]:
# You code here (Please add comments in the code):
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.tag import pos_tag
pip install textstat
import textstat
from textstat import flesch_reading_ease
import spacy
#downloading data from nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('vader_lexicon')
#loading spaCy
nlp = spacy.load("en_core_web_sm")
#loading dataset
df = pd.read_csv('products.csv')
if 'review_body' not in df.columns:
    raise ValueError("The 'review_body' column is missing from the CSV file.")
#method to extract lexical features
def extract_lexical_features(text):
    bow_vectorizer = CountVectorizer(max_features=100)
    bow_features = bow_vectorizer.fit_transform(text)
    tfidf_vectorizer = TfidfVectorizer(max_features=100)
    tfidf_features = tfidf_vectorizer.fit_transform(text)
    ngram_vectorizer = CountVectorizer(ngram_range=(2,2), max_features=100)
    ngram_features = ngram_vectorizer.fit_transform(text)
    return bow_features, tfidf_features, ngram_features
#method for syntactic features
def extract_syntactic_features(text):
    features = []
    for review in text:
        tokens = word_tokenize(str(review))
        pos_tags = pos_tag(tokens)
        pos_dist = {tag: count/len(pos_tags) for tag, count in nltk.FreqDist(tag for (word, tag) in pos_tags).items()}
        sentences = sent_tokenize(str(review))
        avg_sentence_length = np.mean([len(word_tokenize(sent)) for sent in sentences])
        features.append({
            'pos_dist': pos_dist,
            'avg_sentence_length': avg_sentence_length
        })
    return features
#method for semantic features
def extract_semantic_features(text):
    features = []
    for review in text:
        doc = nlp(str(review))
        entities = [ent.label_ for ent in doc.ents]
        entity_counts = {ent: entities.count(ent) for ent in set(entities)}
        word_vectors = [token.vector for token in doc if not token.is_stop and not token.is_punct]
        avg_word_vector = np.mean(word_vectors, axis=0) if word_vectors else np.zeros(300)
        features.append({
            'entity_counts': entity_counts,
            'avg_word_vector': avg_word_vector
        })
    return features
#method for sentiment features
def extract_sentiment_features(text):
    sia = SentimentIntensityAnalyzer()
    features = []
    for review in text:
        sentiment_scores = sia.polarity_scores(str(review))
        features.append(sentiment_scores)
    return features
#method for stylometric features
def extract_stylometric_features(text):
    features = []
    for review in text:
        readability = flesch_reading_ease(str(review))
        exclamation_count = str(review).count('!')
        question_count = str(review).count('?')
        features.append({
            'readability': readability,
            'exclamation_count': exclamation_count,
            'question_count': question_count
        })
    return features
bow_features, tfidf_features, ngram_features = extract_lexical_features(df['review_body'])
syntactic_features = extract_syntactic_features(df['review_body'])
semantic_features = extract_semantic_features(df['review_body'])
sentiment_features = extract_sentiment_features(df['review_body'])
stylometric_features = extract_stylometric_features(df['review_body'])
# printing a sample of extracted features
print("Dataset shape:", df.shape)
print("Bag of Words (shape):", bow_features.shape)
print("TF-IDF (shape):", tfidf_features.shape)
print("N-grams (shape):", ngram_features.shape)
print("\nSyntactic Features (first review):", syntactic_features[0])
print("\nSemantic Features (first review entity counts):", semantic_features[0]['entity_counts'])
print("\nSentiment Features (first review):", sentiment_features[0])
print("\nStylometric Features (first review):", stylometric_features[0])
#savinf features
pd.DataFrame(bow_features.toarray()).to_csv('bow_features.csv', index=False)
pd.DataFrame(tfidf_features.toarray()).to_csv('tfidf_features.csv', index=False)
pd.DataFrame(ngram_features.toarray()).to_csv('ngram_features.csv', index=False)
pd.DataFrame(syntactic_features).to_csv('syntactic_features.csv', index=False)
pd.DataFrame(semantic_features).to_csv('semantic_features.csv', index=False)
pd.DataFrame(sentiment_features).to_csv('sentiment_features.csv', index=False)
pd.DataFrame(stylometric_features).to_csv('stylometric_features.csv', index=False)
print("\nFeatures saved into a csv file")


SyntaxError: invalid syntax (<ipython-input-16-229b958e0a02>, line 9)

## Question 3 (10 points):
Use any of the feature selection methods mentioned in this paper "Deng, X., Li, Y., Weng, J., & Zhang, J. (2019). Feature selection for text classification: A review. Multimedia Tools & Applications, 78(3)."

Select the most important features you extracted above, rank the features based on their importance in the descending order.

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.preprocessing import LabelEncoder
#loading the dataset
df = pd.read_csv('products.csv')
#creating the labels and logic for the labels
positive_words = ['great', 'interesting', 'love', 'amazing', 'fantastic', 'hysterical', 'suspenseful']
negative_words = ['boring', 'devastating', 'disappointing']
def label_review(review):
    if any(word in review.lower() for word in positive_words):
        return 1  # returning if positive
    elif any(word in review.lower() for word in negative_words):
        return 0  #returnig if negative
    return 0
#applying label function
df['label'] = df['review_body'].apply(label_review)
#using vectorizer
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(df['review_body'])
#intializing target variable
y = df['label']
#chi square
k = 10
chi2_selector = SelectKBest(chi2, k=k)
chi2_selector.fit(X, y)
feature_names = vectorizer.get_feature_names_out()
scores = chi2_selector.scores_
feature_scores = pd.DataFrame({'Feature': feature_names, 'Score': scores})
feature_scores = feature_scores.sort_values(by='Score', ascending=False)
#output
print("\nTop Features based on the score of the chi square:")
print(feature_scores.head(k))


Top Features based on the score of the chi square:
          Feature     Score
591         great  3.661709
831          love  3.335471
1215       series  2.674140
832         loved  2.262435
714   interesting  1.606310
424       enjoyed  1.453337
230         clean  0.984401
373          does  0.889400
51        amazing  0.839853
614      happened  0.831453


## Question 4 (10 points):
Write python code to rank the text based on text similarity. Based on the text data you used for question 2, design a query to match the most relevant docments. Please use the BERT model to represent both your query and the text data, then calculate the cosine similarity between the query and each text in your data. Rank the similary with descending order.

In [None]:
# You code here (Please add comments in the code):
import pandas as pd
import numpy as np
from transformers import BertTokenizer, BertModel
import torch
from sklearn.metrics.pairwise import cosine_similarity
import matplotlib.pyplot as plt
# loading the dataset
df = pd.read_csv('products.csv')
if 'review_body' not in df.columns:
    raise ValueError("The 'review_body' col")
# loading Bert
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
def get_bert_embedding(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512, padding=True)
    with torch.no_grad():
        outputs = model(**inputs)
    return outputs.last_hidden_state.mean(dim=1).squeeze().numpy()
print("Generating BERT embeddings for reviews...")
review_embeddings = np.array([get_bert_embedding(review) for review in df['review_body']])
# Intializing a query
query = "High quality product with excellent customer service"
print(f"Query: '{query}'")
query_embedding = get_bert_embedding(query)
similarities = cosine_similarity([query_embedding], review_embeddings)[0]
df['similarity'] = similarities
# sorting in descending order
df_sorted = df.sort_values('similarity', ascending=False).reset_index(drop=True)
print("\nTop 10 most similar reviews:")
for i in range(10):
    print(f"{i+1}. Similarity: {df_sorted['similarity'][i]:.4f}")
    print(f"   Review: {df_sorted['review_body'][i][:100]}...")  # Print first 100 characters
    print()
plt.figure(figsize=(10, 6))
plt.hist(similarities, bins=50)
plt.title('Distribution of Similarity Scores')
plt.xlabel('Cosine Similarity')
plt.ylabel('Frequency')
plt.savefig('similarity_distribution.png')
plt.close()
# saving results into csv file
df_sorted.to_csv('ranked_reviews.csv', index=False)
print("results are saved to 'ranked_reviews.csv'")
print("saving similarity distribution plot as a png")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Generating BERT embeddings for reviews...
Query: 'High quality product with excellent customer service'

Top 10 most similar reviews:
1. Similarity: 0.6477
   Review: "This is a great series and I recommend it for everyone if you like fantasy. Great character develop...

2. Similarity: 0.6347
   Review: "This work wasn't as good as the first. Loss of not just one good character but two was excessive in...

3. Similarity: 0.6211
   Review: "Short read, lovely, romantic with lots of secrets and drama and a satisfying ending. Highly recomme...

4. Similarity: 0.6173
   Review: "This work should not be put in the category that it is put into. It is NOT CLEAN! And I wish there ...

5. Similarity: 0.6132
   Review: "This is one of the best book I have read for some time. The amazing story pulls you in deeper and r...

6. Similarity: 0.6091
   Review: "Puts peoples experiences in words that we can understand.<br />Excellent reading for those who are ...

7. Similarity: 0.6059
   Review: "This

# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment. Consider the following points in your response:

Learning Experience: Describe your overall learning experience in working on extracting features from text data. What were the key concepts or techniques you found most beneficial in understanding the process?

Challenges Encountered: Were there specific difficulties in completing this exercise?

Relevance to Your Field of Study: How does this exercise relate to the field of NLP?

**(Your submission will not be graded if this question is left unanswered)**



In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:
Learning Experience- I found TF-IDF as useful because it calculates the importance of words in sentence
Challenges- Getting familiar with chi square took me some time
Relevance to field of study- I have gained experience and problem solving skills related to nlp, especially in
nlp of data like sentiment analysis.





'''