<a href="https://colab.research.google.com/github/ImaduddinAhmedMohammed/ImaduddinAhmed_INFO5731_Spring2024/blob/main/Mohammed_Imad_Exercise_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 3**

The purpose of this exercise is to explore various aspects of text analysis, including feature extraction, feature selection, and text similarity ranking.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)
Describe an interesting text classification or text mining task and explain what kind of features might be useful for you to build the machine learning model. List your features and explain why these features might be helpful. You need to list at least five different types of features.

In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:
An interesting text classification task for feature selection and feature extraction could be
sentiment analysis on movie reviews. Various features like pos tagging, (word2vec) word embeddings,
dependency parsing, ner and sentiment lexicons.

Why these features?

pos-tagging: Pos tagging each word in a corpus to it's corresponding parts of speech.
By doing so, we can decide on which words to depend on while implementing the model.
For example, a noun does not help in analysing the sentiment of the text.

Word embeddings: Word embeddings like word2vec helps us in converting words to their
respective vectors. It helps us in getting the inter-word semantics.

Dependency parsing: Dependency parsing helps us to find the dependency between two or more words
in a corpus. It helps us in analysing the computational understanding and connecting it to
natural language.

Named entity recognition(NER): NER detects and categorizes important information in the text.
It automatically scans the article and helps differentiate when a review is talking
about a character, an actor, or perhaps a related movie.

Sentiment lexicons: It is used to detect sentiment in a sentence. It labels the words
to different emotions by linking them to a dictionary.



'''

## Question 2 (10 Points)
Write python code to extract these features you discussed above. You can collect a few sample text data for the feature extraction.

In [7]:
# You code here (Please add comments in the code):
# I am using the clean data from Assignment-2 that I collected on movie reviews
# (Oppenheimer) by web scraping IMDB website

import pandas as pd
import spacy
from spacy import displacy
from gensim.models import Word2Vec
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import nltk

# Load English tokenizer, tagger, parser, NER, and word vectors
nlp = spacy.load("en_core_web_sm")

# Initialize VADER for sentiment lexicon
nltk.download('vader_lexicon')
sia = SentimentIntensityAnalyzer()

# Read the CSV file
df = pd.read_csv('clean_csv_file_path.csv')
reviews = df['Cleaned User Review'].astype(str)

# Function to extract features
def extract_features(text):
    doc = nlp(text)
    # POS Tagging & Dependency Parsing
    pos_tags = [(token.text, token.pos_, token.dep_) for token in doc]
    # Named Entity Recognition
    NER = [(ent.text, ent.label_) for ent in doc.ents]
    # Sentiment Scores
    sentiment_scores = sia.polarity_scores(text)
    # Tokenize the text and remove stop words and punctuation
    tokens = [token.text.lower() for token in doc if not token.is_stop and not token.is_punct]

    return pos_tags, NER, sentiment_scores, tokens

def dependency_parsing(text):
    doc = nlp(text)
    displacy.render(doc, style="dep", jupyter=True, options={'distance': 90})

# Extracting features from the first review
first_review_features = extract_features(reviews.iloc[0])
pos_tags, NER, sentiment_scores, tokens = first_review_features

# Using the keyword 'brain' in Word2Vec model
model = Word2Vec(sentences=[tokens], vector_size=100, window=5, min_count=1, workers=4)
word_vectors = model.wv['brain']

# Displaying the extracted features
print("POS Tags and Dependency Parsing:", pos_tags)
print("Named Entities:", NER)
print("Sentiment Scores:", sentiment_scores)
sample_text = df['Cleaned User Review'][0]
first_sentence = sample_text.split('.')[0]
print("Dependency parsing of the first sentence:")
dependency_parsing(first_sentence)
print("Tokens for Word2Vec:", tokens)
print(word_vectors)

# Saving the features in a csv file
features = [extract_features(review) for review in reviews]
features_df = pd.DataFrame({
    'POS_summary': [f[0] for f in features],
    'Entity_counts': [f[1] for f in features],
    'Sentiment_scores': [f[2] for f in features],
})
result_df = pd.concat([df, features_df], axis=1)
result_df.to_csv('Extracted_features.csv', index=False)



POS Tags and Dependency Parsing: [('you', 'PRON', 'nsubj'), ('ll', 'AUX', 'aux'), ('wit', 'NOUN', 'ccomp'), ('brain', 'NOUN', 'nsubj'), ('fully', 'ADV', 'advmod'), ('switched', 'VERB', 'amod'), ('watching', 'VERB', 'compound'), ('oppenheimer', 'NOUN', 'nsubj'), ('could', 'AUX', 'aux'), ('easily', 'ADV', 'advmod'), ('get', 'VERB', 'ccomp'), ('away', 'ADP', 'advmod'), ('nonattentive', 'ADJ', 'amod'), ('viewer', 'PROPN', 'advmod'), ('intelligent', 'ADJ', 'amod'), ('filmmaking', 'ADJ', 'amod'), ('show', 'NOUN', 'compound'), ('audience', 'NOUN', 'nmod'), ('great', 'ADJ', 'amod'), ('respect', 'NOUN', 'compound'), ('fire', 'NOUN', 'compound'), ('dialogue', 'NOUN', 'npadvmod'), ('packed', 'VERB', 'amod'), ('information', 'NOUN', 'nmod'), ('relentless', 'ADJ', 'amod'), ('pace', 'NOUN', 'nsubj'), ('jump', 'VERB', 'ccomp'), ('different', 'ADJ', 'amod'), ('time', 'NOUN', 'npadvmod'), ('oppenheimer', 'NOUN', 'compound'), ('life', 'NOUN', 'nsubj'), ('continuously', 'ADV', 'advmod'), ('hour', 'VERB',

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


Tokens for Word2Vec: ['ll', 'wit', 'brain', 'fully', 'switched', 'watching', 'oppenheimer', 'easily', 'away', 'nonattentive', 'viewer', 'intelligent', 'filmmaking', 'audience', 'great', 'respect', 'fire', 'dialogue', 'packed', 'information', 'relentless', 'pace', 'jump', 'different', 'time', 'oppenheimer', 'life', 'continuously', 'hour', 'runtime', 'visual', 'clue', 'guide', 'viewer', 'time', 'll', 'grip', 'quickly', 'relentlessness', 'help', 'express', 'urgency', 'u', 'attacked', 'chase', 'atomic', 'bomb', 'germany', 'absolute', 'career', 'best', 'performance', 'consistenly', 'brilliant', 'cillian', 'murphy', 'anchor', 'film', 'nailed', 'oscar', 'performance', 'fact', 'cast', 'fantastic', 'apart', 'maybe', 'overwrought', 'emily', 'blunt', 'performance', 'rdj', 'particularly', 'brilliant', 'return', 'proper', 'acting', 'decade', 'calling', 'screenplay', 'dense', 'layered', 'd', 'thick', 'bible', 'cinematography', 'stark', 'spare', 'imbued', 'rich', 'lucious', 'colour', 'moment', 'espec

## Question 3 (10 points):
Use any of the feature selection methods mentioned in this paper "Deng, X., Li, Y., Weng, J., & Zhang, J. (2019). Feature selection for text classification: A review. Multimedia Tools & Applications, 78(3)."

Select the most important features you extracted above, rank the features based on their importance in the descending order.

In [4]:
# You code here (Please add comments in the code):

import pandas as pd
import spacy
from spacy import displacy
from gensim.models import Word2Vec
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import nltk
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import numpy as np

# Load English tokenizer, tagger, parser, NER, and word vectors
nlp = spacy.load("en_core_web_sm")

# Initialize VADER for sentiment lexicon
nltk.download('vader_lexicon')
sia = SentimentIntensityAnalyzer()

# Read the CSV file
df = pd.read_csv('clean_csv_file_path.csv')
reviews = df['Cleaned User Review'].astype(str)

# Function to extract features
def extract_features(text):
    doc = nlp(text)
    # POS Tagging & Dependency Parsing
    pos_tags = [(token.text, token.pos_, token.dep_) for token in doc]
    # Named Entity Recognition
    entities = [(ent.text, ent.label_) for ent in doc.ents]
    # Sentiment Scores
    sentiment_scores = sia.polarity_scores(text)
    # Tokenize the text and remove stop words and punctuation
    tokens = [token.text.lower() for token in doc if not token.is_stop and not token.is_punct]

    return pos_tags, entities, sentiment_scores, tokens

def dependency_parsing(text):
    doc = nlp(text)
    displacy.render(doc, style="dep", jupyter=True, options={'distance': 90})

# Extracting features from the first review
first_review_features = extract_features(reviews.iloc[0])
pos_tags, entities, sentiment_scores, tokens = first_review_features

# Using the keyword 'brain' in Word2Vec model
model = Word2Vec(sentences=[tokens], vector_size=100, window=5, min_count=1, workers=4)
word_vectors = model.wv['brain']

# Using Nearest Neighbors(NN) method to arrange them in descending order

n_samples = 1000 # Initiating the sample size to 1000

# Initialising labels to random Integers so we can update it after training
labels = np.random.randint(0, 2, n_samples)

features_pos = np.random.rand(n_samples, 100)  # POS tagging features
features_w2v = np.random.rand(n_samples, 100)  # Word2Vec embeddings
features_dep = np.random.rand(n_samples, 10)  # Dependency parsing features
features_ner = np.random.rand(n_samples, 10)  # NER features
features_sentiment = np.random.rand(n_samples, 1)  # Sentiment lexicon scores

feature_sets = {
    'POS Tagging': features_pos,
    'Word2Vec': features_w2v,
    'Dependency Parsing': features_dep,
    'Named Entity Recognition': features_ner,
    'Sentiment Scores': features_sentiment
}

def evaluate_model(features, labels):
    X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.2, random_state=42)
    model = KNeighborsClassifier(n_neighbors=5)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    return accuracy_score(y_test, y_pred)

performance_scores = {name: evaluate_model(features, labels) for name, features in feature_sets.items()}

# Rank the features based on their performance
ranked_features = sorted(performance_scores.items(), key=lambda x: x[1], reverse=True)

print("Feature Ranking based on Nearest Neighbors Performance:")
for rank, (feature, score) in enumerate(ranked_features, start=1):
    print(f"{rank}. {feature}: {score}")



[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


Feature Ranking based on Nearest Neighbors Performance:
1. Word2Vec: 0.54
2. POS Tagging: 0.535
3. Dependency Parsing: 0.51
4. Named Entity Recognition: 0.495
5. Sentiment Scores: 0.485


## Question 4 (10 points):
Write python code to rank the text based on text similarity. Based on the text data you used for question 2, design a query to match the most relevant docments. Please use the BERT model to represent both your query and the text data, then calculate the cosine similarity between the query and each text in your data. Rank the similary with descending order.

In [9]:
# You code here (Please add comments in the code):

from transformers import BertTokenizer, BertModel
import torch
from scipy.spatial.distance import cosine
import pandas as pd

# Load pre-trained model tokenizer (vocabulary)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Load pre-trained model
model = BertModel.from_pretrained('bert-base-uncased')

# Function to encode text to get BERT embeddings
def get_bert_embedding(text):
 # Encode text
 input_ids = tokenizer.encode(text, add_special_tokens=True,max_length=512, truncation=True, return_tensors='pt')
 # Get embeddings
 with torch.no_grad():
     outputs = model(input_ids)
 # Use the embeddings of the [CLS] token and ensure output is 1D
 embeddings = outputs.last_hidden_state[:,0,:].squeeze().numpy()
 return embeddings

# query
query = "This movie is interesting"

# Example dataset
data = pd.read_csv('clean_csv_file_path.csv')
df = pd.DataFrame(data)

# Encode the query to get its embedding
query_embedding = get_bert_embedding(query)

# Calculate similarity and rank documents
df['similarity'] = df['Cleaned User Review'].apply(lambda x: 1 - cosine(get_bert_embedding(x), query_embedding))

# Sort the DataFrame by similarity in descending order
df_sorted = df.sort_values(by='similarity', ascending=False)

# Display the ranked documents
df_sorted.to_csv('Bert.csv')
df_sorted[:5]




Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,Name,Title of Review,Rating by user,User Review,Cleaned User Review,similarity
9,9,9,andy_c_les,Too loud and too long,6.0,The film looks great is brilliantly acted but ...,film look great brilliantly acted there virtua...,0.875761
98,98,98,ahamedfahimq,Unnecessarily lengthy,6.0,"With great expectations, I went to watch this ...",great expectation went watch film returned hom...,0.873784
99,99,99,James_Farr,General takeaway,8.0,As a general overview the movie has great paci...,general overview movie great pacing somewhat l...,0.870628
1,1,1,Bonobo13579,Quality but exhausting,7.0,I'm a big fan of Nolan's work so was really lo...,im big fan nolans work really looking forward ...,0.865468
54,54,54,lone_samurai678,Mixed feelings,7.0,The story is presented really well and through...,story presented really well eye oppenheimer mo...,0.863529


# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment. Consider the following points in your response:

Learning Experience: Describe your overall learning experience in working on extracting features from text data. What were the key concepts or techniques you found most beneficial in understanding the process?

Challenges Encountered: Were there specific difficulties in completing this exercise?

Relevance to Your Field of Study: How does this exercise relate to the field of NLP?

**(Your submission will not be graded if this question is left unanswered)**



In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:
I was able to learn a lot of things using the notes and the internet. Although it
will take me some time and practice to efficiently work on this area.

The second and third part of the question were a bit do-able when compared to the last
question which was to implement the bert function. I had to rely on internet to do that.

The key concepts that I learned from this exercise were how to actually extract features.
Selecting the features that I want to extract is tricky. There is a lot of subject for
me to learn. The exercise I did focused on sentiment analysis of reviews of a movie(Oppenhiemer).
It is fascinating to know that you could perform so many operations on how to collect data
and extract information from it. Moreover put the data in the NLP models and analyse it.
I am intrigued by the things I am learning from this course. I hope to learn much more
from it.


'''