<a href="https://colab.research.google.com/github/SwathiNagilla/Swathi_INFO5731_FALL2024/blob/main/Nagilla_Swathi_Exercise_3_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 3**

The purpose of this exercise is to explore various aspects of text analysis, including feature extraction, feature selection, and text similarity ranking.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of Friday, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**

**Please check that the link you submitted can be opened and points to the correct assignment.**

## Question 1 (10 Points)
Describe an interesting **text classification or text mining task** and explain what kind of features might be useful for you to build the machine learning model. List your features and explain why these features might be helpful. You need to list at least five different types of features. **Your dataset must be text.**

In [None]:
'''
One interesting text classification task is the extraction of movie-related information like title, year, cast, director, and
producer from unstructured text. It will enable you to structure large movie datasets or even build searchable movie databases.

Features:

1. Named Entity Recognition (NER) can be used to recognize the names of movie cast members, directors, and producers.
NER aids in accurately identifying and categorizing these individuals.
2. Regular Expression (Regex) is useful for getting the year of release because years typically have four digits (e.g., 1995 or 2010).
A regex pattern can quickly match and gather such data from the text.
3. Title Keyword Patterns can assist in identifying the movie title, which frequently appears at the start or is enclosed in quotes.
Titles can be distinguished from other textual information by focusing on their position and context.
4. Part of Speech (POS) Tagging can be used to distinguish between different roles, such as director or actor, by identifying nouns and
proper nouns in the sentence. This helps to differentiate between positions and people's names.
5. Word embeddings (such as BERT) are effective at capturing the conceptual significance of a text. They can be used to distinguish
between various movie roles or to categorize movie details based on their connections in the text. Word embeddings help the model recognize how words relate to one another, which improves accuracy.
'''

## Question 2 (10 Points)
Write python code to extract these features you discussed above. You can collect a few sample text data for the feature extraction.

In [1]:
import re
#import spacy which is used to recognize people’s names
import spacy
from collections import defaultdict

#load the model
nlp = spacy.load("en_core_web_sm")

data = [
    "Baahubali The Beginning (2015), is an Indian epic action film co-written and directed by S. S. Rajamouli, and produced by Shobu Yarlagadda and Prasad Devineni under Arka Media Works. It features Prabhas in a dual role alongside Rana Daggubati, Anushka Shetty, Tamannaah Bhatia, Ramya Krishnan, Sathyaraj, and Nassar.",
    "Inception (2010), directed by Christopher Nolan, stars Leonardo DiCaprio and was produced by Emma Thomas.",
    "The Dark Knight (2008), starring Christian Bale and directed by Christopher Nolan, was produced by Charles Roven."
]

# creating features dictionary
features = defaultdict(list)

#using regex for title and year
title_year = re.compile(r'([A-Za-z\s]+)\s\((\d{4})\)')

def extract_features(text):
    doc = nlp(text)

    # Extract title and year using regex
    match = title_year.search(text)
    if match:
        title, year = match.groups()
        features['title'].append(title.strip())
        features['year'].append(year)

    # Extract cast, director, and producer
    cast = []
    director = None
    producer = None

    # Extract named entities
    for ent in doc.ents:
        if ent.label_ == "PERSON":
            cast.append(ent.text)

    # Extract director
    director = re.search(r'directed by\s([A-Za-z\s]+)', text)
    if director:
        features['director'].append(director.group(1).strip())
    #Extract producer
    producer = re.search(r'produced by\s([A-Za-z\s]+(?:,?\s*[A-Za-z\s]+)*)', text)
    if producer:
        features['producer'].append(producer.group(1).strip())

    # Add the cast list to features
    features['cast'].append(cast)

# Process each movie description
for movie_description in data:
    extract_features(movie_description)

# Output the Display
for feature, values in features.items():
    print(f"{feature.capitalize()}: {values}")


Title: ['Baahubali The Beginning', 'Inception', 'The Dark Knight']
Year: ['2015', '2010', '2008']
Director: ['S', 'Christopher Nolan', 'Christopher Nolan']
Producer: ['Shobu Yarlagadda and Prasad Devineni under Arka Media Works', 'Emma Thomas', 'Charles Roven']
Cast: [['S. S. Rajamouli', 'Shobu Yarlagadda', 'Prasad Devineni', 'Rana Daggubati', 'Anushka Shetty', 'Tamannaah Bhatia', 'Ramya Krishnan', 'Nassar'], ['Christopher Nolan', 'Leonardo DiCaprio', 'Emma Thomas'], ['Christopher Nolan', 'Charles Roven']]


## Question 3 (10 points):
Use any of the feature selection methods mentioned in this paper "Deng, X., Li, Y., Weng, J., & Zhang, J. (2019). Feature selection for text classification: A review. Multimedia Tools & Applications, 78(3)."

Select the most important features you extracted above, rank the features based on their importance in the descending order.

In [2]:
import numpy as np
import pandas as pd
from sklearn.feature_selection import chi2
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder

# Sample data
movie_data = [
    {"title": "Baahubali The Beginning", "year": 2015, "cast": "Prabhas, Rana Daggubati", "director": "S. S. Rajamouli", "producer": "Shobu Yarlagadda, Prasad Devineni"},
    {"title": "Inception", "year": 2010, "cast": "Leonardo DiCaprio", "director": "Christopher Nolan", "producer": "Emma Thomas"},
    {"title": "The Dark Knight", "year": 2008, "cast": "Christian Bale", "director": "Christopher Nolan", "producer": "Charles Roven"},
]

data = pd.DataFrame(movie_data)

# Use TF-IDF to vectorize the cast, director, and producer
tfidf_vectorizer = TfidfVectorizer()
cast = tfidf_vectorizer.fit_transform(data['cast'])
cast_names = ['cast_' + name for name in tfidf_vectorizer.get_feature_names_out()]

director = tfidf_vectorizer.fit_transform(data['director'])
director_names = ['director_' + name for name in tfidf_vectorizer.get_feature_names_out()]

producer = tfidf_vectorizer.fit_transform(data['producer'])
producer_names = ['producer_' + name for name in tfidf_vectorizer.get_feature_names_out()]

# Combine the feature vectors and convert to array
features = np.hstack([cast.toarray(), director.toarray(), producer.toarray()])
feature_names = cast_names + director_names + producer_names

# Encode the Movie Title
label = LabelEncoder()
y = label.fit_transform(data['title'])

# Apply Chi-Square
chi2, values = chi2(features, y)
if len(feature_names) == len(chi2):
    feature_rankings = pd.DataFrame({
        'Feature': feature_names,
        'Chi2_Score': chi2,
        'p-value': values
    })

    # Sort in descending order
    ranked_features = feature_rankings.sort_values(by='Chi2_Score', ascending=False)
    print(ranked_features)
else:
    print("Feature names and Chi-Square scores are not aligned.")


                 Feature  Chi2_Score   p-value
9     director_rajamouli    2.000000  0.367879
1         cast_christian    1.414214  0.493069
16       producer_thomas    1.414214  0.493069
14        producer_roven    1.414214  0.493069
12         producer_emma    1.414214  0.493069
10      producer_charles    1.414214  0.493069
0              cast_bale    1.414214  0.493069
4          cast_leonardo    1.414214  0.493069
3          cast_dicaprio    1.414214  0.493069
6              cast_rana    1.154701  0.561384
5           cast_prabhas    1.154701  0.561384
2         cast_daggubati    1.154701  0.561384
11     producer_devineni    1.000000  0.606531
13       producer_prasad    1.000000  0.606531
15        producer_shobu    1.000000  0.606531
17   producer_yarlagadda    1.000000  0.606531
7   director_christopher    0.707107  0.702189
8         director_nolan    0.707107  0.702189


## Question 4 (10 points):
Write python code to rank the text based on text similarity. Based on the text data you used for question 2, design a query to match the most relevant docments. Please use the BERT model to represent both your query and the text data, then calculate the cosine similarity between the query and each text in your data. Rank the similary with descending order.

In [3]:
import torch
from transformers import BertTokenizer, BertModel
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Sample text data
texts = [
    "Baahubali The Beginning (2015), is an Indian epic action film directed by S. S. Rajamouli and produced by Shobu Yarlagadda and Prasad Devineni. It features Prabhas, Rana Daggubati, Anushka Shetty, Tamannaah Bhatia, and others.",
    "Inception (2010), directed by Christopher Nolan, stars Leonardo DiCaprio and is produced by Emma Thomas.",
    "The Dark Knight (2008), starring Christian Bale, directed by Christopher Nolan, was produced by Charles Roven.",
]

# query
query = "A movie directed by Christopher Nolan with Christian Bale in the lead role."

# Load BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Function to generate BERT embeddings for text
def rank(text):
    input = tokenizer(text, return_tensors='pt', truncation=True, padding=True, max_length=512)
    with torch.no_grad():
        output = model(**input)
    return output.last_hidden_state.mean(dim=1).squeeze().numpy()
query_embedding = rank(query)
text_embeddings = [rank(text) for text in texts]

# Calculate similarity between the query and text
similarities = [cosine_similarity([query_embedding], [text_embedding])[0][0] for text_embedding in text_embeddings]

# Rank in descending order
ranked_texts = sorted(zip(texts, similarities), key=lambda x: x[1], reverse=True)

# Print the output
for i, (text, similarity) in enumerate(ranked_texts, 1):
    print(f"Rank {i}: Similarity = {similarity:.4f}\n{text}\n")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Rank 1: Similarity = 0.8584
The Dark Knight (2008), starring Christian Bale, directed by Christopher Nolan, was produced by Charles Roven.

Rank 2: Similarity = 0.8185
Inception (2010), directed by Christopher Nolan, stars Leonardo DiCaprio and is produced by Emma Thomas.

Rank 3: Similarity = 0.6359
Baahubali The Beginning (2015), is an Indian epic action film directed by S. S. Rajamouli and produced by Shobu Yarlagadda and Prasad Devineni. It features Prabhas, Rana Daggubati, Anushka Shetty, Tamannaah Bhatia, and others.



# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment. Consider the following points in your response:

Learning Experience: Describe your overall learning experience in working on extracting features from text data. What were the key concepts or techniques you found most beneficial in understanding the process?

Challenges Encountered: Were there specific difficulties in completing this exercise?

Relevance to Your Field of Study: How does this exercise relate to the field of NLP?

**(Your submission will not be graded if this question is left unanswered)**



In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
I found the task of obtaining features from text data to be difficult. I gained a lot of knowledge about methods such as
regular expressions and Named Entity Recognition, which helped me comprehend how to extract relevant data from unstructured text.
But I had trouble, especially correctly identifying roles and titles. Because it highlights the significance of feature extraction
in practical applications.
'''