## Follow the similar logic as Movie Classifer Demo done in the class.

- Tokenize text using spacy.
- Download the Word2Vec Model
- Vectorize all words in each review.
- Calculate mean vector of the reviews
- Train a Neural Network for classification
- Test the trained neural network with few examples.

In [1]:
import pandas as pd
import numpy as np
from tqdm import tqdm
import spacy
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import tensorflow as tf
from gensim.models import KeyedVectors
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.spatial.distance import euclidean, cosine

2025-02-02 15:00:26.024185: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [2]:
#read the csv file
df = pd.read_csv('assignment_1.4.csv')

print(df.head())

genres = df['genre']
descriptions = df['description']
print(genres[:5], descriptions[:5])
print(len(genres), len(descriptions))   


       genre                                        description
0    horror    When six friends fly off on a weekend getaway...
1    horror    The story is about a young girl who was touch...
2   romance    A young woman named Anna has always longed fo...
3    horror    A London couple moves to a large country hous...
4    horror    In a small college in North Carolina, only a ...
0      horror 
1      horror 
2     romance 
3      horror 
4      horror 
Name: genre, dtype: object 0     When six friends fly off on a weekend getaway...
1     The story is about a young girl who was touch...
2     A young woman named Anna has always longed fo...
3     A London couple moves to a large country hous...
4     In a small college in North Carolina, only a ...
Name: description, dtype: object
1344 1344


In [3]:
# initiate vectorizer object
vectorizer = TfidfVectorizer()

# fit the vectorizer on the description column
vectorizer.fit(descriptions)
descriptions_tf_idf_vectors = vectorizer.transform(descriptions)
tf_idf_indexes = vectorizer.get_feature_names_out()

print(vectorizer.vocabulary_)
print(len(vectorizer.vocabulary_))

print(descriptions_tf_idf_vectors.toarray().shape)
df = df.assign(descriptions_tf_idf_vectors = list(descriptions_tf_idf_vectors.toarray()))
df.head()


14571
(1344, 14571)


Unnamed: 0,genre,description,descriptions_tf_idf_vectors
0,horror,When six friends fly off on a weekend getaway...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
1,horror,The story is about a young girl who was touch...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
2,romance,A young woman named Anna has always longed fo...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
3,horror,A London couple moves to a large country hous...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
4,horror,"In a small college in North Carolina, only a ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."


### Mean Vectors of Genres

In [4]:
# claculating mean vectors for each genre
genre_mean_vectors = {}
for genre in genres.unique():
    genre_mean_vectors[genre] = np.mean(df[df['genre'] == genre]['descriptions_tf_idf_vectors'].to_list(), axis=0)
print(genre_mean_vectors)

{' horror ': array([0.00017285, 0.00084734, 0.00048681, ..., 0.00018588, 0.        ,
       0.        ]), ' romance ': array([0.        , 0.00062105, 0.00109619, ..., 0.        , 0.00020363,
       0.00011124])}


In [5]:
#cosine similarity and eucledian distance beteween the mean vectors
cosine_similarity = 1 - cosine(genre_mean_vectors[" horror "], genre_mean_vectors[" romance "])
euclidean_distance = euclidean(genre_mean_vectors[" horror "], genre_mean_vectors[" romance "])
print("Eucledian distance between mean vectors of genres: ",euclidean_distance)
print("Cosine similarity between mean vectors of genres ",cosine_similarity)

Eucledian distance between mean vectors of genres:  0.12004757633003402
Cosine similarity between mean vectors of genres  0.8788579301786654


In [6]:
# imports for Preprocessing
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from string import punctuation
from tqdm import tqdm
import spacy
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import tensorflow as tf
from gensim.models import KeyedVectors
from huggingface_hub import hf_hub_download
# Load spacy model
nlp = spacy.load("en_core_web_lg")

nltk.download("stopwords")
nltk.download("punkt")
nltk.download("wordnet")

  from .autonotebook import tqdm as notebook_tqdm
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/kush_210/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/kush_210/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /home/kush_210/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [7]:
#generate the sentiment labels
df["genre_label"] = df["genre"].map({" romance ": 0, " horror ": 1})
print(df.head())

       genre                                        description  \
0    horror    When six friends fly off on a weekend getaway...   
1    horror    The story is about a young girl who was touch...   
2   romance    A young woman named Anna has always longed fo...   
3    horror    A London couple moves to a large country hous...   
4    horror    In a small college in North Carolina, only a ...   

                         descriptions_tf_idf_vectors  genre_label  
0  [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...            1  
1  [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...            1  
2  [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...            0  
3  [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...            1  
4  [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...            1  


In [8]:
repo_id = "NathaNn1111/word2vec-google-news-negative-300-bin"
filename = "GoogleNews-vectors-negative300.bin"
model_path = hf_hub_download(repo_id=repo_id, filename=filename)

In [None]:
try:
    word2vec = KeyedVectors.load_word2vec_format(model_path, binary=True)
except Exception as e:
    print("Error loading Word2Vec model:", e)
    word2vec = None

In [9]:
def clean_data(desc):
    words = stopwords.words('english')
    lower = " ".join([w for w in desc.lower().split() if not w in words])
    punct = ''.join(ch for ch in lower if ch not in punctuation)
    wordnet_lemmatizer = WordNetLemmatizer()

    word_tokens = nltk.word_tokenize(punct)
    lemmatized_words = [wordnet_lemmatizer.lemmatize(word) for word in word_tokens]

    words_joined = " ".join(lemmatized_words)
    
    return words_joined

In [10]:


# Function to create mean vector for a review
def description_to_vector(description):
    tokens = [token.text.lower() for token in nlp(description) if token.is_alpha]
    vectors = [word2vec[word] for word in tokens if word in word2vec]
    if vectors:
        return np.mean(vectors, axis=0)
    else:
        return np.zeros(word2vec.vector_size)

In [11]:
# Generate mean vectors for all reviews
tqdm.pandas()
df['vector'] = df['description'].progress_apply(description_to_vector)

  0%|          | 2/1344 [00:00<03:26,  6.51it/s]

100%|██████████| 1344/1344 [00:39<00:00, 33.65it/s]


In [12]:
df.columns

Index(['genre', 'description', 'descriptions_tf_idf_vectors', 'genre_label',
       'vector'],
      dtype='object')

In [13]:
if tf.config.list_physical_devices('GPU'):
    print("GPU is available")
else:
    print("GPU is not available")

GPU is available


In [18]:
#Train the model for genre classification

X = np.stack(df['vector'].values)
y = df['genre_label'].values

#Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#Simple neural network model

model = tf.keras.Sequential([
    tf.keras.layers.Dense(128, activation='relu', input_shape=(X_train.shape[1],)),
    tf.keras.layers.Dropout(0.3),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dropout(0.3),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

: 

In [None]:
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train the model
history = model.fit(X_train, y_train, epochs=200, batch_size=32, validation_split=0.2)

# Evaluate the model
y_pred = (model.predict(X_test) > 0.5).astype(int).flatten()
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")



I0000 00:00:1738490848.416015  219284 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 3586 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 3060 Laptop GPU, pci bus id: 0000:01:00.0, compute capability: 8.6
  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


In [None]:
# Function to preprocess and predict genre for a new description
def predict_genre(description, model, word2vec, nlp):
    # Tokenize and create a mean vector for the description
    tokens = [token.text.lower() for token in nlp(description) if token.is_alpha]
    vectors = [word2vec[word] for word in tokens if word in word2vec]
    if vectors:
        mean_vector = np.mean(vectors, axis=0)
    else:
        mean_vector = np.zeros(word2vec.vector_size)
    
    # Predict genre
    prediction = model.predict(mean_vector.reshape(1, -1))[0][0]
    genre = "horror" if prediction > 0.5 else "romance"
    confidence = prediction if genre == "horror" else 1 - prediction
    return genre, confidence

# Example reviews for inference
example_desc = [
    " Anthology of three horror tales spun around various characters which star Brinke Stevens who appears in all three episodes in three different roles. In ""The Wish"" a religious backwoods family can only pray for a quick death when the woods come to life with the zombies of their victims. In ""The Night Caller"" a DJ is harassed by a mysterious caller whom makes clear threats of his attempts to kill. In ""Hexed"" a woman gets mixed up with a coven of witches.",
    " Eve is an ordinary married woman. A happy, spiritual woman who lives an idyllic life. But events take a turn for the worse when she's bitten by a snake. For this is no ordinary snake. And nothing can prepare Eve for the events that are to follow.",
    " ""When event planner Chloe (Kebbel) is hired to plan the local Christmas Festival, she is beyond thrilled to embrace the challenge. Professionally, everything is going great, but much to the dismay of her mother (Post), Chloe confesses she has given up on ever finding Mr. Right. That all changes the night of the opening of the festival when she meets Evan. The two begin a whirlwind romance, but as Christmas Day nears, Chloe learns that Evan is being transferred overseas for work. What follows is three more Christmases where Chloe and Evan cross paths at the annual festival, but each year something - or someone - stands in the way of true love. Will a touch of Santa's magic on their fourth Christmas Eve finally bring them together?""",
    " An excommunicated priest sets up a satanic cult that only looks Catholic on the outside. He convinces a man to sign over his daughter's soul so that she will become the devil's representative on earth on her eighteenth birthday, but as that day nears, the man seeks the help of an American occult novelist to save his daughter, both physically and spiritually.",
    " A beautiful stranger on the Coney Island train becomes both lead actress and real life object of desire in this choose your own adventure documentary about writing a fictional love story on the streets of New York. Director Florian Habicht casts himself as the leading man in this interactive and multi-layered ode to the Woody Allen-style Manhattan romantic comedy. Outspoken New Yorkers and Habicht's father, via Skype calls, influence the narrative of the film within the documentary. Blurs the boundaries of fact and fiction and pushes the relationship between director and subject to new extremes.",
]

# Run inference on example reviews
for description in example_desc:
    genre, confidence = predict_genre(description, model, word2vec, nlp)
    print(f"Description: {description}\nPredicted genre: {genre} (Confidence: {confidence:.2f})\n")

GPU is available


AttributeError: 'KerasTensor' object has no attribute 'device'