### Sentiment Classification Using Word2Vec  

**Problem Statement**  
We will classify movie reviews into two categories: positive or negative using Word2Vec embeddings and a machine learning classifier.  

**Steps**  
- Preprocess the data.
- Train a Word2Vec model using Gensim.
- Generate AvgWord2Vec embeddings for each document.
- Train a classifier using the embeddings.

In [53]:
import nltk
from nltk.corpus import stopwords
from gensim.models import Word2Vec

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

In [54]:
# Download NLTK stopwords
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Dell\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [55]:
# Sample dataset
data = [
    ("I absolutely loved this movie. It was brilliant!", "positive"),
    ("The film was a complete disaster, I regret watching it.", "negative"),
    ("Amazing storyline and fantastic acting!", "positive"),
    ("The plot was predictable and the characters were dull.", "negative"),
    ("This movie is a masterpiece; I could watch it over and over.", "positive"),
    ("What a boring film, I couldn't sit through the whole thing.", "negative"),
    ("Beautiful cinematography and heartfelt performances.", "positive"),
    ("Terrible direction and the screenplay was weak.", "negative"),
    ("A true work of art. The movie touched my heart.", "positive"),
    ("Worst movie ever, waste of time and money.", "negative"),
    ("Fantastic! Everything about this film was perfect.", "positive"),
    ("The acting was horrible, and the dialogue was cringe-worthy.", "negative"),
    ("Incredible! I would recommend it to everyone.", "positive"),
    ("I didn't like it. The story made no sense.", "negative"),
    ("The movie kept me on the edge of my seat!", "positive"),
    ("This film is so overrated, I don't understand the hype.", "negative"),
    ("Excellent performances and a gripping narrative.", "positive"),
    ("Awful experience. I wouldn't recommend it to anyone.", "negative"),
    ("A delightful movie that everyone should see.", "positive"),
    ("The movie dragged on and felt like a chore to watch.", "negative"),
    ("What a wonderful story! The actors were phenomenal.", "positive"),
    ("A complete flop. The worst film I've seen this year.", "negative"),
    ("Highly entertaining and full of surprises!", "positive"),
    ("Not worth watching. I was disappointed.", "negative"),
    ("A heartwarming and uplifting experience.", "positive"),
    ("I fell asleep halfway through; it was so boring.", "negative"),
    ("Brilliant direction and exceptional acting.", "positive"),
    ("The movie lacked depth and felt rushed.", "negative"),
    ("One of the best films I've seen in a long time.", "positive"),
    ("I wouldn't watch this again. It was painful to sit through.", "negative"),
    ("A cinematic masterpiece with stunning visuals.", "positive"),
    ("Badly written and poorly executed.", "negative"),
    ("I was deeply moved by this beautiful movie.", "positive"),
    ("Terrible pacing and no character development.", "negative"),
    ("A must-watch! This film is outstanding.", "positive"),
    ("The movie was slow and uninteresting.", "negative"),
    ("I laughed, I cried, and I thoroughly enjoyed it!", "positive"),
    ("An absolute waste of time and effort.", "negative"),
    ("The movie inspired me and left me in awe.", "positive"),
    ("The dialogues were cheesy and unconvincing.", "negative"),
    ("Such a delightful film, I would watch it again!", "positive"),
    ("The special effects were terrible and distracting.", "negative"),
    ("This film is an absolute gem. Don't miss it!", "positive"),
    ("One of the worst movies I have ever seen.", "negative"),
    ("The story was captivating and well-told.", "positive"),
    ("The acting was subpar, and the plot had too many holes.", "negative"),
    ("A feel-good movie that left me smiling.", "positive"),
    ("The movie was too long and very boring.", "negative"),
    ("Simply amazing! The cast was perfect.", "positive"),
    ("Poorly directed and utterly forgettable.", "negative"),
    ("A powerful story told with grace and skill.", "positive"),
    ("Horrible. The film was a big disappointment.", "negative"),
    ("A truly enjoyable and mesmerizing experience.", "positive"),
    ("Not engaging at all. I stopped watching halfway.", "negative"),
    ("A fantastic journey through an emotional story.", "positive"),
    ("The visuals were bad, and the soundtrack was annoying.", "negative"),
    ("This movie deserves all the awards. Fantastic!", "positive"),
    ("The plot twists were ridiculous and unnecessary.", "negative"),
    ("An inspiring and beautifully crafted film.", "positive"),
    ("Terrible editing and lack of continuity.", "negative"),
    ("I felt so connected to the characters. Brilliant!", "positive"),
    ("A poorly made movie that failed to impress.", "negative"),
    ("Outstanding! This film exceeded my expectations.", "positive"),
    ("The concept was interesting, but the execution was bad.", "negative"),
    ("This is my favorite film of all time. Amazing!", "positive"),
    ("The movie was overly dramatic and felt fake.", "negative"),
    ("The performances were superb and unforgettable.", "positive"),
    ("The movie was pointless and lacked any substance.", "negative"),
]

len(data)

68

In [56]:
# Preprocessing: Tokenization and stopword removal
def preprocess(text):
    tokens = nltk.word_tokenize(text.lower())
    tokens = [word for word in tokens if word.isalnum() and word not in stopwords.words('english')]
    # print(len(tokens))
    return tokens

In [57]:
# Preprocess dataset
texts = [preprocess(text) for text,label in data]
labels = [1 if label == 'positive' else 0 for _,label in data]

print(texts)
print(labels)

[['absolutely', 'loved', 'movie', 'brilliant'], ['film', 'complete', 'disaster', 'regret', 'watching'], ['amazing', 'storyline', 'fantastic', 'acting'], ['plot', 'predictable', 'characters', 'dull'], ['movie', 'masterpiece', 'could', 'watch'], ['boring', 'film', 'could', 'sit', 'whole', 'thing'], ['beautiful', 'cinematography', 'heartfelt', 'performances'], ['terrible', 'direction', 'screenplay', 'weak'], ['true', 'work', 'art', 'movie', 'touched', 'heart'], ['worst', 'movie', 'ever', 'waste', 'time', 'money'], ['fantastic', 'everything', 'film', 'perfect'], ['acting', 'horrible', 'dialogue'], ['incredible', 'would', 'recommend', 'everyone'], ['like', 'story', 'made', 'sense'], ['movie', 'kept', 'edge', 'seat'], ['film', 'overrated', 'understand', 'hype'], ['excellent', 'performances', 'gripping', 'narrative'], ['awful', 'experience', 'would', 'recommend', 'anyone'], ['delightful', 'movie', 'everyone', 'see'], ['movie', 'dragged', 'felt', 'like', 'chore', 'watch'], ['wonderful', 'story

In [58]:
# Train a Word2Vec model using Gensim

word2vec_model = Word2Vec(sentences=texts,vector_size=100,window=2,workers=4,min_count=1)
print(word2vec_model)
print(word2vec_model.vector_size)

Word2Vec<vocab=183, vector_size=100, alpha=0.025>
100


In [59]:
## verifying vocab size manually
 
vocabulary= set([word for sentence in texts for word in sentence])
print(len(vocabulary))
len([word for word in vocabulary if word in word2vec_model.wv])

183


183

In [60]:
# Function to compute AvgWord2Vec for a document

def compute_avgword2vec(words,model,vector_size):
    # Get word embeddings for valid words
    word_vectors = [model.wv[word] for word in words if word in model.wv]
    if len(word_vectors) == 0:
        return np.zeros(vector_size)  # Return zero vector if no valid words
    return np.mean(word_vectors,axis=0)

In [61]:
#compute avg word2vec for all documents/sentences

vector_size = word2vec_model.vector_size
print(vector_size)
X = np.array([compute_avgword2vec(sentence,word2vec_model,vector_size) for sentence in texts])
Y = np.array(labels)

print(X.shape)   # Independent variables is 6
print(Y.shape)   #


100
(68, 100)
(68,)


In [62]:
# Split data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=42)

In [63]:
# Train a classifier (Random Forest in this case)
classifier = RandomForestClassifier(n_estimators=100, random_state=42)
classifier.fit(X_train, y_train)

In [64]:
# Predict on the test set
y_pred = classifier.predict(X_test)

In [65]:
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")

Accuracy: 42.86%


# End!