# Contextual Word Sentiment Classification

This notebook implements a contextual word sentiment classification model using the IMDb dataset. 
The goal is to classify individual words as positive, negative, or neutral based on sentence-level sentiment labels, 
while incorporating the context of neighboring words.


**Important**: At the end you should write a report of adequate size, which will probably mean at least half a page. In the report you should describe how you approached the task. You should describe:
- Encountered difficulties (due to the method, e.g. "not enough training samples to converge", not technical like "I could not install a package over pip")
- Steps taken to alleviate difficulties
- General description of what you did, explain how you understood the task and what you did to solve it in general language, no code.
- Potential limitations of your approach, what could be issues, how could this be hard on different data or with slightly different conditions
- If you have an idea how this could be extended in an interesting way, describe it.


In [1]:
# Required Libraries
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from nltk.tokenize import word_tokenize
from collections import defaultdict
from tqdm import tqdm
import nltk

nltk.download('punkt')  # For tokenization

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/waelbenslima/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [3]:
#TASK 1 TEST FINAL DONE DONT TOUCH 
from keras.datasets import imdb

(X_train, y_train), (X_test, y_test) = imdb.load_data()

print(len(X_train), 'train sequences')
print(len(X_test), 'test sequences')

25000 train sequences
25000 test sequences


In [5]:
print(X_train[0])

[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 22665, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 21631, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 19193, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 10311, 8, 4, 107, 117, 5952, 15, 256, 4, 31050, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 12118, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]


In [7]:
INDEX_FROM = 3
word_index = imdb.get_word_index()
word_index = {key:(value+INDEX_FROM) for key,value in word_index.items()}
word_index["<PAD>"] = 0    # the padding token
word_index["<START>"] = 1  # the starting token
word_index["<UNK>"] = 2    # the unknown token
reverse_word_index = {value:key for key, value in word_index.items()}

def decode_review(text):
    return ' '.join([reverse_word_index.get(i, '?') for i in text])

decode_review(X_train[0])

"<START> this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert redford's is an amazing actor and now the same being director norman's father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for retail and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also congratulations to the two little boy's that played the part's of norman and paul they were just brilliant children are often left out of the praising list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and sh

In [9]:
vocab_size = 5000 
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words= vocab_size)

X_train, X_val = X_train[:-5000], X_train[-5000:]
y_train, y_val = y_train[:-5000], y_train[-5000:]

print(len(X_train), 'train sequences')
print(len(X_val), 'val sequences')
print(len(X_test), 'test sequences')



20000 train sequences
5000 val sequences
25000 test sequences


In [None]:
## Tip. You can get the dataset from torchtext but the package is old and needs pytorch version 2.2 to work
## If you want to use it choose your versions like this: 
## !pip install -U torch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0 --index-url https://download.pytorch.org/whl/cu121 torchtext
# from torchtext.datasets import IMDB

In [None]:
#TASK 2 & Maybe 3 ? TEST FINAL DONT TOUCH 
from keras.preprocessing.sequence import pad_sequences

maximum_sequence_length = 500 # maximum length of all review sequences

X_train = pad_sequences(X_train, value= word_index["<PAD>"], padding= 'post', maxlen= maximum_sequence_length)
X_val = pad_sequences(X_val, value= word_index["<PAD>"], padding= 'post', maxlen= maximum_sequence_length)
X_test = pad_sequences(X_test, value= word_index["<PAD>"], padding= 'post', maxlen= maximum_sequence_length)

print('X_train shape:', X_train.shape) # (n_samples, n_timesteps)
print('X_val shape:', X_val.shape)
print('X_test shape:', X_test.shape)



In [None]:
pip install scikeras


In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Dropout, Conv1D, GlobalMaxPooling1D, Dense
from sklearn.model_selection import ParameterGrid
from tensorflow.keras.callbacks import EarlyStopping
import numpy as np

# Define embedding dimension and other constants
embedding_dim = 16
vocab_size = 10000  # Example value
maximum_sequence_length = 100  # Example value

# Define dummy data for testing
X_train = np.random.randint(1, vocab_size, size=(1000, maximum_sequence_length))
y_train = np.random.randint(0, 2, size=(1000,))
X_val = np.random.randint(1, vocab_size, size=(200, maximum_sequence_length))
y_val = np.random.randint(0, 2, size=(200,))

# Define the model-building function
def create_model(filters=64, kernel_size=3, strides=1, units=256, 
                 optimizer='adam', rate=0.25, kernel_initializer='glorot_uniform'):
    model = Sequential()
    # Embedding layer
    model.add(Embedding(vocab_size, embedding_dim))
    # Convolutional Layer(s)
    model.add(Dropout(rate))
    model.add(Conv1D(filters=filters, kernel_size=kernel_size, strides=strides, 
                     padding='same', activation='relu'))
    model.add(GlobalMaxPooling1D())
    # Dense layer(s)
    model.add(Dense(units=units, activation='relu', kernel_initializer=kernel_initializer))
    model.add(Dropout(rate))
    # Output layer
    model.add(Dense(1, activation='sigmoid'))  # Ensure binary classification output
    
    # Compile the model
    model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['accuracy'])
    return model


# Wrap the model using KerasClassifier
from scikeras.wrappers import KerasClassifier
model = KerasClassifier(model=create_model, verbose=1)


# Define hyperparameter grid with 'model__' prefix
param_grid = {
    'model__filters': [64, 128],
    'model__kernel_size': [3, 5],
    'model__strides': [1],
    'model__units': [128, 256],
    'model__optimizer': ['adam', 'rmsprop'],
    'model__rate': [0.25, 0.5],
    'model__kernel_initializer': ['glorot_uniform', 'he_normal'],
    'batch_size': [32, 64],
    'epochs': [5, 10]
}

# Exhaustive Grid Search
grid = ParameterGrid(param_grid)
param_scores = []

for params in grid:
    print(f"Testing parameters: {params}")
    
    # Set model parameters
    model.set_params(**params)
    
    # Early stopping
    earlystopper = EarlyStopping(monitor='val_accuracy', patience=2, verbose=1)
    
    # Fit the model
    history = model.fit(
        X_train, y_train,
        validation_data=(X_val, y_val),
        callbacks=[earlystopper]
    )
    
    # Store the validation accuracy
    val_accuracy = history.history['val_accuracy'][-1]
    param_scores.append(val_accuracy)
    print(f"Validation Accuracy: {val_accuracy}")
    print('-' * 80)

# Select the best parameters
best_index = np.argmax(param_scores)
best_params = list(grid)[best_index]

print(f"Best Parameters: {best_params}")
print(f"Best Validation Accuracy: {param_scores[best_index]}")


In [None]:
# Task 2: Implement tokenization and label propagation
# Implement a function to calculate sentiment scores for each word based on sentence-level labels.
# The function should propagate labels to individual words and calculate a soft score for each word.

In [None]:
# Hint: You can use word_tokenize for tokenization
# Hint: You can use a dictionary to store counts of positive and negative labels for each word.

# Task 3: Prepare data for contextual learning
# Implement a class to create a dataset with context windows. 
# Each data point should include the word embedding for the target word, 
# as well as an averaged embedding of the context words in a defined window size.

# Use a pre-trained embedding model like GloVe. Download the embeddings and load them into a dictionary.
# Example: {"word": embedding_vector}

# Class signature example:
# class WordContextDataset(Dataset):
#     def __init__(self, df, word_scores, embedding_model, window_size=2):
#         # Your code here
#         pass
    
#     def __len__(self):
#         # Your code here
#         pass
    
#     def __getitem__(self, idx):
#         # Your code here
#         pass

In [11]:
import numpy as np
from tensorflow.keras.datasets import imdb
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Dropout, Conv1D, GlobalMaxPooling1D, Dense
from tensorflow.keras.callbacks import EarlyStopping
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score

# Load the dataset
vocab_size = 5000
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=vocab_size)

# Split training and validation sets
X_train, X_val = X_train[:-5000], X_train[-5000:]
y_train, y_val = y_train[:-5000], y_train[-5000:]

# Pad sequences
maximum_sequence_length = 500
X_train = pad_sequences(X_train, value=0, padding='post', maxlen=maximum_sequence_length)
X_val = pad_sequences(X_val, value=0, padding='post', maxlen=maximum_sequence_length)
X_test = pad_sequences(X_test, value=0, padding='post', maxlen=maximum_sequence_length)

print(f"X_train shape: {X_train.shape}")
print(f"X_val shape: {X_val.shape}")
print(f"X_test shape: {X_test.shape}")

# Define the model creation function
def create_model(filters=64, kernel_size=3, strides=1, units=256, 
                 optimizer='adam', rate=0.25, kernel_initializer='glorot_uniform'):
    model = Sequential()
    # Embedding layer
    model.add(Embedding(input_dim=vocab_size, output_dim=16, input_length=maximum_sequence_length))

    # Convolutional Layer(s)
    model.add(Dropout(rate))
    model.add(Conv1D(filters=filters, kernel_size=kernel_size, strides=strides, 
                     padding='same', activation='relu'))
    model.add(GlobalMaxPooling1D())
    # Dense layer(s)
    model.add(Dense(units=units, activation='relu', kernel_initializer=kernel_initializer))
    model.add(Dropout(rate))
    # Output layer
    model.add(Dense(1, activation='sigmoid'))  # Single unit for binary classification
    
    # Compile the model
    model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['accuracy'])
    return model

# Wrap the model using KerasClassifier
from scikeras.wrappers import KerasClassifier

# Example: pass the parameters directly during initialization
model = KerasClassifier(model=create_model,
                        filters=128,
                        kernel_size=5,
                        optimizer='adam',
                        units=128,
                        rate=0.25,
                        kernel_initializer='TruncatedNormal',
                        verbose=1)

from sklearn.model_selection import ParameterGrid
param_grid = {
    'filters': [64, 128],
    'kernel_size': [3, 5],
    'optimizer': ['adam', 'rmsprop'],
    'units': [128, 256],
    'rate': [0.25, 0.5],
    'kernel_initializer': ['glorot_uniform', 'he_normal'],
    'batch_size': [32, 64],
    'epochs': [5, 10]
}

# Iterate over grid
grid = ParameterGrid(param_grid)
param_scores = []

for params in grid:
    print(f"Testing parameters: {params}")
    
    # Initialize the model with the parameters from the grid
    model = KerasClassifier(model=create_model, **params, verbose=1)
    
    # Early stopping
    earlystopper = EarlyStopping(monitor='val_accuracy', patience=2, verbose=1)
    
    # Fit the model
    history = model.fit(
        X_train, y_train,
        validation_data=(X_val, y_val),
        callbacks=[earlystopper],
        epochs=params['epochs'],
        batch_size=params['batch_size']
    )
    
    # Store the validation accuracy
    val_accuracy = history.history['val_accuracy'][-1]
    param_scores.append(val_accuracy)
    print(f"Validation Accuracy: {val_accuracy}")
    print('-' * 80)

# Select the best parameters
best_index = np.argmax(param_scores)
best_params = list(grid)[best_index]

print(f"Best Parameters: {best_params}")
print(f"Best Validation Accuracy: {param_scores[best_index]}")

X_train shape: (20000, 500)
X_val shape: (5000, 500)
X_test shape: (25000, 500)
Testing parameters: {'batch_size': 32, 'epochs': 5, 'filters': 64, 'kernel_initializer': 'glorot_uniform', 'kernel_size': 3, 'optimizer': 'adam', 'rate': 0.25, 'units': 128}




ValueError: Sequential model 'sequential' has no defined outputs yet.

In [None]:
# Task 4: Define and train the model
# Define a neural network for sentiment classification using PyTorch.
# The network should take an input vector of concatenated word and context embeddings.

In [None]:
# Example:
# class SentimentClassifier(nn.Module):
#     def __init__(self, input_dim):
#         super(SentimentClassifier, self).__init__()
#         # Your code here
    
#     def forward(self, x):
#         # Your code here
#         pass

# Implement a training loop to train the model on the dataset created.

In [None]:
# Task 5: Evaluate the model
# Evaluate the trained model on a validation set.
# Use metrics such as precision, recall, and F1-score.

In [None]:
# Example code to evaluate the model:
# with torch.no_grad():
#     # Predict on validation data and calculate metrics
#     pass

# Optional: Experiment with hyperparameters or model architecture to improve performance.
# Examples: Try different window sizes, embedding dimensions, or additional layers in the model.

In [None]:
from keras.datasets import imdb
from keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Dropout, Conv1D, GlobalMaxPooling1D, Dense
from tensorflow.keras.callbacks import EarlyStopping
from scikeras.wrappers import KerasClassifier
from sklearn.model_selection import GridSearchCV
import numpy as np

# Load the dataset
vocab_size = 5000
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=vocab_size)

# Split training and validation sets
X_train, X_val = X_train[:-5000], X_train[-5000:]
y_train, y_val = y_train[:-5000], y_train[-5000:]

# Pad sequences
maximum_sequence_length = 500
X_train = pad_sequences(X_train, value=0, padding='post', maxlen=maximum_sequence_length)
X_val = pad_sequences(X_val, value=0, padding='post', maxlen=maximum_sequence_length)
X_test = pad_sequences(X_test, value=0, padding='post', maxlen=maximum_sequence_length)

print(f"X_train shape: {X_train.shape}")
print(f"X_val shape: {X_val.shape}")
print(f"X_test shape: {X_test.shape}")



In [None]:
# Define the model creation function
def create_model(filters=64, kernel_size=3, strides=1, units=256, 
                 optimizer='adam', rate=0.25, kernel_initializer='glorot_uniform'):
    model = Sequential()
    # Embedding layer
    model.add(Embedding(input_dim=vocab_size, output_dim=16, input_length=maximum_sequence_length))

    # Convolutional Layer(s)
    model.add(Dropout(rate))
    model.add(Conv1D(filters=filters, kernel_size=kernel_size, strides=strides, 
                     padding='same', activation='relu'))
    model.add(GlobalMaxPooling1D())
    # Dense layer(s)
    model.add(Dense(units=units, activation='relu', kernel_initializer=kernel_initializer))
    model.add(Dropout(rate))
    # Output layer
    model.add(Dense(1, activation='sigmoid'))  # Single unit for binary classification
    
    # Compile the model
    model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['accuracy'])
    return model

# Wrap the model using KerasClassifier
from scikeras.wrappers import KerasClassifier

# Example: pass the parameters directly during initialization
model = KerasClassifier(model=create_model,
                        filters=128,
                        kernel_size=5,
                        optimizer='adam',
                        units=128,
                        rate=0.25,
                        kernel_initializer='TruncatedNormal',
                        verbose=1)




In [None]:
from sklearn.model_selection import ParameterGrid
param_grid = {
    'filters': [64, 128],
    'kernel_size': [3, 5],
    'optimizer': ['adam', 'rmsprop'],
    'units': [128, 256],
    'rate': [0.25, 0.5],
    'kernel_initializer': ['glorot_uniform', 'he_normal'],
    'batch_size': [32, 64],
    'epochs': [5, 10]
}

# Iterate over grid
grid = ParameterGrid(param_grid)
param_scores = []

for params in grid:
    print(f"Testing parameters: {params}")
    
    # Initialize the model with the parameters from the grid
    model = KerasClassifier(model=create_model, **params, verbose=1)
    
    # Early stopping
    earlystopper = EarlyStopping(monitor='val_accuracy', patience=2, verbose=1)
    
    # Fit the model
    history = model.fit(
        X_train, y_train,
        validation_data=(X_val, y_val),
        callbacks=[earlystopper]
    )
    
    # Store the validation accuracy
    val_accuracy = history.history['val_accuracy'][-1]
    param_scores.append(val_accuracy)
    print(f"Validation Accuracy: {val_accuracy}")
    print('-' * 80)

# Select the best parameters
best_index = np.argmax(param_scores)
best_params = list(grid)[best_index]

print(f"Best Parameters: {best_params}")
print(f"Best Validation Accuracy: {param_scores[best_index]}")


In [None]:
model.set_params(**best_params)
model.fit(np.vstack((X_train, X_val)), np.hstack((y_train, y_val)))