# Contextual Word Sentiment Classification

This notebook implements a contextual word sentiment classification model using the IMDb dataset. 
The goal is to classify individual words as positive, negative, or neutral based on sentence-level sentiment labels, 
while incorporating the context of neighboring words.


**Important**: At the end you should write a report of adequate size, which will probably mean at least half a page. In the report you should describe how you approached the task. You should describe:
- Encountered difficulties (due to the method, e.g. "not enough training samples to converge", not technical like "I could not install a package over pip")
- Steps taken to alleviate difficulties
- General description of what you did, explain how you understood the task and what you did to solve it in general language, no code.
- Potential limitations of your approach, what could be issues, how could this be hard on different data or with slightly different conditions
- If you have an idea how this could be extended in an interesting way, describe it.


In [9]:
pip install portalocker

Collecting portalocker
  Downloading portalocker-3.0.0-py3-none-any.whl (19 kB)
Installing collected packages: portalocker
Successfully installed portalocker-3.0.0
Note: you may need to restart the kernel to use updated packages.


In [7]:
pip install torchdata

Collecting torchdata
  Downloading torchdata-0.7.1-cp311-cp311-macosx_10_13_x86_64.whl (1.8 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m[36m0:00:01[0mm eta [36m0:00:01[0m
Installing collected packages: torchdata
Successfully installed torchdata-0.7.1
Note: you may need to restart the kernel to use updated packages.


In [5]:
pip install torchtext

Collecting torchtext
  Downloading torchtext-0.17.2-cp311-cp311-macosx_10_13_x86_64.whl (2.3 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.3/2.3 MB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m[36m0:00:01[0mm eta [36m0:00:01[0m
Installing collected packages: torchtext
Successfully installed torchtext-0.17.2
Note: you may need to restart the kernel to use updated packages.


In [2]:
pip install nltk

Collecting nltk
  Downloading nltk-3.9.1-py3-none-any.whl (1.5 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m[31m2.6 MB/s[0m eta [36m0:00:01[0m
Installing collected packages: nltk
Successfully installed nltk-3.9.1
Note: you may need to restart the kernel to use updated packages.


In [3]:
# Required Libraries
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from nltk.tokenize import word_tokenize
from collections import defaultdict
from tqdm import tqdm
import nltk

nltk.download('punkt')  # For tokenization

[nltk_data] Downloading package punkt to /Users/baptiste/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [10]:
# Task 1: Load and preprocess the IMDb dataset
# IMDb dataset can be loaded from torchtext or manually via pandas
from torchtext.datasets import IMDB

# Load IMDb dataset
train_data, test_data = IMDB(split=('train', 'test'))

# Convert the data into lists
train_data = [(label, text) for label, text in train_data]
test_data = [(label, text) for label, text in test_data]

# Combine the datasets for label propagation
data = train_data + test_data

# Convert to DataFrame for easier processing
df = pd.DataFrame(data, columns=['label', 'text'])

AttributeError: 'NoneType' object has no attribute 'Lock'
This exception is thrown by __iter__ of _MemoryCellIterDataPipe(remember_elements=1000, source_datapipe=_ChildDataPipe)

In [None]:
## Tip. You can get the dataset from torchtext but the package is old and needs pytorch version 2.2 to work
## If you want to use it choose your versions like this: 
## !pip install -U torch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0 --index-url https://download.pytorch.org/whl/cu121 torchtext
# from torchtext.datasets import IMDB

In [None]:
# Task 2: Implement tokenization and label propagation
# Implement a function to calculate sentiment scores for each word based on sentence-level labels.
# The function should propagate labels to individual words and calculate a soft score for each word.

In [None]:
# Hint: You can use word_tokenize for tokenization
# Hint: You can use a dictionary to store counts of positive and negative labels for each word.

# Task 3: Prepare data for contextual learning
# Implement a class to create a dataset with context windows. 
# Each data point should include the word embedding for the target word, 
# as well as an averaged embedding of the context words in a defined window size.

# Use a pre-trained embedding model like GloVe. Download the embeddings and load them into a dictionary.
# Example: {"word": embedding_vector}

# Class signature example:
# class WordContextDataset(Dataset):
#     def __init__(self, df, word_scores, embedding_model, window_size=2):
#         # Your code here
#         pass
    
#     def __len__(self):
#         # Your code here
#         pass
    
#     def __getitem__(self, idx):
#         # Your code here
#         pass

In [None]:
# Task 4: Define and train the model
# Define a neural network for sentiment classification using PyTorch.
# The network should take an input vector of concatenated word and context embeddings.

In [None]:
# Example:
# class SentimentClassifier(nn.Module):
#     def __init__(self, input_dim):
#         super(SentimentClassifier, self).__init__()
#         # Your code here
    
#     def forward(self, x):
#         # Your code here
#         pass

# Implement a training loop to train the model on the dataset created.

In [None]:
# Task 5: Evaluate the model
# Evaluate the trained model on a validation set.
# Use metrics such as precision, recall, and F1-score.

In [None]:
# Example code to evaluate the model:
# with torch.no_grad():
#     # Predict on validation data and calculate metrics
#     pass

# Optional: Experiment with hyperparameters or model architecture to improve performance.
# Examples: Try different window sizes, embedding dimensions, or additional layers in the model.