# Contextual Word Sentiment Classification


In [1]:
import pandas as pd    # to load dataset
import numpy as np     # for mathematic equation
from nltk.corpus import stopwords   # to get collection of stopwords
from sklearn.model_selection import train_test_split       # for splitting dataset
from tensorflow.keras.preprocessing.text import Tokenizer  # to encode text to int
from tensorflow.keras.preprocessing.sequence import pad_sequences   # to do padding or truncating
from tensorflow.keras.models import Sequential     # the model
from tensorflow.keras.layers import Embedding, LSTM, Dense # layers of the architecture
from tensorflow.keras.callbacks import ModelCheckpoint   # save model
from tensorflow.keras.models import load_model   # load saved model
import re

In [3]:
data = pd.read_csv('IMDB Dataset.csv')

print(data)

                                                  review sentiment
0      One of the other reviewers has mentioned that ...  positive
1      A wonderful little production. <br /><br />The...  positive
2      I thought this was a wonderful way to spend ti...  positive
3      Basically there's a family where a little boy ...  negative
4      Petter Mattei's "Love in the Time of Money" is...  positive
...                                                  ...       ...
49995  I thought this movie did a down right good job...  positive
49996  Bad plot, bad dialogue, bad acting, idiotic di...  negative
49997  I am a Catholic taught in parochial elementary...  negative
49998  I'm going to have to disagree with the previou...  negative
49999  No one expects the Star Trek movies to be high...  negative

[50000 rows x 2 columns]



<b>Stop Word</b> is a commonly used words in a sentence, usually a search engine is programmed to ignore this words (i.e. "the", "a", "an", "of", etc.)

<i>Declaring the english stop words</i>

In [6]:
english_stops = set(stopwords.words('english'))

<hr>

### TASK 1Load and Clean Dataset

In the original dataset, the reviews are still dirty. There are still html tags, numbers, uppercase, and punctuations. This will not be good for training, so in <b>load_dataset()</b> function, beside loading the dataset using <b>pandas</b>, I also pre-process the reviews by removing html tags, non alphabet (punctuations and numbers), stop words, and lower case all of the reviews.

### Encode Sentiments
In the same function, I also encode the sentiments into integers (0 and 1). Where 0 is for negative sentiments and 1 is for positive sentiments.

In [9]:
def load_dataset():
    df = pd.read_csv('IMDB Dataset.csv')
    x_data = df['review']       # Reviews/Input
    y_data = df['sentiment']    # Sentiment/Output

    # PRE-PROCESS REVIEW
    x_data = x_data.replace({'<.*?>': ''}, regex = True)          # remove html tag
    x_data = x_data.replace({'[^A-Za-z]': ' '}, regex = True)     # remove non alphabet
    x_data = x_data.apply(lambda review: [w for w in review.split() if w not in english_stops])  # remove stop words
    x_data = x_data.apply(lambda review: [w.lower() for w in review])   # lower case
    
    # ENCODE SENTIMENT -> 0 & 1
    y_data = y_data.replace('positive', 1)
    y_data = y_data.replace('negative', 0)

    return x_data, y_data

x_data, y_data = load_dataset()

print('Reviews')
print(x_data, '\n')
print('Sentiment')
print(y_data)

Reviews
0        [one, reviewers, mentioned, watching, oz, epis...
1        [a, wonderful, little, production, the, filmin...
2        [i, thought, wonderful, way, spend, time, hot,...
3        [basically, family, little, boy, jake, thinks,...
4        [petter, mattei, love, time, money, visually, ...
                               ...                        
49995    [i, thought, movie, right, good, job, it, crea...
49996    [bad, plot, bad, dialogue, bad, acting, idioti...
49997    [i, catholic, taught, parochial, elementary, s...
49998    [i, going, disagree, previous, comment, side, ...
49999    [no, one, expects, star, trek, movies, high, a...
Name: review, Length: 50000, dtype: object 

Sentiment
0        1
1        1
2        1
3        0
4        1
        ..
49995    1
49996    0
49997    0
49998    0
49999    0
Name: sentiment, Length: 50000, dtype: int64


  y_data = y_data.replace('negative', 0)


### Split Dataset
In this work, I decided to split the data into 80% of Training and 20% of Testing set using <b>train_test_split</b> method from Scikit-Learn. By using this method, it automatically shuffles the dataset. We need to shuffle the data because in the original dataset, the reviews and sentiments are in order, where they list positive reviews first and then negative reviews. By shuffling the data, it will be distributed equally in the model, so it will be more accurate for predictions.

In [12]:
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size = 0.2)

print('Train Set')
print(x_train, '\n')
print(x_test, '\n')
print('Test Set')
print(y_train, '\n')
print(y_test)

Train Set
24487    [end, game, started, well, least, said, end, b...
34343    [i, recently, seen, an, zhan, not, hong, kong,...
29313    [i, mean, really, really, really, high, movie,...
42994    [the, lovely, danish, actress, sonja, richter,...
38206    [i, watched, love, life, holiday, filmed, film...
                               ...                        
26381    [look, poor, robert, webber, character, great,...
49198    [virginal, innocent, indri, finds, house, pros...
16617    [whatever, merits, film, poorly, researched, a...
18793    [for, movie, plot, like, i, would, normally, s...
44577    [what, dire, film, i, cannot, believe, i, actu...
Name: review, Length: 40000, dtype: object 

18401    [excellent, performance, mary, kay, place, ste...
26641    [i, loved, complete, savages, why, cancel, any...
42933    [a, cranky, police, detective, suspects, frenc...
4897     [this, film, really, quite, odd, clearly, cert...
11999    [minutes, movie, i, hyperventilating, shaking,...
 

<hr>
<i>Function for getting the maximum review length, by calculating the mean of all the reviews length (using <b>numpy.mean</b>)</i>

In [15]:
def get_max_length():
    review_length = []
    for review in x_train:
        review_length.append(len(review))

    return int(np.ceil(np.mean(review_length)))

<hr>

### TASK 2 : Tokenize and Pad/Truncate Reviews
A Neural Network only accepts numeric data, so we need to encode the reviews. I use <b>tensorflow.keras.preprocessing.text.Tokenizer</b> to encode the reviews into integers, where each unique word is automatically indexed (using <b>fit_on_texts</b> method) based on <b>x_train</b>. <br>
<b>x_train</b> and <b>x_test</b> is converted into integers using <b>texts_to_sequences</b> method.

Each reviews has a different length, so we need to add padding (by adding 0) or truncating the words to the same length (in this case, it is the mean of all reviews length) using <b>tensorflow.keras.preprocessing.sequence.pad_sequences</b>.


<b>post</b>, pad or truncate the words in the back of a sentence<br>
<b>pre</b>, pad or truncate the words in front of a sentence

In [18]:
# ENCODE REVIEW
token = Tokenizer(lower=False)    # no need lower, because already lowered the data in load_data()
token.fit_on_texts(x_train)
x_train = token.texts_to_sequences(x_train)
x_test = token.texts_to_sequences(x_test)

max_length = get_max_length()

x_train = pad_sequences(x_train, maxlen=max_length, padding='post', truncating='post')
x_test = pad_sequences(x_test, maxlen=max_length, padding='post', truncating='post')

total_words = len(token.word_index) + 1   # add 1 because of 0 padding

print('Encoded X Train\n', x_train, '\n')
print('Encoded X Test\n', x_test, '\n')
print('Maximum review length: ', max_length)

Encoded X Train
 [[  54  372  571 ...    0    0    0]
 [   1  941   38 ...    0    0    0]
 [   1  287   15 ...    0    0    0]
 ...
 [ 742 5566    4 ...    0    0    0]
 [ 206    3   41 ...    0    0    0]
 [ 105 4010    4 ...    0    0    0]] 

Encoded X Test
 [[  226   146  1184 ...     0     0     0]
 [    1   339   500 ...    88   765     1]
 [   39 13820   485 ...     0     0     0]
 ...
 [    1    93   358 ...     0     0     0]
 [ 1649     3   184 ...     0     0     0]
 [    8     3   940 ...   910  1346     8]] 

Maximum review length:  131


### TASK 3 Build Architecture/Model

In [36]:
import requests
import io
import zipfile
import numpy as np
from torch.utils.data import Dataset
import pandas as pd
import torch

# Step 1: Load GloVe embeddings into a dictionary
def load_glove_embeddings(url, embedding_dim=50):
    response = requests.get(url)
    
    if response.status_code == 200:
        with zipfile.ZipFile(io.BytesIO(response.content)) as zip_ref:
            with zip_ref.open(f'glove.6B.{embedding_dim}d.txt') as f:
                embedding_model = {}
                for line in f:
                    values = line.decode('utf-8').split()
                    word = values[0]
                    embedding_vector = np.array(values[1:], dtype='float32')
                    embedding_model[word] = embedding_vector
                print("GloVe embeddings loaded successfully!")
                return embedding_model
    else:
        print(f"Failed to download the GloVe file. Status code: {response.status_code}")
        return None

# Step 2: Define the dataset class
class WordContextDataset(Dataset):
    def __init__(self, df, word_scores, embedding_model, window_size=2):
        """
        Args:
        - df: A pandas DataFrame containing the text data.
        - word_scores: A dictionary containing words and their corresponding scores.
        - embedding_model: A dictionary containing word embeddings (e.g., GloVe).
        - window_size: Size of the context window around the target word.
        """
        if embedding_model is None:
            raise ValueError("Embedding model is None. Please check if the GloVe embeddings were loaded correctly.")
        
        self.texts = df['text'].tolist()
        self.word_scores = word_scores
        self.embedding_model = embedding_model
        self.window_size = window_size
        self.embedding_dim = next(iter(self.embedding_model.values())).shape[0]
        
        # Create word index
        self.word_index = {word: i for i, word in enumerate(self.embedding_model.keys())}
        self.word_index['<PAD>'] = len(self.word_index)  # Add padding token
        
        # Prepare dataset
        self.data = self._prepare_dataset()

    def _prepare_dataset(self):
        data = []
        for text in self.texts:
            tokens = text.split()
            for i, target_word in enumerate(tokens):
                if target_word not in self.embedding_model:
                    continue
                
                # Get context window
                start = max(i - self.window_size, 0)
                end = min(i + self.window_size + 1, len(tokens))
                context_words = tokens[start:i] + tokens[i+1:end]

                # Compute embeddings
                target_embedding = self.embedding_model.get(target_word, np.zeros(self.embedding_dim))
                context_embeddings = [
                    self.embedding_model.get(word, np.zeros(self.embedding_dim)) 
                    for word in context_words
                ]
                if context_embeddings:
                    averaged_context_embedding = np.mean(context_embeddings, axis=0)
                else:
                    averaged_context_embedding = np.zeros(self.embedding_dim)
                
                # Add to dataset
                score = self.word_scores.get(target_word, 0)
                data.append({
                    'target_word': target_word,
                    'target_embedding': target_embedding,
                    'context_embedding': averaged_context_embedding,
                    'score': score
                })
        return data

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        sample = self.data[idx]
        return {
            'input': torch.tensor([self.word_index[sample['target_word']]] + [self.word_index.get(word, self.word_index['<PAD>']) for word in sample['context_embedding']]).float(),
            'target': torch.tensor(sample['score']).float()
        }

# Example usage
if __name__ == "__main__":
    # Corrected URL
    url = "https://nlp.stanford.edu/data/glove.6B.zip"

    # Load GloVe embeddings
    embedding_model = load_glove_embeddings(url, 50)
    
    # Check if embeddings were loaded
    if embedding_model is not None:
        # Example DataFrame
        data = {
            "text": [
                "the leader in natural language processing",
                "deep learning is a prime example of AI"
            ]
        }
        df = pd.DataFrame(data)

        # Example word scores
        word_scores = {
            "leader": 1.0,
            "natural": 0.9,
            "language": 0.8,
            "deep": 0.7,
            "learning": 0.6
        }

        # Create the dataset
        dataset = WordContextDataset(df, word_scores, embedding_model, window_size=2)

        # Example access
        print(f"Dataset size: {len(dataset)}")
        sample = dataset[0]
        print(f"Input: {sample['input']}")
        print(f"Target: {sample['target']}")
    else:
        print("Failed to create dataset due to missing embeddings.")

GloVe embeddings loaded successfully!
Dataset size: 13
Input: tensor([     0., 400000., 400000., 400000., 400000., 400000., 400000., 400000.,
        400000., 400000., 400000., 400000., 400000., 400000., 400000., 400000.,
        400000., 400000., 400000., 400000., 400000., 400000., 400000., 400000.,
        400000., 400000., 400000., 400000., 400000., 400000., 400000., 400000.,
        400000., 400000., 400000., 400000., 400000., 400000., 400000., 400000.,
        400000., 400000., 400000., 400000., 400000., 400000., 400000., 400000.,
        400000., 400000., 400000.])
Target: 0.0


<hr>

### TASK 4 Build Architecture/Model
<b>Embedding Layer</b>: in simple terms, it creates word vectors of each word in the <i>word_index</i> and group words that are related or have similar meaning by analyzing other words around them.

<b>LSTM Layer</b>: to make a decision to keep or throw away data by considering the current input, previous output, and previous memory. There are some important components in LSTM.
<ul>
    <li><b>Forget Gate</b>, decides information is to be kept or thrown away</li>
    <li><b>Input Gate</b>, updates cell state by passing previous output and current input into sigmoid activation function</li>
    <li><b>Cell State</b>, calculate new cell state, it is multiplied by forget vector (drop value if multiplied by a near 0), add it with the output from input gate to update the cell state value.</li>
    <li><b>Ouput Gate</b>, decides the next hidden state and used for predictions</li>
</ul>

<b>Dense Layer</b>: compute the input with the weight matrix and bias (optional), and using an activation function. I use <b>Sigmoid</b> activation function for this work because the output is only 0 or 1.

The optimizer is <b>Adam</b> and the loss function is <b>Binary Crossentropy</b> because again the output is only 0 and 1, which is a binary number.

In [43]:
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
import numpy as np
import pandas as pd

# Define the dataset class
class WordContextDataset(Dataset):
    def __init__(self, df, word_scores, embedding_model, window_size=2):
        self.texts = df['text'].tolist()
        self.word_scores = word_scores
        self.embedding_model = embedding_model
        self.window_size = window_size
        self.embedding_dim = next(iter(self.embedding_model.values())).shape[0]
        self.word_index = {word: i for i, word in enumerate(self.embedding_model.keys())}
        self.data = self._prepare_dataset()

    def _prepare_dataset(self):
        data = []
        for text in self.texts:
            tokens = text.split()
            for i, target_word in enumerate(tokens):
                if target_word not in self.embedding_model:
                    continue
                # Get context window
                start = max(i - self.window_size, 0)
                end = min(i + self.window_size + 1, len(tokens))
                context_words = tokens[start:i] + tokens[i+1:end]
                # Compute embeddings
                target_embedding = self.embedding_model.get(target_word, np.zeros(self.embedding_dim))
                context_embeddings = np.array([self.embedding_model.get(word, np.zeros(self.embedding_dim)) for word in context_words])
                if context_embeddings.size == 0:
                    averaged_context_embedding = np.zeros(self.embedding_dim)
                else:
                    averaged_context_embedding = np.mean(context_embeddings, axis=0)
                # Add to dataset
                score = self.word_scores.get(target_word, 0)
                data.append((torch.tensor([target_word] + context_words, dtype=torch.long), torch.tensor(score, dtype=torch.float)))
        return data

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        return self.data[idx]

# Define the neural network architecture
class SentimentClassifier(nn.Module):
    def __init__(self, input_dim, embedding_dim, lstm_out):
        super(SentimentClassifier, self).__init__()
        self.embedding = nn.Embedding(input_dim, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, lstm_out, batch_first=True)
        self.fc = nn.Linear(lstm_out, 1)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        x = self.embedding(x)  # (batch_size, sequence_length, embedding_dim)
        lstm_out, _ = self.lstm(x)
        out = self.fc(lstm_out[:, -1, :])  # Take the last time step
        out = self.sigmoid(out)
        return out

# Prepare the dataset and data loader
batch_size = 32
dataset = WordContextDataset(df, word_scores, embedding_model, window_size=2)
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

# Hyperparameters
EMBED_DIM = 100  # This should match the dimensionality of your GloVe embeddings
LSTM_OUT = 64
learning_rate = 0.001
epochs = 5

# Get the number of unique words from the dataset's word_index attribute
input_dim = len(dataset.word_index)

# Initialize the model
model = SentimentClassifier(input_dim, EMBED_DIM, LSTM_OUT)

# Define the loss function and optimizer
criterion = nn.BCELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

# Training loop
for epoch in range(epochs):
    for batch_idx, (inputs, targets) in enumerate(dataloader):
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, targets.unsqueeze(1).float())
        loss.backward()
        optimizer.step()
        if batch_idx % 100 == 0:
            print(f'Epoch {epoch+1}, Batch {batch_idx}, Loss: {loss.item()}')

# Save the model
torch.save(model.state_dict(), 'models/SentimentClassifier.pth')

# Evaluation
model.eval()
with torch.no_grad():
    review = input('Movie Review: ')
    preprocessed_review = preprocess_text(review)  # You need to implement this function
    inputs = torch.tensor(preprocessed_review, dtype=torch.long)  # Ensure inputs are long tensors
    inputs = inputs.unsqueeze(0)  # Add batch dimension
    outputs = model(inputs)
    sentiment = 'positive' if outputs.item() >= 0.7 else 'negative'
    print(f'The review is {sentiment}')

ValueError: too many dimensions 'str'

In [24]:
import torch
import torch.nn as nn
import torch.optim as optim

# Define the neural network architecture for sentiment classification
class SentimentClassifier(nn.Module):
    def __init__(self, input_dim, embedding_dim, lstm_out):
        super(SentimentClassifier, self).__init__()
        self.embedding = nn.Embedding(input_dim, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, lstm_out, batch_first=True)
        self.fc = nn.Linear(lstm_out, 1)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        # x: (batch_size, sequence_length, 2 * embedding_dim)
        embedded = self.embedding(x)
        lstm_out, _ = self.lstm(embedded)
        out = self.fc(lstm_out[:, -1, :])  # Take the last time step's output
        out = self.sigmoid(out)
        return out

# Calculate the input dimension for the model
total_words = len(embedding_model)  # This should be the number of unique words in your embedding model
input_dim = total_words
embedding_dim = EMBED_DIM
lstm_out = LSTM_OUT

# Initialize the model
model = SentimentClassifier(input_dim, embedding_dim, lstm_out)



None


<hr>

### Training
For training, it is simple. We only need to fit our <b>x_train</b> (input) and <b>y_train</b> (output/label) data. For this training, I use a mini-batch learning method with a <b>batch_size</b> of <i>128</i> and <i>5</i> <b>epochs</b>.

Also, I added a callback called **checkpoint** to save the model locally for every epoch if its accuracy improved from the previous epoch.

In [31]:
checkpoint = ModelCheckpoint(
    'models/LSTM.keras',
    monitor='accuracy',
    save_best_only=True,
    verbose=1
)

In [33]:
model.fit(x_train, y_train, batch_size = 128, epochs = 5, callbacks=[checkpoint])

Epoch 1/5
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 180ms/step - accuracy: 0.5382 - loss: 0.6831
Epoch 1: accuracy improved from -inf to 0.54752, saving model to models/LSTM.keras
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m65s[0m 182ms/step - accuracy: 0.5382 - loss: 0.6830
Epoch 2/5
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 179ms/step - accuracy: 0.6073 - loss: 0.6527
Epoch 2: accuracy improved from 0.54752 to 0.66230, saving model to models/LSTM.keras
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m57s[0m 181ms/step - accuracy: 0.6075 - loss: 0.6525
Epoch 3/5
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 178ms/step - accuracy: 0.5854 - loss: 0.6685
Epoch 3: accuracy did not improve from 0.66230
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m56s[0m 178ms/step - accuracy: 0.5855 - loss: 0.6685
Epoch 4/5
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 177ms/ste

<keras.src.callbacks.history.History at 0x22402cdf5f0>

<hr>

### Testing
To evaluate the model, we need to predict the sentiment using our <b>x_test</b> data and comparing the predictions with <b>y_test</b> (expected output) data. Then, we calculate the accuracy of the model by dividing numbers of correct prediction with the total data. Resulted an accuracy of <b>86.63%</b>

In [42]:
import numpy as np

# 使用 model.predict() 方法进行预测
y_pred = model.predict(x_test, batch_size=128)

# 使用 np.argmax() 函数获取预测结果的索引值
y_pred_classes = np.argmax(y_pred, axis=1)

true = 0
for i, y in enumerate(y_test):
    if y == y_pred_classes[i]:
        true += 1

print('Correct Prediction: {}'.format(true))
print('Wrong Prediction: {}'.format(len(y_pred_classes) - true))
print('Accuracy: {}'.format(true/len(y_pred_classes)*100))

[1m79/79[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 71ms/step
Correct Prediction: 5037
Wrong Prediction: 4963
Accuracy: 50.370000000000005


---

### Load Saved Model

Load saved model and use it to predict a movie review statement's sentiment (positive or negative).

In [44]:
loaded_model = load_model('models/LSTM.keras')

Receives a review as an input to be predicted

In [49]:
review = str(input('Movie Review: '))

Movie Review:  I liked the movie. All the scenes were perfectly done. The story is so good. The actors were professional


The input must be pre processed before it is passed to the model to be predicted

In [51]:
# Pre-process input
regex = re.compile(r'[^a-zA-Z\s]')
review = regex.sub('', review)
print('Cleaned: ', review)

words = review.split(' ')
filtered = [w for w in words if w not in english_stops]
filtered = ' '.join(filtered)
filtered = [filtered.lower()]

print('Filtered: ', filtered)

Cleaned:  I liked the movie All the scenes were perfectly done The story is so good The actors were professional
Filtered:  ['i liked movie all scenes perfectly done the story good the actors professional']


Once again, we need to tokenize and encode the words. I use the tokenizer which was previously declared because we want to encode the words based on words that are known by the model.

In [53]:
tokenize_words = token.texts_to_sequences(filtered)
tokenize_words = pad_sequences(tokenize_words, maxlen=max_length, padding='post', truncating='post')
print(tokenize_words)

[[   1  325    3  196   60  814  125    2   15    9    2   68 1528    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0]]


This is the result of the prediction which shows the **confidence score** of the review statement.

In [55]:
result = loaded_model.predict(tokenize_words)
print(result)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 545ms/step
[[0.50359666]]


If the confidence score is close to 0, then the statement is **negative**. On the other hand, if the confidence score is close to 1, then the statement is **positive**. I use a threshold of **0.7** to determine which confidence score is positive and negative, so if it is equal or greater than 0.7, it is **positive** and if it is less than 0.7, it is **negative**

In [57]:
if result >= 0.7:
    print('positive')
else:
    print('negative')

negative


#### Difficulties Encountered
- One of the main difficulties was the technical issue with TorchText and Python version compatibility. This required a shift in strategy to use Kaggle and Pandas for data loading, which was not the initial plan.
- Problem of datas that had different shape
- Problem related to keras : in the function to fit the model, we are using sequential model and it says sequential has no outputs yet. Therefore, we tried to look for the issue on stackoverflow and github but we did not get response so we continued to debug it.

#### Steps taken to alleviate difficulties
To address the TorchText and Python version compatibility issue, we opted to use Kaggle to directly download the dataset and then import it into Pandas for preprocessing. This change in strategy allowed us to bypass the initial dependency from TorchText. For the challenge of data with varying shapes, I implemented data preprocessing techniques such as padding and truncation to normalize the input data, ensuring that all sequences were of equal length and suitable for model training. The website stack helped also while we were facing errors.

#### General description of what you did, explain how you understood the task and what you did to solve it in general language, no code.
For this task, we built a model that could classify the sentiment of individual words based on the sentiment of the sentences they appear in. To do so, we first prepared the data by tokenizing the text into words and sentences, and then padded the sentences to a uniform length to standardize the input for the model. Then, we developed a neural network model designed to process sequences of words and learn the contextual relationships that influence word sentiment. The model was trained on the IMDb dataset, using the sentence-level sentiment labels to teach it to classify words as positive, negative, or neutral. After training, the model was evaluated to assess its ability to accurately classify word sentiments in new sentences.

#### Potential limitations of our approach, what could be issues, how could this be hard on different data or with slightly different conditions
- **Data Bias**: The IMDb dataset may contain biases that could affect the model's performance on different types of data or under slightly different conditions.
- **Context Length**: The model might struggle with longer contexts or sentences where the relationship between words and sentiment is more complex.
- **Vocabulary Size**: Limitations in the model's vocabulary size could lead to misclassifications of words that are not well-represented in the training data.

#### Extensions and Future Work
An interesting extension of this work could involve exploring different neural network architectures, such as transformers, which are known for their ability to handle long-range dependencies in text. Additionally, incorporating multimodal data, such as movie scenes or audio, could provide richer context for sentiment analysis. 
- This approach could also be extended to detect hate speech or discriminatory comments on social media platforms more effectively.
- For hotel, restaurant and product that base their grade(with stars usuall) on the positive review.
