# Base2 Model: Lightweight Text Sentiment Analysis Using PyTorch

## Introduction
The **Base2 model** improves upon the initial sentiment analysis approach by leveraging a **neural network** with **LSTM** (Long Short-Term Memory) layers, which are more powerful for sequential data like text. Unlike the previous model based on `TextBlob`, this model is designed to handle more complex patterns in text while maintaining **computational efficiency** by using a **lightweight tokenizer** and a **custom vocabulary** built directly from the dataset.

### Why Use LSTM?
1. **Sequential Learning**: LSTM is ideal for processing text data as it can learn the relationships between words in a sentence and capture the context necessary to understand sentiment.
2. **Memory Capabilities**: LSTM can "remember" important information across longer sequences, which is particularly useful when analyzing longer text entries.
3. **Efficiency with Lightweight Components**: By using a basic tokenizer and a custom vocabulary built from the data, we reduce computational load and memory usage while still benefiting from deep learning performance.

### Model Architecture
The architecture of this model includes the following layers:
1. **Embedding Layer**: We use a randomly initialized embedding layer, which is learned during training. This reduces the complexity compared to pre-trained embeddings (like GloVe) while still providing effective representations.
2. **LSTM Layer**: This layer processes the sequential data (text) and learns the underlying patterns that contribute to sentiment.
3. **Fully Connected Layer**: This final layer maps the LSTM's output to sentiment categories (positive, negative, neutral).
4. **Softmax Activation**: Converts the output to probabilities for each sentiment class.

### Key Benefits of the Lightweight Model
- **Reduced Complexity**: By eliminating external libraries for embeddings and using a custom tokenizer and vocabulary, the model is faster and uses less memory.
- **Performance**: The model retains its ability to capture nuanced sentiments through LSTM while being optimized for GPU acceleration.
- **Scalability**: It can be extended to larger datasets or integrated with additional features like personalized music recommendations.

## Dataset
The dataset consists of labeled text entries (e.g., movie reviews, product reviews, social media posts) with sentiment labels:
- **Positive**
- **Negative**
- **Neutral**

The text data is preprocessed by:
1. **Tokenization**: Using a basic tokenizer that splits sentences into words based on spaces.
2. **Padding and Truncation**: Ensuring all input sequences have the same length.
3. **Custom Vocabulary**: A lightweight vocabulary built from the dataset itself, ensuring minimal computational overhead.

## Training Process
The model is trained on a split dataset (training and testing), where it learns to predict sentiment based on the text input. After each epoch, we evaluate the model’s performance on the test set using the following metrics:
- **Accuracy**: The percentage of correct predictions.
- **Loss**: Measures how far the predicted sentiments are from the actual labels, minimized during training.

## Music Recommendation System
Once the sentiment is predicted, the app can provide personalized music recommendations based on the sentiment:
- **Positive Sentiment**: Upbeat and energizing music.
- **Negative Sentiment**: Relaxing or comforting music.
- **Neutral Sentiment**: Balanced or neutral music.

The system can deliver these recommendations in real-time, enhancing user experience based on their detected emotional state.

## Next Steps for Improvement
1. **Transformer Models**: In future iterations, models like **BERT** or **GPT** could replace LSTM to achieve higher accuracy in sentiment prediction.
2. **Multimodal Analysis**: We can combine text-based sentiment analysis with other inputs, such as **facial expressions** or **voice analysis**, to detect emotions more comprehensively.
3. **Personalized Recommendations**: The app can learn individual preferences over time and adjust its music recommendations for each user, making the experience more personalized.

## Conclusion
The **Base2 model** enhances the previous sentiment analysis by utilizing **LSTM** in a lightweight, efficient manner. This model balances **performance** and **efficiency** by using a simpler tokenizer and custom vocabulary while retaining the power of deep learning. It can be further improved and scaled by integrating advanced models or additional input modalities.


In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset
from sklearn.model_selection import train_test_split
from collections import Counter

In [2]:
# Check if GPU is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

Using device: cuda


In [3]:
# dummy data
data = [
    ("I love this!", 1),
    ("This is terrible.", 0),
    ("I'm feeling great today.", 1),
    ("Not happy with the product.", 0),
    ("It was an average experience.", 2)
]

In [4]:
# Basic whitespace tokenizer
def basic_tokenizer(text):
    return text.lower().split()

# Build custom vocabulary based on dataset
def build_vocab(dataset, tokenizer, max_vocab_size=5000):
    word_freq = Counter()
    for sentence, _ in dataset:
        tokens = tokenizer(sentence)
        word_freq.update(tokens)
    
    # Create vocab dict with most common words and assign indices
    vocab = {word: idx+2 for idx, (word, _) in enumerate(word_freq.most_common(max_vocab_size))}
    vocab["<pad>"] = 0  # Padding token
    vocab["<unk>"] = 1  # Unknown token
    return vocab

# Build vocabulary from data
vocab = build_vocab(data, basic_tokenizer)

In [5]:
# Dataset class for PyTorch
class SentimentDataset(Dataset):
    def __init__(self, data, tokenizer, vocab, max_len=50):
        self.data = data
        self.tokenizer = tokenizer
        self.vocab = vocab
        self.max_len = max_len
    
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        text, label = self.data[idx]
        tokens = self.tokenizer(text)
        indices = [self.vocab.get(token, self.vocab["<unk>"]) for token in tokens]
        if len(indices) < self.max_len:
            indices += [self.vocab["<pad>"]] * (self.max_len - len(indices))
        else:
            indices = indices[:self.max_len]
        return torch.tensor(indices), torch.tensor(label)

In [6]:
# Dataset and Dataloader
train_data, test_data = train_test_split(data, test_size=0.2)
train_dataset = SentimentDataset(train_data, basic_tokenizer, vocab, max_len=50)
test_dataset = SentimentDataset(test_data, basic_tokenizer, vocab, max_len=50)

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32)


In [7]:
# Model definition
class SentimentLSTM(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim):
        super(SentimentLSTM, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, batch_first=True)
        self.fc = nn.Linear(hidden_dim, output_dim)
    
    def forward(self, x):
        embedded = self.embedding(x)
        lstm_out, _ = self.lstm(embedded)
        output = self.fc(lstm_out[:, -1, :])
        return output


In [15]:
# Parameters
vocab_size = len(vocab)
embedding_dim = 50  # Reduced dimensionality
hidden_dim = 128
output_dim = 3  # For positive, negative, neutral

# Model, loss, and optimizer
model = SentimentLSTM(vocab_size, embedding_dim, hidden_dim, output_dim).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)

In [16]:
# Training loop
for epoch in range(100):
    model.train()
    epoch_loss = 0
    for text, label in train_loader:
        text, label = text.to(device), label.to(device)
        optimizer.zero_grad()
        predictions = model(text)
        loss = criterion(predictions, label)
        loss.backward()
        optimizer.step()
        epoch_loss += loss.item()
    
    print(f"Epoch {epoch+1}: Loss = {epoch_loss/len(train_loader)}")

Epoch 1: Loss = 1.0863237380981445
Epoch 2: Loss = 1.8107515573501587
Epoch 3: Loss = 1.1190261840820312
Epoch 4: Loss = 1.089711308479309
Epoch 5: Loss = 1.1152374744415283
Epoch 6: Loss = 1.099804162979126
Epoch 7: Loss = 1.083448886871338
Epoch 8: Loss = 1.0722392797470093
Epoch 9: Loss = 1.0647034645080566
Epoch 10: Loss = 1.059110403060913
Epoch 11: Loss = 1.054405689239502
Epoch 12: Loss = 1.0501315593719482
Epoch 13: Loss = 1.0462766885757446
Epoch 14: Loss = 1.0431077480316162
Epoch 15: Loss = 1.0409505367279053
Epoch 16: Loss = 1.0399620532989502
Epoch 17: Loss = 1.0400193929672241
Epoch 18: Loss = 1.0407772064208984
Epoch 19: Loss = 1.0417776107788086
Epoch 20: Loss = 1.04258394241333
Epoch 21: Loss = 1.042914628982544
Epoch 22: Loss = 1.042709469795227
Epoch 23: Loss = 1.0420995950698853
Epoch 24: Loss = 1.0413143634796143
Epoch 25: Loss = 1.0405813455581665
Epoch 26: Loss = 1.0400559902191162
Epoch 27: Loss = 1.039795160293579
Epoch 28: Loss = 1.039772629737854
Epoch 29: Lo

In [17]:
# Evaluation
model.eval()
correct = 0
total = 0
with torch.no_grad():
    for text, label in test_loader:
        text, label = text.to(device), label.to(device)
        outputs = model(text)
        _, predicted = torch.max(outputs, 1)
        total += label.size(0)
        correct += (predicted == label).sum().item()

print(f"Test Accuracy: {100 * correct / total}%")

Test Accuracy: 0.0%


In [18]:
# Sample Input Sentiment Analysis
def analyze_sentiment(text):
    tokens = basic_tokenizer(text)
    indices = [vocab.get(token, vocab["<unk>"]) for token in tokens]
    if len(indices) < 50:
        indices += [vocab["<pad>"]] * (50 - len(indices))
    else:
        indices = indices[:50]
    input_tensor = torch.tensor(indices).unsqueeze(0).to(device)
    
    model.eval()
    with torch.no_grad():
        output = model(input_tensor)
        _, predicted = torch.max(output, 1)
    return predicted.item()


In [19]:
# Testing with new input
sample_input = "I'm feeling amazing today!"
sentiment = analyze_sentiment(sample_input)
print(f"Predicted Sentiment: {sentiment}")  # 1 -> Positive, 0 -> Negative, 2 -> Neutral

Predicted Sentiment: 1
