<a href="https://colab.research.google.com/github/AnderssonTom/D7047E-Lab-1/blob/main/D7047E_Lab_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#D7047E Lab 1
##Group 25
Antonino Davolos, Christos Michail, Felix Hessinger, Sandra Sandström, Tom Andersson

##GPU and memory info on Colab

In [2]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

from psutil import virtual_memory
ram_gb = virtual_memory().total / 1e9
print('Your runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))

if ram_gb < 20:
  print('Not using a high-RAM runtime')
else:
  print('You are using a high-RAM runtime!')

Sat Apr  5 06:29:34 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   48C    P8             11W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

##Task 1.1. Simple model, without/with embeddings
###ANN: 5000 X 64 X 2
###Without embeddings (TF-IDF), single words and word-pairs as features, encoded in terms of their importance (sentence distinctive frequencies)

In [4]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import TensorDataset, DataLoader
import numpy as np
from matplotlib import pyplot
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
from nltk import word_tokenize
import nltk
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, classification_report

nltk.download('punkt_tab')
nltk.download('stopwords')

def preprocess_pandas(data, columns):
    df_ = pd.DataFrame(columns=columns)
    data['Sentence'] = data['Sentence'].str.lower()
    data['Sentence'] = data['Sentence'].replace('[a-zA-Z0-9-_.]+@[a-zA-Z0-9-_.]+', '', regex=True)                      # remove emails
    data['Sentence'] = data['Sentence'].replace('((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)(\.|$)){4}', '', regex=True)    # remove IP address
    data['Sentence'] = data['Sentence'].str.replace('[^\w\s]','', regex=True)                                                       # remove special characters
    data['Sentence'] = data['Sentence'].replace('\d', '', regex=True)                                                   # remove numbers
    for index, row in data.iterrows():
        word_tokens = word_tokenize(row['Sentence'])
        filtered_sent = [w for w in word_tokens if not w in stopwords.words('english')]
        df_.loc[len(df_)] = {
            "index": row['index'],
            "Class": row['Class'],
            "Sentence": " ".join(filtered_sent)
        }
    return df_

# === Load and Preprocess Data ===
data = pd.read_csv("amazon_cells_labelled.txt", delimiter='\t', header=None)
data.columns = ['Sentence', 'Class']
data['index'] = data.index
columns = ['index', 'Class', 'Sentence']
data = preprocess_pandas(data, columns)

# === Split Data ===
train_sentences, val_sentences, train_labels, val_labels = train_test_split(
    data['Sentence'].values.astype('U'),
    data['Class'].values.astype('int32'),
    test_size=0.1,
    random_state=42,
    shuffle=True
)

# === TF-IDF Vectorization ===
vectorizer = TfidfVectorizer(analyzer='word', ngram_range=(1, 2), max_features=5000)
train_features = vectorizer.fit_transform(train_sentences).todense()
val_features = vectorizer.transform(val_sentences).todense()

# === Convert to Tensors ===
train_x = torch.tensor(np.array(train_features)).float()
train_y = torch.tensor(np.array(train_labels)).long()
val_x = torch.tensor(np.array(val_features)).float()
val_y = torch.tensor(np.array(val_labels)).long()

train_dataset = TensorDataset(train_x, train_y)
val_dataset = TensorDataset(val_x, val_y)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=32)

# === Define a Simple ANN ===
class SimpleANN(nn.Module):
    def __init__(self, input_dim, hidden_dim=64):
        super(SimpleANN, self).__init__()
        self.fc1 = nn.Linear(input_dim, hidden_dim)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_dim, 2)  # Two output classes

    def forward(self, x):
        out = self.relu(self.fc1(x))
        return self.fc2(out)

# === Initialize and Train ===
model = SimpleANN(input_dim=train_x.shape[1])
loss_fn = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# === Training Loop ===
for epoch in range(5):
    model.train()
    total_loss = 0
    for inputs, targets in train_loader:
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = loss_fn(outputs, targets)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    print(f"Epoch {epoch+1}, Loss: {total_loss:.4f}")

# === Evaluate ===
model.eval()
with torch.no_grad():
    val_preds = []
    val_true = []
    for inputs, targets in val_loader:
        outputs = model(inputs)
        preds = torch.argmax(outputs, dim=1)
        val_preds.extend(preds.tolist())
        val_true.extend(targets.tolist())

print(f"\nValidation Accuracy: {accuracy_score(val_true, val_preds):.4f}")
print(classification_report(val_true, val_preds, zero_division=0))


[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Epoch 1, Loss: 20.0793
Epoch 2, Loss: 18.9057
Epoch 3, Loss: 15.6980
Epoch 4, Loss: 10.9726
Epoch 5, Loss: 6.8162

Validation Accuracy: 0.7800
              precision    recall  f1-score   support

           0       0.76      0.82      0.79        50
           1       0.80      0.74      0.77        50

    accuracy                           0.78       100
   macro avg       0.78      0.78      0.78       100
weighted avg       0.78      0.78      0.78       100



###Model with word embeddings and sentence embeddings based on average pooling

In [5]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import pandas as pd
import numpy as np
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from collections import Counter
import re
from tqdm import tqdm

# === Download required resources ===
nltk.download('punkt')
nltk.download('stopwords')

# === Load Data ===
df = pd.read_csv("amazon_cells_labelled_LARGE_25K.txt", delimiter="\t", header=None, names=["Sentence", "Class"])

# === Clean and Tokenize ===
def clean_text(text):
    text = text.lower()
    text = re.sub(r'[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+', '', text)  # remove emails
    text = re.sub(r'((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)(\.|$)){4}', '', text)  # remove IP addresses
    text = re.sub(r'\d+', '', text)  # remove numbers
    text = re.sub(r'[^\w\s]', '', text)  # remove punctuation
    tokens = word_tokenize(text)
    tokens = [word for word in tokens if word not in stopwords.words('english')]
    return tokens

df['tokens'] = df['Sentence'].apply(clean_text)

# === Build Vocabulary ===
all_tokens = [token for sentence in df['tokens'] for token in sentence]
vocab_counts = Counter(all_tokens)
vocab = {word: idx + 2 for idx, (word, _) in enumerate(vocab_counts.items())}
vocab['<PAD>'] = 0
vocab['<UNK>'] = 1

# === Encode Sentences ===
def encode_sentence(tokens, vocab, max_len=50):
    encoded = [vocab.get(word, vocab['<UNK>']) for word in tokens]
    if len(encoded) < max_len:
        encoded += [vocab['<PAD>']] * (max_len - len(encoded))
    else:
        encoded = encoded[:max_len]
    return encoded

df['encoded'] = df['tokens'].apply(lambda x: encode_sentence(x, vocab))

# === Split Data ===
X_train, X_val, y_train, y_val = train_test_split(
    df['encoded'].tolist(), df['Class'].tolist(), test_size=0.1, random_state=42
)

# === Create Dataset ===
class ReviewDataset(Dataset):
    def __init__(self, X, y):
        self.X = torch.tensor(X, dtype=torch.long)
        self.y = torch.tensor(y, dtype=torch.long)

    def __len__(self):
        return len(self.y)

    def __getitem__(self, idx):
        return self.X[idx], self.y[idx]

train_dataset = ReviewDataset(X_train, y_train)
val_dataset = ReviewDataset(X_val, y_val)

train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=64)

# === Define Model with Trainable Embeddings ===
class TextClassifier(nn.Module):
    def __init__(self, vocab_size, embedding_dim=100, hidden_dim=64):
        super(TextClassifier, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)
        self.fc1 = nn.Linear(embedding_dim, hidden_dim)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_dim, 2)

    def forward(self, x):
        embedded = self.embedding(x)  # (batch_size, seq_len, embed_dim)
        pooled = embedded.mean(dim=1)  # Average over sequence
        x = self.relu(self.fc1(pooled))
        return self.fc2(x)

# === Initialize Model ===
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = TextClassifier(vocab_size=len(vocab)).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# === Train Model ===
epochs = 5
for epoch in range(epochs):
    model.train()
    total_loss = 0
    for batch_X, batch_y in tqdm(train_loader):
        batch_X, batch_y = batch_X.to(device), batch_y.to(device)
        optimizer.zero_grad()
        outputs = model(batch_X)
        loss = criterion(outputs, batch_y)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    print(f"Epoch {epoch+1}/{epochs}, Loss: {total_loss:.4f}")

# === Evaluate ===
model.eval()
all_preds = []
all_labels = []

with torch.no_grad():
    for batch_X, batch_y in val_loader:
        batch_X, batch_y = batch_X.to(device), batch_y.to(device)
        outputs = model(batch_X)
        preds = torch.argmax(outputs, dim=1)
        all_preds.extend(preds.cpu().numpy())
        all_labels.extend(batch_y.cpu().numpy())

print(f"\nValidation Accuracy: {accuracy_score(all_labels, all_preds):.4f}")
print(classification_report(all_labels, all_preds, zero_division=0))


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
100%|██████████| 352/352 [00:01<00:00, 213.99it/s]


Epoch 1/5, Loss: 187.0947


100%|██████████| 352/352 [00:00<00:00, 376.97it/s]


Epoch 2/5, Loss: 127.6191


100%|██████████| 352/352 [00:00<00:00, 386.87it/s]


Epoch 3/5, Loss: 103.4851


100%|██████████| 352/352 [00:00<00:00, 381.50it/s]


Epoch 4/5, Loss: 86.3975


100%|██████████| 352/352 [00:00<00:00, 382.56it/s]

Epoch 5/5, Loss: 72.7769

Validation Accuracy: 0.8592
              precision    recall  f1-score   support

           0       0.82      0.81      0.82       965
           1       0.88      0.89      0.89      1535

    accuracy                           0.86      2500
   macro avg       0.85      0.85      0.85      2500
weighted avg       0.86      0.86      0.86      2500






###Model with LSTM classifier
The LSTM model takes word order into account. However, it needed several adjustments to reach a reasonable accuracy. Further testing showed that we only needed to reverse the input array to reach high accuracy. See code following this one. This indicates that it is actually better to skip order sensitivity. Average pooling is simpler and even more accurate.

In [6]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import pandas as pd
import numpy as np
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from collections import Counter
import re
from tqdm import tqdm

# === Setup ===
nltk.download('punkt')
nltk.download('stopwords')

# === Load and Preprocess Data ===
df = pd.read_csv("amazon_cells_labelled_LARGE_25K.txt", delimiter="\t", header=None, names=["Sentence", "Class"])

def clean_text(text):
    text = text.lower()
    text = re.sub(r'[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+', '', text)  # remove emails
    text = re.sub(r'((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)(\.|$)){4}', '', text)  # remove IPs
    text = re.sub(r'\d+', '', text)
    text = re.sub(r'[^\w\s]', '', text)
    tokens = word_tokenize(text)
    tokens = [word for word in tokens if word not in stopwords.words('english')]
    return tokens

df['tokens'] = df['Sentence'].apply(clean_text)

# === Build Vocabulary ===
all_tokens = [token for sentence in df['tokens'] for token in sentence]
vocab_counts = Counter(all_tokens)
vocab = {word: idx + 2 for idx, (word, _) in enumerate(vocab_counts.items())}
vocab['<PAD>'] = 0
vocab['<UNK>'] = 1

# === Encode Tokens ===
MAX_LEN = 40  # reduce padding length
def encode_sentence(tokens, vocab, max_len=MAX_LEN):
    encoded = [vocab.get(word, vocab['<UNK>']) for word in tokens]
    return encoded[:max_len] + [vocab['<PAD>']] * (max_len - len(encoded))

df['encoded'] = df['tokens'].apply(lambda x: encode_sentence(x, vocab))

# === Split Dataset ===
X_train, X_val, y_train, y_val = train_test_split(
    df['encoded'].tolist(), df['Class'].tolist(), test_size=0.1, random_state=42
)

# === Dataset Class ===
class ReviewDataset(Dataset):
    def __init__(self, X, y):
        self.X = torch.tensor(X, dtype=torch.long)
        self.y = torch.tensor(y, dtype=torch.long)

    def __len__(self):
        return len(self.y)

    def __getitem__(self, idx):
        return self.X[idx], self.y[idx]

train_dataset = ReviewDataset(X_train, y_train)
val_dataset = ReviewDataset(X_val, y_val)

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)  # reduced batch size
val_loader = DataLoader(val_dataset, batch_size=32)

# === LSTM Model ===
class OptimizedLSTMClassifier(nn.Module):
    def __init__(self, vocab_size, embedding_dim=100, hidden_dim=128, dropout_p=0.3):
        super(OptimizedLSTMClassifier, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, batch_first=True, bidirectional=True)
        self.dropout = nn.Dropout(dropout_p)
        self.fc = nn.Linear(hidden_dim * 2, 2)  # *2 for bidirectional

    def forward(self, x):
        embedded = self.embedding(x)                     # (batch, seq_len, embed_dim)
        _, (hidden, _) = self.lstm(embedded)             # hidden: (2, batch, hidden_dim)
        combined = torch.cat((hidden[0], hidden[1]), dim=1)  # concat both directions
        output = self.dropout(combined)
        return self.fc(output)

# === Training Setup ===
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = OptimizedLSTMClassifier(vocab_size=len(vocab)).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# === Training Loop ===
EPOCHS = 10
for epoch in range(EPOCHS):
    model.train()
    total_loss = 0
    for batch_X, batch_y in tqdm(train_loader, desc=f"Epoch {epoch+1}"):
        batch_X, batch_y = batch_X.to(device), batch_y.to(device)
        optimizer.zero_grad()
        outputs = model(batch_X)
        loss = criterion(outputs, batch_y)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    print(f"Epoch {epoch+1} Loss: {total_loss:.4f}")

# === Evaluation ===
model.eval()
all_preds = []
all_labels = []

with torch.no_grad():
    for batch_X, batch_y in val_loader:
        batch_X, batch_y = batch_X.to(device), batch_y.to(device)
        outputs = model(batch_X)
        preds = torch.argmax(outputs, dim=1)
        all_preds.extend(preds.cpu().numpy())
        all_labels.extend(batch_y.cpu().numpy())

print(f"\nValidation Accuracy: {accuracy_score(all_labels, all_preds):.4f}")
print(classification_report(all_labels, all_preds, zero_division=0))

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
Epoch 1: 100%|██████████| 704/704 [00:03<00:00, 230.31it/s]


Epoch 1 Loss: 328.5784


Epoch 2: 100%|██████████| 704/704 [00:02<00:00, 249.60it/s]


Epoch 2 Loss: 226.1252


Epoch 3: 100%|██████████| 704/704 [00:02<00:00, 280.29it/s]


Epoch 3 Loss: 164.5635


Epoch 4: 100%|██████████| 704/704 [00:02<00:00, 281.44it/s]


Epoch 4 Loss: 105.6348


Epoch 5: 100%|██████████| 704/704 [00:02<00:00, 281.66it/s]


Epoch 5 Loss: 59.8263


Epoch 6: 100%|██████████| 704/704 [00:02<00:00, 235.76it/s]


Epoch 6 Loss: 32.4524


Epoch 7: 100%|██████████| 704/704 [00:02<00:00, 280.39it/s]


Epoch 7 Loss: 33.0505


Epoch 8: 100%|██████████| 704/704 [00:02<00:00, 279.79it/s]


Epoch 8 Loss: 10.9100


Epoch 9: 100%|██████████| 704/704 [00:02<00:00, 278.03it/s]


Epoch 9 Loss: 10.7391


Epoch 10: 100%|██████████| 704/704 [00:02<00:00, 276.30it/s]


Epoch 10 Loss: 8.3295

Validation Accuracy: 0.8176
              precision    recall  f1-score   support

           0       0.79      0.73      0.75       965
           1       0.84      0.88      0.85      1535

    accuracy                           0.82      2500
   macro avg       0.81      0.80      0.80      2500
weighted avg       0.82      0.82      0.82      2500



###LSTM with just reversed order of input.

In [7]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import pandas as pd
import numpy as np
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from collections import Counter
import re
from tqdm import tqdm

# === Setup ===
nltk.download('punkt')
nltk.download('stopwords')

# === Load Data ===
df = pd.read_csv("amazon_cells_labelled_LARGE_25K.txt", delimiter="\t", header=None, names=["Sentence", "Class"])

# === Text Cleaning and Tokenization ===
def clean_text(text):
    text = text.lower()
    text = re.sub(r'\d+', '', text)
    text = re.sub(r'[^\w\s]', '', text)
    tokens = word_tokenize(text)
    tokens = [word for word in tokens if word not in stopwords.words('english')]
    return tokens

df['tokens'] = df['Sentence'].apply(clean_text)

# === Vocabulary Building ===
all_tokens = [token for sentence in df['tokens'] for token in sentence]
vocab_counts = Counter(all_tokens)
vocab = {word: idx + 2 for idx, (word, _) in enumerate(vocab_counts.items())}
vocab['<PAD>'] = 0
vocab['<UNK>'] = 1

# === Encode Tokens ===
def encode_sentence(tokens, vocab, max_len=50):
    encoded = [vocab.get(word, vocab['<UNK>']) for word in tokens]
    if len(encoded) < max_len:
        encoded += [vocab['<PAD>']] * (max_len - len(encoded))
    else:
        encoded = encoded[:max_len]
    return encoded

df['encoded'] = df['tokens'].apply(lambda x: encode_sentence(x, vocab))

# === Split Dataset ===
X_train, X_val, y_train, y_val = train_test_split(
    df['encoded'].tolist(), df['Class'].tolist(), test_size=0.1, random_state=42
)

# === Dataset Class ===
class ReviewDataset(Dataset):
    def __init__(self, X, y):
        self.X = torch.tensor(X, dtype=torch.long)
        self.y = torch.tensor(y, dtype=torch.long)

    def __len__(self):
        return len(self.y)

    def __getitem__(self, idx):
        return self.X[idx], self.y[idx]

train_dataset = ReviewDataset(X_train, y_train)
val_dataset = ReviewDataset(X_val, y_val)

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=32)

# === LSTM-based Classifier ===
class TextClassifierLSTM(nn.Module):
    def __init__(self, vocab_size, embedding_dim=100, hidden_dim=64):
        super(TextClassifierLSTM, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, batch_first=True, bidirectional=True)
        self.fc = nn.Linear(hidden_dim, 2)

    def forward(self, x):
        embedded = self.embedding(x)                # (batch, seq_len, embedding_dim)
        _, (hidden, _) = self.lstm(embedded)        # hidden: (1, batch, hidden_dim)
        pooled = hidden[-1]                         # take last hidden state
        return self.fc(pooled)

# === Training Setup ===
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = TextClassifierLSTM(vocab_size=len(vocab)).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# === Training Loop ===
epochs = 10
for epoch in range(epochs):
    model.train()
    total_loss = 0
    for batch_X, batch_y in tqdm(train_loader):
        batch_X, batch_y = batch_X.to(device), batch_y.to(device)
        optimizer.zero_grad()
        outputs = model(batch_X)
        loss = criterion(outputs, batch_y)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    print(f"Epoch {epoch+1}/{epochs}, Loss: {total_loss:.4f}")

# === Evaluation ===
model.eval()
all_preds = []
all_labels = []

with torch.no_grad():
    for batch_X, batch_y in val_loader:
        batch_X, batch_y = batch_X.to(device), batch_y.to(device)
        outputs = model(batch_X)
        preds = torch.argmax(outputs, dim=1)
        all_preds.extend(preds.cpu().numpy())
        all_labels.extend(batch_y.cpu().numpy())

print(f"\nValidation Accuracy: {accuracy_score(all_labels, all_preds):.4f}")
print(classification_report(all_labels, all_preds, zero_division=0))

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
100%|██████████| 704/704 [00:02<00:00, 295.59it/s]


Epoch 1/10, Loss: 334.5021


100%|██████████| 704/704 [00:02<00:00, 265.32it/s]


Epoch 2/10, Loss: 221.6548


100%|██████████| 704/704 [00:02<00:00, 271.85it/s]


Epoch 3/10, Loss: 160.0234


100%|██████████| 704/704 [00:02<00:00, 297.03it/s]


Epoch 4/10, Loss: 107.6234


100%|██████████| 704/704 [00:02<00:00, 295.16it/s]


Epoch 5/10, Loss: 63.9831


100%|██████████| 704/704 [00:02<00:00, 295.43it/s]


Epoch 6/10, Loss: 35.9184


100%|██████████| 704/704 [00:02<00:00, 255.29it/s]


Epoch 7/10, Loss: 20.1777


100%|██████████| 704/704 [00:02<00:00, 283.79it/s]


Epoch 8/10, Loss: 11.9338


100%|██████████| 704/704 [00:02<00:00, 294.51it/s]


Epoch 9/10, Loss: 8.2903


100%|██████████| 704/704 [00:02<00:00, 291.54it/s]


Epoch 10/10, Loss: 6.0371

Validation Accuracy: 0.8360
              precision    recall  f1-score   support

           0       0.80      0.77      0.78       965
           1       0.86      0.88      0.87      1535

    accuracy                           0.84      2500
   macro avg       0.83      0.82      0.83      2500
weighted avg       0.84      0.84      0.84      2500



##Task 1.2. Transformer model, without/with BERT pretraining
###No pretraining


In [18]:
# ================== 1. Imports ==================
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
import pandas as pd
import numpy as np
from collections import Counter
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from nltk.corpus import stopwords
from nltk import word_tokenize
import re
import nltk

nltk.download("punkt")
nltk.download("stopwords")

# ================== 2. Load and Preprocess Data ==================

def clean_and_tokenize(text):
    text = text.lower()
    text = re.sub(r'[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+', '', text)  # remove emails
    text = re.sub(r'((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)(\.|$)){4}', '', text)  # remove IP addresses
    text = re.sub(r'[^\w\s]', '', text)  # remove punctuation
    text = re.sub(r'\d+', '', text)  # remove digits

    tokens = word_tokenize(text)
    filtered = [w for w in tokens if w not in stopwords.words('english')]
    return filtered

# Load and process
df = pd.read_csv("amazon_cells_labelled_LARGE_25K.txt", delimiter="\t", header=None)
df.columns = ["text", "label"]
df.dropna(inplace=True)
df = df.sample(frac=1, random_state=42).reset_index(drop=True)

df["tokens"] = df["text"].apply(clean_and_tokenize)

# ================== 3. Build Vocabulary and Encode ==================

all_tokens = [token for sent in df["tokens"] for token in sent]
vocab = {"<PAD>": 0, "<UNK>": 1}
vocab.update({word: idx + 2 for idx, (word, _) in enumerate(Counter(all_tokens).items())})

def encode(tokens, vocab, max_len=32):
    ids = [vocab.get(tok, vocab["<UNK>"]) for tok in tokens]
    return ids[:max_len] + [vocab["<PAD>"]] * max(0, max_len - len(ids))

df["input_ids"] = df["tokens"].apply(lambda x: encode(x, vocab))

train_texts, val_texts, train_labels, val_labels = train_test_split(
    df["input_ids"].tolist(), df["label"].tolist(), test_size=0.1, random_state=42
)

# ================== 4. Dataset Class ==================

class AmazonDataset(Dataset):
    def __init__(self, inputs, labels):
        self.inputs = inputs
        self.labels = labels

    def __getitem__(self, idx):
        return {
            "input_ids": torch.tensor(self.inputs[idx]),
            "labels": torch.tensor(self.labels[idx])
        }

    def __len__(self):
        return len(self.labels)

train_dataset = AmazonDataset(train_texts, train_labels)
val_dataset = AmazonDataset(val_texts, val_labels)

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=64)

# ================== 5. Transformer Model ==================

class MiniTransformer(nn.Module):
    def __init__(self, vocab_size, embed_dim=64, num_heads=2, hidden_dim=128, num_layers=2, max_len=32):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.pos_embedding = nn.Parameter(torch.randn(1, max_len, embed_dim))

        encoder_layer = nn.TransformerEncoderLayer(d_model=embed_dim, nhead=num_heads, dim_feedforward=hidden_dim)
        self.encoder = nn.TransformerEncoder(encoder_layer, num_layers=num_layers)

        self.classifier = nn.Sequential(
            nn.Linear(embed_dim, 64),
            nn.ReLU(),
            nn.Linear(64, 2)
        )

    def forward(self, input_ids):
        x = self.embedding(input_ids) + self.pos_embedding[:, :input_ids.size(1), :]
        x = self.encoder(x)
        pooled = x.mean(dim=1)  # Average pooling
        return self.classifier(pooled)

# ================== 6. Train ==================

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = MiniTransformer(vocab_size=len(vocab)).to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()

for epoch in range(5):
    model.train()
    total_loss = 0
    for batch in train_loader:
        input_ids = batch["input_ids"].to(device)
        labels = batch["labels"].to(device)

        optimizer.zero_grad()
        outputs = model(input_ids)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()

    print(f"Epoch {epoch+1}, Loss: {total_loss:.4f}")

# ================== 7. Evaluate ==================

model.eval()
all_preds = []
all_labels = []

with torch.no_grad():
    for batch in val_loader:
        input_ids = batch["input_ids"].to(device)
        labels = batch["labels"].to(device)
        outputs = model(input_ids)
        preds = torch.argmax(outputs, dim=1)
        all_preds.extend(preds.cpu().numpy())
        all_labels.extend(labels.cpu().numpy())

print("\nClassification Report:")
print(classification_report(all_labels, all_preds, digits=4))


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Epoch 1, Loss: 423.2495
Epoch 2, Loss: 293.5590
Epoch 3, Loss: 234.0356
Epoch 4, Loss: 198.4438
Epoch 5, Loss: 169.1481

Classification Report:
              precision    recall  f1-score   support

           0     0.8362    0.7256    0.7770      1006
           1     0.8304    0.9043    0.8657      1494

    accuracy                         0.8324      2500
   macro avg     0.8333    0.8150    0.8214      2500
weighted avg     0.8327    0.8324    0.8300      2500



###With BERT pretraining model

In [8]:
# ================== 1. Install + Imports ==================
!pip install -q transformers

import torch
from transformers import (
    BertTokenizerFast,
    BertForSequenceClassification,
    Trainer,
    TrainingArguments
)
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, classification_report
import pandas as pd
import numpy as np

# ================== 2. Load and Prepare Data ==================
df = pd.read_csv("amazon_cells_labelled_LARGE_25K.txt", delimiter="\t", header=None)
df.columns = ["text", "label"]
df.dropna(inplace=True)

# Optional shuffle
df = df.sample(frac=1, random_state=42).reset_index(drop=True)

train_texts, val_texts, train_labels, val_labels = train_test_split(
    df["text"].tolist(),
    df["label"].tolist(),
    test_size=0.1,
    random_state=42
)

# ================== 3. Tokenization ==================
tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")

train_encodings = tokenizer(train_texts, truncation=True, padding=True, max_length=64)
val_encodings = tokenizer(val_texts, truncation=True, padding=True, max_length=64)

# ================== 4. Dataset Class ==================
class AmazonDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        return {
            key: torch.tensor(val[idx])
            for key, val in self.encodings.items()
        } | {"labels": torch.tensor(self.labels[idx])}

    def __len__(self):
        return len(self.labels)

train_dataset = AmazonDataset(train_encodings, train_labels)
val_dataset = AmazonDataset(val_encodings, val_labels)

# ================== 5. Load Model ==================
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)

# ================== 6. Training Arguments ==================
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=4,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    warmup_steps=500,
    learning_rate=2e-5,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    logging_dir="./logs",
    report_to="none"
)

# ================== 7. Metrics Function ==================
def compute_metrics(pred):
    labels = pred.label_ids
    preds = np.argmax(pred.predictions, axis=1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary')
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'precision': precision,
        'recall': recall,
        'f1': f1
    }

# ================== 8. Trainer ==================
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics
)

# ================== 9. Train ==================
trainer.train()

# ================== 10. Evaluate ==================
metrics = trainer.evaluate()
print("\nEvaluation Metrics:")
for k, v in metrics.items():
    if k.startswith("eval_"):
        print(f"{k[5:].capitalize()}: {v:.4f}")

# ================== 11. Optional: Classification Report ==================
preds_output = trainer.predict(val_dataset)
preds = np.argmax(preds_output.predictions, axis=1)
labels = preds_output.label_ids

print("\nClassification Report:\n")
print(classification_report(labels, preds, digits=4))


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,0.2355,0.198097,0.9288,0.945196,0.935074,0.940108
2,0.1564,0.218507,0.9384,0.94137,0.956493,0.948871
3,0.0806,0.276753,0.9368,0.941799,0.953146,0.947438
4,0.0453,0.319031,0.9376,0.943046,0.953146,0.948069



Evaluation Metrics:
Loss: 0.2185
Accuracy: 0.9384
Precision: 0.9414
Recall: 0.9565
F1: 0.9489
Runtime: 7.8437
Samples_per_second: 318.7280
Steps_per_second: 5.1000

Classification Report:

              precision    recall  f1-score   support

           0     0.9338    0.9115    0.9225      1006
           1     0.9414    0.9565    0.9489      1494

    accuracy                         0.9384      2500
   macro avg     0.9376    0.9340    0.9357      2500
weighted avg     0.9383    0.9384    0.9383      2500



##Task 1.3 Comparison
Here, you should compare of both models; you are requested to use the same test dataset for both ANN and the transformer to answer the following:

- Compare the performance of the two models and explain in which scenarios you would prefer one over the other.

- How did the two models’ complexity, accuracy, and efficiency differ? Did one model outperform the other in specific scenarios or tasks? If so, why?

- What insights did you obtain concerning data amount to train?Embedding utilized? Architectural choices made?
