In [1]:
import torch 
import numpy as np
import pandas as pd
import this

The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!


In [2]:
torch.cuda.empty_cache()

In [3]:
df=pd.read_csv("IMDB Dataset.csv")

In [4]:
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [5]:
import re
df['review'] = df['review'].apply(lambda x: re.sub("<.*?>", " ", x))

In [6]:
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. The filming t...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [7]:
df['sentiment'] = df['sentiment'].apply(lambda x: 1 if x == 'positive' else 0)
df['sentiment'] = df['sentiment'].astype(int)

In [8]:
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,1
1,A wonderful little production. The filming t...,1
2,I thought this was a wonderful way to spend ti...,1
3,Basically there's a family where a little boy ...,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",1


In [9]:
data=df.to_numpy()
print(data[0][0])

One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.  The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.  It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.  I would say the main appeal of the show is due to the fact that it goes where other sho

In [10]:
from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased',do_lower_case=True)
token_out = df['review'].apply(lambda x: tokenizer(x,padding='max_length',truncation=True,max_length=512,return_tensors='pt')).to_list()
print(token_out[0])


  from .autonotebook import tqdm as notebook_tqdm


{'input_ids': tensor([[  101,  2028,  1997,  1996,  2060, 15814,  2038,  3855,  2008,  2044,
          3666,  2074,  1015, 11472,  2792,  2017,  1005,  2222,  2022, 13322,
          1012,  2027,  2024,  2157,  1010,  2004,  2023,  2003,  3599,  2054,
          3047,  2007,  2033,  1012,  1996,  2034,  2518,  2008,  4930,  2033,
          2055, 11472,  2001,  2049, 24083,  1998,  4895, 10258,  2378,  8450,
          5019,  1997,  4808,  1010,  2029,  2275,  1999,  2157,  2013,  1996,
          2773,  2175,  1012,  3404,  2033,  1010,  2023,  2003,  2025,  1037,
          2265,  2005,  1996,  8143, 18627,  2030,  5199,  3593,  1012,  2023,
          2265,  8005,  2053, 17957,  2007, 12362,  2000,  5850,  1010,  3348,
          2030,  4808,  1012,  2049,  2003, 13076,  1010,  1999,  1996,  4438,
          2224,  1997,  1996,  2773,  1012,  2009,  2003,  2170, 11472,  2004,
          2008,  2003,  1996,  8367,  2445,  2000,  1996, 17411,  4555,  3036,
          2110,  7279,  4221, 12380,  

In [11]:
from torch.utils.data import Dataset, DataLoader
class TextDataset(Dataset):
  def __init__(self, token_o,labels):
    self.text = token_o
    self.labels = labels
  def __len__(self):
    return len(self.labels)
  def __getitem__(self, idx):
    label = torch.tensor(self.labels[idx])
    text = self.text[idx]
    return text, label

In [12]:
total_len = len(token_out)
training_ratio = 0.75
validation_ratio = 0.10
test_ratio = 0.15
train_len, valid_len, test_len = training_ratio*total_len, validation_ratio*total_len, test_ratio*total_len
train_len, valid_len, test_len = int(train_len), int(valid_len), int(test_len)
print('training len:\t valid len:\t test len:')
print(train_len, '\t\t', valid_len, '\t\t',test_len)

training len:	 valid len:	 test len:
37500 		 5000 		 7500


In [13]:
from torch.utils.data import random_split
TotalData = TextDataset(token_out, df['sentiment'].to_numpy())
train, valid = random_split(TotalData, [train_len,total_len-train_len], 
                            generator=torch.Generator().manual_seed(42))

In [14]:
valid, test = random_split(valid, [valid_len,len(valid)-valid_len], 
                           generator=torch.Generator().manual_seed(42))

In [15]:
print('training len:\t valid len:\t test len:')
print(len(train), '\t\t', len(valid), '\t\t',len(test))

training len:	 valid len:	 test len:
37500 		 5000 		 7500


In [16]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
bert = BertModel.from_pretrained('bert-base-uncased')

In [17]:
print(train[0][0])

{'input_ids': tensor([[  101,  2023,  2146,  2792, 15173,  3815,  1997,  2004, 24826, 15683,
          1997, 20096,  1010, 10874,  1010,  6547,  1998,  5936,  2055,  1037,
          2645,  1997, 25433,  1997, 20052,  2114,  2798, 11668, 23689, 16874,
          2239,  1010,  1037,  3040, 25044,  2121,  1012,  2023,  2003,  2019,
          6581,  2058, 10052,  2448,  7292,  1997,  7441, 12049,  1011,  9106,
          2186,  1012,  1999,  1996,  2143,  3711,  5156,  9106,  1005,  1055,
          8854,  2004,  7742, 26693, 13662,  1998,  3680,  6842,  1010,  2295,
          2053, 22993, 23871,  1010,  2174,  2003,  1037,  4602, 12700,  1010,
          2798, 11668,  1012,  2009,  1005,  1055,  1037, 10218, 17039, 27158,
          2007, 20014, 27611,  1010, 16959,  2015,  1010,  1998, 23873,  1010,
          2164,  2019, 10990,  2345,  9792,  1012,  2023,  2003,  1037,  3327,
         20052,  3185,  2021,  2057,  2424,  2000,  9106,  4634,  1999,  2293,
          2007,  1037,  7947,  1010,  

In [None]:
b2= bert.to(device)

embed=b2(input_ids=train[0][0]['input_ids'].to(device),
          attention_mask=train[0][0]['attention_mask'].to(device))



print(embed[1].shape)

torch.Size([1, 768])


In [23]:
from torch import nn
from torch.nn.utils.rnn import pack_padded_sequence
class BERTLSTMnn(nn.Module):
  def __init__(self, bert, seq_len, n_layers):
    super().__init__()
    self.bert = bert
    for i in self.bert.parameters():
      i.requires_grad = False  # Freeze BERT parameters
    embedding_size = bert.config.to_dict()['hidden_size']
    self.lstm = nn.LSTM(input_size=embedding_size, 
                        hidden_size=seq_len, 
                        num_layers=n_layers,
                        batch_first=True,
                        dropout=0.2)
    self.dense = nn.Linear(seq_len, 1)
    self.softmax = nn.Sigmoid()

  def forward(self, sentence):
    embedded = self.bert(input_ids=sentence['input_ids'].squeeze(1),
                           attention_mask=sentence['attention_mask'])
      
    out, (HiddenStates, CellStates) = self.lstm(embedded['last_hidden_state'])

    output = self.dense(out[:, -1, :])  # Get the last time step output
    output = self.softmax(output)
    return output

In [24]:
import time
from torch import optim
lr=0.001
epochs=20
model = BERTLSTMnn(bert=b2, seq_len=512, n_layers=2)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
model.to(device)
optimizer = optim.Adam(model.parameters(), lr=lr)
criterion = nn.BCEWithLogitsLoss()
for epoch in range(epochs):
    start_time = time.time()
    tl=0
    train_loader = DataLoader(train, batch_size=64, shuffle=True)
    model.train()
    for X, y in train_loader:
        X = X.to(device)
        y = y.to(device)
        optimizer.zero_grad()
        output = model(X)
        loss = criterion(output.squeeze(1), y.float())
        loss.backward()
        optimizer.step()
        print(f"Batch Loss: {loss.item()}")
        tl += loss.item()
    elapsed = time.time() - start_time
    print(f"Epoch {epoch+1}/{epochs}, Loss: {tl/len(train_loader)}, Time: {elapsed:.2f}s")


Using device: cuda
Batch Loss: 0.6920819282531738
Batch Loss: 0.6845941543579102
Batch Loss: 0.6811965703964233
Batch Loss: 0.6859957575798035
Batch Loss: 0.7178672552108765
Batch Loss: 0.6754839420318604
Batch Loss: 0.6838299036026001
Batch Loss: 0.6731332540512085
Batch Loss: 0.6679762005805969
Batch Loss: 0.676271378993988
Batch Loss: 0.6276028156280518
Batch Loss: 0.6026493310928345
Batch Loss: 0.6561583280563354
Batch Loss: 0.7046667337417603
Batch Loss: 0.6442412734031677
Batch Loss: 0.6641668081283569
Batch Loss: 0.6328604817390442
Batch Loss: 0.5977702736854553
Batch Loss: 0.5740159749984741
Batch Loss: 0.6541840434074402
Batch Loss: 0.6627797484397888
Batch Loss: 0.6382341980934143
Batch Loss: 0.6344977617263794
Batch Loss: 0.6197805404663086
Batch Loss: 0.6526685953140259
Batch Loss: 0.6146706342697144
Batch Loss: 0.6312329769134521
Batch Loss: 0.6497822999954224
Batch Loss: 0.6179473400115967
Batch Loss: 0.6722896099090576
Batch Loss: 0.6754106283187866
Batch Loss: 0.6270014

In [25]:
test_loader = DataLoader(test, batch_size=64, shuffle=False)
model.eval()
correct = 0
total = 0
with torch.no_grad():
    for X, y in test_loader:
        X = X.to(device)
        y = y.to(device)
        output = model(X)
        predicted = (output.squeeze(1) > 0.5).float()
        total += y.size(0)
        correct += (predicted == y).sum().item()

In [26]:
print(f"Test Accuracy: {100 * correct / total:.2f}%")

Test Accuracy: 61.88%


In [None]:
# Save the model
torch.save(model.state_dict(), 'bert_lstm_model_21_6.pth')