In [None]:
# For tips on running notebooks in Google Colab, see
# https://pytorch.org/tutorials/beginner/colab
%matplotlib inline

#Assignment 1

In this assignment, you will be asked to build an LSTM-based classifier that automatically determines whether a text was written by a human or by ChatGPT. You're dataset for this project is in the `csv` file `human_or_bot.csv`.

Run the three tutorial notebooks in this folder on Google colab using GPU runtime. Make sure that you understand all of the content in these notebbooks and that they run successfully.

In the current notebook, go through each cell containing a question and answer in an additional `Text` cell


### Question 1

When you run the LSTM sentiment classifier in `Assignment_1_LSTM_tutorial` you'll note that this classifier has a test accuracy of `0.0`. Why might that be? Hint: take a look at the original dataset and note how the train, test, and validation datasets have been generated.  Examine the distribution of labels in each dataset.

### Imbalanced Dataset:
The original dataset used for training has a significant imbalance between positive and negative labels, the model may learn to predict only the majority class. For instance, if 90% of the training samples are positive, the model might predict all test samples as positive, resulting in poor accuracy on the minority class.

### Incorrect Splitting of Data:
The train, test, and validation datasets are generated in such a way that they do not represent the overall distribution of labels accurately, the model may not generalize well. For example, if all positive samples are in the training set and all negative samples are in the test set, the model will achieve high accuracy on the training set but will perform poorly on the test set.

### Data Leakage:
There is a leakage between the datasets (e.g., if the same samples or very similar samples are present in both the training and testing datasets), this could lead to misleading results. The model may perform well on the training data but fails to generalize to unseen data.

### Overfitting:
The model is overfitting to the training data (especially if the training set is small), it may not perform well on the test set, particularly if the test set includes samples that differ from the training set.

### Incorrect Label Encoding:
There are issues with how the labels are encoded (e.g., mislabeling or errors in the mapping), the model could misinterpret the classes, leading to incorrect predictions.

### Steps to Diagnose the Issue

1. Check the distribution of the labels in the training, validation, and test datasets. Use pandas to visualize the counts of each class.

2. Check Data Splitting Logic:

3. Evaluate Model Performance on Train Set:

4. Hyperparameter Tuning:

### Conclusion
Test accuracy of 0.0 indicates a significant issue in either the dataset or the model training process. By closely examining the data distribution and splitting, you can identify the root cause of the problem and take appropriate corrective actions.

### Question 2

After you've identifed the problem with the original run of the LSTM model in the `Assignment_1_LSTM_tutorial` notebook, run the training and test cells again and report the new accuracy.

In [5]:
import torch
import torch.nn as nn
import numpy as np
import pandas as pd
from torch.utils.data import DataLoader, Dataset, random_split
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import CountVectorizer

# Sample data loading
# Replace 'your_dataset.csv' with your actual dataset path
# Example dataset structure:
#   - text: the content of the review
#   - sentiment: the corresponding sentiment label (positive/negative)
data = pd.read_csv('imdb_small.csv')

data.columns

Index(['Unnamed: 0', 'review', 'sentiment'], dtype='object')

In [1]:
import pandas as pd
import numpy as np
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

# Load data
data = pd.read_csv('imdb_small.csv')

# Basic preprocessing
data['review'] = data['review'].str.lower()  # Convert to lowercase
data['sentiment'] = data['sentiment'].map({'positive': 1, 'negative': 0})  # Encode sentiments

# Split data into features and labels
texts = data['review'].values
labels = data['sentiment'].values

# Tokenization and padding
vectorizer = CountVectorizer(max_features=5000, token_pattern=r'\b\w+\b')
X = vectorizer.fit_transform(texts).toarray()  # Convert to numerical representation
vocab_size = len(vectorizer.vocabulary_) + 1  # +1 for padding

# Hyperparameters
output_size = 1
embedding_dim = 300  # Adjusted for smaller embeddings
hidden_dim = 128     # Reduced to prevent overfitting
n_layers = 2
dropout_prob = 0.5
lr = 0.001
epochs = 2  # Increased epochs for better training
batch_size = 5  # Define a batch size as needed
clip = 5  # Gradient clipping

# Create Dataset and DataLoader
class SentimentDataset(Dataset):
    def __init__(self, texts, labels):
        self.texts = texts
        self.labels = labels

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        return self.texts[idx], self.labels[idx]

# Create DataLoader
train_size = int(0.8 * len(X))
valid_size = int(0.1 * len(X))
test_size = len(X) - train_size - valid_size

X_train, X_temp, y_train, y_temp = train_test_split(X, labels, test_size=0.2, random_state=42)
X_valid, X_test, y_valid, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

train_dataset = SentimentDataset(torch.tensor(X_train, dtype=torch.float32), torch.tensor(y_train, dtype=torch.float32))
valid_dataset = SentimentDataset(torch.tensor(X_valid, dtype=torch.float32), torch.tensor(y_valid, dtype=torch.float32))
test_dataset = SentimentDataset(torch.tensor(X_test, dtype=torch.float32), torch.tensor(y_test, dtype=torch.float32))

train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
valid_loader = DataLoader(valid_dataset, batch_size=batch_size)
test_loader = DataLoader(test_dataset, batch_size=batch_size)

# Define the LSTM model
class SentimentLSTM(nn.Module):
    def __init__(self, vocab_size, output_size, embedding_dim, hidden_dim, n_layers, drop_prob=0.5):
        super().__init__()
        self.output_size = output_size
        self.n_layers = n_layers
        self.hidden_dim = hidden_dim
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, n_layers, dropout=drop_prob, batch_first=True)
        self.dropout = nn.Dropout(drop_prob)
        self.fc1 = nn.Linear(hidden_dim, 64)
        self.fc2 = nn.Linear(64, 16)
        self.fc3 = nn.Linear(16, output_size)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x, hidden):
        batch_size = x.size(0)
        embedd = self.embedding(x.long())
        lstm_out, hidden = self.lstm(embedd.view(batch_size, -1, self.embedding.embedding_dim), hidden)
        lstm_out = lstm_out.contiguous().view(-1, self.hidden_dim)
        out = self.dropout(lstm_out)
        out = self.fc1(out)
        out = self.dropout(out)
        out = self.fc2(out)
        out = self.dropout(out)
        out = self.fc3(out)
        sig_out = self.sigmoid(out)
        sig_out = sig_out.view(batch_size, -1)
        sig_out = sig_out[:, -1]
        return sig_out, hidden

    def init_hidden(self, batch_size):
        weight = next(self.parameters()).data
        if torch.cuda.is_available():
            hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda(),
                      weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda())
        else:
            hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_(),
                      weight.new(self.n_layers, batch_size, self.hidden_dim).zero_())
        return hidden

# Initialize the model
net = SentimentLSTM(vocab_size, output_size, embedding_dim, hidden_dim, n_layers, dropout_prob)

# Check if CUDA is available
train_on_gpu = torch.cuda.is_available()

# Move model to GPU, if available
if train_on_gpu:
    net.cuda()

# Loss and optimizer
criterion = nn.BCELoss()
optimizer = torch.optim.Adam(net.parameters(), lr=lr)

# Training the model
net.train()
for e in range(epochs):
    for inputs, labels in train_loader:
        batch_size = inputs.size(0)  # Use the actual batch size
        h = net.init_hidden(batch_size)  # Initialize hidden state with the correct batch size

        if train_on_gpu:
            inputs, labels = inputs.cuda(), labels.cuda()
        
        h = tuple([each.data for each in h])  # Detach hidden states
        optimizer.zero_grad()  # Zero out accumulated gradients
        output, h = net(inputs, h)
        loss = criterion(output.squeeze(), labels.float())
        loss.backward()
        nn.utils.clip_grad_norm_(net.parameters(), clip)
        optimizer.step()

    # Validation loss
    val_losses = []
    net.eval()
    for inputs, labels in valid_loader:
        batch_size = inputs.size(0)  # Use the actual batch size for validation
        val_h = net.init_hidden(batch_size)

        if train_on_gpu:
            inputs, labels = inputs.cuda(), labels.cuda()
        
        val_h = tuple([each.data for each in val_h])
        output, val_h = net(inputs, val_h)
        val_loss = criterion(output.squeeze(), labels.float())
        val_losses.append(val_loss.item())
    
    print(f"Epoch: {e+1}/{epochs}, Val Loss: {np.mean(val_losses):.6f}")

# Testing the model
test_losses = []
num_correct = 0
h = net.init_hidden(batch_size)
net.eval()

for inputs, labels in test_loader:
    batch_size = inputs.size(0)  # Use the actual batch size for testing
    h = net.init_hidden(batch_size)  # Initialize hidden state with the correct batch size
    h = tuple([each.data for each in h])
    
    if train_on_gpu:
        inputs, labels = inputs.cuda(), labels.cuda()
    
    output, h = net(inputs, h)
    test_loss = criterion(output.squeeze(), labels.float())
    test_losses.append(test_loss.item())
    pred = torch.round(output.squeeze())
    correct_tensor = pred.eq(labels.float().view_as(pred))
    correct = np.squeeze(correct_tensor.cpu().numpy()) if train_on_gpu else np.squeeze(correct_tensor.numpy())
    num_correct += np.sum(correct)

# Calculate and print average test loss and accuracy
print("Test loss: {:.3f}".format(np.mean(test_losses)))
test_acc = num_correct / len(test_loader.dataset)
print("Test accuracy: {:.3f}".format(test_acc))


Epoch: 1/2, Val Loss: 0.697614
Epoch: 2/2, Val Loss: 0.690841
Test loss: 0.694
Test accuracy: 0.460


### Question 3

Try to improve the accuracy of the LSTM model by tuning the following hyper-parameters

- Learning rate (this is the `lr` variable in the __Training__ cell)
- Dropout probability (`dropout_prob` in the __Instantiate the model with hyperparameters__ cell)
- Number of epochs (This is the `epochs` variable in the __Training__  cell)

Choose at least 4 values (2 lower and 2 higher) for each parameter and answer the following questions
- Did the accuracy increase or decrease?
- Based on your understanding of the hyperparameter, why do you think it increased or decreased?


In [6]:
# Define hyperparameter values to test
learning_rates = [0.0001, 0.001, 0.01, 0.1]
dropout_probs = [0.1, 0.3, 0.5, 0.7]
num_epochs = [1, 2, 1, 3]

results = []

# Running experiments
for lr in learning_rates:
    for dropout in dropout_probs:
        for epoch in num_epochs:
            # Initialize the model with the current hyperparameters
            net = SentimentLSTM(vocab_size, output_size=1, embedding_dim=300, hidden_dim=128, n_layers=2, drop_prob=dropout)

            # Check if CUDA is available
            train_on_gpu = torch.cuda.is_available()
            if train_on_gpu:
                net.cuda()

            # Loss and optimizer
            criterion = nn.BCELoss()
            optimizer = torch.optim.Adam(net.parameters(), lr=lr)

            # Training the model
            for e in range(epoch):
                net.train()  # Set model to training mode
                for inputs, labels in train_loader:
                    batch_size = inputs.size(0)  # Use the actual batch size
                    h = net.init_hidden(batch_size)  # Initialize hidden state

                    if train_on_gpu:
                        inputs, labels = inputs.cuda(), labels.cuda()
                    
                    h = tuple([each.data for each in h])  # Detach hidden states
                    optimizer.zero_grad()  # Zero out accumulated gradients
                    output, h = net(inputs, h)
                    loss = criterion(output.squeeze(), labels.float())
                    loss.backward()
                    nn.utils.clip_grad_norm_(net.parameters(), 5)  # Gradient clipping
                    optimizer.step()

            # Testing the model and calculate accuracy
            net.eval()  # Set model to evaluation mode
            test_losses = []
            num_correct = 0

            for inputs, labels in test_loader:
                batch_size = inputs.size(0)  # Use the actual batch size for testing
                h = net.init_hidden(batch_size)  # Initialize hidden state
                h = tuple([each.data for each in h])
                
                if train_on_gpu:
                    inputs, labels = inputs.cuda(), labels.cuda()
                
                output, h = net(inputs, h)
                test_loss = criterion(output.squeeze(), labels.float())
                test_losses.append(test_loss.item())
                pred = torch.round(output.squeeze())
                correct_tensor = pred.eq(labels.float().view_as(pred))
                correct = np.squeeze(correct_tensor.cpu().numpy()) if train_on_gpu else np.squeeze(correct_tensor.numpy())
                num_correct += np.sum(correct)

            # Calculate and store results
            test_acc = num_correct / len(test_loader.dataset)
            results.append((lr, dropout, epoch, test_acc))
            print(f"Learning Rate: {lr}, Dropout: {dropout}, Epochs: {epoch}, Test Accuracy: {test_acc:.4f}")

# Convert results to a DataFrame for better visualization
results_df = pd.DataFrame(results, columns=['Learning Rate', 'Dropout', 'Epochs', 'Test Accuracy'])
print("\nResults Summary:")
print(results_df)


Learning Rate: 0.0001, Dropout: 0.1, Epochs: 1, Test Accuracy: 0.4600
Learning Rate: 0.0001, Dropout: 0.1, Epochs: 2, Test Accuracy: 0.5600
Learning Rate: 0.0001, Dropout: 0.1, Epochs: 1, Test Accuracy: 0.5400
Learning Rate: 0.0001, Dropout: 0.1, Epochs: 3, Test Accuracy: 0.5400
Learning Rate: 0.0001, Dropout: 0.3, Epochs: 1, Test Accuracy: 0.4500
Learning Rate: 0.0001, Dropout: 0.3, Epochs: 2, Test Accuracy: 0.4600
Learning Rate: 0.0001, Dropout: 0.3, Epochs: 1, Test Accuracy: 0.5400
Learning Rate: 0.0001, Dropout: 0.3, Epochs: 3, Test Accuracy: 0.5400
Learning Rate: 0.0001, Dropout: 0.5, Epochs: 1, Test Accuracy: 0.5400
Learning Rate: 0.0001, Dropout: 0.5, Epochs: 2, Test Accuracy: 0.4600
Learning Rate: 0.0001, Dropout: 0.5, Epochs: 1, Test Accuracy: 0.5400
Learning Rate: 0.0001, Dropout: 0.5, Epochs: 3, Test Accuracy: 0.5400
Learning Rate: 0.0001, Dropout: 0.7, Epochs: 1, Test Accuracy: 0.5500
Learning Rate: 0.0001, Dropout: 0.7, Epochs: 2, Test Accuracy: 0.5400
Learning Rate: 0.000

### Question 4

In the current notebook, cut and paste the LSTM code from `Assignment_1_LSTM_tutorial` and train the model on `human_or_bot.csv` (make sure that you create the train/test/validation sets!).

What is the test accuracy of your model?

Next, repeat the hyper-paraemter tuning you performed as part of Question 3, this time for your human vs. bot classification model. Again, choose 4 values for each hyper-paraemter, report out the change in test set performance, and the reason for that change.

In [2]:
pip install tensorflow


Collecting tensorflowNote: you may need to restart the kernel to use updated packages.

  Obtaining dependency information for tensorflow from https://files.pythonhosted.org/packages/ed/b6/62345568cd07de5d9254fcf64d7e44aacbb6abde11ea953b3cb320e58d19/tensorflow-2.17.0-cp311-cp311-win_amd64.whl.metadata
  Downloading tensorflow-2.17.0-cp311-cp311-win_amd64.whl.metadata (3.2 kB)
Collecting tensorflow-intel==2.17.0 (from tensorflow)
  Obtaining dependency information for tensorflow-intel==2.17.0 from https://files.pythonhosted.org/packages/66/03/5c447feceb72f5a38ac2aa79d306fa5b5772f982c2b480c1329c7e382900/tensorflow_intel-2.17.0-cp311-cp311-win_amd64.whl.metadata
  Downloading tensorflow_intel-2.17.0-cp311-cp311-win_amd64.whl.metadata (5.0 kB)
Collecting astunparse>=1.6.0 (from tensorflow-intel==2.17.0->tensorflow)
  Obtaining dependency information for astunparse>=1.6.0 from https://files.pythonhosted.org/packages/2b/03/13dde6512ad7b4557eb792fbcf0c653af6076b81e5941d36ec61f7ce6028/astunpar

In [3]:
import pandas as pd
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Load the dataset
data = pd.read_csv('human_or_bot.csv')

# Features and labels
X = data['text']  # Assuming 'text' is the column with input text
y = data['label']  # Assuming 'label' contains the labels (human or bot)

# Split data into train, validation, and test sets (with stratification)
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, stratify=y, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, stratify=y_temp, random_state=42)

# Tokenize the text data
tokenizer = Tokenizer(num_words=10000)  # Use a vocabulary size of 10,000
tokenizer.fit_on_texts(X_train)

X_train_seq = tokenizer.texts_to_sequences(X_train)
X_val_seq = tokenizer.texts_to_sequences(X_val)
X_test_seq = tokenizer.texts_to_sequences(X_test)

# Pad sequences to ensure uniform input size
max_sequence_length = 100  # You can adjust this based on the text length
X_train_pad = pad_sequences(X_train_seq, maxlen=max_sequence_length)
X_val_pad = pad_sequences(X_val_seq, maxlen=max_sequence_length)
X_test_pad = pad_sequences(X_test_seq, maxlen=max_sequence_length)


In [12]:
# Check label distribution in the train, validation, and test sets
print("Train label distribution:")
print(y_train.value_counts())
print("Validation label distribution:")
print(y_val.value_counts())
print("Test label distribution:")
print(y_test.value_counts())


Train label distribution:
1    700
0    700
Name: label, dtype: int64
Validation label distribution:
0    150
1    150
Name: label, dtype: int64
Test label distribution:
0    150
1    150
Name: label, dtype: int64


In [9]:
print(data['label'].value_counts())


human    1000
bot      1000
Name: label, dtype: int64


In [4]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout

# Build the LSTM model
model = Sequential()
model.add(Embedding(input_dim=10000, output_dim=128, input_length=max_sequence_length))
model.add(LSTM(128, return_sequences=False))
model.add(Dropout(0.5))  # Adjust dropout value later during tuning
model.add(Dense(1, activation='sigmoid'))  # Sigmoid for binary classification

# Compile the model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])




In [6]:
# Assuming the label column contains 'human' and 'bot' as strings
label_mapping = {'human': 0, 'bot': 1}

# Apply the mapping to the labels
y_train = y_train.map(label_mapping)
y_val = y_val.map(label_mapping)
y_test = y_test.map(label_mapping)

# Now the labels will be 0 or 1


In [7]:
history = model.fit(X_train_pad, y_train, 
                    validation_data=(X_val_pad, y_val),
                    epochs=10,  # Adjust based on your tuning
                    batch_size=32, 
                    shuffle=True)


Epoch 1/10
[1m44/44[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 135ms/step - accuracy: 0.5777 - loss: 0.6632 - val_accuracy: 0.8400 - val_loss: 0.5274
Epoch 2/10
[1m44/44[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 119ms/step - accuracy: 0.8894 - loss: 0.4128 - val_accuracy: 0.8633 - val_loss: 0.3347
Epoch 3/10
[1m44/44[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 121ms/step - accuracy: 0.9544 - loss: 0.1399 - val_accuracy: 0.8867 - val_loss: 0.3406
Epoch 4/10
[1m44/44[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 136ms/step - accuracy: 0.9775 - loss: 0.1346 - val_accuracy: 0.8967 - val_loss: 0.3701
Epoch 5/10
[1m44/44[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 124ms/step - accuracy: 0.9969 - loss: 0.0196 - val_accuracy: 0.8700 - val_loss: 0.4137
Epoch 6/10
[1m44/44[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 123ms/step - accuracy: 0.9959 - loss: 0.0169 - val_accuracy: 0.8500 - val_loss: 0.5340
Epoch 7/10
[1m44/44[0m [

In [13]:
# Re-run the model after addressing the data split issue
history = model.fit(X_train_pad, y_train, 
                    validation_data=(X_val_pad, y_val),
                    epochs=10,
                    batch_size=32,
                    shuffle=True)

# Evaluate the model
test_loss, test_accuracy = model.evaluate(X_test_pad, y_test)
print(f"Test accuracy: {test_accuracy}")


Epoch 1/10
[1m44/44[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 107ms/step - accuracy: 1.0000 - loss: 4.0783e-05 - val_accuracy: 0.8467 - val_loss: 1.1313
Epoch 2/10
[1m44/44[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 98ms/step - accuracy: 1.0000 - loss: 4.7847e-05 - val_accuracy: 0.8467 - val_loss: 1.1386
Epoch 3/10
[1m44/44[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 107ms/step - accuracy: 1.0000 - loss: 2.8588e-05 - val_accuracy: 0.8467 - val_loss: 1.1485
Epoch 4/10
[1m44/44[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 105ms/step - accuracy: 1.0000 - loss: 2.8859e-05 - val_accuracy: 0.8467 - val_loss: 1.1639
Epoch 5/10
[1m44/44[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 111ms/step - accuracy: 1.0000 - loss: 2.3207e-05 - val_accuracy: 0.8467 - val_loss: 1.1750
Epoch 6/10
[1m44/44[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 111ms/step - accuracy: 1.0000 - loss: 2.9224e-05 - val_accuracy: 0.8467 - val_loss: 1.1857
Epoch

In [8]:
test_loss, test_accuracy = model.evaluate(X_test_pad, y_test)
print(f"Test Accuracy: {test_accuracy}")


[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 45ms/step - accuracy: 0.8847 - loss: 0.5471
Test Accuracy: 0.8833333253860474


In [9]:
from tensorflow.keras.optimizers import Adam

learning_rates = [0.001, 0.0005, 0.0001, 0.00005]

for lr in learning_rates:
    model.compile(optimizer=Adam(learning_rate=lr), loss='binary_crossentropy', metrics=['accuracy'])
    model.fit(X_train_pad, y_train, epochs=10, validation_data=(X_val_pad, y_val))
    test_loss, test_accuracy = model.evaluate(X_test_pad, y_test)
    print(f"Learning Rate: {lr}, Test Accuracy: {test_accuracy}")


Epoch 1/10
[1m44/44[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 142ms/step - accuracy: 0.9930 - loss: 0.0284 - val_accuracy: 0.8667 - val_loss: 0.6334
Epoch 2/10
[1m44/44[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 135ms/step - accuracy: 0.9984 - loss: 0.0120 - val_accuracy: 0.8833 - val_loss: 0.6219
Epoch 3/10
[1m44/44[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 129ms/step - accuracy: 1.0000 - loss: 5.1622e-04 - val_accuracy: 0.8733 - val_loss: 0.6942
Epoch 4/10
[1m44/44[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 130ms/step - accuracy: 1.0000 - loss: 2.2056e-04 - val_accuracy: 0.8833 - val_loss: 0.7689
Epoch 5/10
[1m44/44[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 133ms/step - accuracy: 1.0000 - loss: 1.0968e-04 - val_accuracy: 0.8833 - val_loss: 0.8705
Epoch 6/10
[1m44/44[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 122ms/step - accuracy: 1.0000 - loss: 2.1156e-04 - val_accuracy: 0.8600 - val_loss: 0.5197
Epoch 7/10

In [10]:
dropout_values = [0.3, 0.5, 0.7]

for dp in dropout_values:
    model = Sequential()
    model.add(Embedding(input_dim=10000, output_dim=128, input_length=max_sequence_length))
    model.add(LSTM(128, return_sequences=False))
    model.add(Dropout(dp))
    model.add(Dense(1, activation='sigmoid'))
    
    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
    model.fit(X_train_pad, y_train, epochs=10, validation_data=(X_val_pad, y_val))
    test_loss, test_accuracy = model.evaluate(X_test_pad, y_test)
    print(f"Dropout: {dp}, Test Accuracy: {test_accuracy}")


Epoch 1/10




[1m44/44[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 129ms/step - accuracy: 0.6073 - loss: 0.6619 - val_accuracy: 0.8600 - val_loss: 0.5118
Epoch 2/10
[1m44/44[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 121ms/step - accuracy: 0.8886 - loss: 0.3527 - val_accuracy: 0.8700 - val_loss: 0.3131
Epoch 3/10
[1m44/44[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 126ms/step - accuracy: 0.9641 - loss: 0.1277 - val_accuracy: 0.9067 - val_loss: 0.3116
Epoch 4/10
[1m44/44[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 132ms/step - accuracy: 0.9864 - loss: 0.0516 - val_accuracy: 0.8933 - val_loss: 0.3470
Epoch 5/10
[1m44/44[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 134ms/step - accuracy: 0.9806 - loss: 0.0566 - val_accuracy: 0.8867 - val_loss: 0.3221
Epoch 6/10
[1m44/44[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 137ms/step - accuracy: 0.9781 - loss: 0.0920 - val_accuracy: 0.8733 - val_loss: 0.3489
Epoch 7/10
[1m44/44[0m [32m━━━━━━━━

In [11]:
epochs_values = [5, 10, 15, 20]

for epochs in epochs_values:
    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
    model.fit(X_train_pad, y_train, epochs=epochs, validation_data=(X_val_pad, y_val))
    test_loss, test_accuracy = model.evaluate(X_test_pad, y_test)
    print(f"Epochs: {epochs}, Test Accuracy: {test_accuracy}")


Epoch 1/5
[1m44/44[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 122ms/step - accuracy: 0.9936 - loss: 0.0348 - val_accuracy: 0.8867 - val_loss: 0.5115
Epoch 2/5
[1m44/44[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 114ms/step - accuracy: 0.9954 - loss: 0.0235 - val_accuracy: 0.8667 - val_loss: 0.6133
Epoch 3/5
[1m44/44[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 115ms/step - accuracy: 0.9984 - loss: 0.0031 - val_accuracy: 0.8833 - val_loss: 0.7671
Epoch 4/5
[1m44/44[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 115ms/step - accuracy: 1.0000 - loss: 2.6839e-04 - val_accuracy: 0.8800 - val_loss: 0.8736
Epoch 5/5
[1m44/44[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 116ms/step - accuracy: 1.0000 - loss: 1.4576e-04 - val_accuracy: 0.8700 - val_loss: 0.9312
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 36ms/step - accuracy: 0.8658 - loss: 0.8878
Epochs: 5, Test Accuracy: 0.8566666841506958
Epoch 1/10
[1m44/44[0m [32m━━━━

In [1]:
import pandas as pd
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from torch.utils.data import DataLoader, TensorDataset
from collections import Counter

# Load dataset
data = pd.read_csv('human_or_bot.csv')
texts = data['text'].values  # Assume 'text' is the column name
labels = data['label'].values  # Assume 'label' is the column name (0 for human, 1 for bot)

# Encode labels
le = LabelEncoder()
y_encoded = le.fit_transform(labels)

# Tokenization and vocabulary creation
tokenized_texts = [text.lower().split() for text in texts]
counter = Counter(word for text in tokenized_texts for word in text)
vocab = {word: i + 1 for i, (word, _) in enumerate(counter.items())}  # +1 for padding

# Convert texts to sequences
max_length = 100  # Maximum length of the sequences
X = []
for text in tokenized_texts:
    seq = [vocab[word] for word in text if word in vocab]
    if len(seq) < max_length:
        seq += [0] * (max_length - len(seq))  # Padding
    else:
        seq = seq[:max_length]  # Truncate
    X.append(seq)

# Convert to numpy arrays
X = np.array(X)
y = np.array(y_encoded)

# Train/test/validation split
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42)
X_valid, X_test, y_valid, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

# Convert to PyTorch tensors
X_train = torch.tensor(X_train, dtype=torch.long)
y_train = torch.tensor(y_train, dtype=torch.float32)
X_valid = torch.tensor(X_valid, dtype=torch.long)
y_valid = torch.tensor(y_valid, dtype=torch.float32)
X_test = torch.tensor(X_test, dtype=torch.long)
y_test = torch.tensor(y_test, dtype=torch.float32)

# Create DataLoader
batch_size = 64
train_data = TensorDataset(X_train, y_train)
train_loader = DataLoader(train_data, shuffle=True, batch_size=batch_size)

# Define the LSTM model
class SentimentLSTM(nn.Module):
    def __init__(self, vocab_size, output_size, embedding_dim, hidden_dim, n_layers, drop_prob=0.5):
        super(SentimentLSTM, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, n_layers, dropout=drop_prob, batch_first=True)
        self.dropout = nn.Dropout(drop_prob)
        self.fc = nn.Linear(hidden_dim, output_size)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        x = self.embedding(x)
        lstm_out, _ = self.lstm(x)
        out = self.dropout(lstm_out[:, -1])
        out = self.fc(out)
        return self.sigmoid(out)

# Instantiate model
vocab_size = len(vocab) + 1  # +1 for padding
output_size = 1
embedding_dim = 400
hidden_dim = 256
n_layers = 2
dropout_prob = 0.5

model = SentimentLSTM(vocab_size, output_size, embedding_dim, hidden_dim, n_layers, dropout_prob)
print(model)

# Define loss and optimizer
criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Training
epochs = 4
for e in range(epochs):
    model.train()
    for inputs, labels in train_loader:
        optimizer.zero_grad()
        output = model(inputs).squeeze()
        loss = criterion(output, labels)
        loss.backward()
        optimizer.step()
    print(f'Epoch: {e+1}/{epochs}... Loss: {loss.item():.4f}')

# Evaluation
model.eval()
with torch.no_grad():
    valid_output = model(X_valid).squeeze()
    valid_loss = criterion(valid_output, y_valid)
    valid_acc = ((valid_output > 0.5).float() == y_valid).float().mean()
    print(f'Validation Loss: {valid_loss.item():.4f}... Validation Accuracy: {valid_acc.item():.4f}')

# Test evaluation
with torch.no_grad():
    test_output = model(X_test).squeeze()
    test_loss = criterion(test_output, y_test)
    test_acc = ((test_output > 0.5).float() == y_test).float().mean()
    print(f'Test Loss: {test_loss.item():.4f}... Test Accuracy: {test_acc.item():.4f}')


SentimentLSTM(
  (embedding): Embedding(55951, 400)
  (lstm): LSTM(400, 256, num_layers=2, batch_first=True, dropout=0.5)
  (dropout): Dropout(p=0.5, inplace=False)
  (fc): Linear(in_features=256, out_features=1, bias=True)
  (sigmoid): Sigmoid()
)
Epoch: 1/4... Loss: 0.6050
Epoch: 2/4... Loss: 0.4973
Epoch: 3/4... Loss: 0.3049
Epoch: 4/4... Loss: 0.2206
Validation Loss: 0.5450... Validation Accuracy: 0.7633
Test Loss: 0.6707... Test Accuracy: 0.6967


In [2]:
print(f'Test Loss: {test_loss.item():.4f}... Test Accuracy: {test_acc.item():.4f}')


Test Loss: 0.6707... Test Accuracy: 0.6967
