# Practical 1: Chatbot – Simple ANN & Transformers

## Task 1.1

Create a simple chatbot that analyzes text responses typed by a user using an artificial neural network (ANN). The minimum requirement is that the bot prompts a response from the user (with various possible prompts). When the user has typed the answer, the bot should analyze the text (using an ANN) and formulate a response based on the analysis.

A small dataset of product reviews that have been labelled as negative (0) or positive (1) is provided in the Files - Exercises - Lab 1 folder, along with some code needed to extract information.

A suggested approach is first to try to train a network on the given data. When that task has been concluded, the model can be improved by finding more data, using a dataset with a broader range of labels, using word embeddings to create unique sentence embeddings, making the bot capable of extended dialogue or any other extension you want to pursue.

## Task 1.2 Transformers Implementation

For this task, you will implement your transformer in PyTorch. You are instructed to follow this link: [Transformers in Pytorch](https://pytorch.org/tutorials/beginner/transformer_tutorial.html).

## Task 1.2 (Alternative)

If you find any problems with the previous code, links, or set-up (due to Amazon or anything else), we offer you another alternative to developing your own transformer.

You can follow Andrej Karpathy's tutorial for a NanoGPT here: [YouTube](https://www.youtube.com/watch?v=kCc8FmEb1nY) and use his GitHub with the code here: [NanoGPT](https://github.com/karpathy/nanoGPT?tab=readme-ov-file) or you can develop your own code if you want.

The only requirement is to standardize your training data. Make it the same across your implementations.

## Task 1.3 Comparison

Here, it would be best if you did a comparison of both models; you are requested to modify your Chatbot to use the same data as the transformer and answer the following:

- Compare the performance of the two models and explain in which scenarios you would prefer one over the other.
- How did the two models’ complexity, accuracy, and efficiency differ? Did one model outperform the other in specific scenarios or tasks? If so, why?
- What insights did you obtain concerning the data amount to train? Embeddings utilized? Architectural choices made?


In [None]:
from google.colab import drive
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from torch.utils.data import TensorDataset, DataLoader
from sklearn.metrics import accuracy_score
from nltk.corpus import stopwords
from nltk import word_tokenize
import nltk
nltk.download('punkt')
nltk.download('stopwords')

def preprocess_pandas(data, columns):
    df_ = pd.DataFrame(columns=columns)
    data['Sentence'] = data['Sentence'].str.lower()
    data['Sentence'] = data['Sentence'].replace('[^\w\s]', '', regex=True)
    rows = []
    for index, row in data.iterrows():
        word_tokens = word_tokenize(row['Sentence'])
        filtered_sentence = [w for w in word_tokens if not w in stopwords.words('english')]
        row_data = {
            "Class": row.get('Class', None),
            "Sentence": " ".join(filtered_sentence)
        }
        if 'index' in columns:
            row_data["index"] = row.get('index', index)
        rows.append(row_data)
    df_ = pd.concat([df_, pd.DataFrame(rows)], ignore_index=True)
    return df_


class TextClassifier(nn.Module):
    def __init__(self, vocab_size):
        super(TextClassifier, self).__init__()
        self.fc1 = nn.Linear(vocab_size, 16)
        self.fc2 = nn.Linear(16, 1)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = torch.sigmoid(self.fc2(x))
        return x.squeeze()

def chatbot(model, vectorizer):
    print("Hello! I'm a KLAB sentiment chatbot. Type 'quit' to exit.")
    while True:
        user_input = input("You: ")
        if user_input.lower() == 'quit':
            break
        processed_input = preprocess_pandas(pd.DataFrame([[user_input, None]], columns=['Sentence', 'Class']), ['Sentence', 'Class'])
        input_vector = vectorizer.transform([processed_input['Sentence'].values[0]]).toarray()
        input_tensor = torch.tensor(input_vector).float()
        output = model(input_tensor)
        response = 'Positive' if torch.round(output).item() == 1 else 'Negative'
        print(f"Bot: That seems like a {response} statement.")

if __name__ == "__main__":
    # Mount Google Drive
    # Hi KLAB people, please put the data file in your drive "my drive". #############################################################################
    drive.mount('/content/drive')
    # Specify the path to your file in Google Drive
    file_path = '/content/drive/My Drive/amazon_cells_labelled.txt'
    # Load and preprocess the data
    data = pd.read_csv(file_path, delimiter='\t', header=None)
    data.columns = ['Sentence', 'Class']
    data['index'] = data.index
    columns = ['index', 'Class', 'Sentence']
    processed_data = preprocess_pandas(data, columns)

    # Split the data
    X_train, X_test, y_train, y_test = train_test_split(
        processed_data['Sentence'].values,
        processed_data['Class'].values.astype('int'),
        test_size=0.10,
        random_state=42
    )

    # Vectorize the data
    vectorizer = TfidfVectorizer(max_features=5000)
    X_train_tfidf = vectorizer.fit_transform(X_train).toarray()
    X_test_tfidf = vectorizer.transform(X_test).toarray()

    # Convert data to PyTorch tensors
    X_train_tensor = torch.tensor(X_train_tfidf).float()
    y_train_tensor = torch.tensor(y_train).float()
    X_test_tensor = torch.tensor(X_test_tfidf).float()
    y_test_tensor = torch.tensor(y_test).float()

    # Create TensorDatasets and DataLoaders
    train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
    test_dataset = TensorDataset(X_test_tensor, y_test_tensor)
    train_loader = DataLoader(dataset=train_dataset, batch_size=64, shuffle=True)
    test_loader = DataLoader(dataset=test_dataset, batch_size=64, shuffle=False)

    # Initialize the model, loss function, and optimizer
    model = TextClassifier(len(vectorizer.vocabulary_))
    criterion = nn.BCELoss()
    optimizer = optim.Adam(model.parameters(), lr=0.001)

    # Train the model
    num_epochs = 10
    for epoch in range(num_epochs):
        model.train()
        total_loss = 0
        for inputs, labels in train_loader:
            optimizer.zero_grad()
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
            total_loss += loss.item()
        print(f'Epoch {epoch+1}, Loss: {total_loss/len(train_loader)}')

    # Evaluate the model
    model.eval()
    with torch.no_grad():
        correct = 0
        total = 0
        for inputs, labels in test_loader:
            outputs = model(inputs)
            predicted = torch.round(outputs)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()
        print(f'Accuracy of the model on the test set: {100 * correct / total}%')

    # Start the chatbot
    chatbot(model, vectorizer)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Mounted at /content/drive
Epoch 1, Loss: 0.6945637424786886
Epoch 2, Loss: 0.6830076495806376
Epoch 3, Loss: 0.6744857509930928
Epoch 4, Loss: 0.6640441338221232
Epoch 5, Loss: 0.6462441007296245
Epoch 6, Loss: 0.6294456442197164
Epoch 7, Loss: 0.6073550383249918
Epoch 8, Loss: 0.5831521391868592
Epoch 9, Loss: 0.5555845499038696
Epoch 10, Loss: 0.5305933713912964
Accuracy of the model on the test set: 76.0%
Hello! I'm a KLAB sentiment chatbot. Type 'quit' to exit.
You: i am a good boy
Bot: That seems like a Positive statement.
You: i am not a good boy
Bot: That seems like a Positive statement.
You: i am a bad boy
Bot: That seems like a Negative statement.
You: i am not a bad boy
Bot: That seems like a Negative statement.
You: quit


In [None]:
import pandas as pd
import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.model_selection import train_test_split
from torch.utils.data import TensorDataset, DataLoader
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import nltk
import re
nltk.download('punkt')
nltk.download('stopwords')

def preprocess_text(text):
    stop_words = set(stopwords.words('english')) #retrieves a set of english stopwords (words that dont have any meaning)
    text = text.lower() #convert text to lower case
    text = re.sub('[^\w\s]', '', text) #removes characters that are non word or non whitespace
    word_tokens = word_tokenize(text) #tokenize text into individual words
    filtered_sentence = [w for w in word_tokens if not w in stop_words] #filters out stopwords from tokenized words
    return " ".join(filtered_sentence) #returns the filtered words back into a string

def load_data(file_path):
    data = pd.read_csv(file_path, delimiter='\t', header=None, names=['Sentence', 'Class'])#reads data from CVS file intp pamdas dataframe
    data['Sentence'] = data['Sentence'].apply(preprocess_text)#applies the preprocess text function to each sentence in the data
    return data

def create_vocab(data):
    all_words = set(word_tokenize(" ".join(data['Sentence']))) #creates a set of all unique words in the dataset
    vocab = {word: i+1 for i, word in enumerate(all_words)}#a dictinary vocab is created mapping each word with a unique integer index
    vocab['<PAD>'] = 0 #special token <PAD> is added with index 0
    return vocab

def tokenize_data(data, vocab, max_len):
    tokenized_data = [[vocab[word] if word in vocab else 0 for word in word_tokenize(sentence)] for sentence in data]#creates a list tokenized data consisting of lists where each list is a tokenized sentence
    padded_data = [sentence + [vocab['<PAD>']] * (max_len - len(sentence)) for sentence in tokenized_data]#adds padding so that all sentences in tokenized data have same length
    return padded_data

class TransformerClassifier(nn.Module):
    def __init__(self, vocab_size, d_model=512, nhead=8, num_encoder_layers=3, dropout=0.1):
        super(TransformerClassifier, self).__init__()#a transformerclassifier is initialized
        self.embedding = nn.Embedding(vocab_size, d_model) #each token in input sequence is mapped to a vector of d_model dimensions
        encoder_layer = nn.TransformerEncoderLayer(d_model=d_model, nhead=nhead, dropout=dropout)#a simgle encoding layer is defined with nhead number of attention heads
        self.transformer_encoder = nn.TransformerEncoder(encoder_layer, num_layers=num_encoder_layers)# a complete transformer encoder is defined with num_encoder_layers layers
        self.fc = nn.Linear(d_model, 1)#fully connected layer

    def forward(self, x):
        x = self.embedding(x)
        x = self.transformer_encoder(x)
        x = x.mean(dim=1)  # Pooling
        return self.fc(x).squeeze()

def train_model(model, train_loader, criterion, optimizer, num_epochs=10):
    for epoch in range(num_epochs):
        model.train() #set the model to training mode
        total_loss = 0 #initialize the loss for ech epoch
        for inputs, labels in train_loader:
            optimizer.zero_grad() # zero the gradients
            outputs = model(inputs) #forward pass / compute predicted outputs by passing inputs to the model
            loss = criterion(outputs, labels) #calculate batch loss
            loss.backward() #backward pass / comupute gradient of the loss
            optimizer.step() #update parameters (weights and biases)
            total_loss += loss.item() #total loss for the epoch
        print(f'Epoch {epoch+1}, Loss: {total_loss / len(train_loader)}') #print average loss for each epoch

def evaluate_model(model, test_loader): #evaluate performance on test set
    model.eval() #set model to evaluation mode
    with torch.no_grad(): #disable gradient calculation
        correct = 0  # Initialize the number of correctly predicted samples
        total = 0    # Initialize the total number of samples

        for inputs, labels in test_loader:
            outputs = model(inputs)  #forward pass / compute predicted outputs by passing inputs to the model
            predicted = torch.round(torch.sigmoid(outputs)) #apply sigmoid activation function to get binary predictions
            total += labels.size(0) #total number of samples
            correct += (predicted == labels).sum().item() #total number of correctly predicted samples
        print(f'Accuracy of the model on the test set: {100 * correct / total}%') #print the accuracy for the test set

def chatbot(model, vocab, max_len):
    print("Hello! I'm a sentiment analysis chatbot. Type 'quit' to exit.")
    while True:
        user_input = input("You: ") #get user input
        if user_input.lower() == 'quit': #check if user wants to quit
            break
        processed_input = preprocess_text(user_input) #preprocess the user input
        tokenized_input = [vocab[word] if word in vocab else 0 for word in word_tokenize(processed_input)] #tokenize the input and convert to indices
        padded_input = tokenized_input + [vocab['<PAD>']] * (max_len - len(tokenized_input)) #pad to maximum sequence length
        input_tensor = torch.tensor([padded_input]).long() #convert to tensor
        output = torch.sigmoid(model(input_tensor)) #pass the input through the model and apply sigmoid
        response = 'Positive' if output.item() >= 0.5 else 'Negative' #determine sentiment based on the output
        print(f"Bot: That seems like a {response} statement.") #prints the prediction

def main():
    #mount drive to get the dataset and create the vocabulary from the data
    drive.mount('/content/drive')
    file_path = '/content/drive/My Drive/amazon_cells_labelled.txt'
    data = load_data(file_path)
    vocab = create_vocab(data)
    max_len = max(len(word_tokenize(text)) for text in data['Sentence']) #calculate max length along all sentences

    #train/test split
    X_train, X_test, y_train, y_test = train_test_split(
        data['Sentence'], data['Class'], test_size=0.10, random_state=42)

    #tokenize training and test data
    X_train = tokenize_data(X_train, vocab, max_len)
    X_test = tokenize_data(X_test, vocab, max_len)

    #convert data into pytorch tensors
    X_train_tensor = torch.tensor(X_train).long()
    y_train_tensor = torch.tensor(y_train.values).float()
    X_test_tensor = torch.tensor(X_test).long()
    y_test_tensor = torch.tensor(y_test.values).float()

    #create pytorch dataloader
    train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
    test_dataset = TensorDataset(X_test_tensor, y_test_tensor)
    train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
    test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False)

    vocab_size = len(vocab) # set vocab size to the length of vocabulary
    model = TransformerClassifier(vocab_size)# initialize the model as a tronsformer classifier
    criterion = nn.BCEWithLogitsLoss() # loss function
    optimizer = optim.Adam(model.parameters(), lr=0.0001) #optimizer and learning rate

    train_model(model, train_loader, criterion, optimizer,10) #train the model for chosen number of epochs
    evaluate_model(model, test_loader) #evaluate the models accuracy on test set
    chatbot(model, vocab, 50) #initialize chatbot
if __name__ == "__main__":
    main()



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).




Epoch 1, Loss: 0.8276934663454691
Epoch 2, Loss: 0.6907510717709859
Epoch 3, Loss: 0.659014097849528
Epoch 4, Loss: 0.6437768777211507
Epoch 5, Loss: 0.5957281708717346
Epoch 6, Loss: 0.5234996656576792
Epoch 7, Loss: 0.44813999931017556
Epoch 8, Loss: 0.3815574884414673
Epoch 9, Loss: 0.3158738334973653
Epoch 10, Loss: 0.2908288856347402
Accuracy of the model on the test set: 72.0%
Hello! I'm a sentiment analysis chatbot. Type 'quit' to exit.
You: i am a good boy
Bot: That seems like a Positive statement.
You: i am a bad boy
Bot: That seems like a Negative statement.
You: i am not a good boy
Bot: That seems like a Positive statement.
You: i am not a bad boy
Bot: That seems like a Negative statement.
You: the man has died
Bot: That seems like a Positive statement.
You: the man has survived
Bot: That seems like a Negative statement.
You: death
Bot: That seems like a Negative statement.
You: life
Bot: That seems like a Negative statement.
You: good
Bot: That seems like a Positive stateme

# Task 1.3

We just did the sentimental analasys part. For 1.2 we imported the data from task 1.1.

1. The first model (ANN) performed better on the test set. However, a transformer-based architecture is more suitable for complex data that is more context-heavy and would be a better choice in that case. ANNs might be more suitable if the task is as simple as segment analysis and the context of the data is less sequential.

2. It took alot longer to run the secound model compared to the first, this is due to its more complex structure. When it comes to the complexity it was easier to implement task 1.1, note however that we processed the dataset for 1.2 in a less complex way compared to task 1.1.

3. The amount of data is not enough for our inputs, it almost feels random when we enter our own data.