<a href="https://colab.research.google.com/github/Camel-light/Assignments/blob/main/assignment_6.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The following assignment consists of a theoretical part (learning portfolio) and a practical part (assignment). The goal is to build a classification model that predicts from which subject area a certain abstract originates. The plan would be that next week we will discuss your learnings from the theory part, that means you are relatively free to fill your Learning Portfolio on this new topic and in two weeks we will discuss your solutions of the Classification Model.

#Theory part (filling your Learning Portfolio, May 10)

In preparation for the practical part, I ask you to familiarize yourself with the following resources in the next week:

1) Please watch the following video:

https://course.fast.ai/Lessons/lesson4.html

You are also welcome to watch the accompanying Kaggle notebook if you like the video.

2) In addition to the video, I recommend you to read the first chapters of the course

https://huggingface.co/learn/nlp-course/chapter1/1


Try to understand principle processes and log them in your learning portfolio! A few suggestions: What is a pre-trained NLP model? How do I load them? What is tokenization? What does fine-tuning mean? What types of NLP Models are there? What possibilities do I have with the Transformers package? etc...

#Practical part (Assignment, May 17)

1) Preprocessing: The data which I provide as zip in Olat must be processed first, that means we need a table which has the following form:

Keywords | Title | Abstract | Research Field

The research field is determined by the name of the file.

2) We need a training dataset and a test dataset. My suggestion would be that for each research field we use the first 5700 lines for the training dataset and the last 300 lines for the test dataset. Please stick to this because then we can compare our models better!

3) Please use a pre-trained model from huggingface to build a classification model that tries to predict the correct research field from the 26. Please calculate the accuracy and the overall accuracy for all research fields. If you solve this task in a group, you can also try different pre-trained models. In addition to the abstracts, you can also see if the model improves if you include keywords and titles.

Some links, which can help you:

https://huggingface.co/docs/transformers/training

https://huggingface.co/docs/transformers/tasks/sequence_classification

One last request: Please always use PyTorch and not TensorFlow!

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Get all the fields unique occurrences (set)
import os

def get_fields(directory):
    fields = set()
    for filename in os.listdir(directory):
        if filename.endswith('.csv'):
            field = filename.split('_')[0]
            fields.add(field)
    return list(fields)


directory = '/content/drive/MyDrive/Colab_Notebooks/Lab6/data_cleaned'
fields = get_fields(directory)
print(fields)

In [None]:
# Get the data from each csv, put first 5700rows into train_df and 300 into test_df
import pandas as pd

def get_data(directory):
    train_dfs = []
    test_dfs = []
    for filename in os.listdir(directory):
        if filename.endswith('.csv'):
            filepath = os.path.join(directory, filename)
            df = pd.read_csv(filepath)
            train_df = df.iloc[:5700]
            test_df = df.iloc[-300:]
            train_dfs.append(train_df)
            test_dfs.append(test_df)
    train_data = pd.concat(train_dfs)
    test_data = pd.concat(test_dfs)
    return train_data, test_data

train_data, test_data = get_data(directory)

In [None]:
train_data.head()

In [None]:
test_data.head()

In [None]:
train_data.describe()

In [None]:
test_data.describe()

In [None]:
train_data.info()
test_data.info()

In [None]:
print(5700/6000)
print(154594/(154594+23400))

It looks like not all files contain a total of 6000 rows. One possibility would be to always take a ratio of 95% / 5% for train / test respectively. Sticking with taking the first 5700 for train as demanded in assignment.

In [None]:
# Many 'Index Keywords' fields are empty. Imputing with row above, only if it has the same research field, as they are about similar topics.

def fill_missing_keywords(df):
    prev_keywords = None
    prev_field = None
    filled_keywords = []
    for index, row in df.iterrows():
        keywords = row['Index Keywords']
        field = row['Research Field']
        if pd.isnull(keywords) and field == prev_field:
            filled_keywords.append(prev_keywords)
        else:
            filled_keywords.append(keywords)
            prev_keywords = keywords
        prev_field = field
    return filled_keywords

train_data['Index Keywords'] = fill_missing_keywords(train_data)
test_data['Index Keywords'] = fill_missing_keywords(test_data)

In [None]:
train_data.info()
test_data.info()

Now there are less null rows for Index Keywords (more non-null rows)

In [None]:
!pip install transformers
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=26)

In [None]:
!pip install datasets
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
import torch
from torch.utils.data import Dataset
from datasets import Dataset, ClassLabel


# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Define a function to preprocess the data
def preprocess_data(data):
    # Select the input features and labels
    input_features = ['Title']
    input_text = data[input_features].iloc[:,0]
    labels = data['Research Field'].to_numpy()
    
    # Tokenize the input text
    encodings = tokenizer(input_text.to_list(), truncation=True, padding=True)
    
    # Get the unique label names from your data
    label_names = data['Research Field'].unique()

    # Create a ClassLabel object with the unique label names
    label_feature = ClassLabel(names=label_names)

    # Encode the labels in your training data
    labels = [label_feature.str2int(label) for label in labels]
    
    # Convert the data to a format that can be used by the Trainer
    dataset = Dataset.from_dict({**encodings, 'labels': labels})
    
    return dataset

# Preprocess the training and test data
train_data = preprocess_data(train_data)
test_data = preprocess_data(test_data)

# Define the training arguments
training_args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy='epoch',
    logging_dir='./logs',
)

# Create a Trainer and fine-tune the model
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=train_data.features['labels'].num_classes)
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_data,
    eval_dataset=test_data,
)
trainer.train()

In [None]:
from torch.utils.data import Dataset

# Define a function to preprocess the data
def preprocess(data, batch_size=1000):
    input_features = ['Title']
    input_text = data[input_features].iloc[:,0]
    labels = data['Research Field']
    
    # Tokenize the input text in batches
    encodings = {'input_ids': [], 'attention_mask': [], 'token_type_ids': []}
    for i in range(0, len(input_text), batch_size):
        batch = input_text[i:i+batch_size]
        batch_encodings = tokenizer(batch.to_list(), truncation=True, padding=True)
        encodings['input_ids'].extend(batch_encodings['input_ids'])
        encodings['attention_mask'].extend(batch_encodings['attention_mask'])
        if 'token_type_ids' in batch_encodings:
            encodings['token_type_ids'].extend(batch_encodings['token_type_ids'])
    
    return encodings, labels

# Preprocess the training and test data
train_encodings, train_labels = preprocess(train_data)
test_encodings, test_labels = preprocess(test_data)

# Convert the labels to a NumPy array
train_labels = train_labels.to_numpy()
test_labels = test_labels.to_numpy()

# Convert the data to a format that can be used by the Trainer
train_data = CustomDataset(train_encodings, train_labels)
test_data = CustomDataset(test_encodings, test_labels)

# Convert the data to a format that can be used by the Trainer
class CustomDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

train_data = CustomDataset(train_encodings, train_labels)
test_data = CustomDataset(test_encodings, test_labels)

In [None]:
print(f"Type of train_labels: {type(train_labels)}")
print(f"Shape of train_labels: {train_labels.shape}")
print(f"Type of test_labels: {type(test_labels)}")
print(f"Shape of test_labels: {test_labels.shape}")

In [None]:
def analyze_dataset(dataset):
    # Check if the dataset has a 'labels' key
    if 'labels' not in dataset[0]:
        print("The dataset does not contain a 'labels' key.")
    else:
        print("The dataset contains a 'labels' key.")
        # Check the shape of the labels
        labels_shape = dataset[0]['labels'].shape
        print(f"The shape of the labels is: {labels_shape}")

# Analyze the training and test datasets
print("Analyzing the training dataset:")
analyze_dataset(train_data)
print("\nAnalyzing the test dataset:")
analyze_dataset(test_data)

In [None]:
from transformers import TrainingArguments, Trainer
# Define the training arguments
training_args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy='epoch',
    logging_dir='./logs',
)

# Create a Trainer and fine-tune the model
model = AutoModelForSequenceClassification.from_pretrained(model_name)
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_data,
    eval_dataset=test_data,
)
trainer.train()

In [None]:
# Evaluate the model on the test data
results = trainer.evaluate()
print(f'Overall accuracy: {results["eval_accuracy"]}')

Addition: Accuracy measures whether the research field with the highest probability value matches the target. With 26 research fields, it would also be interesting to know if the correct target is at least among the three highest probability values.

$\begin{pmatrix} A\\ B \\ C \\D \\E \end{pmatrix} = \begin{pmatrix} 0.1\\ 0.95 \\ 0.5 \\0.2 \\0.3 \end{pmatrix} → \text{Choice}_1 = B, \text{Choice}_3 = B,C,E$