The following assignment consists of a theoretical part (learning portfolio) and a practical part (assignment). The goal is to build a classification model that predicts from which subject area a certain abstract originates. The plan would be that next week we will discuss your learnings from the theory part, that means you are relatively free to fill your Learning Portfolio on this new topic and in two weeks we will discuss your solutions of the Classification Model.

#Theory part (filling your Learning Portfolio, May 10)

In preparation for the practical part, I ask you to familiarize yourself with the following resources in the next week:

1) Please watch the following video:

https://course.fast.ai/Lessons/lesson4.html

You are also welcome to watch the accompanying Kaggle notebook if you like the video.

2) In addition to the video, I recommend you to read the first chapters of the course

https://huggingface.co/learn/nlp-course/chapter1/1


Try to understand principle processes and log them in your learning portfolio! A few suggestions: What is a pre-trained NLP model? How do I load them? What is tokenization? What does fine-tuning mean? What types of NLP Models are there? What possibilities do I have with the Transformers package? etc...

#Practical part (Assignment, May 17)

1) Preprocessing: The data which I provide as zip in Olat must be processed first, that means we need a table which has the following form:

Keywords | Title | Abstract | Research Field

The research field is determined by the name of the file.

2) We need a training dataset and a test dataset. My suggestion would be that for each research field we use the first 5700 lines for the training dataset and the last 300 lines for the test dataset. Please stick to this because then we can compare our models better!

3) Please use a pre-trained model from huggingface to build a classification model that tries to predict the correct research field from the 26. Please calculate the accuracy and the overall accuracy for all research fields. If you solve this task in a group, you can also try different pre-trained models. In addition to the abstracts, you can also see if the model improves if you include keywords and titles.

Some links, which can help you:

https://huggingface.co/docs/transformers/training

https://huggingface.co/docs/transformers/tasks/sequence_classification

One last request: Please always use PyTorch and not TensorFlow!

#### 1) Preprocessing: The data which I provide as zip in Olat must be processed first, that means we need a table which has the following form: Keywords | Title | Abstract | Research Field The research field is determined by the name of the file.

In [None]:
'''import pandas as pd
import glob
import os
from google.colab import drive
from google.colab import data_table

drive.mount('/content/drive')
path = r"/content/drive/MyDrive/Colab Notebooks/data"
all_files = glob.glob(os.path.join(path, "*.csv"))

for file in all_files:
    try:
        df = pd.read_csv(file)
        # Extract the file name without extension
        file_name = os.path.basename(file).split(".")[0]
        # Add a new column with the file name
        df['File Name'] = file_name
        # Append the DataFrame to the list
    except pd.errors.ParserError:
        print(f"Error reading file: {file}")
        # Needed to delete Line 1061 from MATH_1991-2000 because it was shifted 
        continue
df = pd.concat((pd.read_csv(f) for f in all_files), ignore_index=True)


'''

In [None]:
import pandas as pd
import glob
import os
from google.colab import drive
from google.colab import data_table
from sklearn.model_selection import train_test_split

drive.mount('/content/drive')
path = r"/content/drive/MyDrive/Colab Notebooks/data"


df_list = []
df_train= pd.DataFrame()
df_test= pd.DataFrame()

for file in os.listdir(path):
    if file.endswith('.csv'):
      file_path= os.path.join(path, file)
      try:
          df = pd.read_csv(file_path)
          # Extract the file name without extension
          research_field = file.split('_')[0]   
          df['Research Field'] = research_field

          #concetenate keywords
          df['keywords'] = df['Author Keywords'].fillna('') + ' ' + df['Index Keywords'].fillna('')
 
         
          # Replace abstract="No abstract available" with title and keywords
          df.loc[df['Abstract'] == '[No abstract available]', 'Abstract'] = df['Title'] + ' ' + df['keywords']

          #drop uncessecary columns
          useful_cols = ['keywords', 'Title', 'Abstract', 'Research Field']
          df = df[useful_cols]

          #Split Data into train and Test Data
          split = len(df)
          split_index = int(0.95 * split)
          df_train = pd.concat([df_train, df[:split_index]])
          df_test = pd.concat([df_test, df[split_index:]])

          # Append the DataFrame to the list
          #df_list.append(df)

      except pd.errors.ParserError:
          print(f"Error reading file: {file}")
          # Needed to delete Line 1061 from MATH_1991-2000 because it was shifted 
          continue

#Split Training data according to the Research Field
test_data, df_valid = train_test_split(df_train, test_size=0.15, stratify=df_train['Research Field'],random_state=42)

print("Length: " + str(len(df_test)), str(len(df_train)), str(len(df_valid)))



Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Length: 7630 144964 21745


#### Import Huggingfaces pre_trained model

In [None]:

!pip uninstall transformers -y


[0m

In [None]:
!pip install transformers==4.28.0
!pip install --upgrade accelerate
!pip install datasets
!pip install evaluate

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers==4.28.0
  Downloading transformers-4.28.0-py3-none-any.whl (7.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m39.1 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0 (from transformers==4.28.0)
  Downloading huggingface_hub-0.14.1-py3-none-any.whl (224 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m26.6 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers==4.28.0)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m69.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.14.1 tokenizers-0.13.3 transform

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
from datasets import Dataset,DatasetDict


#Create Dictionary for text input("Abstract") and labels ("Research Field")
train_dict= df_train.loc[:, ['Abstract', 'Research Field']]
train_dict.columns = ["text", "labels"]

test_dict= df_test.loc[:, ['Abstract', 'Research Field']]
test_dict.columns = ["text", "labels"]

valid_dict= df_valid.loc[:, ['Abstract', 'Research Field']]
valid_dict.columns = ["text", "labels"]

print(valid_dict)



#Autotokenizer
#from transformers import AutoTokenizer
#tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

from transformers import DistilBertTokenizer, DistilBertModel
import torch
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')


#Define Labels 
id2label = {1: 'DENT', 2: 'AGRI', 3: 'ENER', 4: 'PSYC', 5: 'DECI', 6: 'VETE', 7: 'PHAR', 8: 'MATH',
       9: 'NURS', 10: 'ECON', 11: 'COMP', 12: 'ARTS', 13: 'CENG', 14: 'ENVI', 15: 'SOCI', 16: 'BIOC',
       17: 'MATE', 18: 'CHEM', 19: 'HEAL', 20: 'ENGI', 21: 'BUSI', 22: 'NEUR', 23: 'MEDI', 24: 'IMMU',
       25: 'PHYS', 0: 'EART'}
label2id = {value: key for key, value in id2label.items()}
print(id2label)

def tokenize_function(input):
  return tokenizer(input["text"], padding="max_length", truncation=True)

#create Datasets from transformers
train_dataset = Dataset.from_pandas(train_dict)
test_dataset = Dataset.from_pandas(test_dict)
valid_dataset = Dataset.from_pandas(valid_dict)

Dataset_dictionary = DatasetDict({'train': train_dataset, 'test': test_dataset, 'valid': valid_dataset})

tokenized_datasets = Dataset_dictionary.map(tokenize_function, batched=True)


                                                   text labels
158   Phase relations of natural aphyric high-alumin...   EART
1816  Budyko's framework has been widely used to stu...   ENVI
1257  In this article an overview is given of tradit...   PSYC
1306  Keys to Clinical Success with Pulp Capping: A ...   DENT
1197  OBJECTIVE: To describe the association between...   NURS
...                                                 ...    ...
917   This review covers the synthesis, characteriza...   MATE
1216  The synthetic potential of enzymes related to ...   CHEM
1430  Lipoproteins are of great interest in understa...   IMMU
28    Bone marrow stromal cells exhibit multiple tra...   NEUR
1852  Intranasal (i.n.) immunization is a very effec...   IMMU

[21745 rows x 2 columns]


Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

{1: 'DENT', 2: 'AGRI', 3: 'ENER', 4: 'PSYC', 5: 'DECI', 6: 'VETE', 7: 'PHAR', 8: 'MATH', 9: 'NURS', 10: 'ECON', 11: 'COMP', 12: 'ARTS', 13: 'CENG', 14: 'ENVI', 15: 'SOCI', 16: 'BIOC', 17: 'MATE', 18: 'CHEM', 19: 'HEAL', 20: 'ENGI', 21: 'BUSI', 22: 'NEUR', 23: 'MEDI', 24: 'IMMU', 25: 'PHYS', 0: 'EART'}


Map:   0%|          | 0/144964 [00:00<?, ? examples/s]

Map:   0%|          | 0/7630 [00:00<?, ? examples/s]

Map:   0%|          | 0/21745 [00:00<?, ? examples/s]

#### Evaluation


In [None]:
import numpy as np
import evaluate

metric = evaluate.load("accuracy")


def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)



Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

#### Trainer

In [None]:
print(Dataset_dictionary)

DatasetDict({
    train: Dataset({
        features: ['text', 'labels', '__index_level_0__'],
        num_rows: 144964
    })
    test: Dataset({
        features: ['text', 'labels', '__index_level_0__'],
        num_rows: 7630
    })
    valid: Dataset({
        features: ['text', 'labels', '__index_level_0__'],
        num_rows: 21745
    })
})


In [None]:
from transformers import TrainingArguments, Trainer

model = DistilBertModel.from_pretrained('distilbert-base-uncased', return_dict=True, num_labels=26, id2label=id2label, label2id=label2id)
training_args = TrainingArguments(
    output_dir="/content/drive/MyDrive/Colab Notebooks/TrainingModel_LiteratureClassification",
  )



trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=Dataset_dictionary['train'],
    eval_dataset=Dataset_dictionary["valid"],
    compute_metrics=compute_metrics,
)

trainer.train()

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_transform.weight', 'vocab_projector.bias', 'vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_projector.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


#### Testing

Addition: Accuracy measures whether the research field with the highest probability value matches the target. With 26 research fields, it would also be interesting to know if the correct target is at least among the three highest probability values.

$\begin{pmatrix} A\\ B \\ C \\D \\E \end{pmatrix} = \begin{pmatrix} 0.1\\ 0.95 \\ 0.5 \\0.2 \\0.3 \end{pmatrix} → \text{Choice}_1 = B, \text{Choice}_3 = B,C,E$