The following assignment consists of a theoretical part (learning portfolio) and a practical part (assignment). The goal is to build a classification model that predicts from which subject area a certain abstract originates. The plan would be that next week we will discuss your learnings from the theory part, that means you are relatively free to fill your Learning Portfolio on this new topic and in two weeks we will discuss your solutions of the Classification Model.

#Theory part (filling your Learning Portfolio, May 10)

In preparation for the practical part, I ask you to familiarize yourself with the following resources in the next week:

1) Please watch the following video:

https://course.fast.ai/Lessons/lesson4.html

You are also welcome to watch the accompanying Kaggle notebook if you like the video.

2) In addition to the video, I recommend you to read the first chapters of the course

https://huggingface.co/learn/nlp-course/chapter1/1


Try to understand principle processes and log them in your learning portfolio! A few suggestions: What is a pre-trained NLP model? How do I load them? What is tokenization? What does fine-tuning mean? What types of NLP Models are there? What possibilities do I have with the Transformers package? etc...

#Practical part (Assignment, May 17)

1) Preprocessing: The data which I provide as zip in Olat must be processed first, that means we need a table which has the following form:

Keywords | Title | Abstract | Research Field

The research field is determined by the name of the file.

2) We need a training dataset and a test dataset. My suggestion would be that for each research field we use the first 5700 lines for the training dataset and the last 300 lines for the test dataset. Please stick to this because then we can compare our models better!

3) Please use a pre-trained model from huggingface to build a classification model that tries to predict the correct research field from the 26. Please calculate the accuracy and the overall accuracy for all research fields. If you solve this task in a group, you can also try different pre-trained models. In addition to the abstracts, you can also see if the model improves if you include keywords and titles.

Some links, which can help you:

https://huggingface.co/docs/transformers/training

https://huggingface.co/docs/transformers/tasks/sequence_classification

One last request: Please always use PyTorch and not TensorFlow!

#### 1) Preprocessing: The data which I provide as zip in Olat must be processed first, that means we need a table which has the following form: Keywords | Title | Abstract | Research Field The research field is determined by the name of the file.

In [1]:
'''import pandas as pd
import glob
import os
from google.colab import drive
from google.colab import data_table

drive.mount('/content/drive')
path = r"/content/drive/MyDrive/Colab Notebooks/data"
all_files = glob.glob(os.path.join(path, "*.csv"))

for file in all_files:
    try:
        df = pd.read_csv(file)
        # Extract the file name without extension
        file_name = os.path.basename(file).split(".")[0]
        # Add a new column with the file name
        df['File Name'] = file_name
        # Append the DataFrame to the list
    except pd.errors.ParserError:
        print(f"Error reading file: {file}")
        # Needed to delete Line 1061 from MATH_1991-2000 because it was shifted 
        continue
df = pd.concat((pd.read_csv(f) for f in all_files), ignore_index=True)


'''

'import pandas as pd\nimport glob\nimport os\nfrom google.colab import drive\nfrom google.colab import data_table\n\ndrive.mount(\'/content/drive\')\npath = r"/content/drive/MyDrive/Colab Notebooks/data"\nall_files = glob.glob(os.path.join(path, "*.csv"))\n\nfor file in all_files:\n    try:\n        df = pd.read_csv(file)\n        # Extract the file name without extension\n        file_name = os.path.basename(file).split(".")[0]\n        # Add a new column with the file name\n        df[\'File Name\'] = file_name\n        # Append the DataFrame to the list\n    except pd.errors.ParserError:\n        print(f"Error reading file: {file}")\n        # Needed to delete Line 1061 from MATH_1991-2000 because it was shifted \n        continue\ndf = pd.concat((pd.read_csv(f) for f in all_files), ignore_index=True)\n\n\n'

In [2]:
import pandas as pd
import glob
import os
from google.colab import drive
from google.colab import data_table
from sklearn.model_selection import train_test_split

drive.mount('/content/drive')
path = r"/content/drive/MyDrive/Colab Notebooks/data"


df_list = []
df_train= pd.DataFrame()
df_test= pd.DataFrame()

for file in os.listdir(path):
    if file.endswith('.csv'):
      file_path= os.path.join(path, file)
      try:
          df = pd.read_csv(file_path)
          # Extract the file name without extension
          research_field = file.split('_')[0]   
          df['Research Field'] = research_field

          #concetenate keywords
          df['keywords'] = df['Author Keywords'].fillna('') + ' ' + df['Index Keywords'].fillna('')
 
         
          # Replace abstract="No abstract available" with title and keywords
          df.loc[df['Abstract'] == '[No abstract available]', 'Abstract'] = df['Title'] + ' ' + df['keywords']

          #drop uncessecary columns
          useful_cols = ['keywords', 'Title', 'Abstract', 'Research Field']
          df = df[useful_cols]

          #Split Data into train and Test Data
          split = len(df)
          split_index = int(0.95 * split)
          df_train = pd.concat([df_train, df[:split_index]])
          df_test = pd.concat([df_test, df[split_index:]])

          # Append the DataFrame to the list
          #df_list.append(df)

      except pd.errors.ParserError:
          print(f"Error reading file: {file}")
          # Needed to delete Line 1061 from MATH_1991-2000 because it was shifted 
          continue

#Split Training data according to the Research Field
test_data, df_valid = train_test_split(df_train, test_size=0.15, stratify=df_train['Research Field'],random_state=42)

print("Length: " + str(len(df_test)), str(len(df_train)), str(len(df_valid)))



Mounted at /content/drive
Length: 7630 144964 21745


In [3]:
df_train.head()


Unnamed: 0,keywords,Title,Abstract,Research Field
0,amino acid sequence; article; Bayes theorem; ...,BEAST: Bayesian evolutionary analysis by sampl...,Background. The evolutionary analysis of molec...,AGRI
1,antioxidant; hydrogen peroxide; oxygen; react...,"Oxidative stress, antioxidants and stress tole...","Traditionally, reactive oxygen intermediates (...",AGRI
2,conservation; ecology; evolution; herbarium; ...,Novel methods improve prediction of species' d...,Prediction of species' distributions is centra...,AGRI
3,ecological modeling; estimation method; evolu...,Generalized linear mixed models: a practical g...,How should ecologists and evolutionary biologi...,AGRI
4,Biodiversity; Complementary resource use; Ecos...,Effects of biodiversity on ecosystem functioni...,Humans are altering the composition of biologi...,AGRI


In [4]:
#Reduce Dataframe sizes for speeding up the whole calculation process

df_train=df_train.sample(frac=0.025)
df_valid=df_valid.sample(frac=0.025)
test_data=test_data.sample(frac=0.025)


In [5]:
print(len(df_train))

3624


#### Import Huggingfaces pre_trained model

In [6]:

!pip uninstall transformers -y


[0m

In [7]:
!pip install transformers==4.28.0
!pip install --upgrade accelerate
!pip install datasets
!pip install evaluate

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers==4.28.0
  Downloading transformers-4.28.0-py3-none-any.whl (7.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m51.8 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0 (from transformers==4.28.0)
  Downloading huggingface_hub-0.14.1-py3-none-any.whl (224 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m18.2 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers==4.28.0)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m66.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.14.1 tokenizers-0.13.3 transform

In [8]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
from datasets import Dataset,DatasetDict


#Create Dictionary for text input("Abstract") and labels ("Research Field")
train_dict= df_train.loc[:, ['Abstract', 'Research Field']]
train_dict.columns = ["text", "labels"]

test_dict= df_test.loc[:, ['Abstract', 'Research Field']]
test_dict.columns = ["text", "labels"]

valid_dict= df_valid.loc[:, ['Abstract', 'Research Field']]
valid_dict.columns = ["text", "labels"]

print(valid_dict)



#Autotokenizer
#from transformers import AutoTokenizer
#tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

from transformers import DistilBertTokenizer, DistilBertModel
import torch
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')


#Define Labels 
id2label = {1: 'DENT', 2: 'AGRI', 3: 'ENER', 4: 'PSYC', 5: 'DECI', 6: 'VETE', 7: 'PHAR', 8: 'MATH',
       9: 'NURS', 10: 'ECON', 11: 'COMP', 12: 'ARTS', 13: 'CENG', 14: 'ENVI', 15: 'SOCI', 16: 'BIOC',
       17: 'MATE', 18: 'CHEM', 19: 'HEAL', 20: 'ENGI', 21: 'BUSI', 22: 'NEUR', 23: 'MEDI', 24: 'IMMU',
       25: 'PHYS', 0: 'EART'}
label2id = {value: key for key, value in id2label.items()}
print(id2label)

#def tokenize_function(input):
  #return tokenizer(input["text"], padding="max_length", truncation=True)

def tokenize_function(x):
    tokens = tokenizer(x['text'], truncation=True, padding="max_length")
    #assign numerical labels to a list of labels stored in the x["labels"] variable
    tokens["labels"] = [label2id[label] for label in x["labels"]]
    return tokens

#create Datasets from transformers
train_dataset = Dataset.from_pandas(train_dict)
test_dataset = Dataset.from_pandas(test_dict)
valid_dataset = Dataset.from_pandas(valid_dict)

Dataset_dictionary = DatasetDict({'train': train_dataset, 'test': test_dataset, 'valid': valid_dataset})

tokenized_datasets = Dataset_dictionary.map(tokenize_function, batched=True)


                                                   text labels
1408                     The identity theory of truth     ARTS
491   Unparalleled in scope compared to the literatu...   MATH
1411  This paper intends to review the basic theory ...   CHEM
1417  Placentas from scrapie-affected ewes are known...   IMMU
137   The statistics of extremes have played an impo...   ENVI
...                                                 ...    ...
1611  BACKGROUND: In the past, the keratocytes of th...   HEAL
987   Magnesium chloride, as compared to alum and po...   ENVI
778   Influence maximization is the problem of findi...   ENGI
942   Structurally Diverse π-Cyclopentadienyl Comple...   CHEM
215   Pfam is a widely used database of protein fami...   BIOC

[544 rows x 2 columns]


Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

{1: 'DENT', 2: 'AGRI', 3: 'ENER', 4: 'PSYC', 5: 'DECI', 6: 'VETE', 7: 'PHAR', 8: 'MATH', 9: 'NURS', 10: 'ECON', 11: 'COMP', 12: 'ARTS', 13: 'CENG', 14: 'ENVI', 15: 'SOCI', 16: 'BIOC', 17: 'MATE', 18: 'CHEM', 19: 'HEAL', 20: 'ENGI', 21: 'BUSI', 22: 'NEUR', 23: 'MEDI', 24: 'IMMU', 25: 'PHYS', 0: 'EART'}


Map:   0%|          | 0/3624 [00:00<?, ? examples/s]

Map:   0%|          | 0/7630 [00:00<?, ? examples/s]

Map:   0%|          | 0/544 [00:00<?, ? examples/s]

In [17]:
print(label2id)
print(Dataset_dictionary)

{'DENT': 1, 'AGRI': 2, 'ENER': 3, 'PSYC': 4, 'DECI': 5, 'VETE': 6, 'PHAR': 7, 'MATH': 8, 'NURS': 9, 'ECON': 10, 'COMP': 11, 'ARTS': 12, 'CENG': 13, 'ENVI': 14, 'SOCI': 15, 'BIOC': 16, 'MATE': 17, 'CHEM': 18, 'HEAL': 19, 'ENGI': 20, 'BUSI': 21, 'NEUR': 22, 'MEDI': 23, 'IMMU': 24, 'PHYS': 25, 'EART': 0}
DatasetDict({
    train: Dataset({
        features: ['text', 'labels', '__index_level_0__'],
        num_rows: 3624
    })
    test: Dataset({
        features: ['text', 'labels', '__index_level_0__'],
        num_rows: 7630
    })
    valid: Dataset({
        features: ['text', 'labels', '__index_level_0__'],
        num_rows: 544
    })
})


#### Evaluation


In [9]:
import numpy as np
import evaluate

metric = evaluate.load("accuracy")


def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)



Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

#### Trainer

In [10]:
print(Dataset_dictionary)

DatasetDict({
    train: Dataset({
        features: ['text', 'labels', '__index_level_0__'],
        num_rows: 3624
    })
    test: Dataset({
        features: ['text', 'labels', '__index_level_0__'],
        num_rows: 7630
    })
    valid: Dataset({
        features: ['text', 'labels', '__index_level_0__'],
        num_rows: 544
    })
})


In [27]:
print(tokenized_datasets["train"])

temp = tokenized_datasets["train"]
for feature in temp.features:
    print(f"Attribute: {feature}")
    for i in range(2):  # Print the first 5 rows
        print(temp[feature][i])
    print()

Dataset({
    features: ['text', 'labels', '__index_level_0__', 'input_ids', 'attention_mask'],
    num_rows: 3624
})
Attribute: text
Designer magnets  
Hepatitis B virus (HBV) polymerase and human immunodeficiency virus (HIV) reverse transcriptase are structurally related. However, the HBV enzyme has a protein priming activity absent in the HIV enzyme. Approved nucleoside/ nucleotide inhibitors of the HBV polymerase include lamivudine, adefovir, telbivudine, entecavir and tenofovir. Although most of them target DNA elongation, guanosine and adenosine analogs (e.g. entecavir and tenofovir, respectively) also impair protein priming. Major mutational patterns conferring nucleoside/nucleotide analog resistance include the combinations rtL180M/rtM204(I/V) (for lamivudine, entecavir, telbivudine and clevudine) and rtA181V/rtN236T (for adefovir and tenofovir). However, development of drug resistance is very slow for entecavir and tenofovir. Novel nucleoside/nucleotide analogs in advanced cli

In [22]:
data = {
    'text': ['text1', 'text2', 'text3', 'text4', 'text5'],
    'labels': [1, 2, 3, 4, 5],
    '__index_level_0__': [0, 1, 2, 3, 4],
    'input_ids': [101, 102, 103, 104, 105],
    'attention_mask': [1, 1, 1, 1, 1]
}

for attribute, values in data.items():
    print(f"Attribute: {attribute}")
    for value in values[:5]:  # Print the first 5 values
        print(value)
    print()

Attribute: text
text1
text2
text3
text4
text5

Attribute: labels
1
2
3
4
5

Attribute: __index_level_0__
0
1
2
3
4

Attribute: input_ids
101
102
103
104
105

Attribute: attention_mask
1
1
1
1
1



In [12]:
print(tokenized_datasets["train"]["attention_mask"])

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [None]:
######################TEMP

from transformers import BertForSequenceClassification, DataCollatorWithPadding
from transformers import AutoModelForSequenceClassification
import evaluate

accuracy = evaluate.load("accuracy")
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

training_args = TrainingArguments("/content/drive/MyDrive/Colab Notebooks/TrainingModel_LiteratureClassification/", evaluation_strategy="epoch", per_device_train_batch_size=8, per_device_eval_batch_size=8)
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=26, id2label=id2label, label2id=label2id)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["valid"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)


trainer.train()

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

Epoch,Training Loss,Validation Loss


In [None]:
from transformers import TrainingArguments, Trainer
from transformers import AutoModelForSequenceClassification


model = AutoModelForSequenceClassification.from_pretrained('distilbert-base-uncased', return_dict=True, num_labels=26, id2label=id2label, label2id=label2id)
training_args = TrainingArguments(
    output_dir="/content/drive/MyDrive/Colab Notebooks/TrainingModel_LiteratureClassification",
  )



trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=Dataset_dictionary['train'],
    eval_dataset=Dataset_dictionary["valid"],
    compute_metrics=compute_metrics,
)

trainer.train()

#### Testing

Addition: Accuracy measures whether the research field with the highest probability value matches the target. With 26 research fields, it would also be interesting to know if the correct target is at least among the three highest probability values.

$\begin{pmatrix} A\\ B \\ C \\D \\E \end{pmatrix} = \begin{pmatrix} 0.1\\ 0.95 \\ 0.5 \\0.2 \\0.3 \end{pmatrix} → \text{Choice}_1 = B, \text{Choice}_3 = B,C,E$