## NER Tagging Demo

See details at: https://www.kaggle.com/c/feedback-prize-2021/discussion/296669

In [1]:
import os

import pandas as pd
import transformers as trm
from tqdm.auto import tqdm

# Uses this utility script: https://www.kaggle.com/xhlulu/ner-tagging
import ner_tagging as tag

In [2]:
!pip install --quiet datasets transformers[sentencepiece]




In [3]:
from datasets import load_dataset
from huggingface_hub import notebook_login

Let's first load the training dataframe, along with the essays text and a dictionary to quickly map a id to a subset of the dataframe for the corresponding essay:

In [4]:
%%time
train_df = pd.read_csv('../input/feedback-prize-2021/train.csv')

train_dir = '../input/feedback-prize-2021/train'
train_files = list(os.listdir(train_dir))
train_ids = [f.replace('.txt', '') for f in train_files]

train_essays = [
    open(os.path.join(train_dir, f)).read()
    for f in tqdm(train_files)
]

train_id_to_df = dict(list(train_df.groupby('id')))

  0%|          | 0/15594 [00:00<?, ?it/s]

CPU times: user 3.39 s, sys: 1.45 s, total: 4.84 s
Wall time: 45 s


In [5]:
train_id_to_df["0000D23A521A"]

Unnamed: 0,id,discourse_id,discourse_start,discourse_end,discourse_text,discourse_type,discourse_type_num,predictionstring
59951,0000D23A521A,1617735000000.0,0.0,170.0,"Some people belive that the so called ""face"" o...",Position,Position 1,0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18...
59952,0000D23A521A,1617735000000.0,170.0,357.0,"It was not created by aliens, and there is no ...",Evidence,Evidence 1,34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 4...
59953,0000D23A521A,1617735000000.0,358.0,438.0,"A mesa is a naturally occuring rock formation,...",Evidence,Evidence 2,69 70 71 72 73 74 75 76 77 78 79 80 81 82 83
59954,0000D23A521A,1617735000000.0,438.0,626.0,"This ""face"" on mars only looks like a face bec...",Claim,Claim 1,84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 9...
59955,0000D23A521A,1617735000000.0,627.0,722.0,Many conspiracy theorists believe that NASA is...,Counterclaim,Counterclaim 1,117 118 119 120 121 122 123 124 125 126 127 12...
59956,0000D23A521A,1617735000000.0,722.0,836.0,These people would be very wrong. If NASA foun...,Rebuttal,Rebuttal 1,134 135 136 137 138 139 140 141 142 143 144 14...
59957,0000D23A521A,1617735000000.0,836.0,1014.0,"NASA's budget would increase drasticly, which ...",Evidence,Evidence 3,154 155 156 157 158 159 160 161 162 163 164 16...
59958,0000D23A521A,1617735000000.0,1015.0,1343.0,"So, NASA is not hiding life on Mars from us, a...",Concluding Statement,Concluding Statement 1,186 187 188 189 190 191 192 193 194 195 196 19...


Let's use any tokenizer (you can change it here if needed) to tokenize our training text. I selected `bert-base-cased` but you can choose any tokenize you want; the important part is to keep the `return_offsets_mapping` to true so we can use it in the `iob.create_target` function later.

In [6]:
tokenizer = trm.AutoTokenizer.from_pretrained("distilbert-base-uncased", return_offsets_mapping=True, truncation=True, max_length=512, padding='max_length')
tokens = tokenizer(train_essays, return_offsets_mapping=True, truncation=True, max_length=512, padding='max_length')

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

In case continue training run the following cell if you want to start from the original distilbert-base-uncased don't run it

In [7]:
model_checkpoint = "NahedAbdelgaber/evaluating-student-writing-distibert-ner"


In [8]:
tokenizer = trm.AutoTokenizer.from_pretrained(model_checkpoint, return_offsets_mapping=True, truncation=True, max_length=512, padding='max_length')
tokens = tokenizer(train_essays, return_offsets_mapping=True, truncation=True, max_length=512, padding='max_length')

Downloading:   0%|          | 0.00/462 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [9]:
len(tokenizer(train_essays[0], return_offsets_mapping=True, truncation=True, max_length=512, padding='max_length')["input_ids"])

512

In [10]:
train_essays[0]

"I think we should be able to play in a sport if we have a grade C. I think i would be not fear for student that have a good grade like c to play in a sport. If we had a D or an F i would understand that but a C i nothing. Not a lot of kid get A or Bs and if we do. Some of those kids don't like to play a sport they like to do all there homework not that i am saying that a bad C grade people do there homework to. If there is only 1 out of 4 percent of student that get A and B, They all don't like the same sports and some don't like to do sports so it wouldn't be a hole team in that sport. that means you would have to cancel all the sport teams in the school. That why you should let C student play an a sport."

We will now generate the target training data. 

In [11]:
train_targets = []

# First, you need to generate the tags from labels in the training dataframe
tags, tag_to_num = tag.generate_tags(train_df.discourse_type, scheme="BILOU")

for i, essay in enumerate(tqdm(train_essays)):
    essay_df = train_id_to_df[train_ids[i]]
    
    # Using the offset_mapping obtained from the tokenizer, we can align
    # it with the tagged characters to create the target for our model
    target = tag.create_target(
        text=essay,
        labels=essay_df.discourse_type,
        start=essay_df.discourse_start,
        end=essay_df.discourse_end,
        offset_mapping=tokens.offset_mapping[i],
        tag_to_num=tag_to_num,
        scheme="BILOU"
    )
    train_targets.append(target)

  0%|          | 0/15594 [00:00<?, ?it/s]

In [12]:
tag_to_num

{'O': 0,
 'B-Evidence': 1,
 'I-Evidence': 2,
 'L-Evidence': 3,
 'U-Evidence': 4,
 'B-Rebuttal': 5,
 'I-Rebuttal': 6,
 'L-Rebuttal': 7,
 'U-Rebuttal': 8,
 'B-Lead': 9,
 'I-Lead': 10,
 'L-Lead': 11,
 'U-Lead': 12,
 'B-Concluding Statement': 13,
 'I-Concluding Statement': 14,
 'L-Concluding Statement': 15,
 'U-Concluding Statement': 16,
 'B-Counterclaim': 17,
 'I-Counterclaim': 18,
 'L-Counterclaim': 19,
 'U-Counterclaim': 20,
 'B-Claim': 21,
 'I-Claim': 22,
 'L-Claim': 23,
 'U-Claim': 24,
 'B-Position': 25,
 'I-Position': 26,
 'L-Position': 27,
 'U-Position': 28}

In [13]:
print("train_targets = ", len(train_targets)), print("train_id_to_df = ",len(train_id_to_df))

train_targets =  15594
train_id_to_df =  15594


(None, None)

In [14]:
# tokens[0]

In [15]:
len(train_targets[0])

512

In [16]:
print(type(train_targets)) 
print(type(tokens))

<class 'list'>
<class 'transformers.tokenization_utils_base.BatchEncoding'>


In [17]:
type(train_essays)

list

In [18]:
train_essays[0]

"I think we should be able to play in a sport if we have a grade C. I think i would be not fear for student that have a good grade like c to play in a sport. If we had a D or an F i would understand that but a C i nothing. Not a lot of kid get A or Bs and if we do. Some of those kids don't like to play a sport they like to do all there homework not that i am saying that a bad C grade people do there homework to. If there is only 1 out of 4 percent of student that get A and B, They all don't like the same sports and some don't like to do sports so it wouldn't be a hole team in that sport. that means you would have to cancel all the sport teams in the school. That why you should let C student play an a sport."

In [19]:
from torch.utils.data import Dataset

In [20]:
class EvaluationStudentWritingDataSet(Dataset):
    
    
      def __init__(
          self, data, tokenizer, labels,
          return_offsets_mapping= True, 
          truncation=True,
          padding='max_length',
          max_token_len: int = 512
          ):

        self.tokenizer = tokenizer
        self.data = data
        self.return_offsets_mapping = return_offsets_mapping
        self.max_token_len = max_token_len
        self.truncation = truncation
        self.padding = padding
        self.labels = labels

      def __len__(self):
        return len(self.data)

      def __getitem__(self, index:int):
        data_row = self.data[index]
        text = data_row

        text_encoding = self.tokenizer(text, return_offsets_mapping=self.return_offsets_mapping, 
                                       truncation=self.truncation, 
                       max_length=self.max_token_len, padding=self.padding, return_tensors = "pt")
        label = self.labels[index]

        
        return dict(
#             text = text,
            input_ids = text_encoding["input_ids"].flatten(),
            attention_mask =  text_encoding["attention_mask"].flatten(),
            labels = label,
#             special_tokens_mask = text_encoding["special_tokens_mask"].flatten(),
#             offsets=  text_encoding["offset_mapping"].flatten(),
#             type_ids =  text_encoding["type_ids"].flatten()
        )

In [21]:
from sklearn.model_selection import train_test_split

In [22]:
# train_text, test_text = train_test_split(train_essays, test_size = 0.1)

# train_labels, test_labels = train_test_split(train_targets, test_size = 0.1)

train_text, test_text, train_labels, test_labels = train_test_split(train_essays, train_targets, test_size=0.1, random_state=42)

train_data_set = EvaluationStudentWritingDataSet(train_text, tokenizer, train_labels)
test_data_set = EvaluationStudentWritingDataSet(test_text, tokenizer, test_labels)

Start of the baseline

In [23]:
from transformers import AutoModelForMaskedLM

# model_checkpoint = "distilbert-base-uncased"

# model_checkpoint = "NahedAbdelgaber/distilbert-base-uncased-finetuned-evaluating-student-writing"

# fined_tunned_model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)

In [24]:
from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

In [25]:
label2id = tag_to_num
id2label = {v: k for k, v in label2id.items()}

In [26]:
from transformers import AutoModelForTokenClassification

model = AutoModelForTokenClassification.from_pretrained(
    model_checkpoint,
    id2label=id2label,
    label2id=label2id,
)

Downloading:   0%|          | 0.00/1.92k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/253M [00:00<?, ?B/s]

In [27]:
model.config

DistilBertConfig {
  "_name_or_path": "NahedAbdelgaber/evaluating-student-writing-distibert-ner",
  "activation": "gelu",
  "architectures": [
    "DistilBertForTokenClassification"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "id2label": {
    "0": "O",
    "1": "B-Evidence",
    "2": "I-Evidence",
    "3": "L-Evidence",
    "4": "U-Evidence",
    "5": "B-Rebuttal",
    "6": "I-Rebuttal",
    "7": "L-Rebuttal",
    "8": "U-Rebuttal",
    "9": "B-Lead",
    "10": "I-Lead",
    "11": "L-Lead",
    "12": "U-Lead",
    "13": "B-Concluding Statement",
    "14": "I-Concluding Statement",
    "15": "L-Concluding Statement",
    "16": "U-Concluding Statement",
    "17": "B-Counterclaim",
    "18": "I-Counterclaim",
    "19": "L-Counterclaim",
    "20": "U-Counterclaim",
    "21": "B-Claim",
    "22": "I-Claim",
    "23": "L-Claim",
    "24": "U-Claim",
    "25": "B-Position",
    "26": "I-Position",
    "27": "L-Position",
    "28": "U-Position"
  }

In [28]:
model.config.num_labels

29

In [29]:
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"

In [30]:
train_data_set[0].keys()

dict_keys(['input_ids', 'attention_mask', 'labels'])

In [31]:
for i in range(5):
    print("label", len(train_data_set[i]['labels']))
    print("input_ids", len(train_data_set[i]['input_ids']))

label 512
input_ids 512
label 512
input_ids 512
label 512
input_ids 512
label 512
input_ids 512
label 512
input_ids 512


In [32]:
notebook_login()

VBox(children=(HTML(value="<center>\n<img src=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [33]:
# sample = 0.1 * train_data_set.data

In [34]:
from transformers import TrainingArguments

args = TrainingArguments(
    "evaluating-student-writing-distibert-ner-with-metric",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    num_train_epochs=3,
    weight_decay=0.01,
    push_to_hub=True,
)

In [35]:
label_names = list(id2label.values())

In [36]:
!pip install seqeval

Collecting seqeval
  Downloading seqeval-1.2.2.tar.gz (43 kB)
     |████████████████████████████████| 43 kB 265 kB/s            
[?25h  Preparing metadata (setup.py) ... [?25l- done
Building wheels for collected packages: seqeval
  Building wheel for seqeval (setup.py) ... [?25l- \ done
[?25h  Created wheel for seqeval: filename=seqeval-1.2.2-py3-none-any.whl size=16181 sha256=eb663fc096bd3d46b1b474885791d54f19c590f8c14ad5eb2d74f088bcc27f5a
  Stored in directory: /root/.cache/pip/wheels/05/96/ee/7cac4e74f3b19e3158dce26a20a1c86b3533c43ec72a549fd7
Successfully built seqeval
Installing collected packages: seqeval
Successfully installed seqeval-1.2.2


In [37]:
import numpy as np
from datasets import load_metric

metric = load_metric("seqeval")



def compute_metrics(eval_preds):
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)

    # Remove ignored index (special tokens) and convert to labels
    true_labels = [[label_names[l] for l in label if l != -100] for label in labels]
    true_predictions = [
        [label_names[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    all_metrics = metric.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": all_metrics["overall_precision"],
        "recall": all_metrics["overall_recall"],
        "f1": all_metrics["overall_f1"],
        "accuracy": all_metrics["overall_accuracy"],
    }

Downloading:   0%|          | 0.00/2.48k [00:00<?, ?B/s]

In [38]:

!apt install git-lfs




The following NEW packages will be installed:
  git-lfs
0 upgraded, 1 newly installed, 0 to remove and 31 not upgraded.
Need to get 3316 kB of archives.
After this operation, 11.1 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu focal/universe amd64 git-lfs amd64 2.9.2-1 [3316 kB]
Fetched 3316 kB in 1s (2261 kB/s)

7[0;23r8[1ASelecting previously unselected package git-lfs.
(Reading database ... 102169 files and directories currently installed.)
Preparing to unpack .../git-lfs_2.9.2-1_amd64.deb ...
7[24;0f[42m[30mProgress: [  0%][49m[39m [..........................................................] 87[24;0f[42m[30mProgress: [ 20%][49m[39m [###########...............................................] 8Unpacking git-lfs (2.9.2-1) ...
7[24;0f[42m[30mProgress: [ 40%][49m[39m [#######################...................................] 8Setting up git-lfs (2.9.2-1) ...
7[24;0f[42m[30mProgress: [ 60%][49m[39m [###

In [39]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_data_set,
    eval_dataset=test_data_set,
#     train_dataset=train_batch,
#     eval_dataset=test_batch,  
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
)


ValueError: You need to pass a valid `token` or login by using `huggingface-cli login`

In [None]:
trainer.train()

pushing the model and tokenizer to the hub

In [None]:
trainer.push_to_hub(commit_message="Training complete")

In [None]:
tokenizer.push_to_hub("evaluating-student-writing-distibert-ner-with-metric")

In [None]:
def postprocess(predictions, labels):
    predictions = predictions.detach().cpu().clone().numpy()
    labels = labels.detach().cpu().clone().numpy()

    # Remove ignored index (special tokens) and convert to labels
    true_labels = [[label_names[l] for l in label if l != -100] for label in labels]
    true_predictions = [
        [label_names[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    return true_labels, true_predictions

In [None]:
from transformers import pipeline

# Replace this with your own checkpoint
model_checkpoint = "NahedAbdelgaber/evaluating-student-writing-distibert-ner-with-metric"
token_classifier = pipeline(
    "token-classification", model=model_checkpoint, aggregation_strategy="simple"
)
# token_classifier("My name is Sylvain and I work at Hugging Face in Brooklyn.")

In [None]:
text = """Some people belive that the so called "face" on mars was created by life on mars. This is not the case. The face on Mars is a naturally occuring land form called a mesa. It was not created by aliens, and there is no consiracy to hide alien lifeforms on mars. There is no evidence that NASA has found that even suggests that this face was created by aliens.

A mesa is a naturally occuring rock formation, that is found on Mars and Earth. This "face" on mars only looks like a face because humans tend to see faces wherever we look, humans are obviously extremely social, which is why our brain is designed to recognize faces.

Many conspiracy theorists believe that NASA is hiding life on Mars from the rest of the world. These people would be very wrong. If NASA found life on Mars, then they would get millions of people's attention. NASA's budget would increase drasticly, which means that their workers would get paid more. There is no good reason that NASA would hide life on Mars from the rest of the world.

So, NASA is not hiding life on Mars from us, and they are not trying to trick us into thinking that the "face" on mars is just a mesa, because it actually is. NASA hiding life would be illogical, because if they found life on Mars, they would make a lot of money, and we all know that the people at NASA aren't illogical people."""

In [None]:
token_classifier(text)

In [None]:
char_start = 0
char_end = 170
word_start = len(text[:char_start].split())
word_end = word_start + len(text[char_start:char_end].split())
word_end = min( word_end, len(text.split()) )
predictionstring = " ".join( [str(x) for x in range(word_start,word_end)] )

In [None]:
rows["predictionstring"].iloc[-2]

In [None]:
result = token_classifier(text)

for group in result:
    char_start = group["start"]
    char_end = group["end"]
    word_start = len(text[:char_start].split())
    word_end = word_start + len(text[char_start:char_end].split())
    word_end = min( word_end, len(text.split()) )
    predictionstring = " ".join( [str(x) for x in range(word_start,word_end)] )
    
    print("pred ", predictionstring, "class" , group["entity_group"])

Prepare submission file

just making an update b

In [None]:
import pandas as pd

In [None]:
test_dir = '../input/feedback-prize-2021/test'
test_files = list(os.listdir(test_dir))
test_ids = [f.replace('.txt', '') for f in test_files]


In [None]:
test_df = pd.DataFrame(columns = ["id","text"])

test_df

In [None]:
for f in tqdm(test_files):
    text = open(os.path.join(test_dir, f)).read()
    text_id = f.replace('.txt', '')
    test_df = test_df.append({"id": text_id, "text":text}, ignore_index=True)
    

In [None]:
len(test_df)

In [None]:
for i in range(len(test_df)):
    row = test_df.iloc[i]
    print(row["id"])
    break

In [None]:
submission_df = pd.DataFrame(columns = ["id","class", "predictionstring"])

In [None]:
for i in range(len(test_df)):
    row = test_df.iloc[i]
    text_id = row["id"]
    text = row["text"]
    result = token_classifier(text)

    for group in result:
        char_start = group["start"]
        char_end = group["end"]
        word_start = len(text[:char_start].split())
        word_end = word_start + len(text[char_start:char_end].split())
        word_end = min( word_end, len(text.split()) )
        
        predictionstring = " ".join( [str(x) for x in range(word_start,word_end)] )
        prediction_class = group["entity_group"]
        
        submission_df = submission_df.append({"id":text_id, "class": prediction_class,"predictionstring":predictionstring}, ignore_index=True)
        

In [None]:
submission_df.to_csv("submission.csv")

In [None]:
submission_df.head(10)

editing again

