# Task: add a new class to CoNLL dataset and find occurrences in wikitext

In this task you will retrain a sequence labelling model as done during the seminar, but with one modification. You will add a new class corresponding to nationality like Brazilian, which is usually falls into MISC class of CoNLL. Instead you will try to detect it as a separate class. You are free to use any sequence classification model explored in the seminar.

How to proceed:

- Load text from wikitext dataset as shown below

- Write code corresponding to instructions below inside comments



In [1]:
! pip install  datasets



In [2]:
from datasets import load_dataset

dataset = load_dataset("wikitext", "wikitext-2-v1")

In [3]:
import re
from collections import Counter
import pandas as pd

def modify_labels(example):
  labels = example["ner_tags"]
  modified_labels = []
  for label in labels:
    if label == "I-MISC":
      if example["tokens"][label] in nationality_list:
        modified_labels.append("I-NAT")
      else:
        modified_labels.append(label)
    else:
      modified_labels.append(label)
  example["ner_tags"] = modified_labels
  return example

nationality_regex = re.compile(r"\b[A-Z][a-z]+(?:n|ish|ese|ian)\b")

all_entities = []
for i, example in enumerate(dataset["train"]):
  text = example["text"]
  entities = []
  for match in re.finditer(nationality_regex, text):
    entity = match.group(0)
    entities.append(entity)
  all_entities.extend(entities)

  with open(f"entities_{i}.txt", "w") as file:
    for entity in entities:
      file.write(f"{entity}\tNAT\n")

  if i > 10:
    break

conll = load_dataset("conll2003")

nationality_list = ["Brazilian", "Spanish", "American", "French", "German", "Italian", "Japanese", "Chinese"]

conll["train"] = conll["train"].map(modify_labels)

top_entities = Counter(entities).most_common(20)
df_entities = pd.DataFrame(top_entities, columns=["Entity", "Frequency"])
df_entities["Entity-Type"] = "NAT"

print("Entity\tEntity-Type\tFrequency")
for _, row in df_entities.iterrows():
  print(f"{row['Entity']}\t{row['Entity-Type']}\t{row['Frequency']}")

tag_counts = Counter(all_entities).most_common(20)

df_tags = pd.DataFrame(tag_counts, columns=["Tag", "Frequency"])

print("\nFrequency of tags in wikitext\n")
print(df_tags.to_string(index=False))

tag_counts_per_word = Counter(all_entities)
df_tags_per_word = pd.DataFrame(tag_counts_per_word.items(), columns=["Word", "Tag"])
df_tags_per_word = df_tags_per_word.sort_values("Word")

print("\nFrequency of tags in wikitext\n")
print(df_tags_per_word.to_string(index=False))

"""
1) Change the training dataset CoNLL -- the one used in seminar -- so that names on nations are separated from MISC to NAT class
Example:

Spanish JJ I-NP I-MISC --> Spanish JJ I-NP I-NAT
Brazilian JJ I-NP I-MISC --> Brazilian JJ I-NP I-NAT

Use one of the foloowing strategies to detect NAT tags inside the MISC:

- endings (e.g. "*ian", "*an" etc)
- clustering of strings using word embeddings - names of nationalities will most likely found in a sigle cluster as distributioanlly similar words
- list like https://gist.github.com/marijn/274449
- some cobmbination of them

Check consistency (by a random sample) and save a modified training datset using new laels (one new label will be added - NAT).
Therefore, the MISC category will be split into MISC and NAT categories.

2) Retrain the CoNLL dataset based on the new labelling. Use the model of your choice from seminar.

3) Load text from wikitext dataset and detect all entities in texts and save list of the found entities into a text file indicating type in the format "entity\<TAB\>type".
Count the number of the found entries and print top 20 for each category in the form of the table with frequencies "entity\<TAB\>entity-type\<TAB\>frequency" sorted by frequency.

The output shall be two tables as shown below.
"""


Entity	Entity-Type	Frequency

Frequency of tags in wikitext

       Tag  Frequency
     Japan          3
  Japanese          2
    Vision          2
    Action          2
   Europan          1
     Raven          1
Revolution          1

Frequency of tags in wikitext

      Word  Tag
    Action    2
   Europan    1
     Japan    3
  Japanese    2
     Raven    1
Revolution    1
    Vision    2


'\n1) Change the training dataset CoNLL -- the one used in seminar -- so that names on nations are separated from MISC to NAT class\nExample:\n\nSpanish JJ I-NP I-MISC --> Spanish JJ I-NP I-NAT\nBrazilian JJ I-NP I-MISC --> Brazilian JJ I-NP I-NAT\n\nUse one of the foloowing strategies to detect NAT tags inside the MISC:\n\n- endings (e.g. "*ian", "*an" etc)\n- clustering of strings using word embeddings - names of nationalities will most likely found in a sigle cluster as distributioanlly similar words\n- list like https://gist.github.com/marijn/274449\n- some cobmbination of them\n\nCheck consistency (by a random sample) and save a modified training datset using new laels (one new label will be added - NAT).\nTherefore, the MISC category will be split into MISC and NAT categories.\n\n2) Retrain the CoNLL dataset based on the new labelling. Use the model of your choice from seminar.\n\n3) Load text from wikitext dataset and detect all entities in texts and save list of the found entit

# Results

Frequency of tags in wikitext

| Tag         | Frequency    |
|--------------|-----------|
| LOC | 199 |
| ORG | 99 |
| NAT | 9 |
| MISC | 3 |



Frequency of tags in wikitext

|Word | Tag | Frequency |
|------|--------|-----------|
|Saint-Petersbourg | LOC | 199 |
| ORG | 99 |
| NAT | 9 |
| MISC | 3 |

