

 # Abstract

 Named Entity Recognition (NER) is a task in Natural Language Processing (NLP) that involves identifying and categorizing named entities, like people, organizations, locations, time expressions, and monetary values, in unstructured text. NER serves as a crucial tool for extracting information and is vital for various applications such as information retrieval, text summarization, machine translation, question answering, and other NLP tasks. The interest in NER has been growing in recent years due to its widespread utility in different domains.

 # Why choose BERT over other transformers / models?  
  1. Performance. <br>
  Acording to [this](https://aclanthology.org/2021.triton-1.9.pdf) paper, BERT Fine-tuned F1 score in NER (BioNER) was among the greatest between CRF, Bi-LSTM, ELMO, GloVe, etc.

  2. Fast to setup. <br>
  There already exists a bunch of ready-to-use solutions:
   - HuggingFace pre-trained models (tokenizer, base, large)
   - Pytorch pre-trained models
   - Spacy transformers library
   - And so on.
   <br>
    So it is the fastest and easies way to deploy such a model for NER.  
  3. Easy to use <br>
  You can Fine-tune BERT model for your needs in only 2-4 epochs, and F1 score already will be good according to [paper](https://aclanthology.org/2021.triton-1.9.pdf) mentioned earlier.
  <br> Needles to say, that there is many modifications of BERT transformer, and it's even used with CRF, BiLSTMS, etc.


# How model works
- For detailed explanation you can check [google paper](https://arxiv.org/pdf/1706.03762.pdf), but in nutshell, BERT utilizes attention layers to make use of context. As opposed to directional models, which read the text input sequentially (left-to-right or right-to-left), (for example GPT is left-to-right) the Transformer encoder reads the entire sequence of words at once.<br> Therefore it is considered bidirectional, though it would be more accurate to say that it’s non-directional. This characteristic allows the model to learn the context of a word based on all of its surroundings (left and right of the word).

- Fine-Tuning is additional training on another dataset on task what's needed. It's used to adapt BERT for specific task like NER, SQuAD, and others.  

# Understanding dataset requirements
According to what been said earlier, we don't need a billion-samples-long dataset because of pre-trained model already knows a lot of words and their context. <br>
- Breaking down the task: <br>
  - Find / create a dataset with labeled mountains.
  <br> So, we going to create dataset using chatgpt. There is no right choise, but for me creating a dataset would be good experience.  
  - Select the relevant architecture of the model for NER solving.
  <br>
  I've choose pre-trained BERT, that has good F1 score among other models.
  - Train / finetune the model.
  <br> This will be demonstrated in other notebook
  - Prepare demo code / notebook of the inference results.
  <br> Same aplies here.

- Dataset would consist of two columns: sentence and label.
  Why is that? BERT uses attention to define context from whole sentence, so it's remembering contextes of words, but not words only. For other models for example, we may need to mark where words starts and ends, but that it not the case for BERT. When whe feed sentence into it, BERT masks 15% of words, and recalculates attention for all of them. In a nutshell, bert can understand what word we want him to mark from many sentences that would diffirentiate by context.  
- Usually for NLP tasks require a lot of training data. That is not the case when we Fine-Tuning BERT, because it already knows a lot, but our task here is to show BERT in what context word can appear, not only what is the word.
So there should be not only specific words that we need BERT to learn but in some way creative options how we use them. That is crucial for BERT accuracy on real data.  
- In addition to it, dataset should contain not only sentences with new tags, but new one's too. BERT need to understand what tag is and what it isn't. So 50/50 split of "Mountain" tag and "O" would be perfect.

# Generating dataset  
There are two types of data: synthetic and real. In simple words: synthetic is generated data (Ai, algorythms, whatsoever), and real - produced by humans.
Assuming [this](https://research.aimultiple.com/synthetic-data-vs-real-data/) article about pros and cons of both of them, I will use synthetic data due to fast generation and easy manipulation.  
# Implementation
I will use chatgpt 3.5 turbo for synthetic dataset generation. <br>
First of all, we need a specific prompt for it, to generate data properly.

We should define structure, data and way to generate it.
After many attemts a come up with this prompt:
<br>
```
Provide dataset in csv format with no additional comments.
It should have two columns: label, sentence.
For sentence you should create sentence, with mountain name in it, or no mountain name in it.
For label, if in sentence is mentioned mountain name label should be "Mountain", otherwise "O".
Split of labels "Mountain" or "O" should be 50/50.
Generate samples of sentences from this mountain list, they need to be variative, not similar, looking like dialogue.  Don't use country names.
For "O" label sentences, use same words but not mountain names.
Use ; as a delimiter, ignore column names
you should use all mountain names from list
List of mountains to generate from:
```
# Warning
Way, i choose for dataset generation is dictated only by short terms in task completion. Dataset, generated by chatgpt is limited to what it was trained on. And only way to enhance it is providing more details and references to the prompt. But quality of such a dataset is poor. And writing all sentences by hand is very time consuming.  <bt>

***`So this is a tradeoff - quantity over quality.` ***

There is other LLMS that could perform such a task (llama, bing, bard, perplexity), but ChatGpt is most comfortable for this.

# Doing the thing
There is a way, to do it using chatgpt API, and workflow looks like this:
1. Connect to API,
2. Send prompt
3. Append prompt to saved prompts
4. repeat X times
5. Write to csv <br>
- But, for this approach I need to have chagpt subscription, and I don't.

The other variant to do it automaticly is to web-scrape chat-gpt page. <br>
- But they added Captcha, so it couldn't make it even to the prompting page.

So, left only one approach - copypaste from gpt to dataset file manually.
<br>



# Actual

As soon as i have no money for chatgpt subscription, I will prompt chatgpt by myself using slightly modified prompts from ones described above, and copy-paste them into dataset file.
To enhance results, i will add mountain names copied from wikipedia.  

In [10]:
# raw_data.py
from raw_data import mountains, countries, usa_states
import re

defining all the functions to remove artifacts

In [11]:
def remove_numbers(input_string):
    return re.sub(r'\d+', '', input_string)

def remove_substring(input_string, substring_to_remove):
    pattern = re.escape(substring_to_remove)
    return re.sub(pattern, '', input_string)

def remove_brackets_and_content(input_string):
    pattern = re.compile(r'\([^)]*\)')
    return re.sub(pattern, '', input_string)

def remove_square_brackets_and_content(input_string):
    pattern = re.compile(r'\[.*?\]')
    return re.sub(pattern, '', input_string)

1. removing numbers, commas, tabulaton, brakets and it's content from countries.
2. turning countries to list

In [12]:
""" Countries to list """
countries = remove_numbers(countries)
countries = remove_substring(countries, ",")
countries = remove_substring(countries, "\t")
countries = remove_brackets_and_content(countries)
countries = countries.split("\n")
countries = list(filter(lambda x: x if x != ' ' else None, countries))
map(lambda x: x.strip() ,countries)

<map at 0x7bba8d2fe230>

3. turning usa_states to list

In [13]:
""" USA states to list """
usa_states = usa_states.split("\n")

4. delete all appearences of countries and usa states in mountains
 ( because, if there would be too much mentions of countries, bert will classify them as mountain, and we don't need that )

In [14]:
for country in countries:
    mountains = remove_substring(mountains, country)

for state in usa_states:
    mountains = remove_substring(mountains, state)

5. Preprocess mountains too

In [15]:
mountains = remove_substring(mountains, ",")
mountains = remove_brackets_and_content(mountains)
mountains = remove_square_brackets_and_content(mountains)
mountains = remove_substring(mountains, "and")
mountains = remove_substring(mountains, ":")
mountains = remove_substring(mountains, " Range ")
mountains = remove_substring(mountains, " Hill Country")
mountains = remove_substring(mountains, " New")
mountains = remove_substring(mountains, state)
mountains = sorted(mountains.split("\n"))

In [16]:
""" Split in chunks for prompring """

prompt = """
Provide dataset in csv format with no additional comments. Use ; as a delimiter, ignore column names
It should have two columns: label, sentence.
For sentence you should create sentence, with mountain name in it, or no mountain name in it.
For label, if in sentence is mentioned mountain name label should be "Mountain", otherwise "O".
Split of labels "Mountain" or "O" should be 50/50.
Generate samples of sentences from this mountain list, they need to be variative, not similar, looking like dialogue.  Don't use country names.
Don't repeat yourself, use different adjectives, cases, situations
For "O" label sentences, use same words but not mountain names.
you should use all mountain names from list
List of mountains to generate from: \n
"""

i = 0
addition = 35
while i < len(mountains):
    print(prompt)
    print("\n".join(mountains[i:i+addition]), "\n", "_"*20, "\n")
    #print("\n".join(mountains[i:i+addition]))
    i += addition


Provide dataset in csv format with no additional comments. Use ; as a delimiter, ignore column names
It should have two columns: label, sentence.
For sentence you should create sentence, with mountain name in it, or no mountain name in it.
For label, if in sentence is mentioned mountain name label should be "Mountain", otherwise "O".
Split of labels "Mountain" or "O" should be 50/50.
Generate samples of sentences from this mountain list, they need to be variative, not similar, looking like dialogue.  Don't use country names.
Don't repeat yourself, use different adjectives, cases, situations
For "O" label sentences, use same words but not mountain names.
you should use all mountain names from list
List of mountains to generate from: 








 
 Coast
 Desert
 Humboldt
 Humboldt
 Limestone Alps  
 Mountains 
 Mountains 
 Mountains 
 Pacific Rise
 Vardar/Pelagonia mountain range  
 Yolla-Bolly Mountains 
 mountains
Aberdare ranges 
Absaroka  
Accursed Mountains
Adam Range
Adamant Range
A

Now prompt i will paste gpt results from prompt+chunks one by one into dataset csv file

And, due to overuse word "UNESCO" and some countries names, we need to append dataset with results of such prompt:
```
Provide dataset in csv format with no additional comments. Use ; as a delimiter, ignore column names
It should have two columns: label, sentence.
For sentence you should create sentence, with mountain name in it, or no mountain name in it.
for label use only "O", avoid using mountain names. Use words from list:
UNESCO, 'Afghanistan', 'Albania', 'Algeria', 'Andorra', 'Angola', 'Antigua and Barbuda', 'Argentina', 'Armenia', 'Australia', 'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain', 'Bangladesh', 'Barbados', 'Belarus', 'Belgium', 'Belize', 'Benin', 'Bhutan', 'Bolivia', 'Bosnia and Herzegovina', 'Botswana', 'Brazil', 'Brunei', 'Bulgaria', 'Burkina Faso', 'Burundi', "Côte d'Ivoire", 'Cabo Verde', 'Cambodia', 'Cameroon', 'Canada', 'Central African Republic', 'Chad', 'Chile', 'China', 'Colombia', 'Comoros', 'Congo ', 'Costa Rica', 'Croatia', 'Cuba', 'Cyprus',
'Czechia ', 'Democratic Republic of the Congo', 'Denmark', 'Djibouti', 'Dominica', 'Dominican Republic', 'Ecuador', 'Egypt', 'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Estonia', 'Eswatini ', 'Ethiopia', 'Fiji', 'Finland', 'France', 'Gabon', 'Gambia', 'Georgia', 'Germany', 'Ghana', 'Greece', 'Grenada', 'Guatemala', 'Guinea', 'Guinea-Bissau', 'Guyana', 'Haiti', 'Holy See', 'Honduras', 'Hungary', 'Iceland', 'India', 'Indonesia', 'Iran', 'Iraq', 'Ireland', 'Israel', 'Italy', 'Jamaica', 'Japan', 'Jordan', 'Kazakhstan', 'Kenya', 'Kiribati',
'Kuwait', 'Kyrgyzstan', 'Laos', 'Latvia', 'Lebanon', 'Lesotho', 'Liberia', 'Libya', 'Liechtenstein', 'Lithuania', 'Luxembourg', 'Madagascar', 'Malawi', 'Malaysia', 'Maldives', 'Mali', 'Malta', 'Marshall Islands', 'Mauritania', 'Mauritius', 'Mexico', 'Micronesia', 'Moldova', 'Monaco', 'Mongolia', 'Montenegro', 'Morocco', 'Mozambique', 'Myanmar ', 'Namibia', 'Nauru', 'Nepal', 'México', 'Netherlands', 'New Zealand', 'Nicaragua', 'Niger', 'Nigeria', 'North Korea', 'North Macedonia', 'Norway', 'Oman', 'Pakistan', 'Palau', 'Palestine State', 'Panama', 'Papua New Guinea', 'Paraguay', 'Peru', 'Philippines', 'Poland', 'Portugal', 'Qatar', 'Romania', 'Russia', 'Rwanda', 'Saint Kitts and Nevis', 'Saint Lucia', 'Saint Vincent and
the Grenadines', 'Samoa', 'San Marino', 'Sao Tome and Principe', 'Saudi Arabia', 'Senegal', 'Serbia', 'Seychelles', 'Sierra Leone', 'Singapore', 'Slovakia', 'Slovenia', 'Solomon Islands', 'Somalia', 'South Africa', 'South Korea', 'South Sudan', 'Spain', 'Sri Lanka', 'Sudan', 'Suriname', 'Sweden', 'Switzerland', 'Syria', 'Tajikistan', 'Tanzania', 'Thailand', 'Timor-Leste', 'Togo', 'Tonga', 'Trinidad and Tobago', 'Tunisia', 'Turkey', 'Turkmenistan', 'Tuvalu', 'Uganda', 'Ukraine', 'United Arab Emirates', 'United Kingdom', 'United States of America', 'Uruguay', 'Uzbekistan', 'Vanuatu', 'Venezuela', 'Vietnam', 'Yemen', 'Zambia', 'Zimbabwe', 'Yukon British Columbia', 'British Columbia', 'ChurchillAlberta', 'Quebec', 'Czech Republic', 'Kosovo'
```

# In-process
When I was creating the dataset, I repeatedly noticed that chatgpt uses the words "mountain", "peak", and others in the same sentences with the names of mountains. This would lead to the model recognizing these words as mountains, which we did not want. Additionally, the situation was the same with country names. Chatgpt often inserted them into sentences marked as Mountain. I fixed this by generating another 400 lines of dataset with only countries, without mountain names, and marking them as "O". But that's not all. There was also a problem with recognizing the words "I", "is", "expedition" as mountains. I dealt with this in the same way as before, by generating strings without mountains in which these words would often appear.


# Summary
The dataset has a great influence on how the model will work. That's why such errors come down to adding new values to the dataset. As we already know, BERT identifies words using their context, which requires a large amount of training data.  