

 # 🔹 Abstract
 Named Entity Recognition (NER) is a task in Natural Language Processing (NLP) that involves identifying and categorizing named entities, like people, organizations, locations, time expressions, and monetary values, in unstructured text. NER serves as a crucial tool for extracting information and is vital for various applications such as information retrieval, text summarization, machine translation, question answering, and other NLP tasks. The interest in NER has been growing in recent years due to its widespread utility in different domains.

# ⚙️ How model works
- [detailed explanation](https://spacy.io/usage/training)
- In a nutshell:
  <br>
 SpaCy's components like tagger, parser, and text categorizer rely on statistical models that make predictions based on learned patterns from training data. During training, the model compares its predictions to reference annotations, adjusting its weights through backpropagation to minimize the difference. The goal is not just memorization but creating a generalized theory applicable to unseen data. It's crucial to have representative training data, as a model trained on specific data may not generalize well to different contexts. Evaluation data is also necessary to assess how well the model generalizes beyond the training set. Training requires both training and evaluation data, each with a few hundred examples at least.

# 📦 Understanding dataset requirements
- Breaking down the task: <br>
  - Find / create a dataset with labeled mountains.
  <br> So, we going to create dataset using chatgpt. There is no right choise, but for me creating a dataset would be good experience.  
  - Select the relevant architecture of the model for NER solving.
  <br>
  I've choose pre-trained BERT, that has good F1 score among other models.
  - Train / finetune the model.
  <br> This will be demonstrated in other notebook
  - Prepare demo code / notebook of the inference results.
  <br> Same aplies here.

- According to the way transformer models work, we need sentences of words so that the transformer can understand the context of words, which words are used more often, their relationship. In addition, we need to have a label for each word in these sentences so that we can determine which is the class we need and which is not. Based on this, I can say that we need a dataset with the columns "sentence#", "word", "label". The words will be on separate lines, but if you group them together, they will form a sentence. The same goes for labels. This is an easy way to get both without complicated manipulations.

- Usually, NLP models for any task require large datasets with different words and the use of these words. This is true. To get a good result not only on training data, you need to prepare a dataset that contains not only sentences containing the words we need (mountain names), but also many variants of such and similar sentences. Unfortunately, this is a time-consuming process that requires a team or at least a lot of time. In my case, I don't have a lot of time and I'm on my own.

- The more different words there are in the dataset, the better the model can handle new ones.

# 🧰 Generating dataset  
There are two types of data: synthetic and real. Synthetic is generated data (Ai, algorythms, whatsoever), and real - produced by humans.
Assuming [this](https://research.aimultiple.com/synthetic-data-vs-real-data/) article about pros and cons of both of them, I will use synthetic data due to fast generation and easy manipulation.  

# ⚠️ Warning
Way, i choose for dataset generation is dictated only by short terms in task completion. Dataset, generated by chatgpt is limited to what it was trained on. And only way to enhance it is providing more details and references to the prompt. But quality of such a dataset is poor. And writing all sentences by hand is very time consuming and unrealistic in such short term. <bt>

***`So this is a tradeoff - quantity over quality.` ***

There is other LLMS that could perform such a task (llama, bing, bard, perplexity), but ChatGpt is most comfortable for this.



In [None]:
import requests

# 🌎 Getting mountain names acrooss the world

Here I am using geonames API. They contain the names of almost half a million mountains, which is enough to generate a dataset.
Note that I will not be able to generate a sentence for each name. Also, the names from this API have their own characteristics, which I will explain later.

In [None]:
country_codes = [
  "AF","AX","AL","DZ","AS","AD","AO","AI","AQ","AG","AR","AM","AW","AU","AT","AZ","BH","BS","BD","BB","BY","BE","BZ","BJ","BM","BT",
  "BO","BQ","BA","BW","BV","BR","IO","BN","BG","BF","BI","KH","CM","CA","CV","KY","CF","TD","CL","CN","CX","CC","CO","KM","CG","CD",
  "CK","CR","CI","HR","CU","CW","CY","CZ","DK","DJ","DM","DO","EC","EG","SV","GQ","ER","EE","ET","FK","FO","FJ","FI","FR","GF","PF",
  "TF","GA","GM","GE","DE","GH","GI","GR","GL","GD","GP","GU","GT","GG","GN","GW","GY","HT","HM","VA","HN","HK","HU","IS","IN","ID",
  "IR","IQ","IE","IM","IL","IT","JM","JP","JE","JO","KZ","KE","KI","KP","KR","KW","KG","LA","LV","LB","LS","LR","LY","LI","LT","LU",
  "MO","MK","MG","MW","MY","MV","ML","MT","MH","MQ","MR","MU","YT","MX","FM","MD","MC","MN","ME","MS","MA","MZ","MM","NA","NR","NP",
  "NL","NC","NZ","NI","NE","NG","NU","NF","MP","NO","OM","PK","PW","PS","PA","PG","PY","PE","PH","PN","PL","PT","PR","QA","RE","RO",
  "RU","RW","BL","SH","KN","LC","MF","PM","VC","WS","SM","ST","SA","SN","RS","SC","SL","SG","SX","SK","SI","SB","SO","ZA","GS","SS",
  "ES","LK","SD","SR","SJ","SZ","SE","CH","SY","TW","TJ","TZ","TH","TL","TG","TK","TO","TT","TN","TR","TM","TC","TV","UG","UA","AE",
  "GB","US","UM","UY","UZ","VU","VE","VN","VG","VI","WF","EH","YE","ZM","ZW"
]

def form_request(country_code, amount, username='me'):
  return f'http://api.geonames.org/searchJSON?featureCode=MT&username={username}&maxRows={amount}&style=SHORT&isReduced=true&country={country_code}'

Making actual API requests and getting mountain names from them

In [None]:
mountain_names = []
AMOUNT_OF_MOUNTAINS_PER_COUNTRY = 500
for country_code in country_codes:
  print(country_code, end=" ")
  response = requests.get(form_request(country_code, AMOUNT_OF_MOUNTAINS_PER_COUNTRY))
  for response_mountain in response.json()['geonames']:
    mountain_names.append(response_mountain['toponymName'])
print(len(mountain_names))

AF AX AL DZ AS AD AO AI AQ AG AR AM AW AU AT AZ BH BS BD BB BY BE BZ BJ BM BT BO BQ BA BW BV BR IO BN BG BF BI KH CM CA CV KY CF TD CL CN CX CC CO KM CG CD CK CR CI HR CU CW CY CZ DK DJ DM DO EC EG SV GQ ER EE ET FK FO FJ FI FR GF PF TF GA GM GE DE GH GI GR GL GD GP GU GT GG GN GW GY HT HM VA HN HK HU IS IN ID IR IQ IE IM IL IT JM JP JE JO KZ KE KI KP KR KW KG LA LV LB LS LR LY LI LT LU MO MK MG MW MY MV ML MT MH MQ MR MU YT MX FM MD MC MN ME MS MA MZ MM NA NR NP NL NC NZ NI NE NG NU NF MP NO OM PK PW PS PA PG PY PE PH PN PL PT PR QA RE RO RU RW BL SH KN LC MF PM VC WS SM ST SA SN RS SC SL SG SX SK SI SB SO ZA GS SS ES LK SD SR SJ SZ SE CH SY TW TJ TZ TH TL TG TK TO TT TN TR TM TC TV UG UA AE GB US UM UY UZ VU VE VN VG VI WF EH YE ZM ZW 61961


In [None]:
import pandas as pd
df = pd.DataFrame(mountain_names, columns=['name'])

In [None]:
df.to_csv('/content/mountain_names.csv',index=False)
df.describe()

Unnamed: 0,name
count,61961
unique,56643
top,Cerro Grande
freq,36


A little bit about the data we work with.
They are specific: mountain names contain elements of different languages, which makes it difficult to use the dataset. For example, Ukrainian mountains go together with the word 'Hora' or 'Gora'. The same is true for the English 'Mountain' or 'peak'.

In [None]:
prompt= """
Provide sentences dataset with words from list. Be creative. Don't change words that you are using, they should look like in the list. Sentences should be 5-10 words long. You should use all words from list
generate sentences from all words in list. All words from list should be used;
there shold be 50 sentences
List:"""

I need this list of mountains for two things.
The first one is to generate prompts for chatgpt that can be quickly copied and pasted. (Unfortunately I don't have a subscription to their API, it would speed up the process significantly)

Therefore, below you can see how prompts are generated. There is one initial prompt, and a list of mountains for which chatgpt will generate the result. This is necessary to ensure that the dataset contains as many different mountain names as possible, as well as to avoid repetition. We don't want to devote 5,000 lines of dataset to just two mountains, do we?

In [None]:
index = 0
chunk = 50
while index < len(mountain_names):
  chunk_of_mountains = mountain_names[index:index+chunk]
  index += chunk
  mountains_ = ", ".join(chunk_of_mountains)
  print(prompt)
  print(mountains_, "\n")

[1;30;43mПоказано результат, скорочений до останніх рядків (5000).[0m
Kūh-e Malek, Kūh-e Gowd-e Anjīrī, Kūh-e Poshtkūh, Kūh-e Shāskūh, Kūh-e Tanūreh, Kūh-e Qārūn, Kūh-e Lākh-e Jangī, Kūh-e Hārmū, Kūh-e Seh Chāhī, Kūh-e Khūm, Kūh-e Khatābūn, Kūh-e ‘Omar Kūh, Kūh-e ‘Arabū, Kūh-e Shūrān Sīāh, Kūh-e Kūr Sefīd, Kūh-e Khaţīr Sūkhteh, Kūh-e Shojābīl, Kūh-e Sar Owz̧ā‘, Kūh-e Qahremān-e Fānī, Kūh-e Sekīl-e Seh Tā, Kūh-e Āj-e Bālā, Kūh-e Qal‘eh Kamar, Kūh-e Shānjān-e Nasvī, Kūh-e Qarchanqūl, Kūh-e Āqā Mīr Dāghī, Kūh-e Espelān, Kūh-e Sorkh Sangān, Kūh-e Tameshk Chāl, Kūh-e Neshā’, Kūh-e Sārī Qayeh, Kūh-e Rameẕān Dāghī, Kūh-e Kīālān, Kūh-e Do Delū, Kūh-e Sar Chāl-e Bāl, Kūh-e Gajeh, Kūh-e Sarī, Kūh-e Jangal-e Rey, Kūh-e Gīv Sīāh, Kūh-e Halach Halach, Kūh-e Sīrāngaleh, Kūh-e Sar Khowrandar, Kūh-e Jabar Ālān, Kūh-e Hameh Qolī, Kūh-e Kāleh, Kūh-e Gūreh Qalā‘, Kūh-e Kūchakeh Sūr, Kūh-e Do Ḩeşāreh, Kūh-e Havār Barzeh, Kūh-e Bard Zard, Kūh-e Zarang 


Provide sentences dataset with words from list. Be

In [None]:
df_sentences = pd.read_csv('/content/dataset_sentences.csv', delimiter=';', error_bad_lines=False)



  df_sentences = pd.read_csv('/content/dataset_sentences.csv',delimiter=';', error_bad_lines=False)


I copied the result of the generated chatgpt responses directly to a csv dataset named dataset_sentences.csv.
The thing is that this is not the final form of this dataset.

In [None]:
import re

def first_upper(s):
    return bool(s and s[0].isupper())

def remove_duplicates(array):
    string_counts = {}

    for string in array:
        string_counts[string] = string_counts.get(string, 0) + 1
    filtered_array = [string for string, count in string_counts.items() if count <= 100]
    return filtered_array


So, the second reason I need a list of mountains is to speed up the generation of the dataset in its final form. Chatgpt does a poor job of generating it in the previously mentioned form, so yes.

Here, we take every word from every sentence in the manually-copied dataset and check if it is in the mountain list.
Here it is important to remember what I said: these names often contain translations of the word "mountain" and synonyms in different languages. This needs to be separated. We don't want to consider these as mountains.

In [None]:
!pip install unidecode -q

In [None]:
from unidecode import unidecode

new_dataset = []

joined_mountains = remove_duplicates((" ".join(mountain_names)).split(" "))
joined_mountains = list(map(lambda x: unidecode(x), joined_mountains))
joined_mountains = remove_duplicates((" ".join(mountain_names)).split(" "))
word = "Velikiy"
word = re.sub(r"[^a-zA-Z0-9\-]", '', unidecode(word))

True


The kind of words I was talking about can be filtered out by two criteria:
1. They are often repeated if the data sample is large (as in our case)
2. They mostly start with a lowercase letter.

This is the logic behind the `remove_duplicates` function.
I found that such words are repeated at least a hundred times if you make 50 queries for each country.

And here is the transformation of the dataset into the final form, which we will use for training and testing the transformer

In [None]:
sentences_data = df_sentences['sentence'].to_numpy()
for i, sentence in enumerate(sentences_data):
  if (i % (len(sentences_data) // 5) == 0):
    print(f" > {i}/{len(sentences_data)}")
  for word in sentence.split(" "):
    word = re.sub(r"[^a-zA-Z0-9\-]", '', unidecode(word))
    if word in joined_mountains and first_upper(word):
      label = "Mountain"
    else:
      label = "O"
    new_dataset.append([i, word, label])

 > 0/5839
 > 1167/5839
 > 2334/5839
 > 3501/5839
 > 4668/5839
 > 5835/5839


Checking to see if everything is okay

In [None]:
new_dataset[0]

[0, 'Koh-e', 'O']

Write the result to a file

In [None]:
df = pd.DataFrame(new_dataset, columns=['sentence#', 'word', 'label'])

In [None]:
df.to_csv('/content/dataset_medium.csv', index=False)