

 # Abstract

 Named Entity Recognition (NER) is a task in Natural Language Processing (NLP) that involves identifying and categorizing named entities, like people, organizations, locations, time expressions, and monetary values, in unstructured text. NER serves as a crucial tool for extracting information and is vital for various applications such as information retrieval, text summarization, machine translation, question answering, and other NLP tasks. The interest in NER has been growing in recent years due to its widespread utility in different domains.

 # Why choose BERT over other transformers / models?  
  1. Performance. <br>
  Acording to [this](https://aclanthology.org/2021.triton-1.9.pdf) paper, BERT Fine-tuned F1 score in NER (BioNER) was among the greatest between CRF, Bi-LSTM, ELMO, GloVe, etc.

  2. Fast to setup. <br>
  There already exists a bunch of ready-to-use solutions:
   - HuggingFace pre-trained models (tokenizer, base, large)
   - Pytorch pre-trained models
   - Spacy transformers library
   - And so on.
   <br>
    So it is the fastest and easies way to deploy such a model for NER.  
  3. Easy to use <br>
  You can Fine-tune BERT model for your needs in only 2-4 epochs, and F1 score already will be good according to [paper](https://aclanthology.org/2021.triton-1.9.pdf) mentioned earlier.
  <br> Needles to say, that there is many modifications of BERT transformer, and it's even used with CRF, BiLSTMS, etc.


# How model works
- For detailed explanation you can check [google paper](https://arxiv.org/pdf/1706.03762.pdf), but in nutshell, BERT utilizes attention layers to make use of context. As opposed to directional models, which read the text input sequentially (left-to-right or right-to-left), (for example GPT is left-to-right) the Transformer encoder reads the entire sequence of words at once.<br> Therefore it is considered bidirectional, though it would be more accurate to say that it’s non-directional. This characteristic allows the model to learn the context of a word based on all of its surroundings (left and right of the word).

- Fine-Tuning is additional training on another dataset on task what's needed. It's used to adapt BERT for specific task like NER, SQuAD, and others.  

# Understanding dataset requirements
According to what been said earlier, we don't need a billion-samples-long dataset because of pre-trained model already knows a lot of words and their context. <br>
- Breaking down the task: <br>
  - Find / create a dataset with labeled mountains.
  <br> So, we going to create dataset using chatgpt. There is no right choise, but for me creating a dataset would be good experience.  
  - Select the relevant architecture of the model for NER solving.
  <br>
  I've choose pre-trained BERT, that has good F1 score among other models.
  - Train / finetune the model.
  <br> This will be demonstrated in other notebook
  - Prepare demo code / notebook of the inference results.
  <br> Same aplies here.

- Dataset would consist of two columns: sentence and label.
  Why is that? BERT uses attention to define context from whole sentence, so it's remembering contextes of words, but not words only. For other models for example, we may need to mark where words starts and ends, but that it not the case for BERT. When whe feed sentence into it, BERT masks 15% of words, and recalculates attention for all of them. In a nutshell, bert can understand what word we want him to mark from many sentences that would diffirentiate by context.  
- Usually for NLP tasks require a lot of training data. That is not the case when we Fine-Tuning BERT, because it already knows a lot, but our task here is to show BERT in what context word can appear, not only what is the word.
So there should be not only specific words that we need BERT to learn but in some way creative options how we use them. That is crucial for BERT accuracy on real data.  
- In addition to it, dataset should contain not only sentences with new tags, but new one's too. BERT need to understand what tag is and what it isn't. So 50/50 split of "Mountain" tag and "O" would be perfect.

# Generating dataset  
There are two types of data: synthetic and real. In simple words: synthetic is generated data (Ai, algorythms, whatsoever), and real - produced by humans.
Assuming [this](https://research.aimultiple.com/synthetic-data-vs-real-data/) article about pros and cons of both of them, I will use synthetic data due to fast generation and easy manipulation.  
# Implementation
I will use chatgpt 3.5 turbo for synthetic dataset generation. <br>
First of all, we need a specific prompt for it, to generate data properly.

We should define structure, data and way to generate it.
After many attemts a come up with this prompt:
<br>
```
Provide dataset in csv format with no additional comments.
It should have two columns: label, sentence.
For sentence you should create sentence, with mountain name in it, or no mountain name in it.
For label, if in sentence is mentioned mountain name label should be "Mountain", otherwise "O".
Split of labels "Mountain" or "O" should be 50/50.
Generate samples of sentences from this mountain list, they need to be variative, not similar, looking like dialogue.  Don't use country names.
For "O" label sentences, use same words but not mountain names.
Use ; as a delimiter, ignore column names
you should use all mountain names from list
List of mountains to generate from:
```
# Warning
Way, i choose for dataset generation is dictated only by short terms in task completion. Dataset, generated by chatgpt is limited to what it was trained on. And only way to enhance it is providing more details and references to the prompt. But quality of such a dataset is poor. And writing all sentences by hand is very time consuming.  <bt>

***`So this is a tradeoff - quantity over quality.` ***

There is other LLMS that could perform such a task (llama, bing, bard, perplexity), but ChatGpt is most comfortable for this.

# Doing the thing
There is a way, to do it using chatgpt API, and workflow looks like this:
1. Connect to API,
2. Send prompt
3. Append prompt to saved prompts
4. repeat X times
5. Write to csv <br>
- But, for this approach I need to have chagpt subscription, and I don't.

The other variant to do it automaticly is to web-scrape chat-gpt page. <br>
- But they added Captcha, so it couldn't make it even to the prompting page.

So, left only one approach - copypaste from gpt to dataset file manually.
<br>



## Demo of how process would look like using API

In [None]:

"""
!pip install openai -q

prompt = '''Provide dataset in csv format with no additional comments.
It should have two columns: label, sentence.
For sentence you should create sentence, with mountain name in it, or no mountain name in it.
For label, if in sentence is mentioned mountain name label should be "Mountain", otherwise "O".
Split of labels "Mountain" or "O" should be 50/50.
Generate 20 samples. Skip table names at the beginning of csv file'''.replace("\n", " ")


from openai import OpenAI

OPENAI_API_KEY="some-key"
client = OpenAI(api_key=OPENAI_API_KEY)

answers = []

columns = ["label, sentence"]
cycles = 10
for i in range(cycles):
  completion = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
      {"role": "system", "content": ""},
      {"role": "user", "content": prompt}
    ])
  answers.append(completion)

with open("dataset.csv", "w") as f:
  f.write("\n".join(answers))
"""
# And all necessary things...

'\n!pip install openai -q\n\nprompt = \'\'\'Provide dataset in csv format with no additional comments.\nIt should have two columns: label, sentence.\nFor sentence you should create sentence, with mountain name in it, or no mountain name in it.\nFor label, if in sentence is mentioned mountain name label should be "Mountain", otherwise "O".\nSplit of labels "Mountain" or "O" should be 50/50.\nGenerate 20 samples. Skip table names at the beginning of csv file\'\'\'.replace("\n", " ")\n\n\nfrom openai import OpenAI\n\nOPENAI_API_KEY="some-key"\nclient = OpenAI(api_key=OPENAI_API_KEY)\n\nanswers = []\n\ncolumns = ["label, sentence"]\ncycles = 10\nfor i in range(cycles):\n  completion = client.chat.completions.create(\n    model="gpt-3.5-turbo",\n    messages=[\n      {"role": "system", "content": ""},\n      {"role": "user", "content": prompt}\n    ])\n  answers.append(completion)\n\nwith open("dataset.csv", "w") as f:\n  f.write("\n".join(answers))\n'

# Actual

As soon as i have no money for chatgpt subscription, I will prompt chatgpt by myself using slightly modified prompts from ones described above, and copy-paste them into dataset file.
To enhance results, i will add mountain names copied from wikipedia.  

In [None]:
"""
# raw_data.py
from raw_data import mountains, countries, usa_states
#from raw_data import additional_mountains
import re"""

'\n# raw_data.py\nfrom raw_data import mountains, countries, usa_states\n#from raw_data import additional_mountains\nimport re'

defining all the functions to remove artifacts

In [None]:
"""
def remove_numbers(input_string):
    return re.sub(r'\d+', '', input_string)

def remove_substring(input_string, substring_to_remove):
    pattern = re.escape(substring_to_remove)
    return re.sub(pattern, '', input_string)

def remove_brackets_and_content(input_string):
    pattern = re.compile(r'\([^)]*\)')
    return re.sub(pattern, '', input_string)

def remove_square_brackets_and_content(input_string):
    pattern = re.compile(r'\[.*?\]')
    return re.sub(pattern, '', input_string)
"""

"\ndef remove_numbers(input_string):\n    return re.sub(r'\\d+', '', input_string)\n\ndef remove_substring(input_string, substring_to_remove):\n    pattern = re.escape(substring_to_remove)\n    return re.sub(pattern, '', input_string)\n\ndef remove_brackets_and_content(input_string):\n    pattern = re.compile(r'\\([^)]*\\)')\n    return re.sub(pattern, '', input_string)\n\ndef remove_square_brackets_and_content(input_string):\n    pattern = re.compile(r'\\[.*?\\]')\n    return re.sub(pattern, '', input_string)\n"

1. removing numbers, commas, tabulaton, brakets and it's content from countries.
2. turning countries to list

In [None]:
""" Countries to list """
"""
countries = remove_numbers(countries)
countries = remove_substring(countries, ",")
countries = remove_substring(countries, "\t")
countries = remove_brackets_and_content(countries)
countries = countries.split("\n")
countries = list(filter(lambda x: x if x != ' ' else None, countries))
map(lambda x: x.strip() ,countries)
"""

'\ncountries = remove_numbers(countries)\ncountries = remove_substring(countries, ",")\ncountries = remove_substring(countries, "\t")\ncountries = remove_brackets_and_content(countries)\ncountries = countries.split("\n")\ncountries = list(filter(lambda x: x if x != \' \' else None, countries))\nmap(lambda x: x.strip() ,countries)\n'

3. turning usa_states to list

In [None]:
""" USA states to list """
#usa_states = usa_states.split("\n")

' USA states to list '

4. delete all appearences of countries and usa states in mountains
 ( because, if there would be too much mentions of countries, bert will classify them as mountain, and we don't need that )

In [None]:
"""
for country in countries:
    mountains = remove_substring(mountains, country)

for state in usa_states:
    mountains = remove_substring(mountains, state)
"""

'\nfor country in countries:\n    mountains = remove_substring(mountains, country)\n\nfor state in usa_states:\n    mountains = remove_substring(mountains, state)\n'

5. Preprocess mountains too

In [None]:
"""
mountains = remove_substring(mountains, ",")
mountains = remove_brackets_and_content(mountains)
mountains = remove_square_brackets_and_content(mountains)
mountains = remove_substring(mountains, "and")
mountains = remove_substring(mountains, ":")
mountains = remove_substring(mountains, " Range ")
mountains = remove_substring(mountains, " Hill Country")
mountains = remove_substring(mountains, " New")
mountains = remove_substring(mountains, state)
mountains = sorted(mountains.split("\n"))
"""

'\nmountains = remove_substring(mountains, ",")\nmountains = remove_brackets_and_content(mountains)\nmountains = remove_square_brackets_and_content(mountains)\nmountains = remove_substring(mountains, "and")\nmountains = remove_substring(mountains, ":")\nmountains = remove_substring(mountains, " Range ")\nmountains = remove_substring(mountains, " Hill Country")\nmountains = remove_substring(mountains, " New")\nmountains = remove_substring(mountains, state)\nmountains = sorted(mountains.split("\n"))\n'

In [None]:
""" Split in chunks for prompring """

prompt = """
Provide dataset in csv format with no additional comments. Use ; as a delimiter, ignore column names
It should have two columns: label, sentence.
For sentence you should create sentence, with mountain name in it, or no mountain name in it.
For label, if in sentence is mentioned mountain name label should be "Mountain", otherwise "O".
Split of labels "Mountain" or "O" should be 50/50.
Generate samples of sentences from this mountain list, they need to be variative, not similar, looking like dialogue.  Don't use country names.
Don't repeat yourself, use different adjectives, cases, situations
For "O" label sentences, use same words but not mountain names.
you should use all mountain names from list
List of mountains to generate from: \n
"""
"""
i = 0
addition = 35
while i < len(mountains):
    #print(prompt)
    #print("\n".join(mountains[i:i+addition]), "\n", "_"*20, "\n")
    print("\n".join(mountains[i:i+addition]))
    i += addition
"""

'\ni = 0\naddition = 35\nwhile i < len(mountains):\n    #print(prompt)\n    #print("\n".join(mountains[i:i+addition]), "\n", "_"*20, "\n")\n    print("\n".join(mountains[i:i+addition]))\n    i += addition\n'

Now prompt i will paste gpt results from prompt+chunks one by one into dataset csv file

And, due to overuse word "UNESCO" and some countries names, we need to append dataset with results of such prompt:
```
Provide dataset in csv format with no additional comments. Use ; as a delimiter, ignore column names
It should have two columns: label, sentence.
For sentence you should create sentence, with mountain name in it, or no mountain name in it.
for label use only "O", avoid using mountain names. Use words from list:
UNESCO, 'Afghanistan', 'Albania', 'Algeria', 'Andorra', 'Angola', 'Antigua and Barbuda', 'Argentina', 'Armenia', 'Australia', 'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain', 'Bangladesh', 'Barbados', 'Belarus', 'Belgium', 'Belize', 'Benin', 'Bhutan', 'Bolivia', 'Bosnia and Herzegovina', 'Botswana', 'Brazil', 'Brunei', 'Bulgaria', 'Burkina Faso', 'Burundi', "Côte d'Ivoire", 'Cabo Verde', 'Cambodia', 'Cameroon', 'Canada', 'Central African Republic', 'Chad', 'Chile', 'China', 'Colombia', 'Comoros', 'Congo ', 'Costa Rica', 'Croatia', 'Cuba', 'Cyprus',
'Czechia ', 'Democratic Republic of the Congo', 'Denmark', 'Djibouti', 'Dominica', 'Dominican Republic', 'Ecuador', 'Egypt', 'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Estonia', 'Eswatini ', 'Ethiopia', 'Fiji', 'Finland', 'France', 'Gabon', 'Gambia', 'Georgia', 'Germany', 'Ghana', 'Greece', 'Grenada', 'Guatemala', 'Guinea', 'Guinea-Bissau', 'Guyana', 'Haiti', 'Holy See', 'Honduras', 'Hungary', 'Iceland', 'India', 'Indonesia', 'Iran', 'Iraq', 'Ireland', 'Israel', 'Italy', 'Jamaica', 'Japan', 'Jordan', 'Kazakhstan', 'Kenya', 'Kiribati',
'Kuwait', 'Kyrgyzstan', 'Laos', 'Latvia', 'Lebanon', 'Lesotho', 'Liberia', 'Libya', 'Liechtenstein', 'Lithuania', 'Luxembourg', 'Madagascar', 'Malawi', 'Malaysia', 'Maldives', 'Mali', 'Malta', 'Marshall Islands', 'Mauritania', 'Mauritius', 'Mexico', 'Micronesia', 'Moldova', 'Monaco', 'Mongolia', 'Montenegro', 'Morocco', 'Mozambique', 'Myanmar ', 'Namibia', 'Nauru', 'Nepal', 'México', 'Netherlands', 'New Zealand', 'Nicaragua', 'Niger', 'Nigeria', 'North Korea', 'North Macedonia', 'Norway', 'Oman', 'Pakistan', 'Palau', 'Palestine State', 'Panama', 'Papua New Guinea', 'Paraguay', 'Peru', 'Philippines', 'Poland', 'Portugal', 'Qatar', 'Romania', 'Russia', 'Rwanda', 'Saint Kitts and Nevis', 'Saint Lucia', 'Saint Vincent and
the Grenadines', 'Samoa', 'San Marino', 'Sao Tome and Principe', 'Saudi Arabia', 'Senegal', 'Serbia', 'Seychelles', 'Sierra Leone', 'Singapore', 'Slovakia', 'Slovenia', 'Solomon Islands', 'Somalia', 'South Africa', 'South Korea', 'South Sudan', 'Spain', 'Sri Lanka', 'Sudan', 'Suriname', 'Sweden', 'Switzerland', 'Syria', 'Tajikistan', 'Tanzania', 'Thailand', 'Timor-Leste', 'Togo', 'Tonga', 'Trinidad and Tobago', 'Tunisia', 'Turkey', 'Turkmenistan', 'Tuvalu', 'Uganda', 'Ukraine', 'United Arab Emirates', 'United Kingdom', 'United States of America', 'Uruguay', 'Uzbekistan', 'Vanuatu', 'Venezuela', 'Vietnam', 'Yemen', 'Zambia', 'Zimbabwe', 'Yukon British Columbia', 'British Columbia', 'ChurchillAlberta', 'Quebec', 'Czech Republic', 'Kosovo'
```

In [None]:
"""
import pandas as pd
from collections import Counter
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')

df = pd.read_csv('/content/bert_ner_moountain_dataset.csv', delimiter=";",  error_bad_lines=False )"""

'\nimport pandas as pd\nfrom collections import Counter\nfrom nltk.tokenize import word_tokenize\nimport nltk\nnltk.download(\'punkt\')\n\ndf = pd.read_csv(\'/content/bert_ner_moountain_dataset.csv\', delimiter=";",  error_bad_lines=False )'

In [None]:
"""
# Select Mountain labeled sentences and count how often each word appears in them
mountain_sentences = df[df['label'] == 'Mountain']['sentence']
all_mountain_sentences = ' '.join(mountain_sentences)
mountain_words = nltk.word_tokenize(all_mountain_sentences)
mountain_word_counts = Counter(mountain_words)

# Select O labeled sentences and count how often each word appears in them
o_sentences = df[df['label'] == 'O']['sentence']
all_o_sentences = ' '.join(o_sentences)
o_words = nltk.word_tokenize(all_o_sentences)
o_word_counts = Counter(o_words)
"""

"\n# Select Mountain labeled sentences and count how often each word appears in them\nmountain_sentences = df[df['label'] == 'Mountain']['sentence']\nall_mountain_sentences = ' '.join(mountain_sentences)\nmountain_words = nltk.word_tokenize(all_mountain_sentences)\nmountain_word_counts = Counter(mountain_words)\n\n# Select O labeled sentences and count how often each word appears in them\no_sentences = df[df['label'] == 'O']['sentence']\nall_o_sentences = ' '.join(o_sentences)\no_words = nltk.word_tokenize(all_o_sentences)\no_word_counts = Counter(o_words)\n"

# In-process
When I was creating the dataset, I repeatedly noticed that chatgpt uses the words "mountain", "peak", and others in the same sentences with the names of mountains. This would lead to the model recognizing these words as mountains, which we did not want. Additionally, the situation was the same with country names. Chatgpt often inserted them into sentences marked as Mountain. I fixed this by generating another 400 lines of dataset with only countries, without mountain names, and marking them as "O". But that's not all. There was also a problem with recognizing the words "I", "is", "expedition" as mountains. I dealt with this in the same way as before, by generating strings without mountains in which these words would often appear.


# Summary
The dataset has a great influence on how the model will work. That's why such errors come down to adding new values to the dataset. As we already know, BERT identifies words using their context, which requires a large amount of training data.  

In [None]:
import requests

In [None]:
country_codes = [
  "AF","AX","AL","DZ","AS","AD","AO","AI","AQ","AG","AR","AM","AW","AU","AT","AZ","BH","BS","BD","BB","BY","BE","BZ","BJ","BM","BT",
  "BO","BQ","BA","BW","BV","BR","IO","BN","BG","BF","BI","KH","CM","CA","CV","KY","CF","TD","CL","CN","CX","CC","CO","KM","CG","CD",
  "CK","CR","CI","HR","CU","CW","CY","CZ","DK","DJ","DM","DO","EC","EG","SV","GQ","ER","EE","ET","FK","FO","FJ","FI","FR","GF","PF",
  "TF","GA","GM","GE","DE","GH","GI","GR","GL","GD","GP","GU","GT","GG","GN","GW","GY","HT","HM","VA","HN","HK","HU","IS","IN","ID",
  "IR","IQ","IE","IM","IL","IT","JM","JP","JE","JO","KZ","KE","KI","KP","KR","KW","KG","LA","LV","LB","LS","LR","LY","LI","LT","LU",
  "MO","MK","MG","MW","MY","MV","ML","MT","MH","MQ","MR","MU","YT","MX","FM","MD","MC","MN","ME","MS","MA","MZ","MM","NA","NR","NP",
  "NL","NC","NZ","NI","NE","NG","NU","NF","MP","NO","OM","PK","PW","PS","PA","PG","PY","PE","PH","PN","PL","PT","PR","QA","RE","RO",
  "RU","RW","BL","SH","KN","LC","MF","PM","VC","WS","SM","ST","SA","SN","RS","SC","SL","SG","SX","SK","SI","SB","SO","ZA","GS","SS",
  "ES","LK","SD","SR","SJ","SZ","SE","CH","SY","TW","TJ","TZ","TH","TL","TG","TK","TO","TT","TN","TR","TM","TC","TV","UG","UA","AE",
  "GB","US","UM","UY","UZ","VU","VE","VN","VG","VI","WF","EH","YE","ZM","ZW"
]

def form_request(country_code, amount, username='me'):
  return f'http://api.geonames.org/searchJSON?featureCode=MT&username={username}&maxRows={amount}&style=SHORT&isReduced=true&country={country_code}'

In [None]:
mountain_names = []
AMOUNT_OF_MOUNTAINS_PER_COUNTRY = 500
for country_code in country_codes:
  print(country_code, end=" ")
  response = requests.get(form_request(country_code, AMOUNT_OF_MOUNTAINS_PER_COUNTRY))
  for response_mountain in response.json()['geonames']:
    mountain_names.append(response_mountain['toponymName'])
print(len(mountain_names))

AF AX AL DZ AS AD AO AI AQ AG AR AM AW AU AT AZ BH BS BD BB BY BE BZ BJ BM BT BO BQ BA BW BV BR IO BN BG BF BI KH CM CA CV KY CF TD CL CN CX CC CO KM CG CD CK CR CI HR CU CW CY CZ DK DJ DM DO EC EG SV GQ ER EE ET FK FO FJ FI FR GF PF TF GA GM GE DE GH GI GR GL GD GP GU GT GG GN GW GY HT HM VA HN HK HU IS IN ID IR IQ IE IM IL IT JM JP JE JO KZ KE KI KP KR KW KG LA LV LB LS LR LY LI LT LU MO MK MG MW MY MV ML MT MH MQ MR MU YT MX FM MD MC MN ME MS MA MZ MM NA NR NP NL NC NZ NI NE NG NU NF MP NO OM PK PW PS PA PG PY PE PH PN PL PT PR QA RE RO RU RW BL SH KN LC MF PM VC WS SM ST SA SN RS SC SL SG SX SK SI SB SO ZA GS SS ES LK SD SR SJ SZ SE CH SY TW TJ TZ TH TL TG TK TO TT TN TR TM TC TV UG UA AE GB US UM UY UZ VU VE VN VG VI WF EH YE ZM ZW 61961


In [None]:
import pandas as pd
df = pd.DataFrame(mountain_names, columns=['name'])

In [None]:
df.to_csv('/content/mountain_names.csv',index=False)
df.describe()

Unnamed: 0,name
count,61961
unique,56643
top,Cerro Grande
freq,36


In [None]:
prompt= """
Provide sentences dataset with words from list. Be creative. Don't change words that you are using, they should look like in the list. Sentences should be 5-10 words long. You should use all words from list
generate sentences from all words in list. All words from list should be used;
there shold be 50 sentences
List:"""

In [None]:
index = 0
chunk = 50
while index < len(mountain_names):
  chunk_of_mountains = mountain_names[index:index+chunk]
  index += chunk
  mountains_ = ", ".join(chunk_of_mountains)
  print(prompt)
  print(mountains_, "\n")

[1;30;43mПоказано результат, скорочений до останніх рядків (5000).[0m
Kūh-e Malek, Kūh-e Gowd-e Anjīrī, Kūh-e Poshtkūh, Kūh-e Shāskūh, Kūh-e Tanūreh, Kūh-e Qārūn, Kūh-e Lākh-e Jangī, Kūh-e Hārmū, Kūh-e Seh Chāhī, Kūh-e Khūm, Kūh-e Khatābūn, Kūh-e ‘Omar Kūh, Kūh-e ‘Arabū, Kūh-e Shūrān Sīāh, Kūh-e Kūr Sefīd, Kūh-e Khaţīr Sūkhteh, Kūh-e Shojābīl, Kūh-e Sar Owz̧ā‘, Kūh-e Qahremān-e Fānī, Kūh-e Sekīl-e Seh Tā, Kūh-e Āj-e Bālā, Kūh-e Qal‘eh Kamar, Kūh-e Shānjān-e Nasvī, Kūh-e Qarchanqūl, Kūh-e Āqā Mīr Dāghī, Kūh-e Espelān, Kūh-e Sorkh Sangān, Kūh-e Tameshk Chāl, Kūh-e Neshā’, Kūh-e Sārī Qayeh, Kūh-e Rameẕān Dāghī, Kūh-e Kīālān, Kūh-e Do Delū, Kūh-e Sar Chāl-e Bāl, Kūh-e Gajeh, Kūh-e Sarī, Kūh-e Jangal-e Rey, Kūh-e Gīv Sīāh, Kūh-e Halach Halach, Kūh-e Sīrāngaleh, Kūh-e Sar Khowrandar, Kūh-e Jabar Ālān, Kūh-e Hameh Qolī, Kūh-e Kāleh, Kūh-e Gūreh Qalā‘, Kūh-e Kūchakeh Sūr, Kūh-e Do Ḩeşāreh, Kūh-e Havār Barzeh, Kūh-e Bard Zard, Kūh-e Zarang 


Provide sentences dataset with words from list. Be

In [None]:
df_sentences = pd.read_csv('/content/dataset_sentences.csv',delimiter=';', error_bad_lines=False)



  df_sentences = pd.read_csv('/content/dataset_sentences.csv',delimiter=';', error_bad_lines=False)


In [None]:
import re

def first_upper(s):
    return bool(s and s[0].isupper())

def remove_duplicates(array):
    string_counts = {}

    for string in array:
        string_counts[string] = string_counts.get(string, 0) + 1
    filtered_array = [string for string, count in string_counts.items() if count <= 100]
    return filtered_array


In [None]:
!pip install unidecode -q

In [None]:
from unidecode import unidecode

new_dataset = []

joined_mountains = remove_duplicates((" ".join(mountain_names)).split(" "))
joined_mountains = list(map(lambda x: unidecode(x), joined_mountains))
joined_mountains = remove_duplicates((" ".join(mountain_names)).split(" "))
word = "Velikiy"
word = re.sub(r"[^a-zA-Z0-9\-]", '', unidecode(word))
print(word in joined_mountains)




True


In [None]:
sentences_data = df_sentences['sentence'].to_numpy()
for i, sentence in enumerate(sentences_data):
  if (i % (len(sentences_data) // 5) == 0):
    print(f" > {i}/{len(sentences_data)}")
  for word in sentence.split(" "):
    word = re.sub(r"[^a-zA-Z0-9\-]", '', unidecode(word))
    if word in joined_mountains and first_upper(word):
      label = "Mountain"
    else:
      label = "O"
    new_dataset.append([i, word, label])

 > 0/5839
 > 1167/5839
 > 2334/5839
 > 3501/5839
 > 4668/5839
 > 5835/5839


In [None]:
new_dataset[0]

[0, 'Koh-e', 'O']

In [None]:
df = pd.DataFrame(new_dataset, columns=['sentence#', 'word', 'label'])

In [None]:
df.to_csv('/content/example.csv', index=False)