# ⬇️ Download dependencies

In [165]:
# install spacy
# !pip install -U spacy
# !pip install --upgrade spacy
# !pip install spacy-transformers -q
# !-m pip install --upgrade transformers -q
# !python -m spacy download en -q

# ⬅️ Importing libraries

In [166]:
import pandas as pd
import numpy as np
from spacy.training.example import Example
import spacy
from spacy.tokens import Doc, Token
from sklearn.model_selection import train_test_split
import random
import time

# 📥 Loading model

Pickle is the best option I could find. SpaCy.to_disk leaves behind a directory with a network that is difficult to import and export. It took me a long time to do this (using shutil, zip/unzip), so I can say that it is not the best option for the task.

In [167]:
import pickle
with open('spacy_model.pkl', 'rb') as f:
    nlp = pickle.load(f)

# 🎯 Accuracy on dataset

Loading a dataset as a pandas dataframe. Below you can see the columns.
As for the structure of the dataset: each word is divided into sentences. By grouping the rows by sentence#, you can get sentences and labels for them.


In [168]:
data_path = '/content/dataset_medium.csv'
df = pd.read_csv(data_path, delimiter=",",  error_bad_lines=False)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51640 entries, 0 to 51639
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   sentence#  51640 non-null  int64 
 1   word       51640 non-null  object
 2   label      51640 non-null  object
dtypes: int64(1), object(2)
memory usage: 1.2+ MB




  df = pd.read_csv(data_path, delimiter=",",  error_bad_lines=False)


Since the model was trained on lowercase words, we translate the dataset to it, otherwise the results will be inaccurate.

In [169]:
df['word'] = df['word'].map(lambda x: x.lower())
sentences = df.groupby("sentence#")["word"].apply(list).values
labels = df.groupby(by = 'sentence#')['label'].apply(list).values

To simplify the accuracy calculation, let's combine the words of the sentences with their labels.

In [170]:
dataset = []
for i in zip(sentences, labels):
  dataset.append(list(zip(i[0], i[1]) ))

Accuracy calculation on the dataset on which the model was trained. It is important to understand that the model cannot predict new words (mountain names) without context.
The accuracy on the training dataset is demonstrated only to show how the model behaves on known and unknown data.

In [171]:
total_accuracy = 0
actual_accuracy = 0
for i, sentence in enumerate(sentences):
    doc = nlp(" ".join(sentence))
    actual = [(ent.text, ent.label_) for ent in doc.ents]
    prediction = [element for element in dataset[i] if element[1] == "Mountain"]

    actual_accuracy += int(set(actual) == set(prediction))
    total_accuracy += 1

print(f" accuracy: {round(actual_accuracy/total_accuracy, 2)*100}% ")

 accuracy: 95.0% 


We see, that model is well-trained.

# ⚙️ Predict entities

In [199]:
def predict_entities(text, nlp):
    doc = nlp(text.lower())
    entities = [(ent.text, ent.label_) for ent in doc.ents]
    return entities

sentences_manual = [
    "I love climbing mountains.",
    "Did you seen gora Hoverla pictures? They're awesome!",
    "Hey, breakfast was good today",
    "I like hora velikaya in Ukraine",
    "I like velikaya mountain in Ukraine"
]
for sentence in sentences_manual:
    print(sentence)
    print(predict_entities(sentence, nlp), "\n")

I love climbing mountains.
[('i', 'Mountain')] 

Did you seen gora Hoverla pictures? They're awesome!
[('hoverla', 'Mountain')] 

Hey, breakfast was good today
[] 

I like hora velikaya in Ukraine
[('i', 'Mountain'), ('velikaya', 'Mountain')] 

I like velikaya mountain in Ukraine
[('i', 'Mountain')] 



For example, I want to use two sentences:

<br>"I like hora velikaya in Ukraine
<br>`[('i', 'Mountain'), ('velikaya', 'Mountain')]`

<br>"I like velikaya mountain in Ukraine
<br>`[('i', 'Mountain')]`


#### What is the difference here, and why in the second case the model did not recognize the Velikaya Mountain as a mountain?
<br>
In the dataset, namely, if we take into account the
part where the names of Ukrainian mountains are found due to the specifics of the source, Ukrainian mountains have the word "hora" before the name, which corresponds to the English "mountain". Because of this and the attention mechanism in spacy transformers, the model "knows" that the word "hora" is used with Ukrainian mountain names. Therefore, despite the fact that the model could not recognize a large mountain by itself, using this context, the model still recognizes the mountain correctly.
<br><br>
This is a problem of the dataset alone. It should consist of many different examples, which is almost impossible to achieve in such a short period of time using chatgpt (primarily due to chatgpt limitations).
<br><br>
High-quality datasets are created by a team of people, checked and tested many times. This dataset is its 7th version.


In [176]:
for i in range(5):
  print(" ".join(sentences[random.randint(0, len(dataset))]))

hora mulha and hora polovatynets celebrated the night away
loma de los guananitos a remote hamlet preserves traditional customs
phnum thma keh a region with historical charm
mount tyree a rugged mountain challenging the brave
cerro uran surrounded by lush vegetation was a haven


Continuing with the example of the Ukrainian 'hora', other languages also have their own words. Since the dataset is compiled from the names of mountains in all countries, there are also 'cerro', 'gora', 'mountain', 'peak', and other similar words.

They are all context that helps determine the label for words in a sentence.

# ⚔️ Try it yourself!

In [213]:
your_sentence = "Did you seen hora Hoverla on photos? It's awesome!"
print(f" <> your sentence: \n - { your_sentence } \n <> predicted labels: \n{ predict_entities(your_sentence, nlp) }")

 <> your sentence: 
 - Did you seen hora Hoverla on photos? It's awesome! 
 <> predicted labels: 
[('hoverla', 'Mountain')]


# ⚠️ Warning an conclusion
The way the model behaves is dictated solely by the dataset. It should be bigger, more detailed, and more diverse.

But even with such a small dataset, the model shows good results, even though it relies heavily on context.

This can be corrected.  

# 👤 Personal thoughts
This assignment helped me to better understand transformers, how they work, and the nuances of forming a dataset. I learned about several ready-made transformer solutions such as SpaCy, Pytotch pretrained bert, and others.
I think it was a valuable experience.
In case of comments, I am open to criticism and will be glad to hear your thoughts on how to improve the model itself or dataset generation