# test file для проекта:
# Cопоставление геоназваний с унифицированными именами geonames

# Задача

- Создать решение для подбора наиболее подходящих названий с geonames.
Например Ереван -> Yerevan
- На примере РФ и стран наиболее популярных для релокации - Беларусь, Армения,
Казахстан, Кыргызстан, Турция, Сербия. Города с населением от 15000 человек (с
возможностью масштабирования на сервере заказчика)
- Возвращаемые поля *geonameid, name, region, country, cosine similarity*
- формат данных на выходе: список словарей, например [{dict_1}, {dict_2}, …. {dict_n}], где словарь - одна запись с указанными полями

# Цель

Сопоставление произвольных гео названий с унифицированными именами geonames для внутреннего использования Карьерным центром.

# План работы

- Провеcти исследовательский анализ данных.
- Подготовить данные к обучению.
  - Первый способ через подключение сторонних сайтов с помощью API
  - Второй способ обучение модели с помощью DML.
    - Обучить нейронную сеть и рассчитайте её качество.

# Описание данных

Используемые таблицы с geonames:
- admin1CodesASCII
- alternateNamesV2
- cities15000
- countryInfo
- при необходимости любые другие открытые данные
- таблицы geonames можно скачать здесь http://download.geonames.org/export/dump/
- Тестовый датасет: https://disk.yandex.ru/d/wC296Rj3Yso2AQ


**Импортируем необходимые библиотеки (pandas, numpy и другие).**

In [33]:
!pip install -U sentence-transformers



In [34]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/mbart-aug/semantic_mod.py
/kaggle/input/mbart-aug/corpus.csv
/kaggle/input/mbart-aug/augmentation_add_typo.py
/kaggle/input/mbart-aug/geo_test.csv
/kaggle/input/mbart-aug/city_dataset.py
/kaggle/input/mbart-aug/generation_semantic_mod.py
/kaggle/input/mbart-aug/count_vec_mod.py


In [35]:
import sys
sys.path.append('/kaggle/input/mbart-aug/')

In [1]:
import pandas as pd
import numpy as np
import time

from sklearn.feature_extraction.text import CountVectorizer
from count_vec_mod import CountVec
from semantic_mod import SemSearch
from tqdm import notebook

# NLP
import torch
import transformers
from sentence_transformers import SentenceTransformer
from transformers import MBartForConditionalGeneration, MBart50TokenizerFast
import nltk
import re
import os

**Так как будем использовать gpu для расчета качества разных способов выгрузим таблицу из *corpus.csv* файла**

In [2]:
if os.path.exists('corpus.csv'):
    corpus_rus = pd.read_csv("corpus.csv", delimiter=';')
elif os.path.exists("/kaggle/input/mbart-aug/corpus.csv"):
    corpus_rus = pd.read_csv("/kaggle/input/mbart-aug/corpus.csv", delimiter=';')
elif os.path.exists('C:/Users/User/Desktop/DS Python/Geonames/data/corpus.csv'):
    corpus_rus = pd.read_csv("C:/Users/User/Desktop/DS Python/Geonames/data/corpus.csv",
                           delimiter=';')
else:
    print('Something is wrong')

corpus_rus.head()

Unnamed: 0,geonameid,name,asciiname,alternatenames,population,country,country_code,admin1_code,concat_code,region,ascii_region
0,616435,Masis,Masis,"Hrazdan,Masis,Narimanlu,Razdan,Takhanshalu,Tok...",18911,Armenia,AM,2,AM.02,Ararat,Ararat
1,174991,Ararat,Ararat,"Ararat,Araratas,Ararato,Davalinskiy Tsemzavod,...",28832,Armenia,AM,2,AM.02,Ararat,Ararat
2,174979,Artashat,Artashat,"Artachat,Artasat,Artasatas,Artasato,Artaschat,...",20562,Armenia,AM,2,AM.02,Ararat,Ararat
3,174972,Hats’avan,Hats'avan,"Acavan,Atsavan,Hats'avan,Hats’avan,Sisian,Ацав...",15208,Armenia,AM,8,AM.08,Syunik,Syunik
4,174895,Goris,Goris,"Geryusy,Goris,Горис,Գորիս",20379,Armenia,AM,8,AM.08,Syunik,Syunik


Оценим нашу выгруженную таблицу.

In [38]:
corpus_rus.head()

Unnamed: 0,geonameid,name,asciiname,alternatenames,population,country,country_code,admin1_code,concat_code,region,ascii_region
0,174875,Kapan,Kapan,"Ghap'an,Ghapan,Ghap’an,Kafan,Kafin,Kapan,Kapan...",33160,Armenia,AM,8,AM.08,Syunik,Syunik
1,174895,Goris,Goris,"Geryusy,Goris,Горис,Գորիս",20379,Armenia,AM,8,AM.08,Syunik,Syunik
2,174972,Hats’avan,Hats'avan,"Acavan,Atsavan,Hats'avan,Hats’avan,Sisian,Ацав...",15208,Armenia,AM,8,AM.08,Syunik,Syunik
3,174979,Artashat,Artashat,"Artachat,Artasat,Artasatas,Artasato,Artaschat,...",20562,Armenia,AM,2,AM.02,Ararat,Ararat
4,174991,Ararat,Ararat,"Ararat,Araratas,Ararato,Davalinskiy Tsemzavod,...",28832,Armenia,AM,2,AM.02,Ararat,Ararat


Обработаем значения в таблице *corpus*

In [39]:
corpus_rus['alternatenames'] = [name if name is not None else '' for name in corpus_rus['alternatenames']]

In [40]:
# Удаляем пропущенные значения из таблицы
corpus = corpus_rus.dropna()
# Разделяем значения в столбце 'alternatenames'
corpus.alternatenames = corpus.alternatenames.str.split(',')
# Приводим к виду 1 наименование asciiname = 1 наименование alternatenames
corpus = corpus.explode('alternatenames')
# Удаляем совпадающие значения в столбцах
corpus = corpus[corpus.asciiname!=corpus.alternatenames]
# Удаляем парные дубликаты из двух столбцов
corpus = corpus.drop_duplicates(subset=['asciiname', 'alternatenames'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  corpus.alternatenames = corpus.alternatenames.str.split(',')


In [41]:
corpus.head()

Unnamed: 0,geonameid,name,asciiname,alternatenames,population,country,country_code,admin1_code,concat_code,region,ascii_region
0,174875,Kapan,Kapan,Ghap'an,33160,Armenia,AM,8,AM.08,Syunik,Syunik
0,174875,Kapan,Kapan,Ghapan,33160,Armenia,AM,8,AM.08,Syunik,Syunik
0,174875,Kapan,Kapan,Ghap’an,33160,Armenia,AM,8,AM.08,Syunik,Syunik
0,174875,Kapan,Kapan,Kafan,33160,Armenia,AM,8,AM.08,Syunik,Syunik
0,174875,Kapan,Kapan,Kafin,33160,Armenia,AM,8,AM.08,Syunik,Syunik


Загрузим тестовый набор данных из файла *geo_test.csv*

In [4]:
if os.path.exists('geo_test.csv'):
    corpus_test = pd.read_csv("geo_test.csv", delimiter=';')
elif os.path.exists('/kaggle/input/mbart-aug/geo_test.csv'):
    corpus_test = pd.read_csv("/kaggle/input/mbart-aug/geo_test.csv", delimiter=';')
elif os.path.exists('C:/Users/User/Desktop/DS Python/Geonames/data/geo_test.csv'):
    corpus_test = pd.read_csv("C:/Users/User/Desktop/DS Python/Geonames/data/geo_test.csv",
                           delimiter=';')
else:
    print('Something is wrong')

corpus_test.head()

Unnamed: 0,query,name,region,country
0,Смоленск,Smolensk,Smolensk Oblast,Russia
1,Кемерово,Kemerovo,Kuzbass,Russia
2,Бишкек,Bishkek,Bishkek,Kyrgyzstan
3,Москва,Moscow,Moscow,Russia
4,Алматы,Almaty,Almaty,Kazakhstan


Все результаты сохраним в словаре.

In [43]:
total = {
    'accuracy':[]
}

### Метод использования предобученной модели из SentenceTransformers

Выберем предварительно обученную модель из библиотеки SentenceTransformer и загрузим ее

In [44]:
# Загрузка предварительно обученной модели mBART и токенизатора
model = SentenceTransformer('artefucktor/LaBSE_geonames_RU_RELOCATION')

.gitattributes:   0%|          | 0.00/1.57k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

2_Dense/config.json:   0%|          | 0.00/114 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.36M [00:00<?, ?B/s]

README.md:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/902 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.88G [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.89G [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/13.6M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.27k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/5.22M [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/461 [00:00<?, ?B/s]

Cоздадим эмбендинги для всех названий городов

In [45]:
names_list = corpus.asciiname.drop_duplicates().values
names_list

array(['Kapan', 'Goris', "Hats'avan", ..., 'Beylikduezue', 'Cankaya',
       'Muratpasa'], dtype=object)

In [46]:
embeddings = model.encode(names_list)

Batches:   0%|          | 0/52 [00:00<?, ?it/s]

In [47]:
sem_mod = SemSearch()

#### Функция для подбора наиболее подходящих названий с geonames с помощью cosine-similarity

**Предлагаю оценивать точность подбора названий с помощью функции get_score**

In [48]:
def get_score_sim(city, name):
    output_str = sem_mod.get_sim(city, names_list,
                                 embeddings, model,
                                 corpus_rus, top=1)
    # если найденный ближайший город совпадет с реальным искомым,
    # то присвоим 1 бал функции, если нет 0.
    if output_str[0]['asciiname'] == name:
        score = 1
    else:
        score = 0
    return score, output_str[0]['asciiname']

In [49]:
scores= []
cities= []
for i in range(len(corpus_test)):
    query = corpus_test['query'][i]
    name = corpus_test['name'][i]
    score, city = get_score_sim(query, name)
    scores.append(score)
    cities.append(city)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [50]:
accuracy_sim = sum(scores)/len(scores)
print('Точность модели', accuracy_sim)

Точность модели 0.8477011494252874


In [51]:
corpus_test['scores'] = scores
corpus_test['cities'] = cities

In [52]:
corpus_test.head()

Unnamed: 0,query,name,region,country,scores,cities
0,Смоленск,Smolensk,Smolensk Oblast,Russia,1,Smolensk
1,Кемерово,Kemerovo,Kuzbass,Russia,1,Kemerovo
2,Бишкек,Bishkek,Bishkek,Kyrgyzstan,1,Bishkek
3,Москва,Moscow,Moscow,Russia,1,Moscow
4,Алматы,Almaty,Almaty,Kazakhstan,1,Almaty


In [53]:
corpus_test.to_csv(r"corpus_test_sim.csv", index=False, sep=";")

Добавим данные в словарь

In [54]:
total['accuracy'].append(round(accuracy_sim, 3))
total

{'accuracy': [0.848]}

### Метод использования предобученной модели из tansformers

Выберем свою предварительно обученную модель из библиотеки Transformer и загрузим ее

In [55]:
# Загрузка предварительно обученной модели mBART и токенизатора
output_model_path = "EldarKerimkhan/mbart-large-50-many-to-many-mmt.geonames_RU_RELOCATION"
mbart_model = MBartForConditionalGeneration.from_pretrained(output_model_path)
tokenizer = MBart50TokenizerFast.from_pretrained(output_model_path)

config.json:   0%|          | 0.00/1.43k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.44G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/256 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/10.9k [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/992 [00:00<?, ?B/s]

Подключим gpu для расчета, если есть такая возможность

In [56]:
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
model.to(device)

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
  (2): Dense({'in_features': 768, 'out_features': 768, 'bias': True, 'activation_function': 'torch.nn.modules.activation.Tanh'})
  (3): Normalize()
)

In [57]:
def get_score_mbart(city, name, model, tokenizer, device):
    model = model.to(device)
    city_tokens = tokenizer(city, return_tensors="pt", padding=True, truncation=True).to(device)
    outputs = mbart_model.generate(city_tokens.input_ids)
    output_str = tokenizer.batch_decode(outputs, skip_special_tokens=True,  clean_up_tokenization_spaces=True)

    if output_str[0] == name:
        score = 1
    else:
        score = 0
    return score, output_str[0]

In [58]:
scores= []
cities= []
for i in range(len(corpus_test)):
    query = corpus_test['query'][i]
    name = corpus_test['name'][i]
    score, city = get_score_mbart(query, name, mbart_model, tokenizer, device)
    scores.append(score)
    cities.append(city)

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


In [59]:
accuracy_mbart = sum(scores)/len(scores)
print('Точность модели', accuracy_mbart)

Точность модели 0.8448275862068966


In [60]:
total['accuracy'].append(round(accuracy_mbart, 3))
total

{'accuracy': [0.848, 0.845]}

In [61]:
corpus_test['scores'] = scores
corpus_test['cities'] = cities

In [62]:
corpus_test.head()

Unnamed: 0,query,name,region,country,scores,cities
0,Смоленск,Smolensk,Smolensk Oblast,Russia,1,Smolensk
1,Кемерово,Kemerovo,Kuzbass,Russia,1,Kemerovo
2,Бишкек,Bishkek,Bishkek,Kyrgyzstan,1,Bishkek
3,Москва,Moscow,Moscow,Russia,1,Moscow
4,Алматы,Almaty,Almaty,Kazakhstan,1,Almaty


In [63]:
corpus_test.to_csv(r"corpus_test_mbart.csv", index=False, sep=";")

### Метод векторизации слова CountVectorizer()

Создадим экземпляр класса CountVectorizer().

In [5]:
# анализировать будем отдельные буквы и n-граммы
vectorizer = CountVectorizer(analyzer='char', ngram_range=(2,4))

Создадим экземпляр кастомного класса CountVec().

In [6]:
vec_mod = CountVec()

In [7]:
def get_score_vec(city, name):
    # искомый город подается в функцию vec_mod.get_cos(),
    # количество близких названий городов выбирается 1,
    # чтобы оценить абсолютную точность
    output_str = vec_mod.get_cos(city,
                       corpus_rus[['geonameid', 'asciiname',
                                   'alternatenames', 'country',
                                   'region']],
                       vectorizer,
                       num=1)
    # если найденный ближайший город совпадет с реальным искомым,
    # то присвоим 1 бал функции, если нет 0.
    if output_str[0]['asciiname'] == name:
        score = 1
    else:
        score = 0
    return score, output_str[0]['asciiname']

Просто переберем все запросы и запишем значения очков и предсказания в списки

In [8]:
scores= [] #  список значения очков
cities= [] #  список предсказанных городов
for i in range(len(corpus_test)):
    #  список запрашиваемых городов
    query = corpus_test['query'][i]
    #  список ожидамеых городов
    name = corpus_test['name'][i]
    score, city = get_score_vec(query, name)
    scores.append(score)
    cities.append(city)

Выведем среднее количество правильных ответов - процент точности предсказания

In [9]:
accuracy_cv = sum(scores)/len(scores)
print('Точность модели',accuracy_cv)

Точность модели 0.7988505747126436


Запишем данные значения в таблицу, чтобы можно было проанализировать, какие города доставили трудности при подборе.

In [10]:
corpus_test['scores'] = scores
corpus_test['cities'] = cities

In [11]:
corpus_test.head()

Unnamed: 0,query,name,region,country,scores,cities
0,Смоленск,Smolensk,Smolensk Oblast,Russia,1,Smolensk
1,Кемерово,Kemerovo,Kuzbass,Russia,1,Kemerovo
2,Бишкек,Bishkek,Bishkek,Kyrgyzstan,1,Bishkek
3,Москва,Moscow,Moscow,Russia,1,Moscow
4,Алматы,Almaty,Almaty,Kazakhstan,1,Almaty


Добавим данные в словарь

In [None]:
total['accuracy'].append(round(accuracy_cv, 2))
total

Сохраним таблицу в виде *corpus_test_CountVec.csv* файла

In [None]:
corpus_test.to_csv(r"/kaggle/working/corpus_test_CountVec.csv", index=False, sep=";")

In [12]:
total = {'accuracy':[0.848, 0.845, accuracy_cv]}

In [13]:
pd.DataFrame(total, index=['LaBSE', 'MBart', 'CountVectorizer'])

Unnamed: 0,accuracy
LaBSE,0.848
MBart,0.845
CountVectorizer,0.798851


**Выводы** 

При сравнении качества нескольких методов поиска наиболее подходящих названий с geonames лучшие результаты дали предобученные модели `score = 0.85`. 

Качество данных моделей можно увеличить, подобрав оптимальное число эпох для обучения по функции Loss.
А также, при более детальной и тщательной предобработке названий из столбца альтернативных имен. 