## Описание проекта

### Цель:

- Сопоставление произвольных гео названий с унифицированными именами geonames для внутреннего использования Карьерным центром

### Задачи:

- Создать решение для подбора наиболее подходящих названий с geonames. Например Ереван -> Yerevan

- На примере РФ и стран наиболее популярных для релокации - Беларусь, Армения, Казахстан, Кыргызстан, Турция, Сербия. Города с населением от 15000 человек (с возможностью масштабирования на сервере заказчика)

- Возвращаемые поля geonameid, name, region, country, cosine similarity

- формат данных на выходе: список словарей, например [{dict_1}, {dict_2}, …. {dict_n}] где словарь - одна запись с указанными полями


### Описание данных

*Используемые таблицы с geonames:* 


- admin1CodesASCII
- alternateNamesV2
- cities15000
- countryInfo

*Дополнительно:*
- при необходимости любые другие открытые данные
- таблицы geonames можно скачать здесь http://download.geonames.org/export/dump/


Установка дополнительных инструментов и библиотек.

In [None]:
pip install SQLAlchemy

In [None]:
pip install psycopg2

In [27]:
pip install -U sentence-transformers




In [42]:
pip install translate

Collecting translate
  Using cached translate-3.6.1-py2.py3-none-any.whl (12 kB)
Collecting libretranslatepy==2.1.1
  Using cached libretranslatepy-2.1.1-py3-none-any.whl (3.2 kB)
Installing collected packages: libretranslatepy, translate
Successfully installed libretranslatepy-2.1.1 translate-3.6.1
Note: you may need to restart the kernel to use updated packages.


In [60]:
pip install pyarrow

Collecting pyarrow
  Downloading pyarrow-14.0.1-cp39-cp39-win_amd64.whl (24.6 MB)
Installing collected packages: pyarrow
Successfully installed pyarrow-14.0.1
Note: you may need to restart the kernel to use updated packages.


Устанавливаем подключение к базе из тетрадки:

In [1]:
from sqlalchemy import create_engine
from sqlalchemy.engine.url import URL

DATABASE = {
    'drivername': 'postgresql',
    'username': 'postgres', 
    'password': '15112005', 
    'host': 'localhost',
    'port': 5433,
    'database': 'postgres',
    'query': {}
}  

engine = create_engine(URL(**DATABASE))


  engine = create_engine(URL(**DATABASE))


С помощью библиотеки pandas начинаем работать с небходимыми данными.

In [2]:
import pandas as pd
import os
import logging
import torch

from translate import Translator
from sentence_transformers import SentenceTransformer, util, InputExample, losses, evaluation
from torch.utils.data import DataLoader

In [3]:
df1 = pd.read_csv('C:\\Users\\Family\\Downloads\\admin1CodesASCII.txt', delimiter= '\t', low_memory=False, header=None,
                 names = ['code', 'name_reg', 'name_ascii', 'geonameid'],
                 usecols = ['code','geonameid','name_reg'])

In [4]:
df1['country_code'] = df1.code.str.split('.').str[0]

In [5]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3880 entries, 0 to 3879
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   code          3880 non-null   object
 1   name_reg      3880 non-null   object
 2   geonameid     3880 non-null   int64 
 3   country_code  3880 non-null   object
dtypes: int64(1), object(3)
memory usage: 121.4+ KB


In [6]:
df1.head()

Unnamed: 0,code,name_reg,geonameid,country_code
0,AD.06,Sant Julià de Loria,3039162,AD
1,AD.05,Ordino,3039676,AD
2,AD.04,La Massana,3040131,AD
3,AD.03,Encamp,3040684,AD
4,AD.02,Canillo,3041203,AD


In [13]:
df1.to_sql('admin1codes', con=engine)

880

In [5]:
query = 'SELECT * FROM admin1codes LIMIT 10'
pd.read_sql_query(query, con=engine)

Unnamed: 0,index,code,name,name ascii,geonameid
0,0,AD.06,Sant Julià de Loria,Sant Julia de Loria,3039162
1,1,AD.05,Ordino,Ordino,3039676
2,2,AD.04,La Massana,La Massana,3040131
3,3,AD.03,Encamp,Encamp,3040684
4,4,AD.02,Canillo,Canillo,3041203
5,5,AD.07,Andorra la Vella,Andorra la Vella,3041566
6,6,AD.08,Escaldes-Engordany,Escaldes-Engordany,3338529
7,7,AE.07,Imārat Umm al Qaywayn,Imarat Umm al Qaywayn,290595
8,8,AE.05,Raʼs al Khaymah,Imarat Ra's al Khaymah,291075
9,9,AE.03,Dubai,Dubai,292224


In [7]:
df2 = pd.read_csv('C:\\Users\\Family\\Downloads\\countryInfo.txt', delimiter= '\t', low_memory=False, header=None, encoding='utf-8',
                 names = ['country_code','iso_3','iso_numeric','fips','country','capital','area','population','continent','tld','currency_code','currency_name','phone','postal_code_format','postal_code_regex','languages','geonameid','neighbours','equivalent_fips_code'],
                 usecols = ['country_code','geonameid','country']).dropna()

In [8]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 251 entries, 0 to 251
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   country_code  251 non-null    object
 1   country       251 non-null    object
 2   geonameid     251 non-null    int64 
dtypes: int64(1), object(2)
memory usage: 7.8+ KB


In [10]:
df2.to_sql('country_info', con=engine)

In [8]:
query = 'SELECT * FROM country_info LIMIT 10'
pd.read_sql_query(query, con=engine)

Unnamed: 0,index,country_code,country,geonameid
0,0,AD,Andorra,3041565
1,1,AE,United Arab Emirates,290557
2,2,AF,Afghanistan,1149361
3,3,AG,Antigua and Barbuda,3576396
4,4,AI,Anguilla,3573511
5,5,AL,Albania,783754
6,6,AM,Armenia,174982
7,7,AO,Angola,3351879
8,8,AQ,Antarctica,6697173
9,9,AR,Argentina,3865483


In [9]:
df3 = pd.read_csv('C:\\Users\\Family\\Downloads\\cities15000.txt', delimiter= '\t', low_memory=False, header=None, encoding='utf8',
                 names = ['geonameid','name','name_ascii','alternate_names','latitude','longitude','feature_class','feature_code','country_code','cc2','admin1_code','admin2_code','admin3_code','admin4_code','population','elevation','dem','timezone','modification_date'],
                 usecols = ['geonameid','name','name_ascii','country_code','alternate_names','admin1_code']).dropna()

In [10]:
df3['code'] = df3.country_code + '.' + df3.admin1_code
df3 = df3.drop('admin1_code', axis=1)

In [11]:
df3.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 24618 entries, 0 to 26931
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   geonameid        24618 non-null  int64 
 1   name             24618 non-null  object
 2   name_ascii       24618 non-null  object
 3   alternate_names  24618 non-null  object
 4   country_code     24618 non-null  object
 5   code             24618 non-null  object
dtypes: int64(1), object(5)
memory usage: 1.3+ MB


In [None]:
df3.to_sql('cities_15000', con=engine)

In [11]:
query = 'SELECT * FROM cities_15000 LIMIT 10'
pd.read_sql_query(query, con=engine)

Unnamed: 0,index,geonameid,name,name_ascii,alternate_names,country_code
0,0,3040051,les Escaldes,les Escaldes,"Ehskal'des-Ehndzhordani,Escaldes,Escaldes-Engo...",AD
1,1,3041563,Andorra la Vella,Andorra la Vella,"ALV,Ando-la-Vyey,Andora,Andora la Vela,Andora ...",AD
2,2,290594,Umm Al Quwain City,Umm Al Quwain City,"Oumm al Qaiwain,Oumm al Qaïwaïn,Um al Kawain,U...",AE
3,3,291074,Ras Al Khaimah City,Ras Al Khaimah City,"Julfa,Khaimah,RAK City,RKT,Ra's al Khaymah,Ra'...",AE
4,4,291580,Zayed City,Zayed City,"Bid' Zayed,Bid’ Zayed,Madinat Za'id,Madinat Za...",AE
5,5,291696,Khawr Fakkān,Khawr Fakkan,"Fakkan,Fakkān,Khawr Fakkan,Khawr Fakkān,Khawr ...",AE
6,6,292223,Dubai,Dubai,"DXB,Dabei,Dibai,Dibay,Doubayi,Dubae,Dubai,Duba...",AE
7,7,292231,Dibba Al-Fujairah,Dibba Al-Fujairah,"Al-Fujairah,BYB,Dibba Al-Fujairah,dba alfjyrt,...",AE
8,8,292239,Dibba Al-Hisn,Dibba Al-Hisn,"BYB,Daba,Daba al-Hisn,Dabā,Dabā al-Ḥiṣn,Diba,D...",AE
9,9,292672,Sharjah,Sharjah,"Al Sharjah,Ash 'Mariqah,Ash Shariqa,Ash Shariq...",AE


In [12]:
df4 = pd.read_csv('C:\\Users\\Family\\Downloads\\alternateNamesV2.txt', delimiter= '\t', low_memory=False, header=None, encoding='utf8',
                 names = ['alternate_name_id','geonameid','iso_language','alternate_name','is_preferred_name','is_short_name','is_colloquial','is_historic','from','to'],
                 usecols = ['alternate_name_id','geonameid','alternate_name','iso_language'], dtype = {'alternate_name':'str'})

In [13]:
df4.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16035088 entries, 0 to 16035087
Data columns (total 4 columns):
 #   Column             Dtype 
---  ------             ----- 
 0   alternate_name_id  int64 
 1   geonameid          int64 
 2   iso_language       object
 3   alternate_name     object
dtypes: int64(2), object(2)
memory usage: 489.4+ MB


In [25]:
df4.to_sql('alternate_name', con=engine)

1000

In [14]:
df4.head()

Unnamed: 0,alternate_name_id,geonameid,iso_language,alternate_name
0,1284819,2994701,,Roc Mélé
1,1284820,2994701,,Roc Meler
2,4285256,3007683,,Pic des Langounelles
3,1291197,3017832,,Pic de les Abelletes
4,4290387,3017832,,Pic de la Font-Nègre


### Вывод:

- На первом этапе мы познакомились с данными и загрузили их на postgresql, а также проверили, что все сохраняется и открывается коректно;

- Определили, что в дальнейшем будем работать с name и alternate_name.

Объединяем таблицы

In [14]:
def asd(df):
    return (pd.concat([df, df.drop_duplicates(subset='geonameid').assign(alternate_name = df.name)])
           .drop_duplicates().reset_index(drop=True))

In [15]:
df_country_info =(df2.merge(df4, on='geonameid', how='left')
                 .rename(columns={'country': 'name'})
                 .assign(geo_type='country')
                 .assign(code=df2.country_code+'.00'))

df_country_info =asd(df_country_info)
df_country_info.head()

Unnamed: 0,country_code,name,geonameid,alternate_name_id,iso_language,alternate_name,geo_type,code
0,AD,Andorra,3041565,1298014,ca,Principat d’Andorra,country,AD.00
1,AD,Andorra,3041565,1298015,,Les Vallées d’Andorre,country,AE.00
2,AD,Andorra,3041565,1298016,,L’Andorre,country,AF.00
3,AD,Andorra,3041565,1298017,,Valls d’Andorra,country,AG.00
4,AD,Andorra,3041565,1298020,,Principauté d’Andorre,country,AI.00


In [16]:
df_region = (df1.merge(df2[['country_code','country']],on='country_code', how='left')
             .merge(df4, on='geonameid', how='left')
             .rename(columns={'name_reg': 'name'})
             .assign(geo_type='region')
             .drop_duplicates()
             .reset_index(drop=True)
            )
            
df_region =asd(df_region)
df_region.head()

Unnamed: 0,code,name,geonameid,country_code,country,alternate_name_id,iso_language,alternate_name,geo_type
0,AD.06,Sant Julià de Loria,3039162,AD,Andorra,1297839.0,ca,Sant Julià de Lòria,region
1,AD.06,Sant Julià de Loria,3039162,AD,Andorra,1297840.0,en,Sant Julià de Loria,region
2,AD.06,Sant Julià de Loria,3039162,AD,Andorra,1297841.0,ca,Parròquia de Sant Julià de Lòria,region
3,AD.06,Sant Julià de Loria,3039162,AD,Andorra,2170607.0,post,AD600,region
4,AD.06,Sant Julià de Loria,3039162,AD,Andorra,2185944.0,fr,Sant Julià de Lòria,region


In [17]:
df_cities = (df3.merge(df1[['code','name_reg']],on='code', how='left')
             .merge(df2[['country_code','country']], on='country_code', how='left')
             .merge(df4, on='geonameid', how='left')
             .assign(geo_type='city')
             .drop_duplicates()
             .reset_index(drop=True)
            )

df_cities =asd(df_cities)
df_cities.head()

Unnamed: 0,geonameid,name,name_ascii,alternate_names,country_code,code,name_reg,country,alternate_name_id,iso_language,alternate_name,geo_type
0,3040051,les Escaldes,les Escaldes,"Ehskal'des-Ehndzhordani,Escaldes,Escaldes-Engo...",AD,AD.08,Escaldes-Engordany,Andorra,1297907,ca,Les Escaldes,city
1,3040051,les Escaldes,les Escaldes,"Ehskal'des-Ehndzhordani,Escaldes,Escaldes-Engo...",AD,AD.08,Escaldes-Engordany,Andorra,1297908,ca,Escaldes,city
2,3040051,les Escaldes,les Escaldes,"Ehskal'des-Ehndzhordani,Escaldes,Escaldes-Engo...",AD,AD.08,Escaldes-Engordany,Andorra,1904145,fr,Escaldes-Engordany,city
3,3040051,les Escaldes,les Escaldes,"Ehskal'des-Ehndzhordani,Escaldes,Escaldes-Engo...",AD,AD.08,Escaldes-Engordany,Andorra,1904146,pl,Escaldes-Engordany,city
4,3040051,les Escaldes,les Escaldes,"Ehskal'des-Ehndzhordani,Escaldes,Escaldes-Engo...",AD,AD.08,Escaldes-Engordany,Andorra,1904147,es,Escaldes-Engordany,city


In [18]:
df_full = pd.concat([df_country_info, df_region, df_cities], axis=0)
df_full['country'] = df_full['country'].fillna(value=df_full['name'])
df_full['name_reg'] = df_full['name_reg'].fillna(value=df_full['name'])
df_full['alternate_name'] = df_full['alternate_name'].fillna(value=df_full['name'])
df_full = df_full.drop_duplicates()

In [19]:
df_filtr = df_full[df_full.country_code.isin(['AM','RU','BY','KG','GE','RS','ME'])].reset_index(drop=True)
df_filtr

Unnamed: 0,country_code,name,geonameid,alternate_name_id,iso_language,alternate_name,geo_type,code,country,name_ascii,alternate_names,name_reg
0,AM,Armenia,174982,135836.0,hy,Hayastani Hanrapetut’yun,country,,Armenia,,,Armenia
1,AM,Armenia,174982,135839.0,,Armenian Soviet Socialist Republic,country,,Armenia,,,Armenia
2,AM,Armenia,174982,1560696.0,af,Armenië,country,,Armenia,,,Armenia
3,AM,Armenia,174982,1560697.0,am,አርሜኒያ,country,,Armenia,,,Armenia
4,AM,Armenia,174982,1560698.0,ar,ارمينيا,country,,Armenia,,,Armenia
...,...,...,...,...,...,...,...,...,...,...,...,...
29121,RU,Sampsonievskiy,8504965,8067493.0,ru,Sampsonievskiy,city,RU.42,Russia,Sampsonievskiy,"Sampsonievskij,Sampsonievskoe,Сампсониевский,С...",Leningradskaya Oblast'
29122,RU,Vostochnoe Degunino,8505053,8613237.0,ru,Vostochnoe Degunino,city,RU.48,Russia,Vostochnoe Degunino,"Vostochnoe Degunino,Восточное Дегунино",Moscow
29123,RU,Dzerzhinsky,8521440,7131300.0,link,Dzerzhinsky,city,RU.47,Russia,Dzerzhinsky,"Dzerzhinskij,Дзержинский",Moscow Oblast
29124,RU,Fedorovskiy,11886891,13721178.0,ru,Fedorovskiy,city,RU.32,Russia,Fedorovskiy,"Fedorovskij,Федоровский",Khanty-Mansia


In [20]:
df_filtr.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29126 entries, 0 to 29125
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   country_code       29126 non-null  object 
 1   name               29126 non-null  object 
 2   geonameid          29126 non-null  int64  
 3   alternate_name_id  29126 non-null  float64
 4   iso_language       24278 non-null  object 
 5   alternate_name     29126 non-null  object 
 6   geo_type           29126 non-null  object 
 7   code               27620 non-null  object 
 8   country            29126 non-null  object 
 9   name_ascii         24571 non-null  object 
 10  alternate_names    24571 non-null  object 
 11  name_reg           29126 non-null  object 
dtypes: float64(1), int64(1), object(10)
memory usage: 2.7+ MB


### Вывод:

Объеденили необходимые данные в один датасет, который отфильтровали согласно задания закaзчика (7 стран).

In [24]:
if not os.path.exists('C:/Users/Family/Downloads'):
    os.makedirs('C:/Users/Family/Downloads')
    df_filtr.to_csv('C:/Users/Family/Downloads/df_filtr.csv', index=False)

Битекстовый анализ описывает процесс поиска переведенных пар слов на двух языках. Это наш вариант использования, выбираем модель обеспечивающую наилучшую производительность:

LaBSE - Модель LaBSE. 

Поддерживает 109 языков. Хорошо подходит для поиска пар переводов на нескольких языках. Как подробно описано здесь, LaBSE хуже подходит для оценки сходства пар предложений, которые не являются переводами друг друга.

In [23]:
model = SentenceTransformer('sentence-transformers/LaBSE')
translator = Translator(from_lang='russian', to_lang='English')

In [24]:
if os.path.isfile('C:/Users/Family/Downloads/embeddings.csv'):
    logging.info('Load embeddings')
    embeddings = pd.read_С('C:/Users/Family/Downloads/embeddings.csv')
else:
    logging.info('Encode embeddings...')
    embeddings = model.encode(df_filtr.alternate_name.str.lower().values, normalize_embeddings=True, show_progress_bar=True)
    pd.DataFrame(embeddings).to_csv('C:/Users/Family/Downloads/embeddings.csv', index=False)

Batches:   0%|          | 0/911 [00:00<?, ?it/s]

In [27]:
def similar(question, translate=False, num=5, search=100, names_only=False):
    question = question.lower()
    query = translator.translate(question) if translate else question
    query_embedding = model.encode(query, convert_to_tensor=True, show_progress_bar=False).reshape(1,-1)
    res = util.semantic_search(query_embedding, embeddings, top_k=search)
    idx = [i['corpus_id'] for i in res[0]]
    score = [i['score'] for i in res[0]]
    
    if names_only:
        return (df_filtr.loc[idx].drop_duplicates(subset=['name', 'code']).iloc[:num].name.tolist())
    else:
        return (df_filtr.loc[idx, ['name', 'code', 'name_reg', 'country']]
               .assign(similarity=score)
               .drop_duplicates(subset=['name', 'code'])
               .iloc[:num])

In [28]:
similar('русь')

Unnamed: 0,name,code,name_reg,country,similarity
1492,Russia,,Russia,Russia,0.89428
23969,Sayanogorsk,RU.31,Khakasiya Republic,Russia,0.877791
8746,Zheleznovodsk,RU.70,Stavropol Kray,Russia,0.866411
11938,Shcherbinka,RU.48,Moscow,Russia,0.864898
22194,Atkarsk,RU.67,Saratov Oblast,Russia,0.853817


### Вывод:

Модель хорошо справляется с поставленной задачей, но попробуем обучить модель на имеющихся данных и определять будем насиленый пункт по названию.

In [33]:
df3.head()

Unnamed: 0,geonameid,name,name_ascii,alternate_names,country_code,code
0,3040051,les Escaldes,les Escaldes,"Ehskal'des-Ehndzhordani,Escaldes,Escaldes-Engo...",AD,AD.08
1,3041563,Andorra la Vella,Andorra la Vella,"ALV,Ando-la-Vyey,Andora,Andora la Vela,Andora ...",AD,AD.07
2,290594,Umm Al Quwain City,Umm Al Quwain City,"Oumm al Qaiwain,Oumm al Qaïwaïn,Um al Kawain,U...",AE,AE.07
3,291074,Ras Al Khaimah City,Ras Al Khaimah City,"Julfa,Khaimah,RAK City,RKT,Ra's al Khaymah,Ra'...",AE,AE.05
4,291580,Zayed City,Zayed City,"Bid' Zayed,Bid’ Zayed,Madinat Za'id,Madinat Za...",AE,AE.01


In [34]:
df3 = df3[df3.country_code.isin(['AM','RU','BY','KG','GE','RS','ME'])]

In [35]:
df3.head()

Unnamed: 0,geonameid,name,name_ascii,alternate_names,country_code,code
94,174875,Kapan,Kapan,"Ghap'an,Ghapan,Ghap’an,Kafan,Kafin,Kapan,Kapan...",AM,AM.08
95,174895,Goris,Goris,"Geryusy,Goris,Горис,Գորիս",AM,AM.08
96,174972,Hats’avan,Hats'avan,"Acavan,Atsavan,Hats'avan,Hats’avan,Sisian,Ацав...",AM,AM.08
97,174979,Artashat,Artashat,"Artachat,Artasat,Artasatas,Artasato,Artaschat,...",AM,AM.02
98,174991,Ararat,Ararat,"Ararat,Araratas,Ararato,Davalinskiy Tsemzavod,...",AM,AM.02


In [36]:
df3.shape

(1267, 6)

In [37]:
df3.alternate_names = df3.alternate_names.str.split(',')
df3 = df3.explode('alternate_names')
df3 = df3.drop_duplicates(subset=['name','alternate_names'])

In [38]:
df3 = df3[df3.name!=df3.alternate_names]

In [39]:
df3.shape

(17854, 6)

In [40]:
train_examples = []
for row in df3[:10].itertuples():
    train_examples.append(InputExample(texts=[row.name, row.alternate_names]))

In [41]:
train_examples

[<sentence_transformers.readers.InputExample.InputExample at 0x256060b2340>,
 <sentence_transformers.readers.InputExample.InputExample at 0x256060b2490>,
 <sentence_transformers.readers.InputExample.InputExample at 0x256060b2af0>,
 <sentence_transformers.readers.InputExample.InputExample at 0x256060b2970>,
 <sentence_transformers.readers.InputExample.InputExample at 0x256060b21f0>,
 <sentence_transformers.readers.InputExample.InputExample at 0x256611f27f0>,
 <sentence_transformers.readers.InputExample.InputExample at 0x25669f4aa60>,
 <sentence_transformers.readers.InputExample.InputExample at 0x25669f4a730>,
 <sentence_transformers.readers.InputExample.InputExample at 0x25669f4a550>,
 <sentence_transformers.readers.InputExample.InputExample at 0x25669f4ab20>]

In [42]:
train_dataloder = DataLoader(train_examples, shuffle=True, batch_size=16)
train_loss = losses.MultipleNegativesRankingLoss(model=model)

Обучение модели на 5 эпохах:

In [43]:
model.fit(train_objectives=[(train_dataloder, train_loss)], epochs=5)

Epoch:   0%|          | 0/5 [00:00<?, ?it/s]

Iteration:   0%|          | 0/1 [00:00<?, ?it/s]

Iteration:   0%|          | 0/1 [00:00<?, ?it/s]

Iteration:   0%|          | 0/1 [00:00<?, ?it/s]

Iteration:   0%|          | 0/1 [00:00<?, ?it/s]

Iteration:   0%|          | 0/1 [00:00<?, ?it/s]

In [44]:
names = df3.name.drop_duplicates().values
names[:10]

array(['Kapan', 'Goris', 'Hats’avan', 'Artashat', 'Ararat', 'Yerevan',
       'Vagharshapat', 'Stepanavan', 'Spitak', 'Sevan'], dtype=object)

In [45]:
embedding_model = model.encode(names)
embedding_model.shape

(1247, 768)

In [46]:
def get_sim(geoname, names=names, embedding=embedding_model, model=model, top_k=3):
    result =pd.DataFrame(util.semantic_search(model.encode(geoname), embedding_model, top_k=top_k)[0])
    return result.assign(name=names[result.corpus_id])

In [47]:
get_sim('иван')

Unnamed: 0,corpus_id,score,name
0,763,0.634947,Ivanovo
1,5,0.620656,Yerevan
2,1025,0.566276,Nyagan


## Вывод:

Полученная модель работает, но явного улычшения не видно. Модель LaBSE - отлично справляется и может быть предложана заказчику как окончательный продукт.