# Обзор
Этот блокнот используется для извлечения необходимой информации об уже используемых postamates. Последнее заложит основу целевой переменной в процессе машинного обучения.

In [1]:
import os
import pandas as pd
import numpy as np
import h3
import json

In [2]:
RESOLUTION = 8

In [3]:
# # давайте начнем с пункта выбора почтовых партнеров

pos_loc = os.path.join("postamats", "pick_points.xls")
pps = pd.read_excel(pos_loc)

In [4]:
pps.head()
ms_pps = pps[pps["city"].str.lower() == 'москва']
ms_pps = ms_pps.loc[:, ['lat', 'lng']]
ms_pps.head()


Unnamed: 0,lat,lng
4,55.609675,37.720106
5,55.729833,37.731021
6,55.815319,37.737845
7,55.658959,37.742099
8,55.806039,37.395572


In [5]:
def get_h3(row):
    row['h3'] = h3.geo_to_h3(lat=row['lat'], lng=row['lng'], resolution=RESOLUTION)
    return row
ms_pps = ms_pps.apply(get_h3, axis=1) # теперь у нас есть каждый postamt, назначенный шестиугольнику h3

print(ms_pps.shape)

(1663, 3)


In [6]:
pos_loc = os.path.join("postamats", "all_postamats.json")

with open(pos_loc) as f:
   data = json.load(f)

lats = [float(t['coordinates'].split(",")[1]) for t in data]
lngs = [float(t['coordinates'].split(",")[0]) for t in data]

print(data[-5:])
print(lats[-5:])
print(lngs[-5:])

[{'name': None, 'href': 'https://yandex.ru/maps/org/sberlogistika/91452306293/', 'coordinates': '37.492430,55.654945'}, {'name': None, 'href': 'https://yandex.ru/maps/org/5post/88203260828/', 'coordinates': '37.647994,55.586232'}, {'name': None, 'href': 'https://yandex.ru/maps/org/sberlogistika/60639548802/', 'coordinates': '37.491046,55.667506'}, {'name': None, 'href': 'https://yandex.ru/maps/org/postamat_yandeks_marketa/108030982491/', 'coordinates': '37.471625,55.633981'}, {'name': None, 'href': 'https://yandex.ru/maps/org/5post/227726346799/', 'coordinates': '37.649526,55.578867'}]
[55.654945, 55.586232, 55.667506, 55.633981, 55.578867]
[37.49243, 37.647994, 37.491046, 37.471625, 37.649526]


In [7]:

data_dict = {"lat": lats, "lng": lngs}
postmats = pd.DataFrame(data=data_dict)
print(postmats.shape)

(2331, 2)


In [8]:
print(postmats.head())
postmats = postmats.apply(get_h3, axis=1)


         lat        lng
0  55.789923  37.681442
1  55.781475  37.564770
2  55.707464  37.582725
3  55.789652  37.665697
4  55.800559  37.720051


In [10]:
postmats = pd.concat([postmats, ms_pps], ignore_index=True)
postmats.shape
postmats.to_excel("postmats.xlsx")

In [44]:
post_count = pd.pivot_table(postmats, index='h3', aggfunc=['count'])

In [45]:
post_count = post_count.loc[:, [('count', 'lat')]]

post_count.columns = ['y']

In [46]:
print(post_count.head())

                 y
h3                
8811818497fffff  1
8811818c6dfffff  1
8811818e25fffff  1
8811818e99fffff  1
8811819407fffff  1


In [47]:
d_train = pd.read_excel(os.path.join(f"data_{str(RESOLUTION)}", "training_data.xlsx"))

In [None]:
def merge_zone_feature(df, df_feat):
    try:
        df = df.set_index('h3')
        df_feat = df_feat.set_index('h3')
    except:
        pass
    df = pd.merge(df, df_feat, how='left', right_index=True, left_index=True)
    return df, df_feat
d_train, _ = merge_zone_feature(d_train, post_count)

In [50]:
# теперь данные обучения завершены: функции и целевая переменная

d_train.to_excel(os.path.join("training_data_y.xlsx"))

In [51]:
print(d_train.columns)

Index(['lat', 'lng', 'bus_freq_count', 'education_count', 'education_area',
       'parking_count', 'parking_area', 'bus_count', 'financial_count',
       'financial_area', 'accomodation_count', 'accomodation_area',
       'commercial_count', 'commercial_area', 'health_care_count',
       'health_care_area', 'entertainment_count', 'entertainment_area',
       'sustenance_count', 'sustenance_area', 'government_count',
       'government_area', 'sports_count', 'sports_area',
       'count_highway_primary', 'length_highway_primary',
       'count_highway_secondary', 'length_highway_secondary',
       'count_highway_tertiary', 'length_highway_tertiary',
       'count_highway_residential', 'length_highway_residential',
       'count_highway_pedestrian', 'length_highway_pedestrian', 'population',
       'TotalPassengers', 'y'],
      dtype='object')
