# Задание

Разработать автоматический критерий, определяющий применимость к каждому из пользователей паттерна перемещения “работа-дом”, т.е. критерий должен найти тех, кто достаточно долгое время в ночные часы находится в районе одной точки (“дом”) и достаточно долгое время в дневные часы находится в заметно отличающейся точке (“работа”).

Также нужно предоставить результат работы критерия на предоставленной выборке.

Если для решения задачи вы рассматривали несколько алгоритмов и выбрали один из них, будет хорошо, если вы сможете обосновать свой выбор.

### Решение

Можно предположить, что пользователь находится в «домашнем» или «рабочем» районе, если за определенный период времени (ночной или дневной) он преодолел минимальную дистанцию. 

![](img/image-02.png)

Периоды времени выбраны следующие:

    1)	С 02:00 до 04:30 – в это время пользователи должны спать: и кто поздно ложиться спать, и кто рано встает.
    2)	С 11:00 до 16:30, исключая обеденное время с 12:30 до 14:30, – это рабочее время.


Подготовим датасет для вычислений.

In [1]:
import pandas as pd

import numpy as np

import warnings
warnings.filterwarnings("ignore")

In [2]:
data = pd.read_csv('event_sample.csv', sep=';', header=None)
data.columns = ['id', 'time', 'longitude', 'latitude']
data

Unnamed: 0,id,time,longitude,latitude
0,2,2017-03-20 00:00:18 +0300,38.937878,47.210758
1,3,2017-03-20 00:00:18 +0300,45.655278,43.259605
2,4,2017-03-20 00:00:18 +0300,47.113440,42.804970
3,5,2017-03-20 00:00:18 +0300,158.601667,53.068436
4,6,2017-03-20 00:00:18 +0300,104.259995,52.250453
...,...,...,...,...
1727035,685,2017-03-21 00:00:00 +0300,30.486057,59.942234
1727036,527,2017-03-21 00:00:00 +0300,37.872987,55.912129
1727037,805,2017-03-21 00:00:00 +0300,45.869799,43.289353
1727038,824,2017-03-21 00:00:00 +0300,44.790299,43.215624


In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1727040 entries, 0 to 1727039
Data columns (total 4 columns):
 #   Column     Dtype  
---  ------     -----  
 0   id         int64  
 1   time       object 
 2   longitude  float64
 3   latitude   float64
dtypes: float64(2), int64(1), object(1)
memory usage: 52.7+ MB


In [4]:
data.isnull().sum()

id           0
time         0
longitude    0
latitude     0
dtype: int64

Отсортировали данные по признакам 'id' и 'time'.

In [5]:
data_group = data.sort_values(by=['id', 'time'], ascending=[True, True])
data_group

Unnamed: 0,id,time,longitude,latitude
0,2,2017-03-20 00:00:18 +0300,38.937878,47.210758
844,2,2017-03-20 00:00:18 +0300,38.938669,47.209144
1666,2,2017-03-20 00:00:40 +0300,38.939279,47.209029
10308,2,2017-03-20 00:03:01 +0300,38.937462,47.209331
10839,2,2017-03-20 00:03:09 +0300,38.939508,47.210721
...,...,...,...,...
1556491,974,2017-03-20 21:43:38 +0300,37.463898,56.017527
1557195,974,2017-03-20 21:44:20 +0300,37.477952,56.011354
1561036,974,2017-03-20 21:47:59 +0300,37.464269,56.017311
1565172,974,2017-03-20 21:51:47 +0300,37.454963,56.016284


Перемещаем в строку данные из будущего периода для будущего вычисления расстояния.

In [6]:
data_sorted_shift = pd.concat([data_group, data_group.shift(-1)], axis=1)
data_sorted_shift

Unnamed: 0,id,time,longitude,latitude,id.1,time.1,longitude.1,latitude.1
0,2,2017-03-20 00:00:18 +0300,38.937878,47.210758,2.0,2017-03-20 00:00:18 +0300,38.938669,47.209144
844,2,2017-03-20 00:00:18 +0300,38.938669,47.209144,2.0,2017-03-20 00:00:40 +0300,38.939279,47.209029
1666,2,2017-03-20 00:00:40 +0300,38.939279,47.209029,2.0,2017-03-20 00:03:01 +0300,38.937462,47.209331
10308,2,2017-03-20 00:03:01 +0300,38.937462,47.209331,2.0,2017-03-20 00:03:09 +0300,38.939508,47.210721
10839,2,2017-03-20 00:03:09 +0300,38.939508,47.210721,2.0,2017-03-20 00:05:14 +0300,38.938469,47.209709
...,...,...,...,...,...,...,...,...
1556491,974,2017-03-20 21:43:38 +0300,37.463898,56.017527,974.0,2017-03-20 21:44:20 +0300,37.477952,56.011354
1557195,974,2017-03-20 21:44:20 +0300,37.477952,56.011354,974.0,2017-03-20 21:47:59 +0300,37.464269,56.017311
1561036,974,2017-03-20 21:47:59 +0300,37.464269,56.017311,974.0,2017-03-20 21:51:47 +0300,37.454963,56.016284
1565172,974,2017-03-20 21:51:47 +0300,37.454963,56.016284,974.0,2017-03-20 23:55:17 +0300,37.467068,56.014844


In [7]:
data_sorted_shift.dropna(axis=0, inplace=True)
data_sorted_shift

Unnamed: 0,id,time,longitude,latitude,id.1,time.1,longitude.1,latitude.1
0,2,2017-03-20 00:00:18 +0300,38.937878,47.210758,2.0,2017-03-20 00:00:18 +0300,38.938669,47.209144
844,2,2017-03-20 00:00:18 +0300,38.938669,47.209144,2.0,2017-03-20 00:00:40 +0300,38.939279,47.209029
1666,2,2017-03-20 00:00:40 +0300,38.939279,47.209029,2.0,2017-03-20 00:03:01 +0300,38.937462,47.209331
10308,2,2017-03-20 00:03:01 +0300,38.937462,47.209331,2.0,2017-03-20 00:03:09 +0300,38.939508,47.210721
10839,2,2017-03-20 00:03:09 +0300,38.939508,47.210721,2.0,2017-03-20 00:05:14 +0300,38.938469,47.209709
...,...,...,...,...,...,...,...,...
1539604,974,2017-03-20 21:27:11 +0300,37.464890,56.015909,974.0,2017-03-20 21:43:38 +0300,37.463898,56.017527
1556491,974,2017-03-20 21:43:38 +0300,37.463898,56.017527,974.0,2017-03-20 21:44:20 +0300,37.477952,56.011354
1557195,974,2017-03-20 21:44:20 +0300,37.477952,56.011354,974.0,2017-03-20 21:47:59 +0300,37.464269,56.017311
1561036,974,2017-03-20 21:47:59 +0300,37.464269,56.017311,974.0,2017-03-20 21:51:47 +0300,37.454963,56.016284


In [8]:
data_sorted_shift.columns = ['id', 'time', 'longitude', 'latitude', 'id_shift', 'time_shift', 'longitude_shift', 'latitude_shift']
data_sorted_shift

Unnamed: 0,id,time,longitude,latitude,id_shift,time_shift,longitude_shift,latitude_shift
0,2,2017-03-20 00:00:18 +0300,38.937878,47.210758,2.0,2017-03-20 00:00:18 +0300,38.938669,47.209144
844,2,2017-03-20 00:00:18 +0300,38.938669,47.209144,2.0,2017-03-20 00:00:40 +0300,38.939279,47.209029
1666,2,2017-03-20 00:00:40 +0300,38.939279,47.209029,2.0,2017-03-20 00:03:01 +0300,38.937462,47.209331
10308,2,2017-03-20 00:03:01 +0300,38.937462,47.209331,2.0,2017-03-20 00:03:09 +0300,38.939508,47.210721
10839,2,2017-03-20 00:03:09 +0300,38.939508,47.210721,2.0,2017-03-20 00:05:14 +0300,38.938469,47.209709
...,...,...,...,...,...,...,...,...
1539604,974,2017-03-20 21:27:11 +0300,37.464890,56.015909,974.0,2017-03-20 21:43:38 +0300,37.463898,56.017527
1556491,974,2017-03-20 21:43:38 +0300,37.463898,56.017527,974.0,2017-03-20 21:44:20 +0300,37.477952,56.011354
1557195,974,2017-03-20 21:44:20 +0300,37.477952,56.011354,974.0,2017-03-20 21:47:59 +0300,37.464269,56.017311
1561036,974,2017-03-20 21:47:59 +0300,37.464269,56.017311,974.0,2017-03-20 21:51:47 +0300,37.454963,56.016284


Для исключения попадания в одну строку данных о разных пользователях удаляем строки, где не совпадают в одной строке id пользователей.

In [9]:
data_sorted_shift_drop = data_sorted_shift[~(data_sorted_shift.id != data_sorted_shift.id_shift)]
data_sorted_shift_drop

Unnamed: 0,id,time,longitude,latitude,id_shift,time_shift,longitude_shift,latitude_shift
0,2,2017-03-20 00:00:18 +0300,38.937878,47.210758,2.0,2017-03-20 00:00:18 +0300,38.938669,47.209144
844,2,2017-03-20 00:00:18 +0300,38.938669,47.209144,2.0,2017-03-20 00:00:40 +0300,38.939279,47.209029
1666,2,2017-03-20 00:00:40 +0300,38.939279,47.209029,2.0,2017-03-20 00:03:01 +0300,38.937462,47.209331
10308,2,2017-03-20 00:03:01 +0300,38.937462,47.209331,2.0,2017-03-20 00:03:09 +0300,38.939508,47.210721
10839,2,2017-03-20 00:03:09 +0300,38.939508,47.210721,2.0,2017-03-20 00:05:14 +0300,38.938469,47.209709
...,...,...,...,...,...,...,...,...
1539604,974,2017-03-20 21:27:11 +0300,37.464890,56.015909,974.0,2017-03-20 21:43:38 +0300,37.463898,56.017527
1556491,974,2017-03-20 21:43:38 +0300,37.463898,56.017527,974.0,2017-03-20 21:44:20 +0300,37.477952,56.011354
1557195,974,2017-03-20 21:44:20 +0300,37.477952,56.011354,974.0,2017-03-20 21:47:59 +0300,37.464269,56.017311
1561036,974,2017-03-20 21:47:59 +0300,37.464269,56.017311,974.0,2017-03-20 21:51:47 +0300,37.454963,56.016284


Рассчитаем расстояние передвижения пользователя от точки к точке.

In [10]:
import geopy.distance
from geopy.distance import geodesic

In [11]:
def distanse(row):
    origin = (row['latitude'], row['longitude'])
    dist = (row['latitude_shift'], row['longitude_shift'])
    return geodesic(origin, dist).meters

In [12]:
data_sorted_shift_drop['distanse'] = data_sorted_shift_drop.apply(distanse, axis=1)
data_sorted_shift_drop

Unnamed: 0,id,time,longitude,latitude,id_shift,time_shift,longitude_shift,latitude_shift,distanse
0,2,2017-03-20 00:00:18 +0300,38.937878,47.210758,2.0,2017-03-20 00:00:18 +0300,38.938669,47.209144,189.178385
844,2,2017-03-20 00:00:18 +0300,38.938669,47.209144,2.0,2017-03-20 00:00:40 +0300,38.939279,47.209029,47.924217
1666,2,2017-03-20 00:00:40 +0300,38.939279,47.209029,2.0,2017-03-20 00:03:01 +0300,38.937462,47.209331,141.713724
10308,2,2017-03-20 00:03:01 +0300,38.937462,47.209331,2.0,2017-03-20 00:03:09 +0300,38.939508,47.210721,218.909623
10839,2,2017-03-20 00:03:09 +0300,38.939508,47.210721,2.0,2017-03-20 00:05:14 +0300,38.938469,47.209709,137.303567
...,...,...,...,...,...,...,...,...,...
1539604,974,2017-03-20 21:27:11 +0300,37.464890,56.015909,974.0,2017-03-20 21:43:38 +0300,37.463898,56.017527,190.438987
1556491,974,2017-03-20 21:43:38 +0300,37.463898,56.017527,974.0,2017-03-20 21:44:20 +0300,37.477952,56.011354,1113.850711
1557195,974,2017-03-20 21:44:20 +0300,37.477952,56.011354,974.0,2017-03-20 21:47:59 +0300,37.464269,56.017311,1080.860261
1561036,974,2017-03-20 21:47:59 +0300,37.464269,56.017311,974.0,2017-03-20 21:51:47 +0300,37.454963,56.016284,591.478753


In [13]:
data_with_distance = data_sorted_shift_drop[['id', 'time', 'time_shift', 'longitude_shift', 'latitude_shift', 'distanse']]
data_with_distance

Unnamed: 0,id,time,time_shift,longitude_shift,latitude_shift,distanse
0,2,2017-03-20 00:00:18 +0300,2017-03-20 00:00:18 +0300,38.938669,47.209144,189.178385
844,2,2017-03-20 00:00:18 +0300,2017-03-20 00:00:40 +0300,38.939279,47.209029,47.924217
1666,2,2017-03-20 00:00:40 +0300,2017-03-20 00:03:01 +0300,38.937462,47.209331,141.713724
10308,2,2017-03-20 00:03:01 +0300,2017-03-20 00:03:09 +0300,38.939508,47.210721,218.909623
10839,2,2017-03-20 00:03:09 +0300,2017-03-20 00:05:14 +0300,38.938469,47.209709,137.303567
...,...,...,...,...,...,...
1539604,974,2017-03-20 21:27:11 +0300,2017-03-20 21:43:38 +0300,37.463898,56.017527,190.438987
1556491,974,2017-03-20 21:43:38 +0300,2017-03-20 21:44:20 +0300,37.477952,56.011354,1113.850711
1557195,974,2017-03-20 21:44:20 +0300,2017-03-20 21:47:59 +0300,37.464269,56.017311,1080.860261
1561036,974,2017-03-20 21:47:59 +0300,2017-03-20 21:51:47 +0300,37.454963,56.016284,591.478753


Делаем срез данных за ночной период.

In [14]:
data_with_distance_night = data_with_distance[(data_with_distance.time_shift > '2017-03-20 02:00:00') & (data_with_distance.time_shift < '2017-03-20 04:30:00')]

data_with_distance_night


Unnamed: 0,id,time,time_shift,longitude_shift,latitude_shift,distanse
274690,2,2017-03-20 01:59:24 +0300,2017-03-20 02:02:23 +0300,38.938487,47.210089,49.313035
281059,2,2017-03-20 02:02:23 +0300,2017-03-20 02:04:20 +0300,38.938835,47.209228,99.232670
285194,2,2017-03-20 02:04:20 +0300,2017-03-20 02:04:20 +0300,38.938016,47.210160,120.763877
285200,2,2017-03-20 02:04:20 +0300,2017-03-20 02:04:42 +0300,38.938352,47.210668,61.892618
285989,2,2017-03-20 02:04:42 +0300,2017-03-20 02:04:43 +0300,38.931705,47.208533,556.669469
...,...,...,...,...,...,...
536182,954,2017-03-20 04:22:17 +0300,2017-03-20 04:23:44 +0300,104.134651,52.383737,57.886636
258581,956,2017-03-20 01:51:44 +0300,2017-03-20 02:26:54 +0300,37.717582,55.822089,17930.691026
333310,956,2017-03-20 02:26:54 +0300,2017-03-20 02:27:08 +0300,37.717238,55.822211,25.467632
333112,956,2017-03-20 02:27:08 +0300,2017-03-20 02:27:27 +0300,37.716429,55.823002,101.563122


Сортируем данные по 'id' и 'time_shift'.

In [15]:
data_with_distance_night = data_with_distance_night.drop(['time'], axis=1)
data_with_distance_night = data_with_distance_night.sort_values(by=['id', 'time_shift'], ascending=[True, True])
data_with_distance_night

Unnamed: 0,id,time_shift,longitude_shift,latitude_shift,distanse
274690,2,2017-03-20 02:02:23 +0300,38.938487,47.210089,49.313035
281059,2,2017-03-20 02:04:20 +0300,38.938835,47.209228,99.232670
285194,2,2017-03-20 02:04:20 +0300,38.938016,47.210160,120.763877
285200,2,2017-03-20 02:04:42 +0300,38.938352,47.210668,61.892618
285989,2,2017-03-20 02:04:43 +0300,38.931705,47.208533,556.669469
...,...,...,...,...,...
536182,954,2017-03-20 04:23:44 +0300,104.134651,52.383737,57.886636
258581,956,2017-03-20 02:26:54 +0300,37.717582,55.822089,17930.691026
333310,956,2017-03-20 02:27:08 +0300,37.717238,55.822211,25.467632
333112,956,2017-03-20 02:27:27 +0300,37.716429,55.823002,101.563122


In [16]:
data_with_distance_night = data_with_distance_night.drop(['time_shift'], axis=1)

Группируем данные, вычисляя при этом координаты центроида района и расстояние, пройденное пользователем в ночное время.

In [17]:
data_with_distance_night_sum = data_with_distance_night.groupby('id').agg({'longitude_shift': 'mean', 'latitude_shift': 'mean', 'distanse': 'sum'})
data_with_distance_night_sum

Unnamed: 0_level_0,longitude_shift,latitude_shift,distanse
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2,38.938396,47.209695,34547.721709
3,45.656193,43.259608,62262.341042
4,47.122996,42.813180,209072.454698
5,158.603659,53.069024,38763.475366
6,104.256066,52.252100,208499.532798
...,...,...,...
952,126.692783,62.102453,41568.912499
953,37.212274,55.940335,39849.678965
954,104.084433,52.417183,46468.086949
956,37.717083,55.822434,18057.721780


Аналогично обрабатываем датасет в рамках дневного периода.

In [18]:
data_with_distance_day = data_with_distance[(data_with_distance.time_shift > '2017-03-20 11:00:00') & (data_with_distance.time_shift < '2017-03-20 16:30:00')]
data_with_distance_day = data_with_distance_day[(data_with_distance_day.time_shift < '2017-03-20 12:30:00') | (data_with_distance_day.time_shift > '2017-03-20 14:30:00')]
data_with_distance_day


Unnamed: 0,id,time,time_shift,longitude_shift,latitude_shift,distanse
1012794,2,2017-03-20 10:59:23 +0300,2017-03-20 11:01:44 +0300,38.939348,47.209453,33.591106
1014616,2,2017-03-20 11:01:44 +0300,2017-03-20 11:02:07 +0300,38.939655,47.210397,107.578609
1014887,2,2017-03-20 11:02:07 +0300,2017-03-20 11:06:27 +0300,38.937825,47.209897,149.388659
1018269,2,2017-03-20 11:06:27 +0300,2017-03-20 11:06:27 +0300,38.939115,47.209886,97.733882
1018270,2,2017-03-20 11:06:27 +0300,2017-03-20 11:06:34 +0300,38.938539,47.210216,56.974741
...,...,...,...,...,...,...
1184913,973,2017-03-20 14:50:02 +0300,2017-03-20 14:50:11 +0300,37.396935,55.705853,1074.674876
1196662,974,2017-03-20 15:05:55 +0300,2017-03-20 15:25:33 +0300,37.477887,56.018844,912.105368
1212440,974,2017-03-20 15:25:33 +0300,2017-03-20 16:16:12 +0300,37.460441,56.017453,1098.910687
1253536,974,2017-03-20 16:16:12 +0300,2017-03-20 16:23:55 +0300,37.471363,56.017971,683.566425


In [19]:
data_with_distance_day = data_with_distance_day.drop(['time'], axis=1)
data_with_distance_day = data_with_distance_day.sort_values(by=['id', 'time_shift'], ascending=[True, True])
data_with_distance_day

Unnamed: 0,id,time_shift,longitude_shift,latitude_shift,distanse
1012794,2,2017-03-20 11:01:44 +0300,38.939348,47.209453,33.591106
1014616,2,2017-03-20 11:02:07 +0300,38.939655,47.210397,107.578609
1014887,2,2017-03-20 11:06:27 +0300,38.937825,47.209897,149.388659
1018269,2,2017-03-20 11:06:27 +0300,38.939115,47.209886,97.733882
1018270,2,2017-03-20 11:06:34 +0300,38.938539,47.210216,56.974741
...,...,...,...,...,...
1184913,973,2017-03-20 14:50:11 +0300,37.396935,55.705853,1074.674876
1196662,974,2017-03-20 15:25:33 +0300,37.477887,56.018844,912.105368
1212440,974,2017-03-20 16:16:12 +0300,37.460441,56.017453,1098.910687
1253536,974,2017-03-20 16:23:55 +0300,37.471363,56.017971,683.566425


In [20]:
data_with_distance_day_sum = data_with_distance_day.drop(['time_shift'], axis=1).groupby('id').agg({'longitude_shift': 'mean', 'latitude_shift': 'mean', 'distanse': 'sum'})
data_with_distance_day_sum

Unnamed: 0_level_0,longitude_shift,latitude_shift,distanse
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2,38.937699,47.209291,56222.426341
3,45.672075,43.321897,192683.771942
4,47.122253,42.812378,76556.712594
5,158.605267,53.056002,71.575169
6,104.255970,52.252395,258730.168704
...,...,...,...
968,37.607226,55.786912,2093.654708
969,37.547810,55.757917,5716.943530
971,37.684294,55.747523,123957.427400
973,37.389497,55.707880,2312.674129


In [21]:
data_with_distance_night_sum.columns = ['longitude_mean_night', 'latitude_mean_night', 'distanse_night']
data_with_distance_night_sum

Unnamed: 0_level_0,longitude_mean_night,latitude_mean_night,distanse_night
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2,38.938396,47.209695,34547.721709
3,45.656193,43.259608,62262.341042
4,47.122996,42.813180,209072.454698
5,158.603659,53.069024,38763.475366
6,104.256066,52.252100,208499.532798
...,...,...,...
952,126.692783,62.102453,41568.912499
953,37.212274,55.940335,39849.678965
954,104.084433,52.417183,46468.086949
956,37.717083,55.822434,18057.721780


In [22]:
data_with_distance_day_sum.columns = ['longitude_mean_day', 'latitude_mean_day', 'distanse_day']
data_with_distance_day_sum

Unnamed: 0_level_0,longitude_mean_day,latitude_mean_day,distanse_day
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2,38.937699,47.209291,56222.426341
3,45.672075,43.321897,192683.771942
4,47.122253,42.812378,76556.712594
5,158.605267,53.056002,71.575169
6,104.255970,52.252395,258730.168704
...,...,...,...
968,37.607226,55.786912,2093.654708
969,37.547810,55.757917,5716.943530
971,37.684294,55.747523,123957.427400
973,37.389497,55.707880,2312.674129


Объединяем по пользователю дневной и ночной датасеты.

In [23]:
merge_data = data_with_distance_night_sum.merge(data_with_distance_day_sum, on='id', how='inner')
merge_data

Unnamed: 0_level_0,longitude_mean_night,latitude_mean_night,distanse_night,longitude_mean_day,latitude_mean_day,distanse_day
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2,38.938396,47.209695,34547.721709,38.937699,47.209291,56222.426341
3,45.656193,43.259608,62262.341042,45.672075,43.321897,192683.771942
4,47.122996,42.813180,209072.454698,47.122253,42.812378,76556.712594
5,158.603659,53.069024,38763.475366,158.605267,53.056002,71.575169
6,104.256066,52.252100,208499.532798,104.255970,52.252395,258730.168704
...,...,...,...,...,...,...
952,126.692783,62.102453,41568.912499,126.687259,62.106717,48604.317093
953,37.212274,55.940335,39849.678965,37.314860,55.407016,6037.083410
954,104.084433,52.417183,46468.086949,103.873326,52.573463,57010.693152
956,37.717083,55.822434,18057.721780,37.528305,55.755539,61175.541780


Вычисляем расстояние "Дом-Работа" по центроидам районов, где пользователь находился днем и ночью.

In [24]:
def distanse_home_to_work(row):
    origin = (row['latitude_mean_night'], row['longitude_mean_night'])
    dist = (row['latitude_mean_day'], row['longitude_mean_day'])
    return geodesic(origin, dist).meters

In [25]:
merge_data['distanse_home_work'] = merge_data.apply(distanse_home_to_work, axis=1)

In [26]:
merge_data = merge_data[['distanse_night', 'distanse_day', 'distanse_home_work']]
merge_data

Unnamed: 0_level_0,distanse_night,distanse_day,distanse_home_work
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2,34547.721709,56222.426341,69.325221
3,62262.341042,192683.771942,7039.182037
4,209072.454698,76556.712594,107.798809
5,38763.475366,71.575169,1453.269309
6,208499.532798,258730.168704,33.408659
...,...,...,...
952,41568.912499,48604.317093,555.917615
953,39849.678965,6037.083410,59727.381593
954,46468.086949,57010.693152,22538.779774
956,18057.721780,61175.541780,13989.866010


Возьмем максимальную сумму расстояний между всеми изменениями координат пользователя для "дома" равной 2 500 м., для "работы" - 10 000 м.

Дистанцию от дома до работы возьмем более 2000 м.


In [27]:
# Сюда вписываем максимальное длину траектории изменения координат пользователя для определения "домашнего" района
max_night_distance = 2500

# Сюда вписываем максимальное длину траектории изменения координат пользователя для определения "рабочего" района
max_day_distance = 10000

# Здесь минимальная дистанция после которой мы можем отделить "домашний" район от "рабочего"
min_distance_home_to_work = 2000


known_home_and_work = merge_data[(merge_data.distanse_night < max_night_distance)]
known_home_and_work = known_home_and_work[(known_home_and_work.distanse_home_work > min_distance_home_to_work)]
known_home_and_work = known_home_and_work[(known_home_and_work.distanse_day < max_day_distance)]

print('ID пользователей, кто достаточно долгое время в ночные часы находится в районе одной точки “дом” и достаточно долгое время в дневные часы находится в заметно отличающейся точке “работа”:', str(list(known_home_and_work.index))[1:-1] )
print()
print('Количество пользователей = ', known_home_and_work.shape[0]) 


ID пользователей, кто достаточно долгое время в ночные часы находится в районе одной точки “дом” и достаточно долгое время в дневные часы находится в заметно отличающейся точке “работа”: 28, 59, 101, 103, 117, 136, 153, 175, 183, 235, 246, 250, 277, 309, 442, 471, 487, 497, 549, 752, 863, 940, 957

Количество пользователей =  23


Попробуем сравнить данный результат с результатом первого исследования.

In [28]:
ver_1 = [59, 89, 103, 136, 152, 153, 154, 175, 179, 183, 277, 282, 306, 309, 329, 347, 446, 468, 471, 487, 549, 657, 673, 727, 752, 792, 867, 886, 940, 957]
ver_2 = [28, 59, 101, 103, 117, 136, 153, 175, 183, 235, 246, 250, 277, 309, 442, 471, 487, 497, 549, 752, 863, 940, 957]


match = 0

for i in range(0, 23):
    if ver_2[i] in ver_1:
        match +=1

print("Совпавших пользователей: ",(match/23)*100, " %")

Совпавших пользователей:  60.86956521739131  %


### Вывод

Предпочтительнее использовать критерий первый по фильтрации пользователей через площади "домашнего" и "рабочего" района, так как причина выбора размера ограничительной площади поддается лучшей интерпретации, чем с критерием по длине траектории движения пользователя.

#### Что можно сделать еще

Можно попробовать сделать кластеризацию другими методами. Например, как критерий задать расстояние от центроида до самой далекой точки. 

Или изменить общий алгоритм и все передвижения пользователя кластеризовать и затем уже проверять в какое время образован кластер, сколько там пробыл пользователь, после чего определять "дом"/"работа".