# PREDICTION OF DESTITATION
By Ziqing Zhou,Qiu Chen

## Introduction
In this tutorial, we will cover the techniques and algorithm to use a content image and a style image to generate a stylized image.

The main idea of our project is to extract features from certain layers of VGG-19 to generate the target image. To get result of higher perceptual quality, we choose relu4_2 and relu5_2 to extract content while using relu1_1, relu2_1, relu3_1, relu4_1 and relu5_1 to extract texture from the style image. 

The cost function is defined over the output of content layers and style layers. Afterwards, we use Adam algorithm to optimize the network and use back-propagation to generate the target image.

Since the algorithm define the style of an image as texture, to get an amazing result, you may want to choose a style image with special textures such as Starry Night.

## Data File
[PREDICTION OF DESTITATION](http://www.dcjingsai.com/common/cmpt/%E6%B1%BD%E8%BD%A6%E7%9B%AE%E7%9A%84%E5%9C%B0%E6%99%BA%E8%83%BD%E9%A2%84%E6%B5%8B%E5%A4%A7%E8%B5%9B_%E8%B5%9B%E4%BD%93%E4%B8%8E%E6%95%B0%E6%8D%AE.html) 

### Import

In [None]:
import pandas as pd
import numpy as np
import os
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN
from sklearn.decomposition import PCA
from math import radians, atan, tan, sin, acos, cos
import requests
import json
from shapely.geometry  import MultiPoint,Polygon
from  geopy.distance import great_circle
from geopy.geocoders  import Nominatim
from geopy.point  import Point
import geopandas as gpd
from sklearn.preprocessing  import StandardScaler,minmax_scale
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb

### EDA

since the competation offers us a few columns of data,so we just do some basic EDA to sense the content of the data

In [3]:
train_set = pd.read_csv('data/train_new.csv', encoding='utf8')
train_set.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,r_key,out_id,start_time,end_time,start_lat,start_lon,end_lat,end_lon
0,SDK-XJ_609994b4d50a8a07a64d41d1f70bbb05,2016061820000b,2018-01-20 10:13:43,2018-01-20 10:19:04,33.783415,111.60366,33.779811,111.605885
1,SDK-XJ_4c2f29d94c9478623711756e4ae34cc5,2016061820000b,2018-02-12 17:40:51,2018-02-12 17:58:13,34.810763,115.549264,34.814875,115.549374
2,SDK-XJ_3570183177536a575b9da67a86efcd62,2016061820000b,2018-02-13 14:52:24,2018-02-13 15:24:33,34.640284,115.539024,34.813136,115.559243
3,SDK-XJ_78d749a376e190685716a51a6704010b,2016061820000b,2018-02-13 17:23:08,2018-02-13 17:39:02,34.81828,115.542039,34.813141,115.559217
4,SDK-XJ_3b249941c27834f5e43d43a9114e4909,2016061820000b,2018-02-13 18:06:02,2018-02-13 19:02:51,34.813278,115.55926,34.786126,115.874361


In [4]:
train_set.shape

(1495814, 8)

In [5]:
df_train_data = pd.read_csv('data/d2/train_new.csv')
print(df_train_data.shape)
print(df_train_data.start_time.max())
df_train_data.head(1)

  interactivity=interactivity, compiler=compiler, result=result)


(1495814, 8)
2018-08-01 00:49:30


Unnamed: 0,r_key,out_id,start_time,end_time,start_lat,start_lon,end_lat,end_lon
0,SDK-XJ_609994b4d50a8a07a64d41d1f70bbb05,2016061820000b,2018-01-20 10:13:43,2018-01-20 10:19:04,33.783415,111.60366,33.779811,111.605885


In [6]:
df_test_data = pd.read_csv('data/d2/test_new.csv')
print(df_test_data.shape)
print(df_test_data.start_time.min())
df_test_data.head(1)

(58097, 5)
2018-09-01 00:32:42


Unnamed: 0,r_key,out_id,start_time,start_lat,start_lon
0,f6fa6b2a1fa250b3_SDK-XJ_eed80f24f496fc9a59f49e...,358962079107966,2018-09-01 15:54:12,43.943356,125.377718


### the data is pretty sparse,we can't make a model for every specific car,so we concerntrate on the records that shows more than 10 times

In [8]:
test_most=pd.DataFrame(pd.value_counts(df_test_data.out_id)).reset_index().rename(columns={"out_id":"count","index":"out_id"})
test_most.head()

Unnamed: 0,out_id,count
0,861661609031151,24
1,861691704005451,24
2,861691703008261,24
3,891941604000461,24
4,868260020967122,24


In [11]:
test_most.count = test_most["count"].astype("float")
test_most_list = list(test_most[test_most.count>=10].out_id)

### concat the test data and train data 

In [13]:
df_data = pd.concat([df_train_data,df_test_data],sort=True)
print(df_data.shape)
df_data.tail()

(1553911, 8)


Unnamed: 0,end_lat,end_lon,end_time,out_id,r_key,start_lat,start_lon,start_time
58092,,,,895851711000830,ecd2b169462a8170_SDK-XJ_5f9fad90d15c6f26c61bef...,39.808821,116.496067,2018-10-23 11:02:22
58093,,,,895851711001200,03901f27f8da8635_SDK-XJ_8bb742fb3c64cd11de4052...,31.017508,121.401338,2018-10-23 00:14:55
58094,,,,895851711003404,baab481186d3c5b7_SDK-XJ_175e32c31c8975998cb8d8...,23.115605,113.413985,2018-10-23 08:18:49
58095,,,,895851711004600,0d6bfa6c320ffcaf_SDK-XJ_ca4fac4dde952a9f9bd2a4...,22.807363,108.324269,2018-10-23 19:14:32
58096,,,,911691581099171,331d3083d4546ae6_SDK-XJ_d0fd8c2f78cfae07cc4c4f...,36.878666,118.072888,2018-10-23 15:46:47


In [15]:
df_data.start_time = pd.to_datetime(df_data.start_time)
df_data.end_time = pd.to_datetime(df_data.end_time)
df_data.out_id = df_data.out_id.astype(str)
df_data

Unnamed: 0,end_lat,end_lon,end_time,out_id,r_key,start_lat,start_lon,start_time
0,33.779811,111.605885,2018-01-20 10:19:04,2016061820000b,SDK-XJ_609994b4d50a8a07a64d41d1f70bbb05,33.783415,111.603660,2018-01-20 10:13:43
1,34.814875,115.549374,2018-02-12 17:58:13,2016061820000b,SDK-XJ_4c2f29d94c9478623711756e4ae34cc5,34.810763,115.549264,2018-02-12 17:40:51
2,34.813136,115.559243,2018-02-13 15:24:33,2016061820000b,SDK-XJ_3570183177536a575b9da67a86efcd62,34.640284,115.539024,2018-02-13 14:52:24
3,34.813141,115.559217,2018-02-13 17:39:02,2016061820000b,SDK-XJ_78d749a376e190685716a51a6704010b,34.818280,115.542039,2018-02-13 17:23:08
4,34.786126,115.874361,2018-02-13 19:02:51,2016061820000b,SDK-XJ_3b249941c27834f5e43d43a9114e4909,34.813278,115.559260,2018-02-13 18:06:02
5,34.813162,115.559195,2018-02-13 21:58:38,2016061820000b,SDK-XJ_9fd451509b05ecc54e641878a13baea8,34.785990,115.874259,2018-02-13 20:58:36
6,34.641016,115.536066,2018-02-14 09:22:47,2016061820000b,SDK-XJ_ca29f7fd306c47a407e4ab5aed21e7fb,34.812824,115.559272,2018-02-14 08:54:02
7,35.822282,116.367214,2018-02-16 16:19:27,2016061820000b,SDK-XJ_651f34259082f6b78f675357bf0200d1,34.803362,115.839409,2018-02-16 14:33:23
8,36.845625,117.209663,2018-02-16 18:34:56,2016061820000b,SDK-XJ_4566f3f45f4df155319c66de13d59936,35.822863,116.367453,2018-02-16 16:33:56
9,34.786001,115.874430,2018-02-17 17:12:11,2016061820000b,SDK-XJ_349990ffb6274fbf07dbe869762cf8ea,36.204509,116.459742,2018-02-17 14:41:30


In [None]:
#df_train.r_key.str.split('_')
for i in df_data.columns:
    print('column:',i,'have',df_data[i].unique().__len__())

In [17]:
df_data['week'] = df_data.start_time.dt.week
df_data['week_day'] = df_data.start_time.dt.weekday
df_data['month'] = df_data.start_time.dt.month
df_data['hour'] =df_data.start_time.dt.hour
df_data['calendar_date'] =df_data.start_time.dt.date
df_data.calendar_date = pd.to_datetime(df_data.calendar_date)
df_data.head()
#df_data = df_data.drop('end_time',axis = 1)
# df_train = df_data[df_data.start_time.dt.month < 9]
# df_test = df_data[df_data.start_time.dt.month >= 9]

Unnamed: 0,end_lat,end_lon,end_time,out_id,r_key,start_lat,start_lon,start_time,week,week_day,month,hour,calendar_date
0,33.779811,111.605885,2018-01-20 10:19:04,2016061820000b,SDK-XJ_609994b4d50a8a07a64d41d1f70bbb05,33.783415,111.60366,2018-01-20 10:13:43,3,5,1,10,2018-01-20
1,34.814875,115.549374,2018-02-12 17:58:13,2016061820000b,SDK-XJ_4c2f29d94c9478623711756e4ae34cc5,34.810763,115.549264,2018-02-12 17:40:51,7,0,2,17,2018-02-12
2,34.813136,115.559243,2018-02-13 15:24:33,2016061820000b,SDK-XJ_3570183177536a575b9da67a86efcd62,34.640284,115.539024,2018-02-13 14:52:24,7,1,2,14,2018-02-13
3,34.813141,115.559217,2018-02-13 17:39:02,2016061820000b,SDK-XJ_78d749a376e190685716a51a6704010b,34.81828,115.542039,2018-02-13 17:23:08,7,1,2,17,2018-02-13
4,34.786126,115.874361,2018-02-13 19:02:51,2016061820000b,SDK-XJ_3b249941c27834f5e43d43a9114e4909,34.813278,115.55926,2018-02-13 18:06:02,7,1,2,18,2018-02-13


### time_features,index by calendar_date,one-hot for some special holiday in china
###  read "calendar_date","holiday","dayOff" these three column

In [19]:
# time_features,index by calendar_date,one-hot for some special holiday in china
# read "calendar_date","holiday","dayOff" these three column
time_features = pd.read_csv('data/time_features.csv')[["calendar_date","holiday","dayOff"]]
time_features.calendar_date = pd.to_datetime(time_features.calendar_date)

### merge two dataframe via calendar_date,add data feature for each column in df_data

In [22]:
# merge two dataframe via calendar_date,add data feature for each column in df_data
df_data = df_data.merge(time_features,on="calendar_date")
df_data.head(5)

Unnamed: 0,end_lat,end_lon,end_time,out_id,r_key,start_lat,start_lon,start_time,week,week_day,month,hour,calendar_date,holiday_x,dayOff_x,holiday_y,dayOff_y,holiday,dayOff
0,33.779811,111.605885,2018-01-20 10:19:04,2016061820000b,SDK-XJ_609994b4d50a8a07a64d41d1f70bbb05,33.783415,111.60366,2018-01-20 10:13:43,3,5,1,10,2018-01-20,0,1,0,1,0,1
1,39.720161,98.488424,2018-01-20 02:21:39,851181601004171,SDK-XJ_a7155fe411bc3fe5bfacb3137c443dde,39.728298,98.496293,2018-01-20 01:54:42,3,5,1,1,2018-01-20,0,1,0,1,0,1
2,39.717471,98.489098,2018-01-20 22:13:50,851181601004171,SDK-XJ_1fd9ae6c8b9a695f9f9b9a9afd4d7028,39.717927,98.488435,2018-01-20 21:49:44,3,5,1,21,2018-01-20,0,1,0,1,0,1
3,39.717432,98.489051,2018-01-21 00:09:32,851181601004171,SDK-XJ_a05d788b2b172e383e7bab162c2c1e77,39.719525,98.488131,2018-01-20 23:45:29,3,5,1,23,2018-01-20,0,1,0,1,0,1
4,25.488169,105.354194,2018-01-20 12:02:40,851181601028851,SDK-XJ_2e5f00055146b0548726578fbaa62426,25.478436,105.366651,2018-01-20 11:56:29,3,5,1,11,2018-01-20,0,1,0,1,0,1


### prepared timestamp for add more feature on time column

In [24]:
# duplicate the "end_lat" and  "end_lon" for 3 times
m = 3
df_data['start_lat_1'] = df_data.start_lat.round(m)
df_data['start_lon_1'] = df_data.start_lon.round(m)
df_data["end_lat_2"] = df_data.end_lat.round(m)
df_data["end_lon_2"] = df_data.end_lon.round(m)
df_data.head()

Unnamed: 0,end_lat,end_lon,end_time,out_id,r_key,start_lat,start_lon,start_time,week,week_day,...,holiday_x,dayOff_x,holiday_y,dayOff_y,holiday,dayOff,start_lat_1,start_lon_1,end_lat_2,end_lon_2
0,33.779811,111.605885,2018-01-20 10:19:04,2016061820000b,SDK-XJ_609994b4d50a8a07a64d41d1f70bbb05,33.783415,111.60366,2018-01-20 10:13:43,3,5,...,0,1,0,1,0,1,33.783,111.604,33.78,111.606
1,39.720161,98.488424,2018-01-20 02:21:39,851181601004171,SDK-XJ_a7155fe411bc3fe5bfacb3137c443dde,39.728298,98.496293,2018-01-20 01:54:42,3,5,...,0,1,0,1,0,1,39.728,98.496,39.72,98.488
2,39.717471,98.489098,2018-01-20 22:13:50,851181601004171,SDK-XJ_1fd9ae6c8b9a695f9f9b9a9afd4d7028,39.717927,98.488435,2018-01-20 21:49:44,3,5,...,0,1,0,1,0,1,39.718,98.488,39.717,98.489
3,39.717432,98.489051,2018-01-21 00:09:32,851181601004171,SDK-XJ_a05d788b2b172e383e7bab162c2c1e77,39.719525,98.488131,2018-01-20 23:45:29,3,5,...,0,1,0,1,0,1,39.72,98.488,39.717,98.489
4,25.488169,105.354194,2018-01-20 12:02:40,851181601028851,SDK-XJ_2e5f00055146b0548726578fbaa62426,25.478436,105.366651,2018-01-20 11:56:29,3,5,...,0,1,0,1,0,1,25.478,105.367,25.488,105.354


In [26]:
# count unique for each column
for i in df_data.columns:
    print('column:',i,'have',df_data[i].unique().__len__())

column: end_lat have 1280754
column: end_lon have 1313349
column: end_time have 1405441
column: out_id have 5817
column: r_key have 1553911
column: start_lat have 1377855
column: start_lon have 1409858
column: start_time have 1459660
column: week have 40
column: week_day have 7
column: month have 10
column: hour have 24
column: calendar_date have 237
column: holiday_x have 2
column: dayOff_x have 2
column: holiday_y have 2
column: dayOff_y have 2
column: holiday have 2
column: dayOff have 2
column: start_lat_1 have 25036
column: start_lon_1 have 32604
column: end_lat_2 have 25210
column: end_lon_2 have 32895


###  contact lng and lag in one column into coordinate format

In [30]:
# contact lng and lag in one column,point format
df_data['start_place'] = df_data['start_lat_1'].astype(str)+','+df_data['start_lon_1'].astype(str)
df_data['end_place'] = df_data['end_lat_2'].astype(str)+','+df_data['end_lon_2'].astype(str)

In [31]:
df_data.head()

Unnamed: 0,end_lat,end_lon,end_time,out_id,r_key,start_lat,start_lon,start_time,week,week_day,...,holiday_y,dayOff_y,holiday,dayOff,start_lat_1,start_lon_1,end_lat_2,end_lon_2,start_place,end_place
0,33.779811,111.605885,2018-01-20 10:19:04,2016061820000b,SDK-XJ_609994b4d50a8a07a64d41d1f70bbb05,33.783415,111.60366,2018-01-20 10:13:43,3,5,...,0,1,0,1,33.783,111.604,33.78,111.606,"33.783,111.604","33.78,111.606"
1,39.720161,98.488424,2018-01-20 02:21:39,851181601004171,SDK-XJ_a7155fe411bc3fe5bfacb3137c443dde,39.728298,98.496293,2018-01-20 01:54:42,3,5,...,0,1,0,1,39.728,98.496,39.72,98.488,"39.728,98.496","39.72,98.488"
2,39.717471,98.489098,2018-01-20 22:13:50,851181601004171,SDK-XJ_1fd9ae6c8b9a695f9f9b9a9afd4d7028,39.717927,98.488435,2018-01-20 21:49:44,3,5,...,0,1,0,1,39.718,98.488,39.717,98.489,"39.718,98.488","39.717,98.489"
3,39.717432,98.489051,2018-01-21 00:09:32,851181601004171,SDK-XJ_a05d788b2b172e383e7bab162c2c1e77,39.719525,98.488131,2018-01-20 23:45:29,3,5,...,0,1,0,1,39.72,98.488,39.717,98.489,"39.72,98.488","39.717,98.489"
4,25.488169,105.354194,2018-01-20 12:02:40,851181601028851,SDK-XJ_2e5f00055146b0548726578fbaa62426,25.478436,105.366651,2018-01-20 11:56:29,3,5,...,0,1,0,1,25.478,105.367,25.488,105.354,"25.478,105.367","25.488,105.354"


#### Define DISTANCE FUNCTION

because what we have is Latitude and longitude(geographic coordinate system),so when we consider the distance, we cannot use Euclidean Distance,we need to consider the shape of earth.

-[Geographic_coordinate_system](https://en.wikipedia.org/wiki/Geographic_coordinate_system)

In [22]:
# calculate the distance on the earth,based on lat and lon
def getDistance(latA, lonA, latB, lonB):  
    ra = 6378140  # radius of equator: meter  
    rb = 6356755  # radius of polar: meter  
    flatten = (ra - rb) / ra # Partial rate of the earth  
    # change angle to radians  
    radLatA = radians(latA)  
    radLonA = radians(lonA)  
    radLatB = radians(latB)  
    radLonB = radians(lonB)  
  
    try: 
        pA = atan(rb / ra * tan(radLatA))  
        pB = atan(rb / ra * tan(radLatB))  
        x = acos(sin(pA) * sin(pB) + cos(pA) * cos(pB) * cos(radLonA - radLonB))  
        c1 = (sin(x) - x) * (sin(pA) + sin(pB))**2 / cos(x / 2)**2  
        c2 = (sin(x) + x) * (sin(pA) - sin(pB))**2 / sin(x / 2)**2  
        dr = flatten / 8 * (c1 - c2)  
        distance = ra * (x + dr)  
        return distance	  # meter   
    except:
        return 0.0000001

#### SAMPLE

the data is with in a city,so we choose the person who has more than 10 records.

we focus on these active person

In [37]:
# select out id shows up above 10 times
df_data = df_data[df_data.out_id.isin(test_most_list)].copy()

In [38]:
df_data.head(2)

Unnamed: 0,end_lat,end_lon,end_time,out_id,r_key,start_lat,start_lon,start_time,week,week_day,...,holiday_y,dayOff_y,holiday,dayOff,start_lat_1,start_lon_1,end_lat_2,end_lon_2,start_place,end_place
1,39.720161,98.488424,2018-01-20 02:21:39,851181601004171,SDK-XJ_a7155fe411bc3fe5bfacb3137c443dde,39.728298,98.496293,2018-01-20 01:54:42,3,5,...,0,1,0,1,39.728,98.496,39.72,98.488,"39.728,98.496","39.72,98.488"
2,39.717471,98.489098,2018-01-20 22:13:50,851181601004171,SDK-XJ_1fd9ae6c8b9a695f9f9b9a9afd4d7028,39.717927,98.488435,2018-01-20 21:49:44,3,5,...,0,1,0,1,39.718,98.488,39.717,98.489,"39.718,98.488","39.717,98.489"


After the selection of data,we have 2881 totally vehicle among the city

In [39]:
# this many out position totally
len(df_data.out_id.unique())

2881

In [26]:
total_sample = df_data.shape[0]
cut_sample = int(df_data.shape[0]*0.75)
train_set = df_data.iloc[:cut_sample,:].copy()
test_set = df_data.iloc[cut_sample:,:].copy()

In [27]:
cut_sample

668422

In [28]:
train_set.shape

(668422, 21)

In [29]:
test_set.shape

(222808, 21)

In [30]:
df_data.shape

(891230, 21)

In [49]:
# calculate the center of a cluster
def get_centermost_point(cluster):
    centroid = (MultiPoint(cluster).centroid.x, MultiPoint(cluster).centroid.y)
    #centermost_point = min(cluster, key=lambda point: great_circle(point, centroid).m)
    return tuple(centroid)

In [40]:
# density cluster
coords=train_set[['end_lon','end_lat']].values
kms_per_radian = 6371.0088
epsilon = 0.02/6371.0088
db = DBSCAN(eps=epsilon, min_samples=1, algorithm='ball_tree', metric='haversine').fit(np.radians(coords))

In [41]:
cluster_labels = db.labels_
cluster_labels

array([     0,      1,      2, ..., 483861,  51339,  51339], dtype=int64)

In [45]:
# set cluster_labels as end_location_id
train_set['end_location_id'] = cluster_labels
train_set.head()

Unnamed: 0,r_key,out_id,start_time,end_time,start_lat,start_lon,end_lat,end_lon,end_location_id
0,SDK-XJ_609994b4d50a8a07a64d41d1f70bbb05,2016061820000b,2018-01-20 10:13:43,2018-01-20 10:19:04,33.783415,111.60366,33.779811,111.605885,0
1,SDK-XJ_4c2f29d94c9478623711756e4ae34cc5,2016061820000b,2018-02-12 17:40:51,2018-02-12 17:58:13,34.810763,115.549264,34.814875,115.549374,1
2,SDK-XJ_3570183177536a575b9da67a86efcd62,2016061820000b,2018-02-13 14:52:24,2018-02-13 15:24:33,34.640284,115.539024,34.813136,115.559243,2
3,SDK-XJ_78d749a376e190685716a51a6704010b,2016061820000b,2018-02-13 17:23:08,2018-02-13 17:39:02,34.81828,115.542039,34.813141,115.559217,2
4,SDK-XJ_3b249941c27834f5e43d43a9114e4909,2016061820000b,2018-02-13 18:06:02,2018-02-13 19:02:51,34.813278,115.55926,34.786126,115.874361,3


In [46]:
train_set['end_location_id'] = train_set['end_location_id']
num_clusters = len(set(cluster_labels))

In [47]:
clusters = pd.Series([coords[cluster_labels == n] for n in range(num_clusters)])
print('Number of clusters: {}'.format(num_clusters))

Number of clusters: 483862


In [50]:
centermost_points = clusters.map(get_centermost_point)
lons, lats = zip(*centermost_points)
rep_points = pd.DataFrame({'lon_center':lons, 'lat_center':lats}).reset_index()

In [35]:
X_train = train_set[['start_lat','start_lon','week_day', 'hour']]
y_train = train_set['end_location_id']
X_test = test_set[['start_lat','start_lon','week_day', 'hour']].copy()

In [None]:
train_data = xgb.DMatrix(data=X_train, label=y_train)
test_data = xgb.DMatrix(data=X_test, label=y_train)
watchlist = [(train_data, 'train'), (test_data, 'test')]
xgb_pars = {'objective':'multi:softmax', 'num_class':num_clusters,'max_depth': 6,'silent':1}
param = {'objective':'multi:softmax', 'num_class':num_clusters,'max_depth': 6,'silent':1}

In [None]:
model = xgb.train(param,train_data)
# model = xgb.train(xgb_pars, train_data, 10, watchlist, early_stopping_rounds=2, maximize=False, verbose_eval=1)
print('Modeling RMSLE %.5f' % model.best_score)

In [None]:
# train_data = xgb.DMatrix(data=X_train,label=y_train)
# param = {'objective':'multi:softmax', 'num_class':num_clusters,'max_depth': 6,'silent':1}
# model = xgb.train(param,train_data)

In [None]:
X_test["pred_label"] = model.predict(test_data)

In [None]:
X_test.merge(rep_points,left_on="pred_label",right_on="index")

# Add more  features 

In [50]:
train_part = df_data[df_data.start_time.dt.month < 6].copy()

In [51]:
def count_train(unit_train_part):
    return pd.DataFrame(pd.value_counts(unit_train_part.start_place)).reset_index().rename(columns={"start_place":"start_count","index":"start_place"})

In [52]:
count_train_part = train_part.groupby("out_id").apply(count_train).reset_index().drop("level_1",axis=1)

In [53]:
df_data = df_data.merge(count_train_part,how="left").fillna(0.0)

In [54]:
df_data.head()

Unnamed: 0,end_lat,end_lon,end_time,out_id,r_key,start_lat,start_lon,start_time,week,week_day,...,calendar_date,holiday,dayOff,start_lat_1,start_lon_1,end_lat_2,end_lon_2,start_place,end_place,start_count
0,39.720161,98.488424,2018-01-20 02:21:39,851181601004171,SDK-XJ_a7155fe411bc3fe5bfacb3137c443dde,39.728298,98.496293,2018-01-20 01:54:42,3,5,...,2018-01-20,0,1,39.728,98.496,39.72,98.488,"39.728,98.496","39.72,98.488",1.0
1,39.717471,98.489098,2018-01-20 22:13:50,851181601004171,SDK-XJ_1fd9ae6c8b9a695f9f9b9a9afd4d7028,39.717927,98.488435,2018-01-20 21:49:44,3,5,...,2018-01-20,0,1,39.718,98.488,39.717,98.489,"39.718,98.488","39.717,98.489",11.0
2,39.717432,98.489051,2018-01-21 00:09:32,851181601004171,SDK-XJ_a05d788b2b172e383e7bab162c2c1e77,39.719525,98.488131,2018-01-20 23:45:29,3,5,...,2018-01-20,0,1,39.72,98.488,39.717,98.489,"39.72,98.488","39.717,98.489",6.0
3,23.369518,103.405644,2018-01-20 20:57:04,861021508004521,SDK-XJ_97838d2475985c51d083cf5744c14d34,23.346433,103.408263,2018-01-20 20:29:45,3,5,...,2018-01-20,0,1,23.346,103.408,23.37,103.406,"23.346,103.408","23.37,103.406",1.0
4,43.970747,87.574997,2018-01-20 10:12:52,861021509015321,SDK-XJ_1e789ab3ca2ce2c0942dbe4f7842cc5d,43.945658,87.590479,2018-01-20 10:06:51,3,5,...,2018-01-20,0,1,43.946,87.59,43.971,87.575,"43.946,87.59","43.971,87.575",18.0


# Train and Predict

#### at this  step,we first assign the point that we want to predict into the 'frequent vehicle' and then use that model to make a prediction 

#### the feature we used here is ('start_lat','start_lon','week_day', 'hour','start_count',"holiday","dayOff")

In [55]:
def get_centermost_point(cluster):
    centroid = (MultiPoint(cluster).centroid.x, MultiPoint(cluster).centroid.y)
    #centermost_point = min(cluster, key=lambda point: great_circle(point, centroid).m)
    return tuple(centroid)

def train_model(df_data): 
#     df_data = df_data[df_data.start_time.dt.month < 9 ].copy()
#     total_sample = df_data.shape[0]
#     cut_sample = int(df_data.shape[0]*0.75)
#     train_set = df_data.iloc[:cut_sample,:].copy()
#     test_set = df_data.iloc[cut_sample:,:].copy()
    train_set = df_data[df_data.start_time.dt.month < 9 ].copy()
    test_set = df_data[(df_data.start_time.dt.month >= 9)&(df_data.start_time.dt.month <11)].copy()
    test_set = test_set.drop("out_id",axis=1).reset_index(drop=True)
    
    if test_set.shape[0] > 0 :
        coords=train_set[['end_lon','end_lat']].values
        kms_per_radian = 6371.0088
        epsilon = 0.02/6371.0088
        db = DBSCAN(eps=epsilon, min_samples=1, algorithm='ball_tree', metric='haversine').fit(np.radians(coords))
        cluster_labels = db.labels_
        train_set['end_location_id'] = cluster_labels
        train_set['end_location_id'] = train_set['end_location_id']
        num_clusters = len(set(cluster_labels))
        clusters = pd.Series([coords[cluster_labels == n] for n in range(num_clusters)])
        print('Number of clusters: {}'.format(num_clusters))
        num_digits = len(train_set.end_place.unique())
        print('3 digits: {}'.format(num_digits))

        centermost_points = clusters.map(get_centermost_point)
        lons, lats = zip(*centermost_points)
        rep_points = pd.DataFrame({'lon_center':lons, 'lat_center':lats}).reset_index()

        X_train = train_set[['start_lat','start_lon','week_day', 'hour','start_count',"holiday","dayOff"]]
        y_train = train_set['end_location_id']
        X_test = test_set[['start_lat','start_lon','week_day', 'hour','start_count',"holiday","dayOff"]].copy()

        train_data = xgb.DMatrix(data=X_train,label=y_train)
        param = {"eta":0.1,'objective':'multi:softmax', 'num_class':num_clusters,'silent':1,'max_depth':4,'min_child_weight': 0.8}

        test_data = xgb.DMatrix(data=X_test)
        test_set["pred_label"] = model.predict(test_data)
        pred_result = test_set.merge(rep_points,left_on="pred_label",right_on="index")[["r_key","lat_center","lon_center"]]

        return pred_result

In [None]:
test_result = df_data.groupby('out_id').apply(train_model).reset_index()

Number of clusters: 116
3 digits: 102
Number of clusters: 116
3 digits: 102
Number of clusters: 111
3 digits: 107
Number of clusters: 122
3 digits: 109
Number of clusters: 81
3 digits: 74
Number of clusters: 61
3 digits: 71
Number of clusters: 121
3 digits: 128
Number of clusters: 94
3 digits: 87
Number of clusters: 139
3 digits: 133
Number of clusters: 214
3 digits: 216
Number of clusters: 118
3 digits: 118
Number of clusters: 154
3 digits: 162
Number of clusters: 77
3 digits: 72
Number of clusters: 169
3 digits: 149
Number of clusters: 12
3 digits: 14
Number of clusters: 62
3 digits: 66
Number of clusters: 11
3 digits: 11
Number of clusters: 71
3 digits: 75
Number of clusters: 130
3 digits: 120
Number of clusters: 98
3 digits: 89
Number of clusters: 67
3 digits: 61
Number of clusters: 103
3 digits: 98
Number of clusters: 52
3 digits: 53
Number of clusters: 75
3 digits: 66
Number of clusters: 42
3 digits: 44
Number of clusters: 55
3 digits: 57
Number of clusters: 165
3 digits: 144
Num

Number of clusters: 101
3 digits: 91
Number of clusters: 99
3 digits: 99
Number of clusters: 123
3 digits: 128
Number of clusters: 164
3 digits: 147
Number of clusters: 121
3 digits: 117
Number of clusters: 58
3 digits: 66
Number of clusters: 155
3 digits: 134
Number of clusters: 228
3 digits: 207
Number of clusters: 134
3 digits: 117
Number of clusters: 193
3 digits: 159
Number of clusters: 98
3 digits: 104
Number of clusters: 38
3 digits: 43
Number of clusters: 58
3 digits: 59
Number of clusters: 40
3 digits: 49
Number of clusters: 108
3 digits: 103
Number of clusters: 103
3 digits: 80
Number of clusters: 104
3 digits: 111
Number of clusters: 87
3 digits: 93
Number of clusters: 72
3 digits: 78
Number of clusters: 100
3 digits: 96
Number of clusters: 169
3 digits: 158
Number of clusters: 72
3 digits: 67
Number of clusters: 35
3 digits: 35
Number of clusters: 103
3 digits: 95
Number of clusters: 44
3 digits: 47
Number of clusters: 137
3 digits: 138
Number of clusters: 74
3 digits: 72
N

Number of clusters: 123
3 digits: 117
Number of clusters: 77
3 digits: 70
Number of clusters: 96
3 digits: 96
Number of clusters: 124
3 digits: 117
Number of clusters: 177
3 digits: 170
Number of clusters: 61
3 digits: 60
Number of clusters: 33
3 digits: 33
Number of clusters: 54
3 digits: 57
Number of clusters: 173
3 digits: 173
Number of clusters: 53
3 digits: 54
Number of clusters: 37
3 digits: 37
Number of clusters: 101
3 digits: 91
Number of clusters: 88
3 digits: 81
Number of clusters: 55
3 digits: 51
Number of clusters: 67
3 digits: 69
Number of clusters: 232
3 digits: 222
Number of clusters: 94
3 digits: 87
Number of clusters: 48
3 digits: 43
Number of clusters: 91
3 digits: 75
Number of clusters: 120
3 digits: 118
Number of clusters: 47
3 digits: 46
Number of clusters: 229
3 digits: 204
Number of clusters: 72
3 digits: 79
Number of clusters: 135
3 digits: 131
Number of clusters: 86
3 digits: 80
Number of clusters: 94
3 digits: 89
Number of clusters: 127
3 digits: 133
Number of

Number of clusters: 77
3 digits: 79
Number of clusters: 81
3 digits: 75
Number of clusters: 78
3 digits: 77
Number of clusters: 67
3 digits: 65
Number of clusters: 120
3 digits: 118
Number of clusters: 80
3 digits: 73
Number of clusters: 59
3 digits: 54
Number of clusters: 85
3 digits: 90
Number of clusters: 316
3 digits: 283
Number of clusters: 190
3 digits: 171
Number of clusters: 47
3 digits: 45
Number of clusters: 116
3 digits: 112
Number of clusters: 75
3 digits: 69
Number of clusters: 130
3 digits: 123
Number of clusters: 75
3 digits: 77
Number of clusters: 196
3 digits: 180
Number of clusters: 73
3 digits: 72
Number of clusters: 108
3 digits: 107
Number of clusters: 46
3 digits: 50
Number of clusters: 89
3 digits: 90
Number of clusters: 138
3 digits: 120
Number of clusters: 186
3 digits: 175
Number of clusters: 81
3 digits: 85
Number of clusters: 25
3 digits: 27
Number of clusters: 73
3 digits: 76
Number of clusters: 61
3 digits: 60
Number of clusters: 53
3 digits: 56
Number of 

#### save  the result and test the score

In [None]:
test_result_final = test_result[["r_key","lat_center","lon_center"]]

In [None]:
test_result['distance'] = test_result.apply(lambda x:getDistance(x[3],x[4],x[5],x[6]),axis = 1)
# test_result.head()

In [None]:
a=test_result.merge(df_data)

In [None]:
a['distance'] = a.apply(lambda x:getDistance(x[3],x[4],x[5],x[6]),axis = 1)

In [12]:
a['score'] = a.apply(lambda x:get_score(x[-1]),axis = 1)
a.score.mean()

0.4180557


In [None]:
test_result['score'] = test_result.apply(lambda x:get_score(x[7]),axis = 1)

### Finallt after the test, the best result we get here is  (0.392666  maxd=3),and we ranked 11# in this competation

In [4]:
test_result.score.mean()

0.392666  maxd=3


In [None]:
#0.392666   maxd=3  
#0.41398   maxd=5
#0.412666   maxd=3  
#0.41398   maxd=5

In [None]:
# 0.40447  maxd=3 allfea 
# 0.40464834 maxd=4 allfea 
# 0.403428174 eta=0.1 maxd=4 allfea  #'min_child_weight': 0.6 0.4031971673  #'min_child_weight': 0.8 0.4029715 #'scale_pos_weight': 3
# 0.4038698985  eta=0.1 maxd=3 allfea 
# basefea maxd=4 0.4062949

#### Citations
*  https://github.com/carlosbkm/car-destination-prediction 
*  http://www.cs.cmu.edu/~coral/old/publinks/brettb/06itsc-driver-intent.pdf
*  https://www.nbcnews.com/mach/science/elon-musk-says-tesla-s-ai-will-let-cars-predict- ncna813211 
*  https://machinelearningmastery.com/feature-importance-and-feature-selection-with-xgboost-in-python/
*  https://www.fastcompany.com/3035350/uber-can-now-predict-where-youre-going-before-you-get-in-the-car 
*  https://pdfs.semanticscholar.org/fd0f/98aa362d6b86f02a28c571e5c40d8d8ff65d.pdf  
*  http://people.cs.aau.dk/~csj/Papers/Files/2006_brilingaiteITSS.pdf 
*  https://www.hindawi.com/journals/mpe/2015/824532/ 