# Dataset Description


Students will have access to a data file data_train.csv that contains the anonymized geolocation
data of multiple mobile devices in the City of Atlanta (US) for 11 working days in October 2018


The devices’ ID resets every 24 hours; therefore, you will not be able to trace the same device
across different days. Therefore, every device ID represents a 1-day journey.


Each journey is formed by several trajectories. A trajectory is defined as the route of a moving
person in a straight line with an entry and an exit point. See an example below of one trajectory
from one of the devices:


![Image sample plot data](data_desc/data_graph.PNG)


As you can see, trajectories are a simplification of the real path of a person.
A trajectory ends when a person stops moving and stays in the same place for a while and when
the device stops recording for some time.


For each device you will get multiple trajectories. The set of all trajectories of a device
represents a simplification of the journey of one person for 24 hours. The graphic below shows
a full journey of a device. 


![Image sample plot data direction](data_desc/data_graph_1.PNG)


**It is important to note that each device has a different number of trajectories**

Trajectories are separated. In the graph, this separation is shown as a dotted line between the
exit point of a trajectory and the entry point of the next one. These dotted lines represent blind
parts of the journey where the device did not record the location.

# Dataset Details

There are approximately 210,000 devices and 11 columns in the database. You will receive
these records separated into two datasets:
- A train dataset (data_train.csv)
- A test dataset (data_test.csv) 

The train dataset contains 80% of the records, while the test dataset contains 20%. The test
dataset will then be split into public and private datasets. 


The variables in the dataset are as follows:

![Image sample variable dataset](data_desc/table.PNG)

**All the data related to time is shown in Atlanta’s local time (Eastern Time). *

# The challenge

You must predict how many people are in the city center between 15:00 and 16:00.
The test dataset contains a number of devices where the trajectories after 15:00 have been
removed. All but one: After 15:00, you will find one last trajectory, with (1) entry location, (2)
entry time and an exit time that is between 15:00 and 16:00. But the exit point has been
removed.


***Your task is to predict the location of this last exit point and whether this device is within
the city center or not. The target variable is the latter.***

*See the graphic example below*


![Image Challange](data_desc/data_plot_sample.PNG)


**After you estimate the position of each target, you will have to classify that point based on
whether it is located inside the city center or not. To do so, you will have to implement a rule
that outlies the limits of the city center of Atlanta (decimal point “.”):**  
<br>
<br>
<br>
<center><b>3750901.5068 ≤ 𝑥 ≤ 3770901.5068</b></center>
<center><b>−19268905.6133 ≤ 𝑦 ≤ −19208905.6133</b></center>


**You will need to classify each of the exit points whether they are within (1) or outside (0) the
limits of the city center. For example, Device 1 in the graph is in the city center between 15:00
and 16:00, therefore Device 1 will be classified with a “1” while Device 2, which is not in the
city center in that timestamp will be classified with a “0”.**


**Some trajectories may “cross” the city center, but their exit point will be outside the city
center. See the example of Device 3 (D3) in the graph below. For the sake of simplicity, these
trajectories are considered outside the center, given that we only consider if the exit point is
within boundaries or not.**


![Image Challange 1](data_desc/data_plot_sample_1.PNG)




# Data Sample Prediction  Result

**After classifying each of the targets, you will have to submit your results in the following format: **  

*trajectory_id, city_center*  

The trajectory id identifies the last trajectory of a device and the city center identifies the
location of that point. Here’s an example of how a submission would look:  

![Image Result](data_desc/sample_tranjectory_id.PNG)

Trajectory “123df5” ends in the center, while trajectory 345rgf does not. 

# Additional Task 

Apart from those mentioned above, there are several additional tasks, namely:
- Calculate the F1 score
- show the confussion matrix
- show the tuning(find best params in model)
- tell the insights you got from this task

In [1]:
import pandas as pd

In [9]:
train = pd.read_csv('dataset/data_train.csv')
test = pd.read_csv('dataset/data_test.csv')
sample_prediction = pd.read_csv('dataset/sample.csv')

In [10]:
train

Unnamed: 0.1,Unnamed: 0,hash,trajectory_id,time_entry,time_exit,vmax,vmin,vmean,x_entry,y_entry,x_exit,y_exit
0,0,0000a8602cf2def930488dee7cdad104_1,traj_0000a8602cf2def930488dee7cdad104_1_0,07:04:31,07:08:32,,,,3.751014e+06,-1.909398e+07,3.750326e+06,-1.913634e+07
1,1,0000a8602cf2def930488dee7cdad104_1,traj_0000a8602cf2def930488dee7cdad104_1_1,07:20:34,07:25:42,,,,3.743937e+06,-1.932247e+07,3.744975e+06,-1.931966e+07
2,2,0000a8602cf2def930488dee7cdad104_1,traj_0000a8602cf2def930488dee7cdad104_1_2,07:53:32,08:03:25,,,,3.744868e+06,-1.929356e+07,3.744816e+06,-1.929284e+07
3,3,0000a8602cf2def930488dee7cdad104_1,traj_0000a8602cf2def930488dee7cdad104_1_3,08:17:50,08:37:23,,,,3.744880e+06,-1.929229e+07,3.744809e+06,-1.929049e+07
4,4,0000a8602cf2def930488dee7cdad104_1,traj_0000a8602cf2def930488dee7cdad104_1_4,14:38:09,14:38:09,,,,3.744909e+06,-1.928558e+07,3.744909e+06,-1.928558e+07
...,...,...,...,...,...,...,...,...,...,...,...,...
814257,814257,ffffc6359725f0e1feac9ef1872ab207_11,traj_ffffc6359725f0e1feac9ef1872ab207_11_4,02:21:11,02:21:11,,,,3.744666e+06,-1.925679e+07,3.744666e+06,-1.925679e+07
814258,814258,ffffc6359725f0e1feac9ef1872ab207_11,traj_ffffc6359725f0e1feac9ef1872ab207_11_5,06:02:17,06:02:17,,,,3.744732e+06,-1.925614e+07,3.744732e+06,-1.925614e+07
814259,814259,ffffc6359725f0e1feac9ef1872ab207_11,traj_ffffc6359725f0e1feac9ef1872ab207_11_7,09:52:13,09:52:13,,,,3.744666e+06,-1.925679e+07,3.744666e+06,-1.925679e+07
814260,814260,ffffc6359725f0e1feac9ef1872ab207_11,traj_ffffc6359725f0e1feac9ef1872ab207_11_8,14:20:26,14:27:15,,,,3.741043e+06,-1.929051e+07,3.741057e+06,-1.928936e+07


In [11]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 814262 entries, 0 to 814261
Data columns (total 12 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   Unnamed: 0     814262 non-null  int64  
 1   hash           814262 non-null  object 
 2   trajectory_id  814262 non-null  object 
 3   time_entry     814262 non-null  object 
 4   time_exit      814262 non-null  object 
 5   vmax           256769 non-null  float64
 6   vmin           256769 non-null  float64
 7   vmean          270778 non-null  float64
 8   x_entry        814262 non-null  float64
 9   y_entry        814262 non-null  float64
 10  x_exit         814262 non-null  float64
 11  y_exit         814262 non-null  float64
dtypes: float64(7), int64(1), object(4)
memory usage: 74.5+ MB


In [12]:
test

Unnamed: 0.1,Unnamed: 0,hash,trajectory_id,time_entry,time_exit,vmax,vmin,vmean,x_entry,y_entry,x_exit,y_exit
0,0,00032f51796fd5437b238e3a9823d13d_31,traj_00032f51796fd5437b238e3a9823d13d_31_0,11:43:17,11:50:17,,,,3.773413e+06,-1.909828e+07,3.773111e+06,-1.914508e+07
1,1,00032f51796fd5437b238e3a9823d13d_31,traj_00032f51796fd5437b238e3a9823d13d_31_2,12:21:37,12:21:37,0.0,0.0,0.0,3.773199e+06,-1.914354e+07,3.773199e+06,-1.914354e+07
2,2,00032f51796fd5437b238e3a9823d13d_31,traj_00032f51796fd5437b238e3a9823d13d_31_3,12:34:27,13:14:11,,,,3.763760e+06,-1.921342e+07,3.771757e+06,-1.911092e+07
3,3,00032f51796fd5437b238e3a9823d13d_31,traj_00032f51796fd5437b238e3a9823d13d_31_4,13:25:33,13:43:13,,,,3.773385e+06,-1.911344e+07,3.773131e+06,-1.914465e+07
4,4,00032f51796fd5437b238e3a9823d13d_31,traj_00032f51796fd5437b238e3a9823d13d_31_5,15:03:32,15:10:32,,,,3.773118e+06,-1.914490e+07,,
...,...,...,...,...,...,...,...,...,...,...,...,...
202932,202932,fff9552047b095e8242b4913f3289a26_25,traj_fff9552047b095e8242b4913f3289a26_25_3,11:23:33,11:23:33,,,,3.762713e+06,-1.935493e+07,3.762713e+06,-1.935493e+07
202933,202933,fff9552047b095e8242b4913f3289a26_25,traj_fff9552047b095e8242b4913f3289a26_25_4,12:12:10,12:12:10,,,,3.761040e+06,-1.935274e+07,3.761040e+06,-1.935274e+07
202934,202934,fff9552047b095e8242b4913f3289a26_25,traj_fff9552047b095e8242b4913f3289a26_25_5,13:08:14,13:12:01,,,,3.762680e+06,-1.935570e+07,3.762683e+06,-1.935529e+07
202935,202935,fff9552047b095e8242b4913f3289a26_25,traj_fff9552047b095e8242b4913f3289a26_25_6,14:14:36,14:14:36,,,,3.761776e+06,-1.935772e+07,3.761776e+06,-1.935772e+07


In [13]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 202937 entries, 0 to 202936
Data columns (total 12 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   Unnamed: 0     202937 non-null  int64  
 1   hash           202937 non-null  object 
 2   trajectory_id  202937 non-null  object 
 3   time_entry     202937 non-null  object 
 4   time_exit      202937 non-null  object 
 5   vmax           54705 non-null   float64
 6   vmin           54705 non-null   float64
 7   vmean          57359 non-null   float64
 8   x_entry        202937 non-null  float64
 9   y_entry        202937 non-null  float64
 10  x_exit         169422 non-null  float64
 11  y_exit         169422 non-null  float64
dtypes: float64(7), int64(1), object(4)
memory usage: 18.6+ MB


In [15]:
sample_prediction

Unnamed: 0,id,target
0,traj_000219c2a6380c307e8bffd85b5e404b_23_16,1.0
1,...,
