# MSR GeoLife Data
The first part contains code to get your hands on the data.
This dataset can get really big, so the second part will contain a function to help with gathering data for exploration.

The third part will then contain code examples of what you can do with the data.

# Part I. Getting Data from Blob Storage

In [52]:
from azure.storage.blob import BlockBlobService
import os

In [53]:
!mkdir geolife

mkdir: cannot create directory ‘geolife’: File exists


In [54]:
local_path=os.getcwd() + "/geolife"
blob_account_name = "mldsdatahack2019diag" # fill in your blob account name
blob_account_key = "JsauBssnY92CeD3MgI2SWhkQ16JioJCRWVW8NzKtcWckI+DaNNbCmpmMAVq27GD91mhgH+oHPx+QbIKUCow5gA=="  # fill in your blob account key
mycontainer = "datahackdata2019"       # fill in the container name 
myblobname = "000/Trajectory/20090705025307.csv"        # fill in the blob name 
mydatafile = "Output"        # fill in the output file name

In [55]:
import azure
from azure.storage.blob import BlockBlobService

blob_service = BlockBlobService(account_name=blob_account_name, account_key=blob_account_key)
containers = blob_service.list_containers()
blobs = [a for a in blob_service.list_blobs("datahackdata2019")]
csv_names = [a.name for a in blobs]

In [56]:
for name in csv_names:
    if name == "BCycleAustin.csv": continue
    blob_service.get_blob_to_path("datahackdata2019", name, os.path.join(local_path, name))

# Part II

In [35]:
import pandas as pd

## Extract Users from Dataset

In [63]:
'''
    Function to collect desired user information. See examples of usage in next few cells.
    
    Input:
       - user_list  : list of desired user ids, should be 3 digit strings
            - e.g. '000', '001', ..., '180'
       - query      : string query to perform filtering on dataframe. 
            - Should be in the form ' [COLUMN] [QUALIFIER] [VALUE] '. 
            - Look into pandas.query for examples
       - date_range : list containing two valid DATETIMES
            - Format: [start, end]
            - If either is None, then no min/max date will be applied.
       - transportation: list containing what types of transportation
            - Format: [transportation1, transportation2, ...]
            - Default is None, which takes all types (even no transportation)
       - null_transport: boolean whether or not to include rows without transportation
            - e.g. if True, then keep all rows, o/w remove these blanks
    Output:
       - df : Pandas Dataframe with desired information
            
'''
def extract_users(PATH='./geolife',
                  user_list=[],
                  query = None,
                  date_range = [None,None],
                  transportation = [None],
                  null_transport = True):
    dfs = []
    
    start = date_range[0]
    end = date_range[1]
    
    check_transportation = transportation[0] != None
    
    for user in user_list:
        df = pd.read_csv(PATH+'/'+user+'.csv')
        df.fillna('', inplace=True)
        if start is None:
            start = df.DateTime.min()
        if end is None:
            end = df.DateTime.max()
        if query is not None:
            df = df.query(query)
        if not check_transportation:
            df = df[ (df.DateTime > start) & (df.DateTime < end) ]
        else:
            df = df[ (df.DateTime > start) & (df.DateTime < end) & (df['Transportation Mode'].isin(transportation))]
        
        if not null_transport:
            df = df[ df['Transportation Mode'] != '' ]
        dfs.append(df)
    return pd.concat(dfs,ignore_index=True)

## Extracting Examples:

In [58]:
# Extract User 000
df = extract_users(user_list=['010'])

  if self.run_code(code, result):


In [60]:
# Just showing you what gets sent in in the next cell.
print(['{0:03d}'.format(x) for x in range(0,11)])

['000', '001', '002', '003', '004', '005', '006', '007', '008', '009', '010']


In [61]:
# Extract Users from 000 to 010
df = extract_users(user_list=['{0:03d}'.format(x) for x in range(0,11)])
df

  if self.run_code(code, result):


Unnamed: 0,Latitude,Longitude,Dummy,Altitude,Days Passed,User,Trajectory,DateTime,Transportation Mode
0,39.984683,116.318450,0,492.0,39744.120255,0,0,2008-10-23 02:53:10,
1,39.984686,116.318417,0,492.0,39744.120313,0,0,2008-10-23 02:53:15,
2,39.984688,116.318385,0,492.0,39744.120370,0,0,2008-10-23 02:53:20,
3,39.984655,116.318263,0,492.0,39744.120428,0,0,2008-10-23 02:53:25,
4,39.984611,116.318026,0,493.0,39744.120486,0,0,2008-10-23 02:53:30,
5,39.984608,116.317761,0,493.0,39744.120544,0,0,2008-10-23 02:53:35,
6,39.984563,116.317517,0,496.0,39744.120602,0,0,2008-10-23 02:53:40,
7,39.984539,116.317294,0,500.0,39744.120660,0,0,2008-10-23 02:53:45,
8,39.984606,116.317065,0,505.0,39744.120718,0,0,2008-10-23 02:53:50,
9,39.984568,116.316911,0,510.0,39744.120775,0,0,2008-10-23 02:53:55,


In [62]:
# only collect data from Users 000 to 010 with Altitude between 100 and 300 and first 10 Trajectories
df = extract_users(user_list=['{0:03d}'.format(x) for x in range(0,11)], 
                   query = 'Altitude > 100 and Altitude < 300 and Trajectory < 10')
df

  if self.run_code(code, result):


Unnamed: 0,Latitude,Longitude,Dummy,Altitude,Days Passed,User,Trajectory,DateTime,Transportation Mode
0,39.984618,116.314323,0,113.0,39744.121817,0,0,2008-10-23 02:55:25,
1,39.984649,116.314107,0,117.0,39744.121875,0,0,2008-10-23 02:55:30,
2,39.984621,116.313941,0,121.0,39744.121933,0,0,2008-10-23 02:55:35,
3,39.984655,116.313724,0,126.0,39744.121991,0,0,2008-10-23 02:55:40,
4,39.984681,116.313521,0,129.0,39744.122049,0,0,2008-10-23 02:55:45,
5,39.984708,116.313311,0,133.0,39744.122106,0,0,2008-10-23 02:55:50,
6,39.984708,116.313099,0,137.0,39744.122164,0,0,2008-10-23 02:55:55,
7,39.984696,116.312921,0,144.0,39744.122222,0,0,2008-10-23 02:56:00,
8,39.984677,116.312746,0,153.0,39744.122280,0,0,2008-10-23 02:56:05,
9,39.984682,116.312525,0,155.0,39744.122338,0,0,2008-10-23 02:56:10,


In [64]:
# same as above, but make it between October 23, 2018 to Jan 1, 2019
df = extract_users(user_list=['{0:03d}'.format(x) for x in range(0,11)], 
                   query = 'Altitude > 100 and Altitude < 300 and Trajectory < 10',
                   date_range = ['2008-10-23','2009-01-01'])
df

  if self.run_code(code, result):


Unnamed: 0,Latitude,Longitude,Dummy,Altitude,Days Passed,User,Trajectory,DateTime,Transportation Mode
0,39.984618,116.314323,0,113.0,39744.121817,0,0,2008-10-23 02:55:25,
1,39.984649,116.314107,0,117.0,39744.121875,0,0,2008-10-23 02:55:30,
2,39.984621,116.313941,0,121.0,39744.121933,0,0,2008-10-23 02:55:35,
3,39.984655,116.313724,0,126.0,39744.121991,0,0,2008-10-23 02:55:40,
4,39.984681,116.313521,0,129.0,39744.122049,0,0,2008-10-23 02:55:45,
5,39.984708,116.313311,0,133.0,39744.122106,0,0,2008-10-23 02:55:50,
6,39.984708,116.313099,0,137.0,39744.122164,0,0,2008-10-23 02:55:55,
7,39.984696,116.312921,0,144.0,39744.122222,0,0,2008-10-23 02:56:00,
8,39.984677,116.312746,0,153.0,39744.122280,0,0,2008-10-23 02:56:05,
9,39.984682,116.312525,0,155.0,39744.122338,0,0,2008-10-23 02:56:10,


# Part II.

In [65]:
# Install a mapping library
!pip install folium

[33mYou are using pip version 9.0.3, however version 19.0.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [66]:
# Library for visualizing on maps, for resources look here: 
#  https://python-visualization.github.io/folium/quickstart.html#Getting-Started
import folium

In [67]:
# One trajectory for user 000
df = extract_users(user_list=['000'],
                   query = 'Trajectory == 0')
df

Unnamed: 0,Latitude,Longitude,Dummy,Altitude,Days Passed,User,Trajectory,DateTime,Transportation Mode
0,39.984683,116.318450,0,492,39744.120255,0,0,2008-10-23 02:53:10,
1,39.984686,116.318417,0,492,39744.120313,0,0,2008-10-23 02:53:15,
2,39.984688,116.318385,0,492,39744.120370,0,0,2008-10-23 02:53:20,
3,39.984655,116.318263,0,492,39744.120428,0,0,2008-10-23 02:53:25,
4,39.984611,116.318026,0,493,39744.120486,0,0,2008-10-23 02:53:30,
5,39.984608,116.317761,0,493,39744.120544,0,0,2008-10-23 02:53:35,
6,39.984563,116.317517,0,496,39744.120602,0,0,2008-10-23 02:53:40,
7,39.984539,116.317294,0,500,39744.120660,0,0,2008-10-23 02:53:45,
8,39.984606,116.317065,0,505,39744.120718,0,0,2008-10-23 02:53:50,
9,39.984568,116.316911,0,510,39744.120775,0,0,2008-10-23 02:53:55,


In [68]:
# Visualized!
m = folium.Map(
    location=[df['Latitude'].mean(),df['Longitude'].mean()],
    zoom_start=14,
    tiles='Stamen Terrain'
)


for i in range(0,len(df),100):
    folium.Marker([df['Latitude'][i],df['Longitude'][i]]).add_to(m)
    
folium.PolyLine([(df['Latitude'][i], df['Longitude'][i]) for i in range(0,len(df),10)], 
                    color="red", weight=2.5, opacity=1).add_to(m)
m

### Followup:

After seeing this, we can ask a few questions:

- Are there key locations on the route that other people go to often? (Like common workplaces/restaurants/etc)
- How does this trajectory change over different days?

## Example 2: Comparing two types of trajectories (based on Transportation) of the same user

In [69]:
df = extract_users(user_list=['010'],
                   date_range = ['2008-10-23','2009-01-01'],
                   null_transport = False)

  if self.run_code(code, result):


In [70]:
df['Transportation Mode'].unique()

array(['taxi', 'train', 'subway', 'walk', 'bus'], dtype=object)

In [71]:
# get trajectories where transportation is taxis and trains
df_taxi  = df[df['Transportation Mode'] == 'taxi']
df_train = df[df['Transportation Mode'] == 'train']

In [72]:
print(df_taxi['Trajectory'].unique())
print(df_train['Trajectory'].unique())

[124 126 127 128 130 132 133 134]
[124 125 126 128 129 130 133 134]


From the above, we can see that Trajectory 124 had the user ride the taxi and train. Let's visualize this!

In [73]:
df_taxi = df_taxi[df_taxi['Trajectory'] == 124].reset_index(drop=True)
df_train = df_train[df_train['Trajectory'] == 124].reset_index(drop=True)

In [74]:
latitudes = df_taxi['Latitude'].append(df_train['Latitude'])
longitudes = df_taxi['Longitude'].append(df_train['Longitude'])

In [75]:
m = folium.Map(
    location=[latitudes.mean(),longitudes.mean()],
    zoom_start=8.5,
    tiles='Stamen Terrain'
)

folium.Marker([df_taxi['Latitude'][0],df_taxi['Longitude'][0]],
              tooltip='Taxi Start').add_to(m)
folium.Marker([df_taxi['Latitude'][len(df_taxi['Latitude'])-1],
               df_taxi['Longitude'][len(df_taxi['Latitude'])-1]],
               tooltip='Taxi End').add_to(m)

folium.Marker([df_train['Latitude'][0],df_train['Longitude'][0]],
              tooltip='Train Start').add_to(m)
folium.Marker([df_train['Latitude'][len(df_train['Latitude'])-1],
               df_train['Longitude'][len(df_train['Latitude'])-1]],
               tooltip='Train End').add_to(m)

    
folium.PolyLine([(df_taxi['Latitude'][i], df_taxi['Longitude'][i]) for i in range(0,len(df_taxi),10)], 
                    color="red", weight=2.5, opacity=1).add_to(m)
folium.PolyLine([(df_train['Latitude'][i], df_train['Longitude'][i]) for i in range(0,len(df_train),10)], 
                    color="blue", weight=2.5, opacity=1).add_to(m)
m

### Follow-Up

It would be interesting to see if the above train and taxi trajectories were continuous (like one going to work and back), but from the markers we can see this is not the case. In fact, with further inspection, we can see that the trajectories are actually split up into **different trips**. 

A suggestion for further analysis could be:

- Find a way to automatically split trajectories into different trips 
    - **Note that these gps locations are taken every 5s**
- Once this has been done, create a nice visualization of how a user's trajectory changes over the course of a day
    - This could include a way to distinguish different transportation types (including those without)
    - Also a way to visually see the progression of trips over time 
    
Good luck!