## Pre-Procesing

- Only using the data between 1st January and 25th December which should cover the year for me
    - These are kept in UTC for now as my travels spanned across multiple timezones.
    - Some activities had to be reviewed manually to set the start and end dates.
    - Checks were done by comparing the activity coordinates manually on google to confirm what the best datetime to be used.
    - Extra checks are added by checking the end time with start and end datetime for activity to avoid overlapping activities as defined in the next section.
- Google Maps Timeline
    - Data is divided into 4 types - Visits, Activities, Timeline Paths, and Timeline Memory
    - **Visit**: Usually stores when you have been static on the map for a while, unsure how this works fully as it tends to capture highway stops but misses some restaurant visits. Connectivty could be a factor.
    - **Activity**: While movement is detected, google tries to predict if you were driving, walking, cycling, etc. This provides info on strat and end coordinates along with distace - so speed is a factor used to determine activity type. Again bad connectivity could factor in for error.
    - **Timeline Paths**: While activities only store the start and end coordinates along with their time, Timeline Paths store a dictionary of coordinates between those times to give a full picture of points visited usually captured at a frequency of few minutes.
    - **Timeline Memory** - My data contained very little of these for me to make any concrete definition for them and they have not been used in the analysis.
- Each row in JSON data can depict either of the above types and need a check to fill in an NA values to make their identification easier.


In [1]:
from datetime import datetime
import pandas as pd
import numpy as np
import json
import matplotlib.pyplot as plt

# read the JSON file
with open('data/Timeline_backup_20241224.json') as json_data:
    data = json.load(json_data)
    df_timelines = pd.json_normalize(data['semanticSegments'])

# PRE_PROCESSING

# replace any null values for columns being used
df_timelines['timelinePath'] = df_timelines['timelinePath'].fillna(0)
df_timelines['visit.probability'] = df_timelines['visit.probability'].fillna(0)
df_timelines['activity.probability'] = df_timelines['activity.probability'].fillna(0)

# convert to datetime format
df_timelines['startTime'] = pd.to_datetime(df_timelines['startTime'], utc=True)
df_timelines['endTime'] = pd.to_datetime(df_timelines['endTime'], utc=True)

# filter only for the needed travel dates
travel_start_date = '2024-01-01 16:00:00'
travel_end_date = '2024-12-24 05:30:00'

df_timelines = df_timelines[df_timelines['startTime'] >= pd.to_datetime(travel_start_date, utc=True)]
df_timelines = df_timelines[df_timelines['startTime'] <= pd.to_datetime(travel_end_date, utc=True)]
df_timelines = df_timelines[df_timelines['endTime'] <= pd.to_datetime(travel_end_date, utc=True)]

# reset the index before finalized the data frame
df_timelines = df_timelines.reset_index(drop=True)

# Analysis 1
- Get the count of most visits over the year and remove 
- Get it by count of visits to a place on differetn date or time, as well as get it by the time spent at the location

## Processing:
- Only select the visit related columns and rows for now
- Use a split funstion to get the visit lattitude and longitude
- Aggregate to get the total visits by count as well as time spent there

In [24]:
needed_cols = ['startTime','endTime','visit.hierarchyLevel','visit.probability','visit.topCandidate.placeId','visit.topCandidate.semanticType','visit.topCandidate.probability','visit.topCandidate.placeLocation.latLng']
df_temp = df_timelines[df_timelines['visit.probability']!=0]
df_temp = df_temp[needed_cols]

df_temp['time_spent'] = df_temp['endTime']-df_temp['startTime']
df_temp[['visit.lat','visit.long']] = df_temp['visit.topCandidate.placeLocation.latLng'].str.split(',',expand=True)
df_temp['visit.lat'] = df_temp['visit.lat'].apply(lambda x: round(float(x[:-2]), 4))
df_temp['visit.long'] = df_temp['visit.long'].apply(lambda x: round(float(x[:-2]), 4))

In [25]:

groupby_cols = ['visit.lat','visit.long']
selected_cols = ['time_spent']
df_temp = df_temp[df_temp['visit.topCandidate.semanticType']=="UNKNOWN"]
df_tempGroup_count = df_temp[selected_cols + groupby_cols].groupby(groupby_cols).count().sort_values(by='time_spent',ascending=False).reset_index()
df_tempGroup_timeSpent = df_temp[selected_cols + groupby_cols].groupby(groupby_cols).sum().sort_values(by='time_spent',ascending=False).reset_index()


In [17]:
# add some columns and export to csv so data can be reviewed over excel
# merge the two dfs
df_temp = df_tempGroup_count.merge(df_tempGroup_timeSpent,"outer",on=groupby_cols)

# add columns
df_temp["time_mins"] = df_temp['time_spent_y'].astype('timedelta64[m]')
df_temp["googleMap_link"] = "https://maps.google.com/?q=" + df_temp["visit.lat"].astype(str) + "," + df_temp["visit.long"].astype(str)
df_temp["visit_desc"] = ""

# save as csv
df_temp.to_csv("data/analysis1_output.csv")

# Analysis 2
- Check how many times I visited a park
- Park visits maynot be considered as a visit in Google Timelines data since walks are not considered as a visit
- The activites data will have to be used to count park visits

## Processing:
- Make a dictionary of parks to be considered
- Capture their cordinates as a list and then review all the activity data
- Check if the given point exists in any of the park polygons created
- Keep a count for each park and increase it as needed

In [173]:
from shapely.geometry import Point, Polygon

# cordinates have been extracted and rounded to 4 decimals manually
parks_dict = {
    "CP" : Polygon([(40.7681, -73.9816),(40.7644, -73.9731),(40.7968, -73.9494),(40.8004, -73.9580)]),
    "WSP" : Polygon([(40.7321, -73.9986),(40.7307, -73.9956),(40.7296, -73.9965),(40.7310, -73.9995)]),
    "Union" : Polygon([(40.7352, -73.9916),(40.7370, -73.9903),(40.7365, -73.9892),(40.7349, -73.9904)]),
    "MSP" : Polygon([(40.7433, -73.9881),(40.7427, -73.9867),(40.7410, -73.9879),(40.7414, -73.9889),(40.7422, -73.9888)]),
    "Bryant" : Polygon([(40.7546, -73.9840),(40.7534, -73.9811),(40.7523, -73.9819),(40.7535, -73.9848)]),
    # "Gantry" : Polygon([(40.7488, -73.9588),(40.7468, -73.9572),(40.7459, -73.9582),(40.7450, -73.9579),(40.7436, -73.9595),(40.7384, -73.9616),(40.7386, -73.9628)]),        # manual with some incorrect cordinates for specific activity points
    "Gantry" : Polygon([(40.7387, -73.9601),(40.7385, -73.9614),(40.7395, -73.9619),(40.7401, -73.9611),(40.7414, -73.9607),(40.7423, -73.96),(40.7437, -73.9594),(40.7454, -73.9575),(40.7456, -73.9582),(40.7461, -73.9582),(40.7467, -73.9573),(40.7477, -73.9574),(40.7483, -73.9576),(40.7427, -73.9612),(40.7384, -73.9629),(40.7381, -73.9604),(40.7387, -73.9601)]),
    "QueensBridge" : Polygon([(40.7547, -73.9490),(40.7554, -73.9506),(40.7581, -73.9484),(40.7572, -73.9466)]),
    "Astoria" : Polygon([(40.7767, -73.9276),(40.7754, -73.9250),(40.7812, -73.9181),(40.7824, -73.9197)]),
    "Prospect" : Polygon([(40.6729, -73.9698),(40.6631, -73.9628),(40.6549, -73.9621),(40.6513, -73.9719),(40.6583, -73.9742),(40.6610, -73.9796)]),
    "Greenwood" : Polygon([(40.6594, -73.9951),(40.6590, -73.9883),(40.6551, -73.9820),(40.6478, -73.9805),(40.6443, -73.9890),(40.6529, -74.0019)])
}

# only consider the activites calcuated by the timelines
df_temp = df_timelines[df_timelines['timelinePath']!=0]

In [174]:
def get_latlong_point(lat_long):
    # split the string on the comma
    split_latlong = lat_long.split(",")
    # remove the degree signs to return lattitude and longitude
    lat = round(float(split_latlong[0][:-2]),4)
    lon = round(float(split_latlong[-1][:-2]),4)

    return Point(lat, lon)

In [180]:
park_count = {}
park_lastDate = {}

# adding all keys with default values to the dictionaries
for x in parks_dict:
    park_count[x] = 0
    park_lastDate[x] = 0

for i,x in df_temp.iterrows():
    # go through each point
    for p in x['timelinePath']:
        temp_point = get_latlong_point(p['point'])
        for park in parks_dict:
            # check if the point is in any of the parks and on a different date as previous accounted visit    
            if parks_dict[park].contains(temp_point) and x['startTime'].date() != park_lastDate[park]:
                park_count[park] += 1
                park_lastDate[park] = x['startTime'].date()
                # if the point is found move to the next activity
                break

for park_info in sorted(park_count.items(), key=lambda x: (-x[1], x[0])):
    print(park_info[0],park_info[1])

Bryant 42
Gantry 31
CP 14
WSP 13
Prospect 6
MSP 5
QueensBridge 5
Union 2
Astoria 1
Greenwood 1
