## Pre-Procesing

- Only using the data between 1st January and 25th December which should cover the year for me
    - These are kept in UTC for now as my travels spanned across multiple timezones.
    - Some activities had to be reviewed manually to set the start and end dates.
    - Checks were done by comparing the activity coordinates manually on google to confirm what the best datetime to be used.
    - Extra checks are added by checking the end time with start and end datetime for activity to avoid overlapping activities as defined in the next section.
- Google Maps Timeline
    - Data is divided into 4 types - Visits, Activities, Timeline Paths, and Timeline Memory
    - **Visit**: Usually stores when you have been static on the map for a while, unsure how this works fully as it tends to capture highway stops but misses some restaurant visits. Connectivty could be a factor.
    - **Activity**: While movement is detected, google tries to predict if you were driving, walking, cycling, etc. This provides info on strat and end coordinates along with distace - so speed is a factor used to determine activity type. Again bad connectivity could factor in for error.
    - **Timeline Paths**: While activities only store the start and end coordinates along with their time, Timeline Paths store a dictionary of coordinates between those times to give a full picture of points visited usually captured at a frequency of few minutes.
    - **Timeline Memory** - My data contained very little of these for me to make any concrete definition for them and they have not been used in the analysis.
- Each row in JSON data can depict either of the above types and need a check to fill in an NA values to make their identification easier.


In [5]:
from datetime import datetime
import pandas as pd
import numpy as np
import json
import matplotlib.pyplot as plt

# read the JSON file
with open('data/Timeline_backup_20241224.json') as json_data:
    data = json.load(json_data)
    df_timelines = pd.json_normalize(data['semanticSegments'])

# PRE_PROCESSING

# replace any null values for columns being used
df_timelines['timelinePath'] = df_timelines['timelinePath'].fillna(0)
df_timelines['visit.probability'] = df_timelines['visit.probability'].fillna(0)
df_timelines['activity.probability'] = df_timelines['activity.probability'].fillna(0)

# convert to datetime format
df_timelines['startTime'] = pd.to_datetime(df_timelines['startTime'], utc=True)
df_timelines['endTime'] = pd.to_datetime(df_timelines['endTime'], utc=True)

# filter only for the needed travel dates
travel_start_date = '2024-01-01 16:00:00'
travel_end_date = '2024-12-24 05:30:00'

df_timelines = df_timelines[df_timelines['startTime'] >= pd.to_datetime(travel_start_date, utc=True)]
df_timelines = df_timelines[df_timelines['startTime'] <= pd.to_datetime(travel_end_date, utc=True)]
df_timelines = df_timelines[df_timelines['endTime'] <= pd.to_datetime(travel_end_date, utc=True)]

# reset the index before finalized the data frame
df_timelines = df_timelines.reset_index(drop=True)

# Analysis 1
- Get the count of most visits over the year and remove 
- Get it by count of visits to a place on differetn date or time, as well as get it by the time spent at the location

## Processing:
- Only select the visit related columns and rows for now
- Use a split funstion to get the visit lattitude and longitude
- Aggregate to get the total visits by count as well as time spent there

In [6]:
needed_cols = ['startTime','endTime','visit.hierarchyLevel','visit.probability','visit.topCandidate.placeId','visit.topCandidate.semanticType','visit.topCandidate.probability','visit.topCandidate.placeLocation.latLng']
df_temp = df_timelines[df_timelines['visit.probability']!=0]
df_temp = df_temp[needed_cols]

df_temp['time_spent'] = df_temp['endTime']-df_temp['startTime']
df_temp[['visit.lat','visit.long']] = df_temp['visit.topCandidate.placeLocation.latLng'].str.split(',',expand=True)
df_temp['visit.lat'] = df_temp['visit.lat'].apply(lambda x: round(float(x[:-2]), 4))
df_temp['visit.long'] = df_temp['visit.long'].apply(lambda x: round(float(x[:-2]), 4))

In [None]:

groupby_cols = ['visit.lat','visit.long']
selected_cols = ['time_spent']
df_temp = df_temp[df_temp['visit.topCandidate.semanticType']=="UNKNOWN"]
df_tempGroup_count = df_temp[selected_cols + groupby_cols].groupby(groupby_cols).count().sort_values(by='time_spent',ascending=False).reset_index()
df_tempGroup_timeSpent = df_temp[selected_cols + groupby_cols].groupby(groupby_cols).sum().sort_values(by='time_spent',ascending=False).reset_index()

# review the outcomes
print(df_tempGroup_count[:25])
print(df_tempGroup_timeSpent[:35])