# Tracking Depression Study: Time At Home Data Wrangling

**Author:** Dawson Haddox

**Date:** 8/19/24 (12/20/25: Added a few comments / documentation)

**Purpose:** This code identifies likely home locations in the Tracking Depression study by applying DBSCAN to nighttime GPS data. It loads each participant's GPS records, filters for nighttime hours, and clusters dense locations to infer home coordinates. These inferred home points are then mapped back to the full dataset to flag whether each GPS sample is "at home" based on proximity to any home location. Folium maps are generated to visualize the results for validation. Parameters for DBSCAN and proximity were derived using empirical guides (e.g., typical working hours, largest typical housing sizes) and visual inspection.

**Note:** The code takes a long time to run on a normal laptop. There are ways to make it more efficient (e.g., parallel processing if resources are available, replacing row-wise operations), and it may run faster on the cluster. I'd recommend you start by running it on a subset of the data if you want to make changes to the code.

### Import Libraries & Read Data

In [None]:
# Import libraries
import pandas as pd
import numpy as np

from datetime import timedelta
from tqdm import tqdm
import ast

from sklearn.cluster import DBSCAN
from geopy.distance import geodesic
from shapely.geometry import MultiPoint
import folium
from folium.plugins import MarkerCluster
from functools import lru_cache

# Read & display data
loc_df = pd.read_csv("/Users/dawsonhaddox/Documents/Jacobson Lab/Avoidance-Mediation/smoothed_loc_pull.csv", usecols=lambda column: column != "Unnamed: 0")
display(loc_df.head(), loc_df.shape)

# Subset to save time while testing
loc_df = loc_df[loc_df['uid'].between('0011@mlife', '0020@mlife')]

# Subset to Nighttime / Time Asleep

## Default Nighttime Subset

Subset to GPS observations between 10pm–7am.

Seems to work well based on visual inspection.

In [None]:
# Convert 'time' to datetime format
loc_df['time'] = pd.to_datetime(loc_df['time'])

# Define nighttime range (10pm–7am)
night_start = pd.to_datetime("22:00:00", format='%H:%M:%S').time()
night_end = pd.to_datetime("07:00:00", format='%H:%M:%S').time()

# Filter data to nighttime hours (10pm–7am)
loc_night = loc_df[(loc_df['time'].dt.time >= night_start) | (loc_df['time'].dt.time <= night_end)]

# Display filtered dataframe
display(loc_night.head(), loc_night.shape)

## Garmin Sleep Subset

Loads per-user sleep data from CSV files and checks whether each location timestamp falls within a recorded sleep session by computing sleep start and end times.

This code might need refining if you want to use it.

**Garmin Sleep Columns:**

1. _id: Unique identifier for each entry in the dataset.
2. awakeDurationInSeconds: The duration of time the individual was awake during the sleep period, measured in seconds.
3. **calendarDate:** The date (in the format YYYY-MM-DD) when the sleep session occurred.
4. deepSleepDurationInSeconds: The amount of time spent in deep sleep during the session, measured in seconds.
5. **durationInSeconds:** The total duration of the sleep session, measured in seconds.
6. lightSleepDurationInSeconds: The time spent in light sleep, measured in seconds.
7. **mlife_id:** An identifier that likely refers to the individual’s account or device ID within the system.
8. remSleepInSeconds: The duration of REM (Rapid Eye Movement) sleep during the session, measured in seconds.
9. sleepLevelsMap: A complex, nested JSON-like structure that likely contains detailed information about sleep stages at different times during the sleep session.
10. **startTimeInSeconds:** The timestamp (in seconds since epoch) when the sleep session began.
11. **startTimeOffsetInSeconds:** Time offset, possibly representing the time zone adjustment from UTC, measured in seconds.
12. summaryId: Another unique identifier, likely for the summary of the session.
13. timeOffsetSleepRespiration: A complex, nested structure indicating respiration rates at different time points during the sleep session.
14. timeOffsetSleepSpo2: A nested structure representing blood oxygen saturation (SpO2) levels at different time points during the sleep session.
15. unmeasurableSleepInSeconds: The duration of unmeasurable sleep periods, possibly due to errors or sensor issues, measured in seconds.
16. validation: This field indicates the validation status of the sleep data, with values like “AUTO_TENTATIVE” or “ENHANCED_TENTATIVE,” likely showing how the data was processed or its reliability.

**Errors:**
- Error processing data for UID 0030@mlife: No columns to parse from file
- Error processing data for UID 0138@mlife: No columns to parse from file
- Error processing data for UID 0139@mlife: No columns to parse from file
- Error processing data for UID 0161@mlife: No columns to parse from file
- Error processing data for UID 0167@mlife: No columns to parse from file
- Error processing data for UID 0245@mlife: No columns to parse from file

In [None]:
# @lru_cache(maxsize=1)
# def get_sleep_df(uid):
#     try:
#         # Load the CSV file for the given UID and cache the DataFrame
#         sleep_df = pd.read_csv(f'sleep/{uid}.csv')
#         sleep_df['calendarDate'] = pd.to_datetime(sleep_df['calendarDate'], errors='coerce').dt.date
#         return sleep_df
#     except pd.errors.ParserError as e:
#         print(f"Error parsing CSV for UID {uid}: {e}")
#         return None
#     except FileNotFoundError as e:
#         print(f"File not found for UID {uid}: {e}")
#         return None
#     except Exception as e:
#         print(f"Error processing data for UID {uid}: {e}")
#         return None


# def get_sleep_data(row):
#     uid = row['uid']
#     sleep_df = get_sleep_df(uid)

#     if sleep_df is None:
#         return None

#     try:
#         # Ensure timestamp and date
#         time = pd.to_datetime(row['time'], errors='coerce')
#         if pd.isna(time):
#             return None
#         date = time.date()

#         # Sleep sessions on the same day or the day after
#         relevant_sleep_df = sleep_df[
#             (sleep_df['calendarDate'] == date) |
#             (sleep_df['calendarDate'] == (date + timedelta(days=1)))
#         ].copy()

#         if relevant_sleep_df.empty:
#             return False

#         # Start and end timestamps for each session
#         relevant_sleep_df = relevant_sleep_df.assign(
#             startTime=pd.to_datetime(relevant_sleep_df['startTimeInSeconds'], unit='s')
#                       + pd.to_timedelta(relevant_sleep_df['startTimeOffsetInSeconds'], unit='s'),
#             endTime=lambda d: d['startTime'] + pd.to_timedelta(d['durationInSeconds'], unit='s')
#         )

#         # Overlap check at the given time
#         overlap = (relevant_sleep_df['startTime'] <= time) & (relevant_sleep_df['endTime'] >= time)
#         return bool(overlap.any())

#     except Exception as e:
#         print(f"Error processing data for UID {uid}: {e}")
#         return None


# # Apply the function to loc_df using tqdm for progress tracking
# tqdm.pandas()
# loc_df['is_sleep'] = loc_df.progress_apply(get_sleep_data, axis=1)

# # Filter the rows where sleep session data matches
# loc_sleep = loc_df[loc_df['is_sleep'] == True]
# loc_night = loc_sleep.copy()
# display(loc_night.head(), loc_night.shape, loc_df.shape)


# # TODO: Update loc_night to use the default range for uids who don't have enough sleep data

# DBSCAN for Home Identification

## Allowing multiple homes

(Other version that assumed one home per participant seems to have been removed)

**Notes for Use:**

You'll need to decide how to define home locations. You can use my approach or modify it based on your needs.

Key Considerations:
- Most participants have one home throughout the study, but some move (e.g., to a new home, from family homes to college dorms).
- Participants may occasionally spend days/weeks on a trip or stay at someone else's home.
- My script only considers whether a certain numbre of data points exist in a location. It doesn't consider the temporal proximity of GPS points.

I remember feeling reasonably satisfied with my script when inspecting the results. One strength of my script is that it includes plotting functionality to visualize how your parameter choices affect results.

You may wish to use the Garmin sleep data or to define home locations differently. You may wish to define min_samples dynamically based on the total number of nighttime points a participant has. You may wish to use cluster medians instead of means for home locations. etc. Or you may find the current script sufficient.

One note. DBSCAN defaults to Euclidean distance, so it doesn't account for the curviture of the Earth. Switching to something like Haversine might improve this. Though I'll say that results seemed reasonable based on visual inspection. I'm unsure how switching the metric would change the results or if it would necessitate reparamaterizing.

**Technical Documentation:** 

The script groups nighttime location observations by user (`uid`). For each user, it runs density-based spatial clustering for applications with noise (DBSCAN) on latitude/longitude (converted to radians) with a ~0.1 km radius (`eps`) and a high minimum sample threshold (`min_samples=800`) to find dense “home” clusters. For each non-noise cluster, it computes a centroid as the home location and stores the resulting list in `home_coords`, along with the number of detected home locations in `num_of_homes`. Users without valid clusters receive `home_coords = NaN` and `num_of_homes = 0` (home locations are manually assigned for these users later). Results are returned in `loc_night`. I chose these parameters based on visual inspection of the results. With these parameters, most participants have one detected home location, which aligns with expectations.


In [None]:
def identify_home_location_and_is_home(df):
    # Parameters for DBSCAN
    kms_per_radian = 6371.0088
    epsilon = 0.1 / kms_per_radian  # 0.1 km radius
    
    # Convert lat/lon to radians for DBSCAN
    df['lat_rad'] = np.radians(df['lat'])
    df['lon_rad'] = np.radians(df['lon'])
    coords = df[['lat_rad', 'lon_rad']].values
    
    # Run DBSCAN
    # Results seem decent, but I arguably should've used something like haversine distance (DBSCAN defaults to metric='euclidean'). Unsure how this would change results.
    db = DBSCAN(eps=epsilon, min_samples=800).fit(coords)
    cluster_labels = db.labels_
    
    # Add cluster labels to the dataframe
    df['home_cluster'] = cluster_labels
    
    # Filter out noise points
    clusters = df[df['home_cluster'] != -1]
    
    # Store coordinates for the centroid of each home cluster
    if not clusters.empty:
        home_coords = []
        unique_clusters = clusters['home_cluster'].unique()
        
        for cluster_label in unique_clusters:
            cluster_points = clusters[clusters['home_cluster'] == cluster_label]
            cluster_centroid = MultiPoint(cluster_points[['lon', 'lat']].values).centroid
            home_coords.append((cluster_centroid.y, cluster_centroid.x))
        
        df['home_coords'] = [home_coords] * len(df) # Home coordinates list of tuples
        df['num_of_homes'] = len(home_coords) # Number of home clusters identified
    else:
        # If no valid cluster found, fill with NaNs and set is_home to False
        df['home_coords'] = np.nan
        df['num_of_homes'] = 0
    
    return df

loc_night = loc_night.groupby('uid').apply(identify_home_location_and_is_home).reset_index(drop=True)
display(loc_night.head())

In [None]:
# Merge home info to full dataframe
display(loc_df.shape)
home_info = loc_night[['uid', 'home_coords', 'num_of_homes']].drop_duplicates(subset='uid')
loc_df = pd.merge(loc_df, home_info, on='uid', how='left')

# could vectorize if too slow, arguably should use haversine but results would probably be about equivalent - geodesic probably about the same as haversine over small distances like these.
def is_within_home_area(row, home_coords, radius=40): # When using uids 0011-0020 –– 20: 78,544, 40: 78,575, 50: 78668, 70: 79,099, 100: 79,371, home_cluster: 90,860
    if isinstance(home_coords, float) and np.isnan(home_coords):
        return np.nan
    
    point = (row['lat'], row['lon'])
    for home in home_coords:
        if geodesic(point, home).meters <= radius:
            return True
    return False

loc_df['is_home'] = loc_df.apply(lambda row: is_within_home_area(row, home_coords=row['home_coords']), axis=1)
display(loc_df.head(), loc_df.shape, loc_df["is_home"].value_counts())

In [None]:
loc_night['is_home'] = loc_night.apply(lambda row: is_within_home_area(row, home_coords=row['home_coords']), axis=1)
loc_df[loc_df["is_home"].isna()]["uid"].value_counts()

(I think this might be when subsetting based on sleep data and whatever parameters I used; fewer participants missing clusters when using 10pm–7am subset)

Garmin Missing Home Clusters for 23 uids:
- 0102@mlife    15162
- 0070@mlife    14827
- 0196@mlife    13107
- 0161@mlife    13023
- 0167@mlife    12995
- 0013@mlife    11307
- 0238@mlife    11033
- 0264@mlife    10786
- 0030@mlife    10040
- 0019@mlife     7558
- 0138@mlife     6619
- 0139@mlife     6458
- 0169@mlife     4778
- 0158@mlife     4264
- 0188@mlife     3855
- 0028@mlife     3356
- 0245@mlife     1221
- 0258@mlife     1012
- 0191@mlife     1006
- 0135@mlife      834
- 0053@mlife      829
- 0137@mlife      540
- 0220@mlife      121

## Plot based on cluster

In [None]:
# Function to plot clusters for a specific participant on a map
def plot_clusters_on_map(df, uid):
    participant_data = df[df['uid'] == uid]
    
    # Return if there's no data for the uid
    if participant_data.empty:
        print(f"No data available for participant {uid}")
        return
    
    folium_map = None
    unique_clusters = participant_data['home_cluster'].unique()
    colors = [
        'blue', 'green', 'purple', 'orange', 'darkred', 'lightred', 'darkblue', 
        'darkgreen', 'cadetblue', 'darkpurple', 'pink', 'lightblue'
    ]
    marker_cluster = MarkerCluster()
    
    # Plot all coords and color based on cluster
    for idx, cluster_label in enumerate(unique_clusters):
        cluster_points = participant_data[participant_data['home_cluster'] == cluster_label]        
        color = colors[idx % len(colors)] if cluster_label != -1 else 'gray'
        
        for _, row in cluster_points.iterrows():
            folium.CircleMarker(
                location=[row['lat'], row['lon']],
                radius=5,
                color=color,
                fill=True,
                fill_color=color,
                fill_opacity=0.6,
            ).add_to(marker_cluster)
        
        # For each home cluster (excluding noise), add a marker
        if cluster_label != -1:
            cluster_centroid = MultiPoint(cluster_points[['lon', 'lat']].values).centroid
            if folium_map is None:
                folium_map = folium.Map(location=[cluster_centroid.y, cluster_centroid.x], zoom_start=12)
            
            folium.Marker(
                location=[cluster_centroid.y, cluster_centroid.x],
                popup=f"Home Location (Cluster {cluster_label})",
                icon=folium.Icon(color='red', icon='home'),
            ).add_to(folium_map)
    
    # Add marker cluster to folium map
    if folium_map is not None:
        marker_cluster.add_to(folium_map)
        return folium_map
    else:
        print(f"No valid clusters found for participant {uid}")
        return None

map_night = plot_clusters_on_map(loc_night, uid='0017@mlife')
display(map_night)

## Plot based on is_home

In [None]:
# Function to plot clusters for a specific participant on a map
def plot_clusters_on_map(df, uid):
    participant_data = df[df['uid'] == uid]
    
    # Return if there's no data for the uid
    if participant_data.empty:
        print(f"No data available for participant {uid}")
        return
    
    folium_map = None
    marker_cluster = MarkerCluster()
    
    # Plot all coords and color based on whether the point is home or not
    for _, row in participant_data.iterrows():
        color = 'blue' if row['is_home'] else 'gray'
        
        folium.CircleMarker(
            location=[row['lat'], row['lon']],
            radius=5,
            color=color,
            fill=True,
            fill_color=color,
            fill_opacity=0.6,
        ).add_to(marker_cluster)
    
    # Place home markers based on the first row's 'home_coords'
    home_coords = participant_data.iloc[0]['home_coords']
    if home_coords:
        for coord in home_coords:
            lat, lon = coord 
            if folium_map is None:
                folium_map = folium.Map(location=[lat, lon], zoom_start=12)

            folium.Marker(
                location=[lat, lon],
                popup="Home Location",
                icon=folium.Icon(color='red', icon='home'),
            ).add_to(folium_map)
    
    # Add marker cluster to folium map
    if folium_map is not None:
        marker_cluster.add_to(folium_map)
        return folium_map
    else:
        print(f"No valid data found for participant {uid}")
        return None

map_all = plot_clusters_on_map(loc_df, uid='0011@mlife')
display(map_all)

# Manual Home Identification for Participants Missing Home Clusters

No valid home clusters were identified for:
- 0019@mlife    7558
- 0158@mlife    4264
- 0245@mlife    1221
- 0258@mlife    1012
- 0191@mlife    1006
- 0135@mlife     834
- 0053@mlife     829
- 0137@mlife     540
- 0220@mlife     121

In [None]:
loc_df = pd.read_csv("/Users/dawsonhaddox/Documents/Jacobson Lab/Avoidance-Mediation/time_at_home.csv", usecols=lambda column: column != "Unnamed: 0")
loc_df['time'] = pd.to_datetime(loc_df['time'])
loc_df['home_coords'] = loc_df['home_coords'].apply(lambda x: eval(x) if isinstance(x, str) else x)
loc_df['is_home'] = loc_df['is_home'].astype(bool) # maybe switch to 'boolean

night_start = pd.to_datetime("22:00:00", format='%H:%M:%S').time()
night_end = pd.to_datetime("07:00:00", format='%H:%M:%S').time()
loc_dark = loc_df[(loc_df['time'].dt.time >= night_start) | (loc_df['time'].dt.time <= night_end)]

display(loc_df.head(), loc_df.shape, loc_night.shape, loc_df.dtypes)

In [None]:
# Function to plot clusters for a specific participant on a map
def plot_clusters_on_map(df, uid):
    participant_data = df[df['uid'] == uid]
    
    # Return if there's no data for the uid
    if participant_data.empty:
        print(f"No data available for participant {uid}")
        return
    
    folium_map = None
    marker_cluster = MarkerCluster()
    
    # Plot all coords and color based on whether the point is home or not
    display(participant_data.head())
    for _, row in participant_data.iterrows():
        color = 'blue' if row['is_home'] else 'gray'
        
        # Add a marker with a popup displaying the coordinates
        folium.CircleMarker(
            location=[row['lat'], row['lon']],
            radius=5,
            color=color,
            fill=True,
            fill_color=color,
            fill_opacity=0.6,
            popup=f"Coordinates: {row['lat']}, {row['lon']}"
        ).add_to(marker_cluster)
    
    # Place home markers based on the first row's 'home_coords'
    home_coords = participant_data.iloc[0]['home_coords']
    
    # Check if home_coords is not NaN or None before proceeding
    if isinstance(home_coords, (list, tuple)) and len(home_coords) > 0:
        for coord in home_coords:
            lat, lon = coord
            if folium_map is None:
                folium_map = folium.Map(location=[lat, lon], zoom_start=12)

            folium.Marker(
                location=[lat, lon],
                popup=f"Home Location: {lat}, {lon}",
                icon=folium.Icon(color='red', icon='home'),
            ).add_to(folium_map)
    
    # If no home_coords, create a map with default zoom on the first location
    if folium_map is None:
        # Use the first data point to initialize the map
        first_location = [participant_data.iloc[0]['lat'], participant_data.iloc[0]['lon']]
        folium_map = folium.Map(location=first_location, zoom_start=12)
    
    # Add marker cluster to folium map
    marker_cluster.add_to(folium_map)
    
    # Add LatLngPopup to allow clicking anywhere on the map to get coordinates
    folium_map.add_child(folium.LatLngPopup())

    return folium_map

map_all = plot_clusters_on_map(loc_dark, uid='0019@mlife')
display(map_all)

Manual Home Clusters (lat,lon)
- 0019@mlife    7558: 34.80717923333333, -83.03800723333335 *
- 0158@mlife    4264: 43.076911474285716, -75.60749948571429 *
- 0245@mlife    1221: 34.61370630222222, -86.98471391333331
- 0258@mlife    1012: 36.48292789146346, -80.59133030975616
- 0191@mlife    1006: 43.16906334363637, -89.26362676545465; 42.14770811999999, -88.01730834
- 0135@mlife     834: 42.636156193650805, -84.60205366984124
- 0053@mlife     829: 35.04374041512605, -78.86898512436973
- 0137@mlife     540: 39.9749601888889, -105.08223831111113
- 0220@mlife     121: 32.759629193877544, -117.08836552040808

Garmin Manual Home Clusters (lat,lon)
- 0102@mlife    15162: 42.57890806870229, -71.78348419618322
- 0070@mlife    14827: 30.118677459999994, -97.39395718571429
- 0196@mlife    13107: 29.69164973130435, -96.58995418173907; 29.71451887815126, -96.53967332352948
- 0161@mlife    13023: 37.21932475507247, -80.44992237391305 (Had no Garmin sleep data)
- 0167@mlife    12995: 41.14155916964285, -104.77647582678574
- 0013@mlife    11307: 33.20472388333334, -97.13972475833329 
- 0238@mlife    11033: 40.64433656129033, -74.26836620322581
- 0264@mlife    10786: 41.84367935, -103.66702671666668
- 0030@mlife    10040: 43.62704169705883, -72.93721083529411
- 0019@mlife     7558: 34.80717923333333, -83.03800723333335
- 0138@mlife     6619: 41.31010776923077, -72.92940483846155
- 0139@mlife     6458: 42.18948781111111, -85.54333844444446
- 0169@mlife     4778: 37.00420913448274, -93.0815587747126
- 0158@mlife    4264: 43.076911474285716, -75.60749948571429
- 0188@mlife     3855: 37.62998183478261, -97.36649026666667
- 0028@mlife     3356: 39.0247310154412, -122.67191622647056
- 0245@mlife    1221: 34.61370630222222, -86.98471391333331
- 0258@mlife    1012: 36.48292789146346, -80.59133030975616
- 0191@mlife    1006: 43.16906334363637, -89.26362676545465; 42.14770811999999, -88.01730834
- 0135@mlife     834: 42.636156193650805, -84.60205366984124
- 0053@mlife     829: 35.04374041512605, -78.86898512436973
- 0137@mlife     540: 39.9749601888889, -105.08223831111113
- 0220@mlife     121: 32.759629193877544, -117.08836552040808

In [None]:
# Dictionary with uid as keys and corresponding (lat, lon) values as tuples
manual_home_clusters = {
    "0019@mlife": [(34.80717923333333, -83.03800723333335)],
    "0158@mlife": [(43.076911474285716, -75.60749948571429)],
    "0245@mlife": [(34.61370630222222, -86.98471391333331)],
    "0258@mlife": [(36.48292789146346, -80.59133030975616)],
    "0191@mlife": [(43.16906334363637, -89.26362676545465), (42.14770811999999, -88.01730834)],
    "0135@mlife": [(42.636156193650805, -84.60205366984124)],
    "0053@mlife": [(35.04374041512605, -78.86898512436973)],
    "0137@mlife": [(39.9749601888889, -105.08223831111113)],
    "0220@mlife": [(32.759629193877544, -117.08836552040808)],
}

# manual_home_clusters = {
#     "0019@mlife": [(34.80717923333333, -83.03800723333335)],
#     "0158@mlife": [(43.076911474285716, -75.60749948571429)],
#     "0245@mlife": [(34.61370630222222, -86.98471391333331)],
#     "0258@mlife": [(36.48292789146346, -80.59133030975616)],
#     "0191@mlife": [(43.16906334363637, -89.26362676545465), (42.14770811999999, -88.01730834)],
#     "0135@mlife": [(42.636156193650805, -84.60205366984124)],
#     "0053@mlife": [(35.04374041512605, -78.86898512436973)],
#     "0137@mlife": [(39.9749601888889, -105.08223831111113)],
#     "0220@mlife": [(32.759629193877544, -117.08836552040808)],
#     "0102@mlife": [(42.57890806870229, -71.78348419618322)],
#     "0070@mlife": [(30.118677459999994, -97.39395718571429)],
#     "0196@mlife": [(29.69164973130435, -96.58995418173907), (29.71451887815126, -96.53967332352948)],
#     "0161@mlife": [(37.21932475507247, -80.44992237391305)],
#     "0167@mlife": [(41.14155916964285, -104.77647582678574)],
#     "0013@mlife": [(33.20472388333334, -97.13972475833329)],
#     "0238@mlife": [(40.64433656129033, -74.26836620322581)],
#     "0264@mlife": [(41.84367935, -103.66702671666668)],
#     "0030@mlife": [(43.62704169705883, -72.93721083529411)],
#     "0138@mlife": [(41.31010776923077, -72.92940483846155)],
#     "0139@mlife": [(42.18948781111111, -85.54333844444446)],
#     "0169@mlife": [(37.00420913448274, -93.0815587747126)],
#     "0188@mlife": [(37.62998183478261, -97.36649026666667)],
#     "0028@mlife": [(39.0247310154412, -122.67191622647056)],
# }

# Function to set home_coords for each uid
def set_home_coords(df, home_clusters):
    df['home_coords'] = df.apply(
        lambda row: home_clusters.get(row['uid'], row['home_coords']), axis=1
    )
    return df

# Apply function to loc_df
loc_df = set_home_coords(loc_df, manual_home_clusters)

# Update is_home only for UIDs in manual_home_clusters
def is_within_home_area(row, home_coords, radius=40): 
    if isinstance(home_coords, float) and np.isnan(home_coords):
        return np.nan
    
    point = (row['lat'], row['lon'])
    for home in home_coords:
        if geodesic(point, home).meters <= radius:
            return True
    return False

# Update is_home column only for rows where uid exists in manual_home_clusters
loc_df['is_home'] = loc_df.apply(
    lambda row: is_within_home_area(row, home_coords=row['home_coords']) if row['uid'] in manual_home_clusters else row['is_home'], axis=1
)

loc_df.head()

In [None]:
display(loc_df[loc_df["is_home"].isna()]["uid"].value_counts(), loc_df[loc_df["home_coords"].isna()]["uid"].value_counts())

In [None]:
# Save data
# loc_df.to_csv('time_at_home.csv', index=False)

#  (Extra) Connect with phone unlock

In [None]:
# @lru_cache(maxsize=1)
# def get_unlock_data(uid, date):
#     try:
#         unlock_df = pd.read_csv(f'unlock_clean/{uid}_unlock.csv', index_col='date')
#     except FileNotFoundError:
#         print(f"File for UID {uid} not found.")
#         return {}
#     except pd.errors.ParserError as e:
#         print(f"Error parsing CSV for UID {uid}: {e}")
#         return {}
    
#     matching_row = unlock_df[unlock_df.index.str.startswith(date)]
    
#     if matching_row.empty:
#         print(f"Date {date} not found for UID {uid}")
#         return {}

#     unlock_list = ast.literal_eval(matching_row.iloc[0]['data'])
#     return {int(time): status for time, status in enumerate(unlock_list)}

# def determine_is_unlock(row):
#     """
#     Determines if the user has unlocked at the given timestamp (row['time']).
#     """
#     uid = row['uid']
#     date = row['time'].strftime('%Y-%m-%d')
#     time_in_seconds = row['time'].hour * 3600 + row['time'].minute * 60 + row['time'].second
#     unlock_dict = get_unlock_data(uid, date)
    
#     if not unlock_dict:
#         return None
    
#     return unlock_dict.get(time_in_seconds, False)

# tqdm.pandas()
# loc_df['is_unlock'] = loc_df.progress_apply(lambda row: determine_is_unlock(row), axis=1)
# # loc_df['is_unlock'] = loc_df.progress_apply(
# #     lambda row: determine_is_unlock(row) if row['uid'] in manual_home_clusters or pd.isna(row['is_unlock']) else row['is_unlock'], 
# #     axis=1
# # )

In [None]:
# loc_df[loc_df['is_unlock'].isna()]['uid'].unique()

In [None]:
# loc_df['unlock_away_from_home'] = (loc_df['is_home'] == False) & (loc_df['is_unlock'] == 1.0)
# loc_df.sample(n=30)

In [None]:
# loc_df[loc_df["is_home"].isna()]["uid"].value_counts()

In [None]:
# loc_df[loc_df["is_unlock"].isna()]["uid"].value_counts()

In [None]:
# loc_df.to_csv('time_at_home.csv', index=False)
# loc_df.to_csv('time_at_home_garmin.csv', index=False)

# Agreement btwn Default Nighttime Subset and Garmin Sleep Subset 

(unsure if this was ever finished, but I remember preliminary results showed they worked equally well)

In [None]:
# import pandas as pd
# import numpy as np
# from collections import defaultdict
# from geopy.distance import great_circle, geodesic
# from shapely.geometry import MultiPoint
# from sklearn.cluster import DBSCAN
# import matplotlib.pyplot as plt
# import folium6
# from folium.plugins import MarkerCluster
# from functools import lru_cache
# from tqdm import tqdm
# from datetime import datetime, timedelta
# import ast
# import time

In [None]:
# default_df = pd.read_csv("/Users/dawsonhaddox/Documents/Jacobson Lab/Avoidance-Mediation/time_at_home.csv", usecols=lambda column: column != "Unnamed: 0")
# garmin_df = pd.read_csv("/Users/dawsonhaddox/Documents/Jacobson Lab/Avoidance-Mediation/time_at_home_garmin.csv", usecols=lambda column: column != "Unnamed: 0")

# display("Default Data:", default_df.head(), default_df.shape, garmin_df.head(), garmin_df.shape)

In [None]:
# # Function to ensure that 'home_coords' is properly formatted as a list of tuples of floats
# def validate_and_cast(row):
#     row = ast.literal_eval(row)
#     return [(float(a), float(b)) for a, b in row if isinstance(a, (int, float)) and isinstance(b, (int, float))]

# # Apply the function to 'home_coords' to ensure it's properly formatted and cast
# garmin_df['home_coords'] = garmin_df['home_coords'].apply(validate_and_cast)
# default_df['home_coords'] = default_df['home_coords'].apply(validate_and_cast)

# # Recreate 'num_of_homes' to count the number of tuples
# garmin_df["num_of_homes"] = garmin_df['home_coords'].apply(len)
# default_df["num_of_homes"] = default_df['home_coords'].apply(len)

# display(garmin_df, default_df)

In [None]:
# # Drop duplicate uids
# default_subset = default_df.drop_duplicates(subset=['uid'], keep='first')
# garmin_subset = garmin_df.drop_duplicates(subset=['uid'], keep='first')

# # Merge the two subsets on 'uid' and compare the lengths of 'num_of_homes'
# merged_df = pd.merge(default_subset, garmin_subset, on='uid', suffixes=('_default', '_garmin'))

# # Filter rows where the lengths of 'num_of_homes' are different
# result_df = merged_df[merged_df['num_of_homes_default'] != merged_df['num_of_homes_garmin']]
# display(result_df, result_df.shape)

In [None]:
# import numpy as np
# from geopy.distance import great_circle

# def haversine_distance(coord1, coord2):
#     return great_circle(coord1, coord2).meters

# def average_distance(coords1, coords2):
#     # Ensure that both lists have the same length
#     if len(coords1) != len(coords2):
#         raise ValueError("Coordinate lists must have the same length.")
    
#     # Compute distances between corresponding coordinates
#     distances = [haversine_distance(coord1, coord2) for coord1, coord2 in zip(coords1, coords2)]
#     return np.mean(distances) if distances else None

# # Initialize dictionary to store average distances and coordinates
# average_distances = {
#     'uid': [],
#     'home_coords_default': [],
#     'home_coords_garmin': [],
#     'average_distance': []
# }

# # Merge filtered DataFrames on 'uid'
# merged_filtered_df = pd.merge(default_subset_filtered, garmin_subset_filtered, on='uid', suffixes=('_default', '_garmin'))

# # Compute average distances and store coordinates for each uid
# for index, row in merged_filtered_df.iterrows():
#     uid = row['uid']
#     coords_default = row['home_coords_default']
#     coords_garmin = row['home_coords_garmin']
    
#     # Ensure both coordinate lists have the same length
#     if len(coords_default) != len(coords_garmin):
#         continue
    
#     avg_distance = average_distance(coords_default, coords_garmin)
    
#     # Append results to the dictionary
#     average_distances['uid'].append(uid)
#     average_distances['home_coords_default'].append(coords_default)
#     average_distances['home_coords_garmin'].append(coords_garmin)
#     average_distances['average_distance'].append(avg_distance)

# # Convert to DataFrame for better visualization
# average_distances_df = pd.DataFrame(average_distances)

# # Filter to rows where the average distance is more than one meter
# filtered_average_distances_df = average_distances_df[average_distances_df['average_distance'] > 5.0]

# # Display the filtered DataFrame along with its shape and summary statistics
# display(filtered_average_distances_df, filtered_average_distances_df.shape, average_distances_df.describe())


In [None]:
# # Convert the 'time' column to datetime format
# default_df['time'] = pd.to_datetime(garmin_df['time'])

# # Define the nighttime range
# night_start = pd.to_datetime("22:00:00", format='%H:%M:%S').time()
# night_end = pd.to_datetime("07:00:00", format='%H:%M:%S').time()

# # Filter data to nighttime hours
# night_df = default_df[(default_df['time'].dt.time >= night_start) | (default_df['time'].dt.time <= night_end)]

# sleep_df = garmin_df[garmin_df['is_sleep']==True]

In [None]:
# # Function to plot clusters for a specific participant on a map
# def plot_clusters_on_map(df, uid):
#     participant_data = df[df['uid'] == uid]
    
#     # Return if there's no data for the uid
#     if participant_data.empty:
#         print(f"No data available for participant {uid}")
#         return
    
#     folium_map = None
#     marker_cluster = MarkerCluster()
    
#     # Plot all coords and color based on whether the point is home or not
#     display(participant_data.head())
#     for _, row in participant_data.iterrows():
#         color = 'blue' if row['is_home'] else 'gray'
        
#         # Add a marker with a popup displaying the coordinates
#         folium.CircleMarker(
#             location=[row['lat'], row['lon']],
#             radius=5,
#             color=color,
#             fill=True,
#             fill_color=color,
#             fill_opacity=0.6,
#             popup=f"Coordinates: {row['lat']}, {row['lon']}"
#         ).add_to(marker_cluster)
    
#     # Place home markers based on the first row's 'home_coords'
#     home_coords = participant_data.iloc[0]['home_coords']
    
#     # Check if home_coords is not NaN or None before proceeding
#     if isinstance(home_coords, (list, tuple)) and len(home_coords) > 0:
#         for coord in home_coords:
#             lat, lon = coord
#             if folium_map is None:
#                 folium_map = folium.Map(location=[lat, lon], zoom_start=12)

#             folium.Marker(
#                 location=[lat, lon],
#                 popup=f"Home Location: {lat}, {lon}",
#                 icon=folium.Icon(color='red', icon='home'),
#             ).add_to(folium_map)
    
#     # If no home_coords, create a map with default zoom on the first location
#     if folium_map is None:
#         # Use the first data point to initialize the map
#         first_location = [participant_data.iloc[0]['lat'], participant_data.iloc[0]['lon']]
#         folium_map = folium.Map(location=first_location, zoom_start=12)
    
#     # Add marker cluster to folium map
#     marker_cluster.add_to(folium_map)
    
#     # Add LatLngPopup to allow clicking anywhere on the map to get coordinates
#     folium_map.add_child(folium.LatLngPopup())

#     return folium_map

# map_all = plot_clusters_on_map(sleep_df, uid='0026@mlife')
# display(map_all)