### Summary of Analysis and Visualization

This notebook analyzes Google location data to uncover weekly movement patterns during a college semester at Wheaton College.

- **Data Preparation**: Location data is filtered to include only points within Wheaton’s boundary (defined by `GeoJSON`), with duplicates removed.
- **Timestamp Adjustments**: Data is aligned to Chicago time, limited to weekdays, active hours (6 AM–10 PM), and filtered by semester dates.
- **Heatmap Insights**: `HeatMapWithTime` visualizations highlight movement patterns on different class days (MWF vs. TT), providing a clear picture of weekly activity.


### Running the Notebook

To analyze weekly class patterns and generate heatmaps, begin by running cells from **"Starting Analysis: Generating Heatmaps for Weekly Class Patterns"**. This segment loads the preprocessed `filtered_df` dataset from the `filtered_data.db` SQLite database and visualizes movement patterns across class days.

#### Steps for Running:
1. **Load Data**: Cells will load `filtered_df` from the database, ensuring timestamps are in `America/Chicago` timezone.
2. **Generate Heatmaps**:
   - **Monday, Wednesday, Friday**: Heatmaps use intervals matching typical college class times (6 AM - 10 PM).
   - **Tuesday, Thursday**: Similar intervals display patterns for Tuesday/Thursday schedules.

> **Note**: Ensure the `filtered_data.db` file is present in the same directory for proper data access. Run each cell sequentially to load, analyze, and visualize the data effectively.

### Loading Packages and Functions

In [1]:
import json
import pandas as pd
import re
from datetime import datetime
from shapely.geometry import Point, Polygon

import sqlite3

In [2]:
def load_data(file_path):
    with open(file_path, 'r') as file:
        return json.load(file)

input_file_path = './timeline.json'
data = load_data(input_file_path)

In [3]:
def parse_latlng(latlng_str):
    match = re.match(r"([+-]?\d+\.\d+)°?,\s*([+-]?\d+\.\d+)°?", latlng_str)
    if match:
        return float(match.group(1)), float(match.group(2))
    return None, None

In [4]:
def extract_raw_signals(data):
    mappings = []
    if 'rawSignals' in data:
        for signal in data['rawSignals']:
            position = signal.get('position', {})
            latlng_str = position.get('LatLng')
            timestamp = position.get('timestamp')
            if latlng_str and timestamp:
                lat, lon = parse_latlng(latlng_str)
                mappings.append({'latitude': lat, 'longitude': lon, 'timestamp': timestamp})
    return mappings

raw_signals = extract_raw_signals(data)

In [5]:
def extract_semantic_segments(data):
    mappings = []
    if 'semanticSegments' in data:
        for segment in data['semanticSegments']:
            mappings.extend(extract_activity(segment))
            mappings.extend(extract_visit(segment))
    return mappings

def extract_activity(segment):
    mappings = []
    activity = segment.get('activity')
    if activity:
        start = activity.get('start', {}).get('latLng')
        end = activity.get('end', {}).get('latLng')
        start_time = segment.get('startTime')
        end_time = segment.get('endTime')
        if start and start_time:
            lat, lon = parse_latlng(start)
            mappings.append({'latitude': lat, 'longitude': lon, 'timestamp': start_time})
        if end and end_time:
            lat, lon = parse_latlng(end)
            mappings.append({'latitude': lat, 'longitude': lon, 'timestamp': end_time})
    return mappings

def extract_visit(segment):
    mappings = []
    visit = segment.get('visit')
    if visit:
        place = visit['topCandidate'].get('placeLocation', {}).get('latLng')
        start_time = segment.get('startTime')
        end_time = segment.get('endTime')
        if place and start_time:
            lat, lon = parse_latlng(place)
            mappings.append({'latitude': lat, 'longitude': lon, 'timestamp': start_time})
        if place and end_time:
            lat, lon = parse_latlng(place)
            mappings.append({'latitude': lat, 'longitude': lon, 'timestamp': end_time})
    return mappings

semantic_segments = extract_semantic_segments(data)

In [6]:
mappings = raw_signals + semantic_segments

### Preprocessing

In [7]:
mappings_df = pd.DataFrame(mappings).dropna().reset_index(drop=True)


print("Location Data:")
print(mappings_df.head())
print("Number of data points:", len(mappings_df))

Location Data:
    latitude  longitude                      timestamp
0  41.869569 -88.095996  2024-10-18T13:54:37.000-05:00
1  41.869569 -88.095996  2024-10-18T13:57:20.000-05:00
2  41.869725 -88.096451  2024-10-18T13:58:22.000-05:00
3  41.869725 -88.096427  2024-10-18T14:01:31.000-05:00
4  41.869745 -88.096365  2024-10-18T14:02:40.000-05:00
Number of data points: 7604


In [8]:
mappings_df = mappings_df.drop_duplicates(subset=['latitude', 'longitude', 'timestamp'])
print("Number of data points after deduplication:", len(mappings_df))

Number of data points after deduplication: 7544


In [9]:
wheaton_boundary = Polygon([
    (-88.10183003470493, 41.87380858252311),
    (-88.10183003470493, 41.865779902798465),
    (-88.09208874564698, 41.865779902798465),
    (-88.09208874564698, 41.87380858252311),
    (-88.10183003470493, 41.87380858252311)
])

mappings_df['within_boundary'] = mappings_df.apply(
    lambda row: Point(row['longitude'], row['latitude']).within(wheaton_boundary), axis=1
)
mappings_df = mappings_df[mappings_df['within_boundary']].drop(columns=['within_boundary'])

print("Number of data points within Wheaton College boundary:", len(mappings_df))

Number of data points within Wheaton College boundary: 2162


Ensuring Chicago Time

In [10]:
mappings_df['timestamp'] = pd.to_datetime(mappings_df['timestamp'], errors='coerce')

mappings_df.dropna(subset=['timestamp'], inplace=True)

mappings_df['timestamp'] = mappings_df['timestamp'].apply(
    lambda x: x.tz_localize('America/Chicago', ambiguous='NaT', nonexistent='shift_forward') 
    if x.tzinfo is None else x.tz_convert('America/Chicago')
)

  mappings_df['timestamp'] = pd.to_datetime(mappings_df['timestamp'], errors='coerce')


#### Selecting the Semester for Analysis
To determine the best semester for pattern analysis, this code iterates over defined semester time ranges (6 AM - 11 PM) and counts data points. Semester 3 provides the highest number of data points, making it ideal for focused analysis.

In [11]:
semesters = {
    "Semester 1": (pd.to_datetime("2023-08-28").tz_localize('America/Chicago'), 
                   pd.to_datetime("2023-12-19").tz_localize('America/Chicago')),
    "Semester 2": (pd.to_datetime("2024-01-13").tz_localize('America/Chicago'), 
                   pd.to_datetime("2024-05-08").tz_localize('America/Chicago')),
    "Semester 3": (pd.to_datetime("2024-08-27").tz_localize('America/Chicago'), 
                   pd.to_datetime("2024-12-18").tz_localize('America/Chicago'))
}

semester_counts = {}
for semester, (start_date, end_date) in semesters.items():
    semester_df = mappings_df[(mappings_df['timestamp'] >= start_date) & 
                              (mappings_df['timestamp'] <= end_date)]
    semester_df = semester_df[semester_df['timestamp'].dt.hour.between(6, 22)]
    semester_counts[semester] = semester_df.shape[0]

for semester, count in semester_counts.items():
    print(f"Number of data points in {semester} between 6 AM and 11 PM:", count)

Number of data points in Semester 1 between 6 AM and 11 PM: 461
Number of data points in Semester 2 between 6 AM and 11 PM: 400
Number of data points in Semester 3 between 6 AM and 11 PM: 1010


#### Data Filtering for Semester Analysis

This cell filters location data for the selected semester, limiting entries to weekdays (Monday–Friday) and active hours (after 6 AM). Only data points within the Wheaton College boundary are retained, ensuring analysis is semester-specific and campus-focused.

In [12]:
start_time = pd.to_datetime("2024-08-27").tz_localize('America/Chicago')
end_time = pd.to_datetime("2024-12-18").tz_localize('America/Chicago')

filtered_df = mappings_df[
    (mappings_df['timestamp'] >= start_time) &
    (mappings_df['timestamp'] <= end_time) &
    (mappings_df['timestamp'].dt.day_name().isin(['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday'])) &
    (mappings_df['timestamp'].dt.hour >= 6)
].copy()

filtered_df.loc[:, 'within_boundary'] = filtered_df.apply(
    lambda row: Point(row['longitude'], row['latitude']).within(wheaton_boundary), axis=1
)

filtered_df = filtered_df[filtered_df['within_boundary']].drop(columns=['within_boundary'])

print("Number of data points within boundary after 6 AM: ", len(filtered_df))

Number of data points within boundary after 6 AM:  917


Saves the filtered DataFrame (`filtered_df`) to an SQLite database (`filtered_data.db`).

In [13]:
conn = sqlite3.connect('filtered_data.db')  

filtered_df.to_sql('filtered_data', conn, index=False, if_exists='replace')

conn.close()

### Starting Analysis: Generating Heatmaps for Weekly Class Patterns

This segment of code loads the preprocessed `filtered_df` dataset from an SQLite database and generates two distinct heatmaps:

1. **Data Preparation**:  
   - Loads `filtered_df` from the database and ensures timestamps are in the `America/Chicago` timezone.
   
2. **Heatmap Generation**:
   - Uses the `create_heatmap` function to visualize movement patterns across different time intervals on specific weekdays.
   - **Monday, Wednesday, and Friday**: Heatmap intervals align with common college class times throughout the day, starting from early morning and ending at 10 PM.
   - **Tuesday and Thursday**: Similar heatmap intervals are set to highlight patterns for Tuesday/Thursday schedules.

Run each cell sequentially to load, analyze, and visualize the weekly patterns for class times.

In [14]:
import pandas as pd
from shapely.geometry import Point, Polygon
from datetime import datetime

import sqlite3

import folium
from folium.plugins import HeatMapWithTime
from IPython.display import display, HTML, IFrame

In [15]:
conn = sqlite3.connect('filtered_data.db')
filtered_df = pd.read_sql_query("SELECT * FROM filtered_data", conn)
conn.close()

filtered_df['timestamp'] = pd.to_datetime(filtered_df['timestamp'], errors='coerce', utc=True)
filtered_df['timestamp'] = filtered_df['timestamp'].dt.tz_convert('America/Chicago')

In [16]:
def create_heatmap(df, intervals, days, filename):
    filtered_day_df = df[df['timestamp'].dt.day_name().isin(days)]
    heatmap_data = []
    for label, start_time_str, end_time_str in intervals:
        start_time = pd.to_datetime(start_time_str).time()
        end_time = pd.to_datetime(end_time_str).time()
        
        interval_df = filtered_day_df[(filtered_day_df['timestamp'].dt.time >= start_time) & 
                                      (filtered_day_df['timestamp'].dt.time <= end_time)]
        
        heatmap_data.append(interval_df[['latitude', 'longitude']].values.tolist())
    
    map_heatmap = folium.Map(location=[41.870516, -88.096959], zoom_start=16, tiles="OpenStreetMap")
    HeatMapWithTime(heatmap_data, radius=8, gradient={0.4: 'blue', 0.65: 'lime', 1: 'red'}, 
                    min_opacity=0.5, max_opacity=0.8, auto_play=True, display_index=True,
                    index=[label for label, _, _ in intervals]).add_to(map_heatmap)
    
    map_heatmap.save(filename)
    display(HTML(filename))
    return IFrame(filename, width=700, height=500)

In [17]:
mwf_intervals = [
    ("Before 9:20", "06:00", "09:20"),
    ("9:20 - 10:30", "09:20", "10:30"),
    ("10:30 - 12:55", "10:30", "12:55"),
    ("12:55 - 2:05", "12:55", "14:05"),
    ("2:05 - 2:15", "14:05", "14:15"),
    ("2:15 - 3:25", "14:15", "15:25"),
    ("3:25 - 5:00", "15:25", "17:00"),
    ("5:00 - 7:30", "17:00", "19:30"),
    ("7:30 - 10:00", "19:30", "22:00"),
    ("After 10:00 PM", "22:00", "23:59")
]

create_heatmap(filtered_df, mwf_intervals, ['Monday', 'Wednesday', 'Friday'], "mwf_class_times_heatmap.html")

In [18]:
tt_intervals = [
    ("Before 8:00 AM", "06:00", "08:00"),
    ("8:00 - 9:00 AM", "08:00", "09:00"),
    ("9:00 - 10:00 AM", "09:00", "10:00"),
    ("10:00 - 11:15 AM", "10:00", "11:15"),
    ("11:15 - 1:05 PM", "11:15", "13:05"),
    ("1:05 - 2:30 PM", "13:05", "14:30"),
    ("2:30 - 3:30 PM", "14:30", "15:30"),
    ("3:30 - 5:15 PM", "15:30", "17:15"),
    ("5:15 - 6:30 PM", "17:15", "18:30"),
    ("6:30 - 8:00 PM", "18:30", "20:00"),
    ("8:00 - 9:00 PM", "20:00", "21:00"),
    ("9:00 - 10:00 PM", "21:00", "22:00"),
    ("After 10:00 PM", "22:00", "23:59")
]

create_heatmap(filtered_df, tt_intervals, ['Tuesday', 'Thursday'], "tt_class_times_heatmap.html")

### Summary of Findings

Based on Google location data analysis, I observed the following patterns:

- **Increased Pings for Route Deviations**: Google appears to increase location ping frequency when I deviate from usual routes. For instance, during a cancelled 9:20 MWF Systems class, I recorded 15 data points, compared to a typical 4.
- **Heatmap Indicators for Class Locations**: The heatmaps generally reflect where I spent my class times. For example, during the 2:15-3:25 time block, I consistently appear in Meyer Science Center due to my Database class.
- **Data Limitations for Specific Time Blocks**: Some time intervals, like Tuesday/Thursday from 11:15-1:05, still indicate my house as the most common location, despite being in Data Science class. This discrepancy likely results from Google’s backend polling logic, which may not consistently capture locations at set intervals, thus affecting data frequency accuracy.

In [19]:
filtered_df['timestamp'] = pd.to_datetime(filtered_df['timestamp'], utc=True).dt.tz_convert('America/Chicago')

specific_day = "2024-10-25"
start_time = f"{specific_day} 09:00:00"
end_time = f"{specific_day} 11:00:00"

start_timestamp = pd.to_datetime(start_time).tz_localize('America/Chicago')
end_timestamp = pd.to_datetime(end_time).tz_localize('America/Chicago')

specific_day_points = filtered_df[(filtered_df['timestamp'] >= start_timestamp) & (filtered_df['timestamp'] <= end_timestamp)]
specific_day_count = len(specific_day_points)

weekday_points = filtered_df[filtered_df['timestamp'].dt.day_name().isin(['Monday', 'Wednesday', 'Friday'])]
weekday_points_in_time_range = weekday_points[(weekday_points['timestamp'].dt.time >= pd.to_datetime("09:00:00").time()) & 
                                              (weekday_points['timestamp'].dt.time <= pd.to_datetime("11:00:00").time())]
average_weekday_count = weekday_points_in_time_range.groupby(weekday_points_in_time_range['timestamp'].dt.date).size().median()

print(f"Data points on {specific_day} from 9 AM to 11 AM: {specific_day_count}")
print(f"Average data points on Monday, Wednesday, Friday from 9 AM to 11 AM: {average_weekday_count:.2f}")

Data points on 2024-10-25 from 9 AM to 11 AM: 15
Average data points on Monday, Wednesday, Friday from 9 AM to 11 AM: 4.00
