<a href="https://colab.research.google.com/github/Efi-Pecani/mobility_analytics/blob/main/mobility_analytics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go

In [None]:
#load & disp the dataset

file_path = '/content/loc_data.csv'
data = pd.read_csv(file_path)

data.head()


Unnamed: 0,device_id,os,timestamp,latitude,longitude,accuracy
0,7315284d5507fdf095099fa2d9def878,Android,1642116000.0,45.497424,-122.602216,64.0
1,b9bd5d1c655644b058df8b0dcd16309b,iOS,1642799000.0,37.80322,-81.179315,94.0
2,5d5cfde811fd12498468fcd1a51a7dd8,Android,1641134000.0,37.751847,-81.214984,51.0
3,15aeb6be0b96374fca21ef9c5ca246e3,iOS,1642377000.0,30.329037,-97.737918,77.0
4,e9b4cd96d7c1847ab5073c945417ca68,Android,1642232000.0,37.763829,-81.171029,79.0


In [None]:
print(data.info,'\n\n')
print(data.shape)

<bound method DataFrame.info of                                device_id       os     timestamp   latitude  \
0       7315284d5507fdf095099fa2d9def878  Android  1.642116e+09  45.497424   
1       b9bd5d1c655644b058df8b0dcd16309b      iOS  1.642799e+09  37.803220   
2       5d5cfde811fd12498468fcd1a51a7dd8  Android  1.641134e+09  37.751847   
3       15aeb6be0b96374fca21ef9c5ca246e3      iOS  1.642377e+09  30.329037   
4       e9b4cd96d7c1847ab5073c945417ca68  Android  1.642232e+09  37.763829   
...                                  ...      ...           ...        ...   
845263  ad10aa259f58ef4ad5d15ad3ccc32aa4      iOS  1.642717e+09  30.322379   
845264  8fc45a133afa2949a85c8f0000dd6350  Android  1.641955e+09  37.733803   
845265  f16a6ae711ca1c9f1250eb6e51043067      iOS  1.643642e+09  30.290626   
845266  04210799e7bd7cb1d7d8d30a04417606  Android  1.643203e+09  30.285705   
845267  2e0ff95afaa29c360d490f6b46e3c302  Android  1.642259e+09  30.317636   

         longitude  accuracy  


We have ~845K instances
Each instances is a device sampled in some day at some houre at some location

In [None]:
#calculate the number of distinct devices in the dataset
distinct_devices_count = data['device_id'].nunique()

distinct_devices_count

172

In [None]:
#convert timestamp to datetime
data['timestamp'] = pd.to_datetime(data['timestamp'], unit='s')

#extract hour and day of the week from timestamp
data['hour'] = data['timestamp'].dt.hour
data['day_of_week'] = data['timestamp'].dt.day_name()

data.head(3) #now the time looks better

Unnamed: 0,device_id,os,timestamp,latitude,longitude,accuracy,hour,day_of_week
0,7315284d5507fdf095099fa2d9def878,Android,2022-01-13 23:15:57,45.497424,-122.602216,64.0,23,Thursday
1,b9bd5d1c655644b058df8b0dcd16309b,iOS,2022-01-21 21:11:39,37.80322,-81.179315,94.0,21,Friday
2,5d5cfde811fd12498468fcd1a51a7dd8,Android,2022-01-02 14:40:35,37.751847,-81.214984,51.0,14,Sunday


In [None]:
#check for missing values
missing_values = data.isnull().sum()

#display the conversion result and missing values information
missing_values

device_id      0
os             0
timestamp      0
latitude       0
longitude      0
accuracy       0
hour           0
day_of_week    0
dtype: int64

We can see that: **No Missing Values in the DF**

### Activity In Time Terms

In [None]:
#calculate the distribution of signals over different hours of the day
hourly_distribution = data['hour'].value_counts().sort_index()

hourly_distribution_df = hourly_distribution.reset_index()
hourly_distribution_df.columns = ['hour', 'number_of_signals']

hourly_distribution_df.head(5)

Unnamed: 0,hour,number_of_signals
0,0,28558
1,1,24555
2,2,25350
3,3,27331
4,4,24824


In [None]:
#aggregate signals by date
signals_per_day = data.groupby(data['timestamp'].dt.date).size().reset_index(name='number_of_signals')
signals_per_day.head(4)

Unnamed: 0,timestamp,number_of_signals
0,2022-01-01,27121
1,2022-01-02,24959
2,2022-01-03,26395
3,2022-01-04,27752


In [None]:
#plot a time series of activity levels
fig = px.line(signals_per_day, x='timestamp', y='number_of_signals',
              title='Daily Device Activity Over Time',
              labels={'timestamp': 'Date', 'number_of_signals': 'Number of Signals'})

fig.update_layout(xaxis_title='Date',
                  yaxis_title='Number of Signals',
                  xaxis=dict(rangeslider=dict(visible=True)))

fig.show()


In [None]:
#count unique devices active each hour
active_devices_per_hour = data.groupby('hour')['device_id'].nunique().reset_index(name='active_devices')

#plot a graph of activity of all devices over hours of the day
fig = px.line(active_devices_per_hour, x='hour', y='active_devices',
              title='Number of Active Devices by Hour of the Day',
              labels={'hour': 'Hour of the Day', 'active_devices': 'Number of Active Devices'},
              markers=True)  # Adding markers for each data point

fig.update_layout(xaxis_title='Hour of the Day',
                  yaxis_title='Number of Active Devices',
                  xaxis=dict(tickmode='linear', tick0=0, dtick=1))  # Ensure every hour is shown on the x-axis

fig.show()

In [None]:
#active devices each day where an active device is defined as having at least one signal in a day

#count unique devices active each day
active_devices_per_day = data.groupby(data['timestamp'].dt.date)['device_id'].nunique().reset_index(name='active_devices')

In [None]:
#plot a time series plot of the number of active devices per day
fig = px.line(active_devices_per_day, x='timestamp', y='active_devices',
              title='Number of Active Devices Per Day',
              labels={'timestamp': 'Date', 'active_devices': 'Number of Active Devices'})

fig.update_layout(xaxis_title='Date',
                  yaxis_title='Number of Active Devices',
                  xaxis=dict(rangeslider=dict(visible=True)))

#change the line color to pink
fig.update_traces(line=dict(color='#FF97FF'))  # This is a shade of blue, you can choose any hex color you like


fig.show()

In [None]:
total_devices = data['device_id'].nunique()
active_devices_per_day['percentage_active'] = (active_devices_per_day['active_devices'] / total_devices) * 100


#time series plot of the percentage of active devices per day
fig = px.line(active_devices_per_day, x='timestamp', y='percentage_active',
              title='Percentage of Active Devices Per Day',
              labels={'timestamp': 'Date', 'percentage_active': 'Percentage of Active Devices'})


fig.update_layout(xaxis_title='Date',
                  yaxis_title='Percentage % of Active Devices',
                  yaxis=dict(tickformat=".2f"),  # Format y-axis ticks as percentages with two decimal places
                  xaxis=dict(rangeslider=dict(visible=True)))

#change color to green
fig.update_traces(line=dict(color='#00CC96'))

fig.show()


In [None]:
#ordering the days categoricaly
days_order = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
data['day_of_week'] = pd.Categorical(data['day_of_week'], categories=days_order, ordered=True)

# Recalculate the distribution of signals over days of the week, now respecting the day order
weekly_distribution = data['day_of_week'].value_counts().sort_index()

#reset the index to convert the Series to a DataFrame for plotting
weekly_distribution_df = weekly_distribution.reset_index()
weekly_distribution_df.columns = ['day_of_week', 'number_of_signals']

weekly_distribution_df.head(7)


Unnamed: 0,day_of_week,number_of_signals
0,Monday,138152
1,Tuesday,110715
2,Wednesday,114569
3,Thursday,115366
4,Friday,116055
5,Saturday,126646
6,Sunday,123765


In [None]:
#plot for the hourly distribution of signals
fig_hourly = px.bar(hourly_distribution_df, x='hour', y='number_of_signals',
                    title='Hourly Distribution of Signals',
                    labels={'hour': 'Hour of the Day', 'number_of_signals': 'Number of Signals'})

fig_hourly.update_layout(xaxis_title='Hour of the Day', yaxis_title='Number of Signals',
                         xaxis=dict(tickmode='linear', tick0=0, dtick=1))


fig_hourly.show()

In [None]:
#calculate median and average of the number of signals
median_signals = weekly_distribution_df['number_of_signals'].median()
average_signals = weekly_distribution_df['number_of_signals'].mean()

#plot for the weekly distribution of signals
fig_weekly = px.bar(weekly_distribution_df, x='day_of_week', y='number_of_signals',
                    title='Weekly Distribution of Signals',
                    labels={'day_of_week': 'Day of the Week', 'number_of_signals': 'Number of Signals'})

fig_weekly.update_layout(xaxis_title='Day of the Week', yaxis_title='Number of Signals')

#add median line
fig_weekly.add_trace(go.Scatter(x=weekly_distribution_df['day_of_week'], y=[median_signals]*len(weekly_distribution_df),
                                mode='lines', name='Median',
                                line=dict(color='red', dash='dash')))

#add average line
fig_weekly.add_trace(go.Scatter(x=weekly_distribution_df['day_of_week'], y=[average_signals]*len(weekly_distribution_df),
                                mode='lines', name='Average',
                                line=dict(color='green', dash='dot')))

fig_weekly.show()

**Interpretation -'Active Hour':** The hourly distribution reveals that device activity is not uniform throughout the day, with significant increases during early morning hours, a spike in the afternoon, and the highest levels of activity in the late evening. This pattern could reflect typical daily routines, including waking up, end of the workday or school, and leisure or preparation for the next day in the evening.


The weekly distribution indicates that device usage is highest at the start of the week, with a notable level of activity persisting through the weekend. This pattern could be influenced by work or school schedules during the week and leisure or social activities during the weekend.


**Hourly Distribution of Signals:**

Peak Activity Hours: The highest numbers of signals are recorded at 6 AM, 3 PM, 10 PM, and 11 PM, with the absolute peaks at 10 PM and 11 PM. This suggests significant activity during early morning, late afternoon, and late evening hours.

Morning Increase: There is a noticeable jump in activity starting at 6 AM, which could correspond with people beginning their day.

Afternoon Activity: The afternoon sees a peak at 3 PM, which might align with the end of the school or workday for many individuals.

Evening Peaks: The evening hours show the highest activity, especially at 10 PM and 11 PM, indicating that devices are heavily used late into the night.

Low Activity Hours: The lowest activity is observed during the late night to early morning hours (1 AM to 5 AM), as expected due to typical sleep patterns.

In [None]:
# Group data by device_id and date to calculate active days
data['date'] = data['timestamp'].dt.date
active_days_per_device = data.groupby('device_id')['date'].nunique()

# Summary statistics for active days per device
active_days_summary = active_days_per_device.describe()

# Determine the threshold for a 'good enough' level of activity based on the summary statistics
good_enough_threshold = active_days_summary['75%']  # Using the 75th percentile as a threshold

active_days_summary, good_enough_threshold

(count    172.000000
 mean      24.063953
 std        9.752581
 min        3.000000
 25%       14.000000
 50%       31.000000
 75%       31.000000
 max       31.000000
 Name: date, dtype: float64,
 31.0)

In [None]:
# Calculate the total number of active hours for each device over the month
total_active_hours_per_device = data.groupby('device_id')['hour'].nunique().reset_index(name='total_active_hours')

# Analyze the distribution of total active hours per device
total_active_hours_distribution = total_active_hours_per_device['total_active_hours'].describe()

total_active_hours_distribution

count    172.000000
mean      22.005814
std        4.274906
min       10.000000
25%       24.000000
50%       24.000000
75%       24.000000
max       24.000000
Name: total_active_hours, dtype: float64

In [None]:
#create histogram to visualize the distribution of total active hours per device across the month
fig = px.histogram(total_active_hours_per_device, x='total_active_hours',
                   title="Distribution of Total Active Hours per Device",
                   nbins=24,  # Since we're dealing with hours 24 bins represent each hour of the day
                   labels={"total_active_hours": "Total Active Hours"})

fig.update_layout(xaxis_title="Total Active Hours",
                  yaxis_title="Number of Devices",
                  bargap=0.2)

fig.show()


In [None]:
# Extract date and hour from timestamp for further analysis
data['date'] = data['timestamp'].dt.date
data['hour'] = data['timestamp'].dt.hour

# Calculate active hours per device
active_hours_per_device = data.groupby(['device_id', 'date', 'hour']).size().reset_index(name='signals')

# Calculate active days per device
active_days_per_device = active_hours_per_device.groupby(['device_id', 'date']).size().reset_index(name='active_hours')
active_days_count = active_days_per_device.groupby('device_id')['date'].nunique().reset_index(name='active_days')

# Analyze the distribution of active days across devices
active_days_distribution = active_days_count['active_days'].value_counts().sort_index().reset_index(name='device_count')
active_days_distribution.rename(columns={'index': 'active_days'}, inplace=True)

active_days_distribution.head()

Unnamed: 0,active_days,device_count
0,3,1
1,4,2
2,5,1
3,6,5
4,7,4


In [None]:
# Calculate summary statistics and a threshold for "good enough" activity
summary_statistics = active_days_count['active_days'].describe()
good_enough_threshold = summary_statistics['75%']  # Using the 75th percentile as a threshold

print(summary_statistics,'\n')

print(good_enough_threshold)


count    172.000000
mean      24.063953
std        9.752581
min        3.000000
25%       14.000000
50%       31.000000
75%       31.000000
max       31.000000
Name: active_days, dtype: float64 

31.0


In [None]:
#visualize the distribution of active days
fig = px.bar(active_days_distribution, x='active_days', y='device_count', title='Distribution of Active Days per Device')
fig.update_layout(xaxis_title='Number of Active Days in the Month', yaxis_title='Number of Devices', bargap=0.2)
fig.show()


In [None]:
# Assuming 'data' includes the 'os' column
# Merge the OS information with the active_days_count
active_days_with_os = pd.merge(active_days_count, data[['device_id', 'os']].drop_duplicates(), on='device_id', how='left')

# Group by OS and active_days to get the count of devices for each combination
active_days_os_distribution = active_days_with_os.groupby(['os', 'active_days']).size().reset_index(name='device_count')

# Visualize the distribution of active days per device, separated by OS
fig = px.bar(active_days_os_distribution, x='active_days', y='device_count', color='os',
             title='Distribution of Active Days per Device by OS')
fig.update_layout(xaxis_title='Number of Active Days in the Month', yaxis_title='Number of Devices', bargap=0.2)
fig.show()


**Interpretation 'active-day':**
The data shows a significant portion of devices (75%) have the highest possible level of activity, being active on all 31 days of the month. This indicates a high level of engagement and suggests that these devices are carried by individuals with regular daily routines.
The "good enough" activity level, based on the 75th percentile, is defined as being active on all 31 days of the month. This threshold represents a very active device, likely reflecting typical human behavior including going to work, shopping, and other daily activities.
Given this analysis, devices meeting or exceeding the "good enough" threshold demonstrate consistent daily activity, which is crucial for our accurate mobility analytics.

In [None]:
#group data by device_id and OS
active_days_by_os = active_days_count.groupby('device_id').first().reset_index().merge(data[['device_id', 'os']].drop_duplicates(), on='device_id')

#calc the avg active days for each OS
average_active_days_by_os = active_days_by_os.groupby('os')['active_days'].mean().reset_index()

average_active_days_by_os


Unnamed: 0,os,active_days
0,Android,25.150538
1,iOS,24.444444


In [None]:
fig = px.pie(average_active_days_by_os, names='os', values='active_days',
             title='Average Active Days per OS',
             color='os',  # Optional: Use the OS column to color the pie chart sections
             hole=0.3)  # Optional: Creates a donut pie chart

fig.update_traces(textinfo='percent+label', pull=[0.1 if i == 0 else 0 for i in range(len(average_active_days_by_os['os']))])
fig.update_layout(legend_title="Operating System")

fig.show()


In [None]:
#create a histogram to visualize the distribution of accuracy values
fig = px.histogram(data, x='accuracy', nbins=50,  # Adjust the number of bins as needed for granularity
                   title="Histogram of Accuracy Values",
                   labels={"accuracy": "Accuracy (meters)"})

fig.update_layout(xaxis_title="Accuracy (meters)",
                  yaxis_title="Count",
                  bargap=0.2)

fig.show()

In [None]:
#create a histogram to show the distribution of median accuracy values per device
median_accuracy_per_device = data.groupby('device_id')['accuracy'].median().reset_index()

fig = px.histogram(median_accuracy_per_device, x='accuracy', nbins=30, title="Histogram of Median Accuracy Values per Device")

fig.update_layout(xaxis_title="Median Accuracy (meters)",
                  yaxis_title="Number of Devices",
                  bargap=0.2)

fig.show()

#looks normal approximiation of Normal dis

### Activity In Movement Terms

##### A level 7 geohash provides a balance between precision and generalization
covering an area of approximately 153m x 153m, which is suitable for analyzing day-to-day movements within a city or region.

In [None]:
pip install geohash2

Collecting geohash2
  Downloading geohash2-1.1.tar.gz (15 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: geohash2
  Building wheel for geohash2 (setup.py) ... [?25l[?25hdone
  Created wheel for geohash2: filename=geohash2-1.1-py3-none-any.whl size=15543 sha256=e6745c3fa10a464c525fe758cf91f7dc8caac9966e62b54c118be8c2b1aee657
  Stored in directory: /root/.cache/pip/wheels/c0/21/8d/fe65503f4f439aef35193e5ec10a14adc945e20ff87eb35895
Successfully built geohash2
Installing collected packages: geohash2
Successfully installed geohash2-1.1


In [None]:
import geohash2

#convert coordinates to level 7 geohashes
data['geohash'] = data.apply(lambda x: geohash2.encode(x.latitude, x.longitude, precision=7), axis=1)
data.head(5)

Unnamed: 0,device_id,os,timestamp,latitude,longitude,accuracy,hour,day_of_week,date,geohash
0,7315284d5507fdf095099fa2d9def878,Android,2022-01-13 23:15:57,45.497424,-122.602216,64.0,23,Thursday,2022-01-13,c20ff4e
1,b9bd5d1c655644b058df8b0dcd16309b,iOS,2022-01-21 21:11:39,37.80322,-81.179315,94.0,21,Friday,2022-01-21,dnwr09z
2,5d5cfde811fd12498468fcd1a51a7dd8,Android,2022-01-02 14:40:35,37.751847,-81.214984,51.0,14,Sunday,2022-01-02,dnwnzbt
3,15aeb6be0b96374fca21ef9c5ca246e3,iOS,2022-01-16 23:45:16,30.329037,-97.737918,77.0,23,Sunday,2022-01-16,9v6kxcj
4,e9b4cd96d7c1847ab5073c945417ca68,Android,2022-01-15 07:28:46,37.763829,-81.171029,79.0,7,Saturday,2022-01-15,dnwqbft


In [None]:
#calculate movement metrics
#total unique geohashes visited by each device
total_unique_geohashes = data.groupby('device_id')['geohash'].nunique().reset_index(name='unique_geohashes')

total_unique_geohashes

Unnamed: 0,device_id,unique_geohashes
0,0155faec07276c34fe657074de0ee0be,698
1,031ac6c4707d5b09778195ee3f8d26af,740
2,04210799e7bd7cb1d7d8d30a04417606,1096
3,04ef4115ab3acedd8502f0b64e28e33d,688
4,05ae3058245427f7a1a194d9a9fb453d,32
...,...,...
167,f3bdc20b182155c55cbc48feb5525511,1249
168,f3fed65e73eba269937ff59173d0266c,1431
169,f5fbc77994ffcd80e4c6bea2f03ed2a3,1184
170,f764b1d59ee4a08b5df749a552a6fc41,1261


In [None]:
total_unique_geohashes['unique_geohashes'].describe()

count     172.000000
mean      549.255814
std       455.784003
min         1.000000
25%       136.750000
50%       494.000000
75%       819.000000
max      1612.000000
Name: unique_geohashes, dtype: float64

In [None]:
#visualization of movement
#histogram of total unique geohashes visited by devices during the month of january
fig = px.histogram(total_unique_geohashes, x='unique_geohashes',
                   title="Distribution of Total Unique Geohashes Visited by Devices",
                   labels={"unique_geohashes": "Total Unique Geohashes Visited"},
                   nbins=50)

# Manually specify a high y-value for the lines
y_max_value = 50  # Example value, adjust based on your data's expected range

# Add lines for the 25th and 75th percentiles as visual thresholds
fig.add_trace(go.Scatter(x=[136.75, 136.75], y=[0, y_max_value], mode="lines", name="25% \n(137)", line=dict(color='#00CC96', dash='dash')))
fig.add_trace(go.Scatter(x=[819, 819], y=[0, y_max_value], mode="lines", name="75% \n(819)", line=dict(color="Crimson", dash='dash')))


fig.show()

In [None]:
#calc the number of unique geohashes visited per device per day
daily_movement = data.groupby(['device_id', 'date'])['geohash'].nunique().reset_index(name='unique_geohashes_per_day')

#summary stats for daily movement
daily_movement_summary = daily_movement['unique_geohashes_per_day'].describe()

daily_movement_summary

count    4139.000000
mean       57.252718
std        45.325379
min         1.000000
25%         8.000000
50%        59.000000
75%        97.000000
max       173.000000
Name: unique_geohashes_per_day, dtype: float64

In [None]:
#import plotly.graph_objects as go

# Recreate the histogram for unique_geohashes_per_day
fig = px.histogram(daily_movement, x='unique_geohashes_per_day',
                   title='Distribution of Unique Geohashes Visited Per Device Per Day',
                   nbins=50,  # Adjust the number of bins for better visualization
                   labels={'unique_geohashes_per_day': 'Unique Geohashes per Day'})

fig.update_layout(xaxis_title='Unique Geohashes Visited Per Day',
                  yaxis_title='Count of Device-Days',
                  bargap=0.2)

# Add vertical lines for the 25th and 75th percentiles
fig.add_trace(go.Scatter(x=[8, 8], y=[0, 700], mode="lines", name="25th Percentile (8)", line=dict(color="green", dash='dash')))
fig.add_trace(go.Scatter(x=[97, 97], y=[0, 700], mode="lines", name="75th Percentile (97)", line=dict(color="Crimson", dash='dash')))

fig.show()



There are 4139 device-day records, indicating the total number of unique (device, day) pairs in our dataset.
The median number of unique geohashes visited is 59, suggesting that half of the device-day records show movement across a moderate number of geohashes.

**Interpretation:**
Minimal/Normal Movement: Based on the 25th percentile, minimal movement can be defined as days when a device visits 8 or fewer unique geohashes. Normal movement might be considered as falling within the interquartile range (from 8 to 97 unique geohashes per day), where most device activity is concentrated.

In [None]:
# Visualize the distribution of unique geohashes visited per device per day using a violin plot
fig = px.violin(daily_movement, y='unique_geohashes_per_day',
                title='Distribution of Unique Geohashes Visited Per Device Per Day',
                box=True,  # Display box plot inside the violin
                points="all",  # Show all points
                labels={'unique_geohashes_per_day': 'Unique Geohashes per Day'})

fig.update_layout(yaxis_title='Unique Geohashes Visited Per Day')

fig.show()


In [None]:
# Approximate bounds for Austin, Texas
lat_min_austin, lat_max_austin = 30.196172514159777, 30.416442656780895
lon_min_austin, lon_max_austin = -97.86811477065586, -97.62362926070162

# Filter the dataset for points within Austin, Texas
austin_data = data[(data['latitude'] >= lat_min_austin) & (data['latitude'] <= lat_max_austin) &
                   (data['longitude'] >= lon_min_austin) & (data['longitude'] <= lon_max_austin)]
austin_data.shape

(295982, 10)

In [None]:
# Approximate bounds for Portland Oregon
lat_min, lat_max = 45.33159875506112, 45.64970236341552
lon_min, lon_max = -122.88566150628353, -122.44033485638296

# Filter the dataset for points within Oregon
portland_data = data[(data['latitude'] >= lat_min) & (data['latitude'] <= lat_max) &
                   (data['longitude'] >= lon_min) & (data['longitude'] <= lon_max)]
portland_data.shape


(173332, 10)

In [None]:
# Approximate bounds for Beckley, West Virginia
lat_min, lat_max = 37.64607159174946, 37.8957557358131  # Small range around Beckley's approximate latitude
lon_min, lon_max = -81.37115057619891, -81.00203905006904  # Small range around Beckley's approximate longitude

# Filter the dataset for points within Beckley, West Virginia
beckley_data = data[(data['latitude'] >= lat_min) & (data['latitude'] <= lat_max) &
                    (data['longitude'] >= lon_min) & (data['longitude'] <= lon_max)]

beckley_data.shape


(375954, 10)

In [None]:
if(len(data)-len(beckley_data)-len(portland_data)-len(austin_data))==0:
    print('3 cities in our data and polygons correct')
else:
    print('Wrong polygons')

3 cities in our data and polygons correct


In [None]:
#filter devices showing minimal movement
minimal_movement_devices = daily_movement[daily_movement['unique_geohashes_per_day'] < 8]['device_id'].unique()

#filter devices showing excessive movement
excessive_movement_devices = daily_movement[daily_movement['unique_geohashes_per_day'] > 97]['device_id'].unique()


In [None]:
len(minimal_movement_devices),len(excessive_movement_devices)

(57, 98)

In [None]:
#calculate active days in the month for each device
active_days_per_device = data.groupby('device_id')['date'].nunique()

#calculate peak hour activity
data['hour'] = data['timestamp'].dt.hour
peak_hours = [6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22]  # Identified peak hours based on previous analysis
data['peak_hour_activity'] = data['hour'].isin(peak_hours)

#aggregate peak hour activity per device
peak_hour_activity_per_device = data.groupby('device_id')['peak_hour_activity'].mean()

#filter devices based on geographic movement threshold
geohash_threshold_high = 97  # 75 % threshold
geohash_threshold_low = 8 # 25 %
geohash_activity_per_device = data.groupby(['device_id', 'date'])['geohash'].nunique().groupby('device_id').mean()

# Define Qualified User based on criteria:
# 1. Active on at least 80% of the days in the month (~25 days in a 31-day month)
# 2. Shows activity during peak hours (on average across the month)
# 3. Does not exceed the geohash movement threshold
qualified_user_criteria = (active_days_per_device >= 25) & (geohash_activity_per_device < geohash_threshold_high) & (geohash_activity_per_device > geohash_threshold_low) & (peak_hour_activity_per_device >= 0.5)

# Add 'qualified_user' column to the dataset
data['qualified_user'] = data['device_id'].map(qualified_user_criteria)

# Check the first few rows to verify
data.head()



Unnamed: 0,device_id,os,timestamp,latitude,longitude,accuracy,hour,day_of_week,date,geohash,peak_hour_activity,qualified_user
0,7315284d5507fdf095099fa2d9def878,Android,2022-01-13 23:15:57,45.497424,-122.602216,64.0,23,Thursday,2022-01-13,c20ff4e,False,False
1,b9bd5d1c655644b058df8b0dcd16309b,iOS,2022-01-21 21:11:39,37.80322,-81.179315,94.0,21,Friday,2022-01-21,dnwr09z,True,True
2,5d5cfde811fd12498468fcd1a51a7dd8,Android,2022-01-02 14:40:35,37.751847,-81.214984,51.0,14,Sunday,2022-01-02,dnwnzbt,True,True
3,15aeb6be0b96374fca21ef9c5ca246e3,iOS,2022-01-16 23:45:16,30.329037,-97.737918,77.0,23,Sunday,2022-01-16,9v6kxcj,False,True
4,e9b4cd96d7c1847ab5073c945417ca68,Android,2022-01-15 07:28:46,37.763829,-81.171029,79.0,7,Saturday,2022-01-15,dnwqbft,True,False


In [None]:
#calculate the total number of unique qualified devices (users)
total_qualified_devices = data[data['qualified_user']]['device_id'].nunique()

total_qualified_devices


45

In [None]:
data['qualified_user'].describe()

count     845268
unique         2
top        False
freq      486960
Name: qualified_user, dtype: object

In [None]:
# Calculate the percentage of data marked as True (qualified users) out of the total
qualified_user_percentage = (data['qualified_user'].sum() / len(data)) * 100

print('qualified_user_percentage:  ',round(qualified_user_percentage,2),'%')


qualified_user_percentage:   42.39 %


In [None]:
total_qmau_count = int(data['qualified_user'].sum())
total_qmau_count

358308

**Qualified Users Summary**

Refining QMAU by Segmenting/Grouping
To refine the high-quality group and provide a clearer picture of QMAU, we can consider segmenting users based on attributes that could influence their mobility patterns or device usage. Potential attributes include:

Operating System (OS): Differentiating between Android and iOS users might reveal platform-specific usage patterns or device engagement levels.

Geographic Distribution: Grouping users by their most frequently visited geohash or a broader region could highlight geographic differences in mobility or activity patterns.

Time of Day and Weekday vs. Weekend Activity: Segmenting users based on their most active times or comparing weekday versus weekend activity might uncover different user behaviors or preferences.

Movement Patterns (e.g., Urban vs. Rural): Users frequently moving within dense urban geohashes might exhibit different behaviors from those predominantly in rural areas.

At the bottom line we should remember we would like to keep tracks of as many users as possibile for as long as possible which can give us the most insights of interesting activity that can benefit our platform. Meaning users that frequant a variety of businesses spend money, commute, interact and allow us to have a broad businesses picture of supply and demand for our potential clients.