https://www.kdnuggets.com/2023/03/3-hard-python-coding-interview-questions-data-science.html

Find the average session distance travelled by Google Fit users based on GPS location data. Calculate the distance for two scenarios:



1. Taking into consideration the curvature of the earth

2. Taking into consideration the curvature of the earth as a flat surface


Assume one session distance is the distance between the biggest and the smallest step. If the session has only one step id, discard it from the calculation. Assume that session can't span over multiple days.
Output the average session distances calculated in the two scenarios and the difference between them.


Formula to calculate the distance with the curvature of the earth:

https://platform.stratascratch.com/coding/10067-google-fit-user-tracking?utm_source=blog&utm_medium=click&utm_campaign=kdn+3+hard+python+questions&code_type=1


In [8]:
import numpy as np
import pandas as pd
from math import cos, sin, acos, radians

In [11]:
google_fit_location = pd.read_csv("D:\Codes\CodeBank\Assests\google_fit_location.csv")

In [12]:
google_fit_location

Unnamed: 0,user_id,session_id,step_id,day,latitude,longitude,altitude
0,55e60cfcc9dc49c17e,0,186,1,39.999,19.999,1502.429
1,75d295377a46f83236,2,161,10,40.000,20.001,1499.002
2,406539987dd9b679c0,0,133,7,40.000,20.000,1498.366
3,e6088004caf0c8cc51,0,12,7,40.001,20.001,1499.311
4,ef5fe98c6b9f313075,0,44,8,40.000,20.000,1497.595
...,...,...,...,...,...,...,...
95,850badf89ed8f06854,1,88,9,39.999,19.999,1499.809
96,d63386c884aeb9f71d,1,89,9,39.999,20.000,1497.868
97,e0e0defbb9ec47f6f7,1,90,3,40.000,20.001,1497.235
98,406539987dd9b679c0,2,0,10,40.001,20.000,1499.448


In [13]:
df = pd.merge(
    google_fit_location,
    google_fit_location,
    how="left",
    on=["user_id", "session_id", "day"],
    suffixes=["_1", "_2"],
)

In [16]:
df

Unnamed: 0,user_id,session_id,step_id_1,day,latitude_1,longitude_1,altitude_1,step_id_2,latitude_2,longitude_2,altitude_2
0,55e60cfcc9dc49c17e,0,186,1,39.999,19.999,1502.429,186,39.999,19.999,1502.429
1,75d295377a46f83236,2,161,10,40.000,20.001,1499.002,161,40.000,20.001,1499.002
2,406539987dd9b679c0,0,133,7,40.000,20.000,1498.366,133,40.000,20.000,1498.366
3,406539987dd9b679c0,0,133,7,40.000,20.000,1498.366,101,40.001,19.999,1501.637
4,e6088004caf0c8cc51,0,12,7,40.001,20.001,1499.311,12,40.001,20.001,1499.311
...,...,...,...,...,...,...,...,...,...,...,...
127,d63386c884aeb9f71d,1,89,9,39.999,20.000,1497.868,89,39.999,20.000,1497.868
128,e0e0defbb9ec47f6f7,1,90,3,40.000,20.001,1497.235,79,40.000,20.001,1501.259
129,e0e0defbb9ec47f6f7,1,90,3,40.000,20.001,1497.235,90,40.000,20.001,1497.235
130,406539987dd9b679c0,2,0,10,40.001,20.000,1499.448,0,40.001,20.000,1499.448


In [17]:
df['step_var'] = df['step_id_2'] - df['step_id_1']
 

In [18]:
df

Unnamed: 0,user_id,session_id,step_id_1,day,latitude_1,longitude_1,altitude_1,step_id_2,latitude_2,longitude_2,altitude_2,step_var
0,55e60cfcc9dc49c17e,0,186,1,39.999,19.999,1502.429,186,39.999,19.999,1502.429,0
1,75d295377a46f83236,2,161,10,40.000,20.001,1499.002,161,40.000,20.001,1499.002,0
2,406539987dd9b679c0,0,133,7,40.000,20.000,1498.366,133,40.000,20.000,1498.366,0
3,406539987dd9b679c0,0,133,7,40.000,20.000,1498.366,101,40.001,19.999,1501.637,-32
4,e6088004caf0c8cc51,0,12,7,40.001,20.001,1499.311,12,40.001,20.001,1499.311,0
...,...,...,...,...,...,...,...,...,...,...,...,...
127,d63386c884aeb9f71d,1,89,9,39.999,20.000,1497.868,89,39.999,20.000,1497.868,0
128,e0e0defbb9ec47f6f7,1,90,3,40.000,20.001,1497.235,79,40.000,20.001,1501.259,-11
129,e0e0defbb9ec47f6f7,1,90,3,40.000,20.001,1497.235,90,40.000,20.001,1497.235,0
130,406539987dd9b679c0,2,0,10,40.001,20.000,1499.448,0,40.001,20.000,1499.448,0


In [19]:
df = df.loc[
    df[df["step_var"] > 0]
    .groupby(["user_id", "session_id", "day"])["step_var"]
    .idxmax()
]

In [20]:
df

Unnamed: 0,user_id,session_id,step_id_1,day,latitude_1,longitude_1,altitude_1,step_id_2,latitude_2,longitude_2,altitude_2,step_var
18,157e3e9278e32aba3e,1,5,2,40.0,20.0,1497.785,32,40.0,20.0,1500.462,27
50,2813e59cf6c1ff698e,1,27,6,40.001,20.0,1497.499,77,40.001,20.0,1500.226,50
12,406539987dd9b679c0,0,101,7,40.001,19.999,1501.637,133,40.0,20.0,1498.366,32
81,47be2887786891367e,0,155,8,39.999,20.001,1500.076,169,39.999,20.001,1499.659,14
27,55e60cfcc9dc49c17e,1,12,1,40.0,19.999,1499.136,165,40.001,19.999,1499.754,153
72,5eff3a5bfc0687351e,0,147,10,40.0,20.0,1499.677,163,40.0,20.001,1500.198,16
69,75d295377a46f83236,0,145,3,40.0,20.0,1502.456,164,40.0,19.999,1502.364,19
23,75d295377a46f83236,1,9,10,40.0,20.0,1497.014,135,40.0,20.0,1501.343,126
101,850badf89ed8f06854,0,162,5,40.0,20.001,1500.286,166,40.0,19.999,1499.443,4
30,850badf89ed8f06854,1,13,4,40.0,20.0,1502.782,20,40.0,20.0,1500.512,7


In [23]:
df["distance_curvature"] = pd.Series()
for i, r in df.iterrows():
    df.loc[i, "distance_curvature"] = (
        acos(sin(radians(r["latitude_1"])) * sin(radians(r["latitude_2"])) + cos(radians(r["latitude_1"])) * cos(radians(r["latitude_2"])) * cos(radians(r["longitude_1"] - r["longitude_2"]))) * 6371)

  df["distance_curvature"] = pd.Series()


In [24]:
df["distance_flat"] = pd.Series()
for i, r in df.iterrows():
    df.loc[i, "distance_flat"] = (np.sqrt((r["latitude_2"] - r["latitude_1"]) ** 2 + (r["longitude_2"] - r["longitude_1"]) ** 2) * 111)

  df["distance_flat"] = pd.Series()


In [25]:
result = pd.DataFrame()
result["avg_distance_curvature"] = pd.Series(df["distance_curvature"].mean())
result["avg_distance_flat"] = pd.Series(df["distance_flat"].mean())
result["distance_diff"] = result["avg_distance_curvature"] - result["avg_distance_flat"]
result

Unnamed: 0,avg_distance_curvature,avg_distance_flat,distance_diff
0,0.07727,0.087726,-0.010456
