In this notebook, you can explore the dataset, examine missing values, and handle any necessary data cleaning or preprocessing steps. You can also decide whether to drop or impute missing values based on the columns' characteristics.

# Rule Adherence:

-Rule 1: We have documented all data preprocessing steps, including handling missing values and data cleaning, in this notebook (see Section X).

-Rule 2: All data manipulation and cleaning steps are performed using Python code (see Section Y).

-Rule 6: We have recorded the random seed used in the imputation process (see Section Z).


In [1]:
import pandas as pd
import numpy as np

### Load and View the Dataset

In [2]:
df = pd.read_csv('Data/strava.csv')
df.head()

Unnamed: 0,Air Power,Cadence,Form Power,Ground Time,Leg Spring Stiffness,Power,Vertical Oscillation,altitude,cadence,datafile,...,enhanced_speed,fractional_cadence,heart_rate,position_lat,position_long,speed,timestamp,unknown_87,unknown_88,unknown_90
0,,,,,,,,,0.0,activities/2675855419.fit.gz,...,0.0,0.0,68.0,,,0.0,2019-07-08 21:04:03,0.0,300.0,
1,,,,,,,,,0.0,activities/2675855419.fit.gz,...,0.0,0.0,68.0,,,0.0,2019-07-08 21:04:04,0.0,300.0,
2,,,,,,,,,54.0,activities/2675855419.fit.gz,...,1.316,0.0,71.0,,,1316.0,2019-07-08 21:04:07,0.0,300.0,
3,,,,,,,,3747.0,77.0,activities/2675855419.fit.gz,...,1.866,0.0,77.0,504432050.0,-999063637.0,1866.0,2019-07-08 21:04:14,0.0,100.0,
4,,,,,,,,3798.0,77.0,activities/2675855419.fit.gz,...,1.894,0.0,80.0,504432492.0,-999064534.0,1894.0,2019-07-08 21:04:15,0.0,100.0,


### Data Summary

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40649 entries, 0 to 40648
Data columns (total 22 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Air Power             17842 non-null  float64
 1   Cadence               17847 non-null  float64
 2   Form Power            17842 non-null  float64
 3   Ground Time           17847 non-null  float64
 4   Leg Spring Stiffness  17842 non-null  float64
 5   Power                 17847 non-null  float64
 6   Vertical Oscillation  17847 non-null  float64
 7   altitude              14905 non-null  float64
 8   cadence               40627 non-null  float64
 9   datafile              40649 non-null  object 
 10  distance              40649 non-null  float64
 11  enhanced_altitude     40598 non-null  float64
 12  enhanced_speed        40639 non-null  float64
 13  fractional_cadence    40627 non-null  float64
 14  heart_rate            38355 non-null  float64
 15  position_lat       

In [4]:
df.describe()

Unnamed: 0,Air Power,Cadence,Form Power,Ground Time,Leg Spring Stiffness,Power,Vertical Oscillation,altitude,cadence,distance,enhanced_altitude,enhanced_speed,fractional_cadence,heart_rate,position_lat,position_long,speed,unknown_87,unknown_88,unknown_90
count,17842.0,17847.0,17842.0,17847.0,17842.0,17847.0,17847.0,14905.0,40627.0,40649.0,40598.0,40639.0,40627.0,38355.0,40457.0,40457.0,14928.0,40627.0,38355.0,18618.0
mean,1.8721,77.726565,99.485932,325.934107,13.138571,301.459797,6.458074,3846.184368,72.781254,4097.140051,271.346027,3.037084,0.070138,134.680094,504540800.0,-999517500.0,2067.483856,0.0,298.513883,-1.067354
std,2.777476,9.202077,13.866222,71.773687,2.039567,48.540552,1.135497,134.262498,17.743728,5827.964663,25.035768,1.959805,0.173639,18.713782,169090.5,1376341.0,527.173476,0.0,17.176218,2.820492
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3555.0,0.0,0.0,209.0,0.0,0.0,56.0,503986800.0,-1005696000.0,0.0,0.0,100.0,-13.0
25%,1.0,78.0,97.0,308.0,13.0,283.0,6.125,3768.0,74.0,1117.97,252.8,2.109,0.0,121.0,504439700.0,-999398600.0,1782.0,0.0,300.0,-3.0
50%,1.0,79.0,101.0,326.0,13.375,303.0,6.5,3829.0,78.0,2430.5,269.2,2.445,0.0,136.0,504511600.0,-999260800.0,2071.0,0.0,300.0,0.0
75%,2.0,80.0,105.0,340.0,13.75,326.0,7.0,3912.0,80.0,4403.73,291.2,2.809,0.0,148.0,504615900.0,-999057900.0,2370.0,0.0,300.0,0.0
max,48.0,88.0,125.0,1732.0,16.875,462.0,12.5,5043.0,118.0,39007.12,508.6,15.349,0.5,183.0,508927200.0,-992193800.0,7744.0,0.0,300.0,6.0


### Identify Columns with Missing Values

In [5]:
df.isna().sum()

Air Power               22807
Cadence                 22802
Form Power              22807
Ground Time             22802
Leg Spring Stiffness    22807
Power                   22802
Vertical Oscillation    22802
altitude                25744
cadence                    22
datafile                    0
distance                    0
enhanced_altitude          51
enhanced_speed             10
fractional_cadence         22
heart_rate               2294
position_lat              192
position_long             192
speed                   25721
timestamp                   0
unknown_87                 22
unknown_88               2294
unknown_90              22031
dtype: int64

### Data Cleaning - Handle Missing Values

In [6]:
# Drop rows with missing values for columns with a small number of missing values
df.dropna(subset=["cadence", "fractional_cadence", "unknown_87"], inplace=True)

# Impute missing values for columns with a moderate number of missing values
df["heart_rate"].fillna(df["heart_rate"].mean(), inplace=True)
df["unknown_88"].fillna(df["unknown_88"].median(), inplace=True)

# Impute missing values for "enhanced_altitude" and "enhanced_speed"
df["enhanced_altitude"].fillna(df["enhanced_altitude"].mean(), inplace=True)
df["enhanced_speed"].fillna(df["enhanced_speed"].mean(), inplace=True)

# For columns with a high number of missing values (e.g., "altitude," "speed," "unknown_90"), consider their relevance to your analysis and decide whether to impute or drop them.
# For example, if they are crucial for your analysis, explore imputation techniques. Otherwise, you may consider dropping those columns using df.drop(columns=[...]).

# Finally, you can check again for any remaining missing values
df.isna().sum()


Air Power               22785
Cadence                 22780
Form Power              22785
Ground Time             22780
Leg Spring Stiffness    22785
Power                   22780
Vertical Oscillation    22780
altitude                25722
cadence                     0
distance                    0
enhanced_altitude           0
enhanced_speed              0
fractional_cadence          0
heart_rate                  0
position_lat              192
position_long             192
speed                   25699
timestamp                   0
unknown_87                  0
unknown_88                  0
unknown_90              22009
dtype: int64