Importing dependencies, loading the dataset and getting the DataFrame set

In [1]:
import numpy as np
import pandas as pd

df = pd.read_csv('sensor_log.csv') #where df here serves as the default DataFrame
df.head(10)

Unnamed: 0,timestamp,temperature_c,humidity_pct,voltage_v
0,2025-10-01 08:00:00,24.5,55.2,3.7
1,2025-10-01 08:00:10,24.7,55.0,3.69
2,2025-10-01 08:00:20,24.6,55.1,
3,2025-10-01 08:00:30,,54.9,3.68
4,2025-10-01 08:01:00,24.9,54.8,3.68
5,2025-10-01 08:02:15,25.1,,3.67
6,2025-10-01 08:03:00,25.3,54.7,3.67
7,2025-10-01 08:05:30,25.5,54.9,3.65
8,2025-10-01 08:08:00,,55.0,3.64
9,2025-10-01 08:10:00,26.0,55.1,3.63


### 1. Answers to the Exercises in missing_values_tutorial

#### Exercise 1

In [24]:
# Viewing the rows of which 'temperature_c' is missing
print("Rows showing only where temperature_c is missing")
dfExT = df[df['temperature_c'].isnull()]
display(dfExT)

# Viewing the rows of which 'humidity_pct' is missing
print("Rows showing only where humidity_pct is missing")
dfExH = df[df['humidity_pct'].isnull()]
display(dfExH)

#Percentage of missing values
print("Percentages of missing values in the dataset")
missing_percentage = df.isna().mean() * 100
missing_percentage.round(2)


Rows showing only where temperature_c is missing


Unnamed: 0,timestamp,temperature_c,humidity_pct,voltage_v
3,2025-10-01 08:00:30,,54.9,3.68
8,2025-10-01 08:08:00,,55.0,3.64


Rows showing only where humidity_pct is missing


Unnamed: 0,timestamp,temperature_c,humidity_pct,voltage_v
5,2025-10-01 08:02:15,25.1,,3.67


Percentages of missing values in the dataset


timestamp         0.0
temperature_c    20.0
humidity_pct     10.0
voltage_v        10.0
dtype: float64

Explanation:

The column with the highest percentage of missing values is 'temperature_c' as it has 20% of missing values in the dataset
 

#### Exercise 2

In [31]:
# Creating a copy of df to df_median
df_median = df.copy()

#Selecting only numeric columns to avoid errors
dfEx2 = df_median.select_dtypes(include=[np.number]).columns

for col in dfEx2:
    median_val = df_median[col].median()
    df_median[col] = df_median[col].fillna(median_val)

print("df_median filled values")
display(df_median)

#Creating a copy of df to df_mean for comparison
df_mean = df.copy()

for col in dfEx2:
    mean_val = df_mean[col].mean()
    df_mean[col] = df_mean[col].fillna(mean_val)

print("df_mean filled values")
display(df_mean)


df_median filled values


Unnamed: 0,timestamp,temperature_c,humidity_pct,voltage_v
0,2025-10-01 08:00:00,24.5,55.2,3.7
1,2025-10-01 08:00:10,24.7,55.0,3.69
2,2025-10-01 08:00:20,24.6,55.1,3.67
3,2025-10-01 08:00:30,25.0,54.9,3.68
4,2025-10-01 08:01:00,24.9,54.8,3.68
5,2025-10-01 08:02:15,25.1,55.0,3.67
6,2025-10-01 08:03:00,25.3,54.7,3.67
7,2025-10-01 08:05:30,25.5,54.9,3.65
8,2025-10-01 08:08:00,25.0,55.0,3.64
9,2025-10-01 08:10:00,26.0,55.1,3.63


df_mean filled values


Unnamed: 0,timestamp,temperature_c,humidity_pct,voltage_v
0,2025-10-01 08:00:00,24.5,55.2,3.7
1,2025-10-01 08:00:10,24.7,55.0,3.69
2,2025-10-01 08:00:20,24.6,55.1,3.667778
3,2025-10-01 08:00:30,25.075,54.9,3.68
4,2025-10-01 08:01:00,24.9,54.8,3.68
5,2025-10-01 08:02:15,25.1,54.966667,3.67
6,2025-10-01 08:03:00,25.3,54.7,3.67
7,2025-10-01 08:05:30,25.5,54.9,3.65
8,2025-10-01 08:08:00,25.075,55.0,3.64
9,2025-10-01 08:10:00,26.0,55.1,3.63


Explanation:

The first table (median; df_median) has its missing values replaced by the median (middle value) of the dataset which is not affected by any outliers or extreme values compared to the mean-based imputation

The second table (mean; df_mean) has its missing values replaced by the mean of the dataset which while in this case isn't affected that much due to the numbers being that close to each other, in other more diverse datasets will have issues with calculating the average with extreme values on either end.

The median-based imputation is more robust to extreme values leading to more reliable imputations


#### Exercise 3

In [34]:
df['timestamp'] = pd.to_datetime(df['timestamp'])
df_ts = df.sort_values('timestamp').copy()

df_ffill = df_ts.ffill()

df_bfill = df_ts.bfill()

df_interp = df_ts.interpolate(method='linear')

print("=== COMPARISON: Rows 2 to 5 (Temperature Column) ===")
comparison = pd.DataFrame({
    'Original': df_ts['temperature_c'].iloc[2:6],
    'Forward Fill': df_ffill['temperature_c'].iloc[2:6],
    'Backward Fill': df_bfill['temperature_c'].iloc[2:6],
    'Interpolation': df_interp['temperature_c'].iloc[2:6]
})
print(comparison)

print("\n=== COMPARISON: Rows 2 to 5 (Voltage Column) ===")
comparison_volt = pd.DataFrame({
    'Original': df_ts['voltage_v'].iloc[2:6],
    'Forward Fill': df_ffill['voltage_v'].iloc[2:6],
    'Backward Fill': df_bfill['voltage_v'].iloc[2:6],
    'Interpolation': df_interp['voltage_v'].iloc[2:6]
})
print(comparison_volt)

print("\n=== COMPARISON: Rows 2 to 5 (Humidity Column) ===")
comparison_hum = pd.DataFrame({
    'Original': df_ts['humidity_pct'].iloc[2:6],
    'Forward Fill': df_ffill['humidity_pct'].iloc[2:6],
    'Backward Fill': df_bfill['humidity_pct'].iloc[2:6],
    'Interpolation': df_interp['humidity_pct'].iloc[2:6]
})
print(comparison_hum)

=== COMPARISON: Rows 2 to 5 (Temperature Column) ===
   Original  Forward Fill  Backward Fill  Interpolation
2      24.6          24.6           24.6          24.60
3       NaN          24.6           24.9          24.75
4      24.9          24.9           24.9          24.90
5      25.1          25.1           25.1          25.10

=== COMPARISON: Rows 2 to 5 (Voltage Column) ===
   Original  Forward Fill  Backward Fill  Interpolation
2       NaN          3.69           3.68          3.685
3      3.68          3.68           3.68          3.680
4      3.68          3.68           3.68          3.680
5      3.67          3.67           3.67          3.670

=== COMPARISON: Rows 2 to 5 (Humidity Column) ===
   Original  Forward Fill  Backward Fill  Interpolation
2      55.1          55.1           55.1          55.10
3      54.9          54.9           54.9          54.90
4      54.8          54.8           54.8          54.80
5       NaN          54.8           54.7          54.75


Explanation:

Interpolation here is the most reasonable method for this sensor data because sensor data like temperature and humidity change gradually over time, rarely jump from one value to another. 
Forward fill may be a good pick only if the gap between data is small, with larger gaps meaning that there as no rising nor falling trend which seems to be inaccurate.
Backward fill on the other hand in this context is impossible as it takes values from the future.
Interpolation on the other hand, estimates the missing value based on the trend between the point before and after.


### 2. Mini Project

In [41]:
#Loading dataset into a new DataFrame dfP
dfP = pd.read_csv('sensor_log.csv')
dfP['timestamp'] = pd.to_datetime(dfP['timestamp'])
dfP = dfP.sort_values('timestamp')
print("Dataset loaded successfully.")

#Summarizing missing values
print("\nMissing values summary:")
missing_count = dfP.isnull().sum()
missing_pct = dfP.isnull().mean() * 100
missing_summary = pd.DataFrame({'Missing Count': missing_count, 'Missing Percentage': missing_pct.round(2)})
print(missing_summary)

#Applying Linear Interpolation
dfP_interp = dfP.copy()
numColumns = ['temperature_c', 'humidity_pct', 'voltage_v']
dfP_interp[numColumns] = dfP_interp[numColumns].interpolate(method='linear')
print("\nMissing values after interpolation:")
print(dfP_interp.head(5))

#Comparing summary statistics before and after imputation
print("\nComparison of statistics before and after imputation:")
stats_original = dfP[numColumns].agg(['mean', 'min', 'max'])
stats_imputed = dfP_interp[numColumns].agg(['mean', 'min', 'max'])
comparison = pd.concat([stats_original, stats_imputed], axis=1, keys=['Original', 'Imputed'])
print(comparison)

Dataset loaded successfully.

Missing values summary:
               Missing Count  Missing Percentage
timestamp                  0                 0.0
temperature_c              2                20.0
humidity_pct               1                10.0
voltage_v                  1                10.0

Missing values after interpolation:
            timestamp  temperature_c  humidity_pct  voltage_v
0 2025-10-01 08:00:00          24.50          55.2      3.700
1 2025-10-01 08:00:10          24.70          55.0      3.690
2 2025-10-01 08:00:20          24.60          55.1      3.685
3 2025-10-01 08:00:30          24.75          54.9      3.680
4 2025-10-01 08:01:00          24.90          54.8      3.680

Comparison of statistics before and after imputation:
          Original                              Imputed                       
     temperature_c humidity_pct voltage_v temperature_c humidity_pct voltage_v
mean        25.075    54.966667  3.667778         25.11       54.945    3.6695


Explanation:

I chose not to drop any rows nor columns because of the size of the dataset as dropping any data from the dataset would risk leading to information loss of the timestamps (in the forms of having gaps in the data).
Additionally, linear interpolation was chosen as my imputation strategy because of the type of sensor data used as it maintains the general smoothness (gradual rising and falling of data) rather than immediate jumps and falls which is not recommended for sensor data like these.