## Handling Missing Data

### Preprocessing

In [1]:
library(tidyverse)

── [1mAttaching core tidyverse packages[22m ──────────────────────────────────────────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.4     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.4     
── [1mConflicts[22m ────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


In [2]:
setwd("/home/asus/content/Notes/Semester 4/FDN Lab/Experiments/Experiment 3")

In [3]:
df <- data.frame(
  ID = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
  Name = c("Alice", "Bob", NA, "David", "Emma", "Frank", NA, "Hannah", "Ian", "Jack"),
  Age = c(25, NA, 30, 29, NA, 35, 40, NA, 50, 27),
  Salary = c(50000, 60000, 55000, NA, 70000, 75000, 80000, 65000, NA, 72000),
  Score = c(80, 90, NA, 85, 88, 92, NA, 77, 95, Inf)
)



Identify missing data (is.na(df), sum(is.na(df))).


In [4]:
# i. Identify missing data
print(is.na(df))  # Identify missing values
print(sum(is.na(df)))  # Count total missing values


         ID  Name   Age Salary Score
 [1,] FALSE FALSE FALSE  FALSE FALSE
 [2,] FALSE FALSE  TRUE  FALSE FALSE
 [3,] FALSE  TRUE FALSE  FALSE  TRUE
 [4,] FALSE FALSE FALSE   TRUE FALSE
 [5,] FALSE FALSE  TRUE  FALSE FALSE
 [6,] FALSE FALSE FALSE  FALSE FALSE
 [7,] FALSE  TRUE FALSE  FALSE  TRUE
 [8,] FALSE FALSE  TRUE  FALSE FALSE
 [9,] FALSE FALSE FALSE   TRUE FALSE
[10,] FALSE FALSE FALSE  FALSE FALSE
[1] 9


Remove missing rows (na.omit(df))

In [5]:
df_no_na <- na.omit(df)
print(df_no_na)

   ID  Name Age Salary Score
1   1 Alice  25  50000    80
6   6 Frank  35  75000    92
10 10  Jack  27  72000   Inf


Replace NA with zero (df[is.na(df)] <- 0).

In [6]:

df_zero <- df
df_zero[is.na(df_zero)] <- 0
print(df_zero)



   ID   Name Age Salary Score
1   1  Alice  25  50000    80
2   2    Bob   0  60000    90
3   3      0  30  55000     0
4   4  David  29      0    85
5   5   Emma   0  70000    88
6   6  Frank  35  75000    92
7   7      0  40  80000     0
8   8 Hannah   0  65000    77
9   9    Ian  50      0    95
10 10   Jack  27  72000   Inf


Replace NA with column mean (df$Age[is.na(df$Age)] <- mean(df$Age, na.rm=TRUE)).

In [7]:

df <- df

df$Age[is.na(df$Age)] <- mean(df$Age, na.rm = TRUE)
df$Salary[is.na(df$Salary)] <- mean(df$Salary, na.rm = TRUE)
df$Score[is.na(df$Score)] <- mean(df$Score, na.rm = TRUE)

print(df)


   ID   Name      Age Salary Score
1   1  Alice 25.00000  50000    80
2   2    Bob 33.71429  60000    90
3   3   <NA> 30.00000  55000   Inf
4   4  David 29.00000  65875    85
5   5   Emma 33.71429  70000    88
6   6  Frank 35.00000  75000    92
7   7   <NA> 40.00000  80000   Inf
8   8 Hannah 33.71429  65000    77
9   9    Ian 50.00000  65875    95
10 10   Jack 27.00000  72000   Inf


Remove Inf and NaN (df$Score[is.infinite(df$Score) | is.nan(df$Score)] <- NA)

In [8]:
df_clean <- df
df_clean$Score[is.infinite(df_clean$Score) | is.nan(df_clean$Score)] <- NA
print(df_clean)

   ID   Name      Age Salary Score
1   1  Alice 25.00000  50000    80
2   2    Bob 33.71429  60000    90
3   3   <NA> 30.00000  55000    NA
4   4  David 29.00000  65875    85
5   5   Emma 33.71429  70000    88
6   6  Frank 35.00000  75000    92
7   7   <NA> 40.00000  80000    NA
8   8 Hannah 33.71429  65000    77
9   9    Ian 50.00000  65875    95
10 10   Jack 27.00000  72000    NA


Use tidyverse’s replace_na() for selective column handling.

In [9]:
df_tidy <- df %>%
mutate(
    Age = replace_na(Age, mean(Age, na.rm = TRUE)),
    Salary = replace_na(Salary, median(Salary, na.rm = TRUE))
  )
print(df_tidy)

   ID   Name      Age Salary Score
1   1  Alice 25.00000  50000    80
2   2    Bob 33.71429  60000    90
3   3   <NA> 30.00000  55000   Inf
4   4  David 29.00000  65875    85
5   5   Emma 33.71429  70000    88
6   6  Frank 35.00000  75000    92
7   7   <NA> 40.00000  80000   Inf
8   8 Hannah 33.71429  65000    77
9   9    Ian 50.00000  65875    95
10 10   Jack 27.00000  72000   Inf


Drop columns with excessive missing data (df <- df[, colSums(is.na(df)) < nrow(df) *
0.5])

In [10]:
df_filtered <- df[, colSums(is.na(df)) < (nrow(df) * 0.5)]
print(df_filtered)

   ID   Name      Age Salary Score
1   1  Alice 25.00000  50000    80
2   2    Bob 33.71429  60000    90
3   3   <NA> 30.00000  55000   Inf
4   4  David 29.00000  65875    85
5   5   Emma 33.71429  70000    88
6   6  Frank 35.00000  75000    92
7   7   <NA> 40.00000  80000   Inf
8   8 Hannah 33.71429  65000    77
9   9    Ian 50.00000  65875    95
10 10   Jack 27.00000  72000   Inf


Fill missing categorical values with the mode.

In [11]:
# viii. Fill missing categorical values with mode
fill_mode <- function(x) {
  if (is.character(x)) {
    mode_value <- names(sort(table(x), decreasing = TRUE))[1]
    x[is.na(x)] <- mode_value
  }
  return(x)
}
df_mode <- df
df_mode$Name <- fill_mode(df_mode$Name)
print(df_mode)

   ID   Name      Age Salary Score
1   1  Alice 25.00000  50000    80
2   2    Bob 33.71429  60000    90
3   3  Alice 30.00000  55000   Inf
4   4  David 29.00000  65875    85
5   5   Emma 33.71429  70000    88
6   6  Frank 35.00000  75000    92
7   7  Alice 40.00000  80000   Inf
8   8 Hannah 33.71429  65000    77
9   9    Ian 50.00000  65875    95
10 10   Jack 27.00000  72000   Inf
