# 2 Hands On: Data Quality and Pre-Processing

## 1. Assessing Data Quality

Load the following packages: dplyr, na.tools, tidyimpute (version from github decisionpatterns/tidyimpute")


In [1]:
# Install packages if not already installed
install.packages(c("dplyr", "na.tools"))
#Load packages
library(dplyr)
library(na.tools)


Installing packages into 'C:/Users/Paola/AppData/Local/R/win-library/4.3'
(as 'lib' is unspecified)



package 'dplyr' successfully unpacked and MD5 sums checked
package 'na.tools' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
	C:\Users\Paola\AppData\Local\Temp\RtmpCEEh6n\downloaded_packages



Attaching package: 'dplyr'


The following objects are masked from 'package:stats':

    filter, lag


The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union




In [2]:
install.packages("remotes")  # Install the remotes package if you don't have it

remotes::install_github("decisionpatterns/tidyimpute")


Installing package into 'C:/Users/Paola/AppData/Local/R/win-library/4.3'
(as 'lib' is unspecified)



package 'remotes' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
	C:\Users\Paola\AppData\Local\Temp\RtmpCEEh6n\downloaded_packages


Skipping install of 'tidyimpute' from a github remote, the SHA1 (9e07748f) has not changed since last install.
  Use `force = TRUE` to force installation



In [4]:
library(tidyimpute)



Load the carInsurance data set about the insurance risk rating of cars based on several characteristics of each car.



In [6]:
load("C:/Users/Paola/Documents/GitHub/hands-on-2023A/data/02_dataquality/carInsurance.Rdata")

# Convert the loaded object to a data frame (if necessary)
# Replace `object_name` with the name of the object you loaded from the .Rdata file
data_df <- as.data.frame(carIns)

# Write the data frame to a CSV file
write.csv(data_df, "carInsurance.csv", row.names = FALSE)


In [7]:
#Check the new file .csv
data_df


symb,normLoss,make,fuelType,aspiration,nDoors,bodyStyle,driveWheels,engineLocation,wheelBase,⋯,engineSize,fuelSystem,bore,stroke,compressionRatio,horsePower,peakRpm,cityMpg,highwayMpg,price
<int>,<int>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<dbl>,⋯,<int>,<fct>,<dbl>,<dbl>,<dbl>,<int>,<int>,<int>,<int>,<int>
3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,⋯,130,mpfi,3.47,2.68,9.00,111,5000,21,27,13495
3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,⋯,130,mpfi,3.47,2.68,9.00,111,5000,21,27,16500
1,,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,⋯,152,mpfi,2.68,3.47,9.00,154,5000,19,26,16500
2,164,audi,gas,std,four,sedan,fwd,front,99.8,⋯,109,mpfi,3.19,3.40,10.00,102,5500,24,30,13950
2,164,audi,gas,std,four,sedan,4wd,front,99.4,⋯,136,mpfi,3.19,3.40,8.00,115,5500,18,22,17450
2,,audi,gas,std,two,sedan,fwd,front,99.8,⋯,136,mpfi,3.19,3.40,8.50,110,5500,19,25,15250
1,158,audi,gas,std,four,sedan,fwd,front,105.8,⋯,136,mpfi,3.19,3.40,8.50,110,5500,19,25,17710
1,,audi,gas,std,four,wagon,fwd,front,105.8,⋯,136,mpfi,3.19,3.40,8.50,110,5500,19,25,18920
1,158,audi,gas,turbo,four,sedan,fwd,front,105.8,⋯,131,mpfi,3.13,3.40,8.30,140,5500,17,20,23875
0,,audi,gas,turbo,two,hatchback,4wd,front,99.5,⋯,131,mpfi,3.13,3.40,7.00,160,5500,16,22,


### 
(a) Check if there are any missing values.
Tip: use the function any_na().


In [None]:
# Check for missing values
if (any_na(data_df)) {
  print("There are missing values in the dataset.")
} else {
  print("There are no missing values in the dataset.")
}


### 
(b) Count the number of cases that have, at least, one missing value.
Tip: use the function filter_any_na() and then count().


In [None]:
# Filter cases with at least one missing value and count them
missing_cases <- data_df %>% filter_any_na() %>% count()

# Print the number of cases with missing values
print(missing_cases)


### 
(c) Create a new data set by removing all the cases that have missing values.
Tip: use the function drop_rows_any_na()


In [None]:
# Create a new dataset without missing values
new_dataset <- drop_rows_any_na(data_df)
new_dataset


### 
(d) Create a new data set by imputing all the missing values with 0.
Tip: explore the variants of the function impute()


In [None]:
# Create a new dataset with missing values imputed as 0
new_dataset <- impute(data_df, method = "fixed", fixed_value = 0)
new_dataset


### 
(e) Create a new data set by imputing the mean in all the columns which have double type values.

### 
(f) Create a new data set by imputing the mode in all the columns which have integer type values.

### 
(g) Create a new data set by imputing the most frequent value to the column \"nDoors\".
Tip: use the function impute_replace()

### 
(h) Combine the three last imputations to obtain a final dataset. Are there any duplicated cases?
Tip: use the functions distinct() and count()

## 
2. Data Pre-Processing

### 
2. Load the package dlookr. Use the same car insurance data set above and apply the following
transformations to the price attribute. Be critical regarding the obtained results.
(a) Apply range-based normalization and z-score normalization.
Tip: use the function transform().
(b) Discretize it into 4 equal-frequency ranges an into 4 equal-width ranges.
Tip: use the function binning().
3. With the seed 111019 obtain the following samples on the car insurance data set.
Tip: use the function sample_frac().
(a) A random sample of 60% of the cases, with replacement
(b) A stratified sample of 60% of the cases of cars, according to the fuelType attribute.
(c) Use the table() function to inspect the distribution of values in each of the two samples above.
