# Data cleaning notebook

This notebook contains the examples from chapter 2 of the book. Let us start by reading in the data into a data frame named `bike`.

In [None]:
# install.packages("tidyverse")

In [None]:
bike <- read.csv("https://raw.githubusercontent.com/jgendron/com.packtpub.intro.r.bi/master/Chapter2-DataCleaning/data/Ch2_raw_bikeshare_data.csv", stringsAsFactors = FALSE)


## 1 Summarizing your data for inspection

We will start by looking a bit at the data

In [None]:
str(bike)


Clear problems that we can note already are that the `datetime` column is in characters and not a proper date-time format in R. Moreover, the `humidity` variable is also character, while it looks like it should be numeric or integer.

In [None]:
dim(bike)
head(bike)
tail(bike)


## 2 Finding and fixing flawed data

We will now try to find and fix errors in the data. Note that, this is detective work and there is not always one answer and one method that find and solves all errors.

### Missing values

Missing values is a common problem occuring with most data sets. Sometimes, a specific values is used to indicate that the value is missing. In R this is represented by the value `NA`. In the above `tail` call you can see such a case in the last column. First, let us see how to count the number of `NA` in a data frame:

In [None]:
table(is.na(bike))


This show us that our data frame `bike` contains 554 `NA` values, while 225373 values are not missing. The following code can help us see if it is all variables that contain `NA`:

In [None]:
library(stringr)
str_detect(bike, "NA")


This function returns `TRUE` or `FALSE` for each column in our data frame. The result shows that it is only the last column that contains `NA` values. Another way to confirm this is by:

In [None]:
table(is.na(bike$sources))


Here we can see that there as many `NA` values in the `sources` column as there is in the entire data frame. We will fix these missing values later.

Dealing with missing values is a huge topic on its own! We will return to this next time also.

### Erroneous values

We now return to the issue with the `humidity` attribute. We will start by searching for characters in the column:

In [None]:
bad_data <- str_subset(bike$humidity, "[a-z A-Z]")
bad_data


From this we can see that the value `x61` appears somewhere in the column `humidity`. This is clearly an error. There are not always a clear answer or solution, but in this case, it seems like someone just miss-typed `61` as `x61`. Let us find the location of this error:

In [None]:
location <- str_detect(bike$humidity, bad_data)
bike[location, ]


Note, that the `str_detect` function give us a vector of `TRUE` and `FAlSE` which is only true for the row that has the error. This we can use to subset the `bike` data frame to see the row with the error.

We can now replace this error in the following way and inspect that we fixed the error:

In [None]:
bike$humidity <- str_replace_all(bike$humidity, "x61", "61")
bike[location, ]


## 3 Converting inputs to data types suitable for analysis

We will now try and convert the columns with wrong data format to a proper data format. There is not always one right data format to put your data in. It depends on what you want to do with it later. Some times it is prefered to have a column values as character strings and other times it is prefered to have them as factors. However, in most cases it make sense to turn dates into a proper data format instead of character strings.

We will start by turning the `humidity` column into a numeric columns as we have now fixed the issue with the non numeric value:

In [None]:
bike$humidity <- as.numeric(bike$humidity)
#bike <- mutate(bike, humidity = as.numeric(humidity))
str(bike)


Next we will look at the `holiday` and `workingday` variables which are numeric, but it is more natural to have them as factors. Here is how to fix that:

In [None]:
bike$holiday <- factor(bike$holiday, levels = c(0, 1), labels = c("no", "yes"))
bike$workingday <- factor(bike$workingday, levels = c(0, 1), labels = c("no", "yes"))
str(bike)


Note that this code turned `0` into "no" and `1` into "yes". (Underneath R represent "no" as a 1 and "yes" as a 2, which makes it look like we flipped yes and no, but we did not.) To make this decision we need to know that this is correct of course. That is, we need to know something about the data set - that `0` represent "no" and tha `1` represent "yes". Often this kind of information can be found in the *data dictionary*, if the dataset has a such.

In similar manners we can turn the `season` and `weather` columns into factors, which seems to be the right thing to do in this case, as well:

In [None]:
bike$season <- factor(bike$season, levels = c(1, 2, 3, 4), labels = c("spring", "summer", "fall", "winter"), ordered = TRUE)
bike$weather <- factor(bike$weather, levels = c(1, 2, 3, 4), labels = c("clr_part_cloud", "mist_cloudy", "lt_rain_snow", "hvy_rain_snow"), ordered = TRUE)
str(bike)



**NOTE:** It is not always a good idea to turn character strings into factors. In fact, I will advice not to do it at this stage. You can always do it later, if you realize it will make something easier for you or it is required by other functions you want to use such as a particular function for training a machine learning model, for instance.

We will now finally fix the date format issue with the `datetime` column. To do this we first need to understand what format the `datetime` column is in. Looking at the `str` output above seems to indicate that it is on the format "mm/dd/yyyy hh:mm". The "lubridate" package is a very nice package to work with dates and times. It even has a function to deal with this particular format, namely the function `mdy_hm` function - the name hopefully give away why it is useful in our case! So let us use it to transform the `datetime` column into a proper format:

In [None]:
library(lubridate)
bike$datetime <- mdy_hm(bike$datetime)
str(bike)


Note that the `mdy_hm` function turned the datetime column into is something called "POSIXct", which i a common time format in R. (You can google it if you want. We will not go into this i any more details for now.)

## 4 Adapting string variables to a standard

Finally, we will look at adapting sting variables to a standard. The issue here is that sometimes string values are hard to work with and may contain more information than needed. In such cases, with some manipulation we can turn the variable into a factor variable with fewer values, however, values that are still meaningful to us. This is the case for the `source` column, which we will now take a closer look at. First we look at what are all the unique values this column takes:

In [None]:
unique(bike$source)


There are some obvious cleaning we can do here! First of all there are two values for Twitter which should be indentified and there are three values for ad campaing that should probably also be identified. Moreover, we might want to replace `NA` by unknown in this case. The "stringr" package can again help us and we can solve these issues in the following way:

In [None]:
bike$sources <- tolower(bike$sources)
bike$sources <- str_trim(bike$sources)
unique(bike$source)
na_loc <- is.na(bike$sources)
bike$sources[na_loc] <- "unknown"
unique(bike$source)


This is much better, but we might also want to group all webpages into on category, that is all sources that starts with "www." We can do this using the DataCombine package in the following way:

In [None]:
library(DataCombine)
web_sites <- "(www.[a-z]*.[a-z]*)"
current <- unique(str_subset(bike$sources, web_sites))
replace <- rep("web", length(current))
replacements <- data.frame(from = current, to = replace)
bike <- FindReplace(data = bike, Var = "sources", replacements, from = "from", to = "to", exact = FALSE)
bike$sources <- as.factor(bike$sources)
unique(bike$sources)


We now have a nice and clean dataset in the right format, which we will use for further analysis in the later lectures. Have a look at it:

In [None]:
str(bike)
bike


# Exercise

In this exercise, we will load in another dataset from Google Analytics, clean it and reshape it. 

First, download the sheet "Dataset2" from the dataset "Webanalytics_data_example2.xlsx" and save in a variable called `gadata`. This is website data about users for the first 6 months of 2019 divided by age group.

*Hint: Think about what package and data loading function that will be useful here.*

In [None]:
library(readxl)
webanalyticsDataset2 <- read_excel("./Data/Webanalytics_data_example2.xlsx", sheet = "Dataset2")


If you managed to load in the data it probably looks like this:
![gadata](gadata1.png)
It looks a bit strange doesn't it? It certainly does not look like tidy data. We want to turn it into tidy data that looks like this:
![gadata2](gadata2.png)

Doing this cleaning will require several steps, so we will go through them one by one.

The first thing we will do is to remove the second column titled "Total" as it is just a column with totals, which is not really an observation or an attribute according to the concept of tidy data. So remove the second "Total" column.

*Hint: You can use the `select` function. Try to figure out how `-` can be used in connection with the `select` function.*

In [None]:
library(dplyr) #here
webanalyticsDataset2 %>%
    select(-starts_with('Total'))

str(webanalyticsDataset2)


The second thing we will do is to rename the first column from "...1" to "Age". So go ahead and do this.

*Hint: Use the `rename` function from the dplyr package.*

The next thing we will do is to remove the first row as it just contains the word "Age" or "Users". (Note that what looks like the first row is not a row, but names of columns.) 

*Hint: This can be done in several ways. Try either to subset the data with the "[ , ]", or use the `filter` function.*

Next, we will remove the last rows as it contains totals as well.

*Hint: You can use the same technique as in the previous question.*

Now your data frame should look like this:
![gadata3](gadata3.png)
We will turn all the columns execept `Age` column into one column with the gather function, so that our data will look like this:
![gadata4](gadata4.png)

Go ahead and do this!

*Hint: Use the `gather` function. Call the key column "Date" and the value column "Users". To select every column except the age column as the last arguments there are different options, but try out the `starts_with`.* 

We will now spread out the data again such that the age groups are columns. 

*Hint: Use the spread function with the column `Age` as key and the column `Users` as value.

We will now turn the numbers in the "Date" column into actual dates. Before this, we will arrange the "Date" columns such that the lowest date is first. (The dates are numbered and represent days since "2019-01-01".)

*Hint: Use the arrange function.*

We will now put in the dates instead of the numbers. To do this we will first generate the relevant dates. This can be done with the following command

In [None]:
dates <- seq(from = as.Date("2019-01-01"), to = as.Date("2019-06-30"), by = "day")
dates

We can now put then in instead of the numbers in the `Date` column. Do this.

Your data should now look like this:
![gadata5](gadata5.png)

Now we will gather the data again to get in a tidy format. 

*Hint: Use the `gather` function, "Age" as key and "Users" as value. To select the age group columns you can use `"18-24":"65+"`*

Finally, we will make sure that the `Age` column is factor and that the column `Users` is numeric. So go ahead and make this transformation.

*Hint: use the functions `as.factor` and `as.numeric`.*

If you managed to get the data in the right format the code below should give you a nice plot

In [None]:
ggplot(data = gadata) + geom_line(mapping = aes(x = Date, y = Users, color = Age))