# Assignment

In this assignment you will apply various Data Science skills and techniques that you have learned as part of the previous courses.

You will assume the role of a Data Scientist who has recently joined an AI-powered weather data analytic company and be presented with a challenge that requires data collection, analysis, basic hypothesis testing, visualization, modelling, and dashboard to be performed on real-world datasets.

You will undertake the tasks of
- Collecting and understanding data from multiple sources
- Performing data wrangling
- Performing exploratory data analysis and visualization
- Performing modelling the data with linear regressions using Tidymodels
- Building an interactive dashboard using R Shiny (option)

The project will culminate with a presentation of your data analysis report, with an executive summary for the various stakeholders in the organization. You will be assessed on both your work for the various stages in the data analysis process, as well as the final deliverable.

This project is a great opportunity to showcase your Data Science skills, and demonstrate your proficiency to potential employers.


# Project Scenario

Imagine that you have just been hired by an AI-powered weather data analytics company as a data scientist.

Your first project is to analyze how weather would affect bike-sharing demand in urban areas. To complete this project, you need to first collect and process related weather and bike-sharing demand data from various sources, perform exploratory data analysis on the data, and build predictive models to predict bike-sharing demand. You will combine your results and connect them to a live dashboard displaying an interactive map and associated visualization of the current weather and the estimated bike demand.

The last assignment is creating an insightful and informative slideshow and presenting it to your peers.


# Understanding the source data

Rental bikes are available in many cities around the globe. It is important for each of these cities to provide a reliable supply of rental bikes to optimize availablity and accessibility to the public at all times. Also important is minimizing the cost of these programs, in part by minimizing the number of bikes supplied in order to meet the demand. Thus, to help optimize the supply it would be helpful to be able to predict the number of bikes required each hour of the day, based on currrent conditions such as the weather.The Seoul Bike Sharing Demand Data Set was designed for this purpose. It contains weather information (Temperature, Humidity, Windspeed, Visibility, Dewpoint, Solar radiation, Snowfall, Rainfall), and the number of bikes rented per hour and date.

You will use this dataset to build a linear regression model of the number of bikes rented each hour, based on the weather.
Attribute Information

- Date : year-month-day
- Rented Bike count - Count of bikes rented at each hour
- Hour - Hour of he day
- Temperature-Temperature in Celsius
- Humidity - unit is %
- Windspeed - unit is m/s
- Visibility - unit 10m
- Dew point temperature - Celsius
- Solar radiation - MJ/m2
- Rainfall - mm
- Snowfall - cm
- Seasons - Winter, Spring, Summer, Autumn
- Holiday - Holiday/No holiday
- Functional Day - NoFunc (Non Functional Hours), Fun(Functional hours)


### Relevant Paper and Citation Request:
- Sathishkumar V E, Jangwoo Park, and Yongyun Cho.
Using data mining techniques for bike sharing demand prediction in metropolitan city. Computer Communications, Vol.153, pp.353-366, March, 2020
- Sathishkumar V E and Yongyun Cho.
A rule-based model for Seoul Bike sharing demand prediction using weather data European Journal of Remote Sensing, pp. 1-18, Feb, 2020
- V E, Sathishkumar (2020), “Seoul Bike Sharing Demand Prediction”, Mendeley Data, V2, doi: 10.17632/zbdtzxcxvg.2



# Next Steps: Data Collection

You can collect data using different ways such as:
- OpenWeather APIs Calls
- Web scrape a Global Bike-Sharing Systems Wiki Page

In this study, we will straightforward  download some aggregated datasets from cloud storage

In [1]:
# Download several datasets

# Download raw_bike_sharing_systems.csv
# url <- "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-RP0321EN-SkillsNetwork/labs/datasets/raw_bike_sharing_systems.csv"
# download.file(url, destfile = "raw_bike_sharing_systems.csv")

# Download raw_cities_weather_forecast.csv
# url <- "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-RP0321EN-SkillsNetwork/labs/datasets/raw_cities_weather_forecast.csv"
# download.file(url, destfile = "raw_cities_weather_forecast.csv")

# Download raw_worldcities.csv
# url <- "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-RP0321EN-SkillsNetwork/labs/datasets/raw_worldcities.csv"
# download.file(url, destfile = "raw_worldcities.csv")

# Download raw_seoul_bike_sharing.csv
url <- "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-RP0321EN-SkillsNetwork/labs/datasets/raw_seoul_bike_sharing.csv"
download.file(url, destfile = "raw_seoul_bike_sharing.csv")


#### To improve dataset readbility by both human and computer systems, we first need to standardize the column names of the datasets above using the following naming convention:

- Column names need to be UPPERCASE
- The word separator needs to be an underscore, such as in COLUMN_NAME

You can use the following dataset list and the names() function to get and set each of their column names, and convert them according to our defined naming convention.

In [3]:
seoul <- read.csv("raw_seoul_bike_sharing.csv")
head(seoul)

Unnamed: 0_level_0,Date,RENTED_BIKE_COUNT,Hour,TEMPERATURE,HUMIDITY,WIND_SPEED,Visibility,DEW_POINT_TEMPERATURE,SOLAR_RADIATION,RAINFALL,Snowfall,SEASONS,HOLIDAY,FUNCTIONING_DAY
Unnamed: 0_level_1,<chr>,<int>,<int>,<dbl>,<int>,<dbl>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,<chr>
1,01/12/2017,254,0,-5.2,37,2.2,2000,-17.6,0,0,0,Winter,No Holiday,Yes
2,01/12/2017,204,1,-5.5,38,0.8,2000,-17.6,0,0,0,Winter,No Holiday,Yes
3,01/12/2017,173,2,-6.0,39,1.0,2000,-17.7,0,0,0,Winter,No Holiday,Yes
4,01/12/2017,107,3,-6.2,40,0.9,2000,-17.6,0,0,0,Winter,No Holiday,Yes
5,01/12/2017,78,4,-6.0,36,2.3,2000,-18.6,0,0,0,Winter,No Holiday,Yes
6,01/12/2017,100,5,-6.4,37,1.5,2000,-18.7,0,0,0,Winter,No Holiday,Yes


### Q1- Standardize the column names again for the new datasets and convert their column names

In [4]:
install.packages("stringr")
library(stringr)
# Read dataset
bike_sharing_df <- read.csv("raw_seoul_bike_sharing.csv")
# Standardized its columns:
bike_sharing_df[,2:11] <- scale(bike_sharing_df[,2:11])

# Convert all column names to uppercase

names(bike_sharing_df) <- toupper(names(bike_sharing_df))
# Replace any white space separators by underscores, using the str_replace_all function
names(bike_sharing_df) <- str_replace_all(names(bike_sharing_df),' ', '_')
# Save the dataset
write.csv(bike_sharing_df, "raw_seoul_bike_sharing.csv", row.names=FALSE)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



In [5]:
# bike_sharing_df <- read.csv("raw_seoul_bike_sharing.csv")
# names(bike_sharing_df) <- toupper(names(bike_sharing_df))
# names(bike_sharing_df) <- str_replace_all(names(bike_sharing_df),' ', '_')
head(bike_sharing_df)

Unnamed: 0_level_0,DATE,RENTED_BIKE_COUNT,HOUR,TEMPERATURE,HUMIDITY,WIND_SPEED,VISIBILITY,DEW_POINT_TEMPERATURE,SOLAR_RADIATION,RAINFALL,SNOWFALL,SEASONS,HOLIDAY,FUNCTIONING_DAY
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,<chr>
1,01/12/2017,-0.7397153,-1.6612299,-1.512657,-1.0424234,0.4584496,0.9258185,-1.65951,-0.6550943,-0.1317924,-0.1718813,Winter,No Holiday,Yes
2,01/12/2017,-0.8175544,-1.5167752,-1.537774,-0.9933133,-0.8925105,0.9258185,-1.65951,-0.6550943,-0.1317924,-0.1718813,Winter,No Holiday,Yes
3,01/12/2017,-0.8658146,-1.3723204,-1.579637,-0.9442032,-0.6995162,0.9258185,-1.667167,-0.6550943,-0.1317924,-0.1718813,Winter,No Holiday,Yes
4,01/12/2017,-0.9685621,-1.2278656,-1.596382,-0.8950931,-0.7960134,0.9258185,-1.65951,-0.6550943,-0.1317924,-0.1718813,Winter,No Holiday,Yes
5,01/12/2017,-1.0137088,-1.0834108,-1.579637,-1.0915335,0.5549468,0.9258185,-1.736077,-0.6550943,-0.1317924,-0.1718813,Winter,No Holiday,Yes
6,01/12/2017,-0.9794596,-0.9389561,-1.613127,-1.0424234,-0.2170305,0.9258185,-1.743734,-0.6550943,-0.1317924,-0.1718813,Winter,No Holiday,Yes


### Q2- Read the resulting dataset back and check whether their column names follow the naming convention

In [6]:
# Print a summary  data set to check whether the column names were correctly converted
print(summary(bike_sharing_df))

     DATE           RENTED_BIKE_COUNT      HOUR          TEMPERATURE      
 Length:8760        Min.   :-1.1320   Min.   :-1.6612   Min.   :-2.56760  
 Class :character   1st Qu.:-0.8020   1st Qu.:-0.8306   1st Qu.:-0.79262  
 Mode  :character   Median :-0.2914   Median : 0.0000   Median : 0.06975  
                    Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.00000  
                    3rd Qu.: 0.5524   3rd Qu.: 0.8306   3rd Qu.: 0.80654  
                    Max.   : 4.4008   Max.   : 1.6612   Max.   : 2.22149  
                    NA's   :295                         NA's   :11        
    HUMIDITY          WIND_SPEED        VISIBILITY      DEW_POINT_TEMPERATURE
 Min.   :-2.85950   Min.   :-1.6645   Min.   :-2.3177   Min.   :-2.65489     
 1st Qu.:-0.79687   1st Qu.:-0.7960   1st Qu.:-0.8167   1st Qu.:-0.67179     
 Median :-0.06022   Median :-0.2170   Median : 0.4294   Median : 0.07857     
 Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.00000     
 3rd Qu.: 

In [7]:
library(magrittr) # needs to be run every time you start R and want to use %>%
library(dplyr)



Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union




# Data Wrangling with dplyr

For this dataset, you will be asked to use tidyverse to perform the following data wrangling tasks:

- TASK 1: Detect and handle missing values
- TASK 2: Create indicator (dummy) variables for categorical variables
- TASK 3: Normalize data

In [8]:
library(tidyverse)
bike_sharing_df <- read.csv("raw_seoul_bike_sharing.csv")


summary(bike_sharing_df)
dim(bike_sharing_df)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mggplot2  [39m 3.4.4     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mpurrr[39m::[32m%||%()[39m      masks [34mbase[39m::%||%()
[31m✖[39m [34mtidyr[39m::[32mextract()[39m   masks [34mmagrittr[39m::extract()
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m    masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m       masks [34mstats[39m::lag()
[31m✖[39m [34mpurrr[39m::[32mset_names()[39m masks [34mmagrittr[39m::set_names()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to b

     DATE           RENTED_BIKE_COUNT      HOUR          TEMPERATURE      
 Length:8760        Min.   :-1.1320   Min.   :-1.6612   Min.   :-2.56760  
 Class :character   1st Qu.:-0.8020   1st Qu.:-0.8306   1st Qu.:-0.79262  
 Mode  :character   Median :-0.2914   Median : 0.0000   Median : 0.06975  
                    Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.00000  
                    3rd Qu.: 0.5524   3rd Qu.: 0.8306   3rd Qu.: 0.80654  
                    Max.   : 4.4008   Max.   : 1.6612   Max.   : 2.22149  
                    NA's   :295                         NA's   :11        
    HUMIDITY          WIND_SPEED        VISIBILITY      DEW_POINT_TEMPERATURE
 Min.   :-2.85950   Min.   :-1.6645   Min.   :-2.3177   Min.   :-2.65489     
 1st Qu.:-0.79687   1st Qu.:-0.7960   1st Qu.:-0.8167   1st Qu.:-0.67179     
 Median :-0.06022   Median :-0.2170   Median : 0.4294   Median : 0.07857     
 Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.00000     
 3rd Qu.: 

#### From the summary, we can observe that:

Columns RENTED_BIKE_COUNT, TEMPERATURE, HUMIDITY, WIND_SPEED, VISIBILITY, DEW_POINT_TEMPERATURE, SOLAR_RADIATION, RAINFALL, SNOWFALL are numerical variables/columns and require normalization. Moreover, RENTED_BIKE_COUNT and TEMPERATURE have some missing values (NA's) that need to be handled properly.

SEASONS, HOLIDAY, FUNCTIONING_DAY are categorical variables which need to be converted into indicator columns or dummy variables. Also, HOUR is read as a numerical variable but it is in fact a categorical variable with levels ranging from 0 to 23.

Now that you have some basic ideas about how to process this bike-sharing demand dataset, let's start working on it!

# TASK: Detect and handle missing values

The RENTED_BIKE_COUNT column has about 295 missing values, and TEMPERATURE has about 11 missing values. Those missing values could be caused by not being recorded, or from malfunctioning bike-sharing systems or weather sensor networks. In any cases, the identified missing values have to be properly handled.

Let's first handle missing values in RENTED_BIKE_COUNT column:

Considering RENTED_BIKE_COUNT is the response variable/dependent variable, i.e., we want to predict the RENTED_BIKE_COUNT using other predictor/independent variables later, and we normally can not allow missing values for the response variable, so missing values for response variable must be either dropped or imputed properly.

We can see that RENTED_BIKE_COUNT only has about 3% missing values (295 / 8760). As such, you can safely drop any rows whose RENTED_BIKE_COUNT has missing values.

### Q3: Drop rows with missing values in the RENTED_BIKE_COUNT column

In [9]:
# Drop rows with `RENTED_BIKE_COUNT` column == NA

bike_sharing_df <- na.omit(bike_sharing_df, cols="RENTED_BIKE_COUNT")

In [10]:
# Print the dataset dimension again after those rows are dropped
dim(bike_sharing_df)

Now that you have handled missing values in the RENTED_BIKE_COUNT variable, let's continue processing missing values for the TEMPERATURE column.

Unlike the RENTED_BIKE_COUNT variable, TEMPERATURE is not a response variable. However, it is still an important predictor variable - as you could imagine, there may be a positve correlation between TEMPERATURE and RENTED_BIKE_COUNT. For example, in winter time with lower temperatures, people may not want to ride a bike, while in summer with nicer weather, they are more likely to rent a bike.

How do we handle missing values for TEMPERATURE? We could simply remove the rows but it's better to impute them because TEMPERATURE should be relatively easy and reliable to estimate statistically.

Let's first take a look at the missing values in the TEMPERATURE column.

In [11]:
bike_sharing_df %>%  filter(is.na(TEMPERATURE))

DATE,RENTED_BIKE_COUNT,HOUR,TEMPERATURE,HUMIDITY,WIND_SPEED,VISIBILITY,DEW_POINT_TEMPERATURE,SOLAR_RADIATION,RAINFALL,SNOWFALL,SEASONS,HOLIDAY,FUNCTIONING_DAY
<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,<chr>


It seems that all of the missing values for TEMPERATURE are found in SEASONS == Summer, so it is reasonable to impute those missing values with the summer average temperature.

In [12]:
head(bike_sharing_df)

Unnamed: 0_level_0,DATE,RENTED_BIKE_COUNT,HOUR,TEMPERATURE,HUMIDITY,WIND_SPEED,VISIBILITY,DEW_POINT_TEMPERATURE,SOLAR_RADIATION,RAINFALL,SNOWFALL,SEASONS,HOLIDAY,FUNCTIONING_DAY
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,<chr>
1,01/12/2017,-0.7397153,-1.6612299,-1.512657,-1.0424234,0.4584496,0.9258185,-1.65951,-0.6550943,-0.1317924,-0.1718813,Winter,No Holiday,Yes
2,01/12/2017,-0.8175544,-1.5167752,-1.537774,-0.9933133,-0.8925105,0.9258185,-1.65951,-0.6550943,-0.1317924,-0.1718813,Winter,No Holiday,Yes
3,01/12/2017,-0.8658146,-1.3723204,-1.579637,-0.9442032,-0.6995162,0.9258185,-1.667167,-0.6550943,-0.1317924,-0.1718813,Winter,No Holiday,Yes
4,01/12/2017,-0.9685621,-1.2278656,-1.596382,-0.8950931,-0.7960134,0.9258185,-1.65951,-0.6550943,-0.1317924,-0.1718813,Winter,No Holiday,Yes
5,01/12/2017,-1.0137088,-1.0834108,-1.579637,-1.0915335,0.5549468,0.9258185,-1.736077,-0.6550943,-0.1317924,-0.1718813,Winter,No Holiday,Yes
6,01/12/2017,-0.9794596,-0.9389561,-1.613127,-1.0424234,-0.2170305,0.9258185,-1.743734,-0.6550943,-0.1317924,-0.1718813,Winter,No Holiday,Yes


### Q4-Impute missing values for the TEMPERATURE column using its mean value.

In [13]:
# Calculate the summer average temperature
bike_sharing_df<-bike_sharing_df%>%mutate(TEMPERATURE=as.numeric(TEMPERATURE)) #because it is not taking dbl as numeric type
avg_temp <- bike_sharing_df %>%
  filter(SEASONS == "Summer") %>%
  summarise(mean_temperature=mean(TEMPERATURE, na.rm=TRUE))
avg_temp<-unlist(avg_temp)
avg_temp

In [14]:
library(tidyverse)

In [15]:
# Impute missing values for TEMPERATURE column with summer average temperature
bike_sharing_df<-bike_sharing_df%>%replace_na(list(TEMPERATURE=avg_temp))
bike_sharing_df

Unnamed: 0_level_0,DATE,RENTED_BIKE_COUNT,HOUR,TEMPERATURE,HUMIDITY,WIND_SPEED,VISIBILITY,DEW_POINT_TEMPERATURE,SOLAR_RADIATION,RAINFALL,SNOWFALL,SEASONS,HOLIDAY,FUNCTIONING_DAY
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,<chr>
1,01/12/2017,-0.7397153,-1.66122995,-1.5126567,-1.04242338,0.45844961,0.92581850,-1.6595099,-0.65509432,-0.1317924,-0.1718813,Winter,No Holiday,Yes
2,01/12/2017,-0.8175544,-1.51677517,-1.5377743,-0.99331329,-0.89251055,0.92581850,-1.6595099,-0.65509432,-0.1317924,-0.1718813,Winter,No Holiday,Yes
3,01/12/2017,-0.8658146,-1.37232039,-1.5796370,-0.94420320,-0.69951624,0.92581850,-1.6671667,-0.65509432,-0.1317924,-0.1718813,Winter,No Holiday,Yes
4,01/12/2017,-0.9685621,-1.22786561,-1.5963821,-0.89509310,-0.79601339,0.92581850,-1.6595099,-0.65509432,-0.1317924,-0.1718813,Winter,No Holiday,Yes
5,01/12/2017,-1.0137088,-1.08341083,-1.5796370,-1.09153347,0.55494676,0.92581850,-1.7360775,-0.65509432,-0.1317924,-0.1718813,Winter,No Holiday,Yes
6,01/12/2017,-0.9794596,-0.93895606,-1.6131272,-1.04242338,-0.21703047,0.92581850,-1.7437342,-0.65509432,-0.1317924,-0.1718813,Winter,No Holiday,Yes
7,01/12/2017,-0.8533603,-0.79450128,-1.6298722,-1.14064357,-0.41002478,0.92581850,-1.8049882,-0.65509432,-0.1317924,-0.1718813,Winter,No Holiday,Yes
8,01/12/2017,-0.4190185,-0.65004650,-1.6968525,-0.99331329,-0.79601339,0.92581850,-1.7896747,-0.65509432,-0.1317924,-0.1718813,Winter,No Holiday,Yes
9,01/12/2017,0.3126685,-0.50559172,-1.7135976,-1.04242338,-0.60301909,0.92581850,-1.8279585,-0.64358348,-0.1317924,-0.1718813,Winter,No Holiday,Yes
10,01/12/2017,-0.3723150,-0.36113694,-1.6214997,-1.53352431,-1.18200201,0.80745560,-2.0270340,-0.39034498,-0.1317924,-0.1718813,Winter,No Holiday,Yes


In [16]:
# Print the summary of the dataset again to make sure no missing values in all columns
summary(bike_sharing_df)

     DATE           RENTED_BIKE_COUNT        HOUR           
 Length:8454        Min.   :-1.132024   Min.   :-1.6612299  
 Class :character   1st Qu.:-0.801987   1st Qu.:-0.7945013  
 Mode  :character   Median :-0.293698   Median : 0.0722274  
                    Mean   :-0.001658   Mean   : 0.0006151  
                    3rd Qu.: 0.550856   3rd Qu.: 0.9389561  
                    Max.   : 4.400775   Max.   : 1.6612299  
  TEMPERATURE           HUMIDITY           WIND_SPEED       
 Min.   :-2.567596   Min.   :-2.859497   Min.   :-1.664488  
 1st Qu.:-0.826109   1st Qu.:-0.796873   1st Qu.:-0.796013  
 Median : 0.044635   Median :-0.060221   Median :-0.217030  
 Mean   :-0.009425   Mean   :-0.004704   Mean   : 0.000824  
 3rd Qu.: 0.814908   3rd Qu.: 0.774650   3rd Qu.: 0.554947  
 Max.   : 2.221494   Max.   : 1.953292   Max.   : 5.476302  
   VISIBILITY        DEW_POINT_TEMPERATURE SOLAR_RADIATION   
 Min.   :-2.317654   Min.   :-2.65489      Min.   :-0.65509  
 1st Qu.:-0.823322   1

In [17]:
# Save the dataset as `seoul_bike_sharing.csv`
write.csv(bike_sharing_df,"seoul_bike_sharing.csv")

# TASK: Create indicator (dummy) variables for categorical variables¶

Regression models can not process categorical variables directly, thus we need to convert them into indicator variables.

In the bike-sharing demand dataset, SEASONS, HOLIDAY, FUNCTIONING_DAY are categorical variables. Also, HOUR is read as a numerical variable but it is in fact a categorical variable with levels ranged from 0 to 23.

### Q5- Convert HOUR column from numeric into character first:

In [18]:
# Using mutate() function to convert HOUR column into character type
bike_sharing_df<-bike_sharing_df%>%mutate(HOUR=as.character(HOUR))
bike_sharing_df

Unnamed: 0_level_0,DATE,RENTED_BIKE_COUNT,HOUR,TEMPERATURE,HUMIDITY,WIND_SPEED,VISIBILITY,DEW_POINT_TEMPERATURE,SOLAR_RADIATION,RAINFALL,SNOWFALL,SEASONS,HOLIDAY,FUNCTIONING_DAY
Unnamed: 0_level_1,<chr>,<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,<chr>
1,01/12/2017,-0.7397153,-1.66122994540396,-1.5126567,-1.04242338,0.45844961,0.92581850,-1.6595099,-0.65509432,-0.1317924,-0.1718813,Winter,No Holiday,Yes
2,01/12/2017,-0.8175544,-1.51677516754275,-1.5377743,-0.99331329,-0.89251055,0.92581850,-1.6595099,-0.65509432,-0.1317924,-0.1718813,Winter,No Holiday,Yes
3,01/12/2017,-0.8658146,-1.37232038968153,-1.5796370,-0.94420320,-0.69951624,0.92581850,-1.6671667,-0.65509432,-0.1317924,-0.1718813,Winter,No Holiday,Yes
4,01/12/2017,-0.9685621,-1.22786561182032,-1.5963821,-0.89509310,-0.79601339,0.92581850,-1.6595099,-0.65509432,-0.1317924,-0.1718813,Winter,No Holiday,Yes
5,01/12/2017,-1.0137088,-1.08341083395911,-1.5796370,-1.09153347,0.55494676,0.92581850,-1.7360775,-0.65509432,-0.1317924,-0.1718813,Winter,No Holiday,Yes
6,01/12/2017,-0.9794596,-0.938956056097891,-1.6131272,-1.04242338,-0.21703047,0.92581850,-1.7437342,-0.65509432,-0.1317924,-0.1718813,Winter,No Holiday,Yes
7,01/12/2017,-0.8533603,-0.794501278236677,-1.6298722,-1.14064357,-0.41002478,0.92581850,-1.8049882,-0.65509432,-0.1317924,-0.1718813,Winter,No Holiday,Yes
8,01/12/2017,-0.4190185,-0.650046500375463,-1.6968525,-0.99331329,-0.79601339,0.92581850,-1.7896747,-0.65509432,-0.1317924,-0.1718813,Winter,No Holiday,Yes
9,01/12/2017,0.3126685,-0.505591722514249,-1.7135976,-1.04242338,-0.60301909,0.92581850,-1.8279585,-0.64358348,-0.1317924,-0.1718813,Winter,No Holiday,Yes
10,01/12/2017,-0.3723150,-0.361136944653035,-1.6214997,-1.53352431,-1.18200201,0.80745560,-2.0270340,-0.39034498,-0.1317924,-0.1718813,Winter,No Holiday,Yes


`SEASONS`, `HOLIDAY`, `FUNCTIONING_DAY`,  `HOUR` are all character columns now and are ready to be converted into indicator variables.

For example, `SEASONS` has four categorical values: `Spring`, `Summer`, `Autumn`, `Winter`. We thus need to create four indicator/dummy variables `Spring`, `Summer`, `Autumn`, and `Winter` which only have the value 0 or 1.

So, given a data entry with the value `Spring` in the `SEASONS` column, the values for the four new columns `Spring`, `Summer`, `Autumn`, and `Winter` will be set to 1 for `Spring` and 0 for the others:

|Spring|Summer|Autumn|Winter|
|----- |------|------|------|
|     1|     0|     0|     0|


### Q6-Convert SEASONS, HOLIDAY and HOUR columns into indicator columns.

Note that if FUNCTIONING_DAY only contains one categorical value after missing values removal, then you don't need to convert it to an indicator column.

In [19]:
# Convert SEASONS, HOLIDAY, FUNCTIONING_DAY, and HOUR columns into indicator columns.
bike_sharing_df<-bike_sharing_df%>%mutate(dummy=1)%>%spread(key=SEASONS, value=dummy, fill=0)
bike_sharing_df<-bike_sharing_df%>%mutate(dummy=1)%>%spread(key=HOLIDAY, value=dummy, fill=0)
bike_sharing_df<-bike_sharing_df%>%mutate(dummy=1)%>%spread(key=HOUR, value=dummy, fill=0)

In [20]:
# Print the dataset summary again to make sure the indicator columns are created properly
summary(bike_sharing_df)

     DATE           RENTED_BIKE_COUNT    TEMPERATURE           HUMIDITY        
 Length:8454        Min.   :-1.132024   Min.   :-2.567596   Min.   :-2.859497  
 Class :character   1st Qu.:-0.801987   1st Qu.:-0.826109   1st Qu.:-0.796873  
 Mode  :character   Median :-0.293698   Median : 0.044635   Median :-0.060221  
                    Mean   :-0.001658   Mean   :-0.009425   Mean   :-0.004704  
                    3rd Qu.: 0.550856   3rd Qu.: 0.814908   3rd Qu.: 0.774650  
                    Max.   : 4.400775   Max.   : 2.221494   Max.   : 1.953292  
   WIND_SPEED          VISIBILITY        DEW_POINT_TEMPERATURE
 Min.   :-1.664488   Min.   :-2.317654   Min.   :-2.65489     
 1st Qu.:-0.796013   1st Qu.:-0.823322   1st Qu.:-0.70242     
 Median :-0.217030   Median : 0.416200   Median : 0.04795     
 Mean   : 0.000824   Mean   :-0.004122   Mean   :-0.01142     
 3rd Qu.: 0.554947   3rd Qu.: 0.925818   3rd Qu.: 0.84425     
 Max.   : 5.476302   Max.   : 0.925818   Max.   : 1.77071     

In [21]:
# Save the dataset as `seoul_bike_sharing_converted.csv`
# write_csv(dataframe, "seoul_bike_sharing_converted.csv")

# TASK: Normalize data

Columns RENTED_BIKE_COUNT, TEMPERATURE, HUMIDITY, WIND_SPEED, VISIBILITY, DEW_POINT_TEMPERATURE, SOLAR_RADIATION, RAINFALL, SNOWFALL are numerical variables/columns with different value units and range. Columns with large values may adversely influence (bias) the predictive models and degrade model accuracy. Thus, we need to perform normalization on these numeric columns to transfer them into a similar range.


In this project, you are asked to use Min-max normalization:

**Min-max** rescales each value in a column by first subtracting the minimum value of the column from each value, and then divides the result by the difference between the maximum and minimum values of the column. So the column gets re-scaled such that the minimum becomes 0 and the maximum becomes 1.

$$x_{new} = \frac{x_{old} - x_{min}}{x_{max} - x_{min}}$$



#### Q7: Apply min-max normalization on RENTED_BIKE_COUNT, TEMPERATURE, HUMIDITY, WIND_SPEED, VISIBILITY, DEW_POINT_TEMPERATURE, SOLAR_RADIATION, RAINFALL, SNOWFALL

In [22]:
# Use the `mutate()` function to apply min-max normalization on columns
# `RENTED_BIKE_COUNT`, `TEMPERATURE`, `HUMIDITY`, `WIND_SPEED`, `VISIBILITY`, `DEW_POINT_TEMPERATURE`, `SOLAR_RADIATION`, `RAINFALL`, `SNOWFALL`
bike_sharing_df <- bike_sharing_df %>%
  mutate(RENTED_BIKE_COUNT= (RENTED_BIKE_COUNT-min(RENTED_BIKE_COUNT)) / (max(RENTED_BIKE_COUNT) - min(RENTED_BIKE_COUNT)))
bike_sharing_df <- bike_sharing_df %>%
  mutate(RENTED_BIKE_COUNT= (TEMPERATURE-min(TEMPERATURE)) / (max(TEMPERATURE) - min(TEMPERATURE)))
bike_sharing_df <- bike_sharing_df %>%
  mutate(HUMIDITY= (HUMIDITY-min(HUMIDITY)) / (max(HUMIDITY) - min(HUMIDITY)))
bike_sharing_df <- bike_sharing_df %>%
  mutate(WIND_SPEED= (WIND_SPEED-min(WIND_SPEED)) / (max(WIND_SPEED) - min(WIND_SPEED)))
bike_sharing_df <- bike_sharing_df %>%
  mutate(VISIBILITY= (VISIBILITY-min(VISIBILITY)) / (max(VISIBILITY) - min(VISIBILITY)))
bike_sharing_df <- bike_sharing_df %>%
  mutate(DEW_POINT_TEMPERATURE= (DEW_POINT_TEMPERATURE-min(DEW_POINT_TEMPERATURE)) / (max(DEW_POINT_TEMPERATURE) - min(DEW_POINT_TEMPERATURE)))
bike_sharing_df <- bike_sharing_df %>%
  mutate(SOLAR_RADIATION= (SOLAR_RADIATION-min(SOLAR_RADIATION)) / (max(SOLAR_RADIATION) - min(SOLAR_RADIATION)))
bike_sharing_df <- bike_sharing_df %>%
  mutate(RAINFALL= (RAINFALL-min(RAINFALL)) / (max(RAINFALL) - min(RAINFALL)))
bike_sharing_df <- bike_sharing_df %>%
  mutate(SNOWFALL= (SNOWFALL-min(SNOWFALL)) / (max(SNOWFALL) - min(SNOWFALL)))
bike_sharing_df

DATE,RENTED_BIKE_COUNT,TEMPERATURE,HUMIDITY,WIND_SPEED,VISIBILITY,DEW_POINT_TEMPERATURE,SOLAR_RADIATION,RAINFALL,SNOWFALL,⋯,0.361136944653035,0.505591722514249,0.650046500375463,0.794501278236677,0.938956056097891,1.08341083395911,1.22786561182032,1.37232038968153,1.51677516754275,1.66122994540396
<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
01/12/2017,0.2202797,-1.5126567,0.3775510,0.29729730,1.0000000,0.2249135,0.000000000,0,0,⋯,0,0,0,0,0,0,0,0,0,0
01/12/2017,0.2150350,-1.5377743,0.3877551,0.10810811,1.0000000,0.2249135,0.000000000,0,0,⋯,0,0,0,0,0,0,0,0,0,0
01/12/2017,0.2062937,-1.5796370,0.3979592,0.13513514,1.0000000,0.2231834,0.000000000,0,0,⋯,0,0,0,0,0,0,0,0,0,0
01/12/2017,0.2027972,-1.5963821,0.4081633,0.12162162,1.0000000,0.2249135,0.000000000,0,0,⋯,0,0,0,0,0,0,0,0,0,0
01/12/2017,0.2062937,-1.5796370,0.3673469,0.31081081,1.0000000,0.2076125,0.000000000,0,0,⋯,0,0,0,0,0,0,0,0,0,0
01/12/2017,0.1993007,-1.6131272,0.3775510,0.20270270,1.0000000,0.2058824,0.000000000,0,0,⋯,0,0,0,0,0,0,0,0,0,0
01/12/2017,0.1958042,-1.6298722,0.3571429,0.17567568,1.0000000,0.1920415,0.000000000,0,0,⋯,0,0,0,0,0,0,0,0,0,0
01/12/2017,0.1818182,-1.6968525,0.3877551,0.12162162,1.0000000,0.1955017,0.000000000,0,0,⋯,0,0,0,0,0,0,0,0,0,0
01/12/2017,0.1783217,-1.7135976,0.3775510,0.14864865,1.0000000,0.1868512,0.002840909,0,0,⋯,0,0,0,0,0,0,0,0,0,0
01/12/2017,0.1975524,-1.6214997,0.2755102,0.06756757,0.9635073,0.1418685,0.065340909,0,0,⋯,0,0,0,0,0,0,0,0,0,0


In [23]:
# Print the summary of the dataset again to make sure the numeric columns range between 0 and 1
summary(bike_sharing_df)

     DATE           RENTED_BIKE_COUNT  TEMPERATURE           HUMIDITY     
 Length:8454        Min.   :0.0000    Min.   :-2.567596   Min.   :0.0000  
 Class :character   1st Qu.:0.3636    1st Qu.:-0.826109   1st Qu.:0.4286  
 Mode  :character   Median :0.5455    Median : 0.044635   Median :0.5816  
                    Mean   :0.5342    Mean   :-0.009425   Mean   :0.5932  
                    3rd Qu.:0.7063    3rd Qu.: 0.814908   3rd Qu.:0.7551  
                    Max.   :1.0000    Max.   : 2.221494   Max.   :1.0000  
   WIND_SPEED       VISIBILITY     DEW_POINT_TEMPERATURE SOLAR_RADIATION   
 Min.   :0.0000   Min.   :0.0000   Min.   :0.0000        Min.   :0.000000  
 1st Qu.:0.1216   1st Qu.:0.4607   1st Qu.:0.4412        1st Qu.:0.000000  
 Median :0.2027   Median :0.8429   Median :0.6107        Median :0.002841  
 Mean   :0.2332   Mean   :0.7133   Mean   :0.5973        Mean   :0.161306  
 3rd Qu.:0.3108   3rd Qu.:1.0000   3rd Qu.:0.7907        3rd Qu.:0.264205  
 Max.   :1.0000   M

In [25]:
# Save the dataset as `seoul_bike_sharing_converted_normalized.csv`
write_csv(bike_sharing_df, "seoul_bike_sharing_converted_normalized.csv")

# Continue