# Assignment

In this assignment you will apply various Data Science skills and techniques that you have learned as part of the previous courses.

You will assume the role of a Data Scientist who has recently joined an AI-powered weather data analytic company and be presented with a challenge that requires data collection, analysis, basic hypothesis testing, visualization, modelling, and dashboard to be performed on real-world datasets.

You will undertake the tasks of
- Collecting and understanding data from multiple sources
- Performing data wrangling 
- Performing exploratory data analysis and visualization 
- Performing modelling the data with linear regressions using Tidymodels
- Building an interactive dashboard using R Shiny (option)

The project will culminate with a presentation of your data analysis report, with an executive summary for the various stakeholders in the organization. You will be assessed on both your work for the various stages in the data analysis process, as well as the final deliverable.

This project is a great opportunity to showcase your Data Science skills, and demonstrate your proficiency to potential employers.


# Project Scenario

Imagine that you have just been hired by an AI-powered weather data analytics company as a data scientist.

Your first project is to analyze how weather would affect bike-sharing demand in urban areas. To complete this project, you need to first collect and process related weather and bike-sharing demand data from various sources, perform exploratory data analysis on the data, and build predictive models to predict bike-sharing demand. You will combine your results and connect them to a live dashboard displaying an interactive map and associated visualization of the current weather and the estimated bike demand.

The last assignment is creating an insightful and informative slideshow and presenting it to your peers.


# Understanding the source data

Rental bikes are available in many cities around the globe. It is important for each of these cities to provide a reliable supply of rental bikes to optimize availablity and accessibility to the public at all times. Also important is minimizing the cost of these programs, in part by minimizing the number of bikes supplied in order to meet the demand. Thus, to help optimize the supply it would be helpful to be able to predict the number of bikes required each hour of the day, based on currrent conditions such as the weather.The Seoul Bike Sharing Demand Data Set was designed for this purpose. It contains weather information (Temperature, Humidity, Windspeed, Visibility, Dewpoint, Solar radiation, Snowfall, Rainfall), and the number of bikes rented per hour and date.

You will use this dataset to build a linear regression model of the number of bikes rented each hour, based on the weather.
Attribute Information

- Date : year-month-day
- Rented Bike count - Count of bikes rented at each hour
- Hour - Hour of he day 
- Temperature-Temperature in Celsius
- Humidity - unit is %
- Windspeed - unit is m/s
- Visibility - unit 10m
- Dew point temperature - Celsius
- Solar radiation - MJ/m2
- Rainfall - mm 
- Snowfall - cm
- Seasons - Winter, Spring, Summer, Autumn
- Holiday - Holiday/No holiday
- Functional Day - NoFunc (Non Functional Hours), Fun(Functional hours)


### Relevant Paper and Citation Request:
- Sathishkumar V E, Jangwoo Park, and Yongyun Cho.
Using data mining techniques for bike sharing demand prediction in metropolitan city. Computer Communications, Vol.153, pp.353-366, March, 2020
- Sathishkumar V E and Yongyun Cho.
A rule-based model for Seoul Bike sharing demand prediction using weather data European Journal of Remote Sensing, pp. 1-18, Feb, 2020
- V E, Sathishkumar (2020), “Seoul Bike Sharing Demand Prediction”, Mendeley Data, V2, doi: 10.17632/zbdtzxcxvg.2



# Next Steps: Data Collection

You can collect data using different ways such as:
- OpenWeather APIs Calls
- Web scrape a Global Bike-Sharing Systems Wiki Page

In this study, we will straightforward  download some aggregated datasets from cloud storage

In [1]:
# Download several datasets

# Download raw_bike_sharing_systems.csv
# url <- "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-RP0321EN-SkillsNetwork/labs/datasets/raw_bike_sharing_systems.csv"
# download.file(url, destfile = "raw_bike_sharing_systems.csv")

# Download raw_cities_weather_forecast.csv
# url <- "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-RP0321EN-SkillsNetwork/labs/datasets/raw_cities_weather_forecast.csv"
# download.file(url, destfile = "raw_cities_weather_forecast.csv")

# Download raw_worldcities.csv
# url <- "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-RP0321EN-SkillsNetwork/labs/datasets/raw_worldcities.csv"
# download.file(url, destfile = "raw_worldcities.csv")

# Download raw_seoul_bike_sharing.csv
url <- "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-RP0321EN-SkillsNetwork/labs/datasets/raw_seoul_bike_sharing.csv"
download.file(url, destfile = "raw_seoul_bike_sharing.csv")


#### To improve dataset readbility by both human and computer systems, we first need to standardize the column names of the datasets above using the following naming convention:

- Column names need to be UPPERCASE
- The word separator needs to be an underscore, such as in COLUMN_NAME

You can use the following dataset list and the names() function to get and set each of their column names, and convert them according to our defined naming convention.

### Q1- Standardize the column names again for the new datasets and convert their column names

In [2]:
# install.packages("stringr")
library(stringr)
      # Read dataset raw_seoul_bike_sharing.csv
dataset<-read.csv("raw_seoul_bike_sharing.csv")
    # Standardized its columns:
#     dataset.standardized=scale(dataset)
#dataset.standardize=scale(dataset)
    # Convert all column names to uppercase
names(dataset) <- toupper(names(dataset))
    # Replace any white space separators by underscores, using the str_replace_all function
names(dataset) <- str_replace_all(names(dataset), " ", "_")
    # Save the dataset 
write.csv(dataset, "raw_seoul_bike_sharing.csv", row.names=FALSE)



### Q2- Read the resulting dataset back and check whether their column names follow the naming convention

In [3]:
# Print a summary  data set to check whether the column names were correctly converted
#     print(summary(dataset))
dataset<-read.csv("raw_seoul_bike_sharing.csv")
summary(dataset)

     DATE           RENTED_BIKE_COUNT      HOUR        TEMPERATURE    
 Length:8760        Min.   :   2.0    Min.   : 0.00   Min.   :-17.80  
 Class :character   1st Qu.: 214.0    1st Qu.: 5.75   1st Qu.:  3.40  
 Mode  :character   Median : 542.0    Median :11.50   Median : 13.70  
                    Mean   : 729.2    Mean   :11.50   Mean   : 12.87  
                    3rd Qu.:1084.0    3rd Qu.:17.25   3rd Qu.: 22.50  
                    Max.   :3556.0    Max.   :23.00   Max.   : 39.40  
                    NA's   :295                       NA's   :11      
    HUMIDITY       WIND_SPEED      VISIBILITY   DEW_POINT_TEMPERATURE
 Min.   : 0.00   Min.   :0.000   Min.   :  27   Min.   :-30.600      
 1st Qu.:42.00   1st Qu.:0.900   1st Qu.: 940   1st Qu.: -4.700      
 Median :57.00   Median :1.500   Median :1698   Median :  5.100      
 Mean   :58.23   Mean   :1.725   Mean   :1437   Mean   :  4.074      
 3rd Qu.:74.00   3rd Qu.:2.300   3rd Qu.:2000   3rd Qu.: 14.800      
 Max.   :98.

In [4]:
library(magrittr) # needs to be run every time you start R and want to use %>%
library(dplyr)


Attaching package: 'dplyr'


The following objects are masked from 'package:stats':

    filter, lag


The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union




# Data Wrangling with dplyr

For this dataset, you will be asked to use tidyverse to perform the following data wrangling tasks:

- TASK 1: Detect and handle missing values
- TASK 2: Create indicator (dummy) variables for categorical variables
- TASK 3: Normalize data

In [5]:
library(tidyverse)
bike_sharing_df <- read.csv("raw_seoul_bike_sharing.csv")


summary(bike_sharing_df)
dim(bike_sharing_df)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mreadr    [39m 2.1.4
[32m✔[39m [34mggplot2  [39m 3.4.2     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.2     [32m✔[39m [34mtidyr    [39m 1.3.0
[32m✔[39m [34mpurrr    [39m 1.0.1     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mtidyr[39m::[32mextract()[39m   masks [34mmagrittr[39m::extract()
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m    masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m       masks [34mstats[39m::lag()
[31m✖[39m [34mpurrr[39m::[32mset_names()[39m masks [34mmagrittr[39m::set_names()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


     DATE           RENTED_BIKE_COUNT      HOUR        TEMPERATURE    
 Length:8760        Min.   :   2.0    Min.   : 0.00   Min.   :-17.80  
 Class :character   1st Qu.: 214.0    1st Qu.: 5.75   1st Qu.:  3.40  
 Mode  :character   Median : 542.0    Median :11.50   Median : 13.70  
                    Mean   : 729.2    Mean   :11.50   Mean   : 12.87  
                    3rd Qu.:1084.0    3rd Qu.:17.25   3rd Qu.: 22.50  
                    Max.   :3556.0    Max.   :23.00   Max.   : 39.40  
                    NA's   :295                       NA's   :11      
    HUMIDITY       WIND_SPEED      VISIBILITY   DEW_POINT_TEMPERATURE
 Min.   : 0.00   Min.   :0.000   Min.   :  27   Min.   :-30.600      
 1st Qu.:42.00   1st Qu.:0.900   1st Qu.: 940   1st Qu.: -4.700      
 Median :57.00   Median :1.500   Median :1698   Median :  5.100      
 Mean   :58.23   Mean   :1.725   Mean   :1437   Mean   :  4.074      
 3rd Qu.:74.00   3rd Qu.:2.300   3rd Qu.:2000   3rd Qu.: 14.800      
 Max.   :98.

#### From the summary, we can observe that:

Columns RENTED_BIKE_COUNT, TEMPERATURE, HUMIDITY, WIND_SPEED, VISIBILITY, DEW_POINT_TEMPERATURE, SOLAR_RADIATION, RAINFALL, SNOWFALL are numerical variables/columns and require normalization. Moreover, RENTED_BIKE_COUNT and TEMPERATURE have some missing values (NA's) that need to be handled properly.

SEASONS, HOLIDAY, FUNCTIONING_DAY are categorical variables which need to be converted into indicator columns or dummy variables. Also, HOUR is read as a numerical variable but it is in fact a categorical variable with levels ranging from 0 to 23.

Now that you have some basic ideas about how to process this bike-sharing demand dataset, let's start working on it!

# TASK: Detect and handle missing values

The RENTED_BIKE_COUNT column has about 295 missing values, and TEMPERATURE has about 11 missing values. Those missing values could be caused by not being recorded, or from malfunctioning bike-sharing systems or weather sensor networks. In any cases, the identified missing values have to be properly handled.

Let's first handle missing values in RENTED_BIKE_COUNT column:

Considering RENTED_BIKE_COUNT is the response variable/dependent variable, i.e., we want to predict the RENTED_BIKE_COUNT using other predictor/independent variables later, and we normally can not allow missing values for the response variable, so missing values for response variable must be either dropped or imputed properly.

We can see that RENTED_BIKE_COUNT only has about 3% missing values (295 / 8760). As such, you can safely drop any rows whose RENTED_BIKE_COUNT has missing values.

### Q3: Drop rows with missing values in the RENTED_BIKE_COUNT column

In [6]:
# Drop rows with `RENTED_BIKE_COUNT` column == NA
bike_sharing_df <- bike_sharing_df %>%
    filter(!is.na(RENTED_BIKE_COUNT))

In [7]:
# Print the dataset dimension again after those rows are dropped
dim(bike_sharing_df)

Now that you have handled missing values in the RENTED_BIKE_COUNT variable, let's continue processing missing values for the TEMPERATURE column.

Unlike the RENTED_BIKE_COUNT variable, TEMPERATURE is not a response variable. However, it is still an important predictor variable - as you could imagine, there may be a positve correlation between TEMPERATURE and RENTED_BIKE_COUNT. For example, in winter time with lower temperatures, people may not want to ride a bike, while in summer with nicer weather, they are more likely to rent a bike.

How do we handle missing values for TEMPERATURE? We could simply remove the rows but it's better to impute them because TEMPERATURE should be relatively easy and reliable to estimate statistically.

Let's first take a look at the missing values in the TEMPERATURE column.

In [8]:
# code here
missing_temp <- bike_sharing_df %>%
    filter(is.na(TEMPERATURE))

missing_temp

DATE,RENTED_BIKE_COUNT,HOUR,TEMPERATURE,HUMIDITY,WIND_SPEED,VISIBILITY,DEW_POINT_TEMPERATURE,SOLAR_RADIATION,RAINFALL,SNOWFALL,SEASONS,HOLIDAY,FUNCTIONING_DAY
<chr>,<int>,<int>,<dbl>,<int>,<dbl>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,<chr>
07/06/2018,3221,18,,57,2.7,1217,16.4,0.96,0.0,0,Summer,No Holiday,Yes
12/06/2018,1246,14,,45,2.2,1961,12.7,1.39,0.0,0,Summer,No Holiday,Yes
13/06/2018,2664,17,,57,3.3,919,16.4,0.87,0.0,0,Summer,No Holiday,Yes
17/06/2018,2330,17,,58,3.3,865,16.7,0.66,0.0,0,Summer,No Holiday,Yes
20/06/2018,2741,19,,61,2.7,1236,17.5,0.6,0.0,0,Summer,No Holiday,Yes
30/06/2018,1144,13,,87,1.7,390,23.2,0.71,3.5,0,Summer,No Holiday,Yes
05/07/2018,827,10,,75,1.1,1028,20.8,1.22,0.0,0,Summer,No Holiday,Yes
11/07/2018,634,9,,96,0.6,450,24.9,0.41,0.0,0,Summer,No Holiday,Yes
12/07/2018,593,6,,93,1.1,852,24.3,0.01,0.0,0,Summer,No Holiday,Yes
21/07/2018,347,4,,77,1.2,1203,21.2,0.0,0.0,0,Summer,No Holiday,Yes


It seems that all of the missing values for TEMPERATURE are found in SEASONS == Summer, so it is reasonable to impute those missing values with the summer average temperature.

### Q4-Impute missing values for the TEMPERATURE column using its mean value.

In [9]:
# Calculate the summer average temperature
summer_mean <- mean(dataset$TEMPERATURE[dataset$SEASONS == "Summer"], na.rm = TRUE)
summer_mean

In [10]:
# Impute missing values for TEMPERATURE column with summer average temperature
bike_sharing_df <- bike_sharing_df %>%
    mutate(TEMPERATURE=ifelse(is.na(TEMPERATURE) & SEASONS == 'Summer', summer_mean, TEMPERATURE))

In [11]:
# Print the summary of the dataset again to make sure no missing values in all columns
summary(bike_sharing_df)

     DATE           RENTED_BIKE_COUNT      HOUR        TEMPERATURE    
 Length:8465        Min.   :   2.0    Min.   : 0.00   Min.   :-17.80  
 Class :character   1st Qu.: 214.0    1st Qu.: 6.00   1st Qu.:  3.00  
 Mode  :character   Median : 542.0    Median :12.00   Median : 13.50  
                    Mean   : 729.2    Mean   :11.51   Mean   : 12.77  
                    3rd Qu.:1084.0    3rd Qu.:18.00   3rd Qu.: 22.70  
                    Max.   :3556.0    Max.   :23.00   Max.   : 39.40  
    HUMIDITY       WIND_SPEED      VISIBILITY   DEW_POINT_TEMPERATURE
 Min.   : 0.00   Min.   :0.000   Min.   :  27   Min.   :-30.600      
 1st Qu.:42.00   1st Qu.:0.900   1st Qu.: 935   1st Qu.: -5.100      
 Median :57.00   Median :1.500   Median :1690   Median :  4.700      
 Mean   :58.15   Mean   :1.726   Mean   :1434   Mean   :  3.945      
 3rd Qu.:74.00   3rd Qu.:2.300   3rd Qu.:2000   3rd Qu.: 15.200      
 Max.   :98.00   Max.   :7.400   Max.   :2000   Max.   : 27.200      
 SOLAR_RADIAT

In [12]:
# Save the dataset as `seoul_bike_sharing.csv`
write.csv(bike_sharing_df, "seoul_bike_sharing.csv", row.names=FALSE)

# TASK: Create indicator (dummy) variables for categorical variables¶

Regression models can not process categorical variables directly, thus we need to convert them into indicator variables.

In the bike-sharing demand dataset, SEASONS, HOLIDAY, FUNCTIONING_DAY are categorical variables. Also, HOUR is read as a numerical variable but it is in fact a categorical variable with levels ranged from 0 to 23.

### Q5- Convert HOUR column from numeric into character first:

In [13]:
# Using mutate() function to convert HOUR column into character type
bike_sharing_df <- bike_sharing_df %>%
    mutate(HOUR=as.character(HOUR))

`SEASONS`, `HOLIDAY`, `FUNCTIONING_DAY`,  `HOUR` are all character columns now and are ready to be converted into indicator variables.

For example, `SEASONS` has four categorical values: `Spring`, `Summer`, `Autumn`, `Winter`. We thus need to create four indicator/dummy variables `Spring`, `Summer`, `Autumn`, and `Winter` which only have the value 0 or 1.

So, given a data entry with the value `Spring` in the `SEASONS` column, the values for the four new columns `Spring`, `Summer`, `Autumn`, and `Winter` will be set to 1 for `Spring` and 0 for the others:

|Spring|Summer|Autumn|Winter|
|----- |------|------|------|
|     1|     0|     0|     0| 


### Q6-Convert SEASONS, HOLIDAY and HOUR columns into indicator columns.

Note that if FUNCTIONING_DAY only contains one categorical value after missing values removal, then you don't need to convert it to an indicator column.

In [14]:
# Convert SEASONS, HOLIDAY, FUNCTIONING_DAY, and HOUR columns into indicator columns.
season_dummies <- model.matrix(~ SEASONS - 1, data = bike_sharing_df)
holiday_dummies <- model.matrix(~ HOLIDAY -1, data = bike_sharing_df)
hour_dummies <- model.matrix(~ HOUR -1, data = bike_sharing_df)
bike_sharing_df <- cbind(bike_sharing_df, season_dummies)
bike_sharing_df <- cbind(bike_sharing_df, holiday_dummies)
bike_sharing_df <- cbind(bike_sharing_df, hour_dummies)
bike_sharing_df <- subset(bike_sharing_df, select=-c(HOLIDAY,SEASONS,HOUR))

In [15]:
# Print the dataset summary again to make sure the indicator columns are created properly
head(bike_sharing_df)

Unnamed: 0_level_0,DATE,RENTED_BIKE_COUNT,TEMPERATURE,HUMIDITY,WIND_SPEED,VISIBILITY,DEW_POINT_TEMPERATURE,SOLAR_RADIATION,RAINFALL,SNOWFALL,⋯,HOUR21,HOUR22,HOUR23,HOUR3,HOUR4,HOUR5,HOUR6,HOUR7,HOUR8,HOUR9
Unnamed: 0_level_1,<chr>,<int>,<dbl>,<int>,<dbl>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,01/12/2017,254,-5.2,37,2.2,2000,-17.6,0,0,0,⋯,0,0,0,0,0,0,0,0,0,0
2,01/12/2017,204,-5.5,38,0.8,2000,-17.6,0,0,0,⋯,0,0,0,0,0,0,0,0,0,0
3,01/12/2017,173,-6.0,39,1.0,2000,-17.7,0,0,0,⋯,0,0,0,0,0,0,0,0,0,0
4,01/12/2017,107,-6.2,40,0.9,2000,-17.6,0,0,0,⋯,0,0,0,1,0,0,0,0,0,0
5,01/12/2017,78,-6.0,36,2.3,2000,-18.6,0,0,0,⋯,0,0,0,0,1,0,0,0,0,0
6,01/12/2017,100,-6.4,37,1.5,2000,-18.7,0,0,0,⋯,0,0,0,0,0,1,0,0,0,0


In [16]:
# Save the dataset as `seoul_bike_sharing_converted.csv`
write_csv(bike_sharing_df, "seoul_bike_sharing_converted.csv")

# TASK: Normalize data

Columns RENTED_BIKE_COUNT, TEMPERATURE, HUMIDITY, WIND_SPEED, VISIBILITY, DEW_POINT_TEMPERATURE, SOLAR_RADIATION, RAINFALL, SNOWFALL are numerical variables/columns with different value units and range. Columns with large values may adversely influence (bias) the predictive models and degrade model accuracy. Thus, we need to perform normalization on these numeric columns to transfer them into a similar range.


In this project, you are asked to use Min-max normalization: 

**Min-max** rescales each value in a column by first subtracting the minimum value of the column from each value, and then divides the result by the difference between the maximum and minimum values of the column. So the column gets re-scaled such that the minimum becomes 0 and the maximum becomes 1.

$$x_{new} = \frac{x_{old} - x_{min}}{x_{max} - x_{min}}$$



#### Q7: Apply min-max normalization on RENTED_BIKE_COUNT, TEMPERATURE, HUMIDITY, WIND_SPEED, VISIBILITY, DEW_POINT_TEMPERATURE, SOLAR_RADIATION, RAINFALL, SNOWFALL

In [17]:
# Use the `mutate()` function to apply min-max normalization on columns 
# `RENTED_BIKE_COUNT`, `TEMPERATURE`, `HUMIDITY`, `WIND_SPEED`, `VISIBILITY`, `DEW_POINT_TEMPERATURE`, `SOLAR_RADIATION`, `RAINFALL`, `SNOWFALL`
dataframe <- read.csv("seoul_bike_sharing_converted.csv")

columns_to_normalize <- c("RENTED_BIKE_COUNT", "TEMPERATURE", "HUMIDITY", "WIND_SPEED",
                          "VISIBILITY", "DEW_POINT_TEMPERATURE", "SOLAR_RADIATION",
                          "RAINFALL", "SNOWFALL")

dataframe <- dataframe %>%
  mutate(across(all_of(columns_to_normalize), scale))

In [18]:
# Print the summary of the dataset again to make sure the numeric columns range between 0 and 1
summary(dataframe)

     DATE           RENTED_BIKE_COUNT.V1    TEMPERATURE.V1   
 Length:8465        Min.   :-1.132024    Min.   :-2.5254284  
 Class :character   1st Qu.:-0.801987    1st Qu.:-0.8072419  
 Mode  :character   Median :-0.291362    Median : 0.0601119  
                    Mean   : 0.000000    Mean   : 0.0000000  
                    3rd Qu.: 0.552413    3rd Qu.: 0.8200790  
                    Max.   : 4.400775    Max.   : 2.1995845  
     HUMIDITY.V1         WIND_SPEED.V1       VISIBILITY.V1    
 Min.   :-2.8385478   Min.   :-1.668679   Min.   :-2.3099428  
 1st Qu.:-0.7882510   1st Qu.:-0.798509   1st Qu.:-0.8190994  
 Median :-0.0560021   Median :-0.218396   Median : 0.4205336  
 Mean   : 0.0000000   Mean   : 0.000000   Mean   : 0.0000000  
 3rd Qu.: 0.7738799   3rd Qu.: 0.555088   3rd Qu.: 0.9295220  
 Max.   : 1.9454781   Max.   : 5.486049   Max.   : 0.9295220  
 DEW_POINT_TEMPERATURE.V1 SOLAR_RADIATION.V1      RAINFALL.V1    
 Min.   :-2.6086661       Min.   :-0.654041   Min.   :-0.13

In [21]:
# Save the dataset as `seoul_bike_sharing_converted_normalized.csv`
write.csv(dataframe, "seoul_bike_sharing_converted_normalized.csv")

# Continue