Data Mining [H02C6a] - Spring 2022

# Session 1: Exploratory Data Analysis with `pandas`

## Exercise 2: Bike Rental Usage Analysis

In this exercise, we explore the bike rental data set using visualisation techniques similar to those of exercise 1.

<img src = '../img/bikes.jpg' width = 25% align = right>

## Background

Bike sharing systems enable users to easily rent a bike from a particular position and return it back at another position. Currently, there are over 500 bike-sharing programs around the world which is composed of over 500 thousands bicycles. 

Today, there exists a great interest in these systems due to their important role in traffic, environmental and health issues. 


## The data

The data set you are about to explore is related to the two-year historical log corresponding to the years 2011 and 2012 from the Capital Bikeshare system, Washington D.C., USA which is publicly available in http://capitalbikeshare.com/system-data. The data has been aggregated on an hourly basis. Besides, the corresponding weather and seasonal information has been added.

Attributes are as follows:
- **Datetime**: date and time, hourly
- **Season**: 1 - winter, 2 - spring, 3 - summer, 4 - autumn
- **Holiday**: whether the day is considered a holiday
- **Workingday**: whether the day is neither a weekend nor holiday
- **Weather**:
    1. Clear, Few clouds, Partly cloudy, Partly cloudy
    2. Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
    3. Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
    4. Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog 
- **Temp**: temperature, Celsius
- **Atemp**: "feels like" temperature, Celsius
- **Humidity**: relative humidity
- **Windspeed**: wind speed
- **Casual**: number of non-registered user rentals initiated
- **Registered**: number of registered user rentals initiated
- **Count**: number of total rentals, *target variable*

Let's get started!

In [None]:
import pandas as pd, numpy as np
import matplotlib as plt
import seaborn as sns
%matplotlib inline

## Loading the data

In [None]:
data = pd.read_csv('../datasets/bikes_train.csv', parse_dates=['datetime'])
print(data.shape)
data.head()

In [None]:
print('The data types are: ')
data.dtypes

Note that `pandas` doesn't automatically recognize some features to be nominal, since they are encoded with numbers (e.g., `season`, `dayofweek`, etc.). You can fix this malually using `.astype` method if needed.

Briefly discuss the following questions:

<b><font color = 'red'>Question 2.1</font> Why is this data interesting? What can it be used for?</b>

[Your ideas here]

<b><font color = 'red'>Question 2.2</font> What may affect the bike rental process? What patterns do you expect to discover in the data?</b>

[Your ideas here]

## Data aggregation

The original dataset contains hourly rental counts, which can be too detailed. 

**<font color='red'>Question 2.3</font> Define a new DataFrame `daily_counts` that contains the total number of rentals (`casual`, `registered` and total `count`) for every day in the database, along with mean temperature, windspeed and humidity values for that day. Indicate the day of week and whether the day is a public holiday.**

Hint: use the <a href='https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html?highlight=groupby#pandas.DataFrame.groupby'>`DataFrame.groupby()`</a> and <a href='https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.aggregate.html'>`DataFrame.aggregate()`</a> methods.

In [None]:
# Your code here

**Ectract day of week, month and year as well.**

In [None]:
# Your code here

## Data exploration

Explore the dataset and answer the following questions about the bike rental usage applying visualisation techniques similar to those of exercise 1.

Let's start with a general overview.

**<font color = 'red'>Question 2.4</font> Plot the evolution of daily system usage over two years.**

Hint: use the <a href=https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.plot.html>`DataFrame.plot`</a> method

In [None]:
# Your code here

<b>Looking at the plot, answer the following questions:
* what is the general trend?
* who uses the system more in general: registered or casual users? Is it always so? 
</b>

[Your answer here]

**<font color = 'red'>Question 2.5</font> Plot the average number of daily bike rentals per month in 2011 and 2012.**

In [None]:
# Your code here

**Which month was the busiest in 2011? And in 2012?**

[Your answer here]

**<font color='red'>Question 2.6</font> Do casual and registered users have different bike rental patterns? If so, explain the difference and posible reasons for it. Make plots to justify your answer.**

[Your answer here]

In [None]:
# Your code here

In [None]:
# Your code here

**<font color='red'>Question 2.7</font>** What are, on average, the busiest hours (in terms of the total number of rentals)? Are they the same on working and non-working days? Can the same pattern be observed both in 2011 and 2012?

In [None]:
# Your code here

<b><font color='red'>Question 2.8</font> 

Define a new variable `daytime` as follows:
* 5am - 12pm -> `morning`
* 12pm - 5pm -> `afternoon`
* 5pm - 9pm -> `evening`
* 9pm - 5am -> `night`
</b>

In [None]:
# Your code here

<b>Are the following statements true or false? Make corresponding plots to justify your answer.</b>
<i>
* Casual users rent the highest number of bikes in the afternoon, both on working and non-working days.
* The smallest number of bikes is rented during night hours.
* Registered users rent more bikes on holidays than on working days.
</i>

In [None]:
# Your code here

<b><font color='red'>Question 2.9</font> Is the following statement true or false? Confirm with a plot.</b>

<i>The highest average number of daily bike rentals by casual users was recorded on summer days both in 2011 and 2012.</i>


In [None]:
# Your code here

<b><font color='red'>Question 2.10</font> 
<br>
Do weather conditions affect bike rental patterns of registered users? Casual users? 
<br>
Make plots to justify your answer and try to explain the reason for what you see.
</b>

[Your answer here]

In [None]:
# Your code here

<b><font color='red'>Question 2.11</font> 
<br>
Speaking about the weather, let's see what's the weather like in Washington. Plot the distribution of the temperature over the seasons. </b>

In [None]:
# Your code here

**Are the most windy days also the coldest ones? Make a plot to justify your answer.**

In [None]:
# Your code here

<b>Do such factors as temperature, humidity and windspeed seriously affect the bike rental process? 
<br>
Make plots to justify your answer.
</b>

[Your answer here]

In [None]:
# Your code here