<a href='https://ai.meng.duke.edu'> = <img align="left" style="padding-top:10px;" src='OCR_logo3.png'>

# Predicting Demand for Washington D.C.'s Bike Share System


_Version 1.0_  
_Author(s): Jon Reifschneider, Duke University School of Engineering_

<img align="left" style="padding-top:10px;" src="CaBi-return2.jpg">  

## _About this teaching case_
**Level:** Beginner  
**Language:** Python  
**Libraries:** pandas, matplotlib, scikit learn  
**Industry:** Transportation & Tourism

**Learning Topic(s):**  
- Data Manipulation
- Exploratory Data Analysis
- Feature Engineering & Feature Selection  

**Learning Objectives**  
- Understand how to create features for time-series data
- Learn to perform univariate feature selection to evaluate and reduce features
- Learn how to use encoding methods for categorical features
- Gain practice in merging, manipulating and visualizing data in pandas and matplotlib

**Pre-requisites**  
- Basic proficiency in Python and pandas

**Case Structure**  
This teaching case is structured to follow the ***modified CRISP-DM data science methodology*** used in Duke University's AI for Product Innovation graduate programs. 

**Datasets Used**  
Data used in this case is modified from the following original sources:  
Bike sharing data: Capital Bikeshare, available at https://www.capitalbikeshare.com/system-data  
Weather data: FreeMeteo weather compiled by University of Porto, available at https://www.kaggle.com/marklvl/bike-sharing-dataset

# Contents
[1: Business Understanding](#1)  
[2: Data Understanding](#2)  
[3: Data Preparation](#3)  
[4: Analysis / Modeling](#4)  
[5: Evaluation / Interpretation](#5)

# Step 1: Business Understanding <a class="anchor" id="1"></a>

You have just been hired as the first data scientist working for Capital Bikeshare, the organization which runs the Washington D.C. bike sharing system. The first major project they have asked you to work on is to build a model to predict demand for the shared bikes in the system for each hour of each day.  

Having an accurate understanding of the expected demand is critical to the successful operation of Capital Bikeshare.  If they underestimate demand and have too few bikes available, potential users of the system are not able to find a bike to use and so get upset and are less likely to use the system in the future.

In your initial discussions with your new colleagues, you determine that there are two main drivers of demand for bikes:  
1) Time  - the demand varies by day and by hour  
2) Weather - weather conditions cause fluctuations in demand

Our objective in this project is to maximize the accuracy of our prediction model by creating an optimal feature set from the data we have available.  The model you will use has already been set up for you in a separate script (a linear regression model) which you should not change.  Your job is to prepare the data and define an optimal set of features which maximizes the model performance. To evaluate the quality of our model we will use Mean Squared Error (MSE) as our metric.  

# Step 2: Data Understanding <a class="anchor" id="2"></a>

You have received a csv file of historical demand data from the past two years of operation. The dataset contains the following columns:
- dteday : date 
- hr : hour (0 to 23) 
- cnt: count of total rental bikes 

In [None]:
# Run this before any other code cell
# This downloads the csv data files into the same directory where you have saved this notebook

import urllib.request
from pathlib import Path
import os
path = Path()

# Dictionary of file names and download links
files = {'2011-2012_bikes.csv':'https://storage.googleapis.com/aipi_datasets/2011-2012_bikes.csv',
        '2011-2012_weather.csv': 'https://storage.googleapis.com/aipi_datasets/2011-2012_weather.csv'}

# Download each file
for key,value in files.items():
    filename = path/key
    url = value
    # If the file does not already exist in the directory, download it
    if not os.path.exists(filename):
        urllib.request.urlretrieve(url,filename)

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pandas.tseries.holiday import USFederalHolidayCalendar as calendar
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression, mutual_info_regression
from sklearn.decomposition import PCA

from model import run_model

import warnings
warnings.filterwarnings("ignore")

pd.options.display.float_format = '{:,.2f}'.format

### Gather data

- Read in the file '2011-2012_bikes.csv' and use the name 'data' for your dataframe
- Since it is a time series, convert the index to datetime (including day and time)
- Convert dteday to a numerical feature which stores the day of year
- Then run the next cell to train the model

In [None]:
### BEGIN SOLUTION


### END SOLUTION

Let's run our model using the raw data to get a baseline of modeling performance. We will train our model on the data from January 1, 2011 - June 30, 2012.  We will then use our trained model to predict the bike demand for each day in the period July 1, 2012 - December 31, 2012.

The function *run_model* in the code cell below accepts a pandas dataframe containing both the data and the target (must be named 'cnt').  The function will split the data as above, train a model, and calculate the Mean Squared Error of the predictions for the time period.

In [None]:
mse_score = run_model(data)
print('Mean Squared Error: {:.2f}'.format(mse_score))

As we can see, our R-squared is not very high (it ranges from 0 to 1 depending on how much of the variability in the target data the model is able to explain).  We now have a baseline value for R-squared and MSE.  Let's see if we can improve it.

### Add in weather data  
Let's now add in weather data from freemeteo.com (compiled by University of Porto).  The new features we will add are:
- weathersit : 
    - 1: Clear, Few clouds, Partly cloudy, Partly cloudy 
    - 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist 
    - 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds 
    - 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog 
- temp : Temperature in Celsius
- atemp: Feeling temperature in Celsius
- hum: Humidity
- windspeed: Wind speed

Read in the weather data from '2011-2012_weather.csv'  
- Merge the weather data into your existing dataframe *data* (you might have to adjust a column or index in the weather dataframe in order to merge it into your existing dataframe.  Use an 'inner' merge
- We have one categorical variable in the weather data: 'weathersit'. Determine if/how you would like to encode it and do so

In [None]:
### BEGIN SOLUTION


### END SOLUTION

Let's run our model again to get a new baseline on the raw data including the weather data.

In [None]:
mse_score = run_model(data)
print('Mean Squared Error: {:.2f}'.format(mse_score))

# Step 3: Prepare Data <a class="anchor" id="3"></a>

As we have seen above, our new features did not cause much improvement in our r-squared / MSE scores.  This may be because the model treated them as numerical continuous variables, rather than categorical.  For some features that are either 0/1 this will not matter, but for other features that have more than 2 possible values this will make a difference.

Let's now encode our categorical features and see if that improves our performance. The choice of which features to encode and which method of encoding (label encoding, ordinal encoding, or one-hot encoding) is up to you.  See the example scripts from class for the code to do the encoding.  Encode any features you wish and store your updated data in a dataframe *data_encoded*.

### Feature engineering
Let's create some additional features in our data which help the model better understand changing usage patterns over time.  Create some new categorical features which you think will help explain the variance in demand over time.  The features you create should be numerical (do not yet create any categorical/string features).  For now, treat them as numerical continuous features and do not encode them.

Example: you might want to create a 'workingday' feature which stores a 0 or 1 depending if the day is a work day (e.g. not a weekend nor a holiday).  

Some possible features you might consider: year, month, holiday (whether holiday or not - hint see imports cell for package to use), day of week, working day, etc.  

You can create these or create others, it is up to your discretion as the lead data scientist. Reminder: create the features as numerical, but do not yet encode them using any type of encoding.

In [None]:
### BEGIN SOLUTION


### END SOLUTION

Use visualizations and/or statistics / statistical tests to determine if the new features you have created are likely to have value in improving our model.  You may choose what analyses to display.  

In [1]:
### BEGIN SOLUTION


### END SOLUTION

## Feature Selection

We will now analyze our set of features to determine if we have unnecessary features we can remove - features that either add no value to our model or are duplicative with other features.

We evaluate features using only the data available to us in the training dataset that is later used to train the model (we do not want to "peek" at the test dataset and allow it to influence our choice of variables to use).  First, create a dataset to use for feature selection containing only the data used for model training (the time period Jan 1 2011 - June 30 2012). Remember, if you have one-hot encoded any features, you will need to go back to your data prior to doing the one-hot encoded and use that to run your categorical feature selection - the categorical variables must be label-encoded but not one-hot encoded in order for it to run properly.

In [2]:
### BEGIN SOLUTION


### END SOLUTION

### Continuous feature selection
Use univariate feature selection methods to evaluate the continuous features in our dataset and determine if we have any unnecessary features to remove.

In [3]:
### BEGIN SOLUTION


### END SOLUTION

### Categorical feature selection

Use a univariate method to evaluate the categorical features in your dataset.  R

In [None]:
### BEGIN SOLUTION


### END SOLUTION

### Remove irrelevant or duplicative features

Based on the feature selection work you did above, remove any duplicative or unnecessary features which do not add value to your model

In [None]:
### BEGIN SOLUTION


### END SOLUTION

Explain below why you dropped each of the features you decided to drop (if any):

### Encode categorical variables  
Let's now encode our categorical features and see if that improves our performance. The choice of which features to encode and which method of encoding (label encoding, ordinal encoding, or one-hot encoding) is up to you.  See the example scripts from class for the code to do the encoding.  Encode any features you wish and store your updated data in a dataframe *data_encoded*.

In [None]:
### BEGIN SOLUTION


### END SOLUTION

### Standardize continuous features
Extract the features ['temp','atemp','hum','windspeed'] from your full dataframe *data_encoded*, standardize each one and store the dataframe containing only the standardized continuous features as *data_standardized*

In [None]:
### BEGIN SOLUTION


### END SOLUTION

# Step 4: Modeling <a class="anchor" id="4"></a>

Now that we have created, selected and prepared our features, we are ready to run our model again to evaluate our new performance. Run the cell below to re-run the model using your new data (stored as *data_standardized*).  Then, comment on how the performance of your model has changed and why.

In [None]:
mse_score = run_model(data_standardized)
print('Mean Squared Error: {:.2f}'.format(mse_score))

# Step 5: Evaluation/Interpretation <a class="anchor" id="5"></a>

How has the performance of your model changed as you have created new features and used encoding to make them available to your model? Why has this improved the model performance?