<a href='https://ai.meng.duke.edu'> = <img align="left" style="padding-top:10px;" src=https://storage.googleapis.com/aipi_datasets/Duke-AIPI-Logo.png>

# Predicting Demand for Washington D.C.'s Bike Share System

## Background
You have just been hired as the first data scientist working for Capital Bikeshare, the organization which runs the Washington D.C. bike sharing system. The first major project they have asked you to work on is to build a model to predict demand for the shared bikes in the system for each hour of each day.  

Having an accurate understanding of the expected demand is critical to the successful operation of Capital Bikeshare.  If they underestimate demand and have too few bikes available, potential users of the system are not able to find a bike to use and so get upset and are less likely to use the system in the future.  If they overestimate by too much, they end up with too many bikes sitting around not being used.  In the real-world, one of the things that makes this challenging is that they have to predict demand **for each pick-up hub location**.  To keep things simple for our final, we will focus on predicting aggregate demand.

Our task in this exercise is to build the pipeline to convert raw data into features to use in a ML model. The model itself that you will use has already been set up for you (a linear regression model which has been put into a separate script you will import) and **you cannot change the model**, only the data pipeline.

## Data
You have been given two csv files of data to use in your analysis.  The first file ("2011-2012_bikes.csv") contains historical demand data from the past two years of operation. The dataset contains the following columns:
- dteday : date 
- hr : hour (0 to 23) 
- cnt: count of total rental bikes 

The second file ("2011-2012_weather.csv") contains weather information for the same time period.  This dataset contains the following columns:  
- dteday : date 
- hr : hour (0 to 23) 
- weathersit : 
    - 1: Clear, Few clouds, Partly cloudy, Partly cloudy 
    - 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist 
    - 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds 
    - 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog 
- temp : Temperature in Celsius
- atemp: Feels-like temperature in Celsius
- hum: Humidity
- windspeed: Wind speed

You may use some or all of the data provided, not all of it is necessarily useful.

## Approach
Your task in this exercise is to build the pipeline from raw data to features ready for modeling.  There are many possible approaches to doing this, some are better, some are worse.  


In [None]:
# Run this before any other code cell
# This downloads the csv data files into the same directory where you have saved this notebook

import urllib.request
from pathlib import Path
import os
path = Path()

# Dictionary of file names and download links
files = {'2011-2012_bikes.csv':'https://storage.googleapis.com/aipi_datasets/2011-2012_bikes.csv',
        '2011-2012_weather_messy.csv': 'https://storage.googleapis.com/aipi_datasets/2011-2012_weather_messy.csv'}

# Download each file
for key,value in files.items():
    filename = path/key
    url = value
    # If the file does not already exist in the directory, download it
    if not os.path.exists(filename):
        urllib.request.urlretrieve(url,filename)

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LinearRegression
from sklearn import metrics
import pandas as pd

def run_model(X_train,y_train,X_test,y_test):

    lin_model = LinearRegression()
    lin_model.fit(X_train,y_train)
    y_pred = lin_model.predict(X_test)
    test_mse = metrics.mean_squared_error(y_test, y_pred)
    return test_mse

import warnings
warnings.filterwarnings("ignore")

pd.options.display.float_format = '{:,.2f}'.format

### Load data
Create and run a function `load_data()` to do your data loading and any merging needed.  You can specify the arguments and returns as needed.

### Clean data
Use the cell below to create and run a function `clean_data()` which cleans up the data as needed.  Things you may want to consider at this stage include:  
- Checking for and handling any missing values 
- Identifying any erroneous data and handling 
- Identifying outliers and determining whether to remove/adjust them or leave them as-is
- Etc.

### Split data for training and testing
Create and run the function `split_data()` in the cell below to split the data into training and test sets.  You should use all data up to and including July 31 2012 as the training set, and the data for the period August 1 2012 - December 31 2012 as the test set.

### Feature Engineering
Create and run the function `build_features()` below to create any additional derivative features (e.g. time series features) that you wish to use in modeling.  You will need to apply this function to both your training and test sets.

### Feature Selection
Use the cell below to create and run the function `feature_select()` which performs feature selection using univariate (filter) methods.  After you analyze the correlations, determine whether you would like to remove any features and do so.

### Prepare Features for Modeling
Our final step in the pipeline is to prepare our feature set for modeling.  In particular, in this step we need to ensure that any categorical variables we may be using are encoded as numeric values in order for the model to function properly.  You might also consider scaling some of your data.

In the below cell create and run a function `prepare_train_feats()` which prepares the training features.

We also need to prepare the features in our test set in the same way to feed into the model.  Use the cell below for the function `prepare_test_feats()` which prepares your test set features.

### Run pipeline
Finally, let's bring everything together in a function to run the entire pipeline for our training data.  Complete the function `run_pipeline()` in the cell below.  The function should call any/all of the functions you have defined above which are needed to load the data, transform it and prepare the features for both the training set and the test set.

In [None]:
def run_pipeline(bike_filename, weather_filename):
    '''
    Runs your pipeline (calling the above functions as needed) to transform the raw data into the training and test data sets for modeling

    Inputs:
        bike_filename(str): name of the file containing the bike data
        weather_filename(str): name of the file containing the weather data

    Returns:
        X_train(pd.DataFrame): dataframe containing the training set inputs
        y_train(pd.DataFrame): dataframe containing the training set labels
        X_test(pd.DataFrame): dataframe containing the test set inputs
        y_test(pd.DataFrame): dataframe containing the test set labels
    '''
    

Now that we've prepared our features we are ready to run our model.  Run the cell below, which trains the model on the training set and calculates and reports the mean squared error (MSE) on the test set.  If everything went well you should have a MSE below 18500

In [None]:
bike_datafile = "2011-2012_bikes.csv"
weather_datafile = "2011-2012_weather_messy.csv"
X_train, y_train, X_test, y_test = run_pipeline(bike_datafile, weather_datafile)
mse_score = run_model(X_train, y_train, X_test, y_test)
print('Mean Squared Error on the test set: {:.2f}'.format(mse_score))

assert mse_score < 18500