# Predicting Daily Bike Rentals

A bike share system is a service in which bicycles are made available for shared use to individuals on a short term basis for a price or free [source](https://en.wikipedia.org/wiki/Bicycle-sharing_system). The dataset related to the bike rental has been realized by Hadi Fanaee-T at the University of Porto, that can be found [here](http://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset).
The goal of this project is to predict the total number of bikes people rented in a given hour.

To this end different machine learning techniques will be developed, in particular Decison Tree, Random Forest and AdaBoost along with the evaluation of their performances in order to select the one that better describes the considered dataset and that can make the most accurate prediction.


The project will be structured as follows:

1) Analysis and Data Preparation

2) Definition of useful functions (e.g. for cross-validation, test/train split, computation of metrics of interest etc.)

3) Models Evaluation

4) Analysis of features importance

5) Robustness Analysis
    
6) Hyperparameter Optimization

## Analysis and Data Preparation

**You will need to import some useful packages for the exercises. You can write them here.**

In [1]:
# Imports


**Now import the dataset and take a look at it to have a better understanding of the features and to identify the response variable. Write down some comments about them.**

In [2]:
# import the dataset

# display the first few row in order to take a look at the data


**Consider the type of each feature and identify the one/s that may not work for a machine learning framework. Would it be useful to convert it/them or should we drop it/them?
Moreover think about the 'yr' column: could we also drop this feature?**


**Take now a look at the *instant* column: do you think it would be helpful in predicting the bike rental? Why?**

In [None]:
# write your code here

**Evaluate now the distribution of the total rentals and comment the results.**

In [3]:
# Build an histogram of the "cnt" column


**Another way to select the most useful features for your machine learning problem is to evaluate the correlation analysis: features that are highly correlated between them bring redundant information into the analysis, so it could be useful to drop one of them. Moreover it is also useful to check which features are more correlated with the response variable.**

**Perform a correlation analysis and evaluate the features which are more correlated with the response variable and ask yourself if it would be fair for your analysis to keep them. Why?**

**Take now a look at the features that are highly correlated between them: which one could we drop?**

In [4]:
# write your code here

### Manipulate Columns to Extract Useful Features

**Now manipulate the hr column that contains rental hours, from 1 to 24. In particular you can put together a specific time interval to identify a specific moment of the day (e.g. morning from 6am to 12pm) that can be associated with an identifier. One idea is the following:**

**- assign 1 if the hour is from 6 to 12**

**- assign 2 if the hour is from 12 to 18**

**- assign 3 if the hour is from 18 to 24**

**- assign 4 if the hour is from 0 to 6**

In [5]:
# write your code here

**Use the function .describe() to inspect the values of each column: are they consistent in terms of values  or should we normalize them?**

In [6]:
# write your code here

##  Definition of useful functions

**In this section you are asked to write some useful functions:**

**Cross_Val(splits, features, target, model)**: 

This function computes the cross validation in order to estimate the error of the model with respect to unseen data. The estimation is done by averaging the performance of the model over different subsets of the training set, divided into training and validation set. The function takes as input the number of splits, the features matrix and the target of the training set and the model. The function should return the R-squared and the train and the test root mean squared error normalized with respect to the variation of the target values (max value - min value). 

Hint: use KFold to split the training features matrix into train and validation.

**Test(X_train, y_train, X_test, y_test, model)**:

This function is used to train the model with all the training set and then to test the trained model with the test set. Use as metric the normalized root mean squared error (RMSE/(max(target value)-min(target value))) and the R-squared. The function takes as input the training test (features and target) and the test set (features and target).

**Test_Train_Random(data)**:

This fucntion should randomly give as output the indices of the original dataset that should belong to the training (80%) and test set (20%).

Hint: use random.Random(1).shuffle()

In [7]:
# write your code here

## Models Evaluation

**First of all split the original datas into train_set and test_set by using the indices provided by the Test_Train_Random function. After that divide the test and the training sets into X_train (for the features) / y_train (for the label) and the X_test (for the features) / y_test (for the label) respectively.**

Hint: reset indices with reset_indx(drop=True) function.

In [8]:
# write your code here

**Initialize here 3 different models: DecisionTreeRegressor, RandomForestRegressor and AdaBoostRegressor. Remember to set the random_state parameter to a value in order to have repeatable results (see Scikit-Learn documentation). Using the Cross_Val function, with 10 splits, evaluate the prediction error estimation for the training set and the validation set. Then test your model using the Test function. In this operation store the feature importances for each model with the following commands:**

feature_importance = model.feature_importances_

**Answer the following questions:**

### *Questions*
- By using the Cross_Val function you have estimated the prediction error of the model in predicting the training labels and the validation labels. Which one is always bigger? Why?
- How these two errors change according to the model used? Why? 

## Analysis of features importance
Given the generated model, evaluate the contribution of each feature and see how the importance of such features change over the algorithms.

**In this part you are asked to evaluate the importance of each feature. In the previous section you have stored the feature_importances_ for each model. Plot them using a bar plot and answer the following questions:**


### *Questions*
- Is the features contribution in each algorithm consistent?
- Are the most important features coherent with a common interpretation? Why?

In [9]:
# write your code here

## Robustness Analysis

**In this section you are asked to add noise to the features of the training set and then train the model with this noisy dataset. Then test the model with the test set and check whether how much the test error of each model is affected by the noise addition. Set the noise as follows:**

noise = np.random.rand(X_train.shape[0])*val

X_train_noisy = X_train.copy()

for col in X_train.columns:

        X_train_noisy[col] = X_train_noisy[col]+noise

**where val should vary in the following range: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10].**

**NB since it is a random noise, in order to have repeatable results use np.random.seed(seed_number) where you can set a specific seed_number.**

**Plot the results for each model at each val value, and answer the following questions:**

### *Questions*
- How do the performances of each model change while increasing the noise? 
- Which algorithm is more affected by the addition of the noise? And which one is more robust?
- Did you expect these results? Why?

In [10]:
# write your code here

## Hyperparameter Optimization

**In this section you are asked to optimize some hyperparameters of the models under analysis (Decision Tree, Random Forest and AdaBoost). For each hyperparameter you should define a set of values to test and then combine all the sets into a dictionary structure. Use the RandomizedSearchCV function to search the hyper-parameter space for the best cross validation score (see [documentation](https://scikit-learn.org/stable/modules/grid_search.html#grid-search)). Compare the results given by the best hyperparameters combination with respect to the default model.**

### *Questions*
- Can we improve the performance of the models by optimizing the hyperameters?
- What is the bottleneck in the hyperparameters optimization? How would you handle this problem?

In [11]:
# write your code here