# Summary
The goal of this assignment is to Predict the total number of Washington D.C. bicycle users on an hourly basis using a dataset (use attached `hour.csv`, and have a look at `README.txt` for explanations)
 with data from 2011 and 2012. The notebook shold be divided in the following sections:

### Part 1: Exploratory Data Analysis
1. Ensuring data quality
2. Plotting clear and meaningful figures
3. Checking possibly redundant variables via correlations
4. Giving insights on what seems relevant for prediction and what does not

### Part 2: Data Engineering
1. Discussion on missing values and outliers
2. Treatment of text and date features
3. Generation of extra features and studying the influence of combinations of features
4. Giving new insights on what seems relevant for prediction and what does not
5. Use of scikit-learn pipelines to perform transformations

### Part 3: Machine Learning
1. Choosing sensible models (linear and non-linear)
    * Baseline Linear Regression with Initial Variables
    * Linear Regression with New Variables
    * Baseline Random Forest
2. Tuning model parameters with validation
3. Obtaining accurate predictions in test
4. Plotting predictions vs reality for additional insights

## Submission format: One Jupyter notebook (Bike-Sharing_GroupN.ipynb)
Please do not submit:
* A zip file
* A link to Google CoLab
* A file with the wrong extension
* A Python script


### Task description
* Training data: whole 2011 and first 3 quarters of 2012.
* Test data: 4th quarter of 2012. Do not use it to fit your models!
* Target: total number of users (cnt)
* Error metric: R2 score (scikit-learn's default for regression).
* Features to use: at least the ones present in the data (except for cnt, casual, and registered).
* Groups: default groups for this term. You can split work as you consider best, but make sure each and every member is able to explain details on what was done throughout the project, even not their part.
* Software: No extra libraries should need to be installed to run the notebook

### Grading criteria
As explained in the syllabus, the project report (the submission in the form of a Jupyter notebook) weighs 35 % of the overall grade of the subject, while the presentation amounts for 10 %. For this task there are no detailed, explicit steps to be followed, but broad elements expected to be present.

#### Report (10 points):
* Exploratory Data Analysis (descriptive analytics) (3 points)
    * Ensuring data quality (correctness, consistency, missing values, outliers...).
    * Plotting clear and meaningful figures.
    * Giving insights on what seems relevant for prediction and what does not.
    * Bonus points for:
        * Checking possibly redundant variables via correlations.
* Data Engineering (3 points)
    * Discussion on missing values and outliers
    * Treatment of text and date features
    * Generation of extra features (e.g., season, yes/no holiday, hours of daylight, combinations of features, quantization/binarization, polynomial features)
    * Use of scikit-learn pipelines to perform transformations
* Machine Learning (predictive analytics) (3 points)
    * Choosing sensible models (linear and non-linear).
    * Tuning model parameters with validation (use the fixed validation set).
    * Obtaining accurate predictions in test (measured with R2 score).
    * Plotting predictions vs. reality for additional insights.
    * Bonus points for:
        * Plotting validation results to justify further choices (parameter ranges, other validations...).
        * Following an incremental approach (baseline models first, then more complex models, then combining models...)
* Extra 1 point:
    * Work description
    * Overall quality and clarity
    * Innovative approaches

Note: The R2 will not be used directly for the grading, although a better preprocessing and feature generation strategy will provide both a better grade and a better model. (edited) 

#### Exposition in class (3 points):
* Explanation of technical details
* Clarity, conciseness, quality of content & delivery
* Answering final questions (posed by the professor and/or other groups)

The exposition has a limit of 5 minutes per group (excluding questions). You are free to choose the number of presenters, it is not mandatory that everyone talks. Teams running out of time will receive a grade penalty.

The presentation should be a technical one, therefore I expect to see code. My recommendation is to turn the Jupyter report into a slideshow, but any format is fine.

After that, expect technical questions about the code and the project, for example:
* "What does this function do in this cell?"
* "What would happen if the data was shaped in this other way?"
* "Can you think of an alternative way of doing that grouping?"
* "There is a mistake in this code cell, can you tell me what is it?"
* "How could you replace this `for` loop with a pandas function?"

All members of the group should be ready to answer questions, also those that are remote or didn't present.