# Predict Bike Sharing Demand with AutoGluon Template

### Install packages

In [11]:
!pip install -U pip
!pip install -U setuptools wheel
!pip install -U "mxnet<2.0.0" bokeh==2.0.1
!pip install autogluon --no-cache-dir
# Without --no-cache-dir, smaller aws instances may have trouble installing

### Download dataset

### Setup Kaggle API Key


In [12]:
# create the .kaggle directory and an empty kaggle.json file
!mkdir -p /root/.kaggle
!touch /root/.kaggle/kaggle.json
!chmod 600 /root/.kaggle/kaggle.json

In [13]:
# Fill in your user name and key from creating the kaggle account and API token file
import json
kaggle_username = "amirhelmy"
kaggle_key = "21e0ff4904a671ef59640f134d66b324"

# Save API token the kaggle.json file
with open("/root/.kaggle/kaggle.json", "w") as f:
    f.write(json.dumps({"username": kaggle_username, "key": kaggle_key}))

In [14]:
# Download the dataset, it will be in a .zip file so you'll need to unzip it as well.
!kaggle competitions download -c bike-sharing-demand
# If you already downloaded it you can use the -o command to overwrite the file
!unzip -o bike-sharing-demand.zip

In [15]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from autogluon.tabular import TabularPredictor

### Explore dataset

In [16]:
# Create the train dataset in pandas by reading the csv
train = pd.read_csv("./train.csv")
print(train.shape)
train.head()

As we can see the dataframe consists of 10886 rows and 12 columns.
The columns here "Features" explaning the date and time of each ride, the weather conditions and if the customer is registered in the company or not.

In [17]:
# Set the parsing of the datetime column so we can use some of the `dt` features in pandas later
train["datetime"] = pd.to_datetime(train["datetime"])

# Sanity check 
train.info()

The "datetime" columns successfully changed to **datetime** type column.

In [18]:
# Simple output of the train dataset to view some of the min/max/varition of the dataset features.
train.describe()

In [19]:
# Create the test pandas dataframe in pandas by reading the csv, remember to parse the datetime!
test = pd.read_csv("./test.csv")
print(test.shape)
test.head()

In [20]:
# Set the parsing of the datetime column so we can use some of the `dt` features in pandas later
test["datetime"] = pd.to_datetime(test["datetime"])

# Sanity check 
test.info()

In [21]:
# Same thing as train and test dataset
submission = pd.read_csv("./sampleSubmission.csv")
submission.head()

## Step 3: Train a model using AutoGluon’s Tabular Prediction

Requirements:
* We are prediting `count`, so it is the label we are setting.
* Ignore `casual` and `registered` columns as they are also not present in the test dataset. 
* Use the `root_mean_squared_error` as the metric to use for evaluation.
* Set a time limit of 10 minutes (600 seconds).
* Use the preset `best_quality` to focus on creating the best model.

In [22]:
# Ingnoring "causal" and "registered" columns as the dont exist in the test data set
train = train.drop(["casual", "registered"], axis=1)

In [23]:
train.shape

In [24]:
predictor = TabularPredictor(label="count").fit(train_data=train, time_limit=600, presets="best_quality")

### Review AutoGluon's training run with ranking of models that did the best.

In [25]:
predictor.fit_summary()

### Create predictions from test dataset

In [26]:
predictions = pd.Series(predictor.predict(test))

In [27]:
predictions.head()

#### NOTE: Kaggle will reject the submission if we don't set everything to be > 0.

In [28]:
# Describe the `predictions` series to see if there are any negative values
predictions.describe()

In [29]:
# How many negative values do we have?
print("We have {} negative values".format(len(predictions[predictions < 0])))

In [23]:
submission["count"] = predictions
# Set them to zero
submission["count"] = submission["count"].apply(lambda x : 0 if (x < 0 ) else x)
submission.head()

### Set predictions to submission dataframe, save, and submit

In [24]:
submission.to_csv("submission.csv", index=False)

In [27]:
!kaggle competitions submit -c bike-sharing-demand -f submission.csv -m "first raw submission"

#### View submission via the command line or in the web browser under the competition's page - `My Submissions`

In [28]:
!kaggle competitions submissions -c bike-sharing-demand | tail -n +1 | head -n 6

#### Initial score of 1.80657      

## Step 4: Exploratory Data Analysis and Creating an additional feature


In [30]:
# Create a histogram of all features to show the distribution of each one relative to the data. This is part of the exploritory data analysis
fig = plt.figure(figsize=(10,10))
ax = fig.gca()
train.hist(ax=ax)
plt.show();

### Findings from the plot above:
- We can extract time of the day hour and month from datetime column. 
- The four seasons have are perfetly balanced with each others, there is no season more that the other.
- There are some features that needs to be with a categoey type such as "season" and "weather" .
- Our target column "count" is right-skewed, it ranges from 0 to almost 1000, but the values are centered under 250.


In [31]:
# create a new features from datetime on train dataset
train["hour"] = train["datetime"].dt.hour
train["day"] = train["datetime"].dt.day
train["month"] = train["datetime"].dt.month

# create a new features from datetime on test dataset
test["hour"] = test["datetime"].dt.hour
test["day"] = test["datetime"].dt.day
test["month"] = test["datetime"].dt.month

In [32]:
# Sanity check
train.head(2)

## Make category types 

In [33]:
train["season"]  = train["season"].astype("category")
train["weather"] = train["weather"].astype("category")
test["season"]   = test["season"] .astype("category")
test["weather"]  = test["weather"].astype("category")

In [34]:
# View histogram of all features again now with the hour feature
fig = plt.figure(figsize=(10,10))
ax = fig.gca()
train.hist(ax=ax)
plt.show();

## Step 5: Rerun the model with the same settings as before, just with more features

In [35]:
predictor_new_features = TabularPredictor(label="count").fit(train_data=train, time_limit=600, presets="best_quality")

In [36]:
predictor_new_features.fit_summary()

In [37]:
# predict on the test data set
predictions_new_features = pd.Series(predictor_new_features.predict(test))

In [38]:
# How many negative values do we have?
print("We have {} negative values".format(len(predictions_new_features[predictions_new_features < 0])))

In [39]:
submission_new_features = submission.copy()
submission_new_features["count"] = predictions_new_features
# Set them to zero
submission_new_features["count"] = submission_new_features["count"].apply(lambda x : 0 if (x < 0 ) else x)
submission_new_features.head()

In [40]:
# Same submitting predictions
submission_new_features.to_csv("submission_new_features.csv", index=False)

In [114]:
!kaggle competitions submit -c bike-sharing-demand -f submission_new_features.csv -m "new features"

In [55]:
!kaggle competitions submissions -c bike-sharing-demand | tail -n +1 | head -n 6

#### New Score of 0.45996

## Step 6: Hyper parameter optimization
* There are many options for hyper parameter optimization.
* Options are to change the AutoGluon higher level parameters or the individual model hyperparameters.
* The hyperparameters of the models themselves that are in AutoGluon. Those need the `hyperparameter` and `hyperparameter_tune_kwargs` arguments.

In [42]:
predictor_new_hpo = TabularPredictor(label="count",eval_metric="root_mean_squared_error",
                                    problem_type="regression").fit(train_data=train, time_limit=900,
                                                                   presets="best_quality")

In [43]:
predictor_new_hpo.fit_summary()

In [45]:
# predict on the test data set
predictions_new_hpo = pd.Series(predictor_new_hpo.predict(test))

In [46]:
# How many negative values do we have?
print("We have {} negative values".format(len(predictions_new_hpo[predictions_new_hpo < 0])))

In [47]:
submission_new_hpo = submission.copy()
submission_new_hpo["count"] = predictions_new_hpo
# Set them to zero
submission_new_hpo["count"] = submission_new_hpo["count"].apply(lambda x : 0 if (x < 0 ) else x)
submission_new_hpo.head()

In [48]:
# Same submitting predictions
submission_new_hpo.to_csv("submission_new_hpo.csv", index=False)

In [49]:
!kaggle competitions submit -c bike-sharing-demand -f submission_new_hpo.csv -m "new features with hyperparameters"

In [50]:
!kaggle competitions submissions -c bike-sharing-demand | tail -n +1 | head -n 6

#### New Score of 0.45967

## Step 7: Write a Report
### Refer to the markdown file for the full report
### Creating plots and table for report

In [None]:
# Taking the top model score from each training run and creating a line plot to show improvement
# You can create these in the notebook and save them to PNG or use some other tool (e.g. google sheets, excel)
fig = pd.DataFrame(
    {
        "model": ["initial", "add_features", "hpo"],
        "score": [?, ?, ?]
    }
).plot(x="model", y="score", figsize=(8, 6)).get_figure()
fig.savefig('model_train_score.png')

In [51]:
# Take the 3 kaggle scores and creating a line plot to show improvement
fig = pd.DataFrame(
    {
        "test_eval": ["initial", "add_features", "hpo"],
        "score": [1.80657, 0.45996, 0.45967]
    }
).plot(x="test_eval", y="score", figsize=(8, 6)).get_figure()
fig.savefig('model_test_score.png')

### Hyperparameter table

In [54]:
# The 3 hyperparameters we tuned with the kaggle score as the result
pd.DataFrame({
    "model": ["initial", "add_features", "hpo"],
    "hpo1": ['label', 'time_limit', 'presets'],
    "hpo2": ['label', 'time_limit', 'presets'],
    "hpo3": ['eval_metric', 'problem_typ', 'presets'],
    "score": [1.80657, 0.45996,0.45967]
})
