# Importing libraries
We rely heavily on `numpy` and `pandas` libraries.

In [None]:
import os
import numpy as np
import pandas as pd
import pathlib

# Loading the data set
We select the data from the data directory. The `parse_dates` option directly converts the `datetime` column to the right `dtype`.

In [None]:
data_dir = 'data'
df = pd.read_csv(os.path.join(data_dir, 'data.csv'), parse_dates=['datetime'])

# Train/test split
The data span two years of activity. In a realistic setting, it is most probable that data from past years would be used to predict on current and future years. We thus simulate such a production run by choosing the first year for training, and second year for testing.

In [None]:
train_index = df['datetime'] < '2012-01-01'
test_index = df['datetime'] >= '2012-01-01'

We delay the split into two data frames to the very last section, for keeping feature engineering simple. In a production run though, any feature engineering process should be factorized into the machine learning pipeline.

# Correcting `dtypes`
Some `dtypes` are wrong, e.g. the `humidity` column is represented as `int64` instead of `float64`. Since we plan to normalize continuous features, we should better correct this.

In [None]:
df['humidity'] = df['humidity'].astype(float)

Furthermore, categorical columns should fit in the `category dtypes`. The `pd.Categorical` method allows to do this, and can even be provided with a list of allowed categories, preventing discrepancies between categorical features in the training and testing set.

In [None]:
allowed_categories = {'holiday': [0, 1], 'workingday': [0, 1], 'season': [1, 2, 3, 4], 'weather': [1, 2, 3, 4]}
for column, categories in allowed_categories.items():
    df[column] = pd.Categorical(df[column], categories=categories)

# Dummifying categorical variables
The `pd.get_dummies` function creates one column per category, filled with boolean 0s and 1s. These features are 100% correlated, and this would impair linear regression (non-invertible matrix). We thus remove the column corresponding to the first category, which is useless anyway.

In [None]:
where_category = df.dtypes == 'category'
categorical_features = df.dtypes[where_category].index.tolist()
for column in categorical_features:
    categories = df[column].cat.categories
    df = pd.get_dummies(df, columns=[column]).drop(columns=[column + '_' + str(min(categories))])

# Managing outliers
Inspecting the `weather` column, it turns out that the event `weather == 4` is almost non-existant, except for one observation. We thus have no valuable information on this kind of weather. Would it occur in the future, we should better assign it a similar weather for which we have more information, such as `weather == 3`.

In [None]:
where_weather_4 = df['weather_4'] == 1
df.loc[where_weather_4, 'weather_3'] = 1
df.drop(columns=['weather_4'], inplace=True)

# Managing datetime
The `datetime` column is internally represented as a series of increasing large `int`s. We can make it more meaningful to the model by extracting the hour, which influences strongly the `count` variable. Furthermore, there is a periodicity in the signal (see descriptive statistics). Appealing to Fourrier series, we can guess that sinusoidal functions of the time variable may contribute significantly to the explanation of the target variable.

In [None]:
omega_t = 2*np.pi*df['datetime'].dt.hour/24
for i in range(1, 7):
    df['hour_cos' + str(i)] = np.cos(i*omega_t)
    df['hour_sin' + str(i)] = np.sin(i*omega_t)

# Aggregated features
The idea is to use past targets to predict the current one. In a production scenario, it would be perfectly legal to use the `count`s of January to predict those of February. Past counts encodes the recent number of locations, thus anticipating increase or decrease of locations in time. Since data is missing after the $20^{th}$ of each month, we cannot rely on sliding windows of size less than a month. We can still take advantage of the `resample` method: we create a feature `moving_avg` that is nothing but the average number of daily locations in the past month. The `shift` method ensures that any observation at month $n$ is provided with the average of month $n-1$. Obviously, the first month is subject to missing data, that we fill with 0.

In [None]:
moving_avg = df[['datetime', 'count']].resample('1M', on='datetime').sum().shift(1).fillna(0)/19
moving_avg.columns = ['count_1M']

The `moving_avg` data frame has one line per month, and we should input these values in the whole data frame `df`. We first convert the index of `moving_avg` to a string of the form YYYYMM.

In [None]:
index = moving_avg.index.to_series()
moving_avg.set_index(index.dt.year.map(str) + index.dt.month.map(str), inplace=True)

We then repeat the operation for `df`, creating a `df_yyyymm` index, on the base of which `moving_avg` computations are duplicated as need be to fit `df` shape.

In [None]:
df_yyyymm = df['datetime'].dt.year.map(str) + df['datetime'].dt.month.map(str)
moving_avg = moving_avg.loc[df_yyyymm].reset_index(drop=True)

The new column `moving_avg` is simply obtained by `concat`enation.

In [None]:
df = pd.concat([df, moving_avg], axis=1)

# Feature selection
The `atemp` variable is almost 100% correlated with the `temp` one, so we can safely drop it from the features, all the more since it would bring high variance for linear regression. We have no more use of `datetime` and `count` columns either, the latter being trivially deduced from the two outputs `casual` and `registered`.

In [None]:
df.drop(columns=['datetime', 'atemp', 'count'], inplace=True)

We have a reasonable number of features, with no so much correlations, and we thus do not need *a priori* further dimension reduction techniques such as Principal Component Analysis.

# Log-transformation
The count variable being widely spread, with most of values near zero, we apply a log-transformation of the `casual` and `registered` variables, to shorten their range of variations. This is also going to favor the modelisation of multiplicative effects.

In [None]:
target_columns = ['casual', 'registered']
for target in target_columns:
    df[target] = np.log(1 + df[target])

# Archiving
Sometimes, feature engineering is so demanding of computational resources that it is performed offline, and stored in backup files. We also perform the train/test split once and for all by creating multiple files.

In [None]:
df_train = df[train_index]
df_test = df[test_index]

We arrange the project by creating a `pickles` directory in which we are going to save binary files. The `pathlib` library allows to create the directory if not existing, thus automatizing the whole process. We are also going to write into separate files the features and the outcomes.

In [None]:
save_dir = 'pickles'
pathlib.Path(save_dir).mkdir(parents=True, exist_ok=True)

The advantage of using binary files (pickles) is that they keep in memory all metadata, such as `dtypes`, whereas `csv` files can only store strings.

In [None]:
df_train.drop(columns=target_columns).to_pickle(os.path.join(save_dir, 'train.pkl'))
df_test.drop(columns=target_columns).to_pickle(os.path.join(save_dir, 'test.pkl'))
df_train[target_columns].to_pickle(os.path.join(save_dir, 'y_train.pkl'))
df_test[target_columns].to_pickle(os.path.join(save_dir, 'y_test.pkl'))