# Playing with data: bike sharing

## 1. Bike sharing systems

Bike sharing has rapidly become a popular means of public transport in modern cities. Its benefits are numerous, including improved health and fitness for the users, decreased congestion on the road networks of the cities, and decreased environmental and pollutive impact from the more conventional means of public transport which rely on fossil fuels. Cycling is also widely considered to be fun! :)

In an effort to meet these needs and promote these benefits many companies have sprung up that offer a systematised bike-sharing service. All have a fleet of bicycles available for the customers, but deployal comes in two forms: from fixed stations with locking docking points, or on a "pick-up-drop-off" principle, where the user parks the bike in any accessible location for the next user to find it. Whilst the flexibility of the latter offers many benefits, we will focus on the former for our following experiments. We will use the [Capital Bike Sharing data](https://s3.amazonaws.com/capitalbikeshare-data/index.html) based around metropolitan District Columbia, USA, which are freely available from Capital's website.

The biggest problem for users of bike-sharing systems such as Capital's is being able to find a bike and/or a docking station when you need one. For example during the morning rush-hour, commuters to work might find that earlier users have already snapped up all the bikes. After eventually locating a bike and cycling to work, they might find that all the docking stations near their work are already full up from the earlier commuters. This can make usage of the system quite tricky and annoying! And of course the same problems apply to the evening homeward commute...

We will try to find patterns in usage, by applying techniques from machine learning (ML). Moreover we will attempt to predict demands (of bikes/spaces) at the stations, which a company could use to artificially reposition the bikes so as to better meet the users' requirements.

## 2. The data

The data available from Capital comprise a fairly large set (there are now roughly 3 x 10^6 bike journeys every year using their system), so there is some re-engineering required to make the ML techniques more tractable.

### 2.1 Importing

We assume the full data of years 2013--2017 (inclusive) have been downloaded from Capital's webpage and saved in the same folder as the notebook. We import them into pandas data-sets.

The data are split into yearly quarters. There are 5 years' data available, and 4 quarters per year, meaning we have 20 csv files to read. This is best accomplished with a for-loop, and saving the data in a dictionary.

In [12]:
# We start by setting up our Python workbook. We'll use pandas for manipulating our data; it relies on numpy, which is inherently
# important itself for maths in Python. For plotting purposes we'll use Python's built-in library matplotlib.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Now we import our data. This will likely take a few minutes.
num_years = 5
start = 3
data = {}
for n in range(num_years):
    for m in range(4):
        data["Y201{0}".format(str(int(n+start))+"Q"+str(int(m+1)))] = \
        pd.read_csv(r"C:\Users\sneak\Documents\Python Scripts\Bike_sharing\201"+str(int(n+start))+"Q"+str(int(m+1))+"-capitalbikeshare-tripdata.csv")

In [27]:
# Having imported the data as separate dataframes, we will now combine them into a single dataframe. Later, for testing purposes,
# we'll import what data are available from 2018 (as of today, 2018/11/12, the year is not yet over), and see how our predictions
# compare.

all_data = pd.concat(data.values(),axis=0).reset_index(drop=True)

### 2.2 Engineering

Let's get down to some data re-engineering. The most important features for our purposes are:
Start date, End date, Start station number and End station number.

The final two features (Bike number and Member type) might be interesting for some other analyses, but for predicting usage patterns don't hold much a-priori value. The Duration, Start and End stations can all be re-engineered from our key features.

After selecting and engineering our features we will then produce two new dataframes, indexed by Start and End dates, and resampled over some small sub-interval of time. We'll use 1-hour sub-intervals.

In [28]:
all_data = all_data.drop(["Duration","Start station","End station","Bike number","Member type"],axis=1)

In [45]:
# Now we'll re-label our stations.
# First we simplify the column names:
all_data.columns = ["Start date","End date","Start stn","End stn"]
# Now we find the minimum value:
stn_min = np.min(all_data["Start stn"].apply(int))
# Then we subtract it from all the station labels:
all_data["Start stn"] = all_data["Start stn"].apply(int) - stn_min
all_data["End stn"] = all_data["End stn"].apply(int) - stn_min

Unnamed: 0,Start date,End date,Start stn,End stn
0,2013-01-01 00:03:55,2013-01-01 00:15:24,101,106
1,2013-01-01 00:04:39,2013-01-01 00:16:19,236,304
2,2013-01-01 00:10:01,2013-01-01 00:16:06,257,239
3,2013-01-01 00:12:55,2013-01-01 00:17:47,614,612
4,2013-01-01 00:13:24,2013-01-01 00:17:07,239,214
5,2013-01-01 00:14:20,2013-01-01 00:33:58,41,47
6,2013-01-01 00:14:35,2013-01-01 00:33:50,41,47
7,2013-01-01 00:19:39,2013-01-01 00:26:15,613,619
8,2013-01-01 00:21:35,2013-01-01 00:26:47,400,602
9,2013-01-01 00:22:17,2013-01-01 00:26:19,111,245
