# 🚲 Bike Sharing in Boston

<div style="align:center">
    <img src="https://storage.googleapis.com/kaggle-datasets-images/1064629/1791350/5d4bc152999636141f1d8d6857dc724b/dataset-cover.jpg?t=2020-12-29-02-40-10">
</div>

<br>

<img src="https://img.shields.io/badge/Made%20on-Kaggle-20beff?style=flat&logo=kaggle&logoColor=white">

<div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#20BEFF;
           font-size:110%;
           font-family:Brushstroke;
           letter-spacing:0.5px">
    
    <p style="padding: 10px;color:black;"><em>“The race is won by the rider who can suffer the most”</em> – Eddy Merckx</p>
</div>

## Introduction

This notebook is dedicated to solve a task from the [Bike Sharing in Boston](https://www.kaggle.com/jackdaoud/bluebikes-in-boston) dataset, here it is:
> Exploratory Data Analysis (2019 vs 2020), *Has COVID-19 impacted BlueBikes in 2020?*

### Context
> BlueBIkes is a bike sharing system born in July 2011 in Metro Boston. It has grown exponentially over the years: <br><br>From 3,203 annual members in 2011 to 21,261 in 2019 <br>From 610 bicycles in 2011 to 3,500+ in 2019 <br><br>The system is simple. A user can pick up a bike at any station dock, ride it for a specific amount of time, and then return it to any station for re-docking.

### Acknowledgments
Thanks for [@jackdaoud](https://www.kaggle.com/jackdaoud) for providing this dataset.

## Table of contents
- Setup
- EDA

# Setup

In this section the main packages for data manipulation and stuff accompagnied with the data will be loaded. Then the data will be a bit cleaned in order to manipulate it in a easier way!

In [None]:
# Load main packages
import time  # timer
import numpy as np  # linear algebra
import pandas as pd  # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns  # plot handling
import matplotlib.pyplot as plt  # plot handling

In [None]:
%%time

# Load data
bikes2019 = pd.read_csv('../input/bluebikes-in-boston/bluebikes_tripdata_2019.csv', low_memory=False)  # Boston Bike sharing in 2019
bikes2020 = pd.read_csv('../input/bluebikes-in-boston/bluebikes_tripdata_2020.csv', low_memory=False)  # Boston Bike sharing in 2020

In [None]:
from math import cos, asin, sqrt, pi

def fdistance(series):
    frame = series.to_frame().T
    lat1 = frame['start station latitude'].astype(float)
    lat2 = frame['end station latitude'].astype(float)
    lon1 = frame['start station longitude'].astype(float)
    lon2 = frame['end station longitude'].astype(float)
    p = pi/180
    a = 0.5 - cos((lat2-lat1)*p)/2 + cos(lat1*p) * cos(lat2*p) * (1-cos((lon2-lon1)*p))/2
    return 12742 * asin(sqrt(a))

In [None]:
%%time

# Convert to datetime
bikes2019['starttime'] = pd.to_datetime(bikes2019['starttime'])
bikes2019['stoptime'] = pd.to_datetime(bikes2019['stoptime'])
bikes2020['starttime'] = pd.to_datetime(bikes2020['starttime'])
bikes2020['stoptime'] = pd.to_datetime(bikes2020['stoptime'])

# Convert to string
to_string = ['start station name', 'end station name']
bikes2019[to_string] = bikes2019[to_string].astype('string')
bikes2020[to_string] = bikes2020[to_string].astype('string')

# Convert to categorical
to_category = ['usertype', 'gender']
bikes2019[to_category] = bikes2019[to_category].astype('category')
bikes2020[to_category] = bikes2020[to_category].astype('category')

# Add distance variable
C = 12742
P = np.pi / 100
for frame in [bikes2019, bikes2020]:
    LAT1 = frame['start station latitude']
    LAT2 = frame['end station latitude']
    LON1 = frame['start station longitude']
    LON2 = frame['end station longitude']
    frame['distance'] = C * np.arcsin(np.sqrt(0.5 - np.cos((LAT2-LAT1)*P)/2 + np.cos(LAT1*P) * np.cos(LAT2*P) * (1-np.cos((LON2-LON1)*P))/2))

In [None]:
print(f"For {2019} there are {bikes2019.shape[0]:,} rows of data.")
print(f"For {2020} there are {bikes2020.shape[0]:,} rows of data.")
print(f"> {2019} has {bikes2019.shape[0]-bikes2020.shape[0]:,} rows less than {2019}")

In [None]:
print(f"For {2019} there are {bikes2019.isna().sum().sum():,} missing values.")
print(f"For {2020} there are {bikes2020.isna().sum().sum():,} missing values.")

In [None]:
(bikes2020.isna().sum()[bikes2020.isna().sum() > 0] / bikes2020.shape[0] * 100).apply(lambda x: (str(round(x, 2))+'%').rjust(5)).rename('Missing value percentage')

> Seems like in 2020 some people did not give data on their personal infos...

# EDA

## Univariate analysis

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(20, 6))

sns.barplot(x=['2019', '2020'], y=[bikes2019['bikeid'].unique().shape[0], bikes2020['bikeid'].unique().shape[0]], palette='icefire', ax=ax[0])
ax[0].title.set_text('No. of bike used by year')

stations_2019 = len(set(bikes2019['start station id'].unique()).intersection(bikes2019['end station id'].unique()))
stations_2020 = len(set(bikes2020['start station id'].unique()).intersection(bikes2020['end station id'].unique()))
sns.barplot(x=['2019', '2020'], y=[stations_2019, stations_2020], palette='coolwarm', ax=ax[1])
ax[1].title.set_text('No. of station by year')

plt.show()

<div class="alert alert-block alert-danger">📉 The overall number of bikes <b>decreased</b> between 2019-20</div>

<br>

<div class="alert alert-block alert-info">📈 The number of unique station <b>increased</b> between 2019-20</div>

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(20, 6), sharey=True)

sns.countplot(x='usertype', data=bikes2019, ax=ax[0], palette='mako')
ax[0].title.set_text('No. of user by type in 2019')

sns.countplot(x='usertype', data=bikes2020, ax=ax[1], palette='mako')
ax[1].title.set_text('No. of user by type in 2020')

plt.show()

<div class="alert alert-block alert-warning">📉 The overall number of Customer <b>stays idle</b> between 2019-20</div>

<br>

<div class="alert alert-block alert-danger">📈 The number of subscriber <b>decreased</b> between 2019-20. This can be the direct effect of subscription cancellation due to <b><em>COVID-19</em></b></div>

In [None]:
plt.figure(figsize=(20, 8))


for frame, color, year in zip([bikes2019, bikes2020], ['red', 'purple'], [2019, 2020]):
    log_frame = frame['tripduration']
    sns.kdeplot(x=log_frame, label=str(year), color=color, fill=True, alpha=.3)
    max_frame = log_frame.max()
    median_frame = log_frame.median()
    mean_frame = log_frame.mean()
    
    # Max
    plt.plot(max_frame, 0.005, markersize=16, marker=7, color=color)
    plt.text(max_frame, 0.015, f'Max({year})', color=color)
    
    # Median
    plt.plot([median_frame, median_frame], [0, 0.7], linestyle='--', color=color, label=f'median {year}', alpha=0.4)
    
    # Mean
    plt.plot([mean_frame, mean_frame], [0, 0.7], linestyle='-', color=color, label=f'mean {year}')

plt.title('Trip duration distribution 2019 vs 2020')
plt.legend(loc="upper left")
plt.show()

<div class="alert alert-block alert-danger">⚠️ Urgh, the `tripduration` distribution is so <b>skewed</b>! Let's log scale it because we can't see many things...</div>

In [None]:
plt.figure(figsize=(20, 8))


for frame, color, year in zip([bikes2019, bikes2020], ['red', 'purple'], [2019, 2020]):
    log_frame = frame['tripduration'].apply(np.log)
    sns.kdeplot(x=log_frame, label=str(year), color=color, fill=True, alpha=.3)
    max_frame = log_frame.max()
    median_frame = log_frame.median()
    mean_frame = log_frame.mean()
    
    # Max
    plt.plot(max_frame, 0.005, markersize=16, marker=7, color=color)
    plt.text(max_frame, 0.015, f'LogMax({year})', color=color)
    
    # Median
    plt.plot([median_frame, median_frame], [0, 0.5], linestyle='--', color=color, label=f'median {year}', alpha=0.4)
    
    # Mean
    plt.plot([mean_frame, mean_frame], [0, 0.5], linestyle='-', color=color, label=f'mean {year}')

plt.title('Log scale trip duration distribution 2019 vs 2020')
plt.legend(loc="upper left")
plt.show()

<div class="alert alert-block alert-info">ℹ️ It seems like in <b>2019 there were longer trip duration than 2020</b></div>

<div class="alert alert-block alert-warning">ℹ️ Though the <b>trip duration seems to stay idle in overall in 2019-20</b></div>

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(20, 6), sharey=True)

sns.countplot(x=bikes2019['gender'].astype('string').fillna('Unknown'), ax=ax[0], palette=['#7c80d9', '#7ddb8a', '#dbbe7b'])
ax[0].title.set_text('No. of user by gender in 2019')

sns.countplot(x=bikes2020['gender'].astype('string').fillna('Unknown'), ax=ax[1], palette=['#cfb6ca', '#7c80d9', '#7ddb8a', '#dbbe7b'])
ax[1].title.set_text('No. of user by gender in 2020')

plt.show()

<div class="alert alert-block alert-danger">📈 The number of <b>user giving their information about their gender has decreased</b>, the majority of the data is unknown for 2020...</div>

<br>

<div class="alert alert-block alert-warning">ℹ️ It seems like <b>the known gender distribution did not change that much</b><br><span style="color:green; font-weight:bold">Count(Gender:1) > Count(Gender:2) > Count(Gender:0)</span></div>

In [None]:
plt.figure(figsize=(20, 8))


for frame, color, year in zip([bikes2019, bikes2020], ['red', 'purple'], [2019, 2020]):
    log_frame = frame['birth year']
    sns.kdeplot(x=log_frame, label=str(year), color=color, fill=True, alpha=.3)
    min_frame = log_frame.min()
    median_frame = log_frame.median()
    mean_frame = log_frame.mean()
    
    # Min
    plt.plot(min_frame, 0.0, markersize=16, marker='o', color=color, label=f'min({year})')
    
    # Median
    plt.plot([median_frame, median_frame], [0, 0.1], linestyle='--', color=color, label=f'median({year})', alpha=0.4)
    
    # Mean
    plt.plot([mean_frame, mean_frame], [0, 0.1], linestyle='-', color=color, label=f'mean({year})')

plt.title('Birth year distribution 2019 vs 2020')
plt.legend(loc="upper left")
plt.show()

<div class="alert alert-block alert-warning">ℹ️ It seems that:
    <ul>
        <li>In both 2019 and 2020, the users's age are <b>similar</b></li>
        <li>Both share a peak no. of user born around <b>±1970</b> and <b>±1990</b> so a lot of users were around <b>±50yo</b> and <b>±30yo</b></li>
        <li>Suprisingly there is/are people born around <b>±1890</b> so aged <b>±130yo</b>, is that an outlier?⇒or wrong typed information?</li>
        <li>There are not too many <b>20yo</b> users</li>
    </ul>
</div>

In [None]:
for frame, year in zip([bikes2019, bikes2020], [2019, 2020]):
    print(f"There are {round(frame[frame['birth year'] <= year-30].shape[0] / frame.shape[0] * 100)}% users aged less than 30yo in {year}.")

In [None]:
plt.figure(figsize=(20, 6))

for frame, color, year in zip([bikes2019, bikes2020], ['red', 'purple'], [2019, 2020]):
    sns.kdeplot(x=frame[frame['distance'] < 10000]['distance'], color=color, label=str(year))
    
plt.title('Distance travelled in 2019 vs 2020')
plt.legend(loc='upper left')
plt.show()

<div class="alert alert-block alert-danger">⚠️ I plotted the distance once before this and I identified outliers, because some distance were at <b>10 000</b> which can't be possible...</div>

<div class="alert alert-block alert-warning">ℹ️ In general the distance travelled by users in 2019 and 2020 are <b>nearly the same</b>! In fact <b>most the users travel less than 5km</b>.</div>

In [None]:
outlier = 10000

for frame, year in zip([bikes2019, bikes2020], [2019, 2020]):
    n = frame.query(f'distance > {outlier}').shape[0]
    print(f"There are {n} outliers in {year}.")

In [None]:
bikes2019.query(f'distance > {outlier}')

<div class="alert alert-block alert-danger">ℹ️ Expand the output above to see the data where the outliers are present .It seems that what provoked the outliers in distance was those <b>Mobile Temporary Station 2</b> because they have don't have a <b>fixed location</b>.</div>

In [None]:
print(f"In {2020}, there are {len(bikes2020['postal code'].unique())} unique postal codes.")

In [None]:
def get_unique_stations(frame):
    cols1 = ['start station name', 'start station latitude', 'start station longitude']
    cols2 = ['end station name', 'end station latitude', 'end station longitude']
    
    one = frame[cols1].drop_duplicates()
    one.columns = ['name', 'lat', 'long']
    two = frame[cols2].drop_duplicates()
    two.columns = ['name', 'lat', 'long']
    
    return pd.concat([one, two],axis=0).drop_duplicates()

In [None]:
import plotly.express as px

stations2019 = get_unique_stations(bikes2019)
stations2019['year'] = 'Only 2019'
stations2020 = get_unique_stations(bikes2020)
stations2020['year'] = 'Only 2020'
intersected = list(set(stations2019['name']).intersection(set(stations2020['name'])))
stations2019['year'] = stations2019['name'].apply(lambda x: 'Both' if x in intersected else 'Only 2019')
stations2020['year'] = stations2020['name'].apply(lambda x: 'Both' if x in intersected else 'Only 2020')
stations = pd.concat([stations2019, stations2020], axis=0)

fig = px.scatter_mapbox(
    stations,
    lat="lat",
    lon="long",
    hover_name="name",
    hover_data=["name", "long", "lat"],
    color="year",
    color_continuous_scale=['red', 'green', 'blue'],
    zoom=9.5,
    height=400,
    center={'lat':42.341, 'lon':-71.089},
    title='Bike stations in 2019 and 2020'
)

fig.update_layout(
    mapbox_style="white-bg",
    mapbox_layers=[
        {
            "below": 'traces',
            "sourcetype": "raster",
            "sourceattribution": "United States Geological Survey",
            "source": [
                "https://basemap.nationalmap.gov/arcgis/rest/services/USGSImageryOnly/MapServer/tile/{z}/{y}/{x}"
            ]
        }
])


fig.show()

<div class="alert alert-block alert-warning">ℹ️ It seems that:
    <ul>
        <li><b>Most of the stations are used in 2019 and 2020</b></li>
        <li>The stations <b>only used in 2020</b> are around the city.</li>
        <li>While the stations <b>only used in 20219</b> are more centered in the city</li>
    </ul>
</div>

## Multivariate analysis

In [None]:
all_year = pd.concat([bikes2019, bikes2020], axis=0)

plt.figure(figsize=(20, 6))
plt.plot(all_year['starttime'], all_year['tripduration'])
plt.title('Trip duration from 2019 to 2020')
plt.show()

<div class="alert alert-block alert-warning">ℹ️ It seems that users rode bikes <b>longer between May and September 2019</b>, it may be because it corresponds to the <b>Summer Holidays</b>.</div>

<br>

Let's compare the 2019 and 2020 years...

In [None]:
fig, axs = plt.subplots(2, 1, figsize=(22, 8), sharey=True)

for ax, color, frame, year in zip(axs, ['orangered', 'steelblue'], [bikes2019, bikes2020], [2019, 2020]):
    ax.plot(frame['starttime'], frame['tripduration'], color=color)
    ax.title.set_text(f'Trip duration in {year}')
plt.show()

<div class="alert alert-block alert-danger">📈 The duration of bike trips are clearly <b>shorter in overall in 2020</b> than 2019. This can be the direct effect of <b><em>COVID-19</em></b> because people were under lockdown.</div>

In [None]:
cdt = 'distance < 10000'

g = sns.relplot(
    y=all_year.query(cdt)['tripduration'].apply(np.log),
    x=all_year.query(cdt)['distance'],
    hue=all_year.query(cdt)['gender'], 
    style=all_year.query(cdt)['usertype'],
    col=all_year.query(cdt)['year'],
    kind='scatter',
    height=10,
    palette='bright'
)

g

<b><ins>Observations</ins></b>:
- **gender** and **usertype** seems to keep the same distribution
- <span style="color:orange; font-weight=bold">Users(gender=1)</span>'s rides are shorter than <span style="color:blue; font-weight=bold">Users(gender=0)</span>
- <b>Users(usertype=Subscriber)</b> are taking longer rides than <b>Users(usertype=Customer)</b>!

Work in progress...