## Analyzing Bike-Sharing Service Hubway Data using SQL

## Objective

Analyzing Bike-Sharing Service Hubway Data using SQLite library to connect the database, extract data, and perform analysis on the data.

## Data Set

Dataset come from the bike-sharing service [Hubway](https://www.thehubway.com/), which includes data on over 1.5 million trips made with the service.

The database has two tables, trips and stations.

Here are the preview column of trips table :

| Column        | Description                                                                                                              |
|---------------|--------------------------------------------------------------------------------------------------------------------------|
| id            | A unique integer that serves as a   reference for each trip                                                              |
| duration      | The duration of the trip, measured in   seconds                                                                          |
| start_date    | The date and time the trip began                                                                                         |
| start_station | An integer that corresponds to the id   column in the stations table for the station the trip started at                 |
| end_date      | The date and time the trip ended                                                                                         |
| end_station   |  The ‘id’ of the station the trip ended at                                                                               |
| bike_number   | Hubway’s unique identifier for the bike   used on the trip                                                               |
| sub_type      | The subscription type of the user.   "Registered" for users with a membership, "Casual" for   users without a membership |
| zip_code      |  The zip code of the user (only available for   registered members)                                                      |
| birth_date    |  The birth year of the user (only available   for registered members)                                                    |
| gender        | The gender of the user (only available   for registered members)                                                         |

## Analysis

With this information (from trips table), here are some interesting questions to answer for doing analysis:

    What was the duration of the longest trip?
    How many trips were taken by ‘registered’ users?
    What was the average trip duration?
    Do registered or casual users take longer trips?
    Which bike was used for the most trips?
    What is the average duration of trips by users over the age of 30?

And from stations table, here are some interesting questions to answer:

    which station is the most frequent starting point?
    which stations are most frequently used for round trips?
    How many trips start and end in different municipalities?

## Reading In the Data

In [1]:
import sqlite3
import pandas as pd

db = sqlite3.connect('hubway.db')

def run_query(query):
    return pd.read_sql_query(query,db)

### What was the duration of the longest trip?

In [2]:
query = 'SELECT * FROM trips LIMIT 5;'
run_query(query)

Unnamed: 0,id,duration,start_date,start_station,end_date,end_station,bike_number,sub_type,zip_code,birth_date,gender
0,1,9,2011-07-28 10:12:00,23,2011-07-28 10:12:00,23,B00468,Registered,'97217,1976.0,Male
1,2,220,2011-07-28 10:21:00,23,2011-07-28 10:25:00,23,B00554,Registered,'02215,1966.0,Male
2,3,56,2011-07-28 10:33:00,23,2011-07-28 10:34:00,23,B00456,Registered,'02108,1943.0,Male
3,4,64,2011-07-28 10:35:00,23,2011-07-28 10:36:00,23,B00554,Registered,'02116,1981.0,Female
4,5,12,2011-07-28 10:37:00,23,2011-07-28 10:37:00,23,B00554,Registered,'97214,1983.0,Female


In [3]:
query = '''
SELECT duration 
FROM trips
ORDER BY duration DESC
LIMIT 1;
'''

run_query(query)

Unnamed: 0,duration
0,9999


### How many trips were taken by ‘registered’ users?

In [4]:
query = '''
SELECT COUNT(*)
FROM trips
WHERE sub_type = "Registered";
'''

run_query(query)

Unnamed: 0,COUNT(*)
0,1105192


Renaming COUNT(*) to make more readable

In [5]:
query = '''
SELECT COUNT(*) AS "Total Trips by Registered Users"
FROM trips
WHERE sub_type = "Registered";
'''

run_query(query)

Unnamed: 0,Total Trips by Registered Users
0,1105192


### What was the average trip duration?

In [6]:
query = '''
SELECT AVG(duration) AS "Average Duration"
FROM trips;
'''

run_query(query)

Unnamed: 0,Average Duration
0,912.409682


It turns out that the average trip duration is 912 seconds, which is about 15 minutes. This makes some sense, since Hubway charges extra fees for trips over 30 minutes because the service is designed for riders to take short, one-way trips.

### Do registered or casual users take longer trips?

In [7]:
query = '''
SELECT sub_type, AVG(duration) AS "Average Duration"
FROM trips
GROUP BY sub_type;
'''

run_query(query)

Unnamed: 0,sub_type,Average Duration
0,Casual,1519.643897
1,Registered,657.026067


That’s quite a difference. On average, registered users take trips that last around 11 minutes whereas casual users are spending almost 25 minutes per ride. Registered users are likely taking shorter, more frequent trips, possibly as part of their commute to work. Casual users, on the other hand, are spending around twice as long per trip. It’s possible that casual users tend to come from demographics (tourists, for example) that are more inclined to take longer trips make sure they get around and see all the sights.

### Which bike was used for the most trips?

In [8]:
query = '''
SELECT bike_number as "Bike Number", COUNT(*) AS "Number of Trips"
FROM trips
GROUP BY bike_number
ORDER BY COUNT(*) DESC
LIMIT 1;
'''

run_query(query)

Unnamed: 0,Bike Number,Number of Trips
0,B00490,2120


### What is the average duration of trips by users over the age of 30?

In [9]:
query = '''
SELECT AVG(duration)
FROM trips
WHERE (2017 - birth_date) > 30;
'''

run_query(query)

Unnamed: 0,AVG(duration)
0,923.014685


## Exploring stations Table

In [10]:
query = '''
SELECT *
FROM stations
LIMIT 5;
'''
run_query(query)

Unnamed: 0,id,station,municipality,lat,lng
0,3,Colleges of the Fenway,Boston,42.340021,-71.100812
1,4,Tremont St. at Berkeley St.,Boston,42.345392,-71.069616
2,5,Northeastern U / North Parking Lot,Boston,42.341814,-71.090179
3,6,Cambridge St. at Joy St.,Boston,42.361285,-71.06514
4,7,Fan Pier,Boston,42.353412,-71.044624


### Which station is the most frequent starting point?

In [11]:
query = '''
SELECT stations.station AS "Station", COUNT(*) AS "Count"
FROM trips 
INNER JOIN stations
ON trips.start_station = stations.id
GROUP BY stations.station
ORDER BY COUNT(*) DESC
LIMIT 5;
'''

run_query(query)

Unnamed: 0,Station,Count
0,South Station - 700 Atlantic Ave.,56123
1,Boston Public Library - 700 Boylston St.,41994
2,Charles Circle - Charles St. at Cambridge St.,35984
3,Beacon St / Mass Ave,35275
4,MIT at Mass Ave / Amherst St,33644


### Which stations are most frequently used for round trips?

In [12]:
query = '''
SELECT stations.station AS "Station", COUNT(*) AS "Count"
FROM trips 
INNER JOIN stations
ON trips.start_station = stations.id
WHERE trips.start_station = trips.end_station
GROUP BY stations.station
ORDER BY COUNT(*) DESC
LIMIT 5;
'''

run_query(query)

Unnamed: 0,Station,Count
0,The Esplanade - Beacon St. at Arlington St.,3064
1,Charles Circle - Charles St. at Cambridge St.,2739
2,Boston Public Library - 700 Boylston St.,2548
3,Boylston St. at Arlington St.,2163
4,Beacon St / Mass Ave,2144


It appears that a number of these stations are the same as the previous question but the amounts are much lower. The busiest stations are still the busiest stations, but the lower numbers overall would suggest that people are typically using Hubway bikes to get from point A to point B rather than just to cycle around for a while before returning to where they started.

### How many trips start and end in different municipalities?

In [13]:
query = '''
SELECT COUNT(trips.id) AS "Count"
FROM trips 
INNER JOIN stations AS start
ON trips.start_station = start.id
INNER JOIN stations AS end
ON trips.end_station = end.id
WHERE start.municipality <> end.municipality;
'''

run_query(query)

Unnamed: 0,Count
0,309748


This shows that about 300k out of 1.5 million trips (or 20%) ended in a different municipality than they started in - further evidence that people mostly use Hubway bicycles for relatively short journeys rather than longer trips between towns.