# Data Challenge - Divvy dataset (Alan Au)

### Data

Divvy is a bike-sharing program sponsored by the City of Chicago since 2013.  They have made all of their bike ride data freely available online at https://www.divvybikes.com/system-data.  We’d like to learn about this data — in particular, we are interested in short rides — we’ll define “short” as rides in which the direct distance between the departure and arrival stations is less than 2km.

Each trip is anonymized and includes:

* Trip start day and time
* Trip end day and time
* Trip start station
* Trip end station
* Rider type (Member or 24-Hour Pass User)
* If a Member trip, it will also include Member’s gender and year of birth

### Task

Build a model to predict if a given bike trip will be a short one or not.  You are free to use as much of the Divvy data as you deem appropriate.  But do validate, and don’t overfit.  Moreover, time permitted, please feel free to bring in any publicly available supplementary data that might be useful.  Be careful not to bring in the time machine, i.e., any information that would not be known to Divvy at the start of a trip! Please code your solution in Python.

 

### What Are We Looking For?

We are looking for a finished predictive model using whatever tools and libraries that you are comfortable with.  In your submission, please document your methodology (with visualizations where appropriate) and provide ALL the code (including any environment setup and data sources) necessary to reproduce your analysis from scratch. Moreover, please report on the general performance of your model.

In [69]:
#!/usr/bin/python3
__author__ = 'Alan Au'
__date__  = '2018-12-22'

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import sklearn
import time

from os import listdir
from string import digits as digits # saves me some typing

#Instead of manually typing in file names, I'll just let Python find them for me.
data_loc = "./Divvy/" #path to my data file directory; end with '/'

trip_files = [f for f in listdir(data_loc) if f[6:11] == 'Trips' and f[-4:] == ".csv"]
station_files = [f for f in listdir(data_loc) if f[6:14] == 'Stations' and f[-4:] == ".csv"]

In [81]:
#This is a haversine distance calculator. I pulled it from https://pypi.python.org/pypi/haversine
    
from math import radians, cos, sin, asin, sqrt

AVG_EARTH_RADIUS = 6371  # in km
MILES_PER_KILOMETER = 0.621371

def haversine(point1, point2, miles=False):
    # unpack latitude/longitude
    lat1, lng1 = point1
    lat2, lng2 = point2

    # convert all latitudes/longitudes from decimal degrees to radians
    lat1, lng1, lat2, lng2 = map(radians, (lat1, lng1, lat2, lng2))

    # calculate haversine
    lat = lat2 - lat1
    lng = lng2 - lng1
    d = sin(lat * 0.5) ** 2 + cos(lat1) * cos(lat2) * sin(lng * 0.5) ** 2
    h = 2 * AVG_EARTH_RADIUS * asin(sqrt(d))
    if miles:
        return h * MILES_PER_KILOMETER # in miles
    else:
        return h # in kilometers

# Data preparation

The first step is to download and extract the data files of interest. First I did a quick look at the data to see what parts I need. I have both station and trip data.

Some notes:
* 2013 has a different date format, but that's the only file, so I can just transform it and dump it back out.
* Some of the files wrap the data elements in quotation marks, so I will want to strip those off.
* Other than those differences, the files seem largely compatible (if large).
* Only "Subscriber"-type users have gender and birthyear data; I will have to think about that when modeling.

### First let's process the stations, so I can get the lat/long and calculate distances.

In [73]:
#Fields are:
#id,name,city (2017 only), latitude,longitude,dpcapacity,landmark (2013 only),online date
#I'm going to discard the name, city, dpcapacity, landmark, and online date.

start = time.mktime(time.localtime())
print("Running...") #so I know Jupyter is doing something

#for stations
def combine_stations(data_loc, stations, out_file):
    #inputs: data_loc (directory), stations (list of filenames), out_file (filename)
    #outputs: writes to out_file, returns all_stations (dict)
    combined_file = open(data_loc+out_file,'w')
    all_stations = {}
    for station in stations:
        raw_file = open(data_loc+station,'r')
        raw_data = raw_file.readlines()
        for line in raw_data:
            line = line.replace('\"','') #get rid of quotation marks

            if line[0] not in digits: #skip header lines
                continue
            
            if station[15:19] == '2017': #2017 only
                (s_id, s_name, s_city, s_lat, s_lon) = line.split(',')[:5]
            else:
                (s_id, s_name, s_lat, s_lon) = line.split(',')[:4]
            if s_id not in all_stations: #de-duplicate; note it only logs a station the first time it sees it
                all_stations[s_id] = (float(s_lat),float(s_lon))
                combined_file.write(','.join([s_id,s_lat,s_lon])+'\n')
    combined_file.close()
    return all_stations #{id:(lat, lon)}

stations = combine_stations(data_loc, station_files, "All_Stations.csv")

duration = time.mktime(time.localtime()) - start
print("All stations done! Finished in "+str(duration)+" seconds.") #so I know when Jupyter is done

Running...
All stations done! Finished in 0.0 seconds.


### Now let's handle trip data.

In [90]:
#Fields are:
#trip_id,start,stop,bike,duration,from_id,from_name,to_id,to_name,usertype,gender,birthyear

#helper function to process 2013 dates
def new_2013_date(old_date):
    #old format is yyyy-mm-dd hh:mm
    #new format is mm/dd/yyyy hh:mm
    parts = old_date.split()
    (y,m,d) = parts[0].split('-')
    new_date = '/'.join(m,d,y)+" "+parts[1]
    return new_date

#for trips
def combine_trips(data_loc, trips, stations, out_file):
    #inputs: data_loc (directory), trips (list of filenames), out_file (filename)
    #outputs: writes to out_file, returns none
    combined_file = open(data_loc+out_file,'w')
    for trip in trips:
        raw_file = open(data_loc+trip,'r')
        raw_data = raw_file.readlines()
        for line in raw_data:
            line = line.strip().replace('\"','') #get rid of quotation marks

            if line[0] not in digits: #skip header lines
                continue

            (t_id,t_start,t_stop,t_bike,t_dur,t_from_id,t_from_name,t_to_id,t_to_name,t_usertype,t_gender,t_birth) = line.split(',')
            
            if csv == 'Divvy_Trips_2013.csv': #process 2013 dates separately
                t_start = new_2013_date(t_start)
                t_stop = new_2013_date(t_stop)

            t_dist = str(haversine(stations[t_from_id],stations[t_to_id])) #calculate distance between stations in km
            
            #I'm throwing out trip_id and bike_id (probably not useful), and station names (redundant w/station ID)
            out_line = ','.join([t_start,t_stop,t_dur,t_from_id,t_to_id,t_dist,t_usertype,t_gender,t_birth])+"\n"
            combined_file.write(out_line)
    combined_file.close() #clean up after ourselves

In [88]:
start = time.mktime(time.localtime())
print("Running...") #so I know Jupyter is doing something

combine_trips(data_loc, trip_files, stations, "All_Trips.csv")

duration = time.mktime(time.localtime()) - start
print("All trips done! Finished in "+str(duration)+" seconds.") #so I know when Jupyter is done

Running...
All trips done! Finished in 216.0 seconds.


### There's a lot of data, like 932 Mb of it, so I'm going to generate subsets by year.

In [89]:
start = time.mktime(time.localtime())
print("Running...") #so I know Jupyter is doing something

all_trips = {}
for trip_file in trip_files:
    year = trip_file[12:16] #get the year number
    if year in all_trips:
        all_trips[year].append(trip_file)
    else:
        all_trips[year] = [trip_file]

for trip_year in all_trips: #for each year, process its sublist of associated files
    out_name = "Trips_"+trip_year+".csv"
    combine_trips(data_loc, all_trips[trip_year], stations, out_name)

duration = time.mktime(time.localtime()) - start
print("Trips-by-year done! Finished in "+str(duration)+" seconds.") #so I know when Jupyter is done

Running...
Trips-by-year done! Finished in 198.0 seconds.


### Some thoughts

So far, things look good, but there's a quirk where some rides start and stop at the same station. This could be a round-trip, but I can't readily tell. a quick Google search suggest that the average in-city bike speed is about 15 km/h, but of course, there's no guarantee that someone is using the bike continuously between check-ins.

# Data ingestion

Now I have data files that are all in a standardized format, with distance between stations and a set of features. I probably also want to pull out some date features, like day-of-week and hour-of-start.

I'll want to come up with a proxy distance when it's listed as 0: ```distance(km) = duration(sec) * 15(km/h) * 1/3600(h/sec)```