### Model Describer Meetup Tutorial

In this notebook, we will be doing some brief EDA of the [bicycle trip dataset](https://www.kaggle.com/pronto/cycle-share-dataset/home). This is data from the Pronto Cycle Share system which consists of 500 bikes and 54 stations located in Seattle. 

The key question for this tutorial will be whether or not there are noticeable differences in trip duration by gender and by user age. We will not be controlling for location information (a known deficit of this tutorial). 

In this tutorial, we will be covering the following:

* [Data prep](#prep)
* [Exploratory data analysis](#eda)
* [Build neural network model](#neural)
* [Model Describer Regression Evaluation](#mdesc_regression)
* [Model Describer Classification Evaluation](#mdesc_classification)
* [Model Describer Regression Sensitivity](#mdesc_sensitivity_regression)
* [Model Describer Classification Sensitivity](mdesc_sensitivity_classification)
* [Additional thoughts](#thoughts)

In [1]:
import os
from datetime import datetime

from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly.graph_objs as go
import keras
import pandas as pd

from mdesc.models import (Eval, ClassifierEval, Sensitivity, ClassifierSensitivity)


Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.

Using TensorFlow backend.


In [2]:
# initialize plotly notebook_mode
init_notebook_mode(connected=True)

#### Data Prep <a id='prep' />

Read in data and perform basic data manipulations

In [3]:
# read in the trip and weather data. We will not be using the station data, but there are a number of ways we could use
# this if we took advantage of the location information. 

base_path = r'C:\Users\jlewris\Desktop\BikeData'
trip = pd.read_csv(os.path.join(base_path, 'trip.csv'), error_bad_lines=False)
weather = pd.read_csv(os.path.join(base_path, 'weather.csv'))

b'Skipping line 50794: expected 12 fields, saw 20\n'


In [4]:
def convert_date(dte):
    """
    convert string date into datetime object
    
    Parameters
    ----------
    dte - string
          datetime string object
          
    Return 
    ----------
    datetime obj
        string input converted to datetime object
    """
    try:
        dte = datetime.strptime(dte, '%m/%d/%Y %H:%M')
    except ValueError:
        dte = datetime.strptime(dte, '%m/%d/%Y')
    return dte

def return_part_date(dte, part_of_date='month'):
    """
    Pull the part_of_date from input datetime object
    
    Parameters
    ----------
    dte - datetime object
          input datetime object
    
    part_of_date - str - ['month', 'day', 'year', 'hour', 'minute']
          
    Return 
    ----------
    part of date
        part of date, i.e. year, hour, etc. 
    """
    dte = convert_date(dte)
    return getattr(dte, part_of_date)

def is_weekday(dte):
    """
    return whether a dte is a weekday or not
    
    Parameters
    ----------
    dte - datetime object
          input datetime object
              
    Return 
    ----------
    binary flag
        1 if is weekday else 0 
    """
    dte = convert_date(dte)
    if dte.weekday() not in [5, 6]:
        return 1
    else:
        return 0
    

In [5]:
# pull relevant parts of date
trip['start_day'] = trip['starttime'].apply(lambda x: return_part_date(x, part_of_date='day'))
trip['start_year'] = trip['starttime'].apply(lambda x: return_part_date(x, part_of_date='year'))
trip['start_month'] = trip['starttime'].apply(lambda x: return_part_date(x, part_of_date='month'))
trip['start_hour'] = trip['starttime'].apply(lambda x: return_part_date(x, part_of_date='hour'))
weather['start_day'] = weather['Date'].apply(lambda x: return_part_date(x, part_of_date='day'))
weather['start_year'] = weather['Date'].apply(lambda x: return_part_date(x, part_of_date='year'))
weather['start_month'] = weather['Date'].apply(lambda x: return_part_date(x, part_of_date='month'))

# test if date is weekday or not
trip['weekday'] = trip['starttime'].apply(lambda x: is_weekday(x))

In [6]:
# pull out just the mean values for weather
weather_sub = weather[['Mean_Temperature_F', 'MeanDew_Point_F', 'Mean_Humidity', 
                      'Mean_Visibility_Miles', 'Mean_Wind_Speed_MPH', 'Precipitation_In', 
                      'start_day', 'start_year', 'start_month']]

In [7]:
trip = pd.merge(trip, weather_sub, on=['start_day', 'start_year', 'start_month'], how='left')

In [8]:
# drop end location information
trip = trip.drop(['to_station_name', 'to_station_id', 'from_station_name'], axis=1)

In [9]:
# get age of rider
trip['rider_age'] = trip['start_year'] - trip['birthyear']

#### Basic Exploratory Data Analysis <a id='eda' />

Perform basic exploratory analysis of the bicycle data

In [10]:
numtrips_weekday = trip.groupby(['weekday', 'start_hour'])['trip_id'].nunique().reset_index(name='numTrip')

total_weekday_trips = trip.loc[trip['weekday'] == 1]['trip_id'].nunique()
total_weekend_trips = trip.loc[trip['weekday'] == 0]['trip_id'].nunique()

numtrips_weekday['percentTrips'] = numtrips_weekday.apply(lambda x: x['numTrip']/total_weekday_trips if x['weekday'] == 1 else x['numTrip']/total_weekend_trips, axis=1)

In [11]:
numtrips_weekday.head()

Unnamed: 0,weekday,start_hour,numTrip,percentTrips
0,0,0,721,0.012195
1,0,1,494,0.008355
2,0,2,404,0.006833
3,0,3,164,0.002774
4,0,4,71,0.001201


In [12]:
# get number of trips by hour of day

weekday = numtrips_weekday['weekday'] == 1
weekend = numtrips_weekday['weekday'] == 0

trace1 = go.Bar(
    x = numtrips_weekday.loc[weekday, 'start_hour'].tolist(),
    y = numtrips_weekday.loc[weekday, 'percentTrips'].tolist(), 
    name='Weekday Trips'
)

trace2 = go.Bar(
    x = numtrips_weekday.loc[weekend, 'start_hour'].tolist(),
    y = numtrips_weekday.loc[weekend, 'percentTrips'].tolist(), 
    name='Weekend Trips'
)

data = [trace1, trace2]

layout = go.Layout(barmode='group', 
                  title='Weekend vs. Weekday Trips by Hour',
                  xaxis=dict(title='Hour'), 
                  yaxis=dict(title='Percent of Trips'))

fig = go.Figure(data=data, layout=layout)

iplot(fig)

In [13]:
trip_dist = trip.groupby(['weekday', 'start_hour'])['tripduration'].mean().reset_index(name='tripDist')

In [14]:
trace1 = go.Scatter(
    x = trip_dist.loc[weekday, 'start_hour'].tolist(),
    y = trip_dist.loc[weekday, 'tripDist'].tolist(), 
    name='Weekday Trips', 
    mode='lines'
)

trace2 = go.Scatter(
    x = trip_dist.loc[weekend, 'start_hour'].tolist(),
    y = trip_dist.loc[weekend, 'tripDist'].tolist(), 
    name='Weekend Trips', 
    mode='lines'
)

data = [trace1, trace2]

layout = go.Layout(
                  title='Weekend vs. Weekday Trip Duration by Hour',
                  xaxis=dict(title='Hour'), 
                  yaxis=dict(title='Trip Duration (seconds)'))

fig = go.Figure(data=data, layout=layout)

iplot(fig)

In [15]:
tripgender = trip.groupby('gender')['tripduration'].mean().reset_index(name='tripDist')

data = [go.Bar(
    x=tripgender['gender'].tolist(),
    y=tripgender['tripDist'].tolist()
)]

layout = go.Layout(
                  title='Average Trip Duration (seconds) by Gender',
                  xaxis=dict(title='Gender'), 
                  yaxis=dict(title='Trip Duration (seconds)'))

fig = go.Figure(data=data, layout=layout)

iplot(fig)

In [16]:
corr = trip[['rider_age', 'Mean_Temperature_F', 'Mean_Humidity', 
             'tripduration', 'weekday']].corr()

trace = go.Heatmap(z=corr.values.tolist(), 
                   x=corr.index.tolist(),
                   y=corr.columns.tolist())

layout = go.Layout(
                  title='Correlation Heatmap',
    )

data = [trace]

fig = go.Figure(data=data, layout=layout)

iplot(fig)

### Build Neural Network Model <a id='neural' /a>

Build out example neural network in keras to predict trip duration

In [17]:
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.optimizers import SGD, Adam
from sklearn.datasets import make_regression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import numpy as np
#from sklearn.preprocessing import train_test_split

In [91]:
trip_model = trip.drop(['stoptime', 'bikeid', 'trip_id', 
                       'starttime', 'birthyear'], axis=1)
X = trip_model.loc[:, trip_model.columns != 'tripduration']
y = trip_model.loc[:, 'tripduration'].values.reshape((286857, 1))

In [92]:
X.fillna(method='pad', inplace=True)



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



In [93]:
X = X.drop('from_station_id', axis=1)

categories = [ 'usertype', 
              'gender', 'start_day', 'start_year', 
              'start_month', 'start_hour', 'weekday']

continuous = [col for col in X.columns.tolist() if col not in categories]

In [94]:

X_cat = pd.get_dummies(X[categories], columns=categories)

X_cat_cols = X_cat.columns.tolist()
#X_cat = X_cat.values

In [95]:
X_cat.head()

Unnamed: 0,usertype_Member,usertype_Short-Term Pass Holder,gender_Female,gender_Male,gender_Other,start_day_1,start_day_2,start_day_3,start_day_4,start_day_5,...,start_hour_16,start_hour_17,start_hour_18,start_hour_19,start_hour_20,start_hour_21,start_hour_22,start_hour_23,weekday_0,weekday_1
0,1,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,1,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,1,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,1,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,1,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


In [96]:
sc = StandardScaler()

X_cont = sc.fit_transform(X[continuous])



In [97]:
X = np.hstack((X_cat, X_cont))

feature_names = X_cat_cols + continuous

In [98]:
X_train, X_test, y_train, y_test, group_train, group_test = train_test_split(X, y,trip, 
                                                   test_size=0.1, 
                                                   random_state=42)

In [99]:
model = Sequential()
model.add(Dense(128, input_dim=X_train.shape[1], activation='relu', 
               kernel_initializer='normal'))
model.add(Dropout(0.1))

model.add(Dense(64, activation='relu'))
model.add(Dropout(0.1))

#model.add(Dense(128, activation='relu'))
#model.add(Dropout(0.1))
model.add(Dense(1, activation='linear'))

#keras.optimizers.SGD(lr=0.001, momentum=0.0, decay=0.0, nesterov=False)
adam = Adam(lr=0.001)

model.compile(loss='mae', optimizer='adam')
model.fit(X_train, y_train, epochs=10, verbose=1, batch_size=256, 
         validation_split=0.2)

Train on 206536 samples, validate on 51635 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x2270263ddd8>

In [54]:
group_test.columns

Index(['trip_id', 'starttime', 'stoptime', 'bikeid', 'tripduration',
       'from_station_id', 'usertype', 'gender', 'birthyear', 'start_day',
       'start_year', 'start_month', 'start_hour', 'weekday',
       'Mean_Temperature_F', 'MeanDew_Point_F', 'Mean_Humidity',
       'Mean_Visibility_Miles', 'Mean_Wind_Speed_MPH', 'Precipitation_In',
       'rider_age'],
      dtype='object')

In [100]:
groupby_df = group_test[['gender',  'usertype']]

In [101]:
groupby_df['gender'].fillna('Other', inplace=True)



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



In [109]:
RE = Eval(prediction_fn=model.predict, target_names='transit_time', 
                feature_names=feature_names)
res = RE.fit_transform(X=X_test[0:8000], y=y_test[0:8000].flatten(), 
                       groupby_df=groupby_df[0:8000])

0 gender
Female



Mean of empty slice


Mean of empty slice



Male
Other
1 usertype
Member
Short-Term Pass Holder


In [33]:
RE.data_set.groupby_names

['all_values']

In [110]:
RE.data_set.viz_now(groupby_name='gender')

In [36]:
X[:, 71]

array([0.34634954, 0.34634954, 0.34634954, ..., 0.64192684, 0.64192684,
       0.64192684])