## Introduction 

This dataset is a catalogue of rides taken on the LA-Metro Bike Share Network.

"The Metro Bike Share system makes bikes available 24/7, 365 days a year in Downtown LA, Central LA, Port of LA and the Westside. Metro Bike Share offers convenient round-the-clock access to a fleet of bicycles for short trips. Metro Bike Share is one of LA Metro's multiple public transportation options for Angelenos and visitors to get around."
<br>-https://bikeshare.metro.net/about/

Bike sharing netowrks have been established in cities of all sizes, they allow residents and visitors a fun, cheap, fast and  easy way to explore the city they are based in and have even become part of regular commuter routines. Although a bicycle is in itself a low-tech transport option, technology has and will continue to play major role in the rapid expansion of bike sharing networks. The bike may be dumb (and they increasingly aren't) but the network itself is very smart.

"Now companies have GPS sensors to track their bikes, and smartphones, credit cards, or transit passes to know who has them—and whom to penalize if the wheels go missing. Riders, meanwhile, can use apps to track down available rides or bike-share stations when they need them."
<br>-Wired Magazine (https://www.wired.com/story/americans-falling-in-love-bike-share/)

And data collection and analysis and with it machine learning is the key to building a bike network that really works. Data can provide many insights and much guidance about a bike sharing network and through this exploration of the LA-Metro Bike Network I hope to demonstrate just some of this potential.



 ## Load Data and Basic Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
df=pd.read_csv('../input/metro-bike-share-trip-data.csv')
df.head(10)

In [None]:
df.shape

Feature descriptions from https://bikeshare.metro.net/about/data/

<b>trip_id:</b> Locally unique integer that identifies the trip

<b>duration:</b> Length of trip in <i> minutes*</i>

<b>start_time:</b> The date/time when the trip began, presented in ISO 8601 format in local time

<b>end_time: </b>The date/time when the trip ended, presented in ISO 8601 format in local time

<b>start_station:</b> The station ID where the trip originated 

<b>start_lat:</b> The latitude of the station where the trip originated

<b>start_lon: </b>The longitude of the station where the trip originated

<b>end_station: </b>The station ID where the trip terminated 

<b>end_lat: </b>The latitude of the station where the trip terminated

<b>end_lon: </b>The longitude of the station where the trip terminated

<b>bike_id: </b> Locally unique integer that identifies the bike
    
<b>plan_duration: </b>The number of days that the plan the passholder is using entitles them to ride; 0 is used for a single ride plan (Walk-up)

<b>trip_route_category:</b> "Round Trip" for trips starting and ending at the same station or "One Way" for all other trips

<b>passholder_type: </b>The name of the passholder's plan

*units are actually in but each instance is a multiple of 60 i.e. a whole minute number. We will convert this to minutes

Other features are not officially described by Metro Bike Share

Starting Lat-Long and Ending Lat-Long appear to be repetitions of previous station data and can be dropped safely

Remaining features appear to be LA geographic information not relevant to this study, many of which include large amounts of missing data. These will  also be dropped.

## Data Cleaning

In [None]:
df.info()

In [None]:
df.isna().sum()

In [None]:
df.drop(columns=['Starting Lat-Long',
                 'Ending Lat-Long',
                 'Neighborhood Councils (Certified)',
                 'Council Districts',
                 'Zip Codes',
                 'LA Specific Plans',
                 'Precinct Boundaries',
                 'Census Tracts'],
       inplace=True)
df.Duration=df.Duration/60

In [None]:
df.head()

Trip ID can be used is a unique identifier and can be used as an index for this data.

Remaining features had little missing data the largest source being just 1051 missing cells from over 132,000. Rows with missing data will be dropped from the datasets.

Start Time and End Time will be converted to time series data types


In [None]:
df.set_index('Trip ID', inplace=True)
df.dropna(inplace=True)
df['Start Time']= pd.to_datetime(df['Start Time'])
df['End Time']=pd.to_datetime(df['End Time'])

df.head()

In [None]:
df.describe()

Starting Station Latitude, Starting Station Longitude, Ending Station Latitude & Ending Station Longitude contain zeros that must be treated as missing data. However, there are no zeros or any remaining missing values in Starting Station ID or Ending Station ID, we can use these values to impute the missing data

In [None]:
df.loc[df['Starting Station Latitude']==0]['Starting Station ID'].value_counts()

In [None]:
df.loc[df['Starting Station Longitude']==0]['Starting Station ID'].value_counts()

In [None]:
df.loc[df['Ending Station Latitude']==0]['Ending Station ID'].value_counts()

In [None]:
df.loc[df['Ending Station Longitude']==0]['Ending Station ID'].value_counts()

All of the zeros were recorded at Bike Station 4108, there may be some kind of error related to this particular Bike Station. The values of this bike station's latitude and Longitude can be found in any of the complete values and imputed over the zeros

In [None]:
stat_4108_lat= df.loc[df['Starting Station ID']==4108]['Starting Station Latitude'].max() #use max to avoid the zeros
stat_4108_long= df.loc[df['Starting Station ID']==4108]['Starting Station Longitude'].min()
 #all non-zeroes are the same anyway no need to find and replace
df['Starting Station Latitude'].replace(0,stat_4108_lat,inplace=True)
df['Ending Station Latitude'].replace(0,stat_4108_lat,inplace=True)
df['Starting Station Longitude'].replace(0,stat_4108_long,inplace=True)
df['Ending Station Longitude'].replace(0,stat_4108_long,inplace=True)

In [None]:
df.describe()

Our dataset is now free of missing values

## Geographic Investigation & Visualisation

In [None]:
plt.figure(figsize=(14,5))
plt.subplot(1,2,1)
plt.scatter(df['Starting Station Latitude'],df['Starting Station Longitude'],alpha=0.3)
plt.title("Starting Station Latitude and Longitude")
plt.subplot(1,2,2)
plt.scatter(df['Ending Station Latitude'],df['Ending Station Longitude'],alpha=0.3)
plt.title("Ending Station Latitude and Longitude")
plt.show()

Matching plots indicate there are no start points that had no drop off and vice-versa this would seem logical

In [None]:
sorted(df['Starting Station ID'].unique())==sorted(df['Ending Station ID'].unique())

This confirms that the Starting and Ending Stations sets are identical as expected

Now would a great time for a map!

In [None]:
import folium
bike_map1=folium.Map([df['Starting Station Latitude'].values[0],df['Starting Station Longitude'].values[0]])
for station in df['Starting Station ID'].unique():
    lat=df.loc[df['Starting Station ID']==station]['Starting Station Latitude'].values[0]
    lon=df.loc[df['Starting Station ID']==station]['Starting Station Longitude'].values[0]
    marker=folium.Marker([lat,lon],popup=str(station))
    marker.add_to(bike_map1)

bike_map1

Almost all the Bike Stations are located in Downtown LA with the exception of 2; 1 in Culver City(id = 3039), and 1 in Venice (id = 3009)

Lets look closer at the traffic to and from these 2 outlying bike stations, are they outliers?

In [None]:
df.loc[df['Starting Station ID']==3039]['Ending Station ID'].value_counts()

In [None]:
df.loc[df['Ending Station ID']==3039]['Starting Station ID'].value_counts()

In [None]:
df.loc[df['Starting Station ID']==3009]['Ending Station ID'].value_counts()

In [None]:
df.loc[df['Ending Station ID']==3009]['Starting Station ID'].value_counts()

Almost all of the trips made to and from these bike stations are either round trips to the same place or from one of the two to the other. These two stations are almost completely disconnected from the Downtown LA network.

They also only account for a very tiny fraction of the rides total; from over 130k rides just 163 are to or from these two stations, I wonder why that is?

In [None]:
print('The first trip made FROM bike Station 3039 (Culver City) was on : ',df.loc[df['Starting Station ID']==3039]['Start Time'].min())
print('The last trip made FROM bike Station 3039 (Culver City) was on : ',df.loc[df['Starting Station ID']==3039]['Start Time'].max())
print('The first trip made FROM bike Station 3009 (Venice) was on : ',df.loc[df['Starting Station ID']==3009]['Start Time'].min())
print('The last trip made FROM bike Station 309 (Venice) was on : ',df.loc[df['Starting Station ID']==3009]['Start Time'].max())

print('The first trip made TO bike Station 3039 (Culver City) was on : ',df.loc[df['Ending Station ID']==3039]['Start Time'].min())
print('The last trip made TO bike Station 3039 (Culver City) was on : ',df.loc[df['Ending Station ID']==3039]['Start Time'].max())
print('The first trip made TO bike Station 3009 (Venice) was on : ',df.loc[df['Ending Station ID']==3009]['Start Time'].min())
print('The last trip made TO bike Station 309 (Venice) was on : ',df.loc[df['Ending Station ID']==3009]['Start Time'].max())

All these the trips to and from Culver City and Venice are on the same day; 2017-03-26 which is the last weekend day of the dataset. What may be the case is that this station was only operational beginning that day (or that weekend)so it was only  used on the sunday by people going to and from the beach but these new stations were not used during the week as part of anyone's regular commute. Atl least not yet.

Lets explore some of the distinctive characteristics of these other Bike Stations further:

In [None]:
df['Starting Station ID'].value_counts().tail(10)

In [None]:
df['Ending Station ID'].value_counts().tail(10)

Bike Station 4108, which was our station with the apparent logging errors, also has relatively few trips to and from it, from the map it is a little separated from the rest of the downtown set

In [None]:
df.loc[df['Starting Station ID']==4108]['Trip Route Category'].value_counts(normalize=True)

In [None]:
df.loc[df['Ending Station ID']==4108]['Trip Route Category'].value_counts(normalize=True)

Bike Station ID 4108 is dominantly Round-Trip traffic, this is very unusual for this dataset! 

Bike Station 3053 has the least number of rides to and from it but it appears to be right in the mix of things

In [None]:
df.loc[df['Starting Station ID']==3053]['Start Time']

In [None]:
df.loc[df['Ending Station ID']==3053]['Start Time']

In [None]:
print('The first trip made FROM bike Station 3053 was on : ',df.loc[df['Starting Station ID']==3053]['Start Time'].min())
print('The last trip made FROM bike Station 3053 was on : ',df.loc[df['Starting Station ID']==3053]['Start Time'].max())
print('The first trip made TO bike Station 3053 was on : ',df.loc[df['Ending Station ID']==3053]['Start Time'].min())
print('The last trip made TO bike Station 30053 was on : ',df.loc[df['Ending Station ID']==3053]['Start Time'].max())

All trips made over just 5 days from the 7th of July 2016 (the first day of the dataset) to the 11th of July 2016. Is it possible this Bike Station was decomissioned? 

## Time Series Investigation & Visualisation

In [None]:
df['Start Time'].min()

In [None]:
df['Start Time'].max()

In [None]:
df['Start Time'].hist(figsize=(15,4))
plt.title('Ride Timeline Histogram')
plt.show()

Rides data is assumed to be complete over the timeperiod starting from the 7th of July 2016 to the 31st of March 2017. In the above histogram we can see the growth in popularity over the summer months. Ridership declines as summer turns into autumn turns into winter dropping to its lowest point in wht we would expect to be the coldest part of winter of 2017 and the nincreasing as spring begins to arrive. Unfortunately the time period encompasses less than 1 year so we are unable to observe a full cyclical seasonality on ridership

In [None]:
df['Start Month']=df['Start Time'].dt.month_name()

df['Start Month'].value_counts()

In [None]:
plt.figure(figsize=(14,5))
plt.bar(df['Start Month'].value_counts().index,df['Start Month'].value_counts().values)
plt.title('Number of Rides by Month of Year')
plt.show()

In [None]:
df['Start Day']=df['Start Time'].dt.day_name()
df['Start Day'].value_counts()

In [None]:
plt.figure(figsize=(14,5))

plt.bar(df['Start Day'].value_counts().index,df['Start Day'].value_counts().values)
plt.xticks(rotation=45)
plt.title('Number of Rides by Day of Week')
plt.show()

Ridership does vary by day of the week but without a clear weekday vs weekend divide

Just the time string data is isolated from the datetime provided in Start Time and is rounded off to the nearest hour to ease our analysis

In [None]:
df['Time Only']= df['Start Time'].dt.round('H')
df['Time Only']=pd.to_datetime(df['Time Only'],format= '%H:%M:%S' ).dt.time
df.head()

In [None]:
plt.figure(figsize=(9,4))
plt.scatter(df['Time Only'].value_counts().index,df['Time Only'].value_counts().values)
#on peak between the bars
plt.vlines(x='7:30:00',ymin=0,ymax=12000,color='red')
plt.vlines(x='20:30:00',ymin=0,ymax=12000,color='red')
plt.title('Bike Usage by Time of Day')
plt.show()

Across the day, the bike network usage clearly has a peak time seen between the red bars above

In [None]:
plt.figure(figsize=(14,8))
days=['Monday','Tuesday','Wednesday','Thursday','Friday']
for i in range(len(days)):
    plt.subplot(2,4,i+1)
    plt.scatter(df.loc[df['Start Day'] == days[i]]['Time Only'].value_counts().index,df.loc[df['Start Day'] == days[i]]['Time Only'].value_counts().values)
    plt.title(days[i])
    plt.vlines(x='7:30:00',ymin=0,ymax=1800,color='red')
    plt.vlines(x='20:30:00',ymin=0,ymax=1800,color='red')
    
#Offset peak start by 2 hours for the weekend
plt.subplot(2,4,6)
plt.scatter(df.loc[df['Start Day'] == 'Saturday']['Time Only'].value_counts().index,df.loc[df['Start Day'] == 'Saturday']['Time Only'].value_counts().values)
plt.title('Saturday')
plt.vlines(x='9:30:00',ymin=0,ymax=1800,color='red')
plt.vlines(x='20:30:00',ymin=0,ymax=1800,color='red')

plt.subplot(2,4,7)
plt.scatter(df.loc[df['Start Day'] == 'Sunday']['Time Only'].value_counts().index,df.loc[df['Start Day'] == 'Sunday']['Time Only'].value_counts().values)
plt.title('Sunday')
plt.vlines(x='9:30:00',ymin=0,ymax=1800,color='red')
plt.vlines(x='20:30:00',ymin=0,ymax=1800,color='red')

plt.tight_layout()

People like to sleep in on the weekends, move the start of peak bar up 2 hours on saturday and sunday

We will create a feature called 'Peak' to describe this On-Peak vs Off-Peak difference and a feature called 'Time Only Int' to simplify our Start Time for later usage

In [None]:
def to_hour_int(x):
    #convert hh:mm:ss to hh integar
    x=str(x)
    x=x[:2]
    x=int(x)
    return x

In [None]:
df['Time Only Int']=df['Time Only']
df['Time Only Int']=df['Time Only Int'].astype('str')
df['Time Only Int']=df['Time Only Int'].apply(lambda x: to_hour_int(x))

In [None]:
df['Peak']=1
df.loc[df['Time Only Int']>20.5,'Peak']=0
df.loc[df['Time Only Int']<7.5,'Peak']=0
df.loc[(df['Start Day']=='Saturday')&(df['Time Only Int']<9.5),'Peak']=0
df.loc[(df['Start Day']=='Sunday')&(df['Time Only Int']<9.5),'Peak']=0

df['Peak'].describe()

In [None]:
print(df.Peak.value_counts(normalize=True))
plt.bar(df.Peak.value_counts().index,df.Peak.value_counts().values)
plt.xticks(ticks=[0,1],labels=['Off-Peak','On-Peak'])
plt.title('On-Peak vs. Off-Peak Rides')
plt.show()

In [None]:
df.info()

## Ride Duration and Type Investigation

In [None]:
df['Trip Route Category'].value_counts()

In [None]:
df['Trip Route Category'].value_counts(normalize=True)

Over 90% of trips are One Way, the bike is picked up at one Station and ridden to another Station. The rider is getting from one place to another, using the bike as a legit mode of transport

In [None]:
df['Passholder Type'].value_counts()

In [None]:
df['Passholder Type'].value_counts(normalize=True)

Monthly Pass holder make the majority of rides in LA Metro bike Share Network by a noticeable margin. By taking up a recurring monthly subscription they are well committed to the bike share network

In [None]:
df['Plan Duration'].value_counts()

Plan Duration and Passholder Type appear to be displaying the same data in 2 different fashions
30 == Monthly Pass 
0 == Walk-up
365 == Flex Pass
Having this feature exist twice is unnecessary, so lets drop Plan Duration



In [None]:
df.drop(columns=['Plan Duration'],inplace=True)

In [None]:
len(df['Bike ID'].unique())

762 individual bikes in the system

In [None]:
df['Bike ID'].value_counts().hist()
plt.title('Bike Usage')
plt.show()

Distribution of Bike usage appears normal indicating the bikes are all used randomly. This is to be expected but we can also expect the particular bike ridden to be of little use to our machine learning models

In [None]:
df.groupby('Trip Route Category')['Duration'].describe()

Round Trips tend to be significantly longer rides

In [None]:
df.groupby('Passholder Type')['Duration'].describe()

Monthly Pass Holder tend to take much shorter rides  but take many more of them. An excellent indication of a low barrier to riding. Walk up riders that only use the bike share system occasionally tend to take longer rides, this would be the type of pass Iespect to be held by a tourist, leisurely riding around downtown LA

In [None]:
df.groupby('Passholder Type')['Trip Route Category'].value_counts(normalize=True)

In [None]:
df.groupby('Peak')['Duration'].describe()

In [None]:
df.groupby('Peak')['Trip Route Category'].value_counts(normalize=True)

In [None]:
df.groupby('Peak')['Passholder Type'].value_counts(normalize=True)

Interestingly On-Peak and Off-Peak riding times did not see a difference in ride Duration or Trip Route Category. 
The proportion of Monthly Pass holders riding is higher during On-Peak hours and Walk-Up riders are a higher proportion of Off-Peak riders. 

In [None]:
df.Duration.value_counts(normalize=True).head(10)


In [None]:
plt.figure(figsize=(15,4))
df.Duration.hist(bins=29)
plt.title('Ride Duration Histogram')
plt.show()

Ride Duration is very heavily left skewed, most rides are very short but there is a very long tail out to a max of 1440 minutes which is rides lasting longer than a full day

In [None]:
plt.figure(figsize=(15,4))
df.loc[df['Duration']<30].Duration.hist(bins=29)
plt.title('Ride Duration for rides less than 30 minutes')
plt.show()

In [None]:
round(len(df.loc[df.Duration<31])/len(df),2)

89% of rides are 30 minutes or less. All rides longer than 30 minutes will be rounded down to 30 to prevent the long tail from exerting undue influence on our machine learning models.

In [None]:
df['Duration']=df['Duration'].clip(upper=30)

In [None]:
plt.figure(figsize=(15,4))
df.Duration.hist(bins=29)
plt.title('Ride Duration clipped to 30 minutes')
plt.show()
# plt.vlines(x=15,ymin=0,ymax=10000,color='red')
# plt.vlines(x=10,ymin=0,ymax=10000,color='red')
# plt.vlines(x=5,ymin=0,ymax=10000,color='red')
# plt.ylim(0,30000)
# plt.xlim(0,5000)

## Classifier Model for Passholder Type

Can we predict what kind of membership the ride taker holds?

There are multiple types of network membership and each are bound to attract riders with different wants or expectations. An understanding of how and why these riders fit into these particular categories is invaluable for understanding rider behaviour attracting new riders and ultimately growing the LA bike share network.

In [None]:
df.columns

In [None]:
y=df['Passholder Type']
X=df.drop(columns=['Bike ID','Time Only','Start Time','End Time','Passholder Type'])

In [None]:
X['Starting Station ID']=X['Starting Station ID'].astype('str')
X['Ending Station ID']=X['Ending Station ID'].astype('str')


Categorical features are one-hot encoded, all features are scaled and dimensionality is reduced

In [None]:
X=pd.get_dummies(X)

In [None]:
X.shape

In [None]:
from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()
scaler.fit_transform(X)

In [None]:
from sklearn.decomposition import PCA
pca=PCA()
pca.fit(X)
tot = sum(pca.explained_variance_)
var_exp = [(i/tot)*100 for i in sorted(pca.explained_variance_, reverse=True)] 
print(var_exp[0:5])
print(sum(var_exp))
cum_var_exp = np.cumsum(var_exp) 
plt.style.use('ggplot')
plt.figure(figsize=(15, 8))
plt.plot(cum_var_exp)
plt.title('Cumulative Explained Variance as a Function of the Number of Components')
plt.vlines(x=20,ymin=var_exp[0],ymax=100)

Approximately 98% of variance retained using 20 variables

In [None]:
pca=PCA(n_components=20)
pca.fit(X)

In [None]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y)

In [None]:
X_train=pca.transform(X_train)
X_test=pca.transform(X_test)

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
param_grid_rfc={'n_estimators':[10,100],
                'criterion': ['entropy', 'gini'], 
                'max_depth': [2, 5, 10, None],  
                'min_samples_leaf':[0.05 ,0.1, 0.2], 
                'min_samples_split':[0.05 ,0.1, 0.2]}
grid_rfc=GridSearchCV(estimator=RandomForestClassifier(),
                     param_grid=param_grid_rfc,
                     cv=3)
grid_rfc.fit(X_train,y_train)

In [None]:
grid_rfc.best_params_

In [None]:
grid_rfc.best_score_

In [None]:
from sklearn.metrics import confusion_matrix, classification_report
y_pred = grid_rfc.best_estimator_.predict(X_test)
print(classification_report(y_test,y_pred))

In [None]:
import itertools
def show_cf(y_true, y_pred, class_names=None, model_name=None):
    '''Stylized Visual Confusion Matrix provided by Flatiron School'''
    cf = confusion_matrix(y_true, y_pred)
    plt.imshow(cf, cmap=plt.cm.Blues)
    
    if model_name:
        plt.title("Confusion Matrix: {}".format(model_name))
    else:
        plt.title("Confusion Matrix")
    plt.ylabel('True Label')
    plt.xlabel('Predicted Label')
    
#     class_names = set(y_true)
    tick_marks = np.arange(len(class_names))
    if class_names:
        plt.xticks(tick_marks, class_names)
        plt.yticks(tick_marks, class_names)
    
    thresh = cf.max() / 2.
    
    for i, j in itertools.product(range(cf.shape[0]), range(cf.shape[1])):
        plt.text(j, i, cf[i, j], horizontalalignment='center', color='white' if cf[i, j] > thresh else 'black')

    plt.colorbar()
    


In [None]:
show_cf(y_test,y_pred,class_names=['Flex Pass','Monthly Pass','Walk Up'],model_name='Tuned Random Forest Classifier')

In [None]:
from sklearn.ensemble import AdaBoostClassifier
param_grid_ada={'n_estimators': [30, 50, 70],
                'learning_rate': [1.0, 0.5, 0.1]}
grid_ada=GridSearchCV(estimator=AdaBoostClassifier(),
                     param_grid=param_grid_ada,
                     cv=3)
grid_ada.fit(X_train,y_train)

In [None]:
grid_ada.best_params_

In [None]:
grid_ada.best_score_

In [None]:
y_pred = grid_ada.best_estimator_.predict(X_test)
# print(confusion_matrix(y_test,y_pred))
print(classification_report(y_test,y_pred))

In [None]:
show_cf(y_test,y_pred, class_names=['Flex Pass','Monthly Pass','Walk Up'],model_name='Tuned ADABoost Classifier')

In [None]:
import xgboost as xgb
param_grid_xgb= {
    "learning_rate": [0.3,0.5,0.7],
    'max_depth': [5,6,7],
    'min_child_weight': [0,1,3],
    'n_estimators': [10,100],
}
grid_xgb=GridSearchCV(estimator=xgb.XGBClassifier(),
                      param_grid=param_grid_xgb,
                     cv=3)
grid_xgb.fit(X_train,y_train)


In [None]:
grid_xgb.best_params_

In [None]:
grid_xgb.best_score_

In [None]:
y_pred = grid_xgb.best_estimator_.predict(X_test)
# print(confusion_matrix(y_test,y_pred))
print(classification_report(y_test,y_pred))

In [None]:
show_cf(y_test, y_pred,class_names=['Flex Pass','Monthly Pass','Walk Up'],model_name='Tuned XGBoost Classifier')

Our Classification Models do quite well. All three score better than 70%!  XGBoost performs the best with overall score of 74% and perhaps most significantly it was able to attribute some predictions to our least common membership type, the  Flex-Pass holders. The underrepresentation of Flex Pass holders in the dataset makes prodiction tricky and there are techniques for resampling to overcome this, they are beyond the scope of this project

## Conclusion

Machine Learning models are valuable tools for analysing the LA-Metro Bike Sharing Network.
<br> We have demonstrated accurate models of the riding patterns of different membership types with a 74% prediction accuracy.


### Future Work


1. Current dataset covers 9 months, updated data lengthening the timescale will allow for year on year factors to be accounted for.
2. Time is binned by hour in this exercise, more detailed usage of time series data may improve accuracy
3. Resampling method such as SMOTE for Classification models to boost accuracy for Flex Pass membership prediction\
4. Extract most influential features from models and explore their influence on ridership 
5. Tune Classification models based on business case for maximising accuracy in relation to a particular Membership Type with the goal of guiding membership in that direction
6. Alternative ride Duration modelling using classifier models on binned Duration intervals
7. Import and cross-reference weather data for more accurate Duration prediction.