# City Bike Share Analysis



# # Project Summary

###Problem Statement

###What is the likelihood that a Bike Share program will succeed in the City of Atlanta? 
By using bike share data from other US cities, I want to be able to analyze the popularity and trends of bike shares amongst females and males of different ages. 



## Description of Data and Collection method

CSV Files of Open Dataset for:
*Chicago Q1 & Q2 2015
*Boston 2011 to 2013
*NYC Jan 2015 to Jul 2015

Files Contain:
Gender
Year of Birth
Trip date and time
Duration of Trip

###Transformation and data aggregation description:
I loaded data for Chicago and Boston into a single table (via MS Sql Server Express), where the total volume is ~1.2 M records.In order to conform data to a standard gender code, I updated the gender for each record to "Female" and "Male" values.
The data was then aggregrated by sum of duration , grouped by city,gender and age. The aggregrated data was outputted to a CSV file for use in cross validation and model use.


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler
from sklearn import cross_validation
from sklearn.metrics import mean_squared_error
from sklearn import grid_search
%matplotlib inline



In [2]:
dsCity= pd.read_csv('https://raw.githubusercontent.com/Sw3m/Data_Science_GA_Project_15/master/Datasets/CityBikeShareExtract.csv')

In [3]:
dsCity.head()

Unnamed: 0,City,NumRecords,Gender,Age,YearOfBirth,TotalDur_sec,AvgDur_sec
0,Boston,5,Female,83,1932,4182,836
1,Boston,2,Female,77,1938,1457,728
2,Boston,11,Male,77,1938,7645,695
3,Boston,11,Male,76,1939,4956,450
4,Boston,16,Male,75,1940,10131,633


In [4]:
len(dsCity.index)

378

In [5]:
dsBoston = dsCity[dsCity['City']=='Boston']
dsChicago = dsCity[dsCity['City']=='Chicago']
dsNYC= dsCity[dsCity['City']=='NYC']

In [6]:
dsCity['Gender']=dsCity.Gender.map({'Female':0,'Male':1})
dsCity['City']=dsCity.City.map({'Boston':0,'Chicago':1,'NYC':2})
dsCity.head()

Unnamed: 0,City,NumRecords,Gender,Age,YearOfBirth,TotalDur_sec,AvgDur_sec
0,0,5,0,83,1932,4182,836
1,0,2,0,77,1938,1457,728
2,0,11,1,77,1938,7645,695
3,0,11,1,76,1939,4956,450
4,0,16,1,75,1940,10131,633


In [7]:
X=dsCity[['City','NumRecords','Gender','Age']]

Y= dsCity[['TotalDur_sec']]

In [9]:
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X,Y,test_size=0.2,random_state=1)

### Perform Linear Regression

In [10]:
linear_rgr =linear_model.LinearRegression()
linear_rgr.fit(X_train,y_train)

print("coefficient {}".format(linear_rgr.coef_))
print("intercept {}".format(linear_rgr.intercept_))
linear_MSE = mean_squared_error(y_test, linear_rgr.predict(X_test))
print('Mean squared error for liner regression: {}'.format(linear_MSE))


coefficient [[-159027.21882418     712.18490071 -234449.87436224     958.78162002]]
intercept [ 264146.00886506]
Mean squared error for liner regression: 67099289132.2


In [13]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn import preprocessing
X_train_scaled=preprocessing.scale(X_train)
X_test_scaled=preprocessing.scale(X_test)
y_train_scaled=preprocessing.scale(y_train)
y_test_scaled = preprocessing.scale(y_test)

  "got %s" % (estimator, X.dtype))
  Xr -= mean_
  Xr -= mean_1
  Xr /= std_
  Xr -= mean_2
  Xr -= mean_
  Xr -= mean_1
  Xr /= std_
  Xr -= mean_2
  Xr -= mean_
  Xr -= mean_1
  Xr /= std_
  Xr -= mean_2
  Xr -= mean_
  Xr -= mean_1
  Xr /= std_
  Xr -= mean_2


In [14]:
param_grid = [{'alpha':np.linspace(1e-8, 1, 3000)}]

#Ridge
ridge_rgr = linear_model.Ridge(normalize = True)
ridge_cv = grid_search.GridSearchCV(ridge_rgr, param_grid,cv=5)
ridge_cv.fit(X_train_scaled, y_train_scaled)


GridSearchCV(cv=5, error_score='raise',
       estimator=Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=True, solver='auto', tol=0.001),
       fit_params={}, iid=True, loss_func=None, n_jobs=1,
       param_grid=[{'alpha': array([  1.00000e-08,   3.33454e-04, ...,   9.99667e-01,   1.00000e+00])}],
       pre_dispatch='2*n_jobs', refit=True, score_func=None, scoring=None,
       verbose=0)

In [15]:
#Lasso
lasso_rgr = linear_model.Lasso(normalize = True)
lasso_cv = grid_search.GridSearchCV(lasso_rgr, param_grid, cv=5)
lasso_cv.fit(X_train_scaled, y_train_scaled)

GridSearchCV(cv=5, error_score='raise',
       estimator=Lasso(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=True, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False),
       fit_params={}, iid=True, loss_func=None, n_jobs=1,
       param_grid=[{'alpha': array([  1.00000e-08,   3.33454e-04, ...,   9.99667e-01,   1.00000e+00])}],
       pre_dispatch='2*n_jobs', refit=True, score_func=None, scoring=None,
       verbose=0)

In [16]:
Ridge_MSE = mean_squared_error(y_test_scaled, ridge_cv.best_estimator_.predict(X_test_scaled))
Lasso_MSE = mean_squared_error(y_test_scaled, lasso_cv.best_estimator_.predict(X_test_scaled))

print('Mean squared error for Ridge regression: {}'.format(Ridge_MSE))
print('Mean squared error for Lasso regression: {}'.format(Lasso_MSE))

Mean squared error for Ridge regression: 0.0132338627564
Mean squared error for Lasso regression: 0.0132403884941
