# Ford GoBike System Data Exploration
## by Martin Tschendel

## Preliminary Wrangling

This data set includes information about individual rides made in a bike-sharing system covering the greater San Francisco Bay area.

In [33]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb

%matplotlib inline

> Load in your dataset and describe its properties through the questions below.
Try and motivate your exploration goals through this section.

In [34]:
#load in the datasets 
data_1801 = pd.read_csv('data/201801-fordgobike-tripdata.csv')
data_1802 = pd.read_csv('data/201802-fordgobike-tripdata.csv')
data_1803 = pd.read_csv('data/201803-fordgobike-tripdata.csv')
data_1804 = pd.read_csv('data/201804-fordgobike-tripdata.csv')
data_1805 = pd.read_csv('data/201805-fordgobike-tripdata.csv')
data_1806 = pd.read_csv('data/201806-fordgobike-tripdata.csv')
data_1807 = pd.read_csv('data/201807-fordgobike-tripdata.csv')
data_1808 = pd.read_csv('data/201808-fordgobike-tripdata.csv')
data_1809 = pd.read_csv('data/201809-fordgobike-tripdata.csv')
data_1810 = pd.read_csv('data/201810-fordgobike-tripdata.csv')
data_1811 = pd.read_csv('data/201811-fordgobike-tripdata.csv')
data_1812 = pd.read_csv('data/201812-fordgobike-tripdata.csv')

In [35]:
data_1801.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 94802 entries, 0 to 94801
Data columns (total 16 columns):
duration_sec               94802 non-null int64
start_time                 94802 non-null object
end_time                   94802 non-null object
start_station_id           94802 non-null int64
start_station_name         94802 non-null object
start_station_latitude     94802 non-null float64
start_station_longitude    94802 non-null float64
end_station_id             94802 non-null int64
end_station_name           94802 non-null object
end_station_latitude       94802 non-null float64
end_station_longitude      94802 non-null float64
bike_id                    94802 non-null int64
user_type                  94802 non-null object
member_birth_year          86963 non-null float64
member_gender              87001 non-null object
bike_share_for_all_trip    94802 non-null object
dtypes: float64(5), int64(4), object(7)
memory usage: 11.6+ MB


In [36]:
data_1803.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 111382 entries, 0 to 111381
Data columns (total 16 columns):
duration_sec               111382 non-null int64
start_time                 111382 non-null object
end_time                   111382 non-null object
start_station_id           111382 non-null int64
start_station_name         111382 non-null object
start_station_latitude     111382 non-null float64
start_station_longitude    111382 non-null float64
end_station_id             111382 non-null int64
end_station_name           111382 non-null object
end_station_latitude       111382 non-null float64
end_station_longitude      111382 non-null float64
bike_id                    111382 non-null int64
user_type                  111382 non-null object
member_birth_year          102347 non-null float64
member_gender              102385 non-null object
bike_share_for_all_trip    111382 non-null object
dtypes: float64(5), int64(4), object(7)
memory usage: 13.6+ MB


In [37]:
data_1811.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 134135 entries, 0 to 134134
Data columns (total 16 columns):
duration_sec               134135 non-null int64
start_time                 134135 non-null object
end_time                   134135 non-null object
start_station_id           133651 non-null float64
start_station_name         133651 non-null object
start_station_latitude     134135 non-null float64
start_station_longitude    134135 non-null float64
end_station_id             133651 non-null float64
end_station_name           133651 non-null object
end_station_latitude       134135 non-null float64
end_station_longitude      134135 non-null float64
bike_id                    134135 non-null int64
user_type                  134135 non-null object
member_birth_year          129037 non-null float64
member_gender              129037 non-null object
bike_share_for_all_trip    134135 non-null object
dtypes: float64(7), int64(2), object(7)
memory usage: 16.4+ MB


In [38]:
#join dataframes along rows
df_18 = pd.concat([data_1801, data_1802, data_1803, data_1804,
                  data_1805, data_1806, data_1807, data_1808,
                  data_1809, data_1810, data_1811, data_1812,], sort=True)

In [39]:
df_18.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1863721 entries, 0 to 131362
Data columns (total 16 columns):
bike_id                    int64
bike_share_for_all_trip    object
duration_sec               int64
end_station_id             float64
end_station_latitude       float64
end_station_longitude      float64
end_station_name           object
end_time                   object
member_birth_year          float64
member_gender              object
start_station_id           float64
start_station_latitude     float64
start_station_longitude    float64
start_station_name         object
start_time                 object
user_type                  object
dtypes: float64(7), int64(2), object(7)
memory usage: 241.7+ MB


### Data quality issues
* data type of column 'timestamp' is object and not datetime
* datatype of user_type and member_gender is object and not category
* datatype of start_station_id and end_station_id  is float and not category

In [40]:
# Change datetype of columns start_time and end_time to datetime
df_18.start_time = pd.to_datetime(df_18.start_time)

In [41]:
df_18.end_time = pd.to_datetime(df_18.end_time)

In [42]:
# Change data type from object to category
df_18.user_type = df_18.user_type.astype('category')
df_18.member_gender = df_18.member_gender.astype('category')

In [43]:
# Convert the start_station_id and end_station_id column's data type from a float to a 
# string using astype, remove the '.0' using string slicing, and convert datatype from a string to a ctegory 
df_18.start_station_id = df_18.start_station_id.astype(str).str[:-2]
df_18.start_station_id = df_18.start_station_id.astype('category')
df_18.end_station_id = df_18.end_station_id.astype(str).str[:-2]
df_18.end_station_id = df_18.end_station_id.astype('category')

In [44]:
df_18.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1863721 entries, 0 to 131362
Data columns (total 16 columns):
bike_id                    int64
bike_share_for_all_trip    object
duration_sec               int64
end_station_id             category
end_station_latitude       float64
end_station_longitude      float64
end_station_name           object
end_time                   datetime64[ns]
member_birth_year          float64
member_gender              category
start_station_id           category
start_station_latitude     float64
start_station_longitude    float64
start_station_name         object
start_time                 datetime64[ns]
user_type                  category
dtypes: category(4), datetime64[ns](2), float64(5), int64(2), object(3)
memory usage: 195.5+ MB


In [45]:
df_18.describe()

Unnamed: 0,bike_id,duration_sec,end_station_latitude,end_station_longitude,member_birth_year,start_station_latitude,start_station_longitude
count,1863721.0,1863721.0,1863721.0,1863721.0,1753003.0,1863721.0,1863721.0
mean,2296.851,857.3026,37.7669,-122.3487,1983.088,37.76678,-122.3492
std,1287.733,2370.379,0.1056483,0.1650597,10.44289,0.1057689,0.1654634
min,11.0,61.0,37.26331,-122.4737,1881.0,37.26331,-122.4737
25%,1225.0,350.0,37.77106,-122.4094,1978.0,37.77106,-122.4114
50%,2338.0,556.0,37.78127,-122.3971,1985.0,37.78107,-122.3974
75%,3333.0,872.0,37.79728,-122.2894,1991.0,37.79625,-122.2865
max,6234.0,86366.0,45.51,-73.57,2000.0,45.51,-73.57


In [46]:
df_18['user_type'].value_counts()

Subscriber    1583554
Customer       280167
Name: user_type, dtype: int64

In [47]:
df_18.head()

Unnamed: 0,bike_id,bike_share_for_all_trip,duration_sec,end_station_id,end_station_latitude,end_station_longitude,end_station_name,end_time,member_birth_year,member_gender,start_station_id,start_station_latitude,start_station_longitude,start_station_name,start_time,user_type
0,2765,No,75284,285,37.783521,-122.431158,Webster St at O'Farrell St,2018-02-01 19:47:19.824,1986.0,Male,120,37.76142,-122.426435,Mission Dolores Park,2018-01-31 22:52:35.239,Subscriber
1,2815,No,85422,15,37.795392,-122.394203,San Francisco Ferry Building (Harry Bridges Pl...,2018-02-01 15:57:17.310,,,15,37.795392,-122.394203,San Francisco Ferry Building (Harry Bridges Pl...,2018-01-31 16:13:34.351,Customer
2,3039,No,71576,296,37.325998,-121.87712,5th St at Virginia St,2018-02-01 10:16:52.116,1996.0,Male,304,37.348759,-121.894798,Jackson St at 5th St,2018-01-31 14:23:55.889,Customer
3,321,No,61076,47,37.780955,-122.399749,4th St at Harrison St,2018-02-01 07:51:20.500,,,75,37.773793,-122.421239,Market St at Franklin St,2018-01-31 14:53:23.562,Customer
4,617,No,39966,19,37.788975,-122.403452,Post St at Kearny St,2018-02-01 06:58:31.053,1991.0,Male,74,37.776435,-122.426244,Laguna St at Hayes St,2018-01-31 19:52:24.667,Subscriber


### What is the structure of your dataset?

There are nearly 1.9 Mio.individual ride entries in this dataset and they have 10 features. We can find variables with quantitative nature like the duration of each ride (duration_sec) and categorical variables like type of users (user_type).

### What is/are the main feature(s) of interest in your dataset?

> Your answer here!

### What features in the dataset do you think will help support your investigation into your feature(s) of interest?

> Your answer here!

## Univariate Exploration

> In this section, investigate distributions of individual variables. If
you see unusual points or outliers, take a deeper look to clean things up
and prepare yourself to look at relationships between variables.

> Make sure that, after every plot or related series of plots, that you
include a Markdown cell with comments about what you observed, and what
you plan on investigating next.

### Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?

> Your answer here!

### Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

> Your answer here!

## Bivariate Exploration

> In this section, investigate relationships between pairs of variables in your
data. Make sure the variables that you cover here have been introduced in some
fashion in the previous section (univariate exploration).

### Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

> Your answer here!

### Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

> Your answer here!

## Multivariate Exploration

> Create plots of three or more variables to investigate your data even
further. Make sure that your investigations are justified, and follow from
your work in the previous sections.

### Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

> Your answer here!

### Were there any interesting or surprising interactions between features?

> Your answer here!

> At the end of your report, make sure that you export the notebook as an
html file from the `File > Download as... > HTML` menu. Make sure you keep
track of where the exported file goes, so you can put it in the same folder
as this notebook for project submission. Also, make sure you remove all of
the quote-formatted guide notes like this one before you finish your report!