## CitiBike System Data Exploration

### by Martin Tschendel

### Preliminary Wrangling

This data set includes information about individual rides made in a bike-sharing system covering New
York. Source of data: [Link](https://www.bikeshare.com/data/)

In [48]:
# import all packages and set plots to be embedded inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb

%matplotlib inline

compared to the data from the San Francisco Bay Area, we have nearly 10 times more entries for New York in 2018. At first I only take the dataset for May 2018, in order to reduce the calculation time. 

next I will load the dataset

In [49]:
#load in the dataset
data_1805_NY = pd.read_csv('data/201805-citibike-tripdata.csv')

I'm interested in some characteristics of the new data set

In [50]:
data_1805_NY.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1824710 entries, 0 to 1824709
Data columns (total 15 columns):
tripduration               int64
starttime                  object
stoptime                   object
start station id           int64
start station name         object
start station latitude     float64
start station longitude    float64
end station id             int64
end station name           object
end station latitude       float64
end station longitude      float64
bikeid                     int64
usertype                   object
birth year                 int64
gender                     int64
dtypes: float64(4), int64(6), object(5)
memory usage: 208.8+ MB


## Data Quality Issues
I figured out some data quality issues and plan to solve them for the upcoming investigation steps.
* data type of columns 'starttime' and 'stoptime' is object and not datetime
* usertype is object and not category
* some column names like 'start station id' have white spaces 
* datatype of start_station_id and end_station_id is float and not category
* gender is currently integer (Zero=unknown; 1=male; 2=female) and should be ideally changed to category (Zero->unknown; 1->male; 2->female) 

In [51]:
data_1805_NY.head()

Unnamed: 0,tripduration,starttime,stoptime,start station id,start station name,start station latitude,start station longitude,end station id,end station name,end station latitude,end station longitude,bikeid,usertype,birth year,gender
0,367,2018-05-01 05:06:16.5840,2018-05-01 05:12:23.9650,72,W 52 St & 11 Ave,40.767272,-73.993929,514,12 Ave & W 40 St,40.760875,-74.002777,30567,Subscriber,1965,1
1,1313,2018-05-01 06:25:49.4250,2018-05-01 06:47:42.7120,72,W 52 St & 11 Ave,40.767272,-73.993929,426,West St & Chambers St,40.717548,-74.013221,18965,Subscriber,1956,1
2,1798,2018-05-01 06:40:26.4450,2018-05-01 07:10:25.1790,72,W 52 St & 11 Ave,40.767272,-73.993929,3435,Grand St & Elizabeth St,40.718822,-73.99596,30241,Subscriber,1959,2
3,518,2018-05-01 07:06:02.9730,2018-05-01 07:14:41.0040,72,W 52 St & 11 Ave,40.767272,-73.993929,477,W 41 St & 8 Ave,40.756405,-73.990026,28985,Subscriber,1986,1
4,109,2018-05-01 07:26:32.3450,2018-05-01 07:28:21.5420,72,W 52 St & 11 Ave,40.767272,-73.993929,530,11 Ave & W 59 St,40.771522,-73.990541,14556,Subscriber,1991,1


In [52]:
# Change datetype of columns starttime and stoptime to datetime
data_1805_NY.starttime = pd.to_datetime(data_1805_NY.starttime)

In [53]:
data_1805_NY.stoptime = pd.to_datetime(data_1805_NY.stoptime)

In [54]:
# Check if datetype of columns starttime and stoptime is changed to datetime
data_1805_NY.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1824710 entries, 0 to 1824709
Data columns (total 15 columns):
tripduration               int64
starttime                  datetime64[ns]
stoptime                   datetime64[ns]
start station id           int64
start station name         object
start station latitude     float64
start station longitude    float64
end station id             int64
end station name           object
end station latitude       float64
end station longitude      float64
bikeid                     int64
usertype                   object
birth year                 int64
gender                     int64
dtypes: datetime64[ns](2), float64(4), int64(6), object(3)
memory usage: 208.8+ MB


In [55]:
# Change datetype of column usertype from object to category
data_1805_NY.usertype = data_1805_NY.usertype.astype('category')

In [56]:
# Check if datetype of column usertype is changed to category
data_1805_NY.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1824710 entries, 0 to 1824709
Data columns (total 15 columns):
tripduration               int64
starttime                  datetime64[ns]
stoptime                   datetime64[ns]
start station id           int64
start station name         object
start station latitude     float64
start station longitude    float64
end station id             int64
end station name           object
end station latitude       float64
end station longitude      float64
bikeid                     int64
usertype                   category
birth year                 int64
gender                     int64
dtypes: category(1), datetime64[ns](2), float64(4), int64(6), object(2)
memory usage: 196.6+ MB


In [57]:
#rename columns
data_1805_NY.rename(columns={'start station id':'start_station_id', 'end station id':'end_station_id'}, inplace=True)

In [58]:
#rename other columns with blank spaces
data_1805_NY.rename(columns={'start station latitude':'start_station_latitude', 'end station latitude':'end_station_latitude'}, inplace=True)
data_1805_NY.rename(columns={'start station longitude':'start_station_longitude', 'end station longitude':'end_station_longitude'}, inplace=True)
data_1805_NY.rename(columns={'start station name':'start_station_name', 'end station name':'end_station_name'}, inplace=True)
data_1805_NY.rename(columns={'birth year':'birt_year'}, inplace=True)

In [59]:
#check if columns have been renamed
data_1805_NY.head(1)

Unnamed: 0,tripduration,starttime,stoptime,start_station_id,start_station_name,start_station_latitude,start_station_longitude,end_station_id,end_station_name,end_station_latitude,end_station_longitude,bikeid,usertype,birt_year,gender
0,367,2018-05-01 05:06:16.584,2018-05-01 05:12:23.965,72,W 52 St & 11 Ave,40.767272,-73.993929,514,12 Ave & W 40 St,40.760875,-74.002777,30567,Subscriber,1965,1


In [60]:
# Convert the start_station_id and end_station_id column's data type from a float to a category 
data_1805_NY.start_station_id = data_1805_NY.start_station_id.astype('category')
data_1805_NY.end_station_id = data_1805_NY.end_station_id.astype('category')

In [61]:
#check if station id columns have been changed to categories
data_1805_NY.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1824710 entries, 0 to 1824709
Data columns (total 15 columns):
tripduration               int64
starttime                  datetime64[ns]
stoptime                   datetime64[ns]
start_station_id           category
start_station_name         object
start_station_latitude     float64
start_station_longitude    float64
end_station_id             category
end_station_name           object
end_station_latitude       float64
end_station_longitude      float64
bikeid                     int64
usertype                   category
birt_year                  int64
gender                     int64
dtypes: category(3), datetime64[ns](2), float64(4), int64(4), object(2)
memory usage: 175.8+ MB
