# Part I - Ford Go-Bike System Data Exploration And Visualization

## Introduction
> This analysis explores a dataset containing information regarding bike usage in Francisco Bay Ford GoBike sharing system.  


## Preliminary Wrangling


In [1]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
import warnings
warnings.filterwarnings("ignore")

%matplotlib inline

In [2]:
# load in the dataset into a pandas dataframe
df_0 = pd.read_csv("201902-fordgobike-tripdata.csv")

In [3]:
# Make a copy of the dataset to avoid reloading it when it becomes neccesary 
df = df_0.copy()
df.head()

Unnamed: 0,duration_sec,start_time,end_time,start_station_id,start_station_name,start_station_latitude,start_station_longitude,end_station_id,end_station_name,end_station_latitude,end_station_longitude,bike_id,user_type,member_birth_year,member_gender,bike_share_for_all_trip
0,52185,2019-02-28 17:32:10.1450,2019-03-01 08:01:55.9750,21.0,Montgomery St BART Station (Market St at 2nd St),37.789625,-122.400811,13.0,Commercial St at Montgomery St,37.794231,-122.402923,4902,Customer,1984.0,Male,No
1,42521,2019-02-28 18:53:21.7890,2019-03-01 06:42:03.0560,23.0,The Embarcadero at Steuart St,37.791464,-122.391034,81.0,Berry St at 4th St,37.77588,-122.39317,2535,Customer,,,No
2,61854,2019-02-28 12:13:13.2180,2019-03-01 05:24:08.1460,86.0,Market St at Dolores St,37.769305,-122.426826,3.0,Powell St BART Station (Market St at 4th St),37.786375,-122.404904,5905,Customer,1972.0,Male,No
3,36490,2019-02-28 17:54:26.0100,2019-03-01 04:02:36.8420,375.0,Grove St at Masonic Ave,37.774836,-122.446546,70.0,Central Ave at Fell St,37.773311,-122.444293,6638,Subscriber,1989.0,Other,No
4,1585,2019-02-28 23:54:18.5490,2019-03-01 00:20:44.0740,7.0,Frank H Ogawa Plaza,37.804562,-122.271738,222.0,10th Ave at E 15th St,37.792714,-122.24878,4898,Subscriber,1974.0,Male,Yes


In [4]:
# overview of data shape and composition
df.shape

(183412, 16)

In [5]:
# changing data type of start_time and end_time to datetime.
df.start_time = pd.to_datetime(df.start_time)
df.end_time = pd.to_datetime(df.end_time)

In [6]:
# Engineering new feataures from the start_time column
df['start_month'] = df.start_time.dt.month
df['start_day'] = df.start_time.dt.weekday 
df['start_hour'] = df.start_time.dt.hour

# extract the age of members from their birth year
df["member_age"]=df["member_birth_year"].apply(lambda x: x if str(x)=="nan" else 2019-x)

# convert duration in seconds to minutes
df['duration_minutes'] = df['duration_sec'].apply(lambda x: x/60)

In [7]:
# Replace the numbers in start_day with abbreviated weekdays name
df["start_day"]=df["start_day"].replace({0:"Mon", 1:"Tue", 2:"Wed", 3:"Thur", 4:"Fri", 5:"Sat", 6:"Sun"})

# convert start_day, start_hour into ordered categorical types
ordinal_var_dict = {"start_day":["Mon","Tue","Wed","Thur","Fri","Sat","Sun"],
                    "start_hour":list(range(24))}
for var in ordinal_var_dict:
    ordered_var = pd.api.types.CategoricalDtype(ordered = True,
                                                categories = ordinal_var_dict[var])
    df[var] = df[var].astype(ordered_var)

In [8]:
df.head()

Unnamed: 0,duration_sec,start_time,end_time,start_station_id,start_station_name,start_station_latitude,start_station_longitude,end_station_id,end_station_name,end_station_latitude,...,bike_id,user_type,member_birth_year,member_gender,bike_share_for_all_trip,start_month,start_day,start_hour,member_age,duration_minutes
0,52185,2019-02-28 17:32:10.145,2019-03-01 08:01:55.975,21.0,Montgomery St BART Station (Market St at 2nd St),37.789625,-122.400811,13.0,Commercial St at Montgomery St,37.794231,...,4902,Customer,1984.0,Male,No,2,Thur,17,35.0,869.75
1,42521,2019-02-28 18:53:21.789,2019-03-01 06:42:03.056,23.0,The Embarcadero at Steuart St,37.791464,-122.391034,81.0,Berry St at 4th St,37.77588,...,2535,Customer,,,No,2,Thur,18,,708.683333
2,61854,2019-02-28 12:13:13.218,2019-03-01 05:24:08.146,86.0,Market St at Dolores St,37.769305,-122.426826,3.0,Powell St BART Station (Market St at 4th St),37.786375,...,5905,Customer,1972.0,Male,No,2,Thur,12,47.0,1030.9
3,36490,2019-02-28 17:54:26.010,2019-03-01 04:02:36.842,375.0,Grove St at Masonic Ave,37.774836,-122.446546,70.0,Central Ave at Fell St,37.773311,...,6638,Subscriber,1989.0,Other,No,2,Thur,17,30.0,608.166667
4,1585,2019-02-28 23:54:18.549,2019-03-01 00:20:44.074,7.0,Frank H Ogawa Plaza,37.804562,-122.271738,222.0,10th Ave at E 15th St,37.792714,...,4898,Subscriber,1974.0,Male,Yes,2,Thur,23,45.0,26.416667


In [9]:
df.dtypes.value_counts()

float64           9
object            5
int64             3
datetime64[ns]    2
category          1
category          1
dtype: int64

### What is the structure of your dataset?

> There are 183,412 diamonds in the dataset with 16 features (duration_sec, start_time, end_time, start_station_id, start_station_name, etc.) The variables type include numeric, object, and time. However, two categorical variables were extracted from the datetime variable.  

### What is/are the main feature(s) of interest in your dataset?

> My main features of interest would be centered around those variables that would enable me to predict how long it would take to complete an average trip. 

### What features in the dataset do you think will help support your investigation into your feature(s) of interest?

> I expect that user_type, member's age and gender, and the trip start time to have strong effect on trip duration. 