# Part I - Ford GoBike System Data
## by Eniola Sofiyah Elemide

## Introduction
> The dataset contains information about **183412** ride details that occured in 2019, collected from Ford GoBike System. The dataset contains **16 columns** with gives the details about each ride. Each column is described below:
    <ol>
    <li><strong>duration_sec</strong>: Trip duration in seconds</li>
    <li><strong>start_time</strong>: Time the trip started</li>
    <li><strong>end_time</strong>: Time the trip ended</li> 
    <li><strong>start_station_id</strong>: Unique number assigned to the start station</li>
    <li><strong>start_station_name</strong>: Name of the start station</li>
    <li><strong>start_station_latitude</strong>: Latitude of start station</li>
    <li><strong>start_station_longitude</strong>: Longitude of start station</li>
    <li><strong>end_station_id</strong>: Unique number assigned to the end station</li>
    <li><strong>end_station_name</strong>: Name of the end station</li>
    <li><strong>end_station_latitude</strong>: Latitude of end station</li>
    <li><strong>end_station_longitude</strong>: Longitude of end station</li>
    <li><strong>bike_id</strong>: Unique number asigned to the bike used</li>
    <li><strong>user_type</strong>: Whether user is a one off customer or a subscriber</li>
    <li><strong>member_birth_year</strong>: User year of birth</li>
    <li><strong>member_gender</strong>: User gender</li>
    <li><strong>bike_share_for_all_trip</strong></li>
    </ol>



## Preliminary Wrangling
I'll begin the exploration of a dataset containing duration and details for 183421 bike trips in the year 2019.

In [1]:
# import all packages and set plots to be embedded inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb

%matplotlib inline

In [2]:
# load the dataset into a dataframe
bike = pd.read_csv('201902-fordgobike-tripdata.csv')

# check if the data is properly loaded
bike.head()

Unnamed: 0,duration_sec,start_time,end_time,start_station_id,start_station_name,start_station_latitude,start_station_longitude,end_station_id,end_station_name,end_station_latitude,end_station_longitude,bike_id,user_type,member_birth_year,member_gender,bike_share_for_all_trip
0,52185,2019-02-28 17:32:10.1450,2019-03-01 08:01:55.9750,21.0,Montgomery St BART Station (Market St at 2nd St),37.789625,-122.400811,13.0,Commercial St at Montgomery St,37.794231,-122.402923,4902,Customer,1984.0,Male,No
1,42521,2019-02-28 18:53:21.7890,2019-03-01 06:42:03.0560,23.0,The Embarcadero at Steuart St,37.791464,-122.391034,81.0,Berry St at 4th St,37.77588,-122.39317,2535,Customer,,,No
2,61854,2019-02-28 12:13:13.2180,2019-03-01 05:24:08.1460,86.0,Market St at Dolores St,37.769305,-122.426826,3.0,Powell St BART Station (Market St at 4th St),37.786375,-122.404904,5905,Customer,1972.0,Male,No
3,36490,2019-02-28 17:54:26.0100,2019-03-01 04:02:36.8420,375.0,Grove St at Masonic Ave,37.774836,-122.446546,70.0,Central Ave at Fell St,37.773311,-122.444293,6638,Subscriber,1989.0,Other,No
4,1585,2019-02-28 23:54:18.5490,2019-03-01 00:20:44.0740,7.0,Frank H Ogawa Plaza,37.804562,-122.271738,222.0,10th Ave at E 15th St,37.792714,-122.24878,4898,Subscriber,1974.0,Male,Yes


In [3]:
# print the structure of the dataset
print(bike.shape)
# print the datatypes of the variables in the dataset
print(bike.dtypes)

(183412, 16)
duration_sec                 int64
start_time                  object
end_time                    object
start_station_id           float64
start_station_name          object
start_station_latitude     float64
start_station_longitude    float64
end_station_id             float64
end_station_name            object
end_station_latitude       float64
end_station_longitude      float64
bike_id                      int64
user_type                   object
member_birth_year          float64
member_gender               object
bike_share_for_all_trip     object
dtype: object


In [4]:
# check for duplicated rows
bike.duplicated().sum()

0

In [5]:
# check the statistical summary of the data
bike.describe()

Unnamed: 0,duration_sec,start_station_id,start_station_latitude,start_station_longitude,end_station_id,end_station_latitude,end_station_longitude,bike_id,member_birth_year
count,183412.0,183215.0,183412.0,183412.0,183215.0,183412.0,183412.0,183412.0,175147.0
mean,726.078435,138.590427,37.771223,-122.352664,136.249123,37.771427,-122.35225,4472.906375,1984.806437
std,1794.38978,111.778864,0.099581,0.117097,111.515131,0.09949,0.116673,1664.383394,10.116689
min,61.0,3.0,37.317298,-122.453704,3.0,37.317298,-122.453704,11.0,1878.0
25%,325.0,47.0,37.770083,-122.412408,44.0,37.770407,-122.411726,3777.0,1980.0
50%,514.0,104.0,37.78076,-122.398285,100.0,37.78101,-122.398279,4958.0,1987.0
75%,796.0,239.0,37.79728,-122.286533,235.0,37.79732,-122.288045,5502.0,1992.0
max,85444.0,398.0,37.880222,-121.874119,398.0,37.880222,-121.874119,6645.0,2001.0


# Data Cleaning 
## Issues
1. Improper datatypes for start_time and end_time columns
2. Impossible values in the member_birth_year column(<1920)
3. Multiple variables in the start_time and end_time columns
4. Missing values in gender column

### Define 
Convert start_time and end_time columns to the proper datatype(datetime columns)

### Code 

In [6]:
bike['start_time'] = pd.to_datetime(bike['start_time'])
bike['end_time'] = pd.to_datetime(bike['end_time'])

### Test

In [7]:
print('The datatype of start_time column is {}'.format(bike['start_time'].dtype))
print('The datatype of end_time column is {}'.format(bike['end_time'].dtype))

The datatype of start_time column is datetime64[ns]
The datatype of end_time column is datetime64[ns]


### Define 
Extract the days and months variables from start_time and end_time column

### Code

In [8]:
# extract the variables(date, hour, day and month) from the start_time column
bike['start_date'] = pd.to_datetime(bike['start_time']).dt.date 
bike['start_clock'] = pd.to_datetime(bike['start_time']).dt.strftime('%H:%M')
bike['start_day'] = pd.to_datetime(bike['start_time']).dt.day_name()
bike['start_month'] = pd.to_datetime(bike['start_time']).dt.strftime('%B')

# extract from the end_time column
bike['end_date'] = pd.to_datetime(bike['end_time']).dt.date
bike['end_clock'] = pd.to_datetime(bike['end_time']).dt.strftime('%H:%M')
bike['end_day'] = pd.to_datetime(bike['end_time']).dt.day_name()
bike['end_month'] = pd.to_datetime(bike['end_time']).dt.strftime('%B')

### Test

In [9]:
print(bike.info())

# confirm that the columns are created with proper values
bike.sample()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 183412 entries, 0 to 183411
Data columns (total 24 columns):
 #   Column                   Non-Null Count   Dtype         
---  ------                   --------------   -----         
 0   duration_sec             183412 non-null  int64         
 1   start_time               183412 non-null  datetime64[ns]
 2   end_time                 183412 non-null  datetime64[ns]
 3   start_station_id         183215 non-null  float64       
 4   start_station_name       183215 non-null  object        
 5   start_station_latitude   183412 non-null  float64       
 6   start_station_longitude  183412 non-null  float64       
 7   end_station_id           183215 non-null  float64       
 8   end_station_name         183215 non-null  object        
 9   end_station_latitude     183412 non-null  float64       
 10  end_station_longitude    183412 non-null  float64       
 11  bike_id                  183412 non-null  int64         
 12  user_type       

Unnamed: 0,duration_sec,start_time,end_time,start_station_id,start_station_name,start_station_latitude,start_station_longitude,end_station_id,end_station_name,end_station_latitude,...,member_gender,bike_share_for_all_trip,start_date,start_clock,start_day,start_month,end_date,end_clock,end_day,end_month
36116,353,2019-02-23 16:58:08.753,2019-02-23 17:04:01.866,196.0,Grand Ave at Perkins St,37.808894,-122.25646,183.0,Telegraph Ave at 19th St,37.808702,...,Male,No,2019-02-23,16:58,Saturday,February,2019-02-23,17:04,Saturday,February


### Define 
Replace impossible values in the member_birth_year column; values less than 1910 are considered impossible because riders that are over 100 years are too old to ride a bike.

**Note: Oldest person to ride a bike was 105 years according to google**

### Code

In [10]:
# replace values less than 1910 with mean of the variable
bike.loc[bike['member_birth_year'] < 1910, 'member_birth_year'] = bike['member_birth_year'].mean()

### Test

In [11]:
# check the statistics for the minimum year
bike['member_birth_year'].describe()

count    175147.000000
mean       1984.840781
std           9.971632
min        1910.000000
25%        1980.000000
50%        1987.000000
75%        1992.000000
max        2001.000000
Name: member_birth_year, dtype: float64

### Define
Replace missing values in member_gender column with 'Other'

### Code

In [12]:
bike['member_gender'].replace(np.nan, 'Other', inplace = True)

### Test

In [13]:
bike['member_gender'].isna().sum()

0

**Create a time_of_day column that contains categorical values values: Morning, Afternoon, Evening and Night**

In [14]:
# get the hour before the colon and minutes form the start hour
bike['start_hour'] = bike['start_clock'].str.split(':').str[0]
# convert the extracted hour to an integer
bike['start_hour'] = bike['start_hour'].astype('int')

# get the hour before the colon and minutes form the start hour
bike['end_hour'] = bike['end_clock'].str.split(':').str[0]
bike['end_hour'] = bike['end_hour'].astype('int')

In [15]:
# create a function to divide the hours into parts of the day
def time_of_day(dataframe):

    '''
    create a modified dataframe that contains important columns
    
    Parameters:
    dataframe: Dataframe to modify
    
    Variables:
    start_time_day: a list containing start time grouped into categories
    end_time_day: a list containing end time grouped into categories
    
    Returns:
    A new dataframe to perform operation on 
    '''
    df = dataframe.copy()
    
    start_time_of_day = []
    for row in df['start_hour']:
        if row >= 5 and row <= 11:
            start_time_of_day.append('Morning')
        elif row >= 12 and row <= 16:
            start_time_of_day.append('Afternoon')
        elif row >= 17 and row <= 20:
            start_time_of_day.append('Evening')
        else:
            start_time_of_day.append('Night')
    
    df['start_time_of_day'] = start_time_of_day
    
    end_time_of_day = []
    for row in df['end_hour']:
        if row >= 5 and row <= 11:
            end_time_of_day.append('Morning')
        elif row >= 12 and row <= 16:
            end_time_of_day.append('Afternoon')
        elif row >= 17 and row <= 20:
            end_time_of_day.append('Evening')
        else:
            end_time_of_day.append('Night')
            
    df['end_time_of_day'] = end_time_of_day
    
    return df