# Part I - (Dataset Exploration Title)
## by (Brian Orandi)

## Introduction
> This data set includes information about individual rides made in a bike-sharing system covering the greater San Francisco Bay area. We can download 12 monthly data from January 2019 to December 2019 to make a full year coverage. The data files can be downloaded programmatically using the Requests library and then be joined together into a single file. Then we'll assess and clean the dataset and store the clean data for our analysis. To get started, let's import our libraries.


## Preliminary Wrangling

In [34]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# setting the style
plt.style.use("seaborn")

> Load in your dataset and describe its properties through the questions below. Try and motivate your exploration goals through this section.


### Assessing data

In [35]:
# importing the dataset 

gobike = pd.read_csv("fordgobike.csv")
gobike.head()

Unnamed: 0,duration_sec,start_time,end_time,start_station_id,start_station_name,start_station_latitude,start_station_longitude,end_station_id,end_station_name,end_station_latitude,end_station_longitude,bike_id,user_type,member_birth_year,member_gender,bike_share_for_all_trip
0,52185,2019-02-28 17:32:10.1450,2019-03-01 08:01:55.9750,21.0,Montgomery St BART Station (Market St at 2nd St),37.789625,-122.400811,13.0,Commercial St at Montgomery St,37.794231,-122.402923,4902,Customer,1984.0,Male,No
1,42521,2019-02-28 18:53:21.7890,2019-03-01 06:42:03.0560,23.0,The Embarcadero at Steuart St,37.791464,-122.391034,81.0,Berry St at 4th St,37.77588,-122.39317,2535,Customer,,,No
2,61854,2019-02-28 12:13:13.2180,2019-03-01 05:24:08.1460,86.0,Market St at Dolores St,37.769305,-122.426826,3.0,Powell St BART Station (Market St at 4th St),37.786375,-122.404904,5905,Customer,1972.0,Male,No
3,36490,2019-02-28 17:54:26.0100,2019-03-01 04:02:36.8420,375.0,Grove St at Masonic Ave,37.774836,-122.446546,70.0,Central Ave at Fell St,37.773311,-122.444293,6638,Subscriber,1989.0,Other,No
4,1585,2019-02-28 23:54:18.5490,2019-03-01 00:20:44.0740,7.0,Frank H Ogawa Plaza,37.804562,-122.271738,222.0,10th Ave at E 15th St,37.792714,-122.24878,4898,Subscriber,1974.0,Male,Yes


In [36]:
# looking at the columns
for i, v in enumerate(gobike.columns):
    print(i, v)

0 duration_sec
1 start_time
2 end_time
3 start_station_id
4 start_station_name
5 start_station_latitude
6 start_station_longitude
7 end_station_id
8 end_station_name
9 end_station_latitude
10 end_station_longitude
11 bike_id
12 user_type
13 member_birth_year
14 member_gender
15 bike_share_for_all_trip


In [37]:
# shape of data
gobike.shape

(183412, 16)

In [38]:
# Checking the data types
gobike.dtypes

duration_sec                 int64
start_time                  object
end_time                    object
start_station_id           float64
start_station_name          object
start_station_latitude     float64
start_station_longitude    float64
end_station_id             float64
end_station_name            object
end_station_latitude       float64
end_station_longitude      float64
bike_id                      int64
user_type                   object
member_birth_year          float64
member_gender               object
bike_share_for_all_trip     object
dtype: object

From the above, we can see that:
- Start_time, end_time should be changed to datetime, instead of object.
- start_station_id, end_station_id are float instead of being strings.
- We should also change the bike_id to string instead of it being an integer.
- member_birth_year should be a string.

In [39]:
# descriptive statistics for numerical variables
gobike.describe()

Unnamed: 0,duration_sec,start_station_id,start_station_latitude,start_station_longitude,end_station_id,end_station_latitude,end_station_longitude,bike_id,member_birth_year
count,183412.0,183215.0,183412.0,183412.0,183215.0,183412.0,183412.0,183412.0,175147.0
mean,726.078435,138.590427,37.771223,-122.352664,136.249123,37.771427,-122.35225,4472.906375,1984.806437
std,1794.38978,111.778864,0.099581,0.117097,111.515131,0.09949,0.116673,1664.383394,10.116689
min,61.0,3.0,37.317298,-122.453704,3.0,37.317298,-122.453704,11.0,1878.0
25%,325.0,47.0,37.770083,-122.412408,44.0,37.770407,-122.411726,3777.0,1980.0
50%,514.0,104.0,37.78076,-122.398285,100.0,37.78101,-122.398279,4958.0,1987.0
75%,796.0,239.0,37.79728,-122.286533,235.0,37.79732,-122.288045,5502.0,1992.0
max,85444.0,398.0,37.880222,-121.874119,398.0,37.880222,-121.874119,6645.0,2001.0


In [40]:
# checking on the categorical values
gobike.user_type.value_counts()

Subscriber    163544
Customer       19868
Name: user_type, dtype: int64

In [41]:
gobike.member_gender.value_counts()

Male      130651
Female     40844
Other       3652
Name: member_gender, dtype: int64

In [42]:
gobike.bike_share_for_all_trip.value_counts()

No     166053
Yes     17359
Name: bike_share_for_all_trip, dtype: int64

In [43]:
# are there null values?
gobike.isna().sum()

duration_sec                  0
start_time                    0
end_time                      0
start_station_id            197
start_station_name          197
start_station_latitude        0
start_station_longitude       0
end_station_id              197
end_station_name            197
end_station_latitude          0
end_station_longitude         0
bike_id                       0
user_type                     0
member_birth_year          8265
member_gender              8265
bike_share_for_all_trip       0
dtype: int64

From above, we can see that there are missing values in start_station_id, start_station_name, end_station_id, end_station_name, member_gender and member_birth_year.
The missing values are quite large hence we shall leave them as they are.

From the descriptive statistics above, we saw that the duration of seconds had a minimum of 61 seconds as shown below.

In [44]:
# looking at highest and lowest duration in seconds spent in riding the bike
print(f'Maximum Seconds: {gobike.duration_sec.max()}')
print(f'Minimum Seconds: {gobike.duration_sec.min()}')

Maximum Seconds: 85444
Minimum Seconds: 61


This is quite abnormal as there riding a bike for just a minute is too short for a ride. This should be higher. Perhaps it was an error on the recording side. We shall remove the rows with this duration.

In [45]:
# looking at bikes hired for the minimum seconds of 61, roughly 1min and a second.
gobike.query("duration_sec == 61")

Unnamed: 0,duration_sec,start_time,end_time,start_station_id,start_station_name,start_station_latitude,start_station_longitude,end_station_id,end_station_name,end_station_latitude,end_station_longitude,bike_id,user_type,member_birth_year,member_gender,bike_share_for_all_trip
18578,61,2019-02-26 18:23:44.2830,2019-02-26 18:24:45.5230,368.0,Myrtle St at Polk St,37.785434,-122.419622,368.0,Myrtle St at Polk St,37.785434,-122.419622,5333,Subscriber,1989.0,Female,No
19581,61,2019-02-26 16:40:53.1210,2019-02-26 16:41:54.4510,43.0,San Francisco Public Library (Grove St at Hyde...,37.778768,-122.415929,43.0,San Francisco Public Library (Grove St at Hyde...,37.778768,-122.415929,5306,Subscriber,1987.0,Female,No
27017,61,2019-02-25 10:31:18.4150,2019-02-25 10:32:19.7480,59.0,S Van Ness Ave at Market St,37.774814,-122.418954,59.0,S Van Ness Ave at Market St,37.774814,-122.418954,5921,Subscriber,1972.0,Male,Yes
44301,61,2019-02-22 15:09:57.0480,2019-02-22 15:10:58.7420,310.0,San Fernando St at 4th St,37.335885,-121.88566,280.0,San Fernando St at 7th St,37.337122,-121.883215,6347,Subscriber,1989.0,Male,Yes
44787,61,2019-02-22 13:56:21.9760,2019-02-22 13:57:23.4650,30.0,San Francisco Caltrain (Townsend St at 4th St),37.776598,-122.395282,81.0,Berry St at 4th St,37.77588,-122.39317,6150,Subscriber,1931.0,Male,No
51120,61,2019-02-21 18:27:34.9930,2019-02-21 18:28:36.6300,113.0,Franklin Square,37.764555,-122.410345,100.0,Bryant St at 15th St,37.7671,-122.410662,6515,Subscriber,1984.0,Male,No
58992,61,2019-02-20 21:44:00.1540,2019-02-20 21:45:01.2350,85.0,Church St at Duboce Ave,37.770083,-122.429156,85.0,Church St at Duboce Ave,37.770083,-122.429156,4351,Subscriber,1994.0,Male,No
64088,61,2019-02-20 13:08:18.2850,2019-02-20 13:09:19.4330,67.0,San Francisco Caltrain Station 2 (Townsend St...,37.776639,-122.395526,80.0,Townsend St at 5th St,37.775235,-122.397437,2090,Subscriber,1931.0,Male,No
80047,61,2019-02-18 16:31:12.8960,2019-02-18 16:32:14.5880,89.0,Division St at Potrero Ave,37.769218,-122.407646,101.0,15th St at Potrero Ave,37.767079,-122.407359,6195,Subscriber,1931.0,Male,No
82564,61,2019-02-18 09:53:31.3990,2019-02-18 09:54:33.1620,249.0,Russell St at College Ave,37.858473,-122.253253,249.0,Russell St at College Ave,37.858473,-122.253253,3054,Subscriber,1990.0,Male,No


### Cleaning the data set

In [46]:
# making a copy of the gobike dataset
gobike_clean = gobike.copy()
print(gobike_clean.shape)
gobike_clean.head()

(183412, 16)


Unnamed: 0,duration_sec,start_time,end_time,start_station_id,start_station_name,start_station_latitude,start_station_longitude,end_station_id,end_station_name,end_station_latitude,end_station_longitude,bike_id,user_type,member_birth_year,member_gender,bike_share_for_all_trip
0,52185,2019-02-28 17:32:10.1450,2019-03-01 08:01:55.9750,21.0,Montgomery St BART Station (Market St at 2nd St),37.789625,-122.400811,13.0,Commercial St at Montgomery St,37.794231,-122.402923,4902,Customer,1984.0,Male,No
1,42521,2019-02-28 18:53:21.7890,2019-03-01 06:42:03.0560,23.0,The Embarcadero at Steuart St,37.791464,-122.391034,81.0,Berry St at 4th St,37.77588,-122.39317,2535,Customer,,,No
2,61854,2019-02-28 12:13:13.2180,2019-03-01 05:24:08.1460,86.0,Market St at Dolores St,37.769305,-122.426826,3.0,Powell St BART Station (Market St at 4th St),37.786375,-122.404904,5905,Customer,1972.0,Male,No
3,36490,2019-02-28 17:54:26.0100,2019-03-01 04:02:36.8420,375.0,Grove St at Masonic Ave,37.774836,-122.446546,70.0,Central Ave at Fell St,37.773311,-122.444293,6638,Subscriber,1989.0,Other,No
4,1585,2019-02-28 23:54:18.5490,2019-03-01 00:20:44.0740,7.0,Frank H Ogawa Plaza,37.804562,-122.271738,222.0,10th Ave at E 15th St,37.792714,-122.24878,4898,Subscriber,1974.0,Male,Yes


### Data type Issues
- start_time, end_time should be changed to datetime, instead of object.
- start_station_id, end_station_id are float instead of being strings.
- We should also change the bike_id to string instead of it being an integer.
- member_birth_year should be a string

#### Define
- For the datetime issues, we shall use pd.datetime to convert them to datetime
- For string issues, we shall use astype(str) to convert them to strings

#### Code

In [47]:
# using pd.dateime to convert to datetime
gobike_clean.start_time = pd.to_datetime(gobike_clean.start_time)
gobike_clean.end_time = pd.to_datetime(gobike_clean.end_time)

In [48]:
# string issues. using astype(str)
gobike_clean.start_station_id = gobike_clean.start_station_id.astype(str)
gobike_clean.end_station_id = gobike_clean.end_station_id.astype(str)
gobike_clean.bike_id = gobike_clean.bike_id.astype(str)
gobike_clean.member_birth_year = gobike_clean.member_birth_year.astype(str)

#### Test

In [49]:
gobike_clean.dtypes

duration_sec                        int64
start_time                 datetime64[ns]
end_time                   datetime64[ns]
start_station_id                   object
start_station_name                 object
start_station_latitude            float64
start_station_longitude           float64
end_station_id                     object
end_station_name                   object
end_station_latitude              float64
end_station_longitude             float64
bike_id                            object
user_type                          object
member_birth_year                  object
member_gender                      object
bike_share_for_all_trip            object
dtype: object

### Abnormal issues on the duration_sec variable
This is quite abnormal as there riding a bike for just a minute is too short for a ride. This should be higher. Perhaps it was an error on the recording side. We shall remove the rows with this duration.



#### Define

We are going to filter out all data values(rows) with 61 seconds

#### Code

In [50]:
gobike_clean = gobike_clean[gobike_clean.duration_sec != 61]

#### Test

In [51]:
gobike_clean.duration_sec.describe()

count    183394.000000
mean        726.143712
std        1794.465739
min          62.000000
25%         325.000000
50%         514.000000
75%         796.000000
max       85444.000000
Name: duration_sec, dtype: float64

## Storing the cleaned data

In [52]:
# storing to a csv file
gobike_clean.to_csv("gobike_cleaned.csv", index=False)

In [60]:
# loading cleaned data
df = pd.read_csv("gobike_cleaned.csv")
print(df.shape)
df.sample(10)

(183394, 16)


Unnamed: 0,duration_sec,start_time,end_time,start_station_id,start_station_name,start_station_latitude,start_station_longitude,end_station_id,end_station_name,end_station_latitude,end_station_longitude,bike_id,user_type,member_birth_year,member_gender,bike_share_for_all_trip
37408,685,2019-02-23 14:08:23.056,2019-02-23 14:19:48.934,62.0,Victoria Manalo Draves Park,37.777791,-122.406432,23.0,The Embarcadero at Steuart St,37.791464,-122.391034,3060,Customer,1992.0,Male,No
122351,514,2019-02-11 12:05:18.189,2019-02-11 12:13:53.123,76.0,McCoppin St at Valencia St,37.771662,-122.422423,349.0,Howard St at Mary St,37.78101,-122.405666,6638,Subscriber,1972.0,Male,No
179496,412,2019-02-01 13:36:56.369,2019-02-01 13:43:48.821,23.0,The Embarcadero at Steuart St,37.791464,-122.391034,6.0,The Embarcadero at Sansome St,37.80477,-122.403234,610,Subscriber,1986.0,Female,No
55268,252,2019-02-21 10:36:40.107,2019-02-21 10:40:52.808,43.0,San Francisco Public Library (Grove St at Hyde...,37.778768,-122.415929,60.0,8th St at Ringold St,37.77452,-122.409449,2481,Subscriber,1979.0,Male,No
6086,375,2019-02-28 11:13:04.406,2019-02-28 11:19:20.305,239.0,Bancroft Way at Telegraph Ave,37.868813,-122.258764,266.0,Parker St at Fulton St,37.862464,-122.264791,6363,Subscriber,1996.0,Male,Yes
132781,325,2019-02-08 22:06:16.543,2019-02-08 22:11:41.699,121.0,Mission Playground,37.75921,-122.421339,142.0,Guerrero Park,37.745739,-122.42214,5541,Subscriber,1984.0,Male,No
80515,407,2019-02-18 15:27:18.005,2019-02-18 15:34:05.047,144.0,Precita Park,37.7473,-122.411403,123.0,Folsom St at 19th St,37.760594,-122.414817,5531,Subscriber,1996.0,Male,No
100434,1233,2019-02-14 18:09:52.609,2019-02-14 18:30:25.949,30.0,San Francisco Caltrain (Townsend St at 4th St),37.776598,-122.395282,368.0,Myrtle St at Polk St,37.785434,-122.419622,2684,Customer,1991.0,Male,No
40508,931,2019-02-22 20:24:51.765,2019-02-22 20:40:23.316,98.0,Valencia St at 16th St,37.765052,-122.421866,72.0,Page St at Scott St,37.772406,-122.43565,6232,Subscriber,1994.0,Male,No
9824,258,2019-02-28 00:12:22.749,2019-02-28 00:16:41.526,67.0,San Francisco Caltrain Station 2 (Townsend St...,37.776639,-122.395526,79.0,7th St at Brannan St,37.773492,-122.403672,5405,Customer,1985.0,Male,No


### What is the structure of your dataset?

The dataset contains 183394 records of bike rides in 2019 with a couple of features namely:
- duration_sec
- start_time
- end_time
- start_station_id
- start_station_name
- start_station_latitute
- start_station_longitude
- end_station_id
- end_station_name
- end_station_latitude
- end_station_longitude
- bike_id
- user_type
- member_birth_year
- member_gender
- bike_share_for_all_trip

### What is/are the main feature(s) of interest in your dataset?

I am most interested in the following questions:
1. When were the most trips taken, in terms of time of day, maybe even the week, month and year?
2. When the rides were taken, how long did these trips last (duration)?
3. We have types of users; subscribers and customers. Does the duration of trip difer between these two user types?

### What features in the dataset do you think will help support your investigation into your feature(s) of interest?

As far as the time of day, week, month or year is concerned, start_time information will prove to be a useful feature. These information can be extracted from the column for analysis.

## Univariate Exploration



>**Rubric Tip**: Visualizations should depict the data appropriately so that the plots are easily interpretable. You should choose an appropriate plot type, data encodings, and formatting as needed. The formatting may include setting/adding the title, labels, legend, and comments. Also, do not overplot or incorrectly plot ordinal data.

### Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?

> Your answer here!

### Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

> Your answer here!

## Bivariate Exploration

> In this section, investigate relationships between pairs of variables in your
data. Make sure the variables that you cover here have been introduced in some
fashion in the previous section (univariate exploration).

### Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

> Your answer here!

### Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

> Your answer here!

## Multivariate Exploration

> Create plots of three or more variables to investigate your data even
further. Make sure that your investigations are justified, and follow from
your work in the previous sections.

### Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

> Your answer here!

### Were there any interesting or surprising interactions between features?

> Your answer here!

## Conclusions
>You can write a summary of the main findings and reflect on the steps taken during the data exploration.



> Remove all Tips mentioned above, before you convert this notebook to PDF/HTML


> At the end of your report, make sure that you export the notebook as an
html file from the `File > Download as... > HTML or PDF` menu. Make sure you keep
track of where the exported file goes, so you can put it in the same folder
as this notebook for project submission. Also, make sure you remove all of
the quote-formatted guide notes like this one before you finish your report!

