# Ford GoBike System Dateset
## by Ahmad ALMosallam


Ford GoBike System Dateset is a dateset that contains trip data from lyft's bike service for public use. Variables including, trip duration, start time and end time with date, start station and end station names, start and end coordinates, customer type, year of birth and gender.

In [51]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
from requests import get
from zipfile import ZipFile
from io import StringIO, BytesIO
from timeit import default_timer as timer
import os
%matplotlib inline

Loading the dataset

In [73]:
# get the data from the site
link = "https://s3.amazonaws.com/baywheels-data/202001-baywheels-tripdata.csv.zip"
zipfile = get(link)
filename = link[40:-4]
with ZipFile(BytesIO(zipfile.content)) as file:
    file.extract(member = filename, path = './Data')

In [103]:
df = pd.read_csv('./Data/' + filename)

In [83]:
# Here are all rest data from the wesite https://s3.amazonaws.com/baywheels-data/index.html For 2020 year
links = ["https://s3.amazonaws.com/baywheels-data/202002-baywheels-tripdata.csv.zip",
        "https://s3.amazonaws.com/baywheels-data/202003-baywheels-tripdata.csv.zip"]
start = timer()
for link in links:
    zipfile = get(link)
    filename = link[40:-4]
    with ZipFile(BytesIO(zipfile.content)) as file:
        file.extract(member = filename, path = './Data')
        
end = timer()
print(end - start)

22.54542760000004


In [105]:
# Put all the data in one Dataframe
list_of_filenames = os.listdir('./Data')
for filename in list_of_filenames:
    d = pd.read_csv('./Data/' + filename)
    df = df.append(d)

  interactivity=interactivity, compiler=compiler, result=result)


In [106]:
print(df.shape)
df.head()

(1081806, 14)


Unnamed: 0,duration_sec,start_time,end_time,start_station_id,start_station_name,start_station_latitude,start_station_longitude,end_station_id,end_station_name,end_station_latitude,end_station_longitude,bike_id,user_type,rental_access_method
0,35187,2020-03-31 20:42:10.0790,2020-04-01 06:28:37.8440,462.0,Cruise Terminal at Pier 27,37.804648,-122.402087,24.0,Spear St at Folsom St,37.789677,-122.390428,10982,Customer,
1,14568,2020-03-31 22:45:25.5010,2020-04-01 02:48:13.7730,42.0,San Francisco City Hall (Polk St at Grove St),37.77865,-122.41823,370.0,Jones St at Post St,37.787327,-122.413278,12617,Customer,
2,35990,2020-03-31 15:08:22.3310,2020-04-01 01:08:12.9900,391.0,1st St at Younger Ave,37.35503,-121.904436,397.0,Gish Rd at 1st St,37.361867,-121.909315,12812,Customer,
3,1068,2020-03-31 23:55:00.4260,2020-04-01 00:12:49.0200,456.0,Arguello Blvd at Geary Blvd,37.781468,-122.458806,107.0,17th St at Dolores St,37.763015,-122.426497,12955,Customer,
4,3300,2020-03-31 23:00:55.6410,2020-03-31 23:55:56.6110,6.0,The Embarcadero at Sansome St,37.80477,-122.403234,24.0,Spear St at Folsom St,37.789677,-122.390428,13050,Customer,


In [107]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1081806 entries, 0 to 176798
Data columns (total 14 columns):
 #   Column                   Non-Null Count    Dtype  
---  ------                   --------------    -----  
 0   duration_sec             1081806 non-null  int64  
 1   start_time               1081806 non-null  object 
 2   end_time                 1081806 non-null  object 
 3   start_station_id         529401 non-null   float64
 4   start_station_name       531287 non-null   object 
 5   start_station_latitude   1081806 non-null  float64
 6   start_station_longitude  1081806 non-null  float64
 7   end_station_id           530322 non-null   float64
 8   end_station_name         532293 non-null   object 
 9   end_station_latitude     1081806 non-null  float64
 10  end_station_longitude    1081806 non-null  float64
 11  bike_id                  1081806 non-null  int64  
 12  user_type                1081806 non-null  object 
 13  rental_access_method     732127 non-null   

In [121]:
# rental_access_method had missing values so I dropped them
df = df.query("rental_access_method == 'app' or rental_access_method == 'clipper'")
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 732127 entries, 62530 to 176798
Data columns (total 14 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   duration_sec             732127 non-null  int64  
 1   start_time               732127 non-null  object 
 2   end_time                 732127 non-null  object 
 3   start_station_id         179722 non-null  float64
 4   start_station_name       181608 non-null  object 
 5   start_station_latitude   732127 non-null  float64
 6   start_station_longitude  732127 non-null  float64
 7   end_station_id           180643 non-null  float64
 8   end_station_name         182614 non-null  object 
 9   end_station_latitude     732127 non-null  float64
 10  end_station_longitude    732127 non-null  float64
 11  bike_id                  732127 non-null  int64  
 12  user_type                732127 non-null  object 
 13  rental_access_method     732127 non-null  object 
dtype

In [125]:
df.sample()

Unnamed: 0,duration_sec,start_time,end_time,start_station_id,start_station_name,start_station_latitude,start_station_longitude,end_station_id,end_station_name,end_station_latitude,end_station_longitude,bike_id,user_type,rental_access_method
192452,236,2020-02-29 19:49:18,2020-02-29 19:53:15,,,37.76413,-122.449309,,,37.769432,-122.453087,530703,Subscriber,app


### What is the structure of your dataset?

Each trip is anonymized and includes:

- Trip Duration (seconds)
- Start Time and Date
- End Time and Date
- Start Station ID
- Start Station Name
- Start Station Latitude
- Start Station Longitude
- End Station ID
- End Station Name
- End Station Latitude
- End Station Longitude
- Bike ID
- User Type (Subscriber or Customer – “Subscriber” = Member or “Customer” = Casual)
- Rental Access Method (App or Clipper)

### What is/are the main feature(s) of interest in your dataset?

- User Type
- Rental Access Method
- Trip Duration

### What features in the dataset do you think will help support your investigation into your feature(s) of interest?

- User Type
- Rental Access Method
- Trip Duration

## Univariate Exploration

> In this section, investigate distributions of individual variables. If
you see unusual points or outliers, take a deeper look to clean things up
and prepare yourself to look at relationships between variables.

> Make sure that, after every plot or related series of plots, that you
include a Markdown cell with comments about what you observed, and what
you plan on investigating next.

### Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?

> Your answer here!

### Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

> Your answer here!

## Bivariate Exploration

> In this section, investigate relationships between pairs of variables in your
data. Make sure the variables that you cover here have been introduced in some
fashion in the previous section (univariate exploration).

### Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

> Your answer here!

### Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

> Your answer here!

## Multivariate Exploration

> Create plots of three or more variables to investigate your data even
further. Make sure that your investigations are justified, and follow from
your work in the previous sections.

### Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

> Your answer here!

### Were there any interesting or surprising interactions between features?

> Your answer here!

> At the end of your report, make sure that you export the notebook as an
html file from the `File > Download as... > HTML` menu. Make sure you keep
track of where the exported file goes, so you can put it in the same folder
as this notebook for project submission. Also, make sure you remove all of
the quote-formatted guide notes like this one before you finish your report!