# Analyzing Bay Wheels trip data on April 2020
**by Gabriel Medeiros das Neves**

## Introduction

Write a briefly introduction to the study and the dataset here.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb

%matplotlib inline

## Data Wrangling
In this section of the report I will gather the necessary data, understand its general properties, identify and clean possible issues, such as missing or incorrect values, for example.

### Gather
Here I'll be converting the provided `.csv` file in a Pandas DataFrame.

In [2]:
trip_data = pd.read_csv('baywheels_tripdata.csv')
trip_data.head()

Unnamed: 0,ride_id,rideable_type,started_at,ended_at,start_station_name,start_station_id,end_station_name,end_station_id,start_lat,start_lng,end_lat,end_lng,member_casual
0,5A1FF31692371859,electric_bike,2020-04-04 08:28:20,2020-04-04 08:33:34,,,,,37.7692,-122.4209,37.7703,-122.4069,casual
1,D8D5BA2D4F051133,electric_bike,2020-04-03 18:55:43,2020-04-03 19:21:05,,,,,37.8023,-122.4244,37.8023,-122.4244,casual
2,A3633A9140CA4FF8,electric_bike,2020-04-04 15:11:04,2020-04-04 15:12:21,,,,,37.7667,-122.3961,37.7667,-122.3962,casual
3,301F57EB0197A5E0,electric_bike,2020-04-03 20:21:03,2020-04-03 22:08:06,8th St at Ringold St,60.0,,,37.7744,-122.4095,37.7805,-122.4033,casual
4,9429C701AF5744B3,electric_bike,2020-04-03 18:39:39,2020-04-03 18:47:19,,,,,37.8027,-122.4433,37.8009,-122.4269,casual


### Assess
The main objective of the Assess section is to better understand each piece of data and identify possible issues that must be cleaned.

In [3]:
trip_data.sample(5)

Unnamed: 0,ride_id,rideable_type,started_at,ended_at,start_station_name,start_station_id,end_station_name,end_station_id,start_lat,start_lng,end_lat,end_lng,member_casual
54800,B5047FB6343FFB74,docked_bike,2020-04-27 11:00:04,2020-04-27 11:19:32,Howard St at Beale St,22.0,Bryant St at 15th St,100.0,37.7898,-122.3946,37.7671,-122.4107,casual
30373,3CD37E7F0CE0F774,electric_bike,2020-04-25 11:47:19,2020-04-25 12:42:12,Esprit Park,126.0,Esprit Park,126.0,37.7616,-122.3908,37.7616,-122.3908,casual
5245,E15EB915A9BB9AB0,electric_bike,2020-04-11 16:49:24,2020-04-11 17:28:36,Central Ave at Fell St,70.0,48th Ave at Cabrillo St,521.0,37.7736,-122.4444,37.773,-122.5091,casual
25216,9F30E16AECDF5C44,electric_bike,2020-04-05 13:51:56,2020-04-05 14:14:12,Folsom St at 13th St,87.0,Market St at Steuart St,16.0,37.7703,-122.4155,37.7946,-122.3948,casual
32462,D3D0ED74ED7453B3,electric_bike,2020-04-15 15:10:32,2020-04-15 16:06:09,,,,,37.3356,-121.8746,37.3288,-121.8661,casual


In [4]:
trip_data.shape

(84259, 13)

In [5]:
trip_data.dtypes

ride_id                object
rideable_type          object
started_at             object
ended_at               object
start_station_name     object
start_station_id      float64
end_station_name       object
end_station_id        float64
start_lat             float64
start_lng             float64
end_lat               float64
end_lng               float64
member_casual          object
dtype: object

In [6]:
trip_data.isnull().sum()

ride_id                   0
rideable_type             0
started_at                0
ended_at                  0
start_station_name    30825
start_station_id      30825
end_station_name      32401
end_station_id        32401
start_lat                 0
start_lng                 0
end_lat                 142
end_lng                 142
member_casual             0
dtype: int64

In [7]:
trip_data.rideable_type.unique()

array(['electric_bike', 'docked_bike'], dtype=object)

In [8]:
trip_data.member_casual.unique()

array(['casual', 'member'], dtype=object)

In [9]:
trip_data.start_station_name.unique()

array([nan, '8th St at Ringold St', 'Chestnut St at Van Ness Ave',
       'Buchanan St at North Point St',
       'Montgomery St BART Station (Market St at 2nd St)',
       'Post St at Webster St', '4th St at 16th St',
       'McKinnon Ave at 3rd St', 'Precita Park', 'Central Ave at Fell St',
       'Funston Ave at Fulton St', 'Steiner St at California St',
       'The Embarcadero at Bryant St', 'Market St at Steuart St',
       'Leavenworth St at Broadway',
       'Union Square (Powell St at Post St)', '10th St at Empire St',
       'Market St at Dolores St', 'Davis St at Jackson St',
       'Salesforce Transit Center (Natoma St at 2nd St)',
       '4th St at Mission Bay Blvd S', '23rd St at Taylor St',
       'Folsom St at 19th St', '15th St at Potrero Ave',
       'Market St at 10th St', 'Jones St at Post St',
       'Fell St at Stanyan St', 'Octavia Blvd at Page St',
       'Eureka Valley Recreation Center', 'S Park St at 3rd St',
       'Golden Gate Ave at Hyde St', 'Sonora Ave at

In [10]:
trip_data.end_station_name.unique()

array([nan, '4th St at 16th St',
       'Montgomery St BART Station (Market St at 2nd St)', 'Precita Park',
       'Grove St at Divisadero',
       'Garfield Square (25th St at Harrison St)',
       'Post St at Webster St', 'Broderick St at Oak St',
       'Central Ave at Fell St', 'Greenwich St at Franklin St',
       '10th St at Empire St', 'Leavenworth St at Broadway',
       'Davis St at Jackson St', '4th St at Mission Bay Blvd S',
       'San Francisco Public Library (Grove St at Hyde St)',
       '15th St at Potrero Ave', '23rd St at Tennessee St',
       'Mission Playground', 'Jones St at Post St',
       'Market St at Dolores St', 'Fell St at Stanyan St',
       'Golden Gate Ave at Polk St', 'S Park St at 3rd St',
       'Sonora Ave at 1st St', '20th St at Bryant St',
       'Market St at 10th St', 'Scott St at Golden Gate Ave',
       'Broadway at Kearny',
       'San Francisco Caltrain (Townsend St at 4th St)',
       '23rd St at Taylor St', 'Parker Ave at McAllister St',
   

In [11]:
trip_data.started_at.iloc[0]

'2020-04-04 08:28:20'

#### Observed Issues

1. Missing data in **station columns** and **end coordinates**.
2. **Date columns** are strings instead of datetime.
3. The column name **"member_casual"** does not match the column values, since the column is representing the user account type.
4. There are some unhelpful columns for the scope of this analysis.

### Clean
Here I will be defining and executing programmatic solutions for each identified issue in the Assess section, as well as testing whether the proposed solution solved the problem or not.

#### Define
1. The missing stations will be filled with **"Not informed"** values, while the coordinates columns will not be filled as they do not belong to the scope of this analysis.
2. Use the pandas [to_datetime()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html) method to convert column values to datetime.
3. Rename column to **"account_type"**.
4. Drop `ride_id`, `start_station_id`, `end_station_id`, `start_lat`, `start_lng`, `end_lat` and `end_lng` columns.

#### Code and Test
Here I will be using code techniques to clean up the data as planned in the **Define** section.

In [12]:
clean_trip_data = trip_data.copy()

In [13]:
clean_trip_data.drop(['ride_id', 'start_station_id', 
                      'end_station_id', 'start_lat', 
                      'start_lng', 'end_lat', 'end_lng'], axis=1, inplace=True)
clean_trip_data.head()

Unnamed: 0,rideable_type,started_at,ended_at,start_station_name,end_station_name,member_casual
0,electric_bike,2020-04-04 08:28:20,2020-04-04 08:33:34,,,casual
1,electric_bike,2020-04-03 18:55:43,2020-04-03 19:21:05,,,casual
2,electric_bike,2020-04-04 15:11:04,2020-04-04 15:12:21,,,casual
3,electric_bike,2020-04-03 20:21:03,2020-04-03 22:08:06,8th St at Ringold St,,casual
4,electric_bike,2020-04-03 18:39:39,2020-04-03 18:47:19,,,casual


In [14]:
clean_trip_data.fillna('Not informed', inplace=True)
clean_trip_data.isnull().sum()

rideable_type         0
started_at            0
ended_at              0
start_station_name    0
end_station_name      0
member_casual         0
dtype: int64

In [15]:
clean_trip_data['started_at'] = pd.to_datetime(clean_trip_data.started_at)
clean_trip_data['ended_at'] = pd.to_datetime(clean_trip_data.ended_at)
clean_trip_data.started_at.iloc[0]

Timestamp('2020-04-04 08:28:20')

In [16]:
clean_trip_data.rename(columns={'member_casual': 'account_type'}, inplace=True)
clean_trip_data.sample(5)

Unnamed: 0,rideable_type,started_at,ended_at,start_station_name,end_station_name,account_type
47122,electric_bike,2020-04-11 17:10:24,2020-04-11 17:29:24,Not informed,Folsom St at 9th St,member
36354,docked_bike,2020-04-13 18:34:40,2020-04-13 18:44:11,Buchanan St at North Point St,Buchanan St at North Point St,casual
18844,electric_bike,2020-04-07 14:28:13,2020-04-07 14:32:22,Not informed,Not informed,casual
69592,docked_bike,2020-04-18 12:56:05,2020-04-18 13:52:58,Emeryville Town Hall,Emeryville Town Hall,casual
23184,electric_bike,2020-04-22 16:10:06,2020-04-22 16:21:07,Glen Park BART Station,Not informed,casual


### Feature Engineering

Here I'll be using the knowledge acquired by inspecting the dataframes to create new features, aiming to facilitate the exploration.

In [17]:
clean_trip_data['started_hour'] = clean_trip_data.started_at.dt.hour
clean_trip_data['ended_hour'] = clean_trip_data.ended_at.dt.hour
clean_trip_data.sample(5)

Unnamed: 0,rideable_type,started_at,ended_at,start_station_name,end_station_name,account_type,started_hour,ended_hour
32063,electric_bike,2020-04-19 15:05:13,2020-04-19 15:26:08,Carl St at Cole St,Howard St at 8th St,casual,15,15
75840,electric_bike,2020-04-20 19:06:23,2020-04-20 19:14:27,Not informed,Folsom St at 3rd St,member,19,19
70781,electric_bike,2020-04-17 17:28:35,2020-04-17 17:34:18,Not informed,Townsend St at 5th St,member,17,17
57198,docked_bike,2020-04-07 11:05:15,2020-04-07 11:18:58,Telegraph Ave at 19th St,MacArthur BART Station,member,11,11
77760,docked_bike,2020-04-26 12:32:20,2020-04-26 13:06:44,Santa Clara St at Almaden Blvd,5th St at San Salvador St,member,12,13


In [18]:
clean_trip_data['monthday'] = clean_trip_data.started_at.dt.day
clean_trip_data.sample(5)

Unnamed: 0,rideable_type,started_at,ended_at,start_station_name,end_station_name,account_type,started_hour,ended_hour,monthday
39910,electric_bike,2020-04-18 08:46:30,2020-04-18 08:56:18,Harrison St at 17th St,Not informed,member,8,8,18
37974,docked_bike,2020-04-29 15:17:32,2020-04-29 15:43:56,Derby St at College Ave,Russell St at College Ave,casual,15,15,29
80320,electric_bike,2020-04-15 21:04:41,2020-04-15 21:20:15,Not informed,Not informed,member,21,21,15
39474,docked_bike,2020-04-09 06:17:45,2020-04-09 06:22:59,Gennessee St at Monterey Blvd,Glen Park BART Station,member,6,6,9
16910,electric_bike,2020-04-01 14:31:20,2020-04-01 15:17:29,Not informed,Not informed,casual,14,15,1


In [19]:
clean_trip_data['weekday'] = clean_trip_data.started_at.dt.weekday
clean_trip_data['weekday'] = clean_trip_data.weekday.map({0: 'Monday', 1: 'Tuesday', 2: 'Wednesday', 
                                                          3: 'Thursday', 4: 'Friday', 5: 'Saturday', 6: 'Sunday'})
clean_trip_data.sample(5)

Unnamed: 0,rideable_type,started_at,ended_at,start_station_name,end_station_name,account_type,started_hour,ended_hour,monthday,weekday
26703,electric_bike,2020-04-28 00:52:43,2020-04-28 00:55:14,Not informed,San Carlos St at Meridian Ave,casual,0,0,28,Tuesday
18833,electric_bike,2020-04-07 13:37:29,2020-04-07 13:50:26,Not informed,Not informed,casual,13,13,7,Tuesday
32541,electric_bike,2020-04-03 01:29:13,2020-04-03 01:32:28,Not informed,San Carlos St at Meridian Ave,casual,1,1,3,Friday
27587,electric_bike,2020-04-13 21:11:01,2020-04-13 21:15:04,Page St at Scott St,Koshland Park,casual,21,21,13,Monday
76960,docked_bike,2020-04-25 12:16:01,2020-04-25 12:22:15,El Embarcadero at Grand Ave,Telegraph Ave at 23rd St,member,12,12,25,Saturday


In [20]:
clean_trip_data['trip_time'] =  (clean_trip_data.ended_at - clean_trip_data.started_at)
clean_trip_data.sample(5)

Unnamed: 0,rideable_type,started_at,ended_at,start_station_name,end_station_name,account_type,started_hour,ended_hour,monthday,weekday,trip_time
76123,electric_bike,2020-04-12 13:46:40,2020-04-12 13:58:17,Not informed,McCoppin St at Valencia St,member,13,13,12,Sunday,00:11:37
5265,electric_bike,2020-04-28 12:37:40,2020-04-28 14:12:46,Not informed,Jackson St at Polk St,casual,12,14,28,Tuesday,01:35:06
34627,electric_bike,2020-04-20 23:25:35,2020-04-21 00:08:41,Not informed,Not informed,casual,23,0,20,Monday,00:43:06
82295,electric_bike,2020-04-07 20:55:26,2020-04-07 21:01:27,Powell St BART Station (Market St at 5th St),Grove St at Gough St,member,20,21,7,Tuesday,00:06:01
77240,electric_bike,2020-04-22 17:17:29,2020-04-22 17:25:22,Pierce Ave at Market St,The Alameda at Bush St,member,17,17,22,Wednesday,00:07:53


In [21]:
clean_trip_data.drop(['started_at', 'ended_at'], axis=1, inplace=True)
clean_trip_data.sample(5)

Unnamed: 0,rideable_type,start_station_name,end_station_name,account_type,started_hour,ended_hour,monthday,weekday,trip_time
58502,docked_bike,Church St at Duboce Ave,Valencia St at 16th St,member,9,9,26,Sunday,00:06:26
76679,electric_bike,Buchanan St at North Point St,Not informed,member,16,16,17,Friday,00:44:34
76261,electric_bike,Not informed,Not informed,member,21,21,22,Wednesday,00:15:59
56317,docked_bike,Esprit Park,Valencia St at 16th St,member,13,14,20,Monday,00:17:27
21175,electric_bike,Fell St at Stanyan St,Funston Ave at Fulton St,casual,18,18,30,Thursday,00:20:24


## Exploratory Data Analysis
The Exploratory Data Analysis section is where I'll be focusing on computing statistics and creating visualizations to explore the dataset. 

### What is the structure of the dataset?
There are 84,259 bike rents in the dataset with 9 features (rideable_type, start_station_name, end_station_name, account_type, started_hour, ended_hour, monthday, weekday, and trip_time). 

**Column data types:**
1. `rideable_type`: object(str)
2. `start_station_name`: object(str)
3. `end_station_name`: object(str)
4. `account_type`: object(str)
5. `started_hour`: integer
6. `ended_hour`: integer
7. `monthday`: integer
8. `weekday`: object(str)
9. `trip_time`: Timedelta object ([docs here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Timedelta.html))

### What is/are the main feature(s) of interest in the dataset?
I'm most interested in understanding which features are correlated to the number of rents, I'll probably be researching at what time of the day most rentals occur, which type of bike is most popular, which days of the week have the most rentals, etc.  
I'm also interested in figuring out what features are best for predicting the trip time of a rent.

### What features in the dataset  will probably help support the investigation into the feature(s) of interest?
I believe that the day of the week and the time of day will be determining factors when investigating my features of interest, although I assume that the remaining features will also have relevant information to the analysis.

## Univariate Exploration

> In this section, investigate distributions of individual variables. If
you see unusual points or outliers, take a deeper look to clean things up
and prepare yourself to look at relationships between variables.

> Make sure that, after every plot or related series of plots, that you
include a Markdown cell with comments about what you observed, and what
you plan on investigating next.

### Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?

> Your answer here!

### Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

> Your answer here!

## Bivariate Exploration

> In this section, investigate relationships between pairs of variables in your
data. Make sure the variables that you cover here have been introduced in some
fashion in the previous section (univariate exploration).

### Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

> Your answer here!

### Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

> Your answer here!

## Multivariate Exploration

> Create plots of three or more variables to investigate your data even
further. Make sure that your investigations are justified, and follow from
your work in the previous sections.

### Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

> Your answer here!

### Were there any interesting or surprising interactions between features?

> Your answer here!

> At the end of your report, make sure that you export the notebook as an
html file from the `File > Download as... > HTML` menu. Make sure you keep
track of where the exported file goes, so you can put it in the same folder
as this notebook for project submission. Also, make sure you remove all of
the quote-formatted guide notes like this one before you finish your report!