# Analyzing Bay Wheels trip data on April 2020
**by Gabriel Medeiros das Neves**

## Introduction

Write a briefly introduction to the study and the dataset here.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb

%matplotlib inline

## Data Wrangling
In this section of the report I will gather the necessary data, understand its general properties, identify and clean possible quality and tidiness issues, such as missing or incorrect values, for example.

### Gather
Here I'll be converting the provided `.csv` file in a Pandas DataFrame.

In [2]:
trip_data = pd.read_csv('baywheels_tripdata.csv')
trip_data.head()

Unnamed: 0,ride_id,rideable_type,started_at,ended_at,start_station_name,start_station_id,end_station_name,end_station_id,start_lat,start_lng,end_lat,end_lng,member_casual
0,5A1FF31692371859,electric_bike,2020-04-04 08:28:20,2020-04-04 08:33:34,,,,,37.7692,-122.4209,37.7703,-122.4069,casual
1,D8D5BA2D4F051133,electric_bike,2020-04-03 18:55:43,2020-04-03 19:21:05,,,,,37.8023,-122.4244,37.8023,-122.4244,casual
2,A3633A9140CA4FF8,electric_bike,2020-04-04 15:11:04,2020-04-04 15:12:21,,,,,37.7667,-122.3961,37.7667,-122.3962,casual
3,301F57EB0197A5E0,electric_bike,2020-04-03 20:21:03,2020-04-03 22:08:06,8th St at Ringold St,60.0,,,37.7744,-122.4095,37.7805,-122.4033,casual
4,9429C701AF5744B3,electric_bike,2020-04-03 18:39:39,2020-04-03 18:47:19,,,,,37.8027,-122.4433,37.8009,-122.4269,casual


### Assess
The main objective of the Assess section is to better understand each piece of data and identify possible issues that must be cleaned

In [3]:
trip_data.sample(5)

Unnamed: 0,ride_id,rideable_type,started_at,ended_at,start_station_name,start_station_id,end_station_name,end_station_id,start_lat,start_lng,end_lat,end_lng,member_casual
76232,F94991554FDB6C42,docked_bike,2020-04-30 12:23:43,2020-04-30 12:30:30,Berry St at 4th St,81.0,The Embarcadero at Steuart St,23.0,37.7759,-122.3932,37.7915,-122.391,member
64469,CB005B63FBC280C6,docked_bike,2020-04-24 13:27:41,2020-04-24 15:12:33,Potrero Ave and Mariposa St,336.0,Irwin St at 8th St,102.0,37.7633,-122.4074,37.7669,-122.3996,casual
30639,C4E1A0F878200461,electric_bike,2020-04-07 18:57:32,2020-04-07 19:01:45,Greenwich St at Franklin St,478.0,Bay St at Fillmore St,399.0,37.8003,-122.4258,37.8026,-122.4361,casual
78505,C2111D3273166B53,docked_bike,2020-04-18 12:53:20,2020-04-18 12:58:25,Grand Ave at Perkins St,196.0,Grand Ave at Santa Clara Ave,193.0,37.8089,-122.2565,37.8127,-122.2472,member
47930,CCB10B925A56DA94,docked_bike,2020-04-23 10:20:33,2020-04-23 10:55:01,Jack London Square,187.0,Lake Merritt BART Station,163.0,37.7962,-122.2794,37.7973,-122.2653,casual


In [4]:
trip_data.shape

(84259, 13)

In [5]:
trip_data.dtypes

ride_id                object
rideable_type          object
started_at             object
ended_at               object
start_station_name     object
start_station_id      float64
end_station_name       object
end_station_id        float64
start_lat             float64
start_lng             float64
end_lat               float64
end_lng               float64
member_casual          object
dtype: object

In [6]:
trip_data.isnull().sum()

ride_id                   0
rideable_type             0
started_at                0
ended_at                  0
start_station_name    30825
start_station_id      30825
end_station_name      32401
end_station_id        32401
start_lat                 0
start_lng                 0
end_lat                 142
end_lng                 142
member_casual             0
dtype: int64

In [7]:
trip_data.rideable_type.unique()

array(['electric_bike', 'docked_bike'], dtype=object)

In [8]:
trip_data.member_casual.unique()

array(['casual', 'member'], dtype=object)

In [9]:
trip_data.start_station_name.unique()

array([nan, '8th St at Ringold St', 'Chestnut St at Van Ness Ave',
       'Buchanan St at North Point St',
       'Montgomery St BART Station (Market St at 2nd St)',
       'Post St at Webster St', '4th St at 16th St',
       'McKinnon Ave at 3rd St', 'Precita Park', 'Central Ave at Fell St',
       'Funston Ave at Fulton St', 'Steiner St at California St',
       'The Embarcadero at Bryant St', 'Market St at Steuart St',
       'Leavenworth St at Broadway',
       'Union Square (Powell St at Post St)', '10th St at Empire St',
       'Market St at Dolores St', 'Davis St at Jackson St',
       'Salesforce Transit Center (Natoma St at 2nd St)',
       '4th St at Mission Bay Blvd S', '23rd St at Taylor St',
       'Folsom St at 19th St', '15th St at Potrero Ave',
       'Market St at 10th St', 'Jones St at Post St',
       'Fell St at Stanyan St', 'Octavia Blvd at Page St',
       'Eureka Valley Recreation Center', 'S Park St at 3rd St',
       'Golden Gate Ave at Hyde St', 'Sonora Ave at

In [10]:
trip_data.end_station_name.unique()

array([nan, '4th St at 16th St',
       'Montgomery St BART Station (Market St at 2nd St)', 'Precita Park',
       'Grove St at Divisadero',
       'Garfield Square (25th St at Harrison St)',
       'Post St at Webster St', 'Broderick St at Oak St',
       'Central Ave at Fell St', 'Greenwich St at Franklin St',
       '10th St at Empire St', 'Leavenworth St at Broadway',
       'Davis St at Jackson St', '4th St at Mission Bay Blvd S',
       'San Francisco Public Library (Grove St at Hyde St)',
       '15th St at Potrero Ave', '23rd St at Tennessee St',
       'Mission Playground', 'Jones St at Post St',
       'Market St at Dolores St', 'Fell St at Stanyan St',
       'Golden Gate Ave at Polk St', 'S Park St at 3rd St',
       'Sonora Ave at 1st St', '20th St at Bryant St',
       'Market St at 10th St', 'Scott St at Golden Gate Ave',
       'Broadway at Kearny',
       'San Francisco Caltrain (Townsend St at 4th St)',
       '23rd St at Taylor St', 'Parker Ave at McAllister St',
   

In [11]:
trip_data.started_at.iloc[0]

'2020-04-04 08:28:20'

#### Observed Issues

1. Missing data in **station columns** and **end coordinates**.
2. **Date columns** are strings instead of datetime.
3. The column name **"member_casual"** does not match the column values, since the column is representing the user account type.
4. There are some unhelpful columns for the scope of this analysis.

### Clean
Here I will be defining and executing programmatic solutions for each identified issue in the Assess section, as well as testing whether the proposed solution solved the problem or not.

#### Define
1. The missing stations will be filled with **"Not informed"** values, while the coordinates columns will not be filled as they do not belong to the scope of this analysis.
2. Use the pandas [to_datetime()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html) method to convert column values to datetime.
3. Rename column to **"account_type"**.
4. Drop `ride_id`, `start_station_id`, `end_station_id`, `start_lat`, `start_lng`, `end_lat` and `end_lng` columns.

#### Code and Test
Here I will be using code techniques to clean up the data as planned in the **Define** section.

In [67]:
clean_trip_data = trip_data.copy()

In [68]:
clean_trip_data.drop(['ride_id', 'start_station_id', 
                      'end_station_id', 'start_lat', 
                      'start_lng', 'end_lat', 'end_lng'], axis=1, inplace=True)
clean_trip_data.head()

Unnamed: 0,rideable_type,started_at,ended_at,start_station_name,end_station_name,member_casual
0,electric_bike,2020-04-04 08:28:20,2020-04-04 08:33:34,,,casual
1,electric_bike,2020-04-03 18:55:43,2020-04-03 19:21:05,,,casual
2,electric_bike,2020-04-04 15:11:04,2020-04-04 15:12:21,,,casual
3,electric_bike,2020-04-03 20:21:03,2020-04-03 22:08:06,8th St at Ringold St,,casual
4,electric_bike,2020-04-03 18:39:39,2020-04-03 18:47:19,,,casual


In [69]:
clean_trip_data.fillna('Not informed', inplace=True)
clean_trip_data.isnull().sum()

rideable_type         0
started_at            0
ended_at              0
start_station_name    0
end_station_name      0
member_casual         0
dtype: int64

In [70]:
clean_trip_data['started_at'] = pd.to_datetime(clean_trip_data.started_at)
clean_trip_data['ended_at'] = pd.to_datetime(clean_trip_data.ended_at)
clean_trip_data.started_at.iloc[0]

Timestamp('2020-04-04 08:28:20')

In [71]:
clean_trip_data.rename(columns={'member_casual': 'account_type'}, inplace=True)
clean_trip_data.sample(5)

Unnamed: 0,rideable_type,started_at,ended_at,start_station_name,end_station_name,account_type
51603,electric_bike,2020-04-20 18:59:49,2020-04-20 19:02:19,Not informed,1st St at San Carlos St,member
35502,electric_bike,2020-04-08 06:55:13,2020-04-08 07:20:25,Not informed,Not informed,member
9657,electric_bike,2020-04-04 13:19:42,2020-04-04 13:41:40,Jackson St at Polk St,Not informed,casual
31399,electric_bike,2020-04-18 12:20:02,2020-04-18 12:48:04,Not informed,Not informed,casual
7375,electric_bike,2020-04-14 20:04:56,2020-04-14 20:34:26,48th Ave at Cabrillo St,Scott St at Golden Gate Ave,casual


### Feature Engineering

In [72]:
clean_trip_data['started_hour'] = clean_trip_data.started_at.dt.hour
clean_trip_data['ended_hour'] = clean_trip_data.ended_at.dt.hour
clean_trip_data.sample(5)

Unnamed: 0,rideable_type,started_at,ended_at,start_station_name,end_station_name,account_type,started_hour,ended_hour
16155,electric_bike,2020-04-24 16:08:10,2020-04-24 16:54:11,Not informed,Not informed,casual,16,16
26381,electric_bike,2020-04-26 18:46:02,2020-04-26 20:02:28,Not informed,Not informed,casual,18,20
70790,electric_bike,2020-04-28 07:36:50,2020-04-28 07:41:45,Hyde St at Post St,Mechanics Monument Plaza (Market St at Bush St),member,7,7
83841,docked_bike,2020-04-26 15:42:59,2020-04-26 15:53:35,Broderick St at Oak St,S Van Ness Ave at Market St,casual,15,15
33763,electric_bike,2020-04-19 13:15:10,2020-04-19 13:15:20,Not informed,Not informed,casual,13,13


In [73]:
clean_trip_data['monthday'] = clean_trip_data.started_at.dt.day
clean_trip_data.sample(5)

Unnamed: 0,rideable_type,started_at,ended_at,start_station_name,end_station_name,account_type,started_hour,ended_hour,monthday
80182,electric_bike,2020-04-11 17:14:17,2020-04-11 17:27:33,16th St Mission BART Station 2,1st St at Folsom St,member,17,17,11
54313,electric_bike,2020-04-20 16:33:35,2020-04-20 16:43:17,Not informed,Not informed,member,16,16,20
3740,electric_bike,2020-04-19 14:32:55,2020-04-19 14:49:11,Not informed,Not informed,casual,14,14,19
9763,electric_bike,2020-04-06 12:35:54,2020-04-06 12:48:50,Powell St BART Station (Market St at 5th St),Not informed,casual,12,12,6
34220,electric_bike,2020-04-17 08:34:15,2020-04-17 08:45:35,Not informed,Folsom St at 15th St,casual,8,8,17


In [74]:
clean_trip_data['weekday'] = clean_trip_data.started_at.dt.weekday
clean_trip_data['weekday'] = clean_trip_data.weekday.map({0: 'Monday', 1: 'Tuesday', 2: 'Wednesday', 
                                                          3: 'Thursday', 4: 'Friday', 5: 'Saturday', 6: 'Sunday'})
clean_trip_data.sample(5)

Unnamed: 0,rideable_type,started_at,ended_at,start_station_name,end_station_name,account_type,started_hour,ended_hour,monthday,weekday
49975,docked_bike,2020-04-04 12:19:05,2020-04-04 12:30:51,17th St at Valencia St,Mississippi St at 17th St,member,12,12,4,Saturday
26584,electric_bike,2020-04-22 14:07:49,2020-04-22 15:00:41,Not informed,Not informed,casual,14,15,22,Wednesday
11104,electric_bike,2020-04-17 21:17:53,2020-04-17 21:26:02,Not informed,Not informed,casual,21,21,17,Friday
39771,electric_bike,2020-04-07 15:13:53,2020-04-07 15:23:53,Not informed,Not informed,member,15,15,7,Tuesday
67688,electric_bike,2020-04-16 05:35:05,2020-04-16 05:36:24,Jackson Playground,Not informed,member,5,5,16,Thursday


In [75]:
clean_trip_data['trip_time'] =  (clean_trip_data.ended_at - clean_trip_data.started_at)
clean_trip_data.sample(5)

Unnamed: 0,rideable_type,started_at,ended_at,start_station_name,end_station_name,account_type,started_hour,ended_hour,monthday,weekday,trip_time
42268,electric_bike,2020-04-16 07:10:57,2020-04-16 07:18:54,4th St at 16th St,Powell St BART Station (Market St at 4th St),member,7,7,16,Thursday,00:07:57
21287,electric_bike,2020-04-15 15:19:24,2020-04-15 15:27:08,Not informed,Not informed,casual,15,15,15,Wednesday,00:07:44
37312,docked_bike,2020-04-25 10:23:13,2020-04-25 10:24:34,Duboce Park,Duboce Park,casual,10,10,25,Saturday,00:01:21
50305,electric_bike,2020-04-19 12:53:19,2020-04-19 13:03:43,Not informed,Not informed,member,12,13,19,Sunday,00:10:24
8349,electric_bike,2020-04-29 04:23:28,2020-04-29 05:01:15,Not informed,Not informed,casual,4,5,29,Wednesday,00:37:47


In [76]:
clean_trip_data.drop(['started_at', 'ended_at'], axis=1, inplace=True)
clean_trip_data.sample(5)

Unnamed: 0,rideable_type,start_station_name,end_station_name,account_type,started_hour,ended_hour,monthday,weekday,trip_time
47848,docked_bike,Frederick St at Arguello Blvd,10th Ave at Irving St,casual,13,13,14,Tuesday,00:48:46
70890,electric_bike,Bryant St at 6th St,Turk St at Fillmore St,member,14,15,7,Tuesday,00:45:28
28446,electric_bike,Not informed,Not informed,casual,18,19,22,Wednesday,00:07:37
58033,docked_bike,Powell St at Columbus Ave,Terry Francois Blvd at Mission Bay Blvd N,casual,17,18,25,Saturday,00:26:04
53801,electric_bike,Funston Ave at Irving St,Not informed,member,11,11,27,Monday,00:04:29


## Exploratory Data Analysis

### What is the structure of your dataset?

> Your answer here!

### What is/are the main feature(s) of interest in your dataset?

> Your answer here!

### What features in the dataset do you think will help support your investigation into your feature(s) of interest?

> Your answer here!

## Univariate Exploration

> In this section, investigate distributions of individual variables. If
you see unusual points or outliers, take a deeper look to clean things up
and prepare yourself to look at relationships between variables.

> Make sure that, after every plot or related series of plots, that you
include a Markdown cell with comments about what you observed, and what
you plan on investigating next.

### Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?

> Your answer here!

### Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

> Your answer here!

## Bivariate Exploration

> In this section, investigate relationships between pairs of variables in your
data. Make sure the variables that you cover here have been introduced in some
fashion in the previous section (univariate exploration).

### Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

> Your answer here!

### Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

> Your answer here!

## Multivariate Exploration

> Create plots of three or more variables to investigate your data even
further. Make sure that your investigations are justified, and follow from
your work in the previous sections.

### Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

> Your answer here!

### Were there any interesting or surprising interactions between features?

> Your answer here!

> At the end of your report, make sure that you export the notebook as an
html file from the `File > Download as... > HTML` menu. Make sure you keep
track of where the exported file goes, so you can put it in the same folder
as this notebook for project submission. Also, make sure you remove all of
the quote-formatted guide notes like this one before you finish your report!