# A&A Project Boston Blue Bikes 2017 Analysis

Import of packages

In [30]:
import pandas as pd
import numpy as np

Import of data

In [31]:
df_Trips = pd.read_csv('boston_2017.csv')
df_Stations = pd.read_csv('current_bluebikes_stations.csv')
df_Weather = pd.read_csv('weather_hourly_boston.csv')

## 1 Data Cleaning

In the first step the imported data will be examined, cleaned and transformed. This leads to a clean data set with additional features upon which further analysis can be done.

###  1.1 Getting a first overview of the data

In [32]:
df_Trips.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1313774 entries, 0 to 1313773
Data columns (total 8 columns):
 #   Column              Non-Null Count    Dtype 
---  ------              --------------    ----- 
 0   start_time          1313774 non-null  object
 1   end_time            1313774 non-null  object
 2   start_station_id    1313774 non-null  int64 
 3   end_station_id      1313774 non-null  int64 
 4   start_station_name  1313774 non-null  object
 5   end_station_name    1313774 non-null  object
 6   bike_id             1313774 non-null  int64 
 7   user_type           1313774 non-null  object
dtypes: int64(3), object(5)
memory usage: 80.2+ MB


In [33]:
len(df_Trips)

1313774

In [34]:
df_Trips.columns

Index(['start_time', 'end_time', 'start_station_id', 'end_station_id',
       'start_station_name', 'end_station_name', 'bike_id', 'user_type'],
      dtype='object')

In [35]:
df_Trips.head(5)

Unnamed: 0,start_time,end_time,start_station_id,end_station_id,start_station_name,end_station_name,bike_id,user_type
0,2017-01-01 00:06:58,2017-01-01 00:12:49,67,139,MIT at Mass Ave / Amherst St,Dana Park,644,Subscriber
1,2017-01-01 00:13:16,2017-01-01 00:28:07,36,10,Boston Public Library - 700 Boylston St.,B.U. Central - 725 Comm. Ave.,230,Subscriber
2,2017-01-01 00:16:17,2017-01-01 00:44:10,36,9,Boston Public Library - 700 Boylston St.,Agganis Arena - 925 Comm Ave.,980,Customer
3,2017-01-01 00:21:22,2017-01-01 00:33:50,46,19,Christian Science Plaza,Buswell St. at Park Dr.,1834,Subscriber
4,2017-01-01 00:30:06,2017-01-01 00:40:28,10,8,B.U. Central - 725 Comm. Ave.,Union Square - Brighton Ave. at Cambridge St.,230,Subscriber


In [36]:
df_Trips.tail(5)

Unnamed: 0,start_time,end_time,start_station_id,end_station_id,start_station_name,end_station_name,bike_id,user_type
1313769,2017-12-31 23:46:18,2017-12-31 23:50:27,117,141,Binney St / Sixth St,Kendall Street,1846,Subscriber
1313770,2017-12-29 16:11:56,2017-12-29 16:16:18,54,42,Tremont St at West St,Boylston St at Arlington St TEMPORARY WINTER L...,2,Subscriber
1313771,2017-12-30 08:09:44,2017-12-30 08:26:08,54,58,Tremont St at West St,Beacon St at Arlington St,1534,Subscriber
1313772,2017-12-30 12:20:01,2017-12-30 12:49:12,54,46,Tremont St at West St,Christian Science Plaza - Massachusetts Ave at...,1978,Subscriber
1313773,2017-12-30 18:27:39,2017-12-30 18:53:54,54,21,Tremont St at West St,Prudential Center - Belvedere St,15,Subscriber


### 1.2 Identifying missing or wrong values and duplicates

In [56]:
df_Trips.isnull().values.any()

False

In [38]:
# .sum() counts the TRUE values 
df_Trips.duplicated().sum()

0

As there are no missing or duplicates values in the dataframe, it can be suspected that the quality of the data is already very good. 

### 1.3 Feature Engineering 

Now additional features will be calculated out of the existing data. 
This will contain: 
- temporal data eg. duration of the ride, weekday or hour

In [39]:
df_Trips[['start_time','end_time']] = df_Trips[['start_time','end_time']].apply(pd.to_datetime)

In [40]:
df_Trips.tail()

Unnamed: 0,start_time,end_time,start_station_id,end_station_id,start_station_name,end_station_name,bike_id,user_type
1313769,2017-12-31 23:46:18,2017-12-31 23:50:27,117,141,Binney St / Sixth St,Kendall Street,1846,Subscriber
1313770,2017-12-29 16:11:56,2017-12-29 16:16:18,54,42,Tremont St at West St,Boylston St at Arlington St TEMPORARY WINTER L...,2,Subscriber
1313771,2017-12-30 08:09:44,2017-12-30 08:26:08,54,58,Tremont St at West St,Beacon St at Arlington St,1534,Subscriber
1313772,2017-12-30 12:20:01,2017-12-30 12:49:12,54,46,Tremont St at West St,Christian Science Plaza - Massachusetts Ave at...,1978,Subscriber
1313773,2017-12-30 18:27:39,2017-12-30 18:53:54,54,21,Tremont St at West St,Prudential Center - Belvedere St,15,Subscriber


In [41]:
df_Trips['start_hour'] = pd.DatetimeIndex(df_Trips['start_time']).hour
df_Trips['weekday'] = pd.DatetimeIndex(df_Trips['start_time']).weekday
df_Trips['duration'] = df_Trips['end_time'] - df_Trips['start_time']

#sales['time_hour'] = pd.DatetimeIndex(sales['timestamp']).hour

In [53]:
df_Trips.head(5)

Unnamed: 0,start_time,end_time,start_station_id,end_station_id,start_station_name,end_station_name,bike_id,user_type,start_hour,weekday,duration
0,2017-01-01 00:06:58,2017-01-01 00:12:49,67,139,MIT at Mass Ave / Amherst St,Dana Park,644,Subscriber,0,6,0 days 00:05:51
1,2017-01-01 00:13:16,2017-01-01 00:28:07,36,10,Boston Public Library - 700 Boylston St.,B.U. Central - 725 Comm. Ave.,230,Subscriber,0,6,0 days 00:14:51
2,2017-01-01 00:16:17,2017-01-01 00:44:10,36,9,Boston Public Library - 700 Boylston St.,Agganis Arena - 925 Comm Ave.,980,Customer,0,6,0 days 00:27:53
3,2017-01-01 00:21:22,2017-01-01 00:33:50,46,19,Christian Science Plaza,Buswell St. at Park Dr.,1834,Subscriber,0,6,0 days 00:12:28
4,2017-01-01 00:30:06,2017-01-01 00:40:28,10,8,B.U. Central - 725 Comm. Ave.,Union Square - Brighton Ave. at Cambridge St.,230,Subscriber,0,6,0 days 00:10:22


In [59]:
df_Trips[df_Trips['start_station_id'] == 175]

Unnamed: 0,start_time,end_time,start_station_id,end_station_id,start_station_name,end_station_name,bike_id,user_type,start_hour,weekday,duration
294,2017-01-01 15:38:39,2017-01-01 16:06:53,175,33,Brighton Center,Kenmore Sq / Comm Ave,1640,Subscriber,15,6,0 days 00:28:14
762,2017-01-02 13:10:08,2017-01-02 13:31:07,175,21,Brighton Center,Prudential Center / Belvidere,1417,Subscriber,13,0,0 days 00:20:59
1333,2017-01-03 07:13:39,2017-01-03 07:39:35,175,42,Brighton Center,Boylston St. at Arlington St.,1535,Subscriber,7,1,0 days 00:25:56
61976,2017-03-28 07:45:35,2017-03-28 08:12:18,175,91,Brighton Center - Washington St at Cambridge St,One Kendall Square at Hampshire St / Portland St,343,Subscriber,7,1,0 days 00:26:43
62575,2017-03-28 12:24:16,2017-03-28 12:28:14,175,8,Brighton Center - Washington St at Cambridge St,Union Square - Brighton Ave at Cambridge St,127,Subscriber,12,1,0 days 00:03:58
...,...,...,...,...,...,...,...,...,...,...,...
1312673,2017-12-28 10:23:06,2017-12-28 10:31:32,175,69,Brighton Center - Washington St at Cambridge St,Coolidge Corner - Beacon St @ Centre St,862,Subscriber,10,3,0 days 00:08:26
1312875,2017-12-28 17:51:25,2017-12-28 18:12:45,175,82,Brighton Center - Washington St at Cambridge St,Washington Square at Washington St. / Beacon S...,1842,Subscriber,17,3,0 days 00:21:20
1313082,2017-12-29 10:27:59,2017-12-29 10:37:45,175,69,Brighton Center - Washington St at Cambridge St,Coolidge Corner - Beacon St @ Centre St,1772,Subscriber,10,4,0 days 00:09:46
1313119,2017-12-29 12:12:40,2017-12-29 12:22:35,175,41,Brighton Center - Washington St at Cambridge St,Packard's Corner - Commonwealth Ave at Brighto...,1777,Subscriber,12,4,0 days 00:09:55


In [42]:
df_Trips['duration'].max()

Timedelta('48 days 08:40:21')

In [43]:
df_Trips['duration'].min()

Timedelta('-1 days +23:06:07')

### Weather data Cleaning 

In [44]:
df_Weather.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 43848 entries, 0 to 43847
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   date_time  43354 non-null  object 
 1   max_temp   43354 non-null  float64
 2   min_temp   43354 non-null  float64
 3   precip     43356 non-null  float64
dtypes: float64(3), object(1)
memory usage: 1.3+ MB


In [45]:
df_Weather.isnull().values.sum()

1974

In [46]:
df_Weather.head(5)

Unnamed: 0,date_time,max_temp,min_temp,precip
0,2015-01-02 01:00:00,-1.1,-1.1,0.0
1,2015-01-02 02:00:00,-1.1,-1.1,0.0
2,2015-01-02 03:00:00,-0.6,-0.6,0.0
3,2015-01-02 04:00:00,-0.6,-0.6,0.0
4,2015-01-02 05:00:00,-0.6,-0.6,0.0


In [47]:
df_Weather.tail(5)

Unnamed: 0,date_time,max_temp,min_temp,precip
43843,2020-01-01 20:00:00,5.0,5.0,0.0
43844,2020-01-01 21:00:00,4.4,4.4,0.0
43845,2020-01-01 22:00:00,4.4,4.4,0.0
43846,2020-01-01 23:00:00,3.9,3.9,0.0
43847,2020-01-02 00:00:00,3.3,3.3,0.0


The dataset containing information about the weather reaches from 2015 - 2020. As only data of 2017 is needed, the data can limited to this year. 

In [48]:
df_Weather['date_time'] = pd.to_datetime(df_Weather['date_time'])

df_Weather2 = df_Weather[(df_Weather['date_time'] > "2017-01-01") & (df_Weather['date_time'] < "2018-01-01")]


In [49]:
df_Weather2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8689 entries, 17520 to 26302
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   date_time  8689 non-null   datetime64[ns]
 1   max_temp   8689 non-null   float64       
 2   min_temp   8689 non-null   float64       
 3   precip     8689 non-null   float64       
dtypes: datetime64[ns](1), float64(3)
memory usage: 339.4 KB


To display information about every hour in 2017 the weather data should contain 365 * 24 = 8760 rows. As the filtered dataframe only has 8689 entries, information about 71 hours is missing. 

In [50]:
df_Weather2.isnull().values.sum()

0

Fortunatley no NA values are in the dataset anymore.

In [51]:
df_Weather2.head(5)

Unnamed: 0,date_time,max_temp,min_temp,precip
17520,2017-01-01 01:00:00,4.4,4.4,0.0
17521,2017-01-01 02:00:00,5.0,5.0,1.0
17522,2017-01-01 03:00:00,5.0,5.0,1.0
17523,2017-01-01 04:00:00,5.0,4.4,1.0
17524,2017-01-01 05:00:00,4.4,4.4,1.0


In [52]:
df_Weather2.tail(5)

Unnamed: 0,date_time,max_temp,min_temp,precip
26298,2017-12-31 19:00:00,-11.1,-11.1,0.0
26299,2017-12-31 20:00:00,-10.6,-10.6,0.0
26300,2017-12-31 21:00:00,-11.1,-11.1,0.0
26301,2017-12-31 22:00:00,-11.7,-11.7,0.0
26302,2017-12-31 23:00:00,-11.1,-11.1,0.0


In [57]:
df_Stations.head()

Unnamed: 0,Number,Name,Latitude,Longitude,District,Public,Total docks
0,K32015,1200 Beacon St,42.344149,-71.114674,Brookline,Yes,15
1,W32006,160 Arsenal,42.364664,-71.175694,Watertown,Yes,11
2,A32019,175 N Harvard St,42.363796,-71.129164,Boston,Yes,18
3,S32035,191 Beacon St,42.380323,-71.108786,Somerville,Yes,19
4,C32094,2 Hummingbird Lane at Olmsted Green,42.28887,-71.095003,Boston,Yes,17
