# Capstone Project
### DATA 606, Spring 2022, Dr. Chaojie Wang
### David Fahnestock

#### Description: 
This notebook represents data cleansing and Exploratory Data Analysis.  This explores BikeShare trip data for three cities to get an understanding of the data through summarizing and charting.  

#### Dataset Sources:
<ul><li>New York City: <a href='https://ride.citibikenyc.com/system-data'>https://ride.citibikenyc.com/system-data</a> </li>
    <li>Chicago: <a href='https://ride.divvybikes.com/system-data'>https://ride.divvybikes.com/system-data</a> </li>
    <li>San Francisco: <a href='https://www.lyft.com/bikes/bay-wheels/system-data'>https://www.lyft.com/bikes/bay-wheels/system-data</a> </li>
</ul>

## Function Definitions & Setup
This section deals with imports and function definitions

In [11]:
import pandas as pd
pd.__version__

'1.3.4'

In [12]:
from matplotlib import pyplot as plt

In [13]:
import numpy as np

In [14]:
import glob

In [15]:
import os
from os.path import join, isdir
from os import mkdir, path

In [16]:
import datetime

In [17]:
# Import my helper py file to help with importing the data
import dfimporthelpers as imp

### Input Parameters
#### Set input parameter values that will be used for the analysis


In [18]:
# Used for chart labels
p_city_name = 'Chicago'

In [19]:
# Source directory of csv data files to analyze
p_src_directory = 'ChicagoData'

In [20]:
# Column names to be used when importing the data since some files use different names
# even though the columns are the same.  Set to None to use the columns names from the header in 
# each file.
p_column_names = None
#p_column_names = ['ride_id','started_at','ended_at','bikeid','tripduration','start_station_id',
#                'start_station_name','end_station_id','end_station_name','usertype','gender','birthyear']

<br><br>

# Load and Clean Bike Share data for each city

In [21]:
dir_path = '/Users/DF/Library/CloudStorage/OneDrive-Personal/Documents/' + \
           'Grad School-David’s MacBook Pro/Spring 2022 - Capstone/JupyterNB/data/'

## Load Chicago data
#### First load the newer data, which uses a different field layout than the older data. This was determined through initial cleaning and then streamlined in this consolidated notebook.

In [22]:
# Load Chicago data that uses the latest field layout
full_path = dir_path + 'ChicagoData'
df_chicago = imp.load_csvs_to_df(src_dir=full_path, col_names=None)

In [23]:
df_chicago.head(5)

Unnamed: 0,ride_id,rideable_type,started_at,ended_at,start_station_name,start_station_id,end_station_name,end_station_id,start_lat,start_lng,end_lat,end_lng,member_casual
0,BD0A6FF6FFF9B921,electric_bike,2020-11-01 13:36:00,2020-11-01 13:45:40,Dearborn St & Erie St,110.0,St. Clair St & Erie St,211.0,41.894177,-87.629127,41.894434,-87.623379,casual
1,96A7A7A4BDE4F82D,electric_bike,2020-11-01 10:03:26,2020-11-01 10:14:45,Franklin St & Illinois St,672.0,Noble St & Milwaukee Ave,29.0,41.890959,-87.635343,41.900675,-87.66248,casual
2,C61526D06582BDC5,electric_bike,2020-11-01 00:34:05,2020-11-01 01:03:06,Lake Shore Dr & Monroe St,76.0,Federal St & Polk St,41.0,41.880983,-87.616754,41.872054,-87.62955,casual
3,E533E89C32080B9E,electric_bike,2020-11-01 00:45:16,2020-11-01 00:54:31,Leavitt St & Chicago Ave,659.0,Stave St & Armitage Ave,185.0,41.895499,-87.682013,41.917744,-87.691392,casual
4,1C9F4EF18C168C60,electric_bike,2020-11-01 15:43:25,2020-11-01 16:16:52,Buckingham Fountain,2.0,Buckingham Fountain,2.0,41.876497,-87.620358,41.876448,-87.620338,casual


#### Now Load the older data

In [24]:
# Load the older data that uses a different layout
full_path = dir_path + 'ChicagoData/Older'

column_names = ['ride_id','started_at','ended_at','bikeid','tripduration','start_station_id',
                'start_station_name','end_station_id','end_station_name','member_casual','gender','birthyear']

df_temp = imp.load_csvs_to_df(src_dir=full_path, col_names=column_names)

In [25]:
df_temp.head(5)

Unnamed: 0,ride_id,started_at,ended_at,bikeid,tripduration,start_station_id,start_station_name,end_station_id,end_station_name,member_casual,gender,birthyear
0,17536702,2018-01-01 00:12:00,2018-01-01 00:17:23,3304,323.0,69,Damen Ave & Pierce Ave,159,Claremont Ave & Hirsch St,Subscriber,Male,1988.0
1,17536703,2018-01-01 00:41:35,2018-01-01 00:47:52,5367,377.0,253,Winthrop Ave & Lawrence Ave,325,Clark St & Winnemac Ave (Temp),Subscriber,Male,1984.0
2,17536704,2018-01-01 00:44:46,2018-01-01 01:33:10,4599,2904.0,98,LaSalle St & Washington St,509,Troy St & North Ave,Subscriber,Male,1989.0
3,17536705,2018-01-01 00:53:10,2018-01-01 01:05:37,2302,747.0,125,Rush St & Hubbard St,364,Larrabee St & Oak St,Subscriber,Male,1983.0
4,17536706,2018-01-01 00:53:37,2018-01-01 00:56:40,3696,183.0,129,Blue Island Ave & 18th St,205,Paulina St & 18th St,Subscriber,Male,1989.0


In [26]:
# View data
df_chicago.dtypes

ride_id                object
rideable_type          object
started_at             object
ended_at               object
start_station_name     object
start_station_id       object
end_station_name       object
end_station_id         object
start_lat             float64
start_lng             float64
end_lat               float64
end_lng               float64
member_casual          object
dtype: object

In [27]:
df_temp.dtypes

ride_id                 int64
started_at             object
ended_at               object
bikeid                  int64
tripduration           object
start_station_id        int64
start_station_name     object
end_station_id          int64
end_station_name       object
member_casual          object
gender                 object
birthyear             float64
dtype: object

#### Combine the two Chicago dataframes then we'll update field types further

#### Get row counts of each before we combine them

In [28]:
df_temp.shape

(11250100, 12)

In [29]:
df_chicago.shape

(9136746, 13)

In [30]:
df_chicago = pd.concat([df_temp, df_chicago])

In [31]:
# How many rows and columns
print('rows:', df_chicago.shape[0])
print('columns:', df_chicago.shape[1])

rows: 20386846
columns: 17


#### Add a field to indicate this is from Chicago.  This will be useful when we combine all the cities together.

In [32]:
df_chicago['city'] = 'Chicago'

In [33]:
# View many
df_chicago.dtypes

ride_id                object
started_at             object
ended_at               object
bikeid                float64
tripduration           object
start_station_id       object
start_station_name     object
end_station_id         object
end_station_name       object
member_casual          object
gender                 object
birthyear             float64
rideable_type          object
start_lat             float64
start_lng             float64
end_lat               float64
end_lng               float64
city                   object
dtype: object

## Load San Francisco data
#### First load the newer data, which uses a different field layout than the older data. This was determined through initial cleaning and then streamlined in this consolidated notebook.

In [34]:
# Load Chicago data that uses the latest field layout
full_path = dir_path + 'SanFranciscoData'
df_sanfrancisco = imp.load_csvs_to_df(src_dir=full_path, col_names=None)

In [35]:
df_sanfrancisco.dtypes

ride_id                object
rideable_type          object
started_at             object
ended_at               object
start_station_name     object
start_station_id       object
end_station_name       object
end_station_id         object
start_lat             float64
start_lng             float64
end_lat               float64
end_lng               float64
member_casual          object
dtype: object

#### Now Load the older data

In [36]:
# Load the older data that uses a different layout
full_path = dir_path + 'SanFranciscoData/Older'

column_names = ['duration','started_at','ended_at','start_station_id',
                'start_station_name','start_lat','start_lng','end_station_id','end_station_name',
                'end_lat','end_lng','bike_id','member_casual','rental_access_method']

df_temp = imp.load_csvs_to_df(src_dir=full_path, col_names=column_names)

In [37]:
df_temp.dtypes

duration                 object
started_at               object
ended_at                 object
start_station_id         object
start_station_name       object
start_lat               float64
start_lng               float64
end_station_id           object
end_station_name         object
end_lat                 float64
end_lng                 float64
bike_id                  object
member_casual            object
rental_access_method     object
dtype: object

In [38]:
# View data
df_sanfrancisco.dtypes

ride_id                object
rideable_type          object
started_at             object
ended_at               object
start_station_name     object
start_station_id       object
end_station_name       object
end_station_id         object
start_lat             float64
start_lng             float64
end_lat               float64
end_lng               float64
member_casual          object
dtype: object

#### Combine the two dataframes then we'll update field types further

#### Get row counts of each before we combine them

In [39]:
df_temp.shape

(5795411, 14)

In [40]:
df_sanfrancisco.shape

(3275783, 13)

In [41]:
df_sanfrancisco = pd.concat([df_temp, df_sanfrancisco])

In [42]:
# How many rows and columns
print('rows:', df_sanfrancisco.shape[0])
print('columns:', df_sanfrancisco.shape[1])

rows: 9071194
columns: 16


#### Add a field to indicate the city.  This will be useful when we combine all the cities together.

In [43]:
df_sanfrancisco['city'] = 'San Francisco'

In [44]:
df_sanfrancisco.dtypes

duration                 object
started_at               object
ended_at                 object
start_station_id         object
start_station_name       object
start_lat               float64
start_lng               float64
end_station_id           object
end_station_name         object
end_lat                 float64
end_lng                 float64
bike_id                  object
member_casual            object
rental_access_method     object
ride_id                  object
rideable_type            object
city                     object
dtype: object

In [45]:
# View the combined df
df_sanfrancisco.head()

Unnamed: 0,duration,started_at,ended_at,start_station_id,start_station_name,start_lat,start_lng,end_station_id,end_station_name,end_lat,end_lng,bike_id,member_casual,rental_access_method,ride_id,rideable_type,city
0,598,2018-02-28 23:59:47.0970,2018-03-01 00:09:45.1870,284,Yerba Buena Center for the Arts (Howard St at ...,37.784872,-122.400876,114,Rhode Island St at 17th St,37.764478,-122.40257,1035,Subscriber,No,,,San Francisco
1,943,2018-02-28 23:21:16.4950,2018-02-28 23:36:59.9740,6,The Embarcadero at Sansome St,37.80477,-122.403234,324,Union Square (Powell St at Post St),37.7883,-122.408531,1673,Customer,No,,,San Francisco
2,18587,2018-02-28 18:20:55.1900,2018-02-28 23:30:42.9250,93,4th St at Mission Bay Blvd S,37.770407,-122.391198,15,San Francisco Ferry Building (Harry Bridges Pl...,37.795392,-122.394203,3498,Customer,No,,,San Francisco
3,18558,2018-02-28 18:20:53.6210,2018-02-28 23:30:12.4500,93,4th St at Mission Bay Blvd S,37.770407,-122.391198,15,San Francisco Ferry Building (Harry Bridges Pl...,37.795392,-122.394203,3129,Customer,No,,,San Francisco
4,885,2018-02-28 23:15:12.8580,2018-02-28 23:29:58.6080,308,San Pedro Square,37.336802,-121.89409,297,Locust St at Grant St,37.32298,-121.887931,1839,Subscriber,Yes,,,San Francisco


## Load New York City data
#### First load the newer data, which uses a different field layout than the older data. This was determined through initial cleaning and then streamlined in this consolidated notebook.

In [None]:
# Load data that uses the latest field layout
full_path = dir_path + 'NYCData'
df_nyc = imp.load_csvs_to_df(src_dir=full_path, col_names=None)

In [None]:
df_nyc.head(5)

#### Now Load the older data

In [None]:
# Load the older data that uses a different layout
full_path = dir_path + 'NYCData/Older'

column_names = ['duration','started_at','ended_at','start_station_id',
                'start_station_name','start_lat','start_lng','end_station_id','end_station_name',
                'end_lat','end_lng','bike_id','member_casual','birthyear','gender']

df_temp = imp.load_csvs_to_df(src_dir=full_path, col_names=column_names)

In [None]:
df_temp.head(5)

In [None]:
# View data
df_nyc.dtypes

In [None]:
df_temp.dtypes

#### Combine the two dataframes then we'll update field types further

#### Get row counts of each before we combine them

In [None]:
df_temp.shape

In [None]:
df_nyc.shape

In [None]:
df_nyc = pd.concat([df_temp, df_nyc])

In [None]:
# How many rows and columns
print('rows:', df_nyc.shape[0])
print('columns:', df_nyc.shape[1])

#### Add a field to indicate this is from Chicago.  This will be useful when we combine all the cities together.

In [None]:
df_nyc['city'] = 'NYC'

In [None]:
# View many
df_nyc.dtypes

### Finally, combine all City Dataframes into One

In [46]:
df_all = pd.concat([df_chicago,df_sanfrancisco,df_nyc])

In [47]:
# How many rows and columns do we have
df_all.shape

(29458040, 21)

<br>

## Clean and prepare the data

#### Look at the data field types

In [None]:
df_all.dtypes

#### Remove columns we don't currently need

In [None]:
df_all.drop(['rental_access_method','duration','bike_id','bikeid','tripduration'], axis=1, inplace=True)

In [None]:
df_all.drop(['gender','birthyear','start_lat','start_lng','end_lat','end_lng'], axis=1, inplace=True)

In [None]:
df_all.head()

In [None]:
# Convert date/times
df_all['started_at'] = pd.to_datetime(df_all['started_at'])
df_all['ended_at']   = pd.to_datetime(df_all['ended_at'])

#### Create calculated field with trip durations

In [None]:
df_all['duration_minutes'] = (df_all['ended_at'] - df_all['started_at']) / pd.Timedelta(minutes=1)

In [None]:
df_all.head()

In [None]:
df_all.dtypes

#### Check for null values

In [None]:
df_all.isnull().sum()

#### Based on the above, we have nulls in various fields

#### Let's add separate columns for Month#, Year, Hour for further analysis

In [None]:
df_all['year']       = pd.DatetimeIndex(df_all['started_at']).year
df_all['month']      = pd.DatetimeIndex(df_all['started_at']).month
df_all['year-month'] = df_all['started_at'].dt.strftime('%Y-%m')
df_all['hour_of_day'] = pd.DatetimeIndex(df_all['started_at']).hour
df_all['start_date'] = df_all['started_at'].dt.strftime('%Y-%m-%d')

<br />

## Create cleaned version with only the attributes we will use for ML

In [None]:
# Based on the prevalence of nulls in these fields, we will discard them from our analysis
# and only use these fields
df_clean = df_all[['started_at','ended_at','member_casual','city','duration_minutes', \
                   'year','month','year-month','hour_of_day','start_date']].copy()

In [None]:
# Recheck to be sure we have addressed the nulls
df_clean.isnull().sum()

In [None]:
# The remaining nulls are minimal.  Therefore, we will remove those rows for analysis purposes
df_clean = df_clean.dropna()

In [None]:
df_clean.head()

In [None]:
df_clean.dtypes

### Group by city and date 

In [None]:
# Group the results by city and date
df_grouped = df_clean.groupby(['city','start_date','year','month','year-month']).agg( \
                              {'started_at':'count','duration_minutes':'sum'}).copy().reset_index()

In [None]:
# Rename count to be "#trips"
df_grouped.rename(columns={'started_at':'#trips'}, inplace=True)

In [None]:
df_grouped.head(10)


### Save the final combined to a csv so we don't have to repeat the time-consuming above steps when developing and testing

In [None]:
dir_path = '/Users/DF/Library/CloudStorage/OneDrive-Personal/Documents/' + \
           'Grad School-David’s MacBook Pro/Spring 2022 - Capstone/JupyterNB/'

os.chdir(dir_path)

In [None]:
df_grouped.to_csv('nyc_chicago_sf_tripsbydate_jan19tojan22.csv', index=False)

<br /> 

### Start Here to use the saved copy of the above

<br /><br />

# Summarize Data

### Summarize by city

In [None]:
# Group the results by year
df_grouped = df_all.groupby(['city']).agg({'ride_id': 'count'}).copy().reset_index()

# Rename to be "count"
df_grouped.rename(columns={'ride_id':'#trips'}, inplace=True)

df_grouped.sort_values('#trips',inplace=True)

In [None]:
df_grouped.head()

In [None]:
heading = 'Total Trips by City (January 2019 - January 2022)'

plt.figure()
ax = df_grouped.plot.barh(stacked=False, figsize=(10,6), x='city', title=heading)
ax.set_xlabel('Number of Trips')
    
# Display to screen
plt.show()

### Summarize by year and city

In [None]:
# Group the results by year
df_grouped = df_all.groupby(['year','city']).agg({'ride_id': 'count'}).copy().reset_index()

# Rename to be "count"
df_grouped.rename(columns={'ride_id':'#trips'}, inplace=True)

In [None]:
df_grouped.head(12)

In [None]:
# Pivot to prepare to plot the years and cities
df_pivot = df_grouped.pivot(index='year', columns='city', values='#trips')

In [None]:
heading = 'Total Trips by Year (thru January 2022)'

plt.figure()
ax = df_pivot.plot.bar(stacked=True, figsize=(10,6), title=heading)
ax.set_ylabel('Number of Trips')
    
# Display to screen
plt.show()

<br />

### Summarize by month to see the trips per month over time

In [None]:
# Group the results by month
df_grouped = df_all.groupby(['year-month', 'city']).agg({'ride_id': 'count'}).copy().reset_index()

# Rename to be "count"
df_grouped.rename(columns={'ride_id':'#trips'}, inplace=True)

In [None]:
df_grouped[df_grouped['city'] == 'NYC'].head(100)

In [None]:
# Pivot to prepare to plot the years and cities
df_pivot = df_grouped.pivot(index='year-month', columns='city', values='#trips')

In [None]:
heading = 'Total Trips by Month'

plt.figure()
ax = df_pivot.plot.line(stacked=False, figsize=(10,6), title=heading)
ax.set_ylabel('Number of Trips')
    
# Display to screen
plt.show()

In [None]:
# Save to csv for reuse
df_grouped.to_csv('nyc_chicago_sf_tripsbymonth_jan19tojan22.csv', index=False)

<br />

### Let's now group by start station
#### We'll get the count and one lat/long per start station to be able to plot on a map

In [None]:
grp = df_all.groupby(['start_station_name'])

df_bystation = grp.agg({'ride_id': 'count'}).copy().reset_index()

# Rename to be "count"
df_bystation.rename(columns={'ride_id':'#trips'}, inplace=True)

In [None]:
df_bystation.head(10)

#### How many stations are there?

In [None]:
df_bystation.shape

#### What are the top stations?

In [None]:
df_bystation.nlargest(20,'#trips')

### Summarize by Rideable Type

In [None]:
df_grouped = df_all.groupby(['rideable_type','city']).agg({'ride_id': 'count'}).copy().reset_index()

df_grouped.rename(columns={'ride_id':'#trips'}, inplace=True)

In [None]:
df_grouped.head(100)

In [None]:
# Pivot to prepare to plot the years and cities
df_pivot = df_grouped.pivot(index='rideable_type', columns='city', values='#trips')

In [None]:
heading = 'Total Trips by Bike Type'

plt.figure()
ax = df_pivot.plot.bar(stacked=False, figsize=(10,6), title=heading)
ax.set_ylabel('Number of Trips')
    
# Display to screen
plt.show()

### Summarize by Member Type

#### First map older designations to the latest for consistency

In [None]:
df_all.loc[df_all['member_casual'] == 'Customer', 'member_casual'] = 'casual'
df_all.loc[df_all['member_casual'] == 'Subscriber', 'member_casual'] = 'member'
df_all.loc[df_all['member_casual'] == 'Yes', 'member_casual'] = 'member'
df_all.loc[df_all['member_casual'] == 'No', 'member_casual'] = 'casual'

In [None]:
df_grouped = df_all.groupby(['city','member_casual']).agg({'ride_id': 'count'}).copy().reset_index()

df_grouped.rename(columns={'ride_id':'#trips'}, inplace=True)

In [None]:
df_grouped.head()

In [None]:
# Pivot to prepare to plot the years and cities
df_pivot = df_grouped.pivot(index='city', columns='member_casual', values='#trips')

In [None]:
df_pivot.head()

In [None]:
heading = 'Total Trips by Member Type'

plt.figure()
ax = df_pivot.plot.bar(stacked=False, figsize=(10,6), title=heading)
ax.set_ylabel('Number of Trips')

plt.show()