# Cyclistic Trip Data Analysis

Collecting the last 12 months worth of Cyclistic trip data.<br>
Data was downloaded as zip files from [Divvy](https://divvy-tripdata.s3.amazonaws.com/index.html) under this [license](https://www.divvybikes.com/data-license-agreement).<br>
(_The data was made available by Motivate International Inc._)

In [1]:
# Importing required libraries
import os # for interacting with the operating system
import pandas as pd # for working with tabular data

In [2]:
# Get csv file path
dir_path = os.getcwd() + "\\Original_data_20221127\\"

In [3]:
# Get a list of all the trip data csv files
tripdata_files = os.listdir(dir_path)
tripdata_files

['202111-divvy-tripdata.csv',
 '202112-divvy-tripdata.csv',
 '202201-divvy-tripdata.csv',
 '202202-divvy-tripdata.csv',
 '202203-divvy-tripdata.csv',
 '202204-divvy-tripdata.csv',
 '202205-divvy-tripdata.csv',
 '202206-divvy-tripdata.csv',
 '202207-divvy-tripdata.csv',
 '202208-divvy-tripdata.csv',
 '202209-divvy-publictripdata.csv',
 '202210-divvy-tripdata.csv']

## Check for the correct number of files

In [4]:
# Check length of the list, should be 12 files representing the last 12 months trip data
len(tripdata_files)

12

### Inspect each file

In [5]:
file_path = dir_path + tripdata_files[0]
trip_data = pd.read_csv(file_path)
trip_data.head()

Unnamed: 0,ride_id,rideable_type,started_at,ended_at,start_station_name,start_station_id,end_station_name,end_station_id,start_lat,start_lng,end_lat,end_lng,member_casual
0,7C00A93E10556E47,electric_bike,2021-11-27 13:27:38,2021-11-27 13:46:38,,,,,41.93,-87.72,41.96,-87.73,casual
1,90854840DFD508BA,electric_bike,2021-11-27 13:38:25,2021-11-27 13:56:10,,,,,41.96,-87.7,41.92,-87.7,casual
2,0A7D10CDD144061C,electric_bike,2021-11-26 22:03:34,2021-11-26 22:05:56,,,,,41.96,-87.7,41.96,-87.7,casual
3,2F3BE33085BCFF02,electric_bike,2021-11-27 09:56:49,2021-11-27 10:01:50,,,,,41.94,-87.79,41.93,-87.79,casual
4,D67B4781A19928D4,electric_bike,2021-11-26 19:09:28,2021-11-26 19:30:41,,,,,41.9,-87.63,41.88,-87.62,casual


In [6]:
trip_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 359978 entries, 0 to 359977
Data columns (total 13 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   ride_id             359978 non-null  object 
 1   rideable_type       359978 non-null  object 
 2   started_at          359978 non-null  object 
 3   ended_at            359978 non-null  object 
 4   start_station_name  284688 non-null  object 
 5   start_station_id    284688 non-null  object 
 6   end_station_name    280791 non-null  object 
 7   end_station_id      280791 non-null  object 
 8   start_lat           359978 non-null  float64
 9   start_lng           359978 non-null  float64
 10  end_lat             359787 non-null  float64
 11  end_lng             359787 non-null  float64
 12  member_casual       359978 non-null  object 
dtypes: float64(4), object(9)
memory usage: 35.7+ MB


### Dataset has null values for:
- 'start_station_name'
- 'start_station_id'
- 'end_station_name'
- 'end_station_id'
- 'end_lat'
- 'end_lng'

Can we get the missing data?<br>
Can we use the 'start_lat' and 'start_lng' to find the 'start_station_name' and 'start_station_id'?<br>
Is the dat even needed for my analysis?<br>
<br>
*Note: 'started_at' and 'ended_at' need to be split and converted to date and time data types respectively.*

# Check summary statistics

In [7]:
trip_data.describe(include="all")

Unnamed: 0,ride_id,rideable_type,started_at,ended_at,start_station_name,start_station_id,end_station_name,end_station_id,start_lat,start_lng,end_lat,end_lng,member_casual
count,359978,359978,359978,359978,284688,284688,280791,280791,359978.0,359978.0,359787.0,359787.0,359978
unique,359978,3,320477,320071,815,815,805,805,,,,,2
top,7C00A93E10556E47,electric_bike,2021-11-07 15:52:47,2021-11-16 09:56:18,Ellis Ave & 60th St,KA1503000014,Ellis Ave & 60th St,KA1503000014,,,,,member
freq,1,198325,6,14,3095,3095,3002,3002,,,,,253049
mean,,,,,,,,,41.89377,-87.647342,41.8939,-87.647543,
std,,,,,,,,,0.052579,0.033397,0.052618,0.033608,
min,,,,,,,,,41.648501,-87.84,41.39,-88.97,
25%,,,,,,,,,41.877726,-87.663832,41.87785,-87.663923,
50%,,,,,,,,,41.894666,-87.643107,41.894683,-87.643118,
75%,,,,,,,,,41.925602,-87.627722,41.925602,-87.627754,


In [8]:
trip_data["rideable_type"].value_counts(normalize=True)

electric_bike    0.550936
classic_bike     0.427912
docked_bike      0.021151
Name: rideable_type, dtype: float64

In [9]:
trip_data["member_casual"].value_counts(normalize=True)

member    0.702957
casual    0.297043
Name: member_casual, dtype: float64

Quick observations:
- There are 3 types of bikes available
- Electric bikes are used 55% of the time for rides
- Members make up 70% of our customer base

In [10]:
trip_data.groupby(["rideable_type", "member_casual"])["member_casual"].count().unstack()

member_casual,casual,member
rideable_type,Unnamed: 1_level_1,Unnamed: 2_level_1
classic_bike,31866.0,122173.0
docked_bike,7614.0,
electric_bike,67449.0,130876.0


Members do not use docked bikes.<br>
While memebers use electric bikes more, there is not a by a significant margin.<br>
Casual riders use electric bikes more than 50% of the time.<br>

In [11]:
# Transform the 'started_at' and 'ended_at' to datetime objects
trip_data["started_at"] = pd.to_datetime(trip_data["started_at"])
trip_data["ended_at"] = pd.to_datetime(trip_data["ended_at"])

In [12]:
# Calculate the length of each ride/trip
trip_data["ride_length"] = trip_data["ended_at"] - trip_data["started_at"]
trip_data["ride_length"] = round(trip_data.ride_length/pd.Timedelta("60s"), 2)

In [13]:
trip_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 359978 entries, 0 to 359977
Data columns (total 14 columns):
 #   Column              Non-Null Count   Dtype         
---  ------              --------------   -----         
 0   ride_id             359978 non-null  object        
 1   rideable_type       359978 non-null  object        
 2   started_at          359978 non-null  datetime64[ns]
 3   ended_at            359978 non-null  datetime64[ns]
 4   start_station_name  284688 non-null  object        
 5   start_station_id    284688 non-null  object        
 6   end_station_name    280791 non-null  object        
 7   end_station_id      280791 non-null  object        
 8   start_lat           359978 non-null  float64       
 9   start_lng           359978 non-null  float64       
 10  end_lat             359787 non-null  float64       
 11  end_lng             359787 non-null  float64       
 12  member_casual       359978 non-null  object        
 13  ride_length         359978 no

In [19]:
trip_data.head()

Unnamed: 0,ride_id,rideable_type,started_at,ended_at,start_station_name,start_station_id,end_station_name,end_station_id,start_lat,start_lng,end_lat,end_lng,member_casual,ride_length,day_of_week
0,7C00A93E10556E47,electric_bike,2021-11-27 13:27:38,2021-11-27 13:46:38,,,,,41.93,-87.72,41.96,-87.73,casual,19.0,5
1,90854840DFD508BA,electric_bike,2021-11-27 13:38:25,2021-11-27 13:56:10,,,,,41.96,-87.7,41.92,-87.7,casual,17.75,5
2,0A7D10CDD144061C,electric_bike,2021-11-26 22:03:34,2021-11-26 22:05:56,,,,,41.96,-87.7,41.96,-87.7,casual,2.37,4
3,2F3BE33085BCFF02,electric_bike,2021-11-27 09:56:49,2021-11-27 10:01:50,,,,,41.94,-87.79,41.93,-87.79,casual,5.02,5
4,D67B4781A19928D4,electric_bike,2021-11-26 19:09:28,2021-11-26 19:30:41,,,,,41.9,-87.63,41.88,-87.62,casual,21.22,4


In [18]:
trip_data["day_of_week"] = trip_data.started_at.dt.weekday

In [29]:
def process_files(files: list, path: str):
    """
       Helper function for working with multiple csv files to import to a Pandas Dataframe and running
       basic exploratory analysis.  All data in the files need to be of identical structure.  Checks and
       transforms data types.  Provides a quick glimpse at the bike usage, customer usage, and which bikes
       do customers use.
       
       Parameters
       ===============
       files: list
           list of files to upload
       path: str
           path of the directory where the files are stored 
    """
    
    for file in files:
        file_path = path + file
        df = pd.read_csv(file_path)
        trip_date = file[:6]
        print("="*50)
        print("     Beginning Cyclistic Trip Data EDA.......")
        print("="*50)
        print(f"Month: {trip_date[-2:]}")
        print(f" Year: {trip_date[:4]}")
        print("="*50)
        
        print(df.info())
        df.convert_dtypes()
        print("-"*50)
        
        print("        Percentage of Bike Usage by Type")
        print("-"*50)
        print(df["rideable_type"].value_counts(normalize=True))
        print("-"*50)
        
        print("          Percentage of Customer Type")
        print("-"*50)
        print(df["member_casual"].value_counts(normalize=True))
        print("-"*50)
        
        print("       Type of Bike Used by Customer Type")
        print("-"*50)
        print(df.groupby(["rideable_type", "member_casual"])["member_casual"].count().unstack())
        print("-"*50)
        
        # Transform the 'started_at' and 'ended_at' to datetime objects
        df["started_at"] = pd.to_datetime(df["started_at"])
        df["ended_at"] = pd.to_datetime(df["ended_at"])
        
        # Calculate the length of each ride/trip
        df["ride_length"] = df["ended_at"] - df["started_at"]
        df["ride_length"] = round(df.ride_length/pd.Timedelta("60s"), 2)
        
        # Calculate the day of the week (Monday=0,.....,Sunday=6)
        df["day_of_week"] = df.started_at.dt.weekday
        
        print("="*50)
        print("            Initial EDA Results........")
        print("="*50)
        print(df.info())
        print("-"*50)
        print(df.describe(include="all", datetime_is_numeric=True))
        print("="*50)
        return None

In [28]:
process_files(tripdata_files, dir_path)

     Beginning Cyclistic Trip Data EDA.......
Month: 11
 Year: 2021
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 359978 entries, 0 to 359977
Data columns (total 13 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   ride_id             359978 non-null  object 
 1   rideable_type       359978 non-null  object 
 2   started_at          359978 non-null  object 
 3   ended_at            359978 non-null  object 
 4   start_station_name  284688 non-null  object 
 5   start_station_id    284688 non-null  object 
 6   end_station_name    280791 non-null  object 
 7   end_station_id      280791 non-null  object 
 8   start_lat           359978 non-null  float64
 9   start_lng           359978 non-null  float64
 10  end_lat             359787 non-null  float64
 11  end_lng             359787 non-null  float64
 12  member_casual       359978 non-null  object 
dtypes: float64(4), object(9)
memory usage: 35.7+ MB
None
-------------

In [23]:
for file in tripdata_files:
    file_path = dir_path + file
    df = pd.read_csv(file_path)
    trip_date = file[:6]
    print("="*50)
    print("                 Beginning EDA.......")
    print("="*50)
    print(f"Month: {trip_date[-2:]}")
    print(f" Year: {trip_date[:4]}")
    print("="*50)
    
    print(df.info())
    df.convert_dtypes()
    print("-"*50)
    
    print("        Percentage of Bike Usage by Type")
    print("-"*50)
    print(df["rideable_type"].value_counts(normalize=True))
    print("-"*50)
    
    print("          Percentage of Customer Type")
    print("-"*50)
    print(df["member_casual"].value_counts(normalize=True))
    print("-"*50)
    
    print("         Type of Bike Used by Customer Type")
    print("-"*50)
    print(df.groupby(["rideable_type", "member_casual"])["member_casual"].count().unstack())
    print("-"*50)
    
    # Transform the 'started_at' and 'ended_at' to datetime objects
    df["started_at"] = pd.to_datetime(df["started_at"])
    df["ended_at"] = pd.to_datetime(df["ended_at"])
    
    # Calculate the length of each ride/trip
    df["ride_length"] = df["ended_at"] - df["started_at"]
    df["ride_length"] = round(df.ride_length/pd.Timedelta("60s"), 2)
    
    # Calculate the day of the week (Monday=0,.....,Sunday=6)
    df["day_of_week"] = df.started_at.dt.weekday
    
    print("-"*50)
    print("              Initial Results........")
    print(df.info())
    print("-"*50)
    print(df.describe(include="all", datetime_is_numeric=True))
    print("="*50)
    break

                 Beginning EDA.......
Month: 11
 Year: 2021
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 359978 entries, 0 to 359977
Data columns (total 13 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   ride_id             359978 non-null  object 
 1   rideable_type       359978 non-null  object 
 2   started_at          359978 non-null  object 
 3   ended_at            359978 non-null  object 
 4   start_station_name  284688 non-null  object 
 5   start_station_id    284688 non-null  object 
 6   end_station_name    280791 non-null  object 
 7   end_station_id      280791 non-null  object 
 8   start_lat           359978 non-null  float64
 9   start_lng           359978 non-null  float64
 10  end_lat             359787 non-null  float64
 11  end_lng             359787 non-null  float64
 12  member_casual       359978 non-null  object 
dtypes: float64(4), object(9)
memory usage: 35.7+ MB
None
---------------------

In [35]:
trip_data.rename(columns = {"member_casual" : "customer_type"}, inplace = True)

In [36]:
trip_data.head()

Unnamed: 0,ride_id,rideable_type,started_at,ended_at,start_station_name,start_station_id,end_station_name,end_station_id,start_lat,start_lng,end_lat,end_lng,customer_type,ride_length,day_of_week
0,7C00A93E10556E47,electric_bike,2021-11-27 13:27:38,2021-11-27 13:46:38,,,,,41.93,-87.72,41.96,-87.73,casual,19.0,5
1,90854840DFD508BA,electric_bike,2021-11-27 13:38:25,2021-11-27 13:56:10,,,,,41.96,-87.7,41.92,-87.7,casual,17.75,5
2,0A7D10CDD144061C,electric_bike,2021-11-26 22:03:34,2021-11-26 22:05:56,,,,,41.96,-87.7,41.96,-87.7,casual,2.37,4
3,2F3BE33085BCFF02,electric_bike,2021-11-27 09:56:49,2021-11-27 10:01:50,,,,,41.94,-87.79,41.93,-87.79,casual,5.02,5
4,D67B4781A19928D4,electric_bike,2021-11-26 19:09:28,2021-11-26 19:30:41,,,,,41.9,-87.63,41.88,-87.62,casual,21.22,4
