## II. Prepare — What do We Need?

### A. Description of Data

>Available on hands<br>
Content : Details of every ride logged by Cyclistic customers<br>
Range of Data : 2013 - 2024 Mar<br>

> Used in project<br>
Content : Details of every ride logged by Cyclistic customers<br>
Range of Data : 2023 Apr - 2024 Mar `Past 12 months`<br> 


### B. Credibility of Data

The credibility and integrity of our data can be determined using the ROCCC system.

Reliable — 
Original — 
Comprehensive — 
Current — it is relevant and up to date, thus indicating that the source refreshes its data regularly.
Cited — 

### C. Limitations of Data

Data privacy issues prohibit using riders' personally identifiable information such as gender and age, it means that we cannot provide relationship between cutsomers' characteristic such as geographic and demoographic information to customers' behavioural.

Besides, there are no data on hand about pricing. 

## III. Process — From Dirty to Clean

### Decision of tool

Tool: Python and Tableau

In [12]:
import sys
assert sys.version_info >= (3, 10)

import pandas as pd
import numpy as np

# To plot figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt

import seaborn as sns

# Common imports
import os
print("Libraries imported successfully.")

Libraries imported successfully.


### Locate the file

In [13]:
# Define the base directory where the CSV files are stored
directory = r'C:\Users\Lucas\Code\Web_scraper\Case_Study(1)\input'

# Optionally, define a pattern if the files are consistently named
file_pattern = '-divvy-tripdata.csv'

df_sample = pd.read_csv(r'C:\Users\Lucas\Code\Web_scraper\Case_study(1)\input\202304-divvy-tripdata.csv')



### Description of the file

In [14]:
# Generate the basic info. of the sample
df_sample.info()

# In view of the basic info., there are large number of 'null' items in some columns, we can filter some significant column to increase the understanding of the database 
df_sample[df_sample['start_station_name'].notna() & df_sample['end_station_name'].notna()].head(5) 



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 426590 entries, 0 to 426589
Data columns (total 13 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   ride_id             426590 non-null  object 
 1   rideable_type       426590 non-null  object 
 2   started_at          426590 non-null  object 
 3   ended_at            426590 non-null  object 
 4   start_station_name  362776 non-null  object 
 5   start_station_id    362776 non-null  object 
 6   end_station_name    357960 non-null  object 
 7   end_station_id      357960 non-null  object 
 8   start_lat           426590 non-null  float64
 9   start_lng           426590 non-null  float64
 10  end_lat             426155 non-null  float64
 11  end_lng             426155 non-null  float64
 12  member_casual       426590 non-null  object 
dtypes: float64(4), object(9)
memory usage: 42.3+ MB


Unnamed: 0,ride_id,rideable_type,started_at,ended_at,start_station_name,start_station_id,end_station_name,end_station_id,start_lat,start_lng,end_lat,end_lng,member_casual
227,5B6500E1E58655C0,classic_bike,2023-04-10 17:34:35,2023-04-10 18:02:36,Avenue O & 134th St,20214,Avenue O & 134th St,20214,41.651868,-87.539671,41.651868,-87.539671,member
383,AA65D25D69AF771F,classic_bike,2023-04-12 12:29:46,2023-04-12 12:54:00,Cottage Grove Ave & 51st St,TA1309000067,Cottage Grove Ave & 51st St,TA1309000067,41.803038,-87.606615,41.803038,-87.606615,member
409,079FB2C196414482,electric_bike,2023-04-13 17:39:23,2023-04-13 17:40:57,Morgan Ave & 14th Pl,TA1306000002,Morgan Ave & 14th Pl,TA1306000002,41.86243,-87.651152,41.862378,-87.651062,member
561,599623864C871207,classic_bike,2023-04-29 20:57:10,2023-04-29 20:57:13,Cottage Grove Ave & 51st St,TA1309000067,Cottage Grove Ave & 51st St,TA1309000067,41.803038,-87.606615,41.803038,-87.606615,member
692,63ECC8A13D11A76A,classic_bike,2023-04-20 17:03:11,2023-04-20 17:24:58,California Ave & Division St,13256,California Ave & Milwaukee Ave,13084,41.903029,-87.697474,41.922695,-87.697153,casual


## Assess the impact of the missing value

In [15]:
# Calculate the number of rows
def impact_of_missing_value(df):
    total_rows = df.shape[0]

    # Calculate the number of missing values per column
    missing_counts = df.isnull().sum()

    # Calculate the percentage of missing values per column
    missing_percentage = round((missing_counts / total_rows) * 100,2)

    # Calculate the percentage of non-missing values per column
    non_missing_percentage = 100 - missing_percentage

    # Create a DataFrame to nicely display the results
    data_loss_df = pd.DataFrame({
        'Total Rows': total_rows,
        'Missing Values': missing_counts,
        'Percentage Missing': missing_percentage,
        'Percentage Non-Missing': non_missing_percentage
    })

    
    return data_loss_df
print(impact_of_missing_value(df_sample))

                    Total Rows  Missing Values  Percentage Missing  \
ride_id                 426590               0                0.00   
rideable_type           426590               0                0.00   
started_at              426590               0                0.00   
ended_at                426590               0                0.00   
start_station_name      426590           63814               14.96   
start_station_id        426590           63814               14.96   
end_station_name        426590           68630               16.09   
end_station_id          426590           68630               16.09   
start_lat               426590               0                0.00   
start_lng               426590               0                0.00   
end_lat                 426590             435                0.10   
end_lng                 426590             435                0.10   
member_casual           426590               0                0.00   

                   

In [16]:
def missing_status_checking(df):
    # Number of cases where 'end_station_name' is missing but 'start_station_name' is not missing
    end_missing_start_not = df[df['end_station_name'].isnull() & df['start_station_name'].notnull()].shape[0]

    # Number of cases where 'start_station_name' is missing but 'end_station_name' is not missing
    start_missing_end_not = df[df['start_station_name'].isnull() & df['end_station_name'].notnull()].shape[0]

    # Number of cases where both 'start_station_name' and 'end_station_name' are missing
    both_missing = df[df['start_station_name'].isnull() & df['end_station_name'].isnull()].shape[0]
    

    print(f"End station name missing, start station name not missing: {end_missing_start_not}")
    print(f"Start station name missing, end station name not missing: {start_missing_end_not}")
    print(f"Both start and end station names missing: {both_missing}")

missing_status_checking(df_sample)

End station name missing, start station name not missing: 38579
Start station name missing, end station name not missing: 33763
Both start and end station names missing: 30051


In [17]:
def check_duplicate(df):
    duplicate_status = df.duplicated().any()
    print(f'Duplicate found : {duplicate_status}')
    return

check_duplicate(df_sample)

Duplicate found : False


## Handling null data

Regards `null` value in `end_lat`, we can check on the start point and find any reason for missing.

In [18]:
count_by_station = df_sample[df_sample['end_lat'].isnull()].groupby('start_station_name').size()

# Sort the counts in descending order
sorted_count_by_station = count_by_station.sort_values(ascending=False)

print(sorted_count_by_station)

start_station_name
Streeter Dr & Grand Ave              22
Millennium Park                      13
Dusable Harbor                        9
DuSable Lake Shore Dr & Monroe St     9
Shedd Aquarium                        8
                                     ..
Halsted St & 35th St                  1
Halsted St & 96th St                  1
Halsted St & North Branch St          1
Halsted St & Polk St                  1
Woodlawn Ave & Lake Park Ave          1
Length: 232, dtype: int64


In [19]:
# Replace all NaN values with 'unknown'
def df_fillna_unknown (df):
    df_sample_filled = df.fillna('unknown')
    print(f'Null values items:\n{df_sample_filled.isnull().sum()}')
    return df_sample_filled

df_sample_filled = df_fillna_unknown(df_sample)
print(df_fillna_unknown(df_sample).head())



Null values items:
ride_id               0
rideable_type         0
started_at            0
ended_at              0
start_station_name    0
start_station_id      0
end_station_name      0
end_station_id        0
start_lat             0
start_lng             0
end_lat               0
end_lng               0
member_casual         0
dtype: int64
Null values items:
ride_id               0
rideable_type         0
started_at            0
ended_at              0
start_station_name    0
start_station_id      0
end_station_name      0
end_station_id        0
start_lat             0
start_lng             0
end_lat               0
end_lng               0
member_casual         0
dtype: int64
            ride_id  rideable_type           started_at             ended_at  \
0  8FE8F7D9C10E88C7  electric_bike  2023-04-02 08:37:28  2023-04-02 08:41:37   
1  34E4ED3ADF1D821B  electric_bike  2023-04-19 11:29:02  2023-04-19 11:52:12   
2  5296BF07A2F77CB5  electric_bike  2023-04-19 08:41:22  2023-04-19 08:4

B. Data Transformation

#### Column on hand

1. `ride_id`: Unique ID assigned with each ride
2. `rideable_type`: Type of bicycle used on each ride — classic, docked, or electric
3. `started_at`: Date and time at the start of each trip
4. `ended_at`: Date and time at the end of each trip
5. `start_station_name`: Name of the station where each journey started from
6. `start_station_id`: ID of the station where each journey started from
7. `end_station_name`: Name of the station where each trip ended at
8. `end_station_id`: ID of the station where each trip ended at
9. `start_lat`: Latitude of each starting station
10. `start_lng`: Longitude of each starting station
11. `end_lat`: Latitude of each ending station
12. `end_lng`: Longitude of each ending station
13. `member_casual`: Type of membership of each rider



#### Additional:
- `ride_length` : Length of each ride
- `ride_length_minutes` : Length of each ride in minutes
- `start_hour` : Time in hour for starting each ride
- `weekday_name` : Weekday of each ride
- `ride_length_minutes_category` : Dividing in different minitues category to understand the usage
- `ride_length_category` : Dividing in diiferent category to understand the purpose of usage

#### Not in used
 - `trip_distance` : Criculation instead of point to point ride. For example, riding in a park


In [None]:

def processing_data(df):
    # Convert 'started_at' and 'ended_at' to datetime
    df['started_at'] = pd.to_datetime(df['started_at'])
    df['ended_at'] = pd.to_datetime(df['ended_at'])

    # Calculate ride length
    df['ride_length'] = df['ended_at'] - df['started_at']

    # Optional: Convert ride length to minutes
    df['ride_length_minutes'] = df['ride_length'].dt.total_seconds() / 60
    
    df['start_hour'] = df['started_at'].dt.hour
    print(df['start_hour'].unique())

    df['weekday_name'] = df['started_at'].dt.day_name()
    print(df['weekday_name'].unique())

    # Filter out rides with duration less than 1 minute or more than 720 minutes
    df_filtered = df[(df['ride_length_minutes'] >= 1) & (df['ride_length_minutes'] <= 720)]

    # Report dropped entries
    dropped_entries = len(df) - len(df_filtered)
    print(f'Entries dropped for being outside 1-720 minutes: {dropped_entries}')
    print(f'Percentage dropped: {100 * dropped_entries / len(df):.2f}%')
    



    # Define bins for the ride length categories
    bins_ride_mins = [0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 90, 120, float('inf')]
    labels_ride_mins = [
        '1-5 mins', '5-10 mins', '10-15 mins', '15-20 mins', 
        '20-25 mins', '25-30 mins', '30-35 mins', '35-40 mins', 
        '40-45 mins', '45-50 mins', '50-55 mins', '55-60 mins', 
        '60-90 mins', '90-120 mins', '120+ mins'
    ]

    # Define the bins and labels
    bins_ride_cate = [0, 5, 15, 30, 60, float('inf')]
    labels_ride_cate = ["Very Short (1-5 mins)", "Short (6-15 mins)", "Moderate (16-30 mins)", "Long (31-60 mins)", "Very Long (60+ mins)"]

    # Create a new column 'ride_length_category'
    df_filtered['ride_length_minutes_category'] = pd.cut(df_filtered['ride_length_minutes'], bins=bins_ride_mins, labels=labels_ride_mins, right=False)
    df_filtered['ride_length_category'] = pd.cut(df_filtered['ride_length_minutes'], bins=bins_ride_cate, labels=labels_ride_cate, right=False)
    return df_filtered


df_sample_filled = processing_data(df_sample_filled)
print(df_sample_filled.head())


[ 8 11 13 12  9 16 17 14 18 15  7 20 19 21  3  4 10  2  0 22  1 23  6  5]
['Sunday' 'Wednesday' 'Tuesday' 'Thursday' 'Monday' 'Saturday' 'Friday']
Entries dropped for being outside 1-720 minutes: 15584
Percentage dropped: 3.65%
            ride_id  rideable_type          started_at            ended_at  \
0  8FE8F7D9C10E88C7  electric_bike 2023-04-02 08:37:28 2023-04-02 08:41:37   
1  34E4ED3ADF1D821B  electric_bike 2023-04-19 11:29:02 2023-04-19 11:52:12   
2  5296BF07A2F77CB5  electric_bike 2023-04-19 08:41:22 2023-04-19 08:43:22   
3  40759916B76D5D52  electric_bike 2023-04-19 13:31:30 2023-04-19 13:35:09   
4  77A96F460101AC63  electric_bike 2023-04-19 12:05:36 2023-04-19 12:10:26   

  start_station_name start_station_id end_station_name end_station_id  \
0            unknown          unknown          unknown        unknown   
1            unknown          unknown          unknown        unknown   
2            unknown          unknown          unknown        unknown   
3          

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_filtered['ride_length_minutes_category'] = pd.cut(df_filtered['ride_length_minutes'], bins=bins_ride_mins, labels=labels_ride_mins, right=False)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_filtered['ride_length_category'] = pd.cut(df_filtered['ride_length_minutes'], bins=bins_ride_cate, labels=labels_ride_cate, right=False)


In [20]:

def processing_data(df):
    # Convert 'started_at' and 'ended_at' to datetime
    df['started_at'] = pd.to_datetime(df['started_at'])
    df['ended_at'] = pd.to_datetime(df['ended_at'])

    # Calculate ride length
    df['ride_length'] = df['ended_at'] - df['started_at']

    # Optional: Convert ride length to minutes
    df['ride_length_minutes'] = df['ride_length'].dt.total_seconds() / 60
    
    df['start_hour'] = df['started_at'].dt.hour
    print(df['start_hour'].unique())

    df['weekday_name'] = df['started_at'].dt.day_name()
    print(df['weekday_name'].unique())

    # Filter out rides with duration less than 1 minute or more than 720 minutes
    df_filtered = df[(df['ride_length_minutes'] >= 1) & (df['ride_length_minutes'] <= 720)]

    # Report dropped entries
    dropped_entries = len(df) - len(df_filtered)
    print(f'Entries dropped for being outside 1-720 minutes: {dropped_entries}')
    print(f'Percentage dropped: {100 * dropped_entries / len(df):.2f}%')
    



    # Define bins for the ride length categories
    bins_ride_mins = [0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 90, 120, float('inf')]
    labels_ride_mins = [
        '1-5 mins', '5-10 mins', '10-15 mins', '15-20 mins', 
        '20-25 mins', '25-30 mins', '30-35 mins', '35-40 mins', 
        '40-45 mins', '45-50 mins', '50-55 mins', '55-60 mins', 
        '60-90 mins', '90-120 mins', '120+ mins'
    ]

    # Define the bins and labels
    bins_ride_cate = [0, 5, 15, 30, 60, float('inf')]
    labels_ride_cate = ["Very Short (1-5 mins)", "Short (6-15 mins)", "Moderate (16-30 mins)", "Long (31-60 mins)", "Very Long (60+ mins)"]

    # Create a new column 'ride_length_category'
    df_filtered['ride_length_minutes_category'] = pd.cut(df_filtered['ride_length_minutes'], bins=bins_ride_mins, labels=labels_ride_mins, right=False)
    df_filtered['ride_length_category'] = pd.cut(df_filtered['ride_length_minutes'], bins=bins_ride_cate, labels=labels_ride_cate, right=False)
    return df_filtered


df_sample_filled = processing_data(df_sample_filled)
print(df_sample_filled.head())


[ 8 11 13 12  9 16 17 14 18 15  7 20 19 21  3  4 10  2  0 22  1 23  6  5]
['Sunday' 'Wednesday' 'Tuesday' 'Thursday' 'Monday' 'Saturday' 'Friday']
Entries dropped for being outside 1-720 minutes: 15584
Percentage dropped: 3.65%
            ride_id  rideable_type          started_at            ended_at  \
0  8FE8F7D9C10E88C7  electric_bike 2023-04-02 08:37:28 2023-04-02 08:41:37   
1  34E4ED3ADF1D821B  electric_bike 2023-04-19 11:29:02 2023-04-19 11:52:12   
2  5296BF07A2F77CB5  electric_bike 2023-04-19 08:41:22 2023-04-19 08:43:22   
3  40759916B76D5D52  electric_bike 2023-04-19 13:31:30 2023-04-19 13:35:09   
4  77A96F460101AC63  electric_bike 2023-04-19 12:05:36 2023-04-19 12:10:26   

  start_station_name start_station_id end_station_name end_station_id  \
0            unknown          unknown          unknown        unknown   
1            unknown          unknown          unknown        unknown   
2            unknown          unknown          unknown        unknown   
3          

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_filtered['ride_length_minutes_category'] = pd.cut(df_filtered['ride_length_minutes'], bins=bins_ride_mins, labels=labels_ride_mins, right=False)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_filtered['ride_length_category'] = pd.cut(df_filtered['ride_length_minutes'], bins=bins_ride_cate, labels=labels_ride_cate, right=False)


In [21]:
del df_sample
del df_sample_filled

`Merge Data`

In [22]:

# Define the base directory where the CSV files are stored
directory = r'C:\Users\Lucas\Code\Web_scraper\Case_Study(1)\input'

# Optionally, define a pattern if the files are consistently named
file_pattern = '-divvy-tripdata.csv'

dataframes = []  # List to store each DataFrame

# Loop through each file in the directory
for filename in os.listdir(directory):
    if filename.endswith('-divvy-tripdata.csv'):
        print("Processing file:", filename)
        file_path = os.path.join(directory, filename)
        df = pd.read_csv(file_path)
        dataframes.append(df)
    else:
        print("Skipping file:", filename)

# Combine all DataFrames into one
combined_df = pd.concat(dataframes, ignore_index=True)
print("Combined DataFrame shape:", combined_df.shape)

#combined_df.to_csv('combined_data.csv', index=False)

Processing file: 202304-divvy-tripdata.csv
Processing file: 202305-divvy-tripdata.csv
Processing file: 202306-divvy-tripdata.csv
Processing file: 202307-divvy-tripdata.csv
Processing file: 202308-divvy-tripdata.csv
Processing file: 202309-divvy-tripdata.csv
Processing file: 202310-divvy-tripdata.csv
Processing file: 202311-divvy-tripdata.csv
Processing file: 202312-divvy-tripdata.csv
Processing file: 202401-divvy-tripdata.csv
Processing file: 202402-divvy-tripdata.csv
Processing file: 202403-divvy-tripdata.csv
Skipping file: __MACOSX
Combined DataFrame shape: (5750177, 13)


Do all

In [23]:
combined_df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5750177 entries, 0 to 5750176
Data columns (total 13 columns):
 #   Column              Dtype  
---  ------              -----  
 0   ride_id             object 
 1   rideable_type       object 
 2   started_at          object 
 3   ended_at            object 
 4   start_station_name  object 
 5   start_station_id    object 
 6   end_station_name    object 
 7   end_station_id      object 
 8   start_lat           float64
 9   start_lng           float64
 10  end_lat             float64
 11  end_lng             float64
 12  member_casual       object 
dtypes: float64(4), object(9)
memory usage: 570.3+ MB


In [24]:
print(impact_of_missing_value(combined_df))
missing_status_checking(combined_df)


                    Total Rows  Missing Values  Percentage Missing  \
ride_id                5750177               0                0.00   
rideable_type          5750177               0                0.00   
started_at             5750177               0                0.00   
ended_at               5750177               0                0.00   
start_station_name     5750177          874450               15.21   
start_station_id       5750177          874450               15.21   
end_station_name       5750177          929226               16.16   
end_station_id         5750177          929226               16.16   
start_lat              5750177               0                0.00   
start_lng              5750177               0                0.00   
end_lat                5750177            7566                0.13   
end_lng                5750177            7566                0.13   
member_casual          5750177               0                0.00   

                   

In [25]:
check_duplicate(combined_df)
print(df_fillna_unknown(combined_df).head())
df_merged_filled = df_fillna_unknown(combined_df)

Duplicate found : False
Null values items:
ride_id               0
rideable_type         0
started_at            0
ended_at              0
start_station_name    0
start_station_id      0
end_station_name      0
end_station_id        0
start_lat             0
start_lng             0
end_lat               0
end_lng               0
member_casual         0
dtype: int64
            ride_id  rideable_type           started_at             ended_at  \
0  8FE8F7D9C10E88C7  electric_bike  2023-04-02 08:37:28  2023-04-02 08:41:37   
1  34E4ED3ADF1D821B  electric_bike  2023-04-19 11:29:02  2023-04-19 11:52:12   
2  5296BF07A2F77CB5  electric_bike  2023-04-19 08:41:22  2023-04-19 08:43:22   
3  40759916B76D5D52  electric_bike  2023-04-19 13:31:30  2023-04-19 13:35:09   
4  77A96F460101AC63  electric_bike  2023-04-19 12:05:36  2023-04-19 12:10:26   

  start_station_name start_station_id end_station_name end_station_id  \
0            unknown          unknown          unknown        unknown   
1    

In [26]:
df_merged_filled = processing_data(df_merged_filled)
df_merged_filled.info()
print(df_merged_filled.head())


[ 8 11 13 12  9 16 17 14 18 15  7 20 19 21  3  4 10  2  0 22  1 23  6  5]
['Sunday' 'Wednesday' 'Tuesday' 'Thursday' 'Monday' 'Saturday' 'Friday']
Entries dropped for being outside 1-720 minutes: 153503
Percentage dropped: 2.67%


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_filtered['ride_length_minutes_category'] = pd.cut(df_filtered['ride_length_minutes'], bins=bins_ride_mins, labels=labels_ride_mins, right=False)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_filtered['ride_length_category'] = pd.cut(df_filtered['ride_length_minutes'], bins=bins_ride_cate, labels=labels_ride_cate, right=False)


<class 'pandas.core.frame.DataFrame'>
Index: 5596674 entries, 0 to 5750176
Data columns (total 19 columns):
 #   Column                        Dtype          
---  ------                        -----          
 0   ride_id                       object         
 1   rideable_type                 object         
 2   started_at                    datetime64[ns] 
 3   ended_at                      datetime64[ns] 
 4   start_station_name            object         
 5   start_station_id              object         
 6   end_station_name              object         
 7   end_station_id                object         
 8   start_lat                     float64        
 9   start_lng                     float64        
 10  end_lat                       object         
 11  end_lng                       object         
 12  member_casual                 object         
 13  ride_length                   timedelta64[ns]
 14  ride_length_minutes           float64        
 15  start_hour          

In [27]:
df_merged_filled.describe(include='all')

Unnamed: 0,ride_id,rideable_type,started_at,ended_at,start_station_name,start_station_id,end_station_name,end_station_id,start_lat,start_lng,end_lat,end_lng,member_casual,ride_length,ride_length_minutes,start_hour,weekday_name,ride_length_minutes_category,ride_length_category
count,5596674,5596674,5596674,5596674,5596674,5596674,5596674,5596674,5596674.0,5596674.0,5596674.0,5596674.0,5596674,5596674,5596674.0,5596674.0,5596674,5596674,5596674
unique,5596674,3,,,1596,1528,1606,1535,,,12973.0,13078.0,2,,,,7,15,5
top,8FE8F7D9C10E88C7,electric_bike,,,unknown,unknown,unknown,unknown,,,41.89,-87.65,member,,,,Saturday,5-10 mins,Short (6-15 mins)
freq,1,2804580,,,831332,831332,865163,865163,,,94965.0,144894.0,3589103,,,,873168,1737407,2785382
mean,,,2023-08-27 23:18:16.643734272,2023-08-27 23:33:25.310296832,,,,,41.90305,-87.64692,,,,0 days 00:15:08.666564284,15.14444,14.09156,,,
min,,,2023-04-01 00:00:02,2023-04-01 00:03:10,,,,,41.63,-87.94,,,,0 days 00:01:00,1.0,0.0,,,
25%,,,2023-06-18 13:26:48.500000,2023-06-18 13:49:01,,,,,41.88103,-87.66,,,,0 days 00:05:45,5.75,11.0,,,
50%,,,2023-08-15 19:01:02,2023-08-15 19:18:00,,,,,41.89918,-87.644,,,,0 days 00:09:52,9.866667,15.0,,,
75%,,,2023-10-20 13:07:06.500000,2023-10-20 13:24:04.500000,,,,,41.93,-87.62991,,,,0 days 00:17:19,17.31667,18.0,,,
max,,,2024-03-31 23:59:11,2024-04-01 02:35:26,,,,,42.07,-87.46,,,,0 days 11:59:26,719.4333,23.0,,,


In [28]:

print(df_merged_filled.shape)


(5596674, 19)


processing_data(df)

In [29]:
#df_merged_filled.to_csv('processed_data.csv', index=False)