# Data Processing

# Standardization of station Ids

## Import and Concatenation of the files

**We start by importing the modules we need for data extraction and manipulation**

In [25]:
import pandas as pd
import glob
import os

**We then define the files we will need to concatenate, and read them one by one using a for loop before grouping them together**

In [26]:
files_path = r"C:\Users\pocit\OneDrive\Documents\formations_reconversion\Google data analyst certificate\Coursera8_Case_Study\Bike_trip_files\All_trips\*.csv"
files = glob.glob(files_path)

In [28]:
files_list_cyclistic = []

for f in files : # creating a loop to add files in sequence to our list
    adding_files = pd.read_csv(f)
    files_list_cyclistic.append(adding_files)
    
annual_cyclistic_trips = pd.concat(files_list_cyclistic, ignore_index =True) # we concatenate the files list resulting from our loop into a single dataset

In [14]:
print(f" Total_bike_rides : {len(annual_cyclistic_trips)}") #we display the number of rows to make sure the 12 monthly files were correctly assembled

 Total_bike_rides : 5552092


## Mapping the correct Id for each station

**We keep the most recent station_name / station_Id pair for each station recorded in city trips for our reference period**

In [29]:
annual_cyclistic_trips['started_at'] = pd.to_datetime(annual_cyclistic_trips['started_at'])# we convert the started_at column to a date time format to make sure the data is sorted correctly

starting_time_sorted = annual_cyclistic_trips.sort_values(by='started_at', ascending=False)# we sort the file by descending starting time

reference_stations = starting_time_sorted[['start_station_name', 'start_station_id']].dropna() # we extract the relevant columns for the dictionnary

reference_stations = reference_stations.drop_duplicates(subset=['start_station_name']) # we remove the duplicates from the station_name subset

**We then transform the data into a dictionnary**

In [30]:
station_mapping = reference_stations.set_index('start_station_name')['start_station_id'].to_dict()

## Standardizing the station Ids 

**Standardization of the starting stations**

In [31]:
annual_cyclistic_trips['new_starting_stations_id'] = annual_cyclistic_trips['start_station_name'].map(station_mapping) # we create a temporary column with the updated Ids for each station

annual_cyclistic_trips['start_station_id'] = annual_cyclistic_trips['new_starting_stations_id'].fillna(annual_cyclistic_trips['start_station_id']) # we replace the values in the station_id column by the new ones without removing the old ids for the stations no longer in use 

**Standardization of the ending stations - we repeat the two steps above**

In [32]:
annual_cyclistic_trips['new_ending_stations_id'] = annual_cyclistic_trips['end_station_name'].map(station_mapping)

annual_cyclistic_trips['end_station_id'] = annual_cyclistic_trips['new_ending_stations_id'].fillna(annual_cyclistic_trips['end_station_id'])


**Removing the two columns we added previously**

In [33]:
annual_cyclistic_trips.drop(columns=['new_starting_stations_id', 'new_ending_stations_id'], inplace=True) # the two columns we created are not relevant for our future analysis and they make the file heavier, so we drop them

## Checking the result and exporting the file

**Checking the result**

In [34]:
print(annual_cyclistic_trips[annual_cyclistic_trips['start_station_name'] == 'Ashland Ave & Division St'].head()) # we check the result with a station that had an inconsistent Id in the oldest months of our dataset, and have now a standardised Id


              ride_id  rideable_type              started_at  \
8    3F0334B07B4D7F38  electric_bike 2025-02-04 07:57:26.522   
9    1641FB083A2CDC74  electric_bike 2025-02-21 08:18:13.936   
10   7F07E55F71DBB6E1  electric_bike 2025-02-10 17:58:19.301   
38   FB3FF10F0CEC11E4   classic_bike 2025-02-25 10:40:09.706   
108  AFA0BA782A200DF2  electric_bike 2025-02-10 16:15:36.555   

                    ended_at         start_station_name start_station_id  \
8    2025-02-04 08:06:22.498  Ashland Ave & Division St         CHI00244   
9    2025-02-21 08:27:47.866  Ashland Ave & Division St         CHI00244   
10   2025-02-10 18:19:53.406  Ashland Ave & Division St         CHI00244   
38   2025-02-25 10:53:33.887  Ashland Ave & Division St         CHI00244   
108  2025-02-10 16:27:17.030  Ashland Ave & Division St         CHI00244   

              end_station_name end_station_id  start_lat  start_lng  \
8            Clark St & Elm St       CHI00281  41.903394 -87.667867   
9            Cla

**exporting the file**

In [35]:
annual_cyclistic_trips.to_csv('Cyclistic_Trips_Cleaned_202502_202601.csv', index=False) # we export the result as a new CSV file

In [87]:
print(annual_cyclistic_trips.shape) # we make sure that the cleaned file contains the exact same number of rows as the total of the 12 monthly files we concatenated in the first part of the process

(5552092, 13)


# Further Processing of Data

**Now that we have our Data in one concatenated file with up-to-date identifiers for each station, we can do a few more checks to ensure the data is clean and upright**

### First we make sure that every row in our dataset has a unique ride_id

In [88]:
unique_id = annual_cyclistic_trips['ride_id'].unique() # we check that the number of unique ride_id is the same as the total of rows in the file
print(len(unique_id))

5552092


### Then we check that every ride happened in the time period of our study (the past 12 months)

In [89]:
first_and_last_entry = annual_cyclistic_trips.sort_values(by=['ended_at'], ascending=True)
first_and_last_entry.iloc[[0, -1]] # we just check the first and last entry sorted by ending date and time, since the format has been harmonized for the entire column before

Unnamed: 0,ride_id,rideable_type,started_at,ended_at,start_station_name,start_station_id,end_station_name,end_station_id,start_lat,start_lng,end_lat,end_lng,member_casual
7323,9D2A64F3B1815359,electric_bike,2025-01-31 23:55:09.830,2025-02-01 00:00:10.352,Lincoln Ave & Fullerton Ave,CHI00494,Clark St & Wellington Ave,CHI00377,41.92,-87.65,41.94,-87.65,casual
5517569,555372A890596494,electric_bike,2026-01-31 23:56:54.069,2026-01-31 23:59:33.214,,,,,41.92,-87.65,41.93,-87.65,member


### We can check the cross-field concordance of the starting and ending dates to be sure it stays coherent

In [37]:
date_error = annual_cyclistic_trips[annual_cyclistic_trips['ended_at'] < annual_cyclistic_trips['started_at']]
print(len(date_error))
print(date_error.head(5))

29
                  ride_id  rideable_type              started_at  \
4942264  5D010AFEA6850513  electric_bike 2025-11-02 01:52:37.475   
4943812  D1E5316AD88ECD45   classic_bike 2025-11-02 01:50:39.702   
4945806  3AF2F8908C9386F8   classic_bike 2025-11-02 01:57:57.512   
4960702  19386939ECD81B33  electric_bike 2025-11-02 01:55:34.399   
4969070  083534D28DA37F72   classic_bike 2025-11-02 01:17:57.001   

                        ended_at               start_station_name  \
4942264  2025-11-02 01:13:24.728  Pine Grove Ave & Wellington Ave   
4943812  2025-11-02 01:14:04.164        Clark St & Ida B Wells Dr   
4945806  2025-11-02 01:33:52.225           State St & Chicago Ave   
4960702  2025-11-02 01:21:31.873           Broadway & Sheridan Rd   
4969070  2025-11-02 01:05:27.752              Clark St & Grace St   

        start_station_id              end_station_name end_station_id  \
4942264         CHI02158       Wilton Ave & Addison St       CHI02055   
4943812         CHI00216   

We notice a few outliers, happening on the same night of november 2nd. We can infer from this simultaneity of outliers that the events must coincide with the switch to Winter time in Chicago. But for the sake of accuracy, we decide to remove these outliers from the analysis, since their share in the total of trips is negligible

In [17]:
annual_cyclistic_trips.drop(date_error.index, inplace=True) # we drop the outliers from the table

### It's possible to verify for outliers or null values in the the ride_id column to make sure the data stays coherent in this section

In [92]:
ride_id_error = annual_cyclistic_trips[annual_cyclistic_trips['ride_id'].str.len() != 16] # all Ids are strings of 16 characters
print(len(ride_id_error))

0


### When sorting the monthly files in the prepare phase, we noticed during our presence check that some fields in the start_station_name, start_station_id, end_station_name and end_station_id columns were empty.

when a station_name field is empty, the corresponding station_id on the row is also empty. The number of occurrences suggest that some users could take bikes outside of and not returning it to a designated station. Since these empty data can provide insights into our analysis, we make the choice to fill them out with a specific mention, so we have the option to analyze it further later. Nevertheless, the integration of empty data means the results of our measurements as conducted in the analyze phase must be taken with these limitations in mind.

In [93]:
empty_station = (annual_cyclistic_trips['start_station_name'].isna()) | (annual_cyclistic_trips['end_station_name'].isna()) # we create a Boolean to outline the missing start and end stations 
print(sum(empty_station))
annual_cyclistic_trips['Missing_trip_station'] = 'Stations Complete' # we create a new column to differenciate the 'complete bike trip' rows with both station names, from the 'incomplete trip with missing station' rows
annual_cyclistic_trips.loc[empty_station, 'Missing_trip_station'] = 'Incomplete stations' # we fill the incomplete rows with the 'Incomplete stations' mention


1862950


# Adding more columns to prepare for the analysis

**Lastly we can add a few more columns to our table to help us deliver the expected insights and anwsers to our manager**

## Measuring ride lengths

### Creating a column to calculate the length of each ride

In [18]:
annual_cyclistic_trips['started_at'] = pd.to_datetime(annual_cyclistic_trips['started_at']) # first we verify that the date-time columns have the correct format
annual_cyclistic_trips['ended_at'] = pd.to_datetime(annual_cyclistic_trips['ended_at'])

annual_cyclistic_trips['ride_length'] = annual_cyclistic_trips['ended_at'] - annual_cyclistic_trips['started_at'] # We then calculate the difference between the two points
annual_cyclistic_trips['ride_length_minutes'] = (annual_cyclistic_trips['ride_length'].dt.total_seconds() / 60).round(2) # We choose to convert the ride_length to minutes only, since it will allow a clearer display of the statistics for analysis.We round the result to two decimals only for a better clarity of display

print(annual_cyclistic_trips[['started_at', 'ended_at','ride_length_minutes']].head()) # we verify the function with the five first rows 
print(annual_cyclistic_trips['ride_length_minutes'].describe()) # we show key statistics for a better understanding of the dataset, and round the measures to two decimals for better clarity

annual_cyclistic_trips.drop(columns=['ride_length'], inplace=True) # we delete the first ride_length column since it's not as usable as the second one

               started_at                ended_at  ride_length_minutes
0 2025-02-25 21:21:21.171 2025-02-25 21:30:09.941                 8.81
1 2025-02-08 14:55:13.493 2025-02-08 15:13:39.890                18.44
2 2025-02-24 00:32:56.553 2025-02-24 00:38:21.711                 5.42
3 2025-02-07 17:00:38.646 2025-02-07 17:34:29.012                33.84
4 2025-02-10 14:56:56.565 2025-02-10 15:01:18.745                 4.37
count    5.552063e+06
mean     1.609683e+01
std      5.579980e+01
min      0.000000e+00
25%      5.400000e+00
50%      9.440000e+00
75%      1.658000e+01
max      1.574900e+03
Name: ride_length_minutes, dtype: float64


The minimal and maximal lengths returned are surprising : The shortest trip lasted only a fraction of a second, whereas the longest one lasted more than 24 hours (1574 minutes). Since these durations don't look coherent with a standard use of a rented bike in a city, we decide to implement a method to detect outliers, that will help us remove the inconsistent data from our final table.

### Choosing an outlier detection method

We can start by trying to isolate the data outliers based on the distance from the standard deviation, following the two-sigma rule

In [19]:
mean_duration = annual_cyclistic_trips['ride_length_minutes'].mean() # we start by calculating the mean and the standard deviation separately
std_duration = annual_cyclistic_trips['ride_length_minutes'].std()

print(f"Mean duration : {mean_duration:.2f} min")
print(f"Standard Deviation : {std_duration:.2f} min")

standard_deviation_lower_limit = mean_duration - (2* std_duration) # we are considering for this calculation that the data we are using is a normal distribution, and that we need to capture at least 95% of it, following the two-sigma rule
standard_deviation_upper_limit = mean_duration + (2* std_duration)

print( f"Lower limit : {standard_deviation_lower_limit}")
print( f"Upper limit : {standard_deviation_upper_limit}")

Mean duration : 16.10 min
Standard Deviation : 55.80 min
Lower limit : -95.50277405532776
Upper limit : 127.69644088151472


We deduce from this findings that the standard deviation may not be the best indicator to help us remove outliers, since a negative lower limit isn't relevant in the data we are observing, and an upper limit of only 127.7 minutes seems overly exclusive, since a lot of users are renting bike for a whole day. We decide to apply some common logic, based on the business operations of Cyclistic to define outliers as follows :

- A bike trip shorter than one minute probably doesn't relate to a normal use of the product, but rather to a technical issue or a bug ;
- A bike trip of more than 24 consecutive hours for a city rented bike company is unusually long, and can indicate a technical issue, or a theft

Based on business logic and estimations, we will measure the number of bike trips shorter than 1 minute and longer than 24 hours

### Filtering and dealing with outliers

In [96]:
too_short_trips = annual_cyclistic_trips[annual_cyclistic_trips['ride_length_minutes'] < 1] # we start by filtering trips that are too short to be relevant
too_long_trips = annual_cyclistic_trips[annual_cyclistic_trips['ride_length_minutes'] > 1440] # we do the same for the trips of more than 24 hours

print(f"Number of trips that are too short : {len(too_short_trips)}") # we print the results
print(f"Number of trips that are too long : {len(too_long_trips)}")


Number of trips that are too short : 148372
Number of trips that are too long : 5707


Let's see how these numbers translate in percentage of the total rides observed

In [97]:
inconsistent_trips = len(too_short_trips) + len(too_long_trips) # number of inconsistent trips
share_of_inconsistent_trips = (inconsistent_trips/len(annual_cyclistic_trips))*100 # share of inconsistent trips in the total number of trips

print(f"Percentage of inconsistent trips : {share_of_inconsistent_trips} %")

Percentage of inconsistent trips : 2.7751666362575493 %


Since the total percentage of the outliers as defined in the previous step is less than 5%, and that we don't consider these rides as valid, we take the decisions to delete them from the table before beginning our analysis.

In [98]:
annual_cyclistic_trips = annual_cyclistic_trips[
    (annual_cyclistic_trips['ride_length_minutes'] >= 1) &
    (annual_cyclistic_trips['ride_length_minutes'] <= 1440)
    ] # we choose to keep only th rides that lasted bewteen 1 minute and 1440 minutes

annual_cyclistic_trips.shape # we check that the total count is correctly updated after the deletion
annual_cyclistic_trips['ride_length_minutes'].describe().round(2) # we double check with the new min and max length of rides

count   5397984.00
mean         14.96
std          29.47
min           1.00
25%           5.68
50%           9.69
75%          16.85
max        1439.98
Name: ride_length_minutes, dtype: float64

## Indicating the day of the week for the trips

**Knowing the day of the week when each trip started could give us further insights for comparing casual users with members in our analysis**

### Creation of a day_of_week column

In [99]:
annual_cyclistic_trips['day_of_week'] = annual_cyclistic_trips['started_at'].dt.dayofweek # we return the number of the week day corresponding to the starting date_time of the trip 

print(annual_cyclistic_trips[['started_at', 'day_of_week']].head())

               started_at  day_of_week
0 2025-02-25 21:21:21.171            1
1 2025-02-08 14:55:13.493            5
2 2025-02-24 00:32:56.553            0
3 2025-02-07 17:00:38.646            4
4 2025-02-10 14:56:56.565            0


We now know which day of the week each trip started, considering Monday = 0 and Sunday = 6. For a better grasp of this data, we add a last column, corresponding to the name of the corresponding day of week in full letters :

In [100]:
annual_cyclistic_trips['day_name'] = annual_cyclistic_trips['started_at'].dt.day_name()
print(annual_cyclistic_trips[['started_at', 'day_of_week', 'day_name']].head())

               started_at  day_of_week  day_name
0 2025-02-25 21:21:21.171            1   Tuesday
1 2025-02-08 14:55:13.493            5  Saturday
2 2025-02-24 00:32:56.553            0    Monday
3 2025-02-07 17:00:38.646            4    Friday
4 2025-02-10 14:56:56.565            0    Monday


This column adds a layer of accessibility to our table, and will help explore some more insights regarding the use of cyclistic's bikes

Finally, we can export our cleaned and updated table to a csv file, before beginning the Anaylisis phase.

In [120]:
annual_cyclistic_trips.to_csv('Cyclistic_Trips_Final_202502_202601.csv', index=False)

**We prepared the data and took appropriate cleaning measures for further analysis. This will give us a reliable and credible working base to answer our main question, which was : how exactly does the usage of bikes differ between casual users and members ?**