# **Cyclistic bike sharing case-study**  


<img src='/Users/locco/Desktop/Google_Analytics_Course/Case_Studies/Case_Study_Bike_Sharing/python_script/cyclistic.png' width=”5” height=”5”>



### **Data Cleaning and Modelling Notebook**

 **Author**: Karthik Bhaktha  
 **Created**: April 24th, 2022  
 **Data Source**: [Divvy](https://divvy-tripdata.s3.amazonaws.com/index.html)  
 **Data by**: Motivate International Inc. [License](https://www.divvybikes.com/data-license-agreement)  

Purpose of this notebook is to create a master dataset that can be used to perform an analysis to answer the question:  
**How do annual members and casual riders use Cyclistic bikes differently?**  
This will help understand how we can employ strategy leading towards converstion of **casual** members into **annual** members.  
According to the financial analyst doing so could lead to earning higher profits.


The data source is a collection of 13 CSV files.  
Each file contains one month of data covering 12 months from March 2021 to March 2022.  

### Cleaning steps performed in this notebook:
- Merge all CSV files into a single dataframe.
- Create calculated columns: total_seconds and total_distance.
- Check for data integrity.
- Deal with null values.

In [35]:
import pandas as pd
import missingno as msno #for visualizing null values
import os
import haversine as hs
from haversine import Unit #for calculating distance using coordinates
import datetime as dt
import numpy as np
import plotly.express as px

Creating a function that reads the file path.    
This function will return a list of 13 filenames.  
Using the list to read and concatenate creating a master dataframe. 

In [2]:
'''
    For the given path, get the List of all files in the directory tree 
'''
def getListOfFiles(dirName):
    # create a list of file and sub directories 
    # names in the given directory 
    listOfFile = os.listdir(dirName)
    allFiles = list()
    # Iterate over all the entries
    for entry in listOfFile:
        # Create full path
        fullPath = os.path.join(dirName, entry)
        # If entry is a directory then get the list of files in this directory 
        if os.path.isdir(fullPath):
            allFiles = allFiles + getListOfFiles(fullPath)
        else:
            allFiles.append(fullPath)
                
    return allFiles

The code below:  
- Reads the file path
- Concatenates the dataframe into a master dataframe.

In [3]:
#Creating the master dataframe
path =  getListOfFiles('/Users/locco/Desktop/Google_Analytics_Course/Case_Studies/Bike_share_Data/bike_share_datasets_03-2021_to_03-2022')
# run the code below to confirm the path list.
#print(path)
master_df = pd.concat(map(pd.read_csv,path), ignore_index= True)

In [4]:
#checking the shape of master_df
master_df.info()
master_df.head()
master_df.tail()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5952028 entries, 0 to 5952027
Data columns (total 13 columns):
 #   Column              Dtype  
---  ------              -----  
 0   ride_id             object 
 1   rideable_type       object 
 2   started_at          object 
 3   ended_at            object 
 4   start_station_name  object 
 5   start_station_id    object 
 6   end_station_name    object 
 7   end_station_id      object 
 8   start_lat           float64
 9   start_lng           float64
 10  end_lat             float64
 11  end_lng             float64
 12  member_casual       object 
dtypes: float64(4), object(9)
memory usage: 590.3+ MB


Unnamed: 0,ride_id,rideable_type,started_at,ended_at,start_station_name,start_station_id,end_station_name,end_station_id,start_lat,start_lng,end_lat,end_lng,member_casual
5952023,EF56D7D1D612AC11,electric_bike,2021-05-20 16:32:14,2021-05-20 16:35:39,Blackstone Ave & Hyde Park Blvd,13398,,,41.802581,-87.59023,41.8,-87.6,member
5952024,745191CB9F21DE3C,classic_bike,2021-05-29 16:40:37,2021-05-29 17:22:37,Sheridan Rd & Montrose Ave,TA1307000107,Michigan Ave & Oak St,13042,41.96167,-87.65464,41.90096,-87.623777,casual
5952025,428575BAA5356BFF,electric_bike,2021-05-31 14:24:54,2021-05-31 14:31:38,Sheridan Rd & Montrose Ave,TA1307000107,,,41.961525,-87.654651,41.95,-87.65,member
5952026,FC8A4A7AB7249662,electric_bike,2021-05-25 16:01:33,2021-05-25 16:07:37,Sheridan Rd & Montrose Ave,TA1307000107,,,41.961654,-87.654721,41.98,-87.66,member
5952027,E873B8AA3EE84678,docked_bike,2021-05-12 12:22:14,2021-05-12 12:30:27,Sheridan Rd & Montrose Ave,TA1307000107,Clark St & Grace St,TA1307000127,41.96167,-87.65464,41.95078,-87.659172,casual


The Code below:
- Loops throough each dataframe
- finds the size of each csv file 
- calculates the total size

In [5]:
#checking if the merging csv was successful by comparing the size
total = 0
for i in range(len(path)):
    df = pd.read_csv(path[i])
    '''uncheck the comment below to confirm the size of each dtaframe'''
    #print("Size = ", df.size, "of path", path[i])
    total += df.size


print("The size of merged files and total size from loop above are equal",total, master_df.size)

The size of merged files and total size from loop above are equal 77376364 77376364


### Creating a new datafram called final to perform further operations without disturbing the master dataframe

In [6]:
#creating a new df to store the datetime changes.
final_df = master_df

Checking for null rows in the final_df.

In [7]:
final_df.isna().sum()

ride_id                    0
rideable_type              0
started_at                 0
ended_at                   0
start_station_name    760224
start_station_id      760221
end_station_name      812974
end_station_id        812974
start_lat                  0
start_lng                  0
end_lat                 4883
end_lng                 4883
member_casual              0
dtype: int64

Dropping columns with station name and station ID, since there are lots of missing values.
Drop rows with null values from end_lat and end_lng. Dropping these rows should not affect the analysis as much

In [8]:
# removing the columns that wont be used in this analysis
final_df.drop(columns=['start_station_name','start_station_id','end_station_name','end_station_id'], axis = 1, inplace = True)
final_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5952028 entries, 0 to 5952027
Data columns (total 9 columns):
 #   Column         Dtype  
---  ------         -----  
 0   ride_id        object 
 1   rideable_type  object 
 2   started_at     object 
 3   ended_at       object 
 4   start_lat      float64
 5   start_lng      float64
 6   end_lat        float64
 7   end_lng        float64
 8   member_casual  object 
dtypes: float64(4), object(5)
memory usage: 408.7+ MB


In [9]:
# removing the rows with null values
final_df.dropna(subset=['end_lat', 'end_lng'], inplace=True)
final_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5947145 entries, 0 to 5952027
Data columns (total 9 columns):
 #   Column         Dtype  
---  ------         -----  
 0   ride_id        object 
 1   rideable_type  object 
 2   started_at     object 
 3   ended_at       object 
 4   start_lat      float64
 5   start_lng      float64
 6   end_lat        float64
 7   end_lng        float64
 8   member_casual  object 
dtypes: float64(4), object(5)
memory usage: 453.7+ MB


In [10]:
# checking for null values again after removing the columns and rows
final_df.isna().sum()


ride_id          0
rideable_type    0
started_at       0
ended_at         0
start_lat        0
start_lng        0
end_lat          0
end_lng          0
member_casual    0
dtype: int64

Converting the columns 'started_at', 'ended_at' from data type object to datetime.  
This is to calculate the duration of the ride.

In [11]:
#typecasting 'started_at', 'ended_at' into datetime using .astype('datetime64[ns]) method.
final_df.loc[:,['started_at', 'ended_at']] = final_df.loc[:,['started_at', 'ended_at']].astype('datetime64[ns]')

#Creating a calculated column total_time = ended_at - started_at
final_df['total_time'] = abs(final_df.ended_at - final_df.started_at)

#confirming if the data type has changed
final_df.info()
#confirming if the type casting has not altered the columns
final_df.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5947145 entries, 0 to 5952027
Data columns (total 10 columns):
 #   Column         Dtype          
---  ------         -----          
 0   ride_id        object         
 1   rideable_type  object         
 2   started_at     datetime64[ns] 
 3   ended_at       datetime64[ns] 
 4   start_lat      float64        
 5   start_lng      float64        
 6   end_lat        float64        
 7   end_lng        float64        
 8   member_casual  object         
 9   total_time     timedelta64[ns]
dtypes: datetime64[ns](2), float64(4), object(3), timedelta64[ns](1)
memory usage: 499.1+ MB


Unnamed: 0,ride_id,rideable_type,started_at,ended_at,start_lat,start_lng,end_lat,end_lng,member_casual,total_time
0,9DC7B962304CBFD8,electric_bike,2021-09-28 16:07:10,2021-09-28 16:09:54,41.89,-87.68,41.89,-87.67,casual,0 days 00:02:44
1,F930E2C6872D6B32,electric_bike,2021-09-28 14:24:51,2021-09-28 14:40:05,41.94,-87.64,41.98,-87.67,casual,0 days 00:15:14
2,6EF72137900BB910,electric_bike,2021-09-28 00:20:16,2021-09-28 00:23:57,41.81,-87.72,41.8,-87.72,casual,0 days 00:03:41
3,78D1DE133B3DBF55,electric_bike,2021-09-28 14:51:17,2021-09-28 15:00:06,41.8,-87.72,41.81,-87.72,casual,0 days 00:08:49
4,E03D4ACDCAEF6E00,electric_bike,2021-09-28 09:53:12,2021-09-28 10:03:44,41.88,-87.74,41.88,-87.71,casual,0 days 00:10:32


As you can see the calculated column has resulted in a format that includes number of days and hh:mm:ss format.  
Extracting only seconds from the column and add it to a new column called total_seconds. 

In [12]:
'''This loop used iterows and it too about 2m:30s to run
Using dictinary iteration drops the speed to iterate significantly to only 20secs.
'''
#calculating the duration of the ride in seconds and adding it to a new column.
# for index, rows in final_df.iterrows():
#     final_df.at[index, 'total_seconds'] = rows['total_time'].total_seconds()

'This loop used iterows and it too about 2m:30s to run\nUsing dictinary iteration drops the speed to iterate significantly to only 20secs.\n'

Extracting just the total_time column to make the loop run faster, 
this is because the loop does not read through all the columns in the dataframe.

In [13]:
# extracting total_time column from final_df
total_time_df = final_df[['total_time']]
total_time_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5947145 entries, 0 to 5952027
Data columns (total 1 columns):
 #   Column      Dtype          
---  ------      -----          
 0   total_time  timedelta64[ns]
dtypes: timedelta64[ns](1)
memory usage: 90.7 MB


In [14]:
'''Improved loop using dictionary iteration:'''
time_list = []
#calculating the duration of the ride in seconds and adding it to a new column.
for rows in total_time_df.to_dict('records'):
    time_list.append(rows['total_time'].total_seconds())

total_time_dict = {'total_seconds':time_list}

Total_time_dict is now a dictionary.  
in the code cell below we will convert the dictinary into a datafrane and append it to the final_df.

In [15]:
# converting dict to df
total_seconds_df = pd.DataFrame.from_dict(total_time_dict)
final_df = pd.concat([final_df,total_seconds_df], axis = 1)

# checking final_df
final_df.info()
final_df.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5951997 entries, 0 to 5946153
Data columns (total 11 columns):
 #   Column         Dtype          
---  ------         -----          
 0   ride_id        object         
 1   rideable_type  object         
 2   started_at     datetime64[ns] 
 3   ended_at       datetime64[ns] 
 4   start_lat      float64        
 5   start_lng      float64        
 6   end_lat        float64        
 7   end_lng        float64        
 8   member_casual  object         
 9   total_time     timedelta64[ns]
 10  total_seconds  float64        
dtypes: datetime64[ns](2), float64(5), object(3), timedelta64[ns](1)
memory usage: 544.9+ MB


Unnamed: 0,ride_id,rideable_type,started_at,ended_at,start_lat,start_lng,end_lat,end_lng,member_casual,total_time,total_seconds
0,9DC7B962304CBFD8,electric_bike,2021-09-28 16:07:10,2021-09-28 16:09:54,41.89,-87.68,41.89,-87.67,casual,0 days 00:02:44,164.0
1,F930E2C6872D6B32,electric_bike,2021-09-28 14:24:51,2021-09-28 14:40:05,41.94,-87.64,41.98,-87.67,casual,0 days 00:15:14,914.0
2,6EF72137900BB910,electric_bike,2021-09-28 00:20:16,2021-09-28 00:23:57,41.81,-87.72,41.8,-87.72,casual,0 days 00:03:41,221.0
3,78D1DE133B3DBF55,electric_bike,2021-09-28 14:51:17,2021-09-28 15:00:06,41.8,-87.72,41.81,-87.72,casual,0 days 00:08:49,529.0
4,E03D4ACDCAEF6E00,electric_bike,2021-09-28 09:53:12,2021-09-28 10:03:44,41.88,-87.74,41.88,-87.71,casual,0 days 00:10:32,632.0


In [16]:
# dropping total_time column because it is not needed.
final_df.drop(columns='total_time', axis = 1, inplace = True)
final_df.info()
final_df.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5951997 entries, 0 to 5946153
Data columns (total 10 columns):
 #   Column         Dtype         
---  ------         -----         
 0   ride_id        object        
 1   rideable_type  object        
 2   started_at     datetime64[ns]
 3   ended_at       datetime64[ns]
 4   start_lat      float64       
 5   start_lng      float64       
 6   end_lat        float64       
 7   end_lng        float64       
 8   member_casual  object        
 9   total_seconds  float64       
dtypes: datetime64[ns](2), float64(5), object(3)
memory usage: 499.5+ MB


Unnamed: 0,ride_id,rideable_type,started_at,ended_at,start_lat,start_lng,end_lat,end_lng,member_casual,total_seconds
0,9DC7B962304CBFD8,electric_bike,2021-09-28 16:07:10,2021-09-28 16:09:54,41.89,-87.68,41.89,-87.67,casual,164.0
1,F930E2C6872D6B32,electric_bike,2021-09-28 14:24:51,2021-09-28 14:40:05,41.94,-87.64,41.98,-87.67,casual,914.0
2,6EF72137900BB910,electric_bike,2021-09-28 00:20:16,2021-09-28 00:23:57,41.81,-87.72,41.8,-87.72,casual,221.0
3,78D1DE133B3DBF55,electric_bike,2021-09-28 14:51:17,2021-09-28 15:00:06,41.8,-87.72,41.81,-87.72,casual,529.0
4,E03D4ACDCAEF6E00,electric_bike,2021-09-28 09:53:12,2021-09-28 10:03:44,41.88,-87.74,41.88,-87.71,casual,632.0


Creating a new column with total distance travelled.  
Formula for converting coordinates into distance using [haversine function](https://towardsdatascience.com/calculating-distance-between-two-geolocations-in-python-26ad3afe287b)  
Using haversine library to convert the coordinates into distance.  
By default haversine returns the distance in kms.

In [17]:
'''Improved loop in the code cells below'''
#iterating over dataframe to calulate the haversine function

# for index, row in final_df.iterrows():
    # calculating distance in miles
    # loc1= (row['start_lat'], row['start_lng'] )
    # loc2= (row['end_lat'], row['end_lng'] )
    # final_df.at[index, 'total_distance_miles'] = hs.haversine(loc1, loc2, unit= Unit.MILES )
     


'Improved loop in the code cells below'

In [18]:
#extracting just the lat and lng columns
lat_lng_df = final_df[['start_lat','start_lng','end_lat','end_lng']]
lat_lng_df.head()

Unnamed: 0,start_lat,start_lng,end_lat,end_lng
0,41.89,-87.68,41.89,-87.67
1,41.94,-87.64,41.98,-87.67
2,41.81,-87.72,41.8,-87.72
3,41.8,-87.72,41.81,-87.72
4,41.88,-87.74,41.88,-87.71


In [19]:
'''Using dictionary iteration'''
distance_list = []

for row in lat_lng_df.to_dict('records'):
    distance_list.append(hs.haversine((row['start_lat'], row['start_lng'] ), (row['end_lat'], row['end_lng'] ), unit= Unit.MILES ))

distance_dictionary = {"distance_miles": distance_list}       

In [20]:
distance_df = pd.DataFrame.from_dict(distance_dictionary)
# final_df['distance'] = distance_dictionary
final_df = pd.concat([final_df, distance_df], axis=1)
final_df.info()
final_df.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5952028 entries, 0 to 5951948
Data columns (total 11 columns):
 #   Column          Dtype         
---  ------          -----         
 0   ride_id         object        
 1   rideable_type   object        
 2   started_at      datetime64[ns]
 3   ended_at        datetime64[ns]
 4   start_lat       float64       
 5   start_lng       float64       
 6   end_lat         float64       
 7   end_lng         float64       
 8   member_casual   object        
 9   total_seconds   float64       
 10  distance_miles  float64       
dtypes: datetime64[ns](2), float64(6), object(3)
memory usage: 544.9+ MB


Unnamed: 0,ride_id,rideable_type,started_at,ended_at,start_lat,start_lng,end_lat,end_lng,member_casual,total_seconds,distance_miles
0,9DC7B962304CBFD8,electric_bike,2021-09-28 16:07:10,2021-09-28 16:09:54,41.89,-87.68,41.89,-87.67,casual,164.0,0.514351
1,F930E2C6872D6B32,electric_bike,2021-09-28 14:24:51,2021-09-28 14:40:05,41.94,-87.64,41.98,-87.67,casual,914.0,3.164496
2,6EF72137900BB910,electric_bike,2021-09-28 00:20:16,2021-09-28 00:23:57,41.81,-87.72,41.8,-87.72,casual,221.0,0.690934
3,78D1DE133B3DBF55,electric_bike,2021-09-28 14:51:17,2021-09-28 15:00:06,41.8,-87.72,41.81,-87.72,casual,529.0,0.690934
4,E03D4ACDCAEF6E00,electric_bike,2021-09-28 09:53:12,2021-09-28 10:03:44,41.88,-87.74,41.88,-87.71,casual,632.0,1.543294


Distance_miles is the direct distance between the coordinates.  
The distance may slightly vary on the map because it measures the distance betwween two coordinates.  

In [21]:
final_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5952028 entries, 0 to 5951948
Data columns (total 11 columns):
 #   Column          Dtype         
---  ------          -----         
 0   ride_id         object        
 1   rideable_type   object        
 2   started_at      datetime64[ns]
 3   ended_at        datetime64[ns]
 4   start_lat       float64       
 5   start_lng       float64       
 6   end_lat         float64       
 7   end_lng         float64       
 8   member_casual   object        
 9   total_seconds   float64       
 10  distance_miles  float64       
dtypes: datetime64[ns](2), float64(6), object(3)
memory usage: 544.9+ MB


Checking the unique values in rideable_type.

In [22]:
final_df.isna().sum()

ride_id           4883
rideable_type     4883
started_at        4883
ended_at          4883
start_lat         4883
start_lng         4883
end_lat           4883
end_lng           4883
member_casual     4883
total_seconds     4883
distance_miles    4883
dtype: int64

In [23]:
#dropping rows with all the nans
final_df.dropna(inplace = True)


In [24]:
final_df.isna().sum()

ride_id           0
rideable_type     0
started_at        0
ended_at          0
start_lat         0
start_lng         0
end_lat           0
end_lng           0
member_casual     0
total_seconds     0
distance_miles    0
dtype: int64

In [25]:
unique_rides = final_df['rideable_type'].unique()
print(unique_rides)
unique_member = final_df['member_casual'].unique()
print(unique_member)

['electric_bike' 'classic_bike' 'docked_bike']
['casual' 'member']


In [26]:
# checkig min and max date in the final_df to see if there are any errors in the date.
print(final_df.started_at.min())
print(final_df.ended_at.min())
print(final_df.started_at.max())
print(final_df.ended_at.max())

2021-03-01 00:01:09
2021-03-01 00:06:28
2022-03-31 23:59:47
2022-04-01 22:10:12


In [27]:
# checking total_seconds column for any unusual values.
print(final_df.total_seconds.min())
print(final_df.total_seconds.max())
'''Fixed this in the code cell where time difference is calculated'''
# filter out columns with negetive total_seconds
# less_than_zero_df = final_df[final_df['total_seconds'] < 0]
# less_than_zero_df.info()
# less_than_zero_df

0.0
3356649.0


'Fixed this in the code cell where time difference is calculated'

In [28]:
# checking if start time is larger than end time, because you cannot travel back in time.
result_df = final_df[final_df['started_at'] > final_df['ended_at']]
result_df.info()
'''There are 147 rows where started_at is greater than ended_at. It is not possible to travel back in time... yet.'''

<class 'pandas.core.frame.DataFrame'>
Int64Index: 147 entries, 8950 to 5818887
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   ride_id         147 non-null    object        
 1   rideable_type   147 non-null    object        
 2   started_at      147 non-null    datetime64[ns]
 3   ended_at        147 non-null    datetime64[ns]
 4   start_lat       147 non-null    float64       
 5   start_lng       147 non-null    float64       
 6   end_lat         147 non-null    float64       
 7   end_lng         147 non-null    float64       
 8   member_casual   147 non-null    object        
 9   total_seconds   147 non-null    float64       
 10  distance_miles  147 non-null    float64       
dtypes: datetime64[ns](2), float64(6), object(3)
memory usage: 13.8+ KB


'There are 147 rows where started_at is greater than ended_at. It is not possible to travel back in time... yet.'

In [29]:
# removing the rows where started_at is greater than ended_at
final_df=  final_df[final_df['started_at'] < final_df['ended_at']]
final_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5941624 entries, 0 to 5947144
Data columns (total 11 columns):
 #   Column          Dtype         
---  ------          -----         
 0   ride_id         object        
 1   rideable_type   object        
 2   started_at      datetime64[ns]
 3   ended_at        datetime64[ns]
 4   start_lat       float64       
 5   start_lng       float64       
 6   end_lat         float64       
 7   end_lng         float64       
 8   member_casual   object        
 9   total_seconds   float64       
 10  distance_miles  float64       
dtypes: datetime64[ns](2), float64(6), object(3)
memory usage: 544.0+ MB


In [30]:
# checking the distance for any invalid values like negetive distance
print(final_df.distance_miles.min())
print(final_df.distance_miles.max())
result_df = final_df[final_df['distance_miles'] == 739.1357259057336]
result_df
# checking for the largest distance travelled by a rider
desc_dist_df = final_df.sort_values(by = ['distance_miles'], ascending = False)
desc_dist_df

'''Ride lasting only 155 seconds but distance travelled is 739miles, that is not possible and it is possibly an error'''


0.0
739.1357259057336


'Ride lasting only 155 seconds but distance travelled is 739miles, that is not possible and it is possibly an error'

In [31]:
# checking the number of rows in the final_df with distance = 0 miles
'''Zero miles does not necessarily mean that it is a error, it could be that the rider ended the ride from where it started, meaning 
the ride ended at the same location where it began'''
distance_zero = final_df[final_df['distance_miles'] == 0]
distance_zero


Unnamed: 0,ride_id,rideable_type,started_at,ended_at,start_lat,start_lng,end_lat,end_lng,member_casual,total_seconds,distance_miles
8,1F9A9A6BA4C2F82E,electric_bike,2021-09-27 19:57:09,2021-09-27 20:09:08,41.95,-87.76,41.950000,-87.760000,casual,719.0,0.0
13,3575930BD49A07EB,electric_bike,2021-09-29 14:06:59,2021-09-29 14:07:25,41.94,-87.71,41.940000,-87.710000,casual,26.0,0.0
17,8E0C65BBC7771CBF,electric_bike,2021-09-08 00:47:12,2021-09-08 00:56:04,41.92,-87.64,41.920000,-87.640000,casual,532.0,0.0
18,BD04A1DD83063A44,electric_bike,2021-09-08 06:31:49,2021-09-08 07:21:46,41.80,-87.63,41.800000,-87.630000,casual,2997.0,0.0
26,62E81647CEEF07C4,electric_bike,2021-09-03 02:54:28,2021-09-03 03:22:21,41.89,-87.76,41.890000,-87.760000,casual,1673.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...
5943695,F13B21C37D0F4E5E,electric_bike,2021-05-22 21:49:48,2021-05-22 22:13:15,41.88,-87.63,41.870000,-87.630000,casual,128.0,0.0
5943700,439508E452EEA80F,electric_bike,2021-05-23 11:37:24,2021-05-23 11:41:41,41.95,-87.71,41.950000,-87.700000,casual,34.0,0.0
5943702,3D5CE82A58A80C56,electric_bike,2021-05-23 13:05:45,2021-05-23 13:45:03,41.88,-87.68,41.930000,-87.700000,casual,826.0,0.0
5943707,4ED3C4906C8651E0,electric_bike,2021-05-16 16:35:22,2021-05-16 16:46:10,41.88,-87.66,41.888764,-87.644548,casual,294.0,0.0


In [32]:
# sorting the dataframe in desc order of total_seconds
desc_seconds = final_df.sort_values(by = ['total_seconds'], ascending = False)
desc_seconds

Unnamed: 0,ride_id,rideable_type,started_at,ended_at,start_lat,start_lng,end_lat,end_lng,member_casual,total_seconds,distance_miles
4478423,2539589A82655D57,electric_bike,2021-06-25 16:43:50,2021-06-25 16:48:57,41.936523,-87.647502,41.943730,-87.648877,member,3356649.0,2.625705
4428216,628C9B2ABE1D20C9,classic_bike,2021-06-10 08:57:52,2021-06-10 09:08:14,41.921822,-87.644140,41.904613,-87.640552,member,3341501.0,3.532445
5462237,8B032EF0738AA1F9,classic_bike,2021-05-30 14:39:50,2021-05-30 15:11:07,41.940775,-87.639192,41.984037,-87.652310,member,3235296.0,0.430222
3965127,F0FBD9CBE361D871,electric_bike,2021-06-17 17:09:01,2021-06-17 17:18:33,41.912926,-87.664209,41.896358,-87.661111,casual,3162083.0,2.574650
1732162,3A11DA2AC5BD4DD2,classic_bike,2021-07-22 06:09:09,2021-07-22 06:15:10,41.890762,-87.631697,41.894503,-87.617854,casual,2946429.0,3.073715
...,...,...,...,...,...,...,...,...,...,...,...
2648721,D062D528655DA57D,classic_bike,2022-03-15 06:11:30,2022-03-15 06:15:57,41.968987,-87.696027,41.966400,-87.688704,casual,0.0,0.010688
2599491,A7D51956DD32DE9F,classic_bike,2022-03-21 08:40:32,2022-03-21 08:43:52,41.943034,-87.687288,41.946655,-87.683359,member,0.0,0.000000
3271494,94032335629AB7E9,electric_bike,2021-10-09 15:07:21,2021-10-09 15:11:58,41.790000,-87.600000,41.800000,-87.590000,member,0.0,0.264850
2548288,850598DB1949BF0C,electric_bike,2022-03-25 21:18:09,2022-03-25 21:24:33,41.884276,-87.629836,41.886349,-87.617517,member,0.0,0.003316


Aadding a column with days of the week as 0 for Monday to 6 for Sunday.  
Also add day name column. Monday to Sunday

In [33]:
# adding days of week names
final_df['days_of_week'] = final_df['started_at'].dt.day_name()
# final_df.head()
print('Unique days',final_df['days_of_week'].unique())

# adding days of week values 0-6
final_df['days_of_week_val'] = final_df['started_at'].dt.dayofweek
# final_df.head()
print('Unique days value' ,final_df['days_of_week_val'].unique())

# adding months column
final_df['month'] = final_df['started_at'].dt.month_name()
# final_df.head()
print('Unique month' ,final_df['month'].unique())

final_df.head()

Unique days ['Tuesday' 'Monday' 'Wednesday' 'Saturday' 'Friday' 'Thursday' 'Sunday']
Unique days value [1 0 2 5 4 3 6]
Unique month ['September' 'April' 'July' 'November' 'December' 'March' 'February'
 'January' 'October' 'June' 'August' 'May']


Unnamed: 0,ride_id,rideable_type,started_at,ended_at,start_lat,start_lng,end_lat,end_lng,member_casual,total_seconds,distance_miles,days_of_week,days_of_week_val,month
0,9DC7B962304CBFD8,electric_bike,2021-09-28 16:07:10,2021-09-28 16:09:54,41.89,-87.68,41.89,-87.67,casual,164.0,0.514351,Tuesday,1,September
1,F930E2C6872D6B32,electric_bike,2021-09-28 14:24:51,2021-09-28 14:40:05,41.94,-87.64,41.98,-87.67,casual,914.0,3.164496,Tuesday,1,September
2,6EF72137900BB910,electric_bike,2021-09-28 00:20:16,2021-09-28 00:23:57,41.81,-87.72,41.8,-87.72,casual,221.0,0.690934,Tuesday,1,September
3,78D1DE133B3DBF55,electric_bike,2021-09-28 14:51:17,2021-09-28 15:00:06,41.8,-87.72,41.81,-87.72,casual,529.0,0.690934,Tuesday,1,September
4,E03D4ACDCAEF6E00,electric_bike,2021-09-28 09:53:12,2021-09-28 10:03:44,41.88,-87.74,41.88,-87.71,casual,632.0,1.543294,Tuesday,1,September


Exploring the dataset to learn more about the distribution of the data.


In [46]:
final_df.loc[:,['total_seconds']] = final_df.loc[:,['total_seconds']].astype('int64')

final_df.info()
final_df.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5941624 entries, 0 to 5947144
Data columns (total 14 columns):
 #   Column            Dtype         
---  ------            -----         
 0   ride_id           object        
 1   rideable_type     object        
 2   started_at        datetime64[ns]
 3   ended_at          datetime64[ns]
 4   start_lat         float64       
 5   start_lng         float64       
 6   end_lat           float64       
 7   end_lng           float64       
 8   member_casual     object        
 9   total_seconds     int64         
 10  distance_miles    float64       
 11  days_of_week      object        
 12  days_of_week_val  int64         
 13  month             object        
dtypes: datetime64[ns](2), float64(5), int64(2), object(5)
memory usage: 680.0+ MB


Unnamed: 0,ride_id,rideable_type,started_at,ended_at,start_lat,start_lng,end_lat,end_lng,member_casual,total_seconds,distance_miles,days_of_week,days_of_week_val,month
0,9DC7B962304CBFD8,electric_bike,2021-09-28 16:07:10,2021-09-28 16:09:54,41.89,-87.68,41.89,-87.67,casual,164,0.514351,Tuesday,1,September
1,F930E2C6872D6B32,electric_bike,2021-09-28 14:24:51,2021-09-28 14:40:05,41.94,-87.64,41.98,-87.67,casual,914,3.164496,Tuesday,1,September
2,6EF72137900BB910,electric_bike,2021-09-28 00:20:16,2021-09-28 00:23:57,41.81,-87.72,41.8,-87.72,casual,221,0.690934,Tuesday,1,September
3,78D1DE133B3DBF55,electric_bike,2021-09-28 14:51:17,2021-09-28 15:00:06,41.8,-87.72,41.81,-87.72,casual,529,0.690934,Tuesday,1,September
4,E03D4ACDCAEF6E00,electric_bike,2021-09-28 09:53:12,2021-09-28 10:03:44,41.88,-87.74,41.88,-87.71,casual,632,1.543294,Tuesday,1,September


Storing the cleaned and merged data into a CSV to carryout further analysis in 'R'

In [None]:
final_df.to_csv('/Users/locco/Desktop/Google_Analytics_Course/Case_Studies/Bike_share_Data/merged_csv.csv' ,index=False)