In [1]:
import pandas as pd
import numpy as np
import datetime as dt

In [2]:
ride_sharing = pd.read_csv('../datasets/ride_sharing_new.csv')
ride_sharing.head()

Unnamed: 0.1,Unnamed: 0,duration,station_A_id,station_A_name,station_B_id,station_B_name,bike_id,user_type,user_birth_year,user_gender,tire_sizes,ride_date
0,0,12 minutes,81,Berry St at 4th St,323,Broadway at Kearny,5480,2,1959,Male,27,1/19/2020
1,1,24 minutes,3,Powell St BART Station (Market St at 4th St),118,Eureka Valley Recreation Center,5193,2,1965,Male,26,10/24/2018
2,2,8 minutes,67,San Francisco Caltrain Station 2 (Townsend St...,23,The Embarcadero at Steuart St,3652,3,1993,Male,26,12/25/2017
3,3,4 minutes,16,Steuart St at Market St,28,The Embarcadero at Bryant St,1883,1,1979,Male,27,8/4/2025
4,4,11 minutes,22,Howard St at Beale St,350,8th St at Brannan St,4626,2,1994,Male,27,1/29/2019


- __Data Type Constraints__: Understanding that data must be in the correct format for analysis. For example, numerical data represented as strings can lead to incorrect analysis outcomes.
- __Converting Data Types__: You practiced converting string data to numerical types using the __.astype()__ method. This is crucial when you need to perform mathematical operations on data initially read as strings.
- __Stripping Unwanted Characters__: Using the __.str.strip()__ method to remove unwanted characters from data, such as currency symbols, which allows for the conversion of string data to numerical types.
- __Assert Statements__: You learned to use assert statements to verify that your data transformations have been successful. For instance, ensuring that a column's data type has been changed as expected.

// Strip duration of minutes

__ride_sharing['duration_trim'] = ride_sharing['duration'].str.strip('minutes')__

// Convert duration to integer

__ride_sharing['duration_time'] = ride_sharing['duration_trim'].astype('int')__

// Assert the conversion was successful

__assert ride_sharing['duration_time'].dtype == 'int'__

#### `Summing strings and concatenating numbers`
In the previous exercise, you were able to identify that __category__ is the correct data type for __user_type__ and convert it in order to extract relevant statistical summaries that shed light on the distribution of __user_type__.

Another common data type problem is importing what should be numerical values as strings, as mathematical operations such as summing and multiplication lead to string concatenation, not numerical outputs.

In this exercise, you'll be converting the string column __duration__ to the type ___int___. Before that however, you will need to make sure to strip "__minutes__" from the column in order to make sure pandas reads it as numerical. The pandas package has been imported as pd.

- Use the __.strip()__ method to strip __duration__ of "__minutes__" and store it in the __duration_trim__ column.
- Convert __duration_trim__ to int and store it in the __duration_time__ column.
Write an __assert__ statement that checks if __duration_time__'s data type is now an __int__.
Print the average ride duration.

In [3]:
# Strip duration of minutes
ride_sharing['duration_trim'] = ride_sharing['duration'].str.strip("minutes")
ride_sharing['duration_trim']
# # Convert duration to integer
ride_sharing['duration_time'] = ride_sharing['duration_trim'].astype(int)

# # Write an assert statement making sure of conversion
assert ride_sharing['duration_time'].dtype == 'int'

# # Print formed columns and calculate average ride duration
print(ride_sharing[['duration', 'duration_trim', 'duration_time']])
print(ride_sharing['duration_time'].mean())

         duration duration_trim  duration_time
0      12 minutes           12              12
1      24 minutes           24              24
2       8 minutes            8               8
3       4 minutes            4               4
4      11 minutes           11              11
...           ...           ...            ...
25755  11 minutes           11              11
25756  10 minutes           10              10
25757  14 minutes           14              14
25758  14 minutes           14              14
25759  29 minutes           29              29

[25760 rows x 3 columns]
11.389052795031056


In [4]:
ride_sharing.head()

Unnamed: 0.1,Unnamed: 0,duration,station_A_id,station_A_name,station_B_id,station_B_name,bike_id,user_type,user_birth_year,user_gender,tire_sizes,ride_date,duration_trim,duration_time
0,0,12 minutes,81,Berry St at 4th St,323,Broadway at Kearny,5480,2,1959,Male,27,1/19/2020,12,12
1,1,24 minutes,3,Powell St BART Station (Market St at 4th St),118,Eureka Valley Recreation Center,5193,2,1965,Male,26,10/24/2018,24,24
2,2,8 minutes,67,San Francisco Caltrain Station 2 (Townsend St...,23,The Embarcadero at Steuart St,3652,3,1993,Male,26,12/25/2017,8,8
3,3,4 minutes,16,Steuart St at Market St,28,The Embarcadero at Bryant St,1883,1,1979,Male,27,8/4/2025,4,4
4,4,11 minutes,22,Howard St at Beale St,350,8th St at Brannan St,4626,2,1994,Male,27,1/29/2019,11,11


In [5]:
# Convert tire_sizes to integer
ride_sharing['tire_sizes'] = ride_sharing['tire_sizes'].astype('int')

# Set all values above 27 to 27
ride_sharing.loc[ride_sharing['tire_sizes'] > 27, 'tire_sizes'] = 27

# Reconvert tire_sizes back to categorical
ride_sharing['tire_sizes'] = ride_sharing['tire_sizes'].astype('category')

# Print tire size description
print(ride_sharing['tire_sizes'].describe())

count     25760
unique        2
top          27
freq      13274
Name: tire_sizes, dtype: int64


In [6]:
print(ride_sharing['tire_sizes'].unique())

[27, 26]
Categories (2, int32): [26, 27]


You can look at the new maximum by looking at the __top__ row in the description. Notice how essential it was to convert __tire_sizes__ into integer before setting a new maximum.

#### `Back to the future`
A new update to the data pipeline feeding into the __ride_sharing__ DataFrame has been updated to register each ride's date. This information is stored in the __ride_date__ column of the type __object__, which represents strings in __pandas__.

A bug was discovered which was relaying rides taken today as taken next year. To fix this, you will find all instances of the __ride_date__ column that occur anytime in the future, and set the maximum possible value of this column to today's date. Before doing so, you would need to convert __ride_date__ to a __datetime__ object.

The __datetime__ package has been imported as dt, alongside all the packages you've been using till now.

- Convert __ride_date__ to a datetime object using __to_datetime()__, then convert the __datetime__ object into a ___date___ and store it in __ride_dt__ column.
- Create the variable __today__, which stores today's date by using the __dt.date.today()__ function.
- For all instances of __ride_dt__ in the future, set them to today's date.
- Print the maximum date in the __ride_dt__ column.

In [7]:
# Convert ride_date to dateto
ride_sharing['ride_dt'] = pd.to_datetime(ride_sharing['ride_date']).dt.date

# Save today's date
today = dt.date.today()

# Set all in the future to today's date
ride_sharing.loc[ride_sharing['ride_dt'] > today, 'ride_dt'] = today

# Print maximum of ride_dt column
print(ride_sharing['ride_dt'].max())

2024-08-11


In [8]:
ride_sharing['ride_dt'].value_counts(ascending=False)

ride_dt
2024-08-11    5534
2017-09-29      34
2017-02-13      32
2018-11-24      32
2017-11-20      31
              ... 
2019-09-09       8
2019-08-09       8
2019-05-10       7
2019-07-18       7
2017-06-10       6
Name: count, Length: 1146, dtype: int64

Imagine counting the number of rides taken today without having cleaned your ranges correctly. You would have wildly underreported your findings!

In [9]:
print(ride_sharing['ride_date'].sort_values(ascending=True))

4684     1/1/2017
18079    1/1/2017
24000    1/1/2017
24512    1/1/2017
20351    1/1/2017
           ...   
14118    9/9/2019
1984     9/9/2019
23991    9/9/2019
6911     9/9/2019
9463     9/9/2019
Name: ride_date, Length: 25760, dtype: object


- The importance of ensuring data falls within a specified range, using examples like movie ratings and subscription dates, to prevent analysis errors.
- Methods to address out-of-range data, including dropping the data, setting custom limits, or assigning a custom value to out-of-range data points.
- Practical steps to implement these methods in Python using pandas, such as filtering data with conditions, modifying values with the __.loc__ method, and ensuring changes with __assert__ statements.
For example, to correct movie ratings exceeding the maximum allowed value, you used:

__movies.loc[movies['avg_rating'] > 5, 'avg_rating'] = 5__
This code snippet sets any __avg_rating__ values above ___5___ to ___5___, ensuring all ratings fall within the acceptable range.

Additionally, you practiced converting data types and applying constraints to a ride_sharing DataFrame, ensuring tire sizes and ride dates were within realistic limits. This included converting columns to appropriate data types, using conditional logic to identify and correct out-of-range values, and verifying the corrections.

In [10]:
import pandas as pd

loans = pd.DataFrame({
    'first_name': ['John', 'Jane', 'John', 'Jane', 'Bob'],
    'last_name': ['Doe', 'Doe', 'Doe', 'Smith', 'Smith'],
    'loan_amount': [1000, 2000, 1500, 500, 800]
})
loans

Unnamed: 0,first_name,last_name,loan_amount
0,John,Doe,1000
1,Jane,Doe,2000
2,John,Doe,1500
3,Jane,Smith,500
4,Bob,Smith,800


In [11]:
print(loans.duplicated(subset=['first_name', 'last_name'], keep=False))

0     True
1    False
2     True
3    False
4    False
dtype: bool


In [12]:
print(loans.duplicated(subset=['first_name', 'last_name'], keep='first'))

0    False
1    False
2     True
3    False
4    False
dtype: bool


`Subsetting on metadata and keeping all duplicate records gives you a better bird-eye's view over your data and how to duplicate it! You can even subset the loans DataFrame using bracketing and sort the values so you can properly identify the duplicates.`

#### `Finding duplicates`
A new update to the data pipeline feeding into __ride_sharing__ has added the __ride_id__ column, which represents a unique identifier for each ride.

The update however coincided with radically shorter average ride duration times and irregular user birth dates set in the future. Most importantly, the number of rides taken has increased by 20% overnight, leading you to think there might be both complete and incomplete duplicates in the __ride_sharing__ DataFrame.

In this exercise, you will confirm this suspicion by finding those duplicates. A sample of __ride_sharing__ is in your environment, as well as all the packages you've been working with thus far.

- Find duplicated rows of __ride_id__ in the __ride_sharing__ DataFrame while setting ___keep___ to ___False___.
- Subset __ride_sharing__ on __duplicates__ and sort by __ride_id__ and assign the results to __duplicated_rides__.
- Print the __ride_id__, __duration__ and __user_birth_year__ columns of __duplicated_rides__ in that order.

In [13]:
ride_sharing_1 = pd.read_excel('../datasets/ride_sharing.xlsx')
ride_sharing_1.head(2)

Unnamed: 0,ride_id,duration,station_A_id,station_A_name,station_B_id,station_B_name,bike_id,user_type,user_birth_year,user_gender,tire_sizes,ride_date
0,0,11,16,Steuart St at Market St,93,4th St at Mission Bay Blvd S,5504,Subscriber,1988,Male,27,2018-03-04
1,1,8,3,Powell St BART Station (Market St at 4th St),93,4th St at Mission Bay Blvd S,2915,Subscriber,1988,Male,27,2017-03-27


In [14]:
# Find duplicates
duplicates = ride_sharing_1.duplicated(subset='ride_id', keep=False)

# Sort your duplicated rides
duplicated_rides = ride_sharing_1[duplicates].sort_values('ride_id')

# Print relevant columns of duplicated_rides
print(duplicated_rides[['ride_id', 'duration', 'user_birth_year']])

    ride_id  duration  user_birth_year
22       33        10             1979
39       33         2             1979
53       55         9             1985
65       55         9             1985
74       71        11             1997
75       71        11             1997
76       89         9             1986
77       89         9             2060


In [15]:
print(duplicated_rides)

    ride_id  duration  station_A_id  \
22       33        10            30   
39       33         2            30   
53       55         9             3   
65       55         9             3   
74       71        11             3   
75       71        11             3   
76       89         9            15   
77       89         9            15   

                                       station_A_name  station_B_id  \
22     San Francisco Caltrain (Townsend St at 4th St)            59   
39     San Francisco Caltrain (Townsend St at 4th St)            59   
53       Powell St BART Station (Market St at 4th St)           115   
65       Powell St BART Station (Market St at 4th St)           115   
74       Powell St BART Station (Market St at 4th St)            44   
75       Powell St BART Station (Market St at 4th St)            44   
76  San Francisco Ferry Building (Harry Bridges Pl...            81   
77  San Francisco Ferry Building (Harry Bridges Pl...            81   

        

Notice that rides 33 and 89 are incomplete duplicates, whereas the remaining are complete. You'll learn how to treat them in the next exercise!

In [16]:
# Drop complete duplicates from ride_sharing
ride_dup = ride_sharing_1.drop_duplicates()

# Create statistics dictionary for aggregation function
statistics = {'user_birth_year': 'min', 'duration': 'mean'}

# Group by ride_id and compute new statistics
ride_unique = ride_dup.groupby('ride_id').agg(statistics).reset_index()

# Find duplicated values again
duplicates = ride_unique.duplicated(subset='ride_id', keep=False)
duplicated_rides = ride_unique[duplicates == True]

# Assert duplicates are processed
assert duplicated_rides.shape[0] == 0

- __Identifying Duplicates__: You used the __.duplicated()__ method to find duplicate rows in a DataFrame. This method returns a Series of boolean values, marking duplicates as ___True___ except for their first occurrence.
- __Customizing Duplicate Identification__: By adjusting the __subset__ and __keep__ arguments of the __.duplicated()__ method, you learned to refine how duplicates are identified. The __subset__ argument allows focusing on specific columns, while __keep__ controls which duplicates to mark as ___True___.
- __Removing Duplicates__: The __.drop_duplicates()__ method was used to remove duplicate rows from a DataFrame. Similar to __.duplicated()__, it also supports __subset__ and __keep__ arguments for targeted removal of duplicates.
- __Treating Incomplete Duplicates__: For duplicates with discrepancies in certain columns, you explored combining them using statistical measures (e.g., mean or maximum) via the __groupby__ and __agg__ methods. This approach is useful for resolving inconsistencies within duplicated data.

For example, to find and sort duplicate rides in a ride_sharing DataFrame by ride_id, you used:

    - duplicates = ride_sharing.duplicated(subset='ride_id', keep=False)
    - duplicated_rides = ride_sharing[duplicates].sort_values('ride_id')
    - print(duplicated_rides[['ride_id', 'duration', 'user_birth_year']])

In [17]:
ride_sharing.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25760 entries, 0 to 25759
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype   
---  ------           --------------  -----   
 0   Unnamed: 0       25760 non-null  int64   
 1   duration         25760 non-null  object  
 2   station_A_id     25760 non-null  int64   
 3   station_A_name   25760 non-null  object  
 4   station_B_id     25760 non-null  int64   
 5   station_B_name   25760 non-null  object  
 6   bike_id          25760 non-null  int64   
 7   user_type        25760 non-null  int64   
 8   user_birth_year  25760 non-null  int64   
 9   user_gender      25760 non-null  object  
 10  tire_sizes       25760 non-null  category
 11  ride_date        25760 non-null  object  
 12  duration_trim    25760 non-null  object  
 13  duration_time    25760 non-null  int32   
 14  ride_dt          25760 non-null  object  
dtypes: category(1), int32(1), int64(6), object(7)
memory usage: 2.7+ MB


In [18]:
ride_sharing_1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 78 entries, 0 to 77
Data columns (total 12 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   ride_id          78 non-null     int64         
 1   duration         78 non-null     int64         
 2   station_A_id     78 non-null     int64         
 3   station_A_name   78 non-null     object        
 4   station_B_id     78 non-null     int64         
 5   station_B_name   78 non-null     object        
 6   bike_id          78 non-null     int64         
 7   user_type        78 non-null     object        
 8   user_birth_year  78 non-null     int64         
 9   user_gender      78 non-null     object        
 10  tire_sizes       78 non-null     int64         
 11  ride_date        78 non-null     datetime64[ns]
dtypes: datetime64[ns](1), int64(7), object(4)
memory usage: 7.4+ KB
