# Finding duplicates

A new update to the data pipeline feeding into `ride_sharing` has added the `ride_id` column, which represents a unique identifier for each ride.

The update however coincided with radically shorter average ride duration times and irregular user birth dates set in the future. Most importantly, the number of rides taken has increased by 20% overnight, leading you to think there might be both complete and incomplete duplicates in the `ride_sharing` DataFrame.

In this exercise, you will confirm this suspicion by finding those duplicates. A sample of `ride_sharing` is in your environment, as well as all the packages you've been working with thus far.

In [1]:
import pandas as pd
import datetime as dt
import random 
path=r'Z:/'
file='ride_sharing_new.csv'

ride_sharing = pd.read_csv(path+file)
ride_sharing.head()

Unnamed: 0.1,Unnamed: 0,duration,station_A_id,station_A_name,station_B_id,station_B_name,bike_id,user_type,user_birth_year,user_gender
0,0,12 minutes,81,Berry St at 4th St,323,Broadway at Kearny,5480,2,1959,Male
1,1,24 minutes,3,Powell St BART Station (Market St at 4th St),118,Eureka Valley Recreation Center,5193,2,1965,Male
2,2,8 minutes,67,San Francisco Caltrain Station 2 (Townsend St...,23,The Embarcadero at Steuart St,3652,3,1993,Male
3,3,4 minutes,16,Steuart St at Market St,28,The Embarcadero at Bryant St,1883,1,1979,Male
4,4,11 minutes,22,Howard St at Beale St,350,8th St at Brannan St,4626,2,1994,Male


In [6]:
print(len(ride_sharing))

77


In [10]:
ride_sharing['ride_id'] = [random.randint(1,len(ride_sharing) ) for _ in range(len(ride_sharing))]

In [11]:
ride_sharing.head()

Unnamed: 0.1,Unnamed: 0,duration,station_A_id,station_A_name,station_B_id,station_B_name,bike_id,user_type,user_birth_year,user_gender,ride_id,ride_id_test
0,0,12 minutes,81,Berry St at 4th St,323,Broadway at Kearny,5480,2,1959,Male,19,13
1,1,24 minutes,3,Powell St BART Station (Market St at 4th St),118,Eureka Valley Recreation Center,5193,2,1965,Male,25,16
2,2,8 minutes,67,San Francisco Caltrain Station 2 (Townsend St...,23,The Embarcadero at Steuart St,3652,3,1993,Male,28,48
3,3,4 minutes,16,Steuart St at Market St,28,The Embarcadero at Bryant St,1883,1,1979,Male,30,31
4,4,11 minutes,22,Howard St at Beale St,350,8th St at Brannan St,4626,2,1994,Male,10,20


### Instructions
* Find duplicated rows of `ride_id` in the `ride_sharing` DataFrame while setting `keep` to `False`.
* Subset `ride_sharing` on `duplicates` and sort by `ride_id` and assign the results to `duplicated_rides`.
* Print the `ride_id`, `duration` and `user_birth_year` columns of `duplicated_rides` in that order.


In [8]:
# Find duplicates
duplicates = ride_sharing.duplicated(subset='ride_id', keep=False)

# Sort your duplicated rides
duplicated_rides = ride_sharing[duplicates].sort_values('ride_id')

# Print relevant columns of duplicated_rides
print(duplicated_rides[['ride_id','duration','user_birth_year']])

    ride_id     duration  user_birth_year
55        6   11 minutes             1999
75        6   11 minutes             1999
4        10   11 minutes             1994
54       10   12 minutes             1999
49       10    9 minutes             1991
62       11    4 minutes             1991
23       11    3 minutes             1992
25       14    3 minutes             1990
22       14   10 minutes             1982
76       16   18 minutes             1980
17       16    5 minutes             1964
0        19   12 minutes             1959
10       19   19 minutes             1983
9        20    5 minutes             1994
32       20   12 minutes             1980
30       25    6 minutes             1988
1        25   24 minutes             1965
42       27    3 minutes             1977
13       27  187 minutes             1987
8        29   21 minutes             1982
7        29    9 minutes             1991
18       37   16 minutes             2000
67       37   19 minutes          