# DataFrames II: Filtering Data

In [2]:
import pandas as pd
import datetime as dt

## This Module's Dataset + Memory Optimization
- The `pd.to_datetime` method converts a **Series** to hold datetime values.
- The `format` parameter informs pandas of the format that the times are stored in.
- We pass symbols designating the segments of the string. For example, %m means "month" and %d means day.
- The `dt` attribute reveals an object with many datetime-related attributes and methods.
- The `dt.time` attribute extracts only the time from each value in a datetime **Series**.
- Use the `astype` method to convert the values in a **Series** to another type.
- The `parse_dates` parameter of `read_csv` is an alternate way to parse strings as datetimes.

In [4]:
employees = pd.read_csv("employees.csv")

In [5]:
employees.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   First Name         933 non-null    object 
 1   Gender             855 non-null    object 
 2   Start Date         1000 non-null   object 
 3   Last Login Time    1000 non-null   object 
 4   Salary             1000 non-null   int64  
 5   Bonus %            1000 non-null   float64
 6   Senior Management  933 non-null    object 
 7   Team               957 non-null    object 
dtypes: float64(1), int64(1), object(6)
memory usage: 62.6+ KB


In [6]:
employees

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,8/6/1993,12:42 PM,97308,6.945,True,Marketing
1,Thomas,Male,3/31/1996,6:53 AM,61933,4.170,True,
2,Maria,Female,4/23/1993,11:17 AM,130590,11.858,False,Finance
3,Jerry,Male,3/4/2005,1:00 PM,138705,9.340,True,Finance
4,Larry,Male,1/24/1998,4:47 PM,101004,1.389,True,Client Services
...,...,...,...,...,...,...,...,...
995,Henry,,11/23/2014,6:09 AM,132483,16.655,False,Distribution
996,Phillip,Male,1/31/1984,6:30 AM,42392,19.675,False,Finance
997,Russell,Male,5/20/2013,12:39 PM,96914,1.421,False,Product
998,Larry,Male,4/20/2013,4:45 PM,60500,11.985,False,Business Development


In [7]:
pd.to_datetime(employees["Start Date"])


0     1993-08-06
1     1996-03-31
2     1993-04-23
3     2005-03-04
4     1998-01-24
         ...    
995   2014-11-23
996   1984-01-31
997   2013-05-20
998   2013-04-20
999   2012-05-15
Name: Start Date, Length: 1000, dtype: datetime64[ns]

In [8]:
pd.to_datetime(employees["Start Date"],format = "%m/%d/%Y")

0     1993-08-06
1     1996-03-31
2     1993-04-23
3     2005-03-04
4     1998-01-24
         ...    
995   2014-11-23
996   1984-01-31
997   2013-05-20
998   2013-04-20
999   2012-05-15
Name: Start Date, Length: 1000, dtype: datetime64[ns]

In [9]:
employees["Start Date"] = pd.to_datetime(employees["Start Date"],format = "%m/%d/%Y")


In [10]:
employees["Start Date"]

0     1993-08-06
1     1996-03-31
2     1993-04-23
3     2005-03-04
4     1998-01-24
         ...    
995   2014-11-23
996   1984-01-31
997   2013-05-20
998   2013-04-20
999   2012-05-15
Name: Start Date, Length: 1000, dtype: datetime64[ns]

In [11]:
employees["Last Login Time"] = pd.to_datetime(employees["Last Login Time"],format  = "%H:%M %p")

In [12]:
employees.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   First Name         933 non-null    object        
 1   Gender             855 non-null    object        
 2   Start Date         1000 non-null   datetime64[ns]
 3   Last Login Time    1000 non-null   datetime64[ns]
 4   Salary             1000 non-null   int64         
 5   Bonus %            1000 non-null   float64       
 6   Senior Management  933 non-null    object        
 7   Team               957 non-null    object        
dtypes: datetime64[ns](2), float64(1), int64(1), object(4)
memory usage: 62.6+ KB


In [13]:
employees["Senior Management"] = employees["Senior Management"].astype(bool)
employees["Senior Management"]

0       True
1       True
2      False
3       True
4       True
       ...  
995    False
996    False
997    False
998    False
999     True
Name: Senior Management, Length: 1000, dtype: bool

In [14]:
employees.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   First Name         933 non-null    object        
 1   Gender             855 non-null    object        
 2   Start Date         1000 non-null   datetime64[ns]
 3   Last Login Time    1000 non-null   datetime64[ns]
 4   Salary             1000 non-null   int64         
 5   Bonus %            1000 non-null   float64       
 6   Senior Management  1000 non-null   bool          
 7   Team               957 non-null    object        
dtypes: bool(1), datetime64[ns](2), float64(1), int64(1), object(3)
memory usage: 55.8+ KB


In [15]:
sample = employees["Gender"].astype("category")

In [16]:
sample.info()

<class 'pandas.core.series.Series'>
RangeIndex: 1000 entries, 0 to 999
Series name: Gender
Non-Null Count  Dtype   
--------------  -----   
855 non-null    category
dtypes: category(1)
memory usage: 1.2 KB


In [17]:
employees["Senior Management"] = employees["Senior Management"].astype(bool)
employees["Senior Management"]

0       True
1       True
2      False
3       True
4       True
       ...  
995    False
996    False
997    False
998    False
999     True
Name: Senior Management, Length: 1000, dtype: bool

## Filter A DataFrame  Based On A Condition
- Pandas needs a **Series** of Booleans to perform a filter.
- Pass the Boolean Series inside square brackets after the **DataFrame**.
- We can generate a Boolean Series using a wide variety of operations (equality, inequality, less than, greater than, inclusion, etc)

In [19]:
employees = pd.read_csv("employees.csv")
employees.head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,8/6/1993,12:42 PM,97308,6.945,True,Marketing
1,Thomas,Male,3/31/1996,6:53 AM,61933,4.17,True,
2,Maria,Female,4/23/1993,11:17 AM,130590,11.858,False,Finance
3,Jerry,Male,3/4/2005,1:00 PM,138705,9.34,True,Finance
4,Larry,Male,1/24/1998,4:47 PM,101004,1.389,True,Client Services


In [20]:
employees["Gender"] == "Male"

0       True
1       True
2      False
3       True
4       True
       ...  
995    False
996     True
997     True
998     True
999     True
Name: Gender, Length: 1000, dtype: bool

In [21]:
employees[employees["Gender"] == "Male"]

employees[employees["Team"] == "Finance"]

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
2,Maria,Female,4/23/1993,11:17 AM,130590,11.858,False,Finance
3,Jerry,Male,3/4/2005,1:00 PM,138705,9.340,True,Finance
7,,Female,7/20/2015,10:43 AM,45906,11.598,,Finance
14,Kimberly,Female,1/14/1999,7:13 AM,41426,14.543,True,Finance
46,Bruce,Male,11/28/2009,10:47 PM,114796,6.796,False,Finance
...,...,...,...,...,...,...,...,...
907,Elizabeth,Female,7/27/1998,11:12 AM,137144,10.081,False,Finance
954,Joe,Male,1/19/1980,4:06 PM,119667,1.148,True,Finance
987,Gloria,Female,12/8/2014,5:08 AM,136709,10.331,True,Finance
992,Anthony,Male,10/16/2011,8:35 AM,112769,11.625,True,Finance


In [117]:
employees[employees["Senior Management"] == True]

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,12:42 PM,97308,6.945,True,Marketing
1,Thomas,Male,1996-03-31,6:53 AM,61933,4.170,True,
3,Jerry,Male,2005-03-04,1:00 PM,138705,9.340,True,Finance
4,Larry,Male,1998-01-24,4:47 PM,101004,1.389,True,Client Services
6,Ruby,Female,1987-08-17,4:20 PM,65476,10.012,True,Product
...,...,...,...,...,...,...,...,...
991,Rose,Female,2002-08-25,5:12 AM,134505,11.051,True,Marketing
992,Anthony,Male,2011-10-16,8:35 AM,112769,11.625,True,Finance
993,Tina,Female,1997-05-15,3:53 PM,56450,19.040,True,Engineering
994,George,Male,2013-06-21,5:47 PM,98874,4.479,True,Marketing


In [23]:
employees[employees["Salary"] > 110000]

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
2,Maria,Female,4/23/1993,11:17 AM,130590,11.858,False,Finance
3,Jerry,Male,3/4/2005,1:00 PM,138705,9.340,True,Finance
5,Dennis,Male,4/18/1987,1:35 AM,115163,10.125,False,Legal
9,Frances,Female,8/8/2002,6:51 AM,139852,7.524,True,Business Development
12,Brandon,Male,12/1/1980,1:08 AM,112807,17.492,True,Human Resources
...,...,...,...,...,...,...,...,...
987,Gloria,Female,12/8/2014,5:08 AM,136709,10.331,True,Finance
991,Rose,Female,8/25/2002,5:12 AM,134505,11.051,True,Marketing
992,Anthony,Male,10/16/2011,8:35 AM,112769,11.625,True,Finance
995,Henry,,11/23/2014,6:09 AM,132483,16.655,False,Distribution


In [121]:
employees[employees["Bonus %"] < 1.5]

employees[employees["Start Date"] == "8/6/1993"]

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,12:42 PM,97308,6.945,True,Marketing


## Filter with More than One Condition (AND)
- Add the `&` operator in between two Boolean **Series** to filter by multiple conditions.
- We can assign the **Series** to variables to make the syntax more readable.

In [26]:


# Load the dataset and parse 'Start Date' as datetime
employees = pd.read_csv("employees.csv", parse_dates=["Start Date"])

# Correctly parse the 'Last Login Time' as time
employees["Last Login Time"] = pd.to_datetime(employees["Last Login Time"]).dt.time

# Fix the typo in "Senior Management" and convert to boolean
employees["Senior Management"] = employees["Senior Management"].astype(bool)

# Convert "Gender" to categorical type
employees["Gender"] = employees["Gender"].astype("category")

# Display the first few rows of the DataFrame
employees.head()


  employees["Last Login Time"] = pd.to_datetime(employees["Last Login Time"]).dt.time


Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,12:42:00,97308,6.945,True,Marketing
1,Thomas,Male,1996-03-31,06:53:00,61933,4.17,True,
2,Maria,Female,1993-04-23,11:17:00,130590,11.858,False,Finance
3,Jerry,Male,2005-03-04,13:00:00,138705,9.34,True,Finance
4,Larry,Male,1998-01-24,16:47:00,101004,1.389,True,Client Services


In [27]:
is_female = employees["Gender"] == "Female"
is_in_marketing = employees["Team"] == "Marketing"
is_sal = employees["Salary"] > 100000

In [28]:
employees[is_female & is_in_marketing & is_sal]

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
98,Tina,Female,2016-06-16,19:47:00,100705,16.961,True,Marketing
140,Shirley,Female,1981-02-28,13:23:00,113850,1.854,False,Marketing
158,Norma,Female,1999-02-28,20:45:00,114412,8.756,True,Marketing
305,Margaret,Female,1993-02-06,13:05:00,125220,3.733,False,Marketing
319,Jacqueline,Female,1981-11-25,15:01:00,145988,18.243,False,Marketing
379,,Female,2002-09-18,12:39:00,118906,4.537,True,Marketing
468,Janice,Female,1997-06-28,13:48:00,136032,10.696,True,Marketing
490,Judith,Female,2007-11-23,13:22:00,117055,7.461,False,Marketing
531,Virginia,Female,2010-05-02,21:10:00,123649,10.154,True,Marketing
585,Shirley,Female,1988-04-16,11:09:00,132156,2.754,False,Marketing


## Filter with More than One Condition (OR)
- Use the `|` operator in between two Boolean **Series** to filter by *either* condition.

In [30]:
employees = pd.read_csv("employees.csv",parse_dates = ["Start Date"] , date_format = "%m/%d/%Y")
employees.head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,12:42 PM,97308,6.945,True,Marketing
1,Thomas,Male,1996-03-31,6:53 AM,61933,4.17,True,
2,Maria,Female,1993-04-23,11:17 AM,130590,11.858,False,Finance
3,Jerry,Male,2005-03-04,1:00 PM,138705,9.34,True,Finance
4,Larry,Male,1998-01-24,4:47 PM,101004,1.389,True,Client Services


In [31]:
#employees who are either senior management OR started before january 1st, 1990
is_senior = employees["Senior Management"]
start = employees["Start Date"] < "01/01/1990"

employees[is_senior | start]

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,12:42 PM,97308,6.945,True,Marketing
1,Thomas,Male,1996-03-31,6:53 AM,61933,4.170,True,
3,Jerry,Male,2005-03-04,1:00 PM,138705,9.340,True,Finance
4,Larry,Male,1998-01-24,4:47 PM,101004,1.389,True,Client Services
5,Dennis,Male,1987-04-18,1:35 AM,115163,10.125,False,Legal
...,...,...,...,...,...,...,...,...
992,Anthony,Male,2011-10-16,8:35 AM,112769,11.625,True,Finance
993,Tina,Female,1997-05-15,3:53 PM,56450,19.040,True,Engineering
994,George,Male,2013-06-21,5:47 PM,98874,4.479,True,Marketing
996,Phillip,Male,1984-01-31,6:30 AM,42392,19.675,False,Finance


In [32]:
#first name is robert who work in client services or start date after 01-06-2016
name = employees["First Name"] == "Robert"
team = employees["Team"] == "Client Services"

date = employees["Start Date"] > "01/06/2016"


In [33]:
employees[(name & team) | date]

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
15,Lillian,Female,2016-06-05,6:09 AM,59414,1.256,False,Product
39,,Male,2016-01-29,2:33 AM,122173,7.797,,Client Services
89,Janice,Female,2016-03-12,12:40 AM,51082,11.955,False,Legal
98,Tina,Female,2016-06-16,7:47 PM,100705,16.961,True,Marketing
121,Kathleen,,2016-05-09,8:55 AM,119735,18.74,False,Product
143,Teresa,,2016-01-28,10:55 AM,140013,8.689,True,Engineering
239,Lillian,,2016-05-12,3:43 PM,64164,17.612,False,Human Resources
387,Robert,Male,1994-10-29,4:26 AM,123294,19.894,False,Client Services
426,Todd,Male,2016-03-16,2:45 PM,134408,3.56,True,Human Resources
444,,Male,2016-05-24,9:17 PM,76409,7.008,,Distribution


## The isin Method
- The `isin` **Series** method accepts a collection object like a list, tuple, or **Series**.
- The method returns True for a row if its value is found in the collection.

In [35]:
employees = pd.read_csv("employees.csv",parse_dates = ["Start Date"] , date_format = "%m/%d/%Y")
employees.head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,12:42 PM,97308,6.945,True,Marketing
1,Thomas,Male,1996-03-31,6:53 AM,61933,4.17,True,
2,Maria,Female,1993-04-23,11:17 AM,130590,11.858,False,Finance
3,Jerry,Male,2005-03-04,1:00 PM,138705,9.34,True,Finance
4,Larry,Male,1998-01-24,4:47 PM,101004,1.389,True,Client Services


In [36]:
employees[employees["Team"].isin(["Legal","Sales","Product"])]

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
5,Dennis,Male,1987-04-18,1:35 AM,115163,10.125,False,Legal
6,Ruby,Female,1987-08-17,4:20 PM,65476,10.012,True,Product
11,Julie,Female,1997-10-26,3:19 PM,102508,12.637,True,Legal
13,Gary,Male,2008-01-27,11:40 PM,109831,5.831,False,Sales
15,Lillian,Female,2016-06-05,6:09 AM,59414,1.256,False,Product
...,...,...,...,...,...,...,...,...
981,James,Male,1993-01-15,5:19 PM,148985,19.280,False,Legal
985,Stephen,,1983-07-10,8:10 PM,85668,1.909,False,Legal
989,Justin,,1991-02-10,4:58 PM,38344,3.794,False,Legal
997,Russell,Male,2013-05-20,12:39 PM,96914,1.421,False,Product


## The isnull and notnull Methods
- The `isnull` method returns True for `NaN` values in a **Series**.
- The `notnull` method returns True for present values in a **Series**.

In [38]:
employees = pd.read_csv("employees.csv",parse_dates = ["Start Date"] , date_format = "%m/%d/%Y")
employees.head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,12:42 PM,97308,6.945,True,Marketing
1,Thomas,Male,1996-03-31,6:53 AM,61933,4.17,True,
2,Maria,Female,1993-04-23,11:17 AM,130590,11.858,False,Finance
3,Jerry,Male,2005-03-04,1:00 PM,138705,9.34,True,Finance
4,Larry,Male,1998-01-24,4:47 PM,101004,1.389,True,Client Services


In [39]:
employees[employees["Team"].isnull()]

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
1,Thomas,Male,1996-03-31,6:53 AM,61933,4.17,True,
10,Louise,Female,1980-08-12,9:01 AM,63241,15.132,True,
23,,Male,2012-06-14,4:19 PM,125792,5.042,,
32,,Male,1998-08-21,2:27 PM,122340,6.417,,
91,James,,2005-01-26,11:00 PM,128771,8.309,False,
109,Christopher,Male,2000-04-22,10:15 AM,37919,11.449,False,
139,,Female,1990-10-03,1:08 AM,132373,10.527,,
199,Jonathan,Male,2009-07-17,8:15 AM,130581,16.736,True,
258,Michael,Male,2002-01-24,3:04 AM,43586,12.659,False,
290,Jeremy,Male,1988-06-14,6:20 PM,129460,13.657,True,


In [40]:
employees[employees["Team"].notnull()]

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,12:42 PM,97308,6.945,True,Marketing
2,Maria,Female,1993-04-23,11:17 AM,130590,11.858,False,Finance
3,Jerry,Male,2005-03-04,1:00 PM,138705,9.340,True,Finance
4,Larry,Male,1998-01-24,4:47 PM,101004,1.389,True,Client Services
5,Dennis,Male,1987-04-18,1:35 AM,115163,10.125,False,Legal
...,...,...,...,...,...,...,...,...
995,Henry,,2014-11-23,6:09 AM,132483,16.655,False,Distribution
996,Phillip,Male,1984-01-31,6:30 AM,42392,19.675,False,Finance
997,Russell,Male,2013-05-20,12:39 PM,96914,1.421,False,Product
998,Larry,Male,2013-04-20,4:45 PM,60500,11.985,False,Business Development


In [41]:
fn = employees["First Name"].isnull() 
t = employees["Team"].notnull() 

employees[fn & t]

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
7,,Female,2015-07-20,10:43 AM,45906,11.598,,Finance
25,,Male,2012-10-08,1:12 AM,37076,18.576,,Client Services
39,,Male,2016-01-29,2:33 AM,122173,7.797,,Client Services
51,,,2011-12-17,8:29 AM,41126,14.009,,Sales
62,,Female,2007-06-12,5:25 PM,58112,19.414,,Marketing
116,,Male,1991-06-22,8:58 PM,76189,18.988,,Legal
149,,Female,2014-08-17,2:00 PM,86230,8.578,,Distribution
157,,Female,2005-07-27,8:32 AM,79536,14.443,,Product
165,,Female,2014-03-23,1:28 PM,59148,9.061,,Legal
166,,Female,1991-07-09,6:52 PM,42341,7.014,,Sales


## The between Method
- The `between` method returns True if a **Series** value is found within its range.

In [43]:
employees = pd.read_csv("employees.csv",parse_dates = ["Start Date"] , date_format = "%m/%d/%Y")
employees.head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,12:42 PM,97308,6.945,True,Marketing
1,Thomas,Male,1996-03-31,6:53 AM,61933,4.17,True,
2,Maria,Female,1993-04-23,11:17 AM,130590,11.858,False,Finance
3,Jerry,Male,2005-03-04,1:00 PM,138705,9.34,True,Finance
4,Larry,Male,1998-01-24,4:47 PM,101004,1.389,True,Client Services


In [44]:
employees[employees["Salary"].between(60000,70000)]

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
1,Thomas,Male,1996-03-31,6:53 AM,61933,4.170,True,
6,Ruby,Female,1987-08-17,4:20 PM,65476,10.012,True,Product
10,Louise,Female,1980-08-12,9:01 AM,63241,15.132,True,
20,Lois,,1995-04-22,7:18 PM,64714,4.934,True,Legal
41,Christine,,2015-06-28,1:08 AM,66582,11.308,True,Business Development
...,...,...,...,...,...,...,...,...
965,Catherine,Female,1989-09-25,1:31 AM,68164,18.393,False,Client Services
970,Alice,Female,1988-09-03,8:54 PM,63571,15.397,True,Product
974,Harry,Male,2011-08-30,6:31 PM,67656,16.455,True,Client Services
978,Sean,Male,1983-01-17,2:23 PM,66146,11.178,False,Human Resources


In [45]:
employees[employees["Bonus %"].between(2.0,5.0)]

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
1,Thomas,Male,1996-03-31,6:53 AM,61933,4.170,True,
20,Lois,,1995-04-22,7:18 PM,64714,4.934,True,Legal
40,Michael,Male,2008-10-10,11:25 AM,99283,2.665,True,Distribution
49,Chris,,1980-01-24,12:13 PM,113590,3.055,False,Sales
60,Paula,,2005-11-23,2:01 PM,48866,4.271,False,Distribution
...,...,...,...,...,...,...,...,...
943,Wayne,Male,2006-09-08,11:09 AM,67471,2.728,False,Engineering
961,Antonio,,1989-06-18,9:37 PM,103050,3.050,False,Legal
976,Denise,Female,1992-10-19,5:42 AM,137954,4.195,True,Legal
989,Justin,,1991-02-10,4:58 PM,38344,3.794,False,Legal


In [46]:
employees[employees["Start Date"].between("01/01/1991","01/04/1992")]

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
27,Scott,,1991-07-11,6:58 PM,122367,5.218,False,Legal
75,Bonnie,Female,1991-07-02,1:27 AM,104897,5.118,True,Human Resources
88,Donna,Female,1991-11-27,1:59 PM,64088,6.155,True,Legal
116,,Male,1991-06-22,8:58 PM,76189,18.988,,Legal
148,Patrick,,1991-07-14,2:24 AM,124488,14.837,True,Sales
166,,Female,1991-07-09,6:52 PM,42341,7.014,,Sales
172,Sara,Female,1991-09-23,6:17 PM,97058,9.402,False,Finance
220,,Female,1991-06-17,12:49 PM,71945,5.56,,Marketing
245,Victor,Male,1991-04-11,7:44 AM,70817,17.138,False,Engineering
277,Brenda,,1991-05-29,6:32 AM,82439,19.062,False,Sales


## The duplicated Method
- The `duplicated` method returns True if a **Series** value is a duplicate.
- Pandas will mark one occurrence of a repeated value as a non-duplicate.
- Use the `keep` parameter to designate whether the first or last occurrence of a repeated value should be considered the "non-duplicate".
- Pass False to the `keep` parameter to mark all occurrences of repeated values as duplicates.
- Use the tilde symbol (`~`) to invert a **Series's** values. Trues will become Falses, and Falses will become trues.

In [123]:
employees = pd.read_csv("employees.csv",parse_dates = ["Start Date"] , date_format = "%m/%d/%Y")
employees.head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,12:42 PM,97308,6.945,True,Marketing
1,Thomas,Male,1996-03-31,6:53 AM,61933,4.17,True,
2,Maria,Female,1993-04-23,11:17 AM,130590,11.858,False,Finance
3,Jerry,Male,2005-03-04,1:00 PM,138705,9.34,True,Finance
4,Larry,Male,1998-01-24,4:47 PM,101004,1.389,True,Client Services


In [145]:
employees[employees["First Name"].duplicated()]


employees[employees["First Name"].duplicated(keep = "first")]

employees[employees["First Name"].duplicated(keep = "last")]

employees[employees["First Name"].duplicated(keep = False)]


employees[~employees["First Name"].duplicated(keep = False)]

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
5,Dennis,Male,1987-04-18,1:35 AM,115163,10.125,False,Legal
8,Angela,Female,2005-11-22,6:29 AM,95570,18.523,True,Engineering
33,Jean,Female,1993-12-18,9:07 AM,119082,16.18,False,Business Development
190,Carol,Female,1996-03-19,3:39 AM,57783,9.129,False,Finance
291,Tammy,Female,1984-11-11,10:30 AM,132839,17.463,True,Client Services
495,Eugene,Male,1984-05-24,10:54 AM,81077,2.117,False,Sales
688,Brian,Male,2007-04-07,10:47 PM,93901,17.821,True,Legal
832,Keith,Male,2003-02-12,3:02 PM,120672,19.467,False,Legal
887,David,Male,2009-12-05,8:48 AM,92242,15.407,False,Legal


## The drop_duplicates Method
- The `drop_duplicates` method deletes rows with duplicate values.
- By default, it will remove a row if *all* of its values are shared with another row.
- The `subset` parameter configures the columns to look for duplicate values within.
- Pass a list to `subset` parameter to look for duplicates across multiple columns.

In [147]:
employees = pd.read_csv("employees.csv",parse_dates = ["Start Date"] , date_format = "%m/%d/%Y")
employees.head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,12:42 PM,97308,6.945,True,Marketing
1,Thomas,Male,1996-03-31,6:53 AM,61933,4.17,True,
2,Maria,Female,1993-04-23,11:17 AM,130590,11.858,False,Finance
3,Jerry,Male,2005-03-04,1:00 PM,138705,9.34,True,Finance
4,Larry,Male,1998-01-24,4:47 PM,101004,1.389,True,Client Services


In [149]:
employees.drop_duplicates()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,12:42 PM,97308,6.945,True,Marketing
1,Thomas,Male,1996-03-31,6:53 AM,61933,4.170,True,
2,Maria,Female,1993-04-23,11:17 AM,130590,11.858,False,Finance
3,Jerry,Male,2005-03-04,1:00 PM,138705,9.340,True,Finance
4,Larry,Male,1998-01-24,4:47 PM,101004,1.389,True,Client Services
...,...,...,...,...,...,...,...,...
995,Henry,,2014-11-23,6:09 AM,132483,16.655,False,Distribution
996,Phillip,Male,1984-01-31,6:30 AM,42392,19.675,False,Finance
997,Russell,Male,2013-05-20,12:39 PM,96914,1.421,False,Product
998,Larry,Male,2013-04-20,4:45 PM,60500,11.985,False,Business Development


In [157]:
employees.drop_duplicates("Team",keep = "last")  #- keep ->which value to mark as  non duplicate

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
951,,Female,2010-09-14,5:19 AM,143638,9.662,,
988,Alice,Female,2004-10-05,9:34 AM,47638,11.209,False,Human Resources
989,Justin,,1991-02-10,4:58 PM,38344,3.794,False,Legal
990,Robin,Female,1987-07-24,1:35 PM,100765,10.982,True,Client Services
993,Tina,Female,1997-05-15,3:53 PM,56450,19.04,True,Engineering
994,George,Male,2013-06-21,5:47 PM,98874,4.479,True,Marketing
995,Henry,,2014-11-23,6:09 AM,132483,16.655,False,Distribution
996,Phillip,Male,1984-01-31,6:30 AM,42392,19.675,False,Finance
997,Russell,Male,2013-05-20,12:39 PM,96914,1.421,False,Product
998,Larry,Male,2013-04-20,4:45 PM,60500,11.985,False,Business Development


In [159]:
employees.drop_duplicates("Team",keep = False) #remove first,last,all occurences

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team


In [165]:
employees.drop_duplicates(["Senior Management","Team"]).sort_values("Team")

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
481,,Female,2013-04-27,6:40 AM,93847,1.085,,Business Development
33,Jean,Female,1993-12-18,9:07 AM,119082,16.18,False,Business Development
9,Frances,Female,2002-08-08,6:51 AM,139852,7.524,True,Business Development
25,,Male,2012-10-08,1:12 AM,37076,18.576,,Client Services
18,Diana,Female,1981-10-23,10:27 AM,132940,19.082,False,Client Services
4,Larry,Male,1998-01-24,4:47 PM,101004,1.389,True,Client Services
40,Michael,Male,2008-10-10,11:25 AM,99283,2.665,True,Distribution
60,Paula,,2005-11-23,2:01 PM,48866,4.271,False,Distribution
149,,Female,2014-08-17,2:00 PM,86230,8.578,,Distribution
54,Sara,Female,2007-08-15,9:23 AM,83677,8.999,False,Engineering


## The unique and nunique Methods
- The `unique` method on a **Series** returns a collection of its unique values. The method does not exist on a **DataFrame**.
- The `nunique` method returns a *count* of the number of unique values in the **Series**/**DataFrame**.
- The `dropna` parameter configures whether to include or exclude missing (`NaN`) values.

In [167]:
employees = pd.read_csv("employees.csv",parse_dates = ["Start Date"] , date_format = "%m/%d/%Y")
employees.head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,12:42 PM,97308,6.945,True,Marketing
1,Thomas,Male,1996-03-31,6:53 AM,61933,4.17,True,
2,Maria,Female,1993-04-23,11:17 AM,130590,11.858,False,Finance
3,Jerry,Male,2005-03-04,1:00 PM,138705,9.34,True,Finance
4,Larry,Male,1998-01-24,4:47 PM,101004,1.389,True,Client Services


In [171]:
employees["Gender"].unique()

array(['Male', 'Female', nan], dtype=object)

In [175]:
employees["Team"].unique()

array(['Marketing', nan, 'Finance', 'Client Services', 'Legal', 'Product',
       'Engineering', 'Business Development', 'Human Resources', 'Sales',
       'Distribution'], dtype=object)

In [183]:
employees["Team"].nunique(dropna = False)

11

In [187]:
employees.nunique()

First Name           200
Gender                 2
Start Date           972
Last Login Time      720
Salary               995
Bonus %              971
Senior Management      2
Team                  10
dtype: int64