## DataFrames: Filtering Data

1-Filter A DataFrame Based On A Condition (Equal, Not equal, Greater than, Less than, Greater than or equal to, less than or equal to) 

2-Filter with More than One Condition (AND) 

3-Filter with More than One Condition (OR)

4-The .isin() Method

5-The .isnull() and .notnull() Methods

6-The .between() Method

7-The .duplicated() Method / "~" does unique all values 

8-The .drop_duplicates() Method

9-The .unique() and .nunique() Methods

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv("files/employees.csv")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   First Name         933 non-null    object 
 1   Gender             855 non-null    object 
 2   Start Date         1000 non-null   object 
 3   Last Login Time    1000 non-null   object 
 4   Salary             1000 non-null   int64  
 5   Bonus %            1000 non-null   float64
 6   Senior Management  933 non-null    object 
 7   Team               957 non-null    object 
dtypes: float64(1), int64(1), object(6)
memory usage: 62.6+ KB


In [3]:
# to change data type from object to datetime. 
df["Start Date"] = pd.to_datetime(df["Start Date"])
df.head(3)

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,12:42 PM,97308,6.945,True,Marketing
1,Thomas,Male,1996-03-31,6:53 AM,61933,4.17,True,
2,Maria,Female,1993-04-23,11:17 AM,130590,11.858,False,Finance


In [4]:
# Last login time represent by time they don't include day/month/year informations. if we use this method we'll see today's information

df["Last Login Time"] = pd.to_datetime(df["Last Login Time"])
df.head(3)

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,2021-01-14 12:42:00,97308,6.945,True,Marketing
1,Thomas,Male,1996-03-31,2021-01-14 06:53:00,61933,4.17,True,
2,Maria,Female,1993-04-23,2021-01-14 11:17:00,130590,11.858,False,Finance


In [5]:
df["Senior Management"] = df["Senior Management"].astype("bool")
df.head(3)

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,2021-01-14 12:42:00,97308,6.945,True,Marketing
1,Thomas,Male,1996-03-31,2021-01-14 06:53:00,61933,4.17,True,
2,Maria,Female,1993-04-23,2021-01-14 11:17:00,130590,11.858,False,Finance


In [6]:
df["Gender"] = df["Gender"].astype("category")
df.head(3)

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,2021-01-14 12:42:00,97308,6.945,True,Marketing
1,Thomas,Male,1996-03-31,2021-01-14 06:53:00,61933,4.17,True,
2,Maria,Female,1993-04-23,2021-01-14 11:17:00,130590,11.858,False,Finance


#### Filter A DataFrame Based On A Condition

In [7]:
df = pd.read_csv("files/employees.csv", parse_dates = ["Start Date","Last Login Time"])
#df["Start Date"] = pd.to_datetime(df["Start Date"])
#df["Last Login Time"] = pd.to_datetime(df["Last Login Time"]) 
# we used parse_dates function instead above 2 line. 
df["Senior Management"] = df["Senior Management"].astype("bool")
df["Gender"] = df["Gender"].astype("category")
df.head(3)

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,2021-01-14 12:42:00,97308,6.945,True,Marketing
1,Thomas,Male,1996-03-31,2021-01-14 06:53:00,61933,4.17,True,
2,Maria,Female,1993-04-23,2021-01-14 11:17:00,130590,11.858,False,Finance


In [8]:
# we have this formula from earler cell: 
df = pd.read_csv("files/employees.csv", parse_dates = ["Start Date","Last Login Time"])
df["Senior Management"] = df["Senior Management"].astype("bool")
df["Gender"] = df["Gender"].astype("category")
df.head(3)

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,2021-01-14 12:42:00,97308,6.945,True,Marketing
1,Thomas,Male,1996-03-31,2021-01-14 06:53:00,61933,4.17,True,
2,Maria,Female,1993-04-23,2021-01-14 11:17:00,130590,11.858,False,Finance


##### Equal ==

I'm actually going to use the python sign the language sign for comparison which is `the double equal` sign. Now you have to be very careful here because if you accidentally use the single equal sign what it's going to do is assign the values on the right to the values in that series.
 So for example what I want to do here is compare to mail compare the values in the gender column to mail.
If I accidentally forgot that second equals it's simply going to overwrite all of the values in the gender column with that string value of mail.

In [9]:
# if you want to have only male value you can use this formula: our data decreased1000 rows to 424 rows.

df[df["Gender"] == "Male"]

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,2021-01-14 12:42:00,97308,6.945,True,Marketing
1,Thomas,Male,1996-03-31,2021-01-14 06:53:00,61933,4.170,True,
3,Jerry,Male,2005-03-04,2021-01-14 13:00:00,138705,9.340,True,Finance
4,Larry,Male,1998-01-24,2021-01-14 16:47:00,101004,1.389,True,Client Services
5,Dennis,Male,1987-04-18,2021-01-14 01:35:00,115163,10.125,False,Legal
...,...,...,...,...,...,...,...,...
994,George,Male,2013-06-21,2021-01-14 17:47:00,98874,4.479,True,Marketing
996,Phillip,Male,1984-01-31,2021-01-14 06:30:00,42392,19.675,False,Finance
997,Russell,Male,2013-05-20,2021-01-14 12:39:00,96914,1.421,False,Product
998,Larry,Male,2013-04-20,2021-01-14 16:45:00,60500,11.985,False,Business Development


In [10]:
# if you want to have only finance value at team you can use this formula: our data decreased1000 rows to 102 rows.

df[df["Team"] == "Finance"]

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
2,Maria,Female,1993-04-23,2021-01-14 11:17:00,130590,11.858,False,Finance
3,Jerry,Male,2005-03-04,2021-01-14 13:00:00,138705,9.340,True,Finance
7,,Female,2015-07-20,2021-01-14 10:43:00,45906,11.598,True,Finance
14,Kimberly,Female,1999-01-14,2021-01-14 07:13:00,41426,14.543,True,Finance
46,Bruce,Male,2009-11-28,2021-01-14 22:47:00,114796,6.796,False,Finance
...,...,...,...,...,...,...,...,...
907,Elizabeth,Female,1998-07-27,2021-01-14 11:12:00,137144,10.081,False,Finance
954,Joe,Male,1980-01-19,2021-01-14 16:06:00,119667,1.148,True,Finance
987,Gloria,Female,2014-12-08,2021-01-14 05:08:00,136709,10.331,True,Finance
992,Anthony,Male,2011-10-16,2021-01-14 08:35:00,112769,11.625,True,Finance


In [11]:
# Another way:
mask = df["Team"] == "Finance"
df[mask].head(3)

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
2,Maria,Female,1993-04-23,2021-01-14 11:17:00,130590,11.858,False,Finance
3,Jerry,Male,2005-03-04,2021-01-14 13:00:00,138705,9.34,True,Finance
7,,Female,2015-07-20,2021-01-14 10:43:00,45906,11.598,True,Finance


In [12]:
# for bool values: 

df[df["Senior Management"]]

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,2021-01-14 12:42:00,97308,6.945,True,Marketing
1,Thomas,Male,1996-03-31,2021-01-14 06:53:00,61933,4.170,True,
3,Jerry,Male,2005-03-04,2021-01-14 13:00:00,138705,9.340,True,Finance
4,Larry,Male,1998-01-24,2021-01-14 16:47:00,101004,1.389,True,Client Services
6,Ruby,Female,1987-08-17,2021-01-14 16:20:00,65476,10.012,True,Product
...,...,...,...,...,...,...,...,...
991,Rose,Female,2002-08-25,2021-01-14 05:12:00,134505,11.051,True,Marketing
992,Anthony,Male,2011-10-16,2021-01-14 08:35:00,112769,11.625,True,Finance
993,Tina,Female,1997-05-15,2021-01-14 15:53:00,56450,19.040,True,Engineering
994,George,Male,2013-06-21,2021-01-14 17:47:00,98874,4.479,True,Marketing


##### Not Equal !=

In [13]:
# for finding not equal Marketing  values: 

df[df["Team"] != "Marketing"].head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
1,Thomas,Male,1996-03-31,2021-01-14 06:53:00,61933,4.17,True,
2,Maria,Female,1993-04-23,2021-01-14 11:17:00,130590,11.858,False,Finance
3,Jerry,Male,2005-03-04,2021-01-14 13:00:00,138705,9.34,True,Finance
4,Larry,Male,1998-01-24,2021-01-14 16:47:00,101004,1.389,True,Client Services
5,Dennis,Male,1987-04-18,2021-01-14 01:35:00,115163,10.125,False,Legal


##### Less Than or Greater Than <>

In [14]:
df[df["Salary"] > 110000]

df[df["Bonus %"] < 1.5]. head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
4,Larry,Male,1998-01-24,2021-01-14 16:47:00,101004,1.389,True,Client Services
15,Lillian,Female,2016-06-05,2021-01-14 06:09:00,59414,1.256,False,Product
58,Theresa,Female,2010-04-11,2021-01-14 07:18:00,72670,1.481,True,Engineering
77,Charles,Male,2004-09-14,2021-01-14 20:13:00,107391,1.26,True,Marketing
175,Willie,Male,1998-02-17,2021-01-14 20:20:00,146651,1.451,True,Engineering


##### Greater than or equal to >= or less than or equal to <=

In [15]:
df[df["Start Date"] <= "1985-01-01"].head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
10,Louise,Female,1980-08-12,2021-01-14 09:01:00,63241,15.132,True,
12,Brandon,Male,1980-12-01,2021-01-14 01:08:00,112807,17.492,True,Human Resources
18,Diana,Female,1981-10-23,2021-01-14 10:27:00,132940,19.082,False,Client Services
28,Terry,Male,1981-11-27,2021-01-14 18:30:00,124008,13.464,True,Client Services
37,Linda,Female,1981-10-19,2021-01-14 20:49:00,57427,9.557,True,Client Services


#### Filter with More than One Condition (AND)

In [16]:
# we have this formula from earler cell: 
df = pd.read_csv("files/employees.csv", parse_dates = ["Start Date","Last Login Time"])
df["Senior Management"] = df["Senior Management"].astype("bool")
df["Gender"] = df["Gender"].astype("category")
df.head(3)

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,2021-01-14 12:42:00,97308,6.945,True,Marketing
1,Thomas,Male,1996-03-31,2021-01-14 06:53:00,61933,4.17,True,
2,Maria,Female,1993-04-23,2021-01-14 11:17:00,130590,11.858,False,Finance


In [17]:
mask1 = df["Gender"] == "Male"
mask2 = df["Team"] == "Marketing"

df[mask1 & mask2].head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,2021-01-14 12:42:00,97308,6.945,True,Marketing
21,Matthew,Male,1995-09-05,2021-01-14 02:12:00,100612,13.645,False,Marketing
26,Craig,Male,2000-02-27,2021-01-14 07:45:00,37598,7.757,True,Marketing
74,Thomas,Male,1995-06-04,2021-01-14 14:24:00,62096,17.029,False,Marketing
77,Charles,Male,2004-09-14,2021-01-14 20:13:00,107391,1.26,True,Marketing


#### Filter with More than One Condition (OR)

In [18]:
# we have this formula from earler cell: 
df = pd.read_csv("files/employees.csv", parse_dates = ["Start Date","Last Login Time"])
df["Senior Management"] = df["Senior Management"].astype("bool")
df["Gender"] = df["Gender"].astype("category")
df.head(3)

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,2021-01-14 12:42:00,97308,6.945,True,Marketing
1,Thomas,Male,1996-03-31,2021-01-14 06:53:00,61933,4.17,True,
2,Maria,Female,1993-04-23,2021-01-14 11:17:00,130590,11.858,False,Finance


In [19]:
mask1 = df["Senior Management"] == True # You don't have to use == True. 
mask2 = df["Start Date"] < "1990-01-01"

df[mask1 | mask2].head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,2021-01-14 12:42:00,97308,6.945,True,Marketing
1,Thomas,Male,1996-03-31,2021-01-14 06:53:00,61933,4.17,True,
3,Jerry,Male,2005-03-04,2021-01-14 13:00:00,138705,9.34,True,Finance
4,Larry,Male,1998-01-24,2021-01-14 16:47:00,101004,1.389,True,Client Services
5,Dennis,Male,1987-04-18,2021-01-14 01:35:00,115163,10.125,False,Legal


In [20]:
mask1 = df["First Name"] == "Robert"
mask2 = df["Team"] == "Client Services"
mask3 = df["Start Date"] > "2016-06-01"

df[(mask1 & mask2) | mask3]

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
15,Lillian,Female,2016-06-05,2021-01-14 06:09:00,59414,1.256,False,Product
98,Tina,Female,2016-06-16,2021-01-14 19:47:00,100705,16.961,True,Marketing
387,Robert,Male,1994-10-29,2021-01-14 04:26:00,123294,19.894,False,Client Services
451,Terry,,2016-07-15,2021-01-14 00:29:00,140002,19.49,True,Marketing


#### The .isin() Method

In [25]:
# we have this formula from earler cell: 
df = pd.read_csv("files/employees.csv", parse_dates = ["Start Date","Last Login Time"])
df["Senior Management"] = df["Senior Management"].astype("bool")
df["Gender"] = df["Gender"].astype("category")
df.head(1)

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,2021-01-14 12:42:00,97308,6.945,True,Marketing


In [22]:
# with "or" method:

mask1 = df["Team"] == "Legal"
mask2 = df["Team"] == "Sales"
mask3 = df["Team"] == "Product"

df[mask1 | mask2 | mask3].head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
5,Dennis,Male,1987-04-18,2021-01-14 01:35:00,115163,10.125,False,Legal
6,Ruby,Female,1987-08-17,2021-01-14 16:20:00,65476,10.012,True,Product
11,Julie,Female,1997-10-26,2021-01-14 15:19:00,102508,12.637,True,Legal
13,Gary,Male,2008-01-27,2021-01-14 23:40:00,109831,5.831,False,Sales
15,Lillian,Female,2016-06-05,2021-01-14 06:09:00,59414,1.256,False,Product


In [23]:
# with .isin() method:

df[df["Team"].isin(["Legal","Sales","Product"])].head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
5,Dennis,Male,1987-04-18,2021-01-14 01:35:00,115163,10.125,False,Legal
6,Ruby,Female,1987-08-17,2021-01-14 16:20:00,65476,10.012,True,Product
11,Julie,Female,1997-10-26,2021-01-14 15:19:00,102508,12.637,True,Legal
13,Gary,Male,2008-01-27,2021-01-14 23:40:00,109831,5.831,False,Sales
15,Lillian,Female,2016-06-05,2021-01-14 06:09:00,59414,1.256,False,Product


#### The .isnull() and .notnull() Methods

In [24]:
# we have this formula from earler cell: 
df = pd.read_csv("files/employees.csv", parse_dates = ["Start Date","Last Login Time"])
df["Senior Management"] = df["Senior Management"].astype("bool")
df["Gender"] = df["Gender"].astype("category")
df.head(1)

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,2021-01-14 12:42:00,97308,6.945,True,Marketing


In [28]:
# if value is null it is going to turns True,if value is not null it  is going to turns False.
df["Team"].isnull() # true-false 
df[df["Team"].isnull()].head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
1,Thomas,Male,1996-03-31,2021-01-14 06:53:00,61933,4.17,True,
10,Louise,Female,1980-08-12,2021-01-14 09:01:00,63241,15.132,True,
23,,Male,2012-06-14,2021-01-14 16:19:00,125792,5.042,True,
32,,Male,1998-08-21,2021-01-14 14:27:00,122340,6.417,True,
91,James,,2005-01-26,2021-01-14 23:00:00,128771,8.309,False,


In [29]:
# if value is not null it is going to turns True,if value is null it  is going to turns False.
df["Team"].notnull() # true-false 
df[df["Team"].notnull()].head() 

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,2021-01-14 12:42:00,97308,6.945,True,Marketing
2,Maria,Female,1993-04-23,2021-01-14 11:17:00,130590,11.858,False,Finance
3,Jerry,Male,2005-03-04,2021-01-14 13:00:00,138705,9.34,True,Finance
4,Larry,Male,1998-01-24,2021-01-14 16:47:00,101004,1.389,True,Client Services
5,Dennis,Male,1987-04-18,2021-01-14 01:35:00,115163,10.125,False,Legal


#### The .between() Method

In [30]:
# we have this formula from earler cell: 
df = pd.read_csv("files/employees.csv", parse_dates = ["Start Date","Last Login Time"])
df["Senior Management"] = df["Senior Management"].astype("bool")
df["Gender"] = df["Gender"].astype("category")
df.head(1)

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,2021-01-14 12:42:00,97308,6.945,True,Marketing


In [35]:
df[df["Salary"].between(60000, 70000)].head(3)

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
1,Thomas,Male,1996-03-31,2021-01-14 06:53:00,61933,4.17,True,
6,Ruby,Female,1987-08-17,2021-01-14 16:20:00,65476,10.012,True,Product
10,Louise,Female,1980-08-12,2021-01-14 09:01:00,63241,15.132,True,


In [34]:
df[df["Bonus %"].between(2.0, 5.0)].head(3)

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
1,Thomas,Male,1996-03-31,2021-01-14 06:53:00,61933,4.17,True,
20,Lois,,1995-04-22,2021-01-14 19:18:00,64714,4.934,True,Legal
40,Michael,Male,2008-10-10,2021-01-14 11:25:00,99283,2.665,True,Distribution


In [36]:
df[df["Start Date"].between("1991-01-01", "1992-01-01")].head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
27,Scott,,1991-07-11,2021-01-14 18:58:00,122367,5.218,False,Legal
75,Bonnie,Female,1991-07-02,2021-01-14 01:27:00,104897,5.118,True,Human Resources
88,Donna,Female,1991-11-27,2021-01-14 13:59:00,64088,6.155,True,Legal
116,,Male,1991-06-22,2021-01-14 20:58:00,76189,18.988,True,Legal
148,Patrick,,1991-07-14,2021-01-14 02:24:00,124488,14.837,True,Sales


#### The .duplicated() Method

In [38]:
# we have this formula from earler cell: 
df = pd.read_csv("files/employees.csv", parse_dates = ["Start Date","Last Login Time"])
df["Senior Management"] = df["Senior Management"].astype("bool")
df["Gender"] = df["Gender"].astype("category")
df.sort_values("First Name", inplace=True)
df.head(3)

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
101,Aaron,Male,2012-02-17,2021-01-14 10:20:00,61602,11.849,True,Marketing
327,Aaron,Male,1994-01-29,2021-01-14 18:48:00,58755,5.097,True,Marketing
440,Aaron,Male,1990-07-22,2021-01-14 14:53:00,52119,11.343,True,Client Services


In [43]:
# First Aaron is false. That's meaning is first name not duplicate but others are duplicated. if we use .duplicated() method we have only duplicated values. 

df["First Name"].duplicated() #True-False
df[df["First Name"].duplicated()].head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
327,Aaron,Male,1994-01-29,2021-01-14 18:48:00,58755,5.097,True,Marketing
440,Aaron,Male,1990-07-22,2021-01-14 14:53:00,52119,11.343,True,Client Services
937,Aaron,,1986-01-22,2021-01-14 19:39:00,63126,18.424,False,Client Services
141,Adam,Male,1990-12-24,2021-01-14 20:57:00,110194,14.727,True,Product
302,Adam,Male,2007-07-05,2021-01-14 11:59:00,71276,5.027,True,Human Resources


In [45]:
# Last value accepted unique value(not duplicated). As default value is first value. 

df["First Name"].duplicated(keep = "last").head()

101     True
327     True
440     True
937    False
137     True
Name: First Name, dtype: bool

In [63]:
# First or last value is not matter in here. it is going to turn duplicated for all values. 

df["First Name"].duplicated(keep = False).head()

101    True
327    True
440    True
937    True
137    True
Name: First Name, dtype: bool

#### "~" does unique all values:

In [65]:
~df["First Name"].duplicated(keep = False).head()

101    False
327    False
440    False
937    False
137    False
Name: First Name, dtype: bool

In [61]:
~df["First Name"].duplicated(keep = False).value_counts()

True    -992
False    -10
Name: First Name, dtype: int64

In [64]:
# it shows just unique values: 

mask = ~df["First Name"].duplicated(keep = False)
df[mask]

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
8,Angela,Female,2005-11-22,2021-01-14 06:29:00,95570,18.523,True,Engineering
688,Brian,Male,2007-04-07,2021-01-14 22:47:00,93901,17.821,True,Legal
190,Carol,Female,1996-03-19,2021-01-14 03:39:00,57783,9.129,False,Finance
887,David,Male,2009-12-05,2021-01-14 08:48:00,92242,15.407,False,Legal
5,Dennis,Male,1987-04-18,2021-01-14 01:35:00,115163,10.125,False,Legal
495,Eugene,Male,1984-05-24,2021-01-14 10:54:00,81077,2.117,False,Sales
33,Jean,Female,1993-12-18,2021-01-14 09:07:00,119082,16.18,False,Business Development
832,Keith,Male,2003-02-12,2021-01-14 15:02:00,120672,19.467,False,Legal
291,Tammy,Female,1984-11-11,2021-01-14 10:30:00,132839,17.463,True,Client Services


#### The .drop_duplicates() Method

In [67]:
# we have this formula from earler cell: 
df = pd.read_csv("files/employees.csv", parse_dates = ["Start Date","Last Login Time"])
df["Senior Management"] = df["Senior Management"].astype("bool")
df["Gender"] = df["Gender"].astype("category")
df.sort_values("First Name", inplace=True)
df.head(1)

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
101,Aaron,Male,2012-02-17,2021-01-14 10:20:00,61602,11.849,True,Marketing


In [68]:
len(df)

1000

In [70]:
# İf we call drop_duplicates() we are going to have all of our duplicates removed. 
# This method works when all columns values' be equal. 

len(df.drop_duplicates())

1000

In [72]:
# We said with this formula: drop duplicates values by first name and keep only first "first name". 

df.drop_duplicates(subset = ["First Name"], keep = "first").head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
101,Aaron,Male,2012-02-17,2021-01-14 10:20:00,61602,11.849,True,Marketing
137,Adam,Male,2011-05-21,2021-01-14 01:45:00,95327,15.12,False,Distribution
300,Alan,Male,1988-06-26,2021-01-14 03:54:00,111786,3.592,True,Engineering
372,Albert,Male,1997-02-01,2021-01-14 16:20:00,67827,19.717,True,Engineering
988,Alice,Female,2004-10-05,2021-01-14 09:34:00,47638,11.209,False,Human Resources


In [73]:
# We drop duplicates values by first name and team. And keep only first "first name". as a result, we have same name employees who are working at different team. 

df.drop_duplicates(subset = ["First Name","Team"], keep = "first").head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
101,Aaron,Male,2012-02-17,2021-01-14 10:20:00,61602,11.849,True,Marketing
440,Aaron,Male,1990-07-22,2021-01-14 14:53:00,52119,11.343,True,Client Services
137,Adam,Male,2011-05-21,2021-01-14 01:45:00,95327,15.12,False,Distribution
141,Adam,Male,1990-12-24,2021-01-14 20:57:00,110194,14.727,True,Product
302,Adam,Male,2007-07-05,2021-01-14 11:59:00,71276,5.027,True,Human Resources


#### The .unique() and .nunique() Methods

In [74]:
# we have this formula from earler cell: 
df = pd.read_csv("files/employees.csv", parse_dates = ["Start Date","Last Login Time"])
df["Senior Management"] = df["Senior Management"].astype("bool")
df["Gender"] = df["Gender"].astype("category")
df.head(1)

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,2021-01-14 12:42:00,97308,6.945,True,Marketing


In [75]:
df["Gender"].unique()

['Male', 'Female', NaN]
Categories (2, object): ['Male', 'Female']

In [76]:
df["Team"].unique()

array(['Marketing', nan, 'Finance', 'Client Services', 'Legal', 'Product',
       'Engineering', 'Business Development', 'Human Resources', 'Sales',
       'Distribution'], dtype=object)

In [79]:
len(df["Team"].unique())

11

`Number of unique value:as it turns out the n unique method by defult does not include or count no values that includes a parameter called "dropna = False". that parameter is set to an argument of true by defult.` 

In [80]:
df["Team"].nunique()

10

In [82]:
df["Team"].nunique(dropna = False)

11