# DataFrames II: Filtering Data

In [3]:
import pandas as pd

## This Module's Dataset + Memory Optimization
- The `pd.to_datetime` method converts a **Series** to hold datetime values.
- The `format` parameter informs pandas of the format that the times are stored in.
- We pass symbols designating the segments of the string. For example, %m means "month" and %d means day.
- The `dt` attribute reveals an object with many datetime-related attributes and methods.
- The `dt.time` attribute extracts only the time from each value in a datetime **Series**.
- Use the `astype` method to convert the values in a **Series** to another type.
- The `parse_dates` parameter of `read_csv` is an alternate way to parse strings as datetimes.

In [6]:
employees= pd.read_csv('employees.csv')
employees.head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,8/6/1993,12:42 PM,97308,6.945,True,Marketing
1,Thomas,Male,3/31/1996,6:53 AM,61933,4.17,True,
2,Maria,Female,4/23/1993,11:17 AM,130590,11.858,False,Finance
3,Jerry,Male,3/4/2005,1:00 PM,138705,9.34,True,Finance
4,Larry,Male,1/24/1998,4:47 PM,101004,1.389,True,Client Services


In [4]:
employees.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   First Name         933 non-null    object 
 1   Gender             855 non-null    object 
 2   Start Date         1000 non-null   object 
 3   Last Login Time    1000 non-null   object 
 4   Salary             1000 non-null   int64  
 5   Bonus %            1000 non-null   float64
 6   Senior Management  933 non-null    object 
 7   Team               957 non-null    object 
dtypes: float64(1), int64(1), object(6)
memory usage: 62.6+ KB


In [11]:
employees['Start Date'] = pd.to_datetime(employees['Start Date'], format='%m/%d/%Y') 
employees.info()
# in the format argument we have to pass the string format we have in the values of the reference column
# if we want pandas to infer the string format of the dates, we need to pass the argument 'mixed', but this is not recommended

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   First Name         933 non-null    object        
 1   Gender             855 non-null    object        
 2   Start Date         1000 non-null   datetime64[ns]
 3   Last Login Time    1000 non-null   object        
 4   Salary             1000 non-null   int64         
 5   Bonus %            1000 non-null   float64       
 6   Senior Management  933 non-null    object        
 7   Team               957 non-null    object        
dtypes: datetime64[ns](1), float64(1), int64(1), object(5)
memory usage: 62.6+ KB


In [None]:
pd.to_datetime(employees['Last Login Time'], format='%H:%M %p').dt.time # %p is used to make reference to string AM or PM

# .dt.time is an attribute we can apply to a Series in order to extract the time info from the values
# (this case is useful since we don't have the specific date information)

0      12:42:00
1      06:53:00
2      11:17:00
3      01:00:00
4      04:47:00
         ...   
995    06:09:00
996    06:30:00
997    12:39:00
998    04:45:00
999    06:24:00
Name: Last Login Time, Length: 1000, dtype: object

In [18]:
employees['Last Login Time'] = pd.to_datetime(employees['Last Login Time'], format='%H:%M %p').dt.time
employees.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   First Name         933 non-null    object        
 1   Gender             855 non-null    object        
 2   Start Date         1000 non-null   datetime64[ns]
 3   Last Login Time    1000 non-null   object        
 4   Salary             1000 non-null   int64         
 5   Bonus %            1000 non-null   float64       
 6   Senior Management  933 non-null    object        
 7   Team               957 non-null    object        
dtypes: datetime64[ns](1), float64(1), int64(1), object(5)
memory usage: 62.6+ KB


In [19]:
employees.head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,12:42:00,97308,6.945,True,Marketing
1,Thomas,Male,1996-03-31,06:53:00,61933,4.17,True,
2,Maria,Female,1993-04-23,11:17:00,130590,11.858,False,Finance
3,Jerry,Male,2005-03-04,01:00:00,138705,9.34,True,Finance
4,Larry,Male,1998-01-24,04:47:00,101004,1.389,True,Client Services


In [21]:
employees['Senior Management']= employees['Senior Management'].astype('bool')

In [22]:
employees['Gender'] = employees['Gender'].astype('category')

In [23]:
employees.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   First Name         933 non-null    object        
 1   Gender             855 non-null    category      
 2   Start Date         1000 non-null   datetime64[ns]
 3   Last Login Time    1000 non-null   object        
 4   Salary             1000 non-null   int64         
 5   Bonus %            1000 non-null   float64       
 6   Senior Management  1000 non-null   bool          
 7   Team               957 non-null    object        
dtypes: bool(1), category(1), datetime64[ns](1), float64(1), int64(1), object(3)
memory usage: 49.1+ KB


In [30]:
# there's another way to convert and handle date columns direct from file read (but the method only applies to datetime Series)
employees= pd.read_csv('employees.csv', parse_dates=['Start Date'], date_format='%m/%d/%Y')
# (all the columns placed on parse_dates list need to have same date_format in order for the treatment to work on)

# timestamp Series (or 'only time', in this case) will still have to be handled individually
employees['Last Login Time'] = pd.to_datetime(employees['Last Login Time'], format='%H:%M %p').dt.time 

# for dataframe memory optimization
employees['Senior Management'] = employees['Senior Management'].astype('bool')
employees['Gender'] = employees['Gender'].astype('category')
employees.head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,12:42:00,97308,6.945,True,Marketing
1,Thomas,Male,1996-03-31,06:53:00,61933,4.17,True,
2,Maria,Female,1993-04-23,11:17:00,130590,11.858,False,Finance
3,Jerry,Male,2005-03-04,01:00:00,138705,9.34,True,Finance
4,Larry,Male,1998-01-24,04:47:00,101004,1.389,True,Client Services


## Filter A DataFrame  Based On A Condition
- Pandas needs a **Series** of Booleans to perform a filter.
- Pass the Boolean Series inside square brackets after the **DataFrame**.
- We can generate a Boolean Series using a wide variety of operations (equality, inequality, less than, greater than, inclusion, etc)

In [4]:
employees= pd.read_csv('employees.csv', parse_dates=['Start Date'], date_format='%m/%d/%Y')
employees['Last Login Time'] = pd.to_datetime(employees['Last Login Time'], format='%H:%M %p').dt.time 
employees['Senior Management'] = employees['Senior Management'].astype('bool')
employees['Gender'] = employees['Gender'].astype('category')
employees

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,12:42:00,97308,6.945,True,Marketing
1,Thomas,Male,1996-03-31,06:53:00,61933,4.170,True,
2,Maria,Female,1993-04-23,11:17:00,130590,11.858,False,Finance
3,Jerry,Male,2005-03-04,01:00:00,138705,9.340,True,Finance
4,Larry,Male,1998-01-24,04:47:00,101004,1.389,True,Client Services
...,...,...,...,...,...,...,...,...
995,Henry,,2014-11-23,06:09:00,132483,16.655,False,Distribution
996,Phillip,Male,1984-01-31,06:30:00,42392,19.675,False,Finance
997,Russell,Male,2013-05-20,12:39:00,96914,1.421,False,Product
998,Larry,Male,2013-04-20,04:45:00,60500,11.985,False,Business Development


In [5]:
employees['Gender'] == 'Male'
# we are getting a boolean Series indicating for which index the condition is true or false

0       True
1       True
2      False
3       True
4       True
       ...  
995    False
996     True
997     True
998     True
999     True
Name: Gender, Length: 1000, dtype: bool

In [33]:
# if we want to extract data from the dataframe regarding an specific condition, we just need to pass it into square brackets
# the resulting Series of True/Falses will indicate pandas which indexes it needs to return and which of them it doesn't need

employees[employees['Gender'] == 'Male']

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,12:42:00,97308,6.945,True,Marketing
1,Thomas,Male,1996-03-31,06:53:00,61933,4.170,True,
3,Jerry,Male,2005-03-04,01:00:00,138705,9.340,True,Finance
4,Larry,Male,1998-01-24,04:47:00,101004,1.389,True,Client Services
5,Dennis,Male,1987-04-18,01:35:00,115163,10.125,False,Legal
...,...,...,...,...,...,...,...,...
994,George,Male,2013-06-21,05:47:00,98874,4.479,True,Marketing
996,Phillip,Male,1984-01-31,06:30:00,42392,19.675,False,Finance
997,Russell,Male,2013-05-20,12:39:00,96914,1.421,False,Product
998,Larry,Male,2013-04-20,04:45:00,60500,11.985,False,Business Development


In [35]:
employees[employees['Team'] == 'Finance']

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
2,Maria,Female,1993-04-23,11:17:00,130590,11.858,False,Finance
3,Jerry,Male,2005-03-04,01:00:00,138705,9.340,True,Finance
7,,Female,2015-07-20,10:43:00,45906,11.598,True,Finance
14,Kimberly,Female,1999-01-14,07:13:00,41426,14.543,True,Finance
46,Bruce,Male,2009-11-28,10:47:00,114796,6.796,False,Finance
...,...,...,...,...,...,...,...,...
907,Elizabeth,Female,1998-07-27,11:12:00,137144,10.081,False,Finance
954,Joe,Male,1980-01-19,04:06:00,119667,1.148,True,Finance
987,Gloria,Female,2014-12-08,05:08:00,136709,10.331,True,Finance
992,Anthony,Male,2011-10-16,08:35:00,112769,11.625,True,Finance


In [None]:
# other way of filtering data would be by using auxiliar variables (# it makes the code easier to read)
on_finance_team = employees['Team'] == 'Finance' # Series of which indexes attend or not to the condition
employees[on_finance_team]

employees['Senior Management'] == True
# but the Senior Management column is already a boolean, so the filter above is totally equivalent to simply the following one
employees['Senior Management']
employees[employees['Senior Management']] # this already gives us the rows with Senior Management equals true

employees[employees['Salary'] > 110000]

employees[employees['Bonus %'] < 1.5]

employees[employees['Start Date'] < '1985-01-01'].head() # we can give it a string representation of a data and Pandas will already understand and compare the datetime object

# in other words, it is possible compare datetime objects with a string representation of a date (pandas is flexible at this point)

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
10,Louise,Female,1980-08-12,09:01:00,63241,15.132,True,
12,Brandon,Male,1980-12-01,01:08:00,112807,17.492,True,Human Resources
18,Diana,Female,1981-10-23,10:27:00,132940,19.082,False,Client Services
28,Terry,Male,1981-11-27,06:30:00,124008,13.464,True,Client Services
37,Linda,Female,1981-10-19,08:49:00,57427,9.557,True,Client Services


In [13]:
import datetime as dt

In [None]:
# It is a little bit more complicated to compare time objects/columns/series. We cannot just pass the string in this case

print(dt.time(11,28))
employees[employees['Last Login Time'] > dt.time(12, 0, 0)] # last login time greater than noon (midday)

11:28:00


Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,12:42:00,97308,6.945,True,Marketing
34,Jerry,Male,2004-01-10,12:56:00,95734,19.096,False,Client Services
49,Chris,,1980-01-24,12:13:00,113590,3.055,False,Sales
61,Denise,Female,2001-11-06,12:03:00,106862,3.699,False,Business Development
76,Margaret,Female,1988-09-10,12:42:00,131604,7.353,True,Distribution
...,...,...,...,...,...,...,...,...
945,Gerald,,1989-04-15,12:44:00,93712,17.426,True,Distribution
956,Beverly,Female,1986-10-17,12:51:00,80838,8.115,False,Engineering
962,Jonathan,Male,2013-08-21,12:45:00,121797,16.923,False,Product
980,Kimberly,Female,2013-01-26,12:57:00,46233,8.862,True,Engineering


In [None]:
# key takeways: whatever you pass in the square brackets must be a boolean series with the same size of your original dataframe

## Filter with More than One Condition (AND)
- Add the `&` operator in between two Boolean **Series** to filter by multiple conditions.
- We can assign the **Series** to variables to make the syntax more readable.

In [21]:
employees= pd.read_csv('employees.csv', parse_dates=['Start Date'], date_format='%m/%d/%Y')
employees['Last Login Time'] = pd.to_datetime(employees['Last Login Time'], format='%H:%M %p').dt.time 
employees['Senior Management'] = employees['Senior Management'].astype('bool')
employees['Gender'] = employees['Gender'].astype('category')
employees

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,12:42:00,97308,6.945,True,Marketing
1,Thomas,Male,1996-03-31,06:53:00,61933,4.170,True,
2,Maria,Female,1993-04-23,11:17:00,130590,11.858,False,Finance
3,Jerry,Male,2005-03-04,01:00:00,138705,9.340,True,Finance
4,Larry,Male,1998-01-24,04:47:00,101004,1.389,True,Client Services
...,...,...,...,...,...,...,...,...
995,Henry,,2014-11-23,06:09:00,132483,16.655,False,Distribution
996,Phillip,Male,1984-01-31,06:30:00,42392,19.675,False,Finance
997,Russell,Male,2013-05-20,12:39:00,96914,1.421,False,Product
998,Larry,Male,2013-04-20,04:45:00,60500,11.985,False,Business Development


In [26]:
#female employees that work for Marketing team
employees[(employees['Gender'] == 'Female') & (employees['Team'] == 'Marketing')].head()

# another way to get same result:
is_female= employees['Gender'] == 'Female'
is_in_markerting= employees['Team'] == 'Marketing'
employees[is_female & is_in_markerting].head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
43,Marilyn,Female,1980-12-07,03:16:00,73524,5.207,True,Marketing
62,,Female,2007-06-12,05:25:00,58112,19.414,True,Marketing
98,Tina,Female,2016-06-16,07:47:00,100705,16.961,True,Marketing
140,Shirley,Female,1981-02-28,01:23:00,113850,1.854,False,Marketing
158,Norma,Female,1999-02-28,08:45:00,114412,8.756,True,Marketing


In [28]:
#female employees that work for Marketing team whor earn over $100k a year

is_female= employees['Gender'] == 'Female'
is_in_markerting= employees['Team'] == 'Marketing'
salary_over_100= employees['Salary'] > 100000
employees[is_female & is_in_markerting & salary_over_100].head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
98,Tina,Female,2016-06-16,07:47:00,100705,16.961,True,Marketing
140,Shirley,Female,1981-02-28,01:23:00,113850,1.854,False,Marketing
158,Norma,Female,1999-02-28,08:45:00,114412,8.756,True,Marketing
305,Margaret,Female,1993-02-06,01:05:00,125220,3.733,False,Marketing
319,Jacqueline,Female,1981-11-25,03:01:00,145988,18.243,False,Marketing


## Filter with More than One Condition (OR)
- Use the `|` operator in between two Boolean **Series** to filter by *either* condition.

In [30]:
employees= pd.read_csv('employees.csv', parse_dates=['Start Date'], date_format='%m/%d/%Y')
employees['Last Login Time'] = pd.to_datetime(employees['Last Login Time'], format='%H:%M %p').dt.time 
employees['Senior Management'] = employees['Senior Management'].astype('bool')
employees['Gender'] = employees['Gender'].astype('category')
employees.head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,12:42:00,97308,6.945,True,Marketing
1,Thomas,Male,1996-03-31,06:53:00,61933,4.17,True,
2,Maria,Female,1993-04-23,11:17:00,130590,11.858,False,Finance
3,Jerry,Male,2005-03-04,01:00:00,138705,9.34,True,Finance
4,Larry,Male,1998-01-24,04:47:00,101004,1.389,True,Client Services


In [32]:
# employees who are either Senior Management or started before January 1st, 1990

are_senior_management= employees['Senior Management']
started_before_jan_1990= employees['Start Date'] < '1990-01-01'

employees[are_senior_management | started_before_jan_1990].head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,12:42:00,97308,6.945,True,Marketing
1,Thomas,Male,1996-03-31,06:53:00,61933,4.17,True,
3,Jerry,Male,2005-03-04,01:00:00,138705,9.34,True,Finance
4,Larry,Male,1998-01-24,04:47:00,101004,1.389,True,Client Services
5,Dennis,Male,1987-04-18,01:35:00,115163,10.125,False,Legal


In [33]:
# employees whose First Name is Robert who work in Client Services or have Start Date after 2016-06-01

named_robert= employees['First Name'] == 'Robert'
work_for_client_services= employees['Team'] == 'Client Services'
entered_after_jun_2016= employees['Start Date'] > '2016-06-01'

employees[(named_robert & work_for_client_services) | entered_after_jun_2016]

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
15,Lillian,Female,2016-06-05,06:09:00,59414,1.256,False,Product
98,Tina,Female,2016-06-16,07:47:00,100705,16.961,True,Marketing
387,Robert,Male,1994-10-29,04:26:00,123294,19.894,False,Client Services
451,Terry,,2016-07-15,12:29:00,140002,19.49,True,Marketing


## The isin Method
- The `isin` **Series** method accepts a collection object like a list, tuple, or **Series**.
- The method returns True for a row if its value is found in the collection.

In [35]:
employees= pd.read_csv('employees.csv', parse_dates=['Start Date'], date_format='%m/%d/%Y')
employees['Last Login Time'] = pd.to_datetime(employees['Last Login Time'], format='%H:%M %p').dt.time 
employees['Senior Management'] = employees['Senior Management'].astype('bool')
employees['Gender'] = employees['Gender'].astype('category')
employees.head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,12:42:00,97308,6.945,True,Marketing
1,Thomas,Male,1996-03-31,06:53:00,61933,4.17,True,
2,Maria,Female,1993-04-23,11:17:00,130590,11.858,False,Finance
3,Jerry,Male,2005-03-04,01:00:00,138705,9.34,True,Finance
4,Larry,Male,1998-01-24,04:47:00,101004,1.389,True,Client Services


In [39]:
# Legal team OR Sales team OR Product team
employees[employees['Team'].isin(['Legal', 'Sales', 'Product'])]

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
5,Dennis,Male,1987-04-18,01:35:00,115163,10.125,False,Legal
6,Ruby,Female,1987-08-17,04:20:00,65476,10.012,True,Product
11,Julie,Female,1997-10-26,03:19:00,102508,12.637,True,Legal
13,Gary,Male,2008-01-27,11:40:00,109831,5.831,False,Sales
15,Lillian,Female,2016-06-05,06:09:00,59414,1.256,False,Product
...,...,...,...,...,...,...,...,...
981,James,Male,1993-01-15,05:19:00,148985,19.280,False,Legal
985,Stephen,,1983-07-10,08:10:00,85668,1.909,False,Legal
989,Justin,,1991-02-10,04:58:00,38344,3.794,False,Legal
997,Russell,Male,2013-05-20,12:39:00,96914,1.421,False,Product


## The isnull and notnull Methods
- The `isnull` method returns True for `NaN` values in a **Series**.
- The `notnull` method returns True for present values in a **Series**.

In [40]:
employees= pd.read_csv('employees.csv', parse_dates=['Start Date'], date_format='%m/%d/%Y')
employees['Last Login Time'] = pd.to_datetime(employees['Last Login Time'], format='%H:%M %p').dt.time 
employees['Senior Management'] = employees['Senior Management'].astype('bool')
employees['Gender'] = employees['Gender'].astype('category')
employees.head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,12:42:00,97308,6.945,True,Marketing
1,Thomas,Male,1996-03-31,06:53:00,61933,4.17,True,
2,Maria,Female,1993-04-23,11:17:00,130590,11.858,False,Finance
3,Jerry,Male,2005-03-04,01:00:00,138705,9.34,True,Finance
4,Larry,Male,1998-01-24,04:47:00,101004,1.389,True,Client Services


In [44]:
employees[employees['Team'].isnull()] # employees that don't have Team filled
employees[employees['Team'].notnull()] # employees that have Team filled

employees[ employees['First Name'].isnull() & employees['Team'].notnull() ]

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
7,,Female,2015-07-20,10:43:00,45906,11.598,True,Finance
25,,Male,2012-10-08,01:12:00,37076,18.576,True,Client Services
39,,Male,2016-01-29,02:33:00,122173,7.797,True,Client Services
51,,,2011-12-17,08:29:00,41126,14.009,True,Sales
62,,Female,2007-06-12,05:25:00,58112,19.414,True,Marketing
116,,Male,1991-06-22,08:58:00,76189,18.988,True,Legal
149,,Female,2014-08-17,02:00:00,86230,8.578,True,Distribution
157,,Female,2005-07-27,08:32:00,79536,14.443,True,Product
165,,Female,2014-03-23,01:28:00,59148,9.061,True,Legal
166,,Female,1991-07-09,06:52:00,42341,7.014,True,Sales


## The between Method
- The `between` method returns True if a **Series** value is found within its range.

In [45]:
employees= pd.read_csv('employees.csv', parse_dates=['Start Date'], date_format='%m/%d/%Y')
employees['Last Login Time'] = pd.to_datetime(employees['Last Login Time'], format='%H:%M %p').dt.time 
employees['Senior Management'] = employees['Senior Management'].astype('bool')
employees['Gender'] = employees['Gender'].astype('category')
employees.head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,12:42:00,97308,6.945,True,Marketing
1,Thomas,Male,1996-03-31,06:53:00,61933,4.17,True,
2,Maria,Female,1993-04-23,11:17:00,130590,11.858,False,Finance
3,Jerry,Male,2005-03-04,01:00:00,138705,9.34,True,Finance
4,Larry,Male,1998-01-24,04:47:00,101004,1.389,True,Client Services


In [60]:
employees[employees['Salary'].between(60000, 70000)] # the endpoints are included in the query

employees[employees['Bonus %'].between(2.0, 5.0)]

employees[employees['Start Date'].between('1991-01-01', '1992-01-01')]

import datetime as dt
employees[employees['Last Login Time'].between(dt.time(12, 30, 0), dt.time(12, 35, 0) ) ]

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
627,Anne,Female,1984-11-21,12:30:00,128305,16.636,False,Marketing
694,Barbara,Female,2011-11-12,12:35:00,85718,13.326,False,Client Services
729,Steven,Male,1986-09-07,12:32:00,43252,18.892,False,Client Services
888,Marilyn,Female,2007-10-08,12:32:00,115149,11.934,True,Legal
915,Todd,Male,1983-01-04,12:34:00,115566,6.716,True,Client Services


## The duplicated Method
- The `duplicated` method returns True if a **Series** value is a duplicate.
- Pandas will mark one occurrence of a repeated value as a non-duplicate.
- Use the `keep` parameter to designate whether the first or last occurrence of a repeated value should be considered the "non-duplicate".
- Pass False to the `keep` parameter to mark all occurrences of repeated values as duplicates.
- Use the tilde symbol (`~`) to invert a **Series's** values. Trues will become Falses, and Falses will become trues.

In [None]:
employees= pd.read_csv('employees.csv', parse_dates=['Start Date'], date_format='%m/%d/%Y')
employees['Last Login Time'] = pd.to_datetime(employees['Last Login Time'], format='%H:%M %p').dt.time 
employees['Senior Management'] = employees['Senior Management'].astype('bool')
employees['Gender'] = employees['Gender'].astype('category')
employees.head()

In [None]:
employees['First Name'].duplicated()
# 'True' indicates duplicated values over the Series while 'False' indicates non-duplicate

# But pandas only assume a duplicate when it sees the value again (for example, if Lucas had been present 3 times in the Series above, the first time it appeared it received a False value, while in the 2 it received a True)

In [None]:
employees[employees['First Name'].duplicated(keep= 'first')] # keep parameter assigned as 'first' (default) marks the first occurrences as non-duplicates
employees[employees['First Name'].duplicated(keep= 'last')] # keep parameter assigned as 'last' marks the last occurrences as non-duplicates
employees[employees['First Name'].duplicated(keep= False)] #keep parameter assigned as False marks every single occurrence of a duplicate value as True (that is, as a duplicate)

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,12:42:00,97308,6.945,True,Marketing
1,Thomas,Male,1996-03-31,06:53:00,61933,4.170,True,
2,Maria,Female,1993-04-23,11:17:00,130590,11.858,False,Finance
3,Jerry,Male,2005-03-04,01:00:00,138705,9.340,True,Finance
4,Larry,Male,1998-01-24,04:47:00,101004,1.389,True,Client Services
...,...,...,...,...,...,...,...,...
995,Henry,,2014-11-23,06:09:00,132483,16.655,False,Distribution
996,Phillip,Male,1984-01-31,06:30:00,42392,19.675,False,Finance
997,Russell,Male,2013-05-20,12:39:00,96914,1.421,False,Product
998,Larry,Male,2013-04-20,04:45:00,60500,11.985,False,Business Development


In [76]:
# df['column'].duplicated(keep=False)
# this gives a series containing all the rows in which the dataframe 'column' has duplicate values

# to get the opposite of it, that is, all the rows in which the dataframe 'column' has unique values, we just need to deny it
# ~df['column'].duplicated(keep=False)

In [77]:
employees[~ employees['First Name'].duplicated(keep= False)]

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
5,Dennis,Male,1987-04-18,01:35:00,115163,10.125,False,Legal
8,Angela,Female,2005-11-22,06:29:00,95570,18.523,True,Engineering
33,Jean,Female,1993-12-18,09:07:00,119082,16.18,False,Business Development
190,Carol,Female,1996-03-19,03:39:00,57783,9.129,False,Finance
291,Tammy,Female,1984-11-11,10:30:00,132839,17.463,True,Client Services
495,Eugene,Male,1984-05-24,10:54:00,81077,2.117,False,Sales
688,Brian,Male,2007-04-07,10:47:00,93901,17.821,True,Legal
832,Keith,Male,2003-02-12,03:02:00,120672,19.467,False,Legal
887,David,Male,2009-12-05,08:48:00,92242,15.407,False,Legal


## The drop_duplicates Method
- The `drop_duplicates` method deletes rows with duplicate values.
- By default, it will remove a row if *all* of its values are shared with another row.
- The `subset` parameter configures the columns to look for duplicate values within.
- Pass a list to `subset` parameter to look for duplicates across multiple columns.

In [78]:
employees= pd.read_csv('employees.csv', parse_dates=['Start Date'], date_format='%m/%d/%Y')
employees['Last Login Time'] = pd.to_datetime(employees['Last Login Time'], format='%H:%M %p').dt.time 
employees['Senior Management'] = employees['Senior Management'].astype('bool')
employees['Gender'] = employees['Gender'].astype('category')
employees.head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,12:42:00,97308,6.945,True,Marketing
1,Thomas,Male,1996-03-31,06:53:00,61933,4.17,True,
2,Maria,Female,1993-04-23,11:17:00,130590,11.858,False,Finance
3,Jerry,Male,2005-03-04,01:00:00,138705,9.34,True,Finance
4,Larry,Male,1998-01-24,04:47:00,101004,1.389,True,Client Services


In [81]:
employees.drop_duplicates()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,12:42:00,97308,6.945,True,Marketing
1,Thomas,Male,1996-03-31,06:53:00,61933,4.170,True,
2,Maria,Female,1993-04-23,11:17:00,130590,11.858,False,Finance
3,Jerry,Male,2005-03-04,01:00:00,138705,9.340,True,Finance
4,Larry,Male,1998-01-24,04:47:00,101004,1.389,True,Client Services
...,...,...,...,...,...,...,...,...
995,Henry,,2014-11-23,06:09:00,132483,16.655,False,Distribution
996,Phillip,Male,1984-01-31,06:30:00,42392,19.675,False,Finance
997,Russell,Male,2013-05-20,12:39:00,96914,1.421,False,Product
998,Larry,Male,2013-04-20,04:45:00,60500,11.985,False,Business Development


- Nothing changed because pandas in that overall level Pandas will only consider duplicates if 2 rows have the exact same values for all the columns.
- If we need to specify the search for some reason, then we need to add the interest columns in the subset "parameter"

In [86]:
employees.drop_duplicates(subset=['Team']) # will ignore all the other columns and keep only unique values for the Team (first occurrence of each one)
employees.drop_duplicates(subset=['Team'], keep= 'first')
employees.drop_duplicates(subset=['Team'], keep= 'last') # the "keep" parameter will work same way as in the duplicated method (default value is First)
employees.drop_duplicates(subset=['Team'], keep= False) # do not keep any duplicate

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team


In [87]:
employees.drop_duplicates(subset=['First Name'], keep= False)
# gives the same result as employees[~ employees['First Name'].duplicated(keep= False)]

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
5,Dennis,Male,1987-04-18,01:35:00,115163,10.125,False,Legal
8,Angela,Female,2005-11-22,06:29:00,95570,18.523,True,Engineering
33,Jean,Female,1993-12-18,09:07:00,119082,16.18,False,Business Development
190,Carol,Female,1996-03-19,03:39:00,57783,9.129,False,Finance
291,Tammy,Female,1984-11-11,10:30:00,132839,17.463,True,Client Services
495,Eugene,Male,1984-05-24,10:54:00,81077,2.117,False,Sales
688,Brian,Male,2007-04-07,10:47:00,93901,17.821,True,Legal
832,Keith,Male,2003-02-12,03:02:00,120672,19.467,False,Legal
887,David,Male,2009-12-05,08:48:00,92242,15.407,False,Legal


In [89]:
employees.drop_duplicates(subset=['Senior Management', 'Team']).sort_values(by= 'Team') # it searches for unique pairing combinations of the subset columns
# (first occurrence of each one)

employees.drop_duplicates(subset=['Senior Management', 'Team'], keep= 'last').sort_values(by= 'Team')

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
971,Patrick,Male,2002-12-30,02:01:00,75423,5.368,True,Business Development
998,Larry,Male,2013-04-20,04:45:00,60500,11.985,False,Business Development
965,Catherine,Female,1989-09-25,01:31:00,68164,18.393,False,Client Services
990,Robin,Female,1987-07-24,01:35:00,100765,10.982,True,Client Services
946,,Female,1985-09-15,01:50:00,133472,16.941,True,Distribution
995,Henry,,2014-11-23,06:09:00,132483,16.655,False,Distribution
993,Tina,Female,1997-05-15,03:53:00,56450,19.04,True,Engineering
984,Maria,Female,2011-10-15,04:53:00,43455,13.04,False,Engineering
996,Phillip,Male,1984-01-31,06:30:00,42392,19.675,False,Finance
992,Anthony,Male,2011-10-16,08:35:00,112769,11.625,True,Finance


## The unique and nunique Methods
- The `unique` method on a **Series** returns a collection of its unique values. The method does not exist on a **DataFrame**.
- The `nunique` method returns a *count* of the number of unique values in the **Series**/**DataFrame**.
- The `dropna` parameter configures whether to include or exclude missing (`NaN`) values.

In [90]:
employees= pd.read_csv('employees.csv', parse_dates=['Start Date'], date_format='%m/%d/%Y')
employees['Last Login Time'] = pd.to_datetime(employees['Last Login Time'], format='%H:%M %p').dt.time 
employees['Senior Management'] = employees['Senior Management'].astype('bool')
employees['Gender'] = employees['Gender'].astype('category')
employees.head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,12:42:00,97308,6.945,True,Marketing
1,Thomas,Male,1996-03-31,06:53:00,61933,4.17,True,
2,Maria,Female,1993-04-23,11:17:00,130590,11.858,False,Finance
3,Jerry,Male,2005-03-04,01:00:00,138705,9.34,True,Finance
4,Larry,Male,1998-01-24,04:47:00,101004,1.389,True,Client Services


In [None]:
employees['Gender'].unique()
type(employees['Gender'].unique()) 

# remembering that we usually use categories when dealing with columns that have a small number of unique values

pandas.core.arrays.categorical.Categorical

In [97]:
employees['Team'].unique()
#type(employees['Team'].unique())

array(['Marketing', nan, 'Finance', 'Client Services', 'Legal', 'Product',
       'Engineering', 'Business Development', 'Human Resources', 'Sales',
       'Distribution'], dtype=object)

In [91]:
employees.nunique()

First Name           200
Gender                 2
Start Date           972
Last Login Time      542
Salary               995
Bonus %              971
Senior Management      2
Team                  10
dtype: int64

## Chat GPT - Exercises

In [100]:
#1) Find the total number of employees whose Salary is between 50,000 and 100,000 (inclusive) and belong to either the "Finance" or "Marketing" teams.

result= employees[ employees['Salary'].between(50000, 100000) & employees['Team'].isin(['Finance', 'Marketing']) ]
result.head()

print(result['Salary'].min())
print(result['Salary'].max())
print(result['Team'].unique())

50366
99747
['Marketing' 'Finance']


In [109]:
#2) Count how many employees have missing values in any of the columns.
len(employees) - len(employees.dropna(how='any'))

# 236 employees have at least one missing value in the dataframe

236

In [112]:
#3) Extract all employees who started working between January 1, 2015, and December 31, 2020, and sort them by their Start Date in ascending order
that_started_between_2015_and_2020= employees['Start Date'].between('2015-01-01', '2020-12-31')
employees[ that_started_between_2015_and_2020 ].sort_values(by=['Start Date'], ascending= [True]).head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
457,Patricia,Female,2015-01-09,04:16:00,121232,16.624,False,Legal
831,Kenneth,Male,2015-01-15,02:41:00,69112,7.588,True,Finance
872,Brenda,Female,2015-01-18,04:39:00,73749,19.332,False,Business Development
235,Norma,Female,2015-02-11,11:44:00,94393,3.643,True,Engineering
432,Jessica,,2015-03-07,08:45:00,121160,12.993,False,Client Services


In [114]:
#4) Identify all employees who are Male and are part of the Senior Management team. How many of them are there?
that_are_male= employees['Gender'] == 'Male'
are_part_of_senior_team= employees['Senior Management'] == True # redundant, I know, but readable

employees[ that_are_male & are_part_of_senior_team]#.head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,12:42:00,97308,6.945,True,Marketing
1,Thomas,Male,1996-03-31,06:53:00,61933,4.170,True,
3,Jerry,Male,2005-03-04,01:00:00,138705,9.340,True,Finance
4,Larry,Male,1998-01-24,04:47:00,101004,1.389,True,Client Services
12,Brandon,Male,1980-12-01,01:08:00,112807,17.492,True,Human Resources
...,...,...,...,...,...,...,...,...
974,Harry,Male,2011-08-30,06:31:00,67656,16.455,True,Client Services
979,Ernest,Male,2013-07-20,06:41:00,142935,13.198,True,Product
992,Anthony,Male,2011-10-16,08:35:00,112769,11.625,True,Finance
994,George,Male,2013-06-21,05:47:00,98874,4.479,True,Marketing


In [137]:
#5) For each Team, find the name of the employee(s) with the highest Salary
# employees['Team']= employees['Team'].fillna('Unknown')
# employees.sort_values(by='Salary', ascending= False).groupby('Team')[['Team', 'First Name', 'Salary']].head(1)

that_have_biggest_salary_per_team= employees.groupby(['Team'])['Salary'].idxmax()
that_have_biggest_salary_per_team

Team
Business Development    721
Client Services         287
Distribution            793
Engineering             541
Finance                 644
Human Resources         429
Legal                   981
Marketing               740
Product                 828
Sales                   186
Unknown                 850
Name: Salary, dtype: int64

In [159]:
employees.iloc[that_have_biggest_salary_per_team]

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
721,Harold,Male,2010-04-16,05:13:00,147417,11.626,True,Business Development
287,Lois,Female,2011-11-09,07:06:00,147183,9.999,True,Client Services
793,Andrea,Female,1999-07-22,09:25:00,149105,13.707,True,Distribution
541,Ruby,Female,1999-05-01,03:36:00,147362,7.851,True,Engineering
644,Katherine,Female,1996-08-13,12:21:00,149908,18.912,False,Finance
429,Rose,Female,2015-05-28,08:40:00,149903,5.63,False,Human Resources
981,James,Male,1993-01-15,05:19:00,148985,19.28,False,Legal
740,Russell,,2009-05-09,11:59:00,149456,3.533,False,Marketing
828,Cynthia,Female,2006-07-12,08:55:00,149684,7.864,False,Product
186,,Female,2005-02-23,09:50:00,149654,1.825,True,Sales


In [166]:
#6) Find the average Bonus % for each gender (Male, Female) where the employees are part of Senior Management.
employees[ employees['Senior Management'] == True ].groupby('Gender')['Bonus %'].agg(['mean', 'std'])

  employees[ employees['Senior Management'] == True ].groupby('Gender')['Bonus %'].agg(['mean', 'std'])


Unnamed: 0_level_0,mean,std
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1
Female,9.89684,5.444026
Male,9.814062,5.585128


In [217]:
#7) Convert the Last Login Time column to a datetime type, and find the employee who logged in the earliest
employees2= pd.read_csv('employees.csv')
employees2.head()

employees2['Last Login Time'] = pd.to_datetime(employees2['Last Login Time'], format='%H:%M %p').dt.time # for datetime conversion, we always need to use pandas method

employees.iloc[employees['Last Login Time'].idxmin()][['First Name', 'Start Date', 'Last Login Time']]

First Name                       Jerry
Start Date         2005-03-04 00:00:00
Last Login Time               01:00:00
Name: 3, dtype: object

In [195]:
#8) Replace all missing values in the Team column with "Unassigned" and count how many employees are now in the "Unassigned" team
employees[['Team', 'First Name']].groupby('Team').count().sort_values(by='First Name', ascending= False)

Unnamed: 0_level_0,First Name
Team,Unnamed: 1_level_1
Client Services,100
Business Development,99
Finance,97
Product,92
Marketing,91
Engineering,86
Legal,86
Sales,86
Human Resources,85
Distribution,77


In [202]:
#9) Extract all employees who either have a Bonus % greater than 10 or belong to the "HR" team, but exclude employees whose Salary is less than 40,000
that_have_bonus_greater_than_10= employees['Bonus %'] > 10
belong_to_HR_team= employees['Team'] == 'Human Resources'
has_a_salary_greater_than_40k= employees['Salary'] >= 40000

employees[ (that_have_bonus_greater_than_10  | belong_to_HR_team) & has_a_salary_greater_than_40k]

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
2,Maria,Female,1993-04-23,11:17:00,130590,11.858,False,Finance
5,Dennis,Male,1987-04-18,01:35:00,115163,10.125,False,Legal
6,Ruby,Female,1987-08-17,04:20:00,65476,10.012,True,Product
7,,Female,2015-07-20,10:43:00,45906,11.598,True,Finance
8,Angela,Female,2005-11-22,06:29:00,95570,18.523,True,Engineering
...,...,...,...,...,...,...,...,...
993,Tina,Female,1997-05-15,03:53:00,56450,19.040,True,Engineering
995,Henry,,2014-11-23,06:09:00,132483,16.655,False,Distribution
996,Phillip,Male,1984-01-31,06:30:00,42392,19.675,False,Finance
998,Larry,Male,2013-04-20,04:45:00,60500,11.985,False,Business Development


In [211]:
employees.groupby(["Team", "Gender"])["First Name"].count()#.unstack()

  employees.groupby(["Team", "Gender"])["First Name"].count()#.unstack()


Team                  Gender
Business Development  Female    49
                      Male      39
Client Services       Female    47
                      Male      38
Distribution          Female    31
                      Male      29
Engineering           Female    43
                      Male      36
Finance               Female    42
                      Male      38
Human Resources       Female    33
                      Male      43
Legal                 Female    33
                      Male      34
Marketing             Female    36
                      Male      38
Product               Female    44
                      Male      39
Sales                 Female    35
                      Male      37
Unknown               Female     7
                      Male      24
Name: First Name, dtype: int64

In [215]:
#10) Create a summary table showing the count of employees by Gender for each Team

employees.groupby(['Team','Gender'])['Gender'].count().unstack()

  employees.groupby(['Team','Gender'])['Gender'].count().unstack()


Gender,Female,Male
Team,Unnamed: 1_level_1,Unnamed: 2_level_1
Business Development,50,40
Client Services,48,42
Distribution,37,35
Engineering,44,40
Finance,44,41
Human Resources,37,45
Legal,34,35
Marketing,40,41
Product,45,40
Sales,39,39
