# DataFrames II: Filtering Data

In [1]:
import pandas as pd
import datetime as dt

## This Module's Dataset + Memory Optimization
- The `pd.to_datetime` method converts a **Series** to hold datetime values.
- The `format` parameter informs pandas of the format that the times are stored in.
- We pass symbols designating the segments of the string. For example, %m means "month" and %d means day.
- The `dt` attribute reveals an object with many datetime-related attributes and methods.
- The `dt.time` attribute extracts only the time from each value in a datetime **Series**.
- Use the `astype` method to convert the values in a **Series** to another type.
- The `parse_dates` parameter of `read_csv` is an alternate way to parse strings as datetimes.

In [2]:
employees = pd.read_csv("data_files/employees.csv")

employees.head(3)
#employees.info()  # memory usage: 62.6+ KB

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,8/6/1993,12:42 PM,97308,6.945,True,Marketing
1,Thomas,Male,3/31/1996,6:53 AM,61933,4.17,True,
2,Maria,Female,4/23/1993,11:17 AM,130590,11.858,False,Finance


In [3]:
employees.dtypes[["Gender", "Start Date", "Last Login Time", "Senior Management"]]

Gender               object
Start Date           object
Last Login Time      object
Senior Management    object
dtype: object

In [4]:
# format tell pandas what the format of the input is.
# i.e.
# str = "09-11-2000"   format = "%m-%d-%Y"
# str = "09/11/2000"   format = "%m/%d/%Y"

employees["Start Date"] = pd.to_datetime( employees["Start Date"], format="%m/%d/%Y" )

In [5]:
# %H = Hour
# %M = Minute
# %p = am/pm

# NOTE: In the absence of a date, pandas will default to 1900-01-01
# because it just store both a date and a time.
# To get around this, we use .dt.time after the to_datetime function
employees["Last Login Time"] = pd.to_datetime( employees["Last Login Time"], format="%H:%M %p" ).dt.time

In [6]:
employees["Senior Management"] = employees["Senior Management"].astype(bool)

In [7]:
employees["Gender"] = employees["Gender"].astype("category")

In [8]:
employees.head(3)

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,12:42:00,97308,6.945,True,Marketing
1,Thomas,Male,1996-03-31,06:53:00,61933,4.17,True,
2,Maria,Female,1993-04-23,11:17:00,130590,11.858,False,Finance


In [9]:
employees.dtypes[["Gender", "Start Date", "Last Login Time", "Senior Management"]]

#employees.info()  # memory usage: 49.1+ KB (21.5% optimization)

Gender                     category
Start Date           datetime64[ns]
Last Login Time              object
Senior Management              bool
dtype: object

---

## Filter a DataFrame Based On a Single Condition
- Pandas needs a **Series** of Booleans to perform a filter.
- Pass the Boolean Series inside square brackets after the **DataFrame**.
- We can generate a Boolean Series using a wide variety of operations (equality, inequality, less than, greater than, inclusion, etc)

In [10]:
employees[employees["Gender"] == "Male"]  # Return all male employees

employees[employees["Team"] == "Finance"]  # Return all employees that are in finance

# You can also break up the syntax for readability
on_marketing_team = employees["Team"] == "Marketing"
employees[on_marketing_team]

# For columns that are already a bool - such as "Senior Management" - we can skip the operator
employees[employees["Senior Management"]]

employees[employees["Salary"] > 100_000]

employees[employees["Bonus %"] < 1.5].head()


# Pandas is smart and (if using the same format) will understand date times
employees[ employees["Start Date"] < "1995-01-01" ]

# Hour, Minute, Second
time = dt.time(8, 0, 0)
employees[ employees["Last Login Time"] > time ]

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,12:42:00,97308,6.945,True,Marketing
2,Maria,Female,1993-04-23,11:17:00,130590,11.858,False,Finance
7,,Female,2015-07-20,10:43:00,45906,11.598,True,Finance
10,Louise,Female,1980-08-12,09:01:00,63241,15.132,True,
13,Gary,Male,2008-01-27,11:40:00,109831,5.831,False,Sales
...,...,...,...,...,...,...,...,...
983,John,Male,1982-12-23,10:35:00,146907,11.738,False,Engineering
985,Stephen,,1983-07-10,08:10:00,85668,1.909,False,Legal
988,Alice,Female,2004-10-05,09:34:00,47638,11.209,False,Human Resources
992,Anthony,Male,2011-10-16,08:35:00,112769,11.625,True,Finance


---

## Filter with More than One Condition (AND)
- Add the `&` operator in between two Boolean **Series** to filter by multiple conditions.
- We can assign the **Series** to variables to make the syntax more readable.

In [11]:
# NOTE: For whatever reason, you must put the operators into a variable before using them.

is_female = employees["Gender"] == "Female"
is_marketing = employees["Team"] == "Marketing"

employees[ is_female & is_marketing ].head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
43,Marilyn,Female,1980-12-07,03:16:00,73524,5.207,True,Marketing
62,,Female,2007-06-12,05:25:00,58112,19.414,True,Marketing
98,Tina,Female,2016-06-16,07:47:00,100705,16.961,True,Marketing
140,Shirley,Female,1981-02-28,01:23:00,113850,1.854,False,Marketing
158,Norma,Female,1999-02-28,08:45:00,114412,8.756,True,Marketing


In [12]:
is_male = employees["Gender"] == "Male"
is_engineer = employees["Team"] == "Engineering"
over_paid = employees["Salary"] > 130_000

employees[ is_male & is_engineer & over_paid ].head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
171,Patrick,Male,2007-08-17,03:16:00,143499,17.495,True,Engineering
175,Willie,Male,1998-02-17,08:20:00,146651,1.451,True,Engineering
447,Gregory,Male,2009-05-15,03:52:00,142208,11.204,True,Engineering
604,Bruce,Male,2013-03-15,11:13:00,141335,15.427,True,Engineering
652,Willie,Male,2009-12-05,05:39:00,141932,1.017,True,Engineering


---

## Filter with More than One Condition (OR)
- Use the `|` operator in between two Boolean **Series** to filter by *either* condition.

In [13]:
# Find employees that are Senior Management or started before 1990

is_senior = employees["Senior Management"]
started_before_1990 = employees["Start Date"] < "1990-01-01"

employees[ is_senior | started_before_1990 ].head()

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,12:42:00,97308,6.945,True,Marketing
1,Thomas,Male,1996-03-31,06:53:00,61933,4.17,True,
3,Jerry,Male,2005-03-04,01:00:00,138705,9.34,True,Finance
4,Larry,Male,1998-01-24,04:47:00,101004,1.389,True,Client Services
5,Dennis,Male,1987-04-18,01:35:00,115163,10.125,False,Legal


## Filter using both & and |

In [14]:
# First name Robbert AND works in Client Services, OR Start Date after 2016-06-01

name = employees["First Name"] == "Robert"
role = employees["Team"] == "Client Services"
start = employees["Start Date"] > "2016-06-01"


# NOTE: You MUST have () to specify the first item you want
# to evaluate, then the second operation will happen.

employees[ (name & role) | start ]

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
15,Lillian,Female,2016-06-05,06:09:00,59414,1.256,False,Product
98,Tina,Female,2016-06-16,07:47:00,100705,16.961,True,Marketing
387,Robert,Male,1994-10-29,04:26:00,123294,19.894,False,Client Services
451,Terry,,2016-07-15,12:29:00,140002,19.49,True,Marketing


---

## The isin Method
- The `isin` **Series** method accepts a collection object like a list, tuple, or **Series**.
- The method returns True for a row if its value is found in the collection.

---

## The isnull and notnull Methods
- The `isnull` method returns True for `NaN` values in a **Series**.
- The `notnull` method returns True for present values in a **Series**.

---

## The between Method
- The `between` method returns True if a **Series** value is found within its range.

---

## The duplicated Method
- The `duplicated` method returns True if a **Series** value is a duplicate.
- Pandas will mark one occurrence of a repeated value as a non-duplicate.
- Use the `keep` parameter to designate whether the first or last occurrence of a repeated value should be considered the "non-duplicate".
- Pass False to the `keep` parameter to mark all occurrences of repeated values as duplicates.
- Use the tilde symbol (`~`) to invert a **Series's** values. Trues will become Falses, and Falses will become trues.

---

## The drop_duplicates Method
- The `drop_duplicates` method deletes rows with duplicate values.
- By default, it will remove a row if *all* of its values are shared with another row.
- The `subset` parameter configures the columns to look for duplicate values within.
- Pass a list to `subset` parameter to look for duplicates across multiple columns.

---

## The unique and nunique Methods
- The `unique` method on a **Series** returns a collection of its unique values. The method does not exist on a **DataFrame**.
- The `nunique` method returns a *count* of the number of unique values in the **Series**/**DataFrame**.
- The `dropna` parameter configures whether to include or exclude missing (`NaN`) values.