# DataFrames II: Filtering Data

In [None]:
import pandas as pd

## This Module's Dataset + Memory Optimization
- The `pd.to_datetime` method converts a **Series** to hold datetime values.
- The `format` parameter informs pandas of the format that the times are stored in.
- We pass symbols designating the segments of the string. For example, %m means "month" and %d means day.
- The `dt` attribute reveals an object with many datetime-related attributes and methods.
- The `dt.time` attribute extracts only the time from each value in a datetime **Series**.
- Use the `astype` method to convert the values in a **Series** to another type.
- The `parse_dates` parameter of `read_csv` is an alternate way to parse strings as datetimes.

In [None]:
employees = pd.read_csv('employees.csv')
employees.info()

In [None]:
employees

In [None]:
employees["Start Date"] = pd.to_datetime(employees["Start Date"], format="%m/%d/%Y")
employees.info()

In [None]:
employees

In [None]:
employees["Last Login Time"] = pd.to_datetime(employees["Last Login Time"], format="%H:%M %p").dt.time

In [None]:
employees.info()

In [None]:
employees

In [None]:
employees['Gender'] = employees['Gender'].astype('category')

In [None]:
employees.info()

In [None]:
employees

In [None]:
employees['Senior Management'] = employees['Senior Management'].astype(bool)

In [None]:
employees

In [None]:
employees['Senior Management'] = employees['Senior Management'].astype('category')

In [None]:
employees.info()

In [None]:
employees = pd.read_csv('employees.csv', parse_dates=['Start Date'], date_format='%m/%d/%Y')
employees

## Filter A DataFrame  Based On A Condition
- Pandas needs a **Series** of Booleans to perform a filter.
- Pass the Boolean Series inside square brackets after the **DataFrame**.
- We can generate a Boolean Series using a wide variety of operations (equality, inequality, less than, greater than, inclusion, etc)

In [None]:
employees = pd.read_csv('employees.csv', parse_dates=['Start Date'], date_format='%m/%d/%Y')
employees['Last Login Time'] = pd.to_datetime(employees['Last Login Time'], format="%H:%M %p").dt.time
employees['Gender'] = employees['Gender'].astype('category')
employees['Senior Management'] = employees['Senior Management'].astype(bool)
employees['Senior Management'] = employees['Senior Management'].astype('category')
employees

In [None]:
employees[employees['Gender'] == 'Male']

In [None]:
employees[employees['Salary'] > 148_980]

In [None]:
employees[employees['Start Date'] > '2016-05-01']

In [None]:
import datetime as dt

In [None]:
employees[employees['Last Login Time'] > dt.time(12, 5, 0)]

## Filter with More than One Condition (AND)
- Add the `&` operator in between two Boolean **Series** to filter by multiple conditions.
- We can assign the **Series** to variables to make the syntax more readable.

In [None]:
employees = pd.read_csv('employees.csv', parse_dates=['Start Date'], date_format='%m/%d/%Y')
employees['Last Login Time'] = pd.to_datetime(employees['Last Login Time'], format="%H:%M %p").dt.time
employees['Gender'] = employees['Gender'].astype('category')
employees['Senior Management'] = employees['Senior Management'].astype(bool)
employees['Senior Management'] = employees['Senior Management'].astype('category')
employees

In [None]:
male_employees = employees['Gender'] == 'Male'
high_salary = employees['Team'] == 'Finance'
employees[male_employees & high_salary]

In [None]:
high_salary = employees['Salary'] > 130_000
early_start_date = employees['Start Date'] < '1981-01-01'
employees[high_salary & early_start_date]

## Filter with More than One Condition (OR)
- Use the `|` operator in between two Boolean **Series** to filter by *either* condition.

In [167]:
employees = pd.read_csv('employees.csv', parse_dates=['Start Date'], date_format='%m/%d/%Y')
employees['Last Login Time'] = pd.to_datetime(employees['Last Login Time'], format="%H:%M %p").dt.time
employees['Gender'] = employees['Gender'].astype('category')
employees['Senior Management'] = employees['Senior Management'].astype(bool)
employees['Senior Management'] = employees['Senior Management'].astype('category')
employees

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,12:42:00,97308,6.945,True,Marketing
1,Thomas,Male,1996-03-31,06:53:00,61933,4.170,True,
2,Maria,Female,1993-04-23,11:17:00,130590,11.858,False,Finance
3,Jerry,Male,2005-03-04,01:00:00,138705,9.340,True,Finance
4,Larry,Male,1998-01-24,04:47:00,101004,1.389,True,Client Services
...,...,...,...,...,...,...,...,...
995,Henry,,2014-11-23,06:09:00,132483,16.655,False,Distribution
996,Phillip,Male,1984-01-31,06:30:00,42392,19.675,False,Finance
997,Russell,Male,2013-05-20,12:39:00,96914,1.421,False,Product
998,Larry,Male,2013-04-20,04:45:00,60500,11.985,False,Business Development


In [173]:
Dennis_name = employees['First Name'] == 'Dennis'
finance_team = employees['Team'] == 'Finance'

In [174]:
dennis_or_finance = employees[Dennis_name | finance_team]
dennis_or_finance

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
2,Maria,Female,1993-04-23,11:17:00,130590,11.858,False,Finance
3,Jerry,Male,2005-03-04,01:00:00,138705,9.340,True,Finance
5,Dennis,Male,1987-04-18,01:35:00,115163,10.125,False,Legal
7,,Female,2015-07-20,10:43:00,45906,11.598,True,Finance
14,Kimberly,Female,1999-01-14,07:13:00,41426,14.543,True,Finance
...,...,...,...,...,...,...,...,...
907,Elizabeth,Female,1998-07-27,11:12:00,137144,10.081,False,Finance
954,Joe,Male,1980-01-19,04:06:00,119667,1.148,True,Finance
987,Gloria,Female,2014-12-08,05:08:00,136709,10.331,True,Finance
992,Anthony,Male,2011-10-16,08:35:00,112769,11.625,True,Finance


In [187]:
marketing_team = employees['Team'] == 'Marketing'
early_date = employees['Start Date'] < '1981-01-01'
high_salary = employees['Salary'] > 149_000

In [188]:
result = employees[high_salary | (early_date & marketing_team)]
result

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
43,Marilyn,Female,1980-12-07,03:16:00,73524,5.207,True,Marketing
160,Kathy,Female,2000-03-18,07:26:00,149563,16.991,True,Finance
186,,Female,2005-02-23,09:50:00,149654,1.825,True,Sales
429,Rose,Female,2015-05-28,08:40:00,149903,5.63,False,Human Resources
601,Christine,,1980-06-15,06:00:00,50366,9.862,True,Marketing
644,Katherine,Female,1996-08-13,12:21:00,149908,18.912,False,Finance
740,Russell,,2009-05-09,11:59:00,149456,3.533,False,Marketing
793,Andrea,Female,1999-07-22,09:25:00,149105,13.707,True,Distribution
828,Cynthia,Female,2006-07-12,08:55:00,149684,7.864,False,Product
881,Ruby,Female,1980-01-28,11:08:00,142868,6.318,False,Marketing


## The isin Method
- The `isin` **Series** method accepts a collection object like a list, tuple, or **Series**.
- The method returns True for a row if its value is found in the collection.

## The isnull and notnull Methods
- The `isnull` method returns True for `NaN` values in a **Series**.
- The `notnull` method returns True for present values in a **Series**.

## The between Method
- The `between` method returns True if a **Series** value is found within its range.

## The duplicated Method
- The `duplicated` method returns True if a **Series** value is a duplicate.
- Pandas will mark one occurrence of a repeated value as a non-duplicate.
- Use the `keep` parameter to designate whether the first or last occurrence of a repeated value should be considered the "non-duplicate".
- Pass False to the `keep` parameter to mark all occurrences of repeated values as duplicates.
- Use the tilde symbol (`~`) to invert a **Series's** values. Trues will become Falses, and Falses will become trues.

## The drop_duplicates Method
- The `drop_duplicates` method deletes rows with duplicate values.
- By default, it will remove a row if *all* of its values are shared with another row.
- The `subset` parameter configures the columns to look for duplicate values within.
- Pass a list to `subset` parameter to look for duplicates across multiple columns.

## The unique and nunique Methods
- The `unique` method on a **Series** returns a collection of its unique values. The method does not exist on a **DataFrame**.
- The `nunique` method returns a *count* of the number of unique values in the **Series**/**DataFrame**.
- The `dropna` parameter configures whether to include or exclude missing (`NaN`) values.