# DataFrames II: Filtering Data

In [1]:
import pandas as pd

## This Module's Dataset + Memory Optimization
- The `pd.to_datetime` method converts a **Series** to hold datetime values.
- The `format` parameter informs pandas of the format that the times are stored in.
- We pass symbols designating the segments of the string. For example, %m means "month" and %d means day.
- The `dt` attribute reveals an object with many datetime-related attributes and methods.
- The `dt.time` attribute extracts only the time from each value in a datetime **Series**.
- Use the `astype` method to convert the values in a **Series** to another type.
- The `parse_dates` parameter of `read_csv` is an alternate way to parse strings as datetimes.

In [None]:
employees = pd.read_csv('employees.csv')
employees.info()

In [None]:
employees

In [None]:
employees["Start Date"] = pd.to_datetime(employees["Start Date"], format="%m/%d/%Y")
employees.info()

In [None]:
employees

In [None]:
employees["Last Login Time"] = pd.to_datetime(employees["Last Login Time"], format="%H:%M %p").dt.time

In [None]:
employees.info()

In [None]:
employees

In [None]:
employees['Gender'] = employees['Gender'].astype('category')

In [None]:
employees.info()

In [None]:
employees

In [None]:
employees['Senior Management'] = employees['Senior Management'].astype(bool)

In [None]:
employees

In [None]:
employees['Senior Management'] = employees['Senior Management'].astype('category')

In [None]:
employees.info()

In [None]:
employees = pd.read_csv('employees.csv', parse_dates=['Start Date'], date_format='%m/%d/%Y')
employees

## Filter A DataFrame  Based On A Condition
- Pandas needs a **Series** of Booleans to perform a filter.
- Pass the Boolean Series inside square brackets after the **DataFrame**.
- We can generate a Boolean Series using a wide variety of operations (equality, inequality, less than, greater than, inclusion, etc)

In [None]:
employees = pd.read_csv('employees.csv', parse_dates=['Start Date'], date_format='%m/%d/%Y')
employees['Last Login Time'] = pd.to_datetime(employees['Last Login Time'], format="%H:%M %p").dt.time
employees['Gender'] = employees['Gender'].astype('category')
employees['Senior Management'] = employees['Senior Management'].astype(bool)
employees['Senior Management'] = employees['Senior Management'].astype('category')
employees

In [None]:
employees[employees['Gender'] == 'Male']

In [None]:
employees[employees['Salary'] > 148_980]

In [None]:
employees[employees['Start Date'] > '2016-05-01']

In [None]:
import datetime as dt

In [None]:
employees[employees['Last Login Time'] > dt.time(12, 5, 0)]

## Filter with More than One Condition (AND)
- Add the `&` operator in between two Boolean **Series** to filter by multiple conditions.
- We can assign the **Series** to variables to make the syntax more readable.

In [None]:
employees = pd.read_csv('employees.csv', parse_dates=['Start Date'], date_format='%m/%d/%Y')
employees['Last Login Time'] = pd.to_datetime(employees['Last Login Time'], format="%H:%M %p").dt.time
employees['Gender'] = employees['Gender'].astype('category')
employees['Senior Management'] = employees['Senior Management'].astype(bool)
employees['Senior Management'] = employees['Senior Management'].astype('category')
employees

In [None]:
male_employees = employees['Gender'] == 'Male'
high_salary = employees['Team'] == 'Finance'
employees[male_employees & high_salary]

In [None]:
high_salary = employees['Salary'] > 130_000
early_start_date = employees['Start Date'] < '1981-01-01'
employees[high_salary & early_start_date]

## Filter with More than One Condition (OR)
- Use the `|` operator in between two Boolean **Series** to filter by *either* condition.

In [None]:
employees = pd.read_csv('employees.csv', parse_dates=['Start Date'], date_format='%m/%d/%Y')
employees['Last Login Time'] = pd.to_datetime(employees['Last Login Time'], format="%H:%M %p").dt.time
employees['Gender'] = employees['Gender'].astype('category')
employees['Senior Management'] = employees['Senior Management'].astype(bool)
employees['Senior Management'] = employees['Senior Management'].astype('category')
employees

In [None]:
Dennis_name = employees['First Name'] == 'Dennis'
finance_team = employees['Team'] == 'Finance'

In [None]:
dennis_or_finance = employees[Dennis_name | finance_team]
dennis_or_finance

In [None]:
marketing_team = employees['Team'] == 'Marketing'
early_date = employees['Start Date'] < '1981-01-01'
high_salary = employees['Salary'] > 149_000

In [None]:
result = employees[high_salary | (early_date & marketing_team)]
result

## The isin Method
- The `isin` **Series** method accepts a collection object like a list, tuple, or **Series**.
- The method returns True for a row if its value is found in the collection.

In [None]:
employees = pd.read_csv('employees.csv', parse_dates=['Start Date'], date_format='%m/%d/%Y')
employees['Last Login Time'] = pd.to_datetime(employees['Last Login Time'], format="%H:%M %p").dt.time
employees['Gender'] = employees['Gender'].astype('category')
employees['Senior Management'] = employees['Senior Management'].astype(bool)
employees['Senior Management'] = employees['Senior Management'].astype('category')
employees

In [None]:
result = employees['Team'].isin(['Finance', 'Marketing', 'Legal'])

In [None]:
employees[result]

## The isnull and notnull Methods
- The `isnull` method returns True for `NaN` values in a **Series**.
- The `notnull` method returns True for present values in a **Series**.

In [2]:
employees = pd.read_csv('employees.csv', parse_dates=['Start Date'], date_format='%m/%d/%Y')
employees['Last Login Time'] = pd.to_datetime(employees['Last Login Time'], format="%H:%M %p").dt.time
employees['Gender'] = employees['Gender'].astype('category')
employees['Senior Management'] = employees['Senior Management'].astype(bool)
employees['Senior Management'] = employees['Senior Management'].astype('category')
employees

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,12:42:00,97308,6.945,True,Marketing
1,Thomas,Male,1996-03-31,06:53:00,61933,4.170,True,
2,Maria,Female,1993-04-23,11:17:00,130590,11.858,False,Finance
3,Jerry,Male,2005-03-04,01:00:00,138705,9.340,True,Finance
4,Larry,Male,1998-01-24,04:47:00,101004,1.389,True,Client Services
...,...,...,...,...,...,...,...,...
995,Henry,,2014-11-23,06:09:00,132483,16.655,False,Distribution
996,Phillip,Male,1984-01-31,06:30:00,42392,19.675,False,Finance
997,Russell,Male,2013-05-20,12:39:00,96914,1.421,False,Product
998,Larry,Male,2013-04-20,04:45:00,60500,11.985,False,Business Development


In [7]:
employees['First Name'].isnull()
employees['Team'].notnull()
employees[employees['First Name'].isnull() & employees['Team'].notnull()]

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
7,,Female,2015-07-20,10:43:00,45906,11.598,True,Finance
25,,Male,2012-10-08,01:12:00,37076,18.576,True,Client Services
39,,Male,2016-01-29,02:33:00,122173,7.797,True,Client Services
51,,,2011-12-17,08:29:00,41126,14.009,True,Sales
62,,Female,2007-06-12,05:25:00,58112,19.414,True,Marketing
116,,Male,1991-06-22,08:58:00,76189,18.988,True,Legal
149,,Female,2014-08-17,02:00:00,86230,8.578,True,Distribution
157,,Female,2005-07-27,08:32:00,79536,14.443,True,Product
165,,Female,2014-03-23,01:28:00,59148,9.061,True,Legal
166,,Female,1991-07-09,06:52:00,42341,7.014,True,Sales


## The between Method
- The `between` method returns True if a **Series** value is found within its range.

## The duplicated Method
- The `duplicated` method returns True if a **Series** value is a duplicate.
- Pandas will mark one occurrence of a repeated value as a non-duplicate.
- Use the `keep` parameter to designate whether the first or last occurrence of a repeated value should be considered the "non-duplicate".
- Pass False to the `keep` parameter to mark all occurrences of repeated values as duplicates.
- Use the tilde symbol (`~`) to invert a **Series's** values. Trues will become Falses, and Falses will become trues.

## The drop_duplicates Method
- The `drop_duplicates` method deletes rows with duplicate values.
- By default, it will remove a row if *all* of its values are shared with another row.
- The `subset` parameter configures the columns to look for duplicate values within.
- Pass a list to `subset` parameter to look for duplicates across multiple columns.

## The unique and nunique Methods
- The `unique` method on a **Series** returns a collection of its unique values. The method does not exist on a **DataFrame**.
- The `nunique` method returns a *count* of the number of unique values in the **Series**/**DataFrame**.
- The `dropna` parameter configures whether to include or exclude missing (`NaN`) values.