## [ 데이터 필터링 ]
- 원하는 데이터를 추출하기 위해서 사용하는 방법
- 원하는 데이터 걸러내겠다는 의미
- 다양한 조건 검사 진행 => 비교 검사 : ==, !=, >, >=, <, <=

(1) 모듈 로딩

In [1]:
import pandas as pd
import numpy as np

(2) 데이터 준비

In [2]:
file = '../../DATA/employees.csv'

(3) 데이터 저장

In [3]:
empDF = pd.read_csv(file)

(4) 데이터 확인

In [4]:
empDF.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1001 entries, 0 to 1000
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   First Name  933 non-null    object 
 1   Gender      854 non-null    object 
 2   Start Date  999 non-null    object 
 3   Salary      999 non-null    float64
 4   Mgmt        933 non-null    object 
 5   Team        957 non-null    object 
dtypes: float64(1), object(5)
memory usage: 47.0+ KB


(5) 데이터 필터링

In [5]:
'Happy' == 'Happy', 'Happy' == 'happy'

(True, False)

- [문제] 데이터에서 이름이 Maria인 데이터만 추출

In [6]:
print(empDF.columns, empDF.index, sep='\n\n')

Index(['First Name', 'Gender', 'Start Date', 'Salary', 'Mgmt', 'Team'], dtype='object')

RangeIndex(start=0, stop=1001, step=1)


- bool 값으로 구성된 데이터 => 불린 인덱싱 (Boolean Indexing)

In [7]:
empDF['First Name'] == 'Maria'

0       False
1       False
2        True
3       False
4       False
        ...  
996     False
997     False
998     False
999     False
1000    False
Name: First Name, Length: 1001, dtype: bool

In [8]:
empDF[empDF['First Name'] == 'Maria']

Unnamed: 0,First Name,Gender,Start Date,Salary,Mgmt,Team
2,Maria,Female,,130590.0,False,Finance
198,Maria,Female,12/27/90,36067.0,True,Product
815,Maria,,1/18/86,106562.0,False,HR
844,Maria,,6/19/85,148857.0,False,Legal
936,Maria,Female,3/14/03,96250.0,False,Business Dev
984,Maria,Female,10/15/11,43455.0,False,Engineering


In [9]:
(empDF['First Name'] == 'Maria').sum()

6

In [10]:
mariaMask = empDF['First Name'] == 'Maria'
mariaMask

0       False
1       False
2        True
3       False
4       False
        ...  
996     False
997     False
998     False
999     False
1000    False
Name: First Name, Length: 1001, dtype: bool

In [11]:
empDF[mariaMask]

Unnamed: 0,First Name,Gender,Start Date,Salary,Mgmt,Team
2,Maria,Female,,130590.0,False,Finance
198,Maria,Female,12/27/90,36067.0,True,Product
815,Maria,,1/18/86,106562.0,False,HR
844,Maria,,6/19/85,148857.0,False,Legal
936,Maria,Female,3/14/03,96250.0,False,Business Dev
984,Maria,Female,10/15/11,43455.0,False,Engineering


[문제] employees 데이터에서 연봉이 140000이상인 이름 추출

In [12]:
moneyMask = empDF['Salary'] >= 140000
moneyMask

0       False
1       False
2       False
3       False
4       False
        ...  
996     False
997     False
998     False
999     False
1000    False
Name: Salary, Length: 1001, dtype: bool

In [13]:
empDF[moneyMask]

Unnamed: 0,First Name,Gender,Start Date,Salary,Mgmt,Team
36,Rachel,Female,2/16/09,142032.0,False,Business Dev
44,Cynthia,Female,11/16/88,145146.0,True,Product
83,Shawn,Male,9/23/05,148115.0,True,Finance
87,Annie,Female,1/30/93,144887.0,True,Sales
96,Cynthia,Female,3/21/94,142321.0,False,Finance
...,...,...,...,...,...,...
948,Ashley,Female,3/31/06,142410.0,True,Engineering
951,,Female,9/14/10,143638.0,,
979,Ernest,Male,7/20/13,142935.0,True,Product
981,James,Male,1/15/93,148985.0,False,Legal


[문제] employees 데이터에서 성별이 '남자'이며 연봉이 140000 이상인 행 출력

In [14]:
(empDF['Gender'] == 'Male').head()

0     True
1     True
2    False
3    False
4     True
Name: Gender, dtype: bool

In [15]:
(empDF['Salary'] >= 140000).head()

0    False
1    False
2    False
3    False
4    False
Name: Salary, dtype: bool

- 여러 개의 조건이 모두 True 여야만 하는 조건 => AND 조건
    * 문법 : 조건1 & 조건2 & ... & 조건N

In [16]:
(empDF['Gender'] == 'Male') & (empDF['Salary'] >= 140000)

0       False
1       False
2       False
3       False
4       False
        ...  
996     False
997     False
998     False
999     False
1000    False
Length: 1001, dtype: bool

In [17]:
theMask = (empDF['Gender'] == 'Male') & (empDF['Salary'] >= 148000)
empDF[theMask]

Unnamed: 0,First Name,Gender,Start Date,Salary,Mgmt,Team
83,Shawn,Male,9/23/05,148115.0,True,Finance
318,Roy,Male,8/6/06,148225.0,False,Finance
800,Clarence,Male,8/5/89,148941.0,False,Product
850,Charles,Male,9/3/97,148291.0,False,
981,James,Male,1/15/93,148985.0,False,Legal


- 여러 개의 조건 중 1개 이상이 True 이면 되는 조건 => OR 조건
    * 문법 : 조건1 | 조건2 | ... | 조건N
    * 연산자 : |(파이프)

[문제] employees 데이터에서 입사일이 1993년 1월 1일 이전이거나 매니지먼트가 없는 사람 데이터 조회

In [18]:
empDF.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1001 entries, 0 to 1000
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   First Name  933 non-null    object 
 1   Gender      854 non-null    object 
 2   Start Date  999 non-null    object 
 3   Salary      999 non-null    float64
 4   Mgmt        933 non-null    object 
 5   Team        957 non-null    object 
dtypes: float64(1), object(5)
memory usage: 47.0+ KB


In [19]:
theMask2 = (empDF['Start Date'] < '1993-01-01') | (empDF['Mgmt'] == False)

In [20]:
empDF[theMask2]

Unnamed: 0,First Name,Gender,Start Date,Salary,Mgmt,Team
2,Maria,Female,,130590.0,False,Finance
4,Larry,Male,1/24/98,101004.0,True,IT
5,Dennis,Male,4/18/87,115163.0,False,Legal
8,Angela,Female,11/22/05,95570.0,True,Engineering
11,Julie,Female,10/26/97,102508.0,True,Legal
...,...,...,...,...,...,...
992,Anthony,Male,10/16/11,112769.0,True,Finance
995,Henry,,11/23/14,132483.0,False,Distribution
996,Phillip,Male,1/31/84,42392.0,False,Finance
997,Russell,Male,5/20/13,96914.0,False,Product


In [28]:
empDF['Start Time'] = pd.to_datetime(empDF['Start Date'])

  empDF['Start Time'] = pd.to_datetime(empDF['Start Date'])


In [29]:
empDF.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1001 entries, 0 to 1000
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   First Name  933 non-null    object        
 1   Gender      854 non-null    object        
 2   Start Date  999 non-null    object        
 3   Salary      999 non-null    float64       
 4   Mgmt        933 non-null    object        
 5   Team        957 non-null    object        
 6   Start Time  999 non-null    datetime64[ns]
dtypes: datetime64[ns](1), float64(1), object(5)
memory usage: 54.9+ KB


In [30]:
empDF['Team'].isin(['Sales', 'Legal', 'Marketing'])

0        True
1       False
2       False
3       False
4       False
        ...  
996     False
997     False
998     False
999      True
1000    False
Name: Team, Length: 1001, dtype: bool

- 구간 지정 검사 불린 메서드 => DF.between(low, high)

In [31]:
(empDF['Salary']>=80000) & (empDF['Salary']<90000)

0       False
1       False
2       False
3       False
4       False
        ...  
996     False
997     False
998     False
999     False
1000    False
Name: Salary, Length: 1001, dtype: bool

In [33]:
empDF['Salary'].between(80000, 90000)

0       False
1       False
2       False
3       False
4       False
        ...  
996     False
997     False
998     False
999     False
1000    False
Name: Salary, Length: 1001, dtype: bool

지시자 str ?

In [34]:
empDF['First Name'].str.upper()

0       DOUGLAS
1        THOMAS
2         MARIA
3         JERRY
4         LARRY
         ...   
996     PHILLIP
997     RUSSELL
998       LARRY
999      ALBERT
1000        NaN
Name: First Name, Length: 1001, dtype: object

In [36]:
pd.Series(['foo', 'fuz', np.nan]).str.replace('foo', 'faz')

0    faz
1    fuz
2    NaN
dtype: object