### 결측값 처리

#### 결측값 처리 메서드

|  |  |
| -- | -- |
| .dropna | 누락된 데이터가 있는 축(행, 열)을 제외 |
| .fillna | 누락된 데이터를 대신할 값을 채우거나 'ffill'이나 'bfill'같은 보간 메서드를 적용 |
| .isnull | 누락되거나 NA인 값을 알려주는 boolean 반환 |
| .notnull | .isnull과 반대 |

In [11]:
import pandas as pd
import numpy as np

In [12]:
string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])

#### ```.isnull()``` : 결측값 여부 확인
- ```None```또는 ```NaN```을 결측값으로 인식

In [13]:
np.sum(string_data.isnull())

1

In [14]:
string_data[0] = None
np.sum(string_data.isnull())

2

#### 
### 결측값 필터링

### ```.dropna(how, thresh, ...)```
- ```how = 'all```일 시 모든값이 NA인 행만 제거
- ```thresh``` : thresh값 이상의 결측값이 포함된 행을 제거

In [15]:
from numpy import nan as NA

In [18]:
data = pd.Series([1, NA, 3.5, NA, 7])
sum(data.dropna().isnull())

0

- **데이터 프레임의 ```.dropna()```는 NA를 포함한 열의 모든 값을 제외**

In [29]:
data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA],
                     [NA, NA, NA], [NA, 6.5, 3.]])
cleaned = data.dropna()
cleaned

Unnamed: 0,0,1,2
0,1.0,6.5,3.0


- **```.dropna(how = 'all')``` :모든 값이 NA인 열만 제외**

In [31]:
data.dropna(how = 'all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


In [50]:
data[4] = NA
data.dropna(axis = 'columns', how = 'all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


#### - 몇 개 이상의 값이 들어있는 행만 살펴보고 싶다면 ```thresh```인자에 원하는 값 입력

In [75]:
df =  pd.DataFrame(np.random.randn(7, 3))
df.shape

(7, 3)

In [91]:
df.iloc[:4, 1] = NA
df.iloc[:2, 2] = NA

In [77]:
# 2개 이상의 결측값이 포함된 행만 제거
df.dropna(thresh = 2)

Unnamed: 0,0,1,2
2,2.208965,,0.085173
3,0.432232,,-1.655078
4,-1.13858,0.066222,-0.784697
5,-0.97247,-0.821108,1.15785
6,1.225096,-1.360004,-0.083296


### 
### 결측치 채우기
### ```.fillna()``` : 결측치 대체
- ```value```가 딕셔너리일 경우 각 열마다 다른 값으로 대체
- ```method``` : 'ffill' = 직전의 값으로 결측치 대체 / 'bfill' = 다음 값으로 결측치 대체

#### fillna 함수 인자
|  |  |
| -- | -- |
| value | 비어 있는 값을 대체할 값 |
| metod | 보간 방식, 기본적으로 'ffill'을 사용 |
| axis | 값을 채워 넣을 축 |
| inplace | 복사본을 생성하지 않고, 호출한 객체를 변경 (기본값은 False) |
| limit | 값을 앞 혹은 뒤에서부터 몇 개까지 채울지 지정 |

In [78]:
df.fillna(0) # 결측값을 0으로 대체
df.fillna({1 : 0.5, 2 : 0}) # '1'열의 결측값은 0.5로, '2'열의 결측값은 0으로 대체

Unnamed: 0,0,1,2
0,0.823271,0.5,0.0
1,-0.944366,0.5,0.0
2,2.208965,0.5,0.085173
3,0.432232,0.5,-1.655078
4,-1.13858,0.066222,-0.784697
5,-0.97247,-0.821108,1.15785
6,1.225096,-1.360004,-0.083296


- 기존 객체 변형

In [79]:
_ = df.fillna(0, inplace = True)

- reindex의 사용하는 보간 메서드를 통한 결측치 대체

In [80]:
df = pd.DataFrame(np.random.randn(6, 3))
df.iloc[2:, 1] = NA
df.iloc[4:, 2] = NA

- **```method```인자를 통한 결측치 대체**

In [81]:
df.fillna(method = 'ffill') # 직전 값으로 결측치 대체

Unnamed: 0,0,1,2
0,-0.424442,0.487937,0.728229
1,0.018264,-1.706691,1.129203
2,0.507083,-1.706691,0.411609
3,-0.118537,-1.706691,-0.913952
4,-0.887661,-1.706691,-0.913952
5,1.610355,-1.706691,-0.913952


### 
### 데이터 변형

In [83]:
data = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'],
                     'k2': [1, 1, 2, 3, 3, 4, 4]})

### ```.duplicated()``` : 각 행이 중복인지 아닌지 알려주는 boolean Series 반환
### ```.drop_duplicates()``` : 중복값을 제거

In [85]:
data.duplicated()
data.drop_duplicates()

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4


#### - 특정 열의 중복여부 확인

In [88]:
data['v1'] = range(7)
data.drop_duplicates(['k1'])

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1


#### - ```keep = 'last'``` : 마지막으로 발견된 값을 반환

In [90]:
data.drop_duplicates(['k1', 'k2'], keep = 'last')

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1
2,one,2,2
3,two,3,3
4,one,3,4
6,two,4,6


#### 
### 함수나 매핑을 이용해서 데이터 변형

In [92]:
data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon',
                              'Pastrami', 'corned beef', 'Bacon',
                              'pastrami', 'honey ham', 'nova lox'],
                     'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})

In [93]:
meat_to_animal = {
  'bacon': 'pig',
  'pulled pork': 'pig',
  'pastrami': 'cow',
  'corned beef': 'cow',
  'honey ham': 'pig',
  'nova lox': 'salmon'
}

In [99]:
lowercased = data['food'].str.lower() # 'food'열의 소문자 변환

In [103]:
# meat_to_animal에 해당하는 값으로 매핑하여 'animal'열 생성
data['animal'] = lowercased.map(meat_to_animal) 
# data['food'].map(lambda x : meat_to_animal[x.lower()])
data.head(5)

Unnamed: 0,food,ounces,animal
0,bacon,4.0,pig
1,pulled pork,3.0,pig
2,bacon,12.0,pig
3,Pastrami,6.0,cow
4,corned beef,7.5,cow


#### 
### 값 치환
### ```.replace(value, NA)``` : value를 NA로 치환

In [105]:
data = pd.Series([1., -999., 2., -999., -1000., 3.])

In [112]:
data.replace(-999, np.nan) # -999를 결측값으로 인식
data.replace([-999, -1000], np.nan) # -999와 -1000을 결측값으로 인식
data.replace([-999, -1000], [np.nan, 0]) # -999는 결측값으로, -1000은 0으로 인식
data.replace({-999: np.nan, -1000: 0}) # -999는 결측값으로, -1000은 0으로 인식

0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

#### 
### 축 인덱스 이름 변경

In [126]:
data = pd.DataFrame(np.arange(12).reshape((3, 4)),
                    index = ['Ohio', 'Colorado', 'New York'],
                    columns = ['one', 'two', 'three', 'four'])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
New York,8,9,10,11


- 리스트(열)의 4번쨰 원소까지 추출하고 대문자로 변환하는 함수 생성 => 행 이름에 적용

In [127]:
transform = lambda x: x[:4].upper() 
data.index.map(transform)

data.index = data.index.map(transform)
data

Unnamed: 0,one,two,three,four
OHIO,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


### ```.rename(index, columns)``` : 행과 열의 이름을 변경

In [128]:
data.rename(index = str.title, columns = str.upper)

Unnamed: 0,ONE,TWO,THREE,FOUR
Ohio,0,1,2,3
Colo,4,5,6,7
New,8,9,10,11


In [130]:
data.rename(index = {'OHIO': 'INDIANA'}, # 행 이름이 'OHIO'인 행을 'INDIANA'로 변경
            columns = {'three': 'peekaboo'}) # 열 이름이 'three'인 열을 'peekaboo'로 변경

Unnamed: 0,one,two,peekaboo,four
INDIANA,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


- 원본 데이터 변경시 ```inplace = True```

In [132]:
data.rename(index={'OHIO': 'INDIANA'}, inplace = True)

### 
### 개별화와 양자화 (구간화, 비닝)
### ```pd.cut(data, bins, right, labels)``` : data를 bins에 맞게 구간화 (비닝)
- 간격을 나타내는 표기법은 중괄호로 시작해서 대괄호로 끝남 '( ~ ]', ```right``` : False일 경우 반대로 괄호 적용
- Categorical이라는 객체로 반환
- ```labels``` : 분할한 그룹의 이름 지정

In [133]:
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]

- ages를 18-25, 26-35, 35-60, 60이상 그룹으로 분할

In [135]:
bins = [18, 25, 35, 60, 100]
cats = pd.cut(ages, bins)
cats

[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64, right]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]

In [138]:
print(cats.codes)
print(cats.categories)
pd.value_counts(cats)

[0 0 0 1 0 0 2 1 3 2 2 1]
IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]], dtype='interval[int64, right]')


(18, 25]     5
(25, 35]     3
(35, 60]     3
(60, 100]    1
dtype: int64

- ```right``` : False일 경우 반대로 괄호 적용

In [139]:
pd.cut(ages, [18, 26, 36, 61, 100], right = False)

[[18, 26), [18, 26), [18, 26), [26, 36), [18, 26), ..., [26, 36), [61, 100), [36, 61), [36, 61), [26, 36)]
Length: 12
Categories (4, interval[int64, left]): [[18, 26) < [26, 36) < [36, 61) < [61, 100)]

- ```labels``` : 분할한 그룹의 이름 지정

In [140]:
group_names = ['Youth', 'YoungAdult', 'MiddleAged', 'Senior']
pd.cut(ages, bins, labels = group_names)

['Youth', 'Youth', 'Youth', 'YoungAdult', 'Youth', ..., 'YoungAdult', 'Senior', 'MiddleAged', 'MiddleAged', 'YoungAdult']
Length: 12
Categories (4, object): ['Youth' < 'YoungAdult' < 'MiddleAged' < 'Senior']

#### - ```pd.cut()```함수에 명시적으로 그룹의 경곗값을 넘기지 않고 그룹의 개수를 넘겨주면 데이터에서 최소/최대값을 기준으로 균등한 길이의 그룹을 자옹으로 계산

In [141]:
data = np.random.rand(20)
pd.cut(data, 4, precision = 2)

[(0.76, 0.98], (0.53, 0.76], (0.53, 0.76], (0.31, 0.53], (0.31, 0.53], ..., (0.53, 0.76], (0.76, 0.98], (0.76, 0.98], (0.31, 0.53], (0.083, 0.31]]
Length: 20
Categories (4, interval[float64, right]): [(0.083, 0.31] < (0.31, 0.53] < (0.53, 0.76] < (0.76, 0.98]]

### ```pd.qcut()``` : 표본 변위치를 기반으로 데이터 분할

In [142]:
data = np.random.randn(1000)  # 정규분포
cats = pd.qcut(data, 4)  # 4분위 분할
print(cats)
pd.value_counts(cats)

[(-3.0389999999999997, -0.617], (0.694, 3.354], (0.0517, 0.694], (-3.0389999999999997, -0.617], (0.694, 3.354], ..., (0.694, 3.354], (-3.0389999999999997, -0.617], (-0.617, 0.0517], (0.0517, 0.694], (0.0517, 0.694]]
Length: 1000
Categories (4, interval[float64, right]): [(-3.0389999999999997, -0.617] < (-0.617, 0.0517] < (0.0517, 0.694] < (0.694, 3.354]]


(-3.0389999999999997, -0.617]    250
(-0.617, 0.0517]                 250
(0.0517, 0.694]                  250
(0.694, 3.354]                   250
dtype: int64

In [144]:
pd.qcut(data, [0, 0.1, 0.5, 0.9, 1.]) # 변 위치를 직접 지정 가능

[(-3.0389999999999997, -1.224], (0.0517, 1.202], (0.0517, 1.202], (-3.0389999999999997, -1.224], (0.0517, 1.202], ..., (0.0517, 1.202], (-1.224, 0.0517], (-1.224, 0.0517], (0.0517, 1.202], (0.0517, 1.202]]
Length: 1000
Categories (4, interval[float64, right]): [(-3.0389999999999997, -1.224] < (-1.224, 0.0517] < (0.0517, 1.202] < (1.202, 3.354]]

##### 
### 이상치(특이값) 제거

In [145]:
data = pd.DataFrame(np.random.randn(1000, 4))
data.describe()

Unnamed: 0,0,1,2,3
count,1000.0,1000.0,1000.0,1000.0
mean,-0.055803,-0.027617,0.050801,-0.02832
std,1.036646,1.029153,1.021179,1.035897
min,-3.135034,-2.869731,-3.073881,-3.60541
25%,-0.76264,-0.692929,-0.66766,-0.717449
50%,-0.087925,-0.074967,0.046357,-0.046178
75%,0.665145,0.684132,0.769069,0.660696
max,3.618152,4.155229,3.486229,3.385475


In [148]:
data[(np.abs(data) > 3).any(1)]# 절댓값이 3을 초과하는 값들이 들어있는 모든 행 선택

col = data[2]
col[np.abs(col) > 3]

169   -3.073881
390    3.391507
701    3.486229
Name: 2, dtype: float64

- -3이나 3을 초과하는 값을 **-3 또는 3으로 지정**
#### ```np.sign(value)``` : value의 부호를 반환

In [152]:
data[np.abs(data) > 3] = np.sign(data) * 3
data.describe()

Unnamed: 0,0,1,2,3
count,1000.0,1000.0,1000.0,1000.0
mean,-0.056691,-0.029811,0.049997,-0.027263
std,1.032871,1.021778,1.018223,1.029749
min,-3.0,-2.869731,-3.0,-3.0
25%,-0.76264,-0.692929,-0.66766,-0.717449
50%,-0.087925,-0.074967,0.046357,-0.046178
75%,0.665145,0.684132,0.769069,0.660696
max,3.0,3.0,3.0,3.0


##### 
### 치환과 임의 샘플링

In [155]:
df = pd.DataFrame(np.arange(5 * 4).reshape((5, 4)))

### ```.permutation()``` : 임의 순서로 재배치

In [156]:
sampler = np.random.permutation(5)
sampler

array([4, 2, 3, 1, 0])

### ```take()``` 

In [162]:
df.iloc[sampler]
# df.take(sampler)

Unnamed: 0,0,1,2,3
4,16,17,18,19
2,8,9,10,11
3,12,13,14,15
1,4,5,6,7
0,0,1,2,3


### ```.sample(n, replace)``` : n만큼 임의 샘플링
- ```replace``` : 복원추출 여부

In [166]:
df.sample(n = 3)

Unnamed: 0,0,1,2,3
4,16,17,18,19
1,4,5,6,7
3,12,13,14,15


In [168]:
choices = pd.Series([5, 7, -1, 6, 4])
draws = choices.sample(n = 10, replace = True)

##### 
### 표시자 / 더미 변수 계산
### ```pd.get_dummies(data, prefix)``` : 더미 변수화
- ```prefix``` : 더미 변수의 접두어 

In [172]:
df = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'],
                   'data1': range(6)})
pd.get_dummies(df['key']) # df의 'key'열을 더미 변수화

Unnamed: 0,a,b,c
0,0,1,0
1,0,1,0
2,1,0,0
3,0,0,1
4,1,0,0
5,0,1,0


- ```prefix``` : 더미 변수의 접두어 

In [173]:
dummies = pd.get_dummies(df['key'], prefix = 'key')
dummies

Unnamed: 0,key_a,key_b,key_c
0,0,1,0
1,0,1,0
2,1,0,0
3,0,0,1
4,1,0,0
5,0,1,0


In [200]:
import os
os.chdir('/' + os.path.join('Users', '이찬솔', 'Documents', 'Python_for_Data_Analysis', 'datasets', 'movielens'))
os.getcwd()

'C:\\Users\\이찬솔\\Documents\\Python_for_Data_Analysis\\datasets\\movielens'

#### - MovieLens 영화 평점 데이터

In [188]:
mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table('movies.dat', sep = '::',
                       header = None, names = mnames)
movies.head(5)

  movies = pd.read_table('movies.dat', sep = '::',


Unnamed: 0,movie_id,title,genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy


In [192]:
all_genres = []

for x in movies.genres:
    all_genres.extend(x.split('|')) # 'genres'열의 값들을 '|'기준으로 분리

genres = pd.unique(all_genres)
genres

array(['Animation', "Children's", 'Comedy', 'Adventure', 'Fantasy',
       'Romance', 'Drama', 'Action', 'Crime', 'Thriller', 'Horror',
       'Sci-Fi', 'Documentary', 'War', 'Musical', 'Mystery', 'Film-Noir',
       'Western'], dtype=object)

In [195]:
zero_matrix = np.zeros((len(movies), len(genres)))
dummies = pd.DataFrame(zero_matrix, columns = genres)
dummies.head(5)

Unnamed: 0,Animation,Children's,Comedy,Adventure,Fantasy,Romance,Drama,Action,Crime,Thriller,Horror,Sci-Fi,Documentary,War,Musical,Mystery,Film-Noir,Western
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### ```.get_indexer()``` : 각 값들을 인덱스화

- Animation|Childern's|Comedy 를 '|'기준으로 분할 => 'Animation', 'Children's', 'Comedy'를 각각 0, 1, 2로 지정

In [198]:
gen = movies.genres[0]
print(gen)

gen.split('|')
dummies.columns.get_indexer(gen.split('|'))

Animation|Children's|Comedy


array([0, 1, 2], dtype=int64)

In [201]:
for i, gen in enumerate(movies.genres):
    indices = dummies.columns.get_indexer(gen.split('|'))
    dummies.iloc[i, indices] = 1

In [202]:
dummies.head(5)

Unnamed: 0,Animation,Children's,Comedy,Adventure,Fantasy,Romance,Drama,Action,Crime,Thriller,Horror,Sci-Fi,Documentary,War,Musical,Mystery,Film-Noir,Western
0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


- **접두어 추가후 movies와 join**

####  ```.get_prefix('접두어')``` : 접두어 추가

In [209]:
movies_windic = movies.join(dummies.add_prefix('Genre_'))
movies_windic.head(5)

Unnamed: 0,movie_id,title,genres,Genre_Animation,Genre_Children's,Genre_Comedy,Genre_Adventure,Genre_Fantasy,Genre_Romance,Genre_Drama,...,Genre_Crime,Genre_Thriller,Genre_Horror,Genre_Sci-Fi,Genre_Documentary,Genre_War,Genre_Musical,Genre_Mystery,Genre_Film-Noir,Genre_Western
0,1,Toy Story (1995),Animation|Children's|Comedy,1.0,1.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,Jumanji (1995),Adventure|Children's|Fantasy,0.0,1.0,0.0,1.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3,Grumpier Old Men (1995),Comedy|Romance,0.0,0.0,1.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,4,Waiting to Exhale (1995),Comedy|Drama,0.0,0.0,1.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,5,Father of the Bride Part II (1995),Comedy,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [210]:
np.random.seed(12345)
values = np.random.rand(10)
bins = [0, 0.2, 0.4, 0.6, 0.8, 1]

pd.get_dummies(pd.cut(values, bins))

Unnamed: 0,"(0.0, 0.2]","(0.2, 0.4]","(0.4, 0.6]","(0.6, 0.8]","(0.8, 1.0]"
0,0,0,0,0,1
1,0,1,0,0,0
2,1,0,0,0,0
3,0,1,0,0,0
4,0,0,1,0,0
5,0,0,1,0,0
6,0,0,0,0,1
7,0,0,0,1,0
8,0,0,0,1,0
9,0,0,0,1,0
