## 결측치 : missing value (누락값)

+ 데이터 수집과정에서 채워지지 못한 값을 의미

+ 예를 들어 설문조사시 설문자가 특정문항에 답을 하지 않으면 그 문항이 결측치가 됨

+ 데이터에 결측치가 포함되어 있으면 데이터 분석시 편향/왜곡된 결과가 도출될 수 있음

+ 해결책 : 제거하거나 추정값으로 대체

In [2]:
x <- c(1,2,3,NA,5,NA,7)
sum(x) # 결측치 때문에 연산불가능

### 결측치 여부 확인

In [4]:
is.na(x)
sum(is.na(x)) 

In [5]:
table(is.na(x)) # 빈도표 결측치 확인


FALSE  TRUE 
    5     2 

In [57]:
colSums(is.na(x)) # 컬럼별 결측치 확인(데이터프레임에서만 사용 가능)

ERROR: Error in colSums(is.na(x)): 'x' must be an array of at least two dimensions


### 결측치 처리 : 제거


In [93]:
sum(x, na.rm=T)

In [94]:
na.omit(x)

In [15]:
x2 <- as.vector(na.omit(x)) # NA 제거후 새로운 벡터로 저장
sum(x2)

### 결측치 처리 : 대체

In [16]:
mean <- mean(x, na.rm=T)

In [18]:
x[is.na(x)] <- mean # boolean 인덱싱을 이용해서 na요소를 찾은 후 평균값으로 대체

In [20]:
x
sum(x)

### read.csv(파일명, 헤더 여부, 구분자, 범주형 처리, 결측치 처리)

### na.strings=c(대상, 지정 결측치)

In [67]:
zip2013<- read.csv('zipcode_2013.txt', header=T, sep='\t', na.strings=c('', 'NA'), stringsAsFactors=F)

In [29]:
# zip2013$ZIPCODE <- as.character(zip2013$ZIPCODE)
# zip2013$RI <- as.character(zip2013$RI)
# zip2013$BUNJI <- as.character(zip2013$BUNJI)

In [30]:
str(zip2013)

'data.frame':	52144 obs. of  7 variables:
 $ ZIPCODE: chr  "135-806" "135-807" "135-806" "135-770" ...
 $ SIDO   : Factor w/ 17 levels "강원","경기",..: 9 9 9 9 9 9 9 9 9 9 ...
 $ GUGUN  : Factor w/ 228 levels "","가평군","강남구",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ DONG   : Factor w/ 5030 levels "가경동","가곡동",..: 160 160 160 160 160 160 160 160 160 160 ...
 $ RI     : chr  "경남아파트" "우성3차아파트" "우성9차아파트" "주공아파트" ...
 $ BUNJI  : chr  "" "(1∼6동)" "(901∼902동)" "(1∼16동)" ...
 $ SEQ    : int  1 2 3 4 5 6 7 8 9 10 ...


In [45]:
table(is.na(zip2013))


 FALSE   TRUE 
312792  52216 

In [46]:
head(zip2013,10)

Unnamed: 0_level_0,ZIPCODE,SIDO,GUGUN,DONG,RI,BUNJI,SEQ
Unnamed: 0_level_1,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<int>
1,135-806,서울,강남구,개포1동,경남아파트,,1
2,135-807,서울,강남구,개포1동,우성3차아파트,(1∼6동),2
3,135-806,서울,강남구,개포1동,우성9차아파트,(901∼902동),3
4,135-770,서울,강남구,개포1동,주공아파트,(1∼16동),4
5,135-805,서울,강남구,개포1동,주공아파트,(17∼40동),5
6,135-966,서울,강남구,개포1동,주공아파트,(41∼85동),6
7,135-807,서울,강남구,개포1동,주공아파트,(86∼103동),7
8,135-805,서울,강남구,개포1동,주공아파트,(104∼125동),8
9,135-807,서울,강남구,개포1동,현대1차아파트,(101∼106동),9
10,135-805,서울,강남구,개포1동,,565,10


### RI컬럼 중에서 NA가 아닌 행의 우편번호 출력

In [52]:
zip2013$ZIPCODE[!is.na(zip2013$RI)]

In [56]:
length(zip2013$ZIPCODE[!is.na(zip2013$BUNJI)])

### RI컬럼 중에서 NA가 아닌 행 출력

In [55]:
### 데이터객체 $ 컬럼명[!is.na(데이터객체 $ 컬럼명),]

In [None]:
zip2013[!is.na(zip2013$RI),]

### RI 컬럼 중에서 NA인 컬럼에 '-'문자 대체

In [68]:
zip2013$RI[is.na(zip2013$RI)] <- '-'

### BUNJI 컬럼 중에서 NA인 컬럼에 '-'문자 대체

In [69]:
zip2013$BUNJI[is.na(zip2013$BUNJI)] <- '-'

In [73]:
zip2013$GUGUN[is.na(zip2013$GUGUN)] <- '-'

In [None]:
colSums(is.na(zip2013))

In [None]:
head(zip2013)

Unnamed: 0_level_0,ZIPCODE,SIDO,GUGUN,DONG,RI,BUNJI,SEQ
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<int>
1,135-806,서울,강남구,개포1동,경남아파트,-,1
2,135-807,서울,강남구,개포1동,우성3차아파트,(1∼6동),2
3,135-806,서울,강남구,개포1동,우성9차아파트,(901∼902동),3
4,135-770,서울,강남구,개포1동,주공아파트,(1∼16동),4
5,135-805,서울,강남구,개포1동,주공아파트,(17∼40동),5
6,135-966,서울,강남구,개포1동,주공아파트,(41∼85동),6


### 타이타닉 데이터 결측치 처리

In [80]:
titan <- read.csv('titanic2.csv', na.strings=c('', 'NA'))

In [81]:
str(titan)
summary(titan)
head(titan)

'data.frame':	1310 obs. of  11 variables:
 $ pclass  : int  1 1 1 1 1 1 1 1 1 1 ...
 $ survived: int  1 1 0 0 0 1 1 0 1 0 ...
 $ name    : Factor w/ 1307 levels "Abbing, Mr. Anthony",..: 22 24 25 26 27 31 46 47 51 55 ...
 $ sex     : Factor w/ 2 levels "female","male": 1 2 1 2 1 2 1 2 1 2 ...
 $ age     : num  29 0.917 2 30 25 ...
 $ sibsp   : int  0 1 1 1 1 0 1 0 2 0 ...
 $ parch   : int  0 2 2 2 2 0 0 0 0 0 ...
 $ ticket  : Factor w/ 929 levels "110152","110413",..: 188 50 50 50 50 125 93 16 77 826 ...
 $ fare    : num  211 152 152 152 152 ...
 $ cabin   : Factor w/ 186 levels "A10","A11","A14",..: 44 80 80 80 80 150 146 16 62 NA ...
 $ embarked: Factor w/ 3 levels "C","Q","S": 3 3 3 3 3 3 3 3 3 1 ...


     pclass         survived                                name     
 Min.   :1.000   Min.   :0.000   Connolly, Miss. Kate         :   2  
 1st Qu.:2.000   1st Qu.:0.000   Kelly, Mr. James             :   2  
 Median :3.000   Median :0.000   Abbing, Mr. Anthony          :   1  
 Mean   :2.295   Mean   :0.382   Abbott, Master. Eugene Joseph:   1  
 3rd Qu.:3.000   3rd Qu.:1.000   Abbott, Mr. Rossmore Edward  :   1  
 Max.   :3.000   Max.   :1.000   (Other)                      :1302  
 NA's   :1       NA's   :1       NA's                         :   1  
     sex           age              sibsp            parch      
 female:466   Min.   : 0.1667   Min.   :0.0000   Min.   :0.000  
 male  :843   1st Qu.:21.0000   1st Qu.:0.0000   1st Qu.:0.000  
 NA's  :  1   Median :28.0000   Median :0.0000   Median :0.000  
              Mean   :29.8811   Mean   :0.4989   Mean   :0.385  
              3rd Qu.:39.0000   3rd Qu.:1.0000   3rd Qu.:0.000  
              Max.   :80.0000   Max.   :8.0000   M

Unnamed: 0_level_0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
Unnamed: 0_level_1,<int>,<int>,<fct>,<fct>,<dbl>,<int>,<int>,<fct>,<dbl>,<fct>,<fct>
1,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S
2,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S
3,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S
4,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S
5,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S
6,1,1,"Anderson, Mr. Harry",male,48.0,0,0,19952,26.55,E12,S


In [82]:
colSums(is.na(titan)) # age:264, cabin:1015 

#### age : 다른값으로 대체, cabin : 컬럼 자체 제거

In [83]:
titan$cabin <- NULL

In [84]:
head(titan)

Unnamed: 0_level_0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,embarked
Unnamed: 0_level_1,<int>,<int>,<fct>,<fct>,<dbl>,<int>,<int>,<fct>,<dbl>,<fct>
1,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,S
2,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,S
3,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,S
4,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,S
5,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,S
6,1,1,"Anderson, Mr. Harry",male,48.0,0,0,19952,26.55,S


In [85]:
tmd <- median(titan$age, na.rm=T)

In [86]:
titan$age[is.na(titan$age)] <- tmd

In [87]:
colSums(is.na(titan))

### embarked 결측치가 3개이므로 이것을 기준으로 제거함

In [88]:
titan<- na.omit(titan)

In [89]:
colSums(is.na(titan))

### 최종결과 저장

In [95]:
save(titan, file='titanic.rdata')