# 탐색적 데이터 분석 (`EDA` : Exploratory Data Analysis)
- 데이터를 보는 눈

## 1. `EDA`
- 데이터를 기술적 접근 뿐만 아니라 그 `자체만으로 인사이트를 얻어`내는 접근법

### 1-1) EDA의 Process
#### 1. 분석 목적과 변수 확인 (column)
#### 2. 데이터 전체적으로 살펴보기 (corelarion, NA(Null 값), 데이터 사이즈 등)
#### 3. 데이터의 개별 속성 파악하기

### 1-2) EDA with Example - Titanic
#### 1. `분석 목적`과 `변수 확인`
##### - `분석 목적` : 살아남은 사람들은 어떤 특징을 가지고 있었을까?
##### - `변수` 및 `key` 확인 : 총 10개 (survival (key : 0 or 1), pclass 등)

In [2]:
# 0. 라이브러리 준비

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [4]:
%matplotlib inline

In [5]:
# csv 파일 불러오기
titanic_df = pd.read_csv("./train.csv")

In [8]:
# 1. 분석의 목적과 변수 확인

# 상위 3개 데이터 확인
titanic_df.head(3)

# NaN는 결측치이며 따로 처리하는 방법이 다양하다.

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S


In [9]:
# 각 Column의 데이터 타입 확인하기

titanic_df.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

#### 2. 데이터 전체적으로 살펴보기

In [10]:
# 수치형 데이터만 얻어오기

titanic_df.describe() 
# mean : 평균
# std : 표준 편차

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [11]:
# 상관계수 확인
titanic_df.corr() 
# 자기 자신과의 상관 계수는 1
# 절댓값이 1과 가까울수록 상관성이 높음

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
PassengerId,1.0,-0.005007,-0.035144,0.036847,-0.057527,-0.001652,0.012658
Survived,-0.005007,1.0,-0.338481,-0.077221,-0.035322,0.081629,0.257307
Pclass,-0.035144,-0.338481,1.0,-0.369226,0.083081,0.018443,-0.5495
Age,0.036847,-0.077221,-0.369226,1.0,-0.308247,-0.189119,0.096067
SibSp,-0.057527,-0.035322,0.083081,-0.308247,1.0,0.414838,0.159651
Parch,-0.001652,0.081629,0.018443,-0.189119,0.414838,1.0,0.216225
Fare,0.012658,0.257307,-0.5495,0.096067,0.159651,0.216225,1.0


##### `상관성` vs `인과성`
##### `상관성` : A up, B up, ...
##### `인과성` : A -> B

In [13]:
# 결측치 확인 (NaN)

titanic_df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64