# 데이터 랭글링
원본 데이터를 정재하고 사용 가능한 형태로 구성하기 위한 변환 과정을 광범위하게 의미하는 비공식적인 용어이다.

In [1]:
import pandas as pd

In [2]:
url = 'https://raw.githubusercontent.com/chrisalbon/simulated_datasets/master/titanic.csv'

In [3]:
dataframe = pd.read_csv(url)

In [4]:
dataframe.head()

Unnamed: 0,Name,PClass,Age,Sex,Survived,SexCode
0,"Allen, Miss Elisabeth Walton",1st,29.0,female,1,1
1,"Allison, Miss Helen Loraine",1st,2.0,female,0,1
2,"Allison, Mr Hudson Joshua Creighton",1st,30.0,male,0,0
3,"Allison, Mrs Hudson JC (Bessie Waldo Daniels)",1st,25.0,female,0,1
4,"Allison, Master Hudson Trevor",1st,0.92,male,1,0


# 3.1 데이터프레임 만들기

In [7]:
dataframe = pd.DataFrame()

In [10]:
dataframe['Name'] = ['Jacky Jackson', 'Steven Stevenson']
dataframe['Age'] = [38, 25]
dataframe['Driver'] = [True, False]

In [11]:
dataframe

Unnamed: 0,Name,Age,Driver
0,Jacky Jackson,38,True
1,Steven Stevenson,25,False


In [15]:
# 열 만들기

new_person = pd.Series(['Molly Mooney', 40, True],
                      index = ['Name', 'Age', 'Driver'])

new_person

Name      Molly Mooney
Age                 40
Driver            True
dtype: object

In [16]:
dataframe.append(new_person, ignore_index = True)

Unnamed: 0,Name,Age,Driver
0,Jacky Jackson,38,True
1,Steven Stevenson,25,False
2,Molly Mooney,40,True


In [7]:
import numpy as np

In [8]:
# list를 전달하여 데이터프레임을 만든다.
data = [ ['Jacky Jackson', 38, True], 
        ['Steven Stevenson', 25, False]]

matrix = np.array(data)

In [9]:
pd.DataFrame(matrix, columns = ['Name', 'Age', 'Driver'])

Unnamed: 0,Name,Age,Driver
0,Jacky Jackson,38,True
1,Steven Stevenson,25,False


In [11]:
pd.DataFrame(data, columns = ['Name', 'Age', 'Driver'])

Unnamed: 0,Name,Age,Driver
0,Jacky Jackson,38,True
1,Steven Stevenson,25,False


In [12]:
# 열 이름과 데이터를 매핑한 딕셔너리를 사용
data = {'Name': ['Jacky Jackson', 'Steven Stevenson'],
       'Age':[38, 25],
       'Driver': [True, False]}

pd.DataFrame(data)

Unnamed: 0,Name,Age,Driver
0,Jacky Jackson,38,True
1,Steven Stevenson,25,False


In [15]:
data = [{'Name': 'Jacky Jackson', 'Age':38, 'Driver': True},
       {'Name': 'Steven', 'Age': 25}]

pd.DataFrame(data)

Unnamed: 0,Name,Age,Driver
0,Jacky Jackson,38,True
1,Steven,25,


# 3.2 데이터 설명하기

In [38]:
url = 'https://raw.githubusercontent.com/chrisalbon/simulated_datasets/master/titanic.csv'

In [39]:
dataframe = pd.read_csv(url)

In [40]:
dataframe.head()

Unnamed: 0,Name,PClass,Age,Sex,Survived,SexCode
0,"Allen, Miss Elisabeth Walton",1st,29.0,female,1,1
1,"Allison, Miss Helen Loraine",1st,2.0,female,0,1
2,"Allison, Mr Hudson Joshua Creighton",1st,30.0,male,0,0
3,"Allison, Mrs Hudson JC (Bessie Waldo Daniels)",1st,25.0,female,0,1
4,"Allison, Master Hudson Trevor",1st,0.92,male,1,0


In [20]:
# 차원을 확인한다.
dataframe.shape

(1313, 6)

In [21]:
# 통곗값 확인
dataframe.describe()

Unnamed: 0,Age,Survived,SexCode
count,756.0,1313.0,1313.0
mean,30.397989,0.342727,0.351866
std,14.259049,0.474802,0.477734
min,0.17,0.0,0.0
25%,21.0,0.0,0.0
50%,28.0,0.0,0.0
75%,39.0,1.0,1.0
max,71.0,1.0,1.0


범주형인 Survived와 SexCode 같은 경우 통계값의 의미가 없다.

# 3.3 데이터프레임 탐색하기

In [23]:
# 첫 번째 행을 선택한다.
dataframe.iloc[0]

Name        Allen, Miss Elisabeth Walton
PClass                               1st
Age                                   29
Sex                               female
Survived                               1
SexCode                                1
Name: 0, dtype: object

In [26]:
# 슬라이스를 이용한 행을 선택
# 끝의 -1까지 출력

dataframe.iloc[1:4]

Unnamed: 0,Name,PClass,Age,Sex,Survived,SexCode
1,"Allison, Miss Helen Loraine",1st,2.0,female,0,1
2,"Allison, Mr Hudson Joshua Creighton",1st,30.0,male,0,0
3,"Allison, Mrs Hudson JC (Bessie Waldo Daniels)",1st,25.0,female,0,1


In [27]:
dataframe.iloc[:4]

Unnamed: 0,Name,PClass,Age,Sex,Survived,SexCode
0,"Allen, Miss Elisabeth Walton",1st,29.0,female,1,1
1,"Allison, Miss Helen Loraine",1st,2.0,female,0,1
2,"Allison, Mr Hudson Joshua Creighton",1st,30.0,male,0,0
3,"Allison, Mrs Hudson JC (Bessie Waldo Daniels)",1st,25.0,female,0,1


In [29]:
# 데이터프레임은 정수 인덱스를 가질 필요가 없다.
# 각 행이 고유해진다면 어떤 값이라도 데이터프레임의 인덱스로 설정할 수 있다.

# 인덱스를 설정한다.
dataframe = dataframe.set_index(dataframe['Name'])

In [30]:
dataframe

Unnamed: 0_level_0,Name,PClass,Age,Sex,Survived,SexCode
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Allen, Miss Elisabeth Walton","Allen, Miss Elisabeth Walton",1st,29.00,female,1,1
"Allison, Miss Helen Loraine","Allison, Miss Helen Loraine",1st,2.00,female,0,1
"Allison, Mr Hudson Joshua Creighton","Allison, Mr Hudson Joshua Creighton",1st,30.00,male,0,0
"Allison, Mrs Hudson JC (Bessie Waldo Daniels)","Allison, Mrs Hudson JC (Bessie Waldo Daniels)",1st,25.00,female,0,1
"Allison, Master Hudson Trevor","Allison, Master Hudson Trevor",1st,0.92,male,1,0
...,...,...,...,...,...,...
"Zakarian, Mr Artun","Zakarian, Mr Artun",3rd,27.00,male,0,0
"Zakarian, Mr Maprieder","Zakarian, Mr Maprieder",3rd,26.00,male,0,0
"Zenni, Mr Philip","Zenni, Mr Philip",3rd,22.00,male,0,0
"Lievens, Mr Rene","Lievens, Mr Rene",3rd,24.00,male,0,0


In [31]:
# 행 확인하기
dataframe.loc['Allen, Miss Elisabeth Walton']

Name        Allen, Miss Elisabeth Walton
PClass                               1st
Age                                   29
Sex                               female
Survived                               1
SexCode                                1
Name: Allen, Miss Elisabeth Walton, dtype: object

데이터프레임의 인덱스는 영문자와 숫자로 이루어진 고유한 문자열이거나 임의의 숫자일 수 있다.

### 판다스의 인덱싱하는 두개의 메서드

- loc는 데이터 프레임의 인덱스가 레이블(예를 들어 문자열)일 때 사용한다.\
- iloc는 데이터 프레임의 위치를 참조한다. 예를 들어 iloc[0]는 정수 혹은 문자열 인덱스에 상관없이 첫 번째 행을 반환한다.\

*** 데이터 정제 단계에서 자주 등장하기 때문에 loc 메서드와 iloc 메서드에 익숙해지는 것이 좋다.

In [32]:
# 슬라이싱을 사용해 열을 선택하기

# 'Allison, Miss Helen Loraine' 이전까지 행에서 Age 열과 Sex 열만 선택한다.
dataframe.loc[:'Allison, Miss Helen Loraine', 'Age':'Sex']

Unnamed: 0_level_0,Age,Sex
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
"Allen, Miss Elisabeth Walton",29.0,female
"Allison, Miss Helen Loraine",2.0,female


In [33]:
# dataframe[:-2]와 동일하다.
dataframe[:'Allison, Miss Helen Loraine']

Unnamed: 0_level_0,Name,PClass,Age,Sex,Survived,SexCode
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Allen, Miss Elisabeth Walton","Allen, Miss Elisabeth Walton",1st,29.0,female,1,1
"Allison, Miss Helen Loraine","Allison, Miss Helen Loraine",1st,2.0,female,0,1


In [36]:
# 리스트를 나열할 땐 [[열1, 열2, 열3..]] 이렇게 써주어야함.
dataframe[['Age', 'Sex']].head(2)

Unnamed: 0_level_0,Age,Sex
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
"Allen, Miss Elisabeth Walton",29.0,female
"Allison, Miss Helen Loraine",2.0,female


# 3.4 조건에 따라 행 선택하기

In [41]:
dataframe.head()

Unnamed: 0,Name,PClass,Age,Sex,Survived,SexCode
0,"Allen, Miss Elisabeth Walton",1st,29.0,female,1,1
1,"Allison, Miss Helen Loraine",1st,2.0,female,0,1
2,"Allison, Mr Hudson Joshua Creighton",1st,30.0,male,0,0
3,"Allison, Mrs Hudson JC (Bessie Waldo Daniels)",1st,25.0,female,0,1
4,"Allison, Master Hudson Trevor",1st,0.92,male,1,0


In [43]:
# 'sex' 열이 'female' 인 행 중 처음 두 개를 출력한다.
dataframe[dataframe['Sex'] == 'female'].head(2)

Unnamed: 0,Name,PClass,Age,Sex,Survived,SexCode
0,"Allen, Miss Elisabeth Walton",1st,29.0,female,1,1
1,"Allison, Miss Helen Loraine",1st,2.0,female,0,1


위의 코드에서 dataframe['Sex'] == 'female'이 조건문이다.\
이를 frame[]으로 감싸서 판다스에게 데이터프레임에서 dataframe['Sex']이 'female'인 모든 열을 선택하라고 요청한다.

In [46]:
# 승객이 65세 이상의 여성인 모든 행을 선택한다.
# 행을 필터링한다.

dataframe[(dataframe['Sex'] == 'female')&(dataframe['Age'] >= 65)]

Unnamed: 0,Name,PClass,Age,Sex,Survived,SexCode
73,"Crosby, Mrs Edward Gifford (Catherine Elizabet...",1st,69.0,female,1,1


# 3.5 값 치환하기

In [47]:
dataframe.head()

Unnamed: 0,Name,PClass,Age,Sex,Survived,SexCode
0,"Allen, Miss Elisabeth Walton",1st,29.0,female,1,1
1,"Allison, Miss Helen Loraine",1st,2.0,female,0,1
2,"Allison, Mr Hudson Joshua Creighton",1st,30.0,male,0,0
3,"Allison, Mrs Hudson JC (Bessie Waldo Daniels)",1st,25.0,female,0,1
4,"Allison, Master Hudson Trevor",1st,0.92,male,1,0


In [50]:
# 값을 치환하고 행을 출력한다.

dataframe['Sex'].replace('female', 'Woman').head()

0    Woman
1    Woman
2     male
3    Woman
4     male
Name: Sex, dtype: object

In [51]:
# 동시에 여러개를 바꾸기

dataframe['Sex'].replace(['female', 'male'], ['Woman', 'Man']).head()

0    Woman
1    Woman
2      Man
3    Woman
4      Man
Name: Sex, dtype: object

하나의 열이 아니라 데이터프레임의 replace 메서드를 사용하여 전체 DataFrame 객체에서 값을 찾아 바꿀 수도 있다.

In [53]:
# 1로 되어있는 모든 객체를 바꿈
dataframe.replace(1, 'One').head()

Unnamed: 0,Name,PClass,Age,Sex,Survived,SexCode
0,"Allen, Miss Elisabeth Walton",1st,29.0,female,One,One
1,"Allison, Miss Helen Loraine",1st,2.0,female,0,One
2,"Allison, Mr Hudson Joshua Creighton",1st,30.0,male,0,0
3,"Allison, Mrs Hudson JC (Bessie Waldo Daniels)",1st,25.0,female,0,One
4,"Allison, Master Hudson Trevor",1st,0.92,male,One,0


In [54]:
# 정규표현식도 인식 가능하다.

dataframe.replace(r'1st', 'First', regex = True).head()

Unnamed: 0,Name,PClass,Age,Sex,Survived,SexCode
0,"Allen, Miss Elisabeth Walton",First,29.0,female,1,1
1,"Allison, Miss Helen Loraine",First,2.0,female,0,1
2,"Allison, Mr Hudson Joshua Creighton",First,30.0,male,0,0
3,"Allison, Mrs Hudson JC (Bessie Waldo Daniels)",First,25.0,female,0,1
4,"Allison, Master Hudson Trevor",First,0.92,male,1,0


In [55]:
# 'female'와 male을 person으로 바꾼다.
dataframe.replace(['female', 'male'], 'person').head()

Unnamed: 0,Name,PClass,Age,Sex,Survived,SexCode
0,"Allen, Miss Elisabeth Walton",1st,29.0,person,1,1
1,"Allison, Miss Helen Loraine",1st,2.0,person,0,1
2,"Allison, Mr Hudson Joshua Creighton",1st,30.0,person,0,0
3,"Allison, Mrs Hudson JC (Bessie Waldo Daniels)",1st,25.0,person,0,1
4,"Allison, Master Hudson Trevor",1st,0.92,person,1,0


In [56]:
# 딕셔너리로 바꿀 값을 각각 매핑하여 전달할 수도 있다.

dataframe.replace({'female': 1, 'male': 0}).head()

Unnamed: 0,Name,PClass,Age,Sex,Survived,SexCode
0,"Allen, Miss Elisabeth Walton",1st,29.0,1,1,1
1,"Allison, Miss Helen Loraine",1st,2.0,1,0,1
2,"Allison, Mr Hudson Joshua Creighton",1st,30.0,0,0,0
3,"Allison, Mrs Hudson JC (Bessie Waldo Daniels)",1st,25.0,1,0,1
4,"Allison, Master Hudson Trevor",1st,0.92,0,1,0


# 3.6 열 이름 바꾸기

In [57]:
# 열 이름을 바꾸고 두 개의 행을 출력한다.

dataframe.rename(columns = {'PClass': 'Passenger Class'}).head()

Unnamed: 0,Name,Passenger Class,Age,Sex,Survived,SexCode
0,"Allen, Miss Elisabeth Walton",1st,29.0,female,1,1
1,"Allison, Miss Helen Loraine",1st,2.0,female,0,1
2,"Allison, Mr Hudson Joshua Creighton",1st,30.0,male,0,0
3,"Allison, Mrs Hudson JC (Bessie Waldo Daniels)",1st,25.0,female,0,1
4,"Allison, Master Hudson Trevor",1st,0.92,male,1,0


In [58]:
# 딕셔너리를 사용하여 여러 개의 열을 바꿀 수 있음

dataframe.rename(columns = {'PClass': 'Passenger Class', 'Sex': 'Gender'}).head()

Unnamed: 0,Name,Passenger Class,Age,Gender,Survived,SexCode
0,"Allen, Miss Elisabeth Walton",1st,29.0,female,1,1
1,"Allison, Miss Helen Loraine",1st,2.0,female,0,1
2,"Allison, Mr Hudson Joshua Creighton",1st,30.0,male,0,0
3,"Allison, Mrs Hudson JC (Bessie Waldo Daniels)",1st,25.0,female,0,1
4,"Allison, Master Hudson Trevor",1st,0.92,male,1,0


전체 열의 이름을 동시에 바구려면 다음 코드처럼 키는 이전 열 이름을 사용하고 값은 비어 있는 딕셔너리를 만드는 것이 편리하다.

In [59]:
import collections

In [60]:
# 딕셔너리 만들기

column_names = collections.defaultdict(str)

In [61]:
# 키를 만들기

for name in dataframe.columns:
    column_names[name]

In [62]:
# 딕셔너리 출력
column_names

defaultdict(str,
            {'Name': '',
             'PClass': '',
             'Age': '',
             'Sex': '',
             'Survived': '',
             'SexCode': ''})

In [64]:
# 인덱스 0을 -1로 바꾸기
# 형태는 딕셔너리

dataframe.rename(index = {0: -1}).head()

Unnamed: 0,Name,PClass,Age,Sex,Survived,SexCode
-1,"Allen, Miss Elisabeth Walton",1st,29.0,female,1,1
1,"Allison, Miss Helen Loraine",1st,2.0,female,0,1
2,"Allison, Mr Hudson Joshua Creighton",1st,30.0,male,0,0
3,"Allison, Mrs Hudson JC (Bessie Waldo Daniels)",1st,25.0,female,0,1
4,"Allison, Master Hudson Trevor",1st,0.92,male,1,0


변환 함수를 전달하고 axis 매개변수에 'columns' 또는 'index'를 지정할 수 있다.

In [65]:
dataframe.rename(str.lower, axis = 'columns').head()

Unnamed: 0,name,pclass,age,sex,survived,sexcode
0,"Allen, Miss Elisabeth Walton",1st,29.0,female,1,1
1,"Allison, Miss Helen Loraine",1st,2.0,female,0,1
2,"Allison, Mr Hudson Joshua Creighton",1st,30.0,male,0,0
3,"Allison, Mrs Hudson JC (Bessie Waldo Daniels)",1st,25.0,female,0,1
4,"Allison, Master Hudson Trevor",1st,0.92,male,1,0


# 3.7 최솟값, 최댓값, 합, 평균 계산 및 개수 세기

In [70]:
print('최댓값:', dataframe['Age'].max())
print('최솟값:', dataframe['Age'].min())
print('평균:', dataframe['Age'].mean())
print('합:', dataframe['Age'].sum())
print('카운트:', dataframe['Age'].count())

최댓값: 71.0
최솟값: 0.17
평균: 30.397989417989415
합: 22980.88
카운트: 756


In [71]:
dataframe.count()

Name        1313
PClass      1313
Age          756
Sex         1313
Survived    1313
SexCode     1313
dtype: int64