## 판다스 기초

### 판다스 라이브러리
- 테이블형 데이터를 다룰 수 있는 다양한 기능을 가진 라이브러리
  - 파이썬 데이터 분석을 위해 기본적으로 사용하는 라이브러리임
- raw data를 데이터 분석 전과정을 위해 사용할 수 있도록 변환하는 데이터 전처리에도 많이 사용됨
  - raw data: 아직 데이터 분석을 위해 정제되지 않은 기본 데이터를 의미함
    - 보통 데이터 분석 목적에 맞지 않은 불필요한 데이터가 있거나, 데이터가 없는 열들이 포함됨
    
[레퍼런스]
https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html


### 판다스 데이터 구조
* 판다스에는 **시리즈**와 **데이터프레임**이라는 데이터 구조가 있다.
* 시리즈는 1차원 데이터구조이다.
* 데이터프레임은 2차원 데이터구조이다.


![image.png](attachment:image.png)

* index: 0,1,2,3 로 표현된 부분
* column: mango,apple,banana
* value: 값

### pandas 라이브러리 임포트
- 보통 pd 로 많이 사용함

In [2]:
import pandas as pd # 라이브러리 임포트 (보통 이와 같이 사용함, 이후에는 pd.판다스함수 형태로 판다스 라이브러리 함수를 호출함)

## 시리즈(Series) 이해하기
* index와 values로 이루어진 데이터구조이다.

### Series 생성

In [3]:
dlist = [70,80,90]
dlist

[70, 80, 90]

In [4]:
seriesdata = pd.Series([70, 60, 90])
seriesdata

0    70
1    60
2    90
dtype: int64

In [5]:
type(seriesdata)

pandas.core.series.Series

### Series 값 읽기

In [6]:
seriesdata[2]

90

### Series 값 수정

In [7]:
seriesdata[1] = 1
seriesdata[2] = 2

seriesdata

0    70
1     1
2     2
dtype: int64

---

## pandas 데이터 타입
- pandas 데이터 타입은 파이썬과 다름
  - dtype 으로 불리우며, 주요 데이터 타입은 다음과 같음
    - object 는 파이썬의 str 또는 혼용 데이터 타입 (문자열)
    - int64 는 파이썬의 int (정수)
    - float64 는 파이썬의 float (부동소숫점)
    - bool 는 파이썬의 bool (True 또는 False 값을 가지는 boolean)
    - 이외에 datetime64 (날짜/시간), timedelta[ns] (두 datatime64 간의 차) 도 활용됨

> 가끔 data type 때문에 에러가 나는 경우가 있으므로, 데이터 타입에 대한 이해 및 데이터 타입 변경 기능을 알아둬야 함

In [8]:
seriesdata = pd.Series(['dave', 'alex', 'amir'])
seriesdata

0    dave
1    alex
2    amir
dtype: object

In [9]:
seriesdata = pd.Series([1, 2, 4])
seriesdata

0    1
1    2
2    4
dtype: int64

## 데이터프레임(Dataframe) 이해하기
- 데이터프레임은 테이블형(2차원) 데이터구조이다.
- 여러 개의 Series가 모여서 행과 열을 이룬 데이터이다.
- 데이터 분석/머신 러닝에서 데이터 처리를 위해 주로 사용된다.
- 2차원이기 때문에 엑셀/csv와 같이 데이터가 row, column로 구성된다.


In [10]:
import pandas as pd # 라이브러리 임포트 (보통 이와 같이 사용함, 이후에는 pd.팬더스함수 형태로 팬더스 라이브러리 함수를 호출함)

### Dataframe 생성 및 타입

* 리스트, 딕셔너리, 넘파이 ndarray 형으로 데이터프레임을 생성할 수 있습니다. 
* 리스트와 ndarray 형태는 비슷해 보이지만 엄연히 다른 타입입니다. 
* 리스트는 파이썬의 자료형이고
* ndarray이는 넘파이라는 라이브러리의 자료형입니다. 

In [11]:
import numpy as np

list1 = [1, 2, 3]
array1 = np.array(list1)
# list1
# array1

In [12]:
type(list1), type(array1)

(list, numpy.ndarray)

In [13]:
array1.tolist()

[1, 2, 3]

In [14]:
df_list1 = pd.DataFrame(list1, columns=['col1'])
df_list1

Unnamed: 0,col1
0,1
1,2
2,3


In [15]:
df_array1 = pd.DataFrame(array1, columns=['col1'])
df_array1

Unnamed: 0,col1
0,1
1,2
2,3


In [16]:
# 3개의 컬럼명이 필요함. 
col_name2=['col1', 'col2', 'col3']

# # 2행x3열 형태의 리스트와 ndarray 생성 한 뒤 이를 DataFrame으로 변환. 
list2 = [[1, 2, 3],
         [11, 22, 33]]

df_list2 = pd.DataFrame(list2, columns=col_name2)
df_list2

Unnamed: 0,col1,col2,col3
0,1,2,3
1,11,22,33


In [17]:
array2 = np.array(list2)

df_array2 = pd.DataFrame(array2, columns=col_name2)
df_array2

Unnamed: 0,col1,col2,col3
0,1,2,3
1,11,22,33


In [18]:
# Key는 컬럼명으로 매핑, Value는 리스트 형(또는 ndarray)
dict = {'col1':[1, 11], 'col2':[2, 22], 'col3':[3, 33]}
df_dict = pd.DataFrame(dict)
df_dict

Unnamed: 0,col1,col2,col3
0,1,2,3
1,11,22,33


In [19]:
df_dict.values

array([[ 1,  2,  3],
       [11, 22, 33]], dtype=int64)

In [20]:
type(df_dict.values)

numpy.ndarray

In [21]:
# DataFrame을 리스트로 변환
df_dict.values.tolist()

[[1, 2, 3], [11, 22, 33]]

In [22]:
# DataFrame을 딕셔너리로 변환
df_dict.to_dict('list')


{'col1': [1, 11], 'col2': [2, 22], 'col3': [3, 33]}

>   
> * 데이터프레임의 데이터 구조와 데이터 타입을 항상 의식하며 사용하도록 합시다.
>   


---

### 데이터로드 및 데이터 탐색

* 일반적으로 데이터 분석에서 사용하는 데이터는 csv, xlsx 파일 형태에서 읽어들이는 방법과
* 깃허브나 OpenAPI를 이용해 인터넷으로 바로 다운받는 방법과
* seaborn, 사이킷런, 텐서플로 같은 라이브러리에 이미 포함된 데이터셋을 사용합니다.

In [63]:
titanic_df = pd.read_csv('./data/titanic_train.csv')

In [64]:
titanic_df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [25]:
type(titanic_df)

pandas.core.frame.DataFrame

In [26]:
titanic_df.head(2)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C


In [27]:
titanic_df.tail(2)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


In [28]:
# 데이터의 row,columns 수
titanic_df.shape

(891, 12)

In [29]:
titanic_df.info()

# 결측지 파악 -> 컬럼 자체를 삭제 or 컬럼 결측치를 채울것인지... 결정
# 각 컬럼의 데이터 타입

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [30]:
titanic_df.describe()
# 통계 정보를 확인

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [31]:
value_counts = titanic_df['Pclass'].value_counts()
value_counts

3    491
1    216
2    184
Name: Pclass, dtype: int64

In [32]:
type(value_counts)

pandas.core.series.Series

In [33]:
E_value = titanic_df['Embarked'].value_counts()
E_value

S    644
C    168
Q     77
Name: Embarked, dtype: int64

In [34]:
titanic_pclass = titanic_df['Pclass']
print(type(titanic_pclass))

<class 'pandas.core.series.Series'>


In [35]:
titanic_pclass.head()

0    3
1    1
2    3
3    1
4    3
Name: Pclass, dtype: int64

In [36]:
Emb=titanic_df['Embarked']
Emb.head()

0    S
1    C
2    S
3    S
4    S
Name: Embarked, dtype: object

In [37]:
value_counts = titanic_df['Pclass'].value_counts()
print(type(value_counts))
print(value_counts)

<class 'pandas.core.series.Series'>
3    491
1    216
2    184
Name: Pclass, dtype: int64


### DataFrame의 컬럼 데이터 셋 Access

In [38]:
titanic_df['Age_0']=0
titanic_df.head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_0
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,0


In [39]:
titanic_df['Age_0'].value_counts()

0    891
Name: Age_0, dtype: int64

In [69]:
titanic_df['Age_by_10'] = titanic_df['Age']*10
titanic_df['Family_No'] = titanic_df['SibSp'] + titanic_df['Parch']+1
titanic_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_by_10,Family_No
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,220.0,2
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,380.0,2
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,260.0,1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,350.0,2
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,350.0,1


In [68]:
titanic_df['Age_by_10'] = titanic_df['Age_by_10'] + 100
titanic_df.head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_by_10
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,100
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,100
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,100


In [67]:
titanic_df['Age_by_10']=0
titanic_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_by_10
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,0
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,0


### DataFrame 데이터 삭제

In [43]:
titanic_drop_df = titanic_df.drop('Age_0', axis=1 )
titanic_drop_df.head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_by_10,Family_No
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,0,2
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,0,2
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,0,1


In [61]:
titanic_df.head(3)

Unnamed: 0,index,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
1,4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
2,5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q


In [70]:
drop_result = titanic_df.drop(['Age_0', 'Age_by_10', 'Family_No'], axis=1, inplace=True)
print(' inplace=True 로 drop 후 반환된 값:',drop_result)
titanic_df.head(3)

KeyError: "['Age_0'] not found in axis"

In [73]:
# 행 삭제
titanic_df.drop([0,1,2], axis=0, inplace=True)

In [74]:
titanic_df.head()


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_by_10,Family_No
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,350.0,2
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,350.0,1
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q,,1
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S,540.0,1
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S,20.0,5


### Index 객체

In [75]:
# Index 객체 추출
indexes = titanic_df.index
indexes

RangeIndex(start=3, stop=891, step=1)

In [80]:
# Index 객체를 실제 값 arrray로 변환 
indexes.values
# titanic_df.values

array([  3,   4,   5,   6,   7,   8,   9,  10,  11,  12,  13,  14,  15,
        16,  17,  18,  19,  20,  21,  22,  23,  24,  25,  26,  27,  28,
        29,  30,  31,  32,  33,  34,  35,  36,  37,  38,  39,  40,  41,
        42,  43,  44,  45,  46,  47,  48,  49,  50,  51,  52,  53,  54,
        55,  56,  57,  58,  59,  60,  61,  62,  63,  64,  65,  66,  67,
        68,  69,  70,  71,  72,  73,  74,  75,  76,  77,  78,  79,  80,
        81,  82,  83,  84,  85,  86,  87,  88,  89,  90,  91,  92,  93,
        94,  95,  96,  97,  98,  99, 100, 101, 102, 103, 104, 105, 106,
       107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119,
       120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132,
       133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145,
       146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158,
       159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171,
       172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 18

In [81]:
print(type(indexes.values))
print(indexes.values.shape)
print(indexes[:5].values)
print(indexes.values[:5])
print(indexes[6])

<class 'numpy.ndarray'>
(888,)
[3 4 5 6 7]
[3 4 5 6 7]
9


In [82]:
titanic_reset_df = titanic_df.reset_index(inplace=False)
titanic_reset_df.head(3)

Unnamed: 0,index,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_by_10,Family_No
0,3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,350.0,2
1,4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,350.0,1
2,5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q,,1


In [83]:
titanic_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_by_10,Family_No
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,350.0,2
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,350.0,1
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q,,1
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S,540.0,1
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S,20.0,5


In [84]:
titanic_df.reset_index(inplace=True)
titanic_df.head()

Unnamed: 0,index,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_by_10,Family_No
0,3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,350.0,2
1,4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,350.0,1
2,5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q,,1
3,6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S,540.0,1
4,7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S,20.0,5


### 데이터 셀렉션

* DataFrame의 [ ] 연산자

In [87]:
titanic_df['Pclass']

0      1
1      3
2      3
3      1
4      3
      ..
883    2
884    1
885    3
886    1
887    3
Name: Pclass, Length: 888, dtype: int64

In [90]:
#단일 컬럼 데이터 추출
titanic_df[ ['Pclass'] ].head(3)

Unnamed: 0,Pclass
0,1
1,3
2,3


In [89]:
titanic_df[ ['Survived', 'Pclass'] ].head(3)

Unnamed: 0,Survived,Pclass
0,1,1
1,0,3
2,0,3


In [91]:
type(titanic_df[ 'Pclass' ]) , type(titanic_df[ ['Survived', 'Pclass'] ]) 

(pandas.core.series.Series, pandas.core.frame.DataFrame)

In [92]:
titanic_df[ [ 'Pclass' ] ].head(3)

Unnamed: 0,Pclass
0,1
1,3
2,3


In [93]:
#[ ] 안에 숫자 index는 KeyError 오류 발생
titanic_df[0]

KeyError: 0

In [94]:
titanic_df[0:2]

Unnamed: 0,index,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_by_10,Family_No
0,3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,350.0,2
1,4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,350.0,1


* DataFrame iloc[ ] 연산자

  * 위치기반으로 데이터 추출

In [95]:
titanic_df.head()

Unnamed: 0,index,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_by_10,Family_No
0,3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,350.0,2
1,4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,350.0,1
2,5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q,,1
3,6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S,540.0,1
4,7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S,20.0,5


In [98]:
titanic_df.iloc[0, 0]

3

In [None]:
# 아래 코드는 오류를 발생합니다. 
titanic_df.iloc[0, 'Name']

ValueError: Location based indexing can only have [integer, integer slice (START point is INCLUDED, END point is EXCLUDED), listlike of integers, boolean array] types

In [None]:
# 아래 코드는 오류를 발생합니다. 
titanic_df.iloc['one', 0]

ValueError: Location based indexing can only have [integer, integer slice (START point is INCLUDED, END point is EXCLUDED), listlike of integers, boolean array] types

In [99]:
titanic_df.iloc[0, 1]

4

* DataFrame loc[ ] 연산자
    * 명칭기반으로 데이터 추출

In [100]:
titanic_df.loc[0, 'Name']

'Futrelle, Mrs. Jacques Heath (Lily May Peel)'

In [101]:
titanic_df.set_index('index',inplace=True)

In [102]:
titanic_df.head()

Unnamed: 0_level_0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_by_10,Family_No
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,350.0,2
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,350.0,1
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q,,1
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S,540.0,1
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S,20.0,5


In [103]:
#명칭
titanic_df.loc[3, 'Name']

'Futrelle, Mrs. Jacques Heath (Lily May Peel)'

In [104]:
#위치기반으로..
titanic_df.iloc[0,3]

'Futrelle, Mrs. Jacques Heath (Lily May Peel)'

In [111]:
titanic_df.loc[3:5,['Name','Sex','Parch']]

Unnamed: 0_level_0,Name,Sex,Parch
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,0
4,"Allen, Mr. William Henry",male,0
5,"Moran, Mr. James",male,0


In [109]:
titanic_df.iloc[0:3,3:7]

Unnamed: 0_level_0,Name,Sex,Age,SibSp
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1
4,"Allen, Mr. William Henry",male,35.0,0
5,"Moran, Mr. James",male,,0


In [None]:
# 아래 코드는 오류를 발생합니다. 
titanic_df.loc[0, 'Name']

KeyError: 0

In [105]:
titanic_df.head()

Unnamed: 0_level_0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_by_10,Family_No
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,350.0,2
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,350.0,1
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q,,1
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S,540.0,1
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S,20.0,5


In [None]:
#위치기반 iloc slicing

titanic_df.iloc[0:3, [3,4,7]]

Unnamed: 0_level_0,Name,Sex,Parch
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,0
4,"Allen, Mr. William Henry",male,0
5,"Moran, Mr. James",male,0


In [None]:
#명칭 기반 loc slicing

titanic_df.loc[3:5, ['Name','Sex','Parch']]

Unnamed: 0_level_0,Name,Sex,Parch
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,0
4,"Allen, Mr. William Henry",male,0
5,"Moran, Mr. James",male,0


### 데이터 필터링

* 불린 인덱싱 방식으로 데이터 필터링합니다. 

In [112]:
df = pd.DataFrame({
    "년도": ['2000', '2010', '2020'],    
    "미국": [2.1, 2.2, 2.3],
    "한국": [0.5, 0.4, 0.45],
    "중국": [17, 13, 15],
    "등급": ['양호','위험','양호']
})
df

Unnamed: 0,년도,미국,한국,중국,등급
0,2000,2.1,0.5,17,양호
1,2010,2.2,0.4,13,위험
2,2020,2.3,0.45,15,양호


In [113]:
#비교조건으로 데이터 선택-> 시리즈타입
df['년도'] > '2010'

0    False
1    False
2     True
Name: 년도, dtype: bool

In [114]:
#마스킹 연산으로 데이터프레임으로 표현
df[df['년도'] > '2010']

Unnamed: 0,년도,미국,한국,중국,등급
2,2020,2.3,0.45,15,양호


In [115]:
#시리즈 + 시리즈 연산 가능
df['합계'] = df['한국'] + df['중국']
df

Unnamed: 0,년도,미국,한국,중국,등급,합계
0,2000,2.1,0.5,17,양호,17.5
1,2010,2.2,0.4,13,위험,13.4
2,2020,2.3,0.45,15,양호,15.45


In [125]:
#원하는 문자열이 있는 데이터 선택
df[df['등급']=='양호'].loc[:,['한국']]

Unnamed: 0,한국
0,0.5
2,0.45


In [118]:
#원하는 문자열이 있는 데이터 선택 contains()함수  사용
df[df['등급'].str.contains('양호')]

Unnamed: 0,년도,미국,한국,중국,등급,합계
0,2000,2.1,0.5,17,양호,17.5
2,2020,2.3,0.45,15,양호,15.45


In [119]:
df[df['등급'].str.contains('양')]

Unnamed: 0,년도,미국,한국,중국,등급,합계
0,2000,2.1,0.5,17,양호,17.5
2,2020,2.3,0.45,15,양호,15.45


In [127]:
df.loc[ df['등급'].str.contains('양호') , ['한국']]

Unnamed: 0,한국
0,0.5
2,0.45


---

In [129]:
titanic_df = pd.read_csv('./data/titanic_train.csv')

In [130]:
# 나이가 60세 이상인 사람들만 보고싶어
titanic_df[titanic_df['Age'] > 60]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
33,34,0,2,"Wheadon, Mr. Edward H",male,66.0,0,0,C.A. 24579,10.5,,S
54,55,0,1,"Ostby, Mr. Engelhart Cornelius",male,65.0,0,1,113509,61.9792,B30,C
96,97,0,1,"Goldschmidt, Mr. George B",male,71.0,0,0,PC 17754,34.6542,A5,C
116,117,0,3,"Connors, Mr. Patrick",male,70.5,0,0,370369,7.75,,Q
170,171,0,1,"Van der hoef, Mr. Wyckoff",male,61.0,0,0,111240,33.5,B19,S
252,253,0,1,"Stead, Mr. William Thomas",male,62.0,0,0,113514,26.55,C87,S
275,276,1,1,"Andrews, Miss. Kornelia Theodosia",female,63.0,1,0,13502,77.9583,D7,S
280,281,0,3,"Duane, Mr. Frank",male,65.0,0,0,336439,7.75,,Q
326,327,0,3,"Nysveen, Mr. Johan Hansen",male,61.0,0,0,345364,6.2375,,S
438,439,0,1,"Fortune, Mr. Mark",male,64.0,1,4,19950,263.0,C23 C25 C27,S


In [131]:
# 1등석 사람들만 보고싶어
titanic_df[titanic_df['Pclass']==1]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
11,12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.5500,C103,S
23,24,1,1,"Sloper, Mr. William Thompson",male,28.0,0,0,113788,35.5000,A6,S
...,...,...,...,...,...,...,...,...,...,...,...,...
871,872,1,1,"Beckwith, Mrs. Richard Leonard (Sallie Monypeny)",female,47.0,1,1,11751,52.5542,D35,S
872,873,0,1,"Carlsson, Mr. Frans Olof",male,33.0,0,0,695,5.0000,B51 B53 B55,S
879,880,1,1,"Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)",female,56.0,0,1,11767,83.1583,C50,C
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S


In [138]:
# 60세 이상인 사람들의 이름과 나이만 보고싶어
titanic_df[titanic_df['Age'] > 60][['Name','Age']]

Unnamed: 0,Name,Age
33,"Wheadon, Mr. Edward H",66.0
54,"Ostby, Mr. Engelhart Cornelius",65.0
96,"Goldschmidt, Mr. George B",71.0
116,"Connors, Mr. Patrick",70.5
170,"Van der hoef, Mr. Wyckoff",61.0
252,"Stead, Mr. William Thomas",62.0
275,"Andrews, Miss. Kornelia Theodosia",63.0
280,"Duane, Mr. Frank",65.0
326,"Nysveen, Mr. Johan Hansen",61.0
438,"Fortune, Mr. Mark",64.0


In [133]:
titanic_df.loc[titanic_df['Age'] > 60, ['Name','Age']].head(3)

Unnamed: 0,Name,Age
33,"Wheadon, Mr. Edward H",66.0
54,"Ostby, Mr. Engelhart Cornelius",65.0
96,"Goldschmidt, Mr. George B",71.0


In [139]:
#60세 이상이고, 1등석이고, 성별이 여성인 경우 데이터를 보고싶어
titanic_df[ (titanic_df['Age'] > 60) & (titanic_df['Pclass']==1) & (titanic_df['Sex']=='female')]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
275,276,1,1,"Andrews, Miss. Kornelia Theodosia",female,63.0,1,0,13502,77.9583,D7,S
829,830,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",female,62.0,0,0,113572,80.0,B28,


In [140]:
cond1 = titanic_df['Age'] > 60
cond2 = titanic_df['Pclass']==1
cond3 = titanic_df['Sex']=='female'
titanic_df[ cond1 & cond2 & cond3]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
275,276,1,1,"Andrews, Miss. Kornelia Theodosia",female,63.0,1,0,13502,77.9583,D7,S
829,830,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",female,62.0,0,0,113572,80.0,B28,


### 정렬

* DataFrame, Series의 정렬 - sort_values()
* by=정렬기준컬럼
* ascending=False 내림차순 | ascending=True 오름차순


In [144]:
titanic_sorted = titanic_df.sort_values(by=['Name'])
titanic_sorted.head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
845,846,0,3,"Abbing, Mr. Anthony",male,42.0,0,0,C.A. 5547,7.55,,S
746,747,0,3,"Abbott, Mr. Rossmore Edward",male,16.0,1,1,C.A. 2673,20.25,,S
279,280,1,3,"Abbott, Mrs. Stanton (Rosa Hunt)",female,35.0,1,1,C.A. 2673,20.25,,S


In [162]:
# titanic_sorted = titanic_df.sort_values(by=['Name', 'Pclass'], ascending=False)
# titanic_sorted.head(3)

titanic_df.sort_values(['Name', 'Pclass'],ascending = [True, False]).head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
845,846,0,3,"Abbing, Mr. Anthony",male,42.0,0,0,C.A. 5547,7.55,,S
746,747,0,3,"Abbott, Mr. Rossmore Edward",male,16.0,1,1,C.A. 2673,20.25,,S
279,280,1,3,"Abbott, Mrs. Stanton (Rosa Hunt)",female,35.0,1,1,C.A. 2673,20.25,,S
308,309,0,2,"Abelson, Mr. Samuel",male,30.0,1,0,P/PP 3381,24.0,,C
874,875,1,2,"Abelson, Mrs. Samuel (Hannah Wizosky)",female,28.0,1,0,P/PP 3381,24.0,,C


### Aggregation 함수 적용

In [149]:
titanic_df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [148]:
titanic_df.count()

PassengerId    891
Survived       891
Pclass         891
Name           891
Sex            891
Age            714
SibSp          891
Parch          891
Ticket         891
Fare           891
Cabin          204
Embarked       889
dtype: int64

In [150]:
titanic_df[['Age', 'Fare']].mean()

Age     29.699118
Fare    32.204208
dtype: float64

### groupby() 이용하기

In [163]:
titanic_groupby = titanic_df.groupby(by='Pclass')
type(titanic_groupby)

pandas.core.groupby.generic.DataFrameGroupBy

In [164]:
titanic_groupby = titanic_df.groupby('Pclass').count()
titanic_groupby

Unnamed: 0_level_0,PassengerId,Survived,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
Pclass,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,216,216,216,216,186,216,216,216,216,176,214
2,184,184,184,184,173,184,184,184,184,16,184
3,491,491,491,491,355,491,491,491,491,12,491


In [167]:
titanic_groupby = titanic_df.groupby('Pclass')[['Sex', 'Survived']].count()
titanic_groupby

Unnamed: 0_level_0,Sex,Survived
Pclass,Unnamed: 1_level_1,Unnamed: 2_level_1
1,216,216
2,184,184
3,491,491


In [168]:
titanic_df.groupby('Pclass')['Age'].agg([max, min])

Unnamed: 0_level_0,max,min
Pclass,Unnamed: 1_level_1,Unnamed: 2_level_1
1,80.0,0.92
2,70.0,0.67
3,74.0,0.42


In [None]:
agg_format={'Age':'max', 'SibSp':'sum', 'Fare':'mean'}
titanic_df.groupby('Pclass').agg(agg_format)

Unnamed: 0_level_0,Age,SibSp,Fare
Pclass,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,80.0,90,84.154687
2,70.0,74,20.662183
3,74.0,302,13.67555


### 결손 데이터 처리하기
* isna()로 결손 데이터 여부 확인

In [169]:
titanic_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [170]:
titanic_df.isnull()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,False,False,False,False,False,False,False,False,False,False,True,False
1,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,True,False
3,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...
886,False,False,False,False,False,False,False,False,False,False,True,False
887,False,False,False,False,False,False,False,False,False,False,False,False
888,False,False,False,False,False,True,False,False,False,False,True,False
889,False,False,False,False,False,False,False,False,False,False,False,False


In [171]:
titanic_df.isnull( ).sum( )

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [172]:
titanic_df[titanic_df['Embarked'].isnull()]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
61,62,1,1,"Icard, Miss. Amelie",female,38.0,0,0,113572,80.0,B28,
829,830,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",female,62.0,0,0,113572,80.0,B28,


* fillna( ) 로 Missing 데이터 대체하기

In [173]:
titanic_df['Cabin'] = titanic_df['Cabin'].fillna('C000')
titanic_df.head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,C000,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,C000,S


In [None]:
titanic_df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin            0
Embarked         2
dtype: int64

In [182]:
titanic_df['Age'] = titanic_df['Age'].fillna(titanic_df['Age'].mean())
titanic_df['Embarked'] = titanic_df['Embarked'].fillna('S')
titanic_df.isnull().sum()

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Cabin          0
Embarked       0
dtype: int64

### apply lambda 식으로 데이터 가공

In [183]:
def get_square(a):
    return a**2

print('3의 제곱은:',get_square(3))

3의 제곱은: 9


In [184]:
lambda_square = lambda x : x ** 2
print('3의 제곱은:',lambda_square(3))

3의 제곱은: 9


In [185]:
a=[1,2,3]
squares = map(lambda x : x**2, a)
list(squares)

[1, 4, 9]

In [186]:
titanic_df['Name_len']= titanic_df['Name'].apply(lambda x : len(x))
titanic_df[['Name','Name_len']].head(3)

Unnamed: 0,Name,Name_len
0,"Braund, Mr. Owen Harris",23
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",51
2,"Heikkinen, Miss. Laina",22


In [188]:
titanic_df['Child_Adult'] = titanic_df['Age'].apply(lambda x : 'Child' if x <=15 else 'Adult' )
titanic_df[['Age','Child_Adult']].head(15)

Unnamed: 0,Age,Child_Adult
0,22.0,Adult
1,38.0,Adult
2,26.0,Adult
3,35.0,Adult
4,35.0,Adult
5,29.699118,Adult
6,54.0,Adult
7,2.0,Child
8,27.0,Adult
9,14.0,Child


In [190]:
def get_category(age):
    cat = ''
    if age <= 5: cat = 'Baby'
    elif age <= 12: cat = 'Child'
    elif age <= 18: cat = 'Teenager'
    elif age <= 25: cat = 'Student'
    elif age <= 35: cat = 'Young Adult'
    elif age <= 60: cat = 'Adult'
    else : cat = 'Elderly'
    return cat
# lambda 식에 위에서 생성한 get_category( ) 함수를 반환값으로 지정.
# get_category(X)는 입력값으로 ‘Age’ 컬럼 값을 받아서 해당하는 cat 반환
titanic_df['Age_cat'] = titanic_df['Age'].apply(lambda x : get_category(x))
titanic_df[['Age','Age_cat']].head()

Unnamed: 0,Age,Age_cat
0,22.0,Student
1,38.0,Adult
2,26.0,Young Adult
3,35.0,Young Adult
4,35.0,Young Adult


In [189]:
titanic_df['Age_cat'] = titanic_df['Age'].apply(lambda x : 'Child' if x<=15 
else ('Adult' if x <= 60 
else 'Elderly'))
titanic_df['Age_cat'].value_counts()

Adult      786
Child       83
Elderly     22
Name: Age_cat, dtype: int64

In [None]:
# 나이에 따라 세분화된 분류를 수행하는 함수 생성. 
def get_category(age):
    cat = ''
    if age <= 5: cat = 'Baby'
    elif age <= 12: cat = 'Child'
    elif age <= 18: cat = 'Teenager'
    elif age <= 25: cat = 'Student'
    elif age <= 35: cat = 'Young Adult'
    elif age <= 60: cat = 'Adult'
    else : cat = 'Elderly'
    
    return cat

# lambda 식에 위에서 생성한 get_category( ) 함수를 반환값으로 지정. 
# get_category(X)는 입력값으로 ‘Age’ 컬럼 값을 받아서 해당하는 cat 반환
titanic_df['Age_cat'] = titanic_df['Age'].apply(lambda x : get_category(x))
titanic_df[['Age','Age_cat']].head()
    

Unnamed: 0,Age,Age_cat
0,22.0,Student
1,38.0,Adult
2,26.0,Young Adult
3,35.0,Young Adult
4,35.0,Young Adult


## 컬럼명 수정

In [191]:
df = pd.DataFrame({
    "년도": ['2000', '2010', '2020'],    
    "미국": [2.1, 2.2, 2.3],
    "한국": [0.5, 0.4, 0.45],
    "중국": [17, 13, 15]    
})

In [197]:
df.rename(columns={'미국':'캐나다','한국':'일본'},inplace=True)
df

Unnamed: 0,년도,캐나다,일본,중국
0,2000,2.1,0.5,17
1,2010,2.2,0.4,13
2,2020,2.3,0.45,15


In [198]:
df.rename(columns={df.columns[1]:'미국',
                   df.columns[2]:'한국'},inplace='True')
df

Unnamed: 0,년도,미국,한국,중국
0,2000,2.1,0.5,17
1,2010,2.2,0.4,13
2,2020,2.3,0.45,15
