pandas dataframe 에서는 series 내 문자열 칼럼을 숫자형으로 변환하는 2가지 방법을 소개하겠습니다. 

In [1]:
import numpy as np
import pandas as pd

# make a DataFrame as a sample

sample = pd.DataFrame({'column_str': ['1', '2', '3', '4', '5']})

print(sample)

  column_str
0          1
1          2
2          3
3          4
4          5


In [2]:
sample.dtypes

column_str    object
dtype: object

## 1)  pd.to_numeric 으로 한 개의 문자열 칼럼 -> 숫자형으로 

In [3]:
sample['column_int'] = pd.to_numeric(sample['column_str'])
print(sample)

  column_str  column_int
0          1           1
1          2           2
2          3           3
3          4           4
4          5           5


In [4]:
sample.dtypes

column_str    object
column_int     int64
dtype: object

#### apply to_numeric

In [5]:
sample2 = pd.DataFrame({'col_str_1': ['1', '2', '3'], 
                   'col_str_2': ['4', '5', '6'], 
                   'col_str_3': ['5.6', '6.7', '7.8']})
print(sample2)

  col_str_1 col_str_2 col_str_3
0         1         4       5.6
1         2         5       6.7
2         3         6       7.8


In [6]:
sample2.dtypes

col_str_1    object
col_str_2    object
col_str_3    object
dtype: object

In [7]:
# convert 'col_str_1' and 'col_str_2' to numeric

sample2[['col_int_1', 'col_int_2']] = sample2[['col_str_1', 'col_str_2']].apply(pd.to_numeric)
print(sample2)

  col_str_1 col_str_2 col_str_3  col_int_1  col_int_2
0         1         4       5.6          1          4
1         2         5       6.7          2          5
2         3         6       7.8          3          6


In [8]:
sample2.dtypes

col_str_1    object
col_str_2    object
col_str_3    object
col_int_1     int64
col_int_2     int64
dtype: object

In [9]:
sample3 = sample2.apply(pd.to_numeric)
sample3.dtypes

col_str_1      int64
col_str_2      int64
col_str_3    float64
col_int_1      int64
col_int_2      int64
dtype: object

### apply 예시

In [10]:
df1 = pd.DataFrame({'a':[1,2,3,4],'b':[6,23,24,30],'c':[111,222,333,444]})
print(df1)

   a   b    c
0  1   6  111
1  2  23  222
2  3  24  333
3  4  30  444


In [11]:
f = lambda x: x.max() - x.min()

In [12]:
df1.apply(f)

a      3
b     24
c    333
dtype: int64

In [13]:
df2 = pd.DataFrame(np.random.randn(4,3), columns=list('bde'),
                  index=['sanfrancisco','houston','boston','LA'])
df2

Unnamed: 0,b,d,e
sanfrancisco,0.828293,1.948907,-0.077783
houston,0.634333,-0.554782,0.019595
boston,0.700103,-0.590163,0.398939
LA,0.280076,1.38493,0.13333


In [14]:
df2.apply(f)

b    0.548217
d    2.539070
e    0.476722
dtype: float64

칼럼별로 f 함수가 적용된 것을 확인할 수 있습니다.칼럼을 index로 하는 Series 를 반환하네요.

In [15]:
df2.apply(f, axis='columns')

sanfrancisco    2.026691
houston         1.189114
boston          1.290266
LA              1.251600
dtype: float64

apply 함수에 axis='columns' 인자를 넘기면 각 로우에 대해 한 번씩만 수행합니다. 

## 2)  astype() 메소드로 문자열 칼럼 -> 숫자형 변환

In [16]:
sample2 = pd.DataFrame({'col_str_1': ['1', '2', '3'], 
                   'col_str_2': ['4', '5', '6'], 
                   'col_str_3': ['5.6', '6.7', '7.8']})
print(sample2)

  col_str_1 col_str_2 col_str_3
0         1         4       5.6
1         2         5       6.7
2         3         6       7.8


In [17]:
sample2.dtypes

col_str_1    object
col_str_2    object
col_str_3    object
dtype: object

In [18]:
sample3 = sample2.astype(float)
sample3

Unnamed: 0,col_str_1,col_str_2,col_str_3
0,1.0,4.0,5.6
1,2.0,5.0,6.7
2,3.0,6.0,7.8


In [19]:
sample3.dtypes

col_str_1    float64
col_str_2    float64
col_str_3    float64
dtype: object

### 문자열 칼럼별로 데이터 형식 개별 지정하기

In [20]:
sample4 = sample2.astype({'col_str_1':int, 
                         'col_str_2':np.float})
sample4

Unnamed: 0,col_str_1,col_str_2,col_str_3
0,1,4.0,5.6
1,2,5.0,6.7
2,3,6.0,7.8


In [21]:
sample4.dtypes

col_str_1      int64
col_str_2    float64
col_str_3     object
dtype: object

### 전체 function을 그대로 함수에 적용할 때는 문자로 이루어진 문자열이 포함되어있을때는 value error 발생

In [22]:
sample7 = pd.DataFrame({'col_1': ['1', '2', '3'], 
                   'col_2': ['abc', 'def', 'ghi']})
sample7.dtypes

col_1    object
col_2    object
dtype: object

In [23]:
sample7.apply(pd.to_numeric)

ValueError: ('Unable to parse string "abc" at position 0', 'occurred at index col_2')