# Pandas 기초

- Python에서 R 만큼의 강력한 데이터 핸들링 성능을 제공하는 모듈이다.
- 코딩 가능하고 응용 가능한 엑셀로 받아들여도 된다.
- 단일 프로세스에서는 최대 효율로 누군가 스테로이드를 맞은 엑셀로 표현한다.

- Pandas 공식문서를 확인해 본다.
https://pandas.pydata.org/docs/reference/api/pandas.Series.html

#### 🔰 Import Module

- pandas는 통상 별칭으로 pd를 사용한다.
- 수치해석적 함수가 많은 numpy의 별칭은 주로 np를 사용한다.

In [22]:
import pandas as pd
import numpy as np

-----

#### 🔰 Series

- index와 value로 이루어져 있다
- 한 가지 데이터 타입만 가질 수 있다
- DataFrame의 column 한 줄 한 줄이 Series이다.

In [23]:
pd.Series()

Series([], dtype: object)

👆 잘 모를 때는 위에 보이는 틀에 맞춰 써본다.

In [24]:
pd.Series([1, 2, 3, 4])

0    1
1    2
2    3
3    4
dtype: int64

In [100]:
pd.Series([1, 2, 3, 4], dtype=float64)

NameError: name 'float64' is not defined

In [26]:
pd.Series([1, 2, 3, 4], dtype=np.float64)

0    1.0
1    2.0
2    3.0
3    4.0
dtype: float64

In [27]:
pd.Series([1, 2, 3, 4], dtype=float)

0    1.0
1    2.0
2    3.0
3    4.0
dtype: float64

In [28]:
pd.Series([1, 2, 3, 4], dtype=str)

0    1
1    2
2    3
3    4
dtype: object

👆 object는 String과 동일한 개념이다.

In [29]:
pd.Series(np.array([1, 2, 3]))

0    1
1    2
2    3
dtype: int32

In [30]:
pd.Series({"key": "Value"})

key    Value
dtype: object

In [31]:
pd.Series([1, 2, 3, "5"])

0    1
1    2
2    3
3    5
dtype: object

👆 전체를 문자열 데이터로 인식한다.</br>
⭐ 즉, Series는 한 가지 데이터 타입만 가질 수 있다.

In [32]:
data = pd.Series([1, 2, 3, 4])
data

0    1
1    2
2    3
3    4
dtype: int64

In [33]:
print(data % 2)
data % 2

0    1
1    0
2    1
3    0
dtype: int64


0    1
1    0
2    1
3    0
dtype: int64

**<날짜 데이터>**

In [34]:
dates = pd.date_range("20230101", periods=6)
dates

DatetimeIndex(['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04',
               '2023-01-05', '2023-01-06'],
              dtype='datetime64[ns]', freq='D')

In [35]:
pd.date_range("20240101", periods=60)

DatetimeIndex(['2024-01-01', '2024-01-02', '2024-01-03', '2024-01-04',
               '2024-01-05', '2024-01-06', '2024-01-07', '2024-01-08',
               '2024-01-09', '2024-01-10', '2024-01-11', '2024-01-12',
               '2024-01-13', '2024-01-14', '2024-01-15', '2024-01-16',
               '2024-01-17', '2024-01-18', '2024-01-19', '2024-01-20',
               '2024-01-21', '2024-01-22', '2024-01-23', '2024-01-24',
               '2024-01-25', '2024-01-26', '2024-01-27', '2024-01-28',
               '2024-01-29', '2024-01-30', '2024-01-31', '2024-02-01',
               '2024-02-02', '2024-02-03', '2024-02-04', '2024-02-05',
               '2024-02-06', '2024-02-07', '2024-02-08', '2024-02-09',
               '2024-02-10', '2024-02-11', '2024-02-12', '2024-02-13',
               '2024-02-14', '2024-02-15', '2024-02-16', '2024-02-17',
               '2024-02-18', '2024-02-19', '2024-02-20', '2024-02-21',
               '2024-02-22', '2024-02-23', '2024-02-24', '2024-02-25',
      

-----

#### 🔰 DataFrame

- pd.Series() 
    - index, value</br>
- pd.DataFrame() 
    - index, value, **column**

In [36]:
pd.DataFrame('data, index= , columns= ')
# jupyter notebook에서 [shift + tab]키를 누르면 함수 사용에 대한 상세한 설명(Init signature: & Docstring:)이 나온다.

In [37]:
# 표준정규분포에서 샘플링한 난수 생성
data = np.random.randn(6, 4)
data

array([[ 0.79255413, -1.81694999,  0.35020953,  1.40498027],
       [-0.06900768, -0.62575308, -0.66187013,  0.75597956],
       [ 0.82863207,  0.58713825,  0.33881898, -0.15277535],
       [ 1.56377512, -0.24221876,  0.4246531 ,  1.28469043],
       [-0.56468827, -0.4132788 , -0.28693992, -1.23037567],
       [ 1.29468486,  1.97990687, -0.57032513,  0.9340191 ]])

`(*args: int)` -> ndarray[Any, dtype[float64]]

__randn__(d0, d1, ..., dn)

Return a sample (or samples) from the "standard normal" distribution.

If positive int_like arguments are provided, randn generates an array of shape (d0, d1, ..., dn), filled with random floats sampled from a univariate "normal" (Gaussian) distribution of mean 0 and variance 1. A single float randomly sampled from the distribution is returned if no argument is provided.

In [38]:
df = pd.DataFrame(data, index=dates, columns=["A", "B", "C", "D"])
df

Unnamed: 0,A,B,C,D
2023-01-01,0.792554,-1.81695,0.35021,1.40498
2023-01-02,-0.069008,-0.625753,-0.66187,0.75598
2023-01-03,0.828632,0.587138,0.338819,-0.152775
2023-01-04,1.563775,-0.242219,0.424653,1.28469
2023-01-05,-0.564688,-0.413279,-0.28694,-1.230376
2023-01-06,1.294685,1.979907,-0.570325,0.934019


-----

#### 🔰 데이터 프레임 정보(내용) 탐색

- Pandas DataFrame 객체의 메서드 : df.head(), df.tail(), ...

- Pandas DataFrame 객체의 속성(변수) : df.index, df.columns, df.values, ...

In [39]:
df.head()

Unnamed: 0,A,B,C,D
2023-01-01,0.792554,-1.81695,0.35021,1.40498
2023-01-02,-0.069008,-0.625753,-0.66187,0.75598
2023-01-03,0.828632,0.587138,0.338819,-0.152775
2023-01-04,1.563775,-0.242219,0.424653,1.28469
2023-01-05,-0.564688,-0.413279,-0.28694,-1.230376


In [40]:
df.tail()

Unnamed: 0,A,B,C,D
2023-01-02,-0.069008,-0.625753,-0.66187,0.75598
2023-01-03,0.828632,0.587138,0.338819,-0.152775
2023-01-04,1.563775,-0.242219,0.424653,1.28469
2023-01-05,-0.564688,-0.413279,-0.28694,-1.230376
2023-01-06,1.294685,1.979907,-0.570325,0.934019


In [41]:
df.index

DatetimeIndex(['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04',
               '2023-01-05', '2023-01-06'],
              dtype='datetime64[ns]', freq='D')

In [42]:
df.columns

Index(['A', 'B', 'C', 'D'], dtype='object')

In [43]:
df.values

array([[ 0.79255413, -1.81694999,  0.35020953,  1.40498027],
       [-0.06900768, -0.62575308, -0.66187013,  0.75597956],
       [ 0.82863207,  0.58713825,  0.33881898, -0.15277535],
       [ 1.56377512, -0.24221876,  0.4246531 ,  1.28469043],
       [-0.56468827, -0.4132788 , -0.28693992, -1.23037567],
       [ 1.29468486,  1.97990687, -0.57032513,  0.9340191 ]])

- **df.info()**

	- DataFrame의 개요(기본 정보)를 확인하는 메서드	
	- 여기서는 각 컬럼의 크기와 데이터형태를 확인하는 경우가 많다.

In [44]:
df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 6 entries, 2023-01-01 to 2023-01-06
Freq: D
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   A       6 non-null      float64
 1   B       6 non-null      float64
 2   C       6 non-null      float64
 3   D       6 non-null      float64
dtypes: float64(4)
memory usage: 240.0 bytes


- **df.describe()**

	- DataFrame의 통계적 개요(기본 정보)를 확인하는 메서드    
    - count : column별 value 개수
    - mean : 평균값
    - std : 표준편차
    - min, max : 최소, 최대 값

In [45]:
df.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,0.640992,-0.088526,-0.067576,0.49942
std,0.811762,1.275938,0.497203,1.010835
min,-0.564688,-1.81695,-0.66187,-1.230376
25%,0.146383,-0.572635,-0.499479,0.074413
50%,0.810593,-0.327749,0.02594,0.844999
75%,1.178172,0.379799,0.347362,1.197023
max,1.563775,1.979907,0.424653,1.40498


-----

#### 🔰 데이터 정렬


- __df.sort_values(by='')__

	- 특정 컬럼(열)을 기준으로 데이터를 정렬한다.

In [46]:
df

Unnamed: 0,A,B,C,D
2023-01-01,0.792554,-1.81695,0.35021,1.40498
2023-01-02,-0.069008,-0.625753,-0.66187,0.75598
2023-01-03,0.828632,0.587138,0.338819,-0.152775
2023-01-04,1.563775,-0.242219,0.424653,1.28469
2023-01-05,-0.564688,-0.413279,-0.28694,-1.230376
2023-01-06,1.294685,1.979907,-0.570325,0.934019


In [47]:
df.sort_values(by='B')

Unnamed: 0,A,B,C,D
2023-01-01,0.792554,-1.81695,0.35021,1.40498
2023-01-02,-0.069008,-0.625753,-0.66187,0.75598
2023-01-05,-0.564688,-0.413279,-0.28694,-1.230376
2023-01-04,1.563775,-0.242219,0.424653,1.28469
2023-01-03,0.828632,0.587138,0.338819,-0.152775
2023-01-06,1.294685,1.979907,-0.570325,0.934019


In [48]:
df.sort_values(by='B', ascending=False)

Unnamed: 0,A,B,C,D
2023-01-06,1.294685,1.979907,-0.570325,0.934019
2023-01-03,0.828632,0.587138,0.338819,-0.152775
2023-01-04,1.563775,-0.242219,0.424653,1.28469
2023-01-05,-0.564688,-0.413279,-0.28694,-1.230376
2023-01-02,-0.069008,-0.625753,-0.66187,0.75598
2023-01-01,0.792554,-1.81695,0.35021,1.40498


- 정렬 결과를 데이터에 반영하려면 inplace param이 필요하다.

In [49]:
df

Unnamed: 0,A,B,C,D
2023-01-01,0.792554,-1.81695,0.35021,1.40498
2023-01-02,-0.069008,-0.625753,-0.66187,0.75598
2023-01-03,0.828632,0.587138,0.338819,-0.152775
2023-01-04,1.563775,-0.242219,0.424653,1.28469
2023-01-05,-0.564688,-0.413279,-0.28694,-1.230376
2023-01-06,1.294685,1.979907,-0.570325,0.934019


In [50]:
df.sort_values(by='B', ascending=False, inplace=True)
df

Unnamed: 0,A,B,C,D
2023-01-06,1.294685,1.979907,-0.570325,0.934019
2023-01-03,0.828632,0.587138,0.338819,-0.152775
2023-01-04,1.563775,-0.242219,0.424653,1.28469
2023-01-05,-0.564688,-0.413279,-0.28694,-1.230376
2023-01-02,-0.069008,-0.625753,-0.66187,0.75598
2023-01-01,0.792554,-1.81695,0.35021,1.40498


#### 🔰 데이터 선택

- 한 개의 컬럼 선택

In [51]:
df['A']

2023-01-06    1.294685
2023-01-03    0.828632
2023-01-04    1.563775
2023-01-05   -0.564688
2023-01-02   -0.069008
2023-01-01    0.792554
Name: A, dtype: float64

In [52]:
type(df['A'])

pandas.core.series.Series

In [53]:
df.A

2023-01-06    1.294685
2023-01-03    0.828632
2023-01-04    1.563775
2023-01-05   -0.564688
2023-01-02   -0.069008
2023-01-01    0.792554
Name: A, dtype: float64

👆 컬럼명이 문자열일 때만 가능하다. 숫자이면 불가능!

- 두 개 이상의 컬럼 선택

	- param은 리스트 형태여야 한다.

In [54]:
df[['A', 'B']]

Unnamed: 0,A,B
2023-01-06,1.294685,1.979907
2023-01-03,0.828632,0.587138
2023-01-04,1.563775,-0.242219
2023-01-05,-0.564688,-0.413279
2023-01-02,-0.069008,-0.625753
2023-01-01,0.792554,-1.81695


#### 🔰 Slicing : offset index

- [n:m] : n부터 (m-1)까지
- index나 column의 이름으로 slicing하는 경우에는 명시한 이름까지 포함된다.

In [55]:
df = pd.DataFrame(data, index=dates, columns=['A', 'B', 'C', 'D'])
df

Unnamed: 0,A,B,C,D
2023-01-01,0.792554,-1.81695,0.35021,1.40498
2023-01-02,-0.069008,-0.625753,-0.66187,0.75598
2023-01-03,0.828632,0.587138,0.338819,-0.152775
2023-01-04,1.563775,-0.242219,0.424653,1.28469
2023-01-05,-0.564688,-0.413279,-0.28694,-1.230376
2023-01-06,1.294685,1.979907,-0.570325,0.934019


In [56]:
df[0:3]

Unnamed: 0,A,B,C,D
2023-01-01,0.792554,-1.81695,0.35021,1.40498
2023-01-02,-0.069008,-0.625753,-0.66187,0.75598
2023-01-03,0.828632,0.587138,0.338819,-0.152775


In [57]:
df["20230101":"20230104"]

Unnamed: 0,A,B,C,D
2023-01-01,0.792554,-1.81695,0.35021,1.40498
2023-01-02,-0.069008,-0.625753,-0.66187,0.75598
2023-01-03,0.828632,0.587138,0.338819,-0.152775
2023-01-04,1.563775,-0.242219,0.424653,1.28469


- __df.loc[index, coloum]__

	- location
	- 이름으로 특정 인덱스(행), 컬럼(열)을 선택해야 한다.
	- `:`는 전체를 의미한다.

In [58]:
df

Unnamed: 0,A,B,C,D
2023-01-01,0.792554,-1.81695,0.35021,1.40498
2023-01-02,-0.069008,-0.625753,-0.66187,0.75598
2023-01-03,0.828632,0.587138,0.338819,-0.152775
2023-01-04,1.563775,-0.242219,0.424653,1.28469
2023-01-05,-0.564688,-0.413279,-0.28694,-1.230376
2023-01-06,1.294685,1.979907,-0.570325,0.934019


In [59]:
df.loc[:, 'A']

2023-01-01    0.792554
2023-01-02   -0.069008
2023-01-03    0.828632
2023-01-04    1.563775
2023-01-05   -0.564688
2023-01-06    1.294685
Freq: D, Name: A, dtype: float64

In [60]:
df.loc[:, ['A', 'B']]

Unnamed: 0,A,B
2023-01-01,0.792554,-1.81695
2023-01-02,-0.069008,-0.625753
2023-01-03,0.828632,0.587138
2023-01-04,1.563775,-0.242219
2023-01-05,-0.564688,-0.413279
2023-01-06,1.294685,1.979907


In [61]:
df.loc["20230102", ['A', 'B']]

A   -0.069008
B   -0.625753
Name: 2023-01-02 00:00:00, dtype: float64

In [62]:
df.loc["20230102":"20230104", ['A', 'B']]

Unnamed: 0,A,B
2023-01-02,-0.069008,-0.625753
2023-01-03,0.828632,0.587138
2023-01-04,1.563775,-0.242219


In [63]:
df.loc["20230102":"20230104", 'A':'D']

Unnamed: 0,A,B,C,D
2023-01-02,-0.069008,-0.625753,-0.66187,0.75598
2023-01-03,0.828632,0.587138,0.338819,-0.152775
2023-01-04,1.563775,-0.242219,0.424653,1.28469


- __df.iloc[rows, columns]__

	- index location
	- 컴퓨터가 인식하는 인덱스 값으로 선택

In [64]:
df

Unnamed: 0,A,B,C,D
2023-01-01,0.792554,-1.81695,0.35021,1.40498
2023-01-02,-0.069008,-0.625753,-0.66187,0.75598
2023-01-03,0.828632,0.587138,0.338819,-0.152775
2023-01-04,1.563775,-0.242219,0.424653,1.28469
2023-01-05,-0.564688,-0.413279,-0.28694,-1.230376
2023-01-06,1.294685,1.979907,-0.570325,0.934019


In [65]:
df.iloc[3]

A    1.563775
B   -0.242219
C    0.424653
D    1.284690
Name: 2023-01-04 00:00:00, dtype: float64

In [66]:
df.iloc[3, 2]

0.42465310412068547

In [67]:
df.iloc[3:5, 0:2]

Unnamed: 0,A,B
2023-01-04,1.563775,-0.242219
2023-01-05,-0.564688,-0.413279


In [68]:
df.iloc[[1, 2, 4], [0, 2]]

Unnamed: 0,A,C
2023-01-02,-0.069008,-0.66187
2023-01-03,0.828632,0.338819
2023-01-05,-0.564688,-0.28694


In [69]:
df.iloc[:, 1:3]

Unnamed: 0,B,C
2023-01-01,-1.81695,0.35021
2023-01-02,-0.625753,-0.66187
2023-01-03,0.587138,0.338819
2023-01-04,-0.242219,0.424653
2023-01-05,-0.413279,-0.28694
2023-01-06,1.979907,-0.570325


-----

#### 🔰 조건(condition)

In [70]:
df

Unnamed: 0,A,B,C,D
2023-01-01,0.792554,-1.81695,0.35021,1.40498
2023-01-02,-0.069008,-0.625753,-0.66187,0.75598
2023-01-03,0.828632,0.587138,0.338819,-0.152775
2023-01-04,1.563775,-0.242219,0.424653,1.28469
2023-01-05,-0.564688,-0.413279,-0.28694,-1.230376
2023-01-06,1.294685,1.979907,-0.570325,0.934019


In [71]:
# A 컬럼에서 0보다 큰 숫자(양수)만 선택

df['A'] > 0

2023-01-01     True
2023-01-02    False
2023-01-03     True
2023-01-04     True
2023-01-05    False
2023-01-06     True
Freq: D, Name: A, dtype: bool

- masking

In [72]:
# 전체 DataFrame에서 'A' 컬럼 내 value 중 0보다 큰 것을 만족하는 rows 출력

df[df['A'] > 0]

Unnamed: 0,A,B,C,D
2023-01-01,0.792554,-1.81695,0.35021,1.40498
2023-01-03,0.828632,0.587138,0.338819,-0.152775
2023-01-04,1.563775,-0.242219,0.424653,1.28469
2023-01-06,1.294685,1.979907,-0.570325,0.934019


In [73]:
# 전체 DataFrame에서 0보다 큰 것을 만족하는 value만 출력

df[df > 0]

Unnamed: 0,A,B,C,D
2023-01-01,0.792554,,0.35021,1.40498
2023-01-02,,,,0.75598
2023-01-03,0.828632,0.587138,0.338819,
2023-01-04,1.563775,,0.424653,1.28469
2023-01-05,,,,
2023-01-06,1.294685,1.979907,,0.934019


👆 조건(df>0)을 만족하지 못하는 value는 NaN으로 표시된다.

- NaN : Not a Number
	- NaN은 데이터가 아니라는 의미이다.

-----

#### 🔰 컬럼 추가

- 해당 컬럼이 기존 데이터에 없다면 __추가__

- 해당 컬럼이 기존 데이터에 있다면 __수정__

In [74]:
df

Unnamed: 0,A,B,C,D
2023-01-01,0.792554,-1.81695,0.35021,1.40498
2023-01-02,-0.069008,-0.625753,-0.66187,0.75598
2023-01-03,0.828632,0.587138,0.338819,-0.152775
2023-01-04,1.563775,-0.242219,0.424653,1.28469
2023-01-05,-0.564688,-0.413279,-0.28694,-1.230376
2023-01-06,1.294685,1.979907,-0.570325,0.934019


In [75]:
df['E'] = ['one', 'one', 'two', 'three', 'four', 'seven']
df

Unnamed: 0,A,B,C,D,E
2023-01-01,0.792554,-1.81695,0.35021,1.40498,one
2023-01-02,-0.069008,-0.625753,-0.66187,0.75598,one
2023-01-03,0.828632,0.587138,0.338819,-0.152775,two
2023-01-04,1.563775,-0.242219,0.424653,1.28469,three
2023-01-05,-0.564688,-0.413279,-0.28694,-1.230376,four
2023-01-06,1.294685,1.979907,-0.570325,0.934019,seven


- __df.isin()__

	- 특정 value의 존재 유무를 확인에 그 결과를 논리형으로 반환한다.
	
	- 조사 범위를 컬럼으로 지정할 수 있고, 데이터 전체를 대상으로도 할 수 있다.

	- (values: Iterable | Series | dict) -> Series[_bool]

		Whether elements in Series are contained in values.

		Return a boolean Series showing whether each element in the Series matches an element in the passed sequence of values exactly.

		Parameters</br>
		values : set or list-like</br>
		The sequence of values to test. Passing in a single string will raise a TypeError. Instead, turn a single string into a list of one element.

In [76]:
df['E'].isin(['two'])

2023-01-01    False
2023-01-02    False
2023-01-03     True
2023-01-04    False
2023-01-05    False
2023-01-06    False
Freq: D, Name: E, dtype: bool

In [77]:
df['E'].isin(['two', 'three', 'five'])

2023-01-01    False
2023-01-02    False
2023-01-03     True
2023-01-04     True
2023-01-05    False
2023-01-06    False
Freq: D, Name: E, dtype: bool

In [78]:
# 전체 데이터에 masking : 조건이 True인 value가 포함된 rows 출력

df[df['E'].isin(['two', 'three', 'five'])]

Unnamed: 0,A,B,C,D,E
2023-01-03,0.828632,0.587138,0.338819,-0.152775,two
2023-01-04,1.563775,-0.242219,0.424653,1.28469,three


-----

#### 🔰 컬럼 제거

- 특정 컬럼을 제거한다.

- **del** df[columns_name]: 제거 결과가 데이터 바로 반영된다.
- df.**drop**([columns_name]) : inplace param을 적용해야 제거 결과가 데이터에 반영된다.

In [79]:
df

Unnamed: 0,A,B,C,D,E
2023-01-01,0.792554,-1.81695,0.35021,1.40498,one
2023-01-02,-0.069008,-0.625753,-0.66187,0.75598,one
2023-01-03,0.828632,0.587138,0.338819,-0.152775,two
2023-01-04,1.563775,-0.242219,0.424653,1.28469,three
2023-01-05,-0.564688,-0.413279,-0.28694,-1.230376,four
2023-01-06,1.294685,1.979907,-0.570325,0.934019,seven


In [80]:
del df['E']
df

Unnamed: 0,A,B,C,D
2023-01-01,0.792554,-1.81695,0.35021,1.40498
2023-01-02,-0.069008,-0.625753,-0.66187,0.75598
2023-01-03,0.828632,0.587138,0.338819,-0.152775
2023-01-04,1.563775,-0.242219,0.424653,1.28469
2023-01-05,-0.564688,-0.413279,-0.28694,-1.230376
2023-01-06,1.294685,1.979907,-0.570325,0.934019


- drop()은 axis 설정이 필요하다.

	- axis : {0 or 'index', 1 or 'columns'}, default 0

In [81]:
df.drop(['D'], axis=1) # axis=1 세로(column)

Unnamed: 0,A,B,C
2023-01-01,0.792554,-1.81695,0.35021
2023-01-02,-0.069008,-0.625753,-0.66187
2023-01-03,0.828632,0.587138,0.338819
2023-01-04,1.563775,-0.242219,0.424653
2023-01-05,-0.564688,-0.413279,-0.28694
2023-01-06,1.294685,1.979907,-0.570325


In [82]:
df

Unnamed: 0,A,B,C,D
2023-01-01,0.792554,-1.81695,0.35021,1.40498
2023-01-02,-0.069008,-0.625753,-0.66187,0.75598
2023-01-03,0.828632,0.587138,0.338819,-0.152775
2023-01-04,1.563775,-0.242219,0.424653,1.28469
2023-01-05,-0.564688,-0.413279,-0.28694,-1.230376
2023-01-06,1.294685,1.979907,-0.570325,0.934019


In [83]:
# axis=0 일 때는 Row 이름을 써줘야 한다.

df.drop(['20230104'], inplace=True)
df

Unnamed: 0,A,B,C,D
2023-01-01,0.792554,-1.81695,0.35021,1.40498
2023-01-02,-0.069008,-0.625753,-0.66187,0.75598
2023-01-03,0.828632,0.587138,0.338819,-0.152775
2023-01-05,-0.564688,-0.413279,-0.28694,-1.230376
2023-01-06,1.294685,1.979907,-0.570325,0.934019


#### 🔰 apply() 메서드

- DataFrame에 함수 기능을 적용해준다.

- 적용하고자 하는 기능의 함수를 인수로 넣어주면 원하는 연산의 결과를 반환받을 수 있다.

- func : function - Python function or NumPy ufunc to apply.

In [84]:
df = pd.DataFrame(data, index=dates, columns=['A', 'B', 'C', 'D'])
df

Unnamed: 0,A,B,C,D
2023-01-01,0.792554,-1.81695,0.35021,1.40498
2023-01-02,-0.069008,-0.625753,-0.66187,0.75598
2023-01-03,0.828632,0.587138,0.338819,-0.152775
2023-01-04,1.563775,-0.242219,0.424653,1.28469
2023-01-05,-0.564688,-0.413279,-0.28694,-1.230376
2023-01-06,1.294685,1.979907,-0.570325,0.934019


In [85]:
df['A'].apply("sum")

3.8459502276655178

In [86]:
df['A'].apply(np.sum)

2023-01-01    0.792554
2023-01-02   -0.069008
2023-01-03    0.828632
2023-01-04    1.563775
2023-01-05   -0.564688
2023-01-06    1.294685
Freq: D, Name: A, dtype: float64

In [87]:
df[['A', 'D']].apply("sum")

A    3.845950
D    2.996518
dtype: float64

In [88]:
df[['A', 'D']].apply(np.sum)

A    3.845950
D    2.996518
dtype: float64

In [89]:
df.apply("sum")

A    3.845950
B   -0.531156
C   -0.405454
D    2.996518
dtype: float64

In [90]:
df.apply(np.sum)

A    3.845950
B   -0.531156
C   -0.405454
D    2.996518
dtype: float64

In [101]:
df

Unnamed: 0,A,B,C,D
2023-01-01,0.792554,-1.81695,0.35021,1.40498
2023-01-02,-0.069008,-0.625753,-0.66187,0.75598
2023-01-03,0.828632,0.587138,0.338819,-0.152775
2023-01-04,1.563775,-0.242219,0.424653,1.28469
2023-01-05,-0.564688,-0.413279,-0.28694,-1.230376
2023-01-06,1.294685,1.979907,-0.570325,0.934019


In [91]:
df.apply(np.cumsum) # 각 컬럼의 누적 합계

Unnamed: 0,A,B,C,D
2023-01-01,0.792554,-1.81695,0.35021,1.40498
2023-01-02,0.723546,-2.442703,-0.311661,2.16096
2023-01-03,1.552179,-1.855565,0.027158,2.008184
2023-01-04,3.115954,-2.097784,0.451811,3.292875
2023-01-05,2.551265,-2.511062,0.164872,2.062499
2023-01-06,3.84595,-0.531156,-0.405454,2.996518


In [92]:
df['A'].apply("mean")

0.6409917046109196

In [93]:
df['A'].apply(np.mean)

2023-01-01    0.792554
2023-01-02   -0.069008
2023-01-03    0.828632
2023-01-04    1.563775
2023-01-05   -0.564688
2023-01-06    1.294685
Freq: D, Name: A, dtype: float64

In [94]:
df['A'].apply("min"), df['A'].apply("max")

(-0.5646882726181228, 1.563775118154971)

In [95]:
df['A'].apply(np.std)

2023-01-01    0.0
2023-01-02    0.0
2023-01-03    0.0
2023-01-04    0.0
2023-01-05    0.0
2023-01-06    0.0
Freq: D, Name: A, dtype: float64

In [96]:
df

Unnamed: 0,A,B,C,D
2023-01-01,0.792554,-1.81695,0.35021,1.40498
2023-01-02,-0.069008,-0.625753,-0.66187,0.75598
2023-01-03,0.828632,0.587138,0.338819,-0.152775
2023-01-04,1.563775,-0.242219,0.424653,1.28469
2023-01-05,-0.564688,-0.413279,-0.28694,-1.230376
2023-01-06,1.294685,1.979907,-0.570325,0.934019


- 사용자 정의 함수 사용

In [97]:
def plusminus(num):
    return "plus" if num > 0 else "minus"

In [98]:
df['A'].apply(plusminus)

2023-01-01     plus
2023-01-02    minus
2023-01-03     plus
2023-01-04     plus
2023-01-05    minus
2023-01-06     plus
Freq: D, Name: A, dtype: object

- 람다 함수 사용

In [99]:
df['A'].apply(lambda num: "plus" if num > 0 else "minus")

2023-01-01     plus
2023-01-02    minus
2023-01-03     plus
2023-01-04     plus
2023-01-05    minus
2023-01-06     plus
Freq: D, Name: A, dtype: object