### 판다스 (Pandas)
- 데이터 처리 라이브러리 중 가장 인기있는 라이브러리이다.
- 2차원 데이터(테이블, 엑셀, CSV 등)를 효율적으로 가공 및 처리할 수 있다.

#### 판다스 구성 요소
- DataFrame: 행과 열로 구성된 2차원 Dataset을 의미한다.
- Series: 1개의 열로만 구성된 열벡터 Dataset을 의미한다.
- Index: DataFrame과 Seies에서 중복없는 행 번호를 의미한다.

In [1]:
import pandas as pd

pd.__version__

'2.1.4'

#### DataFrame()
- Dict를 DataFrame으로 변환하고자 할 때 DataFrame 생성자에 전달한다.
- 컬럼명을 추가하거나 인덱스명을 변경하는 등 다양하게 설정할 수 있다.

In [12]:
import pandas as pd

film = {
    'title': ['명량', '극한 직업', '범죄 도시3', '국제 시장'],
    'audience': [17_615_919, 16_266_480, 10_682_674, 14_265_222],
    'country': ['한국', '한국', '한국', '한국']
}

film_df = pd.DataFrame(film)
display(film_df)

# 새로운 컬럼명
film_df['income'] = [135_758_658_810, 139_657_105_516, 104_686_489_632, 110_951_970_230]
display(film_df)

# 인덱스명 변경
film_df.index = ['one', 'two', 'three', 'four']
display(film_df)

# 인덱스 초기화
film_df = film_df.reset_index()
display(film_df)

# feature 삭제
film_df = film_df.drop(labels=['index'], axis=1)
display(film_df)

# feature 이름 변경
# 여러개 변경도 가능
# film_df = film_df.rename(columns={'title': 'name', k:v, k:v, ...})
film_df = film_df.rename(columns={'title': 'name'})
display(film_df)

# 행 삭제
film_df = film_df.drop(index=[2], axis=0)
display(film_df)

# 인덱스 초기화, drop=True: 기존 인덱스 삭제, inplace=True: 원본수정
film_df.reset_index(drop=True, inplace=True)
display(film_df)

Unnamed: 0,title,audience,country
0,명량,17615919,한국
1,극한 직업,16266480,한국
2,범죄 도시3,10682674,한국
3,국제 시장,14265222,한국


Unnamed: 0,title,audience,country,income
0,명량,17615919,한국,135758658810
1,극한 직업,16266480,한국,139657105516
2,범죄 도시3,10682674,한국,104686489632
3,국제 시장,14265222,한국,110951970230


Unnamed: 0,title,audience,country,income
one,명량,17615919,한국,135758658810
two,극한 직업,16266480,한국,139657105516
three,범죄 도시3,10682674,한국,104686489632
four,국제 시장,14265222,한국,110951970230


Unnamed: 0,index,title,audience,country,income
0,one,명량,17615919,한국,135758658810
1,two,극한 직업,16266480,한국,139657105516
2,three,범죄 도시3,10682674,한국,104686489632
3,four,국제 시장,14265222,한국,110951970230


Unnamed: 0,title,audience,country,income
0,명량,17615919,한국,135758658810
1,극한 직업,16266480,한국,139657105516
2,범죄 도시3,10682674,한국,104686489632
3,국제 시장,14265222,한국,110951970230


Unnamed: 0,name,audience,country,income
0,명량,17615919,한국,135758658810
1,극한 직업,16266480,한국,139657105516
2,범죄 도시3,10682674,한국,104686489632
3,국제 시장,14265222,한국,110951970230


Unnamed: 0,name,audience,country,income
0,명량,17615919,한국,135758658810
1,극한 직업,16266480,한국,139657105516
3,국제 시장,14265222,한국,110951970230


Unnamed: 0,name,audience,country,income
0,명량,17615919,한국,135758658810
1,극한 직업,16266480,한국,139657105516
2,국제 시장,14265222,한국,110951970230


### read_csv()
- csv 파일을 DataFrame으로 읽어온다.

In [14]:
import pandas as pd

happiness_df = pd.read_csv('./datasets/happiness_report_2022.csv')
display(happiness_df)

Unnamed: 0,country,score,income
0,Finland,7.821,High income
1,Denmark,7.636,High income
2,Iceland,7.557,High income
3,Switzerland,7.512,High income
4,Netherlands,7.415,High income
...,...,...,...
141,Botswana,3.471,Upper middle income
142,Rwanda,3.268,Low income
143,Zimbabwe,2.995,Lower middle income
144,Lebanon,2.955,Lower middle income


#### head()
- 전체 데이터 중 앞의 열개만 가져온다

In [16]:
display(happiness_df.head())

Unnamed: 0,country,score,income
0,Finland,7.821,High income
1,Denmark,7.636,High income
2,Iceland,7.557,High income
3,Switzerland,7.512,High income
4,Netherlands,7.415,High income


#### tail()
- 전체 데이터 중 뒷부분 일부를 가져온다.

In [15]:
display(happiness_df.tail())

Unnamed: 0,country,score,income
141,Botswana,3.471,Upper middle income
142,Rwanda,3.268,Low income
143,Zimbabwe,2.995,Lower middle income
144,Lebanon,2.955,Lower middle income
145,Afghanistan,2.404,Low income


#### iloc[ ], loc[ ]
- 원하는 행 또는 열을 가져온다.
- iloc는 인덱스 번호로 가져오고, loc는 인덱스 값 또는 컬럼명으로 가져온다.

In [23]:
# []에 1개의 정수만 전달하면 행을 가져온다.

# iloc: 인덱스 번호로 가져온다.
print(type(happiness_df.iloc[0]))
print(happiness_df.iloc[0])

print(type(happiness_df.iloc[[0]]))
display(happiness_df.iloc[[0]])

# loc: 인덱스 값으로 가져온다.
print(type(happiness_df.loc[0]))
print(happiness_df.loc[0])

print(type(happiness_df.loc[[0]]))
display(happiness_df.loc[[0]])

<class 'pandas.core.series.Series'>
country        Finland
score            7.821
income     High income
Name: 0, dtype: object
<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,country,score,income
0,Finland,7.821,High income


<class 'pandas.core.series.Series'>
country        Finland
score            7.821
income     High income
Name: 0, dtype: object
<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,country,score,income
0,Finland,7.821,High income


In [32]:
# [ ]에 2개의 정수를 콤마로 구분해서 전달하면, 행과 열을 가져온다.

# 전체 행과 마지막 열 가져오기
display(happiness_df.iloc[:, -1])
print(happiness_df.iloc[:, -1])

display(happiness_df.loc[:, 'income'])
print(happiness_df.loc[:, 'income'])

# 전체 행과 마지막 열 가져오기(DataFrame)
display(happiness_df[['income']])

# 전체 행과 여러 feature 가져오기
display(happiness_df[['score', 'income']])

0              High income
1              High income
2              High income
3              High income
4              High income
              ...         
141    Upper middle income
142             Low income
143    Lower middle income
144    Lower middle income
145             Low income
Name: income, Length: 146, dtype: object

0              High income
1              High income
2              High income
3              High income
4              High income
              ...         
141    Upper middle income
142             Low income
143    Lower middle income
144    Lower middle income
145             Low income
Name: income, Length: 146, dtype: object


0              High income
1              High income
2              High income
3              High income
4              High income
              ...         
141    Upper middle income
142             Low income
143    Lower middle income
144    Lower middle income
145             Low income
Name: income, Length: 146, dtype: object

0              High income
1              High income
2              High income
3              High income
4              High income
              ...         
141    Upper middle income
142             Low income
143    Lower middle income
144    Lower middle income
145             Low income
Name: income, Length: 146, dtype: object


Unnamed: 0,income
0,High income
1,High income
2,High income
3,High income
4,High income
...,...
141,Upper middle income
142,Low income
143,Lower middle income
144,Lower middle income


Unnamed: 0,score,income
0,7.821,High income
1,7.636,High income
2,7.557,High income
3,7.512,High income
4,7.415,High income
...,...,...
141,3.471,Upper middle income
142,3.268,Low income
143,2.995,Lower middle income
144,2.955,Lower middle income


In [None]:
import pandas as pd

film = {
    'title': ['명량', '극한 직업', '범죄 도시3', '국제 시장'],
    'audience': [17_615_919, 16_266_480, 10_682_674, 14_265_222],
    'country': ['한국', '한국', '한국', '한국']
}

film_df = pd.DataFrame(film)
display(film_df)

# 새로운 컬럼명
film_df['income'] = [135_758_658_810, 139_657_105_516, 104_686_489_632, 110_951_970_230]
display(film_df)

# 인덱스명 변경
film_df.index = ['one', 'two', 'three', 'four']
display(film_df)

# 인덱스 초기화
film_df = film_df.reset_index()
display(film_df)

# feature 삭제
film_df = film_df.drop(labels=['index'], axis=1)
display(film_df)

# feature 이름 변경
# 여러개 변경도 가능
# film_df = film_df.rename(columns={'title': 'name', k:v, k:v, ...})
film_df = film_df.rename(columns={'title': 'name'})
display(film_df)

# 행 삭제
film_df = film_df.drop(index=[2], axis=0)
display(film_df)

# 인덱스 초기화, drop=True: 기존 인덱스 삭제, inplace=True: 원본수정
film_df.reset_index(drop=True, inplace=True)
display(film_df)

In [54]:
import pandas as pd

happiness_df = pd.read_csv('./datasets/happiness_report_2022.csv')
# display(happiness_df)

# 행복지수가 3보다 작은 데이터 가져오기
happiness_df_lt_3 = happiness_df[happiness_df['score'] < 3]
display(happiness_df_lt_3)

# 가져온 뒤 인덱스 초기화하기
# 기존 인덱스 버리기
happiness_df_lt_3 = happiness_df_lt_3.reset_index(drop=True)
display(happiness_df_lt_3)


Unnamed: 0,country,score,income
143,Zimbabwe,2.995,Lower middle income
144,Lebanon,2.955,Lower middle income
145,Afghanistan,2.404,Low income


Unnamed: 0,country,score,income
0,Zimbabwe,2.995,Lower middle income
1,Lebanon,2.955,Lower middle income
2,Afghanistan,2.404,Low income


In [55]:
happiness_df.shape

(146, 3)

In [None]:
import pandas as pd

happiness_df = pd.read_csv('./datasets/happiness_report_2022.csv')

In [62]:
# culmns: 전체 feature
print(happiness_df.columns)
print(happiness_df.index)
print(happiness_df.index.values)
print("=" * 60)
# 전체 정보 조회
print(happiness_df.info())
print("=" * 60)
# feature별 타입
print(happiness_df.dtypes)

Index(['country', 'score', 'income'], dtype='object')
RangeIndex(start=0, stop=146, step=1)
[  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35
  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53
  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71
  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89
  90  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107
 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125
 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143
 144 145]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 146 entries, 0 to 145
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   country  146 non-null    object 
 1   score    146 non-null    float64
 2   income   140 non-null    object 
dtypes: float64(1), 

In [66]:
# feature의 형변환, 해당 feature의 값도 자동으로 바뀜
happiness_df = happiness_df.astype({'score': 'int32'})
print(happiness_df.dtypes)
happiness_df

country    object
score       int32
income     object
dtype: object


Unnamed: 0,country,score,income
0,Finland,7,High income
1,Denmark,7,High income
2,Iceland,7,High income
3,Switzerland,7,High income
4,Netherlands,7,High income
...,...,...,...
141,Botswana,3,Upper middle income
142,Rwanda,3,Low income
143,Zimbabwe,2,Lower middle income
144,Lebanon,2,Lower middle income


#### describe()
- 숫자형 데이터의 개수, 평균, 표준편차, 최소값, 사분위 분포도(중앙값: 50%), 최대값을 제공한다.
- 25번째 백분위수와 75번째 백분위수를 기준으로 정상치의 범위를 설정할 수 있다.

In [69]:
import pandas as pd

happiness_df = pd.read_csv('./datasets/happiness_report_2022.csv')

In [68]:
display(happiness_df.describe().T)

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
score,146.0,5.10274,1.143072,2.0,4.0,5.0,6.0,7.0


In [71]:
happiness_df.describe().T.loc['score', '25%']

4.88875

In [81]:
import numpy as np
from scipy.stats import iqr

happiness_Q1 = np.percentile(happiness_df.score, 25)
happiness_Q3 = np.percentile(happiness_df.score, 75)
print(happiness_Q1, happiness_Q3)

happiness_Q1 = happiness_df.describe().T.loc['score', '25%']
happiness_Q3 = happiness_df.describe().T.loc['score', '75%']
print(happiness_Q1, happiness_Q3)

# iqr
iqr_value = happiness_Q3 - happiness_Q1
print(iqr_value)

# scipy로 iqr구하기
iqr_value = iqr(happiness_df.score)
print(iqr_value)

lower_bound = happiness_Q1 - 1.5 * iqr_value
upper_bound = happiness_Q3 + 1.5 * iqr_value
print(f'정상치 범위: {lower_bound} ~ {upper_bound}')

4.88875 6.305
4.88875 6.305
1.4162499999999998
1.4162499999999998
정상치 범위: 2.7643750000000002 ~ 8.429375


In [88]:
hp_mean_df = happiness_df.groupby('income').mean('score').reset_index()
display(hp_mean_df)

# 최소값
hp_min_df = happiness_df.groupby('income').min('score').reset_index()
display(hp_min_df)

# 최대값
hp_max_df = happiness_df.groupby('income').max('score').reset_index()
display(hp_max_df)

# 표준편
hp_std_df = happiness_df.groupby('income')['score'].std().reset_index()
display(hp_std_df)

Unnamed: 0,income,score
0,High income,6.684239
1,Low income,4.270889
2,Lower middle income,4.865526
3,Upper middle income,5.523263


Unnamed: 0,income,score
0,High income,5.425
1,Low income,2.404
2,Lower middle income,2.955
3,Upper middle income,3.471


Unnamed: 0,income,score
0,High income,7.821
1,Low income,5.164
2,Lower middle income,6.165
3,Upper middle income,6.582


Unnamed: 0,income,score
0,High income,0.552702
1,Low income,0.723758
2,Lower middle income,0.813456
3,Upper middle income,0.634569
