### 판다스 (Pandas)
- 데이터 처리 라이브러리 중 가장 인기있는 라이브러리이다.
- 2차원 데이터(테이블, 엑셀, CSV 등)를 효율적으로 가공 및 처리할 수 있다.

#### 판다스 구성 요소
- DataFrame : 행과 열로 구성된 2차원 Dataset을 의미한다.
- Series : 1개의 열로만 구성된 열벡터 Dataset을 의미한다.
- Index : DataFrame과 Series에서 중복없는 행 번호를 의미한다.

In [1]:
import pandas as pd
pd.__version__

'2.1.4'

#### DataFrame()
- dict를 DataFrame으로 변환하고자 할 때 DataFrame 생성자에 전달한다.
- 컬렴명을 추가하거나 인덱스명을 변경하는등 다양하게 설정할 수 있다.

In [2]:
import pandas as pd

film = {
    'title': ['명량', '극한 직업', '범죄 도시3', '국제 시장'],
    'audience': [17_615_919, 16_266_480, 10_682_674, 14_265_222],
    'country': ['한국', '한국', '한국', '한국']
}

file_df = pd.DataFrame(film)
display(file_df)

# 새로운 컬럼명

file_df['income'] = [135_758_658_810, 139_657_105_516, 104_686_489_632, 110_951_970_230]
display(file_df)

# 인덱스 변경

file_df.index = ['one', 'two', 'three', 'four']
display(file_df)

# 인덱스 초기화
file_df = file_df.reset_index()
display(file_df)

# feature 삭제

file_df = file_df.drop(labels=['index'], axis=1)
display(file_df)

# feature 이름 변경
file_df = file_df.rename(columns={'title':'name'})
display(file_df)

# 행 삭제
file_df = file_df.drop(index=[2], axis = 0)
display(file_df)

# 인덱스 초기화
file_df.reset_index(drop=True, inplace=True)
display(file_df)

Unnamed: 0,title,audience,country
0,명량,17615919,한국
1,극한 직업,16266480,한국
2,범죄 도시3,10682674,한국
3,국제 시장,14265222,한국


Unnamed: 0,title,audience,country,income
0,명량,17615919,한국,135758658810
1,극한 직업,16266480,한국,139657105516
2,범죄 도시3,10682674,한국,104686489632
3,국제 시장,14265222,한국,110951970230


Unnamed: 0,title,audience,country,income
one,명량,17615919,한국,135758658810
two,극한 직업,16266480,한국,139657105516
three,범죄 도시3,10682674,한국,104686489632
four,국제 시장,14265222,한국,110951970230


Unnamed: 0,index,title,audience,country,income
0,one,명량,17615919,한국,135758658810
1,two,극한 직업,16266480,한국,139657105516
2,three,범죄 도시3,10682674,한국,104686489632
3,four,국제 시장,14265222,한국,110951970230


Unnamed: 0,title,audience,country,income
0,명량,17615919,한국,135758658810
1,극한 직업,16266480,한국,139657105516
2,범죄 도시3,10682674,한국,104686489632
3,국제 시장,14265222,한국,110951970230


Unnamed: 0,name,audience,country,income
0,명량,17615919,한국,135758658810
1,극한 직업,16266480,한국,139657105516
2,범죄 도시3,10682674,한국,104686489632
3,국제 시장,14265222,한국,110951970230


Unnamed: 0,name,audience,country,income
0,명량,17615919,한국,135758658810
1,극한 직업,16266480,한국,139657105516
3,국제 시장,14265222,한국,110951970230


Unnamed: 0,name,audience,country,income
0,명량,17615919,한국,135758658810
1,극한 직업,16266480,한국,139657105516
2,국제 시장,14265222,한국,110951970230


#### read_csv()
- scv 파일을 DataFrame으로 읽어온다.

In [3]:
import pandas as pd

happyness_df =  pd.read_csv('./datasets/happiness_report_2022.csv')
display(happyness_df)


Unnamed: 0,country,score,income
0,Finland,7.821,High income
1,Denmark,7.636,High income
2,Iceland,7.557,High income
3,Switzerland,7.512,High income
4,Netherlands,7.415,High income
...,...,...,...
141,Botswana,3.471,Upper middle income
142,Rwanda,3.268,Low income
143,Zimbabwe,2.995,Lower middle income
144,Lebanon,2.955,Lower middle income


#### head()
- 전체 데이터 중 앞부분 일부를 가져온다.

In [4]:
display(happyness_df.head(10))

Unnamed: 0,country,score,income
0,Finland,7.821,High income
1,Denmark,7.636,High income
2,Iceland,7.557,High income
3,Switzerland,7.512,High income
4,Netherlands,7.415,High income
5,Luxembourg,7.404,High income
6,Sweden,7.384,High income
7,Norway,7.365,High income
8,Israel,7.364,High income
9,New Zealand,7.2,High income


#### tail()

In [5]:
display(happyness_df.tail(10))

Unnamed: 0,country,score,income
136,Zambia,3.76,Low income
137,Malawi,3.75,Low income
138,Tanzania,3.702,Lower middle income
139,Sierra Leone,3.574,Low income
140,Lesotho,3.512,Lower middle income
141,Botswana,3.471,Upper middle income
142,Rwanda,3.268,Low income
143,Zimbabwe,2.995,Lower middle income
144,Lebanon,2.955,Lower middle income
145,Afghanistan,2.404,Low income


#### iloc[], loc[]
- 원하는 행 또는 열을 가져온다.
- iloc은 인덱스 번호로 가져오고, loc은 인덱스 값 또는 컬렴명으로 가져온다.

In [6]:
happyness_df.index += 1

In [7]:
# []에 1개의 정수만 전달하면 행을 가져온다.

# iloc : 인덱스 번호로 가져온다.
print(happyness_df.iloc[0])
print('='*20)
display(happyness_df.iloc[[0]])

print('='*20)
# loc : 인덱스 값으로 가져온다.
print(happyness_df.loc[1])
print('='*20)
display(happyness_df.loc[[1]])

country        Finland
score            7.821
income     High income
Name: 1, dtype: object


Unnamed: 0,country,score,income
1,Finland,7.821,High income


country        Finland
score            7.821
income     High income
Name: 1, dtype: object


Unnamed: 0,country,score,income
1,Finland,7.821,High income


In [8]:
# []에 2개의 정수를 콤마로 구분해서 전달하면, 행과 열을 가져온다.

# 전체 행과 마지막 열 가져오기(Series)

print(happyness_df.iloc[:, -1])

print('='*20)

print(happyness_df.loc[:, 'income'])

# 전체 행과 마지막 열 가져오기(DataFrame)
display(happyness_df[['income']])
display(happyness_df[['score','income']])

1              High income
2              High income
3              High income
4              High income
5              High income
              ...         
142    Upper middle income
143             Low income
144    Lower middle income
145    Lower middle income
146             Low income
Name: income, Length: 146, dtype: object
1              High income
2              High income
3              High income
4              High income
5              High income
              ...         
142    Upper middle income
143             Low income
144    Lower middle income
145    Lower middle income
146             Low income
Name: income, Length: 146, dtype: object


Unnamed: 0,income
1,High income
2,High income
3,High income
4,High income
5,High income
...,...
142,Upper middle income
143,Low income
144,Lower middle income
145,Lower middle income


Unnamed: 0,score,income
1,7.821,High income
2,7.636,High income
3,7.557,High income
4,7.512,High income
5,7.415,High income
...,...,...
142,3.471,Upper middle income
143,3.268,Low income
144,2.995,Lower middle income
145,2.955,Lower middle income


In [9]:
# 행복지수가 3보다 작은 데이터 가쟈오기
# 그 후 인덱스 초기화 하기
# 기존 인덱스 날리기
happyness_df

Unnamed: 0,country,score,income
1,Finland,7.821,High income
2,Denmark,7.636,High income
3,Iceland,7.557,High income
4,Switzerland,7.512,High income
5,Netherlands,7.415,High income
...,...,...,...
142,Botswana,3.471,Upper middle income
143,Rwanda,3.268,Low income
144,Zimbabwe,2.995,Lower middle income
145,Lebanon,2.955,Lower middle income


In [10]:
happyness_df_over = happyness_df.score <3
happyness_df = happyness_df[happyness_df_over]
happyness_df

happyness_df.reset_index(drop=True, inplace=True)
display(happyness_df)

Unnamed: 0,country,score,income
0,Zimbabwe,2.995,Lower middle income
1,Lebanon,2.955,Lower middle income
2,Afghanistan,2.404,Low income


In [11]:
import pandas as pd

happyness_df =  pd.read_csv('./datasets/happiness_report_2022.csv')

In [12]:
print(happyness_df.columns)
print(happyness_df.index)
print(happyness_df.index.values)
happyness_df.info()
happyness_df.dtypes

Index(['country', 'score', 'income'], dtype='object')
RangeIndex(start=0, stop=146, step=1)
[  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35
  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53
  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71
  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89
  90  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107
 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125
 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143
 144 145]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 146 entries, 0 to 145
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   country  146 non-null    object 
 1   score    146 non-null    float64
 2   income   140 non-null    object 
dtypes: float64(1), 

country     object
score      float64
income      object
dtype: object

In [13]:
happyness_df = happyness_df.astype({'score': 'float'})
print(happyness_df.dtypes)
happyness_df

country     object
score      float64
income      object
dtype: object


Unnamed: 0,country,score,income
0,Finland,7.821,High income
1,Denmark,7.636,High income
2,Iceland,7.557,High income
3,Switzerland,7.512,High income
4,Netherlands,7.415,High income
...,...,...,...
141,Botswana,3.471,Upper middle income
142,Rwanda,3.268,Low income
143,Zimbabwe,2.995,Lower middle income
144,Lebanon,2.955,Lower middle income


#### Describe()
- 숫자형 데이터의 개수, 평균, 표준편차, 최소값, 사분위분포도(중앙값 : 50%), 최대값을 제공한다.
- 25번째 백분위수와 75번째 백분위수를 기준으로 정상치의 범위를 설정할 수 있다.

In [14]:
happyness_df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
score,146.0,5.553575,1.086843,2.404,4.88875,5.5685,6.305,7.821


In [15]:
import numpy as np
from scipy.stats import iqr

happyness_df_Q1 = np.percentile(happyness_df.score, 25)
happyness_df_Q3 = np.percentile(happyness_df.score, 75)
print(happyness_df_Q1,happyness_df_Q3)

NameError: name 'happyness_df_Q2' is not defined

In [None]:
import pandas as pd

happyness_df =  pd.read_csv('./datasets/happiness_report_2022.csv')

happyness_df.describe().T

In [None]:
import numpy as np
from scipy.stats import iqr

happyness_df_Q1 = np.percentile(happyness_df.score, 25)
happyness_df_Q3 = np.percentile(happyness_df.score, 75)
print(happyness_df_Q1,happyness_df_Q3)


iqr_values = iqr(happyness_df.score)
iqr_values

lower_bound = happyness_df_Q1 - 1.5* iqr_values
upper_bound = happyness_df_Q3 + 1.5* iqr_values
print(f'정상치 범위 : {lower_bound} ~ {upper_bound}')

In [None]:
display(happyness_df.groupby('income').mean('score').reset_index())

display(happyness_df.groupby('income').max('score').reset_index())

display(happyness_df.groupby('income').min('score').reset_index())

display(happyness_df.groupby('income')['score'].std().reset_index())

#### 결손 데이터 처리하기
- isna()를 통해 결속 데이터 여부를 확인할 수 있다.
- fillna()를 통해 결속 데이터를 다른 값으로 대체할 수 있다.

In [None]:
happyness_df.isna().sum()

In [None]:
happyness_df['income'].value_counts()

In [16]:
happyness_df['income'] = happyness_df['income'].fillna('upper middle income')
happyness_df['income'].isna().sum()

0