## 5.1 Pandas란?

- pandas는 <b>"python data analysis"</b>의 약자입니다.
> pandas는 정형 데이터 처리에 특화되어 있다.

- pandas 역시 다양한 머신러닝 라이브러리들에 의존성을 가지고 있습니다.
> scikit-learn, scipy, statsmodel, tensorflow, pytorch, ...


- 간단하게 생각하면, **python에서 excel의 기능을 사용**할 수 있게 됩니다.
> pandas = python + excel // pandas & excel // pandas VS MS Excel

- 하지만, pandas는 numpy array를 베이스로 지원하며 파이썬과 함께 강력한 시너지를 내기 때문에, 엑셀 그 이상의 퍼포먼스를 냅니다.
> pandas가 Excel에 비해 고성능 데이터처리에 적합하다.

![numpy_data_type](../images/pandas/dataframe.png)

- Pandas 라이브러리에서 기본적으로 데이터를 다루는 단위는 DataFrame입니다. 흔히 알고있는 spreadsheet와 같은 개념입니다.


- 이러한 형태의 데이터는 Structured Data 또는 Panel Data 또는 Tabular Data라고 부릅니다.


- pandas를 공부한다는 것은 결국 dataframe의 사용법을 익히고 활용하는 방법을 배운다는 것과 같습니다.


- pandas를 잘 활용하면 대부분의 structured data를 자유자재로 다룰 수 있게 됩니다.

![pandas_files](../images/pandas/pandas_files.png)

## 5.2. Pandas의 기본 자료구조(Series, DataFrame)

In [4]:
# pandas 라이브러리를 불러옵니다. pd를 약칭으로 사용합니다.
import pandas as pd
import numpy as np
print(pd.__version__)

1.2.4


- DataFrame은 2차원 테이블이고, 테이블의 한 줄(행/열)을 Series라고 합니다.


- Series의 모임이 곧, DataFrame이 됩니다.

In [5]:
# s는 1, 3, 5, np.nan, 6, 8을 원소로 가지는 pandas.Series
s = pd.Series([1, 3, 5, np.nan, 6, 8])
s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

- pandas는 date_range라는 함수를 통해, 날짜정보를 쉽게 생성해주는 객체도 제공합니다.

In [6]:
# 20210101부터 6일간의 날짜 범위를 생성하는 pandas.date_range
dates = pd.date_range('20210101', periods=6)
dates

DatetimeIndex(['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04',
               '2021-01-05', '2021-01-06'],
              dtype='datetime64[ns]', freq='D')

In [7]:
# 6x4 행렬에 -1에서 1 사이의 랜덤한 숫자를 가지는 원소를 가지고, index열은 dates, 나머지 coulmns은 순서대로 A, B, C, D로 하는 DataFrame 생성
df = pd.DataFrame(np.random.randn(6, 4), index = dates, columns = ['A', 'B', 'C', 'D'])
df

Unnamed: 0,A,B,C,D
2021-01-01,0.171061,-1.751561,-1.006392,1.3909
2021-01-02,0.092309,0.673067,-0.715019,-1.024037
2021-01-03,-0.603161,0.537562,-0.627837,-0.973696
2021-01-04,-0.015068,1.245419,-1.876582,0.872252
2021-01-05,1.783304,1.015156,-0.615886,2.042194
2021-01-06,0.483636,0.273775,0.205632,-1.273566


## 5.3. Dataframe 기초 method

In [18]:
# dataframe의 맨 위 다섯줄을 보여주는 head()
df.head()

Unnamed: 0,A,B,C,D
2021-01-01,-0.734055,-0.562238,0.979784,0.672417
2021-01-02,-1.250791,1.064162,-0.804566,-0.076608
2021-01-03,-0.91373,-0.626797,1.062731,1.839246
2021-01-04,0.459121,-0.220419,0.483429,-1.104315
2021-01-05,-0.33513,-0.378503,-0.156978,0.405505


In [20]:
# 3줄
df.head(3) #앞부터
df.tail(3) #뒤부터

Unnamed: 0,A,B,C,D
2021-01-04,0.459121,-0.220419,0.483429,-1.104315
2021-01-05,-0.33513,-0.378503,-0.156978,0.405505
2021-01-06,-2.36082,0.295936,0.429938,0.280345


In [21]:
# dataframe index
df.index

DatetimeIndex(['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04',
               '2021-01-05', '2021-01-06'],
              dtype='datetime64[ns]', freq='D')

In [22]:
# dataframe columns
df.columns

Index(['A', 'B', 'C', 'D'], dtype='object')

In [23]:
# dataframe values
df.values

array([[-0.73405538, -0.5622384 ,  0.979784  ,  0.67241708],
       [-1.25079113,  1.06416188, -0.80456643, -0.07660761],
       [-0.91372963, -0.62679696,  1.06273091,  1.83924594],
       [ 0.45912111, -0.2204192 ,  0.48342866, -1.10431492],
       [-0.33513043, -0.37850301, -0.15697787,  0.4055048 ],
       [-2.36081959,  0.29593611,  0.42993769,  0.28034468]])

In [24]:
# dataframe에 대한 전체적인 요약정보를 보여줍니다. index, columns, null/not-null/dtype/memory usage가 표시됩니다.
df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 6 entries, 2021-01-01 to 2021-01-06
Freq: D
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   A       6 non-null      float64
 1   B       6 non-null      float64
 2   C       6 non-null      float64
 3   D       6 non-null      float64
dtypes: float64(4)
memory usage: 240.0 bytes


In [25]:
# dataframe에 대한 전체적인 통계정보를 보여줍니다.
df.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,-0.855901,-0.07131,0.332389,0.336098
std,0.942273,0.646582,0.709506,0.961832
min,-2.36082,-0.626797,-0.804566,-1.104315
25%,-1.166526,-0.516305,-0.010249,0.01263
50%,-0.823893,-0.299461,0.456683,0.342925
75%,-0.434862,0.166847,0.855695,0.605689
max,0.459121,1.064162,1.062731,1.839246


In [33]:
# column B를 기준으로 내림차순 정렬
df.sort_values('B',ascending=False).head(3) # column B를 기준으로 값이 큰 top 3

Unnamed: 0,A,B,C,D
2021-01-02,-1.250791,1.064162,-0.804566,-0.076608
2021-01-06,-2.36082,0.295936,0.429938,0.280345
2021-01-04,0.459121,-0.220419,0.483429,-1.104315


## 5.4. DataFrame Indexing

> Indexing : 데이터에서 어떤 특정 조건을 만족하는 원소를 찾는 방법.

> 전체 DataFrame에서 조건에 만족하는 데이터를 쉽게 찾아서 조작할 때 유용하게 사용할 수 있습니다.

In [8]:
# pandas dataframe은 column 이름을 이용하여 기본적인 Indexing이 가능합니다.
# column A를 indexing
df ["A"]

2021-01-01    0.171061
2021-01-02    0.092309
2021-01-03   -0.603161
2021-01-04   -0.015068
2021-01-05    1.783304
2021-01-06    0.483636
Freq: D, Name: A, dtype: float64

In [10]:
# 특정날짜를 통한 Indexing
df.loc['2021-01-01'] #pd.Series

A    0.171061
B   -1.751561
C   -1.006392
D    1.390900
Name: 2021-01-01 00:00:00, dtype: float64

In [11]:
# 특정 위치를 통한 indexing
df.iloc[2]

A   -0.603161
B    0.537562
C   -0.627837
D   -0.973696
Name: 2021-01-03 00:00:00, dtype: float64

In [16]:
# dataframe에서 slicing을 이용하면 row 단위로 잘려나옵니다.
# 앞에서 3줄을 slicing 합니다.
df[:3]

Unnamed: 0,A,B,C,D
2021-01-01,0.171061,-1.751561,-1.006392,1.3909
2021-01-02,0.092309,0.673067,-0.715019,-1.024037
2021-01-03,-0.603161,0.537562,-0.627837,-0.973696


In [17]:
# df에서 index value를 기준으로 indexing도 가능합니다. (여전히 row 단위)
# 20210102부터 20210104까지 잘라봅니다. # index의 값을 사용하게되면 Index를 이용한 slicing
df['2021-01-02' : '2021-01-04'] 

Unnamed: 0,A,B,C,D
2021-01-02,0.092309,0.673067,-0.715019,-1.024037
2021-01-03,-0.603161,0.537562,-0.627837,-0.973696
2021-01-04,-0.015068,1.245419,-1.876582,0.872252


In [18]:
df.loc['2021-01-02']

A    0.092309
B    0.673067
C   -0.715019
D   -1.024037
Name: 2021-01-02 00:00:00, dtype: float64

In [20]:

# df.loc는 특정값을 기준으로 indexing합니다. (key - value)
# 2021-01-01값을 가지는 row를 가져옵니다.
df.loc[dates[0]]

A    0.171061
B   -1.751561
C   -1.006392
D    1.390900
Name: 2021-01-01 00:00:00, dtype: float64

In [23]:
# df.loc에 2차원 indexing도 가능합니다. [:, ["A", "B"]]의 의미는 모든 row에 대해서 columns는 A, B만 가져오라는 의미입니다.
df.loc[:, ['A', 'B']]

Unnamed: 0,A,B
2021-01-01,0.171061,-1.751561
2021-01-02,0.092309,0.673067
2021-01-03,-0.603161,0.537562
2021-01-04,-0.015068,1.245419
2021-01-05,1.783304,1.015156
2021-01-06,0.483636,0.273775


In [25]:
# 이번엔 slicing을 통해 특정 row중에서 columns는 A, B
df.loc['2021-01-03' : '2021-01-05', ['A', 'B']]

Unnamed: 0,A,B
2021-01-03,-0.603161,0.537562
2021-01-04,-0.015068,1.245419
2021-01-05,1.783304,1.015156


In [26]:
# 특정 row를 index값을 통한 indexing
df.loc['2021-01-02', ['A', 'B']] #Series

A    0.092309
B    0.673067
Name: 2021-01-02 00:00:00, dtype: float64

In [27]:
# 2차원 리스트 indexing과 같은 원리가 되었습니다.
df.loc['2021-01-01', 'C'] #특정 row에 특정 column값

-1.0063916508722561

In [28]:
# df.iloc는 정수를 이용한 indexing과 같습니다.(row 기준) 3은 4번째를 의미합니다.
df.iloc[3]

A   -0.015068
B    1.245419
C   -1.876582
D    0.872252
Name: 2021-01-04 00:00:00, dtype: float64

In [31]:
# iloc로 2차원 indexing을 하게되면, row 기준으로 index 3,4를 가져오고 column 기준으로 0, 1을 가져옵니다.
df.iloc[3:5, 0:2]

Unnamed: 0,A,B
2021-01-04,-0.015068,1.245419
2021-01-05,1.783304,1.015156


In [32]:
# slicing이 아닌 직접 리스트 형태로 기재하는 indexing
df.iloc[[1, 2, 4], [0, 3]]

Unnamed: 0,A,D
2021-01-02,0.092309,-1.024037
2021-01-03,-0.603161,-0.973696
2021-01-05,1.783304,2.042194


In [35]:
# Q. 2차원 indexing에 뒤에가 : 면 어떤 의미일까요?
df.iloc[1:3, :]
df.iloc[:, 1:3]

Unnamed: 0,B,C
2021-01-01,-1.751561,-1.006392
2021-01-02,0.673067,-0.715019
2021-01-03,0.537562,-0.627837
2021-01-04,1.245419,-1.876582
2021-01-05,1.015156,-0.615886
2021-01-06,0.273775,0.205632


In [None]:
# numpy array의 2차원 indexing과 같다.

In [39]:
df> 0

Unnamed: 0,A,B,C,D
2021-01-01,True,False,False,True
2021-01-02,True,True,False,False
2021-01-03,False,True,False,False
2021-01-04,False,True,False,True
2021-01-05,True,True,False,True
2021-01-06,True,True,True,False


In [41]:
# pandas는 fancy indexing을 지원합니다. (사실 numpy에서 지원하기 때문에 pandas도 지원합니다.)
# fancy indexing이란 조건문을 통해 indexing을 할 수 있는 방법으로 True와 False를 원소로 하는 리스트를 통해 masking하는 원리로 동작합니다.
# column A에 있는 원소들중에 0보다 큰 데이터를 가져옵니다.
df['A'] > 0

2021-01-01     True
2021-01-02     True
2021-01-03    False
2021-01-04    False
2021-01-05     True
2021-01-06     True
Freq: D, Name: A, dtype: bool

In [42]:
# fancy indexing
df[df['A'] > 0]

Unnamed: 0,A,B,C,D
2021-01-01,0.171061,-1.751561,-1.006392,1.3909
2021-01-02,0.092309,0.673067,-0.715019,-1.024037
2021-01-05,1.783304,1.015156,-0.615886,2.042194
2021-01-06,0.483636,0.273775,0.205632,-1.273566


In [43]:
df[df < 0] = 0
df

Unnamed: 0,A,B,C,D
2021-01-01,0.171061,0.0,0.0,1.3909
2021-01-02,0.092309,0.673067,0.0,0.0
2021-01-03,0.0,0.537562,0.0,0.0
2021-01-04,0.0,1.245419,0.0,0.872252
2021-01-05,1.783304,1.015156,0.0,2.042194
2021-01-06,0.483636,0.273775,0.205632,0.0


In [44]:
#df[df > 0]
df[df > 0]

Unnamed: 0,A,B,C,D
2021-01-01,0.171061,,,1.3909
2021-01-02,0.092309,0.673067,,
2021-01-03,,0.537562,,
2021-01-04,,1.245419,,0.872252
2021-01-05,1.783304,1.015156,,2.042194
2021-01-06,0.483636,0.273775,0.205632,


In [45]:
 # dataframe 하나를 복사합니다. 정말 말그대로 복사합니다.
df2 = df.copy()

In [46]:
# dataframe은 dictionary와 비슷한 방식으로 assignment가 가능합니다.
# df에 ['one', 'one','two','three','four','three'] 리스트를 column의 value로 하는 column E를 추가합니다.
df2['E'] = ('one', 'one', 'two', 'three', 'four', 'three')
df2

Unnamed: 0,A,B,C,D,E
2021-01-01,0.171061,0.0,0.0,1.3909,one
2021-01-02,0.092309,0.673067,0.0,0.0,one
2021-01-03,0.0,0.537562,0.0,0.0,two
2021-01-04,0.0,1.245419,0.0,0.872252,three
2021-01-05,1.783304,1.015156,0.0,2.042194,four
2021-01-06,0.483636,0.273775,0.205632,0.0,three


In [47]:
# df.isin은 해당 value들이 들어있는 row에 대해선 True를 가지는 Series를 리턴한다.
df2['E'].isin(['two','four'])

2021-01-01    False
2021-01-02    False
2021-01-03     True
2021-01-04    False
2021-01-05     True
2021-01-06    False
Freq: D, Name: E, dtype: bool

In [50]:
df[df2['E'].isin(['two','four'])]

Unnamed: 0,A,B,C,D
2021-01-03,0.0,0.537562,0.0,0.0
2021-01-05,1.783304,1.015156,0.0,2.042194


## 5.5. 외부 데이터 읽고 쓰기

In [20]:
# data 폴더에 있는 iris.csv를 불러오자.
import pandas as pd
import numpy as np
data = pd.read_csv("data/iris.csv")
data

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...,...
145,146,6.7,3.0,5.2,2.3,Iris-virginica
146,147,6.3,2.5,5.0,1.9,Iris-virginica
147,148,6.5,3.0,5.2,2.0,Iris-virginica
148,149,6.2,3.4,5.4,2.3,Iris-virginica


In [21]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             150 non-null    int64  
 1   SepalLengthCm  150 non-null    float64
 2   SepalWidthCm   150 non-null    float64
 3   PetalLengthCm  150 non-null    float64
 4   PetalWidthCm   150 non-null    float64
 5   Species        150 non-null    object 
dtypes: float64(4), int64(1), object(1)
memory usage: 7.2+ KB


In [22]:
set(data['Species']) ## 0 , 1, 2

{'Iris-setosa', 'Iris-versicolor', 'Iris-virginica'}

In [28]:
# Species column을 숫자로 바꿔보자.
data.loc[data["Species"] == "Iris-setosa","Species"] = 0
data.loc[data["Species"] == "Iris-versicolor","Species"] = 1
data.loc[data["Species"] == "Iris-virginica","Species"] = 2
data

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,0
1,2,4.9,3.0,1.4,0.2,0
2,3,4.7,3.2,1.3,0.2,0
3,4,4.6,3.1,1.5,0.2,0
4,5,5.0,3.6,1.4,0.2,0
...,...,...,...,...,...,...
145,146,6.7,3.0,5.2,2.3,2
146,147,6.3,2.5,5.0,1.9,2
147,148,6.5,3.0,5.2,2.0,2
148,149,6.2,3.4,5.4,2.3,2


In [30]:
set(data["Species"])

{0, 1, 2}

In [31]:
# 바꾼 Dataframe을 Iris_edited.csv 로 저장하자.
data.to_csv("data/Iris_edited.csv")

In [33]:
# 다른 파일도 불러오자.
data2 = pd.read_csv("data/kaggle_survey_2020_responses.csv")
data2

Unnamed: 0,Time from Start to Finish (seconds),Q1,Q2,Q3,Q4,Q5,Q6,Q7_Part_1,Q7_Part_2,Q7_Part_3,...,Q35_B_Part_2,Q35_B_Part_3,Q35_B_Part_4,Q35_B_Part_5,Q35_B_Part_6,Q35_B_Part_7,Q35_B_Part_8,Q35_B_Part_9,Q35_B_Part_10,Q35_B_OTHER
0,Duration (in seconds),What is your age (# years)?,What is your gender? - Selected Choice,In which country do you currently reside?,What is the highest level of formal education ...,Select the title most similar to your current ...,For how many years have you been writing code ...,What programming languages do you use on a reg...,What programming languages do you use on a reg...,What programming languages do you use on a reg...,...,"In the next 2 years, do you hope to become mor...","In the next 2 years, do you hope to become mor...","In the next 2 years, do you hope to become mor...","In the next 2 years, do you hope to become mor...","In the next 2 years, do you hope to become mor...","In the next 2 years, do you hope to become mor...","In the next 2 years, do you hope to become mor...","In the next 2 years, do you hope to become mor...","In the next 2 years, do you hope to become mor...","In the next 2 years, do you hope to become mor..."
1,1838,35-39,Man,Colombia,Doctoral degree,Student,5-10 years,Python,R,SQL,...,,,,TensorBoard,,,,,,
2,289287,30-34,Man,United States of America,Master’s degree,Data Engineer,5-10 years,Python,R,SQL,...,,,,,,,,,,
3,860,35-39,Man,Argentina,Bachelor’s degree,Software Engineer,10-20 years,,,,...,,,,,,,,,,
4,507,30-34,Man,United States of America,Master’s degree,Data Scientist,5-10 years,Python,,SQL,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20032,126,18-21,Man,Turkey,Some college/university study without earning ...,,,,,,...,,,,,,,,,,
20033,566,55-59,Woman,United Kingdom of Great Britain and Northern I...,Master’s degree,Currently not employed,20+ years,Python,,,...,,,,,,,,,,
20034,238,30-34,Man,Brazil,Master’s degree,Research Scientist,< 1 years,Python,,,...,,,,,,,,,,
20035,625,22-24,Man,India,Bachelor’s degree,Software Engineer,3-5 years,Python,,SQL,...,Weights & Biases,,,TensorBoard,,,Trains,,,


In [39]:
# 박사 학위 소지자들만 골라보자.
phd = data2[data2['Q4'] == "Doctoral degree"]
phd

Unnamed: 0,Time from Start to Finish (seconds),Q1,Q2,Q3,Q4,Q5,Q6,Q7_Part_1,Q7_Part_2,Q7_Part_3,...,Q35_B_Part_2,Q35_B_Part_3,Q35_B_Part_4,Q35_B_Part_5,Q35_B_Part_6,Q35_B_Part_7,Q35_B_Part_8,Q35_B_Part_9,Q35_B_Part_10,Q35_B_OTHER
1,1838,35-39,Man,Colombia,Doctoral degree,Student,5-10 years,Python,R,SQL,...,,,,TensorBoard,,,,,,
9,762,35-39,Man,Germany,Doctoral degree,Data Scientist,5-10 years,Python,,SQL,...,,,,,,,,,,
12,742,35-39,Man,United States of America,Doctoral degree,Research Scientist,1-2 years,,R,,...,,,,,,,,,,
21,3313,22-24,Woman,India,Doctoral degree,Statistician,3-5 years,,R,SQL,...,Weights & Biases,,,,,,,,,
33,459,30-34,Man,Other,Doctoral degree,Machine Learning Engineer,10-20 years,Python,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20003,917,30-34,Man,Colombia,Doctoral degree,Software Engineer,5-10 years,Python,,SQL,...,Weights & Biases,Comet.ml,Sacred + Omniboard,TensorBoard,Guild.ai,Polyaxon,Trains,Domino Model Monitor,,
20005,406,30-34,Man,Italy,Doctoral degree,Data Scientist,10-20 years,Python,,SQL,...,,,,,,,,,,
20007,487,45-49,Man,United States of America,Doctoral degree,Software Engineer,20+ years,Python,,,...,,,,,,,,,,
20011,375,40-44,Man,United Kingdom of Great Britain and Northern I...,Doctoral degree,Research Scientist,5-10 years,Python,R,,...,,,,,,,,,,


In [40]:
# 박사 학위 소지자들에 대한 정보만 kaggle_survey_2020_phd.csv로 다시 저장하자.
phd.to_csv("data/kaggle_survey_2020_phd.csv")

In [54]:
# (OPTIONAL) 박사 학위 소지자이면서, 대한민국 국적을 가진 사람들을 뽑아보자.
data2[data2['Q4'] == "Doctoral degree"]
data2['Q3'] == "South Korea"

0        False
1        False
2        False
3        False
4        False
         ...  
20032    False
20033    False
20034    False
20035    False
20036    False
Name: Q3, Length: 20037, dtype: bool