## 0. Pandas
- Pandas 는 데이터 분석을 위한 자료구조로 데이터 분석 도구를 제공하는 파이썬 라이브러리이며, Pandas 의 특징은 다음과 같음
  - 각각의 행,열에 따라 데이터를 정렬할 수 있는 자료구조
  - 시계열, 비시계열 데이터를 함께 다룰 수 있는 통합 자료구조
  - 데이터의 결측치값을 유연하게 처리할 수 있는 기능
  - 데이터 핸들링 및 특정 행,열의 모든 값을 더하는 등의 데이터 연산 기능
- Pandas 의 자료구조로는 Series와 DataFrame이 있음 
  - Series는 1차원 데이터를 다루는 데 효과적인 자료구조임
  - DataFrame은 행과 열로 구성된 2차원 데이터를 다루는 데 효과적인 자료구조임

In [1]:
import pandas as pd
import numpy as np

## 1. Series
- pandas의 series로 데이터를 선언할 때 따로 인덱스를 지정하지 않았다면 기본적으로 0부터 시작하는 정수값으로 인덱싱됨 
- 아래 예제 처럼 index를 사용자가 지정할 수 있음
  - 선언 후 데이터 값을 확인할 때는 values, index를 확인할 때는 index 함수를 이용해서 확인 할 수 있음
  - 기본 함수 사용법 : pd.Series(data, index=index)
  - Series객체 생성 시 인덱스값을 통해 데이터에 접근할 수 있음
  - 파이썬의 리스트와 달리 사용자가 index 값을 지정해 줄 수 있으며, 지정한 index 값으로 데이터에 접근 할 수 있음

In [2]:
s1 = pd.Series([1,2,3])
s1

0    1
1    2
2    3
dtype: int64

In [3]:
s2 = pd.Series(["a","b","c"])
s2

0    a
1    b
2    c
dtype: object

In [4]:
s3 = pd.Series(["a",1,"b",2])
s3

0    a
1    1
2    b
3    2
dtype: object

In [7]:
s4 = pd.Series(["a","b","c"], index=[1,2,3])
s4

1    a
2    b
3    c
dtype: object

In [8]:
s5 = pd.Series([1,2,3],["a","b","c"])
s5

a    1
b    2
c    3
dtype: int64

In [9]:
s5.values

array([1, 2, 3])

In [10]:
s5.index

Index(['a', 'b', 'c'], dtype='object')

In [11]:
s6 = pd.Series([1,2,3,4])
s6

0    1
1    2
2    3
3    4
dtype: int64

In [12]:
s6.index = ("a", "b", "c", "d")
s6

a    1
b    2
c    3
d    4
dtype: int64

### 1.1 Series 기본 함수 

In [13]:
s7 = pd.Series([1,1,1,4,5,6,7,8,9,9,np.NaN])
s7

0     1.0
1     1.0
2     1.0
3     4.0
4     5.0
5     6.0
6     7.0
7     8.0
8     9.0
9     9.0
10    NaN
dtype: float64

In [15]:
s7.size #개수 반환

11

In [16]:
len(s7)

11

In [18]:
s7.count() #NaN을 제외한 개수를 반환

10

In [19]:
s7.shape() #tuple 형태로 shape 반환

TypeError: 'tuple' object is not callable

In [20]:
a1 = np.array(s7)
a1

array([ 1.,  1.,  1.,  4.,  5.,  6.,  7.,  8.,  9.,  9., nan])

In [22]:
print(a1.mean())
print(s7.mean()) #NaN을 제외한 평균 반환

nan
5.1


In [23]:
s7.unique() #유일한 값만 array 형태로 반환

array([ 1.,  4.,  5.,  6.,  7.,  8.,  9., nan])

In [24]:
s7.value_counts() #NaN을 제외하고 각 값들의 빈도를 반환

1.0    3
9.0    2
4.0    1
5.0    1
6.0    1
7.0    1
8.0    1
dtype: int64

### 1.2 Series 연산
 - Series와 스칼라의 연산은 각 원소별로 스칼라와의 연산이 적용
 - Series끼리의 사칙연산도 가능함. 단, index별로 계산이 되는 점을 유의하여야 함


In [27]:
s1 = pd.Series([1,2,3,4,5],["a","b","c","d","e"])
s2 = pd.Series([2,2,2,2,2],["a","b","c","d","e"])

print(s1)
print(s2)

a    1
b    2
c    3
d    4
e    5
dtype: int64
a    2
b    2
c    2
d    2
e    2
dtype: int64


In [29]:
s1*3

a     3
b     6
c     9
d    12
e    15
dtype: int64

In [30]:
s1 + s2

a    3
b    4
c    5
d    6
e    7
dtype: int64

In [31]:
s1["f"] = 100
s2["f"] = 200

print(s1)
print(s2)

a      1
b      2
c      3
d      4
e      5
f    100
dtype: int64
a      2
b      2
c      2
d      2
e      2
f    200
dtype: int64


### 1.3 Series update 

In [44]:
s1 = pd.Series(np.arange(2,13,2),["a","b","c","d","e","f"])
s1

a     2
b     4
c     6
d     8
e    10
f    12
dtype: int64

In [47]:
s1["a"] = 200
s1

a    200
b      4
c      6
d      8
e     10
f     12
dtype: int64

In [48]:
s1.drop("a")

b     4
c     6
d     8
e    10
f    12
dtype: int64

In [49]:
s1

a    200
b      4
c      6
d      8
e     10
f     12
dtype: int64

In [50]:
s1.drop("a", inplace = True) #inplace를 통해 원 데이터 업데이트
s1

b     4
c     6
d     8
e    10
f    12
dtype: int64

### 1.4 Series selection
 - slicing
  - 리스트, array와 동일하게 적용

In [51]:
s1 = pd.Series(np.arange(2,11,2),["a","b","c","d","e"])
s1

a     2
b     4
c     6
d     8
e    10
dtype: int64

In [55]:
s1[1:3]

b    4
c    6
dtype: int64

In [56]:
s1 = pd.Series(np.arange(2,21,2),np.arange(10))
s1

0     2
1     4
2     6
3     8
4    10
5    12
6    14
7    16
8    18
9    20
dtype: int64

In [57]:
s1 > 10

0    False
1    False
2    False
3    False
4    False
5     True
6     True
7     True
8     True
9     True
dtype: bool

In [58]:
s1[s1>10]

5    12
6    14
7    16
8    18
9    20
dtype: int64

In [59]:
(s1 > 10).sum()

5

In [60]:
s1[s1>10].sum()

80

In [62]:
s1 = pd.Series([1,2,3,4,5,np.NaN])
s1

0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
5    NaN
dtype: float64

In [63]:
pd.isnull(s1)

0    False
1    False
2    False
3    False
4    False
5     True
dtype: bool

In [64]:
s1[pd.isnull(s1)]

5   NaN
dtype: float64

In [66]:
pd.notnull(s1)

0     True
1     True
2     True
3     True
4     True
5    False
dtype: bool

In [67]:
s1[pd.notnull(s1)]

0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
dtype: float64

## 2. DataFrame
- Pandas의 Series가 1차원 형태의 자료구조라면 DataFrame은 여러 개의 열로 구성된 2차원 형태의 자료구조임
- numpy array를 받아 만들 수 있으며, Series 처럼 변환 가능한 오브젝트들을 갖고 있는 dict 형태를 인자로 넣어주어 DataFrame을 만들 수 있음


In [80]:
ex = pd.DataFrame({'A': 1.,
                   'B': pd.Timestamp('20130102'),
                   'C': pd.Series(1, index=list(range(5)), dtype='float32'),
                   'D': np.array(np.arange(3,8,1), dtype='int32'),
                   'E': pd.Categorical(['test', 'train', 'test', 'train','test']),
                   'F': 'foo'})
ex

Unnamed: 0,A,B,C,D,E,F
0,1.0,2013-01-02,1.0,3,test,foo
1,1.0,2013-01-02,1.0,4,train,foo
2,1.0,2013-01-02,1.0,5,test,foo
3,1.0,2013-01-02,1.0,6,train,foo
4,1.0,2013-01-02,1.0,7,test,foo


- DataFrame의 컬럼들은 각기 특별한 자료형을 갖고 있음
- 이는 DataFrame 내에 있는 dtypes라는 속성을 통해 확인 가능함
- 파이썬의 기본적인 소수점은 float64로 잡히고, 기본적은 문자열은 str이 아니라 object라는 자료형으로 나타남

In [69]:
ex.dtypes

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

In [81]:
ex2 = pd.DataFrame(np.random.randn(5,2), columns = ['A','B'])
ex2

Unnamed: 0,A,B
0,0.406282,0.368642
1,1.140642,-0.07741
2,-0.812914,-0.273615
3,-0.448528,0.854089
4,-0.237988,0.196074


In [82]:
ex2.head() #기본값 = 5

Unnamed: 0,A,B
0,0.406282,0.368642
1,1.140642,-0.07741
2,-0.812914,-0.273615
3,-0.448528,0.854089
4,-0.237988,0.196074


In [83]:
ex2.tail(3)

Unnamed: 0,A,B
2,-0.812914,-0.273615
3,-0.448528,0.854089
4,-0.237988,0.196074


In [84]:
ex2.index #check index

RangeIndex(start=0, stop=5, step=1)

In [85]:
ex2.columns #check columns

Index(['A', 'B'], dtype='object')

In [87]:
ex2.values #check values

array([[ 0.40628183,  0.36864168],
       [ 1.14064213, -0.07741036],
       [-0.81291384, -0.27361517],
       [-0.44852821,  0.85408865],
       [-0.23798752,  0.1960742 ]])

In [88]:
ex2.describe() #simple statistic information of DataFrame

Unnamed: 0,A,B
count,5.0,5.0
mean,0.009499,0.213556
std,0.772063,0.434924
min,-0.812914,-0.273615
25%,-0.448528,-0.07741
50%,-0.237988,0.196074
75%,0.406282,0.368642
max,1.140642,0.854089


In [89]:
ex2.sort_index(axis =1 , ascending = False)

Unnamed: 0,B,A
0,0.368642,0.406282
1,-0.07741,1.140642
2,-0.273615,-0.812914
3,0.854089,-0.448528
4,0.196074,-0.237988


In [91]:
ex2.sort_index(axis = 0 , ascending = False)

Unnamed: 0,A,B
4,-0.237988,0.196074
3,-0.448528,0.854089
2,-0.812914,-0.273615
1,1.140642,-0.07741
0,0.406282,0.368642


In [92]:
ex2.sort_index()

Unnamed: 0,A,B
0,0.406282,0.368642
1,1.140642,-0.07741
2,-0.812914,-0.273615
3,-0.448528,0.854089
4,-0.237988,0.196074


In [94]:
ex2.sort_values(by='B')

Unnamed: 0,A,B
2,-0.812914,-0.273615
1,1.140642,-0.07741
4,-0.237988,0.196074
0,0.406282,0.368642
3,-0.448528,0.854089


### Selection using pandas 

In [95]:
ex2

Unnamed: 0,A,B
0,0.406282,0.368642
1,1.140642,-0.07741
2,-0.812914,-0.273615
3,-0.448528,0.854089
4,-0.237988,0.196074


In [96]:
ex2['A']

0    0.406282
1    1.140642
2   -0.812914
3   -0.448528
4   -0.237988
Name: A, dtype: float64

In [97]:
ex2.A

0    0.406282
1    1.140642
2   -0.812914
3   -0.448528
4   -0.237988
Name: A, dtype: float64

In [98]:
ex2[['A']]

Unnamed: 0,A
0,0.406282
1,1.140642
2,-0.812914
3,-0.448528
4,-0.237988


In [99]:
type(ex2['A'])

pandas.core.series.Series

In [100]:
type(ex2[['A']])

pandas.core.frame.DataFrame

In [101]:
ex2[0:3]

Unnamed: 0,A,B
0,0.406282,0.368642
1,1.140642,-0.07741
2,-0.812914,-0.273615


## Merge DataFrame  

In [103]:
df1 = pd.DataFrame({'key' : list('ABCDE'),
                    'value' : np.random.randn(5)})
df1

Unnamed: 0,key,value
0,A,-0.604176
1,B,-0.882808
2,C,-0.253994
3,D,0.461608
4,E,-0.50777


In [104]:
df2 = pd.DataFrame({'key' : list('ABCXZ'),
                    'value' : np.random.randn(5)})
df2

Unnamed: 0,key,value
0,A,0.07122
1,B,0.957223
2,C,0.622761
3,X,1.802048
4,Z,-0.531795


In [105]:
pd.concat([df1,df2]) # axis = 0 (Default), concat by rows 

Unnamed: 0,key,value
0,A,-0.604176
1,B,-0.882808
2,C,-0.253994
3,D,0.461608
4,E,-0.50777
0,A,0.07122
1,B,0.957223
2,C,0.622761
3,X,1.802048
4,Z,-0.531795


In [106]:
pd.concat([df1, df2], axis = 0, ignore_index = True)

Unnamed: 0,key,value
0,A,-0.604176
1,B,-0.882808
2,C,-0.253994
3,D,0.461608
4,E,-0.50777
5,A,0.07122
6,B,0.957223
7,C,0.622761
8,X,1.802048
9,Z,-0.531795


In [108]:
pd.concat([df1, df2]).reset_index()

Unnamed: 0,index,key,value
0,0,A,-0.604176
1,1,B,-0.882808
2,2,C,-0.253994
3,3,D,0.461608
4,4,E,-0.50777
5,0,A,0.07122
6,1,B,0.957223
7,2,C,0.622761
8,3,X,1.802048
9,4,Z,-0.531795


In [109]:
pd.concat([df1,df2], axis = 1)

Unnamed: 0,key,value,key.1,value.1
0,A,-0.604176,A,0.07122
1,B,-0.882808,B,0.957223
2,C,-0.253994,C,0.622761
3,D,0.461608,X,1.802048
4,E,-0.50777,Z,-0.531795


In [110]:
df2.columns = ['key','values2']
df2

Unnamed: 0,key,values2
0,A,0.07122
1,B,0.957223
2,C,0.622761
3,X,1.802048
4,Z,-0.531795


In [111]:
pd.concat([df1,df2])

Unnamed: 0,key,value,values2
0,A,-0.604176,
1,B,-0.882808,
2,C,-0.253994,
3,D,0.461608,
4,E,-0.50777,
0,A,,0.07122
1,B,,0.957223
2,C,,0.622761
3,X,,1.802048
4,Z,,-0.531795


###  pd.merge()

In [112]:
df1

Unnamed: 0,key,value
0,A,-0.604176
1,B,-0.882808
2,C,-0.253994
3,D,0.461608
4,E,-0.50777


In [113]:
df2

Unnamed: 0,key,values2
0,A,0.07122
1,B,0.957223
2,C,0.622761
3,X,1.802048
4,Z,-0.531795


In [114]:
pd.merge(df1, df2, on = 'key', how = 'inner')

Unnamed: 0,key,value,values2
0,A,-0.604176,0.07122
1,B,-0.882808,0.957223
2,C,-0.253994,0.622761


In [115]:
pd.merge(df1, df2, on = 'key', how = 'left')

Unnamed: 0,key,value,values2
0,A,-0.604176,0.07122
1,B,-0.882808,0.957223
2,C,-0.253994,0.622761
3,D,0.461608,
4,E,-0.50777,


In [116]:
pd.merge(df1, df2, on = 'key', how = 'right')

Unnamed: 0,key,value,values2
0,A,-0.604176,0.07122
1,B,-0.882808,0.957223
2,C,-0.253994,0.622761
3,X,,1.802048
4,Z,,-0.531795


In [117]:
pd.merge(df1, df2, on = 'key', how = 'outer')

Unnamed: 0,key,value,values2
0,A,-0.604176,0.07122
1,B,-0.882808,0.957223
2,C,-0.253994,0.622761
3,D,0.461608,
4,E,-0.50777,
5,X,,1.802048
6,Z,,-0.531795


### 4.  Practice using data set - iris dataset

In [3]:
from sklearn.datasets import load_iris

In [11]:
print(iris) # 로드된 데이터가 속성-스타일 접근을 제공하는 딕셔너리와 번치 객체로 표현된 것을 확인
print(iris.DESCR) # Description 속성을 이용해서 데이터셋의 정보를 확인

# 각 key에 저장된 value 확인
# feature
print(iris.data)
print(iris.feature_names)

# label
print(iris.target)
print(iris.target_names)

# feature_names 와 target을 레코드로 갖는 데이터프레임 생성
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)

{'data': array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1],
       [5.4, 3.7, 1.5, 0.2],
       [4.8, 3.4, 1.6, 0.2],
       [4.8, 3. , 1.4, 0.1],
       [4.3, 3. , 1.1, 0.1],
       [5.8, 4. , 1.2, 0.2],
       [5.7, 4.4, 1.5, 0.4],
       [5.4, 3.9, 1.3, 0.4],
       [5.1, 3.5, 1.4, 0.3],
       [5.7, 3.8, 1.7, 0.3],
       [5.1, 3.8, 1.5, 0.3],
       [5.4, 3.4, 1.7, 0.2],
       [5.1, 3.7, 1.5, 0.4],
       [4.6, 3.6, 1. , 0.2],
       [5.1, 3.3, 1.7, 0.5],
       [4.8, 3.4, 1.9, 0.2],
       [5. , 3. , 1.6, 0.2],
       [5. , 3.4, 1.6, 0.4],
       [5.2, 3.5, 1.5, 0.2],
       [5.2, 3.4, 1.4, 0.2],
       [4.7, 3.2, 1.6, 0.2],
       [4.8, 3.1, 1.6, 0.2],
       [5.4, 3.4, 1.5, 0.4],
       [5.2, 4.1, 1.5, 0.1],
       [5.5, 4.2, 1.4, 0.2],
     

In [14]:
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [15]:
df.tail()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
145,6.7,3.0,5.2,2.3
146,6.3,2.5,5.0,1.9
147,6.5,3.0,5.2,2.0
148,6.2,3.4,5.4,2.3
149,5.9,3.0,5.1,1.8


In [16]:
df.index #인덱스 확인

RangeIndex(start=0, stop=150, step=1)

In [17]:
df.columns #컬럼 확인

Index(['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)',
       'petal width (cm)'],
      dtype='object')

In [18]:
df.dtypes #형식 확인

sepal length (cm)    float64
sepal width (cm)     float64
petal length (cm)    float64
petal width (cm)     float64
dtype: object

In [22]:
df[['sepal length (cm)', 'sepal width (cm)']] = df[['sepal length (cm)', 'sepal width (cm)']].astype(object)
df.dtypes #형식 변경

sepal length (cm)     object
sepal width (cm)      object
petal length (cm)    float64
petal width (cm)     float64
dtype: object

In [23]:
df[['sepal length (cm)', 'sepal width (cm)']] = df[['sepal length (cm)', 'sepal width (cm)']].astype(float)
df.dtypes

sepal length (cm)    float64
sepal width (cm)     float64
petal length (cm)    float64
petal width (cm)     float64
dtype: object

In [24]:
df.info() #데이터 타입, 각 아이템 개수 확인

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 4 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   sepal length (cm)  150 non-null    float64
 1   sepal width (cm)   150 non-null    float64
 2   petal length (cm)  150 non-null    float64
 3   petal width (cm)   150 non-null    float64
dtypes: float64(4)
memory usage: 4.8 KB


### 4.2 데이터 전처리 

In [27]:
df = df.rename(columns={'sepal length (cm)': 'sepal length', 'sepal width (cm)': 'sepal width',
                        'petal length (cm)' : 'petal length', 'petal width (cm)': 'petal width',
                        'variety' : 'species'}) #변수 이름 변경
df.head()

Unnamed: 0,sepal length,sepal width,petal length,petal width
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


Dataframe column 선택
 - dataframe[ ] 으로 컬럼 추출
 - [] -> Series로 변환
 - [[]] -> dataframe으로 반환

In [28]:
df.columns

Index(['sepal length', 'sepal width', 'petal length', 'petal width'], dtype='object')

In [29]:
df['sepal length']

0      5.1
1      4.9
2      4.7
3      4.6
4      5.0
      ... 
145    6.7
146    6.3
147    6.5
148    6.2
149    5.9
Name: sepal length, Length: 150, dtype: float64

In [30]:
df[['sepal length']]

Unnamed: 0,sepal length
0,5.1
1,4.9
2,4.7
3,4.6
4,5.0
...,...
145,6.7
146,6.3
147,6.5
148,6.2


In [31]:
df[['sepal length', 'sepal width']]

Unnamed: 0,sepal length,sepal width
0,5.1,3.5
1,4.9,3.0
2,4.7,3.2
3,4.6,3.1
4,5.0,3.6
...,...,...
145,6.7,3.0
146,6.3,2.5
147,6.5,3.0
148,6.2,3.4


dataframe row 선택
- dataframe의 경우 []연산자는 컬럼(column) 선택, 하지만 슬라이싱(slicing)은 행(row) 선택
- .loc(),iloc()로 행 선택 가능
 - .loc() : 인덱스 자체를 사용
 - .iloc() : 0 based 인덱스 사용

In [32]:
df.head(10)

Unnamed: 0,sepal length,sepal width,petal length,petal width
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
5,5.4,3.9,1.7,0.4
6,4.6,3.4,1.4,0.3
7,5.0,3.4,1.5,0.2
8,4.4,2.9,1.4,0.2
9,4.9,3.1,1.5,0.1


In [33]:
df[0:5]

Unnamed: 0,sepal length,sepal width,petal length,petal width
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [34]:
df.index = df.index + 100
df.head()

Unnamed: 0,sepal length,sepal width,petal length,petal width
100,5.1,3.5,1.4,0.2
101,4.9,3.0,1.4,0.2
102,4.7,3.2,1.3,0.2
103,4.6,3.1,1.5,0.2
104,5.0,3.6,1.4,0.2


In [35]:
df.loc[[100]]

Unnamed: 0,sepal length,sepal width,petal length,petal width
100,5.1,3.5,1.4,0.2


In [37]:
df.iloc[[30]]

Unnamed: 0,sepal length,sepal width,petal length,petal width
130,4.8,3.1,1.6,0.2


In [38]:
df.iloc[[0]]

Unnamed: 0,sepal length,sepal width,petal length,petal width
100,5.1,3.5,1.4,0.2


In [44]:
df.loc[[100,101,102],["sepal length", "sepal width"]]

Unnamed: 0,sepal length,sepal width
100,5.1,3.5
101,4.9,3.0
102,4.7,3.2


In [45]:
df.iloc[[0,1,2],[0,3]]

Unnamed: 0,sepal length,petal width
100,5.1,0.2
101,4.9,0.2
102,4.7,0.2
