<a href="https://colab.research.google.com/github/KimYongHwi/ml_definitive_guide_study/blob/main/3%EC%9E%A5_pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Pandas
- 판다스는 파이썬에서 데이터 처리를 위해 존재하는 가장 있기 있는 라이브러리
- 일반적으로 대부분의 데이터 세트는 2차원 데이터로 행과 열로 구성돼 있다
- 행과 열로 이루어진 2차원 데이터를 효율적으로 가공/처리할 수 있는 다양하고 훌륭한 기능 제공

#### 주요 구성요소
- DataFrame: 행과 열을 가지고 있는 2차원 데이터셋 (index를 가지고 있음)
- Series: 1개의 열 값으로만 구성된 1차원 데이터셋

In [1]:
from google.colab import auth
auth.authenticate_user()

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
import os
from pathlib import Path

In [5]:
folder = "ml_definitive_guide/data/titanic"

base_path = Path("/content/drive/My Drive/")
project_path = base_path / folder
os.chdir(project_path)

for x in list(project_path.glob("*")):
    if x.is_dir():
        dir_name = str(x.relative_to(project_path))
        os.rename(dir_name, dir_name.split(" ", 1)[0])

print(f"현재 디렉토리 위치: {os.getcwd()}")

현재 디렉토리 위치: /content/drive/My Drive/ml_definitive_guide/data/titanic


In [6]:
import pandas as pd

- read_scv를 이용하여 csv 파일을 편리하게 DataFrame으로 로딩

In [9]:
titanic_df = pd.read_csv('./train.csv')
print('titanic 변수 type:',type(titanic_df))

titanic 변수 type: <class 'pandas.core.frame.DataFrame'>


- head를 이용하여 DataFrame의 맨 앞 일부 데이터만 추출

In [12]:
titanic_df.head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S


- DataFrame 생성

In [13]:
dic1 = {
    'Name': ['Chulmin', 'Enukyung', 'Jinwoong', 'Soobeom'],
    'Year': [2011, 2016, 2015, 2015],
    'Gender': ['Male', 'Female', 'Male', 'Male']
}

# 딕셔너리를 DataFrame으로 변환
data_df = pd.DataFrame(dic1)
print(data_df)
print('#' * 30)

# 새로운 컬럼명을 추가
data_df = pd.DataFrame(dic1, columns=['Name', 'Year', "Gender", 'Age'])
print(data_df)
print('#' * 30)

# 인덱스를 새로운 값으로 할당
data_df = pd.DataFrame(dic1, index=['one', 'two', 'three', 'four'])
print(data_df)
print('#' * 30)

       Name  Year  Gender
0   Chulmin  2011    Male
1  Enukyung  2016  Female
2  Jinwoong  2015    Male
3   Soobeom  2015    Male
##############################
       Name  Year  Gender  Age
0   Chulmin  2011    Male  NaN
1  Enukyung  2016  Female  NaN
2  Jinwoong  2015    Male  NaN
3   Soobeom  2015    Male  NaN
##############################
           Name  Year  Gender
one     Chulmin  2011    Male
two    Enukyung  2016  Female
three  Jinwoong  2015    Male
four    Soobeom  2015    Male
##############################


- DataFrame의 컬럼명과 인덱스

In [14]:
print('columns:', titanic_df.columns)
print('index:', titanic_df.index)
print('index value:', titanic_df.index.values)

columns: Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')
index: RangeIndex(start=0, stop=891, step=1)
index value: [  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35
  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53
  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71
  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89
  90  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107
 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125
 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143
 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161
 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179
 180 181 182 183 184 185 186 187 188 189 190 191 1

- DataFrame에서 Series 추출 및 DataFrame 필터링 추출

In [16]:
# DataFrame 객체에서 []연산자내에 한개의 컬럼만 입력하면 Series 객체를 반환
series = titanic_df['Name']
print(series.head(3))
print('## type:', type(series))

# DataFrame객체에서 []연산자내에 여러개의 컬럼을 리스트로 입력하면 그 컬럼들로 구성된 DataFrame 반환
filtered_df = titanic_df[['Name', 'Age']]
print(filtered_df.head(3))
print('## type:', type(filtered_df))

# DataFrame객체에서 []연산자내에 한개의 컬럼을 리스트로 입력하면 한캐의 컬럼으로 구성된 DataFrame 반환
one_col_df = titanic_df[['Name']]
print(one_col_df.head(3))
print('## type:', type(one_col_df))



0                              Braund, Mr. Owen Harris
1    Cumings, Mrs. John Bradley (Florence Briggs Th...
2                               Heikkinen, Miss. Laina
Name: Name, dtype: object
## type: <class 'pandas.core.series.Series'>
                                                Name   Age
0                            Braund, Mr. Owen Harris  22.0
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  38.0
2                             Heikkinen, Miss. Laina  26.0
## type: <class 'pandas.core.frame.DataFrame'>
                                                Name
0                            Braund, Mr. Owen Harris
1  Cumings, Mrs. John Bradley (Florence Briggs Th...
2                             Heikkinen, Miss. Laina
## type: <class 'pandas.core.frame.DataFrame'>


- shape

In [17]:
print('DataFrame 크기:', titanic_df.shape)

DataFrame 크기: (891, 12)


- info: DataFrame내의 컬럼명, 데이터 타입, Null건수, 데이터 건수 정보를 제공

In [18]:
titanic_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


- describe: 데이터들의 평균, 표준편차, 4분위 분포도를 제공(숫자형 컬럼들에 대해서 해당 정보를 제공)

In [19]:
titanic_df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


- value_counts
  - 동일한 개별 데이터 값이 몇건이 있는지 정보를 제공
  - 개별 데이터값의 분포도를 제공
  - Series 객체에서만 호출 될 수 있으므로 반드시 DataFrame을 단일 컬럼으로 입력하여 Series로 변환한 뒤 호출
  

In [25]:
value_counts = titanic_df['Pclass'].value_counts()
print(type(value_counts))
print(value_counts.sort_index())

<class 'pandas.core.series.Series'>
1    216
2    184
3    491
Name: Pclass, dtype: int64


In [22]:
titanic_pclass = titanic_df['Pclass']
print(type(titanic_pclass))

<class 'pandas.core.series.Series'>


In [23]:
titanic_pclass.head()

0    3
1    1
2    3
3    1
4    3
Name: Pclass, dtype: int64

- sort_values() by=정렬컬럼, ascending=True 또는 False로 오름차순/내림차순으로 정렬

In [29]:
titanic_df.sort_values(by='Pclass', ascending=True)

titanic_df[['Name', 'Age']].sort_values(by='Age')
titanic_df[['Name', 'Age', 'Pclass']].sort_values(by=['Pclass', 'Age'])

Unnamed: 0,Name,Age,Pclass
305,"Allison, Master. Hudson Trevor",0.92,1
297,"Allison, Miss. Helen Loraine",2.00,1
445,"Dodge, Master. Washington",4.00,1
802,"Carter, Master. William Thornton II",11.00,1
435,"Carter, Miss. Lucile Polk",14.00,1
...,...,...,...
859,"Razi, Mr. Raihed",,3
863,"Sage, Miss. Dorothy Edith ""Dolly""",,3
868,"van Melkebeke, Mr. Philemon",,3
878,"Laleff, Mr. Kristo",,3
