#### DataFrame
1. 除了有行索引外，還有列索引。
2. 是一種表格型態的數據。
3. 是很多個 series 資料，但共同擁有一個 index。

In [1]:
import numpy as np
import pandas as pd

In [2]:
from numpy.random import randn

In [3]:
np.random.seed(101)

In [4]:
df = pd.DataFrame(randn(5,4),['A','B','C','D','E'],['W','X','Y','Z'])

In [5]:
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


在 DataFrame 中：

第一個參數為：資料。<br>第二個參數為：行索引。</br><br>第三個參數為：列索引。</br>

In [6]:
df['W']

A    2.706850
B    0.651118
C   -2.018168
D    0.188695
E    0.190794
Name: W, dtype: float64

In [7]:
type(df['W'])

pandas.core.series.Series

In [8]:
type(df)

pandas.core.frame.DataFrame

In [9]:
df.W

A    2.706850
B    0.651118
C   -2.018168
D    0.188695
E    0.190794
Name: W, dtype: float64

直接用 .x 也可以找到相對應的欄，x 要替換成欄位的名稱。

In [10]:
df[['W','Z']]

Unnamed: 0,W,Z
A,2.70685,0.503826
B,0.651118,0.605965
C,-2.018168,-0.589001
D,0.188695,0.955057
E,0.190794,0.683509


如果是要回傳多個欄，則需要在原本的 list 裡面，再加上一個 list。

In [11]:
df['new'] = df['W'] + df['Y']

In [12]:
df

Unnamed: 0,W,X,Y,Z,new
A,2.70685,0.628133,0.907969,0.503826,3.614819
B,0.651118,-0.319318,-0.848077,0.605965,-0.196959
C,-2.018168,0.740122,0.528813,-0.589001,-1.489355
D,0.188695,-0.758872,-0.933237,0.955057,-0.744542
E,0.190794,1.978757,2.605967,0.683509,2.796762


如果要建立新的欄位時，先建造出新欄位的名稱，再將值放在右邊。

In [13]:
df.drop('new')

KeyError: "['new'] not found in axis"

會得到錯誤訊息，是因為在 drop 的預設裡面，是預設 axis = 0，是行，但現在要 drop 的是欄，所以會出現錯誤。

In [14]:
df.drop('new',axis = 1)

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


所以當在 drop 裡面，打上 axis = 1，則就會知道要 drop 的是欄。

In [15]:
df

Unnamed: 0,W,X,Y,Z,new
A,2.70685,0.628133,0.907969,0.503826,3.614819
B,0.651118,-0.319318,-0.848077,0.605965,-0.196959
C,-2.018168,0.740122,0.528813,-0.589001,-1.489355
D,0.188695,-0.758872,-0.933237,0.955057,-0.744542
E,0.190794,1.978757,2.605967,0.683509,2.796762


就算用了 drop 的方法，表面看起來是有實際的將 new 這一欄刪除，但實質上並不會影響到最原本的 DataFrame。

In [16]:
df.drop('new',axis = 1,inplace = True)

In [17]:
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


在 pandas 的預設上，不會希望意外地遺失或刪除資料，所以如果真的要刪除資料，則需要加上 inplace 這步驟，才會真正被刪除。

In [18]:
df.drop('E')

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057


因為在 drop 預設的地方，axis = 0 就代表是行，所以不需要在參數的地方，再打上 axis 這部分。

In [19]:
df.shape

(5, 4)

numpy 的方法，可以看到現在的資料形狀為何。

In [20]:
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [21]:
df.loc['A']

W    2.706850
X    0.628133
Y    0.907969
Z    0.503826
Name: A, dtype: float64

loc 代表的是 location，可以用 loc 的方式，直接找到列，除此之外，也可以發現，列也是 series 的一種。

In [22]:
df.iloc[2]

W   -2.018168
X    0.740122
Y    0.528813
Z   -0.589001
Name: C, dtype: float64

iloc 是要回傳要找的列的 index。無論是用 loc 或 iloc 都可以找到想要找的列。

In [23]:
df.loc['B','Y']

-0.8480769834036315

找到相對應的值，就是在 B 這列且在 Y 這欄的值。

In [24]:
df.loc[['A','B'],['W','Y']]

Unnamed: 0,W,Y
A,2.70685,0.907969
B,0.651118,-0.848077


如果是要抓在這個 DataFrame 的子集資料的話，則會用到在最外圍的 [] 裡面再加上兩個 [[列],[欄]]。