## Slice data

This notebook we are doing series and dataframe slicing:
- One thing importance *When slicing we are creating a VIEW so it will change the origin value*
- BUT 条件选取/Boolean indexing/iloc/loc 选取 不会change original value
- To avoid the confusion, simply use .copy() to create a indep copy, not a view, if we want to keep original thing not changed!

这几点与numpy array 一模一样
- For dataframe, usually preferred method for slicing is to use iloc/loc, not linked chain--[ : ], [ , ]

In [1]:
import numpy as np
import pandas as pd

### Series slicing

In [23]:
s1 = pd.Series(np.arange(5),index=['A','B','C','D','E'])
s1

A    0
B    1
C    2
D    3
E    4
dtype: int32

In [24]:
s2=s1[(s1>2)&(s1<5)]
s2[0]=6
s1  # 条件选取不会change original value

A    0
B    1
C    2
D    3
E    4
dtype: int32

In [15]:
s2=s1[:3]
s2

A    0
B    1
C    2
dtype: int32

In [16]:
s2['A']=2018
s2

A    2018
B       1
C       2
dtype: int32

In [17]:
s1

A    2018
B       1
C       2
D       3
E       4
dtype: int32

In [18]:
s1['B']

1

In [19]:
s1[['A','B','C']]

A    2018
B       1
C       2
dtype: int32

In [20]:
s1[0:3]

A    2018
B       1
C       2
dtype: int32

In [21]:
s1[s1>2]

A    2018
D       3
E       4
dtype: int32

In [3]:
s1[(s1>2)&(s1<5)]

D    3
E    4
dtype: int32

In [2]:
s1 = pd.Series(np.arange(5),index=['A','B','C','D','E'])
print (s1)
s2=s1[s1>3]
s2[0]=100
print(s1) #note: conditional indexing 选出的s2 部分变了不会改变s1

A    0
B    1
C    2
D    3
E    4
dtype: int32
A    0
B    1
C    2
D    3
E    4
dtype: int32


### Panda Dataframe Slicing

In [60]:
df1 = pd.DataFrame(np.random.randn(4,5),index=['r1','r2','r3','r4'],columns=['c1','c2','c3','c4','c5'])
df1

Unnamed: 0,c1,c2,c3,c4,c5
r1,-0.97145,-2.293524,-1.966442,0.669142,0.475444
r2,0.78141,1.586777,0.764737,-0.875183,0.312789
r3,1.424498,-0.782043,0.163194,0.537369,-1.225784
r4,1.433084,-1.705762,3.086858,-0.041512,0.245194


- Linked Chain Method: 选取单列 (cannot use it to select 行)

In [61]:
df1['c1']

r1   -0.971450
r2    0.781410
r3    1.424498
r4    1.433084
Name: c1, dtype: float64

- Linked Chain Method: 选取多列

In [62]:
df1[['c1','c4','c3']]

Unnamed: 0,c1,c4,c3
r1,-0.97145,0.669142,-1.966442
r2,0.78141,-0.875183,0.764737
r3,1.424498,0.537369,0.163194
r4,1.433084,-0.041512,3.086858


- 条件选取

In [63]:
df1[df1['c2']>0]

Unnamed: 0,c1,c2,c3,c4,c5
r2,0.78141,1.586777,0.764737,-0.875183,0.312789


In [64]:
df1<0

Unnamed: 0,c1,c2,c3,c4,c5
r1,True,True,True,False,False
r2,False,False,False,True,False
r3,False,True,False,False,True
r4,False,True,False,True,False


- loc or iloc 行列穿插选取-- this is the **preferred method** than 'linked chain' 法选取

In [65]:
df1.iloc[1]

c1    0.781410
c2    1.586777
c3    0.764737
c4   -0.875183
c5    0.312789
Name: r2, dtype: float64

In [66]:
df1.loc['r2'] #select row

c1    0.781410
c2    1.586777
c3    0.764737
c4   -0.875183
c5    0.312789
Name: r2, dtype: float64

In [74]:
df1.loc[:,'c1'] #select column

r1   -0.971450
r2    0.781410
r3    1.424498
r4    1.433084
Name: c1, dtype: float64

In [67]:
df1.loc[['r2','r3']]

Unnamed: 0,c1,c2,c3,c4,c5
r2,0.78141,1.586777,0.764737,-0.875183,0.312789
r3,1.424498,-0.782043,0.163194,0.537369,-1.225784


In [68]:
df1.iloc[[0,2,3],[0,2]]

Unnamed: 0,c1,c3
r1,-0.97145,-1.966442
r3,1.424498,0.163194
r4,1.433084,3.086858


- .copy()-- Note: same as series and numpy array, slicing will change original dataframe, to avoid it, use .copy() to create a indep copy, not a view

In [69]:
df2 = df1.copy()

In [70]:
df2

Unnamed: 0,c1,c2,c3,c4,c5
r1,-0.97145,-2.293524,-1.966442,0.669142,0.475444
r2,0.78141,1.586777,0.764737,-0.875183,0.312789
r3,1.424498,-0.782043,0.163194,0.537369,-1.225784
r4,1.433084,-1.705762,3.086858,-0.041512,0.245194


- Add a new column

In [79]:
df2.loc[:,'c6']=['s',1,2,3]

In [80]:
df2

Unnamed: 0,c1,c2,c3,c4,c5,c6
r1,-0.97145,-2.293524,-1.966442,0.669142,0.475444,s
r2,0.78141,1.586777,0.764737,-0.875183,0.312789,1
r3,1.424498,-0.782043,0.163194,0.537369,-1.225784,2
r4,1.433084,-1.705762,3.086858,-0.041512,0.245194,3


In [83]:
type(df2.loc['r4','c6'])

int

- assign new value to modify existing elements

In [84]:
#assign new value to modify existing elements
df2.loc['r4','c6']=99
df2.loc['r3']=[1,2,3,4,5,6]
df2

Unnamed: 0,c1,c2,c3,c4,c5,c6
r1,-0.97145,-2.293524,-1.966442,0.669142,0.475444,s
r2,0.78141,1.586777,0.764737,-0.875183,0.312789,1
r3,1.0,2.0,3.0,4.0,5.0,6
r4,1.433084,-1.705762,3.086858,-0.041512,0.245194,99


- Boolean Indexing 选取

> Similar as array, by default, pass a boolean, it will select ROW(not column) when index=True!!!!!!(背)

>To apply to Column, just use .T()

In [90]:
df2['c6'].isin(['s',99])

r1     True
r2    False
r3    False
r4     True
Name: c6, dtype: bool

In [92]:
df2[df2['c6'].isin(['s',99])]

Unnamed: 0,c1,c2,c3,c4,c5,c6
r1,-0.97145,-2.293524,-1.966442,0.669142,0.475444,s
r4,1.433084,-1.705762,3.086858,-0.041512,0.245194,99


In [97]:
df2.T[[False,True,False,False,True,True]]

Unnamed: 0,r1,r2,r3,r4
c2,-2.29352,1.58678,2,-1.70576
c5,0.475444,0.312789,5,0.245194
c6,s,1.0,6,99.0
