## [Intro to data structures](https://pandas.pydata.org/docs/getting_started/dsintro.html)

In [3]:
import numpy as np
import pandas as pd

In [4]:
pd.__version__
np.__version__

'1.18.0'

# Series
Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index. The basic method to create a Series is to call:
`Series` 是带标签的一维数组，可以包含各种类型的数据数据(整数，字符串，浮点数，Python对象，等)
> s = pd.Series(data, index=index)
Here, data can be many different things:
* a Python Dict
* an ndarray
* a scalar value(like 5)

index参数是一个轴标签的列表，根据不同的数据类型，分为一下几种：

### 1. 用ndarry，n维数组创建

如果data是n维数组，index必须和data的长度相同。默认为[0, len(data)-1]

In [6]:
s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
s

a   -0.561475
b   -1.867398
c   -0.232802
d   -0.555533
e    0.907955
dtype: float64

In [7]:
s.index

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

In [8]:
pd.Series(np.random.randn(5))

0   -0.804604
1   -1.128109
2    1.944903
3    0.431423
4   -0.219267
dtype: float64

### 2. 从字典dict创建

In [11]:
d = {'b': 1, 'c': 2, 'd': 3}
a = pd.Series(d)
a

b    1
c    2
d    3
dtype: int64

In [12]:
a.index

Index(['b', 'c', 'd'], dtype='object')

如果传了index参数，将按照index标签对照取出

In [13]:
b = pd.Series(d, index=['a', 'b', 'c', 'd'])
b

a    NaN
b    1.0
c    2.0
d    3.0
dtype: float64

In [14]:
b.index

Index(['a', 'b', 'c', 'd'], dtype='object')

### 3. 从标量值创建
如果data是一个标量，必须提供index参数。标量值会重复使用来匹配index的长度。

In [16]:
pd.Series(5., index=['a', 'b', 'c'])

a    5.0
b    5.0
c    5.0
dtype: float64

## Series类似多维数组
Series和多维数组表现相同，支持大多数Numpy函数。然而，切片操作也会对index切片。

In [17]:
s

a   -0.561475
b   -1.867398
c   -0.232802
d   -0.555533
e    0.907955
dtype: float64

In [18]:
s[0]

-0.5614751476090337

In [19]:
s[:3]

a   -0.561475
b   -1.867398
c   -0.232802
dtype: float64

In [20]:
s.median()

-0.555532562878163

In [22]:
s>s.median()

a    False
b    False
c     True
d    False
e     True
dtype: bool

In [23]:
s[s>s.median()]

c   -0.232802
e    0.907955
dtype: float64

In [24]:
s[[4, 3]]

e    0.907955
d   -0.555533
dtype: float64

In [25]:
np.exp(s)

a    0.570367
b    0.154525
c    0.792310
d    0.573767
e    2.479248
dtype: float64

In [26]:
s.dtype

dtype('float64')

In [27]:
# Series类似n维数组，可以使用`Series.to_numpy()`转换
s.array

<PandasArray>
[ -0.5614751476090337,  -1.8673982065492853, -0.23280202035679604,
   -0.555532562878163,   0.9079553628785715]
Length: 5, dtype: float64

In [28]:
s.to_numpy()

array([-0.56147515, -1.86739821, -0.23280202, -0.55553256,  0.90795536])

## Series类似字典dict
Series类似一个固定的字典，可以用index标签获取和设置值

In [30]:
s

a   -0.561475
b   -1.867398
c   -0.232802
d   -0.555533
e    0.907955
dtype: float64

In [31]:
s['a']

-0.5614751476090337

In [32]:
s['e'] = 12
s

a    -0.561475
b    -1.867398
c    -0.232802
d    -0.555533
e    12.000000
dtype: float64

In [33]:
'e' in s, 'f' in s

(True, False)

In [None]:
s['f'] # KeyError

In [35]:
s.get('f') # None

In [36]:
s.get('f', np.nan)

nan

In [37]:
s

a    -0.561475
b    -1.867398
c    -0.232802
d    -0.555533
e    12.000000
dtype: float64

In [38]:
s + s

a    -1.122950
b    -3.734796
c    -0.465604
d    -1.111065
e    24.000000
dtype: float64

In [39]:
s *2

a    -1.122950
b    -3.734796
c    -0.465604
d    -1.111065
e    24.000000
dtype: float64

In [40]:
np.exp(s)

a         0.570367
b         0.154525
c         0.792310
d         0.573767
e    162754.791419
dtype: float64

和n维数组不同的是，Series会根据标签自动对齐数据，不用顾及Series间是否有相同的标签

In [41]:
s

a    -0.561475
b    -1.867398
c    -0.232802
d    -0.555533
e    12.000000
dtype: float64

In [42]:
s[1:] + s[:-1]

a         NaN
b   -3.734796
c   -0.465604
d   -1.111065
e         NaN
dtype: float64

没对齐的标签取Series间的并集，不存在某个Series的标签，结果会标记为NaN

## 名称属性

In [45]:
s = pd.Series(np.random.randn(5), name='something')
s

0    0.409916
1    0.587775
2    0.793873
3    0.091883
4   -2.222871
Name: something, dtype: float64

In [47]:
s.name, s.dtype, s.size

('something', dtype('float64'), 5)

In [49]:
s.rename('new_name')
s

0    0.409916
1    0.587775
2    0.793873
3    0.091883
4   -2.222871
Name: something, dtype: float64

In [50]:
s.name

'something'

In [51]:
s2 = s.rename("different")
s2

0    0.409916
1    0.587775
2    0.793873
3    0.091883
4   -2.222871
Name: different, dtype: float64

In [52]:
s2.name    # s和s2指向不同的对象

'different'

# DataFrame
DataFrame是带有二维标签的数据结构拥有不同数据类型的列。可以想象是一个Excel，SQL表或者一个Series对象的字典。
DataFrame同样接受很多不同类型的输入：
* 一维ndarrys的字典，列表，字典，Series
* 二维numpy.ndarray
* Structured or record ndarray
* A Series
* Another DataFrame

### 1. From dict of Series or dicts

In [56]:
d = {'one': pd.Series([1, 2, 3.], index=['a', 'b', 'c']),
     'two': pd.Series([4, 5, 6, 7.], index=['a', 'b', 'c', 'd'])
    }
d

{'one': a    1.0
 b    2.0
 c    3.0
 dtype: float64, 'two': a    4.0
 b    5.0
 c    6.0
 d    7.0
 dtype: float64}

In [57]:
d['one']

a    1.0
b    2.0
c    3.0
dtype: float64

In [58]:
df = pd.DataFrame(d)
df

Unnamed: 0,one,two
a,1.0,4.0
b,2.0,5.0
c,3.0,6.0
d,,7.0


In [59]:
pd.DataFrame(d, index=['b', 'c'])

Unnamed: 0,one,two
b,2.0,5.0
c,3.0,6.0


In [60]:
pd.DataFrame(d, index=['a', 'c', 'e'], columns=['two', 'three'])

Unnamed: 0,two,three
a,4.0,
c,6.0,
e,,


In [61]:
df.index, df.columns

(Index(['a', 'b', 'c', 'd'], dtype='object'),
 Index(['one', 'two'], dtype='object'))

### 2. From dict of ndarrays / lists

In [66]:
d = {
    'one': [1, 2., 3, 4],
    'two': [5, 6, 7, 8]
}
d

{'one': [1, 2.0, 3, 4], 'two': [5, 6, 7, 8]}

In [67]:
pd.DataFrame(d)

Unnamed: 0,one,two
0,1.0,5
1,2.0,6
2,3.0,7
3,4.0,8


In [68]:
pd.DataFrame(d, index=['a', 'b', 'c', 'd'])

Unnamed: 0,one,two
a,1.0,5
b,2.0,6
c,3.0,7
d,4.0,8


### 3. From structured or record array¶

In [69]:
data = np.zeros((2,), dtype=[('A', 'i4'), ('B', 'f4'), ('C', 'a10')])
data

array([(0, 0., b''), (0, 0., b'')],
      dtype=[('A', '<i4'), ('B', '<f4'), ('C', 'S10')])

In [70]:
data[:] = [(1, 2., 'Hello'), (2, 3., 'World')]
data

array([(1, 2., b'Hello'), (2, 3., b'World')],
      dtype=[('A', '<i4'), ('B', '<f4'), ('C', 'S10')])

In [71]:
pd.DataFrame(data)

Unnamed: 0,A,B,C
0,1,2.0,b'Hello'
1,2,3.0,b'World'


In [72]:
pd.DataFrame(data, index=['first', 'second'])

Unnamed: 0,A,B,C
first,1,2.0,b'Hello'
second,2,3.0,b'World'


In [73]:
pd.DataFrame(data, columns=['C', 'B', 'A', 'E'])

Unnamed: 0,C,B,A,E
0,b'Hello',2.0,1,
1,b'World',3.0,2,


### 4. From a list of dicts

In [74]:
data2 = [{'a': 1, 'b': 2}, {'a': 3, 'b': 4, 'c': 5}]
data2

[{'a': 1, 'b': 2}, {'a': 3, 'b': 4, 'c': 5}]

In [75]:
pd.DataFrame(data2)

Unnamed: 0,a,b,c
0,1,2,
1,3,4,5.0


In [76]:
pd.DataFrame(data2, index=['first', 'second'])

Unnamed: 0,a,b,c
first,1,2,
second,3,4,5.0


In [77]:
pd.DataFrame(data2, columns=['a', 'b', 'e'])

Unnamed: 0,a,b,e
0,1,2,
1,3,4,


### 5. From a dict of tuples¶

In [79]:
pd.DataFrame({('a', 'b'): {('A', 'B'): 1, ('A', 'C'): 2},
               ('a', 'a'): {('A', 'C'): 3, ('A', 'B'): 4},
               ('a', 'c'): {('A', 'B'): 5, ('A', 'C'): 6},
               ('b', 'a'): {('A', 'C'): 7, ('A', 'B'): 8},
               ('b', 'b'): {('A', 'D'): 9, ('A', 'B'): 10}})

Unnamed: 0_level_0,Unnamed: 1_level_0,a,a,a,b,b
Unnamed: 0_level_1,Unnamed: 1_level_1,b,a,c,a,b
A,B,1.0,4.0,5.0,8.0,10.0
A,C,2.0,3.0,6.0,7.0,
A,D,,,,,9.0


## Column selection, addition, deletion

In [80]:
df

Unnamed: 0,one,two
a,1.0,4.0
b,2.0,5.0
c,3.0,6.0
d,,7.0


In [81]:
df['one']

a    1.0
b    2.0
c    3.0
d    NaN
Name: one, dtype: float64

In [83]:
df['three'] = df['one'] * df['two']

In [84]:
df

Unnamed: 0,one,two,three
a,1.0,4.0,4.0
b,2.0,5.0,10.0
c,3.0,6.0,18.0
d,,7.0,


In [85]:
df['flag'] = df['one'] > 2

In [86]:
df

Unnamed: 0,one,two,three,flag
a,1.0,4.0,4.0,False
b,2.0,5.0,10.0,False
c,3.0,6.0,18.0,True
d,,7.0,,False


In [87]:
del df['two']

In [88]:
df

Unnamed: 0,one,three,flag
a,1.0,4.0,False
b,2.0,10.0,False
c,3.0,18.0,True
d,,,False


In [89]:
three = df.pop('flag')
three

a    False
b    False
c     True
d    False
Name: flag, dtype: bool

In [90]:
df

Unnamed: 0,one,three
a,1.0,4.0
b,2.0,10.0
c,3.0,18.0
d,,


In [91]:
df['foo'] = 'bar'

In [92]:
df

Unnamed: 0,one,three,foo
a,1.0,4.0,bar
b,2.0,10.0,bar
c,3.0,18.0,bar
d,,,bar


In [93]:
df['one']

a    1.0
b    2.0
c    3.0
d    NaN
Name: one, dtype: float64

In [94]:
df['one'][:2]

a    1.0
b    2.0
Name: one, dtype: float64

In [95]:
df['one_trunc'] = df['one'][:2]

In [96]:
df

Unnamed: 0,one,three,foo,one_trunc
a,1.0,4.0,bar,1.0
b,2.0,10.0,bar,2.0
c,3.0,18.0,bar,
d,,,bar,


In [97]:
df.insert(1, 'bar', df['one'])
df

Unnamed: 0,one,bar,three,foo,one_trunc
a,1.0,1.0,4.0,bar,1.0
b,2.0,2.0,10.0,bar,2.0
c,3.0,3.0,18.0,bar,
d,,,,bar,


## Indexing / selection

|Operation | Syntax | Result |
|---| --- | --- |
| Select column | dr[col] | Series|
| select row by lable | df.loc[label] | Series |
| select row by integer location | df.iloc[label] | Series |
| slice rows | df[5:10] | DataFrame |
| select rows by boolean vetor | df[bool_vec] | DataFrame |

In [100]:
df

Unnamed: 0,one,bar,three,foo,one_trunc
a,1.0,1.0,4.0,bar,1.0
b,2.0,2.0,10.0,bar,2.0
c,3.0,3.0,18.0,bar,
d,,,,bar,


In [101]:
df['one']

a    1.0
b    2.0
c    3.0
d    NaN
Name: one, dtype: float64

In [102]:
df.loc['a']

one            1
bar            1
three          4
foo          bar
one_trunc      1
Name: a, dtype: object

In [103]:
df.iloc[2]

one            3
bar            3
three         18
foo          bar
one_trunc    NaN
Name: c, dtype: object

## Data alignment and arithmetic

In [105]:
df = pd.DataFrame(np.random.randn(10, 4), columns=['A', 'B', 'C', 'D'])
df

Unnamed: 0,A,B,C,D
0,0.522742,-2.391933,-0.542012,0.342174
1,-0.618796,0.85893,0.873287,0.800405
2,-0.181047,-1.414828,0.131944,-0.929668
3,0.075224,1.140172,0.171751,-0.254201
4,-1.064046,0.020116,-0.013411,0.333392
5,0.55966,1.400333,-0.51157,-2.300192
6,-0.899453,2.191558,0.40506,0.965517
7,0.080676,-1.950364,1.133487,0.170458
8,1.380558,1.148511,0.453455,-0.099539
9,-0.472998,0.006777,0.205617,0.872105


In [106]:
df2 = pd.DataFrame(np.random.randn(7, 3), columns=['A', 'B', 'C'])
df2

Unnamed: 0,A,B,C
0,0.952947,0.622033,0.177389
1,1.095368,-0.405183,0.440339
2,2.067236,-1.396006,-0.842835
3,-0.272533,-0.255988,-1.46186
4,-0.006401,0.19077,-0.683706
5,0.492597,-0.449293,-0.113548
6,0.600659,-0.620252,1.019686


In [107]:
df + df2

Unnamed: 0,A,B,C,D
0,1.475689,-1.769899,-0.364622,
1,0.476571,0.453746,1.313627,
2,1.886189,-2.810834,-0.710891,
3,-0.197309,0.884184,-1.290109,
4,-1.070447,0.210887,-0.697117,
5,1.052256,0.951039,-0.625118,
6,-0.298794,1.571306,1.424746,
7,,,,
8,,,,
9,,,,


In [108]:
df - df.iloc[0]

Unnamed: 0,A,B,C,D
0,0.0,0.0,0.0,0.0
1,-1.141539,3.250862,1.415299,0.458231
2,-0.703789,0.977105,0.673956,-1.271842
3,-0.447518,3.532105,0.713763,-0.596375
4,-1.586789,2.412049,0.528601,-0.008782
5,0.036917,3.792265,0.030441,-2.642366
6,-1.422196,4.58349,0.947072,0.623343
7,-0.442066,0.441568,1.675498,-0.171716
8,0.857816,3.540444,0.995467,-0.441713
9,-0.99574,2.39871,0.747629,0.52993


In [109]:
index = pd.date_range('1/1/2000', periods=8)
index

DatetimeIndex(['2000-01-01', '2000-01-02', '2000-01-03', '2000-01-04',
               '2000-01-05', '2000-01-06', '2000-01-07', '2000-01-08'],
              dtype='datetime64[ns]', freq='D')

In [111]:
df = pd.DataFrame(np.random.randn(8, 3), index=pd.date_range('1/1/2020', periods=8), columns=list('ABC'))
df

Unnamed: 0,A,B,C
2020-01-01,2.273457,-0.077067,-1.051693
2020-01-02,-0.424376,0.678907,-0.038179
2020-01-03,1.424509,-0.065261,-1.292894
2020-01-04,1.624771,1.099481,0.772532
2020-01-05,-0.251496,1.066648,-1.522538
2020-01-06,1.750127,-0.670992,2.211741
2020-01-07,1.134509,0.991641,-0.201995
2020-01-08,2.352553,1.872329,1.496642


In [112]:
df['A']

2020-01-01    2.273457
2020-01-02   -0.424376
2020-01-03    1.424509
2020-01-04    1.624771
2020-01-05   -0.251496
2020-01-06    1.750127
2020-01-07    1.134509
2020-01-08    2.352553
Freq: D, Name: A, dtype: float64

In [115]:
type(df['A']), df.iloc[0]

(pandas.core.series.Series, A    2.273457
 B   -0.077067
 C   -1.051693
 Name: 2020-01-01 00:00:00, dtype: float64)

In [116]:
df - df['A']

Unnamed: 0,2020-01-01 00:00:00,2020-01-02 00:00:00,2020-01-03 00:00:00,2020-01-04 00:00:00,2020-01-05 00:00:00,2020-01-06 00:00:00,2020-01-07 00:00:00,2020-01-08 00:00:00,A,B,C
2020-01-01,,,,,,,,,,,
2020-01-02,,,,,,,,,,,
2020-01-03,,,,,,,,,,,
2020-01-04,,,,,,,,,,,
2020-01-05,,,,,,,,,,,
2020-01-06,,,,,,,,,,,
2020-01-07,,,,,,,,,,,
2020-01-08,,,,,,,,,,,


In [117]:
df

Unnamed: 0,A,B,C
2020-01-01,2.273457,-0.077067,-1.051693
2020-01-02,-0.424376,0.678907,-0.038179
2020-01-03,1.424509,-0.065261,-1.292894
2020-01-04,1.624771,1.099481,0.772532
2020-01-05,-0.251496,1.066648,-1.522538
2020-01-06,1.750127,-0.670992,2.211741
2020-01-07,1.134509,0.991641,-0.201995
2020-01-08,2.352553,1.872329,1.496642


In [118]:
df * 2

Unnamed: 0,A,B,C
2020-01-01,4.546914,-0.154134,-2.103387
2020-01-02,-0.848751,1.357815,-0.076359
2020-01-03,2.849018,-0.130523,-2.585788
2020-01-04,3.249541,2.198961,1.545064
2020-01-05,-0.502991,2.133297,-3.045076
2020-01-06,3.500254,-1.341983,4.423483
2020-01-07,2.269018,1.983282,-0.403991
2020-01-08,4.705107,3.744658,2.993284


In [119]:
df * 5 +2

Unnamed: 0,A,B,C
2020-01-01,13.367285,1.614666,-3.258467
2020-01-02,-0.121878,5.394536,1.809103
2020-01-03,9.122546,1.673693,-4.464471
2020-01-04,10.123854,7.497404,5.862661
2020-01-05,0.742522,7.333242,-5.61269
2020-01-06,10.750635,-1.354958,13.058707
2020-01-07,7.672545,6.958205,0.990024
2020-01-08,13.762766,11.361645,9.48321


In [121]:
print(df.to_string())

                   A         B         C
2020-01-01  2.273457 -0.077067 -1.051693
2020-01-02 -0.424376  0.678907 -0.038179
2020-01-03  1.424509 -0.065261 -1.292894
2020-01-04  1.624771  1.099481  0.772532
2020-01-05 -0.251496  1.066648 -1.522538
2020-01-06  1.750127 -0.670992  2.211741
2020-01-07  1.134509  0.991641 -0.201995
2020-01-08  2.352553  1.872329  1.496642
