Pandas基于numpy开发。可以先学习numpy。

__Series__ and __DataFrame__ are two data structures in pandas.

| 数据结构  | 特点 |
| ------------- | ------------- |
| Series  | 一维，带标签的数组  |
| DataFrame  | 二维，跟R里类似  |

## 1. Create a data structure 

In [1]:
import numpy as np
import pandas as pd

Series generate index automatically. 

In [3]:
ss = pd.Series([2,6,7,8,np.nan,-4])
print(ss)

0    2.0
1    6.0
2    7.0
3    8.0
4    NaN
5   -4.0
dtype: float64


In [6]:
df = pd.DataFrame(np.random.randn(10,2), columns=["midterm","final"])
print(df)

    midterm     final
0 -0.360622  1.349659
1  0.819132 -1.328718
2 -0.108703 -0.285480
3 -0.951495 -1.178266
4 -2.127274  0.609882
5 -1.033892  1.594901
6 -1.526956 -1.993792
7 -0.433889  0.306501
8  1.271235  0.304408
9  1.191153 -0.404806


We can also create a DataFrame via a dictionary. You can see that there are different types of data in it.

In [14]:
 df2 = pd.DataFrame({'A': 1.,
                      'B': pd.Timestamp('20130102'),
                  'C': pd.Series(1, index=list(range(4)), dtype='float32'),
                  'D': np.array([3] * 4, dtype='int32'),
                   'E': pd.Categorical(["test", "train", "test", "train"]),
                   'F': 'foo'})
print(df2)
print('\n',df2.dtypes)

     A          B    C  D      E    F
0  1.0 2013-01-02  1.0  3   test  foo
1  1.0 2013-01-02  1.0  3  train  foo
2  1.0 2013-01-02  1.0  3   test  foo
3  1.0 2013-01-02  1.0  3  train  foo

 A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object


## 2. View data 

In [23]:
df.head()  # pretty table

#df.tail()

Unnamed: 0,midterm,final
0,-0.360622,1.349659
1,0.819132,-1.328718
2,-0.108703,-0.28548
3,-0.951495,-1.178266
4,-2.127274,0.609882


In [31]:
df.index

RangeIndex(start=0, stop=10, step=1)

In [32]:
df.columns

Index(['midterm', 'final'], dtype='object')

`.to_numpy` transform df to numpy object. numpy has uniform data type.

In [33]:
df.to_numpy()  

array([[-0.36062213,  1.34965881],
       [ 0.81913223, -1.3287182 ],
       [-0.1087033 , -0.28548008],
       [-0.95149473, -1.17826606],
       [-2.12727402,  0.60988196],
       [-1.03389221,  1.59490136],
       [-1.52695573, -1.99379197],
       [-0.43388868,  0.30650135],
       [ 1.27123464,  0.30440786],
       [ 1.19115262, -0.40480618]])

In [35]:
df.describe()    # summary() in R

Unnamed: 0,midterm,final
count,10.0,10.0
mean,-0.326131,-0.102571
std,1.145209,1.16395
min,-2.127274,-1.993792
25%,-1.013293,-0.984901
50%,-0.397255,0.009464
75%,0.587173,0.534037
max,1.271235,1.594901


Sorting is available in df

In [37]:
df.sort_index(axis=1)   # sort by name of columns

Unnamed: 0,final,midterm
0,1.349659,-0.360622
1,-1.328718,0.819132
2,-0.28548,-0.108703
3,-1.178266,-0.951495
4,0.609882,-2.127274
5,1.594901,-1.033892
6,-1.993792,-1.526956
7,0.306501,-0.433889
8,0.304408,1.271235
9,-0.404806,1.191153


In [39]:
df.sort_values(by='final')  # sort by values in column "final"

Unnamed: 0,midterm,final
6,-1.526956,-1.993792
1,0.819132,-1.328718
3,-0.951495,-1.178266
9,1.191153,-0.404806
2,-0.108703,-0.28548
8,1.271235,0.304408
7,-0.433889,0.306501
4,-2.127274,0.609882
0,-0.360622,1.349659
5,-1.033892,1.594901


## 3. Selection 

In [42]:
df.midterm      # equivalent to df["midterm"]

0   -0.360622
1    0.819132
2   -0.108703
3   -0.951495
4   -2.127274
5   -1.033892
6   -1.526956
7   -0.433889
8    1.271235
9    1.191153
Name: midterm, dtype: float64

In [44]:
df[0:3]  # first 3 rows

Unnamed: 0,midterm,final
0,-0.360622,1.349659
1,0.819132,-1.328718
2,-0.108703,-0.28548


In [49]:
df.iloc[1,1]  # select by location

-1.32871819775173

In [51]:
df[df["final"]>0]  # select by boolean

Unnamed: 0,midterm,final
0,-0.360622,1.349659
4,-2.127274,0.609882
5,-1.033892,1.594901
7,-0.433889,0.306501
8,1.271235,0.304408


We can also use `is.in()`

In [54]:
df["grade"] = np.repeat(["A","C"],5)
print(df)

    midterm     final grade
0 -0.360622  1.349659     A
1  0.819132 -1.328718     A
2 -0.108703 -0.285480     A
3 -0.951495 -1.178266     A
4 -2.127274  0.609882     A
5 -1.033892  1.594901     C
6 -1.526956 -1.993792     C
7 -0.433889  0.306501     C
8  1.271235  0.304408     C
9  1.191153 -0.404806     C


In [57]:
df[df["grade"].isin(["A"])]  # select those who got "A"

Unnamed: 0,midterm,final,grade
0,-0.360622,1.349659,A
1,0.819132,-1.328718,A
2,-0.108703,-0.28548,A
3,-0.951495,-1.178266,A
4,-2.127274,0.609882,A
