# pandas基本介绍

pandas 是基于NumPy 的一种工具，该工具是为了解决数据分析任务而创建的。Pandas 纳入了大量库和一些标准的数据模型，提供了高效地操作大型数据集所需的工具。pandas提供了大量能使我们快速便捷地处理数据的函数和方法。

## Pandas对象简介
   - 从底层是叫观察Pandas对象，可以把其看成**增强版的NumPy结构化数组**，行列都不在是简单的整数索引，还可以带上标签

### 安装pandas :安装pandas之前确保操作系统中有numpy
>#### pip3 install pandas

In [2]:
import numpy as np
import pandas as pd

# pandas基本数据结构

### 有两种常用的数据结构
>#### 1：Series：一维数组，与Numpy中的一维array类似。二者与Python基本的数据结构List也很相近。Series如今能保存不同种数据类型，字符串、boolean值、数字等都能保存在Series中
>#### 2：DataFrame：二维的表格型数据结构。很多功能与R中的data.frame类似。可以将DataFrame理解为Series的容器

### 1：Series：可用一维的列表初始化：

#### Series是特殊的字典
   - 可将Series对象看成一种特殊的Python字典

In [18]:
population_dict = {'California':388888,
                  'Texas':3725735,
                  'New York':18976586}
population = pd.Series(population_dict)
population

California      388888
New York      18976586
Texas          3725735
dtype: int64

In [20]:
population['Texas']

3725735

In [8]:
s = pd.Series([1,3,2,np.nan,5])
print(s)

0    1.0
1    3.0
2    2.0
3    NaN
4    5.0
dtype: float64


In [9]:
S = pd.Series([1,3,2,np.nan,5],index=['a','b','c','d','e'])
print(S)

a    1.0
b    3.0
c    2.0
d    NaN
e    5.0
dtype: float64


#### 索引－数据的行标签

In [10]:
s.index

RangeIndex(start=0, stop=5, step=1)

#### 数值

In [11]:
s.values

array([ 1.,  3.,  2., nan,  5.])

In [12]:
s[2]

2.0

#### 切片处理

In [13]:
s[2:4]

2    2.0
3    NaN
dtype: float64

In [15]:
s[::2]

0    1.0
2    2.0
4    5.0
dtype: float64

In [21]:
s['a':'c']

a    1.0
b    3.0
c    2.0
dtype: float64

#### 索引赋值

In [17]:
s.index.name = '索引'
s

索引
0    1.0
1    3.0
2    2.0
3    NaN
4    5.0
dtype: float64

In [19]:
s.index = list('abcde')
s

a    1.0
b    3.0
c    2.0
d    NaN
e    5.0
dtype: float64

### 2：DataFrame

#### 够造一组时间序列，作为我们第一维的下标 

In [4]:
date = pd.date_range('20190731',periods=6)
print(date)

DatetimeIndex(['2019-07-31', '2019-08-01', '2019-08-02', '2019-08-03',
               '2019-08-04', '2019-08-05'],
              dtype='datetime64[ns]', freq='D')


#### 然后创建一个DataFrame结构

In [5]:
df = pd.DataFrame(np.random.randn(6,4))
df

Unnamed: 0,0,1,2,3
0,0.034938,-0.242218,1.388527,-1.709883
1,0.042676,0.793892,-0.122068,-0.390821
2,0.208682,-0.940596,-2.19217,0.741125
3,1.394676,-0.665563,0.454095,1.619273
4,-0.70457,0.303478,0.018864,-0.434929
5,-1.219872,0.75976,-0.695634,0.366834


In [6]:
DF = pd.DataFrame(np.random.randn(6,4), index=date)
DF

Unnamed: 0,0,1,2,3
2019-07-31,0.019553,0.126029,-0.740227,-0.12271
2019-08-01,-0.652874,-0.602815,0.399465,-2.316788
2019-08-02,1.451697,0.608558,-0.030341,1.784204
2019-08-03,0.067531,0.445565,1.791357,-0.351652
2019-08-04,1.207922,0.410494,0.494394,-1.691036
2019-08-05,-0.582755,0.924117,1.986012,1.195991


In [7]:
df1 = pd.DataFrame(np.random.randn(6,4),index=date,columns=list('ABCD'))
df1

Unnamed: 0,A,B,C,D
2019-07-31,0.806069,-1.334987,-0.431877,1.697415
2019-08-01,-0.440896,-1.761438,-1.271346,-0.351532
2019-08-02,-1.189142,0.833549,0.221037,1.090466
2019-08-03,-1.607332,-0.246789,0.126716,-0.96284
2019-08-04,-0.165677,0.700416,0.858383,0.400744
2019-08-05,0.262463,-1.143,-0.822703,0.211922


##### 默认情况下，如果不指定index和columns参数，其值将用从0开始的数字代替

### 可以使用字典传入数据

In [3]:
import pandas as pd
import numpy as np
df2 = pd.DataFrame({'A':1.,'B':pd.Timestamp('20181001'),'C':pd.Series(1,index=list(range(4)),dtype=float),'D':np.array([3]*4,dtype=int),'E':pd.Categorical(["test","train","test","train"]),'F':'abc'})
df2

Unnamed: 0,A,B,C,D,E,F
0,1.0,2018-10-01,1.0,3,test,abc
1,1.0,2018-10-01,1.0,3,train,abc
2,1.0,2018-10-01,1.0,3,test,abc
3,1.0,2018-10-01,1.0,3,train,abc


#### 字典的每一个key代表一列，其value可以是各种能够转化为Series的对象

#### DataFrame只要求每一列数据的格式相同，而Series要求所有类型都一致

## 查看数据
### 头尾数据

&ensp;head和tail方法可以分别查看最前几行和最后几行的数据（默认为5）

In [8]:
df1.head()

Unnamed: 0,A,B,C,D
2019-07-31,0.806069,-1.334987,-0.431877,1.697415
2019-08-01,-0.440896,-1.761438,-1.271346,-0.351532
2019-08-02,-1.189142,0.833549,0.221037,1.090466
2019-08-03,-1.607332,-0.246789,0.126716,-0.96284
2019-08-04,-0.165677,0.700416,0.858383,0.400744


In [10]:
df1.head(2)

Unnamed: 0,A,B,C,D
2019-07-31,0.806069,-1.334987,-0.431877,1.697415
2019-08-01,-0.440896,-1.761438,-1.271346,-0.351532


In [9]:
df1.tail()

Unnamed: 0,A,B,C,D
2019-08-01,-0.440896,-1.761438,-1.271346,-0.351532
2019-08-02,-1.189142,0.833549,0.221037,1.090466
2019-08-03,-1.607332,-0.246789,0.126716,-0.96284
2019-08-04,-0.165677,0.700416,0.858383,0.400744
2019-08-05,0.262463,-1.143,-0.822703,0.211922


In [12]:
df1.tail(3)

Unnamed: 0,A,B,C,D
2019-08-03,-1.607332,-0.246789,0.126716,-0.96284
2019-08-04,-0.165677,0.700416,0.858383,0.400744
2019-08-05,0.262463,-1.143,-0.822703,0.211922


### 下标，列标，数据

In [13]:
#下标
df1.index

DatetimeIndex(['2019-07-31', '2019-08-01', '2019-08-02', '2019-08-03',
               '2019-08-04', '2019-08-05'],
              dtype='datetime64[ns]', freq='D')

In [14]:
#列标
df1.columns

Index(['A', 'B', 'C', 'D'], dtype='object')

In [15]:
#数据
df1.values

array([[ 0.80606883, -1.33498678, -0.43187729,  1.6974149 ],
       [-0.44089612, -1.76143848, -1.27134569, -0.35153243],
       [-1.1891419 ,  0.83354876,  0.22103723,  1.09046618],
       [-1.60733189, -0.24678886,  0.12671564, -0.96283979],
       [-0.16567671,  0.70041635,  0.85838324,  0.40074355],
       [ 0.26246301, -1.14299993, -0.82270319,  0.21192212]])