In [48]:
import matplotlib.pyplot as plt 
import numpy as np 
import pandas as pd

# pandas对象简介

**pandas具有三个基本数据结构：Series， DataFrame， Index**

## 1.pandas的Series对象

### 1.1 Series是一个带索引数据构成的<u>一维数组</u>

In [49]:
data = pd.Series([0.25, 0.5, 0.75, 1.0])
data

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

可以用index和values属性来获取Series的索引和值

In [50]:
data.index  # index属性返回的是一个pd.Index的类数组对象

RangeIndex(start=0, stop=4, step=1)

In [51]:
data.values

array([0.25, 0.5 , 0.75, 1.  ])

In [52]:
print(type(data.index), type(data.values))

<class 'pandas.core.indexes.range.RangeIndex'> <class 'numpy.ndarray'>


In [53]:
# 运用切片
data[1:3]

1    0.50
2    0.75
dtype: float64

相较于Numpy，pandas.Series采用了显式定义的索引与数值关联   
可以通过index参数来自定义索引   


In [54]:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                index=['a', 'b', 'c', 'd'])

In [55]:
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [56]:
data['b']

0.5

可以使用不连续且不按顺序的索引

In [57]:
data = pd.Series([0.25, 0.5, 0.75, 1.0], 
                index=[3, 4, 2, 1])

In [58]:
data

3    0.25
4    0.50
2    0.75
1    1.00
dtype: float64

In [59]:
data[4]

0.5

### 1.2 特殊的字典

将pd.Series看做是特殊的字典  
Series是一种将类型键映射到一组类型值的数据结构  

In [60]:
# 用python的字典创建一个Series对象  
population_dict = {'California': 38332521, 
                   'Texas': 26448193, 
                   'New York': 19651127, 
                   'Florida': 19552860, 
                   'Illinois': 12882135}
population = pd.Series(population_dict)
population

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

用字典创建Series对象时，默认按照顺序排列  

In [61]:
population['Florida']

19552860

除了按照键获取，Series还支持数组形式的操作  

In [62]:
# 切片 
population['California':'Illinois']

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

### 1.3 创建Series对象

index是可选参数，values支持多种数据类型 

In [63]:
pd.Series([2, 4, 6])

0    2
1    4
2    6
dtype: int64

In [64]:
pd.Series(5, index=[100, 200, 300])

100    5
200    5
300    5
dtype: int64

In [65]:
a = pd.Series({2: 'a', 3: 'b', 1: 'c'})
a.sort_index()

1    c
2    a
3    b
dtype: object

In [66]:
# 根据index筛选 
pd.Series({2: 'a', 3: 'b', 1: 'c'}, index=[2, 1])

2    a
1    c
dtype: object

## 2.pandas的DataFrame对象 

### 2.1 DataFrame是通用的Numpy数组

DataFrame是一种既有灵活的行索引，又有灵活列名的二维数组

将DataFrame看做是有序排列的若干Series对象  

In [67]:
area_dict = {'California': 423967, 
             'Texas': 695662, 
             'New York': 141297, 
             'Florida': 170312, 
             'Illinois': 149995}
area = pd.Series(area_dict)
area

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
dtype: int64

In [75]:
states = pd.DataFrame({'population': population, 
                       'area': area})
states

Unnamed: 0,population,area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


DataFrame具有两个属性，index和columns，后者是存放标签的Index对象

In [69]:
states.index

Index(['California', 'Texas', 'New York', 'Florida', 'Illinois'], dtype='object')

In [74]:
states.columns

Index(['population', 'area'], dtype='object')

### 2.2 DataFrame是特殊的字典

In [76]:
states['area']

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

### 2.3 创建DataFrame对象

#### 1. 通过单个Series对象创建 

In [78]:
pd.DataFrame(population, columns=['population'])

Unnamed: 0,population
California,38332521
Texas,26448193
New York,19651127
Florida,19552860
Illinois,12882135


#### 2. 通过字典创建

In [82]:
data = [{'a': i, 'b': 2*i} for i in range(3)]
print(data)
pd.DataFrame(data)

[{'a': 0, 'b': 0}, {'a': 1, 'b': 2}, {'a': 2, 'b': 4}]


Unnamed: 0,a,b
0,0,0
1,1,2
2,2,4


当字典中有些键不存在时，Pandas用缺失值NaN表示

In [85]:
pd.DataFrame([{'a': 1, 'b': 2}, {'b': 3, 'c': 4}])

Unnamed: 0,a,b,c
0,1.0,2,
1,,3,4.0


#### 3. 通过一个由Series对象构成的字典创建  

In [87]:
pd.DataFrame({'population': population, 'area': area})

Unnamed: 0,population,area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


#### 4. 通过Numpy二维数组创建

In [88]:
pd.DataFrame(np.random.rand(3, 2), 
            columns=['foo', 'bar'], 
            index=['a', 'b', 'c'])

Unnamed: 0,foo,bar
a,0.014669,0.887388
b,0.735861,0.823678
c,0.417708,0.026227


#### 5. 通过Numpy结构化数组创建

In [89]:
a = np.zeros(3, dtype=[('A', 'i8'), ('B', 'f8')])
a

array([(0, 0.), (0, 0.), (0, 0.)], dtype=[('A', '<i8'), ('B', '<f8')])

In [90]:
pd.DataFrame(a)

Unnamed: 0,A,B
0,0,0.0
1,0,0.0
2,0,0.0


## 3. Pandas的Index对象

将Index看做是一个不可变数组或有序集合（可包含重复值的多集）

### 3.1 将Index看做一个不可变数组

In [91]:
ind = pd.Index([2, 3, 5, 7, 11])
ind

Int64Index([2, 3, 5, 7, 11], dtype='int64')

In [92]:

ind[1]

3

In [93]:
ind[::2]

Int64Index([2, 5, 11], dtype='int64')

In [95]:
print(ind.shape, ind.ndim, ind.size, ind.dtype)

(5,) 1 5 int64


**Index对象的索引是不可变的，保证了多个DataFrame和数组之间可以共享索引**

In [96]:
ind[0] = 1 ## 报错

TypeError: Index does not support mutable operations

### 3.2 将Index看做有序结合

Index遵循标准库中set的数据结构和用法，包括并集，交集，差集

In [98]:
indA = pd.Index([1, 3, 5, 7, 9])
indB = pd.Index([2, 3, 5, 7, 11])

In [99]:
indA & indB

  indA & indB


Int64Index([3, 5, 7], dtype='int64')

In [100]:
indA | indB

  indA | indB


Int64Index([1, 2, 3, 5, 7, 9, 11], dtype='int64')

In [101]:
indA ^ indB

  indA ^ indB


Int64Index([1, 2, 9, 11], dtype='int64')