In [2]:
import pandas as pd
from pandas import Series, DataFrame

# Chap7. Data Wrangling: Clean, Transform, Merge, Reshape

## Combining and Merging Data Sets
- `pandas.merge`根据一个或多个键将不同的DataFrame连接起来
- `pandas.concat`沿着一条轴将多个对象堆叠到一起




### Database-style DataFrame Merges

`merge`函数的参数

Parameters       | Descriptions
---:             | :---
left             |
right            |
how              |
on               |
left_on          |
right_on         |
left_index       |
right_index      |
sort             | 根据连接键对合并后的数据进行排序，默认为True。有时在处理大数据集时，禁用该选项可获得更好的性能
suffixes         | 字符串值元组，追加到重叠列名的末尾，默认为('_x', '_y')
copy             | 

### Merging on Index


### Concatenating Along and Axis
`concat()`默认在axis=0上工作

##### 层次化
```python
pd.concat([s1, s2, s3], keys=['one', 'two', 'three'])
```

`concat`函数的参数
Parameters             | Descriptions
---:                   | :---
objs                   | 参与连接的pandas对象的列表或字典
axis                   |
join                   |
join_axes              | 指明用于其他n-1条轴的索引，不执行并集/交集运算
keys                   | 用于形成连接轴向上的层次化索引
levels                 | 制定用作层次化索引各级别上的索引，如果设置了keys的话
names                  | 用于创建分层级别的名称，如果设置了keys和（或）levels的话
verify_integrity       | 检查结果对象新轴上的重复情况，如果发现则引发异常。默认False
ignore_index           | 不保留连接轴上的索引

### Combining Data with Overlap

## Reshaping and Pivoting
### Reshaping with Hierarchical Indexing
- `stack`: this "rotates" or pivots from the columns in the ddata to the rows
- `unstack`: this pivots from the row into the columns

### Pivoting "long" to "wide" Format

## Data Transforming
### Removing Duplicates
- `df.duplicated()` 返回各行是否是重复行
- `df.drop_duplicates()`
    - `df.drop_duplicates(['col1'])` 选择在哪些列
    - `take_last=True` 默认保留第一个，True则保留最后一个


In [3]:
data = DataFrame({'k1': ['one'] * 3 + ['two'] * 4,
                  'k2': [1, 1, 2, 3, 3, 4, 4]})

In [4]:
# 返回各行是否是重复行
data.duplicated()

0    False
1     True
2    False
3    False
4     True
5    False
6     True
dtype: bool

In [6]:
# drop_duplicates
data.drop_duplicates()

Unnamed: 0,k1,k2
0,one,1
2,one,2
3,two,3
5,two,4


### Transforming Data Using a Function or Mapping
使用`map()`是一种实现元素级转换以及其他数据清理工作的便捷方式

In [17]:
data = DataFrame({'food': ['bacon', 'pulled pork', 'bacon', 'Pastrami', 'corned beef', 'Bacon', 'pastrami', 'honey ham',
                          'noval lox'],
                 'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
meat_to_animal = {
    'bacon': 'pig',
    'pulled pork': 'pig',
    'pastrami': 'cow',
    'corned beef': 'cow',
    'honey ham': 'pig'
}

In [19]:
# 如果map有缺失，生成nan
data['animal'] = data['food'].map(str.lower).map(meat_to_animal)
data

Unnamed: 0,food,ounces,animal
0,bacon,4.0,pig
1,pulled pork,3.0,pig
2,bacon,12.0,pig
3,Pastrami,6.0,cow
4,corned beef,7.5,cow
5,Bacon,8.0,pig
6,pastrami,3.0,cow
7,honey ham,5.0,pig
8,noval lox,6.0,


In [21]:
# 如果用lambda形式对于缺失的mapping会报错
try:
    data['food'].map(lambda x: meat_to_animal[x.lower()])
except KeyError:
    print('KeyError')

KeyError


In [None]:
### Replacing Values
### Renaming Axis Indexes
### Discretization and Binning
### Detecting and Filtering Outliers
### Permutation and Random Sampling
### Computing Indicator/Dummy Variables

In [None]:
## String Manipulation
### String Object Methods
### Regular Expressions
### Vectorized String Functions in Pandas

## Exmaple: USDA Food Database