In [6]:
import pandas as pd
import numpy as np

# pd.DataFrame()
```python
pd.DataFrame(
    data=None,
    index=None,
    columns=None,
    dtype=None,
    copy=False,
)
```
**Docstring**

创建一个二维、大小可变、可能异构的表格数据。

Data structure also contains labeled axes (rows and columns).
Arithmetic operations align on both row and column labels. Can be
thought of as a dict-like container for Series objects. The primary
pandas data structure.

**Args**

- data: 可以是 ndarray、Iterable、字典、或 DataFrame，字典可以包含 Series、数组、常数、或类列表对象，如果数据是一个dict，则在Python 3.6及之后版本中，列顺序遵循插入顺序；若数据是由字典组成的列表，则在Python 3.6及之后版本中，列顺序遵循插入顺序

- index: 可以是索引或类数组对象，用于生成 Frame 的行标签；若输入数据没有索引信息，且 index 没有传值，则默认使用`RangeIndex`
    
- columns: 可以是索引或类数组对象，用于生成 Frame 的行标签；默认为`RangeIndex`

- dtype: 略

- copy: 是否从输入复制数据。只影响DataFrame / 2d ndarray输入。Copy data from inputs. Only affects DataFrame / 2d ndarray input.

**See Also**

- DataFrame.from_records : Constructor from tuples, also record arrays.
- DataFrame.from_dict : From dicts of Series, arrays, or dicts.
- read_csv
- read_table
- read_clipboard

**Type**

type

In [None]:
x = {'col1': [1, 2, 3], 'col2': [3, 4, 5], "col3": list("abc")}
x = pd.DataFrame(data=x)
print(x, end="\n\n")
print(x.dtypes, end="\n\n")  # ! dtypes
print(type(x))

In [None]:
x = [[1, 2, 3], [True, False, True], list("abc")]
x = pd.DataFrame(data=x, columns=["c1", "c2", "c3"], dtype=float)
print(x, end="\n\n")
print(x.dtypes, end="\n\n")

In [None]:
x = pd.DataFrame(np.random.rand(4, 6), columns=list("abcdef"), index=list("ABCD"))
print(x.index, end="\n\n")
print(x.columns[2:], end="\n\n")
print(x.values, end="\n\n")
print(x.shape, end="\n\n")
print(x.ndim, end="\n\n")
print(x.head(2), end="\n\n")
print(x.tail(2), end="\n\n")
print(x.info(), end="\n\n")
print(x.describe(), end="\n\n")

In [None]:
x = [{"name": "Lara", "gender": "female"}, {"name": "Peter", "age": 18, "gender": "male"}, {"name": "Harry", "age": 24}]
x = pd.DataFrame(x)
print(x, end="\n\n")
print(x.dtypes, end="\n\n")
print(x[:2])  # 行索引只能是代表切片的形式
print(type(x[:2]["age"]))

#  

# pd.DataFrame.sample()
```python
<DataFrame>.sample(
    n=None,
    frac=None,
    replace=False,
    weights=None,
    random_state=None,
    axis=None,
) -> ~FrameOrSeries
```

**Docstring**

在对象的`axis`中进行随机采样并返回，可通过指明`random_state`来确保可复现性reproducibility.

**Args**

- n: 返回的采样个数，不能与`frac`同时使用，若`frac= None`则默认为 1

- frac: 返回采样的比例

- replace: 是否允许对一个行采样多次，若`frac> 1`，则该参数应为 True

- weights: 可以是字符串类型或 ndarray，默认 None 时以等概率采样；<font color="red">**出现在权重中而没有出现在采样对象中的索引值会被忽略，出现在采样对象中却未出现在索引值中的权重将置为 0；对于`DataFrame`对象而言，当`axis = 0`时该参数可接受列的名称；若权重不是`Series`类型，则其长度必须与采样所在索引相等；(???)**</font> 若权重加和不为 1，则其会预先归一化；weights 列中丢失的值被视为 0；不接收`inf`值

- random_state: 可以是整型、类数组、`BitGenerator`、`np.random.RandomState`；整型时种子为随机数生成器；后三者时使用 Numpy 的`RandomState`对象

- axis: 0、`'index` 或 1、`'columns'`，默认为 None；默认为给定数据类型的 stat axis，对于`DataFrame`和`Series`默认为 0
    
**Type**:      function


### Examples

In [None]:
x = pd.DataFrame({'num_legs': [2, 4, 8, 0], 'num_wings': [2, 0, 0, 0], 'num_specimen_seen': [10, 2, 1, 8]},
                  index=['falcon', 'dog', 'spider', 'fish']).sample()
x

In [None]:
x.sample(n=2, weights='num_specimen_seen', random_state=1)

In [63]:
x = x.sample(frac=1, replace=False, random_state=1)
print(x[:2])

        num_legs  num_wings  num_specimen_seen
spider         8          0                  1
dog            4          0                  2


#  

## pd.DataFrame.sort_values

```python
<dataframe>.sort_values(
    by,
    axis=0,
    ascending=True,
    inplace=False,
    kind='quicksort',
    na_position='last',
    ignore_index=False,
)
```

**Docstri**

沿任一轴按值排序

**Args**

- by: 字符串或字符串列表，要排序的名称或名称列表
            Name or list of names to sort by.

            - if `axis` is 0 or `'index'` then `by` may contain index
              levels and/or column labels.
            - if `axis` is 1 or `'columns'` then `by` may contain column
              levels and/or index labels.

            .. versionchanged:: 0.23.0

               Allow specifying index or column level names.
axis : {0 or 'index', 1 or 'columns'}, default 0
     Axis to be sorted.
ascending : bool or list of bool, default True
     Sort ascending vs. descending. Specify list for multiple sort
     orders.  If this is a list of bools, must match the length of
     the by.
inplace : bool, default False
     If True, perform operation in-place.
kind : {'quicksort', 'mergesort', 'heapsort'}, default 'quicksort'
     Choice of sorting algorithm. See also ndarray.np.sort for more
     information.  `mergesort` is the only stable algorithm. For
     DataFrames, this option is only applied when sorting on a single
     column or label.
na_position : {'first', 'last'}, default 'last'
     Puts NaNs at the beginning if `first`; `last` puts NaNs at the
     end.
ignore_index : bool, default False
     If True, the resulting axis will be labeled 0, 1, …, n - 1.

     .. versionadded:: 1.0.0

Returns

sorted_obj : DataFrame or None
    DataFrame with sorted values if inplace=False, None otherwise.

Examples

>>> df = pd.DataFrame({
...     'col1': ['A', 'A', 'B', np.nan, 'D', 'C'],
...     'col2': [2, 1, 9, 8, 7, 4],
...     'col3': [0, 1, 9, 4, 2, 3],
... })
>>> df
    col1 col2 col3
0   A    2    0
1   A    1    1
2   B    9    9
3   NaN  8    4
4   D    7    2
5   C    4    3

Sort by col1

>>> df.sort_values(by=['col1'])
    col1 col2 col3
0   A    2    0
1   A    1    1
2   B    9    9
5   C    4    3
4   D    7    2
3   NaN  8    4

Sort by multiple columns

>>> df.sort_values(by=['col1', 'col2'])
    col1 col2 col3
1   A    1    1
0   A    2    0
2   B    9    9
5   C    4    3
4   D    7    2
3   NaN  8    4

Sort Descending

>>> df.sort_values(by='col1', ascending=False)
    col1 col2 col3
4   D    7    2
5   C    4    3
2   B    9    9
0   A    2    0
1   A    1    1
3   NaN  8    4

Putting NAs first

>>> df.sort_values(by='col1', ascending=False, na_position='first')
    col1 col2 col3
3   NaN  8    4
4   D    7    2
5   C    4    3
2   B    9    9
0   A    2    0
1   A    1    1
File:      d:\programmefiles\python\anaconda3\envs\tensorflow\lib\site-packages\pandas\core\frame.py
Type:      method

#  

#  

## pd.DataFrame.loc()
`x.loc[]`

**Docstring**

通过标签或布尔数组访问一组行和列，``.loc[]``主要基于标签，但也可以使用布尔数组，允许的输入有：

- 一个标签，如``5``或``"a"``，需要注意的是这里``5``应被理解为`index`所取的值，而非`index`的索引值
- 标签的列表或数组，如``['a', 'b', 'c']``
- 带有标签的表示切片的字符，如``'a':'f'``，这里返回结果包括起始行和终止行
- 与被切片的维度长度相同的布尔数组，如``[True, False, True]``.
- 一个可调用的函数，只有一个参数(调用 Series 或 DataFrame)，它返回索引的有效输出(以上的一中情况)

如果没有找到任何项目，则会抛出异常。更多信息详见`Selection by Label <indexing.label>`

**Type**

property


**See Also**

- DataFrame.at : Access a single value for a row/column label pair.
- DataFrame.iloc : Access group of rows and columns by integer position(s).
- DataFrame.xs : Returns a cross-section (row(s) or column(s)) from the Series/DataFrame.
- Series.loc : Access group of values using labels.

In [None]:
x = pd.DataFrame(np.random.rand(6, 4),
                 index=list("abcdef"),
                 columns=list("ABCD"))
print(x)

In [None]:
print(x.loc[["b", "e"], "D":"B":-1], end="\n\n")

In [None]:
# 布尔索引，对应行列长度应与原 DataFrame 行列长度相同
print(x.loc[[False, False, True, False, True, True], [True, False, False, True]], end="\n\n")

# 利用条件产生布尔索引
print(x["A"] > 0.5, end="\n\n")
print(x.loc[x['A'] > 0.5], end="\n\n")  # 返回的是 x["A"] > 0.5 的一整行
print(x[x['A'] > 0.5], end="\n\n")  # 布尔索引可以不用.loc[]

# 利用可调用的函数返回布尔索引
print(x.loc[x['A'] > 0.5].loc[lambda f: f["C"] > 0.5], end="\n\n")

还可以利用以上检索方式修改 DataFrame 元素

In [None]:
x.loc["a":"b", "A":"B"] = 0.72
x.loc["z"] = 0.32  # 若之前不存在某列或某行，则直接 append 上去
x.loc[:, "Z"] = 0.64
x.loc[x["D"] > 0.9] = 0
print(x)

复合索引

In [2]:
tuples = [('cobra', 'mark i'), ('cobra', 'mark ii'), ('sidewinder', 'mark i'), ('sidewinder', 'mark ii'), ('viper', 'mark ii'), ('viper', 'mark iii')]
index = pd.MultiIndex.from_tuples(tuples)
values = [[12, 2], [0, 4], [10, 20],
          [1, 4], [7, 1], [16, 36]]
x = pd.DataFrame(values, columns=['max_speed', 'shield'], index=index)
print(x)

                     max_speed  shield
cobra      mark i           12       2
           mark ii           0       4
sidewinder mark i           10      20
           mark ii           1       4
viper      mark ii           7       1
           mark iii         16      36


In [None]:
print(x.loc["cobra"], end="\n\n")
print(x.loc[("cobra", "mark i"), "shield"], end="\n\n")
print(x.loc["cobra", "mark i"], end="\n\n")  # return Series
print(x.loc[[("cobra", "mark i")]], end="\n\n")  # return DataFrame
print(x.loc[("cobra", "mark ii"):("viper", "mark ii"), "max_speed"], end="\n\n")

#  

#  

## pd.DataFrame.iloc()
`x.iloc[]`

Type:        property

**Docstring**

``.iloc[]``主要用于基于位置的索引，也可以利用布尔索引，索引超出返回会抛``IndexError``异常；支持的输入有

- 代表行标的整数、整数列表、整数数组，如``5``、``[4, 3, 0]``

- 表示切片的字符，如``1:7``，注意此处不包含`7`所指代的行/列

- 布尔型数组

- 一个调用 Series 或 DataFrame 的带有一个参数的可调用函数，其须满足返回用于索引的有效输出(上述之一)；对于在函数链中没有对调用对象的引用，但希望基于某个值进行选择时，这种方法可以提供很大帮助

更多信息参见`Selection by Position <indexing.integer>`.

**See Also**

- DataFrame.iat : Fast integer location scalar accessor.

- Series.iloc : Purely integer-location based indexing for selection by position.

In [4]:
x = pd.DataFrame([range(1, 5), range(100, 500, 100), range(1000, 5000, 1000)], 
                 columns=["O", "I", "II", "III"],
                index=list("ABC"))

for i in x:
    print(i)

O
I
II
III


In [None]:
print(x.iloc[[1, 0], ::-1], end="\n\n")
print(x.loc[[True, False, True], [True, False, True, False]], end="\n\n")
# 使用可调用的有效函数链
print(x.iloc[lambda f: f.index % 2 == 0, lambda f: [2, 0, 1]], end="\n\n")

#  

#  

## pd.DataFrame.drop()
```python
x.drop(
    labels=None,
    axis=0,
    index=None,
    columns=None,
    level=None,
    inplace=False,
    errors='raise',
)
```

**Docstring**

通过指定标签(label)名称和对应的轴、或直接指定索引或列名来删除行或列；当使用复合索引时，可以通过指定级别来删除不同级别上的标签，并返回剩余数据组成的`DataFrame`对象，若在指定的轴上没有找到对应标签，则会抛出异常

**Args**

- labels: 单一标签或类列表形式的标签，即要删除的索引或列标签

- axis: 可以是0或'index'、1或'columns'，从索引(0或'index')还是列(1或'columns')中删除标签

- index: 单一标签或类列表形式的标签，``labels, axis=0``等价于``index=labels``

- columns: 单一标签或类列表形式的标签，``labels, axis=1``等价于``columns=labels``

- level: int or level name，指删除复合索引中的哪一层级

- inplace: bool，If True, do operation inplace and return None.

- errors: {'ignore', 'raise'}，若为'ignore'，则忽略错误，并且只删除现有的标签。

**See Also**

- DataFrame.dropna : Return DataFrame with labels on given axis omitted where (all or any) data are missing.

- DataFrame.drop_duplicates : Return DataFrame with duplicate rows removed, optionally only considering certain columns.

- Series.drop : Return Series with specified index labels removed.

**Type**

function

In [None]:
x = pd.DataFrame(np.arange(36).reshape(6, 6), columns=list("ABCDEF"))
y = x.drop(['B', 'C'], axis=1)  # == x.drop(columns=['B', 'C'])
print(y)
y = x.drop([2, 4])  # == x.drop(index=[2, 4]), 以 index 指明删除对象时可不指明axis
print(y)
y = x.drop(index=[2, 4], columns=["B", "C"])
print(y)

In [None]:
# Drop columns and/or rows of MultiIndex DataFrame
m_idx = pd.MultiIndex(levels=[['lama', 'cow', 'falcon'],
                              ['speed', 'weight', 'length']],
                      codes=[[0, 0, 0, 1, 1, 1, 2, 2, 2],
                             [0, 1, 2, 0, 1, 2, 0, 1, 2]])
x = pd.DataFrame(index=m_idx, columns=['big', 'small'],
                 data=[[45, 30], [200, 100], [1.5, 1],
                       [30, 20], [250, 150], [1.5, 0.8],
                       [320, 250], [1, 0.8], [0.3, 0.2]])
print(x)

In [None]:
y = x.drop(index='length', level=1)
print(y)

#  

#  

## pd.DataFrame.rename()
```python
pd.DataFrame.rename(
    self,
    mapper=None,
    index=None,
    columns=None,
    axis=None,
    copy=True,
    inplace=False,
    level=None,
    errors='ignore',
)
```
**Docstring**

Alter axes labels.

Function / dict values must be unique (1-to-1). Labels not contained in
a dict / Series will be left as-is. Extra labels listed don't throw an
error.

See the :ref:`user guide <basics.rename>` for more.

Parameters

mapper : dict-like or function
    Dict-like or functions transformations to apply to
    that axis' values. Use either ``mapper`` and ``axis`` to
    specify the axis to target with ``mapper``, or ``index`` and
    ``columns``.
index: 类字典或函数，``mapper, axis=0``等价于``index=mapper``
columns : dict-like or function
    Alternative to specifying axis (``mapper, axis=1``
    is equivalent to ``columns=mapper``).
axis : int or str
    Axis to target with ``mapper``. Can be either the axis name
    ('index', 'columns') or number (0, 1). The default is 'index'.
copy : bool, default True
    Also copy underlying data.
inplace : bool, default False
    Whether to return a new DataFrame. If True then value of copy is
    ignored.
level : int or level name, default None
    In case of a MultiIndex, only rename labels in the specified
    level.
errors : {'ignore', 'raise'}, default 'ignore'
    If 'raise', raise a `KeyError` when a dict-like `mapper`, `index`,
    or `columns` contains labels that are not present in the Index
    being transformed.
    If 'ignore', existing keys will be renamed and extra keys will be
    ignored.

Returns

DataFrame
    DataFrame with the renamed axis labels.

Raises

KeyError
    If any of the labels is not found in the selected axis and
    "errors='raise'".

See Also

DataFrame.rename_axis : Set the name of the axis.


``DataFrame.rename`` supports two calling conventions

* ``(index=index_mapper, columns=columns_mapper, ...)``
* ``(mapper, axis={'index', 'columns'}, ...)``
File:      d:\programmefiles\python\anaconda3\envs\tensorflow\lib\site-packages\pandas\core\frame.py
Type:      function

In [None]:
# Rename using a mapping:
x = pd.DataFrame(np.arange(24).reshape(4, 6), index=list("abcd"), columns=list("ABCDEF"))
y = x.rename(columns={"A": "a", "B": "b"})
print(y)
y = x.rename(index={"a": "x", "b": "y", "c": "z"})
print(y)

In [None]:
y = x.rename(index=str)
print(x.index)
print(y.index)

In [None]:
# Using axis-style parameters
y = x.rename(str.lower, axis='columns')
print(y)
y = x.rename({"a": "x", "b": "y", "c": "z"}, axis='index')
print(y)

#  

#  

# pd.DataFrame.drop_duplicates()
```python
x.drop_duplicates(
    subset=None,
    keep='first',
    inplace=False,
    ignore_index=False,
) -> Union[ForwardRef('DataFrame'), NoneType]
```

**Docstring**

返回删除重复行的 DataFrame，或者``inplace=True``时返回 None，考虑某些列是可选的，index、时间索引的重复内容将被忽略

    
**Args**

- subset: 可以是列标签或标签序列，仅考虑某些用来标识重复项的列，默认情况下使用所有列

- keep: {'first', 'last', False}，决定保留哪些重复项
    - ``"first"``: 保留第一个出现的项
    - ``"last"``: 保留最后一个出现的项
    - `False`: 删除所有重复项

- inplace: 是否 in place 地修改对象，False 则返回一个副本

- ignore_index: 若为True，返回对象的 axis 将被标为 0, 1, …, n - 1.


**Type**

function

In [68]:
x = pd.DataFrame(np.random.randint(low=0, high=10, size=[6, 6]), columns=list("ABCDEF"))
x

Unnamed: 0,A,B,C,D,E,F
0,2,4,1,6,8,6
1,4,8,0,3,3,4
2,9,1,8,2,0,8
3,2,7,3,6,1,3
4,2,3,6,7,5,3
5,4,9,8,5,0,8


In [None]:
y = x.drop_duplicates()
print(y, end="\n\n")
y = x.drop_duplicates(subset=["C", "F"])
print(y, end="\n\n")
y = x.drop_duplicates(subset="E", keep="last")
print(y, end="\n\n")
y = x.drop_duplicates(subset="E", keep=False)
print(y, end="\n\n")
y = x.drop_duplicates(subset="E", ignore_index=True)
print(y, end="\n\n")
print(x)

In [None]:
y = x.drop_duplicates(subset="E", inplace=True)
print(y, end="\n\n")  # return None
print(x)

#  

#  

pd.DataFrame.reset_index()
```python
x.reset_index(
    level: Union[Hashable, Sequence[Hashable], NoneType] = None,
    drop: bool = False,
    inplace: bool = False,
    col_level: Hashable = 0,
    col_fill: Union[Hashable, NoneType] = '',
) -> Union[ForwardRef('DataFrame'), NoneType]
```
**Docstring**

重排 index，若 DataFrame 有复合索引，则这个方法可以删除或重排一个或多个 level

**Args**

- level : int, str, tuple, or list, default None
    Only remove the given levels from the index. Removes all levels by
    default.
drop : bool, default False
    Do not try to insert index into dataframe columns. This resets
    the index to the default integer index.
inplace : bool, default False
    Modify the DataFrame in place (do not create a new object).
col_level : int or str, default 0
    If the columns have multiple levels, determines which level the
    labels are inserted into. By default it is inserted into the first
    level.
col_fill : object, default ''
    If the columns have multiple levels, determines how the other
    levels are named. If None then the index name is repeated.

Returns
-------
DataFrame or None
    DataFrame with the new index or None if ``inplace=True``.

See Also
--------
DataFrame.set_index : Opposite of reset_index.
DataFrame.reindex : Change to new indices or expand indices.
DataFrame.reindex_like : Change to same indices as other DataFrame.

Examples
--------
>>> df = pd.DataFrame([('bird', 389.0),
...                    ('bird', 24.0),
...                    ('mammal', 80.5),
...                    ('mammal', np.nan)],
...                   index=['falcon', 'parrot', 'lion', 'monkey'],
...                   columns=('class', 'max_speed'))
>>> df
         class  max_speed
falcon    bird      389.0
parrot    bird       24.0
lion    mammal       80.5
monkey  mammal        NaN

When we reset the index, the old index is added as a column, and a
new sequential index is used:

>>> df.reset_index()
    index   class  max_speed
0  falcon    bird      389.0
1  parrot    bird       24.0
2    lion  mammal       80.5
3  monkey  mammal        NaN

We can use the `drop` parameter to avoid the old index being added as
a column:

>>> df.reset_index(drop=True)
    class  max_speed
0    bird      389.0
1    bird       24.0
2  mammal       80.5
3  mammal        NaN

You can also use `reset_index` with `MultiIndex`.

>>> index = pd.MultiIndex.from_tuples([('bird', 'falcon'),
...                                    ('bird', 'parrot'),
...                                    ('mammal', 'lion'),
...                                    ('mammal', 'monkey')],
...                                   names=['class', 'name'])
>>> columns = pd.MultiIndex.from_tuples([('speed', 'max'),
...                                      ('species', 'type')])
>>> df = pd.DataFrame([(389.0, 'fly'),
...                    ( 24.0, 'fly'),
...                    ( 80.5, 'run'),
...                    (np.nan, 'jump')],
...                   index=index,
...                   columns=columns)
>>> df
               speed species
                 max    type
class  name
bird   falcon  389.0     fly
       parrot   24.0     fly
mammal lion     80.5     run
       monkey    NaN    jump

If the index has multiple levels, we can reset a subset of them:

>>> df.reset_index(level='class')
         class  speed species
                  max    type
name
falcon    bird  389.0     fly
parrot    bird   24.0     fly
lion    mammal   80.5     run
monkey  mammal    NaN    jump

If we are not dropping the index, by default, it is placed in the top
level. We can place it in another level:

>>> df.reset_index(level='class', col_level=1)
                speed species
         class    max    type
name
falcon    bird  389.0     fly
parrot    bird   24.0     fly
lion    mammal   80.5     run
monkey  mammal    NaN    jump

When the index is inserted under another level, we can specify under
which one with the parameter `col_fill`:

>>> df.reset_index(level='class', col_level=1, col_fill='species')
              species  speed species
                class    max    type
name
falcon           bird  389.0     fly
parrot           bird   24.0     fly
lion           mammal   80.5     run
monkey         mammal    NaN    jump

If we specify a nonexistent level for `col_fill`, it is created:

>>> df.reset_index(level='class', col_level=1, col_fill='genus')
                genus  speed species
                class    max    type
name
falcon           bird  389.0     fly
parrot           bird   24.0     fly
lion           mammal   80.5     run
monkey         mammal    NaN    jump
File:      d:\programmefiles\python\anaconda3\envs\tensorflow\lib\site-packages\pandas\core\frame.py
Type:      function