<a href="https://colab.research.google.com/github/smiledinisa/data_python_analysis/blob/master/pandas008_Advanced_pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
from google.colab import drive
drive.mount('/content/drive/')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive/


# CHAPTER 12 Advanced pandas


The preceding chapters have focused on introducing different types of data wrangling
workflows and features of NumPy, pandas, and other libraries. Over time, pandas has
developed a depth of features for power users. This chapter digs into a few more
advanced feature areas to help you deepen your expertise as a pandas user.

## 12.1 Categorical Data
This section introduces the pandas  Categorical type. I will show how you can ach‐
ieve better performance and memory use in some pandas operations by using it. I
also introduce some tools for using categorical data in statistics and machine learning
applications

分类类型，更好的性能和内存使用，在分类数据或者机器学习的应用中。




### Background and Motivation

In [2]:
# Frequently, a column in a table may contain repeated instances of a smaller set of dis‐
# tinct values. We have already seen functions like  unique and  value_counts , which
# enable us to extract the distinct values from an array and compute their frequencies,
# respectively:

# 我们已经结实了两个统计数据中关于频率或者次数的函数：unique, value_counts
# 我们经常遇到在一行或者一列中，都是一些重复的小数据集的个体。

import numpy as np
import pandas as pd

values = pd.Series(['apple', 'orange', 'apple',
            'apple'] * 2)
values

0     apple
1    orange
2     apple
3     apple
4     apple
5    orange
6     apple
7     apple
dtype: object

In [3]:
pd.unique(values)

array(['apple', 'orange'], dtype=object)

In [4]:
pd.value_counts(values)

apple     6
orange    2
dtype: int64

In [5]:
# 利用关键词 take 来进行映射。

values = pd.Series([0,1,0,0]*2)

dim = pd.Series(['apple', 'orange'])

values

0    0
1    1
2    0
3    0
4    0
5    1
6    0
7    0
dtype: int64

In [6]:
dim

0     apple
1    orange
dtype: object

In [7]:
dim.take?
Signature: dim.take(indices, axis=0, is_copy=None, **kwargs) -> 'Series'
Docstring:
Return the elements in the given *positional* indices along an axis.

This means that we are not indexing according to actual values in
the index attribute of the object. We are indexing according to the
actual position of the element in the object.

Parameters
----------
indices : array-like
    An array of ints indicating which positions to take.
axis : {0 or 'index', 1 or 'columns', None}, default 0
    The axis on which to select elements. ``0`` means that we are
    selecting rows, ``1`` means that we are selecting columns.
is_copy : bool
    Before pandas 1.0, ``is_copy=False`` can be specified to ensure
    that the return value is an actual copy. Starting with pandas 1.0,
    ``take`` always returns a copy, and the keyword is therefore
    deprecated.

    .. deprecated:: 1.0.0
**kwargs
    For compatibility with :meth:`numpy.take`. Has no effect on the
    output.

Returns
-------
taken : same type as caller
    An array-like containing the elements taken from the object.

See Also
--------
DataFrame.loc : Select a subset of a DataFrame by labels.
DataFrame.iloc : Select a subset of a DataFrame by positions.
numpy.take : Take elements from an array along an axis.

Examples
--------
>>> df = pd.DataFrame([('falcon', 'bird', 389.0),
...                    ('parrot', 'bird', 24.0),
...                    ('lion', 'mammal', 80.5),
...                    ('monkey', 'mammal', np.nan)],
...                   columns=['name', 'class', 'max_speed'],
...                   index=[0, 2, 3, 1])
>>> df
     name   class  max_speed
0  falcon    bird      389.0
2  parrot    bird       24.0
3    lion  mammal       80.5
1  monkey  mammal        NaN

Take elements at positions 0 and 3 along the axis 0 (default).

Note how the actual indices selected (0 and 1) do not correspond to
our selected indices 0 and 3. That's because we are selecting the 0th
and 3rd rows, not rows whose indices equal 0 and 3.

>>> df.take([0, 3])
     name   class  max_speed
0  falcon    bird      389.0
1  monkey  mammal        NaN

Take elements at indices 1 and 2 along the axis 1 (column selection).

>>> df.take([1, 2], axis=1)
    class  max_speed
0    bird      389.0
2    bird       24.0
3  mammal       80.5
1  mammal        NaN

We may take elements using negative integers for positive indices,
starting from the end of the object, just like with Python lists.

>>> df.take([-1, -2])
     name   class  max_speed
1  monkey  mammal        NaN
3    lion  mammal       80.5
File:      /usr/local/lib/python3.6/dist-packages/pandas/core/series.py
Type:      method


In [8]:
# 从take函数说明可以得出，可以根据位置来映射，值。

dim.take(values) # values 代表了一系列的位置 list

0     apple
1    orange
0     apple
0     apple
0     apple
1    orange
0     apple
0     apple
dtype: object

In [9]:
# 我们将 0 1 代表的 apple orange ，成为类别代码，或者简述代码。

# 这种方式有两个便利之处：

#Renaming categories
# Appending a new category without changing the order or position of the existing
#categories

# 改变类别名称的时候，不用大量修改代码内部。

# 添加新类别的时候，只进行对应类别代码的添加就可以了。

### Categorical Type in pandas

分类类型。

pandas 中有多种 分类类型。

In [10]:
# 关键词： 
# Categorical
# astype('category') 将类型转换为pandas.Categorical 对象。
# 
# 属性： values.categories && values.codes
# # 直接建立Categorical 对象。 pandas.Categorical() 类似于 dataframe,series.

# # 也可以指定 相应的代码。
# categories = [xxx]
# codes = [xx]
# my_cats_2 = pd.Categorical.from_codes(codes, categories, ordered = Ture or False)



In [15]:
# 示例：
fruits = ['apple', 'orange', 'apple', 'apple']*2
N = len(fruits)
df = pd.DataFrame({'fruit': fruits,
          'basket_id': np.arange(N),
          'count': np.random.randint(3, 15, size=N),
          'weight': np.random.uniform(0, 4, size=N)},
          columns=['basket_id', 'fruit', 'count', 'weight'])

df

Unnamed: 0,basket_id,fruit,count,weight
0,0,apple,6,3.222433
1,1,orange,9,2.530345
2,2,apple,5,1.025972
3,3,apple,6,3.926377
4,4,apple,14,0.079198
5,5,orange,10,1.885798
6,6,apple,11,3.326514
7,7,apple,13,3.455719


In [16]:
# 我们需要将fruit转换为数字特征。
fruit_cat = df.fruit.astype('category')
fruit_cat

0     apple
1    orange
2     apple
3     apple
4     apple
5    orange
6     apple
7     apple
Name: fruit, dtype: category
Categories (2, object): [apple, orange]

In [18]:
# 两个属性。categories&&codes
fruit_cat.values.codes

array([0, 1, 0, 0, 0, 1, 0, 0], dtype=int8)

In [19]:
fruit_cat.values.categories

Index(['apple', 'orange'], dtype='object')

In [20]:
fruit_cat.values

[apple, orange, apple, apple, apple, orange, apple, apple]
Categories (2, object): [apple, orange]

In [21]:
# 我们可以将fruits的fruit列转换成categorical 类型。
df.fruit = df.fruit.astype('category')
df

Unnamed: 0,basket_id,fruit,count,weight
0,0,apple,6,3.222433
1,1,orange,9,2.530345
2,2,apple,5,1.025972
3,3,apple,6,3.926377
4,4,apple,14,0.079198
5,5,orange,10,1.885798
6,6,apple,11,3.326514
7,7,apple,13,3.455719


In [23]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype   
---  ------     --------------  -----   
 0   basket_id  8 non-null      int64   
 1   fruit      8 non-null      category
 2   count      8 non-null      int64   
 3   weight     8 non-null      float64 
dtypes: category(1), float64(1), int64(2)
memory usage: 424.0 bytes


In [25]:
# 可以看出列信息类型发生了变化。
# 我们也可以直接用类别代替。
df.fruit = df.fruit.astype('category').values.codes

df

Unnamed: 0,basket_id,fruit,count,weight
0,0,0,6,3.222433
1,1,1,9,2.530345
2,2,0,5,1.025972
3,3,0,6,3.926377
4,4,0,14,0.079198
5,5,1,10,1.885798
6,6,0,11,3.326514
7,7,0,13,3.455719


In [26]:
### 这样就达到了我们进行转换的目的。

## 当然，建立pandas.Categorical 的方法：

my_categories = pd.Categorical(['foo', 'bar', 'baz', 'foo', 'bar'])
my_categories

[foo, bar, baz, foo, bar]
Categories (3, object): [bar, baz, foo]

In [27]:
# frome codes 建立。
categories = ['foo', 'bar', 'baz']
codes = [0,1,2,0,0,1]

my_cat = pd.Categorical.from_codes(codes, categories, ordered=True)
my_cat

[foo, bar, baz, foo, foo, bar]
Categories (3, object): [foo < bar < baz]

In [28]:
pd.Categorical.from_codes?
Signature: pd.Categorical.from_codes(cls, codes, categories=None, ordered=None, dtype=None)
Docstring:
Make a Categorical type from codes and categories or dtype.

This constructor is useful if you already have codes and
categories/dtype and so do not need the (computation intensive)
factorization step, which is usually done on the constructor.

If your data does not follow this convention, please use the normal
constructor.

Parameters
----------
codes : array-like of int
    An integer array, where each integer points to a category in
    categories or dtype.categories, or else is -1 for NaN.
categories : index-like, optional
    The categories for the categorical. Items need to be unique.
    If the categories are not given here, then they must be provided
    in `dtype`.
ordered : bool, optional
    Whether or not this categorical is treated as an ordered
    categorical. If not given here or in `dtype`, the resulting
    categorical will be unordered.
dtype : CategoricalDtype or "category", optional
    If :class:`CategoricalDtype`, cannot be used together with
    `categories` or `ordered`.


In [29]:
# 从函数类型可以看出，是子类。

# 没有经过排序的categorical类型，也可以用 .as_orderd() 方法进行排序。
my_cat.as_unordered()

[foo, bar, baz, foo, foo, bar]
Categories (3, object): [foo, bar, baz]

In [30]:
my_cat.as_ordered()

[foo, bar, baz, foo, foo, bar]
Categories (3, object): [foo < bar < baz]

### Computations with Categoricals

本机主要讲用categorical 类型的计算效率的提升。

以及 qcut 转换的问题。

In [31]:
# pd.qcut 产生的 categorical对象。
# 可以直接用 .codes && categories 属性访问。
# 可以设置label 来修改输出的结果。
# bins = pd.qcut(list, 4, label=['x1','x2','x3','x4'])
# 然后对bins 进行 groupby，统计相应的值。因为groupby可以直接传入一个维度或者长度于对应列数相同的列表或者series.
# 性能提升： 当用字符串存储的时候，和用 categorical 对象存储的时候，所占用的内存大小完全不一样。
# 当然，在这个基础上用groupby等运算的时候，所用的时间也不一样。


In [32]:
np.random.seed(12345)
draws = np.random.randn(1000)

draws[:5]

array([-0.20470766,  0.47894334, -0.51943872, -0.5557303 ,  1.96578057])

In [33]:
# 上边的1000个数，进行qcut，根据大小范围，四份。
bins = pd.qcut(draws, 4)
bins

[(-0.684, -0.0101], (-0.0101, 0.63], (-0.684, -0.0101], (-0.684, -0.0101], (0.63, 3.928], ..., (-0.0101, 0.63], (-0.684, -0.0101], (-2.9499999999999997, -0.684], (-0.0101, 0.63], (0.63, 3.928]]
Length: 1000
Categories (4, interval[float64]): [(-2.9499999999999997, -0.684] < (-0.684, -0.0101] < (-0.0101, 0.63] <
                                    (0.63, 3.928]]

In [34]:
# 从结果看出，bins拥有categories的属性。
# 我们来用categories && codes来访问

bins.categories

IntervalIndex([(-2.9499999999999997, -0.684], (-0.684, -0.0101], (-0.0101, 0.63], (0.63, 3.928]],
              closed='right',
              dtype='interval[float64]')

In [37]:
bins.codes[:10]

array([1, 2, 1, 1, 3, 3, 2, 2, 3, 3], dtype=int8)

In [36]:
# 虽然划分出了每个数据所在哪个区间，但是我们希望更加直观点。
bins = pd.qcut(draws, 4, labels=['q1', 'q2', 'q3','q4'])
bins

[q2, q3, q2, q2, q4, ..., q3, q2, q1, q3, q4]
Length: 1000
Categories (4, object): [q1 < q2 < q3 < q4]

In [39]:
# 然后我们就可以用bins来对draws进行groupby并进行一些统计输出。

bins = pd.Series(bins, name='quartile')

resluts = pd.Series(draws).groupby(bins).agg(['count', 'min', 'max', 'mean']).reset_index()
resluts # 这里resetindex的作用是用 count等直接当作一级columns

Unnamed: 0,quartile,count,min,max,mean
0,q1,250,-2.949343,-0.685484,-1.215981
1,q2,250,-0.683066,-0.010115,-0.362423
2,q3,250,-0.010032,0.628894,0.30784
3,q4,250,0.634238,3.927528,1.26116


In [41]:
# 然后我们可以查看，categorical 对象和 字符对象占用内存对比。

N = 10000000
draws = pd.Series(np.random.randn(N))
labels = pd.Series(['foo', 'bar', 'baz', 'qux'] * (N // 4))

# 将series 转换成categorical 

categories = labels.astype('category')


In [44]:
# 用关键词 memory_usage()查看。

labels.memory_usage()/categories.memory_usage()

7.999756807782151

In [45]:
# 从结果可以很明显的看出series所占用的内存是categorical对象的八倍。

%time _=labels.astype('category')

CPU times: user 377 ms, sys: 4.52 ms, total: 382 ms
Wall time: 383 ms


### Categorical Methods

Series containing categorical data have several special methods similar to the  Ser
ies.str specialized string methods. This also provides convenient access to the cate‐
gories and codes. Consider the Series:

categories 类型的series对象，也有很多methods， 就跟字符串的series对象一样， Series.str 

In [46]:
pd.Series.str?
Init signature: pd.Series.str(*args, **kwargs)
Docstring:     
Vectorized string functions for Series and Index. NAs stay NA unless
handled otherwise by a particular method. Patterned after Python's string
methods, with some inspiration from R's stringr package.

Examples
--------
>>> s.str.split('_')
>>> s.str.replace('_', '')
File:           /usr/local/lib/python3.6/dist-packages/pandas/core/strings.py
Type:           type

In [None]:
!cat /usr/local/lib/python3.6/dist-packages/pandas/core/strings.py

In [None]:

Series.str.split : Split strings around given separator/delimiter.
    Series.str.rsplit : Splits string around given separator/delimiter,
        starting from the right.
    Series.str.join : Join lists contained as elements in the Series/Index
        with passed delimiter.
    str.split : Standard library version for split.
    str.rsplit : Standard library version for rsplit.
Series.str.split : Split strings around given separators.
    str.partition : Standard library version.
Series.str.rjust : Fills the left side of strings with an arbitrary
            character.
        Series.str.ljust : Fills the right side of strings with an arbitrary
            character.
        Series.str.pad : Fills the specified sides of strings with an arbitrary
            character.
        Series.str.center : Fills boths sides of strings with an arbitrary
            character.
Series.str.strip : Remove leading and trailing characters in Series/Index.
    Series.str.lstrip : Remove leading characters in Series/Index.
    Series.str.rstrip : Remove trailing characters in Series/Index.
Series.str.isalpha : Check whether all characters are alphabetic.
    Series.str.isnumeric : Check whether all characters are numeric.
    Series.str.isalnum : Check whether all characters are alphanumeric.
    Series.str.isdigit : Check whether all characters are digits.
    Series.str.isdecimal : Check whether all characters are decimal.
    Series.str.isspace : Check whether all characters are whitespace.
    Series.str.islower : Check whether all characters are lowercase.
    Series.str.isupper : Check whether all characters are uppercase.
    Series.str.istitle : Check whether all characters are titlecase
    Series.str.lower : Converts all characters to lowercase.
    Series.str.upper : Converts all characters to uppercase.
    Series.str.title : Converts first character of each word to uppercase and
        remaining to lowercase.
    Series.str.capitalize : Converts first character to uppercase and
        remaining to lowercase.
    Series.str.swapcase : Converts uppercase to lowercase and lowercase to
        uppercase.
    Series.str.casefold: Removes all case distinctions in the string.
    str.len : Python built-in function returning the length of an object.
    Series.size : Returns the length of the Series.

In [None]:
# cat 提供了对访问codes categories 的方法。

# cat.set_categories() 来扩展类别对象，虽然只是观察到了有限个。
# 比如没有观察到的类比依然会统计信息统计到，只是个数为0而已。 value_counts...

# 大型数据集中，很多类别没有使用，想要去掉这些没有观察到的类别，使用: remove_unused_categories.


# 哑变量： 转换： pandas.get_dummies(series)


![链接文字](https://img-blog.csdnimg.cn/20200825144226828.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2dhb2N1aTg4Mw==,size_16,color_FFFFFF,t_70#pic_center)
![链接文字](https://img-blog.csdnimg.cn/20200825145024196.png#pic_center)

In [48]:
# 示例:

s = pd.Series(['a', 'b', 'c', 'd']*2)

cat_s = s.astype('category')
cat_s

0    a
1    b
2    c
3    d
4    a
5    b
6    c
7    d
dtype: category
Categories (4, object): [a, b, c, d]

In [49]:
# cat 属性可以使用各种方法。

cat_s.cat.codes

0    0
1    1
2    2
3    3
4    0
5    1
6    2
7    3
dtype: int8

In [50]:
cat_s.cat.as_ordered()

0    a
1    b
2    c
3    d
4    a
5    b
6    c
7    d
dtype: category
Categories (4, object): [a < b < c < d]

In [51]:
cat_s.cat.categories

Index(['a', 'b', 'c', 'd'], dtype='object')

In [52]:
cat_s2 = cat_s.cat.add_categories(['e'])
cat_s2

0    a
1    b
2    c
3    d
4    a
5    b
6    c
7    d
dtype: category
Categories (5, object): [a, b, c, d, e]

In [53]:
cat_s2.value_counts()

d    2
c    2
b    2
a    2
e    0
dtype: int64

In [54]:
cat_s2.cat.remove_unused_categories()

0    a
1    b
2    c
3    d
4    a
5    b
6    c
7    d
dtype: category
Categories (4, object): [a, b, c, d]

In [55]:
cat_s2.cat.remove_categories(['d'])

0      a
1      b
2      c
3    NaN
4      a
5      b
6      c
7    NaN
dtype: category
Categories (4, object): [a, b, c, e]

In [56]:
# 哑变量。 get_dummies

cat_s = pd.Series(['a','b','c','d']*2, dtype='category')
cat_s

0    a
1    b
2    c
3    d
4    a
5    b
6    c
7    d
dtype: category
Categories (4, object): [a, b, c, d]

In [57]:
pd.get_dummies(cat_s) # 这样就将一个特征转换为了四个特征。

Unnamed: 0,a,b,c,d
0,1,0,0,0
1,0,1,0,0
2,0,0,1,0
3,0,0,0,1
4,1,0,0,0
5,0,1,0,0
6,0,0,1,0
7,0,0,0,1


## 12.2 Advanced GroupBy Use

### Group Transforms and “Unwrapped” GroupBys

### Grouped Time Resampling

## 12.3 Techniques for Method Chaining

### The pipe Method