<a href="https://colab.research.google.com/github/smiledinisa/data_python_analysis/blob/master/pandas003.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive


In [4]:
cd /content/drive/My Drive/examples

/content/drive/My Drive/examples


#CHAPTER 7 :Data Cleaning and Preparation


During the course of doing data analysis and modeling, a significant amount of time
is spent on data preparation: loading, cleaning, transforming, and rearranging. Such
tasks are often reported to take up 80% or more of an analyst’s time. Sometimes the
way that data is stored in files or databases is not in the right format for a particular
task. Many researchers choose to do ad hoc processing of data from one form to
another using a general-purpose programming language, like Python, Perl, R, or Java,
or Unix text-processing tools like sed or awk. Fortunately, pandas, along with the
built-in Python language features, provides you with a high-level, flexible, and fast set
of tools to enable you to manipulate data into the right form


In this chapter I discuss tools for **missing data**, **duplicate data**, **string manipulation**,
and **some other analytical data transformations**. In the next chapter, I focus on com‐
bining and rearranging datasets in various ways.

## Handling Missing Data


For numeric data, pandas uses the floating-point
value  ***NaN***(Not a Number) to represent missing data. We call this a 
***sentinel value*** that
can be easily detected:

In [3]:
import pandas as pd
from pandas import DataFrame
from pandas import Series
import numpy as np
string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])

In [8]:
string_data

0         None
1    artichoke
2          NaN
3      avocado
dtype: object

In [9]:
string_data.isnull() #na 意思是：no available

0     True
1    False
2     True
3    False
dtype: bool

In [6]:
string_data[0] = None
string_data

0         None
1    artichoke
2          NaN
3      avocado
dtype: object

In [7]:
string_data.isnull()

0     True
1    False
2     True
3    False
dtype: bool

**list of some functions related**
**to missing data handling.**


![链接文字](https://img-blog.csdnimg.cn/20200814104309972.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2dhb2N1aTg4Mw==,size_16,color_FFFFFF,t_70#pic_center)

### Filtering Out Missing Data


There are a few ways to filter out missing data. While you always have the option to
do it by hand using  pandas**.isnull**and **boolean** indexing, the  dropna can be helpful.
On a Series, it returns the Series with only the non-null data and index values:


如果经常手动的用 .isnull 和布尔索引的方式来过滤数据的话，那么dropna 方法将更加方便。

In [5]:
from numpy import nan as NA

# 对于Series 
data = pd.Series([1, NA, 3.5, NA, 9])
data.dropna()

0    1.0
2    3.5
4    9.0
dtype: float64

In [None]:
# 等价于：

data[data.notnull()]

0    1.0
2    3.5
4    9.0
dtype: float64

In [None]:
# 对于DATAFRAME. a bit more complex. dropna drop 所有包含na 的rows

data = DataFrame([[1., 6.5, 3.], [1., NA, NA],[NA, NA, NA], [NA, 6.5, 3.]])
data

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [None]:
data.dropna()

Unnamed: 0,0,1,2
0,1.0,6.5,3.0


data.dropna?

![链接文字](https://img-blog.csdnimg.cn/20200814105608645.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2dhb2N1aTg4Mw==,size_16,color_FFFFFF,t_70#pic_center)

可以看出，我们可以对dropna进行参数控制。

In [None]:
data.dropna(axis=0, how= 'all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


In [None]:
data.dropna(axis= 1, how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [None]:
print(data)
data.dropna(thresh= 2) # 只去掉两个以上na的行。

     0    1    2
0  1.0  6.5  3.0
1  1.0  NaN  NaN
2  NaN  NaN  NaN
3  NaN  6.5  3.0


Unnamed: 0,0,1,2
0,1.0,6.5,3.0
3,,6.5,3.0


### Filling In Missing Data


Rather than filtering out missing data (and potentially discarding other data along
with it), you may want to fill in the “holes” in any number of ways. For most pur‐
poses, the  ***fillna*** method is the workhorse function to use. Calling  fillna with a
constant replaces missing values with that value:



一切丢弃就是一行，一行内可能还有其他非na数据。
所以填充数据就很有用了。



In [None]:
df = DataFrame(np.random.randn(7,3))
df.iloc[:4,1] = NA
df.iloc[:2,2] = NA
df

Unnamed: 0,0,1,2
0,2.598565,,
1,0.350325,,
2,1.537747,,-0.581092
3,0.419049,,-1.123682
4,-1.23684,0.439199,0.292599
5,1.447771,-1.378561,0.438132
6,0.002066,0.015012,-1.51247


In [None]:
df.fillna(0) # 对NA 进行填充 0

Unnamed: 0,0,1,2
0,2.598565,0.0,0.0
1,0.350325,0.0,0.0
2,1.537747,0.0,-0.581092
3,0.419049,0.0,-1.123682
4,-1.23684,0.439199,0.292599
5,1.447771,-1.378561,0.438132
6,0.002066,0.015012,-1.51247


![链接文字](https://img-blog.csdnimg.cn/20200814111702964.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2dhb2N1aTg4Mw==,size_16,color_FFFFFF,t_70#pic_center)


![链接文字](https://img-blog.csdnimg.cn/20200814112048496.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2dhb2N1aTg4Mw==,size_16,color_FFFFFF,t_70#pic_center)

In [None]:
df.fillna({1:0.5, 2:0}) # dataframe,column1的非零值填入0.5，2 填入：0

Unnamed: 0,0,1,2
0,2.598565,0.5,0.0
1,0.350325,0.5,0.0
2,1.537747,0.5,-0.581092
3,0.419049,0.5,-1.123682
4,-1.23684,0.439199,0.292599
5,1.447771,-1.378561,0.438132
6,0.002066,0.015012,-1.51247


In [None]:
_ = df.fillna(0,inplace=True)
df

Unnamed: 0,0,1,2
0,2.598565,0.0,0.0
1,0.350325,0.0,0.0
2,1.537747,0.0,-0.581092
3,0.419049,0.0,-1.123682
4,-1.23684,0.439199,0.292599
5,1.447771,-1.378561,0.438132
6,0.002066,0.015012,-1.51247


In [None]:
df = pd.DataFrame(np.random.randn(6, 3))
df.iloc[2:,1] = NA
df.iloc[4:,2] = np.nan
df

Unnamed: 0,0,1,2
0,-0.664088,1.844724,1.695235
1,-0.608011,-0.168809,1.517685
2,0.077567,,0.08068
3,0.690576,,-0.941624
4,-0.720014,,
5,0.776801,,


In [None]:
df.fillna(method='ffill')

Unnamed: 0,0,1,2
0,-0.664088,1.844724,1.695235
1,-0.608011,-0.168809,1.517685
2,0.077567,-0.168809,0.08068
3,0.690576,-0.168809,-0.941624
4,-0.720014,-0.168809,-0.941624
5,0.776801,-0.168809,-0.941624


In [None]:
df.fillna(method='ffill', limit= 2)

Unnamed: 0,0,1,2
0,-0.664088,1.844724,1.695235
1,-0.608011,-0.168809,1.517685
2,0.077567,-0.168809,0.08068
3,0.690576,-0.168809,-0.941624
4,-0.720014,,-0.941624
5,0.776801,,-0.941624


In [None]:
df.fillna(method='bfill', limit= 1 ,axis= 1)

Unnamed: 0,0,1,2
0,-0.664088,1.844724,1.695235
1,-0.608011,-0.168809,1.517685
2,0.077567,0.08068,0.08068
3,0.690576,-0.941624,-0.941624
4,-0.720014,,
5,0.776801,,


In [None]:
df.fillna(method='ffill', limit= 1, axis=1)

Unnamed: 0,0,1,2
0,-0.664088,1.844724,1.695235
1,-0.608011,-0.168809,1.517685
2,0.077567,0.077567,0.08068
3,0.690576,0.690576,-0.941624
4,-0.720014,-0.720014,
5,0.776801,0.776801,


In [None]:
# 可以创造性的用很多方法来进行填充。比如：

data = Series([1, np.nan, 3.5, np.nan, 5])
print(data)
data.fillna(data.mean())

0    1.0
1    NaN
2    3.5
3    NaN
4    5.0
dtype: float64


0    1.000000
1    3.166667
2    3.500000
3    3.166667
4    5.000000
dtype: float64

## Data Transformation


So far in this chapter we’ve been concerned with rearranging data. Filtering, cleaning,
and other transformations are another class of important operations.



### Removing Duplicates 

数据的冗余，重复，等等。

In [None]:
data = DataFrame({'k1': ['one', 'two'] * 3 + ['two'],
'k2': [1, 1, 2, 3, 3, 4, 4]})
data

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4
6,two,4


The DataFrame method  ***duplicated*** returns a boolean Series indicating whether each
row is a duplicate (has been observed in a previous row) or not:


![链接文字](https://img-blog.csdnimg.cn/20200814114327579.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2dhb2N1aTg4Mw==,size_16,color_FFFFFF,t_70#pic_center)

In [None]:
data.duplicated()

0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

In [None]:
# drop_duplicates 则返回相应的不重复的dataframe

data.drop_duplicates()

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4


In [None]:
# add new columns.
data['v'] = range(7)
print(data)
data.drop_duplicates(['k1'])

    k1  k2  v
0  one   1  0
1  two   1  1
2  one   2  2
3  two   3  3
4  one   3  4
5  two   4  5
6  two   4  6


Unnamed: 0,k1,k2,v
0,one,1,0
1,two,1,1


In [None]:
data.drop_duplicates(['k1','k2'], keep='last')

Unnamed: 0,k1,k2,v
0,one,1,0
1,two,1,1
2,one,2,2
3,two,3,3
4,one,3,4
6,two,4,6


In [None]:
data.drop_duplicates(['k1','k2'], keep = 'first')

Unnamed: 0,k1,k2,v
0,one,1,0
1,two,1,1
2,one,2,2
3,two,3,3
4,one,3,4
5,two,4,5


### Transforming Data Using a Function or Mapping

For many datasets, you may wish to perform some transformation based on the val‐
ues in an array, Series, or column in a DataFrame. Consider the following hypotheti‐
cal data collected about various kinds of meat:


想要进行基于数据集的列，或者组的数据转换。


In [None]:
data = DataFrame({'food': ['bacon', 'pulled pork', 'bacon',
          'Pastrami', 'corned beef', 'Bacon',
          'pastrami', 'honey ham', 'nova lox'],
          'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})

# bacon:培根，pulled pork:手撕猪肉，Pastrami:五香熏牛肉，corned beef:腌牛肉，honey ham:蜜汁火腿。
# nova lox: 盐渍三文鱼

data

Unnamed: 0,food,ounces
0,bacon,4.0
1,pulled pork,3.0
2,bacon,12.0
3,Pastrami,6.0
4,corned beef,7.5
5,Bacon,8.0
6,pastrami,3.0
7,honey ham,5.0
8,nova lox,6.0


In [None]:
# mapping of each distinct meat type of the kind of animal:
# salmon 大马哈鱼。
meat_to_animal = {
    'bacon':'pig',
    'pulled pork':'pig',
    'pastrami':'cow',
    'corned beef':'cow',
    'honey ham':'pig',
    'nova lox':'salmon'



}

In [None]:
lowercased = data['food'].str.lower()
lowercased

0          bacon
1    pulled pork
2          bacon
3       pastrami
4    corned beef
5          bacon
6       pastrami
7      honey ham
8       nova lox
Name: food, dtype: object

In [None]:
# 对lowercased 应用 map方法。

data['animal'] = lowercased.map(meat_to_animal)
data

Unnamed: 0,food,ounces,animal
0,bacon,4.0,pig
1,pulled pork,3.0,pig
2,bacon,12.0,pig
3,Pastrami,6.0,cow
4,corned beef,7.5,cow
5,Bacon,8.0,pig
6,pastrami,3.0,cow
7,honey ham,5.0,pig
8,nova lox,6.0,salmon


In [None]:
# 当然也可以用lambda 函数的方法。

data['meat2'] = data['food'].map(lambda x: meat_to_animal[x.lower()])
data

Unnamed: 0,food,ounces,animal,meat2
0,bacon,4.0,pig,pig
1,pulled pork,3.0,pig,pig
2,bacon,12.0,pig,pig
3,Pastrami,6.0,cow,cow
4,corned beef,7.5,cow,cow
5,Bacon,8.0,pig,pig
6,pastrami,3.0,cow,cow
7,honey ham,5.0,pig,pig
8,nova lox,6.0,salmon,salmon


### Replicing Values

the keyward : replace d


map can do some replace job, but replace is more flexiale.




In [None]:
data = pd.Series([1., -999., 2., -999., -1000., 3.])
data

0       1.0
1    -999.0
2       2.0
3    -999.0
4   -1000.0
5       3.0
dtype: float64

In [None]:
# replace the -999 to nan.

data.replace(-999, np.nan)

0       1.0
1       NaN
2       2.0
3       NaN
4   -1000.0
5       3.0
dtype: float64

In [None]:
# if we want to replace multipe numbers.
# we can pass a list and then the substitute value.

data.replace([-999, -1000], np.nan)

0    1.0
1    NaN
2    2.0
3    NaN
4    NaN
5    3.0
dtype: float64

0       1.0
1    -999.0
2       2.0
3    -999.0
4   -1000.0
5       3.0
dtype: float64

In [None]:
# if we want to replace somediffrent value with diffrent nums.
# pass a list of subsititutes.

# 如果要替换成不同的值，只需要传递相应的不同的替换值即可。

data.replace([-999,-1000], ['xx','yy'])

0     1
1    xx
2     2
3    xx
4    yy
5     3
dtype: object

In [None]:
# the arguments passed also can be a dict:
# 参数也可以是一个字典。key对应要替换的值，value对应替换后的值。

data.replace({-999:'xxxx', -1000:'yyyy'})

0       1
1    xxxx
2       2
3    xxxx
4    yyyy
5       3
dtype: object

### Renaming Axis Indexes

Like values in a Series, axis labels can be similarly transformed by a function or map‐
ping of some form to produce new, differently labeled objects. You can also modify
the axes in-place without creating a new data structure.

就像替换dataframe的值一样，标签也可以进行替换。

In [None]:
data = pd.DataFrame(np.arange(12).reshape((3, 4)),
        index=['Ohio', 'Colorado', 'New York'],
        columns=['one', 'two', 'three', 'four'])

data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
New York,8,9,10,11


In [None]:
# data.index.map(lambda x: x[:4].upper())

tansform = lambda x:x[:4].upper()
data.index.map(tansform)

Index(['OHIO', 'COLO', 'NEW '], dtype='object')

In [None]:
# use it to change the index label.
data.index = data.index.map(lambda x:x[:4].upper())
data

Unnamed: 0,one,two,three,four
OHIO,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


In [None]:
data.columns = data.columns.map(lambda x:x[:].upper())
data

Unnamed: 0,ONE,TWO,THREE,FOUR
OHIO,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


In [None]:
# rename
k = data.rename(index = str.title, columns= str.lower)
print(data)
k
# we can see the original data is not changed.

# rename 就是对原来的dataframe进行拷贝。并且不改变原来的dataframe.

      ONE  TWO  THREE  FOUR
OHIO    0    1      2     3
COLO    4    5      6     7
NEW     8    9     10    11


Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colo,4,5,6,7
New,8,9,10,11


In [None]:
# notably , rename can use a argument of dic :
# 可以用一个列表，来局部的改变标签。
data.rename(index={'OHIO':'fuck'}, columns={'ONE':111})


Unnamed: 0,111,TWO,THREE,FOUR
fuck,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


In [None]:
# rename 是自动的构造一个新的dataframe，并不改变原来的dataframe,但是如果我们希望原来的dataframe也进行相应的改变
# 可以对inplace 参数进行设定。
data.rename(index={'OHIO':'INDIANA'}, inplace=True)
data

Unnamed: 0,ONE,TWO,THREE,FOUR
INDIANA,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


### Discretization and Binning
离散化 和 分组

Continuous data is often discretized or otherwise separated into “bins” for analysis.

Suppose you have data about a group of people in a study, and you want to group
them into discrete age buckets:

需要：data bins 和关键的function ***cut***




In [None]:
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]

In [None]:
bins = [18, 25, 35, 60, 100]

cats = pd.cut(ages, bins)
cats

[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]

In [None]:
# 可以看出来，并不直观，知识显示出ages 每个数据在bins中的范围。可视化的话就更好了，第九章将学习可视化。

cats.codes
# 这里的 0 1，2 ，3 是对bins做了自动的标签分组。然后ages所对应的分组就显示出来来。

array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)

In [None]:
cats.categories  # closed :表示那边是方括号。

IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]],
              closed='right',
              dtype='interval[int64]')

In [None]:
pd.value_counts(cats)

(18, 25]     5
(35, 60]     3
(25, 35]     3
(60, 100]    1
dtype: int64

In [None]:
pd.cut(ages, [18, 26, 36, 61, 100], right = False)

[[18, 26), [18, 26), [18, 26), [26, 36), [18, 26), ..., [26, 36), [61, 100), [36, 61), [36, 61), [26, 36)]
Length: 12
Categories (4, interval[int64]): [[18, 26) < [26, 36) < [36, 61) < [61, 100)]

In [None]:
# you can also pass your own bin names by passing a list or array to the labels option:
group_names =  ['Youth', 'YoungAdult', 'MiddleAged', 'Senior']
pd.cut(ages,bins,labels = group_names)

[Youth, Youth, Youth, YoungAdult, Youth, ..., YoungAdult, Senior, MiddleAged, MiddleAged, YoungAdult]
Length: 12
Categories (4, object): [Youth < YoungAdult < MiddleAged < Senior]

In [None]:
# 也可以直接传入一个整数，表示将data从小到大分为几份，precision表示精度。
data = np.random.rand(20)
x = pd.cut(data, 4, precision=3)
x

[(0.083, 0.304], (0.745, 0.965], (0.745, 0.965], (0.524, 0.745], (0.304, 0.524], ..., (0.304, 0.524], (0.745, 0.965], (0.524, 0.745], (0.083, 0.304], (0.304, 0.524]]
Length: 20
Categories (4, interval[float64]): [(0.083, 0.304] < (0.304, 0.524] < (0.524, 0.745] < (0.745, 0.965]]

In [None]:
x.categories

IntervalIndex([(0.083, 0.304], (0.304, 0.524], (0.524, 0.745], (0.745, 0.965]],
              closed='right',
              dtype='interval[float64]')

In [None]:
x.codes

array([0, 3, 3, 2, 1, 1, 3, 0, 3, 0, 0, 2, 2, 0, 0, 1, 3, 2, 0, 1],
      dtype=int8)

In [None]:
# qcut : 根据样本的分位数来对样本进行分类，因为按普通的cut方法，可能有的bins内没有数据个数的，
# 而qcut 可以保证各个bins内的数据大致一样多。

data = np.random.randn(1000)

cats = pd.qcut(data, 4)

cats

[(0.707, 3.134], (0.707, 3.134], (-3.381, -0.681], (0.0055, 0.707], (0.707, 3.134], ..., (0.707, 3.134], (0.0055, 0.707], (0.0055, 0.707], (-0.681, 0.0055], (0.707, 3.134]]
Length: 1000
Categories (4, interval[float64]): [(-3.381, -0.681] < (-0.681, 0.0055] < (0.0055, 0.707] <
                                    (0.707, 3.134]]

In [None]:
cats.codes

array([3, 3, 0, 2, 3, 0, 1, 2, 3, 3, 0, 3, 1, 2, 1, 0, 3, 0, 3, 2, 3, 2,
       3, 3, 1, 1, 0, 3, 3, 2, 2, 0, 2, 0, 0, 2, 1, 3, 0, 0, 3, 1, 0, 0,
       3, 1, 3, 2, 2, 3, 3, 3, 0, 3, 1, 3, 1, 2, 2, 1, 1, 2, 0, 1, 1, 2,
       3, 0, 0, 0, 1, 2, 1, 3, 2, 2, 3, 0, 0, 2, 0, 1, 0, 1, 1, 2, 3, 0,
       2, 2, 1, 3, 3, 3, 1, 3, 0, 1, 1, 1, 2, 1, 3, 1, 1, 2, 1, 0, 2, 3,
       3, 3, 2, 0, 0, 2, 3, 1, 3, 3, 0, 1, 1, 3, 3, 2, 1, 1, 0, 1, 0, 2,
       1, 2, 0, 3, 3, 1, 1, 0, 2, 1, 3, 3, 3, 3, 1, 1, 2, 3, 1, 0, 2, 0,
       2, 3, 1, 0, 1, 0, 0, 2, 2, 0, 0, 1, 3, 1, 3, 0, 0, 0, 1, 1, 1, 0,
       3, 2, 0, 2, 1, 2, 2, 1, 0, 2, 3, 1, 3, 1, 1, 3, 3, 0, 2, 1, 0, 2,
       0, 1, 3, 1, 0, 2, 0, 1, 0, 1, 2, 3, 2, 2, 2, 1, 1, 3, 1, 1, 1, 0,
       3, 2, 0, 1, 2, 3, 0, 1, 0, 1, 3, 3, 1, 0, 0, 3, 2, 1, 0, 0, 3, 3,
       2, 3, 2, 1, 1, 2, 0, 3, 0, 3, 1, 0, 2, 0, 0, 3, 2, 0, 3, 0, 2, 1,
       3, 3, 1, 3, 3, 2, 3, 1, 2, 2, 1, 3, 2, 2, 0, 2, 0, 2, 0, 0, 2, 3,
       1, 2, 2, 2, 0, 1, 1, 2, 2, 2, 2, 1, 0, 3, 1,

In [None]:
cats.categories

IntervalIndex([(-3.381, -0.681], (-0.681, 0.0055], (0.0055, 0.707], (0.707, 3.134]],
              closed='right',
              dtype='interval[float64]')

In [None]:
pd.value_counts(cats) # 可以看出各个bins 内的数据一样多，但是也可以发现中间bins的间距会小很多。

(0.707, 3.134]      250
(0.0055, 0.707]     250
(-0.681, 0.0055]    250
(-3.381, -0.681]    250
dtype: int64

In [None]:
# similar to cut , we can asol pass our own quantiles.

cuts = pd.qcut(data, [0., 0.1, 0.4, 0.9, 1.])
cuts

[(-0.23, 1.367], (1.367, 3.134], (-3.381, -1.315], (-0.23, 1.367], (1.367, 3.134], ..., (1.367, 3.134], (-0.23, 1.367], (-0.23, 1.367], (-0.23, 1.367], (-0.23, 1.367]]
Length: 1000
Categories (4, interval[float64]): [(-3.381, -1.315] < (-1.315, -0.23] < (-0.23, 1.367] <
                                    (1.367, 3.134]]

In [None]:
cuts.categories

IntervalIndex([(-3.381, -1.315], (-1.315, -0.23], (-0.23, 1.367], (1.367, 3.134]],
              closed='right',
              dtype='interval[float64]')

In [None]:
pd.value_counts(cuts)  # 可以看出，数值大致与分为数设计的相当。1 4 5 9 10 

(-0.23, 1.367]      500
(-1.315, -0.23]     300
(1.367, 3.134]      100
(-3.381, -1.315]    100
dtype: int64

### Dectecting and Filtering Outliers

检测和过滤异常。

考虑正太分布数据：

Filtering or transforming outliers is largely a matter of applying array operations.
Consider a DataFrame with some normally distributed data:

In [None]:
data = pd.DataFrame(np.random.randn(1000,4))
data.describe()  #

Unnamed: 0,0,1,2,3
count,1000.0,1000.0,1000.0,1000.0
mean,-0.002005,-0.027852,0.039779,-0.006305
std,1.028336,1.011777,0.986038,1.016928
min,-3.034914,-3.339943,-2.769556,-3.616945
25%,-0.70578,-0.69152,-0.6466,-0.666185
50%,0.02648,-0.00797,0.011974,-0.020405
75%,0.728928,0.657568,0.694957,0.658949
max,4.694537,3.343568,3.177311,3.286916


In [None]:
# find values in one of the columns exceeding 3 in absolute value:

col = data[2]

col[np.abs(col) > 3]

433    3.177311
Name: 2, dtype: float64

In [None]:
# select all rows having values exceeding 3or -3. 

data[(np.abs(data) >3).any(axis = 1)]

Unnamed: 0,0,1,2,3
5,3.669065,0.046218,-0.692751,-1.854641
181,1.357615,-0.123394,0.507067,3.286916
215,-3.034914,-0.35753,-1.842489,-2.306886
381,-0.019823,-3.339943,-0.899029,0.469548
433,0.855785,-1.55004,3.177311,2.194434
434,4.694537,0.117893,-1.301424,-0.517478
466,0.182538,3.343568,-0.475942,-0.103123
566,-0.327155,-1.109378,-1.61086,-3.616945
574,0.191986,-3.101971,0.629085,0.941641
709,0.52467,0.090949,-0.862245,-3.056611


In [None]:
# criteria:
# 将绝对值大于3的值进行设置，为相应的3或者-3

data[np.abs(data) >3]

Unnamed: 0,0,1,2,3
0,,,,
1,,,,
2,,,,
3,,,,
4,,,,
...,...,...,...,...
995,,,,
996,,,,
997,,,,
998,,,,


In [None]:
data[np.abs(data) >3] = np.sign(data)*3 # 因为不符合条件的都是NAN，所以不会接受赋值。
data  # sign 函数可以根据data的值取其符号。

Unnamed: 0,0,1,2,3
0,0.009858,-0.459252,0.265714,0.724388
1,0.134278,-0.391892,-0.899480,1.393351
2,-1.337335,-1.638208,-0.593843,0.030725
3,2.047114,0.204540,0.814395,-0.293013
4,-1.019289,1.844087,1.378658,0.101328
...,...,...,...,...
995,0.048140,-1.213054,0.612229,-1.561475
996,-0.401492,-0.237421,-1.010551,-0.751552
997,2.141379,-2.117374,0.748302,0.048246
998,-0.454627,-0.518034,-0.312112,-0.413945


In [None]:
data.describe()

Unnamed: 0,0,1,2,3
count,1000.0,1000.0,1000.0,1000.0
mean,-0.004307,-0.027753,0.039601,-0.005918
std,1.01959,1.009325,0.985489,1.013859
min,-3.0,-3.0,-2.769556,-3.0
25%,-0.70578,-0.69152,-0.6466,-0.666185
50%,0.02648,-0.00797,0.011974,-0.020405
75%,0.728928,0.657568,0.694957,0.658949
max,3.0,3.0,3.0,3.0


### Permutation and Random Sampling

排列和随机抽样。

Permuting (randomly reordering) a Series or the rows in a DataFrame is easy to do using the  ***numpy.random.permutation*** function. Calling  ***permutation*** with the length of the axis you want to permute produces an array of integers indicating the new
ordering:


随机抽取：numpy.random.permutation 函数。调用索引长度，然后打乱。


In [None]:
df = pd.DataFrame(np.arange(5*4).reshape((5,4)))
df

Unnamed: 0,0,1,2,3
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11
3,12,13,14,15
4,16,17,18,19


In [None]:
sampler = np.random.permutation(5) # 返回一个随机列表。
sampler

array([0, 3, 4, 1, 2])

![链接文字](https://img-blog.csdnimg.cn/20200814193827718.png#pic_center)


可以看出，**take()*** function similiar with **iloc[]**

In [None]:
# the array can then be used in iloc.

df.take(sampler)  #or use take functio.

Unnamed: 0,0,1,2,3
0,0,1,2,3
3,12,13,14,15
4,16,17,18,19
1,4,5,6,7
2,8,9,10,11


In [None]:
# iloc
df.iloc[sampler] 

Unnamed: 0,0,1,2,3
0,0,1,2,3
3,12,13,14,15
4,16,17,18,19
1,4,5,6,7
2,8,9,10,11


In [None]:
# we used np.random.permutation. it returen a list .
# but if there have some function in pandas ? yes

# sample

df.sample(n=2) 

Unnamed: 0,0,1,2,3
0,0,1,2,3
4,16,17,18,19


In [None]:
# if we want the rows can be repeat ones.

choices = Series([5, 7, -1, 6, 4])
draws = choices.sample(n=10, replace= True)

In [None]:
draws

0    5
3    6
2   -1
1    7
2   -1
2   -1
0    5
3    6
0    5
1    7
dtype: int64

### Computing Indicator/Dummy Variables

计算指标，虚拟变量。

In [None]:
df = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'],
          'data1': range(6)})
df

Unnamed: 0,key,data1
0,b,0
1,b,1
2,a,2
3,c,3
4,a,4
5,b,5


In [None]:
# use the keyward : get_dummies()
pd.get_dummies(df['key'])

Unnamed: 0,a,b,c
0,0,1,0
1,0,1,0
2,1,0,0
3,0,0,1
4,1,0,0
5,0,1,0


In [None]:
dummies = pd.get_dummies(df['key'], prefix='key')
dummies

Unnamed: 0,key_a,key_b,key_c
0,0,1,0
1,0,1,0
2,1,0,0
3,0,0,1
4,1,0,0
5,0,1,0


In [None]:
df_with_dummy = df[['data1']].join(dummies)

In [None]:
df_with_dummy

Unnamed: 0,data1,key_a,key_b,key_c
0,0,0,1,0
1,1,0,1,0
2,2,1,0,0
3,3,0,0,1
4,4,1,0,0
5,5,0,1,0


In [None]:
mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table('movies.dat', sep='::',header=None, names=mnames)


  


In [None]:
movies[:10]

Unnamed: 0,movie_id,title,genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy
5,6,Heat (1995),Action|Crime|Thriller
6,7,Sabrina (1995),Comedy|Romance
7,8,Tom and Huck (1995),Adventure|Children's
8,9,Sudden Death (1995),Action
9,10,GoldenEye (1995),Action|Adventure|Thriller


In [None]:
# 我们现在想做的工作是什么？？？
# 我们需要制作一个dataframe，用来显示这些不同的电影的风格，纵坐标是电影，横坐标风格，1表示该电影是此风格，否则为0

# 首先找出一共的风格数，并对其进行统计

all_genres = []
for x in movies.genres:
  all_genres.extend(x.split('|')) #将各个风格进行分割，并且添加到all——genres中。

all_genres[:10]

['Animation',
 "Children's",
 'Comedy',
 'Adventure',
 "Children's",
 'Fantasy',
 'Comedy',
 'Romance',
 'Comedy',
 'Drama']

In [None]:
# 因为会重复，然后需要将其过滤，成为不重复的。
genres = pd.unique(all_genres)
genres[:10]

array(['Animation', "Children's", 'Comedy', 'Adventure', 'Fantasy',
       'Romance', 'Drama', 'Action', 'Crime', 'Thriller'], dtype=object)

In [None]:
# 然后建立结构。
from pandas import DataFrame as DF
zero_matrix = np.zeros((len(movies), len(genres)))
dummies = DF(zero_matrix, columns = genres)
dummies.head()

Unnamed: 0,Animation,Children's,Comedy,Adventure,Fantasy,Romance,Drama,Action,Crime,Thriller,Horror,Sci-Fi,Documentary,War,Musical,Mystery,Film-Noir,Western
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
# 下一步就是根据电影中的gernes 进行填写即可。

for i,gen in enumerate(movies.genres):
  indices = dummies.columns.get_indexer(gen.split('|')) #得到电影的风格索引。
  #根据这个索引进行填写为1.
  dummies.iloc[i, indices] = 1.0
  

In [None]:
# 进行显示查看。
dummies.head()

Unnamed: 0,Animation,Children's,Comedy,Adventure,Fantasy,Romance,Drama,Action,Crime,Thriller,Horror,Sci-Fi,Documentary,War,Musical,Mystery,Film-Noir,Western
0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
# 然后将行的索引进行设置。
movies_windic = movies.join((dummies.add_prefix('Genre_')))  #添加提名前缀，然后与movies 组合到一起。完成。

movies_windic.head()

Unnamed: 0,movie_id,title,genres,Genre_Animation,Genre_Children's,Genre_Comedy,Genre_Adventure,Genre_Fantasy,Genre_Romance,Genre_Drama,Genre_Action,Genre_Crime,Genre_Thriller,Genre_Horror,Genre_Sci-Fi,Genre_Documentary,Genre_War,Genre_Musical,Genre_Mystery,Genre_Film-Noir,Genre_Western
0,1,Toy Story (1995),Animation|Children's|Comedy,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,Jumanji (1995),Adventure|Children's|Fantasy,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3,Grumpier Old Men (1995),Comedy|Romance,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,4,Waiting to Exhale (1995),Comedy|Drama,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,5,Father of the Bride Part II (1995),Comedy,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## String Manipulation


Python has long been a popular raw data manipulation language in part due to its
ease of use for string and text processing. Most text operations are made simple with
the string object’s built-in methods. For more complex pattern matching and text
manipulations, regular expressions may be needed. pandas adds to the mix by ena‐
bling you to apply string and regular expressions concisely on whole arrays of data,
additionally handling the annoyance of missing data.

text manipulations.

regular expressions.

特点： consicesly .简洁。

annoyance of missing data can be handling. 恼人的数据丢失烦恼也可以进行处理。

### String Object Methods

In [11]:
# python build in function : split.

val = 'a,b,  guido'
print(val)
b = val.split(',') # return a list.
print(b)

a,b,  guido
['a', 'b', '  guido']


In [12]:
# split si often combined with strip to trim whitespace(including line break.)

pieces = [x.strip() for x in val.split(',')]
pieces # 从结果可以看出，'guido'前的空格被裁剪掉了。

['a', 'b', 'guido']

In [13]:
# let s concatenate the pieces together .
firs,sec,thir = pieces

result = firs + '::' + sec + '::' + thir
result

'a::b::guido'

In [14]:
# its not generic method. we ues keward : join.

reslut2 = "::".join(pieces)
reslut2

'a::b::guido'

In [15]:
# keyward:    in,index,find.

'guido' in val

True

In [16]:
val.index(',') # fin ',' index. first.

1

In [17]:
val.find(',')

1

In [19]:
# but index and find have a litter bit diffrent.

print(val.find('red')) # find : if not find ,return -1



-1


In [20]:
print(val.index('red')) #index: if not find raises an exception.

ValueError: ignored

In [21]:
print(val)

# the keyward : count.

print(val.count(','))

a,b,  guido
2


In [24]:
# the keyward : replace.
# replace will substitute occurrences of one pattern for another. 
# It is commonly used
# to delete patterns, too, by passing an empty string

# replace 可以替换相应字符，也可以用来删除字符，只需要传入空字符即刻。

val2 = val.replace(",", '::::')
val2

'a::::b::::  guido'

In [25]:
val.replace(",","") # 用来删除逗号。

'ab  guido'

![链接文字](https://img-blog.csdnimg.cn/20200815105800672.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2dhb2N1aTg4Mw==,size_16,color_FFFFFF,t_70#pic_center)

In [27]:
# val.ljust 

val.rjust(14, '&')

'&&&a,b,  guido'

In [28]:
val.ljust(15, "*")

'a,b,  guido****'

In [30]:
val.casefold() #无案例比较。

'a,b,  guido'

### Regular Expressions


The  re module functions fall into three categories: **pattern matching**, **substitution**,
and **splitting**. Naturally these are all related; a regex describes a pattern to locate in the
text, which can then be used for many purposes. Let’s look at a simple example:


正则表达模块三个类别：模式匹配，替换，分割。


![链接文字](https://img-blog.csdnimg.cn/20200815145637986.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2dhb2N1aTg4Mw==,size_16,color_FFFFFF,t_70#pic_center)


In [41]:
# module : re

import re
text = "foo bar\t baz \tqux"

In [42]:
re.split('\s+', text) #\s+ 正则表达式，代表了多个空格或者tabs,newlines.

['foo', 'bar', 'baz', 'qux']

In [43]:
# 内部的执行： 先对正则表达式进行编译，然后用split 根据正则表达式对text进行分割。
# 也可以这样：

regex = re.compile('\s+')

In [44]:
regex.split(text)

['foo', 'bar', 'baz', 'qux']

In [45]:
# 如果想要在text找到这些符号。
re.findall('\s+', text)

[' ', '\t ', ' \t']

In [46]:
# 注意： 如果重复使用re 对象，强烈建议是先对对象进行编译，然后再重复使用。re.compile('\????')

**match** : string的开头。

**findall** : 找到所有。

**search**: 找到第一个。

In [47]:
text = """Dave dave@google.com
Steve steve@gmail.com
Rob rob@gmail.com
Ryan ryan@yahoo.com
"""

pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'   #表示空字符。

In [48]:
regex = re.compile(pattern, flags= re.IGNORECASE) # flags = re.I,re.IGNORECASE.忽略大小写。
regex.findall(text)

['dave@google.com', 'steve@gmail.com', 'rob@gmail.com', 'ryan@yahoo.com']

In [50]:
m = regex.search(text)
m

<_sre.SRE_Match object; span=(5, 20), match='dave@google.com'>

In [51]:
text[m.start():m.end()]

'dave@google.com'

In [53]:
ma = regex.match(text)
ma

In [54]:
print(regex.sub("fuck", text)) # 对匹配的进行替换。

Dave fuck
Steve fuck
Rob fuck
Ryan fuck



In [62]:
pattern = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'
regex = re.compile(pattern, flags=re.I)
regex

re.compile(r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})',
re.IGNORECASE|re.UNICODE)

In [63]:
m = regex.match('wesn@bringt.net')
m.groups()

('wesn', 'bringt', 'net')

In [64]:
regex.findall(text) # return a list of ers.

[('dave', 'google', 'com'),
 ('steve', 'gmail', 'com'),
 ('rob', 'gmail', 'com'),
 ('ryan', 'yahoo', 'com')]

In [67]:
# sub 可以使用 \2 1 等进入组内。

print(regex.sub(r'USER NAME: \1, DOMAIN: \2, SUFFIX:\3', text))

Dave USER NAME: dave, DOMAIN: google, SUFFIX:com
Steve USER NAME: steve, DOMAIN: gmail, SUFFIX:com
Rob USER NAME: rob, DOMAIN: gmail, SUFFIX:com
Ryan USER NAME: ryan, DOMAIN: yahoo, SUFFIX:com



### Vectorized String Functions in Pandas



pandas中的字符串向量化。

Cleaning up a messy dataset for analysis often requires a lot of string munging（字符串清理） and
regularization.（正则化） To complicate matters, a column containing strings will sometimes
have missing data:


In [69]:
data = {'Dave': 'dave@google.com', 'Steve': 'steve@gmail.com',
      'Rob': 'rob@gmail.com', 'Wes': np.nan}

In [70]:
# 之前说的都是python自带的re，pandas的dataframe和series 将re综合了进来。
# 可以直接用str方法来进行模式的匹配。

# 用法和re相同。

data = Series(data)
data


Dave     dave@google.com
Steve    steve@gmail.com
Rob        rob@gmail.com
Wes                  NaN
dtype: object

In [71]:
data.isnull()

Dave     False
Steve    False
Rob      False
Wes       True
dtype: bool

In [73]:
# the key ward str.contains

data.str.contains('gmail')

Dave     False
Steve     True
Rob       True
Wes        NaN
dtype: object

In [75]:
# pattren .:

pattern

'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\\.([A-Z]{2,4})'

In [76]:
data.str.findall(pattern, flags=re.IGNORECASE)


Dave     [(dave, google, com)]
Steve    [(steve, gmail, com)]
Rob        [(rob, gmail, com)]
Wes                        NaN
dtype: object

In [89]:
matches = data.str.match(pattern, flags=re.IGNORECASE)
matches

Dave     True
Steve    True
Rob      True
Wes       NaN
dtype: object

In [93]:
data.str.get(1)
data.str[:]  ###why??????

Dave     dave@google.com
Steve    steve@gmail.com
Rob        rob@gmail.com
Wes                  NaN
dtype: object

In [94]:
data.str.get(2)  # 字符串第三个字母。

Dave       v
Steve      e
Rob        b
Wes      NaN
dtype: object

一些是str后边可以接的方法列表：

![链接文字](https://img-blog.csdnimg.cn/20200815155119310.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2dhb2N1aTg4Mw==,size_16,color_FFFFFF,t_70#pic_center)

In [95]:
# 其中 findall, startwith,endwith, extract, 都是可以配合re.方法来使用的。