## 数据清洗与准备

- 加载、清理、转换和重新排列。这样的工作占用了分析师80%以上的时间。
- 在本章中，用于缺失值、重复值、字符串操作和其他分析数据转换的工具。
- 下一章中，重点关注利用各种方式对数据集联合、重排列。

### 处理缺失值

In [36]:
import pandas as pd
import numpy as np

In [37]:
# pandas 对象的所有描述性统计信息默认情况下是排除缺失值的。
string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])
string_data

0     aardvark
1    artichoke
2          NaN
3      avocado
dtype: object

In [38]:
string_data.isnull()

0    False
1    False
2     True
3    False
dtype: bool

In [39]:
# Python内建的None/NAN值在对象数组中被当作NA处理:
string_data[0] = None
print(string_data)
string_data.isnull()

0         None
1    artichoke
2          NaN
3      avocado
dtype: object


0     True
1    False
2     True
3    False
dtype: bool

#### 过滤缺失值

虽然你可以使用 pandas.isnull和布尔值索引手动地过滤缺失值，但 dropna在过滤缺失值时是非常有用的。

In [40]:
from numpy import nan as NA

In [41]:
data = pd.Series([1,NA,3.5,NA,7])
data

0    1.0
1    NaN
2    3.5
3    NaN
4    7.0
dtype: float64

In [42]:
data.dropna() #不会导致源pd修改

0    1.0
2    3.5
4    7.0
dtype: float64

In [43]:
data[data.notnull()] # =pd.Series.dropna()

0    1.0
2    3.5
4    7.0
dtype: float64

In [44]:
data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA],[NA, NA, NA], [NA, 6.5, 3.]])
data


Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [45]:
# 当处理DataFrame对象时，事情会稍微更复杂一点。
# 你可能想要删除全部为NA或包含有NA的行或列。 dropna默认情况下会删除包含缺失值的行:
cleaned = data.dropna()
cleaned

Unnamed: 0,0,1,2
0,1.0,6.5,3.0


In [46]:
# 传入how='all'时，将删除所有值均为NA的行:
data.dropna(how='all')


Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


In [47]:
# 如果要用同样的方式去删除列，传入参数axis=1
data[4] = NA
print(data)
data.dropna(axis=1,how="all")

     0    1    2   4
0  1.0  6.5  3.0 NaN
1  1.0  NaN  NaN NaN
2  NaN  NaN  NaN NaN
3  NaN  6.5  3.0 NaN


Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [48]:
# 过滤DataFrame的行的相关方法往往涉及时间序列数据。
# 假设你只想保留包含一定数量的观察值的行。你可以用thresh参数来表示。
df = pd.DataFrame(np.random.randn(7,3))
df.iloc[:4,1] = NA
df.iloc[:2,2] = NA
print(df)
df.dropna(thresh=2)


          0         1         2
0 -0.543985       NaN       NaN
1 -1.134888       NaN       NaN
2  0.102892       NaN -1.439079
3 -0.291253       NaN  0.641031
4  0.444607 -1.576207  0.146887
5 -0.725376 -0.335850  0.212150
6 -0.782123 -0.014961 -0.203287


Unnamed: 0,0,1,2
2,0.102892,,-1.439079
3,-0.291253,,0.641031
4,0.444607,-1.576207,0.146887
5,-0.725376,-0.33585,0.21215
6,-0.782123,-0.014961,-0.203287


#### 补全缺失值


你有时可能需要以多种方式补全“漏洞”，而不是过 滤缺失值(也可能丢弃其他数据)。大多数情况下，主 要使用fillna方法来补全缺失值。调用fillna时，可以使 用一个常数来替代缺失值:


In [49]:
df.fillna(0)

Unnamed: 0,0,1,2
0,-0.543985,0.0,0.0
1,-1.134888,0.0,0.0
2,0.102892,0.0,-1.439079
3,-0.291253,0.0,0.641031
4,0.444607,-1.576207,0.146887
5,-0.725376,-0.33585,0.21215
6,-0.782123,-0.014961,-0.203287


In [50]:
# 在调用fillna时使用字典，你可以为不同列设定不 同的填充值:
df.fillna({1:0.6666,2:0.8888})

Unnamed: 0,0,1,2
0,-0.543985,0.6666,0.8888
1,-1.134888,0.6666,0.8888
2,0.102892,0.6666,-1.439079
3,-0.291253,0.6666,0.641031
4,0.444607,-1.576207,0.146887
5,-0.725376,-0.33585,0.21215
6,-0.782123,-0.014961,-0.203287


In [54]:
# fillna返回的是一个新的对象，但你也可以修改已 经存在的对象
_ = df.fillna(0, inplace= True)
df

Unnamed: 0,0,1,2
0,-0.543985,0.0,0.0
1,-1.134888,0.0,0.0
2,0.102892,0.0,-1.439079
3,-0.291253,0.0,0.641031
4,0.444607,-1.576207,0.146887
5,-0.725376,-0.33585,0.21215
6,-0.782123,-0.014961,-0.203287


In [55]:
# 用于重建索引的相同的插值方法也可以用于fillna:
df = pd.DataFrame(np.random.randn(6, 3))
df.iloc[2:, 1] = NA
df.iloc[4:, 2] = NA
df

Unnamed: 0,0,1,2
0,-0.538651,-0.863332,-1.191394
1,0.018741,-0.561164,1.407663
2,-0.626215,,2.623558
3,-0.843607,,-0.757175
4,-0.468638,,
5,-0.477559,,


In [56]:
df.fillna(method='ffill')

Unnamed: 0,0,1,2
0,-0.538651,-0.863332,-1.191394
1,0.018741,-0.561164,1.407663
2,-0.626215,-0.561164,2.623558
3,-0.843607,-0.561164,-0.757175
4,-0.468638,-0.561164,-0.757175
5,-0.477559,-0.561164,-0.757175


In [59]:
df.fillna(method='ffill', limit=2) # limit表示往下填充几行

Unnamed: 0,0,1,2
0,-0.538651,-0.863332,-1.191394
1,0.018741,-0.561164,1.407663
2,-0.626215,-0.561164,2.623558
3,-0.843607,-0.561164,-0.757175
4,-0.468638,,-0.757175
5,-0.477559,,-0.757175


In [60]:
# 使用fillna你可以完成很多带有一点创造性的工作。
# 例如，你可以将Series的平均值或中位数用于填充 缺失值:
data = pd.Series([1., NA, 3.5, NA, 7])
data.fillna(data.mean())

0    1.000000
1    3.833333
2    3.500000
3    3.833333
4    7.000000
dtype: float64

### 数据转换

#### 删除重复值

In [61]:
# 由于各种原因，DataFrame中会出现重复行。
data =  pd.DataFrame({
    'k1':['one','two']*3+['two'],
    'k2': [1, 1, 2, 3, 3, 4, 4]})
data

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4
6,two,4


In [62]:
# DataFrame的duplicated方法返回的是一个布尔值 Series，
# 这个Series反映的是每一行是否存在重复(与之前出现过的行相同)情况
data.duplicated()

0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

In [63]:
# drop_duplicates返回的是DataFrame，内容是duplicated返回数组中为False的部分。
data.drop_duplicates()

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4


In [65]:
# 这些方法默认都是对列进行操作。
#你可以指定数据的任何子集来检测是否有重复。假设我们有一个额外的列，并想基于'k1'列去除重复值:
data['v1'] = range(7)
print(data)
data.drop_duplicates(['k1'])

    k1  k2  v1
0  one   1   0
1  two   1   1
2  one   2   2
3  two   3   3
4  one   3   4
5  two   4   5
6  two   4   6


Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1


In [66]:
# duplicated和drop_duplicates默认都是保留第一个观测到的值。
# 传入参数keep='last'将会返回最后一个:
data.drop_duplicates(['k1','k2'],keep='last')

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1
2,one,2,2
3,two,3,3
4,one,3,4
6,two,4,6


#### 使用函数或映射进行数据转换

In [76]:
# 对于许多数据集，你可能希望基于DataFrame中的数组、列或列中的数值进行一些转换。
# 考虑下面这些收 集到的关于肉类的假设数据

data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon','Pastrami', 'corned beef', 'Bacon',
        'pastrami', 'honey ham', 'nova lox'],'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5,6]})
data

Unnamed: 0,food,ounces
0,bacon,4.0
1,pulled pork,3.0
2,bacon,12.0
3,Pastrami,6.0
4,corned beef,7.5
5,Bacon,8.0
6,pastrami,3.0
7,honey ham,5.0
8,nova lox,6.0


In [77]:
# 假设你想要添加一列用于表明每种食物的动物肉类型。
meat_to_animal = {
     'bacon': 'pig',
     'pulled pork': 'pig',
     'pastrami': 'cow',
     'corned beef': 'cow',
     'honey ham': 'pig',
     'nova lox': 'salmon',}
meat_to_animal

{'bacon': 'pig',
 'pulled pork': 'pig',
 'pastrami': 'cow',
 'corned beef': 'cow',
 'honey ham': 'pig',
 'nova lox': 'salmon'}

In [82]:
# Series的map方法接收一个函数或一个包含映射关系的字典型对象，
# 但是这里我们有一个小的问题在于一些肉类大写了，而另一部分肉类没有。
# 因此，我们需要使用Series的str.lower方法将每个值都转换为小写:
lowercased = data['food'].str.lower()
lowercased
data['animal'] = lowercased.map(meat_to_animal)
data

Unnamed: 0,food,ounces,animal
0,bacon,4.0,pig
1,pulled pork,3.0,pig
2,bacon,12.0,pig
3,Pastrami,6.0,cow
4,corned beef,7.5,cow
5,Bacon,8.0,pig
6,pastrami,3.0,cow
7,honey ham,5.0,pig
8,nova lox,6.0,salmon


In [81]:
# 我们也可以传入一个能够完成所有工作的函数
animals = data['food'].map(lambda x:meat_to_animal[x.lower()])
data['animal'] = animals
data
# 使用map是一种可以便捷执行按元素转换及其他清洗相关操作的方法。

Unnamed: 0,food,ounces,animal
0,bacon,4.0,pig
1,pulled pork,3.0,pig
2,bacon,12.0,pig
3,Pastrami,6.0,cow
4,corned beef,7.5,cow
5,Bacon,8.0,pig
6,pastrami,3.0,cow
7,honey ham,5.0,pig
8,nova lox,6.0,salmon


#### 替代值

In [90]:
# 使用fillna填充缺失值是通用值替换的特殊案例。 
# 前面你已经看到，map可以用来修改一个对象中的子集的值，但是replace提供了更为简单灵活的实现。
# 让我们考虑下面的Series:
data = pd.Series([1., -999., 2., -999., -1000., 3.])
data

0       1.0
1    -999.0
2       2.0
3    -999.0
4   -1000.0
5       3.0
dtype: float64

In [91]:
# -999可能是缺失值的标识。如果要使用NA来替代 这些值，
# 我们可以使用replace方法生成新的Series(除 非你传入了inplace=True):
data.replace(-999, np.nan)

0       1.0
1       NaN
2       2.0
3       NaN
4   -1000.0
5       3.0
dtype: float64

In [92]:
#  如果你想要一次替代多个值，你可以传入一个列表和替代值:
data.replace([-999, -1000], np.nan)

0    1.0
1    NaN
2    2.0
3    NaN
4    NaN
5    3.0
dtype: float64

In [93]:
# 要将不同的值替换为不同的值，可以传入替代值的列表:
data.replace([-999, -1000], [np.nan, 0])

0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

In [94]:
#  参数也可以通过字典传递:
data.replace({-999:np.nan, -1000:0})

0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

data.replace方法与data.str.replace方法是不同 的，data.str.replace是对字符串进行按元素替代的。我 们将在下一章看到Series的字符串方法。

#### 重命名轴索引

和Series中的值一样，可以通过函数或某种形式的映射对轴标签进行类似的转换，生成新的且带有不同标签的对象。
你也可以在不生成新的数据结构的情况下修改轴。

In [104]:
data = pd.DataFrame(np.arange(12).reshape((3, 4)),index=['Ohio', 'Colorado', 'New York'],columns=['one', 'two', 'three', 'four'])
data.index


Index(['Ohio', 'Colorado', 'New York'], dtype='object')

In [102]:
# 与Series类似，轴索引也有一个map方法:
transform = lambda x: x[:4].upper()
data.index.map(transform)

Index(['OHIO', 'COLO', 'NEW '], dtype='object')

In [103]:
# 你可以赋值给index，修改DataFrame:
data.index = data.index.map(transform)
data

Unnamed: 0,one,two,three,four
OHIO,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


In [119]:
# 如果你想要创建数据集转换后的版本，并且不修改原有的数据集，一个有用的方法是rename:
data.rename(index = str.title, columns = str.upper, inplace = True)
data

Unnamed: 0,ONE,TWO,THREE,FOUR
Ohio,0,1,2,3
Colorado,4,5,6,7
New York,8,9,10,11


In [121]:
# 值得注意的是，rename可以结合字典型对象使用， 为轴标签的子集提供新的值:
data.rename(index = {'Ohio':'NDIANA'}, columns={'THREE':'peekaboo'})

Unnamed: 0,ONE,TWO,peekaboo,FOUR
NDIANA,0,1,2,3
Colorado,4,5,6,7
New York,8,9,10,11


In [123]:
# rename可以让你从手动复制DataFrame并为其分配索引和列属性的烦琐工作中解放出来。
# 如果你想要修改原有的数据集传入inplace=True:
data.rename(index={'OHIO': 'INDIANA'}, inplace=True)

#### 离散化和分箱

In [152]:
# 连续值经常需要离散化，或者分离成”箱子“进行分析。
# 假设你有某项研究中一组人群的数据，你想将他们进行分组，放入离散的年龄框中。
ages =  [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]

# 让我们将这些年龄分为18~25、26~35、36~60以 及61及以上等若干组。
bins =  [18, 25, 35, 60, 100]

cats = pd.cut(ages, bins)

t = 1
print("t: %s " %t)
print("t: {} ".format(t))

print(cats)


# pandas返回的对象是一个特殊的Categorical对象。你看到的输出描述了由pandas.cut计算出的箱。你可以将它当作一个表示箱名的字符串数组;
# 它在内部包含一个categories(类别)数组，它指定了不同的类别名称以及codes属性中的ages(年龄)数据标签:

print(cats.codes)
print(cats.categories)
pd.value_counts(cats)

t: 1 
t: 1 
[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]
[0 0 0 1 0 0 2 1 3 2 2 1]
IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]],
              closed='right',
              dtype='interval[int64]')


(18, 25]     5
(35, 60]     3
(25, 35]     3
(60, 100]    1
dtype: int64

In [153]:
# 与区间的数学符号一致,小括号表示边是开放的,中括号表示它是封闭的(包括边)。
# 你可以通过传递right=False来改变哪一边是封闭的:
pd.cut(ages, [18, 26, 36, 61, 100], right=False)

[[18, 26), [18, 26), [18, 26), [26, 36), [18, 26), ..., [26, 36), [61, 100), [36, 61), [36, 61), [26, 36)]
Length: 12
Categories (4, interval[int64]): [[18, 26) < [26, 36) < [36, 61) < [61, 100)]

In [154]:
# 你也可以通过向labels选项传递一个列表或数组来 传入自定义的箱名:
group_names = ['Youth', 'YoungAdult', 'MiddleAged', 'Senior']
pd.cut(ages, bins, labels=group_names)

['Youth', 'Youth', 'Youth', 'YoungAdult', 'Youth', ..., 'YoungAdult', 'Senior', 'MiddleAged', 'MiddleAged', 'YoungAdult']
Length: 12
Categories (4, object): ['Youth' < 'YoungAdult' < 'MiddleAged' < 'Senior']

In [157]:
# 如果你传给cut整数个的箱来代替显式的箱边， pandas将根据数据中的最小值和最大值计算出等长的箱。
# 请考虑一些均匀分布的数据被切成四份的情况:(precision=2的选项将十进制精度限制在两位。)
data = np.random.rand(20)
pd.cut(data, 4, precision=2)

[(0.73, 0.96], (0.28, 0.5], (0.73, 0.96], (0.73, 0.96], (0.048, 0.28], ..., (0.73, 0.96], (0.048, 0.28], (0.048, 0.28], (0.28, 0.5], (0.5, 0.73]]
Length: 20
Categories (4, interval[float64]): [(0.048, 0.28] < (0.28, 0.5] < (0.5, 0.73] < (0.73, 0.96]]

In [160]:
# qcut是一个与分箱密切相关的函数，它基于样本分位数进行分箱。
# 取决于数据的分布，使用cut通常不会使每个箱具有相同数据量的数据点。
# 由于qcut使用样本的分位数，你可以通过qcut获得等长的箱:

data = np.random.randn(1000)

In [163]:
cats = pd.qcut(data,4)
pd.value_counts(cats)

(0.69, 2.873]                    250
(0.0231, 0.69]                   250
(-0.638, 0.0231]                 250
(-3.4619999999999997, -0.638]    250
dtype: int64

In [166]:
# 与cut类似，你可以传入自定义的分位数(0和1之间的数据，包括边):
pd.value_counts(pd.qcut(data, [0, 0.1, 0.5, 0.9, 1.]))

(0.0231, 1.225]                  400
(-1.276, 0.0231]                 400
(1.225, 2.873]                   100
(-3.4619999999999997, -1.276]    100
dtype: int64

后续章节中，在讨论聚合和分组操作时，我们将会 继续讨论cut和qcut，因为这些离散化函数对于分位数和 分组分析特别有用。

####  检测和过滤异常值

In [173]:
# 过滤或转换异常值在很大程度上是应用数组操作的事情。考虑一个具有正态分布数据的DataFrame:
data = pd.DataFrame(np.random.randn(1000, 4))
data

Unnamed: 0,0,1,2,3
0,0.884562,-0.037524,1.774746,-0.762204
1,-1.179343,-1.206703,-0.120006,-0.040957
2,-0.479342,1.919852,-1.572241,-0.157997
3,0.636788,0.036732,-0.201507,1.615900
4,0.547398,2.233941,-0.204458,1.480093
...,...,...,...,...
995,-0.750497,-0.150814,-0.547285,-0.300477
996,0.143815,-1.140225,1.869420,1.778786
997,-1.436742,0.410095,-0.437544,0.514214
998,0.632389,-0.284183,-0.942877,0.606675


In [174]:
#  假设你想要找出一列中绝对值大于三的值:
col = data[2]
col[np.abs(col)>3]

182   -3.230561
602    3.105307
Name: 2, dtype: float64

In [175]:
# 要选出所有值大于3或小于-3的行，你可以对布尔值DataFrame使用any方法:
data[(np.abs(data) > 3).any(1)]

Unnamed: 0,0,1,2,3
182,-1.479818,-0.644895,-3.230561,-0.074959
270,1.302572,-0.071369,-1.279697,3.166513
359,-2.779244,-0.70738,0.472663,-3.062822
466,0.169714,-0.193251,0.348998,-3.003322
602,-1.432046,0.612466,3.105307,0.120362
646,0.359534,3.599658,-0.041371,-0.080332
648,-1.50982,3.107938,-0.880572,-0.029647
679,1.393324,-3.181249,0.731913,0.378222
722,1.262837,3.00265,0.220749,0.181878
729,1.914932,0.107091,1.151259,3.182782


In [180]:
# 值可以根据这些标准来设置，下面代码限制了-3到3之间的数值:

# 对于绝对值大于3的值，统一根据该值符号，设置成+-3
data[np.abs(data) > 3] = np.sign(data) * 3
data.describe()

Unnamed: 0,0,1,2,3
count,1000.0,1000.0,1000.0,1000.0
mean,-0.024222,0.038519,-0.000353,-0.009315
std,1.034007,1.036672,0.999945,1.041793
min,-2.807908,-3.0,-3.0,-3.0
25%,-0.747162,-0.631713,-0.654297,-0.661615
50%,0.006125,0.05818,-0.026722,0.010938
75%,0.732548,0.689401,0.69123,0.678682
max,2.925559,3.0,3.0,3.0


In [181]:
# 语句np.sign(data)根据数据中的值的正负分别生成1和-1的数值:
np.sign(data).head()

Unnamed: 0,0,1,2,3
0,1.0,-1.0,1.0,-1.0
1,-1.0,-1.0,-1.0,-1.0
2,-1.0,1.0,-1.0,-1.0
3,1.0,1.0,-1.0,1.0
4,1.0,1.0,-1.0,1.0


#### 置换和随机抽样

In [188]:
# 使用numpy.random.permutation[排列(方式); 组合(方式); 置换;]
# 对DataFrame中的Series或行进行置换(随机重排序)是非常方便的。
# 在调用permutation时根据你想要的轴长度可以产生一个表示新顺序的整数数组:
df = pd.DataFrame(np.arange(5*4).reshape(5,4))
sampler = np.random.permutation(5)
sampler

array([1, 4, 0, 3, 2])

In [190]:
# 整数数组可以用在基于iloc的索引或等价的take函数中:
df.take(sampler) # 这里的意思是根据的sampler随机排列的index去排列df的行

Unnamed: 0,0,1,2,3
1,4,5,6,7
4,16,17,18,19
0,0,1,2,3
3,12,13,14,15
2,8,9,10,11


In [191]:
# 要选出一个不含有替代值的随机子集，你可以使用Series和DataFrame的sample方法:
df.sample(n=3)

Unnamed: 0,0,1,2,3
3,12,13,14,15
0,0,1,2,3
1,4,5,6,7


In [196]:
# 要生成一个带有替代值的样本(允许有重复选择)，将replace=True传入sample方法[即当样本数小于n的时候，replace=false；反之]:
choices = pd.Series([5, 7, -1, 6, 4])
draws = choices.sample(n=10, replace=True)
draws

1    7
2   -1
1    7
1    7
4    4
2   -1
1    7
4    4
1    7
0    5
dtype: int64

#### 计算指标/虚拟变量

- 将分类变量转换为“虚拟”或“指标”矩阵是另一种用于统计建模或机器学习的转换操作。
- 如果DataFrame中的一列有k个不同的值，则可以衍生一个k列的值为1和0的矩阵或DataFrame。
- pandas有一个get_dummies函数用于实现该功能，尽管你自行实现也不难。
- 让我们回顾一下之前的一个示例DataFrame:

In [202]:
df = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'], 'data1': range(6)})
df
pd.get_dummies(df['key']) # 将key列转换成哑变量

Unnamed: 0,a,b,c
0,0,1,0
1,0,1,0
2,1,0,0
3,0,0,1
4,1,0,0
5,0,1,0


In [204]:
# 在某些情况下，你可能想在指标DataFrame的列上加入前缀，然后与其他数据合并。
# 在get_dummies方法中有一个前缀参数用于实现该功能:
dummies = pd.get_dummies(df['key'],prefix='fix')
df_with_dummy = df[['data1']].join(dummies)
df_with_dummy

Unnamed: 0,data1,fix_a,fix_b,fix_c
0,0,0,1,0
1,1,0,1,0
2,2,1,0,0
3,3,0,0,1
4,4,1,0,0
5,5,0,1,0


In [234]:
# 如果DataFrame中的一行属于多个类别，则情况略为复杂。
# 让我们看看MovieLens的1M数据集，在第14章中有更为详细的介绍:
mnames = ['movie_id', 'title', 'genres']
movies = pd.read_csv('/Users/eillhuang/Desktop/数据分析案例/movies.csv')
print(movies[:5])
print(movies.info())
all_genres = []
for x in movies.genres:
    all_genres.extend(x.split('|'))
all_genres
genres = pd.unique(all_genres)
genres


   movieId  ...                                       genres
0        1  ...  Adventure|Animation|Children|Comedy|Fantasy
1        2  ...                   Adventure|Children|Fantasy
2        3  ...                               Comedy|Romance
3        4  ...                         Comedy|Drama|Romance
4        5  ...                                       Comedy

[5 rows x 3 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  9742 non-null   int64 
 1   title    9742 non-null   object
 2   genres   9742 non-null   object
dtypes: int64(1), object(2)
memory usage: 228.5+ KB
None


array(['Adventure', 'Animation', 'Children', 'Comedy', 'Fantasy',
       'Romance', 'Drama', 'Action', 'Crime', 'Thriller', 'Horror',
       'Mystery', 'Sci-Fi', 'War', 'Musical', 'Documentary', 'IMAX',
       'Western', 'Film-Noir', '(no genres listed)'], dtype=object)

In [237]:
# 使用全0的DataFrame是构建指标DataFrame的一种方式:
zero_matrix = np.zeros((len(movies), len(genres)))
zero_matrix
dummies = pd.DataFrame(zero_matrix, columns=genres)
dummies


Unnamed: 0,Adventure,Animation,Children,Comedy,Fantasy,Romance,Drama,Action,Crime,Thriller,Horror,Mystery,Sci-Fi,War,Musical,Documentary,IMAX,Western,Film-Noir,(no genres listed)
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9737,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9738,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9739,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9740,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [242]:
# 现在，遍历每一部电影，将dummies每一行的条目设置为1。
# 为了实现该功能，我们使用dummies.columns来计算每一个流派的列指标:
gen = movies.genres[0]
print(gen)
print(gen.split('|'))
dummies.columns.get_indexer(gen.split('|'))

Adventure|Animation|Children|Comedy|Fantasy
['Adventure', 'Animation', 'Children', 'Comedy', 'Fantasy']


array([0, 1, 2, 3, 4])

In [250]:
for i, gen in enumerate(movies.genres):
    indices = dummies.columns.get_indexer(gen.split('|'))
    dummies.iloc[i, indices] = 1
print(dummies)
movies_windic = movies.join(dummies.add_prefix('Genre_'))
movies_windic.iloc[0,:]

      Adventure  Animation  Children  ...  Western  Film-Noir  (no genres listed)
0           1.0        1.0       1.0  ...      0.0        0.0                 0.0
1           1.0        0.0       1.0  ...      0.0        0.0                 0.0
2           0.0        0.0       0.0  ...      0.0        0.0                 0.0
3           0.0        0.0       0.0  ...      0.0        0.0                 0.0
4           0.0        0.0       0.0  ...      0.0        0.0                 0.0
...         ...        ...       ...  ...      ...        ...                 ...
9737        0.0        1.0       0.0  ...      0.0        0.0                 0.0
9738        0.0        1.0       0.0  ...      0.0        0.0                 0.0
9739        0.0        0.0       0.0  ...      0.0        0.0                 0.0
9740        0.0        1.0       0.0  ...      0.0        0.0                 0.0
9741        0.0        0.0       0.0  ...      0.0        0.0                 0.0

[9742 rows x 20

movieId                                                               1
title                                                  Toy Story (1995)
genres                      Adventure|Animation|Children|Comedy|Fantasy
Genre_Adventure                                                       1
Genre_Animation                                                       1
Genre_Children                                                        1
Genre_Comedy                                                          1
Genre_Fantasy                                                         1
Genre_Romance                                                         0
Genre_Drama                                                           0
Genre_Action                                                          0
Genre_Crime                                                           0
Genre_Thriller                                                        0
Genre_Horror                                                    

- 对于更大的数据，上面这种使用多成员构建指 标变量并不是特别快速。
- 更好的方法是写一个直接将数据写为NumPy数组的底层函数，
- 然后将结果封装进 DataFrame。

In [256]:
# 将get_dummies与cut等离散化函数结合使用是统计应用的一个有用方法:
np.random.seed(12345) # 我们使用numpy.random.seed来设置随机种子以确 保示例的确定性。https://blog.csdn.net/weixin_41571493/article/details/80549833
values = np.random.rand(10)
bins = [0, 0.2, 0.4, 0.6, 0.8, 1]
pd.get_dummies(pd.cut(values, bins))

Unnamed: 0,"(0.0, 0.2]","(0.2, 0.4]","(0.4, 0.6]","(0.6, 0.8]","(0.8, 1.0]"
0,0,0,0,0,1
1,0,1,0,0,0
2,1,0,0,0,0
3,0,1,0,0,0
4,0,0,1,0,0
5,0,0,1,0,0
6,0,0,0,0,1
7,0,0,0,1,0
8,0,0,0,1,0
9,0,0,0,1,0


### 字符串操作

- 由于Python在字符串和文本操作上的便利性，使Python成为一个流行的原生数据集操作语言已经有很长时间了。
- 字符串对象的内建方法使得大部分文本操作非常简单。
- 对于更为复杂的模式匹配和文本操作，正则表达式可能是需要的。
- pandas允许你将字符串和正则表达式简洁地应用到整个数据数组上，此外还能处理数据缺失带来的困扰。

#### 字符串对象方法


In [259]:
# 在很多字符串处理和脚本应用中，内建的字符串方法是足够的。
# 例如，一个逗号分隔的字符串可以使用split方法拆分成多块:
val = 'a,b,  guido'
val.split(',')

['a', 'b', '  guido']

In [261]:
# split常和strip一起使用，用于清除空格(包括换行):
pieces = [x.strip() for x in val.split(',')]
pieces

['a', 'b', 'guido']

In [263]:
# 这些子字符串可以使用加法与两个冒号分隔符连接在一起:
first, second, third = pieces
first + "::" + second + "::" + third
# or 
"::".join(pieces)

'a::b::guido'

In [267]:
# 其他方法涉及定位子字符串。使用Python的in关键字是检测子字符串的最佳方法，尽管index和find也能实现同样的功能:
'guido' in val
val

'a,b,  guido'

In [265]:
val.index(',') #请注意find和index的区别在于index在字符串没有 找到时会抛出一个异常(而find是返回-1):


1

In [266]:
val.find(":")

-1

In [268]:
# 相关地，count返回的是某个特定的子字符串在字 符串中出现的次数:
val.count(',')

2

In [269]:
# replace将用一种模式替代另一种模式。它通常也用 于传入空字符串来删除某个模式。
val.replace(',',"::")

'a::b::  guido'

In [270]:
val.replace(',',"")

'ab  guido'

#### 正则表达式

则表达式提供了一种在文本中灵活查找或匹配 (通常更为复杂的)字符串模式的方法。单个表达式通 常被称为regex，是根据正则表达式语言形成的字符 串。Python内建的re模块是用于将正则表达式应用到字 符串上的库。我在此处会给出一些re模块的示例。

re模块主要有三个主题:模式匹配、替代、拆分。 当然，这三部分主题是相关联的。
一个正则表达式描述了在文本中需要定位的一种模式，可以用于多种目标。 
- 让我们来看一个简单的示例:假设我们想将含有多种空白字符(制表符、空格、换行符)的字符串拆分开。描述一个或多个空白字符的正则表达式是\s+:

In [273]:
import re
text = 'foo  bar\t baz \tqux'
re.split('\s+', text)

['foo', 'bar', 'baz', 'qux']

In [275]:
# 当你调用re.split('\s+'，text)，正则表达式首先会被编译，然后正则表达式的split方法在传入文本上被调用。
# 你可以使用re.compile自行编译，形成一个可复用的正则表达式对象:
regex = re.compile('\s+')
regex.split(text)

['foo', 'bar', 'baz', 'qux']

In [276]:
# 如果你想获得的是一个所有匹配正则表达式的模式的列表，你可以使用findall方法:
regex.findall(text)

['  ', '\t ', ' \t']

- match和search与findall相关性很大。
- findall返回的是字符串中所有的匹配项，
- 而search返回的仅仅是第一个匹配项。
- match更为严格，它只在字符串的起始位置进行匹配。

In [278]:
# 识别大部分电子邮件地址的正则表达式

text = """Dave dave@google.com
   Steve steve@gmail.com
   Rob rob@gmail.com
   Ryan ryan@yahoo.com
"""
pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}' # re.IGNORECASE使正则表达式不区分大小写
regex = re.compile(pattern, flags=re.IGNORECASE)
regex.findall(text)

['dave@google.com', 'steve@gmail.com', 'rob@gmail.com', 'ryan@yahoo.com']

In [281]:
# search返回的是文本中第一个匹配到的电子邮件地址。
# 对于前面提到的正则表达式，匹配对象只能告诉我们模式在字符串中起始和结束的位置:
m = regex.search(text)
print(m)
print(text[m.start():m.end()])

<re.Match object; span=(5, 20), match='dave@google.com'>
dave@google.com


In [282]:
# regex.match只在模式出现于字符串起始位置时进行匹配，如果没有匹配到，返回None:
print(regex.match(text))

None


In [283]:
# 相关地，sub会返回一个新的字符串，原字符串中的模式会被一个新的字符串替代:
print(regex.sub('REDACTED', text))

Dave REDACTED
   Steve REDACTED
   Rob REDACTED
   Ryan REDACTED



In [285]:
# 假设您想查找电子邮件地址，并将每个地址分为三个部分:用户名，域名和域名后缀。要实现这一点，可以用括号将模式包起来:
pattern = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'
regex = re.compile(pattern, flags=re.IGNORECASE)
# 由这个修改后的正则表达式产生的匹配对象的groups方法，返回的是模式组件的元组:
m = regex.match('wesm@bright.net')
m.groups()

('wesm', 'bright', 'net')

In [286]:
# 当模式可以分组时，findall返回的是包含元组的列表:
regex.findall(text)

[('dave', 'google', 'com'),
 ('steve', 'gmail', 'com'),
 ('rob', 'gmail', 'com'),
 ('ryan', 'yahoo', 'com')]

In [288]:
# sub也可以使用特殊符号，如\1和\2，访问每个匹配 对象中的分组。
# 符号\1代表的是第一个匹配分组，\2代表的是第二个匹配分组，以此类推:
print(regex.sub(r'Username: \1, Domain: \2, Suffix: \3', text))

Dave Username: dave, Domain: google, Suffix: com
   Steve Username: steve, Domain: gmail, Suffix: com
   Rob Username: rob, Domain: gmail, Suffix: com
   Ryan Username: ryan, Domain: yahoo, Suffix: com

