# 第7章 数据清洗和准备
数据的清洗和准备是数据分析过程中非常重要的一部分工作，而且很多情况下，因为数据的自身的完整性和复杂性问题，导致整个分析过程大部分时间都是在处理数据。而且数据的准确性是分析结果前提，所以数据清洗工作也是非常重要的一部分。
上一章，我们学些了，如何读取文件，包括文本文件、json文件、xml文件、数据库文件以及通过网络读取文件，那么当我们读取到了文件之后怎么处理呢？
本章就来解决这个问题。
本章内容：
* 处理缺失数据
* 数据转换
* 字符串操作

## 7.1 处理缺失数据

首先还是要引入numpy,pandas

In [None]:
import pandas as pd
import numpy as np

In [None]:
string_data = pd.Series(['aardvark','artichoke',np.nan, 'avocado'])

In [None]:
string_data

In [None]:
string_data.isnull()

In [None]:
string_data[0] = None

In [None]:
string_data.isnull()

---
这里还介绍了一下其他方法。
dropna:过滤缺失值用的
fillna：填充默认值
isnull：判断是不是空值
notnul：与isnull相反

In [None]:
from numpy import nan as NA

In [None]:
data = pd.Series([1,NA,3,5,NA,7])

In [None]:
data

### 过滤缺失数据

看看dropna 这个函数的用法，过滤掉了空值的行,这个本身是不对数据进行处理的，是返回数据处理后的副本

In [None]:
data.dropna()

In [None]:
data.dropna?


In [None]:
data[data.notnull()]

In [None]:
data


In [None]:
data = pd.DataFrame([[1.,3.,5.],[1.,NA,NA],[NA,5.,6.]])

In [None]:
data

In [None]:
cleaned = data.dropna()

In [None]:
cleaned

dropna 默认主要有空就处理掉，这个也可以加阈值，参数为axis，具体可以看data_dropna?

In [None]:
cleaned = data.dropna(how='all')

In [None]:
cleaned

In [None]:
cleaned = data.dropna(axis=1)

In [None]:
cleaned

In [None]:
df = pd.DataFrame(np.random.randn(7,3))

In [None]:
df.iloc[:4,1] = NA
df.iloc[:2,2] = NA

In [None]:
df

In [None]:
df.dropna()

In [None]:
df.dropna(thresh = 2)

### 填充缺失数据


有时候，如果缺失数据都处理掉了，会损失数据信息，而且实际中数据复杂，确实数据是很正常的现象，如果都过滤调，那估计也是没哟办法进行分析了。
那么接下来就看看如何填充数据，以最大程度的利用数据信息，但又尽量不对数据结果造成影响。

In [None]:
df.fillna(0)

In [None]:
df.fillna({1:0.5,2:0})


直接可以覆盖原数据

In [None]:
df.fillna(0, inplace=True)

In [None]:
df


In [None]:
df = pd.DataFrame(np.random.randn(6,3))

In [None]:
df


In [None]:

df.iloc[2:, 1] = NA
df.iloc[4:, 2] = NA

In [None]:
df

In [None]:
df.fillna(method='ffill')

In [None]:
df.fillna(method='ffill', limit=2)

In [None]:
data = pd.Series([1,NA,3,NA,7])

也可以使用平均等函数，也就是说这里也可以填别的函数后或数值，对数据的填充处理提供了很便利，强大的功能

In [None]:
data.fillna(data.mean())

## 7.2 数据转换
第一节介绍了，怎么过滤数据，填充数据，也就是到这里，数据读到内存，并且已经完成了确实值的过滤或者填充处理，要么直接干掉不用，要么用一个相对可用的数据进行替换。
目前这一节有很多操作呀，慢慢来吧。


In [None]:
data = pd.DataFrame({"k1":['one','two']*3 + ['two'], 'k2':[1,1,2,3,3,4,4]})

In [None]:
data

In [None]:
data.duplicated()

In [None]:
data.drop_duplicates()

In [None]:
data['v1'] = range(7)

In [None]:
data.drop_duplicates()

In [None]:
lowerstr = data['k1'].str.lower()

In [None]:
lowerstr

In [None]:
meat_to_animal = {'one':'cat','two':'dog','three':'fish'}

In [None]:
data['k1'] = lowerstr.map(meat_to_animal)

In [None]:
data

In [None]:
animal_to_meat = {'cat':'k1','dog':'k2'}

In [None]:
data['k1'].map(lambda x : animal_to_meat[x.lower()])

In [None]:
data = pd.DataFrame(np.arange(12).reshape((3,4,)),index=['O','C','N'],columns=['one','two','three','four'])

In [None]:
data


In [None]:
transform = lambda x : x[:4].upper()

In [None]:
transform

In [None]:
data.index.map(transform)

In [None]:
data.rename(index=str.title,columns=str.upper)

In [None]:
ages=[20,22,34,56,78,90,12,34,34]

In [None]:
bins = [15,40,60,100]

In [None]:
cats = pd.cut(ages ,bins)

In [None]:
cats

In [None]:
cats.codes

In [None]:
cats.categories

In [None]:
pd.value_counts(cats)

In [None]:
group_names = ['Youth','YoungAdult','middleAge']

In [None]:
cats = pd.cut(ages ,bins,labels=group_names)

In [None]:
cats

In [None]:
data = np.random.rand(20)

In [None]:
pd.cut(data,4 ,precision=2)

In [None]:
data = np.random.randn(1000)

In [None]:
cats= pd.cut(data,4)

In [None]:
cats

In [None]:
pd.value_counts(cats)

In [None]:
data = pd.DataFrame(np.random.randn(1000,4))

In [None]:
data.describe()

In [None]:
col = data[2]

In [None]:
col[np.abs(col > 3)]

In [None]:
col

In [None]:
np.sign(data).head()

In [None]:
df = pd.DataFrame(np.arange(5*4).reshape((5,4)))

In [None]:
df

In [None]:
sampler = np.random.permutation(5)

In [None]:
sampler

In [None]:
df.take(sampler)

In [None]:
choies = pd.Series([5,7,-1,6,4])

In [None]:
draws = choies.sample(n=10,replace=True)

In [None]:
draws

In [None]:
df = pd.DataFrame({'key':['b','b','a','c','a']})

In [None]:
pd.get_dummies(df['key'])

In [None]:
dumies = pd.get_dummies(df['key'],prefix='key')

In [None]:
dumies

In [None]:
df_with_dumpy = df[['key']].join(dumies)

In [None]:
df_with_dumpy


In [None]:
mnames = ['move_id','title','genres']

In [None]:
moives = pd.read_table('datasets/movielens/movies.dat', sep='::',header=None,names=mnames)

In [None]:
moives[:10
      ]

In [None]:
all_rangs = []

In [None]:
for x in moives.genres:
    all_rangs.extend(x.split('|'))

In [None]:
genres = pd.unique(all_rangs)

In [None]:
genres

In [None]:
zreo_mx = np.zeros((len(moives), len(genres)))

In [None]:
zreo_mx

In [None]:
moives_w = moives.join(dumies.add_prefix('Genre_'))

In [None]:
moives_w.iloc(0)

In [None]:
print(moives_w)

In [None]:
values = np.random.seed(12345)

In [None]:
values

In [None]:
print(values)

## 7.3 字符串操作


Python在处理字符串和文本方面功能是很强大的，依靠内置功能和正则表达式，可以很方便的处理字符和文本。同时也有很强大丰富的函数库。这一节就来看看基本姿势。

In [None]:
val = 'a,b,val'

In [None]:
val.split(',')

In [None]:
piece = [x.strip() for x in val.split(',')]

In [None]:
piece

In [None]:
'::'.join(piece)

In [None]:
'a' in val

In [None]:
val.index(',')

In [None]:
val.find(':')

In [None]:
val.count(',')

In [None]:
val.replace(',',':')

### 正则表达式 

In [None]:
import re

In [None]:
text = "foo bar\t baz \tqux"

In [None]:
re.split('\s+', text)

In [None]:
regex = re.compile('\s+')

In [None]:
regex

In [None]:
regex.split(text)

In [None]:
regex.findall(text)

还可以定义正则表达式规则

In [None]:
patten = r'[A-Z0-9._%+-] + @[A_Z0-9.-]+\.[A-Z]{2,4}'

In [None]:
reges = re.compile(patten, flags=re.IGNORECASE)

In [None]:
regex.findall(text)

In [None]:
m = regex.search(text)

In [None]:
m

In [None]:
print(regex.match(text))

pandas 也提供了字符串和文本的处理功能

In [None]:
data = {'dava':'d@google.com','s':'s@gmail.com','r':'r@gmail.com','w':np.nan}

In [None]:
data


In [None]:
data = pd.Series(data)

In [None]:
data

In [None]:
data.isnull()

In [None]:
data.str.contains('gmail')

In [None]:
data.str.findall(patten, flags=re.IGNORECASE)

In [None]:
matches = data.str.match(patten, flags=re.IGNORECASE)

In [None]:
matches.str[0]

data.str[:5]

# 总结
高效的数据准备可以让你有更多时间用于后续的数据分析，因此本章作为数据分析的前提准备，还是有很高的时间价值。