<div align="center"><h1> 第4章&nbsp;&nbsp;数据清洗和预处理</h1></div>

# 一、说明

- 描述：本章源代码。
- 作者：方伟（FangWei）
- 程序开发环境：Windows DEV Channel , Build 22533.1001 64位
- Python版本：64位 3.10.1

# 二、程序

## 4.1 对比可视化

In [1]:
import pandas as pd
data  =pd.read_csv('data.csv')

### 4.1.1 查看数据状态

In [2]:
print(data.head(3))

   user_id   age level     sex  orders  values         recent_date
0      662  24.0  High    Male     197  172146  2016/7/23 12:24:28
1      833  17.0  High    Male     227  198124  2016/7/23 12:24:28
2     2289  30.0  High  Female     302  190385  2016/7/23 12:24:28


### 4.1.2 审核数据类型

In [3]:
print(data.dtypes)

user_id          int64
age            float64
level           object
sex             object
orders           int64
values           int64
recent_date     object
dtype: object


### 4.1.3 分析分布趋势

In [4]:
print(data.describe(include='all').round(2))

        user_id      age level   sex   orders      values          recent_date
count   2849.00  2848.00  2845  2848  2849.00     2849.00                 2849
unique      NaN      NaN     4     2      NaN         NaN                  286
top         NaN      NaN  High  Male      NaN         NaN  2018/10/12 12:24:28
freq        NaN      NaN  2443  1864      NaN         NaN                   19
mean    5067.12    38.78   NaN   NaN   491.38   192289.28                  NaN
std     2884.01    25.47   NaN   NaN   483.49   106252.24                  NaN
min        6.00    17.00   NaN   NaN    25.00    19395.00                  NaN
25%     2566.00    28.00   NaN   NaN   207.00   120068.00                  NaN
50%     5166.00    37.00   NaN   NaN   328.00   180052.00                  NaN
75%     7496.00    47.00   NaN   NaN   587.00   239439.00                  NaN
max     9999.00  1200.00   NaN   NaN  4286.00  1184622.00                  NaN


## 4.2 缺失值处理

### 4.2.1 查看缺失值记录

In [5]:
na_records = data.isnull().any(axis=1) # 获取每行是否包含NA判断结果
print(na_records.sum()) # NA记录的总数量
print(na_records[na_records]==True) # NA记录的行号

6
13      True
16      True
1476    True
1761    True
2140    True
2836    True
dtype: bool


### 4.2.2 查看缺失值列

In [6]:
na_cols = data.isnull().any(axis=0) # 获取每列是否包含NA判断结果
print(na_cols.sum()) # NA列的总数量
print(na_cols[na_cols]==True) # NA记录的列名

3
age      True
level    True
sex      True
dtype: bool


### 4.2.3 NA值处理

1. 填充NA值

In [7]:
# 分类型字段的填充
data[['level','sex']] = data[['level','sex']].fillna('others') # 批量用“其他”填充

In [8]:
# 数值型字段的填充
data['age'] = data['age'].fillna(0) # 用0填充
data['age'] = data['age'].fillna(data['age'].mean()) # 均值填充
data['age'] = data['age'].fillna(method='pad') # 用前一个数据填充
data['age'] = data['age'].fillna(method='bfill') # 用后一个数据填充
data['age'] = data['age'].interpolate(method='linear') # 用差值法填充，可指定不同的方法

> 注意如果是日期型，通常应该删除记录，因为日期无法“准确”填充。如果是布尔型，应该按照分类型填充，即先转换为分类型，再填充为“其他”。

2. 丢弃NA值

In [9]:
data_dropna = data.dropna() # 直接丢弃

## 4.3 异常值处理

### 4.3.1 异常值判断

1. 基于经验值的判断和选择

In [10]:
data_sets = data[(data['age']>0)&(data['age']<=100)] # 购物人群年龄在0到100岁之间

2. 基于均值标准差的判断与处理

In [11]:
import numpy as np

In [12]:
def process_outlier(sub_data,each_col):
    _mean = sub_data[each_col].mean()
    _std = sub_data[each_col].std()
    scope_min,scope_max = _mean-2*_std,_mean+2*_std
    is_outlier = (sub_data[each_col] <scope_min) | (sub_data[each_col] >scope_max)
    sub_data[is_outlier] = _mean
    print(np.sum(is_outlier))
    return sub_data    

In [13]:
data['orders'] = process_outlier(data[['orders']], 'orders')

135


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


3. 基于分位数判断异常值

In [14]:
def process_outlier(sub_data,each_col):
    desc = sub_data.describe().T
    per_25 = desc['25%'].values[0]
    per_75 = desc['75%'].values[0]
    spacing = per_75 -per_25
    scope_min,scope_max = per_25-1.5 * spacing,per_75+1.5 * spacing
    is_outlier = (sub_data[each_col] <scope_min) | (sub_data[each_col] >scope_max)
    sub_data[is_outlier] = desc['mean'].values[0]
    print(np.sum(is_outlier))
    return sub_data    

In [15]:
data['values'] = process_outlier(data[['values']], 'values')

92


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


> 注意：如果要删除异常值，那么需要将所有要删除的列做标记，最后合并所有列的标记，取标记的合集做删除；而不能分别针对每一列做删除。

## 4.4 重复值处理

### 4.4.1 判断缺失值

In [16]:
print(data[data.duplicated()])

      user_id   age level   sex  orders    values         recent_date
2847     6249  17.0  High  Male   308.0  132755.0  2019/4/18 12:24:28


### 4.4.2 去除重复值

In [17]:
data_dropduplicates = data.drop_duplicates()

## 4.5 数据抽样

### 4.5.1 随机抽样

In [18]:
# 指定抽样数量
data_sample1 = data.sample(n=1000)

In [19]:
# 指定抽样比例
data_sample2 = data.sample(frac=0.8)

### 4.5.2 分层抽样

In [20]:
def sub_sample(data,group_name):
    return data[data['level']==group_name].sample(frac=0.8)

In [21]:
names = data['level'].unique()
all_samples = [sub_sample(data,group_name) for group_name in names]
samples_pd = pd.concat(all_samples,axis=0)

> 注意：在抽样前，需要先填充NA值

In [22]:
print(data.groupby(['level'],as_index=False)['user_id'].count().T)

            0    1       2      3       4
level    High  Low  Normal  Other  others
user_id  2443  284      87     31       4


In [23]:
print(samples_pd.groupby(['level'],as_index=False)['user_id'].count().T)

            0    1       2      3       4
level    High  Low  Normal  Other  others
user_id  1954  227      70     25       3


## 4.6 数据格式与值变换

### 4.6.1 字符串转日期

In [24]:
print(data['recent_date'].dtype)
data['recent_date'] = [pd.datetime.strptime(i,'%Y/%m/%d %H:%M:%S') for i in data['recent_date']]
print(data['recent_date'].dtype)

object
datetime64[ns]


### 4.6.2 提取日期、时间

In [25]:
single_dt = data['recent_date'].iloc[0]
print(single_dt)

2016-07-23 12:24:28


### 4.6.3 提取时间元素

In [26]:
print(single_dt.date())
print(single_dt.time())

2016-07-23
12:24:28


In [27]:
# 提取不同的元素
dt_elements =\
['day', # 当月第几天
 'dayofweek', # 当周第几天
 'daysinmonth', # 是否当月第一天
 'is_leap_year',# 是否闰年
 'is_month_end', # 是否当月最后一天
 'is_month_start',# 是否当月第一天
 'is_quarter_end',# 是否季度最后一天
 'is_quarter_start',#是否季度第一天
 'is_year_end',# 是否当年最后一天
 'is_year_start',#是否当年第一天
 'month',# 月
 'quarter', # 季度
 'week', # 周
 'weekday', # 周几
 'weekofyear', # 一年的第几周
 'year', # 年
 'hour', # 小时
 'minute', # 分钟
 'second' # 秒
]

In [28]:
for i in dt_elements:
    try:
        print(i,eval('single_dt.'+i+'()'))
    except Exception as e:
        print(i,eval('single_dt.'+i))

day 23
dayofweek 5
daysinmonth 31
is_leap_year True
is_month_end False
is_month_start False
is_quarter_end False
is_quarter_start False
is_year_end False
is_year_start False
month 7
quarter 3
week 29
weekday 5
weekofyear 29
year 2016
hour 12
minute 24
second 28


## 4.7 数据标准化和归一化

In [29]:
from sklearn.preprocessing import StandardScaler,MinMaxScaler

### 4.7.1 Z-SCORE数据标准化

In [30]:
ss_model = StandardScaler()
data['age']=ss_model.fit_transform(data[['age']])
print(data[['age']].describe().T.round(2))

      count  mean  std   min   25%   50%   75%    max
age  2849.0  -0.0  1.0 -1.52 -0.42 -0.07  0.32  45.58


### 4.7.2 MaxMin数据归一化

In [31]:
mm_model = MinMaxScaler((0,1))
data['orders']=mm_model.fit_transform(data[['orders']])
print(data[['orders']].describe().T.round(2))

         count  mean   std  min   25%   50%   75%  max
orders  2849.0  0.27  0.21  0.0  0.13  0.21  0.34  1.0


## 4.8 数据离散化和二元化

### 4.8.1 基于自定义区间的离散化

In [32]:
bins = [0, 2000, 100000, 200000, 500000,1000000,100000000]
data['values_cut'] = pd.cut(data['values'],bins)
print(data['values_cut'].head(3))

0    (100000, 200000]
1    (100000, 200000]
2    (100000, 200000]
Name: values_cut, dtype: category
Categories (6, interval[int64]): [(0, 2000] < (2000, 100000] < (100000, 200000] < (200000, 500000] < (500000, 1000000] < (1000000, 100000000]]


### 4.8.2 基于分位数法离散化

In [33]:
data['values_cut2'] = pd.cut(data['values'],4,labels=['bad', 'medium', 'good', 'awesome'])
print(data['values_cut2'].head(3))

0    medium
1    medium
2    medium
Name: values_cut2, dtype: category
Categories (4, object): [bad < medium < good < awesome]


### 4.8.3 基于指定条件的二值化

In [34]:
from sklearn.preprocessing import Binarizer
bin_model = Binarizer(threshold=data['age'].mean())
data['age_bin'] = bin_model.fit_transform(data[['age']])
print(data['age_bin'].head(4))

0    0.0
1    0.0
2    0.0
3    1.0
Name: age_bin, dtype: float64


## 4.9 分类特征处理

### 4.9.1 分类特征转索引

In [35]:
from sklearn.preprocessing import LabelEncoder
model_le = LabelEncoder()
data['level']=model_le.fit_transform(data['level'])
print(data['level'].head(3))

0    0
1    0
2    0
Name: level, dtype: int32


### 4.9.2 OneHotencode 转换

In [36]:
object_data = data[['sex']]
convert_data=pd.get_dummies(object_data)
print(convert_data.head(3))

   sex_Female  sex_Male  sex_others
0           0         1           0
1           0         1           0
2           1         0           0


## 4.10 特征选择

In [37]:
data_merge = pd.concat((data[['age','orders']],convert_data,data[['age_bin']]),axis=1)

In [38]:
data_merge.head(3)

Unnamed: 0,age,orders,sex_Female,sex_Male,sex_others,age_bin
0,-0.579592,0.120364,0,1,0,0.0
1,-0.854377,0.141358,0,1,0,0.0
2,-0.344063,0.193842,1,0,0,0.0


In [39]:
from sklearn.feature_selection import VarianceThreshold
model_vart = VarianceThreshold(threshold=0.1)
feature = model_vart.fit_transform(data_merge)
print(np.round(model_vart.variances_,2))
print(feature.shape)

[1.   0.04 0.23 0.23 0.   0.25]
(2849, 4)


##  4.11 分词

In [40]:
import jieba

In [41]:
with open('text.txt',encoding='utf8') as  f:
    text_data = f.readlines()
print(text_data[0])

Python作为数据工作领域的关键武器之一，具有开源、多场景应用、快速上手、完善的生态和服务体系等特征，使其在数据分析中的任何场景都能游刃有余；即使是在为数不多的短板上，Python仍然可以基于其“胶水”特征，引入对应的第三方工具/库/程序等来实现全场景、全应用的覆盖。在海量数据背景下，Python对超大数据规模的支持性能、数据分析处理能力和建模的专业程度以及开发便捷性的综合能力要远远高于其他工具。因此，Python几乎是数据分析的不二之选。



In [42]:
def jieba_cut(string):
    return list(jieba.cut(string))  # 精确模式分词

In [43]:
cut_words = [jieba_cut(i) for i in text_data]
print(cut_words[0][:5])

Building prefix dict from the default dictionary ...
Loading model from cache D:\system_backup\Temp\jieba.cache
Loading model cost 0.789 seconds.
Prefix dict has been built succesfully.


['Python', '作为', '数据', '工作', '领域']


## 4.12 文本转向量

In [44]:
from sklearn.feature_extraction.text import TfidfVectorizer as TV

In [45]:
stop_words = ['，','。','\n', '/','“', '”', '、', '；']

In [46]:
vectorizer = TV(stop_words=stop_words,tokenizer=jieba_cut)
X = vectorizer.fit_transform(text_data)
print(vectorizer.get_feature_names()[:10])
print(X.shape)
print(X.toarray()[0])

['python', '一笔带过', '上', '下', '不', '不二', '不仅', '与', '专业', '中']
(3, 152)
[0.19702138 0.         0.12685045 0.08339655 0.         0.08339655
 0.         0.         0.08339655 0.06342522 0.08339655 0.
 0.08339655 0.08339655 0.         0.         0.         0.08339655
 0.04925534 0.08339655 0.         0.08339655 0.         0.08339655
 0.08339655 0.         0.16679309 0.06342522 0.16679309 0.08339655
 0.08339655 0.         0.         0.08339655 0.         0.
 0.         0.         0.         0.         0.         0.08339655
 0.         0.         0.08339655 0.         0.         0.
 0.09851069 0.04925534 0.         0.14776603 0.         0.25018964
 0.08339655 0.         0.08339655 0.         0.06342522 0.
 0.08339655 0.         0.         0.06342522 0.08339655 0.08339655
 0.         0.06342522 0.16679309 0.08339655 0.         0.09851069
 0.         0.06342522 0.08339655 0.08339655 0.08339655 0.
 0.08339655 0.         0.08339655 0.         0.08339655 0.08339655
 0.14776603 0.14776603 0.      