### 知识基础

- Pandas包基础:pd.read_csv
- 正则表达式基础

报告自动化对数据的质量有着更高的要求，但是实际情况中出现错漏是非常正常的，而我们不仅仅应该在出现问题后修复bug，在最开始就应该做好尽可能严格的规定并作出意外情况的报告和处理。

## 读取CSV文件
csv文件是我们常用的数据源，在此我们以csv文件为例

### 首先我们可以查看要读取数据内容

In [18]:
import pandas as pd
import numpy as np

In [5]:
# 可以发现第8行才是头部，于是设置header参数
data = pd.read_csv('data.csv', header=7, index_col=0)
data.head()

Unnamed: 0_level_0,Product Name,Brand,Price,Category,Rank,Sales,Revenue,Reviews,Rating,Seller,LQS,ASIN,Link
#,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1,Mind Reader Adjustable Height Ergonomic Foot R...,Mind Reader,$14.99,Office Products,286,4440,"$66,556",309,4.0,AMZ,N.A.,B07FMGMVT8,https://www.amazon.com/dp/B07FMGMVT8
2,AmazonBasics Foot Rest - Black,AmazonBasics,$13.19,Office Products,539,3115,"$41,087",657,4.0,N.A.,5,B01DN8TG46,https://www.amazon.com/dp/B01DN8TG46
3,Sleepy Ride - Airplane Footrest Made with Prem...,Sleepy Ride,$19.97,Office Products,1067,2075,"$41,438",386,4.5,FBA,5,B01M35M87O,https://www.amazon.com/dp/B01M35M87O
4,Rest My Sole - Foot Rest Cushion for Under Des...,Well Desk,$26.95,Office Products,1159,1661,"$44,764",188,4.5,FBA,8,B075RYDWZH,https://www.amazon.com/dp/B075RYDWZH
5,"Andyer Andyer Foot Rest, Portable Travel Footr...",Andyer,$10.99,Home & Kitchen,6169,1384,"$15,210",215,4.0,FBA,6,B072VJ9BKX,https://www.amazon.com/dp/B072VJ9BKX


### 对读取目标列进行格式规定

In [7]:
data.dtypes

Product Name    object
Brand           object
Price           object
Category        object
Rank            object
Sales           object
Revenue         object
Reviews          int64
Rating          object
Seller          object
LQS             object
ASIN            object
Link            object
dtype: object

可以看到在列：Price, Rank, Sales, Revenue, Reviews, Rating, LQS都应该是数值，但是只有Review列被默认读取为数值

#### 使用dtype进行格式规定

In [19]:
dtype = {'#':int,
         'Product Name':str,
         'Brand':str,
         'Price':float,
         'Category':str,
         'Rank':int,
         'Sales':int,
         'Revenue':int,
         'Reviews':int,
         'Rating':float,
         'Seller':str,
         'LQS':int,
         'ASIN':str,
         'Link':str
        }
try:
    data = pd.read_csv('data.csv', dtype=dtype, header=7, index_col=0)
except BaseException as e:
    print(e)

invalid literal for int() with base 10: '1,067'


可以看到使用dtype并不能直接忽略非数字符号进行转换，我们需要更强的格式规定

#### 使用converters进行格式转化

In [50]:
import re
# 使用正则表达式进行数字提取
def str2num(string):
    if not isinstance(string, str):
        string = str(string)
    string = string.replace(',','')
    regular_expression = '\d+\.?\d*'
    pattern = re.compile(regular_expression)
    match = pattern.search(string)
    if match:
        return float(match.group())
    else:
        return float('nan')
converters = {'Price':str2num,
              'Rank':str2num,
              'Rating':str2num,
              'Sales':str2num,
              'Revenue':str2num,
              'Reviews':str2num
             }
try:
    data = pd.read_csv('data.csv', converters=converters, header=7, index_col=0)
except BaseException as e:
    print(e)
data.head()

Unnamed: 0_level_0,Product Name,Brand,Price,Category,Rank,Sales,Revenue,Reviews,Rating,Seller,LQS,ASIN,Link
#,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1,Mind Reader Adjustable Height Ergonomic Foot R...,Mind Reader,14.99,Office Products,286.0,4440.0,66556.0,309.0,4.0,AMZ,N.A.,B07FMGMVT8,https://www.amazon.com/dp/B07FMGMVT8
2,AmazonBasics Foot Rest - Black,AmazonBasics,13.19,Office Products,539.0,3115.0,41087.0,657.0,4.0,N.A.,5,B01DN8TG46,https://www.amazon.com/dp/B01DN8TG46
3,Sleepy Ride - Airplane Footrest Made with Prem...,Sleepy Ride,19.97,Office Products,1067.0,2075.0,41438.0,386.0,4.5,FBA,5,B01M35M87O,https://www.amazon.com/dp/B01M35M87O
4,Rest My Sole - Foot Rest Cushion for Under Des...,Well Desk,26.95,Office Products,1159.0,1661.0,44764.0,188.0,4.5,FBA,8,B075RYDWZH,https://www.amazon.com/dp/B075RYDWZH
5,"Andyer Andyer Foot Rest, Portable Travel Footr...",Andyer,10.99,Home & Kitchen,6169.0,1384.0,15210.0,215.0,4.0,FBA,6,B072VJ9BKX,https://www.amazon.com/dp/B072VJ9BKX


把不同的数据处理模块解耦，把str2num放入tools包，数据读取放入datapipeline包