<h1 align=center> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;数据科学引论 - Python之道 </h1>

<h1 align=center> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;第4课 数据清洗 </h1>


<hr>

<h1 align="center">按国家和年份统计的CO2排放量</h1>

二氧化碳排放源自燃烧化石燃料和水泥制造，包含了在固体、液体和气体燃料消耗以及放空燃烧过程中所产生的二氧化碳

http://data.worldbank.org/indicator/EN.ATM.CO2E.PC/

<h2 align=center>获取数据</h2>

这些数据可以从世界银行(World Bank) [链接](http://data.worldbank.org/indicator/EN.ATM.CO2E.PC/) 或者从 Box [链接](https://ibm.box.com/shared/static/3yzxbbizo49bkl8cnjw15tymzfwkycj4.csv) 上下载

#### 我们在linux下可以使用bash命令 `wget` 从链接处获取 csv 文件

In [None]:
!wget --output-document /resources/data/co2emissions.csv https://ibm.box.com/shared/static/3yzxbbizo49bkl8cnjw15tymzfwkycj4.csv

<hr>

<h2 align=center>使用 Pandas 导入数据</h2>

#### 导入所需的 `pandas` 库

In [1]:
import pandas as pd

#### 使用 `pd.read_csv` 导入数据

In [2]:
data = pd.read_csv("resources/data/co2emissions.csv", skiprows = 4)

#### 使用 `head` 显示 `data`  的前5行

In [3]:
data

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2007,2008,2009,2010,2011,2012,2013,2014,2015,Unnamed: 60
0,Aruba,ABW,CO2 emissions (metric tons per capita),EN.ATM.CO2E.PC,,,,,,,...,25.613715,24.750133,24.876706,24.182702,23.922412,,,,,
1,Andorra,AND,CO2 emissions (metric tons per capita),EN.ATM.CO2E.PC,,,,,,,...,6.350868,6.296125,6.049173,6.124770,5.968685,,,,,
2,Afghanistan,AFG,CO2 emissions (metric tons per capita),EN.ATM.CO2E.PC,0.046068,0.053615,0.073781,0.074251,0.086317,0.101499,...,0.088141,0.158962,0.249074,0.302936,0.425262,,,,,
3,Angola,AGO,CO2 emissions (metric tons per capita),EN.ATM.CO2E.PC,0.104357,0.084718,0.216025,0.206877,0.216174,0.206089,...,1.311096,1.369425,1.430873,1.401654,1.354008,,,,,
4,Albania,ALB,CO2 emissions (metric tons per capita),EN.ATM.CO2E.PC,1.258195,1.374186,1.439956,1.181681,1.111742,1.166099,...,1.507536,1.580113,1.533178,1.515632,1.607038,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
244,"Yemen, Rep.",YEM,CO2 emissions (metric tons per capita),EN.ATM.CO2E.PC,0.703403,0.507631,0.728004,0.537606,0.658012,0.699574,...,0.981422,0.989905,1.026251,1.090060,0.919968,,,,,
245,South Africa,ZAF,CO2 emissions (metric tons per capita),EN.ATM.CO2E.PC,5.629718,5.694383,5.729712,5.799844,6.170936,6.467359,...,9.063326,9.506481,9.545495,8.957154,9.257216,,,,,
246,"Congo, Dem. Rep.",COD,CO2 emissions (metric tons per capita),EN.ATM.CO2E.PC,0.152228,0.150782,0.135559,0.139446,0.116926,0.142290,...,0.047312,0.048411,0.045604,0.049328,0.050303,,,,,
247,Zambia,ZMB,CO2 emissions (metric tons per capita),EN.ATM.CO2E.PC,,,,,0.950434,1.100197,...,0.150265,0.164692,0.184058,0.192079,0.212450,,,,,


<h1 align=center>数据清洗</h1>

#### 观察这些数据，其质量有什么问题？我们应该如何解决这些问题？

例如，下面这些行有什么问题？

In [4]:
data.loc[[9,182,220,242]]

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2007,2008,2009,2010,2011,2012,2013,2014,2015,Unnamed: 60
9,American Samoa,ASM,CO2 emissions (metric tons per capita),EN.ATM.CO2E.PC,,,,,,,...,,,,,,,,,,
182,Puerto Rico,PRI,CO2 emissions (metric tons per capita),EN.ATM.CO2E.PC,,,,,,,...,,,,,,,,,,
220,Tajikistan,TJK,CO2 emissions (metric tons per capita),EN.ATM.CO2E.PC,,,,,,,...,0.463345,0.409462,0.382775,0.373873,0.358948,,,,,
242,World,WLD,CO2 emissions (metric tons per capita),EN.ATM.CO2E.PC,3.096144,3.067062,3.13802,3.242364,3.358696,3.437595,...,4.686122,4.741826,4.66312,4.840436,4.944676,,,,,


## 数据质量的问题:
1. 有些行是多个国家的总和，而不是实际的国家(例如, "World").
2. 有些列是不相关的，可以被移除(例如, "Indicator Name").
3. 有些年份对于任何国家都没有数据(例如, 2015 到 2016).
4. 有些国家在任何年份都没有数据(例如, "American Samoa").

<br>

<h2> 1. 有些行是多个国家的总和，而不是实际的国家 (例如, "World"). </h2>

**目标:**  
移除不包含实际国家的行。幸运的是，世界银行(World Bank)提供了相应的元数据，表明了哪些行是国家，而哪些行是多个国家的总和。
- 导入countries_metadata.csv
- 在`Country Code`上将元数据与`data`合并

#### 获取 `countries_metadata.csv`

In [None]:
!wget --output-document /resources/data/countries_metadata.csv https://ibm.box.com/shared/static/qh3o86mpij17ot7anydcmbwt41lwxvln.csv

#### 导入 `countries_metadata.csv`

In [5]:
metadata = pd.read_csv("resources/data/countries_metadata.csv", encoding = "utf-8")

In [6]:
metadata.head(10)

Unnamed: 0,Country Name,Country Code,Region,IncomeGroup,SpecialNotes,Unnamed: 5
0,Aruba,ABW,Latin America & Caribbean,High income: nonOECD,SNA data for 2000-2011 are updated from offici...,
1,Afghanistan,AFG,South Asia,Low income,Fiscal year end: March 20; reporting period fo...,
2,Angola,AGO,Sub-Saharan Africa,Upper middle income,"April 2013 database update: Based on IMF data,...",
3,Albania,ALB,Europe & Central Asia,Upper middle income,,
4,Andorra,AND,Europe & Central Asia,High income: nonOECD,,
5,Arab World,ARB,,,Arab World aggregate. Arab World is composed o...,
6,United Arab Emirates,ARE,Middle East & North Africa,High income: nonOECD,April 2013 database update: Based on data from...,
7,Argentina,ARG,Latin America & Caribbean,High income: nonOECD,The base year has changed to 2004.,
8,Armenia,ARM,Europe & Central Asia,Lower middle income,,
9,American Samoa,ASM,East Asia & Pacific,Upper middle income,,


#### 如何标识列出的"Country Name" 是一个国家还是一个多国构成的区域?

注意，当某一行是像"Arab World"这样的聚合区域时,  `Region` 和 `IncomeGroup` 总是 NaN (Not a Number). 我们可以用这条规则来移除所有不是国家的区域.

#### 在关键字`Country Code`上合并`data` 与 `metadata`  

In [7]:
merge = pd.merge(data, metadata, on = "Country Code")

In [8]:
merge

Unnamed: 0,Country Name_x,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2012,2013,2014,2015,Unnamed: 60,Country Name_y,Region,IncomeGroup,SpecialNotes,Unnamed: 5
0,Aruba,ABW,CO2 emissions (metric tons per capita),EN.ATM.CO2E.PC,,,,,,,...,,,,,,Aruba,Latin America & Caribbean,High income: nonOECD,SNA data for 2000-2011 are updated from offici...,
1,Andorra,AND,CO2 emissions (metric tons per capita),EN.ATM.CO2E.PC,,,,,,,...,,,,,,Andorra,Europe & Central Asia,High income: nonOECD,,
2,Afghanistan,AFG,CO2 emissions (metric tons per capita),EN.ATM.CO2E.PC,0.046068,0.053615,0.073781,0.074251,0.086317,0.101499,...,,,,,,Afghanistan,South Asia,Low income,Fiscal year end: March 20; reporting period fo...,
3,Angola,AGO,CO2 emissions (metric tons per capita),EN.ATM.CO2E.PC,0.104357,0.084718,0.216025,0.206877,0.216174,0.206089,...,,,,,,Angola,Sub-Saharan Africa,Upper middle income,"April 2013 database update: Based on IMF data,...",
4,Albania,ALB,CO2 emissions (metric tons per capita),EN.ATM.CO2E.PC,1.258195,1.374186,1.439956,1.181681,1.111742,1.166099,...,,,,,,Albania,Europe & Central Asia,Upper middle income,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
243,"Yemen, Rep.",YEM,CO2 emissions (metric tons per capita),EN.ATM.CO2E.PC,0.703403,0.507631,0.728004,0.537606,0.658012,0.699574,...,,,,,,"Yemen, Rep.",Middle East & North Africa,Lower middle income,Based on official government statistics and In...,
244,South Africa,ZAF,CO2 emissions (metric tons per capita),EN.ATM.CO2E.PC,5.629718,5.694383,5.729712,5.799844,6.170936,6.467359,...,,,,,,South Africa,Sub-Saharan Africa,Upper middle income,Fiscal year end: March 31; reporting period fo...,
245,"Congo, Dem. Rep.",COD,CO2 emissions (metric tons per capita),EN.ATM.CO2E.PC,0.152228,0.150782,0.135559,0.139446,0.116926,0.142290,...,,,,,,"Congo, Dem. Rep.",Sub-Saharan Africa,Low income,Based on official government statistics; the n...,
246,Zambia,ZMB,CO2 emissions (metric tons per capita),EN.ATM.CO2E.PC,,,,,0.950434,1.100197,...,,,,,,Zambia,Sub-Saharan Africa,Lower middle income,The new base year is 2010. National accounts d...,


**注意:** 当某一行不是实际的国家时，'Region'的值就是 NaN.

#### 移除 `Region` 为 NaN 的行

In [9]:
merge = merge[pd.notnull(merge['Region'])]

In [10]:
merge

Unnamed: 0,Country Name_x,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2012,2013,2014,2015,Unnamed: 60,Country Name_y,Region,IncomeGroup,SpecialNotes,Unnamed: 5
0,Aruba,ABW,CO2 emissions (metric tons per capita),EN.ATM.CO2E.PC,,,,,,,...,,,,,,Aruba,Latin America & Caribbean,High income: nonOECD,SNA data for 2000-2011 are updated from offici...,
1,Andorra,AND,CO2 emissions (metric tons per capita),EN.ATM.CO2E.PC,,,,,,,...,,,,,,Andorra,Europe & Central Asia,High income: nonOECD,,
2,Afghanistan,AFG,CO2 emissions (metric tons per capita),EN.ATM.CO2E.PC,0.046068,0.053615,0.073781,0.074251,0.086317,0.101499,...,,,,,,Afghanistan,South Asia,Low income,Fiscal year end: March 20; reporting period fo...,
3,Angola,AGO,CO2 emissions (metric tons per capita),EN.ATM.CO2E.PC,0.104357,0.084718,0.216025,0.206877,0.216174,0.206089,...,,,,,,Angola,Sub-Saharan Africa,Upper middle income,"April 2013 database update: Based on IMF data,...",
4,Albania,ALB,CO2 emissions (metric tons per capita),EN.ATM.CO2E.PC,1.258195,1.374186,1.439956,1.181681,1.111742,1.166099,...,,,,,,Albania,Europe & Central Asia,Upper middle income,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
243,"Yemen, Rep.",YEM,CO2 emissions (metric tons per capita),EN.ATM.CO2E.PC,0.703403,0.507631,0.728004,0.537606,0.658012,0.699574,...,,,,,,"Yemen, Rep.",Middle East & North Africa,Lower middle income,Based on official government statistics and In...,
244,South Africa,ZAF,CO2 emissions (metric tons per capita),EN.ATM.CO2E.PC,5.629718,5.694383,5.729712,5.799844,6.170936,6.467359,...,,,,,,South Africa,Sub-Saharan Africa,Upper middle income,Fiscal year end: March 31; reporting period fo...,
245,"Congo, Dem. Rep.",COD,CO2 emissions (metric tons per capita),EN.ATM.CO2E.PC,0.152228,0.150782,0.135559,0.139446,0.116926,0.142290,...,,,,,,"Congo, Dem. Rep.",Sub-Saharan Africa,Low income,Based on official government statistics; the n...,
246,Zambia,ZMB,CO2 emissions (metric tons per capita),EN.ATM.CO2E.PC,,,,,0.950434,1.100197,...,,,,,,Zambia,Sub-Saharan Africa,Lower middle income,The new base year is 2010. National accounts d...,


<br>

<h2>2. 有些列是不相关的，可以被移除.</h2>

**目标:**  
移除下列不相关的列:
- 第 3 列: **"Indicator Name"**
- 第 4 列: **"Indicator Code"**

In [11]:
merge.columns

Index(['Country Name_x', 'Country Code', 'Indicator Name', 'Indicator Code',
       '1960', '1961', '1962', '1963', '1964', '1965', '1966', '1967', '1968',
       '1969', '1970', '1971', '1972', '1973', '1974', '1975', '1976', '1977',
       '1978', '1979', '1980', '1981', '1982', '1983', '1984', '1985', '1986',
       '1987', '1988', '1989', '1990', '1991', '1992', '1993', '1994', '1995',
       '1996', '1997', '1998', '1999', '2000', '2001', '2002', '2003', '2004',
       '2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013',
       '2014', '2015', 'Unnamed: 60', 'Country Name_y', 'Region',
       'IncomeGroup', 'SpecialNotes', 'Unnamed: 5'],
      dtype='object')

In [None]:
? merge.drop

In [12]:
merge = merge.drop(merge.columns[[60,65]], axis=1) # Note: zero indexed
merge = merge.drop('Indicator Name', axis=1)
merge = merge.drop('Indicator Code', 1)

In [13]:
merge.columns

Index(['Country Name_x', 'Country Code', '1960', '1961', '1962', '1963',
       '1964', '1965', '1966', '1967', '1968', '1969', '1970', '1971', '1972',
       '1973', '1974', '1975', '1976', '1977', '1978', '1979', '1980', '1981',
       '1982', '1983', '1984', '1985', '1986', '1987', '1988', '1989', '1990',
       '1991', '1992', '1993', '1994', '1995', '1996', '1997', '1998', '1999',
       '2000', '2001', '2002', '2003', '2004', '2005', '2006', '2007', '2008',
       '2009', '2010', '2011', '2012', '2013', '2014', '2015',
       'Country Name_y', 'Region', 'IncomeGroup', 'SpecialNotes'],
      dtype='object')

<h2>3. 有些年份对于任何国家都没有数据.</h2>

**目标:**  
计算每一年的数据行数，NaN值不计算在内.


In [14]:
merge.count()

Country Name_x    215
Country Code      215
1960              151
1961              152
1962              153
                 ... 
2015                0
Country Name_y    215
Region            215
IncomeGroup       215
SpecialNotes      136
Length: 62, dtype: int64

查看2015年，看起来所有行都不包含2015年的数据.

In [15]:
merge['2015']

0     NaN
1     NaN
2     NaN
3     NaN
4     NaN
       ..
243   NaN
244   NaN
245   NaN
246   NaN
247   NaN
Name: 2015, Length: 215, dtype: float64

#### 移除任何行都不包含数据的列

In [16]:
merge = merge.drop(['2012','2013','2014','2015'], axis = 1)

In [17]:
merge.count() #double-check that columns have been removed

Country Name_x    215
Country Code      215
1960              151
1961              152
1962              153
1963              154
1964              159
1965              159
1966              159
1967              159
1968              158
1969              159
1970              161
1971              162
1972              164
1973              164
1974              164
1975              164
1976              164
1977              164
1978              164
1979              164
1980              164
1981              164
1982              164
1983              164
1984              164
1985              164
1986              165
1987              165
1988              165
1989              165
1990              167
1991              168
1992              188
1993              188
1994              189
1995              192
1996              191
1997              193
1998              193
1999              193
2000              194
2001              194
2002              195
2003      

<h2>4. 有些国家在任何年份都没有数据.</h2>

**目标:**  
使用行平均值来确定哪些国家不包含任何数据.

计算每一行的平均值(在 axis 1 上).

In [18]:
merge.mean(axis=1) #Takes the mean of all numeric quantities by row

0      21.766264
1       6.926680
2       0.139113
3       0.610377
4       1.657433
         ...    
243     0.737396
244     8.230187
245     0.104890
246     0.495172
247     1.228627
Length: 215, dtype: float64

正如你所见，在不包含任何数据的行中出现了 NaN.

#### 移除所有年份中不包含任何数据的行

In [19]:
merge = merge[pd.notnull(merge.mean(axis=1))]

<hr>

<h2 align = "center">数据清洗 ... 完成!</h2>

### 为了方便起见，重新命名

In [20]:
mydf = merge
mydf

Unnamed: 0,Country Name_x,Country Code,1960,1961,1962,1963,1964,1965,1966,1967,...,2006,2007,2008,2009,2010,2011,Country Name_y,Region,IncomeGroup,SpecialNotes
0,Aruba,ABW,,,,,,,,,...,24.766706,25.613715,24.750133,24.876706,24.182702,23.922412,Aruba,Latin America & Caribbean,High income: nonOECD,SNA data for 2000-2011 are updated from offici...
1,Andorra,AND,,,,,,,,,...,6.553477,6.350868,6.296125,6.049173,6.124770,5.968685,Andorra,Europe & Central Asia,High income: nonOECD,
2,Afghanistan,AFG,0.046068,0.053615,0.073781,0.074251,0.086317,0.101499,0.107674,0.123782,...,0.065816,0.088141,0.158962,0.249074,0.302936,0.425262,Afghanistan,South Asia,Low income,Fiscal year end: March 20; reporting period fo...
3,Angola,AGO,0.104357,0.084718,0.216025,0.206877,0.216174,0.206089,0.265164,0.166659,...,1.200877,1.311096,1.369425,1.430873,1.401654,1.354008,Angola,Sub-Saharan Africa,Upper middle income,"April 2013 database update: Based on IMF data,..."
4,Albania,ALB,1.258195,1.374186,1.439956,1.181681,1.111742,1.166099,1.333055,1.363746,...,1.291548,1.507536,1.580113,1.533178,1.515632,1.607038,Albania,Europe & Central Asia,Upper middle income,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
243,"Yemen, Rep.",YEM,0.703403,0.507631,0.728004,0.537606,0.658012,0.699574,0.605767,0.524197,...,0.985853,0.981422,0.989905,1.026251,1.090060,0.919968,"Yemen, Rep.",Middle East & North Africa,Lower middle income,Based on official government statistics and In...
244,South Africa,ZAF,5.629718,5.694383,5.729712,5.799844,6.170936,6.467359,6.332753,6.465648,...,8.802475,9.063326,9.506481,9.545495,8.957154,9.257216,South Africa,Sub-Saharan Africa,Upper middle income,Fiscal year end: March 31; reporting period fo...
245,"Congo, Dem. Rep.",COD,0.152228,0.150782,0.135559,0.139446,0.116926,0.142290,0.134675,0.124706,...,0.045579,0.047312,0.048411,0.045604,0.049328,0.050303,"Congo, Dem. Rep.",Sub-Saharan Africa,Low income,Based on official government statistics; the n...
246,Zambia,ZMB,,,,,0.950434,1.100197,0.953158,1.263628,...,0.179774,0.150265,0.164692,0.184058,0.192079,0.212450,Zambia,Sub-Saharan Africa,Lower middle income,The new base year is 2010. National accounts d...


<hr>

代码汇总:

In [22]:
import pandas as pd

#Download data
#!wget --output-document co2emissions.csv https://ibm.box.com/shared/static/3yzxbbizo49bkl8cnjw15tymzfwkycj4.csv
#!wget --output-document countries_metadata.csv https://ibm.box.com/shared/static/qh3o86mpij17ot7anydcmbwt41lwxvln.csv
    
#Import data
data = pd.read_csv("resources/data/co2emissions.csv", skiprows = 4)
metadata = pd.read_csv("resources/data/countries_metadata.csv", encoding = "utf-8")

#Merge data
merge = pd.merge(data, metadata, on = "Country Code")

#Remove non-country regions
merge = merge[pd.notnull(merge['Region'])]

#Drop some columns with no data
merge = merge.drop(merge.columns[[60,65]], axis=1)
merge = merge.drop(['Indicator Name', 'Indicator Code','2012', '2013', '2014', '2015'], 1)

#Drop some rows with no data
merge = merge[pd.notnull(merge.mean(axis=1))]

#Rename
mydf = merge

In [23]:
mydf

Unnamed: 0,Country Name_x,Country Code,1960,1961,1962,1963,1964,1965,1966,1967,...,2006,2007,2008,2009,2010,2011,Country Name_y,Region,IncomeGroup,SpecialNotes
0,Aruba,ABW,,,,,,,,,...,24.766706,25.613715,24.750133,24.876706,24.182702,23.922412,Aruba,Latin America & Caribbean,High income: nonOECD,SNA data for 2000-2011 are updated from offici...
1,Andorra,AND,,,,,,,,,...,6.553477,6.350868,6.296125,6.049173,6.124770,5.968685,Andorra,Europe & Central Asia,High income: nonOECD,
2,Afghanistan,AFG,0.046068,0.053615,0.073781,0.074251,0.086317,0.101499,0.107674,0.123782,...,0.065816,0.088141,0.158962,0.249074,0.302936,0.425262,Afghanistan,South Asia,Low income,Fiscal year end: March 20; reporting period fo...
3,Angola,AGO,0.104357,0.084718,0.216025,0.206877,0.216174,0.206089,0.265164,0.166659,...,1.200877,1.311096,1.369425,1.430873,1.401654,1.354008,Angola,Sub-Saharan Africa,Upper middle income,"April 2013 database update: Based on IMF data,..."
4,Albania,ALB,1.258195,1.374186,1.439956,1.181681,1.111742,1.166099,1.333055,1.363746,...,1.291548,1.507536,1.580113,1.533178,1.515632,1.607038,Albania,Europe & Central Asia,Upper middle income,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
243,"Yemen, Rep.",YEM,0.703403,0.507631,0.728004,0.537606,0.658012,0.699574,0.605767,0.524197,...,0.985853,0.981422,0.989905,1.026251,1.090060,0.919968,"Yemen, Rep.",Middle East & North Africa,Lower middle income,Based on official government statistics and In...
244,South Africa,ZAF,5.629718,5.694383,5.729712,5.799844,6.170936,6.467359,6.332753,6.465648,...,8.802475,9.063326,9.506481,9.545495,8.957154,9.257216,South Africa,Sub-Saharan Africa,Upper middle income,Fiscal year end: March 31; reporting period fo...
245,"Congo, Dem. Rep.",COD,0.152228,0.150782,0.135559,0.139446,0.116926,0.142290,0.134675,0.124706,...,0.045579,0.047312,0.048411,0.045604,0.049328,0.050303,"Congo, Dem. Rep.",Sub-Saharan Africa,Low income,Based on official government statistics; the n...
246,Zambia,ZMB,,,,,0.950434,1.100197,0.953158,1.263628,...,0.179774,0.150265,0.164692,0.184058,0.192079,0.212450,Zambia,Sub-Saharan Africa,Lower middle income,The new base year is 2010. National accounts d...


想要导出为 csv 文件?

In [24]:
mydf.to_csv("resources/data/co2emissions_cleaned.csv", index = False) #See Recent Data for exported csv

<hr></hr>
<div class="alert alert-success alertsuccess" style="margin-top: 0px">
<h4> [tip] 数据清洗的规则来源于对业务的理解 </h4>
<p></p>
从上面的清洗过程我们可以看到，无论是如何区分国家和地区，还是剔除不包含数据的国家，都需要明确的清洗规则，而清洗规则是依靠我们对业务的理解而确定的，单纯从数据本身出发，是无法制定出这样的规则的。
<li>这再次说明，数据分析是业务驱动的，并且需要业务模型的支持。</li>
<li>计算机能够帮助我们的，是依靠强大的计算能力来加速数据分析的过程。</li>
<p></p>
</div>
<hr></hr>