## udacity-机器学习-数据分析过程 

### 第 1 步：提问
你要么获取一批数据，然后根据它提问，要么先提问，然后根据问题收集数据。在这两种情况下，好的问题可以帮助你将精力集中在数据的相关部分，并帮助你得出有洞察力的分析。

### 第 2 步：整理数据
你通过三步来获得所需的数据：收集，评估，清理。你收集所需的数据来回答你的问题，评估你的数据来识别数据质量或结构中的任何问题，并通过修改、替换或删除数据来清理数据，以确保你的数据集具有最高质量和尽可能结构化。

### 第 3 步：执行 EDA（探索性数据分析）
你可以探索并扩充数据，以最大限度地发挥你的数据分析、可视化和模型构建的潜力。探索数据涉及在数据中查找模式，可视化数据中的关系，并对你正在使用的数据建立直觉。经过探索后，你可以删除异常值，并从数据中创建更好的特征，这称为特征工程。

### 第 4 步：得出结论（或甚至是做出预测）
这一步通常使用机器学习或推理性统计来完成，不在本课程范围内，本课的重点是使用描述性统计得出结论。

### 第 5 步：传达结果
你通常需要证明你发现的见解及传达意义。或者，如果你的最终目标是构建系统，则通常需要分享构建的结果，解释你得出设计结论的方式，并报告该系统的性能。传达结果的方法有多种：报告、幻灯片、博客帖子、电子邮件、演示文稿，甚至对话。数据可视化总会给你呈现很大的价值。

## 提问

In [2]:
import pandas as pd

In [7]:
df = pd.read_csv('cancer_data.csv')
df.head() #查看数据的前5行

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave_points_mean,...,radius_max,texture_max,perimeter_max,area_max,smoothness_max,compactness_max,concavity_max,concave_points_max,symmetry_max,fractal_dimension_max
0,842302,M,17.99,,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,25.38,,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,84348301,M,11.42,20.38,77.58,386.1,,0.2839,0.2414,0.1052,...,14.91,26.5,98.87,567.7,,0.8663,0.6869,0.2575,0.6638,0.173
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [None]:
[enumerate函数](http://www.runoob.com/python/python-func-enumerate.html)

In [4]:
# 遍历每一列名,用enumerate函数添加序号
for i,v in enumerate(df.columns):
    print (i,v)

0 id
1 diagnosis
2 radius_mean
3 texture_mean
4 perimeter_mean
5 area_mean
6 smoothness_mean
7 compactness_mean
8 concavity_mean
9 concave_points_mean
10 symmetry_mean
11 fractal_dimension_mean
12 radius_SE
13 texture_SE
14 perimeter_SE
15 area_SE
16 smoothness_SE
17 compactness_SE
18 concavity_SE
19 concave_points_SE
20 symmetry_SE
21 fractal_dimension_SE
22 radius_max
23 texture_max
24 perimeter_max
25 area_max
26 smoothness_max
27 compactness_max
28 concavity_max
29 concave_points_max
30 symmetry_max
31 fractal_dimension_max


#### 根据数据提出问题:
eg:肿瘤半径与平滑度的关系;肿瘤面积与诊断结果的关系等

## 整理数据

### 读取 CSV 文件
`read_csv()` 用于将数据从 csv 文件加载到 Pandas 数据框中。只需要指定数据的文件路径。我将 `student_scores.csv` 存储在与这个 Jupyter notebook 相同的目录下，所以只需要提供文件名。
 > udacity reading_csv-zh.jpynb

In [3]:
df = pd.read_csv('student_scores.csv')
df.head()  #显示数据的前几行,无参数时默认5行

Unnamed: 0,ID,Name,Attendance,HW,Test1,Project1,Test2,Project2,Final
0,27604,Joe,0.96,0.97,87.0,98.0,92.0,93.0,95.0
1,30572,Alex,1.0,0.84,92.0,89.0,94.0,92.0,91.0
2,39203,Avery,0.84,0.74,68.0,70.0,84.0,90.0,82.0
3,28592,Kris,0.96,1.0,82.0,94.0,90.0,81.0,84.0
4,27492,Rick,0.32,0.85,98.0,100.0,73.0,82.0,88.0


CSV文件代表的是逗号分隔值(Comma-Separated Values),但在读取文件中,这些值可以通过不同的字符、制表符、空格等分隔。
利用`read_csv`中的`sep`参数可以用不同分隔符显示.

In [4]:
df = pd.read_csv('student_scores.csv',sep = ':')
df.head()

Unnamed: 0,"ID,Name,Attendance,HW,Test1,Project1,Test2,Project2,Final"
0,"27604,Joe,0.96,0.97,87.0,98.0,92.0,93.0,95.0"
1,"30572,Alex,1.0,0.84,92.0,89.0,94.0,92.0,91.0"
2,"39203,Avery,0.84,0.74,68.0,70.0,84.0,90.0,82.0"
3,"28592,Kris,0.96,1.0,82.0,94.0,90.0,81.0,84.0"
4,"27492,Rick,0.32,0.85,98.0,100.0,73.0,82.0,88.0"


此例中因为CSV文件十一逗号分隔的,没有冒号.所以所有值都被读取到一个例

### 标题
`read_csv` 的另一个功能是指定文件的哪一行作为标题，而标题指定了列标签。

通常第一行是标题，但有时如果文件顶部有额外的元信息，我们希望指定另一行作为标题。可以增加`header`参数。

In [7]:
df = pd.read_csv('student_scores.csv',header = 2)
df.head()

Unnamed: 0,30572,Alex,1.0,0.84,92.0,89.0,94.0,92.0.1,91.0
0,39203,Avery,0.84,0.74,68.0,70.0,84.0,90.0,82.0
1,28592,Kris,0.96,1.0,82.0,94.0,90.0,81.0,84.0
2,27492,Rick,0.32,0.85,98.0,100.0,73.0,82.0,88.0


`header`的参数的计数方式遵循从0开始,此处读取了正文的第二行作为标题,此行上面的所有数据都被删除. 

默认情况下`header = 0` 若读取没有列标签的文件时,为了不损失数据.可使用`header = None`

In [8]:
df = pd.read_csv('student_scores.csv', header=None)
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,ID,Name,Attendance,HW,Test1,Project1,Test2,Project2,Final
1,27604,Joe,0.96,0.97,87.0,98.0,92.0,93.0,95.0
2,30572,Alex,1.0,0.84,92.0,89.0,94.0,92.0,91.0
3,39203,Avery,0.84,0.74,68.0,70.0,84.0,90.0,82.0
4,28592,Kris,0.96,1.0,82.0,94.0,90.0,81.0,84.0


列标签也可以用`names`参数自己指定.当文件存在列标签时,可以通过同时添加`header = 0`参数告知 pandas

In [9]:
labels = ['id1', 'name1', 'attendance1', 'hw1', 'test11', 'project11', 'test21', 'project21', 'final1']
df = pd.read_csv('student_scores.csv', header=0, names=labels)
df.head()

Unnamed: 0,id1,name1,attendance1,hw1,test11,project11,test21,project21,final1
0,27604,Joe,0.96,0.97,87.0,98.0,92.0,93.0,95.0
1,30572,Alex,1.0,0.84,92.0,89.0,94.0,92.0,91.0
2,39203,Avery,0.84,0.74,68.0,70.0,84.0,90.0,82.0
3,28592,Kris,0.96,1.0,82.0,94.0,90.0,81.0,84.0
4,27492,Rick,0.32,0.85,98.0,100.0,73.0,82.0,88.0


### 索引

除使用默认索引（从 0 递增 1 的整数）之外，还可以将一个或多个列指定为数据框的索引

制定Name列为索引,对`read_csv`中的`index_col`参数赋值:

In [13]:
df = pd.read_csv('student_scores.csv',index_col = ['Name','ID'])
df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Attendance,HW,Test1,Project1,Test2,Project2,Final
Name,ID,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Joe,27604,0.96,0.97,87.0,98.0,92.0,93.0,95.0
Alex,30572,1.0,0.84,92.0,89.0,94.0,92.0,91.0
Avery,39203,0.84,0.74,68.0,70.0,84.0,90.0,82.0
Kris,28592,0.96,1.0,82.0,94.0,90.0,81.0,84.0
Rick,27492,0.32,0.85,98.0,100.0,73.0,82.0,88.0


这个功能可单独用于进行多种操作，例如解析日期、填充空值、跳行等。可以在  `read_csv()` 后面进行不同步骤，实现这些操作。我们将用其它方法修改数据，可以在 [这里](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) 查看如何用这个功能进行操作。

#### 测试题 #1
使用 `read_csv()` 读入 `cancer_data.csv`，使用适当列作为索引。然后使用数据框上的 `.head()` 查看操作是否正确。

In [14]:
df_cancer = pd.read_csv('cancer_data.csv',index_col =['id'])
df_cancer.head()

Unnamed: 0_level_0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave_points_mean,symmetry_mean,...,radius_max,texture_max,perimeter_max,area_max,smoothness_max,compactness_max,concavity_max,concave_points_max,symmetry_max,fractal_dimension_max
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
842302,M,17.99,,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,...,25.38,,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
84348301,M,11.42,20.38,77.58,386.1,,0.2839,0.2414,0.1052,0.2597,...,14.91,26.5,98.87,567.7,,0.8663,0.6869,0.2575,0.6638,0.173
84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


#### 测试题 #2
根据这个 [网站](http://archive.ics.uci.edu/ml/datasets/combined+cycle+power+plant) 上的特征描述，用 `read_csv()` 读入包含多个描述性列名称的 `powerplant_data.csv`。然后使用数据框上的 `.head()` 查看操作是否正确。*提示：先调用没有参数的  `read_csv()` ，查看一下数据是什么样。*

In [19]:
powerplant_list = ['Tempreature','Vacuum','Ambient Pressure','Relative Humidity','Net hourly electrical energy output']
df_powerplamt = pd.read_csv('powerplant_Data.csv',header = 0,names = powerplant_list)
df_powerplamt.head()

Unnamed: 0,Tempreature,Vacuum,Ambient Pressure,Relative Humidity,Net hourly electrical energy output
0,8.34,40.77,1010.84,90.01,480.48
1,23.64,58.49,1011.4,74.2,445.75
2,29.74,56.9,1007.15,41.91,438.76
3,19.07,49.69,1007.22,76.79,453.09
4,11.8,40.66,1017.13,97.2,464.43


### 写入csv文件

`to_csv`方法将数据保存为csv文件

In [20]:
df_powerplamt.to_csv('powerplant_data_edited.csv')

In [21]:
df = pd.read_csv('powerplant_data_edited.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,Tempreature,Vacuum,Ambient Pressure,Relative Humidity,Net hourly electrical energy output
0,0,8.34,40.77,1010.84,90.01,480.48
1,1,23.64,58.49,1011.4,74.2,445.75
2,2,29.74,56.9,1007.15,41.91,438.76
3,3,19.07,49.69,1007.22,76.79,453.09
4,4,11.8,40.66,1017.13,97.2,464.43


 `Unnamed:0` 是 `to_csv()` 默认保存索引，除非指定不保存。如需忽略索引，必须提供参数 `index=False`

In [24]:
df_powerplamt.to_csv('powerplant_data_edited.csv',index = False)
df = pd.read_csv('powerplant_data_edited.csv')
df.head()

Unnamed: 0,Tempreature,Vacuum,Ambient Pressure,Relative Humidity,Net hourly electrical energy output
0,8.34,40.77,1010.84,90.01,480.48
1,23.64,58.49,1011.4,74.2,445.75
2,29.74,56.9,1007.15,41.91,438.76
3,19.07,49.69,1007.22,76.79,453.09
4,11.8,40.66,1017.13,97.2,464.43
