***
***
# 机器学习简介：

###  从泰坦尼克号讲起

***
***

王成军 

wangchengjun@nju.edu.cn


1. **机器学习简介**：从泰坦尼克号讲起

> 本部分介绍基于python进行机器学习的基本逻辑(需要学员提前安装anaconda）、Scikit-Learn、机器学习模型的参数和模型校验、特征提取。

2. **机器学习初步**: 朴素贝叶斯与线性回归
3. **机器学习进阶**: 支持向量机与随机森林
4. **机器学习扩展**: 基于Pytorch的神经网络模型

https://github.com/computational-class/machine-learning

![](./img/machine.jpg)

## 1、 监督式学习

工作机制：
- 这个算法由一个目标变量或结果变量（或因变量）组成。
- 这些变量由已知的一系列预示变量（自变量）预测而来。
- 利用这一系列变量，我们生成一个将输入值映射到期望输出值的函数。
- 这个训练过程会一直持续，直到模型在训练数据上获得期望的精确度。
- 监督式学习的例子有：回归、决策树、随机森林、K – 近邻算法、逻辑回归等。

## 2、非监督式学习

工作机制：
- 在这个算法中，没有任何目标变量或结果变量要预测或估计。
- 这个算法用在不同的组内聚类分析。
- 这种分析方式被广泛地用来细分客户，根据干预的方式分为不同的用户组。
- 非监督式学习的例子有：关联算法和 K–均值算法。

## 3、强化学习

工作机制：
- 这个算法训练机器进行决策。
- 它是这样工作的：机器被放在一个能让它通过反复试错来训练自己的环境中。
- 机器从过去的经验中进行学习，并且尝试利用了解最透彻的知识作出精确的商业判断。 
- 强化学习的例子有马尔可夫决策过程。alphago

> Chess. Here, the agent decides upon a series of moves depending on the state of the board (the environment), and the
reward can be defined as win or lose at the end of the game:

<img src = './img/mlprocess.png' width = 800>

- 线性回归
- 逻辑回归
- 决策树
- SVM
- 朴素贝叶斯
---
- K最近邻算法
- K均值算法
- 随机森林算法
- 降维算法
- Gradient Boost 和 Adaboost 算法




> # 泰坦尼克号数据分析

王成军

wangchengjun@nju.edu.cn

计算传播网 http://computational-communication.com

In [2]:
import pandas as pd
train = pd.read_csv('./data/titanic_train.csv', 
                    sep = ",")

In [3]:
train.head() 

Unnamed: 0.1,Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [39]:
train["Age"] = train["Age"].fillna(train["Age"].median())
train["Fare"] = train["Fare"].fillna(train["Fare"].median())
#Convert the male and female groups to integer form
train["Sex"][train["Sex"] == "male"] = 0
train["Sex"][train["Sex"] == "female"] = 1
#Impute the Embarked variable
train["Embarked"] = train["Embarked"].fillna('S')
#Convert the Embarked classes to integer form
train["Embarked"][train["Embarked"] == "S"] = 0
train["Embarked"][train["Embarked"] == "C"] = 1
train["Embarked"][train["Embarked"] == "Q"] = 2

In [31]:
#Create the target and features numpy arrays: target, features_one
target = train['Survived'].values
features_one = train[["Pclass", "Sex", "Age", "Fare"]].values

#Fit your first decision tree: my_tree_one
my_tree_one = tree.DecisionTreeClassifier()
my_tree_one = my_tree_one.fit(features_one, target)
#Look at the importance of the included features and print the score
print(my_tree_one.feature_importances_)
print(my_tree_one.score(features_one, target))

[ 0.12294397  0.31274009  0.23680307  0.32751287]
0.977553310887


In [32]:
test = pd.read_csv('./data/titanic_test.csv', sep = ",")
# Impute the missing value with the median
test.Fare[152] = test.Fare.median()
test["Age"] = test["Age"].fillna(test["Age"].median())
#Convert the male and female groups to integer form
test["Sex"][test["Sex"] == "male"] = 0
test["Sex"][test["Sex"] == "female"] = 1

#Impute the Embarked variable
test["Embarked"] = test["Embarked"].fillna('S')
#Convert the Embarked classes to integer form
test["Embarked"][test["Embarked"] == "S"] = 0
test["Embarked"][test["Embarked"] == "C"] = 1
test["Embarked"][test["Embarked"] == "Q"] = 2

# Extract the features from the test set: Pclass, Sex, Age, and Fare.
test_features = test[["Pclass","Sex", "Age", "Fare"]].values

# Make your prediction using the test set
my_prediction = my_tree_one.predict(test_features)

# Create a data frame with two columns: PassengerId & Survived. Survived contains your predictions
PassengerId =np.array(test['PassengerId']).astype(int)
my_solution = pd.DataFrame(my_prediction, PassengerId, columns = ["Survived"])


In [33]:
my_solution[:3]

Unnamed: 0,Survived
892,0
893,0
894,1


In [17]:
# Check that your data frame has 418 entries
my_solution.shape

(418, 1)

In [30]:
# Write your solution to a csv file with the name my_solution.csv 
my_solution.to_csv("../data/tatanic_solution_one.csv", 
                   index_label = ["PassengerId"])

In [34]:
# Create a new array with the added features: features_two
features_two = train[["Pclass","Age","Sex","Fare",\
                      "SibSp", "Parch", "Embarked"]].values

#Control overfitting by setting "max_depth" to 10 and "min_samples_split" to 5 : my_tree_two
max_depth = 10
min_samples_split = 5
my_tree_two = tree.DecisionTreeClassifier(max_depth = max_depth, 
                                          min_samples_split = min_samples_split, 
                                          random_state = 1)
my_tree_two = my_tree_two.fit(features_two, target)

#Print the score of the new decison tree
print(my_tree_two.score(features_two, target))

0.905723905724


In [35]:
# create a new train set with the new variable
train_two = train
train_two['family_size'] = train.SibSp + train.Parch + 1

# Create a new decision tree my_tree_three
features_three = train[["Pclass", "Sex", "Age", \
                        "Fare", "SibSp", "Parch", "family_size"]].values

my_tree_three = tree.DecisionTreeClassifier()
my_tree_three = my_tree_three.fit(features_three, target)

# Print the score of this decision tree
print(my_tree_three.score(features_three, target))


0.979797979798


In [36]:
#Import the `RandomForestClassifier`
from sklearn.ensemble import RandomForestClassifier

#We want the Pclass, Age, Sex, Fare,SibSp, Parch, and Embarked variables
features_forest = train[["Pclass", "Age", "Sex", "Fare", "SibSp", "Parch", "Embarked"]].values

#Building the Forest: my_forest
n_estimators = 100
forest = RandomForestClassifier(max_depth = 10, min_samples_split=2, 
                                n_estimators = n_estimators, random_state = 1)
my_forest = forest.fit(features_forest, target)

#Print the score of the random forest
print(my_forest.score(features_forest, target))

#Compute predictions and print the length of the prediction vector:test_features, pred_forest
test_features = test[["Pclass", "Age", "Sex", "Fare", "SibSp", "Parch", "Embarked"]].values
pred_forest = my_forest.predict(test_features)
print(len(test_features))
print(pred_forest[:3])

0.939393939394
418
[0 0 0]


In [22]:
#Request and print the `.feature_importances_` attribute
print(my_tree_two.feature_importances_)
print(my_forest.feature_importances_)

#Compute and print the mean accuracy score for both models
print(my_tree_two.score(features_two, target))
print(my_forest.score(features_two, target))

[ 0.14130255  0.17906027  0.41616727  0.17938711  0.05039699  0.01923751
  0.0144483 ]
[ 0.10384741  0.20139027  0.31989322  0.24602858  0.05272693  0.04159232
  0.03452128]
0.905723905724
0.939393939394


# 阅读材料
机器学习算法的要点（附 Python 和 R 代码）http://blog.csdn.net/a6225301/article/details/50479672

The "Python Machine Learning" book code repository and info resource https://github.com/rasbt/python-machine-learning-book

An Introduction to Statistical Learning (James, Witten, Hastie, Tibshirani, 2013) : Python code https://github.com/JWarmenhoven/ISLR-python

BuildingMachineLearningSystemsWithPython https://github.com/luispedro/BuildingMachineLearningSystemsWithPython

# 作业
https://www.datacamp.com/community/tutorials/the-importance-of-preprocessing-in-data-science-and-the-machine-learning-pipeline-i-centering-scaling-and-k-nearest-neighbours