# 改进模型
## 使用sklearn随机森林拟合非线性成分
- 用储存在**alg**的随机森林算法去做交叉验证。用predictions去预测Survived列。将结果赋值到**scores**。
- 使用**model_selection.cross_val_score**来完成这些。

In [1]:
import pandas as pd
titanic = pd.read_csv('Data/train.csv')
# 处理Age中的缺失值
titanic['Age'] = titanic['Age'].fillna(titanic['Age'].median())
# 量化Sex
titanic.loc[titanic['Sex'] == 'male', 'Sex'] = 0
titanic.loc[titanic['Sex'] == 'female', 'Sex'] = 1
# 量化Embarked
titanic['Embarked'] = titanic['Embarked'].fillna('S')
titanic.loc[titanic['Embarked'] == 'S', 'Embarked'] = 0
titanic.loc[titanic['Embarked'] == 'C', 'Embarked'] = 1
titanic.loc[titanic['Embarked'] == 'Q', 'Embarked'] = 2

In [2]:
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier

predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]

# Initialize our algorithm with the default paramters
# n_estimators is the number of trees we want to make
# min_samples_split is the minimum number of rows we need to make a split
# min_samples_leaf is the minimum number of samples we can have at the place where a tree branch ends (the bottom points of the tree)
alg = RandomForestClassifier(random_state=1, n_estimators=10, min_samples_split=2, min_samples_leaf=1)
# Compute the accuracy score for all the cross validation folds.  (much simpler than what we did before!)
scores = cross_val_score(alg, titanic[predictors], titanic["Survived"],cv=3)

# Take the mean of the scores (because we have one for each fold)
print(scores.mean())

0.801346801347


# 调参
- 增加我们使用的树的数量会很大的提升预测的准确率，训练更多的树会花费更多的时间
- 调整**min_samples_split**和**min_samples_leaf**变量来减少过拟合

In [3]:
alg = RandomForestClassifier(random_state=1, n_estimators=150, min_samples_split=4, min_samples_leaf=2)
# Compute the accuracy score for all the cross validation folds.  (much simpler than what we did before!)
scores = cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=3)

# Take the mean of the scores (because we have one for each fold)
print(scores.mean())

0.820426487093


# 生成新特征
- 一些点子：
  - 名字的长度——这和那人有多富有，所以在泰坦尼克上的位置有关。
  - 一个家庭的总人数(**SibSp**+**Parch**)。
- 使用pandas数据框的**.apply**方法来生成特征。这会对你传入数据框(dataframe)或序列(series)的每一个元素应用一个函数。我们也可以传入一个**lambda**函数使我们能够定义一个匿名函数。
- 一个匿名的函数的语法是**lambda x:len(x)**。x将传入的值作为输入值——在本例中，就是乘客的名字。表达式的右边和返回结果将会应用于x。**.apply**方法读取这些所有输出并且用他们构造出一个pandas序列。我们可以将这个序列赋值到一个数据框列。

In [4]:
# Generating a familysize column
titanic["FamilySize"] = titanic["SibSp"] + titanic["Parch"]

# The .apply method generates a new series
titanic["NameLength"] = titanic["Name"].apply(lambda x: len(x))

## 使用头衔
- 从乘客的名字中提取出他们的头衔。头衔的格式是**Master.,Mr.,Mrs.**，以一个大写字母开头，后面是小写字母，最后以.结尾。有一些非常常见的头衔，也有一些“长尾理论”中的一次性头衔只有仅仅一个或者两个乘客使用。第一步使用**正则表达式**提取头衔，然后将每一个唯一头衔匹配成(map)整型数值。
- 之后我们将得到一个准确的和Title相对应的数值列。

In [None]:
import re

# A function to get the title from a name.
def get_title(name):
    # Use a regular expression to search for a title.  Titles always consist of capital and lowercase letters, and end with a period.
    title_search = re.search(' ([A-Za-z]+)\.', name)
    # If the title exists, extract and return it.
    if title_search:
        return title_search.group(1)
    return ""

# Get all the titles and print how often each one occurs.
titles = titanic["Name"].apply(get_title)
print(pandas.value_counts(titles))

# Map each title to an integer.  Some titles are very rare, and are compressed into the same codes as other titles.
title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Dr": 5, "Rev": 6, "Major": 7, "Col": 7, "Mlle": 8, "Mme": 8, "Don": 9, "Lady": 10, "Countess": 10, "Jonkheer": 10, "Sir": 9, "Capt": 7, "Ms": 2}
for k,v in title_mapping.items():
    titles[titles == k] = v

# Verify that we converted everything.
print(pandas.value_counts(titles))

# Add in the title column.
titanic["Title"] = titles

In [17]:
import re

# A function to get the title from a name.
def get_title(name):
    # Use a regular expression to search for a title.  Titles always consist of capital and lowercase letters, and end with a period.
    title_search = re.search(' ([A-Za-z]+)\.', name)
    # If the title exists, extract and return it.
    if title_search:
        return title_search.group()
    return ""

name = titanic['Name'][1:5]
titles = name.apply(get_title)
print(titles)

<_sre.SRE_Match object at 0x000000000A16F990>
<_sre.SRE_Match object at 0x000000000A16F990>
<_sre.SRE_Match object at 0x000000000A16F990>
<_sre.SRE_Match object at 0x000000000A16F990>
1      Mrs.
2     Miss.
3      Mrs.
4       Mr.
Name: Name, dtype: object
