### 泰坦尼克号幸存者
这里是一个「泰坦尼克号幸存者预测」的demo。可以快速熟悉整个流程。
1. 数据处理
2. 特征处理
3. 选择模型
4. 训练
5. 产出结果
数据源：https://ai-training-data-1253534728.cos.ap-beijing.myqcloud.com/public/titanic.zip

请把数据下载到本文件同级别dataset_temp 文件夹中，不要提交到远端库
```

In [6]:
import sys
import pandas as pd
import numpy as np
import sklearn

# 导入一些算法
from sklearn import ensemble
# 用于编码转换
from sklearn.preprocessing import LabelEncoder
# 用于特征选择
from sklearn import feature_selection

from sklearn import model_selection
# 用于评估模型
from sklearn import metrics

import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# 设置绘图风格
mpl.style.use('ggplot')

1. 读取数据并补充缺失值

In [7]:
base_path = 'dataset_temp/titanic'
data_train = pd.read_csv(f'{base_path}/train.csv')
data_test = pd.read_csv(f'{base_path}/test.csv')
data_test_survived = pd.read_csv(f'{base_path}/gender_submission.csv')
data_test = pd.merge(data_test, data_test_survived, on='PassengerId', how='left')


data_train['type'] = 'train'
data_test['type'] = 'test'

data_all = pd.concat([data_train, data_test], sort=False)
data_all.columns = data_all.columns.str.lower()

# 补缺失值
data_all['age'].fillna(data_all['age'].median(),inplace=True)    
data_all['fare'].fillna(data_all['fare'].median(),inplace=True)
# 港口，如果没有，就用众数填充
data_all['embarked'].fillna(data_all['embarked'].mode()[0],inplace=True)

# 船舱号，缺失数量太多，删除掉吧
data_all.drop(['cabin'], axis = 1, inplace = True)

data_all.isnull().sum()
data_all.head()


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data_all['age'].fillna(data_all['age'].median(),inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data_all['fare'].fillna(data_all['fare'].median(),inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate objec

Unnamed: 0,passengerid,survived,pclass,name,sex,age,sibsp,parch,ticket,fare,embarked,type
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,S,train
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C,train
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,S,train
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,S,train
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,S,train


特征构建

In [8]:

data_all['family_size'] = data_all['sibsp'] + data_all['parch'] + 1
data_all['is_alone'] = 1
data_all['is_alone'].loc[data_all['family_size'] > 1] = 0
data_all['title'] = data_all['name'].str.split(',',expand=True)[1].str.split('.',expand=True)[0].str.strip()
 
# 票价，票价是一个连续值，我们可以把票价分成几个区间,每个区间的人数基本一致
data_all['fare_bin'] =pd.cut(data_all['fare'].astype(int),10)
# 年龄，这里用cut，每组元素不一样
data_all['age_bin'] = pd.cut(data_all['age'].astype(int),10)

# rarely_title = (data_all['title'].value_counts() < 10)
# rarely_title.value_counts()
# data_all['title'] = data_all['title'].apply(lambda x: 'Rarely' if rarely_title.loc[x] == True else x)

data_all.describe(include='all')
data_all.isnull().sum()


You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  data_all['is_alone'].loc[data_all['family_size'] > 1] = 0
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_

passengerid    0
survived       0
pclass         0
name           0
sex            0
age            0
sibsp          0
parch          0
ticket         0
fare           0
embarked       0
type           0
family_size    0
is_alone       0
title          0
fare_bin       0
age_bin        0
dtype: int64

#### 基于LabelEncoder 构建新的字段
把一些字符串变成机器可以理解的 枚举值

In [9]:
data_all.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1309 entries, 0 to 417
Data columns (total 17 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   passengerid  1309 non-null   int64   
 1   survived     1309 non-null   int64   
 2   pclass       1309 non-null   int64   
 3   name         1309 non-null   object  
 4   sex          1309 non-null   object  
 5   age          1309 non-null   float64 
 6   sibsp        1309 non-null   int64   
 7   parch        1309 non-null   int64   
 8   ticket       1309 non-null   object  
 9   fare         1309 non-null   float64 
 10  embarked     1309 non-null   object  
 11  type         1309 non-null   object  
 12  family_size  1309 non-null   int64   
 13  is_alone     1309 non-null   int64   
 14  title        1309 non-null   object  
 15  fare_bin     1309 non-null   category
 16  age_bin      1309 non-null   category
dtypes: category(2), float64(2), int64(7), object(6)
memory usage: 167.1+ KB


In [10]:
label_encoder = LabelEncoder()
data_all['sex_code'] = label_encoder.fit_transform(data_all['sex'])
data_all['embarked_code'] = label_encoder.fit_transform(data_all['embarked'])
data_all['title_code'] = label_encoder.fit_transform(data_all['title'])
data_all['fare_code'] = label_encoder.fit_transform(data_all['fare_bin'])
data_all['age_code'] = label_encoder.fit_transform(data_all['age_bin'])

# 把suvived转换为int类型
data_all['survived'] = data_all['survived'].astype(int)
data_all.head()



# 把data_all 保存为csv文件
data_all.to_csv(f'{base_path}/data_all.csv', index=False)