提升模型效果的tips
- 处理现实世界数据集中经常发现的数据类型(缺失值，分类变量)，
- 设计管道来提高机器学习代码的质量，
- 使用高级技术进行模型验证(交叉验证)，
- 构建被广泛用于赢得Kaggle竞赛(XGBoost)的最先进的模型，
- 避免常见和重要的数据科学错误(泄漏)

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Read the data
train = pd.read_csv('../input/train.csv', index_col='Id')
X_test = pd.read_csv('../input/test.csv', index_col='Id')

# Obtain target and predictors
y = train.SalePrice
X = train.drop(['SalePrice'], axis=1).select_dtypes(exclude=['object'])

# Break off validation set from training data
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,
                                                      random_state=0)

for i in [X_train, X_valid, y_train, y_valid]:
    i.index=range(len(i))
    
names = ['住宅类型','房屋分类','房屋前的人行道长','地段面积','道路类型','通向房屋的道路类型','房屋形状','房屋角度','水电类型','地块位置','地块坡度','在Ames中的位置','交通配置','交通配置2','房屋建筑类型','房屋层数','房屋使用材料和完成度评级','房子整体状况评估','初始施工年份','改建日期','屋顶类型','屋顶材料','房屋外部覆盖物','房屋外部覆盖物2','表层砌体类型','砖石贴面面积','外部材料质量评估','外部材料现状评估','地基类型','地下室高度评估','地下室情况评估','花园墙水平','地下室完工面积评级1','地下室完工面积1','地下室完工面积评级2','地下室完工面积2','未完工的地下室面积','地下室总面积','供暖方式','供暖质量评级','是否有中央空调','供电系统','1楼面积','2楼面积','低质量完工面积','地上居住面积','地下完整浴室数','地下半浴室数','地上完整浴室数','地上半浴室数','地上卧室数','地上厨房数','厨房质量','客房数','房屋功能评级','壁炉数','壁炉质量','车库位置及类型','车库建造年份','车库内部完工评级','车库的汽车容量','车库面积','车库质量评级','车库条件评级','车道铺砌','木装面积','开放空间面积','密闭空间面积','三季门廊面积','屏风门廊面积','泳池面积','泳池质量','栅栏质量','补充特征','补充特征值','月销量','年销量','销售类型','交易条件']
descrip = [*zip(X.columns,names)]

In [3]:
X.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1460 entries, 1 to 1460
Data columns (total 36 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   MSSubClass     1460 non-null   int64  
 1   LotFrontage    1201 non-null   float64
 2   LotArea        1460 non-null   int64  
 3   OverallQual    1460 non-null   int64  
 4   OverallCond    1460 non-null   int64  
 5   YearBuilt      1460 non-null   int64  
 6   YearRemodAdd   1460 non-null   int64  
 7   MasVnrArea     1452 non-null   float64
 8   BsmtFinSF1     1460 non-null   int64  
 9   BsmtFinSF2     1460 non-null   int64  
 10  BsmtUnfSF      1460 non-null   int64  
 11  TotalBsmtSF    1460 non-null   int64  
 12  1stFlrSF       1460 non-null   int64  
 13  2ndFlrSF       1460 non-null   int64  
 14  LowQualFinSF   1460 non-null   int64  
 15  GrLivArea      1460 non-null   int64  
 16  BsmtFullBath   1460 non-null   int64  
 17  BsmtHalfBath   1460 non-null   int64  
 18  FullBath

大多数机器学习库(包括scikit-learn)在尝试使用缺失值的数据构建模型时都会出现错误。所以你需要选择下面的策略之一。
- （1）删除缺少值的列。注意：除非被删除列中的大多数值都丢失了，否则模型将失去对大量(潜在有用)信息的访问。
- （2）填充缺失值（Imputation）。确实比例比较低时使用，虽然多数情况下填充值不会完全正确，但通常会比完全放弃列得到更精确的模型。
- （3）填充的扩展方法。Imputation值可能高于或低于它们的实际值(数据集中没有收集到实际值)，或者缺失值在某种情况下有唯一的填补值。这时，告诉模型哪些值是缺失并被填补上去的可能有助于做出更好的预测。这时需要为每个有缺失值的列添加一个新列，标识缺失值的位置，这种做法有时将会显著改善结果，有时完全没帮助。
***处理缺失值时将删除缺失值处理的策略作为基线，用于对比其他策略的有效性***

In [4]:
# 获取空值的列
df = X.isnull().sum()
df[df>0]

LotFrontage    259
MasVnrArea       8
GarageYrBlt     81
dtype: int64

In [5]:
# 建一个初始模型 
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

def score_dataset(X_train, X_valid, y_train, y_valid):
    model = RandomForestRegressor(n_estimators=100, criterion='mae', random_state=0)
    model.fit(X_train, y_train)
    preds = model.predict(X_valid)
    return mean_absolute_error(y_valid, preds)

## 缺失值处理

In [6]:
cols_with_missing = [col for col in X_train.columns if X_train[col].isnull().any()]

# 直接删除缺失值
reduced_X_train = X_train.drop(cols_with_missing,axis=1)
reduced_X_valid = X_valid.drop(cols_with_missing,axis=1)

print("MAE (Drop columns with missing values):")
print(score_dataset(reduced_X_train, reduced_X_valid, y_train, y_valid))

MAE (Drop columns with missing values):




17895.090633561642


In [7]:
from sklearn.impute import SimpleImputer

# 填充缺失值--均值（这里用均值填补的效果不如直接删掉）
myImputer = SimpleImputer(strategy='mean') 
imputed_X_train = pd.DataFrame(myImputer.fit_transform(X_train),columns=X_train.columns)
imputed_X_valid = pd.DataFrame(myImputer.transform(X_valid),columns=X_valid.columns)

print("MAE (Imputation):")
print(score_dataset(imputed_X_train, imputed_X_valid, y_train, y_valid))

MAE (Imputation):




18125.367020547947


In [20]:
# 填充缺失值，并将是否缺失标识出来（这种方式在这里没有起到好的作用）
X_train_plus = X_train.copy()
X_valid_plus = X_valid.copy()

# Make new columns indicating what will be imputed
for col in cols_with_missing:
    X_train_plus[col + '_was_missing'] = X_train_plus[col].isnull() # 标识是否缺失
    X_valid_plus[col + '_was_missing'] = X_valid_plus[col].isnull()

# Imputation,标识列会自动转0,1
my_imputer = SimpleImputer()
imputed_X_train_plus = pd.DataFrame(my_imputer.fit_transform(X_train_plus))
imputed_X_valid_plus = pd.DataFrame(my_imputer.transform(X_valid_plus))

# Imputation removed column names; put them back
imputed_X_train_plus.columns = X_train_plus.columns
imputed_X_valid_plus.columns = X_valid_plus.columns

print("MAE from Approach 3 (An Extension to Imputation):")
print(score_dataset(imputed_X_train_plus, imputed_X_valid_plus, y_train, y_valid))

MAE from Approach 3 (An Extension to Imputation):




18403.17455479452


In [21]:
cols_with_missing

['LotFrontage', 'MasVnrArea', 'GarageYrBlt']

In [9]:
# 根据缺失值尝试些别的填补方式
fill_LotFrontage_df = X_train.groupby('MSSubClass').LotFrontage.mean().reset_index().rename(columns={'LotFrontage':'fill_LotFrontage'})

def get_final_data(X_train,fill_LotFrontage_df):
    # 填充LotFrontage
    final_X_train = X_train.merge(fill_LotFrontage_df,on='MSSubClass',how='left')
    final_X_train.loc[final_X_train.LotFrontage.isnull(),'LotFrontage'] = final_X_train.fill_LotFrontage
    # 填充MasVnrArea
    final_X_train['MasVnrArea'] =  final_X_train['MasVnrArea'].fillna(final_X_train['MasVnrArea'].mean())
    # 填充GarageYrBlt
    final_X_train['GarageYrBlt'] =  final_X_train['GarageYrBlt'].fillna(2011)
    return final_X_train
final_X_train = get_final_data(X_train,fill_LotFrontage_df)
final_X_valid = get_final_data(X_valid,fill_LotFrontage_df)

print("MAE (own):")
print(score_dataset(final_X_train, final_X_valid, y_train, y_valid))

MAE (own):




18075.00859589041
