ほぼ初めてのkaggleのため、今後も振り返ることができるよう、意図を詳し目に書きました。


In [None]:
# デフォルトの文章
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session



In [None]:
#① まず、ライブラリを読み込む。
#numpy　配列として扱う。ベクトル演算。
#pandas　データを読み込む。テーブルとして扱う。簡単な可視化。
#seaborn　可視化。
#matplotlib　可視化。
#scipy　Numpyよりも高度な計算。
#sklearn　モデリング。

import numpy as np
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt

from scipy import stats
from scipy.stats import norm, skew #for some statistics

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

%matplotlib inline

In [None]:
#　②データセットを取り込む。
# 学習データとテストデータの読み込み
train_df = pd.read_csv("/kaggle/input/house-prices-advanced-regression-techniques/train.csv")
test_df = pd.read_csv("/kaggle/input/house-prices-advanced-regression-techniques/test.csv")

In [None]:
# ②'学習データの最初の5行を表示
train_df.head(5)

In [None]:
# ③データの中身を確認する。

# 1.どのような変数（特徴量）が存在するのか
train_df.columns.values

In [None]:
# 2.データの型は何か
train_df.info()

In [None]:

# 3.データに欠損値は存在するのか
# デフォルト設定（通常は 60 行前後）だと途中が省略されるので、上限を500行に設定。
pd.set_option('display.max_rows', 500)
train_df.isnull().sum()

In [None]:
# 4.データに重複が存在するのか
# 'O' はオブジェクト型（文字列やカテゴリカルデータを扱える）を意味する。
train_df.describe(include=['O'])

In [None]:
# 5.データ間に相関があるのか
# 数値データだけを取り出す。
numeric_df = train_df.select_dtypes(include=[np.number])
# .corr()で相関を描画する。
import seaborn as sns
sns.heatmap(numeric_df.corr(), square=True)

In [None]:
# 相関を詳しくランキング
corr = numeric_df.corr()
corr.sort_values('SalePrice', ascending=False)

In [None]:
#　上位の相関を示した変数でヒートマップを描画する。
pals = ['OverallQual', 'GrLivArea', 'GarageCars', 'GarageArea', 'TotalBsmtSF']
for pal in pals:
    train_df.plot.scatter(x=pal, y='SalePrice')

GrLivArea, GarageArea, TotalBsmtSFとSalesPriceの間に正の相関が確認できた。

In [None]:
# ④欠損値を補完する。
# テストデータとトレーニングデータを結合し、一気に欠損値を補完する。
combined_df = pd.concat((train_df, test_df))
combined_df.head(5)

欠損値の意味とは何か。

まず欠損値の３タイプを理解する。
MCAR（Missing Completely At Random）:
欠損が全くランダムに発生している場合（たまたまデータ収集時のエラーで欠落した等）

MAR（Missing At Random）:
欠損が観測可能な他の変数に依存している場合（ある検査値が低い人ほど追加検査が行われない（欠損する等）

MNAR（Missing Not At Random）:
欠損がその変数自体の値に依存している場合（高収入の人が収入を報告しない場合）

その上で、欠損値の意味はドメイン知識で考える必要がある。

各カラムの意味を理解している必要があるので、data_description.txtを確認する。欠損が生じた背景を（実際のところは置いておいて）論理的に説明できるならNoneや0で補完できる。

⭐️欠損値に意味がある例

”MiscFeature（その他の機能）”　Miscellaneous feature not covered in other categories
		
       Elev	Elevator
       Gar2	2nd Garage (if not described in garage section)
       Othr	Other
       Shed	Shed (over 100 SF)
       TenC	Tennis Court
       NA	None　→どれも存在しないという意味

       
“Alley（路地）”　Type of alley access to property

       Grvl	Gravel（砂利）
       Pave	Paved（舗装）
       NA 	No alley access　→道がないという意味

“Fence”　Fence quality
		
       GdPrv	Good Privacy
       MnPrv	Minimum Privacy
       GdWo	Good Wood
       MnWw	Minimum Wood/Wire
       NA	No Fence　→フェンスがないという意味
       
“FireplaceQu”　Fireplace quality

       Ex	Excellent - Exceptional Masonry Fireplace
       Gd	Good - Masonry Fireplace in main level
       TA	Average - Prefabricated Fireplace in main living area or Masonry Fireplace in basement
       Fa	Fair - Prefabricated Fireplace in basement
       Po	Poor - Ben Franklin Stove
       NA	No Fireplace　→コンロがないという意味


In [None]:
# 欠損値に意味があるデータはNoneか0で補完する。
combined_df["PoolQC"] = combined_df["PoolQC"].fillna("None")
combined_df["MiscFeature"] = combined_df["MiscFeature"].fillna("None")
combined_df["Alley"] = combined_df["Alley"].fillna("None")
combined_df["Fence"] = combined_df["Fence"].fillna("None")
combined_df["FireplaceQu"] = combined_df["FireplaceQu"].fillna("None")

for col in ('GarageType', 'GarageFinish', 'GarageQual', 'GarageCond'):
    combined_df[col] = combined_df[col].fillna('None')

#　地下室、煉瓦壁関連も同様。
for col in ('BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2'):
    combined_df[col] = combined_df[col].fillna('None')

combined_df["MasVnrType"] = combined_df["MasVnrType"].fillna("None")

In [None]:
#　地下室、煉瓦、ガレージ関連の数値データは0で補完。
for col in ('BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF','TotalBsmtSF', 'BsmtFullBath', 'BsmtHalfBath'):
    combined_df[col] = combined_df[col].fillna(0)

combined_df["MasVnrArea"] = combined_df["MasVnrArea"].fillna(0)

combined_df["GarageYrBlt"] = combined_df["GarageYrBlt"].fillna(0)

for col in ('GarageArea', 'GarageCars'):
    combined_df[col] = combined_df[col].fillna(0)

⭐️欠損値に意味がない例
MSZoning: Identifies the general zoning classification of the sale.
		
       A	Agriculture
       C	Commercial
       FV	Floating Village Residential
       I	Industrial
       RH	Residential High Density
       RL	Residential Low Density
       RP	Residential Low Density Park 
       RM	Residential Medium Density

Functional: Home functionality (Assume typical unless deductions are warranted)

       Typ	Typical Functionality
       Min1	Minor Deductions 1
       Min2	Minor Deductions 2
       Mod	Moderate Deductions
       Maj1	Major Deductions 1
       Maj2	Major Deductions 2
       Sev	Severely Damaged
       Sal	Salvage only
       
Electrical: Electrical system

       SBrkr	Standard Circuit Breakers & Romex
       FuseA	Fuse Box over 60 AMP and all Romex wiring (Average)	
       FuseF	60 AMP Fuse Box and mostly Romex wiring (Fair)
       FuseP	60 AMP Fuse Box and mostly knob & tube wiring (poor)
       Mix	Mixed
       
KitchenQual: Kitchen quality

       Ex	Excellent
       Gd	Good
       TA	Typical/Average
       Fa	Fair
       Po	Poor
       
Exterior1st: Exterior covering on house

       AsbShng	Asbestos Shingles
       AsphShn	Asphalt Shingles
       BrkComm	Brick Common
       BrkFace	Brick Face
       CBlock	Cinder Block
       CemntBd	Cement Board
       HdBoard	Hard Board
       ImStucc	Imitation Stucco
       MetalSd	Metal Siding
       Other	Other
       Plywood	Plywood
       PreCast	PreCast	
       Stone	Stone
       Stucco	Stucco
       VinylSd	Vinyl Siding
       Wd Sdng	Wood Siding
       WdShing	Wood Shingles
       
Exterior2nd: Exterior covering on house (if more than one material)

       AsbShng	Asbestos Shingles
       AsphShn	Asphalt Shingles
       BrkComm	Brick Common
       BrkFace	Brick Face
       CBlock	Cinder Block
       CemntBd	Cement Board
       HdBoard	Hard Board
       ImStucc	Imitation Stucco
       MetalSd	Metal Siding
       Other	Other
       Plywood	Plywood
       PreCast	PreCast
       Stone	Stone
       Stucco	Stucco
       VinylSd	Vinyl Siding
       Wd Sdng	Wood Siding
       WdShing	Wood Shingles
	
MasVnrType: Masonry veneer type

       BrkCmn	Brick Common
       BrkFace	Brick Face
       CBlock	Cinder Block
       None	None
       Stone	Stone

SaleType: Type of sale
		
       WD 	Warranty Deed - Conventional
       CWD	Warranty Deed - Cash
       VWD	Warranty Deed - VA Loan
       New	Home just constructed and sold
       COD	Court Officer Deed/Estate
       Con	Contract 15% Down payment regular terms
       ConLw	Contract Low Down payment and low interest
       ConLI	Contract Low Interest
       ConLD	Contract Low Down
       Oth	Other

In [None]:
combined_df['MSZoning'] = combined_df['MSZoning'].fillna(combined_df['MSZoning'].mode()[0])
combined_df["Functional"] = combined_df["Functional"].fillna("Typ")
combined_df['Electrical'] = combined_df['Electrical'].fillna(combined_df['Electrical'].mode()[0])
combined_df['KitchenQual'] = combined_df['KitchenQual'].fillna(combined_df['KitchenQual'].mode()[0])
combined_df['Exterior1st'] = combined_df['Exterior1st'].fillna(combined_df['Exterior1st'].mode()[0])
combined_df['Exterior2nd'] = combined_df['Exterior2nd'].fillna(combined_df['Exterior2nd'].mode()[0])
combined_df['SaleType'] = combined_df['SaleType'].fillna(combined_df['SaleType'].mode()[0])

全て欠損値のようなデータも対処する。

In [None]:
combined_df['Utilities'].value_counts()

# AllPub    2916
# NoSeWa       1
# Name: Utilities, dtype: int64

In [None]:
combined_df = combined_df.drop(['Utilities'], axis=1)

In [None]:
# 欠損値がなくなったことを確認する。
combined_df.isnull().sum().sort_values(ascending=False)

In [None]:
# LotFrontageに欠損値が残っていたので、意味を確認する。
# LotFrontage: Linear feet of street connected to property（敷地につながる道路の直線状のフィート）
# その地域が都会か田舎か、その地域の中で大きい家なのか、で決まってくる
# したがって地域ごとに最頻値をとって欠損値を補完する。

median_by_neighborhood = combined_df.groupby('Neighborhood')['LotFrontage'].median()

def fill_lot_frontage(row):
    if pd.isnull(row['LotFrontage']):
        return median_by_neighborhood[row['Neighborhood']]
    else:
        return row['LotFrontage']

combined_df['LotFrontage'] = combined_df.apply(fill_lot_frontage, axis=1)



In [None]:
# 欠損値がなくなったことを確認する。
combined_df.isnull().sum().sort_values(ascending=False)

欠損値がなくなったことを確認できた！

次に、カテゴリデータを数値に変換する。
主に2種類の手法がある。
one-hot encoding　→順序があるカテゴリデータ
label encoding　→順序のないカテゴリデータ

順序を持つカテゴリ変数をdata_description.txtで確認

LotShape: General shape of property

       Reg	Regular	
       IR1	Slightly irregular
       IR2	Moderately Irregular
       IR3	Irregular

LandSlope: Slope of property
		
       Gtl	Gentle slope
       Mod	Moderate Slope	
       Sev	Severe Slope

HouseStyle: Style of dwelling
	
       1Story	One story
       1.5Fin	One and one-half story: 2nd level finished
       1.5Unf	One and one-half story: 2nd level unfinished
       2Story	Two story
       2.5Fin	Two and one-half story: 2nd level finished
       2.5Unf	Two and one-half story: 2nd level unfinished
       SFoyer	Split Foyer
       SLvl	Split Level

OverallQual: Rates the overall material and finish of the house

       10	Very Excellent
       9	Excellent
       8	Very Good
       7	Good
       6	Above Average
       5	Average
       4	Below Average
       3	Fair
       2	Poor
       1	Very Poor
	
OverallCond: Rates the overall condition of the house

       10	Very Excellent
       9	Excellent
       8	Very Good
       7	Good
       6	Above Average	
       5	Average
       4	Below Average	
       3	Fair
       2	Poor
       1	Very Poor

ExterQual: Evaluates the quality of the material on the exterior 
		
       Ex	Excellent
       Gd	Good
       TA	Average/Typical
       Fa	Fair
       Po	Poor
		
ExterCond: Evaluates the present condition of the material on the exterior
		
       Ex	Excellent
       Gd	Good
       TA	Average/Typical
       Fa	Fair
       Po	Poor

BsmtQual: Evaluates the height of the basement

       Ex	Excellent (100+ inches)	
       Gd	Good (90-99 inches)
       TA	Typical (80-89 inches)
       Fa	Fair (70-79 inches)
       Po	Poor (<70 inches
       NA	No Basement

BsmtCond: Evaluates the general condition of the basement

       Ex	Excellent
       Gd	Good
       TA	Typical - slight dampness allowed
       Fa	Fair - dampness or some cracking or settling
       Po	Poor - Severe cracking, settling, or wetness
       NA	No Basement

BsmtExposure: Refers to walkout or garden level walls

       Gd	Good Exposure
       Av	Average Exposure (split levels or foyers typically score average or above)	
       Mn	Mimimum Exposure
       No	No Exposure
       NA	No Basement
	
BsmtFinType1: Rating of basement finished area

       GLQ	Good Living Quarters
       ALQ	Average Living Quarters
       BLQ	Below Average Living Quarters	
       Rec	Average Rec Room
       LwQ	Low Quality
       Unf	Unfinshed
       NA	No Basement

BsmtFinType2: Rating of basement finished area (if multiple types)

       GLQ	Good Living Quarters
       ALQ	Average Living Quarters
       BLQ	Below Average Living Quarters	
       Rec	Average Rec Room
       LwQ	Low Quality
       Unf	Unfinshed
       NA	No Basement

HeatingQC: Heating quality and condition

       Ex	Excellent
       Gd	Good
       TA	Average/Typical
       Fa	Fair
       Po	Poor

KitchenQual: Kitchen quality

       Ex	Excellent
       Gd	Good
       TA	Typical/Average
       Fa	Fair
       Po	Poor

FireplaceQu: Fireplace quality

       Ex	Excellent - Exceptional Masonry Fireplace
       Gd	Good - Masonry Fireplace in main level
       TA	Average - Prefabricated Fireplace in main living area or Masonry Fireplace in basement
       Fa	Fair - Prefabricated Fireplace in basement
       Po	Poor - Ben Franklin Stove
       NA	No Fireplace

GarageQual: Garage quality

       Ex	Excellent
       Gd	Good
       TA	Typical/Average
       Fa	Fair
       Po	Poor
       NA	No Garage

GarageCond: Garage condition

       Ex	Excellent
       Gd	Good
       TA	Typical/Average
       Fa	Fair
       Po	Poor
       NA	No Garage

PoolQC: Pool quality
		
       Ex	Excellent
       Gd	Good
       TA	Average/Typical
       Fa	Fair
       NA	No Pool
		
Fence: Fence quality
		
       GdPrv	Good Privacy
       MnPrv	Minimum Privacy
       GdWo	Good Wood
       MnWw	Minimum Wood/Wire
       NA	No Fence
	
MiscFeature: Miscellaneous feature not covered in other categories
		
       Elev	Elevator
       Gar2	2nd Garage (if not described in garage section)
       Othr	Other
       Shed	Shed (over 100 SF)
       TenC	Tennis Court
       NA	None

In [None]:
# 順序のある変数をラベルエンコーディングする。
from sklearn.preprocessing import LabelEncoder
cols = ('FireplaceQu', 'BsmtQual', 'BsmtCond', 'GarageQual', 'GarageCond', 
        'ExterQual', 'ExterCond','HeatingQC', 'PoolQC', 'KitchenQual', 'BsmtFinType1', 
        'BsmtFinType2', 'Functional', 'Fence', 'BsmtExposure', 'GarageFinish', 'LandSlope',
        'LotShape', 'PavedDrive', 'Street', 'Alley', 'CentralAir', 'MSSubClass', 'OverallCond', 
        'YrSold', 'MoSold')

for c in cols:
    lbl = LabelEncoder() 
    lbl.fit(list(combined_df[c].values)) 
    combined_df[c] = lbl.transform(list(combined_df[c].values))

In [None]:
combined_df[['FireplaceQu', 'BsmtQual', 'BsmtCond', 'GarageQual', 'GarageCond', 
        'ExterQual', 'ExterCond','HeatingQC', 'PoolQC', 'KitchenQual', 'BsmtFinType1', 
        'BsmtFinType2', 'Functional', 'Fence', 'BsmtExposure', 'GarageFinish', 'LandSlope',
        'LotShape', 'PavedDrive', 'Street', 'Alley', 'CentralAir', 'MSSubClass', 'OverallCond', 
        'YrSold', 'MoSold']].head(10)

In [None]:
# 順序のないカテゴリ変数をonehotencodingする。

combined_df = pd.get_dummies(combined_df, drop_first=True)

In [None]:
# df のカテゴリカル列を抽出
categorical_columns = combined_df.select_dtypes(include=['object', 'category']).columns
print(categorical_columns)

カテゴリデータがなくなったことがわかった！

In [None]:
# テストデータとトレーニングデータの作成

#　元のトレーニングデータのレコード分、トレーニングデータにする。
train_df = combined_df[:len(train_df)]

#　残りはテストデータとして扱う。本来ないはずのSalesPriceを削除する。
test_df = combined_df[len(train_df):].drop(columns=['SalePrice'])

# X_trainには、SalePriceを除いたtrain_dfを代入。
X_train = train_df.drop("SalePrice", axis=1)

# y_trainには、SalePriceのみが入ったtrain_dfを代入。
y_train = train_df["SalePrice"]

# X_testには、test_dfを代入。
X_test  = test_df

print(X_train.shape, y_train.shape, X_test.shape)
# (1460, 201) (1460,) (1459, 201)

In [None]:
# 提出物はRoot-Mean-Squared-Error(RMSE)（2乗平均平方誤差）で評価されるので、手元にも準備する。
# RMSEとは正負のどちらかに予測がずれている度合いを表している。

from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error

def run_cv(model):
    cv = KFold(n_splits=3, random_state=42, shuffle=True)
    rmse_results = []
    models = []
    for trn_index, val_index in cv.split(X_train):
        X_trn, X_val = X_train.loc[trn_index], X_train.loc[val_index]
        y_trn, y_val = y_train[trn_index], y_train[val_index]
        # モデルの学習
        model.fit(X_trn, y_trn)
        pred = model.predict(X_val)
        # モデルの精度を算出
        rmse = np.sqrt(mean_squared_error(y_val, pred))
        print("RMSE:", rmse)
        rmse_results.append(rmse)
        models.append(model)

    print(rmse_results)
    print("Average:", np.mean(rmse_results))
    return models

In [None]:
# LightGBMを実装する。
#回帰問題としてモデルを構築する 
#性能評価指標としてRMSEを用いる
#shuffle=Trueとrandom_state=42により、データをランダムにシャッフルしてから分割
# テストデータに対する予測結果を格納するための空の配列作成
# クロスバリデーション
from sklearn.model_selection import KFold
import lightgbm as lgb

lgb_params = {
    "objective":"regression",
    "metric": "rmse" 
}
cv = KFold(n_splits=3, random_state=42, shuffle=True)
rmse_results = []
lgbm_models = []

test_preds = np.zeros(len(X_test))

for trn_index, val_index in cv.split(X_train, y_train):
    X_trn, X_val = X_train.loc[trn_index], X_train.loc[val_index]
    y_trn, y_val = y_train[trn_index], y_train[val_index]
    
    train_lgb = lgb.Dataset(X_trn, y_trn)
    validation_lgb = lgb.Dataset(X_val, y_val)
    model = lgb.train(
    lgb_params, train_lgb, 
    num_boost_round=1000,  # 最大1000ラウンドでブースティング
    valid_sets=[train_lgb, validation_lgb],  # トレーニングと検証データの両方で評価
    callbacks=[lgb.log_evaluation(period=10),  # 10ラウンドごとに評価結果を出力
               lgb.early_stopping(stopping_rounds=100)]  # 100ラウンド改善がなければ早期停止
)
    pred = model.predict(X_val) #学習済みモデルを用いて検証データの予測値を算出
    rmse = np.sqrt(mean_squared_error(y_val, pred))
    print("RMSE:", rmse)
    rmse_results.append(rmse)
    lgbm_models.append(model)

    test_preds += model.predict(X_test) / cv.n_splits #各foldで学習したモデルで、テストデータ X_test の予測

print(rmse_results)
print("Average:", np.mean(rmse_results))

In [None]:
# サンプル提出ファイルの予測値列の値を変更
submission = pd.read_csv('../input/house-prices-advanced-regression-techniques/sample_submission.csv')
submission['SalePrice']  = test_preds

# 提出ファイルを出力
submission.to_csv("submission.csv", index=False)