## ProcessDataForSubmittingTrainV8XGB
levelplanをダミー変数化したV8の前処理を行うコード

### 0 出力データ仕様
#### 教師データ：訓練用
- 目的変数　Y_train.csv　　index, columnあり
- 説明変数　X_train.csv　　index, columnあり

#### 教師データ：検証用
- 目的変数　Y_eval.csv　index, columnあり
- 説明変数　X_eval.csv　index, columnあり

#### 教師データ：メタデータ  ⇒必要と思ったが、X_eval, Y_evalに含まれるので無しにする
- train_meta.csv  id, pj_no, 面積を保持
- eval_meta.csv  同上

### 2 ローカル向け処理
- X_train, X_evalは、idとpj_no列をdropしてから、訓練、評価に使う
- Y_train, Y_evalは、tanka_pr列を利用する
#### 教師データ：訓練用

In [11]:
# 共通処理
# x_train. y_train, x_eval, y_evalを作成する
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [12]:
train_x = pd.read_csv("data/processed_train_goto_x_v8.csv")
train_y = pd.read_csv("data/processed_train_goto_y_v8.csv")

In [13]:
#　データを分割して出力する
X_train, X_eval, Y_train, Y_eval = train_test_split( train_x, train_y, train_size=0.8, random_state = 19711022)

In [14]:
X_train.to_csv("data/X_train.csv", index=False)
X_eval.to_csv("data/X_eval.csv", index=False)
Y_train.to_csv("data/Y_train.csv", index=False)
Y_eval.to_csv("data/Y_eval.csv", index=False)

### ローカル向け処理。読み込んで学習、検証データと比較してスコアを計算する

In [15]:
def mean_absolute_percentage_error( y_train: np.array, y_pred: np.array):
    diff = 0
    n = len(y_train)
    for i in range(n):
        diff += abs(y_train[i]-y_pred[i])/y_train[i]
    score = 100*diff / n
    
    return score

In [16]:
train_x = pd.read_csv('data/X_train.csv').drop(['id','pj_no'],axis=1)
train_y = pd.read_csv('data/Y_train.csv').drop(['id'],axis=1)

In [None]:
import time
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(n_estimators=400, max_depth=200, min_samples_split=2,n_jobs=-1)
start = time.perf_counter()
model.fit(train_x.values, train_y.values.ravel() )
end = time.perf_counter()
print(end-start)
eval_x = pd.read_csv('data/X_eval.csv').drop(['id','pj_no'],axis=1)
ans_y = pd.read_csv('data/Y_eval.csv').drop(['id'],axis=1)
pred_y = model.predict(eval_x.values)
print( mean_absolute_percentage_error(ans_y.values,pred_y))

#### XGBoostを試す

In [None]:
import xgboost as xgb
from xgboost import XGBRegressor
import time

#params = {
#    'n_estimators':700,
#    'max_depth':6,
#    'min_child_weight':9,
#    'gamma':0,
#    'subsample':1.0,
#    'colsample_bytree':0.6,
#    'learning_rate':0.1
#}

params = {
    'n_estimators':500,
    'max_depth':3,
    'min_child_weight':2,
    'gamma':0.0,
    'subsample':0.9,
    'colsample_bytree':0.8,
    'learning_rate':0.1
}

In [None]:
xgboost_opt = XGBRegressor(**params, seed=42, n_jobs=-1)
start = time.perf_counter()
xgboost_opt.fit(train_x, train_y)
end = time.perf_counter()
print(end-start)

In [None]:
eval_x = pd.read_csv('data/X_eval.csv').drop(['id','pj_no'],axis=1)
ans_y = pd.read_csv('data/Y_eval.csv').drop(['id'],axis=1)
pred_y = xgboost_opt.predict(eval_x)
print( mean_absolute_percentage_error(ans_y.values,pred_y))

In [17]:
import xgboost as xgb
from xgboost import XGBRegressor
import time
params = {
    'n_estimators':700,
    'max_depth':6,
    'min_child_weight':9,
    'gamma':0,
    'subsample':1.0,
    'colsample_bytree':0.6,
    'learning_rate':0.1
}

In [18]:
xgboost_opt = XGBRegressor(**params, seed=42, n_jobs=-1)
start = time.perf_counter()
xgboost_opt.fit(train_x, train_y)
end = time.perf_counter()
print(end-start)

17.778532050999956


In [19]:
eval_x = pd.read_csv('data/X_eval.csv').drop(['id','pj_no'],axis=1)
ans_y = pd.read_csv('data/Y_eval.csv').drop(['id'],axis=1)
pred_y = xgboost_opt.predict(eval_x)
print( mean_absolute_percentage_error(ans_y.values,pred_y))

[8.5853159]


In [20]:
eval_x = pd.read_csv('data/X_eval.csv')

In [21]:
eval_x['pred_keiyaku_pr']=pd.Series(pred_y)

In [22]:
df =pd.merge(eval_x, pd.read_csv('data/Y_eval.csv'), on='id')

In [23]:
df['diff']=(abs(df['pred_keiyaku_pr']-df['keiyaku_pr'])/df['keiyaku_pr'])*100

In [24]:
df.to_csv('data/difference.csv')

### n_estimatorsが700のケースでsubmitしてみることにする(7/7)

In [None]:
test_x = pd.read_csv("data/processed_test_goto_x.csv")
test_pred = xgboost_opt.predict(test_x.drop(['id','pj_no'],axis=1))
submit = pd.DataFrame(test_x[['id']])
submit['keiyaku_pr']=pd.Series(test_pred).astype(np.int64)
submit.to_csv('data/submit4.tsv',sep='\t',header=None, index=False)

### ここからSageMaker用のデータを作る処理

In [None]:
train_x = pd.read_csv('data/X_train.csv')
train_y = pd.read_csv('data/Y_train.csv')

In [None]:
train_input = pd.concat([train_y.drop(['id','keiyaku_pr','tc_mseki'],axis=1),train_x.drop(['id','pj_no'],axis=1)],axis=1)
train_input.to_csv('data/sagemaker_input.csv', header=None, index=False)
eval_x = pd.read_csv('data/X_eval.csv')
eval_x.drop(['id','pj_no'],axis=1).to_csv('data/sagemaker_eval_input.csv',header=None, index=False)


### SageMakerの出力から精度を計算する

In [None]:
pred2_y = pd.read_csv('data/sagemaker_eval_input.csv.out', header=None)
ans_y = pd.read_csv('data/Y_eval.csv').drop(['id','keiyaku_pr','tc_mseki'],axis=1)

In [None]:
print( mean_absolute_percentage_error(ans_y.values,pred2_y.values))

### SageMaker用予測データを作成する

In [None]:
test_x = pd.read_csv("data/processed_test_goto_x.csv")

In [None]:
test_input = test_x.drop(['id','pj_no'],axis=1)
test_input.to_csv('data/sagemaker_test_input.csv', header=None, index=False)

### SageMaker出力からsubmit用データを作る

In [None]:
tanka = pd.read_csv("data/sagemaker_test_input.csv.out", header=None )

In [None]:
test_x = pd.read_csv("data/processed_test_goto_x.csv")

In [None]:
submit = pd.DataFrame(test_x[['id', 'tc_mseki']])

In [None]:
submit['tanka_pr']=tanka

In [None]:
submit['price']=(submit['tc_mseki']*submit['tanka_pr']).astype(np.int64)

In [None]:
submit.loc[:,['id','price']].to_csv('data/submit3.tsv',sep='\t',header=None, index=False)

In [None]:
submit.head()