PySpark을 로컬머신에 설치하고 노트북을 사용하기 보다는 머신러닝 관련 다양한 라이브러리가 이미 설치되었고 좋은 하드웨어를 제공해주는 Google Colab을 통해 실습을 진행한다.

이를 위해 pyspark과 Py4J 패키지를 설치한다. Py4J 패키지는 파이썬 프로그램이 자바가상머신상의 오브젝트들을 접근할 수 있게 해준다. Local Standalone Spark을 사용한다.

In [1]:
!pip install pyspark==3.0.1 py4j==0.10.9

Collecting pyspark==3.0.1
  Downloading pyspark-3.0.1.tar.gz (204.2 MB)
[K     |████████████████████████████████| 204.2 MB 34 kB/s 
[?25hCollecting py4j==0.10.9
  Downloading py4j-0.10.9-py2.py3-none-any.whl (198 kB)
[K     |████████████████████████████████| 198 kB 20.7 MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.0.1-py2.py3-none-any.whl size=204612242 sha256=859bd4eb589e2cc19c4a9667ee6bd18af40b5556b182fe8e3c3aa0c9bf9810fe
  Stored in directory: /root/.cache/pip/wheels/5e/34/fa/b37b5cef503fc5148b478b2495043ba61b079120b7ff379f9b
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9 pyspark-3.0.1


In [2]:
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Taipei Housing Price Prediction") \
    .getOrCreate()

# 타이베이 주택 가격 예측 모델 만들기




데이터셋 설명

이번 문제는 대만 타이베이 시의 신단 지역에서 수집된 주택 거래 관련 정보를 바탕으로 주택 가격(정확히는 주택의 평당 가격)을 예측하는 Regression 모델을 만들어보는 것이다. 총 6개의 피쳐와 주택의 평당 가격에 해당하는 레이블 정보가 훈련 데이터로 제공된다. 레이블의 경우에는 주택의 최종 가격이 아니라 평당 가격이란 점을 다시 한번 강조한다.

각 컬럼에 대한 설명은 아래와 같으며 모든 필드는 X4를 제외하고는 실수 타입이다.

* X1: 주택 거래 날짜를 실수로 제공한다. 소수점 부분은 달을 나타낸다. 예를 들어 2013.250이라면 2013년 3월임을 나타낸다 (0.250 = 3/12)
* X2: 주택 나이 (년수)
* X3: 가장 가까운 지하철역까지의 거리 (미터)
* X4: 주택 근방 걸어갈 수 있는 거리내 편의점 수
* X5: 주택 위치의 위도 (latitude)
* X6: 주택 위치의 경도 (longitude)
* Y: 주택 평당 가격



In [3]:
spark

In [4]:
!wget https://grepp-reco-test.s3.ap-northeast-2.amazonaws.com/Taipei_sindan_housing.csv

--2021-07-26 14:16:16--  https://grepp-reco-test.s3.ap-northeast-2.amazonaws.com/Taipei_sindan_housing.csv
Resolving grepp-reco-test.s3.ap-northeast-2.amazonaws.com (grepp-reco-test.s3.ap-northeast-2.amazonaws.com)... 52.219.56.39
Connecting to grepp-reco-test.s3.ap-northeast-2.amazonaws.com (grepp-reco-test.s3.ap-northeast-2.amazonaws.com)|52.219.56.39|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 20014 (20K) [text/csv]
Saving to: ‘Taipei_sindan_housing.csv’


2021-07-26 14:16:17 (74.7 KB/s) - ‘Taipei_sindan_housing.csv’ saved [20014/20014]



In [5]:
!ls -tl

total 24
-rw-r--r-- 1 root root 20014 Jul 17 17:35 Taipei_sindan_housing.csv
drwxr-xr-x 1 root root  4096 Jul 16 13:20 sample_data


In [43]:
data = spark.read.csv('./Taipei_sindan_housing.csv', header=True, inferSchema=True)

In [44]:
data.printSchema()

root
 |-- X1: double (nullable = true)
 |-- X2: double (nullable = true)
 |-- X3: double (nullable = true)
 |-- X4: integer (nullable = true)
 |-- X5: double (nullable = true)
 |-- X6: double (nullable = true)
 |-- Y: double (nullable = true)



In [45]:
data.show()

+--------+----+--------+---+--------+---------+----+
|      X1|  X2|      X3| X4|      X5|       X6|   Y|
+--------+----+--------+---+--------+---------+----+
|2012.917|32.0|84.87882| 10|24.98298|121.54024|37.9|
|2012.917|19.5|306.5947|  9|24.98034|121.53951|42.2|
|2013.583|13.3|561.9845|  5|24.98746|121.54391|47.3|
|  2013.5|13.3|561.9845|  5|24.98746|121.54391|54.8|
|2012.833| 5.0|390.5684|  5|24.97937|121.54245|43.1|
|2012.667| 7.1| 2175.03|  3|24.96305|121.51254|32.1|
|2012.667|34.5|623.4731|  7|24.97933|121.53642|40.3|
|2013.417|20.3|287.6025|  6|24.98042|121.54228|46.7|
|  2013.5|31.7|5512.038|  1|24.95095|121.48458|18.8|
|2013.417|17.9| 1783.18|  3|24.96731|121.51486|22.1|
|2013.083|34.8|405.2134|  1|24.97349|121.53372|41.4|
|2013.333| 6.3|90.45606|  9|24.97433| 121.5431|58.1|
|2012.917|13.0|492.2313|  5|24.96515|121.53737|39.3|
|2012.667|20.4|2469.645|  4|24.96108|121.51046|23.8|
|  2013.5|13.2|1164.838|  4|24.99156|121.53406|34.3|
|2013.583|35.7|579.2083|  2| 24.9824|121.54619

In [9]:
data.select(['*']).describe().show()

+-------+------------------+------------------+------------------+------------------+--------------------+--------------------+------------------+
|summary|                X1|                X2|                X3|                X4|                  X5|                  X6|                 Y|
+-------+------------------+------------------+------------------+------------------+--------------------+--------------------+------------------+
|  count|               414|               414|               414|               414|                 414|                 414|               414|
|   mean|2013.1489710144933| 17.71256038647343|1083.8856889130436| 4.094202898550725|  24.969030072463745|  121.53336108695667| 37.98019323671498|
| stddev|0.2819672402629999|11.392484533242524| 1262.109595407851|2.9455618056636177|0.012410196590450208|0.015347183004592374|13.606487697735316|
|    min|          2012.667|               0.0|          23.38284|                 0|            24.93207|           1

## 피쳐 벡터 만들기

- X1 Column은 주택 거래 날짜이므로 주택 평당 가격과 관련이 없을 것으로 생각되어 제외하고 피쳐 벡터를 만들었다. 실제로 correlation coefficient 값을 확인한 결과 X1 column은 0.08749061로 매우 작은 연관성을 보인것을 알 수 있다.

In [61]:
from pyspark.ml.feature import VectorAssembler, MinMaxScaler

feature_columns = data.columns[1:-1]
assembler = VectorAssembler(inputCols=feature_columns, outputCol="features")

# correlation coefficient 확인을 위한 feature vector 생성
feature_columns2 = data.columns[:]
assembler2 = VectorAssembler(inputCols=feature_columns2, outputCol="features")

In [62]:
feature_columns

['X2', 'X3', 'X4', 'X5', 'X6']

In [63]:
data_2 = assembler.transform(data)

In [86]:
feature_columns2

['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'Y']

In [66]:
scaler = MinMaxScaler(inputCol="features", outputCol="features_scaled")
scaler_model = scaler.fit(data_2)
data_2 = scaler_model.transform(data_2)

In [109]:
data_2.show()

+--------+----+--------+---+--------+---------+----+--------------------+--------------------+
|      X1|  X2|      X3| X4|      X5|       X6|   Y|            features|     features_scaled|
+--------+----+--------+---+--------+---------+----+--------------------+--------------------+
|2012.917|32.0|84.87882| 10|24.98298|121.54024|37.9|[32.0,84.87882,10...|[0.73059360730593...|
|2012.917|19.5|306.5947|  9|24.98034|121.53951|42.2|[19.5,306.5947,9....|[0.44520547945205...|
|2013.583|13.3|561.9845|  5|24.98746|121.54391|47.3|[13.3,561.9845,5....|[0.30365296803652...|
|  2013.5|13.3|561.9845|  5|24.98746|121.54391|54.8|[13.3,561.9845,5....|[0.30365296803652...|
|2012.833| 5.0|390.5684|  5|24.97937|121.54245|43.1|[5.0,390.5684,5.0...|[0.11415525114155...|
|2012.667| 7.1| 2175.03|  3|24.96305|121.51254|32.1|[7.1,2175.03,3.0,...|[0.16210045662100...|
|2012.667|34.5|623.4731|  7|24.97933|121.53642|40.3|[34.5,623.4731,7....|[0.78767123287671...|
|2013.417|20.3|287.6025|  6|24.98042|121.54228|46.

In [87]:
data_3 = assembler2.transform(data)

In [88]:
data_2.show()

+--------+----+--------+---+--------+---------+----+--------------------+
|      X1|  X2|      X3| X4|      X5|       X6|   Y|            features|
+--------+----+--------+---+--------+---------+----+--------------------+
|2012.917|32.0|84.87882| 10|24.98298|121.54024|37.9|[32.0,84.87882,10...|
|2012.917|19.5|306.5947|  9|24.98034|121.53951|42.2|[19.5,306.5947,9....|
|2013.583|13.3|561.9845|  5|24.98746|121.54391|47.3|[13.3,561.9845,5....|
|  2013.5|13.3|561.9845|  5|24.98746|121.54391|54.8|[13.3,561.9845,5....|
|2012.833| 5.0|390.5684|  5|24.97937|121.54245|43.1|[5.0,390.5684,5.0...|
|2012.667| 7.1| 2175.03|  3|24.96305|121.51254|32.1|[7.1,2175.03,3.0,...|
|2012.667|34.5|623.4731|  7|24.97933|121.53642|40.3|[34.5,623.4731,7....|
|2013.417|20.3|287.6025|  6|24.98042|121.54228|46.7|[20.3,287.6025,6....|
|  2013.5|31.7|5512.038|  1|24.95095|121.48458|18.8|[31.7,5512.038,1....|
|2013.417|17.9| 1783.18|  3|24.96731|121.51486|22.1|[17.9,1783.18,3.0...|
|2013.083|34.8|405.2134|  1|24.97349|1

In [89]:
data_3.show()

+--------+----+--------+---+--------+---------+----+--------------------+
|      X1|  X2|      X3| X4|      X5|       X6|   Y|            features|
+--------+----+--------+---+--------+---------+----+--------------------+
|2012.917|32.0|84.87882| 10|24.98298|121.54024|37.9|[2012.917,32.0,84...|
|2012.917|19.5|306.5947|  9|24.98034|121.53951|42.2|[2012.917,19.5,30...|
|2013.583|13.3|561.9845|  5|24.98746|121.54391|47.3|[2013.583,13.3,56...|
|  2013.5|13.3|561.9845|  5|24.98746|121.54391|54.8|[2013.5,13.3,561....|
|2012.833| 5.0|390.5684|  5|24.97937|121.54245|43.1|[2012.833,5.0,390...|
|2012.667| 7.1| 2175.03|  3|24.96305|121.51254|32.1|[2012.667,7.1,217...|
|2012.667|34.5|623.4731|  7|24.97933|121.53642|40.3|[2012.667,34.5,62...|
|2013.417|20.3|287.6025|  6|24.98042|121.54228|46.7|[2013.417,20.3,28...|
|  2013.5|31.7|5512.038|  1|24.95095|121.48458|18.8|[2013.5,31.7,5512...|
|2013.417|17.9| 1783.18|  3|24.96731|121.51486|22.1|[2013.417,17.9,17...|
|2013.083|34.8|405.2134|  1|24.97349|1

In [90]:
# correlation coefficient 확인

from pyspark.ml.stat import Correlation
from pyspark.ml.stat import ChiSquareTest

r = Correlation.corr(data_3, "features").head()
print(str(r[0]))

DenseMatrix([[ 1.        ,  0.01754877,  0.06087995,  0.00963544,  0.03505776,
              -0.04108178,  0.08749061],
             [ 0.01754877,  1.        ,  0.02562205,  0.04959251,  0.0544199 ,
              -0.04852005, -0.21056705],
             [ 0.06087995,  0.02562205,  1.        , -0.60251914, -0.59106657,
              -0.80631677, -0.67361286],
             [ 0.00963544,  0.04959251, -0.60251914,  1.        ,  0.44414331,
               0.44909901,  0.57100491],
             [ 0.03505776,  0.0544199 , -0.59106657,  0.44414331,  1.        ,
               0.41292394,  0.54630665],
             [-0.04108178, -0.04852005, -0.80631677,  0.44909901,  0.41292394,
               1.        ,  0.52328651],
             [ 0.08749061, -0.21056705, -0.67361286,  0.57100491,  0.54630665,
               0.52328651,  1.        ]])
pValues: [0.02383460630373746,0.9998812046664141,0.08981892410895442,0.9265205413724785,0.9921131792018435]
degreesOfFreedom: [63215, 69402, 2690, 62677, 62139

## 훈련용과 테스트용 데이터셋으로 분리하고 Linear Regression 모델과 RandomForestRegressor 모델 생성

In [67]:
train, test = data_2.randomSplit([0.7, 0.3])

In [68]:
from pyspark.ml.regression import LinearRegression, RandomForestRegressionModel, RandomForestRegressor

lr_algo = LinearRegression(featuresCol="features_scaled", labelCol="Y")
lr_model = lr_algo.fit(train)

rf_algo = RandomForestRegressor(featuresCol="features_scaled", labelCol="Y")
rf_model = rf_algo.fit(train)

## 모델 성능 측정

#### Linear Regression 성능 측정

In [69]:
evaluation_summary = lr_model.evaluate(test)

In [70]:
evaluation_summary

<pyspark.ml.regression.LinearRegressionSummary at 0x7f8120588590>

In [71]:
evaluation_summary.meanAbsoluteError

6.107269912451954

In [72]:
evaluation_summary.rootMeanSquaredError

8.399740591761777

In [73]:
evaluation_summary.r2

0.6030886195458149

#### Random Froest 성능 측정

In [74]:
rf_predictions = rf_model.transform(test)

In [75]:
rf_predictions.show()

+--------+----+--------+---+--------+---------+----+--------------------+--------------------+------------------+
|      X1|  X2|      X3| X4|      X5|       X6|   Y|            features|     features_scaled|        prediction|
+--------+----+--------+---+--------+---------+----+--------------------+--------------------+------------------+
|2012.667| 0.0|185.4296|  0| 24.9711| 121.5317|37.9|[0.0,185.4296,0.0...|[0.0,0.0250666403...| 50.38259334118262|
|2012.667| 1.5|23.38284|  7|24.96772|121.54102|47.7|[1.5,23.38284,7.0...|[0.03424657534246...|  49.7664117086627|
|2012.667| 3.1|383.8624|  5|24.98085|121.54391|56.2|[3.1,383.8624,5.0...|[0.07077625570776...|53.813306799467625|
|2012.667| 3.1|577.9615|  6|24.97201|121.54722|47.7|[3.1,577.9615,6.0...|[0.07077625570776...|44.959567270209064|
|2012.667| 5.6|90.45606|  9|24.97433| 121.5431|50.0|[5.6,90.45606,9.0...|[0.12785388127853...| 55.39846014236656|
|2012.667|12.9|492.2313|  5|24.96515|121.53737|42.5|[12.9,492.2313,5....|[0.294520547945

In [76]:
from pyspark.ml.evaluation import RegressionEvaluator

rf_evaluator = RegressionEvaluator(
    labelCol='Y', predictionCol='prediction', metricName='rmse'
)
rmse = rf_evaluator.evaluate(rf_predictions)

In [77]:
rmse

6.315941496553416

## 모델 예측값 살펴보기

In [None]:
predictions = model.transform(test)

In [None]:
predictions.show()

+--------+----+--------+---+--------+---------+----+--------------------+------------------+
|      X1|  X2|      X3| X4|      X5|       X6|   Y|            features|        prediction|
+--------+----+--------+---+--------+---------+----+--------------------+------------------+
|2012.667| 5.7|90.45606|  9|24.97433| 121.5431|53.5|[5.7,90.45606,9.0...|  52.1107644492638|
|2012.667|14.6|339.2289|  1|24.97519|121.53151|26.5|[14.6,339.2289,1....| 40.72810659814945|
|2012.667|15.6|289.3248|  5|24.98203|121.54348|46.1|[15.6,289.3248,5....|46.277467578843016|
| 2012.75| 0.0|185.4296|  0| 24.9711| 121.5317|52.2|[0.0,185.4296,0.0...| 43.36030309039029|
| 2012.75| 7.8|104.8101|  5|24.96674|121.54067|38.4|[7.8,104.8101,5.0...|45.253680474874955|
| 2012.75|12.5|1144.436|  4|24.99176|121.53456|34.1|[12.5,1144.436,4....|   45.342533992163|
| 2012.75|13.5|4197.349|  0|24.93885|121.50383|18.6|[13.5,4197.349,0....|14.266340720300605|
| 2012.75|14.1|2615.465|  0|24.95495|121.56174|21.8|[14.1,2615.465,0..

In [None]:
predictions.select(predictions.columns[6:]).show()

+----+--------------------+------------------+
|   Y|            features|        prediction|
+----+--------------------+------------------+
|53.5|[5.7,90.45606,9.0...|  52.1107644492638|
|26.5|[14.6,339.2289,1....| 40.72810659814945|
|46.1|[15.6,289.3248,5....|46.277467578843016|
|52.2|[0.0,185.4296,0.0...| 43.36030309039029|
|38.4|[7.8,104.8101,5.0...|45.253680474874955|
|34.1|[12.5,1144.436,4....|   45.342533992163|
|18.6|[13.5,4197.349,0....|14.266340720300605|
|21.8|[14.1,2615.465,0....|23.100629876355697|
|55.1|[15.4,205.367,7.0...| 49.44895401662302|
|37.5|[15.6,752.7669,2....|40.316845248177515|
|37.4|[17.7,350.8515,1....|39.879988253987676|
|25.0|[29.6,769.4034,7....|42.909815333051256|
|21.5|[31.4,1447.286,3....| 33.15386951572032|
|34.2|[37.9,488.5727,1....|32.891978335143676|
|55.3|[0.0,185.4296,0.0...| 43.36030309039029|
|45.4|[0.0,274.0144,1.0...|45.079733801656175|
|71.0|[0.0,292.9978,6.0...|50.445714698052825|
|54.4|[3.4,56.47425,7.0...| 46.44391499712651|
|35.6|[5.1,18

## GBT Regressor

In [78]:
from pyspark.ml.regression import GBTRegressor

gbt_algo = GBTRegressor(featuresCol="features_scaled", labelCol="Y")
gbt_model = gbt_algo.fit(train)

In [79]:
gbt_predictions = gbt_model.transform(test)
gbt_predictions.show()

+--------+----+--------+---+--------+---------+----+--------------------+--------------------+------------------+
|      X1|  X2|      X3| X4|      X5|       X6|   Y|            features|     features_scaled|        prediction|
+--------+----+--------+---+--------+---------+----+--------------------+--------------------+------------------+
|2012.667| 0.0|185.4296|  0| 24.9711| 121.5317|37.9|[0.0,185.4296,0.0...|[0.0,0.0250666403...|51.417651578220095|
|2012.667| 1.5|23.38284|  7|24.96772|121.54102|47.7|[1.5,23.38284,7.0...|[0.03424657534246...|49.496131526054974|
|2012.667| 3.1|383.8624|  5|24.98085|121.54391|56.2|[3.1,383.8624,5.0...|[0.07077625570776...| 59.46086233912603|
|2012.667| 3.1|577.9615|  6|24.97201|121.54722|47.7|[3.1,577.9615,6.0...|[0.07077625570776...| 37.66555176750972|
|2012.667| 5.6|90.45606|  9|24.97433| 121.5431|50.0|[5.6,90.45606,9.0...|[0.12785388127853...|57.801495410310444|
|2012.667|12.9|492.2313|  5|24.96515|121.53737|42.5|[12.9,492.2313,5....|[0.294520547945

In [80]:
gbt_evaluator = RegressionEvaluator(
    labelCol='Y', predictionCol='prediction', metricName='rmse'
)
rmse = gbt_evaluator.evaluate(gbt_predictions)
print(rmse)

9.18913920667136


## ML Pipeline 생성

In [100]:
from pyspark.ml.feature import VectorAssembler, MinMaxScaler

assembler = VectorAssembler(inputCols=['X2', 'X3', 'X4', 'X5', 'X6'], outputCol="features")

minmax_scaler = MinMaxScaler(inputCol="features", outputCol="features_scaled")

from pyspark.ml.regression import LinearRegression, RandomForestRegressor, GBTRegressor

rf_algo = RandomForestRegressor(featuresCol="features_scaled", labelCol="Y")
lr_algo = LinearRegression(featuresCol="features_scaled", labelCol="Y")
gbt_algo = GBTRegressor(featuresCol="features_scaled", labelCol="Y")

rf_stages = [assembler, minmax_scaler, rf_algo]
lr_stages = [assembler, minmax_scaler, lr_algo]
gbt_stages = [assembler, minmax_scaler, gbt_algo]

In [101]:
from pyspark.ml import Pipeline

lr_pipeline = Pipeline(stages = lr_stages)
rf_pipeline = Pipeline(stages = rf_stages)
gbt_pipeline = Pipeline(stages = gbt_stages)

In [107]:
train, test = data.randomSplit([0.7, 0.3])

In [108]:
from pyspark.ml.evaluation import RegressionEvaluator

evaluator = RegressionEvaluator(
    labelCol='Y', predictionCol='prediction', metricName='rmse'
)

In [109]:
lr_model = lr_pipeline.fit(train)
lr_cv_predictions = lr_model.transform(test)
evaluator.evaluate(lr_cv_predictions)

7.536625360025496

In [110]:
rf_model = rf_pipeline.fit(train)
rf_cv_predictions = rf_model.transform(test)
evaluator.evaluate(rf_cv_predictions)

6.127043389391001

In [111]:
gbt_model = gbt_pipeline.fit(train)
gbt_cv_predictions = gbt_model.transform(test)
evaluator.evaluate(gbt_cv_predictions)

7.797934389010835

## ML Tuning

In [116]:
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

rf_paramGrid = (ParamGridBuilder()
      .addGrid(rf_algo.numTrees, [10, 20, 30, 40, 50])
      .addGrid(rf_algo.maxDepth, [1, 3, 5, 10])
      .addGrid(rf_algo.maxBins, [2, 8, 32, 64])
      .build())

gbt_paramGrid = (ParamGridBuilder()
      .addGrid(gbt_algo.maxDepth, [3, 5, 10])
      .addGrid(gbt_algo.maxBins, [8, 32, 64])
      .addGrid(gbt_algo.maxIter, [10, 20, 30])
      .build())

rf_cv = CrossValidator(
    estimator=rf_pipeline,
    estimatorParamMaps=rf_paramGrid,
    evaluator=evaluator,
    numFolds=5
)

gbt_cv = CrossValidator(
    estimator=gbt_pipeline,
    estimatorParamMaps=gbt_paramGrid,
    evaluator=evaluator,
    numFolds=5
)

In [114]:
# Run Cross Validations
rf_cvModel = rf_cv.fit(train)
rf_cv_predictions = rf_cvModel.transform(test)
evaluator.evaluate(rf_cv_predictions)

5.595301755153252

In [115]:
import pandas as pd

params = [{p.name: v for p, v in m.items()} for m in rf_cvModel.getEstimatorParamMaps()]
pd.DataFrame.from_dict([
    {rf_cvModel.getEvaluator().getMetricName(): metric, **ps} 
    for ps, metric in zip(params, rf_cvModel.avgMetrics)
])

Unnamed: 0,rmse,numTrees,maxDepth,maxBins
0,10.623436,10,1,2
1,9.999940,10,1,8
2,9.989737,10,1,32
3,9.970716,10,1,64
4,9.216276,10,3,2
...,...,...,...,...
75,8.024334,50,5,64
76,9.288244,50,10,2
77,8.356851,50,10,8
78,7.979761,50,10,32


In [117]:
gbt_cvModel = gbt_cv.fit(train)
gbt_cv_predictions = gbt_cvModel.transform(test)
evaluator.evaluate(gbt_cv_predictions)

7.008010310257546

In [118]:
import pandas as pd

params = [{p.name: v for p, v in m.items()} for m in gbt_cvModel.getEstimatorParamMaps()]
pd.DataFrame.from_dict([
    {gbt_cvModel.getEvaluator().getMetricName(): metric, **ps} 
    for ps, metric in zip(params, gbt_cvModel.avgMetrics)
])

Unnamed: 0,rmse,maxDepth,maxBins,maxIter
0,8.788754,3,8,10
1,8.828615,3,8,20
2,8.92704,3,8,30
3,8.813503,3,32,10
4,8.647959,3,32,20
5,8.62778,3,32,30
6,9.180343,3,64,10
7,9.096169,3,64,20
8,9.090386,3,64,30
9,8.998213,5,8,10


## 결론

cross validation과 다양한 hyper parameter를 적용한 결과로 RandomForestRegressor 모델이 최종적으로 rmse 5.595가 나왔다. GBTRegressor 모델은 7.008로 약간은 높은 수치가 나왔다.