**[Intermediate Machine Learning Home Page](https://www.kaggle.com/learn/intermediate-machine-learning)**

---

*This exercise involves you writing code, and we check it automatically to tell you if it's right. We're having a temporary problem with out checking infrastructure, causing a bar that says `None` in some cases when you have the right answer. We're sorry. We're fixing it. In the meantime, if you see a bar saying `None` that means you've done something good.*

In this exercise, you will use **pipelines** to improve the efficiency of your machine learning code.

# Setup

The questions below will give you feedback on your work. Run the following cell to set up the feedback system.

In [1]:
# Set up code checking
import os
if not os.path.exists("../input/train.csv"):
    os.symlink("../input/home-data-for-ml-course/train.csv", "../input/train.csv")  
    os.symlink("../input/home-data-for-ml-course/test.csv", "../input/test.csv") 
from learntools.core import binder
binder.bind(globals())
from learntools.ml_intermediate.ex4 import *
print("Setup Complete")

Setup Complete


You will work with data from the [Housing Prices Competition for Kaggle Learn Users](https://www.kaggle.com/c/home-data-for-ml-course). 

![Ames Housing dataset image](https://i.imgur.com/lTJVG4e.png)

Run the next code cell without changes to load the training and validation sets in `X_train`, `X_valid`, `y_train`, and `y_valid`.  The test set is loaded in `X_test`.

In [4]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Read the data
X_full = pd.read_csv('../input/train.csv', index_col='Id')
X_test_full = pd.read_csv('../input/test.csv', index_col='Id')

# Remove rows with missing target, separate target from predictors
X_full.dropna(axis=0, subset=['SalePrice'], inplace=True)
y = X_full.SalePrice
X_full.drop(['SalePrice'], axis=1, inplace=True)

# Break off validation set from training data
X_train_full, X_valid_full, y_train, y_valid = train_test_split(X_full, y, 
                                                                train_size=0.8, test_size=0.2,
                                                                random_state=0)

# "Cardinality" means the number of unique values in a column
# Select categorical columns with relatively low cardinality (convenient but arbitrary)
categorical_cols = [cname for cname in X_train_full.columns if
                    X_train_full[cname].nunique() < 10 and 
                    X_train_full[cname].dtype == "object"]

# Select numerical columns
numerical_cols = [cname for cname in X_train_full.columns if X_train_full[cname].dtype in ['int64', 'float64']]

# Keep selected columns only
my_cols = categorical_cols + numerical_cols
X_train = X_train_full[my_cols].copy()
X_valid = X_valid_full[my_cols].copy()
X_test = X_test_full[my_cols].copy()

In [5]:
X_train.head()

Unnamed: 0_level_0,MSZoning,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Condition1,Condition2,...,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
619,RL,Pave,,Reg,Lvl,AllPub,Inside,Gtl,Norm,Norm,...,774,0,108,0,0,260,0,0,7,2007
871,RL,Pave,,Reg,Lvl,AllPub,Inside,Gtl,PosN,Norm,...,308,0,0,0,0,0,0,0,8,2009
93,RL,Pave,Grvl,IR1,HLS,AllPub,Inside,Gtl,Norm,Norm,...,432,0,0,44,0,0,0,0,8,2009
818,RL,Pave,,IR1,Lvl,AllPub,CulDSac,Gtl,Norm,Norm,...,857,150,59,0,0,0,0,0,7,2008
303,RL,Pave,,IR1,Lvl,AllPub,Corner,Gtl,Norm,Norm,...,843,468,81,0,0,0,0,0,1,2006


The next code cell uses code from the tutorial to preprocess the data and train a model.  Run this code without changes.

In [15]:
?Pipeline

In [23]:
from sklearn.compose import ColumnTransformer
# ColumnTransformer is Applies transformers to columns of an array or pandas DataFrame. / 칼럼별 변환을 수행하는 변환기

from sklearn.pipeline import FeatureUnion
# Concatenates results of multiple transformer objects. / 파이프라인들을 합친다. 

from sklearn.pipeline import Pipeline
# Pipeline of transforms with a final estimator. / 가장 마지막 파이프라인

from sklearn.impute import SimpleImputer
# SimpleImputer is Imputation transformer for completing missing values. / NA값을 처리하는 변환기

from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

## 예제 ##
# Preprocessing for numerical datab
numerical_transformer = SimpleImputer(strategy='constant')

# Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='most_frequent')), 
                                          ('onehot', OneHotEncoder(handle_unknown='ignore'))])

# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(transformers=[('num', numerical_transformer, numerical_cols),
                                               ('cat', categorical_transformer, categorical_cols)])

# Define model
model = RandomForestRegressor(n_estimators=100, random_state=0)

# Bundle preprocessing and modeling code in a pipeline
clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('model', model)])

# Preprocessing of training data, fit model 
clf.fit(X_train, y_train)

# Preprocessing of validation data, get predictions
preds = clf.predict(X_valid)

print('MAE:', mean_absolute_error(y_valid, preds))

MAE: 17861.780102739725


In [24]:
X_train.head()

Unnamed: 0_level_0,MSZoning,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Condition1,Condition2,...,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
619,RL,Pave,,Reg,Lvl,AllPub,Inside,Gtl,Norm,Norm,...,774,0,108,0,0,260,0,0,7,2007
871,RL,Pave,,Reg,Lvl,AllPub,Inside,Gtl,PosN,Norm,...,308,0,0,0,0,0,0,0,8,2009
93,RL,Pave,Grvl,IR1,HLS,AllPub,Inside,Gtl,Norm,Norm,...,432,0,0,44,0,0,0,0,8,2009
818,RL,Pave,,IR1,Lvl,AllPub,CulDSac,Gtl,Norm,Norm,...,857,150,59,0,0,0,0,0,7,2008
303,RL,Pave,,IR1,Lvl,AllPub,Corner,Gtl,Norm,Norm,...,843,468,81,0,0,0,0,0,1,2006


The code yields a value around 17862 for the mean absolute error (MAE).  In the next step, you will amend the code to do better.

# Step 1: Improve the performance

### Part A

Now, it's your turn!  In the code cell below, define your own preprocessing steps and random forest model.  Fill in values for the following variables:
- `numerical_transformer`
- `categorical_transformer`
- `model`

To pass this part of the exercise, you need only define valid preprocessing steps and a random forest model.

사이킷런의 설계 철학
사이킷런의 API는 아주 잘 설계되어 있습니다. 주요 설계 원칙은 다음과 같습니다.  

일관성: 모든 객체가 일관되고 단순한 인터페이스를 공유합니다.  

추정기(estimator): 데이터셋을 기반으로 일련의 모델 파라미터들을 추정하는 객체를 추정기라고 합니다(예를 들어 imputer 객체는 추정기입니다). 추정 자체는 fit() 메서드에 의해 수행되고 하나의 매개변수로 하나의 데이터셋만 전달합니다(지도 학습 알고리즘에서는 매개변수가 두 개로, 두 번째 데이터셋은 레이블을 담고 있습니다). 추정 과정에서 필요한 다른 매개변수들은 모두 하이퍼파라미터로 간주되고(예를 들면 imputer 객체의 strategy 매개변수), 인스턴스 변수로 저장됩니다(보통 생성자의 매개변수로 전달합니다).  

변환기(transformer): (imputer 같이) 데이터셋을 변환하는 추정기를 변환기라고 합니다. 여기서도 API는 매우 단순합니다. 변환은 데이터셋을 매개변수로 전달받은 transform() 메서드가 수행합니다. 그리고 변환된 데이터셋을 반환합니다. 이런 변환은 일반적으로 imputer의 경우와 같이 학습된 모델 파라미터에 의해 결정됩니다.35 모든 변환기는 fit()과 transform()을 연달아 호출하는 것과 동일한 fit_transform() 메서드도 가지고 있습니다(이따금 fit_transform()이 최적화되어 있어서 더 빠릅니다).  

예측기(predictor): 일부 추정기는 주어진 데이터셋에 대해 예측을 만들 수 있습니다. 예를 들어 앞 장에 나온 LinearRegression 모델이 예측기입니다. 어떤 나라의 1인당 GDP가 주어질 때 삶의 만족도를 예측했습니다. 예측기의 predict() 메서드는 새로운 데이터셋을 받아 이에 상응하는 예측값을 반환합니다. 또한 테스트 세트(지도 학습 알고리즘이라면 레이블도 함께)를 사용해 예측의 품질을 측정하는 score() 메서드를 가집니다.

검사 가능: 모든 추정기의 하이퍼파라미터는 공개public 인스턴스 변수로 직접 접근할 수 있고(예를 들면 imputer.strategy), 모든 추정기의 학습된 모델 파라미터도 접미사로 밑줄을 붙여서 공개 인스턴스 변수로 제공됩니다(예를 들면 imputer.statistics_).  

클래스 남용 방지: 데이터셋을 별도의 클래스가 아니라 넘파이 배열이나 사이파이 희소sparse 행렬로 표현합니다. 하이퍼파라미터는 보통의 파이썬 문자열이나 숫자입니다.  

조합성: 기존의 구성요소를 최대한 재사용합니다. 앞으로 보겠지만 예를 들어 여러 변환기를 연결한 다음 마지막에 추정기 하나를 배치한 Pipeline 추정기를 쉽게 만들 수 있습니다.  

합리적인 기본값: 사이킷런은 일단 돌아가는 기본 시스템을 빠르게 만들 수 있도록 대부분의 매개변수에 합리적인 기본값을 지정해두었습니다.  

In [58]:
## 실습 ##
from sklearn.preprocessing import MinMaxScaler
# Preprocessing for numerical data with Min-Max Scaler
# numerical_transformer = Pipeline(steps=[("impute", SimpleImputer(strategy="mean"))]) # 스케일러 없음
numerical_transformer = Pipeline(steps=[("impute", SimpleImputer(strategy="constant")), 
                                        ('scaler', MinMaxScaler())]) 

# Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='most_frequent')),
                                          ('onehot', OneHotEncoder(handle_unknown='ignore'))])

# Bundle preprocessing for numerical and categorical data
# ColumnTransformer에서는 앞서 만든 pipeline을 칼럼별로 수행
preprocessor = ColumnTransformer(transformers=[('num', numerical_transformer, numerical_cols), 
                                               ('cat', categorical_transformer, categorical_cols)])

# Define model
model = RandomForestRegressor(n_estimators=100, random_state=0) # Your code here

# Check your answer
step_1.a.check()

<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct</span>

In [10]:
# Lines below will give you a hint or solution code
# step_1.a.hint()
# step_1.a.solution()

### Part B

Run the code cell below without changes.

To pass this step, you need to have defined a pipeline in **Part A** that achieves lower MAE than the code above.  You're encouraged to take your time here and try out many different approaches, to see how low you can get the MAE!  (_If your code does not pass, please amend the preprocessing steps and model in Part A._)

In [59]:
# Bundle preprocessing and modeling code in a pipeline
my_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                              ('model', model)])

# Preprocessing of training data, fit model 
my_pipeline.fit(X_train, y_train)

# Preprocessing of validation data, get predictions
preds = my_pipeline.predict(X_valid)

# Evaluate the model
score = mean_absolute_error(y_valid, preds)
print('MAE:', score)

# Check your answer
step_1.b.check()

MAE: 17848.66469178082


<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct</span>

In [13]:
# Line below will give you a hint
# step_1.b.hint()

# Step 2: Generate test predictions

Now, you'll use your trained model to generate predictions with the test data.

In [20]:
# Preprocessing of test data, fit model
preds_test = my_pipeline.predict(X_test) # Your code here

# Check your answer
step_2.check()

<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct</span>

In [21]:
# Lines below will give you a hint or solution code
# step_2.hint()
# step_2.solution()

Run the next code cell without changes to save your results to a CSV file that can be submitted directly to the competition.

In [None]:
# Save test predictions to file
output = pd.DataFrame({'Id': X_test.index,
                       'SalePrice': preds_test})
output.to_csv('submission.csv', index=False)

# Step 3: Submit your results

Once you have successfully completed Step 2, you're ready to submit your results to the leaderboard!  If you choose to do so, make sure that you have already joined the competition by clicking on the **Join Competition** button at [this link](https://www.kaggle.com/c/home-data-for-ml-course).  
- Begin by clicking on the blue **COMMIT** button in the top right corner.  This will generate a pop-up window.  
- After your code has finished running, click on the blue **Open Version** button in the top right of the pop-up window.  This brings you into view mode of the same page. You will need to scroll down to get back to these instructions.
- Click on the **Output** tab on the left of the screen.  Then, click on the **Submit to Competition** button to submit your results to the leaderboard.
- If you want to keep working to improve your performance, select the blue **Edit** button in the top right of the screen. Then you can change your model and repeat the process.

# Keep going

Move on to learn about [**cross-validation**](https://www.kaggle.com/alexisbcook/cross-validation), a technique you can use to obtain more accurate estimates of model performance!

# Reference
https://databuzz-team.github.io/2018/11/11/make_pipeline/  
https://stickie.tistory.com/77  
http://fliphtml5.com/hkuy/lccw/basic/201-226  
https://analysis-flood.tistory.com/67


---
**[Intermediate Machine Learning Home Page](https://www.kaggle.com/learn/intermediate-machine-learning)**





*Have questions or comments? Visit the [Learn Discussion forum](https://www.kaggle.com/learn-forum) to chat with other Learners.*