# Selecting Data for Modeling

Your dataset had too many variables to wrap your head around, or even to print out nicely.  How can you pare down this overwhelming amount of data to something you can understand?

데이터셋에 변수가 너무 많아서 프린트를 잘할 수 없음, 이 엄청난 양의 데이터를 이해할 수 있는 것으로 어떻게 줄일 수 있음?

We'll start by picking a few variables using our intuition. 

우리는 직관을 사용해서 몇 가지 변수를 고르는 것으로 시작할 거임.

Later courses will show you statistical techniques to automatically prioritize variables.

이후의 과정은 자동으로 변수의 우선순위를 정하는 통계적 기술을 보여줄 거임.

To choose variables/columns, we'll need to see a list of all columns in the dataset.

변수나 컬럼을 선택하려면 데이터 집합의 모든 열 목록을 봐야 함.

That is done with the **columns** property of the DataFrame (the bottom line of code below).

그것은 DF의 컬럼 속성(아래 코드의 마지막)으로 이루어짐.

In [1]:
import pandas as pd

melbourne_file_path = './input/melbourne-housing-snapshot/melb_data.csv'
melbourne_data = pd.read_csv(melbourne_file_path) 
melbourne_data.columns

Index(['Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method', 'SellerG',
       'Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Car',
       'Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea', 'Lattitude',
       'Longtitude', 'Regionname', 'Propertycount'],
      dtype='object')

In [4]:
# The Melbourne data has some missing values (some houses for which some variables weren't recorded.)
# 멜버른 데이터에는 일부 누락된 값(일부 변수가 기록되지 않은 일부 주택)이 있음.

# We'll learn to handle missing values in a later tutorial.  
# 우리는 다음 듀토리얼에서 미싱 밸류를 다루는 방법을 배운다.

# Your Iowa data doesn't have missing values in the columns you use. 
# 아이오아 데이터에는 사용하는 열에 결측값이 없음.


# So we will take the simplest option for now, and drop houses from our data. 
# 그래서 지금 우리는 가장 간단한 선택을 할 거임, 그리고 우리 데이터에서 집을 뺄 거임.

# Don't worry about this much for now, though the code is:
# 지금은 이 정도에서 걱정 ㄴㄴ, 코드는 다음과 같음.


# dropna drops missing values (think of na as "not available")
melbourne_data = melbourne_data.dropna(axis=0)

There are many ways to select a subset of your data.
데이터에서 하위집합을 선택하는 많은 방법이 있음.

The [Pandas Micro-Course](https://www.kaggle.com/learn/pandas) covers these in more depth, but we will focus on two approaches for now.
판다스 마이크로코스에서는 이것을 보다 심층적으로 다룰 거지만 지금은 2개의 접근방법에 초점을 둠.


1. Dot notation, which we use to select the "prediction target"
1. 도트 표기법, 예측 대상을 선택하는 데 사용함.

2. Selecting with a column list, which we use to select the "features" 
2. 컬럼들을 선택, 피처 선택에 사용함.


## Selecting The Prediction Target 
You can pull out a variable with **dot-notation**. 
dot notation을 이용해서 변수를 끌어낼 수 있음.


This single column is stored in a **Series**, which is broadly like a DataFrame with only a single column of data.  
이 단일 열은 Series에 저장되고, 이는 데이터열이 하나만 있는 데이터 프레임과 같음.


We'll use the dot notation to select the column we want to predict, which is called the **prediction target**. By convention, the prediction target is called **y**. 

도트 표기법을 사용해서 프레딕트 타겟이라고 부르는  우리가 예측을 원하는 컬럼을 선택함.
관례상 예측 타겟은 y라고 부름.

So the code we need to save the house prices in the Melbourne data is

그래서 멜버른 데이터에서 주택 가격을 저장하는 데 필요한 코드는

In [5]:
y = melbourne_data.Price

# Choosing "Features"

The columns that are inputted into our model (and later used to make predictions) are called "features."

우리 모델에 입력되는 열(나중에 예측을 하는 데 사용됨)을 "피처"라고 함.

In our case, those would be the columns used to determine the home price. 

우리 케이스에서, 그것들은 집값을 결정하는 데 사용되는 컬럼일 것.

Sometimes, you will use all columns except the target as features. 

가끔 우리는 피처에서 타겟을 제외한 모든 컬럼을 사용함.

Other times you'll be better off with fewer features. 

더 적은 기능으로 더 나아질 때도 있음.

For now, we'll build a model with only a few features. 

지금은 우리가 모델을 몇 개의 피처만 포함된 모델을 작성함.

Later on you'll see how to iterate and compare models built with different features.

나중에 네가 다른 피처들로 만들어진 모델들을 반복하고 비교하는 방법을 볼 수 있음.


We select multiple features by providing a list of column names inside brackets. 

대괄호 안에 컬럼 이름 목록을 제공해서 여러 기능을 선택함.

Each item in that list should be a string (with quotes).

각 목록의 항목은 열(따옴표 포함)이어야 함.

Here is an example:

In [7]:
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude']

By convention, this data is called **X**.

관례상 이 데이터들은 X라고 부름.

In [8]:
X = melbourne_data[melbourne_features]

Let's quickly review the data we'll be using to predict house prices using the `describe` method and the `head` method, which shows the top few rows.

앞으로 집갑 예측에 사용될 자료를 몇 줄로 표시하는 describe와 head를 통해 빨리 검토해보자.

In [9]:
X.describe()

Unnamed: 0,Rooms,Bathroom,Landsize,Lattitude,Longtitude
count,6196.0,6196.0,6196.0,6196.0,6196.0
mean,2.931407,1.57634,471.00694,-37.807904,144.990201
std,0.971079,0.711362,897.449881,0.07585,0.099165
min,1.0,1.0,0.0,-38.16492,144.54237
25%,2.0,1.0,152.0,-37.855438,144.926198
50%,3.0,1.0,373.0,-37.80225,144.9958
75%,4.0,2.0,628.0,-37.7582,145.0527
max,8.0,8.0,37000.0,-37.45709,145.52635


In [10]:
X.head()

Unnamed: 0,Rooms,Bathroom,Landsize,Lattitude,Longtitude
1,2,1.0,156.0,-37.8079,144.9934
2,3,2.0,134.0,-37.8093,144.9944
4,4,1.0,120.0,-37.8072,144.9941
6,3,2.0,245.0,-37.8024,144.9993
7,2,1.0,256.0,-37.806,144.9954


Visually checking your data with these commands is an important part of a data scientist's job.  You'll frequently find surprises in the dataset that deserve further inspection.

---
# Building Your Model

You will use the **scikit-learn** library to create your models.  When coding, this library is written as **sklearn**, as you will see in the sample code. Scikit-learn is easily the most popular library for modeling the types of data typically stored in DataFrames. 

The steps to building and using a model are:
* **Define:** What type of model will it be?  A decision tree?  Some other type of model? Some other parameters of the model type are specified too.
* **Fit:** Capture patterns from provided data. This is the heart of modeling.
* **Predict:** Just what it sounds like
* **Evaluate**: Determine how accurate the model's predictions are.

Here is an example of defining a decision tree model with scikit-learn and fitting it with the features and target variable.

In [8]:
from sklearn.tree import DecisionTreeRegressor

# Define model. Specify a number for random_state to ensure same results each run
melbourne_model = DecisionTreeRegressor(random_state=1)

# Fit model
melbourne_model.fit(X, y)

DecisionTreeRegressor(random_state=1)

Many machine learning models allow some randomness in model training. Specifying a number for `random_state` ensures you get the same results in each run. This is considered a good practice. You use any number, and model quality won't depend meaningfully on exactly what value you choose.

We now have a fitted model that we can use to make predictions.

In practice, you'll want to make predictions for new houses coming on the market rather than the houses we already have prices for. But we'll make predictions for the first few rows of the training data to see how the predict function works.


In [9]:
print("Making predictions for the following 5 houses:")
print(X.head())
print("The predictions are")
print(melbourne_model.predict(X.head()))

Making predictions for the following 5 houses:
   Rooms  Bathroom  Landsize  Lattitude  Longtitude
1      2       1.0     156.0   -37.8079    144.9934
2      3       2.0     134.0   -37.8093    144.9944
4      4       1.0     120.0   -37.8072    144.9941
6      3       2.0     245.0   -37.8024    144.9993
7      2       1.0     256.0   -37.8060    144.9954
The predictions are
[1035000. 1465000. 1600000. 1876000. 1636000.]


# Your Turn
Try it out yourself in the **[Model Building Exercise](https://www.kaggle.com/kernels/fork/1404276)**

---




*Have questions or comments? Visit the [Learn Discussion forum](https://www.kaggle.com/learn-forum/161285) to chat with other Learners.*