## Selecting Data for Modeling


In [16]:
import pandas as pd

melbourne_data = pd.read_csv('data/melbourne-housing-snapshot/melb_data.csv')
melbourne_data.columns

Index(['Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method', 'SellerG',
       'Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Car',
       'Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea', 'Lattitude',
       'Longtitude', 'Regionname', 'Propertycount'],
      dtype='object')

Theere are many ways to select a subset of the data. but we will focus on two approaches for now

1. Dot notation, which we use to select the "prediction target"
2. Selecting with a column list, which we use to select the "features"

## Selecting the Prediction

- You can pull out a variable with dot-notation. This single column is stored in a Series, which is broadly like a DataFrame with only a single column of data. 
- We'll use the dot notation to select the column we want to predict, which is called the `prediciton target`. By convention, the prediction target is call y. So the code we need to save the house prices in the Mebourne data is 

In [17]:
y = melbourne_data.Price
print(Y)

0        1480000.0
1        1035000.0
2        1465000.0
3         850000.0
4        1600000.0
           ...    
13575    1245000.0
13576    1031000.0
13577    1170000.0
13578    2500000.0
13579    1285000.0
Name: Price, Length: 13580, dtype: float64


## Choosing "Features"

- The columns that are inputted into our model (and later used to make predictions) are called "features". In our case, those would be the columns used to determine the home price. Sometimes, you will use all columns except the target as features. Other times you'll be better off with fewer features. 

- For now we'll build a model with only a few features. Later we'll see how to iterate and compare models built with different features.

- We select mulitple features by providing a list of column names insde brackets. Each item in that list should be string(with quotes). 



In [18]:
melbourne_features = ['Rooms','Bathroom', 'Landsize', 'Lattitude', 'Longtitude']

# By convention, this data is called X.

X = melbourne_data[melbourne_features]

Let's quickly review the data we'll be using to predic house prices using the `describe` method and the `head` method, which shows the top few rows.

In [19]:
X.describe()

Unnamed: 0,Rooms,Bathroom,Landsize,Lattitude,Longtitude
count,13580.0,13580.0,13580.0,13580.0,13580.0
mean,2.937997,1.534242,558.416127,-37.809203,144.995216
std,0.955748,0.691712,3990.669241,0.07926,0.103916
min,1.0,0.0,0.0,-38.18255,144.43181
25%,2.0,1.0,177.0,-37.856822,144.9296
50%,3.0,1.0,440.0,-37.802355,145.0001
75%,3.0,2.0,651.0,-37.7564,145.058305
max,10.0,8.0,433014.0,-37.40853,145.52635


Visually checking your data with these commandas is an important part of data scientist's job.

## Building You model

We will use sckiti-learn library to create the models. When coding, this library is written as sklearn. Scikit-learn is easily the most popular library for modeling the types of data typically stored in DataFrames.

The steps to builidng and using a model are

- <b>Define:</b> What type of model will it be? A decision tree? Some other type of model? Some other parameters of the model type are spcified too. 
- <b>Fit:</b> Capture pattern from provided data. This is the heart of modeling
- <b>Predict:</b> Just what is sounds like
- <b>Evaluate: </b> Determine how accurate the model's predicitons are 

In [21]:
# Example of defining a a decision tree model with scikit-learn and fitting it with the features and target variable.
from sklearn.tree import DecisionTreeRegressor
# Define model. Specify a number for random_state to ensure same results each run
melbourne_model = DecisionTreeRegressor(random_state=1)

# fit model 
melbourne_model.fit(X,y)

- Many machine learning models allow some randomness in model training. Specifying a number for random_state ensures you get the same results in each run. This is considered a good practice. You use any number, and model quality won't depend meaningfully on exactly what value you choose.

- We now have a fitted model that we can use to make predictions.

In [24]:
print("Making predictions for the following 5 houses:")
print(X.head())
print("The predictions are")
print(melbourne_model.predict(X))

Making predictions for the following 5 houses:
   Rooms  Bathroom  Landsize  Lattitude  Longtitude
0      2       1.0     202.0   -37.7996    144.9984
1      2       1.0     156.0   -37.8079    144.9934
2      3       2.0     134.0   -37.8093    144.9944
3      3       2.0      94.0   -37.7969    144.9969
4      4       1.0     120.0   -37.8072    144.9941
The predictions are
[1480000. 1035000. 1465000. ... 1170000. 2500000. 1285000.]
