## Selecting Data for Modeling

Your dataset had too many variables to wrap your head around, or even to print out nicely.

To choose variables/columns, we'll need to see a list of all columns in the dataset. That is done with the **columns**\
property of the DataFrame (the bottom line of code below).

In [2]:
import pandas as pd

In [15]:
melbourne_file_path = './data/002/melb_data.csv'
melbourne_data = pd.read_csv(melbourne_file_path)

# print all columns
melbourne_data.columns

Index(['Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method', 'SellerG',
       'Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Car',
       'Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea', 'Lattitude',
       'Longtitude', 'Regionname', 'Propertycount'],
      dtype='object')

In [16]:
melbourne_data.head(2)

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,...,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
0,Abbotsford,85 Turner St,2,h,1480000.0,S,Biggin,3/12/2016,2.5,3067.0,...,1.0,1.0,202.0,,,Yarra,-37.7996,144.9984,Northern Metropolitan,4019.0
1,Abbotsford,25 Bloomburg St,2,h,1035000.0,S,Biggin,4/02/2016,2.5,3067.0,...,1.0,0.0,156.0,79.0,1900.0,Yarra,-37.8079,144.9934,Northern Metropolitan,4019.0


In [17]:
# The Melbourne data has some missing values (some houses for which some variables weren't recorded.)

# `dropna` drops missing values (think of na as "not available")
clear_data = melbourne_data.dropna(axis=0)
clear_data.head(5)

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,...,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
1,Abbotsford,25 Bloomburg St,2,h,1035000.0,S,Biggin,4/02/2016,2.5,3067.0,...,1.0,0.0,156.0,79.0,1900.0,Yarra,-37.8079,144.9934,Northern Metropolitan,4019.0
2,Abbotsford,5 Charles St,3,h,1465000.0,SP,Biggin,4/03/2017,2.5,3067.0,...,2.0,0.0,134.0,150.0,1900.0,Yarra,-37.8093,144.9944,Northern Metropolitan,4019.0
4,Abbotsford,55a Park St,4,h,1600000.0,VB,Nelson,4/06/2016,2.5,3067.0,...,1.0,2.0,120.0,142.0,2014.0,Yarra,-37.8072,144.9941,Northern Metropolitan,4019.0
6,Abbotsford,124 Yarra St,3,h,1876000.0,S,Nelson,7/05/2016,2.5,3067.0,...,2.0,0.0,245.0,210.0,1910.0,Yarra,-37.8024,144.9993,Northern Metropolitan,4019.0
7,Abbotsford,98 Charles St,2,h,1636000.0,S,Nelson,8/10/2016,2.5,3067.0,...,1.0,2.0,256.0,107.0,1890.0,Yarra,-37.806,144.9954,Northern Metropolitan,4019.0


## Selecting the prediction target
You can pull out a variable with `dot-notation`. This single column is stored in a `Series`, which is broadly like a\
DataFrame with only a single column of data.

**We'll use the dot natation to select the column we want to predict**, which is called the `prediction target`.\
By convention, the prediction target is called `y`.

In [18]:
# Include DataFrame
melbourne_file_path = './data/002/melb_data.csv'
melbourne_data = pd.read_csv(melbourne_file_path)

In [19]:
melbourne_data.head(1)

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,...,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
0,Abbotsford,85 Turner St,2,h,1480000.0,S,Biggin,3/12/2016,2.5,3067.0,...,1.0,1.0,202.0,,,Yarra,-37.7996,144.9984,Northern Metropolitan,4019.0


In [20]:
y = melbourne_data.Price
y

0        1480000.0
1        1035000.0
2        1465000.0
3         850000.0
4        1600000.0
           ...    
13575    1245000.0
13576    1031000.0
13577    1170000.0
13578    2500000.0
13579    1285000.0
Name: Price, Length: 13580, dtype: float64

### Choosing "Features"
The columns that are inputted into our model (and later used to make predictions) are called "`features`".\
In our case, those would be the columns used to determine the home price. Sometimes, you will use all columns\
except the target as features. Other times you'll be better off with fewer features.

We select multiple features by providing a list of column names inside brackets.\
`Each item in that list should be a string`(with quotes).\
By convention, this data is called `X`.

Example:
```Python
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'Lattitude', 'Longititude']
```

In [21]:
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude']

X = melbourne_data[melbourne_features]
X
# head - Visually checking your data with these commands is an important part of a data scientists's job.

Unnamed: 0,Rooms,Bathroom,Landsize,Lattitude,Longtitude
0,2,1.0,202.0,-37.79960,144.99840
1,2,1.0,156.0,-37.80790,144.99340
2,3,2.0,134.0,-37.80930,144.99440
3,3,2.0,94.0,-37.79690,144.99690
4,4,1.0,120.0,-37.80720,144.99410
...,...,...,...,...,...
13575,4,2.0,652.0,-37.90562,145.16761
13576,3,2.0,333.0,-37.85927,144.87904
13577,3,2.0,436.0,-37.85274,144.88738
13578,4,1.0,866.0,-37.85908,144.89299


In [22]:
X.describe()

Unnamed: 0,Rooms,Bathroom,Landsize,Lattitude,Longtitude
count,13580.0,13580.0,13580.0,13580.0,13580.0
mean,2.937997,1.534242,558.416127,-37.809203,144.995216
std,0.955748,0.691712,3990.669241,0.07926,0.103916
min,1.0,0.0,0.0,-38.18255,144.43181
25%,2.0,1.0,177.0,-37.856822,144.9296
50%,3.0,1.0,440.0,-37.802355,145.0001
75%,3.0,2.0,651.0,-37.7564,145.058305
max,10.0,8.0,433014.0,-37.40853,145.52635


### Building your model
You will use the `scikit-learn` library to create your models. When coding, this library is written as `sklearn`, as you\
will see in the sample code.\
`Scikit-learn` is easily the most popular library for modeling the types of data typically stored\
in DataFrames.

The steps to building and using a model are:
- **Define**: What type of model will it be? A decision it be? A decision tree? Some other type of model? Some other parameters of the model type are specified too.
- **Fit**: Capture patterns from provided data. This is the heart of modeling.
- **Predict**: Just what it sounds like.
- **Evaluate**: Determine how accurate the model's predictions are.

Here is an example of defining a decision tree model with `scikit-learn` and fitting it with the features and target variable.

In [11]:
from sklearn.tree import DecisionTreeRegressor

# Define model. Specify a number for random_state to ensure same results each run.
melbourne_model = DecisionTreeRegressor(random_state=1)
# Fit model
melbourne_model.fit(X, y)

DecisionTreeRegressor(ccp_alpha=0.0, criterion='mse', max_depth=None,
                      max_features=None, max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, presort='deprecated',
                      random_state=1, splitter='best')

`random_state`
Many machine learning models allow some randomness in model taining.\
Specifying a number for `random_state` ensures you get the same results in each run.

In [12]:
print("Making predictions for the following 5 houses:")
print(X.head())
print("The predictions are")
print(melbourne_model.predict(X.head()))

Making predictions for the following 5 houses:
   Rooms  Bathroom  Landsize  Lattitude  Longtitude
0      2       1.0     202.0   -37.7996    144.9984
1      2       1.0     156.0   -37.8079    144.9934
2      3       2.0     134.0   -37.8093    144.9944
3      3       2.0      94.0   -37.7969    144.9969
4      4       1.0     120.0   -37.8072    144.9941
The predictions are
[1480000. 1035000. 1465000.  850000. 1600000.]
