# Introduction
**This will be your workspace for Kaggle's Machine Learning education track.**

You will build and continually improve a model to predict housing prices as you work through each tutorial.  Fork this notebook and write your code in it.

The data from the tutorial, the Melbourne data, is not available in this workspace.  You will need to translate the concepts to work with the data in this notebook, the Iowa data.

Come to the [Learn Discussion](https://www.kaggle.com/learn-forum) forum for any questions or comments. 

**Write Your Code Below ... **



# Level 1-2: Starting Your ML Project

## Your Turn

**Remember, the notebook you want to "fork" is [here](https://www.kaggle.com/dansbecker/my-first-machine-learning-model/).**

Run the equivalent commands (to read the data and print the summary) in the code cell below. The file path for your data is already shown in your coding notebook. Look at the mean, minimum and maximum values for the first few fields. Are any of the values so crazy that it makes you think you've misinterpreted the data?

There are a lot of fields in this data. You don't need to look at it all quite yet.

When your code is correct, you'll see the size, in square feet, of the smallest lot in your dataset. This is from the **min** value of **LotArea**, and you can see the **max** size too. You should notice that it's a big range of lot sizes!

You'll also see some columns filled with `....` That indicates that we had too many columns of data to print, so the middle ones were omitted from printing.

We'll take care of both issues in the next step.

In [57]:
# Import relevant libraries and dependencies
import pandas as pd

# save filepath to variable for easier access
main_file_path = '../input/train.csv'

# read the data and store data in DataFrame titled df
df = pd.read_csv(main_file_path)

# print a summary of the data
print(df.describe())

                Id   MSSubClass  LotFrontage        LotArea  OverallQual  \
count  1460.000000  1460.000000  1201.000000    1460.000000  1460.000000   
mean    730.500000    56.897260    70.049958   10516.828082     6.099315   
std     421.610009    42.300571    24.284752    9981.264932     1.382997   
min       1.000000    20.000000    21.000000    1300.000000     1.000000   
25%     365.750000    20.000000    59.000000    7553.500000     5.000000   
50%     730.500000    50.000000    69.000000    9478.500000     6.000000   
75%    1095.250000    70.000000    80.000000   11601.500000     7.000000   
max    1460.000000   190.000000   313.000000  215245.000000    10.000000   

       OverallCond    YearBuilt  YearRemodAdd   MasVnrArea   BsmtFinSF1  \
count  1460.000000  1460.000000   1460.000000  1452.000000  1460.000000   
mean      5.575342  1971.267808   1984.865753   103.685262   443.639726   
std       1.112799    30.202904     20.645407   181.066207   456.098091   
min       1.000

# Level 1-3: Selecting and Filtering in Pandas

## Your Turn

In the notebook with your code:

1. Print a list of the columns
2. From the list of columns, find a name of the column with the sales prices of the homes. Use the dot notation to extract this to a variable (as you saw above to create melbourne_price_data.)
3. Use the head command to print out the top few lines of the variable you just created.
4. Pick any two variables and store them to a new DataFrame (as you saw above to create two_columns_of_data.)
5. Use the describe command with the DataFrame you just created to see summaries of those variables. 

In [58]:
# Print a list of the columns
print(df.columns)

Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
       'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
       'GarageCond', 'PavedDrive

In [59]:
# Extract the sales price using dot notation
price_data = df.SalePrice

# Display the head of the variable
print(price_data.head(5))

0    208500
1    181500
2    223500
3    140000
4    250000
Name: SalePrice, dtype: int64


In [60]:
# Creation of two columns of data
columns_of_interest = ["YrSold", "RoofStyle"]

# Describe the two columns
print(df[columns_of_interest].describe())

            YrSold
count  1460.000000
mean   2007.815753
std       1.328095
min    2006.000000
25%    2007.000000
50%    2008.000000
75%    2009.000000
max    2010.000000


# Level 1-4: Your First Scikit-Learn Model

## Your Turn

Now it's time for you to define and fit a model for your data (in your notebook).

1. Select the target variable you want to predict. You can go back to the list of columns from your earlier commands to recall what it's called (hint: you've already worked with this variable). Save this to a new variable called y.
2. Create a **list** of the names of the predictors we will use in the initial model. Use just the following columns in the list (you can copy and paste the whole list to save some typing, though you'll still need to add quotes):
 * LotArea
 * YearBuilt
 * 1stFlrSF
 * 2ndFlrSF
 * FullBath
 * BedroomAbvGr
 * TotRmsAbvGrd
3. Using the list of variable names you just created, select a new DataFrame of the predictors data. Save this with the variable name X.
4. Create a DecisionTreeRegressorModel and save it to a variable (with a name like my_model or iowa_model). Ensure you've done the relevant import so you can run this command.
5. Fit the model you have created using the data in X and the target data you saved above.
6. Make a few predictions with the model's predict command and print out the predictions.

In [61]:
# Select target variable to be predicted
y = df["SalePrice"]

# Select predictors
list_of_predictors = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']

# Create a predictors dataFrame
X = df[list_of_predictors]

# Import the decision tree
from sklearn.tree import DecisionTreeRegressor

# Define the model
iowa_model = DecisionTreeRegressor()

# Fit the model
iowa_model.fit(X, y)

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')

In [62]:
# Making a few predictions ...
print("\n" + "Making predictions for the following 5 houses:")
print(X.head())

print("\n" + "The predictions are")
print(iowa_model.predict(X.head()))

print("\n" + "Compared to their real values:")
print(y.head())


Making predictions for the following 5 houses:
   LotArea  YearBuilt  1stFlrSF  2ndFlrSF  FullBath  BedroomAbvGr  \
0     8450       2003       856       854         2             3   
1     9600       1976      1262         0         2             3   
2    11250       2001       920       866         2             3   
3     9550       1915       961       756         1             3   
4    14260       2000      1145      1053         2             4   

   TotRmsAbvGrd  
0             8  
1             6  
2             6  
3             7  
4             9  

The predictions are
[208500. 181500. 223500. 140000. 250000.]

Compared to their real values:
0    208500
1    181500
2    223500
3    140000
4    250000
Name: SalePrice, dtype: int64


# Level 1-5: Model Validation

## Your Turn

1. Use the train_test_split command to split up your data.
2. Fit the model with the training data
3. Make predictions with the validation predictors
4. Calculate the mean absolute error between your predictions and the actual target values for the validation data.

In [63]:
# Split data into training and validation data
from sklearn.model_selection import train_test_split

"""
N.B.
The split is based on a random number generator. Supplying a numeric value to the random_state argument guarantees we get the same split every time we
run this script.
"""

# Split for training and validation
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)

# Define the model
iowa_model = DecisionTreeRegressor()

# Fit the model with the training data
iowa_model.fit(train_X, train_y)

# Make predictions on the validation data
val_predictions = iowa_model.predict(val_X)

# Calculate and print the mean absolute error
from sklearn.metrics import mean_absolute_error

print(mean_absolute_error(val_y, val_predictions))

32069.046575342465


# Level 1-6: Underfitting, Overfitting and Model Optimization

## Your Turn

In the near future, you'll be efficient writing functions like `get_mae` yourself. For now, just copy it over to your work area. Then use a for loop that tries different values of `max_leaf_nodes` and calls the `get_mae` function on each to find the ideal number of leaves for your Iowa data.

You should see that the ideal number of leaves for Iowa data is less than the ideal number of leaves for the Melbourne data. Remember, that a lower MAE is better.

In [64]:
from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor

def get_mae(max_leaf_nodes, predictors_train, predictors_val, targ_train, targ_val):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(predictors_train, targ_train)
    preds_val = model.predict(predictors_val)
    mae = mean_absolute_error(targ_val, preds_val)
    return(mae)

In [65]:
best_choice = 100000 #Arbitrarily chosen

# compare MAE with differing values of max_leaf_nodes
for max_leaf_nodes in range(5, 5000, 1):
    my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
    
    if my_mae < best_choice:
        best_choice = my_mae
        print("Max leaf nodes: %d  \t\t Mean Absolute Error:  %d" %(max_leaf_nodes, my_mae))

Max leaf nodes: 5  		 Mean Absolute Error:  35190
Max leaf nodes: 6  		 Mean Absolute Error:  33967
Max leaf nodes: 7  		 Mean Absolute Error:  33636
Max leaf nodes: 8  		 Mean Absolute Error:  31908
Max leaf nodes: 9  		 Mean Absolute Error:  31416
Max leaf nodes: 10  		 Mean Absolute Error:  30616
Max leaf nodes: 11  		 Mean Absolute Error:  30166
Max leaf nodes: 15  		 Mean Absolute Error:  29666
Max leaf nodes: 16  		 Mean Absolute Error:  29056
Max leaf nodes: 17  		 Mean Absolute Error:  28914
Max leaf nodes: 18  		 Mean Absolute Error:  28771
Max leaf nodes: 21  		 Mean Absolute Error:  28533
Max leaf nodes: 22  		 Mean Absolute Error:  28462
Max leaf nodes: 29  		 Mean Absolute Error:  28282
Max leaf nodes: 31  		 Mean Absolute Error:  27965
Max leaf nodes: 32  		 Mean Absolute Error:  27852
Max leaf nodes: 35  		 Mean Absolute Error:  27556
Max leaf nodes: 36  		 Mean Absolute Error:  27372
Max leaf nodes: 77  		 Mean Absolute Error:  27344
Max leaf nodes: 78  		 Mean Absolute

# Level 1-7: Random Forests

## Your Turn

Run the RandomForestRegressor on your data. You should see a big improvement over your best Decision Tree models.

In [66]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

forest_model = RandomForestRegressor()
forest_model.fit(train_X, train_y)
iowa_predictions_random_forest = forest_model.predict(val_X)
print(mean_absolute_error(val_y, iowa_predictions_random_forest))

23308.836438356167


# Level 1-8: Submitting from a Kernel

In [69]:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor

# Read the data
train = pd.read_csv('../input/train.csv')

# pull data into target (y) and predictors (X)
train_y = train.SalePrice
predictor_cols = ['LotArea', 'OverallQual', 'YearBuilt', 'TotRmsAbvGrd']

# Create training predictors data
train_X = train[predictor_cols]

my_model = RandomForestRegressor()
my_model.fit(train_X, train_y)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

In [70]:
# Read the test data
test = pd.read_csv('../input/test.csv')
# Treat the test data in the same way as training data. In this case, pull same columns.
test_X = test[predictor_cols]
# Use the model to make predictions
predicted_prices = my_model.predict(test_X)
# We will look at the predicted prices to ensure we have something sensible.
print(predicted_prices)

[129550. 199500. 162180. ... 150390. 184850. 274180.]


In [71]:
my_submission = pd.DataFrame({'Id': test.Id, 'SalePrice': predicted_prices})
# you could use any filename. We choose submission here
my_submission.to_csv('submission.csv', index=False)