# Intruduction #
You will work with data from the [Housing Prices Competition for Kaggle Learn Users](https://www.kaggle.com/c/home-data-for-ml-course) to predict home prices in Iowa using 79 explanatory variables describing (almost) every aspect of the homes.

Load the training and validation features in `X_train` and `X_valid`, along with the prediction targets in `y_train` and `y_valid`.  The test features are loaded in `X_test`. 

In [11]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Read the data
X_full = pd.read_csv('train_CV.csv', index_col='Id')
X_test_full = pd.read_csv('test_CV.csv', index_col='Id')

# Obtain target and predictors
y = X_full.SalePrice
features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
X = X_full[features].copy()
X_test = X_test_full[features].copy()

# Break off validation set from training data
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,
                                                      random_state=0)

Print the first several rows of the data. It's a nice way to get an overview of the data you will use in your price prediction model.

In [12]:
from sklearn.ensemble import RandomForestRegressor

# Define the models
model_1 = RandomForestRegressor(n_estimators=50, random_state=0)
model_2 = RandomForestRegressor(n_estimators=100, random_state=0)
model_3 = RandomForestRegressor(n_estimators=100, criterion='absolute_error', random_state=0)
model_4 = RandomForestRegressor(n_estimators=200, min_samples_split=20, random_state=0)
model_5 = RandomForestRegressor(n_estimators=100, max_depth=7, random_state=0)

models = [model_1, model_2, model_3, model_4, model_5]
model_names = ["Model 1", "Model 2", "Model 3", "Model 4", "Model 5"] 

To select the best model out of the five, we define a function score_model() below. This function returns the mean absolute error (MAE) from the validation set. Recall that the best model will obtain the lowest MAE.

In [13]:
from sklearn.metrics import mean_absolute_error
# Function for comparing different models
def score_model(model, X_t=X_train, X_v=X_valid, y_t=y_train, y_v=y_valid):
    model.fit(X_t, y_t)
    preds = model.predict(X_v)
    return mean_absolute_error(y_v, preds)

for i in range(0, len(models)):
    mae = score_model(models[i])
    print("Model %d MAE: %d" % (i+1, mae))
    

Model 1 MAE: 24015
Model 2 MAE: 23740
Model 3 MAE: 23528
Model 4 MAE: 23996
Model 5 MAE: 23706


# Step 1: Evaluate several models
Which model is the best model?

In [14]:
scores = {name: score_model(model, X_t=X_train, X_v=X_valid, y_t=y_train, y_v=y_valid) for name,model in zip(model_names,models)}
best_model = min(scores, key=scores.get)

model_dict = dict(zip(model_names, models))
best_model = model_dict[min(scores, key=scores.get)]
print(f'best model is: {min(scores, key=scores.get)} : {best_model}')

best model is: Model 3 : RandomForestRegressor(criterion='absolute_error', random_state=0)


# Step 2: Generate test predictions
Great. You know how to evaluate what makes an accurate model. Now it's time to go through the modeling process and make predictions.



The code fits the model to the training and validation data, and then generates test predictions that are saved to a CSV file.

In [15]:
my_model = best_model

my_model.fit(X, y)

# Generate test predictions
preds_test = my_model.predict(X_test)
# Save predictions in csv format
output = pd.DataFrame({'Id': X_test.index,
                       'SalePrice': preds_test})
output.to_csv('submission.csv', index=False)
print('Predictions saved to submission.csv!')


Predictions saved to submission.csv!
