# Introduction
**This will be your workspace for Kaggle's Machine Learning education track.**

You will build and continually improve a model to predict housing prices as you work through each tutorial.  Fork this notebook and write your code in it.

The data from the tutorial, the Melbourne data, is not available in this workspace.  You will need to translate the concepts to work with the data in this notebook, the Iowa data.

Come to the [Learn Discussion](https://www.kaggle.com/learn-forum) forum for any questions or comments. 

**Write Your Code Below ... **



# What this is notebook ... What this notebook is not!

This workbook is generally about working through the examples within the machine learning track and getting them to work with a secondary data-set plus odd code snippets where I have experimented. The full CRISP-DM process and submission for the:

> **House Prices: Advanced Regression Techniques**    
> Predict sales prices and practice feature engineering, RFs, and gradient boosting

competition will follow in a further notebook where I put all of the skills and topics touched upon here together neatly and efficiently with the aim of creating the best model possible.

# Level 1-2: Starting Your ML Project

## Your Turn

**Remember, the notebook you want to "fork" is [here](https://www.kaggle.com/dansbecker/my-first-machine-learning-model/).**

Run the equivalent commands (to read the data and print the summary) in the code cell below. The file path for your data is already shown in your coding notebook. Look at the mean, minimum and maximum values for the first few fields. Are any of the values so crazy that it makes you think you've misinterpreted the data?

There are a lot of fields in this data. You don't need to look at it all quite yet.

When your code is correct, you'll see the size, in square feet, of the smallest lot in your dataset. This is from the **min** value of **LotArea**, and you can see the **max** size too. You should notice that it's a big range of lot sizes!

You'll also see some columns filled with `....` That indicates that we had too many columns of data to print, so the middle ones were omitted from printing.

We'll take care of both issues in the next step.

In [None]:
# Import relevant libraries and dependencies
import pandas as pd

# save filepath to variable for easier access
main_file_path = '../input/train.csv'

# read the data and store data in DataFrame titled df
df = pd.read_csv(main_file_path)

# print a summary of the data
print(df.describe())

# Level 1-3: Selecting and Filtering in Pandas

## Your Turn

In the notebook with your code:

1. Print a list of the columns
2. From the list of columns, find a name of the column with the sales prices of the homes. Use the dot notation to extract this to a variable (as you saw above to create melbourne_price_data.)
3. Use the head command to print out the top few lines of the variable you just created.
4. Pick any two variables and store them to a new DataFrame (as you saw above to create two_columns_of_data.)
5. Use the describe command with the DataFrame you just created to see summaries of those variables. 

In [None]:
# Print a list of the columns
print(df.columns)

In [None]:
# Extract the sales price using dot notation
price_data = df.SalePrice

# Display the head of the variable
print(price_data.head(5))

In [None]:
# Creation of two columns of data
columns_of_interest = ["YrSold", "RoofStyle"]

# Describe the two columns
print(df[columns_of_interest].describe())

# Level 1-4: Your First Scikit-Learn Model

## Your Turn

Now it's time for you to define and fit a model for your data (in your notebook).

1. Select the target variable you want to predict. You can go back to the list of columns from your earlier commands to recall what it's called (hint: you've already worked with this variable). Save this to a new variable called y.
2. Create a **list** of the names of the predictors we will use in the initial model. Use just the following columns in the list (you can copy and paste the whole list to save some typing, though you'll still need to add quotes):
 * LotArea
 * YearBuilt
 * 1stFlrSF
 * 2ndFlrSF
 * FullBath
 * BedroomAbvGr
 * TotRmsAbvGrd
3. Using the list of variable names you just created, select a new DataFrame of the predictors data. Save this with the variable name X.
4. Create a DecisionTreeRegressorModel and save it to a variable (with a name like my_model or iowa_model). Ensure you've done the relevant import so you can run this command.
5. Fit the model you have created using the data in X and the target data you saved above.
6. Make a few predictions with the model's predict command and print out the predictions.

In [None]:
# Select target variable to be predicted
y = df["SalePrice"]

# Select predictors
list_of_predictors = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']

# Create a predictors dataFrame
X = df[list_of_predictors]

# Import the decision tree
from sklearn.tree import DecisionTreeRegressor

# Define the model
iowa_model = DecisionTreeRegressor()

# Fit the model
iowa_model.fit(X, y)

In [None]:
# Making a few predictions ...
print("\n" + "Making predictions for the following 5 houses:")
print(X.head())

print("\n" + "The predictions are")
print(iowa_model.predict(X.head()))

print("\n" + "Compared to their real values:")
print(y.head())

# Level 1-5: Model Validation

## Your Turn

1. Use the train_test_split command to split up your data.
2. Fit the model with the training data
3. Make predictions with the validation predictors
4. Calculate the mean absolute error between your predictions and the actual target values for the validation data.

In [None]:
# Split data into training and validation data
from sklearn.model_selection import train_test_split

"""
N.B.
The split is based on a random number generator. Supplying a numeric value to the random_state argument guarantees we get the same split every time we
run this script.
"""

# Split for training and validation
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)

# Define the model
iowa_model = DecisionTreeRegressor()

# Fit the model with the training data
iowa_model.fit(train_X, train_y)

# Make predictions on the validation data
val_predictions = iowa_model.predict(val_X)

# Calculate and print the mean absolute error
from sklearn.metrics import mean_absolute_error

print(mean_absolute_error(val_y, val_predictions))

# Level 1-6: Underfitting, Overfitting and Model Optimization

## Your Turn

In the near future, you'll be efficient writing functions like `get_mae` yourself. For now, just copy it over to your work area. Then use a for loop that tries different values of `max_leaf_nodes` and calls the `get_mae` function on each to find the ideal number of leaves for your Iowa data.

You should see that the ideal number of leaves for Iowa data is less than the ideal number of leaves for the Melbourne data. Remember, that a lower MAE is better.

In [None]:
from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor

def get_mae(max_leaf_nodes, predictors_train, predictors_val, targ_train, targ_val):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(predictors_train, targ_train)
    preds_val = model.predict(predictors_val)
    mae = mean_absolute_error(targ_val, preds_val)
    return(mae)

In [None]:
best_choice = 100000 #Arbitrarily chosen

# compare MAE with differing values of max_leaf_nodes
for max_leaf_nodes in range(5, 5000, 1):
    my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
    
    if my_mae < best_choice:
        best_choice = my_mae
        print("Max leaf nodes: %d  \t\t Mean Absolute Error:  %d" %(max_leaf_nodes, my_mae))

# Level 1-7: Random Forests

## Your Turn

Run the RandomForestRegressor on your data. You should see a big improvement over your best Decision Tree models.

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

forest_model = RandomForestRegressor()
forest_model.fit(train_X, train_y)
iowa_predictions_random_forest = forest_model.predict(val_X)
print(mean_absolute_error(val_y, iowa_predictions_random_forest))

# Level 1-8: Submitting from a Kernel

In [None]:
"""
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor

# Read the data
train = pd.read_csv('../input/train.csv')

# pull data into target (y) and predictors (X)
train_y = train.SalePrice
predictor_cols = ['LotArea', 'OverallQual', 'YearBuilt', 'TotRmsAbvGrd']

# Create training predictors data
train_X = train[predictor_cols]

my_model = RandomForestRegressor()
my_model.fit(train_X, train_y)
"""

In [None]:
"""
# Read the test data
test = pd.read_csv('../input/test.csv')
# Treat the test data in the same way as training data. In this case, pull same columns.
test_X = test[predictor_cols]
# Use the model to make predictions
predicted_prices = my_model.predict(test_X)
# We will look at the predicted prices to ensure we have something sensible.
print(predicted_prices)
"""

In [None]:
"""
my_submission = pd.DataFrame({'Id': test.Id, 'SalePrice': predicted_prices})
# you could use any filename. We choose submission here
my_submission.to_csv('submission.csv', index=False)
"""

# Level 2-1: Handling Missing Values

## Your Turn
1) Find some columns with missing values in your dataset.

2) Use the Imputer class so you can impute missing values

3) Add columns with missing values to your predictors.

If you find the right columns, you may see an improvement in model scores. That said, the Iowa data doesn't have a lot of columns with missing values. So, whether you see an improvement at this point depends on some other details of your model.

Once you've added the Imputer, keep using those columns for future steps. In the end, it will improve your model (and in most other datasets, it is a big improvement).

In [None]:
# Import relevant libraries and dependencies
import pandas as pd

# save filepath to variable for easier access
iowa_train_path = '../input/train.csv'

# read the data and store data in DataFrame titled df
df = pd.read_csv(iowa_train_path)

# print a summary of the data
print(df.describe())

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split

iowa_target = df.SalePrice
iowa_predictors = df.drop(['SalePrice'], axis=1)

# For the sake of keeping the example simple, we'll use only numeric predictors. 
iowa_numeric_predictors = iowa_predictors.select_dtypes(exclude=['object'])

print(iowa_numeric_predictors.columns)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(iowa_numeric_predictors, 
                                                    iowa_target,
                                                    train_size=0.7, 
                                                    test_size=0.3, 
                                                    random_state=42)

def score_dataset(X_train, X_test, y_train, y_test):
    model = RandomForestRegressor()
    model.fit(X_train, y_train)
    preds = model.predict(X_test)
    return mean_absolute_error(y_test, preds)

In [None]:
cols_with_missing = [col for col in X_train.columns 
                                 if X_train[col].isnull().any()]

reduced_X_train = X_train.drop(cols_with_missing, axis=1)

reduced_X_test  = X_test.drop(cols_with_missing, axis=1)

# Finding (numeric) columns with missing values in the Iowa data set
print("Columns with missing data:")
print(iowa_numeric_predictors[cols_with_missing].columns)

print("Mean Absolute Error from dropping columns with Missing Values:")
print(score_dataset(reduced_X_train, reduced_X_test, y_train, y_test))

In [None]:
from sklearn.preprocessing import Imputer

my_imputer = Imputer()
# See comment section about preserving the structure of the data ...
imputed_X_train = pd.DataFrame(my_imputer.fit_transform(X_train))
imputed_X_train.columns = X_train.columns

imputed_X_test = pd.DataFrame(my_imputer.transform(X_test))
imputed_X_test.columns = X_test.columns

print("Mean Absolute Error from Imputation:")
print(score_dataset(imputed_X_train, imputed_X_test, y_train, y_test))

In [None]:
imputed_X_train_plus = X_train.copy()
imputed_X_test_plus = X_test.copy()

cols_with_missing = (col for col in X_train.columns 
                                 if X_train[col].isnull().any())
for col in cols_with_missing:
    imputed_X_train_plus[col + '_was_missing'] = imputed_X_train_plus[col].isnull()
    imputed_X_test_plus[col + '_was_missing'] = imputed_X_test_plus[col].isnull()

# Imputation
my_imputer = Imputer()
# See comment section about preserving the structure of the data ...
imputed_X_train_plus = pd.DataFrame(my_imputer.fit_transform(imputed_X_train_plus))
imputed_X_train_plus.columns = imputed_X_train_plus.columns

imputed_X_test_plus = pd.DataFrame(my_imputer.transform(imputed_X_test_plus))
imputed_X_test_plus.columns = imputed_X_test_plus.columns

print("Mean Absolute Error from Imputation while Track What Was Imputed:")
print(score_dataset(imputed_X_train_plus, imputed_X_test_plus, y_train, y_test))

# Level 2-2: Using Categorical Data with One Hot Encoding

## Your Turn
Use one-hot encoding to allow categoricals in your course project. Then add some categorical columns to your X data. If you choose the right variables, your model will improve quite a bit. Once you've done that, Click Here to return to Learning Machine Learning where you can continue improving your model.

In [None]:
# Read the data
import pandas as pd
train_data = pd.read_csv('../input/train.csv')
test_data = pd.read_csv('../input/test.csv')

# Drop houses where the target is missing
train_data.dropna(axis=0, subset=['SalePrice'], inplace=True)

target = train_data.SalePrice

# Since missing values isn't the focus of this tutorial, we use the simplest
# possible approach, which drops these columns. 
# For more detail (and a better approach) to missing values, see
# https://www.kaggle.com/dansbecker/handling-missing-values
cols_with_missing = [col for col in train_data.columns 
                                 if train_data[col].isnull().any()]                                  
candidate_train_predictors = train_data.drop(['Id', 'SalePrice'] + cols_with_missing, axis=1)
candidate_test_predictors = test_data.drop(['Id'] + cols_with_missing, axis=1)

# "cardinality" means the number of unique values in a column.
# We use it as our only way to select categorical columns here. This is convenient, though
# a little arbitrary.
low_cardinality_cols = [cname for cname in candidate_train_predictors.columns if 
                                candidate_train_predictors[cname].nunique() < 10 and
                                candidate_train_predictors[cname].dtype == "object"]

numeric_cols = [cname for cname in candidate_train_predictors.columns if 
                                candidate_train_predictors[cname].dtype in ['int64', 'float64']]

my_cols = low_cardinality_cols + numeric_cols
train_predictors = candidate_train_predictors[my_cols]
test_predictors = candidate_test_predictors[my_cols]

In [None]:
print(train_predictors.info())

In [None]:
one_hot_encoded_training_predictors = pd.get_dummies(train_predictors)

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor

def get_mae(X, y):
    # multiple by -1 to make positive MAE score instead of neg value returned as sklearn convention
    return -1 * cross_val_score(RandomForestRegressor(50), 
                                X, y, 
                                scoring = 'neg_mean_absolute_error').mean()

predictors_without_categoricals = train_predictors.select_dtypes(exclude=['object'])

mae_without_categoricals = get_mae(predictors_without_categoricals, target)

mae_one_hot_encoded = get_mae(one_hot_encoded_training_predictors, target)

print('Mean Absolute Error when Dropping Categoricals: ' + str(int(mae_without_categoricals)))
print('Mean Abslute Error with One-Hot Encoding: ' + str(int(mae_one_hot_encoded)))

In [None]:
one_hot_encoded_training_predictors = pd.get_dummies(train_predictors)
one_hot_encoded_test_predictors = pd.get_dummies(test_predictors)
final_train, final_test = one_hot_encoded_training_predictors.align(one_hot_encoded_test_predictors,
                                                                    join='left', 
                                                                    axis=1)

# Level 2-3: Learning to use XGBoost

## Your Turn

Convert yuor model to use XGBoost.

Use early stopping to find a good value for n_estimators. Then re-estimate the model with all of your training data, and that value of n_estimators.

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import Imputer

data = pd.read_csv('../input/train.csv')
data.dropna(axis=0, subset=['SalePrice'], inplace=True)
y = data.SalePrice
X = data.drop(['SalePrice'], axis=1).select_dtypes(exclude=['object'])
train_X, test_X, train_y, test_y = train_test_split(X.as_matrix(), y.as_matrix(), test_size=0.25)

my_imputer = Imputer()
train_X = my_imputer.fit_transform(train_X)
test_X = my_imputer.transform(test_X)

In [None]:
from xgboost import XGBRegressor

my_model = XGBRegressor()
# Add silent=True to avoid printing out updates with each cycle
my_model.fit(train_X, train_y, verbose=False)

In [None]:
# make predictions
predictions = my_model.predict(test_X)

from sklearn.metrics import mean_absolute_error
print("Mean Absolute Error : " + str(mean_absolute_error(predictions, test_y)))

In [None]:
my_model = XGBRegressor(n_estimators=1000)
my_model.fit(train_X, train_y, early_stopping_rounds=5, 
             eval_set=[(test_X, test_y)], verbose=False)

In [None]:
my_model = XGBRegressor(n_estimators=10000, learning_rate=0.01)
my_model.fit(train_X, train_y, early_stopping_rounds=10, 
             eval_set=[(test_X, test_y)], verbose=False)

In [None]:
# make predictions
predictions = my_model.predict(test_X)

from sklearn.metrics import mean_absolute_error
print("Mean Absolute Error : " + str(mean_absolute_error(predictions, test_y)))

# Level 2-4: Partial Dependence Plots

## Your Turn

Pick three predictors in your project. Formulate an hypothesis about what the partial dependence plot will look like. Create the plots, and check the results against your hypothesis.


In [None]:
import pandas as pd
from sklearn.ensemble import GradientBoostingRegressor, GradientBoostingClassifier
from sklearn.ensemble.partial_dependence import partial_dependence, plot_partial_dependence
from sklearn.preprocessing import Imputer
from sklearn.model_selection import train_test_split

data = pd.read_csv('../input/train.csv')
data.dropna(axis=0, subset=['SalePrice'], inplace=True)
y = data.SalePrice
X = data.drop(['SalePrice'], axis=1).select_dtypes(exclude=['object'])
train_X, test_X, train_y, test_y = train_test_split(X.as_matrix(), y.as_matrix(), test_size=0.25)

my_imputer = Imputer()
train_X = my_imputer.fit_transform(train_X)
train_X = pd.DataFrame(train_X)
train_X.columns = X.columns

test_X = my_imputer.transform(test_X)
test_X = pd.DataFrame(test_X)
test_X.columns = X.columns

#print(train_X.info())


In [None]:
from sklearn.ensemble.partial_dependence import partial_dependence, plot_partial_dependence

# scikit-learn originally implemented partial dependence plots only for Gradient Boosting models
# this was due to an implementation detail, and a future release will support all model types.
my_model = GradientBoostingRegressor()
# fit the model as usual
my_model.fit(train_X, train_y)
# Here we make the plot
my_plots = plot_partial_dependence(my_model,       
                                   features=[0, 2], # column numbers of plots we want to show
                                   X=train_X,            # raw predictors data.
                                   feature_names=['LotArea', 'Fireplaces', "PoolArea"], # labels on graphs
                                   grid_resolution=10) # number of values to plot on x axis

# Level 2-5: Pipelines

## Your Turn

Take your modeling code and convert it to use pipelines. For now, you'll need to do one-hot encoding of categorical variables outside of the pipeline (i.e. before putting the data in the pipeline).



In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Read Data
data = pd.read_csv('../input/train.csv')

cols_with_missing = [col for col in data.columns if data[col].isnull().any()]

data = data.drop(cols_with_missing, axis=1)

# Drop target to get predictors
predictors = data.drop(['SalePrice'], axis=1)

# Exclude objects
X_numeric = predictors.select_dtypes(exclude=['object'])
X_objects = predictors.select_dtypes(include=['object'])

one_hot_encoded_training_predictors = pd.get_dummies(X_objects)

#X = X_numeric + one_hot_encoded_training_predictors
X = pd.concat([X_numeric, one_hot_encoded_training_predictors], axis=1)

# Define target
y = data.SalePrice

# Train-Test split
train_X, test_X, train_y, test_y = train_test_split(X, y)

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Imputer

my_pipeline = make_pipeline(Imputer(), RandomForestRegressor())

In [None]:
my_pipeline.fit(train_X, train_y)

predictions = my_pipeline.predict(test_X)

print("Mean Absolute Error : " + str(mean_absolute_error(predictions, test_y)))

# Level 2-6: Cross-Validation

## Your Turn

Convert the code for your on-going project over from train-test split to cross-validation. Make sure to remove all code that divides your dataset into training and testing datasets. Leaving code you don't need any more would be sloppy.

Add or remove a predictor from your models. See the cross-validation score using both sets of predictors, and see how you can compare the scores.

In [12]:
import pandas as pd

# Read Data
data = pd.read_csv('../input/train.csv')

cols_with_missing = [col for col in data.columns if data[col].isnull().any()]

data = data.drop(cols_with_missing, axis=1)

# Drop target to get predictors
predictors = data.drop(['SalePrice'], axis=1)

# Exclude objects
X_numeric = predictors.select_dtypes(exclude=['object'])
X_objects = predictors.select_dtypes(include=['object'])

one_hot_encoded_training_predictors = pd.get_dummies(X_objects)

X = X_numeric + one_hot_encoded_training_predictors
X = pd.concat([X_numeric, one_hot_encoded_training_predictors], axis=1)

# Define target
y = data.SalePrice

In [13]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Imputer

my_pipeline = make_pipeline(Imputer(), RandomForestRegressor())

In [14]:
from sklearn.model_selection import cross_val_score

scores_numeric = cross_val_score(my_pipeline, X_numeric, y, scoring='neg_mean_absolute_error', cv=5)
scores_object = cross_val_score(my_pipeline, one_hot_encoded_training_predictors, y, scoring='neg_mean_absolute_error', cv=5)
scores_combined = cross_val_score(my_pipeline, X, y, scoring='neg_mean_absolute_error', cv=5)

print("Numeric scores:", scores_numeric)
print("Object scores:", scores_object)
print("Combined scores:", scores_combined)

Numeric scores: [-19294.5890411  -18679.07465753 -18764.49246575 -17220.42534247
 -21139.03150685]
Object scores: [-27048.75884306 -31513.93325312 -33642.66630708 -28467.02611412
 -28587.48122895]
Combined scores: [-18611.03561644 -19121.68321918 -18443.63013699 -17364.38938356
 -21865.44726027]


In [15]:
print("Mean Absolute Error _ Numeric %2f" %(-1 * scores_numeric.mean()))
print("Mean Absolute Error _ Objects %2f" %(-1 * scores_object.mean()))
print("Mean Absolute Error _ Combined %2f" %(-1 * scores_combined.mean()))

Mean Absolute Error _ Numeric 19019.522603
Mean Absolute Error _ Objects 29851.973149
Mean Absolute Error _ Combined 19081.237123


# Level 2-7: Data Leakage

## Exercise

Review the data in your ongoing project. Are there any predictors that may cause leakage? As a hint, most datasets from Kaggle competitions don't have these variables. Once you get past those carefully curated datasets, this becomes a common issue.

In [16]:
import pandas as pd

# Read Data
data = pd.read_csv('../input/train.csv')

In [17]:
print(data.columns)


TypeError: 'Index' object is not callable