**[Machine Learning Micro-Course Home Page](https://www.kaggle.com/learn/machine-learning)**

---


# Introduction
Machine learning competitions are a great way to improve your data science skills and measure your progress. 

In this exercise, you will create and submit predictions for a Kaggle competition. You can then improve your model (e.g. by adding features) to improve and see how you stack up to others taking this micro-course.

The steps in this notebook are:
1. Build a Random Forest model with all of your data (**X** and **y**)
2. Read in the "test" data, which doesn't include values for the target.  Predict home values in the test data with your Random Forest model.
3. Submit those predictions to the competition and see your score.
4. Optionally, come back to see if you can improve your model by adding features or changing your model. Then you can resubmit to see how that stacks up on the competition leaderboard.

## Recap
Here's the code you've written so far. Start by running it again.

In [1]:
# Code you have previously used to load data
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from learntools.core import *



# Path of the file to read. We changed the directory structure to simplify submitting to a competition
iowa_file_path = '../input/train.csv'
iowa_test_file_path = '../input/test.csv'

train_data = pd.read_csv(iowa_file_path)
test_data = pd.read_csv(iowa_test_file_path)
# Create target object and call it y
y = train_data.SalePrice
train_features = train_data.drop(['SalePrice'], axis = 1)
# Create X
#features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
#X = home_data[features]

In [2]:
# fill in missing numeric values
from sklearn.impute import SimpleImputer

#Impute
train_data_num = train_features.select_dtypes(exclude=['object'])
test_data_num = test_data.select_dtypes(exclude=['object'])
imputer = SimpleImputer()
train_num_cleaned = imputer.fit_transform(train_data_num)
test_num_cleaned = imputer.transform(test_data_num)

#columns rename after imputing
train_num_cleaned = pd.DataFrame(train_num_cleaned)
test_num_cleaned = pd.DataFrame(test_num_cleaned)

train_num_cleaned.columns = train_data_num.columns
test_num_cleaned.columns = test_data_num.columns

In [3]:
# string columns: transform to dummies
train_data_str = train_data.select_dtypes(include=['object'])
test_data_str = test_data.select_dtypes(include=['object'])
train_str_dummy = pd.get_dummies(train_data_str)
test_str_dummy = pd.get_dummies(test_data_str)
train_dummy, test_dummy = train_str_dummy.align(test_str_dummy, 
                                                join = 'left', 
                                                axis = 1)

In [4]:
print(train_num_cleaned.columns)
print(train_num_cleaned.index)
print(test_num_cleaned.columns)
print(test_num_cleaned.index)
print(train_dummy.columns)
print(train_dummy.index)
print(test_dummy.columns)
print(test_dummy.index)

Index(['Id', 'MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual',
       'OverallCond', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1',
       'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd',
       'Fireplaces', 'GarageYrBlt', 'GarageCars', 'GarageArea', 'WoodDeckSF',
       'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea',
       'MiscVal', 'MoSold', 'YrSold'],
      dtype='object')
RangeIndex(start=0, stop=1460, step=1)
Index(['Id', 'MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual',
       'OverallCond', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1',
       'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd',
       'F

In [5]:
# convert numpy to pandas DataFrame
train_num_cleaned = pd.DataFrame(train_num_cleaned)
test_num_cleaned = pd.DataFrame(test_num_cleaned)

In [6]:
# joining numeric and string data
train_all_clean = pd.concat([train_num_cleaned, train_dummy], axis = 1)
test_all_clean = pd.concat([test_num_cleaned, test_dummy], axis = 1)

In [7]:
# detect NaN in already cleaned test data 
# (there could be completely empty columns)
cols_with_missing = [col for col in test_all_clean.columns
                                if test_all_clean[col].isnull().any()]
for col in cols_with_missing:
    print(col, test_all_clean[col].describe())

Utilities_NoSeWa count    0.0
mean     NaN
std      NaN
min      NaN
25%      NaN
50%      NaN
75%      NaN
max      NaN
Name: Utilities_NoSeWa, dtype: float64
Condition2_RRAe count    0.0
mean     NaN
std      NaN
min      NaN
25%      NaN
50%      NaN
75%      NaN
max      NaN
Name: Condition2_RRAe, dtype: float64
Condition2_RRAn count    0.0
mean     NaN
std      NaN
min      NaN
25%      NaN
50%      NaN
75%      NaN
max      NaN
Name: Condition2_RRAn, dtype: float64
Condition2_RRNn count    0.0
mean     NaN
std      NaN
min      NaN
25%      NaN
50%      NaN
75%      NaN
max      NaN
Name: Condition2_RRNn, dtype: float64
HouseStyle_2.5Fin count    0.0
mean     NaN
std      NaN
min      NaN
25%      NaN
50%      NaN
75%      NaN
max      NaN
Name: HouseStyle_2.5Fin, dtype: float64
RoofMatl_ClyTile count    0.0
mean     NaN
std      NaN
min      NaN
25%      NaN
50%      NaN
75%      NaN
max      NaN
Name: RoofMatl_ClyTile, dtype: float64
RoofMatl_Membran count    0.0
mean     NaN
s

In [8]:
# since there are empty columns in test we need to drop them in train and test
train_all_clean_no_nan = train_all_clean.drop(cols_with_missing, axis = 1)
test_all_clean_no_nan = test_all_clean.drop(cols_with_missing, axis = 1)

In [9]:
# Split into validation and training data
train_X, val_X, train_y, val_y = train_test_split(train_all_clean_no_nan, y, random_state=1)

# Specify Model
iowa_model = DecisionTreeRegressor(random_state=1)
# Fit Model
iowa_model.fit(train_X, train_y)

# Make validation predictions and calculate mean absolute error
val_predictions = iowa_model.predict(val_X)
val_mae = mean_absolute_error(val_predictions, val_y)
print("Validation MAE when not specifying max_leaf_nodes: {:,.0f}".format(val_mae))

# Using best value for max_leaf_nodes
iowa_model = DecisionTreeRegressor(max_leaf_nodes=100, random_state=1)
iowa_model.fit(train_X, train_y)
val_predictions = iowa_model.predict(val_X)
val_mae = mean_absolute_error(val_predictions, val_y)
print("Validation MAE for best value of max_leaf_nodes: {:,.0f}".format(val_mae))

# Define the model. Set random_state to 1
rf_model = RandomForestRegressor(random_state=1)
rf_model.fit(train_X, train_y)
rf_val_predictions = rf_model.predict(val_X)
rf_val_mae = mean_absolute_error(rf_val_predictions, val_y)

print("Validation MAE for Random Forest Model: {:,.0f}".format(rf_val_mae))


Validation MAE when not specifying max_leaf_nodes: 26,207
Validation MAE for best value of max_leaf_nodes: 25,771




Validation MAE for Random Forest Model: 18,688


# Creating a Model For the Competition

Build a Random Forest model and train it on all of **X** and **y**.  

In [10]:
# To improve accuracy, create a new Random Forest model which you will train on all training data
rf_model_on_full_data = RandomForestRegressor()

# fit rf_model_on_full_data on all data from the training data
rf_model_on_full_data.fit(train_all_clean_no_nan, y)




RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

# Make Predictions
Read the file of "test" data. And apply your model to make predictions

In [11]:
test_X = test_all_clean_no_nan

# make predictions which we will submit. 
test_preds = rf_model_on_full_data.predict(test_X)

In [12]:
output = pd.DataFrame({'Id': test_data.Id,
                       'SalePrice': test_preds})
output.to_csv('submission.csv', index=False)

# Test Your Work
After filling in the code above:
1. Click the **Commit and Run** button. 
2. After your code has finished running, click the small double brackets **<<** in the upper left of your screen.  This brings you into view mode of the same page. You will need to scroll down to get back to these instructions.
3. Go to the output tab at top of your screen. Select the button to submit your file to the competition.  
4. If you want to keep working to improve your model, select the edit button. Then you can change your model and repeat the process.

Congratulations, you've started competing in Machine Learning competitions.

# Continuing Your Progress
There are many ways to improve your model, and **experimenting is a great way to learn at this point.**

The best way to improve your model is to add features.  Look at the list of columns and think about what might affect home prices.  Some features will cause errors because of issues like missing values or non-numeric data types. 

Level 2 of this micro-course will teach you how to handle these types of features. You will also learn to use **xgboost**, a technique giving even better accuracy than Random Forest.


# Other Micro-Courses
The **[Pandas Micro-Course](https://kaggle.com/Learn/Pandas)** will give you the data manipulation skills to quickly go from conceptual idea to implementation in your data science projects. 

You are also ready for the **[Deep Learning](https://kaggle.com/Learn/Deep-Learning)** micro-course, where you will build models with better-than-human level performance at computer vision tasks.

---
**[Machine Learning Micro-Course Home Page](https://www.kaggle.com/learn/machine-learning)**

