This is the workbook that accompanies the "[Welcome to Data Science in R](https://www.kaggle.com/rtatman/welcome-to-data-science-with-r/)" lesson. You'll probably find it easier to complete the exercises if you read & follow the lesson itself. :)
___

# Table of Contents

* [Starting your machine learning project](#Starting-your-machine-learning-project)
* [Selecting and filtering data with the Tidyverse](#Selecting-and-filtering-data-with-the-Tidyverse)
* [Running your first model](#Running-your-first-model)
* [How do we know if our model is good?](#How-do-we-know-if-our-model-is-good?)
* [Underfitting/overfitting and improving your model](#Underfitting/overfitting-and-improving-your-model)
* [A different type of model: Random forests](#A-different-type-of-model:-Random-forests)

# Starting your machine learning project

I've started off by reading your data in for you. Make sure you run the first cell before you try to run any others!

In [1]:
# load in the tidyverse package
library(tidyverse)

# read the data and store data in a tibble
iowa_data <- read_csv("../input/train.csv") 

# make sure Condition1 is a factor & not a char
iowa_data$Condition1 <- as.factor(iowa_data$Condition1)

── Attaching packages ─────────────────────────────────────── tidyverse 1.2.1 ──
✔ ggplot2 2.2.1.9000     ✔ purrr   0.2.4     
✔ tibble  1.4.2          ✔ dplyr   0.7.4     
✔ tidyr   0.8.0          ✔ stringr 1.2.0     
✔ readr   1.2.0          ✔ forcats 0.2.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
Parsed with column specification:
cols(
  .default = col_character(),
  Id = col_double(),
  MSSubClass = col_double(),
  LotFrontage = col_double(),
  LotArea = col_double(),
  OverallQual = col_double(),
  OverallCond = col_double(),
  YearBuilt = col_double(),
  YearRemodAdd = col_double(),
  MasVnrArea = col_double(),
  BsmtFinSF1 = col_double(),
  BsmtFinSF2 = col_double(),
  BsmtUnfSF = col_double(),
  TotalBsmtSF = col_double(),
  `1stFlrSF` = col_double(),
  `2ndFlrSF` = col_double(),
  LowQualFinSF = col_double(),
  GrLivArea = col_double(),
  BsmtFullBath = col

In [3]:
# Your turn: summarize the iowa_data dataframe
summary(iowa_data)

       Id           MSSubClass      MSZoning          LotFrontage    
 Min.   :   1.0   Min.   : 20.0   Length:1460        Min.   : 21.00  
 1st Qu.: 365.8   1st Qu.: 20.0   Class :character   1st Qu.: 59.00  
 Median : 730.5   Median : 50.0   Mode  :character   Median : 69.00  
 Mean   : 730.5   Mean   : 56.9                      Mean   : 70.05  
 3rd Qu.:1095.2   3rd Qu.: 70.0                      3rd Qu.: 80.00  
 Max.   :1460.0   Max.   :190.0                      Max.   :313.00  
                                                     NA's   :259     
    LotArea          Street             Alley             LotShape        
 Min.   :  1300   Length:1460        Length:1460        Length:1460       
 1st Qu.:  7554   Class :character   Class :character   Class :character  
 Median :  9478   Mode  :character   Mode  :character   Mode  :character  
 Mean   : 10517                                                           
 3rd Qu.: 11602                                                  

# Running your first model

Now it's time for you to define and fit a model for your data.
1. Select the target variable you want to predict. You can get a list of the columns in a data frame using the function col_names(), which is done for you in the cell below.
2. Fit a model that can predict your target variable using the following predictors: 
    * LotArea
    * YearBuilt
    * Condition1 (how close to the main road the house is)
    * FullBath
    * BedroomAbvGr
    * TotRmsAbvGrd

3. Make a few predictions with the predict() function and print them out.
4. Optional: Plot the decision 

In [4]:
# Your turn: build a model to predict housing prices in Iowa

# library for building decision trees
library(rpart)

# print a list of the column names
names(iowa_data)

In [5]:
#fit the model
fit <- rpart(SalePrice ~ LotArea + YearBuilt + Condition1 + FullBath + BedroomAbvGr + TotRmsAbvGrd, data = iowa_data)

In [6]:
print("Making predictions for the following 5 houses:")
print(head(iowa_data))

print("The predictions are")
print(predict(fit, head(iowa_data)))

print("Actual price")
print(head(iowa_data$SalePrice))

[1] "Making predictions for the following 5 houses:"
# A tibble: 6 x 81
     Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
  <dbl>      <dbl> <chr>          <dbl>   <dbl> <chr>  <chr> <chr>   
1  1.00       60.0 RL              65.0    8450 Pave   <NA>  Reg     
2  2.00       20.0 RL              80.0    9600 Pave   <NA>  Reg     
3  3.00       60.0 RL              68.0   11250 Pave   <NA>  IR1     
4  4.00       70.0 RL              60.0    9550 Pave   <NA>  IR1     
5  5.00       60.0 RL              84.0   14260 Pave   <NA>  IR1     
6  6.00       50.0 RL              85.0   14115 Pave   <NA>  IR1     
# ... with 73 more variables: LandContour <chr>, Utilities <chr>,
#   LotConfig <chr>, LandSlope <chr>, Neighborhood <chr>, Condition1 <fct>,
#   Condition2 <chr>, BldgType <chr>, HouseStyle <chr>, OverallQual <dbl>,
#   OverallCond <dbl>, YearBuilt <dbl>, YearRemodAdd <dbl>, RoofStyle <chr>,
#   RoofMatl <chr>, Exterior1st <chr>, Exterior2nd <chr>, MasVnrType <chr>

In [7]:
# package with the mae function
library(modelr)

# get the mean average error for our model
mae(model = fit, data = iowa_data)

# How do we know if our model is good?

In [11]:
# Your turn: split your training data into test & training sets
# split our data so that 40% is in the test set and 60% is in the training set
splitData <- resample_partition(iowa_data, c(test = 0.4, train = 0.6))

# how many cases are in test & training set? 
lapply(splitData, dim)
# Fit a new model to your training set...

fit2 <- rpart(SalePrice ~ LotArea + YearBuilt + Condition1 + FullBath + BedroomAbvGr + TotRmsAbvGrd, data = splitData$train)

# get the mean average error for our new model, based on our test data
mae(model = fit2, data = splitData$test)





# Underfitting/overfitting and improving your model

Use a for loop that tries different values of *maxdepth* and calls the *get_mae* function on each to find the ideal number of leaves for your Iowa data.

In [13]:
# a function to get the maximum average error for a given max depth. You should pass in
# the target as the name of the target column and the predictors as vector where
# each item in the vector is the name of the column
get_mae <- function(maxdepth, target, predictors, training_data, testing_data){
    
    # turn the predictors & target into a formula to pass to rpart()
    predictors <- paste(predictors, collapse="+")
    formula <- as.formula(paste(target,"~",predictors,sep = ""))
    
    # build our model
    model <- rpart(formula, data = training_data,
                   control = rpart.control(maxdepth = maxdepth))
    # get the mae
    mae <- mae(model, testing_data)
    return(mae)
}

In [14]:
# Your turn: use the get_mae function to find the maxdepth that leads to the 
# lowest mean average error for this dataset
# target & predictors to feed into our formula
target <- "SalePrice"
predictors <-  c("LotArea", "YearBuilt", "Condition1", "FullBath", "BedroomAbvGr", "TotRmsAbvGrd")

# get the MAE for maxdepths between 1 & 10
for(i in 1:10){
    mae <- get_mae(maxdepth = i, target = target, predictors = predictors,
                  training_data = splitData$train, testing_data = splitData$test)
    print(glue::glue("Maxdepth: ",i,"\t MAE: ",mae))
}

Maxdepth: 1	 MAE: 46033.335833377
Maxdepth: 2	 MAE: 41143.1294116184
Maxdepth: 3	 MAE: 37006.8686716587
Maxdepth: 4	 MAE: 36327.6225125628
Maxdepth: 5	 MAE: 35816.0910402526
Maxdepth: 6	 MAE: 35816.0910402526
Maxdepth: 7	 MAE: 35816.0910402526
Maxdepth: 8	 MAE: 35816.0910402526
Maxdepth: 9	 MAE: 35816.0910402526
Maxdepth: 10	 MAE: 35816.0910402526


# A different type of model: Random forests

Now it's your turn to fit a randomForest on your data. You're going to need to read in the randomForest library to do this, so be sure to run the first cell before you try to make a call to the randomForest() function or you'll get an error!

In [15]:
# read in the library we'll use for random forests
library(randomForest)

randomForest 4.6-12
Type rfNews() to see new features/changes/bug fixes.

Attaching package: ‘randomForest’

The following object is masked from ‘package:dplyr’:

    combine

The following object is masked from ‘package:ggplot2’:

    margin



In [16]:
# Your turn: Train a random forest using the same features as you used
# to train your original decision tree.
# fit a random forest model to our training set
fitRandomForest <- randomForest(SalePrice ~ LotArea + YearBuilt + Condition1 + FullBath + BedroomAbvGr + TotRmsAbvGrd, data = splitData$train)

# get the mean average error for our new model, based on our test data
mae(model = fitRandomForest, data = splitData$test)

# Check out the MAE. Did you see an improvement over your original model?