Kaggle - Validate Your First ML Model

*  We are going to [validate the Iowa dataset](https://www.kaggle.com/code/dansbecker/model-validation/data) with this notebook and will break up our process into several notebooks
*  We are going to use the [Panda's library](https://pandas.pydata.org/docs/) Date: Dec 08, 2023 Version: 2.1.4
*  We are also going to use the [Scikit library](https://scikit-learn.org/stable/) Date: October 2023.  Version: 1.3.2

We can drag and drop data files (csv files) that we want to work with from our local drive into the google colab file icon (left side of the colab screen)
1.   Download the [Kaggle Iowa Housing Training Dataset](https://www.kaggle.com/code/dansbecker/model-validation/data) to your desktop
2.   Click on the folder on left side of the approximate middle of the Colab screen
3.   Drag and drop the train.csv file into the folder to upload it to Google Colab from your desktop
4.   You will need to do this operation everytime you use the notebook

Our first script does the following:

*   Imports the panda's and scikit libraries
*   Loads the data
*   Breaks out the columns in the dataset for our review

In [1]:
# Import Pandas Library
import pandas as pd
from sklearn.tree import DecisionTreeRegressor

# Load data
home_data = pd.read_csv('train.csv')

# Examine the columns in the dataset
home_data.columns

Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
       'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
       'GarageCond', 'PavedDrive

Our next script does the following:

*   Identifies the target of the analysis and the features that we believe to impact it
*   Loads a [decision tree](https://scikit-learn.org/stable/modules/tree.html) algo (model)
*   Fits the algo (model) via an [estimator  .fit()](https://scikit-learn.org/stable/developers/develop.html)
*   Provides an estimate via [.predict()](https://scikit-learn.org/stable/developers/develop.html)

In [2]:
# Choose target and features
y = home_data.SalePrice
feature_columns = ['LotArea',
                   'YearBuilt',
                   '1stFlrSF',
                   '2ndFlrSF',
                   'FullBath',
                   'BedroomAbvGr',
                   'TotRmsAbvGrd']
X = home_data[feature_columns]

# Specify Model
iowa_model = DecisionTreeRegressor()

# Fit Model
iowa_model.fit(X, y)

print("First in-sample predictions:", iowa_model.predict(X.head()))
print("Actual target values for those homes:", y.head().tolist())

First in-sample predictions: [208500. 181500. 223500. 140000. 250000.]
Actual target values for those homes: [208500, 181500, 223500, 140000, 250000]
