Kaggle - First ML Model


*   Tutorial link:
*   https://www.kaggle.com/code/dansbecker/your-first-machine-learning-model/tutorial
*   We are going to examine the Iowa dataset with this notebook



We can drag and drop data files (csv files) that we want to work with from our local drive into the google colab file icon (left side of the colab screen)
1.   Download the [Kaggle Iowa Housing Data](https://www.kaggle.com/code/dansbecker/your-first-machine-learning-model/data) to your desktop
2.   Click on the folder on left side of the approximate middle of the Colab screen
3.   Drag and drop the train.csv file into the folder to upload it to Google Colab from your desktop
4.   You will need to do this operation everytime you use the notebook

This script will provide a list of all the columns in the dataset

In [21]:
import pandas as pd

iowa_data = pd.read_csv('train.csv')
iowa_data.columns

Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
       'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
       'GarageCond', 'PavedDrive

We can check the results of our previous script with this one. Both scripts yield 81 columns

In [22]:
df = pd.read_csv('train.csv')
df

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,1456,60,RL,62.0,7917,Pave,,Reg,Lvl,AllPub,...,0,,,,0,8,2007,WD,Normal,175000
1456,1457,20,RL,85.0,13175,Pave,,Reg,Lvl,AllPub,...,0,,MnPrv,,0,2,2010,WD,Normal,210000
1457,1458,70,RL,66.0,9042,Pave,,Reg,Lvl,AllPub,...,0,,GdPrv,Shed,2500,5,2010,WD,Normal,266500
1458,1459,20,RL,68.0,9717,Pave,,Reg,Lvl,AllPub,...,0,,,,0,4,2010,WD,Normal,142125


We can address missing data by dropping missing data using this script

*   Dropping from 1460 rows to 0 rows suggests that there is missing data


In [None]:
iowa_data = iowa_data.dropna(axis=0)
iowa_data

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice


The next script will provide a column that we will use as a prediction target.

In [23]:
y = iowa_data.SalePrice
y

0       208500
1       181500
2       223500
3       140000
4       250000
         ...  
1455    175000
1456    210000
1457    266500
1458    142125
1459    147500
Name: SalePrice, Length: 1460, dtype: int64

The columns that are inputted into our model (and later used to make predictions) are called "features." In our case, those would be the columns used to determine the home price. Sometimes, you will use all columns except the target as features. Other times you'll be better off with fewer features.

Our next script looks at a few features which are assigned to the value "X"

In [28]:
iowa_features = ['FullBath', 'BedroomAbvGr']
X = iowa_data[iowa_features]
X

Unnamed: 0,FullBath,BedroomAbvGr
0,2,3
1,2,3
2,2,3
3,1,3
4,2,4
...,...,...
1455,2,3
1456,2,3
1457,2,4
1458,1,2


We can review the data we'll be using to predict house prices using the describe method and the head method, which shows the top few rows.

In [29]:
X.describe()

Unnamed: 0,FullBath,BedroomAbvGr
count,1460.0,1460.0
mean,1.565068,2.866438
std,0.550916,0.815778
min,0.0,0.0
25%,1.0,2.0
50%,2.0,3.0
75%,2.0,3.0
max,3.0,8.0


In [30]:
X.head()

Unnamed: 0,FullBath,BedroomAbvGr
0,2,3
1,2,3
2,2,3
3,1,3
4,2,4


Next we will import the scikit-learn library and use it to define a decision tree model and fit it with features and target variable

Remember that decision trees learn from data with a set of if-then-else decision rules

In [31]:
from sklearn.tree import DecisionTreeRegressor

# Define model. Specify a number for random_state to ensure same results each run
iowa_model = DecisionTreeRegressor(random_state=1)

# Fit model
iowa_model.fit(X, y)

This allows us to make some predictions

In [32]:
print("Making predictions for the following 5 houses:")
print(X.head())
print("The predictions are")
print(iowa_model.predict(X.head()))

Making predictions for the following 5 houses:
   FullBath  BedroomAbvGr
0         2             3
1         2             3
2         2             3
3         1             3
4         2             4
The predictions are
[213687.84650113 213687.84650113 213687.84650113 138216.20170455
 212340.12820513]
