# Basics of creating predictions for kaggle competitions and submitting them

This notebook uses the house prices beginner competition: https://www.kaggle.com/c/home-data-for-ml-course This notebook was run in a Kaggle cloud instance, but you can run this in a standard jupyter notebook environmnent if you perfer (but you would need to update the filepaths acourdingly).

We begin by importing the data reading libraries and optionally printing the contents of Kaggle's input directory.

In [1]:
import numpy as np
import pandas as pd
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/home-data-for-ml-course/train.csv
/kaggle/input/home-data-for-ml-course/sample_submission.csv.gz
/kaggle/input/home-data-for-ml-course/data_description.txt
/kaggle/input/home-data-for-ml-course/test.csv.gz
/kaggle/input/home-data-for-ml-course/test.csv
/kaggle/input/home-data-for-ml-course/train.csv.gz
/kaggle/input/home-data-for-ml-course/sample_submission.csv


Next we read in the files

In [2]:
train = pd.read_csv('../input/home-data-for-ml-course/train.csv', index_col='Id')
test = pd.read_csv('../input/home-data-for-ml-course/test.csv', index_col='Id')

For this next step I create a list of all the columns with integer or floating point values

In [3]:
cols = list(train.describe().columns)
train[cols].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1460 entries, 1 to 1460
Data columns (total 37 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   MSSubClass     1460 non-null   int64  
 1   LotFrontage    1201 non-null   float64
 2   LotArea        1460 non-null   int64  
 3   OverallQual    1460 non-null   int64  
 4   OverallCond    1460 non-null   int64  
 5   YearBuilt      1460 non-null   int64  
 6   YearRemodAdd   1460 non-null   int64  
 7   MasVnrArea     1452 non-null   float64
 8   BsmtFinSF1     1460 non-null   int64  
 9   BsmtFinSF2     1460 non-null   int64  
 10  BsmtUnfSF      1460 non-null   int64  
 11  TotalBsmtSF    1460 non-null   int64  
 12  1stFlrSF       1460 non-null   int64  
 13  2ndFlrSF       1460 non-null   int64  
 14  LowQualFinSF   1460 non-null   int64  
 15  GrLivArea      1460 non-null   int64  
 16  BsmtFullBath   1460 non-null   int64  
 17  BsmtHalfBath   1460 non-null   int64  
 18  FullBath

Here for the sake of simplicity we fill in 0 for all the missing values. With the exception of garage year built, this is probably the best method of imputation.

In [4]:
train = train[cols].fillna(0)
# 17701train['SalePrice'] = train['SalePrice'].astype('float32')
cols.remove('SalePrice')
test = test[cols].fillna(0)

Here I simplify change the data type for each column to be 32 bit floating points, in testing this improves accuracy... I have no clue why

In [5]:
for col in cols:
    train[col] = train[col].astype('float32')
    test[col] = test[col].astype('float32')

Here we create our training and validation set

In [6]:
from sklearn.model_selection import train_test_split

X = train[cols]
y = train.SalePrice

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2)

Now we create and train a random forest model. n_estimators is set to 60 as it reflects the copacity of the model, its my estimate of the complexity of the problem (didn't experiment with this hyperparameter so it by be wildly wrong). We also check the validation mean absolute error.

In [7]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

rfr = RandomForestRegressor(n_estimators=60)
rfr.fit(X_train, y_train)
preds = rfr.predict(X_test)
print('MAE for validation set is:', mean_absolute_error(y_test, preds))

MAE for validation set is: 17708.77492117852


In [8]:
submission = pd.read_csv('../input/home-data-for-ml-course/sample_submission.csv', index_col='Id')
submission

Unnamed: 0_level_0,SalePrice
Id,Unnamed: 1_level_1
1461,169277.052498
1462,187758.393989
1463,183583.683570
1464,179317.477511
1465,150730.079977
...,...
2915,167081.220949
2916,164788.778231
2917,219222.423400
2918,184924.279659


Next we save the predictions to a dataframe object

In [9]:
_t = test.copy()
_t['SalePrice'] = rfr.predict(_t)
out = _t[['SalePrice']]
out

Unnamed: 0_level_0,SalePrice
Id,Unnamed: 1_level_1
1461,130742.500000
1462,155088.333333
1463,180996.250000
1464,183562.500000
1465,199342.016667
...,...
2915,87458.216667
2916,92140.000000
2917,161234.083333
2918,113100.716667


Now we output this dataframe to a csv

In [10]:
out.to_csv('submission.csv')

In the two cells below we check to see if the format matches

In [11]:
pd.read_csv('./submission.csv')

Unnamed: 0,Id,SalePrice
0,1461,130742.500000
1,1462,155088.333333
2,1463,180996.250000
3,1464,183562.500000
4,1465,199342.016667
...,...,...
1454,2915,87458.216667
1455,2916,92140.000000
1456,2917,161234.083333
1457,2918,113100.716667


In [12]:
pd.read_csv('../input/home-data-for-ml-course/sample_submission.csv')

Unnamed: 0,Id,SalePrice
0,1461,169277.052498
1,1462,187758.393989
2,1463,183583.683570
3,1464,179317.477511
4,1465,150730.079977
...,...,...
1454,2915,167081.220949
1455,2916,164788.778231
1456,2917,219222.423400
1457,2918,184924.279659


At this point there are two ways to evaluate the model on the test data. You can take the csv you created and upload it to Kaggle on the competition leaderboard page, or if you ran and commited this notebook in a kaggle you can go to the committed viewer page, look at the output section, select the submission.csv and there will be an option 'submit to competition'