# Training a model on a standalone tabular dataset
Example of making a standalone dataset available for training a fastai deep learning application.

In this notebook we'll go through the steps to train a model on the Kuala Lumpur property dataset: https://www.kaggle.com/dragonduck/property-listings-in-kuala-lumpur



In [940]:
# imports for notebook boilerplate
!pip install -Uqq fastbook
import fastbook
from fastbook import *
from fastai.tabular.all import *


In [941]:
# imports for this notebook
import re

In [942]:
# set up the notebook for fast.ai
fastbook.setup_book()

# Ingest the dataset

The following cells assume that you have completed the following steps:
- Download data_kaggle.csv.zip from https://www.kaggle.com/dragonduck/property-listings-in-kuala-lumpur
- Unzip the downloaded file to extract data_kaggle.csv
- In your Gradient environment, create the folder /storage/archive/kl_property
- Upload data_kaggle.csv to /storage/archive/kl_property


In [943]:
# define a target path for this house price dataset
path = URLs.path('kl_property')

In [944]:
# ingest the dataset into a Pandas dataframe
df_train = pd.read_csv(path/'data_kaggle.csv')

In [945]:
df_train.head()

Unnamed: 0,Location,Price,Rooms,Bathrooms,Car Parks,Property Type,Size,Furnishing
0,"KLCC, Kuala Lumpur","RM 1,250,000",2+1,3.0,2.0,Serviced Residence,"Built-up : 1,335 sq. ft.",Fully Furnished
1,"Damansara Heights, Kuala Lumpur","RM 6,800,000",6,7.0,,Bungalow,Land area : 6900 sq. ft.,Partly Furnished
2,"Dutamas, Kuala Lumpur","RM 1,030,000",3,4.0,2.0,Condominium (Corner),"Built-up : 1,875 sq. ft.",Partly Furnished
3,"Cheras, Kuala Lumpur",,,,,,,
4,"Bukit Jalil, Kuala Lumpur","RM 900,000",4+1,3.0,2.0,Condominium (Corner),"Built-up : 1,513 sq. ft.",Partly Furnished


In [946]:
df_train.shape

(53883, 8)

# Preprocessing to clean up the dataset
Unlike some other datasets featured on Kaggle, this dataset has many interesting anomalies that need to be cleaned up before fastai data preparations can be appplied to it.

Here are the issues that need to be corrected with this dataset:
- Price column has some misisng values. We need to remove these values
- Price column includes the ringgit symbol (the symbol for the Malaysian currency). We need to remove this symbol so that this column can be treated as a continuous column
- Size column needs to be split to into columns, one with the size type and the other with size (area)
- Size (area) column needs to update to remove the measure ("sq. ft.") and to convert area vectors into scalars
- deal with Size entries like: "5700 sf sq. ft.", "646sf~1001sf sq. ft." - remove the rows with ranges or constructs like "22&#8217;x100&#8217; sq. ft."
- Rooms column has an assortment of numeric values, combinations "2 + 1" and strings "Studio"



In [947]:
# function to remove the currency symbol
def remove_currency(currency_string, input_string):
    output_string = re.sub(currency_string,'',input_string)
    return(output_string)
    

In [948]:
# function to remove everything after the space in a string
def remove_after_space(input_string):
    # remove leading and trailing spaces
    input_string = input_string.strip()
    #print('input:', input_string)
    # remove everything after internal spaces
    output_string = re.sub(r'\s* .*', '', input_string)
    output_string = re.sub(r'\([^)]*\)','',output_string)
    #print('output:',output_string)
    return(output_string)

In [949]:
test1 = "40x85(3400)"
test2 = "foo"
remove_after_space(test1)

'40x85'

In [950]:
# remove rows with missing Price values
df_train.dropna(subset=['Price'], inplace=True)
# remove currency symbol from remaining rows
df_train['Price'] = df_train['Price'].apply(lambda x: remove_currency("RM ",x))


# convert Price column to float
df_train['Price'] = pd.to_numeric(df_train['Price'].str.replace(',',''), errors='coerce')
df_train.head()


Unnamed: 0,Location,Price,Rooms,Bathrooms,Car Parks,Property Type,Size,Furnishing
0,"KLCC, Kuala Lumpur",1250000,2+1,3.0,2.0,Serviced Residence,"Built-up : 1,335 sq. ft.",Fully Furnished
1,"Damansara Heights, Kuala Lumpur",6800000,6,7.0,,Bungalow,Land area : 6900 sq. ft.,Partly Furnished
2,"Dutamas, Kuala Lumpur",1030000,3,4.0,2.0,Condominium (Corner),"Built-up : 1,875 sq. ft.",Partly Furnished
4,"Bukit Jalil, Kuala Lumpur",900000,4+1,3.0,2.0,Condominium (Corner),"Built-up : 1,513 sq. ft.",Partly Furnished
5,"Taman Tun Dr Ismail, Kuala Lumpur",5350000,4+2,5.0,4.0,Bungalow,Land area : 7200 sq. ft.,Partly Furnished


In [951]:
df_train.shape

(53635, 8)

In [952]:
# lowercase values in the Size column
df_train['Size'] = df_train['Size'].str.lower()
#  remove remaining records that have "sf","acres", or "#" in the Size column

df_train = df_train[~df_train.Size.str.contains("sf",na=False)]
df_train = df_train[~df_train.Size.str.contains("acre",na=False)]
df_train = df_train[~df_train.Size.str.contains("#",na=False)]

# split the Size column into two columns and make the remaining Size column numeric
df_train[['Size_type','Size']] = df_train['Size'].str.split(':',expand=True)
df_train = df_train[~df_train.Size.str.contains("kuala",na=False)]
df_train = df_train[~df_train.Size.str.contains("malaysia",na=False)]
df_train = df_train[~df_train.Size.str.contains("nil",na=False)]
df_train = df_train[~df_train.Size.str.contains("corner",na=False)]
df_train = df_train[~df_train.Size.str.contains("unknown",na=False)]
df_train = df_train[~df_train.Size.str.contains("n/a",na=False)]
df_train = df_train[~df_train.Size.str.contains("na",na=False)]
df_train = df_train[~df_train.Size.str.contains("wp",na=False)]
df_train = df_train[~df_train.Size.str.contains("xx",na=False)]
df_train = df_train[~df_train.Size.str.contains("intermediate",na=False)]
df_train = df_train[~df_train.Size.str.contains("wilayah",na=False)]
df_train = df_train[~df_train.Size.str.contains("-",na=False)]
df_train = df_train[~df_train.Size.str.contains("\+",na=False)]
df_train = df_train[~df_train.Size.str.contains('\'',na=False)]
df_train = df_train[~df_train.Size.str.contains('\~',na=False)]
# remove commas and metric, and convert "x" with "*" so "22x80" becomes "22*80" and can yield a scalar when eval() is applied
# df_train['Size'] = pd.to_numeric(df_train['Size'].str.replace(',','').str.replace(' sq. ft.','').str.replace("x","*"), errors='coerce')

df_train['Size'] = df_train['Size'].str.replace(',','').str.replace('`','').str.replace('@','x').str.replace('\+ sq. ft.','')
#
df_train['Size'] = df_train['Size'].str.replace(' sq. ft.','').str.replace('sf sq. ft.','').str.replace('ft','').str.replace('sq','').str.replace("xx","*").str.replace("x ","*").str.replace(" x","*").str.replace("x","*").str.replace("X","*").replace('\'','')

df_train.head(15)


Unnamed: 0,Location,Price,Rooms,Bathrooms,Car Parks,Property Type,Size,Furnishing,Size_type
0,"KLCC, Kuala Lumpur",1250000,2+1,3.0,2.0,Serviced Residence,1335,Fully Furnished,built-up
1,"Damansara Heights, Kuala Lumpur",6800000,6,7.0,,Bungalow,6900,Partly Furnished,land area
2,"Dutamas, Kuala Lumpur",1030000,3,4.0,2.0,Condominium (Corner),1875,Partly Furnished,built-up
4,"Bukit Jalil, Kuala Lumpur",900000,4+1,3.0,2.0,Condominium (Corner),1513,Partly Furnished,built-up
5,"Taman Tun Dr Ismail, Kuala Lumpur",5350000,4+2,5.0,4.0,Bungalow,7200,Partly Furnished,land area
7,"Taman Tun Dr Ismail, Kuala Lumpur",2600000,5,4.0,4.0,Semi-detached House,3600,Partly Furnished,land area
8,"Taman Tun Dr Ismail, Kuala Lumpur",1950000,4+1,4.0,3.0,2-sty Terrace/Link House (EndLot),25*75,Partly Furnished,land area
9,"Sri Petaling, Kuala Lumpur",385000,3,2.0,1.0,Apartment (Intermediate),904,Partly Furnished,built-up
11,"Taman Tun Dr Ismail, Kuala Lumpur",1680000,4,3.0,,2-sty Terrace/Link House (Intermediate),22 *80,Partly Furnished,land area
12,"Taman Tun Dr Ismail, Kuala Lumpur",1700000,3+1,3.0,,2-sty Terrace/Link House (Intermediate),1900,Partly Furnished,land area


In [953]:
df_train.shape

(53333, 9)

In [954]:
def apply_eval(input_string):
    #print("s: ", str(input_string))          
    return(eval(str(input_string)))

In [955]:
# replace missing values in the Size column
df_train['Size'] = df_train['Size'].fillna("0")
df_train.head()

Unnamed: 0,Location,Price,Rooms,Bathrooms,Car Parks,Property Type,Size,Furnishing,Size_type
0,"KLCC, Kuala Lumpur",1250000,2+1,3.0,2.0,Serviced Residence,1335,Fully Furnished,built-up
1,"Damansara Heights, Kuala Lumpur",6800000,6,7.0,,Bungalow,6900,Partly Furnished,land area
2,"Dutamas, Kuala Lumpur",1030000,3,4.0,2.0,Condominium (Corner),1875,Partly Furnished,built-up
4,"Bukit Jalil, Kuala Lumpur",900000,4+1,3.0,2.0,Condominium (Corner),1513,Partly Furnished,built-up
5,"Taman Tun Dr Ismail, Kuala Lumpur",5350000,4+2,5.0,4.0,Bungalow,7200,Partly Furnished,land area


In [956]:

# remove duplicates of the form "2850 38x25" by removing everything after space in Size field
df_train['Size'] = df_train['Size'].apply(lambda x: remove_after_space(x))
df_train.head(15)

Unnamed: 0,Location,Price,Rooms,Bathrooms,Car Parks,Property Type,Size,Furnishing,Size_type
0,"KLCC, Kuala Lumpur",1250000,2+1,3.0,2.0,Serviced Residence,1335,Fully Furnished,built-up
1,"Damansara Heights, Kuala Lumpur",6800000,6,7.0,,Bungalow,6900,Partly Furnished,land area
2,"Dutamas, Kuala Lumpur",1030000,3,4.0,2.0,Condominium (Corner),1875,Partly Furnished,built-up
4,"Bukit Jalil, Kuala Lumpur",900000,4+1,3.0,2.0,Condominium (Corner),1513,Partly Furnished,built-up
5,"Taman Tun Dr Ismail, Kuala Lumpur",5350000,4+2,5.0,4.0,Bungalow,7200,Partly Furnished,land area
7,"Taman Tun Dr Ismail, Kuala Lumpur",2600000,5,4.0,4.0,Semi-detached House,3600,Partly Furnished,land area
8,"Taman Tun Dr Ismail, Kuala Lumpur",1950000,4+1,4.0,3.0,2-sty Terrace/Link House (EndLot),25*75,Partly Furnished,land area
9,"Sri Petaling, Kuala Lumpur",385000,3,2.0,1.0,Apartment (Intermediate),904,Partly Furnished,built-up
11,"Taman Tun Dr Ismail, Kuala Lumpur",1680000,4,3.0,,2-sty Terrace/Link House (Intermediate),22,Partly Furnished,land area
12,"Taman Tun Dr Ismail, Kuala Lumpur",1700000,3+1,3.0,,2-sty Terrace/Link House (Intermediate),1900,Partly Furnished,land area


In [957]:
# convert Size to numeric
# df_train['Size'] = pd.to_numeric(df_train['Size'])
# apply arithmetic 
# df_train['Size'] = df_train['Size'].apply(lambda x: eval(str(x)))
df_train['Size'] = df_train['Size'].apply(lambda x: apply_eval(x))
df_train.head(15)

Unnamed: 0,Location,Price,Rooms,Bathrooms,Car Parks,Property Type,Size,Furnishing,Size_type
0,"KLCC, Kuala Lumpur",1250000,2+1,3.0,2.0,Serviced Residence,1335.0,Fully Furnished,built-up
1,"Damansara Heights, Kuala Lumpur",6800000,6,7.0,,Bungalow,6900.0,Partly Furnished,land area
2,"Dutamas, Kuala Lumpur",1030000,3,4.0,2.0,Condominium (Corner),1875.0,Partly Furnished,built-up
4,"Bukit Jalil, Kuala Lumpur",900000,4+1,3.0,2.0,Condominium (Corner),1513.0,Partly Furnished,built-up
5,"Taman Tun Dr Ismail, Kuala Lumpur",5350000,4+2,5.0,4.0,Bungalow,7200.0,Partly Furnished,land area
7,"Taman Tun Dr Ismail, Kuala Lumpur",2600000,5,4.0,4.0,Semi-detached House,3600.0,Partly Furnished,land area
8,"Taman Tun Dr Ismail, Kuala Lumpur",1950000,4+1,4.0,3.0,2-sty Terrace/Link House (EndLot),1875.0,Partly Furnished,land area
9,"Sri Petaling, Kuala Lumpur",385000,3,2.0,1.0,Apartment (Intermediate),904.0,Partly Furnished,built-up
11,"Taman Tun Dr Ismail, Kuala Lumpur",1680000,4,3.0,,2-sty Terrace/Link House (Intermediate),22.0,Partly Furnished,land area
12,"Taman Tun Dr Ismail, Kuala Lumpur",1700000,3+1,3.0,,2-sty Terrace/Link House (Intermediate),1900.0,Partly Furnished,land area


# Define the target, continuous and categorical columns

In [73]:
# select a subset of columns to train the model on
cat_select = ['Neighborhood','HouseStyle','Exterior1st','CentralAir','KitchenQual']
cont_select = ['LotFrontage','LotArea','OverallCond','YearBuilt','GrLivArea','FullBath','HalfBath','BedroomAbvGr','GarageCars']

In [74]:
print("len cont is ",len(cont))
print("len cat is ",len(cat))

len cont is  37
len cat is  43


In [75]:
df_train['SalePrice'].value_counts()

140000    20
135000    17
145000    14
155000    14
190000    13
          ..
84900      1
424870     1
415298     1
62383      1
34900      1
Name: SalePrice, Length: 663, dtype: int64

# Set target
adjust target column for binary classification

In [76]:
# function to replace target values with value indicating whether the input is over or under the mean
def under_over(x,mean_x):
    if (x <= mean_x):
        returner = 0.0
    else:
        returner = 1.0
    return(returner)

In [77]:
# set target column
# df.loc[df.ID == 103, 'FirstName'] = "Matt"
mean_sp = int(df_train['SalePrice'].mean())
#df_train['SalePrice'] = df_train.loc[df_train.SalePrice <= mean_sp,'SalePrice'] = 0.0
#df_train['SalePrice'] = df_train.loc[df_train.SalePrice > mean_sp,'SalePrice'] = 1.0
# df['Date'] = df['Date'].apply(lambda x: int(str(x)[-4:]))
df_train['SalePrice'] = df_train['SalePrice'].apply(lambda x: under_over(x,mean_sp))
df_train.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2003,2003,Gable,CompShg,VinylSd,VinylSd,BrkFace,196.0,Gd,TA,PConc,Gd,TA,No,GLQ,706,Unf,0,150,856,GasA,Ex,Y,SBrkr,856,854,0,1710,1,0,2,1,3,1,Gd,8,Typ,0,,Attchd,2003.0,RFn,2,548,TA,TA,Y,0,61,0,0,0,0,,,,0,2,2008,WD,Normal,1.0
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,6,8,1976,1976,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,TA,CBlock,Gd,TA,Gd,ALQ,978,Unf,0,284,1262,GasA,Ex,Y,SBrkr,1262,0,0,1262,0,1,2,0,3,1,TA,6,Typ,1,TA,Attchd,1976.0,RFn,2,460,TA,TA,Y,298,0,0,0,0,0,,,,0,5,2007,WD,Normal,1.0
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2001,2002,Gable,CompShg,VinylSd,VinylSd,BrkFace,162.0,Gd,TA,PConc,Gd,TA,Mn,GLQ,486,Unf,0,434,920,GasA,Ex,Y,SBrkr,920,866,0,1786,1,0,2,1,3,1,Gd,6,Typ,1,TA,Attchd,2001.0,RFn,2,608,TA,TA,Y,0,42,0,0,0,0,,,,0,9,2008,WD,Normal,1.0
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,7,5,1915,1970,Gable,CompShg,Wd Sdng,Wd Shng,,0.0,TA,TA,BrkTil,TA,Gd,No,ALQ,216,Unf,0,540,756,GasA,Gd,Y,SBrkr,961,756,0,1717,1,0,1,0,3,1,Gd,7,Typ,1,Gd,Detchd,1998.0,Unf,3,642,TA,TA,Y,0,35,272,0,0,0,,,,0,2,2006,WD,Abnorml,0.0
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,8,5,2000,2000,Gable,CompShg,VinylSd,VinylSd,BrkFace,350.0,Gd,TA,PConc,Gd,TA,Av,GLQ,655,Unf,0,490,1145,GasA,Ex,Y,SBrkr,1145,1053,0,2198,1,0,2,1,4,1,Gd,9,Typ,1,TA,Attchd,2000.0,RFn,3,836,TA,TA,Y,192,84,0,0,0,0,,,,0,12,2008,WD,Normal,1.0


In [78]:
mean_sp

180921

In [79]:
df_train['SalePrice'].value_counts()

0.0    900
1.0    560
Name: SalePrice, dtype: int64

# Check for missing values

In [80]:
# df_train.isnull().sum() > 0
count = df_train.isna().sum()
df_train_missing = (pd.concat([count.rename('missing_count'),
                     count.div(len(df_train))
                          .rename('missing_ratio')],axis = 1)
             .loc[count.ne(0)])

In [81]:
df_train_missing

Unnamed: 0,missing_count,missing_ratio
LotFrontage,259,0.177397
Alley,1369,0.937671
MasVnrType,8,0.005479
MasVnrArea,8,0.005479
BsmtQual,37,0.025342
BsmtCond,37,0.025342
BsmtExposure,38,0.026027
BsmtFinType1,37,0.025342
BsmtFinType2,38,0.026027
Electrical,1,0.000685


In [82]:
df_train_missing.shape

(19, 2)

In [83]:
count2 = df_test.isna().sum()
df_test_missing = (pd.concat([count2.rename('missing_count'),
                     count2.div(len(df_test))
                          .rename('missing_ratio')],axis = 1)
             .loc[count2.ne(0)])

In [84]:
df_test_missing

Unnamed: 0,missing_count,missing_ratio
MSZoning,4,0.002742
LotFrontage,227,0.155586
Alley,1352,0.926662
Utilities,2,0.001371
Exterior1st,1,0.000685
Exterior2nd,1,0.000685
MasVnrType,16,0.010966
MasVnrArea,15,0.010281
BsmtQual,44,0.030158
BsmtCond,45,0.030843


In [85]:
# check to see missing value col count in test set
df_test_missing.shape

(33, 2)

# Replace missing values

In [86]:

# for categorical columns, replace missing values with the most column categorical value in that column
df_train[cat] = df_train[cat].fillna(df_train[cat].mode().iloc[0])
df_test[cat] = df_test[cat].fillna(df_test[cat].mode().iloc[0])
# for continuous columns, replace missing values with 0
df_train[cont] = df_train[cont].fillna(0.0)
df_test[cont] = df_test[cont].fillna(0.0)


# Confirm missing values dealt with

In [87]:
# check for missing values in df_train
count = df_train.isna().sum()
df_train_missing = (pd.concat([count.rename('missing_count'),
                     count.div(len(df_train))
                          .rename('missing_ratio')],axis = 1)
             .loc[count.ne(0)])

In [88]:
df_train_missing

Unnamed: 0,missing_count,missing_ratio


In [89]:
# check for missing values in df_test
count = df_test.isna().sum()
df_test_missing = (pd.concat([count.rename('missing_count'),
                     count.div(len(df_test))
                          .rename('missing_ratio')],axis = 1)
             .loc[count.ne(0)])

In [90]:
df_test_missing

Unnamed: 0,missing_count,missing_ratio


# define TabularDataLoaders

In [91]:
# define TabularDataLoaders object 
# valid_idx: the indices to use for the validation set
# what happens when we try to run this without dealing with missing values first
procs = [Categorify, Normalize]
#dls_house=TabularDataLoaders.from_df(df_train,path,procs= procs, 
#                               cat_names= cat, cont_names = cont, y_names = dep_var, valid_idx=list(range((df_train.shape[0]-100),df_train.shape[0])), bs=64)
dls_house=TabularDataLoaders.from_df(df_train,path,procs= procs, 
                               cat_names= cat_select, cont_names = cont_select, y_names = dep_var, valid_idx=list(range((df_train.shape[0]-100),df_train.shape[0])), bs=64)
                               

In [92]:
dls_house.valid.show_batch()

Unnamed: 0,Neighborhood,HouseStyle,Exterior1st,CentralAir,KitchenQual,LotFrontage,LotArea,OverallCond,YearBuilt,GrLivArea,FullBath,HalfBath,BedroomAbvGr,GarageCars,SalePrice
0,SWISU,2Story,MetalSd,Y,TA,51.0,9842.000033,6.0,1921.0,2600.99996,3.0,1.0,4.0,2.0,1.0
1,StoneBr,1Story,VinylSd,Y,Gd,124.0,16157.999818,5.0,2005.000002,1530.0,2.0,-8.687398e-09,3.0,2.0,1.0
2,NAmes,1.5Fin,VinylSd,Y,TA,9.840148e-07,12513.000003,4.0,1920.0,1737.999994,2.0,-8.687398e-09,4.0,1.0,0.0
3,Gilbert,2Story,VinylSd,Y,Gd,73.0,8498.999933,5.0,2006.000002,1412.000004,2.0,1.0,3.0,2.0,0.0
4,Somerst,2Story,MetalSd,Y,Gd,30.0,3180.00021,5.0,2005.000002,1199.999988,2.0,1.0,2.0,2.0,0.0
5,Somerst,2Story,VinylSd,Y,Gd,9.840148e-07,7499.999856,5.0,2000.0,1673.999994,2.0,1.0,3.0,2.0,1.0
6,CollgCr,2Story,VinylSd,Y,Gd,68.0,9179.000068,5.0,1999.0,1790.00001,2.0,1.0,3.0,2.0,1.0
7,MeadowV,2Story,CemntBd,Y,TA,41.0,2664.999956,6.0,1977.0,1474.999998,2.0,-8.687398e-09,4.0,1.0,0.0
8,CollgCr,1Story,VinylSd,Y,Gd,9.840148e-07,4435.000005,5.0,2002.999998,847.999998,1.0,-8.687398e-09,1.0,2.0,0.0
9,CollgCr,1Story,VinylSd,Y,Gd,48.0,10634.999996,5.0,2002.999998,1667.999994,2.0,-8.687398e-09,3.0,2.0,1.0


In [93]:
# define and fit the model
learn = tabular_learner(dls_house, layers=[200,100], metrics=accuracy)
learn.fit_one_cycle(1)

epoch,train_loss,valid_loss,accuracy,time
0,0.181128,0.241859,0.61,00:00


# Apply trained model to the test dataset

In [152]:
# apply model to the test set
# details of test_dl here: https://docs.fast.ai/tutorial.tabular
dl = learn.dls.test_dl(df_test)

In [153]:
learn.get_preds(dl=dl)


(tensor([[-0.0106],
         [ 0.4776],
         [ 0.1401],
         ...,
         [ 0.0540],
         [-0.0338],
         [ 0.2124]]),
 None)

In [42]:
??tabular_learner

[0;31mSignature:[0m
[0mtabular_learner[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mdls[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mlayers[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0memb_szs[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mconfig[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mn_out[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0my_range[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mloss_func[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mopt_func[0m[0;34m=[0m[0;34m<[0m[0mfunction[0m [0mAdam[0m [0mat[0m [0;36m0x7ff836f7e820[0m[0;34m>[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mlr[0m[0;34m=[0m[0;36m0.001[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0msplitter[0m[0;34m=[0m[0;34m<[0m[0mfunction[0m [0mtrainable_params[0m [0mat[0m [0;36m0x7ff838c60ca0[0m[0;34m>[0m[0;34m,[0m[0;34