# Training a model on a standalone tabular dataset
Example of making a standalone dataset available for training a fastai deep learning application.

In this notebook we'll go through the steps to train a model on the Kuala Lumpur property dataset: https://www.kaggle.com/dragonduck/property-listings-in-kuala-lumpur



In [1069]:
# imports for notebook boilerplate
!pip install -Uqq fastbook
import fastbook
from fastbook import *
from fastai.tabular.all import *


In [1070]:
# imports for this notebook
import re

In [1071]:
# set up the notebook for fast.ai
fastbook.setup_book()

# Ingest the dataset

The following cells assume that you have completed the following steps:
- Download data_kaggle.csv.zip from https://www.kaggle.com/dragonduck/property-listings-in-kuala-lumpur
- Unzip the downloaded file to extract data_kaggle.csv
- In your Gradient environment, create the folder /storage/archive/kl_property
- Upload data_kaggle.csv to /storage/archive/kl_property


In [1072]:
# define a target path for this house price dataset
path = URLs.path('kl_property')

In [1073]:
# ingest the dataset into a Pandas dataframe
df_train = pd.read_csv(path/'data_kaggle.csv')

In [1074]:
df_train.head()

Unnamed: 0,Location,Price,Rooms,Bathrooms,Car Parks,Property Type,Size,Furnishing
0,"KLCC, Kuala Lumpur","RM 1,250,000",2+1,3.0,2.0,Serviced Residence,"Built-up : 1,335 sq. ft.",Fully Furnished
1,"Damansara Heights, Kuala Lumpur","RM 6,800,000",6,7.0,,Bungalow,Land area : 6900 sq. ft.,Partly Furnished
2,"Dutamas, Kuala Lumpur","RM 1,030,000",3,4.0,2.0,Condominium (Corner),"Built-up : 1,875 sq. ft.",Partly Furnished
3,"Cheras, Kuala Lumpur",,,,,,,
4,"Bukit Jalil, Kuala Lumpur","RM 900,000",4+1,3.0,2.0,Condominium (Corner),"Built-up : 1,513 sq. ft.",Partly Furnished


In [1075]:
df_train.shape

(53883, 8)

# Preprocessing to clean up the dataset
Unlike some other datasets featured on Kaggle, this dataset has many interesting anomalies that need to be cleaned up before fastai data preparations can be appplied to it. In particular, the Size column has values that were entered free form, which means that it needs a lot of work. For this column we've added processing to get a useful numerical value from the columns values where it's possible, but for values that are difficult to parse, we drop the row. We lose about 1% of the rows in this way - a reasonable tradeoff to make to keep the cleanup code as simple as possible.

Here are the issues that need to be corrected with this dataset:
- Price column has some misisng values. We need to remove these values
- Price column includes the ringgit symbol (the symbol for the Malaysian currency). We need to remove this symbol so that this column can be treated as a continuous column
- Size column needs to be split to into columns, one with the size type and the other with size (area)
- Size (area) column needs to update to remove the measure ("sq. ft.") and to convert area vectors into scalars
- deal with Size entries like: "5700 sf sq. ft.", "646sf~1001sf sq. ft." - remove the rows with ranges or constructs like "22&#8217;x100&#8217; sq. ft.", as well as rows that contain strings that cannot be converted into numerics




In [1076]:
# function to remove the currency symbol
def remove_currency(currency_string, input_string):
    output_string = re.sub(currency_string,'',input_string)
    return(output_string)
    

In [1077]:
# function to remove everything after the space in a string
def remove_after_space(input_string):
    # remove leading and trailing spaces
    input_string = input_string.strip()
    #print('input:', input_string)
    # remove everything after internal spaces
    output_string = re.sub(r'\s* .*', '', input_string)
    output_string = re.sub(r'\([^)]*\)','',output_string)
    #print('output:',output_string)
    return(output_string)

In [1078]:
# remove rows with missing Price values
df_train.dropna(subset=['Price'], inplace=True)
# remove currency symbol from remaining rows
df_train['Price'] = df_train['Price'].apply(lambda x: remove_currency("RM ",x))


# convert Price column to float
df_train['Price'] = pd.to_numeric(df_train['Price'].str.replace(',',''), errors='coerce')
df_train.head()


Unnamed: 0,Location,Price,Rooms,Bathrooms,Car Parks,Property Type,Size,Furnishing
0,"KLCC, Kuala Lumpur",1250000,2+1,3.0,2.0,Serviced Residence,"Built-up : 1,335 sq. ft.",Fully Furnished
1,"Damansara Heights, Kuala Lumpur",6800000,6,7.0,,Bungalow,Land area : 6900 sq. ft.,Partly Furnished
2,"Dutamas, Kuala Lumpur",1030000,3,4.0,2.0,Condominium (Corner),"Built-up : 1,875 sq. ft.",Partly Furnished
4,"Bukit Jalil, Kuala Lumpur",900000,4+1,3.0,2.0,Condominium (Corner),"Built-up : 1,513 sq. ft.",Partly Furnished
5,"Taman Tun Dr Ismail, Kuala Lumpur",5350000,4+2,5.0,4.0,Bungalow,Land area : 7200 sq. ft.,Partly Furnished


In [1079]:
df_train.shape

(53635, 8)

In [1080]:
# lowercase values in the Size column
df_train['Size'] = df_train['Size'].str.lower()
#  remove remaining records that have "sf","acres", or "#" in the Size column

df_train = df_train[~df_train.Size.str.contains("sf",na=False)]
df_train = df_train[~df_train.Size.str.contains("acre",na=False)]
df_train = df_train[~df_train.Size.str.contains("#",na=False)]

# split the Size column into two columns and make the remaining Size column numeric
df_train[['Size_type','Size']] = df_train['Size'].str.split(':',expand=True)
df_train = df_train[~df_train.Size.str.contains("kuala",na=False)]
df_train = df_train[~df_train.Size.str.contains("malaysia",na=False)]
df_train = df_train[~df_train.Size.str.contains("nil",na=False)]
df_train = df_train[~df_train.Size.str.contains("corner",na=False)]
df_train = df_train[~df_train.Size.str.contains("unknown",na=False)]
df_train = df_train[~df_train.Size.str.contains("n/a",na=False)]
df_train = df_train[~df_train.Size.str.contains("na",na=False)]
df_train = df_train[~df_train.Size.str.contains("wp",na=False)]
df_train = df_train[~df_train.Size.str.contains("xx",na=False)]
df_train = df_train[~df_train.Size.str.contains("intermediate",na=False)]
df_train = df_train[~df_train.Size.str.contains("wilayah",na=False)]
df_train = df_train[~df_train.Size.str.contains("-",na=False)]
df_train = df_train[~df_train.Size.str.contains("\+",na=False)]
df_train = df_train[~df_train.Size.str.contains('\'',na=False)]
df_train = df_train[~df_train.Size.str.contains('\~',na=False)]
# remove commas and metric, and convert "x" with "*" so "22x80" becomes "22*80" and can yield a scalar when eval() is applied
# df_train['Size'] = pd.to_numeric(df_train['Size'].str.replace(',','').str.replace(' sq. ft.','').str.replace("x","*"), errors='coerce')

df_train['Size'] = df_train['Size'].str.replace(',','').str.replace('`','').str.replace('@','x').str.replace('\+ sq. ft.','')
#
df_train['Size'] = df_train['Size'].str.replace(' sq. ft.','').str.replace('sf sq. ft.','').str.replace('ft','').str.replace('sq','').str.replace("xx","*").str.replace("x ","*").str.replace(" x","*").str.replace("x","*").str.replace("X","*").replace('\'','')

df_train.head()


Unnamed: 0,Location,Price,Rooms,Bathrooms,Car Parks,Property Type,Size,Furnishing,Size_type
0,"KLCC, Kuala Lumpur",1250000,2+1,3.0,2.0,Serviced Residence,1335,Fully Furnished,built-up
1,"Damansara Heights, Kuala Lumpur",6800000,6,7.0,,Bungalow,6900,Partly Furnished,land area
2,"Dutamas, Kuala Lumpur",1030000,3,4.0,2.0,Condominium (Corner),1875,Partly Furnished,built-up
4,"Bukit Jalil, Kuala Lumpur",900000,4+1,3.0,2.0,Condominium (Corner),1513,Partly Furnished,built-up
5,"Taman Tun Dr Ismail, Kuala Lumpur",5350000,4+2,5.0,4.0,Bungalow,7200,Partly Furnished,land area


In [1081]:
df_train.shape

(53333, 9)

In [1083]:
# replace missing values in the Size column
df_train['Size'] = df_train['Size'].fillna("0")


In [1084]:
# remove duplicates of the form "2850 38x25" by removing everything after space in Size field
df_train['Size'] = df_train['Size'].apply(lambda x: remove_after_space(x))


In [1085]:
# apply eval() to the Size column to convert "24 x 12" values to numeric values
df_train['Size'] = df_train['Size'].apply(lambda x: eval(str(x)))
df_train.head()

Unnamed: 0,Location,Price,Rooms,Bathrooms,Car Parks,Property Type,Size,Furnishing,Size_type
0,"KLCC, Kuala Lumpur",1250000,2+1,3.0,2.0,Serviced Residence,1335.0,Fully Furnished,built-up
1,"Damansara Heights, Kuala Lumpur",6800000,6,7.0,,Bungalow,6900.0,Partly Furnished,land area
2,"Dutamas, Kuala Lumpur",1030000,3,4.0,2.0,Condominium (Corner),1875.0,Partly Furnished,built-up
4,"Bukit Jalil, Kuala Lumpur",900000,4+1,3.0,2.0,Condominium (Corner),1513.0,Partly Furnished,built-up
5,"Taman Tun Dr Ismail, Kuala Lumpur",5350000,4+2,5.0,4.0,Bungalow,7200.0,Partly Furnished,land area


# Define the target, continuous and categorical columns

In [1044]:
# define transforms to apply to the tabular dataset
procs = [FillMissing,Categorify]
# define the dependent variable (y value)
dep_var = 'Price'
# define columns that are continuous / categorical
cont,cat = cont_cat_split(df_train, 1, dep_var=dep_var) 
print("continuous columns are: ",cont)
print("categorical columns are: ",cat)

continuous columns are:  ['Bathrooms', 'Car Parks', 'Size']
categorical columns are:  ['Location', 'Rooms', 'Property Type', 'Furnishing', 'Size_type']


# Check for missing values

In [1047]:
# get a count by column of missing values
count = df_train.isna().sum()
df_train_missing = (pd.concat([count.rename('missing_count'),
                     count.div(len(df_train))
                          .rename('missing_ratio')],axis = 1)
             .loc[count.ne(0)])

In [1048]:
df_train_missing

Unnamed: 0,missing_count,missing_ratio
Rooms,1585,0.029719
Bathrooms,1887,0.035381
Car Parks,17281,0.324021
Furnishing,6760,0.126751
Size_type,1024,0.0192


In [1049]:
df_train_missing.shape

(5, 2)

# Set the target
Adjust the Price column for binary classification

In [1050]:
# function to replace target values with value indicating whether the input is over or under the mean
def under_over(x,mean_x):
    if (x <= mean_x):
        returner = 0.0
    else:
        returner = 1.0
    return(returner)

In [1051]:
# set target column
mean_sp = int(df_train['Price'].mean())
#df_train['SalePrice'] = df_train.loc[df_train.SalePrice <= mean_sp,'SalePrice'] = 0.0
df_train['Price'] = df_train['Price'].apply(lambda x: under_over(x,mean_sp))
df_train.head()

Unnamed: 0,Location,Price,Rooms,Bathrooms,Car Parks,Property Type,Size,Furnishing,Size_type
0,"KLCC, Kuala Lumpur",0.0,2+1,3.0,2.0,Serviced Residence,1335.0,Fully Furnished,built-up
1,"Damansara Heights, Kuala Lumpur",1.0,6,7.0,,Bungalow,6900.0,Partly Furnished,land area
2,"Dutamas, Kuala Lumpur",0.0,3,4.0,2.0,Condominium (Corner),1875.0,Partly Furnished,built-up
4,"Bukit Jalil, Kuala Lumpur",0.0,4+1,3.0,2.0,Condominium (Corner),1513.0,Partly Furnished,built-up
5,"Taman Tun Dr Ismail, Kuala Lumpur",1.0,4+2,5.0,4.0,Bungalow,7200.0,Partly Furnished,land area


In [1052]:
mean_sp

1967793

In [1053]:
# check the proportion of Price values
df_train['Price'].value_counts()

0.0    39909
1.0    13424
Name: Price, dtype: int64

In [1056]:
df_train.shape

(53333, 9)

# define TabularDataLoaders

In [1057]:
# define TabularDataLoaders object using the dataframe, the list of pre-processing steps, the categorical and continuous
# column lists
# valid_idx: the indices to use for the validation set
procs = [FillMissing,Categorify, Normalize]
dls = TabularDataLoaders.from_df(df_train,path,procs= procs, 
                               cat_names= cat, cont_names = cont, y_names = dep_var, valid_idx=list(range((df_train.shape[0]-5000),df_train.shape[0])), bs=64)
                               

In [1058]:
# display a sample batch
dls.valid.show_batch()

Unnamed: 0,Location,Rooms,Property Type,Furnishing,Size_type,Bathrooms_na,Car Parks_na,Bathrooms,Car Parks,Size,Price
0,"Pantai, Kuala Lumpur",3+1,Condominium (Intermediate),Partly Furnished,built-up,False,False,2.0,2.0,1300.000025,0.0
1,"KLCC, Kuala Lumpur",Studio,Serviced Residence (Intermediate),Fully Furnished,built-up,False,False,1.0,1.0,410.999992,0.0
2,"Cheras, Kuala Lumpur",5+1,3-sty Terrace/Link House (Intermediate),Partly Furnished,land area,False,False,4.0,3.0,2800.0,0.0
3,"Cheras, Kuala Lumpur",4,2-sty Terrace/Link House (Intermediate),Partly Furnished,land area,False,False,3.0,2.0,1690.000006,0.0
4,"Sungai Besi, Kuala Lumpur",5,3-sty Terrace/Link House,Unfurnished,land area,False,True,5.0,2.0,1920.000014,0.0
5,"KLCC, Kuala Lumpur",2,Serviced Residence (Intermediate),Partly Furnished,built-up,False,False,1.0,1.0,661.999954,0.0
6,"Brickfields, Kuala Lumpur",3+1,Apartment,Fully Furnished,built-up,False,True,2.0,2.0,1499.999991,0.0
7,"Damansara Heights, Kuala Lumpur",4+1,Bungalow,Fully Furnished,land area,False,True,5.0,2.0,11721.999673,1.0
8,"Jalan Klang Lama (Old Klang Road), Kuala Lumpur",3+1,Apartment,Fully Furnished,built-up,False,False,2.0,2.0,1240.000036,0.0
9,"Setapak, Kuala Lumpur",4,Serviced Residence (Intermediate),Partly Furnished,built-up,False,False,4.0,3.0,1800.000035,0.0


In [1061]:
# define and fit the model
learn = tabular_learner(dls, metrics=accuracy)
learn.fit_one_cycle(5)

epoch,train_loss,valid_loss,accuracy,time
0,0.057596,0.056111,0.7662,00:09
1,0.050664,0.046354,0.7662,00:09
2,0.046991,0.04578,0.7662,00:09
3,0.040996,0.05582,0.7662,00:09
4,0.037257,0.044228,0.7662,00:09


In [1060]:
# show the loss function used by the learner
learn.loss_func

FlattenedLoss of MSELoss()

In [1068]:
# show a set of results from the model
learn.show_results()

Unnamed: 0,Location,Rooms,Property Type,Furnishing,Size_type,Bathrooms_na,Car Parks_na,Bathrooms,Car Parks,Size,Price,Price_pred
0,53.0,26.0,61.0,2.0,1.0,1.0,1.0,0.577252,1.840694,-0.018127,0.0,0.02009
1,88.0,18.0,84.0,0.0,1.0,1.0,1.0,-0.672398,-0.93277,-0.041041,0.0,0.006296
2,23.0,22.0,60.0,1.0,1.0,1.0,1.0,-0.672398,-0.93277,-0.037316,0.0,0.00531
3,45.0,43.0,88.0,1.0,1.0,1.0,2.0,-1.297223,-0.008282,-0.046228,0.0,-0.000452
4,3.0,26.0,60.0,2.0,1.0,1.0,1.0,0.577252,0.916206,0.011475,1.0,1.013828
5,45.0,22.0,84.0,1.0,1.0,1.0,1.0,-0.672398,-0.93277,-0.025811,0.0,0.445047
6,80.0,23.0,61.0,1.0,1.0,1.0,2.0,0.577252,-0.008282,-0.004886,1.0,0.809427
7,16.0,23.0,61.0,2.0,1.0,1.0,1.0,-0.047573,-0.008282,-0.01606,0.0,0.023392
8,10.0,26.0,13.0,2.0,2.0,1.0,1.0,-0.047573,-0.008282,-0.018498,1.0,0.535222
