# Training a model on a standalone tabular dataset
Example of making a standalone dataset available for training a fastai deep learning application.

In this notebook we'll go through the steps to train a model on the Kuala Lumpur property dataset: https://www.kaggle.com/dragonduck/property-listings-in-kuala-lumpur



In [125]:
# imports for notebook boilerplate
!pip install -Uqq fastbook
import fastbook
from fastbook import *
from fastai.tabular.all import *


In [126]:
# imports for this notebook
import re

In [127]:
# set up the notebook for fast.ai
fastbook.setup_book()

# Ingest the dataset

The following cells assume that you have completed the following steps:
- Download data_kaggle.csv.zip from https://www.kaggle.com/dragonduck/property-listings-in-kuala-lumpur
- Unzip the downloaded file to extract data_kaggle.csv
- In your Gradient environment, create the folder /storage/archive/kl_property
- Upload data_kaggle.csv to /storage/archive/kl_property


In [128]:
# define a target path for this house price dataset
path = URLs.path('kl_property')

In [129]:
# ingest the dataset into a Pandas dataframe
df_train = pd.read_csv(path/'data_kaggle.csv')

In [130]:
df_train.head()

Unnamed: 0,Location,Price,Rooms,Bathrooms,Car Parks,Property Type,Size,Furnishing
0,"KLCC, Kuala Lumpur","RM 1,250,000",2+1,3.0,2.0,Serviced Residence,"Built-up : 1,335 sq. ft.",Fully Furnished
1,"Damansara Heights, Kuala Lumpur","RM 6,800,000",6,7.0,,Bungalow,Land area : 6900 sq. ft.,Partly Furnished
2,"Dutamas, Kuala Lumpur","RM 1,030,000",3,4.0,2.0,Condominium (Corner),"Built-up : 1,875 sq. ft.",Partly Furnished
3,"Cheras, Kuala Lumpur",,,,,,,
4,"Bukit Jalil, Kuala Lumpur","RM 900,000",4+1,3.0,2.0,Condominium (Corner),"Built-up : 1,513 sq. ft.",Partly Furnished


In [131]:
df_train.shape

(53883, 8)

In [None]:
# control whether the dependent variable is continuous or categorical
# if this switch is set to True then the values in the Price column are replaced with 
# string indicators: 0 if Price is less or equal to average; 1 if Price is above average
categorical_target = True

# Preprocessing to clean up the dataset
Unlike some other datasets featured on Kaggle, this dataset has many interesting anomalies that need to be cleaned up before fastai data preparations can be appplied to it. In particular, the Size column has values that were entered free form, which means that it needs a lot of work. For this column we've added processing to get a useful numerical value from the columns values where it's possible, but for values that are difficult to parse, we drop the row. We lose about 1% of the rows in this way - a reasonable tradeoff to make to keep the cleanup code as simple as possible.

Here are the issues that need to be corrected with this dataset:
- Price column has some misisng values. We need to remove these values
- Price column includes the ringgit symbol (the symbol for the Malaysian currency). We need to remove this symbol so that this column can be treated as a continuous column
- Size column needs to be split to into columns, one with the size type and the other with size (area)
- Size (area) column needs to update to remove the measure ("sq. ft.") and to convert area vectors into scalars
- deal with Size entries like: "5700 sf sq. ft.", "646sf~1001sf sq. ft." - remove the rows with ranges or constructs like "22&#8217;x100&#8217; sq. ft.", as well as rows that contain strings that cannot be converted into numerics




In [None]:
# function to remove the currency symbol
def remove_currency(currency_string, input_string):
    output_string = re.sub(currency_string,'',input_string)
    return(output_string)
    

In [None]:
# function to remove everything after the space in a string
def remove_after_space(input_string):
    # remove leading and trailing spaces
    input_string = input_string.strip()
    #print('input:', input_string)
    # remove everything after internal spaces
    output_string = re.sub(r'\s* .*', '', input_string)
    output_string = re.sub(r'\([^)]*\)','',output_string)
    #print('output:',output_string)
    return(output_string)

In [None]:
# remove rows with missing Price values
df_train.dropna(subset=['Price'], inplace=True)
# remove currency symbol from remaining rows
df_train['Price'] = df_train['Price'].apply(lambda x: remove_currency("RM ",x))


# convert Price column to float
df_train['Price'] = pd.to_numeric(df_train['Price'].str.replace(',',''), errors='coerce')
df_train.head()


In [None]:
df_train.shape

In [None]:
# lowercase values in the Size column
df_train['Size'] = df_train['Size'].str.lower()
#  remove remaining records that have "sf","acres", or "#" in the Size column

df_train = df_train[~df_train.Size.str.contains("sf",na=False)]
df_train = df_train[~df_train.Size.str.contains("acre",na=False)]
df_train = df_train[~df_train.Size.str.contains("#",na=False)]

# split the Size column into two columns and make the remaining Size column numeric
df_train[['Size_type','Size']] = df_train['Size'].str.split(':',expand=True)
df_train = df_train[~df_train.Size.str.contains("kuala",na=False)]
df_train = df_train[~df_train.Size.str.contains("malaysia",na=False)]
df_train = df_train[~df_train.Size.str.contains("nil",na=False)]
df_train = df_train[~df_train.Size.str.contains("corner",na=False)]
df_train = df_train[~df_train.Size.str.contains("unknown",na=False)]
df_train = df_train[~df_train.Size.str.contains("n/a",na=False)]
df_train = df_train[~df_train.Size.str.contains("na",na=False)]
df_train = df_train[~df_train.Size.str.contains("wp",na=False)]
df_train = df_train[~df_train.Size.str.contains("xx",na=False)]
df_train = df_train[~df_train.Size.str.contains("intermediate",na=False)]
df_train = df_train[~df_train.Size.str.contains("wilayah",na=False)]
df_train = df_train[~df_train.Size.str.contains("-",na=False)]
df_train = df_train[~df_train.Size.str.contains("\+",na=False)]
df_train = df_train[~df_train.Size.str.contains('\'',na=False)]
df_train = df_train[~df_train.Size.str.contains('\~',na=False)]
# remove commas and metric, and convert "x" with "*" so "22x80" becomes "22*80" and can yield a scalar when eval() is applied
# df_train['Size'] = pd.to_numeric(df_train['Size'].str.replace(',','').str.replace(' sq. ft.','').str.replace("x","*"), errors='coerce')

df_train['Size'] = df_train['Size'].str.replace(',','').str.replace('`','').str.replace('@','x').str.replace('\+ sq. ft.','')
#
df_train['Size'] = df_train['Size'].str.replace(' sq. ft.','').str.replace('sf sq. ft.','').str.replace('ft','').str.replace('sq','').str.replace("xx","*").str.replace("x ","*").str.replace(" x","*").str.replace("x","*").str.replace("X","*").replace('\'','')

df_train.head()


In [None]:
df_train.shape

In [None]:
# replace missing values in the Size column
df_train['Size'] = df_train['Size'].fillna("0")


In [None]:
# remove duplicates of the form "2850 38x25" by removing everything after space in Size field
df_train['Size'] = df_train['Size'].apply(lambda x: remove_after_space(x))


In [None]:
# apply eval() to the Size column to convert "24 x 12" values to numeric values
df_train['Size'] = df_train['Size'].apply(lambda x: eval(str(x)))
df_train.head()

# Check for missing values

In [None]:
# get a count by column of missing values
count = df_train.isna().sum()
df_train_missing = (pd.concat([count.rename('missing_count'),
                     count.div(len(df_train))
                          .rename('missing_ratio')],axis = 1)
             .loc[count.ne(0)])

In [None]:
df_train_missing

In [None]:
df_train_missing.shape

# Set the target
Adjust the Price column for binary classification

In [None]:
# function to replace target values with value indicating whether the input is over or under the mean
# note that setting the target to be a string like this results in much higher accuracy (94%) vs. setting
# the target to be a float (accuracy ~ 76%)
def under_over(x,mean_x):
    if (x <= mean_x):
        #returner = 0.0
        returner = "0"
    else:
        returner = "1"
    return(returner)

In [None]:
# set target column
mean_sp = int(df_train['Price'].mean())
if categorical_target:
    df_train['Price'] = df_train['Price'].apply(lambda x: under_over(x,mean_sp))
df_train.head()

In [None]:
mean_sp

In [None]:
# check the proportion of Price values
df_train['Price'].value_counts()

In [None]:
df_train.shape

# Define the target, continuous and categorical columns

In [None]:
# define transforms to apply to the tabular dataset
procs = [FillMissing,Categorify]
# define the dependent variable (y value)
dep_var = 'Price'
# define columns that are continuous / categorical
cont,cat = cont_cat_split(df_train, 1, dep_var=dep_var) 
print("continuous columns are: ",cont)
print("categorical columns are: ",cat)

# define TabularDataLoaders

In [None]:
# define TabularDataLoaders object using the dataframe, the list of pre-processing steps, the categorical and continuous
# column lists
# valid_idx: the indices to use for the validation set
procs = [FillMissing,Categorify, Normalize]
dls = TabularDataLoaders.from_df(df_train,path,procs= procs, 
                               cat_names= cat, cont_names = cont, y_names = dep_var, valid_idx=list(range((df_train.shape[0]-5000),df_train.shape[0])), bs=64)
                               

In [None]:
# display a sample batch
dls.valid.show_batch()

In [None]:
# define and fit the model
learn = tabular_learner(dls, metrics=accuracy)
learn.fit_one_cycle(3)

In [None]:
# show the loss function used by the learner
learn.loss_func

In [None]:
# show a set of results from the model
learn.show_results()

# Examine the structure of the trained model structure

Use the summary() function to see the structure of the trained model, including:

- the layers that make up the model
- total parameters
- loss function
- optimizer function
- callbacks

In [None]:
learn.summary()