# Regression

This tutorial uses safeds on **house sales data** to predict house prices.


1. Load your data into a `Table`, the data is available under `docs/tutorials/data/pricing.csv`:


In [1]:
from safeds.data.tabular.containers import Table

pricing = Table.from_csv_file("data/house_sales.csv")
# For visualisation purposes we only print out the first 15 rows.
pricing.slice_rows(0,15)

id,year,month,day,zipcode,latitude,longitude,sqft_lot,sqft_living,sqft_above,sqft_basement,floors,bedrooms,bathrooms,waterfront,view,condition,grade,year_built,year_renovated,sqft_lot_15nn,sqft_living_15nn,price
i64,i64,i64,i64,i64,f64,f64,i64,i64,i64,i64,f64,i64,f64,i64,i64,i64,i64,i64,i64,i64,i64,i64
0,2014,5,2,98001,47.3406,-122.269,9397,2200,2200,0,2.0,4,2.5,0,1,3,8,1987,0,9176,2310,285000
1,2014,5,2,98003,47.3537,-122.303,10834,2090,1360,730,1.0,3,2.5,0,1,4,8,1987,0,8595,1750,285000
2,2014,5,2,98006,47.5443,-122.177,8119,2160,1080,1080,1.0,4,2.25,0,1,3,8,1966,0,9000,1850,440000
3,2014,5,2,98006,47.5746,-122.135,8800,1450,1450,0,1.0,4,1.0,0,1,4,7,1954,0,8942,1260,435000
4,2014,5,2,98006,47.5725,-122.133,10000,1920,1070,850,1.0,4,1.5,0,1,4,7,1954,0,10836,1450,430000
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
10,2014,5,2,98023,47.3256,-122.378,33151,3240,3240,0,2.0,3,2.5,0,3,3,10,1995,0,24967,4050,604000
11,2014,5,2,98024,47.5643,-121.897,16215,1580,1580,0,1.0,3,2.25,0,1,4,7,1978,0,16215,1450,335000
12,2014,5,2,98027,47.4635,-121.991,35100,1970,1970,0,2.0,3,2.25,0,1,4,9,1977,0,35100,2340,437500
13,2014,5,2,98027,47.4634,-121.987,37277,2710,2710,0,2.0,4,2.75,0,1,3,9,2000,0,39299,2390,630000


2. Split the house sales dataset into two tables. A training set, that we will use later to implement a training model to predict the house price, containing 60% of the data, and a testing set containing the rest of the data.
Delete the column `price` from the test set, to be able to predict it later:


In [2]:
train_table, testing_table = pricing.split_rows(0.60)

test_table = testing_table.remove_columns(["price"]).shuffle_rows()

3. Mark the `price` `Column` as the target variable to be predicted. Include the `id` column only as an extra column, which is completely ignored by the model:

In [3]:
extra_names = ["id"]

train_tabular_dataset = train_table.to_tabular_dataset("price", extra_names=extra_names)


4. Use `Decision Tree` regressor as a model for the regression. Pass the "train_tabular_dataset" table to the fit function of the model:


In [4]:
from safeds.ml.classical.regression import DecisionTreeRegressor

model = DecisionTreeRegressor()
fitted_model = model.fit(train_tabular_dataset)

5. Use the fitted decision tree regression model, that we trained on the training dataset to predict the price of a house in the test dataset.


In [5]:
prediction = fitted_model.predict(
    test_table
)
# For visualisation purposes we only print out the first 15 rows.
prediction.to_table().slice_rows(start=0, length=15)

id,year,month,day,zipcode,latitude,longitude,sqft_lot,sqft_living,sqft_above,sqft_basement,floors,bedrooms,bathrooms,waterfront,view,condition,grade,year_built,year_renovated,sqft_lot_15nn,sqft_living_15nn,price
i64,i64,i64,i64,i64,f64,f64,i64,i64,i64,i64,f64,i64,f64,i64,i64,i64,i64,i64,i64,i64,i64,f64
20953,2015,4,30,98144,47.5835,-122.313,2665,2960,1950,1010,2.0,7,4.0,0,1,3,9,1927,2013,4410,1970,909625.0
21205,2015,5,5,98052,47.6842,-122.155,7800,2300,2300,0,2.0,3,2.5,0,3,3,9,1997,0,8187,2300,776695.0
1360,2014,5,23,98115,47.684,-122.281,5000,1814,944,870,1.0,4,1.75,0,1,4,7,1951,0,5000,1290,544614.75
15230,2015,1,21,98077,47.7696,-122.021,217800,3810,3810,0,2.0,4,3.0,0,1,3,9,2003,0,217364,2580,867816.666667
12893,2014,11,21,98031,47.4014,-122.186,8400,1070,1070,0,1.0,2,2.0,0,1,4,7,1980,0,8190,1430,227493.75
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
8807,2014,9,12,98045,47.4759,-121.735,5978,2640,2640,0,2.0,3,3.0,0,1,3,9,2012,0,6060,2680,499355.2
13089,2014,11,25,98117,47.6802,-122.358,5050,2090,1090,1000,1.0,4,1.75,0,1,4,7,1916,0,5000,1760,492833.333333
12016,2014,11,6,98059,47.5305,-122.135,12968,5550,5550,0,2.0,4,4.25,0,1,3,11,2005,0,13001,4750,1.35245e6
17314,2015,3,10,98052,47.631,-122.098,9500,1650,1650,0,1.0,3,1.75,0,1,3,8,1967,0,9375,1880,523916.666667


6. You can test the mean absolute error of that model with the initial testing_table as follows:


In [6]:
test_tabular_dataset = testing_table.to_tabular_dataset("price", extra_names=extra_names)

fitted_model.mean_absolute_error(test_tabular_dataset)


93590.45902700891