# Regression

This tutorial uses safeds on **house sales data** to predict house prices.


## File and Imports

Start by creating a Python-File with the suffix ``.py``.

Import the classes you want to use from safeds.


## Reading Data

Load your data into a `Table`, the data is available under `docs/tutorials/data/pricing.csv`:

In [14]:
from safeds.data.tabular.containers import Table

pricing = Table.from_csv_file("data/house_sales.csv")
# For visualisation purposes we only print out the first 15 rows.
pricing.slice_rows(0,15)

id,year,month,day,zipcode,latitude,longitude,sqft_lot,sqft_living,sqft_above,sqft_basement,floors,bedrooms,bathrooms,waterfront,view,condition,grade,year_built,year_renovated,sqft_lot_15nn,sqft_living_15nn,price
i64,i64,i64,i64,i64,f64,f64,i64,i64,i64,i64,f64,i64,f64,i64,i64,i64,i64,i64,i64,i64,i64,i64
0,2014,5,2,98001,47.3406,-122.269,9397,2200,2200,0,2.0,4,2.5,0,1,3,8,1987,0,9176,2310,285000
1,2014,5,2,98003,47.3537,-122.303,10834,2090,1360,730,1.0,3,2.5,0,1,4,8,1987,0,8595,1750,285000
2,2014,5,2,98006,47.5443,-122.177,8119,2160,1080,1080,1.0,4,2.25,0,1,3,8,1966,0,9000,1850,440000
3,2014,5,2,98006,47.5746,-122.135,8800,1450,1450,0,1.0,4,1.0,0,1,4,7,1954,0,8942,1260,435000
4,2014,5,2,98006,47.5725,-122.133,10000,1920,1070,850,1.0,4,1.5,0,1,4,7,1954,0,10836,1450,430000
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
10,2014,5,2,98023,47.3256,-122.378,33151,3240,3240,0,2.0,3,2.5,0,3,3,10,1995,0,24967,4050,604000
11,2014,5,2,98024,47.5643,-121.897,16215,1580,1580,0,1.0,3,2.25,0,1,4,7,1978,0,16215,1450,335000
12,2014,5,2,98027,47.4635,-121.991,35100,1970,1970,0,2.0,3,2.25,0,1,4,9,1977,0,35100,2340,437500
13,2014,5,2,98027,47.4634,-121.987,37277,2710,2710,0,2.0,4,2.75,0,1,3,9,2000,0,39299,2390,630000


## Cleaning your Data

At this point it is usual to clean the data. Here's an example how to do so:

In [17]:
# removes columns "latitude" and "longitude" from table
pricing_columns = pricing.remove_columns(["latitude", "longitude"])
# removes rows which contain missing values
pricing_values = pricing_columns.remove_rows_with_missing_values()
# removes rows which contain outliers
pricing_outliers = pricing_values.remove_rows_with_outliers()
# For visualisation purposes we only print out the first 5 rows.
pricing_outliers.slice_rows(0,5)

id,year,month,day,zipcode,sqft_lot,sqft_living,sqft_above,sqft_basement,floors,bedrooms,bathrooms,waterfront,view,condition,grade,year_built,year_renovated,sqft_lot_15nn,sqft_living_15nn,price
i64,i64,i64,i64,i64,i64,i64,i64,i64,f64,i64,f64,i64,i64,i64,i64,i64,i64,i64,i64,i64
0,2014,5,2,98001,9397,2200,2200,0,2.0,4,2.5,0,1,3,8,1987,0,9176,2310,285000
1,2014,5,2,98003,10834,2090,1360,730,1.0,3,2.5,0,1,4,8,1987,0,8595,1750,285000
2,2014,5,2,98006,8119,2160,1080,1080,1.0,4,2.25,0,1,3,8,1966,0,9000,1850,440000
3,2014,5,2,98006,8800,1450,1450,0,1.0,4,1.0,0,1,4,7,1954,0,8942,1260,435000
4,2014,5,2,98006,10000,1920,1070,850,1.0,4,1.5,0,1,4,7,1954,0,10836,1450,430000


See how to perform further Data-Cleaning at: https://library.safeds.com/en/stable/tutorials/data_processing

## Create Training and Testing Set

Split the house sales dataset into two tables. A training set, that will be used later to implement a training model to predict the house prices. It contains 60% of the data. The testing set contains the rest of the data. Delete the column `price` from the test set, to be able to predict it later:


In [18]:
train_table, testing_table = pricing_outliers.split_rows(0.60)

test_table = testing_table.remove_columns(["price"]).shuffle_rows()

Mark the `price` `Column` as the target variable to be predicted. Include the `id` column only as an extra column, which is completely ignored by the model:

In [19]:
extra_names = ["id"]

train_tabular_dataset = train_table.to_tabular_dataset("price", extra_names=extra_names)

## Creating and Fitting a Regressor

Use `Decision Tree` regressor as a model for the regression. Pass the "train_tabular_dataset" table to the fit function of the model:


In [20]:
from safeds.ml.classical.regression import DecisionTreeRegressor

model = DecisionTreeRegressor()
fitted_model = model.fit(train_tabular_dataset)

## Predicting with the Fitted Regressor

Use the fitted decision tree regression model, that we trained on the training dataset to predict the price of a house in the test dataset.


In [21]:
prediction = fitted_model.predict(
    test_table
)
# For visualisation purposes we only print out the first 15 rows.
prediction.to_table().slice_rows(start=0, length=15)

id,year,month,day,zipcode,sqft_lot,sqft_living,sqft_above,sqft_basement,floors,bedrooms,bathrooms,waterfront,view,condition,grade,year_built,year_renovated,sqft_lot_15nn,sqft_living_15nn,price
i64,i64,i64,i64,i64,i64,i64,i64,i64,f64,i64,f64,i64,i64,i64,i64,i64,i64,i64,i64,f64
13188,2014,12,1,98033,7300,1240,1240,0,1.0,3,1.0,0,1,3,7,1968,0,8260,1240,596442.857143
2387,2014,6,10,98106,6771,1780,1230,550,1.0,3,2.5,0,1,3,7,1990,0,6771,1780,388525.0
15530,2015,1,28,98188,7492,2560,2560,0,2.0,4,2.5,0,1,3,8,2014,0,11541,1260,339400.0
3605,2014,6,25,98136,1493,1350,1050,300,2.0,2,2.25,0,1,3,8,2007,0,1202,1250,345277.777778
1829,2014,6,2,98112,4337,1840,1840,0,2.0,4,1.5,0,1,4,8,1917,0,4337,2250,703114.666667
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
13522,2014,12,5,98052,9250,2150,2150,0,2.0,3,2.25,0,1,3,8,1984,0,9266,2240,489377.142857
16479,2015,2,22,98136,4800,1490,750,740,1.0,2,1.75,0,1,4,7,1918,0,6000,1400,553560.0
18536,2015,3,27,98118,1228,1100,900,200,2.0,2,1.5,0,1,3,7,2007,0,1380,1340,293458.333333
13785,2014,12,10,98133,5400,1240,1060,180,1.0,3,1.0,0,1,4,7,1940,0,5400,1429,565200.0


## Evaluating the Fitted Regressor

You can test the mean absolute error of that model with the initial testing_table as follows:

In [22]:
test_tabular_dataset = testing_table.to_tabular_dataset("price", extra_names=extra_names)

fitted_model.mean_absolute_error(test_tabular_dataset)

92672.57153031077

## Full Code

In [23]:
from safeds.data.tabular.containers import Table
from safeds.ml.classical.regression import DecisionTreeRegressor

pricing = Table.from_csv_file("data/house_sales.csv")

pricing_columns = pricing.remove_columns(["latitude", "longitude"])
pricing_values = pricing_columns.remove_rows_with_missing_values()
pricing_outliers = pricing_values.remove_rows_with_outliers()

train_table, testing_table = pricing_outliers.split_rows(0.60)
test_table = testing_table.remove_columns(["price"]).shuffle_rows()

extra_names = ["id"]
train_tabular_dataset = train_table.to_tabular_dataset("price", extra_names=extra_names)

model = DecisionTreeRegressor()
fitted_model = model.fit(train_tabular_dataset)
prediction = fitted_model.predict(test_table)

test_tabular_dataset = testing_table.to_tabular_dataset("price", extra_names=extra_names)
fitted_model.mean_absolute_error(test_tabular_dataset)

92482.6376337888