# Task for Today  

***

## Poland House Price Prediction  

Given *data about houses in Poland*, let's try to predict the **price** of a given house.

We will use a gradient boosting regression model to make our predictions.

# Getting Started

In [1]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

from sklearn.ensemble import GradientBoostingRegressor

In [2]:
data = pd.read_csv('../input/house-prices-in-poland/Houses.csv', encoding='latin-1')

In [3]:
data

Unnamed: 0.1,Unnamed: 0,address,city,floor,id,latitude,longitude,price,rooms,sq,year
0,0,Podgórze Zab³ocie Stanis³awa Klimeckiego,Kraków,2.0,23918.0,50.049224,19.970379,749000.0,3.0,74.05,2021.0
1,1,Praga-Po³udnie Grochowska,Warszawa,3.0,17828.0,52.249775,21.106886,240548.0,1.0,24.38,2021.0
2,2,Krowodrza Czarnowiejska,Kraków,2.0,22784.0,50.066964,19.920025,427000.0,2.0,37.00,1970.0
3,3,Grunwald,Poznañ,2.0,4315.0,52.404212,16.882542,1290000.0,5.0,166.00,1935.0
4,4,Ochota Gotowy budynek. Stan deweloperski. Osta...,Warszawa,1.0,11770.0,52.212225,20.972630,996000.0,5.0,105.00,2020.0
...,...,...,...,...,...,...,...,...,...,...,...
23759,23759,Stare Miasto Naramowice,Poznañ,0.0,3976.0,52.449649,16.949408,543000.0,4.0,77.00,2020.0
23760,23760,W³ochy,Warszawa,4.0,10206.0,52.186109,20.948438,910000.0,3.0,71.00,2017.0
23761,23761,Nowe Miasto Malta ul. Katowicka,Poznañ,0.0,4952.0,52.397345,16.961939,430695.0,3.0,50.67,2022.0
23762,23762,Podgórze Duchackie Walerego S³awka,Kraków,6.0,24148.0,50.024231,19.959569,359000.0,2.0,38.86,2021.0


In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23764 entries, 0 to 23763
Data columns (total 11 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  23764 non-null  int64  
 1   address     23764 non-null  object 
 2   city        23764 non-null  object 
 3   floor       23764 non-null  float64
 4   id          23764 non-null  float64
 5   latitude    23764 non-null  float64
 6   longitude   23764 non-null  float64
 7   price       23764 non-null  float64
 8   rooms       23764 non-null  float64
 9   sq          23764 non-null  float64
 10  year        23764 non-null  float64
dtypes: float64(8), int64(1), object(2)
memory usage: 2.0+ MB


# Preprocessing

In [5]:
def preprocess_inputs(df):
    df = df.copy()
    
    # Drop unused columns
    df = df.drop(['Unnamed: 0', 'address', 'id'], axis=1)
    
    # Split df into X and y
    y = df['price']
    X = df.drop('price', axis=1)
    
    # Train-test split
    X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, shuffle=True, random_state=1)
    
    return X_train, X_test, y_train, y_test

In [6]:
X_train, X_test, y_train, y_test = preprocess_inputs(data)

In [7]:
X_train

Unnamed: 0,city,floor,latitude,longitude,rooms,sq,year
23266,Kraków,3.0,50.046943,19.997153,1.0,23.75,2020.0
4050,Warszawa,0.0,52.322818,21.057657,4.0,76.60,2002.0
20631,Kraków,0.0,50.023016,19.908364,2.0,76.45,2018.0
23295,Warszawa,1.0,52.245964,21.133045,2.0,35.17,2022.0
23585,Warszawa,0.0,52.188721,21.058435,3.0,49.00,1970.0
...,...,...,...,...,...,...,...
10955,Warszawa,4.0,52.277527,21.022353,2.0,37.50,1968.0
17289,Kraków,4.0,50.056192,19.928406,1.0,35.35,2020.0
5192,Warszawa,3.0,52.231958,21.006725,2.0,36.00,2020.0
12172,Poznañ,0.0,52.387661,16.914801,1.0,20.00,1902.0


In [8]:
y_train

23266    329000.0
4050     575000.0
20631    890000.0
23295    269051.0
23585    549000.0
           ...   
10955    468750.0
17289    618625.0
5192     459000.0
12172    187500.0
235      620000.0
Name: price, Length: 16634, dtype: float64

# Building Pipeline

In [9]:
nominal_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(sparse=False))
])

preprocessor = ColumnTransformer(transformers=[
    ('nominal', nominal_transformer, ['city'])
], remainder='passthrough')

model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('scaler', StandardScaler()),
    ('regressor', GradientBoostingRegressor())
])

# Training

In [10]:
model.fit(X_train, y_train)

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('nominal',
                                                  Pipeline(steps=[('onehot',
                                                                   OneHotEncoder(sparse=False))]),
                                                  ['city'])])),
                ('scaler', StandardScaler()),
                ('regressor', GradientBoostingRegressor())])

# Results

In [11]:
y_pred = model.predict(X_test)

rmse = np.sqrt(np.mean((y_test - y_pred)**2))
print("RMSE: {:.5f}".format(rmse))

baseline_errors = np.sum((y_test - np.mean(y_test))**2)
model_errors = np.sum((y_test - y_pred)**2)

r2 = 1 - (model_errors / baseline_errors)
print("R^2 Score: {:.5f}".format(r2))

RMSE: 246098.14872
R^2 Score: 0.78564


# Data Every Day  

This notebook is featured on Data Every Day, a YouTube series where I train models on a new dataset each day.  

***

Check it out!  
https://youtu.be/9-T48384oM0