# Description of the dataset:

1. address - Full addres
2. city - Warszawa (Warsaw), Kraków (Cracow), Poznań (Poznan).
3. floor - The number of the floor where the apartment is located
4. id - id
5. latitude - latitude
6. longitude - longitude
7. price - Price of apartment in PLN [TARGET]
8. rooms - Number of rooms in the apartment
9. sq - Number of square meters of the apartment
10. year - Year of the building / apartment

# Importing the libraries

In [2]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Loading the dataset

In [16]:
# Load the dataset from the 'data' folder
file_path = '../data/Houses.csv'

try:
    dataset = pd.read_csv(file_path, encoding='utf-8')
except UnicodeDecodeError:
    # If utf-8 fails, try 'latin1' encoding
    dataset = pd.read_csv(file_path, encoding='latin1')
    
print(dataset)

       Unnamed: 0                                            address  \
0               0           Podgórze Zab³ocie Stanis³awa Klimeckiego   
1               1                          Praga-Po³udnie Grochowska   
2               2                            Krowodrza Czarnowiejska   
3               3                                           Grunwald   
4               4  Ochota Gotowy budynek. Stan deweloperski. Osta...   
...           ...                                                ...   
23759       23759                            Stare Miasto Naramowice   
23760       23760                                             W³ochy   
23761       23761                    Nowe Miasto Malta ul. Katowicka   
23762       23762                 Podgórze Duchackie Walerego S³awka   
23763       23763                                            Ursynów   

           city  floor       id   latitude  longitude      price  rooms  \
0        Kraków    2.0  23918.0  50.049224  19.970379   7490

# Dropping unnecessary columns

In [17]:
# Drop unnecessary columns for now
dataset = dataset.drop(['id', 'address', 'latitude', 'longitude'], axis=1)

print(dataset)

       Unnamed: 0      city  floor      price  rooms      sq    year
0               0    Kraków    2.0   749000.0    3.0   74.05  2021.0
1               1  Warszawa    3.0   240548.0    1.0   24.38  2021.0
2               2    Kraków    2.0   427000.0    2.0   37.00  1970.0
3               3    Poznañ    2.0  1290000.0    5.0  166.00  1935.0
4               4  Warszawa    1.0   996000.0    5.0  105.00  2020.0
...           ...       ...    ...        ...    ...     ...     ...
23759       23759    Poznañ    0.0   543000.0    4.0   77.00  2020.0
23760       23760  Warszawa    4.0   910000.0    3.0   71.00  2017.0
23761       23761    Poznañ    0.0   430695.0    3.0   50.67  2022.0
23762       23762    Kraków    6.0   359000.0    2.0   38.86  2021.0
23763       23763  Warszawa    2.0   604800.0    3.0   63.00  1978.0

[23764 rows x 7 columns]


# Separate features (X) and target variable (y)

In [19]:
# Separate features (X) and target variable (y)
X = dataset.drop('price', axis=1)
y = dataset['price']

In [20]:
print(X)

       Unnamed: 0      city  floor  rooms      sq    year
0               0    Kraków    2.0    3.0   74.05  2021.0
1               1  Warszawa    3.0    1.0   24.38  2021.0
2               2    Kraków    2.0    2.0   37.00  1970.0
3               3    Poznañ    2.0    5.0  166.00  1935.0
4               4  Warszawa    1.0    5.0  105.00  2020.0
...           ...       ...    ...    ...     ...     ...
23759       23759    Poznañ    0.0    4.0   77.00  2020.0
23760       23760  Warszawa    4.0    3.0   71.00  2017.0
23761       23761    Poznañ    0.0    3.0   50.67  2022.0
23762       23762    Kraków    6.0    2.0   38.86  2021.0
23763       23763  Warszawa    2.0    3.0   63.00  1978.0

[23764 rows x 6 columns]


In [21]:
print(y)

0         749000.0
1         240548.0
2         427000.0
3        1290000.0
4         996000.0
           ...    
23759     543000.0
23760     910000.0
23761     430695.0
23762     359000.0
23763     604800.0
Name: price, Length: 23764, dtype: float64


# Handling missing data

In [22]:
missing_data = dataset.isnull()
missing_data_count_per_column = missing_data.sum()

In [23]:
print(missing_data_count_per_column)

Unnamed: 0    0
city          0
floor         0
price         0
rooms         0
sq            0
year          0
dtype: int64


The output indicates that there are no missing values in the columns of the dataset. Each column is reported to have 0 missing values, as indicated by the count of 0 for each column in the missing_data_count_per_column result.