# The Data

In every fields of Artificial Intelligence and Machine Learning, data are the fuel that allows us to create models. Every project should start with a complete analysis of the data available. 

The dataset we are going to study is the $\it{"Real\ Estate\ Dataset.csv"}$ from Kaggle, available at https://www.kaggle.com/datasets/arnoldkakas/real-estate-dataset?resource=download. 

This dataset contains raw advertising data from the nehnutelnosti.sk website, gathered through ethical scraping methods in compliance with the robots.txt protocol in mid-November 2023. Focused exclusively on apartments, it excludes houses and other real estate types. The dataset's structure comprises various columns detailing apartment listings. Intentionally unprocessed, this dataset offers a valuable learning resource for those looking to develop fundamental data science skills, including data cleaning, exploratory data analysis (EDA), and predictive model training. It presents a unique chance to refine your abilities while engaging with authentic, real-world data.

Here is the description of each column of the dataset. 
- name_nsi: Name of the commune 
- price: Price in EUR 
- index: "Index of Living," ranging from 0 to 10, calculated by the Slovak startup City Performer (https://cityperformer.com/). It considers six categories: environment, quality_of_living, safety, transport, services, and relax.
quality_of_living: Component of the index 
- safety: Component of the index 
- transport: Component of the index 
- services: Component of the index 
- relax: Component of the index 
- condition: Condition of the listed apartment 
- area: Area in square meters 
- energy_costs: Energy costs in EUR 
- provision: Binary indicator; 1 if the provision of the agency is included in the price, else 0 
- certificate: Energy certificate of the building 
- construction_type: Construction type of the building 
- orientation: Geographical orientation 
- year_built: Year of construction 
- last_reconstruction: Year of the last reconstruction (no specification of what reconstruction means) 
- total_floors: Number of total floors in the building
- floor: Number of the listed apartment's floor 
- lift: Binary indicator; 1 if the building has a lift, else 0 
- balconies: Number of balconies 
- loggia: Number of loggias 
- cellar: Binary indicator; 1 if the building has a cellar, else 0 
- type: Type of apartment 
- rooms: Number of rooms 
- district: District where the commune belongs 

In the next notebook we will try to predict the price of the appartements based on the other criterias. This is why we will now focus on the impact of each criteria on the price in this notebook.

In [1]:
import pandas as pd

In [4]:
df = pd.read_csv('Data_Set/Real Estate Dataset.csv', sep=";")
# display all the columns in the notebook
pd.set_option ('display.max_columns', None)
df.head()

Unnamed: 0,name_nsi,price,index,environment,quality_of_living,safety,transport,services,relax,condition,area,energy_costs,provision,certificate,construction_type,orientation,year_built,last_reconstruction,total_floors,floor,lift,balkonies,loggia,cellar,type,rooms,district
0,Semerovo,42000,,,,,,,,Original condition,58,,0,,,,,,,,0,,,0,3-room apartment,3,Nové Zámky
1,Semerovo,42000,,,,,,,,Original condition,58,,0,none,Brick,,,,2.0,,0,,,0,3-room apartment,3,Nové Zámky
2,Štúrovo,107000,83.0,,,,,,,Partial reconstruction,40,,0,,,,,,5.0,3.0,0,,,0,1-room apartment,1,Nové Zámky
3,Štúrovo,105000,,,,,,,,Complete reconstruction,76,200.0,1,C,,,,,7.0,4.0,1,,,0,3-room apartment,3,Nové Zámky
4,Štúrovo,82000,,,,,,,,Partial reconstruction,63,,0,,,,,2018.0,,2.0,0,,,0,2-room apartment,2,Nové Zámky


In [5]:
print(df.columns)

Index(['name_nsi', 'price', 'index', 'environment', 'quality_of_living',
       'safety', 'transport', 'services', 'relax', 'condition', 'area',
       'energy_costs', 'provision', 'certificate', 'construction_type',
       'orientation', 'year_built', 'last_reconstruction', 'total_floors',
       'floor', 'lift', 'balkonies', 'loggia', 'cellar', 'type', 'rooms',
       'district'],
      dtype='object')
