In [33]:
import pandas as pd
import os

## Reproduction of Determinants of apartment prices in Warsaw by Xinyue Fang

Study made by Xinyue is focused on explaining how different determinants are affecting house price on micro level. Research focuses on factors that influence the price of apartments in Warsaw, following hypothesis are formulated:
1. Influence of apartment type on it's price, it is assumed that tenement apartments should be the cheapest.
2. Non-linear relationship between size and price, it is expected for price to rise rapidly with the increase of flat size.
3. Positive relationship between floor and price - the higher the apartment is situated the higher the price. Lowest floor flats should be the cheapest.
4. Negative relationship between distance from city center and price.
5. Positive relationship between number of ammenities and price.
6. Ownership relationship between type of ownership an price.
7. Positive relationship between existance of elavator and price.
8. Positive rlationship between existance of balcony and price.


Original dataset was taken from https://www.kaggle.com/datasets/krzysztofjamroz/apartment-prices-in-poland . Originally research was done based on data from November 2023. Given the abilities of python which we are going to use, we decided to create two separate datasets - one just to reproduce the research and extend one with data since November 2023 up until June 2024. Source dataset from kaggle includes more variables than is required for research and rows for cities which are als not included, so we will need to do some adjusting of datasets.
Way to proceed:
1. Download data from aforementioned source.
2. Add csv files into source into data/source directory.
3. Recreate original dataset for November 2023.
4. Create copy of dataset from previous step and add data from subsequent months from source directory.

In [67]:
def transform_data_from_file(file_path: str):
    df_transformed = pd.read_csv(file_path)
    df_transformed = df_transformed[(df_transformed['city'] == 'warszawa')]
    df_transformed = df_transformed[
        ['id', 'city', 'type', 'squareMeters', 'floor', 'floorCount', 'centreDistance', 'poiCount', 'ownership',
         'hasBalcony', 'hasElevator', 'price']]
    df_transformed = df_transformed[
        (df_transformed['ownership'] == 'cooperative') | (df_transformed['ownership'] == 'condominium')]
    df_transformed = df_transformed[
        (df_transformed['type'] == 'apartmentBuilding') | (df_transformed['type'] == 'tenement') | (
                df_transformed['type'] == 'blockOfFlats')]
    df_transformed = df_transformed[
        (df_transformed['squareMeters'] > 0) | (df_transformed['floor'] >= 0) | (df_transformed['floorCount'] >= 0) | (
                df_transformed['poiCount'] >= 0) | (df_transformed['centreDistance'] >= 0) | (
                    df_transformed['price'] >= 0)]

    df_transformed = df_transformed.dropna()
    df_transformed.drop_duplicates(subset='id', inplace=True)
    df_transformed.drop(axis='columns', labels=['city'], inplace=True)
    return df_transformed


def aggregate_data():
    df_list = []
    for file in os.listdir('data/source'):
        df = transform_data_from_file('data/source/' + file)
        df_list.append(df)
    aggregated_df = pd.concat(df_list, ignore_index=True)
    # Removing same offers across different months
    aggregated_df.drop_duplicates(subset='id', inplace=True)
    aggregated_df.drop(axis='columns', labels=['id'], inplace=True)
    return aggregated_df


extended_df = aggregate_data()
original_df = transform_data_from_file('data/source/apartments_pl_2023_11.csv')


In [68]:
extended_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 15805 entries, 0 to 27641
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   type            15805 non-null  object 
 1   squareMeters    15805 non-null  float64
 2   floor           15805 non-null  float64
 3   floorCount      15805 non-null  float64
 4   centreDistance  15805 non-null  float64
 5   poiCount        15805 non-null  float64
 6   ownership       15805 non-null  object 
 7   hasBalcony      15805 non-null  object 
 8   hasElevator     15805 non-null  object 
 9   price           15805 non-null  int64  
dtypes: float64(5), int64(1), object(4)
memory usage: 1.3+ MB


In [69]:
original_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2899 entries, 10627 to 15415
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   id              2899 non-null   object 
 1   type            2899 non-null   object 
 2   squareMeters    2899 non-null   float64
 3   floor           2899 non-null   float64
 4   floorCount      2899 non-null   float64
 5   centreDistance  2899 non-null   float64
 6   poiCount        2899 non-null   float64
 7   ownership       2899 non-null   object 
 8   hasBalcony      2899 non-null   object 
 9   hasElevator     2899 non-null   object 
 10  price           2899 non-null   int64  
dtypes: float64(5), int64(1), object(5)
memory usage: 271.8+ KB
