# RWI-GEO-RED Panel Data Preprocessing: Household Purchasing Power (HK)

## About the Dataset


### Dataset: CampusFile_HK_cities.csv

### Data Structure
- **Temporal Coverage**: Panel data across multiple years
- **Geographic Coverage**: 15 largest cities in Germany

## Setup: Load Required Libraries

In [1]:
library(dplyr)
library(ggplot2)
library(tidyr)


Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union




# Load Hauskauf data

In [2]:
raw_data <- read.csv('data/RWI-GEO-RED/panel-15-largest-cities-germany/CampusFile_HK_cities.csv')

In [3]:
df <- raw_data
total_rows_in_raw_data = nrow(df)
total_rows_in_raw_data

# Preprocess the data

In [4]:
summary(df)

      obid                plz          kaufpreis       mieteinnahmenpromonat
 Min.   : 25586984   Min.   :   -9   Min.   :  29000   Min.   :    -9.0     
 1st Qu.: 59750323   1st Qu.:13503   1st Qu.: 260000   1st Qu.:    -9.0     
 Median : 84395616   Median :30539   Median : 399900   Median :    -9.0     
 Mean   : 89238240   Mean   :36025   Mean   : 506339   Mean   :   197.9     
 3rd Qu.:116947671   3rd Qu.:50765   3rd Qu.: 649900   3rd Qu.:    -9.0     
 Max.   :156377699   Max.   :99571   Max.   :2444444   Max.   :100000.0     
   heizkosten         baujahr     letzte_modernisierung  wohnflaeche   
 Min.   : -9.000   Min.   :1500   Min.   :  -9.0        Min.   : 54.0  
 1st Qu.: -9.000   1st Qu.:1952   1st Qu.:  -9.0        1st Qu.:120.0  
 Median : -8.000   Median :1979   Median :  -9.0        Median :145.0  
 Mean   : -8.492   Mean   :1974   Mean   : 327.5        Mean   :174.8  
 3rd Qu.: -8.000   3rd Qu.:2009   3rd Qu.:  -9.0        3rd Qu.:198.2  
 Max.   :150.000   Max.   :20

## Range of the house inserate

In [5]:
# adat -> the date of the inserat
df %>%
    summarise(
        min_adat = min(as.Date(paste0(adat, "-01")), na.rm = TRUE),
        max_adat = max(as.Date(paste0(adat, "-01")), na.rm = TRUE)
    )

min_adat,max_adat
<date>,<date>
2007-01-01,2024-12-01


data goes from 2007 to 2024

## Extract only berlin samples

We only want to createt a model of the berlin real estate listings

In [6]:
extract_berlin_samples <- function(source_df) {
    df_berlin <- subset(source_df, !is.na(plz) &
        as.integer(as.character(plz)) >= 10115 &
        as.integer(as.character(plz)) <= 14199)
    return (
        df_berlin
    )
}

In [7]:
dfb = extract_berlin_samples(df)

In [8]:
cat("Total rows in raw data:", total_rows_in_raw_data, "\n")
cat("Rows in Berlin sample:", nrow(dfb), "\n")
cat("Percentage of Berlin rows:", round(100 * nrow(dfb) / total_rows_in_raw_data, 2), "%\n")

Total rows in raw data: 412816 
Rows in Berlin sample: 89753 
Percentage of Berlin rows: 21.74 %


## Sentinel cleanup

The data is full of placeholder integer values that symbolise missing data

- -9 -> missing data
- -8 -> did not answer

etc ...
 
these need to be cleaned up

In [12]:
# util function for replacing sentinel values with NA
sentinels <- c(-9, -8, -7, -6, -5)
replace_sentinels <- function(x, s = sentinels) {
  x[x %in% s] <- NA
  return (
    x
  )
}

In [15]:
c(colnames(df))

In [16]:
cols_with_sentinels <- c(colnames(df))
df[cols_with_sentinels] <- lapply(df[cols_with_sentinels], replace_sentinels)