# Overcoming the problem of biased estimates while analyzing open markets data by the Jackknife resampling method

Cyrill A. Murashev, 2023-02-03

## Abstract

Appraisers often face the need to analyze and describe market data collected on the open markets. Almost always they can't get the data about the whole market, but they deal with samples, which could be very small relatively to the whole general population. In this case, the problem of biased estimates arises. It follows from the above that any statistical estimate made on the basis of the sample in question is an estimate for the sample itself. At the same time, it may have a bias relative to the estimate that would be obtained in the case of an analysis of the entire general population. Appraisers often say that they calculated some descriptive statistics of the market. It can be the price mean or median, maximum, and minimum, etc. But we should understand that they are just estimates for samples, not for the entire market. Today we will consider the minimum theoretical basis of the method. And then we implement it to real market data using the Python language. We will learn how to determine for any estimation whether bias exists and how to automatically reduce its linear component.

## Basis of the Jackknife method

### Introduction

First, we need to recall why appraisers need statistic. Usually they have some distribution of the features of objects from the sample of analogues collected on the open market. And they try to get some estimations of the values of that features. It can be mean, median, maximum, minimum, variance, etc. Sometimes they also need to compare two or more subsamples to decide whether some adjustments are required based on the difference of feature values. As we can guess, more often than not appraisers deal with samples, not with the entire market. Thus, appraisers can get only sampling estimates of the features values, but not their true values. 
The method of Jackknife can manage two issues:
- reduce the bias of the sampling estimation relative to the true value from the general population;
- calculate the variance of the adjusted feature value.

Suppose we have some feature X (it could be unit price, for example), the distribution of which in the general population is unknown to us. But we have a sample consisting of n elements $[x{1},\ldots, x_{n}]$. We want to estimate the expectation of X which can be written as $\mathbb{E}[X]$. In general, the expectation can be notated as following
$$\mathbb{E}[X] = \sum_{j=1}^{n >> 1} p(x_{j})x_{j}.$$

But we have only a sample that of course consists of very limited number of observations far from infinity. So we are not able to estimate the expectation, only the sampling mean which is notated as
$$\hat{\mu}=\dfrac{1}{n} \sum_{i=1}^{n<<\infty}x_{i}.$$

Therefore, we do not use probabilities, but observed frequencies. It's obvious that $\mathbb{E}[X] \neq \hat{\mu}$, but $\hat{\mu} = \mathbb{E}[X] + \mathcal{bias}$, where $\mu$ is the estimation of the expectation, and the bias is some systematic shift between true and estimated values of the expectation.


### General concept of Jacknife method

We have considered a special case. We can now move on to the more general concept of estimator's bias. Let's consider the random variable *X* having the unknown distribution *U*. There is some parameter of its distribution named as $\theta$. And we aim to determine its value. Using the abstract parameter Theta instead of a specific one emphasizes the universality of the Jackknife method, which is able to detect bias for any parameter of distribution and automatically correct its linear component. We also have parameter $\hat{\theta}$, which is the sampling estimation got from using some function. Due to the fact that $\hat{\theta}$ was got from a sample, while we aim to estimate $\theta$ for the general population, that is the entire market in the context of valuation, $\hat{\theta}$ has a bias relative to $\theta$. Mathematically, it means that the expectation for $\hat{\theta}$ is not equal to the expectation for $\theta$:
$$\mathbb{E}(\hat{\theta}) \neq \mathbb{E}(\theta).$$
In this case, we can say that
$$\mathbb{E}(\hat{\theta}_{n}) = \theta + \frac{\alpha}{n} + \frac{\beta}{n^{2}} + \frac{\gamma}{n^{3}} + \ldots \frac{\omega}{n^{(k\rightarrow \infty)}},$$
where $\theta$ is the true value of the parameter for general population, and  $\frac{\alpha}{n} + \frac{\beta}{n^{2}} + \frac{\gamma}{n^{3}} + \ldots \frac{\omega}{n^{(k\rightarrow \infty)}}$ are linear, quadratic, cubic and other components of bias. All components decrease with the growth of the sample in accordance with linear, quadratic, cubic and other functions. The linear term introduces the largest error because it decreases the slowest of all the others.

The Jackknife method eliminates the linear component of bias. Let's introduce a new definitions.

$\hat{\theta}_{i}$ is the value of $\hat{\theta}$ that would be obtained when calculating not on a full sample, but on a sample with an excluded observation *i*, which takes values from 1 to *n*. Then
$$\mathbb{E}(\hat{\theta}_{(i)}) = \theta + \frac{\alpha}{n-1} + \frac{\beta}{(n-1)^{2}} + \frac{\gamma}{(n-1)^{3}} + \ldots \frac{\omega}{(n-1)^{(k\rightarrow \infty)}}.$$
$\overline{\theta}$ is the mean value of all $\hat{\theta}_{i}$.
$$\overline{\theta} = \frac{1}{n} \sum_{i=1}^{n} \hat{\theta}_{i}$$


### Summary

So, to apply the Jackknife method, i.e., detecting the presence of a bias and automatically eliminating its linear component, the following set of steps is required.
1. Let us need to estimate some parameter $\theta$ of a random variable *X*.
1. Let's get some estimate of $\hat{\theta}$ for the sample using some mathematical function $\hat{\theta}=F(x_{1},\ldots,x_{n})$.
1. $\hat{\theta}$ can be biased.
1. $\theta = \mathbb{E}(\hat{\theta}) + bias$.
1. Let's create the new *n* samples by sequentially exclusion of one *x* from the initial sample.
1. Calculate the $\hat{\theta}_{(i)}$ for all new samples using the same function **F**.
1. Calculate the mean of all $\hat{\theta}_{(i)}$ and denote it as $\overline{\theta}$.
1. Calculate the bias using the following formula
$$\widehat{bias}_{jack} = (n-1)(\overline{\theta} - \hat{\theta}).$$
1. Eliminate the linear component of bias by using the formula
$$\hat{\theta}_{jacked} = \hat{\theta} - \widehat{bias}_{jack}.$$

## Practical implementation in Python

Today we will use dataset containing 34821 observations of residential real market in St. Petersburg. It was got from web-scrapping in September 2021. Let's assume that this dataset contains data about the entire market, so we can use it as a general population. Next, we will create subsample containing only 25 observations, which reflects the typical number of observations that an appraiser deals. We will calculate “the expectation” for our “general population”, then we will calculate the mean for the sample. At last, we will apply the Jackknife method and compare its result of mean calculating relative to sampling mean.

In [13]:
# import libraries
import numpy as np
import pandas as pd
from astropy.stats import jackknife_resampling
from astropy.stats import jackknife_stats
from random import sample

In [52]:
# import data
df = pd.read_csv("spba-flats-210928.csv", index_col=False)
print(df)

       Unnamed: 0                                     links   price_m  county
0               1  https://spb.cian.ru/sale/flat/262765174/  155460.0  sadadm
1               2  https://spb.cian.ru/sale/flat/263280601/  295455.0  sadadm
2               3  https://spb.cian.ru/sale/flat/261612519/  310559.0  sadadm
3               4  https://spb.cian.ru/sale/flat/263094016/  100000.0  sadadm
4               5  https://spb.cian.ru/sale/flat/262339898/  145929.0  sadadm
...           ...                                       ...       ...     ...
34816       34817  https://spb.cian.ru/sale/flat/256621764/   70093.0  llobol
34817       34818  https://spb.cian.ru/sale/flat/261430727/   67227.0  llobol
34818       34819  https://spb.cian.ru/sale/flat/246538655/   86207.0  llobol
34819       34820  https://spb.cian.ru/sale/flat/246587468/   65455.0  llobol
34820       34821  https://spb.cian.ru/sale/flat/239698989/   89041.0  llobol

[34821 rows x 4 columns]


In [53]:
# calculate mean and maximum for the "general population"
gp_mean = df['price_m'].mean()
gp_max = df['price_m'].max()
print("The mean of unit price for 'general population' is", gp_mean)
print("The maximum of unit price for 'general population' is", gp_max)

The mean of unit price for 'general population' is 176116.52505671864
The maximum of unit price for 'general population' is 1624829.0


In [68]:
# create sample
sam_size = 25
ran_sam = df.sample(n=sam_size)
print(ran_sam)

       Unnamed: 0                                     links   price_m  county
20606       20607  https://spb.cian.ru/sale/flat/264064476/  213889.0  sprkol
11160       11161  https://spb.cian.ru/sale/flat/261519367/  161667.0  skryuz
21112       21113  https://spb.cian.ru/sale/flat/263895261/  148830.0  sprlax
19612       19613  https://spb.cian.ru/sale/flat/250041085/  189706.0  sprn65
5076         5077  https://spb.cian.ru/sale/flat/263660571/  139382.0  skapis
13890       13891  https://spb.cian.ru/sale/flat/263013521/  231935.0  smonow
12315       12316  https://spb.cian.ru/sale/flat/262914352/  198765.0  smozwy
8250         8251  https://spb.cian.ru/sale/flat/263518941/  181985.0  skrpol
243           244  https://spb.cian.ru/sale/flat/251281327/  210697.0  sadeka
15608       15609  https://spb.cian.ru/sale/flat/260189091/  134454.0  snenew
7387         7388  https://spb.cian.ru/sale/flat/257597416/  108974.0  skomet
5179         5180  https://spb.cian.ru/sale/flat/261053782/  150

In [67]:
# calculate mean and maximum for the samle
rs_mean = ran_sam['price_m'].mean()
rs_max = ran_sam['price_m'].max()
print("The mean of unit price for random sample is", rs_mean)
print("The maximum of unit price for random sample is", rs_max)

The mean of unit price for random sample is 147030.5
The maximum of unit price for random sample is 181481.0


In [69]:
# obtain Jackknife resamples
new_df = ran_sam["price_m"]
array = new_df.to_numpy()
print(new_df)

20606    213889.0
11160    161667.0
21112    148830.0
19612    189706.0
5076     139382.0
13890    231935.0
12315    198765.0
8250     181985.0
243      210697.0
15608    134454.0
7387     108974.0
5179     150000.0
22767    121429.0
22444    223214.0
10338    177384.0
13155    195000.0
804      138900.0
18970    221212.0
21243    233333.0
28099    110754.0
31551     83459.0
4399     151282.0
12014    198780.0
22696    145516.0
23239    204433.0
Name: price_m, dtype: float64


In [70]:
# obtain Jackknife resamples
resamples = jackknife_resampling(array)
print(resamples)

[[161667. 148830. 189706. 139382. 231935. 198765. 181985. 210697. 134454.
  108974. 150000. 121429. 223214. 177384. 195000. 138900. 221212. 233333.
  110754.  83459. 151282. 198780. 145516. 204433.]
 [213889. 148830. 189706. 139382. 231935. 198765. 181985. 210697. 134454.
  108974. 150000. 121429. 223214. 177384. 195000. 138900. 221212. 233333.
  110754.  83459. 151282. 198780. 145516. 204433.]
 [213889. 161667. 189706. 139382. 231935. 198765. 181985. 210697. 134454.
  108974. 150000. 121429. 223214. 177384. 195000. 138900. 221212. 233333.
  110754.  83459. 151282. 198780. 145516. 204433.]
 [213889. 161667. 148830. 139382. 231935. 198765. 181985. 210697. 134454.
  108974. 150000. 121429. 223214. 177384. 195000. 138900. 221212. 233333.
  110754.  83459. 151282. 198780. 145516. 204433.]
 [213889. 161667. 148830. 189706. 231935. 198765. 181985. 210697. 134454.
  108974. 150000. 121429. 223214. 177384. 195000. 138900. 221212. 233333.
  110754.  83459. 151282. 198780. 145516. 204433.]
 [213

In [71]:
# obtain Jackknife resamples shape
resamples.shape

(25, 24)

In [77]:
# obtain Jackknife estimate for the mean, its bias,
# its standard error, and its 95% confidence interval
test_statistic = np.mean

estimate, bias, stderr, conf_interval = jackknife_stats(
    array, test_statistic, 0.95)

mean_jacked = estimate
print("the jacked mean is ", mean_jacked)
bias_jack = abs(mean_jacked - rs_mean)
print("the bias got from the Jackknife is ", bias_jack)
std_error = stderr
print("the standard error got from the Jackknife is ", std_error)
conf_int = conf_interval
print("the confident interval (95%) of jacked mean is ", conf_int)

the jacked mean is  170999.2
the bias got from the Jackknife is  23968.70000000001
the standard error got from the Jackknife is  8468.584368318783
the confident interval (95%) of jacked mean is  [154401.07963806 187597.32036194]


As we can see, the expectation is 176116.525, the sampling mean is 147030.500, and the adjusted mean got by the Jackknife method is 170999.200. The confidence interval for the expectation is [154401.080, 187597.320] with probability of 0.95. So, we can Thus, we unequivocally proved that the application of the Jackknife method leads to a significant improvement in the accuracy of the distribution parameter estimate. Moreover, the calculated confidence interval includes the true value of the expectation.


## Afterword

The Jackknife method is a simple and computationally effective tool for adjusting of sampling estimators. I hope that its application will help to many appraisers in their everyday practice. And perhaps it will inspire someone to more widely use machine learning methods in valuation activities.