# Overcoming the problem of biased estimates in the analysis of open market data with the Jackknife resampling method

Cyrill A. Murashev, 2023-02-07

## Abstract

Appraisers are often faced with the need to analyze and describe market data collected in open markets. Almost always, they can't get the data for the whole market, but they are dealing with samples that may be very small compared to the whole population. In this case, the problem of biased estimates arises. It follows from the above that any statistical estimate made on the basis of the sample in question is an estimate for the sample itself. At the same time, it may have a bias relative to the estimate that would be obtained in the case of an analysis of the entire general population. Appraisers often say that they have calculated some descriptive statistics of the market. It may be the price mean or median, maximum and minimum, skewness and kurtosis, etc. But we should understand that these are only estimates for samples, not for the entire market. Today we will look at the minimal theoretical basis of the method. And then we will implement it on real market data using the Python language. We will learn how to determine whether bias exists for any estimate and how to automatically reduce its linear component. This paper is available in [English](https://github.com/Kirill-Murashev/AI_for_valuers_book/blob/main/Parts-Chapters/Jackknife/jackknife.ipynb), [Spanish](https://github.com/Kirill-Murashev/AI_for_valuers_book/blob/main/Parts-Chapters/Jackknife/jackknife-esp.ipynb), and [Russian](https://github.com/Kirill-Murashev/AI_for_valuers_book/blob/main/Parts-Chapters/Jackknife/jackknife-nov.ipynb). The English version is the most current and the most quickly updated. If there are any discrepancies between the versions, the English version should be relied upon.

## Fundamentals of the Jackknife Method

### Introduction

First, we need to remember why appraisers need statistics. Usually they have some distribution of features of objects from the sample of analogues collected on the open market. And they try to get some estimates of the values of these characteristics. It can be mean, median, maximum, minimum, variance, etc. Sometimes they also need to compare two or more subsamples to decide if some adjustments are needed based on the difference of the feature values. As we can guess, most of the time appraisers are dealing with samples, not the entire market. Thus, appraisers can only obtain sample estimates of the feature values, not their true values. 
The jackknife method can address two issues:
- reduce the bias of the sample estimate relative to the true value from the general population;
- calculate the variance of the adjusted trait value.

Suppose we have some characteristic *X* (it could be the unit price, for example), the distribution of which in the general population is unknown to us. But we have a sample consisting of n elements $[x{1},\ldots, x_{n}]$. We want to estimate the expectation of *X*, which can be written as $\mathbb{E}[X]$. In general, the expectation can be written as follows
$$\mathbb{E}[X] = \sum_{j=1}^{n >> 1} p(x_{j})x_{j}.$$

But we only have a sample, which of course consists of a very limited number of observations, far from infinity. So we cannot estimate the expectation, only the sample mean, which is written as
$$\hat{\mu}=\dfrac{1}{n} \sum_{i=1}^{n<<\infty}x_{i}.$$

Therefore, we do not use probabilities, but observed frequencies. It's obvious that $\mathbb{E}[X] \neq \hat{\mu}$, but $\hat{\mu} = \mathbb{E}[X] + \mathcal{bias}$, where $\mu$ is the estimate of the expectation, and the bias is some systematic shift between the true and estimated values of the expectation.

Note that the case of mean calculation is the only case where the method cannot calculate the bias, since it is itself based on averaging. This is its main limitation. But it is quite good for dealing with a central moments higher than the mean.


### General Concept of the Jacknife Method

We have considered a special case. We can now move on to the more general concept of estimator bias. Let's consider the random variable *X* with the unknown distribution *U*. There is a parameter of its distribution called as $\theta$. And we want to determine its value. Using the abstract parameter $\theta$ instead of a specific one emphasizes the universality of the Jackknife method, which is able to detect bias for any parameter of the distribution and automatically correct its linear component. We also have the parameter $\hat{\theta}$, which is the sampling estimate obtained by using some function. Due to the fact that $\hat{\theta}$ was obtained from a sample, while we want to estimate $\theta$ for the general population, i.e., the entire market in the context of valuation, $\hat{\theta}$ has a bias relative to $\theta$. Mathematically, this means that the expectation for $\hat{\theta}$ is not equal to the expectation for $\theta$:
$$\mathbb{E}(\hat{\theta}) \neq \mathbb{E}(\theta).$$
In this case, we can say that
$$\mathbb{E}(\hat{\theta}_{n}) = \theta + \frac{\alpha}{n} + \frac{\beta}{n^{2}} + \frac{\gamma}{n^{3}} + \ldots \frac{\omega}{n^{(k\rightarrow \infty)}},$$
where $\theta$ is the true value of the parameter for the general population, and  $\frac{\alpha}{n} + \frac{\beta}{n^{2}} + \frac{\gamma}{n^{3}} + \ldots \frac{\omega}{n^{(k\rightarrow \infty)}}$ are linear, quadratic, cubic, and other components of the bias. All components decrease as the sample grows according to linear, quadratic, cubic, and other functions. The linear term introduces the largest error because it decreases the slowest of all the other terms.

The Jackknife method eliminates the linear component of the bias. Let's introduce some new definitions.

$\hat{\theta}_{i}$ is the value of $\hat{\theta}$ that would be obtained if the calculation were not based on a full sample, but on a sample with an excluded observation *i* that takes values from 1 to *n*. Then
$$\mathbb{E}(\hat{\theta}_{(i)}) = \theta + \frac{\alpha}{n-1} + \frac{\beta}{(n-1)^{2}} + \frac{\gamma}{(n-1)^{3}} + \ldots \frac{\omega}{(n-1)^{(k\rightarrow \infty)}}.$$
$\overline{\theta}$ is the mean value of all $\hat{\theta}_{i}$.
$$\overline{\theta} = \frac{1}{n} \sum_{i=1}^{n} \hat{\theta}_{i}$$


### Summary

Therefore, to apply the Jackknife method, i.e., to detect the presence of a bias and automatically eliminate its linear component, the following set of steps is required.
1. Suppose we need to estimate some parameter $\theta$ of a random variable *X*.
1. Let's get some an estimate of $\hat{\theta}$ for the sample using a mathematical function $\hat{\theta}=F(x_{1},\ldots,x_{n})$.
1. $\hat{\theta}$ can be biased.
1. $\theta = \mathbb{E}(\hat{\theta}) + bias$.
1. Let's create the new *n* samples by sequentially excluding of one *x* from the initial sample.
1. Calculate the $\hat{\theta}_{(i)}$ for all new samples using the same function **F**.
1. Calculate the mean of all $\hat{\theta}_{(i)}$ and label it as $\overline{\theta}$.
1. Calculate the bias using the following formula
$$\widehat{bias}_{jack} = (n-1)(\overline{\theta} - \hat{\theta}).$$
1. Eliminate the linear component of the bias by using the formula
$$\hat{\theta}_{jacked} = \hat{\theta} - \widehat{bias}_{jack}.$$

## Python practical implementation

Today we will use a dataset containing 34821 observations of the residential real estate market in St. Petersburg. It was obtained from web scraping in September 2021. Let's assume that this dataset contains data about the whole market, so we can use it as a general population. Next, we will create a subsample with only 25 observations, which is the typical number of observations that an appraiser deals with. We will calculate the "expectation" for our "general population", then we will calculate the mean for the sample. Finally, we will apply the Jackknife method and compare the result of the mean calculation to the sample mean.

In [194]:
# import libraries
import numpy as np
import pandas as pd
from astropy.stats import jackknife_resampling
from astropy.stats import jackknife_stats
from random import sample
from scipy.stats import skew
from scipy.stats import kurtosis

In [62]:
# import data
df = pd.read_csv("spba-flats-210928.csv", index_col=False)
print(df)

       Unnamed: 0                                     links  price_m  county
0               1  https://spb.cian.ru/sale/flat/262765174/   155460  sadadm
1               2  https://spb.cian.ru/sale/flat/263280601/   295455  sadadm
2               3  https://spb.cian.ru/sale/flat/261612519/   310559  sadadm
3               4  https://spb.cian.ru/sale/flat/263094016/   100000  sadadm
4               5  https://spb.cian.ru/sale/flat/262339898/   145929  sadadm
...           ...                                       ...      ...     ...
34816       34817  https://spb.cian.ru/sale/flat/256621764/    70093  llobol
34817       34818  https://spb.cian.ru/sale/flat/261430727/    67227  llobol
34818       34819  https://spb.cian.ru/sale/flat/246538655/    86207  llobol
34819       34820  https://spb.cian.ru/sale/flat/246587468/    65455  llobol
34820       34821  https://spb.cian.ru/sale/flat/239698989/    89041  llobol

[34821 rows x 4 columns]


In [193]:
# calculate statistics for the "general population"
gp_est    = df['price_m'].mean()
gp_sem    = df['price_m'].sem()
gp_min    = df['price_m'].min()
gp_25q    = df['price_m'].quantile(0.25)
gp_median = df['price_m'].median()
gp_75q    = df['price_m'].quantile(0.75)
gp_max    = df['price_m'].max()
gp_skew   = df['price_m'].skew()
gp_kurt   = df['price_m'].kurtosis()
gp_ran    = gp_max - gp_min

print("The estimation of unit price for population is", gp_est)
print("The standard error of unit price mean for population is", gp_sem)
print("The minimum of unit price for population is", gp_min)
print("The 0.25 quantile of unit price for population is", gp_25q)
print("The median of unit price for population is", gp_median)
print("The 0.75 quantile of unit price for population is", gp_75q)
print("The maximum of unit price for population is", gp_max)
print("The skewness of unit price for population is", gp_skew)
print("The kurtosis of unit price for population is", gp_kurt)
print("The range of unit price for population is", gp_ran)

The estimation of unit price for population is 176132.997530226
The standard error of unit price mean for population is 411.42108784161167
The minimum of unit price for population is 11817
The 0.25 quantile of unit price for population is 135870.0
The median of unit price for population is 162544.0
The 0.75 quantile of unit price for population is 196078.0
The maximum of unit price for population is 1624829
The skewness of unit price for population is 4.425121271105129
The kurtosis of unit price for population is 44.47491746881878
The range of unit price for population is 1613012


In [165]:
# create sample
sam_size = 25
ran_sam = df.sample(n=sam_size)
print(ran_sam)

       Unnamed: 0                                     links  price_m  county
3306         3307  https://spb.cian.ru/sale/flat/263952864/   208706  sfrn75
25446       25447  https://spb.cian.ru/sale/flat/264176271/   183721  swyswe
11568       11569  https://spb.cian.ru/sale/flat/262828919/   135742  skupes
33301       33302  https://spb.cian.ru/sale/flat/264080846/   106557  lwsswe
19647       19648  https://spb.cian.ru/sale/flat/259620581/   187500  sprn65
12156       12157  https://spb.cian.ru/sale/flat/263812187/   154135  smogag
33578       33579  https://spb.cian.ru/sale/flat/249136615/    94340  lkiotr
25289       25290  https://spb.cian.ru/sale/flat/262800893/   231783  swysam
18805       18806  https://spb.cian.ru/sale/flat/264405723/   182927  spechk
28863       28864  https://spb.cian.ru/sale/flat/261066046/   100000  lwsser
15471       15472  https://spb.cian.ru/sale/flat/264113874/   146245  snenar
24352       24353  https://spb.cian.ru/sale/flat/260113706/   154694  swyn15

In [166]:
# calculate statistics for the random sample
rs_mean   = ran_sam['price_m'].mean()
rs_sem    = ran_sam['price_m'].sem()
rs_min    = ran_sam['price_m'].min()
rs_25q    = ran_sam['price_m'].quantile(0.25)
rs_median = ran_sam['price_m'].median()
rs_75q    = ran_sam['price_m'].quantile(0.75)
rs_max    = ran_sam['price_m'].max()
rs_skew   = ran_sam['price_m'].skew()
rs_kurt   = ran_sam['price_m'].kurtosis()
rs_ran   = rs_max - rs_min

print("The mean of unit price for random sample is", rs_mean)
print("The standard error of unit price mean for random sample is", rs_sem)
print("The minimum of unit price for random sample is", rs_min)
print("The 0.25 quantile of unit price for random sample is", rs_25q)
print("The median of unit price for random sample is", rs_median)
print("The 0.75 quantile of unit price for random sample is", rs_75q)
print("The maximum of unit price for random sample is", rs_max)
print("The skewness of unit price for random sample is", rs_skew)
print("The kurtosis of unit price for random sample is", rs_kurt)
print("The range of unit price for random sample is", rs_ran)

The mean of unit price for random sample is 193665.44
The standard error of unit price mean for random sample is 20251.124762679232
The minimum of unit price for random sample is 63495
The 0.25 quantile of unit price for random sample is 146245.0
The median of unit price for random sample is 183333.0
The 0.75 quantile of unit price for random sample is 208706.0
The maximum of unit price for random sample is 574103
The skewness of unit price for random sample is 2.3536484943924028
The kurtosis of unit price for random sample is 7.829781802086403
The range of unit price for random sample is 510608


In [167]:
# obtain Jackknife resamples
new_df = ran_sam["price_m"]
array = new_df.to_numpy()
print(new_df)

3306     208706
25446    183721
11568    135742
33301    106557
19647    187500
12156    154135
33578     94340
25289    231783
18805    182927
28863    100000
15471    146245
24352    154694
12759    186667
18902    321429
802      193998
8789     112931
22581    183333
6809     152813
12278    212903
33868     63495
19219    273146
18680    574103
13881    193629
13526    318386
16092    168453
Name: price_m, dtype: int64


In [168]:
# obtain Jackknife resamples
resamples = jackknife_resampling(array)
print(resamples)

[[183721. 135742. 106557. 187500. 154135.  94340. 231783. 182927. 100000.
  146245. 154694. 186667. 321429. 193998. 112931. 183333. 152813. 212903.
   63495. 273146. 574103. 193629. 318386. 168453.]
 [208706. 135742. 106557. 187500. 154135.  94340. 231783. 182927. 100000.
  146245. 154694. 186667. 321429. 193998. 112931. 183333. 152813. 212903.
   63495. 273146. 574103. 193629. 318386. 168453.]
 [208706. 183721. 106557. 187500. 154135.  94340. 231783. 182927. 100000.
  146245. 154694. 186667. 321429. 193998. 112931. 183333. 152813. 212903.
   63495. 273146. 574103. 193629. 318386. 168453.]
 [208706. 183721. 135742. 187500. 154135.  94340. 231783. 182927. 100000.
  146245. 154694. 186667. 321429. 193998. 112931. 183333. 152813. 212903.
   63495. 273146. 574103. 193629. 318386. 168453.]
 [208706. 183721. 135742. 106557. 154135.  94340. 231783. 182927. 100000.
  146245. 154694. 186667. 321429. 193998. 112931. 183333. 152813. 212903.
   63495. 273146. 574103. 193629. 318386. 168453.]
 [208

In [169]:
# obtain Jackknife resamples shape
resamples.shape

(25, 24)

In [170]:
# obtain Jackknife estimate for the mean, its bias,
# its standard error, and its 95% confidence interval
test_statistic = np.mean

estimate, bias, stderr, conf_interval = jackknife_stats(
    array, test_statistic, 0.95)

mean_jacked = estimate
print("the jacked mean is", mean_jacked)
mean_true_bias =  rs_mean - gp_est
print("the true bias of the mean is", mean_true_bias)
mean_bias_jack = bias
print("the bias of the mean obtained by the Jackknife is", mean_bias_jack)
mean_corr_bias_perc = mean_bias_jack / mean_true_bias
print("the corrected percentage of the bias is", mean_corr_bias_perc)
mean_std_error = stderr
print("the standard error of the mean obtained by the Jackknife is", mean_std_error)
mean_conf_int = conf_interval
print("the confidence interval (95%) of the jacked mean is", mean_conf_int)

the jacked mean is 193665.4400000007
the true bias of the mean is 17532.44246977399
the bias of the mean obtained by the Jackknife is -6.984919309616089e-10
the corrected percentage of the bias is -3.983996708763267e-14
the standard error of the mean obtained by the Jackknife is 20251.12476267923
the confidence interval (95%) of the jacked mean is [153973.96481872 233356.91518128]


As we can see, the expectation is 176133, the sample mean is 193665, and the adjusted mean obtained by the jackknife method is also 193665. The confidence interval for the expectation is [153974, 233357] with a probability of 0.95. The calculated confidence interval contains the true value of the expectation. As we discussed earlier, the mean is the one parameter of a distribution that, by its nature, cannot be adjusted by the Jackknife method. Now let's look at other parameters.


In [192]:
# obtain Jackknife estimate for the median, its bias,
# its standard error, and its 95% confidence interval
test_statistic = np.median

estimate, bias, stderr, conf_interval = jackknife_stats(
    array, test_statistic, 0.95)

median_jacked = estimate
print("the jacked median is", median_jacked)
median_true_bias =  rs_median - gp_est
print("the true bias of the median is", median_true_bias)
median_bias_jack = bias
print("the bias of the median obtained by the Jackknife is", median_bias_jack)
median_corr_bias_perc = median_bias_jack / median_true_bias
print("the corrected percentage of the bias is", median_corr_bias_perc)
median_std_error = stderr
print("the standard error of the median obtained by the Jackknife is", median_std_error)
median_conf_int = conf_interval
print("the confidence interval (95%) of the jacked median is", median_conf_int)

the jacked median is 183445.31999999983
the true bias of the median is 7200.002469773986
the bias of the median obtained by the Jackknife is -112.31999999983236
the corrected percentage of the bias is -0.015599994648801583
the standard error of the median obtained by the Jackknife is 952.8097934005506
the confidence interval (95%) of the jacked median is [181577.84712082 185312.79287918]


In [190]:
# obtain Jackknife estimate for the minimum, its bias,
# its standard error, and its 95% confidence interval
test_statistic = np.min

estimate, bias, stderr, conf_interval = jackknife_stats(
    array, test_statistic, 0.95)

min_jacked = estimate
print("the jacked min is", min_jacked)
min_true_bias =  rs_min - gp_min
print("the true bias of the mininimum is", min_true_bias)
min_bias_jack = bias
print("the bias of the minimum obtained by the Jackknife is", min_bias_jack)
min_corr_bias_perc = min_bias_jack / min_true_bias
print("the corrected percentage of the bias is", min_corr_bias_perc)
min_std_error = stderr
print("the standard error of the mininimum obtained by the Jackknife is", min_std_error)
min_conf_int = conf_interval
print("the confidence interval (95%) of the jacked minimum is", min_conf_int)

the jacked min is 33883.79999999993
the true bias of the mininimum is 51678
the bias of the minimum obtained by the Jackknife is 29611.20000000007
the corrected percentage of the bias is 0.5729943109253468
the standard error of the mininimum obtained by the Jackknife is 29611.2
the confidence interval (95%) of the jacked minimum is [-24153.08553901  91920.68553901]


In [174]:
# obtain Jackknife estimate for the maximum, its bias,
# its standard error, and its 95% confidence interval
test_statistic = np.max

estimate, bias, stderr, conf_interval = jackknife_stats(
    array, test_statistic, 0.95)

maximum_jacked = estimate
print("the jacked maximum is", maximum_jacked)
maximum_true_bias =  rs_max - gp_max
print("the true bias of the maximum is", maximum_true_bias)
maximum_bias_jack = bias
print("the bias of the maximum obtained by the Jackknife is", maximum_bias_jack)
maximum_corr_bias_perc = maximum_bias_jack / maximum_true_bias
print("the corrected percentage of the bias is", maximum_corr_bias_perc)
maximum_std_error = stderr
print("the standard error of the maximum obtained by the Jackknife is", maximum_std_error)
maximum_conf_int = conf_interval
print("the confidence interval (95%) of the jacked maximum is", maximum_conf_int)

the jacked maximum is 816670.0399999991
the true bias of the maximum is -1050726
the bias of the maximum obtained by the Jackknife is -242567.0399999991
the corrected percentage of the bias is 0.23085660771694913
the standard error of the maximum obtained by the Jackknife is 242567.04
the confidence interval (95%) of the jacked maximum is [ 341247.37776351 1292092.70223649]


In [185]:
# obtain Jackknife estimate for the skewness, its bias,
# its standard error, and its 95% confidence interval
test_statistic = skew

estimate, bias, stderr, conf_interval = jackknife_stats(
    array, test_statistic, 0.95)

skew_jacked = estimate
print("the jacked skewness is", skew_jacked)
skew_true_bias =  rs_skew - gp_skew
print("the true bias of the skewness is", skew_true_bias)
skew_bias_jack = bias
print("the bias of the skewness obtained by the Jackknife is", skew_bias_jack)
skew_corr_bias_perc = skew_bias_jack / skew_true_bias
print("the corrected percentage of the bias is", skew_corr_bias_perc)
skew_std_error = stderr
print("the standard error of the skewness obtained by the Jackknife is", skew_std_error)
skew_conf_int = conf_interval
print("the confidence interval (95%) of the jacked skewnes is", skew_conf_int)

the jacked skewness is 3.7874875564237587
the true bias of the skewness is -2.0714727767127266
the bias of the skewness obtained by the Jackknife is -1.5774797157901084
the corrected percentage of the bias is 0.7615256804356616
the standard error of the skewness obtained by the Jackknife is 1.6269608977678238
the confidence interval (95%) of the jacked skewnes is [0.59870279 6.97627232]


### Results

In [191]:
# obtain Jackknife estimate for the kurtosis, its bias,
# its standard error, and its 95% confidence interval
test_statistic = kurtosis

estimate, bias, stderr, conf_interval = jackknife_stats(
    array, test_statistic, 0.95)

kurt_jacked = estimate
print("the jacked kurtosis is", kurt_jacked)
kurt_true_bias =  rs_kurt - gp_kurt
print("the true bias of the kurtosis is", kurt_true_bias)
kurt_bias_jack = bias
print("the bias of the kurtosis obtained by the Jackknife is", kurt_bias_jack)
kurt_corr_bias_perc = kurt_bias_jack / kurt_true_bias
print("the corrected percentage of the bias is", kurt_corr_bias_perc)
kurt_std_error = stderr
print("the standard error of the kurtosis obtained by the Jackknife is", kurt_std_error)
kurt_conf_int = conf_interval
print("the confidence interval (95%) of the jacked kurtosis is", kurt_conf_int)

the jacked kurtosis is 14.227526609758394
the true bias of the kurtosis is -36.64513566673238
the bias of the kurtosis obtained by the Jackknife is -8.109145853579356
the corrected percentage of the bias is 0.22128846587791726
the standard error of the kurtosis obtained by the Jackknife is 6.04963781033231
the confidence interval (95%) of the jacked kurtosis is [ 2.37045438 26.08459884]


The results of applying of the Jackknife method are summarized in the table below.

|Estimate |True value | Sample estimate | Jacked estimate | True bias | Corrected bias | Corr. perc. of the bias | Confidence Interval| CI Corr.|
| :- | :- | :- | :- | :- | :- | :- | :- | :- |
|Mean|176132|193665|193665|17532|0|0|153974, 233357|yes|
|Minimum|11817|63495|33884|51678|29611|0.573|-24153, 91921|yes|
|Maximum|1624829|574103|816670|-1050726|-242567|0.23|341247, 1292093|no|
|Skewness|4.425|2.35|3.787|-2.071|-1.577|0.762|0.599, 6.976|yes|
|Kurtosis|44.475|7.830|14.228|-36.645|-8.109|0.221|2.370, 26.085|no|
|Median|162544|183333|183445|7200|-112|-0.016|181578, 185313|no|

The problems of using the Jackknife method for mean adjustment were discussed earlier. As we can see from the table, this method is also not good for median adjustment. However, it is useful for adjusting central moments that are higher than the mean, as well as for adjusting marginal values.

## Afterword

The Jackknife method is a simple and computationally efficient tool for adjusting some sampling estimators. It's not a Holy Grail, but it can solve some of the problems of estimating the parameters of an open market. I hope that its application will help many appraisers in their daily practice. And perhaps it will inspire someone to use machine learning methods more widely in valuation activities. The next topic will be a more powerful and universal method of bootstrapping. See for the updates.