layout | title |
---|---|
site |
Descriptive Statistics |
Descriptive statistics are used to quantitatively describe the main characteristics of the data. They provide meaningful summaries computed over different observations or data records collected in a study. These summaries typically form the basis of the initial data exploration as part of a more extensive statistical analysis. Such a quantitative analysis assumes that every variable (also known as, attribute, feature, or column) in the data has a specific level of measurement [Stevens1946].
The measurement level of a variable, often called as variable type, can either be scale or categorical. A scale variable represents the data measured on an interval scale or ratio scale. Examples of scale variables include ‘Height’, ‘Weight’, ‘Salary’, and ‘Temperature’. Scale variables are also referred to as quantitative or continuous variables. In contrast, a categorical variable has a fixed limited number of distinct values or categories. Examples of categorical variables include ‘Gender’, ‘Region’, ‘Hair color’, ‘Zipcode’, and ‘Level of Satisfaction’. Categorical variables can further be classified into two types, nominal and ordinal, depending on whether the categories in the variable can be ordered via an intrinsic ranking. For example, there is no meaningful ranking among distinct values in ‘Hair color’ variable, while the categories in ‘Level of Satisfaction’ can be ranked from highly dissatisfied to highly satisfied.
The input dataset for descriptive statistics is provided in the form of a matrix, whose rows are the records (data points) and whose columns are the features (i.e. variables). Some scripts allow this matrix to be vertically split into two or three matrices. Descriptive statistics are computed over the specified features (columns) in the matrix. Which statistics are computed depends on the types of the features. It is important to keep in mind the following caveats and restrictions:
-
Given a finite set of data records, i.e. a sample, we take their feature values and compute their sample statistics. These statistics will vary from sample to sample even if the underlying distribution of feature values remains the same. Sample statistics are accurate for the given sample only. If the goal is to estimate the distribution statistics that are parameters of the (hypothesized) underlying distribution of the features, the corresponding sample statistics may sometimes be used as approximations, but their accuracy will vary.
-
In particular, the accuracy of the estimated distribution statistics will be low if the number of values in the sample is small. That is, for small samples, the computed statistics may depend on the randomness of the individual sample values more than on the underlying distribution of the features.
-
The accuracy will also be low if the sample records cannot be assumed mutually independent and identically distributed (i.i.d.), that is, sampled at random from the same underlying distribution. In practice, feature values in one record often depend on other features and other records, including unknown ones.
-
Most of the computed statistics will have low estimation accuracy in the presence of extreme values (outliers) or if the underlying distribution has heavy tails, for example obeys a power law. However, a few of the computed statistics, such as the median and Spearman’s rank correlation coefficient, are robust to outliers.
-
Some sample statistics are reported with their sample standard errors in an attempt to quantify their accuracy as distribution parameter estimators. But these sample standard errors, in turn, only estimate the underlying distribution’s standard errors and will have low accuracy for small or samples, outliers in samples, or heavy-tailed distributions.
-
We assume that the quantitative (scale) feature columns do not contain missing values, infinite values,
NaN
s, or coded non-numeric values, unless otherwise specified. We assume that each categorical feature column contains positive integers from 1 to the number of categories; for ordinal features, the natural order on the integers should coincide with the order on the categories.
Univariate statistics are the simplest form of descriptive statistics
in data analysis. They are used to quantitatively describe the main
characteristics of each feature in the data. For a given dataset matrix,
script Univar-Stats.dml
computes certain univariate
statistics for each feature column in the matrix. The feature type
governs the exact set of statistics computed for that feature. For
example, the statistic mean can only be computed on a quantitative
(scale) feature like ‘Height’ and ‘Temperature’. It does not make sense
to compute the mean of a categorical attribute like ‘Hair Color’.
The output matrix of Univar-Stats.dml
has one row per
each univariate statistic and one column per input feature. This table
lists the meaning of each row. Signs "+" show applicability to scale
or/and to categorical features.
Row | Name of Statistic | Scale | Category |
---|---|---|---|
1 | Minimum | + | |
2 | Maximum | + | |
3 | Range | + | |
4 | Mean | + | |
5 | Variance | + | |
6 | Standard deviation | + | |
7 | Standard error of mean | + | |
8 | Coefficient of variation | + | |
9 | Skewness | + | |
10 | Kurtosis | + | |
11 | Standard error of skewness | + | |
12 | Standard error of kurtosis | + | |
13 | Median | + | |
14 | Interquartile mean | + | |
15 | Number of categories | + | |
16 | Mode | + | |
17 | Number of modes | + |
Given an input matrix X
, this script computes the set of all relevant
univariate statistics for each feature column X[,i]
in X
. The list
of statistics to be computed depends on the type, or measurement
level, of each column. The command-line argument points to a vector
containing the types of all columns. The types must be provided as per
the following convention: 1 = scale, 2 = nominal,
3 = ordinal.
Below we list all univariate statistics computed by script
Univar-Stats.dml
. The statistics are collected by
relevance into several groups, namely: central tendency, dispersion,
shape, and categorical measures. The first three groups contain
statistics computed for a quantitative (also known as: numerical, scale,
or continuous) feature; the last group contains the statistics for a
categorical (either nominal or ordinal) feature.
Let idx
and consider sample statistics
of feature column X[
$,idx]
. Let
X[
$,idx]
in
their original unsorted order:
The computation of quartiles, median, and interquartile mean from the
empirical distribution function of the 10-point
sample {2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8}. Each vertical step in
the graph has height
$q_{25%}$
$q_{50%}$
$q_{75%}$
Sample statistics that describe the location of the quantitative (scale) feature distribution, represent it with a single value.
Mean (output row 4): The arithmetic average over a sample of a quantitative
feature. Computed as the ratio between the sum of values and the number
of values:
Note that the mean is significantly affected by extreme values in the sample and may be misleading as a central tendency measure if the feature varies on exponential scale. For example, the mean of {0.01, 0.1, 1.0, 10.0, 100.0} is 22.222, greater than all the sample values except the largest.
Median (output row 13): The "middle" value that separates the higher half of the
sample values (in a sorted order) from the lower half. To compute the
median, we sort the sample in the increasing order, preserving
duplicates:
Unlike the mean, the median is not sensitive to extreme values in the sample, i.e. it is robust to outliers. It works better as a measure of central tendency for heavy-tailed distributions and features that vary on exponential scale. However, the median is sensitive to small sample size.
Interquartile mean (output row 14): For a sample of a quantitative feature, this is
the mean of the values greater than or equal to the
To compute the measure, we sort the sample in the increasing order,
preserving duplicates:
$$\frac{1}{3{/}4 - 1{/}4} \left[
\left(\frac{j}{n} - \frac{1}{4}\right) v^s_j ,,+
\sum_{j<i<k} \left(\frac{i}{n} - \frac{i,{-},1}{n}\right) v^s_i
,,+,, \left(\frac{3}{4} - \frac{k,{-},1}{n}\right) v^s_k\right]$$
In other words, all sample values between the
Statistics that describe the amount of variation or spread in a quantitative (scale) data feature.
Variance (output row 5): A measure of dispersion, or spread-out, of sample values
around their mean, expressed in units that are the square of those of
the feature itself. Computed as the sum of squared differences between
the values in the sample and their mean, divided by one less than the
number of values:
Standard deviation (output row 6): A measure of dispersion around the mean, the square root of variance. Computed by taking the square root of the sample variance; see Variance above on computing the variance. Example: the standard deviation of sample {2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8} equals 1.8. At least two values are required to avoid division by zero. Note that standard deviation is sensitive to outliers.
Standard deviation is used in conjunction with the mean to determine an interval containing a given percentage of the feature values, assuming the normal distribution. In a large sample from a normal distribution, around 68% of the cases fall within one standard deviation and around 95% of cases fall within two standard deviations of the mean. For example, if the mean age is 45 with a standard deviation of 10, around 95% of the cases would be between 25 and 65 in a normal distribution.
Coefficient of variation (output row 8): The ratio of the standard deviation to the
mean, i.e. the relative standard deviation, of a quantitative feature
sample. Computed by dividing the sample standard deviation by the
sample mean, see above for their computation details. Example: the
coefficient of variation for sample {2.2, 3.2, 3.7, 4.4, 5.3, 5.7,
6.1, 6.4, 7.2, 7.8} equals 1.8$,{/},$5.2
This metric is used primarily with non-negative features such as
financial or population data. It is sensitive to outliers. Note: zero
mean causes division by zero, returning infinity or NaN
. At least two
values (records) are required to compute the standard deviation.
Minimum (output row 1): The smallest value of a quantitative sample, computed as
Maximum (output row 2): The largest value of a quantitative sample, computed as
Range (output row 3): The difference between the largest and the smallest value of
a quantitative sample, computed as
Standard error of the mean (output row 7): A measure of how much the value of the sample mean may vary from sample to sample taken from the same (hypothesized) distribution of the feature. It helps to roughly bound the distribution mean, i.e.the limit of the sample mean as the sample size tends to infinity. Under certain assumptions (e.g. normality and large sample), the difference between the distribution mean and the sample mean is unlikely to exceed 2 standard errors.
The measure is computed by dividing the sample standard deviation by the
square root of the number of values
Note that the standard error itself is subject to sample randomness. Its accuracy as an error estimator may be low if the sample size is small or non-i.i.d., if there are outliers, or if the distribution has heavy tails.
Statistics that describe the shape and symmetry of the quantitative (scale) feature distribution estimated from a sample of its values.
Skewness (output row 9): It measures how symmetrically the values of a feature are spread out around the mean. A significant positive skewness implies a longer (or fatter) right tail, i.e. feature values tend to lie farther away from the mean on the right side. A significant negative skewness implies a longer (or fatter) left tail. The normal distribution is symmetric and has a skewness value of 0; however, its sample skewness is likely to be nonzero, just close to zero. As a guideline, a skewness value more than twice its standard error is taken to indicate a departure from symmetry.
Skewness is computed as the
Standard error in skewness (output row 11): A measure of how much the sample
skewness may vary from sample to sample, assuming that the feature is
normally distributed, which makes its distribution skewness equal 0.
Given the number
This measure can tell us, for example:
- If the sample skewness lands within two standard errors from 0, its positive or negative sign is non-significant, may just be accidental.
- If the sample skewness lands outside this interval, the feature is unlikely to be normally distributed.
At least 3 values (
Kurtosis (output row 10): As a distribution parameter, kurtosis is a measure of the extent to which feature values cluster around a central point. In other words, it quantifies "peakedness" of the distribution: how tall and sharp the central peak is relative to a standard bell curve.
Positive kurtosis (leptokurtic distribution) indicates that, relative to a normal distribution:
- Observations cluster more about the center (peak-shaped)
- The tails are thinner at non-extreme values
- The tails are thicker at extreme values
Negative kurtosis (platykurtic distribution) indicates that, relative to a normal distribution:
- Observations cluster less about the center (box-shaped)
- The tails are thicker at non-extreme values
- The tails are thinner at extreme values
Kurtosis of a normal distribution is zero; however, the sample kurtosis (computed here) is likely to deviate from zero.
Sample kurtosis is computed as the
Note that kurtosis is sensitive to outliers, and requires at least two
different sample values. Example: for sample {2.2, 3.2, 3.7, 4.4,
5.3, 5.7, 6.1, 6.4, 7.2, 7.8} with the mean of 5.2 and the standard
deviation of 1.8, sample kurtosis equals
Standard error in kurtosis (output row 12): A measure of how much the sample
kurtosis may vary from sample to sample, assuming that the feature is
normally distributed, which makes its distribution kurtosis equal 0.
Given the number
This measure can tell us, for example:
- If the sample kurtosis lands within two standard errors from 0, its positive or negative sign is non-significant, may just be accidental.
- If the sample kurtosis lands outside this interval, the feature is unlikely to be normally distributed.
At least 4 values (
Statistics that describe the sample of a categorical feature, either nominal or ordinal. We represent all categories by integers from 1 to the number of categories; we call these integers category IDs.
Number of categories (output row 15): The maximum category ID that occurs in the sample. Note that some categories with IDs smaller than this maximum ID may have no occurrences in the sample, without reducing the number of categories. However, any categories with IDs larger than the maximum ID with no occurrences in the sample will not be counted. Example: in sample {1, 3, 3, 3, 3, 4, 4, 5, 7, 7, 7, 7, 8, 8, 8} the number of categories is reported as 8. Category IDs 2 and 6, which have zero occurrences, are still counted; but if there is a category with ID${}=9$ and zero occurrences, it is not counted.
Mode (output row 16): The most frequently occurring category value. If several values share the greatest frequency of occurrence, then each of them is a mode; but here we report only the smallest of these modes. Example: in sample {1, 3, 3, 3, 3, 4, 4, 5, 7, 7, 7, 7, 8, 8, 8} the modes are 3 and 7, with 3 reported.
Computed by counting the number of occurrences for each category, then taking the smallest category ID that has the maximum count. Note that the sample modes may be different from the distribution modes, i.e. the categories whose (hypothesized) underlying probability is the maximum over all categories.
Number of modes (output row 17): The number of category values that each have the largest frequency count in the sample. Example: in sample {1, 3, 3, 3, 3, 4, 4, 5, 7, 7, 7, 7, 8, 8, 8} there are two category IDs (3 and 7) that occur the maximum count of 4 times; hence, we return 2.
Computed by counting the number of occurrences for each category, then counting how many categories have the maximum count. Note that the sample modes may be different from the distribution modes, i.e. the categories whose (hypothesized) underlying probability is the maximum over all categories.
The output matrix containing all computed statistics is of size
X
. Each row
corresponds to a particular statistic, according to the convention
specified in Table 1. The first
Bivariate statistics are used to quantitatively describe the association
between two features, such as test their statistical (in-)dependence or
measure the accuracy of one data feature predicting the other feature,
in a sample. The bivar-stats.dml
script computes common
bivariate statistics, such as Pearson’s correlation
coefficient and Pearson’s bivar-stats.dml
computes certain bivariate statistics for
the given feature (column) pairs in the matrix. The feature types govern
the exact set of statistics computed for that pair. For example,
Pearson’s correlation coefficient can only be computed on
two quantitative (scale) features like ‘Height’ and ‘Temperature’. It
does not make sense to compute the linear correlation of two categorical
attributes like ‘Hair Color’.
The output matrices of bivar-stats.dml
have one row per one bivariate
statistic and one column per one pair of input features. This table lists
the meaning of each matrix and each row.
Output File / Matrix | Row | Name of Statistic |
---|---|---|
All Files | 1 | 1-st feature column |
" | 2 | 2-nd feature column |
bivar.scale.scale.stats | 3 | Pearson’s correlation coefficient |
bivar.nominal.nominal.stats | 3 | Pearson’s |
" | 4 | Degrees of freedom |
" | 5 | $P\textrm{-}$value of Pearson’s |
" | 6 | Cramér’s |
bivar.nominal.scale.stats | 3 | Eta statistic |
" | 4 |
|
bivar.ordinal.ordinal.stats | 3 | Spearman’s rank correlation coefficient |
Script bivar-stats.dml
takes an input matrix X
whose
columns represent the features and whose rows represent the records of a
data sample. Given X
, the script computes certain relevant bivariate
statistics for specified pairs of feature columns X[,i]
and
X[,j]
. Command-line parameters index1
and index2
specify the
files with column pairs of interest to the user. Namely, the file given
by index1
contains the vector of the 1st-attribute column indices and
the file given by index2
has the vector of the 2nd-attribute column
indices, with "1st" and "2nd" referring to their places in bivariate
statistics. Note that both index1
and index2
files should contain a
1-row matrix of positive integers.
The bivariate statistics to be computed depend on the types, or
measurement levels, of the two columns. The types for each pair are
provided in the files whose locations are specified by types1
and
types2
command-line parameters. These files are also 1-row matrices,
i.e. vectors, that list the 1st-attribute and the 2nd-attribute column
types in the same order as their indices in the index1
and index2
files. The types must be provided as per the following convention:
1 = scale, 2 = nominal, 3 = ordinal.
The script organizes its results into (potentially) four output matrices, one per each type combination. The types of bivariate statistics are defined using the types of the columns that were used for their arguments, with "ordinal" sometimes retrogressing to "nominal." Table 2 describes what each column in each output matrix contains. In particular, the script includes the following statistics:
- For a pair of scale (quantitative) columns, Pearson’s correlation coefficient.
- For a pair of nominal columns (with finite-sized, fixed, unordered
domains), the Pearson’s
and its p-value. - For a pair of one scale column and one nominal column,
statistic. - For a pair of ordinal columns (ordered domains depicting ranks), Spearman’s rank correlation coefficient.
Note that, as shown in Table 2, the output matrices
contain the column indices of the features involved in each statistic.
Moreover, if the output matrix does not contain a value in a certain
cell then it should be interpreted as a
Below we list all bivariate statistics computed by script
bivar-stats.dml
. The statistics are collected into
several groups by the type of their input features. We refer to the two
input features as X
, i.e. the sample size.
Sample statistics that describe association between two quantitative (scale) features. A scale feature has numerical values, with the natural ordering relation.
Pearson’s correlation coefficient: A measure of linear dependence between two numerical features:
$$r = \frac{Cov(v_1, v_2)}{\sqrt{Var v_1 Var v_2}} = \frac{\sum_{i=1}^n (v_{1,i} - \bar{v}1) (v{2,i} - \bar{v}2)}{\sqrt{\sum{i=1}^n (v_{1,i} - \bar{v}1)^{2\mathstrut} \cdot \sum{i=1}^n (v_{2,i} - \bar{v}_2)^{2\mathstrut}}} $$
Commonly denoted by
Suppose that we use simple linear regression to represent one feature
given the other, say represent
In other words,
Sample statistics that describe association between two nominal categorical features. Both features’ value domains are encoded with positive integers in arbitrary order: nominal features do not order their value domains.
Pearson’s $\chi^2$: A measure of how much the
frequencies of value pairs of two categorical features deviate from
statistical independence. Under independence, the probability of every
value pair must equal the product of probabilities of each value in the
pair:
where
Degrees of freedom: An integer parameter required for the
interpretation of Pearson’s
$P\textrm{-}$value of Pearson’s $\chi^2$: A measure of
how likely we would observe the current frequencies of value pairs of
two categorical features assuming their statistical independence. More
precisely, it computes the probability that the sum of
As any probability,
Cramér’s $V$: A measure for the strength of
association, i.e. of statistical dependence, between two categorical
features, conceptually similar to Pearson’s correlation
coefficient. It divides the
observed Pearson’s
As opposed to $P\textrm{-}$value of Pearson’s
Sample statistics that describe association between a categorical feature (order ignored) and a quantitative (scale) feature. The values of the categorical feature must be coded as positive integers.
Eta statistic: A measure for the strength of
association (statistical dependence) between a nominal feature and a
scale feature, conceptually similar to Pearson’s correlation
coefficient. Ranges from 0 to 1, approaching 0 when there is no
association and approaching 1 when there is a strong association. The
nominal feature, treated as the independent variable, is assumed to have
relatively few possible values, all with large frequency counts. The
scale feature is treated as the dependent variable. Denoting the nominal
feature by
$$\eta^2 ,=, 1 - \frac{\sum_{i=1}^{n} \big(y_i - \hat{y}[x_i]\big)^2}{\sum_{i=1}^{n} (y_i - \bar{y})^2},
,,,,\textrm{where},,,,
\hat{y}[x] = \frac{1}{\mathop{\mathrm{freq}}(x)}\sum_{i=1}^n
,\left{!!\begin{array}{rl} y_i & \textrm{if $x_i = x$}\ 0 & \textrm{otherwise}\end{array}\right.!!!$$
and
$F$ statistic: A measure of how much the values of the
scale feature, denoted here by
- The scale feature
has approximately normal distribution whose mean may depend only on and variance is the same for all . - The nominal feature
has relatively small value domain with large frequency counts, the -values are treated as fixed (non-random). - All records are sampled independently of each other.
To compute
-
Residual sum-of-squares of the "predictor" accuracy:
. -
Explained sum-of-squares of the "predictor" variability:
.
Here
The statistic can test if the independence hypothesis of
Sample statistics that describe association between two ordinal categorical features. Both features’ value domains are encoded with positive integers, so that the natural order of the integers coincides with the order in each value domain.
Spearman’s rank correlation coefficient: A measure for
the strength of association (statistical dependence) between two ordinal
features, conceptually similar to Pearson’s correlation
coefficient. Specifically, it is Pearson’s correlation
coefficient applied to the feature vectors in which all values
are replaced by their ranks, i.e. their positions if the vector is
sorted. The ranks of identical (duplicate) values are replaced with
their average rank. For example, in vector
Our implementation of Spearman’s rank correlation
coefficient is geared towards features having small value domains
and large counts for the values. Given the two input vectors, we form a
contingency table
$$\rho ,,=,, \frac{Cov_T(r_1, r_2)}{\sqrt{Var_{f_1}(r_1)Var_{f_2}(r_2)}} ,,=,, \frac{\sum_{i,j} T_{i,j} (r_{1,i} - \bar{r}1) (r{2,j} - \bar{r}2)}{\sqrt{\sum_i f{1,i} (r_{1,i} - \bar{r}1)^{2\mathstrut} \cdot \sum_j f{2,j} (r_{2,j} - \bar{r}_2)^{2\mathstrut}}}$$
where
A collection of (potentially) 4 matrices. Each matrix contains bivariate
statistics that resulted from a different combination of feature types.
There is one matrix for scale-scale statistics (which includes
Pearson’s correlation coefficient), one for nominal-nominal
statistics (includes Pearson’s
The stratstats.dml
script computes common bivariate
statistics, such as correlation, slope, and their p-value, in parallel
for many pairs of input variables in the presence of a confounding
categorical variable. The values of this confounding variable group the
records into strata (subpopulations), in which all bivariate pairs are
assumed free of confounding. The script uses the same data model as in
one-way analysis of covariance (ANCOVA), with strata representing
population samples. It also outputs univariate stratified and bivariate
unstratified statistics.
To see how data stratification mitigates confounding, consider an (artificial) example in Table 3. A highly seasonal retail item was marketed with and without a promotion over the final 3 months of the year. In each month the sale was more likely with the promotion than without it. But during the peak holiday season, when shoppers came in greater numbers and bought the item more often, the promotion was less frequently used. As a result, if the 4-th quarter data is pooled together, the promotion’s effect becomes reversed and magnified. Stratifying by month restores the positive correlation.
The script computes its statistics in parallel over all possible pairs
from two specified sets of covariates. The 1-st covariate is a column in
input matrix
Both covariates in each pair must be numerical, with the 2-nd covariate normally distributed given the 1-st covariate (see Details). Missing covariate values or strata are represented by "NaN". Records with NaN’s are selectively omitted wherever their NaN’s are material to the output statistic.
Stratification example: the effect of the promotion on average sales
becomes reversed and amplified (from
Month | Oct | Nov | Dec | Oct | -Dec | |||
---|---|---|---|---|---|---|---|---|
Customers (millions) | 0.6 | 1.4 | 1.4 | 0.6 | 3.0 | 1.0 | 5.0 | 3.0 |
Promotions (0 or 1) | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 |
Avg sales per 1000 | 0.4 | 0.5 | 0.9 | 1.0 | 2.5 | 2.6 | 1.8 | 1.3 |
The stratstats.dml
output matrix has one row per each distinct pair of
1-st and 2-nd covariates, and 40 columns with the statistics described here.
Col | Meaning | Col | Meaning | ||
---|---|---|---|---|---|
01 |
|
11 |
|
||
02 | presence count for |
12 | presence count for |
||
03 | global mean |
13 | global mean |
||
04 | global std. dev. |
14 | global std. dev. |
||
1-st covariate | 05 | stratified std. dev. |
2-nd covariate | 15 | stratified std. dev. |
06 |
|
16 |
|
||
07 | adjusted |
17 | adjusted |
||
08 | p-value, $x \sim $ strata | 18 | p-value, $y \sim $ strata | ||
09 - 10 | reserved | 19 - 20 | reserved | ||
21 | presence count |
31 | presence count |
||
22 | regression slope | 32 | regression slope | ||
23 | regres. slope std. dev. | 33 | regres. slope std. dev. | ||
24 | correlation |
34 | correlation |
||
|
25 | residual std. dev. |
|
35 | residual std. dev. |
26 |
|
36 |
|
||
27 | adjusted |
37 | adjusted |
||
28 | p-value for "slope = 0" | 38 | p-value for "slope = 0" | ||
29 | reserved | 39 | # strata with |
||
30 | reserved | 40 | reserved |
Suppose we have
We assume a linear regression model for
$$ \begin{equation} y_{i,j} ,=, \alpha_i + \beta x_{i,j} + {\varepsilon}{i,j},, \quad\textrm{where},,,, {\varepsilon}{i,j} \sim Normal(0, \sigma^2) \end{equation} $$
Here
$$\bar{x}i ,= \Big(\sum\nolimits{j=1}^{n_i} ,x_{i, j}\Big) / n_i,;\quad \bar{y}i ,= \Big(\sum\nolimits{j=1}^{n_i} ,y_{i, j}\Big) / n_i$$
If
$$ \begin{equation} \sum\nolimits_{i=1}^k \sum\nolimits_{j=1}^{n_i} \big(y_{i,j} - \beta x_{i,j} - (\bar{y}_i - \beta \bar{x}i)\big)^2 ,,=,, \beta^{2,}V_x ,-, 2\beta ,V{x,y} ,+, V_y \label{eqn:stratsumsq} \end{equation} $$
where
$$\begin{aligned}
V_x ,&=, \sum\nolimits_{i=1}^k \sum\nolimits_{j=1}^{n_i} \big(x_{i,j} - \bar{x}i\big)^2; \quad V_y ,=, \sum\nolimits{i=1}^k \sum\nolimits_{j=1}^{n_i} \big(y_{i,j} - \bar{y}i\big)^2;\ V{x,y} ,&=, \sum\nolimits_{i=1}^k \sum\nolimits_{j=1}^{n_i} \big(x_{i,j} - \bar{x}i\big)\big(y{i,j} - \bar{y}_i\big)
\end{aligned}$$
They are stratified because we compute the sample (co-)variances in each
stratum
Minimizing over
$$\mathrm{RSS} ,,=, , \sum\nolimits_{i=1}^k \sum\nolimits_{j=1}^{n_i} \big(y_{i,j} - \hat{\beta} x_{i,j} - (\bar{y}_i - \hat{\beta} \bar{x}i)\big)^2 ,,=,, V_y ,\big(1 ,-, V{x,y}^2 / (V_x V_y)\big)$$
The quantity
The
The
$$st.dev(\hat{\beta}){\mathrm{est}} ,=, \hat{\sigma},/\sqrt{V_x} \quad\Longrightarrow\quad t ,=, \hat{R}\sqrt{V_y} ,/, \hat{\sigma} ,=, \beta ,/, st.dev(\hat{\beta}){\mathrm{est}}$$
The standard deviation estimate for stratstats.dml
output.
The output matrix format is defined in Table 4.