-
Notifications
You must be signed in to change notification settings - Fork 0
/
02-desc.Rmd
2342 lines (1480 loc) · 132 KB
/
02-desc.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
output:
html_document:
css: ../Data_Mgt_Analysis_and_Graphics_R/Pages/pages.css
---
# Introduction to Descriptive Statistics{#chap2}
```{r globalOptions, echo=FALSE}
library(knitr)
opts_chunk$set(
collapse = TRUE,
dev = "png",
warning = FALSE,
message = FALSE,
fig.path = "figures/"
)
```
## Goal {-}
The main goal of this chapter is to introduce "Descriptive Statistics" as a foundation for data analysis.
## What we shall cover {-}
By the end of this chapter you should:
* have an understanding on concept of location/relative standing (quantiles and percentiles), center (mean, median and mode), variability/spread (range, variance and standard deviation), skewness and kurtosis
* know how to graphically display descriptive summaries as an addition or alternative to displaying numerical summaries
## Descriptive Statistics Overview
One of the first tasks a data analyst is tasked to do after a quick [Exploratory Data Analysis](#eda) is to describe variables in a given data. The main aim of this task is to understand information being passed by these variables, this is achieved by computing summaries of each variable and making visual displays. In this regard, when we say descriptive statistics, we mean numbers and graphs used to describe and summarize a given data.
There could be many descriptive statistics computed and/or graphed to describe an individual variable, but we often report the most informative descriptive statistic per variable.
So what are some of these descriptive statistics?
Consider a numerical variable like scores of students in a class room, for this particular variable, what information would be of interest to us? Won't it be informative to know average scores, how about range between the highest and lowest score, or the percentage of students in the lower or upper bounds (we call these outliers), won't they be informative. It could also be quite informative to visualize position of each score. Based on this, we would compute some values to give us this information, these values are what we call descriptive statistics of a variable. In our report, we would not include all of these summaries, only those we found to be informative. For example, if we did not have outliers, then we would not report it, we can simply report on the average. We would also not include graphs on individual observation if it did not show an interesting pattern (clustering or presence of outliers).
In this chapter, we will go over concepts in descriptive statistics (theoretically) and then follow up with a practical session. Our practical session will involve an actual data analysis of some data set followed up by a demonstration of how to write an analytical report.
With that in mind, for our concept building section, we shall discuss two quantiles (percentiles and quartiles ), three measures of central tendency or location (mean, median, and mode), four measures of spread or dispersion (range, inter-quartile range(IQR), variance, and standard deviation) and finally two measures of shape of a data distribution (skewness, kurtosis).
## Measures of Descriptive Statistics
In this section we will begin by gaining theoretical knowledge on some of the most informative measures of descriptive statistics, these are:
* Quantiles: percentiles and quartiles
* Measures of central tendency: mean, median and mode
* Measures of spread/dispersion: range, inter-quartile, variance, and standard deviation
* Measures of distribution shape: skewness and kurtosis
In all these measures, we will discuss their numerical and graphical representation and follow-up with a demonstration on how they are computed in R.
### Quantiles
In the most simplistic terms, quantiles are statistical measures which give values below and above a certain point. There are two commonly used measurements, these are percentiles and quartiles (this term is different from our title quantiles).
Quantiles are used to inform on data distribution, for example, we could say 90% of all callers to a customer care center were satisfied with services offered or most students scored between the second quartile (median/50%) and third quartile (75%). Saying this rather actual values or scores can be quite meaningful as we would get a general picture of where an individual value/score lies within a group of observations.
In general, we use quantiles when we want to describe an individual value in regards to other values.
#### Percentiles
There are quite a number of definitions of percentiles, but the underlining concept behind them is that percentiles give a value below which a given percent of observations occur and the remaining percent of data occur above. To understand this, think of a number line with percentages from 1 to 99 (first value would be 1% and last value would be 99%), a score in the 25th percentile means there are 25% of the observations below it and 75% above it.
![Twentyfifth percentile](figures/twentyfifth-02.png)
With that understanding, suppose we were choosing a statistical program to use for our organization and we are told our preferred program R had a score of 286 out of a possible 300. This is good information but leaves us with a number of questions, top most being, *how does 286 compare to scores for other programs*. Percentiles can be handy here, but we need the entire data set to get a percentile. Therefore let's use the following hypothetical data (distribution) to learn how to compute percentiles.
```
174 287 236 211 156 286 232 188 182 276 229
```
First thing we want to do is order our data set from lowest to highest value.
```
156 174 182 188 211 229 232 236 276 286 287
```
```{r "props-02", echo=FALSE}
props <- seq(0, 1, length.out = 11)
```
Then we want to compute proportions of each value, that is, get values between 0 and 1 of the same length with our data set. This should give us `r paste(props[-length(props)], collapse = ", ")` and `r props[length(props)]`.
Our percentiles will be these proportions multiplied by 100. We can table this percentile as follows:
Score | Rank | Percentile
---------|--------|------------
156 | 1 | 0
174 | 2 | 10
182 | 3 | 20
188 | 4 | 30
211 | 5 | 40
229 | 6 | 50
232 | 7 | 60
236 | 8 | 70
276 | 9 | 80
286 | 10 | 90
287 | 11 | 100
From this table we can easily see score of 286 at the 90^th^ percentile. This is certainly much more informative than just saying R scored 286 out of 300.
Take note, percentile and percentage are two totally different terms. Saying someone scored 90 percent is not the same as being in the 90^th^ percentile. As an example, there could be a number of scores like 85 to 92 in the 90^th^ percentile but only a score of 90 percent can be 90 percent.
**Computing percentiles in R**
As mentioned before, we really do not need to memorize formulas or do manual computations, we just need to understand how to use them and then let statistical programs like R do the computation.
In R, to get percentile of any value in a given distribution, we first have to tell R which data we will be using, sort the data and identify index of interested value, get quantiles with function quantile and then subset output of quantiles with index of interested value. For the quantile function, we will input proportions of each value or probability of observing each value.
```{r "percentile-02"}
# Data
scores <- c(174, 286, 287, 236, 211, 156, 232, 188, 182, 276, 229)
# Index of interested score
rank <- which(sort(scores) == 286)
# Percentiles of all scores
p <- quantile(scores, probs = seq(0, 1, length.out = length(scores)))
p
# Percentile for interested score
cat("\n", names(p[rank]), "\n")
```
Please read up on function `?quantile` in R to understand available algorithms for computing percentiles, there are nine of them.
#### Quartiles{#quartiles}
Quartiles (with an "r" not an "n") are similar to percentiles except that in quartiles we use fractions of the data instead of percentages. That is, both percentiles and quartiles divide data, however, percentiles divide data such that a certain percent of data lie below a give percent and the rest above while quartiles divide data such that a certain fraction of data lie above and the rest below. To understand these two terms better, let's first get fractions of a data set and then see how they differ from percentiles.
To obtain sample fractions of a given data set we begin by ordering the data set or getting ordered statistics. These ordered statistics are the quantiles and their fractions can obtained by computing their proportion (each value divided by variable length minus 1).
$$fractions = \sum[\frac{x_i}{(n-1)}]$$
Where:
$x_i$ = value
$n$ = number of observations
For our students scores data, we can compute their fractions as shown in the table:
```{r "fractions-02", echo=FALSE}
fracs <- round((1:length(scores) - 1)/(length(scores) - 1), 2)
data.frame(Quantile = sort(scores), Sample_fraction = fracs)
```
Notice our fractions are different from our percentiles:
```{r "difffracperc-02", echo=FALSE}
data.frame(Quantile = sort(scores), Quartile = fracs, Percentile = sub("(\\d+)%", "\\1", names(p)))
```
There are four quarters often reported for a variable, this quarters as the name suggest partition data into four equal parts. There quarters can be quite informative as it can show unique features of the data like data concentration and isolated values at extreme points (outliers). It might not be appropriate to compute quartiles if data is multi-modal (it has more than one data concentration), but let's discuss this limitation when we are discussing mode under measures of central tendency.
There are three cut-off points that divide a data set into four equal parts, these are Q1 (first quartile), Q2 (second quartile), and Q3 (third quartile). Q1 splits the lowest 25% of data from the highest 75% of data, this is the same as the 25^th^ percentile. Q2 splits data into halves, this is the same as the 50^th^ percentile or as we shall discuss later, the median of a distribution. Q3 splits top 25% of data from lower 75% of data, this is the same as the 75^th^ percentile.
![Quantiles: Percentiles and Quartiles](figures/quartiles-02.png)
Going back to our scores data set, looking at the fractions for our quantiles (ordered statistics), we cannot find a value where 25% of data are below and 75% are above, we also can't find a value where 25% of data are above and 75% are below. However we can find a value where 50% are above and 50% are below, this is score 229. We therefore can get Q2 but not Q1 and Q3.
To get these missing values we need to use a mathematical concept called linear interpolation. Linear interpolation simply means getting new data point given some values.
In our case, the first new point we want is a score that cuts off data such that 25% are above and 75% are below. Looking at our table with quantiles and their fractions, we see 0.25 (25/100) is between 0.2 and 0.3, so we know the score we seek is between 182 and 188. We now need to interpolate this score using these four pieces of information.
To interpolate this score, we need to determine type of change as well as rate of this change [^2]. Change between these two points is an increase as scores increased from 182 to 188 (difference of 6) and fractions increased from 0.2 to 0.3 (difference of 0.1). We can compute rate of increase by dividing change in scores by change in fractions, that is `r 6/0.1`. Since the score we seek is between 182 and 188, then we expect rate of increase from score of 182 with fraction of 0.2 to this unknown score with a fraction of 0.25 to be a fraction of `r 6/0.1`. This fraction is exactly the difference between 0.25 and 0.2 which is `r 0.25 - 0.2`. So, rate of change from point 0.2,182 to our unknown point is `r (0.25 - 0.2) * (6/0.1)`, if we add this to 182 we get `r 182 + (.25 - .2) * (6/.1)`. We can therefore conclude that the score that cuts off values such that 25% are below and 75% are above is 185.
[^2]: Rate of change is change of one variable given change in another variable.
Using the same line of reasoning, we can establish that `r 236 + (.75 - .7) * ((276-236)/(0.8-0.7))` (236 + (0.75 - 0.7) * ((276 - 236)/(0.8 - 0.7))) cuts off values such that 25% are above it and 75% are below it.
Let's look at how to compute these values in R.
**Getting quartiles in R**
In R, we can still use function quantiles to get our quartiles, in these case inputting proportions for the three quantiles:
```{r "quartiles-02"}
quantile(scores, seq(0.25, 0.75, 0.25))
```
We can also get this and other information using function "summary" and "fivenum". Note function "fivenum" means Tukey's five number summary, it's output is unnamed vector, it can be useful for additional computation.
```{r "summaries-02"}
summary(scores)
fivenum(scores)
```
#### Graphical Display for Quantiles
There are about four graphs used to display quantiles, these are:
* Box plot
* QQplots
* Empirical Shift function plots and
* Symmetry plots
On this section we will look at the first two displays.
##### Box plots {-}
Box plots or more appropriately box-and-whisker plots are one of the most informative graphical displays for distribution even though they have of late been superseded by fancier displays; their simplicity make them stand test of time.
Box-and-whisker plots are best used to show outliers (values occurring at extreme points) and comparing two or more distributions.
<p id="iqr">To draw a box-and-whiskers plot, draw a box from Q1 to Q3 noting Q2 with a vertical line. This box is called Inter-quartile Range (IQR) and it represents 50% of data (75% - 25%). Draw whiskers as lines extending 1.5 times IQR below Q1 and above Q3.</p>
![Box-and-whiskers plot](figures/boxplot1-02.png)
Value 1.5 is an arbitrary number with no specific meaning behind it, however, it but serves it's purpose in identifying outliers.
###### Constructing box-and-whisker graphs by hand {-}
As an example, suppose we had the following hypothetical values for students scores;
```{r "scores2data-02", echo=FALSE}
set.seed(285)
scores2 <- c(round(rnorm(n = 15, mean = 77, sd = 5)), round(rnorm(4, 20, 10)), round(rnorm(2, 97)))
cat(scores2)
```
To draw a box-and-whiskers plot for this distribution, we first get it's quartiles:
```{r "quartiles2-02", echo=FALSE}
quart <- quantile(scores2, probs = seq(0.25, 0.75, 0.25))
quart
```
Then we compute IQR and whiskers length. We compute whiskers as 1.5 times IQR below Q1 and above Q3. Whiskers are lines extending from both ends of Q1 and Q3, they are referred to as lower and upper whiskers. IQR is computed as the difference between Q3 and Q1.
For our hypothetical data set, IQR = `r iqr <- as.vector(quart[3] - quart[1]); iqr`. Our whiskers are computed as:
*Lower whisker*
```{r "outliers1-02", echo = FALSE}
iqr15 <- iqr*1.5
q1 <- as.vector(quart[1])
outliers1 <- sort(scores2[scores2 < q1 - iqr15])
```
1.5 time IQR is `r iqr15`. Subtracting `r iqr15` from Q1 which was `r q1` we get `r q1 - iqr15`. Our lower whisker will extend from Q1 to `r q1 - iqr15`, all values below this are outliers, these are `r paste0(outliers1[-length(outliers1)], collapse = ", ")` and `r outliers1[length(outliers1)]`.
*Upper whisker*
For the upper whisker, we will add Q3 (`r q3 <- as.vector(quart[3]); q3`) to `r iqr15` giving us `r q3 + iqr15`. Since we do not have scores above 99, then we will draw our whisker from Q3 to our highest score which is 99.
Within this information, we can now draw our box plot.
![Box plot for scores](figures/boxplot2-02.png)
###### Using R to plot box-and-whiskers {-}
In R, plotting box-and-whisker is just one function call, "boxplot".
```{r "boxplotR-02", fig.cap="Box plot in R"}
boxplot(scores2, col = "grey90", ylab = "Scores", pch = 21, bg = 4, horizontal = TRUE)
title("Box plot for student's scores")
```
###### Interpreting box-and-whisker plot {-}
From our plot, it's clear to see most students performed well as they clustered around average score of `r quantile(scores2, 0.5)`, however, there are four students who performed worse than other students.
##### Quantile plots {-}
These plots display sample fractions against quartiles they correspond to. To draw these we just need to compute the fractions and plot them.
Using our scores data set, we can get the following fractions:
```{r "quantiles3-02", echo=FALSE}
quants <- seq(0, 1, length.out = length(scores2))
quants
```
We will plot fractions we have just computed on the x-axis and our ordered scores/quartiles on the y-axis. We will plot a line passing though all points.
Due to interpolation, drawing this plot by hand might not be a good idea, therefore we will use R.
###### Quantile plots in R {-}
There is no function to call for this plot, but since it is a line graph, standard "plot()" should do the trick.
```{r "quantileplot-02", fig.cap="Q-plot in R"}
plot(quants, sort(scores2), type = "l", ann = FALSE)
title("Quantile plot in R", xlab = "Sample fractions", ylab = "Quartiles")
# Add points to show how linear interpolants
points(quants, sort(scores2), pch = 21, bg = 4)
```
##### Quantile-Quantile (QQ) Plots
QQ plots are graphical displays for comparing two data sets, these data sets can either be two observations or one observation and one theoretical data set. Quantiles of observation one are plotted against quantiles of observation two/theoretical data set. Patterns of these points are used to
* Assess whether distributions being compared are similar
* Compare shapes of distribution
* Assess goodness of fit
As an example, let's add a second class scores with these values.
```{r "secores3data-02", echo=FALSE}
set.seed(4985)
scores3 <- round(rnorm(35, 78, 10))
props2 <- seq(0, 1, length.out = length(scores3))
cat(scores3)
```
We want to compare this distribution with that of our first class. To do this we compute sample fractions of both observations. But before we do that, take note these two classes do not have the same size, class one has `r length(scores2)` scores and class two has `r length(scores3)`. Since we want to plot them on the same axis, we need to standardize their axis by taking number of fractions for each sample to be equal to highest value between the two observations.
Therefore, our first task is to get highest number between the two observations which is `r length(scores3)` (length of second class), then we compute there fractions. This should give us `r paste(round(props2[-length(props2)], 2), collapse = ", ")` and `r round(props2[length(props2)], 2)`.
Now we can get quantiles for our classes using our computed fractions. Since we are using sample size for the second class (35), then quantiles for second class would simply be ordered statistics of it's class scores. However, for the first class we need to interpolate their quantiles. We have seen how to interpolate these values, hence we will use R to make our work easier.
**Using R**
Let's compute quantiles for scores of first and second class.
```{r "samplefractions-02"}
# Get number of fractions
n <- max(length(scores2), length(scores3))
# Compute quantiles
quantileClass1 <- quantile(scores2, seq(0, 1, length.out = n))
cat("Quantiles for first class:\n", quantileClass1, "\n\n")
quantileClass2 <- quantile(scores3, seq(0, 1, length.out = n))
cat("Quantiles for second class:\n", quantileClass2)
```
We can now plot these two samples using plot function. But you should know that R has a handy function which we can call with our two distributions and it will do all the calculations and then make QQ plots for us. This function is "qqplot".
Let's compare qqplots generated using our computations and those from `qqplot` function.
```{r "qqplot1-02"}
op <- par("mfrow")
par(mfrow = c(1, 2))
plot(quantileClass1, quantileClass2, ann = FALSE, pch = 21, bg = 4)
title("Computed QQ plot", xlab = "Class 1", ylab = "Class 2")
qqplot(scores2, scores3, ann = FALSE, pch = 21, bg = 4)
title("Using 'qqplot()'", xlab = "Class 1", ylab = "Class 2")
par(op)
```
These plots look the same, now we need to interpret it.
**Interpreting QQ plots**
There are at least three distribution properties a QQ plot can tell us. This are, skewness, tailness and modality. It should however be noted that distributions with small sample size are not often clear as in our case (sample size of 21 and 35).
In general, if points on a QQ plot lie on the line x=y, then the two distributions are said to be similar. If the points form a line but not necessarily lie on x=y, then they are said to be linearly related and generally come from the same probability distribution. We will discuss [probability distributions](#probdist) and their implications in chapter three. Given this information, let's add x=y line to our plot.
```{r "qqplot2-02"}
qqplot(x = scores2, y = scores3, xlab = "Class 1", ylab = "Class 2", main = "QQ plot in R", pch = 21, bg = 4)
lines(x = 1:99, y = 1:99)
```
To draw x=y line we used function `line` parsing to it values forming x=y. From this line we know that scores of class one and class two do not form a linear relationship. We can thus conclude they do not have similar distributions.
Though not to clear (due to small sample size), there seems to be a bi-modal (two peaks) given the fact that we see a sort of "s" shape. These two peaks are concentration of points at point 20,70 and 80,80.
Something we can see from our graph are tailness or isolated values at extreme point, this could be an indicator of outliers (values away from expected).
It is useful to note QQ plots are not reported, they are more applicable as an Exploratory Data Analysis technique (an analysts tool so to say), that is, they are more suitable in guiding data analysis rather than being a finding to be reported.
We shall revisit QQ plots when discuss probability distribution, at that point we would have discussed some of the issues we have mentioned like skewness and modality.
### Measures of Central Tendency
To best understand measures of central tendency or location, think of our first example on scores, we wanted to make an informed select of one statistical program among a number of programs. To this end we were told our preferred program, R, scored 286 out of 300. From our discussion on quantiles we discovered that this meant it was in the 90^th^ percentile. That's certainly good information, however, if you are an astute analyst, then you would want to know where the other programs are located in the distribution. More specifically, you would want to know distance of 286 from center of the distribution. Measure of central tendency is the answer to this. They summarize data to a single useful and representative information.
There are three commonly used measures of central tendency, these are "mean", "median" and "mode". In this section we get to look at each one of them while noting their applicability.
#### Mean
Mean and specifically arithmetic mean indicates center of a distribution. It's computed as sum of all values divided by number of values. So, if you have a variable, mean is the summation of all values in that variable divided by number of elements in that variable. Mean is more appropriate for numerical variables (discrete [^3] and continuous variables [^4]), but not qualitative or categorical data.
[^3]: Discrete variables are numerical variables whose values take on certain values. Simply put, these are whole numbers or numbers without decimal notation hence cannot be divided for example number of students in a classroom, or number of cats in a household.
[^4]: Continuous variables are numerical variables whose values can take any value within a range. These values have fractions or decimal places, for example normal human temperature is said to be between 36.1 and 37.5 centigrade or between 96.9 and 99.5 Fahrenheit.
##### Mean for numeric data{#numericmean}
Going back to our first example on scores on statistical programs, we compute mean as total of all values divided number of all values, that is, sum of `r cat(paste(scores[-length(scores)], collapse = ", "), "and", scores[length(scores)]) ` divided by `r length(scores)`, giving us `r sum(scores)/length(scores)`.
**Mathematical notation**
Based on the notion that mean is the sum of all values divided by number of values, then, given values x~1~, x~2~, x~3~, ..., X~n~, mean is:
$$\frac{x_1 + x_2 + x_3 + ... + x_n}{n}$$
This is mathematically expressed as:
$$\bar{x} = \frac{\sum {x_1, x_2, x_3, ..., x_n}}{n}$$
Where:
$\bar{x}$ is Sample mean
$\sum$ is Greek capital letter sigma meaning "sum of"
$n$ is sample size
This mathematical expression is often reduced to:
$$\bar{x} = \frac{\sum{x}}{n}$$
In statistics, it's important to distinguish between population parameters [^5] and sample statistics [^6]. Mathematical expression given above is a sample statistic, if we were dealing with entire population, then population mean would be given by:
[^5]: Describes entire a population, they often unknown
[^6]: Describes a fraction of a population or a sample and used to estimate population parameter
$$\mu = \frac{\sum{X}}{N}$$
Where:
$\mu$ is population mean
$X$ are observations
$N$ is population size
##### Computing Mean in R {-}
Getting mean in R is just one function call, we use function "mean".
```{r "mean-02"}
mean(scores2)
```
Do take note, if data contains missing values or NA's, you need to tell R by setting argument "na.rm" to TRUE, otherwise output would be NA.
#### Univariate Frequency Distributions {-}
When we have a discrete variable with few unique values or continuous variable with known ranges, then it useful to convert them grouped data.
Grouping data involves categorization or batching together observations. Grouping not only helps describe similar observations but it also helps to see underlying distribution like average, spread, skewness, modality or peakness, and extreme or isolated values.
Grouped data is often presented in frequency tables. This table can be used for grouped and ungrouped data. Ungrouped data are often unique vales of a discrete variable. Frequency tables are also called frequency distributions as they tabulate frequencies along side their corresponding observation. Frequency is the number of times an observation occurs.
With that understanding, let's look at two examples of frequency distributions, one will be for ungrouped data and the other for grouped data.
For our first example on ungrouped data, let's consider the following data set, it's a list of responses to a question asked to analyst on how many times they have used R in the last week.
{0, 0, 1, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5}
From this data, we can see there are a number of repetitive values or few distinct values. Based on this fact, we can summarize these data by counting number of occurrence of each unique value (0, 1, 2, 3, 4 and 5) and tabulate them as follows.
Usage | Frequency
---------|----------------
0 | 2
1 | 1
2 | 1
3 | 6
4 | 11
5 | 9
What we have just created is an ungrouped frequency table.
Now let's look at grouped frequency distributions.
Suppose we have the following data on number of years some of the most popular programs have been in existence:
{1, 4, 6, 7, 7, 8, 9, 12, 12, 12, 13, 15, 15, 16, 17, 17, 18, 19, 19, 19, 20, 20, 21, 21, 22, 22, 23, 23, 24, 25}
There are few terms or concepts we need to appreciate as we construct frequency tables for grouped distributions. These are:
* Class: Range of values like "1-5" or "6-7". These can also be considered as sub-set of a data distribution.
* Class size: It is the number of values in a class, for example a class of "1-5" has 5 values 1, 2, 3, 4, and 5.
* Class limits: These are the minimum and maximum values of a class, for example 1 and 5 for class "1-5". These values can be specified as upper and lower limits.
* Class boundaries: These are also called true class limits and computed as an average of sum of lower limit of one class and upper limit of a subsequent class. As an example, if we have three classes "1-5", "6-10" and "11-15", we can compute class boundaries for the first two classes as (5 + 6)/2 = `r (5+6)/2` and (10 + 11)/2 = `r (10+11)/2`. Notice, we are adding half a point to each upper boundary, therefore, we can just do the same to other classes beginning from before first class and ending right after last class; that is 0.5, 5.5, 10.5, and 15.5.
* Class width/interval: These are difference between upper and lower boundaries of any class for example `r 5.5-0.5` which is 5.5 - 0.5. It's also the lower limits of two consecutive classes or the upper limits of two classes like `r 6 - 1` which results from 6 - 1.
* Class mark/midpoint: This is the middle value in a class. It's computed as an average of upper and lower limits of a class or difference of upper and lower boundaries.
When constructing a frequency distribution table there are a few issues that need to be agreed on, these include number of classes and class width. It's important to ensure we do not have too many or too few classes as it will obscure certain feature of our distribution or make it hard for us to interpret frequency distribution. It's also important for us to consider class width as we do not want to end up with too many empty classes than necessary.
There are couple of formulas out there on estimating class size/width, there are also those that recommend class sizes of either 2, 5, or 10. I suggest using the latter recommendation but guided by data. For example, for our data which has values from 1 to 25, we might want to have a class width of 5 thereby having a total of `r 25/5` classes. Taking classes of width 2 might make our frequency distribution too big as we would have about 12 classes and some leftover. Having class width of 10 on the other hand would mean having only two or three classes, this might be a bit small. So our 5 classes seams ideal.
Now that we know how many classes we will have and their width, we can construct our classes bearing in mind that they need to be unique (a value can only have one possible class it belongs to). Based on this we can have the following classes and number of observations that fall in those classes.
Years | Frequency
---------|------------
1 - 5 | 2
6 - 10 | 5
11- 15 | 6
16- 20 | 9
21- 25 | 8
What we have above is a grouped frequency distribution table.
With this brief introduction to frequency distributions, let's now see how to compute their descriptive statistics.
##### Mean for frequency distributions{#meanDist}
In this section we will look at how to compute averages for ungrouped and grouped distributions.
##### Mean for ungrouped distributions{#ungroupedMean}
To learn how to compute mean for ungrouped distributions, let's build up from our understanding of mean for non-frequency distribution. For these non-frequency distributions, we defined mean as sum of all values divided by number of values. Now, for ungrouped mean, we need to begin by reconstructing number of values by multiplying observations by their frequencies, and then sum them up before dividing by number of values or total frequencies.
As an example, let's revisit our data on responses from analyst.
Usage | Frequency
---------|-------------------
0 | 2
1 | 1
2 | 1
3 | 6
4 | 11
5 | 9
We compute mean for this data by multiply each usage (observation/value) with it's frequency, then sum them up and finally divide by number of responses (number of analysts); that is,
$$\frac{(0*2) + (1*1) + (2*1) + (3*6) + (4*11) + (5*9)}{2+1+1+6+11+9}$$
We should get mean as `r usage <- ((0*2)+(1*1)+(2*1)+(3*6)+(4*11)+(5*9))/(2+1+1+6+11+9); usage`. Based on this finding we can conclude that, on average analysts in our organization used R about `r round(usage, 1)` times last week.
We can mathematically express this computation as:
$$\bar{x} = \frac{{\sum\limits^{n}_{i=1}}{f_ix_i}}{\sum\limits^{n}_{i=1}{f_i}}$$
Where:
n = number of unique observations or number of rows in frequency table
f = frequency
x = an observation in a frequency table
or simply as:
$$\bar{x} = \frac{\sum{fx}}{\sum{f}}$$
**Computing mean for ungrouped distribution in R**
In R, we can compute mean for distribution with "mean" function. To generate a frequency table we use function "table". table() does not produce a very presentable table, so we will transform it into a data frame with "as.data.frame" function thereby giving us a table similar to what we manually constructed.
```{r "ungroupeddata1-02", echo=FALSE}
# Input data
ungpd1 <- c(0, 0, 1, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5)
# Obtain mean
mean(ungpd1)
# Make a a frequency table
(ungpd1Tab <- table(ungpd1))
# Turn it into a presentable table
ungpd1Tab <- as.data.frame(ungpd1Tab)
names(ungpd1Tab)[1] <- "Usage"
ungpd1Tab
```
###### Mean for grouped distributions{#groupedMean}
Let's use our second example on frequency distribution to compute mean for a grouped distribution. This was our distribution:
Years | Frequency
---------|------------
1 - 5 | 2
6 - 10 | 5
11- 15 | 6
16- 20 | 9
21- 25 | 8
We have just discussed mean for ungrouped mean as summation of products of observations and frequencies divided by total frequency. We are going to use this definition with a slight amendment and that's what we consider to be our observations.
When we were dealing with ungrouped data it was easy for us to recreate our original values by multiplying observations by their frequencies, however, for grouped distributions we can't do this. This is because we can simply not know exact value of any frequency within a class (range of values). So what we can do is go for the next best thing which is an estimate. This estimate is a class midpoint or a class mark. Once we have these mid points, then we can compute mean just as we did with ungrouped data.
To show each computation, let's use our frequency table and add column on midpoint and product of midpoints and frequencies.
Years | Midpoints | Frequencies | Product
---------|--------------|--------------|-------------
1 - 5 |`r (1+5)/2` | 2 |`r (a <- ((1+5)/2) * 2)`
6 - 10 |`r (6+10)/2` | 5 |`r (b <- ((6+10)/2) * 5)`
11 - 15 |`r (11+15)/2` | 6 |`r (c <- ((11+15)/2) * 6)`
16 - 20 |`r (16+20)/2` | 9 |`r (d <- ((16+20)/2) * 9)`
21 - 25 |`r (21+25)/2` | 8 |`r (e <- ((21+25)/2) * 8)`
Total | |`r (f <- 2+5+6+9+8)`|`r (p <- a+b+c+d+e)`
Mean for this distribution is thus `r p` divided by `r f` which is `r (m <- p/f)`. We can therefore conclude that average number of years statistical packages have been in existence is about `r round(m, 1)` years. We can use this average with number of years R has been in existence which is `r 2017-1993` (from 1993 to 2017); looks like R has some mileage over most programs (hypothetically speaking).
There two things we need to appreciate as we conclude this section on mean for grouped data, these are;
* Mean for grouped data is an estimate: Unlike mean for ungrouped distributions, mean for grouped data is an approximation as it uses midpoints rather than actual values/observations. It is therefore important to collect responses with ungrouped values as it is easier to group observations during data analysis than it is to reconstruct actual values from classes.
* Don't use mean for frequency distributions with open groups: Open groups like "15+" or "65 and above", should use [mode](#mode) as a measure of central tendency. This is because it is not possible to compute midpoint for an infinite class.
**Computing grouped mean in R**
Unfortunately there is no one function for calculating grouped mean in R so we have to go through a number of steps to compute this mean.
```{r "groupedmean-02"}
# Data
years <- factor(c("0-5", "6-10", "11-15", "16-20", "21-25"), ordered = TRUE)
freq <- c(2L, 5L, 6L, 9L, 8L)
gpd1 <- data.frame(Years = years, Freq = freq)
gpd1
# Number of observations
n <- sum(gpd1$Freq)
# Midpoint
midpoint <- c((5+0)/2, (10+6)/2, (15+11)/2, (20+16)/2, (25+21)/2)
gpd1[3] <- midpoint
names(gpd1)[3] <- "Midpoint"
gpd1
# Product of midpoints and frequency
gpd1[4] <- gpd1$Freq * gpd1$Midpoint
names(gpd1)[4] <- "Product"
gpd1
# Mean
sum(gpd1$Product)/n
```
#### Median{#median}
Median is basically the middle observation in an ordered distribution. To get this middle value, we have to determine if distribution has an even or an odd number of observations.
##### Median for odd numbered distributions
For odd numbered observations, middle value is rather easy to locate, it is that value which splits a distribution such that there are equal number of values before it and after it. For example, a distribution with 21 observation would have the eleventh observation as it's median since there ten values before it and another ten after it. Basically , median for an odd numbered distribution is number of observations divided by two and then raised to the nearest whole like 21/2 = 10.5, 10.5 to nearest whole is 11.
Using this reasoning we can generate our own formula for computing median for odd numbered distribution:
$$Median_{(odd)} = data[round(\frac{n}{2})]$$
Where:
data = distribution
[] = subset notation
round = raise number to next whole number (digits = 0)
n = number of observation in distribution
Now let's get median for our data on scores for statistical programming languages.
First we order our data from the lowest value to the highest value:
```{r "median1-02", echo=FALSE}
cat(sort(scores))
```
Since number of elements in this data set is odd (`r length(scores)`), we can use our formula that is, median is round(`r length(scores)`/2) which is,`r sort(scores)[round(length(scores)/2)]`.
**Computing median in R**
In R, median for numerical distribution is one function call whether it's an odd numbered distribution or even.
```{r "median2-02"}
median(scores)
```
##### Median for even numbered distributions
For even numbered distributions, median is an average of the two middle values. For example, a distribution with 20 observations would have it's median as an average of the tenth and eleventh observation.
To get these two middle values, we get half the number of distribution like 20/2 and half the number of distribution plus one like (20/2 + 1).
As before, we can generate our own formula for computing median for even numbered distribution as:
$$Median_{(even)} = \frac{{data[\frac{n}{2}]+data[\frac{n}{2}+1]}}{2}$$
Where:
data = distribution
[] = subset
n = number of observation in the distribution
Now, using our scores data set, let's add a score of 234 to make an even numbered distribution. This is how it looks when ordered:
```{r "median3-02", echo=FALSE}
amendedScores <- sort(c(scores, 234))
cat(sort(amendedScores))
```
Our data now has twelve values, using our derived formula, we can compute median as
$$Median_{(even)} = \frac{scores[\frac{12}{2}]+scores[\frac{12}{2}+1]}{2}$$
This should output `r (amendedScores[length(amendedScores)/2] + amendedScores[length(amendedScores)/2 + 1])/2`.
Another way to look at median for even numbered distribution is number of distribution (n) plus one divided by two.
$$Median_{(even)} = \frac{n + 1}{2}$$
Above formula is certainly simpler but not as intuitive as our formula.
##### Median for frequency distributions
Like median for non-frequency distribution, computation for median for frequency distributions depends on whether total frequency is odd or even.
Since we now know difference between median for odd and even numbered distribution, in this section we will focus on getting to locate median of frequency distribution by using our two data sets on responses from analysts and years statistical programs have been in existence.
By and large, median for ungrouped and grouped distributions go through the same processes. We first determine if we are dealing with an odd or an even numbered distribution by getting sum of all frequencies, then using appropriate formula, compute location of median, and finally identify observation or class containing the median by cummulating frequencies.
Let's see how this actually works.
###### Ungrouped distributions
Using our data analyst response data, let's determine its median.
We begin by finding out if it's an odd or even numbered distribution by summing frequencies (2, 1, 1, 6, 11, and 9). This should give us `r 2+1+1+6+11+9`, an even number.
Since it's an even number we will use our second formula to locate position of our median. Median is the observation at position `r (30+1)/2` ((30+1)/2).
To identify observation at this position we need to generate cumulative frequencies and the best way to do this is to add a column to our frequency distribution table.
Usage | Frequency | Cumulative frequency
---------|-----------------|-------------------------
0 | 2 | 2
1 | 1 | `r 2+1`
2 | 1 | `r 2+1+1`
3 | 6 | `r 2+1+1+6`
4 | 11 | `r 2+1+1+6+11`
5 | 9 | `r 2+1+1+6+11+9`
From our cumulative frequencies, we can see `r (30+1)/2` is in the fourth observation, hence median is 4.
**Locating median for ungrouped distributions in R**
Median for ungrouped distributions is computed the same way as non-frequency distributions, using function "median".
```{r "ungroupedmedian-02"}
ungpd1
median(ungpd1)
```
###### Grouped distributions
Median for grouped distribution is exactly the same as that of ungrouped distributions. That is, we begin by identifying whether we have an odd or an even number of distribution, compute location of our median and finally identify class with that position.
Our total frequency is `r 2+5+6+9+8`, same as before so we know we are looking for a class with the fifteen point five observation.
We generate cumulative frequencies.
Years | Frequencies | Cumulative frequency
---------|---------------|------------------------
1 - 5 | 2 | 2
6 - 10 | 5 | `r 2+5`
11 - 15 | 6 | `r 2+5+6`
16 - 20 | 9 | `r 2+5+6+9`
21 - 25 | 8 | `r 2+5+6+9+8`
From these (cumulative frequencies) we find 15.5 is in the fourth class, hence median class is "16-20".
**Locating median for grouped distribution in R**
Base R does not have a function to compute median for grouped data, but it is not hard to compute it. For odd number of distribution, we can get median of our distribution's indices, round it up and subset this value from our data. For even number, we can simply get median of our data.
```{r "groupedmedian-02"}
# Median for odd numbered grouped distribution
dat <- rep(as.character(years), freq) # Generate data
dat[round(median(seq_along(dat)))]
# Median for even numbered grouped distribution
dat[length(dat)+1] <- "0-5" # Add a value to make distribution even
median(dat)
```
#### Mode
Mode is the most frequently occurring value or category. This is the only measure of central tendency suitable for categorical or qualitative data. This is also the only measure of central tendency which could have more than one value or none at all.
A distribution can have no mode in which case there are repeating/uniform observations, or have one mode thus called "unimodal", or two modes thus called "bimodal" or more than two modes thus called "multimodal".
##### Mode: numerical distributions
For discrete distributions, to get the most frequently occurring value we need to generate frequencies and then determine which observation has the highest frequency. For continuous distributions, we need to group/categorize observations, we will discuss these distributions in our section on grouped distributions.
As an example of discrete distributions, let's look at situations where we do not have a mode, have one mode (unimodal), two modes (bimodal) and where we have more than two modes (multimodal).
**Uniform Distribution**
Uniform distributions have no mode which means all observations are equal. Here is an example of a data set with no mode.
```{r "uniformdist-02", echo=FALSE}
uniform <- c(rep(65, 5), rep(66, 5), rep(67, 5), rep(68, 5), rep(69, 5), rep(70, 5))
cat(uniform)
```
We can establish lack of mode using a frequency table.
Value | Frequency
---------|-------------
65 | 5
66 | 5
67 | 5
68 | 5
69 | 5
70 | 5
All observations have the same frequency.
**Unimodal distributions**
Unimodal distributions have one peek or one most frequently occurring observation. For example, the following distribution:
```{r "modenum1-02", echo=FALSE}
set.seed(583)
cat(round(rnorm(100, 65)))
```
We can create the following frequency distribution
Values | Frequency
---------|--------------
63 | 4
64 | 24
65 | 33
66 | 30
67 | 6
68 | 3
From this table, it is clear to see the most frequently occurring value is 65 as it has the highest number of observations (frequency of 33).
**Bimodal distributions**
Bimodal distributions have exactly two modes. That is, they have two most frequently occurring value. We can see this from the following distribution
```{r "bimodal-02", echo=FALSE}
bimodal <- c(rep(c(39:41, 68:71), c(2, 6, 3, 1, 3, 6, 2)))
cat(bimodal)
```
This distribution has the following frequency distribution:
Value | Frequency
---------|-------------
39 | 2
40 | 6
41 | 3
68 | 1
69 | 3
70 | 6
71 | 2
The two modes in this distribution are 40 and 70 as each has six observations.
**Multimodal distributions**
Multimodal distributions have more than two modes, here is an example of a distribution with three modes.
```{r "multimodal1-02"}
multimodal <- c(bimodal, rep(72, 6))
cat(multimodal)
```
From the following frequency distribution, we can see there are three modes, 40, 70 and 72.
Value | Frequency
---------|------------
39 | 2
40 | 6
41 | 3
68 | 1
69 | 3
70 | 6
71 | 2
72 | 6
**Getting mode of a distribution in R**
R does not have a function to get statistical mode, the "mode" function in R is used to do something else (get internal storage type). However, getting this value is rather easy with knowledge of what mode is.
To get mode we table (make a frequency table) our values, find maximum frequency using function "which.max" and return value using "name" function.
```{r "modenum2-02"}
# Data
set.seed(583)
mode1 <- round(rnorm(100, 65))
# Frequency table
table(mode1)
# Mode
names(which.max(table(mode1)))
```
##### Mode: Frequecy distribitions
In this section we discuss how to get mode for grouped and ungrouped distribution.
###### Mode: ungrouped distributions
Getting mode for this distribution is the same as getting mode for non-frequency distribution, actually even easier since we have frequencies, we only need to determine which is the highest frequency.
Using our data analyst responses, we can get mode as 4 since it had the highest frequency (11).
In R we only need to use function "which.max" to get mode.
```{r "modeungouped-02"}
# Frequency table (as a dataframe)
ungpd1Tab
# Mode
ungpd1Tab[which.max(ungpd1Tab$Freq), 1]
```
###### Mode: Grouped distributions
Mode for grouped data is applicable for both categorical distributions as well as continuous distributions. For continuous distributions, categories/groups need to be constructed first.
From these groups we can get modal class the same way we got mode for ungrouped frequency, that is, identify class with highest frequency.
For our data on number of years statistical programs have been in existence, we can easily locate mode as the fourth class "16-20" which has 9 observations.
**Locating mode for grouped distribution in R**
Using R, we again use function "which.max" and subset class (years).
```{r "modegrouped-02"}
gpd1$Years[which.max(gpd1$Freq)]
```
Here is a function for determining modal class given a continuous distribution and breaks. Breaks are cutoff points to which a distribution will be grouped.
```{r "modegroupedfunction-02"}
mode_grouped <- function(x, breaks, class = TRUE) {
n <- length(breaks)
nms <- sapply(1:(n-1), function(i) paste(breaks[i], "-", breaks[i+1]))
freq <- lapply(1:(n-1), function(i) which(x >= breaks[i] & x < breaks[i+1]))
names(freq) <- nms
if (class) {
names(which.max(sapply(freq, length)))
} else {
which.max(sapply(freq, length))
}
}
```
#### Comparison of measures of central tendecy
We have just concluded a good discussion on measures of central tendency. From it we can compute mean, median and mode of any numeric and frequency distribution. This is certainly great, but do we need to report on all of them? Certainly not, each measure has it's own merits and demerits as well as it's applicability. Let's discuss these aspects.
**Central tendency for qualitative distributions**
When dealing with qualitative or grouped data, mode is the most appropriate measure of central tendency. Reason, think of a variable such as educational level with high, medium and low, would it make sense to say *average level of education is 10.6*, what would ten mean and more specifically, what would a point six indicate? Won't it be more informative to hear "most respondents have high education".
**Mean and median**
When data has some extreme values, median is more appropriate than mean. Basic reason for this is that mean uses all values in a distribution while median uses positions of these values. Think of it this way, you can't compute mean without knowing values in a distribution but you can say where median value is located by just knowing how many values a distribution has.
For example, in the following distribution, we have eleven values from 53 to 64. This distribution does not have extreme values.
```{r "data1-02", echo=FALSE}
set.seed(4)
data1 <- sort(round(rnorm(11, 60)))
cat(data1)
```
Mean for this distribution is `r mean(data1)` and median is `r median(data1)`, a difference of `r median(data1) - mean(data1)`. Difference between mean and median is small and any can be used to report centrality of distribution but most analysts in this case would report on mean.
Now let's add just one extreme value (2) and assess it's impact on mean and median.
```{r "data2-02", echo=FALSE}
data2 <- sort(c(2, data1))
cat(data2)
```
Now our distribution begins from 2 not 53. When we compute mean we get `r mean(data2)` and when we compute median we get `r median(data2)`, a difference of `r median(data2) - mean(data2)`. For this distribution, which average would be more appropriate, mean of about `r round(mean(data2))` or median of about `r round(median(data2))`?
![Comparison of mean and median with extreme value](figures/comparison-02.png)
As shown in the figure above, mean would certainly not be an accurate measure of centrality for this distribution, it is influenced by an extreme value (2). Therefore, when reporting averages for distributions with extreme values, it is meaningful to report median as opposed to mean.
#### Summary - Mean, Median, Mode
Mean is also called average, it is computed as sum of all values divided by number of values. Median is center of a distribution; the value at the middle of a distribution when it is arranged in order. Mode is the most frequently occurring value in a distribution.
Mean, median and mode provide us with a descriptive value for our distribution what we would call a representative value. All three measures can be used to describe numerical distributions (discrete and continuous), but mean and median are more appropriate. When a numerical distribution has extreme values, it is best to use median as a representative value.