# Descriptive Statistics

![](banner_descriptive_statistics.jpg)

_<p style="text-align: center;"> What are the chances that the wildebeast makes it across the river? </p>_

In [1]:
f = "setup.R"; for (i in 1:10) { if (file.exists(f)) break else f = paste0("../", f) }; source(f)

## Introduction

Motivation, context, history, related topics ...

_<p style="text-align: center;"> “The object of statistical science is to discover methods of condensing information concerning<br> large groups of allied facts into brief and compendious expressions suitable for discussion.”<br>- Sir Francis Galton </p>_

## Terms

* A **statistic** summarizes one or more **variable** distributions.

## Data

Consider the following pedagogical dataset called `data`.

In [2]:
data = data.frame(x1=c(3,4,5,4,5,5,8,7,9,9),
                  x2=c(25,42,53,47,51,65,81,79,95,93),
                  x3=c(95,93,81,79,53,47,51,65,42,25),
                  x4=c(22,41,NA,NA,51,63,89,NA,93,97),
                  x5=c("A","B","C","B","B","C","D","B","E","D"))
fmt(data)

x1,x2,x3,x4,x5
3,25,95,22.0,A
4,42,93,41.0,B
5,53,81,,C
4,47,79,,B
5,51,53,51.0,B
5,65,47,63.0,C
8,81,51,89.0,D
7,79,65,,B
9,95,42,93.0,E
9,93,25,97.0,D


## Descriptive Statistics about One Variable

Variable type:

In [3]:
class(data$x1)
class(data$x2)
class(data$x3)
class(data$x4)
class(data$x5)

Count:

In [4]:
length(data$x1) # count

Mean:

In [5]:
mean(data$x1)

Median:

In [6]:
median(data$x1)

Unique values:

In [7]:
unique(data$x1)
unique(data$x5)

Count of unique values:

In [8]:
length(unique(data$x1)) # number of unique values
length(unique(data$x5)) # number of unique values

Frequency table:

In [9]:
table(data$x5) # frequencies


A B C D E 
1 4 2 2 1 

Relative frequency table:

In [10]:
table(data$x5) / length(data$x5) # relative frequencies


  A   B   C   D   E 
0.1 0.4 0.2 0.2 0.1 

Percent relative frequency table:

In [11]:
(table(data$x5) / length(data$x5)) * 100 # percent frequencies


 A  B  C  D  E 
10 40 20 20 10 

Frequency of value:

In [12]:
table(data$x5)["B"] # frequency of B
table(data$x5)[2] # frequency of value in position 2

Mode:

In [36]:
data$x5
table(data$x5)
max(table(data$x5))
table(data$x5) == max(table(data$x5))
names(table(data$x5))
names(table(data$x5))[table(data$x5) == max(table(data$x5))]

factor(names(table(data$x5))[table(data$x5) == max(table(data$x5))], levels=unique(data$x5)) # mode


A B C D E 
1 4 2 2 1 

In [14]:
data$x5
table(data$x5)
as.data.frame(table(data$x5))
which.max(as.data.frame(table(data$x5))$Freq)
data$x5[which.max(as.data.frame(table(data$x5))$Freq)]


A B C D E 
1 4 2 2 1 

Var1,Freq
A,1
B,4
C,2
D,2
E,1


Minimum, maximum:

In [15]:
min(data$x1) # minimum
max(data$x1) # maximum
range(data$x1) # minimum and maximum

Range:

In [16]:
range(data$x1)[2] - range(data$x1)[1] # range

Variance:

In [17]:
var(data$x1) # variance

Standard deviation:

In [18]:
sd(data$x1) # standard deviation

Quartiles:

In [19]:
quantile(data$x1) # quartiles

Percentiles:

In [20]:
quantile(data$x1, c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1)) # percentiles

Inter-quartile range:

In [21]:
IQR(data$x1) # inter-quartile range = 75th percentile minus 25th percentile

Normalized values:

In [22]:
scale(data$x1) # z-scores, also known as normalized data

0
-1.328283
-0.8702544
-0.4122258
-0.8702544
-0.4122258
-0.4122258
0.9618601
0.5038315
1.4198887
1.4198887


## Descriptive Statistics about Two Variables

Covariance:

In [23]:
cov(data$x1, data$x2) # covariance

Correlation:

In [24]:
cor(data$x1, data$x2) # correlation

## Descriptive Statistics about Three Variables

Correlation table:

In [25]:
cor(data[,c("x1","x2","x3")]) # correlations

Unnamed: 0,x1,x2,x3
x1,1.0,0.9736893,-0.8270139
x2,0.9736893,1.0,-0.8563633
x3,-0.8270139,-0.8563633,1.0


## Handling Missing Values

In [26]:
mean(data$x4)
mean(data$x4, na.rm=TRUE)

In [27]:
cor(data$x1, data$x4)
cor(data$x1[!is.na(data$x4)], data$x4[!is.na(data$x4)]) # ignore position of missing x4 value in BOTH vectors 

In [28]:
cor(data[,c("x1","x2","x3","x4")])

Unnamed: 0,x1,x2,x3,x4
x1,1.0,0.9736893,-0.8270139,
x2,0.9736893,1.0,-0.8563633,
x3,-0.8270139,-0.8563633,1.0,
x4,,,,1.0


## Code

### Useful Functions

In [29]:
# help(as.numeric) # from base library
# help(class)      # from base library
# help(cor)        # from stats library
# help(cov)        # from stats library
# help(factor)     # from base library
# help(IQR)        # from stats library
# help(length)     # from base library
# help(max)        # from base library
# help(mean)       # from base library
# help(median)     # from stats library
# help(min)        # from base library
# help(names)      # from base library
# help(quantile)   # from stats library
# help(range)      # from base library
# help(scale)      # from base library
# help(sd)         # from stats library
# help(table)      # from base library
# help(unique)     # from base library
# help(var)        # from stats library

## Further Reading
* https://www.nasa.gov/sites/default/files/thumbnails/image/mollweide_cycle.gif


<p style="text-align:left; font-size:10px;">
Copyright (c) Berkeley Data Analytics Group, LLC
<span style="float:right;">
Document revised January 25, 2021
</span>
</p>