<div style="text-align: center; font-size: 30px;">
Statistics Labs<br/>
</div>
<div style="text-align: center; font-size: 30px;">
Descriptive Statistics
</div>
<div style="text-align: center; font-size: 16px; font-style: italic">
Material prepared by M. Dolores Frías, Jesús Fernández, and Carmen M. Sordo, senior lectures from the Department of Applied Mathematics and Computer Science at the University of Cantabria.
</div>

# Objectives

Applying a descriptive statistical analysis to a dataset is often the first step before considering more complex methods, as it allows summarizing information from the data while identifying potential patterns and unique characteristics.

In this exercise, we will learn how to perform a descriptive analysis of data using R. To achieve this, we will calculate certain statistical measures covered in class, summarize the data in frequency tables, and generate some of the most common types of graphs. It is important to remember that the applied methodology will vary depending on the nature of the studied data: qualitative, ordinal, discrete, or continuous.

Keep in mind that R is merely a tool that allows us to explore different datasets. To use it properly, we must understand which statistical methods to apply in each case and have the ability to analyze and interpret the obtained results.

We will focus mainly on unidimensional analysis, but we will also look at some highlights of the two-variable study.


# Frequency Tables

Tables are a widely used tool for obtaining summarized information from a dataset, as they show the frequency of each possible value (or category) that a variable can take. 

We will start by analyzing one-dimensional frequency tables and at the end of the section we will see how to construct two-dimensional tables.

## One-dimensional table

A complete frequency table for one variable includes information on absolute frequency, relative frequency, cumulative absolute frequency, and cumulative relative frequency. 

Remember that tables are constructed differently depending on the nature of the variable. If the variable is continuous or discrete with many possible values, then the data must be classified into intervals. Let's see how to obtain frequency tables in each case using R.

To begin, we will load the file *pulsations.rda*, available on the course webpage, which we will use in this exercise. This file contains records of 92 individuals across different variables. We can view its content using the command `head(pulsations)`. 

Some individuals in this sample were asked to run for a period, while others were not. The first column (*Pulse1*) contains each individual's heart rate at the beginning of the test, while the second column (*Pulse2*) records the heart rate at the end of the test. The third column (*Run*) indicates whether the individual was in the running group or not. The fourth column (*Smoke*) specifies whether the individual smokes or not, and the fifth column (*Sex*) indicates whether the individual is male or female. The sixth (*Height*) and seventh (*Weight*) columns record the individual's height (cm) and weight (kg), respectively. Finally, the eighth column (*Activity*) indicates the type of physical activity the individual usually engages in daily: none, low, medium, or high.

As always, we will first set the working directory using the `setwd` command. For example, if we have created the folder *C:/Desktop/Statistics/P1* on our computer and downloaded the necessary data files for this exercise, we should write and execute the following command in RStudio at the beginning of the program:
```R
setwd("C:/Desktop/Statistics/P1")

```

In [None]:
# Set working directory
setwd("data/") 

In [None]:
# Load the data
load("pulsations.rda") 
# Data frame structure
str(pulsations)
# First records of the data frame
head(pulsations)
attach(pulsations)

As we can see, the file contains variables of different types, as shown by the command `str(pulsations)`. 

Remember that the `attach` command allows us to treat the columns of a data frame as vectors, making it possible to apply any command directly to the variable name instead of referencing the data frame explicitly. For example, we can now execute `length(Height)` instead of writing `length(pulsations$Height)`. The command `detach` undoes the `attach` action. 

Absolute frequencies for qualitative, semi-quantitative (factor), or discrete variables with a limited number of possible values can be obtained directly by applying the `table` function to the variable. This function generates a table where the possible values of the variable are displayed along with the number of times each value appears in the sample (absolute frequencies). 

We will apply this function to the *Sex* variable as follows:

In [None]:
table(Sex)

Since this is a qualitative variable, it only makes sense to obtain absolute and relative frequencies in the frequency table.

In [None]:
xi <- levels(Sex) # Categories
ni <- as.vector(table(Sex)) # Absolute frequency
fi <- ni/sum(ni) # Relative frequency
# Display the table in a formatted manner by creating a data frame.
data.frame(xi=xi, ni=ni, fi=fi)

The `as.vector` command used in the previous code allows us to extract only the absolute frequencies from the output of `table`, which we store in the variable `ni`, representing absolute frequency. 

Additionally, we can use the `levels` function to retrieve the categories of the *Sex* variable, as it is a qualitative variable defined as a factor in R. You can run `class(Sex)` to check the variable type.

We can also improve the table output by rounding the relative frequencies to two decimal places using the `round` function.

In [None]:
fi <- round(ni/sum(ni),2) # Relative frequency
# Display the data in a formatted manner by creating a data frame.
data.frame(xi=xi, ni=ni, fi=fi)

Analyzing the previous table, we see that the 57 men in the sample represent 62% of the total data.

Now, let's see how to obtain the frequency table for an ordinal variable such as *Activity*. In this case, it makes sense to analyze cumulative frequencies (*Ni*, *Fi*), as the categories of the variable indicate an order.

In [None]:
table(Activity)

Observing the result, we see that R, by default, orders the categories of this variable alphabetically (*High, Low, Medium*) when it should be arranged from least to most activity. 

To correctly interpret the cumulative frequencies, we must reorder these categories properly as follows:

In [None]:
activity.order <- ordered(Activity, levels=c( "Low", "Medium", "High"))
levels(activity.order) 

With this correction, we obtain the complete frequency table.

In [None]:
xi <- levels(activity.order) # Categories
ni <- as.vector(table(activity.order)) # Absolute frequency
Ni <- cumsum(ni) # Cumulative absolute frequency
fi <- round(ni/sum(ni),2) # Relative frequency
Fi <- cumsum(fi) # Cumulative relative frequency
# Display the table in a formatted manner by creating a data frame.
data.frame(xi=xi, ni=ni, Ni=Ni, fi=fi, Fi=Fi) 

Once we have the full table, it is important to analyze the results, as it contains a lot of information about the sample. For example, in this case, we see that only 9 people (10% of the sample) have a low activity level. Additionally, we observe that 76% of individuals engaged in low or moderate physical activity.

The `table` function can also be applied to quantitative variables, returning the absolute frequencies for each value of the variable. For example:

In [None]:
ni <- as.vector(table(Pulse1))
xi <- sort(unique(Pulse1)) # Sort the possible values of the variable.
Ni <- cumsum(ni)
fi <- ni/sum(ni)
Fi <- cumsum(fi)
# Display the table in a formatted manner by creating a data frame.
data.frame(xi=xi, ni=ni, Ni=Ni, fi=fi, Fi=Fi)

We can see that the frequency table for the variable *Pulse1* does not effectively summarize the information, as there are many values with very low absolute frequencies (only 1 or 2 individuals with that pulse rate). 

As we already know, when dealing with continuous data or discrete data with many different values (as in the case of *Pulse1*), it is necessary to group them into classes. 

In our case, we will implement Sturges' rule in R to determine the number of classes to define. Then, we will use the `cut` function, which groups data into intervals, allowing us to specify the classes we want to create.

In [None]:
n <- length(Pulse1) # Sample size
# Sturges criteria. The floor function rounds the value down.
nclass <- floor(3/2+log(n)/log(2)); nclass 
# Range
range(Pulse1) 

In [None]:
# Interval limits
interval_boundaries <- seq(44, 105, 7)
# Classification of data into intervals.
data_class <- cut(Pulse1, breaks=interval_boundaries); data_class
# Absolute frequency
ni <- as.vector(table(data_class))

Observe the output of *data_class*, as it corresponds to the classified data in intervals. That is, each value of *Pulse1* has been assigned to one of the defined intervals.

From the *ni* vector, the remaining frequencies can be calculated to complete the frequency table using the following commands:

In [None]:
# Intervals and class mark
intervals <- levels(data_class)
xi <- interval_boundaries[1:nclass]+diff(interval_boundaries)/2
# Acumulated absolute frequency
Ni <- cumsum(ni)
# Relative frequency
fi <- ni / sum(ni)
# Acumulated relative frequency
Fi <- cumsum(fi)
# Display the table in a formatted manner by creating a data frame.
data.frame("(Li-1,Li]"=intervals,xi,ni,Ni,fi,Fi, check.names=FALSE)

By default, the `cut` function closes intervals on the right (see the variable `data_class`). This function has an argument that allows changing this option, which you can verify by checking the function's help documentation.

We can improve the presentation of the table results by rounding the values of $f_i$ and $F_i$ to two decimal places using the `round` function as follows:

In [None]:
data.frame("(Li-1,Li]"=intervals,xi,ni,Ni,fi=round(fi,2),Fi=round(Fi,2), check.names=FALSE)

Among many other insights, we can see that the modal class corresponds to the interval (65,72], which has the highest absolute/relative frequency. Additionally, we observe that 57% of the individuals in the sample have a resting pulse rate of 72 or lower before engaging in physical activity.

<div class="alert alert-block alert-info">
<strong>PRACTICE ON YOUR OWN <br></strong>

<br>
Using the data from the *pulsations.rda* file, answer the following questions:

1. Calculate the absolute frequency of women.
2. Obtain the percentage of smokers.
3. Calculate the percentage of women who are smokers.
4. Determine the relative frequency of individuals who are shorter than 183 cm.
5. Determine the relative frequency of individuals who are women.
6. Determine the relative frequency of men whose usual sports practice is low.
7. Determine the relative frequency of men who smoke.
8. Among women, determine the relative frequency of those who are shorter than 170 cm.
9. Determine the relative frequency of individuals whose resting pulse rate (*Pulse1*) is higher than 84.
10. Determine the relative frequency of individuals whose resting pulse rate is higher than 84 and whose usual sports practice is not high.
11. Determine the relative frequency of individuals with a resting pulse rate higher than 84, who smoke, and are women.
12. Among smokers, determine the relative frequency of those whose *Pulse1* is higher than 84.
</div>

## Two-dimensional table

We will now focus on how to build with R two-dimensional tables. As an example we construct the two-way table with two non-numeric variables selected from the pulsations file, such as *Activity* and *Smoke*.

In [None]:
table("Activity"=activity.order, "Smoke"=Smoke)

From the results of the table, we can see, for example, that 16 individuals who engage in high levels of physical activity are non-smokers. 

The values shown are absolute frequencies, but we could obtain the relative frequencies simply by dividing by the sample size. Those 16 individuals represent 17.4% of the sample.

In [None]:
round(table("Activity"=activity.order, "Smoke"=Smoke)/n,3)

As in the unidimensional case, if we want to group data from continuous or discrete variables with many possible values, we must first classify the values into intervals using the `cut` function. This function allows us to define the range of each interval and specify whether the intervals should be closed on the right or the left.

Thus, knowing that Sturges' rule suggests defining 8 intervals, the contingency table for the variables *Pulse1* and *Pulse2* can be obtained as follows:

In [None]:
range(Pulse1)
range(Pulse2)
# Two dimensional table
table("Pulse 1"=cut(Pulse1, seq(44, 105, 7)), "Pulse 2"=cut(Pulse2, seq(50, 150, 12), right=FALSE))

We observe for instance that there are 16 individuals with a resting pulse rate between (65,72] and a pulse after the activity between [62,74). 

# Statistics

There are several functions that allow us to summarize information from a dataset, with each function providing insight into a different aspect of the sample in a single value.

In this exercise, we will learn how to compute different statistical measures in R, focusing on a single variable. We will continue working with the data from the *pulsations.rda* file.

The `summary` command provides a brief descriptive analysis for each variable contained in this data frame. Note that statistics exist only for quantitative variables. For them, this function displays values corresponding to the maximum, minimum, quartiles, and mean. Meanwhile, for qualitative or ordinal variables (referred to as factors in R), it only shows the absolute frequency of each category in which the variable is classified (along with the count of missing values, if any).

In [None]:
summary(pulsations)

If there are more than ten variables in the dataset, R requests confirmation, as the large amount of information may be difficult to display on the screen.

We can also filter individuals from the sample who meet a specific condition. For example, in this case, we compute the basic statistics of the *Height* variable for individuals who smoke:

In [None]:
summary(Height) # Brief descriptive analysis for the variable Height
HeightSmoke <- Height[Smoke=="Yes"]
summary(HeightSmoke) # Basic statistics of the Height variable for individuals who smoke

For example, we can see that the range of heights among smokers is reduced.

Now, let's outline the functions that compute the most important statistical measures. As we know, these measures are divided into four major groups: location, dispersion, position, and shape.

## Location Statistics

Location statistics, also known as measures of central tendency, provide information about the central tendency or the value around which the data clusters. The most important measures are the mean, median, and mode.

### Mean
The arithmetic mean is calculated using the `mean` function:

In [None]:
mean(Height, na.rm=TRUE)

The argument `na.rm=TRUE` is optional and is used to indicate that missing data should not be considered when calculating the statistic. This option is available in many other R functions.

Calculate the mean of the vector `c(2,2,2,NA)` both with and without the `na.rm=TRUE` option to observe the difference in the result. The value `NA` stands for *Not Available*, meaning that no data is recorded for that entry. 

In some cases, a dataset may contain the value `NaN`, which stands for *Not a Number*. The `NaN` value arises from operations such as `0/0`. The `na.rm=TRUE` option excludes both cases from the mean calculation.

### Median 

This statistic can be computed using two functions: the `quantile` function, specifying the quantile order (0.5 in this case), or using the `median` function:

In [None]:
quantile(Height, c(0.5))
median(Height)

### Mode

There is no a function in R to compute the mode of a sample. So, we can create our function. The function should take the vector as input and give the mode or modes as output. An example can be:

In [None]:
compute.mode <- function(x) {
  return(as.numeric(names(which(table(x)==max(table(x))))))
}

Let's check the use of this function for the *Pulse1* variable:

In [None]:
compute.mode(Pulse1)

## Dispersion Statistics

This group of statistics provides information about the spread of the data.

### Quasi-variance and Quasi-standard deviation

It is important to note that the `var` and `sd` functions in R compute the **quasi-variance** and **quasi-standard deviation**, respectively. Specifically, these functions calculate the following formulas:

<div><img alt="" src="./figuras/formula_cuasivar_cuasdesv.png" width="400"/></div>

Variance and standard deviation can be obtained from these results by recalling that:
<div><img alt="" src="./figuras/formula_var.png" width="150"/></div>

Create two functions that directly compute variance and standard deviation.

### Range or Amplitude

The range is the difference between the maximum and minimum values. In R, this requires combining two commands:

In [None]:
diff(range(Height))

If this function will be used multiple times, it is better to define a function that performs this calculation, which we can call ``:

Si esta función se va a utilizar varias veces, es mejor definir una función que haga este cálculo, la cual podemos llamar por ejemplo `range_value`:

In [None]:
range_value <- function(x){diff(range(x))}
range_value(Height)

### Interquartile Range (IQR)

The interquartile range quantifies the dispersion of the central 50% of the data after sorting them from lowest to highest. To calculate this statistic, we use the `IQR` function:

In [None]:
IQR(Height)

### Coefficient of Variation

There is no built-in function in R's base package to compute this statistic, but we can define the function `CV` in its simplest form by dividing the sample standard deviation by the mean:

In [None]:
CV <- function(x) {sd(x)/mean(x)}
CV(Height)

or, to obtain a numerical value even when there are missing data,

In [None]:
CV <- function(x) {sd(x, na.rm=TRUE) / mean(x, na.rm=TRUE)}
CV(c(Height, NA))

## Measures of position

These measures indicate the value of the variable that occupies a specific position in an ordered distribution, from smallest to largest.

### Maximum and Minimum

In R, they are calculated using the `max` and `min` functions.

In [None]:
max(Height)
min(Height)
range(Height) # Show both values with just one command.

### Quantiles

As mentioned earlier, any quantile can be calculated using the `quantile` function.

In [None]:
quantile(Height, c(0.25,0.5,0.75)) # Quartiles

In [None]:
quantile(Height, c(1/3,2/3)) # Terciles

Check the help documentation for the `quantile` function, as it includes 9 different algorithms for calculating quantiles based on the `type` option.

## Shape Statistics

### Skewness and Kurtosis

R's base package does not include functions to compute skewness or kurtosis. However, we can install the `moments` package, which provides these functions (`kurtosis`, `skewness`), or alternatively, we can implement our own functions to compute these statistics based on the formulas covered in class.

<div class="alert alert-block alert-info">
<strong>PRACTICE ON YOUR OWN</strong>

- Create the functions *kurtosis* and *skewness* to compute kurtosis and skewness, respectively.

- Answer the following questions using the dataset from *pulsations.rda*:
  1. Interquartile range of *Weight*.
  2. Variance of *Weight*.
  3. Is the distribution of *Weight* symmetric?
  4. How many men are in the sample?
  5. Is the distribution of *Weight* platykurtic, mesokurtic, or leptokurtic?
  6. Which of the two variables, *Weight* or *Height*, shows greater dispersion?
  7. Compute the kurtosis of the *Weight* variable for non-smokers.
  8. Compute the cuasi-standard deviation of the height of the women.

<br>

- Throughout the year, your monthly mobile phone bills were:

     23, 33, 25, 45, 10, 28, 39, 27, 15, 38, 34, 29

    1. How much did you spend in total over the year?
    2. What was the minimum expense?
    3. What was the maximum expense?
    4. In which months was the expense lower than the average expense?
    5. Is there a lot of variation in expenses from month to month? How would you estimate this variation?

<br>

- The data file *santander.txt* contains monthly average temperature records in Santander from 1950 to 2003. The task is:
    1. Compute the median, mean, and standard deviation of the monthly average temperatures for January and July between the years 1950 and 1980. Comment on the differences.
    2. Determine the skewness and kurtosis coefficients of the monthly average temperatures for June, July, and August between 1950 and 1980. In which case does the series show the greatest asymmetry?
</div>

# Graphical Representations

Plots are tools to visualize the distribution of a variable. They allow us to quickly and effortlessly capture the main characteristics of a dataset being a complementary andy crucial tool for conducting a descriptive statistical analysis of a data sample.

We describe here only those graphs of interest in our course but there are more.

## Unidimensional Graphical Representations

First we analyze those plots that consider just one variable: Pie chart, bar plot, histogram and box plots.  

### Pie Chart

A pie chart is recommended for qualitative variables. This type of representation displays the relative sizes of absolute or relative frequencies for each category of the variable. 

In cases where categories have similar frequencies, this may not be the best way to present the information, as it can be difficult for the eye to distinguish relative areas.

The command in R to create a pie chart is `pie`:

In [None]:
pie(table(Smoke), main="Yes")

Note that to create a pie chart, you first need to calculate the absolute frequencies of the variable using the `table` function.

In this case, with only two categories, the chart clearly shows that the number of non-smokers is substantially higher than the number of smokers.

The appearance of this type of chart can be improved using additional parameters in the `pie` function. Some options include `main` to add a title, `col` to define colors, `labels` to add labels, etc. Check the function's help documentation for more details.

In [None]:
pie(table(Smoke), labels=c(paste("No (",table(Smoke)[1],"%)"), 
                           paste("Yes (",table(Smoke)[2],"%)")),
                  main="Tobacco consumption") 

As shown below, a bar plot is an easier type of graph to analyze, as the human eye is better at distinguishing linear measurements than relative sizes.

### Bar plot

A bar chart is a tool that allows us to visualize the distribution of qualitative, semi-quantitative, or discrete variables with a small number of possible values. The command to generate this type of chart in R is `barplot`.

In [None]:
barplot(table(Smoke))

Again, within the function, it is necessary to calculate the absolute frequencies of the variable using the `table` command. 

In this case, we have represented the absolute frequency, but it is also possible to display the relative frequency. 

If we compare the obtained bar chart with the previous pie chart, we can see that it is much easier for the eye to determine the frequency for each category (smokers and non-smokers) in the bar chart than in the pie chart. 

Additionally, this type of chart can be optimized by adding more arguments to the function. Some of the most commonly used options include:
- `main` to add a title to the chart,
- `xlab` and `ylab` to provide labels for the axes,
- `xlim` and `ylim` to specify a range of values for the axes,
- `col` to define the color of the bars, etc.

Check the function’s help documentation to explore all available options. Here is the improved chart, making it easier to interpret the results as all details of the analyzed variable are displayed.

In [None]:
bplot_smoke <- barplot(table(Smoke), ylab="ni", ylim=c(0,70), main="Tobacco consumption", 
                       col="lightblue")
text(x = bplot_smoke, y=as.vector(table(Smoke))+2, labels = as.vector(table(Smoke)))

Additionally, we have added a label displaying the absolute frequency of each category on each bar using the `text` function. We can observe that 64 individuals in the sample do not smoke, compared to 28 who do.

Let's analyze another example. In this case, we generate the bar plot for the ordinal variable *Activity*:

In [None]:
barplot(table(Activity), xlab="Activity", ylab="ni")   

Again, if we consider the original variable, we observe that the order of the categories in this chart is incorrect. Since this is an ordinal variable, the categories should be arranged from lower to higher activity levels. 

We encountered this same issue when calculating the frequency table for this variable. Just as we did then, we should use the ordered variable we previously created:

In [None]:
bplot_act <- barplot(table(activity.order), ylab="ni", xlab="Activity", ylim=c(0,70)) 
text(x = bplot_act, y=as.vector(table(activity.order))+2, labels = as.vector(table(activity.order)))

### Histogram

A histogram is a visual representation of the frequency distribution of a sample. It allows us to identify the values around which a relatively large portion of the data is clustered and where fewer data points are located. 

This type of graph is used to represent the distribution of quantitative variables grouped into class intervals, which explains why the bars are adjacent to each other. 

In R, the function `hist` is used to generate a histogram from a dataset.

In [None]:
hist(Height)

This function accepts various arguments to optimize the histogram. For example, we can manually define the intervals or classes of the histogram, or choose one of the available methods to determine the number of classes, including Sturges' rule. 

Additionally, we can specify whether the interval should be closed on the right `(,]` or on the left `[,)` using the argument `right=TRUE` or `right=FALSE`.

As with the previously discussed charts, it is highly recommended to include a title and axis labels in the graph to clearly indicate the variables being represented. 

Thus, we can improve the previous histogram of the *Height* variable as follows:

In [None]:
# Stablish the breaks for the intervals
interval_boundaries <- seq(153,193,5); interval_boundaries
# Histogram
hist(Height, breaks=interval_boundaries, col="darkgray", main="Histogram of Height", 
     xlim=c(150, 200), xlab="Height (cm)", ylab="ni")

From the obtained histogram, we can see that the heights of most individuals in the study are concentrated around 175 cm, with very few individuals below 158 cm or above 188 cm. The shape of the histogram also indicates that the data distribution is left-skewed.

By default, the `hist` function represents absolute frequency, but we can determine the probability of a given range of values in the sample by displaying the probability density instead of the absolute value. To do this, we set the argument `freq` to `FALSE`. In this case, the total area of the histogram equals 1. 

Refer to the function’s help documentation for more details.

In [None]:
hist(Height, freq=FALSE, breaks=interval_boundaries, col="darkgray", 
     main="Density of probability", xlim=c(150, 200), xlab="Height (cm)", 
     ylab="Density")

Observe how the values on the vertical axis change in this case. For example, from this new histogram, we can see that around 2% of individuals in this sample have a height between 158 and 163 cm.

Additionally, the `hist` function allows us to store its output in a variable, enabling further calculations. One example of using this stored output is to generate a frequency table for the variable with fewer lines of code compared to the approach we implemented in the *Frequency Tables* section.

In [None]:
h <- hist(Height, breaks=interval_boundaries, plot=F); h

Note that the histogram object created (`h`) is a list containing five components:
- `breaks` stores the interval boundaries.
- `counts` holds the number of observations in each interval, representing the absolute frequencies.
- `density` stores the probability density values for each interval.
- `mids` contains the class midpoints.
- `xname` saves the variable name.
- `equidist` is a logical value indicating whether the intervals are equally spaced.

In this case, we will use `mids`, `breaks`, and `counts` to complete the table of frequencies for the *Height* variable.

In [None]:
# Number of intervals
c <- length(h$mids)
# Table of frequencies
data.frame("(Li-1"=h$breaks[1:c], "Li]"=h$breaks[2:(c+1)], xi=h$mids,
            ni=h$counts, check.names=F)

<div class="alert alert-block alert-info">
<strong>PRACTICE ON YOUR OWN<br></strong>

<br>

Using the data from the *pulsations.rda* file, plot the histogram for the *Weight* variable. What do you observe? 

Overlay the histograms for men (in red) and women (in blue) on top of the overall histogram. 

**Note:** To add a histogram to an already displayed plot, use the argument `add=TRUE`.
</div>

### Box Plot

Box plots are very useful for representing the distribution of quantitative variables. In R, they are generated using the `boxplot` function. 

For example, if we want to create a box plot for the *Height* variable, we should execute the following command:

In [None]:
boxplot(Height, ylab="Height (cm)", main="Box plot of Height")

In this case, we have added a title to the figure using the `main` argument.

The `boxplot` function also allows us to create box plots for a specific variable classified by groups. For example, we can generate a box plot for the *Height* variable, separated by *Sex*, using the following command:

In [None]:
boxplot(Height~Sex, xlab="Sex", ylab="Height (cm)", main="Box plot of Height by Sex")

From the previous figure, we can observe considerably differences in height between men and women. For example, the lowest height among men in this sample corresponds to the median height of women, meaning that 50% of women are shorter than the minimum height of the men's group.

If a variable contains outliers, these will also appear in the box plot as circles above the upper whisker or below the lower whisker.

In some cases, it is useful to display two figures in a single window to facilitate comparison. For instance, we might be interested in comparing the box plot of the *Heiht* variable classified by both sex and smoking status.

In [None]:
# Split the graphics window into two sections
par(mfrow=c(1,2), mar=c(5,4,1,1))
boxplot(Height~Sex, ylab="Height (cm)")
boxplot(Height~Smoke, ylab="Height (cm)")

The `mfrow` argument used in the `par` function specifies the number of figures and the order in which they will be arranged in the graphics window. In this example, *c(1,2)* sets up 1 row and 2 columns.

The `mar` parameter controls the margin size in a clockwise direction, starting from the bottom of the graphics window. Modify these numbers to see the diferences. 

Many of the graphical functions we have seen also allow adding a new plot to an existing one. To do this, we must include the option `add=TRUE` (or the abbreviated form `add=T`) in the R commands we execute.

In [None]:
par(mfrow=c(1,1)) # Reset the graphics window in RStudio to display a single panel
hist(Height, main="Histogram of Height", xlab="Height (cm)", ylab="ni")
boxplot(Height, add=T, horizontal=T, boxwex=5, border="red",col="pink", lwd=3)

In this case, the histogram of the *Height* variable has been drawn with a box plot overlaid on it. The combination of both figures allows for a more detailed statistical analysis of the dataset, providing insights into aspects such as the interquartile range, skewness, and more.

<div class="alert alert-block alert-info">
<strong>PRACTICE ON YOUR OWN<br></strong>

<br>


Using the data from the *pulsations.rda* file:

1. Create a box plot for the *Pulse1* variable. By analyzing the graph, estimate the approximate values for the minimum pulse, maximum pulse, first quartile, median, and third quartile.
2. Compute a new variable that represents the difference between *Pulse2* and *Pulse1*, then create a box plot of this new variable against *Run*.
3. Group the data by physical activity level (*Activity*) and represent the pulse difference for individuals who ran.
</div>

## Two-dimensional Graphical Representations

We can analyze the relationship between two variables through a plot. Here we will focus on the scatter plot

### Scatter plot

Scatter plots visualize the relationship between two numeric variables, where one variable is displayed on the x-axis, and the other variable is displayed on the y-axis. Each point represents an observation, with its position determined by the values of the two variables. Scatter plots are useful for identifying trends, correlations, clusters, and potential outliers in data.

In R we can obtain this graph using the `plot` function. As an example, we will study the relationship between the variables *Height* and *Weight* from the file *pulsations.rda*:

In [None]:
plot(Height, Weight, main="Scatter plot", pch=19, xlab="Height (cm)", ylab="Weight (Kg)")

We can improve the previous representation using the `scatterplot` function from the `car` library, which we need to install (this process is done only once) and load with the `library` function. Remember that we should load this library every time we start R and want to use a function from this package:

In [None]:
install.packages("car")
library(car)
scatterplot(Weight~Height, smooth=FALSE)

The `scatterplot` function draws the regression line that best fits the data points considering the least squares criterion. Additionally, it also plots the boxplots of each variable on its corresponding axis. Consult the help documentation of this
function to see all the options it offers.

From the figure, it is evident that there is a direct linear relationship between both variables. It is linear because the data points cluster around a line, and it’s direct because when one variable increases, the other also increases (positive slope of the line). 

To quantify this linear relationship between the two variables, we can calculate the coefficient of linear correlation. The `cor.test` or `cor` functions provide this value:

In [None]:
cor(Height, Weight)

By default, it calculates the Pearson correlation coefficient, although, as indicated in the documentation of this function, it can also calculate the Kendall or Spearman correlation coefficients.

As expected, the coefficient of linear correlation is relatively high and positive, confirming the direct linear relationship between the variables observed in the scatter plot.

<div class="alert alert-block alert-info">
<strong>PRACTICE ON YOUR OWN<br></strong>

<br>

Make a scatter plot of *Height* versus *Weight* separated by *Sex* and answer the following questions:
- Does height increase in the same proportion as weight in men and women?
- Which of the two groups has a higher correlation value? Indicate the value in each case.
  
Note that it is necessary to filter the data by gender before calculating the correlation coefficient.
</div>