```{r include=FALSE}
library(knitr)
library(ggplot2)
```


# Introduction

## Overview

This presentation demonstrates the capabilities of *Bivariate Analysis* on datasets, to infer relationship between various features of Nations.

::: {.incremental}
- **Logarithm of GDP per Capita**: Logarithm (base $e$) of Gross Domestic Product per citizen. [*lngdp*^[GDP per capita is used.]]
- **Basic Sanitation Access**: Percentage of people using at least basic Sanitation facilities. [*snt*]
- **Life Expectancy**: The average number of years a newly born Child would live. [*lfx*]
:::

## Data


```{r echo=TRUE, collapse=TRUE}
script.dir <- getSrcDirectory(function(x) {x})
setwd(script.dir)

numerise = function(x){
  x[grepl("k$", x)] <- as.numeric(sub("k$", "", x[grepl("k$", x)]))*10^3
  x <- as.numeric(x)
  return(x)
}

d1_raw = read.csv("./Data/gdp.csv")
d2_raw = read.csv("./Data/sanitation.csv")
d3_raw = read.csv("./Data/life_expectancy.csv")

yearname = "X2010"

d1 = d1_raw[!is.na(numerise(d1_raw[, yearname])),][,c("country", yearname)]
colnames(d1)[2] = "lngdp"
d2 = d2_raw[!is.na(numerise(d2_raw[, yearname])),][,c("country", yearname)]
colnames(d2)[2] = "snt"
d3 = d3_raw[!is.na(numerise(d3_raw[, yearname])),][,c("country", yearname)]
colnames(d3)[2] = "lfx"

dtemp = merge(x = d1, y = d2, by = "country")
d = merge(x = dtemp, y = d3, by = "country")

d$lngdp = log(numerise(d$lngdp))

write.csv(d, "./Data/assembled.csv")

kable(head(d, 5L))
```


::: aside
<sub>FREE DATA FROM [UN](un.org), [WORLD BANK](https://worldbank.org), [WHO](https://who.org), [IMHE](http://www.healthdata.org/) VIA [GAPMINDER.ORG](https://gapminder.org), [CC-BY LICENSE](https://creativecommons.org/licenses/by/2.0/).</sub>
:::

# Univariate Statistics

## Mean and Standard Deviation {.smaller}

*Sample Mean* $\bar{x}$ is a measure of central tendency of a random variable $x$.

*Standard deviation* $s_x$ is a measure of dispersion in random variable $x$.

Note that $x_i$ is the i<sup>th</sup> observation of the random variable $x$. 

$$
\begin{align}
\bar{x} = {\frac{\sum _{i=1}^{n}(x_{i})}{n}}
&&
s_x = \sqrt{\frac{\sum_{i=1}^{n} \left(x_i - \bar{x}\right)^2}{n}}
\end{align}
$$


```{r echo=TRUE}
d_cor = data.frame(
  row.names = "Variable",
  Variable = c(
    "*ln(GDP)*",
    "*Sanitation*",
    "*Life Exp.*"
  ),
  Mean = c(
    mean(d$lngdp),
    mean(d$snt),
    mean(d$lfx)
  ),
  SD = c(
    sd(d$lngdp),
    sd(d$snt),
    sd(d$lfx)
  )
)

kable(
  d_cor,
  col.names = c(
    "Mean $\\bar{x}$",
    "Standard Deviation $s_x$"
  ),
  digits=5
)
```


# Scatter Plot

A *Scatter plot* is a type of Plot using Cartesian coordinate system to display values for two variables for a set of data. The data are displayed as a collection of points, each having one variable determining the *abscissa* and the other variable determining the *ordinate*. It helps us:

- take a short glance at effect of two variables.
- suggest kinds of correlations between variables.
- estimate the direction of correlation.

## Sanitation vs. GDP per Capita

``` {r echo=TRUE}

sctrplot = function(d, x_map, y_map, x_lab=NULL, y_lab=NULL){
  ggplot(d, mapping = aes(x = x_map, y = y_map))+
    geom_point()+
    theme_minimal()+
    theme(
      plot.background = element_rect(fill = "#f9f5d7"),
      panel.grid = element_line(colour = "#d5c4a1"),
      axis.line = element_line(colour = "#928374")
      )
}

sctrplot(d, d$lngdp, d$snt)
```


## Life Expectancy vs. GDP per Capita

```{r echo=TRUE}
sctrplot(d, d$lngdp, d$lfx)
```


## Life Expectation vs. Sanitation

```{r echo=TRUE}
sctrplot(d, d$snt, d$lfx)
```


# Bivariate Statistics

## Covariance and Correlation Matrices {.smaller}

*Covariance* $\operatorname{cov}(x, y)$ is a measure of the joint variability of two random variables $x$, $y$.

*Correlation* $r_{x,y}$ is any relationship, causal or spurious, between two random variables $x$, $y$. *Pearson's* $r$ correlation is considered here.

$$
\begin{align}
\operatorname {cov} (x,y)={\frac {\sum _{i=1}^{n}(x_{i}-\bar{x})(y_{i}-\bar{y})}{n}}
&&
r_{x,y}= \frac{\operatorname{cov}(x,y)}{s_x s_x}
\end{align}
$$

:::: {.columns}

::: {.column width="50%"}


```{r echo=TRUE}
cov_mat = cov(d[, 2:4])

kable(cov_mat, digits=3)
```


$$A_{i,j} = \operatorname{cov}(x_i, x_j)$$
:::

::: {.column width="50%"}


```{r echo=TRUE}
cor_mat = cor(d[, 2:4])

kable(cor_mat, digits=3)
```


$$A_{i,j} = r_{x_i, x_j}$$
:::

::::

## Other Correlation Coefficients


```{r echo=TRUE}
d_cor = data.frame(
  row.names = "Variable",
  Variable = c(
    "*Sanitation vs. ln(GDP)*",
    "*Life Exp. vs. ln(GDP)*",
    "*Life Exp. vs. Sanitation*"
  ),
  Pearson = c(
    cor(d$snt, d$lngdp, method="pearson"),
    cor(d$lfx, d$lngdp, method="pearson"),
    cor(d$lfx, d$snt, method="pearson")
  ),
  Spearman = c(
    cor(d$snt, d$lngdp, method="spearman"),
    cor(d$lfx, d$lngdp, method="spearman"),
    cor(d$lfx, d$snt, method="spearman")
  ),
  Kendall = c(
    cor(d$snt, d$lngdp, method="kendall"),
    cor(d$lfx, d$lngdp, method="kendall"),
    cor(d$lfx, d$snt, method="kendall")
  )
)

kable(
  d_cor,
  col.names = c(
    "*Pearson's* $r$",
    "*Spearman's* $r_s$",
    "*Kendall's* $\\tau$"
  ),
  digits=5
)
```


# Linear Regression

*Simple Univariate Linear Regression*  is a method for estimating the relationship $y_i=f(x_i)$ of a *response* variable $y$ with a *predictor* variable $x$, as a line that closely fits the $y$ vs. $x$ *scatter plot*.

$$
y_i = a + b x_i + e_i.
$$

Where $a$ is the intercept, $b$ is the slope, and $e_i$ is the i<sup>th</sup> residual error. We aim to minimise $e_i$ for better fit.

## 