# Multivariate Data Sets

---
> Erick Eduardo Aguilar Hernández:
> * mat.ErickAguilar@gmail.com.mx
> * isc.ErickAguilar@gmail.com

---
Multivariate analysis is the branch of statistics that generalizes methods of inferential statistics, so that a population $X$ can be characterized through a finite collection of random variables $X_i$. i.e multivariated distribution of random vectors. E.j. a species of animals can be characterized through quantitative and qualitative variables as they are high body, body width, weight, head high, head width, leg size, hair color, eye color, sex, etc. this variables are called explicatives.

$$ \vec{X} = (X_1,X_2,...,X_p) $$

As in classical inferential statistics, multivariate analysis the main idea is to generalize patterns or obtain useful conclusions from a multivariate population based on the information of the sample however in this case the information is multidimensional.

#### Statistical learning and machine learning

There are often situations in which it is necessary to make inferences about the future behavior of one or several variables in terms of random vectors, infer the population type of a random vector since there are several populations that share the same explicative variables but with different distribution, or find boundaries and structures of clustering since there are different types of mixed populations of which the membership of the vectors is not known. 

For these situations and some more, exist results based on multivariate analysis that provides methods for a non-exactly teoric solution to the problem, this is called **statistical learning**. In addition to this, a computational approach is added considering algorithms, complexity, expenditure, data structures, etc. then the set of these techniques known as **machine learning**.


#### Data matrix

Supose that you have n obervations of the random vector $\vec{X}$ (the distribution of the population), such that each vector have p explicative variables. Then set of observations $\{\vec{X}_i\}_{i=0}^n=\{(X_1,...,X_p)_i\}_{i=0}^n$ can be represented as a matrix called data matrix $\textbf{X}_{n \times p}$, the rows of this matrix represent de index of the observation and each column represent one of the explicative variables.

$$\textbf{X}_{n \times p}=
\left( \begin{array}{ccccc}
x_{1 1} & \cdots & x_{1 j} & \cdots & x_{1 p} \\
\vdots  & \ddots & \vdots  & \ddots & \vdots \\
x_{i 1} & \cdots & x_{i j} & \cdots & x_{i p} \\
\vdots  & \ddots & \vdots  & \ddots & \vdots \\
x_{n 1} & \cdots & x_{n j} & \cdots & x_{n p} \end{array} \right)
$$

**Notation:**
* $\textbf{x}_i$ Indicates the i-th row of $\textbf{X}$, however it will operated as a column.
* $X_j$ Indicates the j-th column of $\textbf{X}$

**Example 1.1 - 1 [Iris plants]**: The following datset contains samples of 3 iris plants populations of 50, obervations each one. The 3 populations that share the same explicative variables:
1. Sepal length in cm
2. Sepal width in cm
3. Petal length in cm
4. Petal width in cm
5. Species: 
      - Iris Setosa
      - Iris Versicolour
      - Iris Virginica

URL of the dataset.
https://archive.ics.uci.edu/ml/datasets/iris

In [None]:
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import Column as C

try:
    sc = SparkContext('local[*]')
except:
    sc = SparkContext.getOrCreate('local[*]')
sqlContext = SQLContext(sc)

In [None]:
irisPath = '../DataSets/Iris.csv'
lim = 4
irisDF = sqlContext.read.format('com.databricks.spark.csv')\
                   .options(header='true',inferschema='true')\
                   .load(irisPath)
irisSetosaDF = irisDF.where(irisDF.Species == 'Iris-setosa')
irisVersicolorDF = irisDF.where(irisDF.Species == 'Iris-versicolor')
irisVirginicaDF = irisDF.where(irisDF.Species == 'Iris-virginica')
irisSetosaDF.limit(lim).union(irisVersicolorDF.limit(lim)).union(irisVirginicaDF.limit(lim)).toPandas()

### Plotting Multivariate Data

**Scatter plots matrix**: A scatter plot matrix arranges all possible two-way scatter plots in a q × q matrix. These displays can be enhanced with brushing, in which individual points or groups of points can be selected in one plot, and be simultaneously highlighted in the other plots.

**Example 1.1 - 2 [Scatter plot matrix of iris plants]**

In [None]:
import seaborn as sns
smp = sns.pairplot(irisDF.toPandas(),hue="Species",diag_kind='hist')

# Multivariate descriptive statistics
---

#### Mean vector

For the j-th column of the data matrix the sample mean of the values $\{x_{1j},...,x_{nj}\}$ the sample mean is given by $\bar{x_j}=\frac{1}{n}\sum_{i=0}^{n}x_{ij}$. If we assume that the sample is independent and identically distributed then the expected value $E[X_j]=\mu_j=E[\bar{x_j}]$ because $\bar{x_j}$ it's the unviased estimator of $\mu_j$. On the other hand we know of the multivariate distributions that if you have a random vector $\vec{X}$ of dimension p, then its expected value is given by:

$$
\begin{align}
\vec{\mu} = E[\vec{X}] & = (E[X_1],\dotsc,E[X_j],\dotsc,E[X_p]) \\
& = (E[\bar{x_1}],\dotsc,E[\bar{x_j}],\dotsc,E[\bar{x_j}]) \\
& = E[(\bar{x_1},\dots,\bar{x_j},\dotsc,\bar{x_p})]\\
& = E\left[\left(\frac{1}{n}\sum_{i=0}^{n}x_{i1},\dots,\frac{1}{n}\sum_{i=0}^{n}x_{ij},\dotsc,\frac{1}{n}\sum_{i=0}^{n}x_{ip}\right)\right]\\
& = E\left[\frac{1}{n} \left(\sum_{i=0}^{n}x_{i1},\dots,\sum_{i=0}^{n}x_{ij},\dotsc,\sum_{i=0}^{n}x_{ip}\right)\right]\\
& = E\left[\frac{1}{n} \sum_{i=0}^{n} \left(x_{i1},\dots,x_{ij},\dotsc,x_{ip}\right)\right]\\
\vec{\mu} & = E\left[\frac{1}{n} \sum_{i=0}^{n} \textbf{x}_i\right]\\
\end{align}
$$

Which means that $\frac{1}{n} \sum_{i=0}^{n} \textbf{x}_i$ denoted by $\bar{\textbf{x}}$ is the unbiased estimator for the expected value $\vec{\mu}$ of the random vector $\vec{X}$ and is called the mean vector.

#### References:
* "Multivariate Statistics", John I. Marden, Department of Statistics, University of Illinois at Urbana-Champaign.
* "Nuevos Métodos de Análisis Multivariante", Carles M. Cuadras, c C. M. Cuadras, CMC Editions, Manacor 30, 08023 Barcelona, Spain.
* "Applied Multivariate Statistical Analysis", PennState university, online course STAT 505, (https://onlinecourses.science.psu.edu/stat505/)