# Descriptive Statistics with Python and R


## Index:
* [Data-set](#1)
* [Selecting](#2)
* [Sampling](#3)
* [Filtering](#4)
* [Mutate](#5)
* [Arrange](#6)
* [Rename](#7)
* [Gather](#8)
* [Spread](#9)
* [Separate](#10)
* [Unite](#11)
* [Joins](#12)
* *  [Inner Join](#13)
* * [Full Join](#14)
* * [Left Join](#15)
* * [Right Join](#16)
* * [Semi Join](#17)
* * [Anti Join](#18)
* * [Union](#19)
* * [Intersect](#20)
* * [Difference](#21)
*  [Concatenate](#22)
*  [Group and Summarize](#23)
*  [Other usuful functions ](#24)

## Data-Set <a class="anchor" id="1"></a>

We load the data-set with which we are going to work mainly:

### Working with `Python`:

In [3]:
import pandas as pd

from IPython.display import display
pd.options.display.max_columns = None

import warnings
warnings.filterwarnings('ignore')

We load the data-set that we will use in this article using the following link:

https://raw.githubusercontent.com/FabioScielzoOrtiz/Estadistica4all-blog/main/Descriptive%20Statisitcs%20in%20Python%20and%20R/datosAragon.csv

In [4]:
url = 'https://raw.githubusercontent.com/FabioScielzoOrtiz/Estadistica4all-blog/main/Descriptive%20Statisitcs%20in%20Python%20and%20R/datosAragon.csv'

data_Python = pd.read_csv(url)

data_Python

Unnamed: 0,ca,datosECVmas16.prov,nomprov,gen,edad,nac,neduc,sitlab,ingnorm,horas,factorel
0,2,22,Huesca,1,3,1,3,1,21237.1,36.42,393.7
1,2,22,Huesca,2,2,1,2,1,17810.8,31.72,393.7
2,2,22,Huesca,1,1,1,2,1,11889.1,31.88,393.7
3,2,22,Huesca,1,1,1,2,1,16000.5,38.18,393.7
4,2,22,Huesca,1,1,1,2,3,21169.6,0.00,393.7
...,...,...,...,...,...,...,...,...,...,...,...
1231,2,50,Zaragoza,1,2,1,2,1,11760.6,28.79,2206.1
1232,2,50,Zaragoza,1,2,1,2,2,19321.6,0.00,124.4
1233,2,50,Zaragoza,2,2,1,2,1,19924.8,37.21,124.4
1234,2,50,Zaragoza,2,3,1,1,3,13042.5,0.00,246.5


Throughout this article we will use the Python package `dfply`

In [5]:
# pip install dfply 

from dfply import *

We prepare the data-set:

We select the columns that we will use:

In [6]:
data_Python = (data_Python.T >> row_slice( list(range(2,10)) ) ).T

data_Python

Unnamed: 0,nomprov,gen,edad,nac,neduc,sitlab,ingnorm,horas
0,Huesca,1,3,1,3,1,21237.1,36.42
1,Huesca,2,2,1,2,1,17810.8,31.72
2,Huesca,1,1,1,2,1,11889.1,31.88
3,Huesca,1,1,1,2,1,16000.5,38.18
4,Huesca,1,1,1,2,3,21169.6,0.0
...,...,...,...,...,...,...,...,...
1231,Zaragoza,1,2,1,2,1,11760.6,28.79
1232,Zaragoza,1,2,1,2,2,19321.6,0.0
1233,Zaragoza,2,2,1,2,1,19924.8,37.21
1234,Zaragoza,2,3,1,1,3,13042.5,0.0


We rename some of these columns:

In [7]:
data_Python = data_Python >> rename( genero=X.gen , 
                              provincia=X.nomprov , 
                              ingresos=X.ingnorm ) 
    

data_Python

Unnamed: 0,provincia,genero,edad,nac,neduc,sitlab,ingresos,horas
0,Huesca,1,3,1,3,1,21237.1,36.42
1,Huesca,2,2,1,2,1,17810.8,31.72
2,Huesca,1,1,1,2,1,11889.1,31.88
3,Huesca,1,1,1,2,1,16000.5,38.18
4,Huesca,1,1,1,2,3,21169.6,0.0
...,...,...,...,...,...,...,...,...
1231,Zaragoza,1,2,1,2,1,11760.6,28.79
1232,Zaragoza,1,2,1,2,2,19321.6,0.0
1233,Zaragoza,2,2,1,2,1,19924.8,37.21
1234,Zaragoza,2,3,1,1,3,13042.5,0.0


We can see the structure of the variables of our data-set:

In [8]:
data_Python.dtypes

provincia    object
genero       object
edad         object
nac          object
neduc        object
sitlab       object
ingresos     object
horas        object
dtype: object

We will convert the structure of them in the following way:

- ingresos and horas to 'int' (numeric)

- The rest to 'category' (categorical)

In [9]:

data_Python['ingresos'] = data_Python['ingresos'].astype(float) 
data_Python['horas'] = data_Python['horas'].astype(float) 

data_Python['genero'] = data_Python['genero'].astype('category')
data_Python['edad'] = data_Python['edad'].astype('category')
data_Python['nac'] = data_Python['nac'].astype('category')
data_Python['neduc'] = data_Python['neduc'].astype('category')
data_Python['sitlab'] = data_Python['sitlab'].astype('category')
data_Python['provincia'] = data_Python['provincia'].astype('category')

Now we check if changes have been done correctly:

In [10]:
data_Python.dtypes 

provincia    category
genero       category
edad         category
nac          category
neduc        category
sitlab       category
ingresos      float64
horas         float64
dtype: object

In [11]:
data_Python

Unnamed: 0,provincia,genero,edad,nac,neduc,sitlab,ingresos,horas
0,Huesca,1,3,1,3,1,21237.1,36.42
1,Huesca,2,2,1,2,1,17810.8,31.72
2,Huesca,1,1,1,2,1,11889.1,31.88
3,Huesca,1,1,1,2,1,16000.5,38.18
4,Huesca,1,1,1,2,3,21169.6,0.00
...,...,...,...,...,...,...,...,...
1231,Zaragoza,1,2,1,2,1,11760.6,28.79
1232,Zaragoza,1,2,1,2,2,19321.6,0.00
1233,Zaragoza,2,2,1,2,1,19924.8,37.21
1234,Zaragoza,2,3,1,1,3,13042.5,0.00


### Working with `R`

To work with R inside Python we will use the Python package `rpy2`

In [12]:
# pip install rpy2

import rpy2

%load_ext rpy2.ipython

import rpy2.robjects as robjects

Unable to determine R home: [WinError 2] El sistema no puede encontrar el archivo especificado


We prepare the data-set:

In [13]:
%%R

library(tidyverse)

url = 'https://raw.githubusercontent.com/FabioScielzoOrtiz/Estadistica4all-blog/main/Descriptive%20Statisitcs%20in%20Python%20and%20R/datosAragon.csv'

data_R <- read_csv(url)

data_R <- data_R %>% select(3:10)

data_R <- data_R %>% rename("genero"="gen",
         "provincia"="nomprov", "ingresos"="ingnorm")

data_R <- as.data.frame(data_R)

R[write to console]: -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

R[write to console]: v ggplot2 3.3.6     v purrr   0.3.4
v tibble  3.1.7     v dplyr   1.0.9
v tidyr   1.2.0     v stringr 1.4.0
v readr   2.1.2     v forcats 0.5.1

R[write to console]: -- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()



Rows: 1236 Columns: 11
-- Column specification --------------------------------------------------------
Delimiter: ","
chr  (1): nomprov
dbl (10): ca, datosECVmas16.prov, gen, edad, nac, neduc, sitlab, ingnorm, ho...

i Use `spec()` to retrieve the full column specification for this data.
i Specify the column types or set `show_col_types = FALSE` to quiet this message.


In [14]:
%%R

head(data_R)

  provincia genero edad nac neduc sitlab ingresos horas
1    Huesca      1    3   1     3      1  21237.1 36.42
2    Huesca      2    2   1     2      1  17810.8 31.72
3    Huesca      1    1   1     2      1  11889.1 31.88
4    Huesca      1    1   1     2      1  16000.5 38.18
5    Huesca      1    1   1     2      3  21169.6  0.00
6    Huesca      1    2   1     2      1  16001.3 34.52


## Data-set Description  

The definition of the variables are the following:

-   **provincia**: indicates the province of the Aragon community which belong the sample individuals.

-   **genero**: indicates the sex of the sample individuals. Take 1 if it's male and 2 if it's female.

-   **edad**: indicates the age range of the sample individuals.
    Take 1 if age is between 16 and 24 ages,  2 if age is between 25 and 49 ages, and 3 if age is between 50 and 64 ages, and 4 if age is greater or equal than 65 ages.

-   **nacionalidad** (**nac**): indicates the nacionality of the sample individuals. If they are from Spain, it takes 1, and if they are from other country, it takes 2.

-   **situacion laboral** (**sitlab**): indicates the labor situation of the sample individuals. It takes 1 if they are working, 2 if they are unemployed, and 3 if they are inactive.

-   **ingresos** : indicates the incomes of the individuals in the sample.

-   **horas** : indicates the number of working hours of the individuals in the sample.
-   **neduc** : indicates the education level  of the individuals in the sample. It takes 1 if they have low education level, 2 if they have intermediate education level, and 3 if they have high education level. 
  

## Statistical Variable <a class="anchor" id="2"></a>

$$
X_k= \begin{pmatrix}
x_{1k} \\
x_{2k}\\
... \\
x_{nk} 
\end{pmatrix} 
$$

is a **statistical variable** because is a vector with the values or observations of  variable $X_k$ for $n$ elements or individuals of a sample.  

Where:  $x_{ik}$  is the value or observation of  $X_k$ variable for the $i$ element of the sample.

## Data Martrix   <a class="anchor" id="3"></a>

In general, if we have observed $p$ variables on a sample $\varepsilon$ with $n$ elements/individuals ,

The data matrix $X$  of the variables $X_1,...,X_p$ measurements on the sample is:

$$
X= \begin{pmatrix}
x_{11} & x_{12}&...&x_{1p}\\
x_{21} & x_{22}&...&x_{2p}\\
...&...&...&...\\
x_{n1}& x_{n2}&...&x_{np}
\end{pmatrix}
$$

Note:  &nbsp;   $X$ is a matrix $nxp$

## Data Martrix Representation by Rows   <a class="anchor" id="4"></a>

$$
X= \begin{pmatrix}
x_{1}^{t} \\
x_{2} ^t \\
... \\
x_{n} ^t 
\end{pmatrix}
$$

Where:

$x_i ^t = (x_{i1}, x_{i2}, ..., x_{ip} )$ is the vector with the values of the $p$ variables $X_1,...,X_p$ for the $i$ element/individual  of the sample, for $i=1,...,n$

## Data Martrix Representation by Columns   <a class="anchor" id="5"></a>


We can express:

$$
X= (X_1 , X_2 ,..., X_p )
$$

Where: 
$$
X_k= \begin{pmatrix}
x_{1k} \\
x_{2k}\\
... \\
x_{nk} 
\end{pmatrix} 
$$

for $k=1,2,...,p$


## Range of a Statistical Variable   <a class="anchor" id="6"></a>



The **range** of a statistical variable  $X_k$ is the set of values that can be taken by the variable. 


Depending on the variable range we can define if the variable is **quantitative** or **categorical** . Clasification that has a particular relevance in statistics.

<br /> 

Examples:

 
$X_k =$ Incomes of 1000 employees of Amazon  &nbsp; $\Rightarrow$  &nbsp;
$Range(X_k) =[0, \infty )$

 
$X_k =$ brand of the car of 50 footballers &nbsp; $\Rightarrow$  &nbsp;
$Range(X_k) = \lbrace Mercedes, Audi, ... \rbrace$

 
$X_k =$ Number of houses of 10 urbanizations  &nbsp; $\Rightarrow$  &nbsp;
$Range(X_k) = \lbrace 0, 1, 2,... \rbrace$

## Types of Statistical Variable   <a class="anchor" id="7"></a>



The variable $X_k$ is **quantitative** if the elements of it´s range are
conceptually numbers.

The variable $X_k$ is **categorical** if the elements of it´s range aree labels or categories (they can be numbers at a symbolic level but not at a conceptual level)


### Types of Quantitative Variables <a class="anchor" id="8"></a>


 

#### Discrete and Continuous Variables <a class="anchor" id="9"></a>
 

Let $X_k$ a **quantitative** variable ,

 

The variable $X_k$ is **discrete** if it´s range is a numerable set .

The variable $X_k$ is **continuous** if it´s range isn´t a numerable set. 

<br /> 

**Note:**

In particular, variables whose  **range** is a **finite** set
will be **discrete**.

variables whose  **range** isn´t a **finite** set
will be **continuous**.

 <br /> 

### Types of Categorical Variables   <a class="anchor" id="10"></a>

Let $X_k$ a **categorical** variable ,

$X_k$ is **r-aria** if it´s range has **r** elements that are categories or labels.

In Statistics **binary** (2-aria) categorical variables are particularly important.
 

 

#### Nominal and Ordinal Variables <a class="anchor" id="11"></a>


Let $X_k$ a  $r$-aria **categorical** variable ,

The variable $X_k$ is **nominal** if **there is no ordering** between the $r$ categories of it´s range.

The variable $X_k$ is **ordinal** if **there is ordering** between the $r$ categories of it´s range.

 <br /> 

**Examples:**  

$Range(X_k)= \lbrace Apple , Samsung, Oppo \rbrace \Rightarrow X_k$
is nominal 

$Range(X_k)= \lbrace mal , regular, bien \rbrace \Rightarrow X_k$ is
  ordinal



## Descriptive Statistic   <a class="anchor" id="11"></a>

A descriptive statistic is a function of a sample elements.

In general, all function applied on a statistical variable is a statistic.

 
We will see some of  most important statistics.
 

 

### Mean   <a class="anchor" id="12"></a>

 

The arithmetic mean (or simply mean) of a variable $X_k$ is defined as: 

$$
\overline{X_k}=  \frac{1}{n} \cdot \sum_{i=1}^{n} x_{ik} 
$$

 
**Some Properties:**

$$
\sum_{i=1}^{n} \left( x_{ik} - \overline{X_k} \right) = 0
$$
  


#### Mean in R   <a class="anchor" id="12"></a>


In [15]:
%%R
head(data_R)

  provincia genero edad nac neduc sitlab ingresos horas
1    Huesca      1    3   1     3      1  21237.1 36.42
2    Huesca      2    2   1     2      1  17810.8 31.72
3    Huesca      1    1   1     2      1  11889.1 31.88
4    Huesca      1    1   1     2      1  16000.5 38.18
5    Huesca      1    1   1     2      3  21169.6  0.00
6    Huesca      1    2   1     2      1  16001.3 34.52


One way to compute the mean of a variable in R:

In [16]:
%%R
mean(data_R$ingresos)

[1] 14078.77



#### Mean in Python   <a class="anchor" id="12"></a>





In [17]:
data_Python >> head()

Unnamed: 0,provincia,genero,edad,nac,neduc,sitlab,ingresos,horas
0,Huesca,1,3,1,3,1,21237.1,36.42
1,Huesca,2,2,1,2,1,17810.8,31.72
2,Huesca,1,1,1,2,1,11889.1,31.88
3,Huesca,1,1,1,2,1,16000.5,38.18
4,Huesca,1,1,1,2,3,21169.6,0.0


One way to compute the mean of a variable in Python:

In [18]:
data_Python[['ingresos']].mean() 

ingresos    14078.766909
dtype: float64

Another way:

In [19]:
( data_Python >> select(X.ingresos) ).mean() 

ingresos    14078.766909
dtype: float64


### Mean Vector   <a class="anchor" id="12"></a>

Given a data matrix $X$ of size $nxp$,

The means vector of $X$ is: 
$$
\overline{X}=( \overline{X_1} , \overline{X_2} , ... , \overline{X_p} ) ^t
$$



### Mean Vector in Python  <a class="anchor" id="12"></a>


We can also compute the mean of each variable of a data-set in a easy way:

In [20]:
data_Python.mean()

ingresos    14078.766909
horas          16.922532
dtype: float64

### Mean Vector in R  <a class="anchor" id="12"></a>

In [21]:
%%R

colMeans(data_R[ , 2:length(data_R)])

      genero         edad          nac        neduc       sitlab     ingresos 
    1.502427     2.555016     1.046117     1.805016     1.978155 14078.766909 
       horas 
   16.922532 




#### Matrix Expression of the Mean Vector   <a class="anchor" id="12"></a>

The matrix expression of means vector of $X$ is:
 
$$
\overline{X}= \dfrac{1}{n} \cdot X\hspace{0.05cm}^t \cdot \overrightarrow{1}_{nx1}
$$





### Ponderate Mean   <a class="anchor" id="12"></a>



Given the variable $X_k=(x_{1k}, x_{2k},...,x_{nk})^t$

Given a weights for each observation of the variable
$X_k$ : $w=(w_1,w_2,...,w_n)^t$

 
The **weighted mean** de la variable $X_k$ con el vector de pesos $w$
es: 
$$
\overline{X_k} (w) =   \dfrac{\sum_{i=1}^{n}  x_{ik}\cdot w_i  }{ \sum_{i=1}^{n}  w_{i} }  
$$





### Geometric Mean   <a class="anchor" id="12"></a>


Given the variable $X_k=(x_{1k}, x_{2k},...,x_{nk})^t$


The **geometric mean** of the variable $X_k$ is:

$$
\overline{X_k}_{geom} =   \sqrt{\Pi_{i=1}^{n}  x_{ik}} = \sqrt{x_{1k}\cdot x_{2k}\cdot...\cdot x_{nk}} 
$$


### Median  <a class="anchor" id="12"></a>

The **median** of a variable $X_k$ is a values such that half of the observations of  $X_k$ are **less** than that value.

 

Given the variable $X_k=(x_{1k}, x_{2k},...,x_{nk})^t$


We order their values ​​from smallest to largest:

$x_{(1)k} < x_{(2)k} < ...< x_{(n)k}$

 <br /> 

The median of the variable $X_k$ is:

$$
Median(X_k)=  \left\lbrace\begin{array}{l} \dfrac{ x_{(n/2)k} + x_{(n/2 + 1)k} }{2} \hspace{0.3cm},\text{ if $n$ is  {pair}} \\ x_{(\lceil n/2 \rceil)k} \hspace{0.3cm},\text{   if $n$ is  {odd}  }  \end{array}\right.
$$




#### Median in R   <a class="anchor" id="12"></a>


In [22]:
%%R

median(data_R$ingresos) 

[1] 12331


It can be verified that it is fulfilled the median definition:

In [23]:
%%R

sum(data_R$ingresos < median(data_R$ingresos))/length(data_R$ingresos)

[1] 0.5



#### Median in Python   <a class="anchor" id="12"></a>



In [24]:
( data_Python >> select(X.ingresos) ).median() 

ingresos    12331.0
dtype: float64

In [25]:
data_Python[['ingresos']].median() 

ingresos    12331.0
dtype: float64

In [26]:
data_Python[['ingresos']] < data_Python[['ingresos']].median()

Unnamed: 0,ingresos
0,False
1,False
2,True
3,False
4,False
...,...
1231,True
1232,False
1233,False
1234,False


As in R, we can verify that it is fulfilled the median definition:

In [27]:
( data_Python[['ingresos']] < data_Python[['ingresos']].median() ).sum() / len(data_Python)

ingresos    0.5
dtype: float64

## Quantiles

The quantile of order $p$ of the variable $X_k$ is the value $Q(p, X_k)$ such that the proportion of observations of $X_k$ that are less than $Q(p, X_k)$ is $p$


More formally:
<br>



$Q(p, X_k)$   &nbsp; is the quantile of order $p$ of $X_k$
&nbsp; $\Leftrightarrow$ 

<br>

$$
\Leftrightarrow \hspace{0.3cm} \dfrac{\# \lbrace \hspace{0.05cm} i=1,..,n  \hspace{0.15cm} / \hspace{0.15cm}  x_{ik} < Q(p, X_k) \hspace{0.05cm}   \rbrace}{n} = p
$$

<br>

Note:  the median is the quantile of order $p=0.5$



### Quantiles in R

In [28]:
%%R

quantile(data_R$ingresos)

      0%      25%      50%      75%     100% 
 -999.50  8310.60 12331.00 18269.38 58470.40 


In [29]:
%%R

quantile(data_R$ingresos, 0.85)

    85% 
21840.8 


In [30]:
%%R

quantile(data_R$ingresos, 0.37)

     37% 
10329.74 


### Quantiles in Python

In [31]:
data_Python

Unnamed: 0,provincia,genero,edad,nac,neduc,sitlab,ingresos,horas
0,Huesca,1,3,1,3,1,21237.1,36.42
1,Huesca,2,2,1,2,1,17810.8,31.72
2,Huesca,1,1,1,2,1,11889.1,31.88
3,Huesca,1,1,1,2,1,16000.5,38.18
4,Huesca,1,1,1,2,3,21169.6,0.00
...,...,...,...,...,...,...,...,...
1231,Zaragoza,1,2,1,2,1,11760.6,28.79
1232,Zaragoza,1,2,1,2,2,19321.6,0.00
1233,Zaragoza,2,2,1,2,1,19924.8,37.21
1234,Zaragoza,2,3,1,1,3,13042.5,0.00


In [32]:
np.quantile( data_Python[['ingresos']] , [0, 0.25, 0.5, 0.75 , 1])

array([ -999.5  ,  8310.6  , 12331.   , 18269.375, 58470.4  ])

In [33]:
np.quantile( data_Python[['ingresos']] , 0.37)

10329.735

In [34]:
np.quantile( data_Python[['ingresos']] , 0.85)

21840.800000000003

We can create a data-frame with the main quantiles of the cuantitative variables ingresos and horas:

In [35]:
a = np.quantile( data_Python[['ingresos']] , [0, 0.25, 0.5, 0.75 , 1])
b = np.quantile( data_Python[['horas']] , [0, 0.25, 0.5, 0.75 , 1])

a = pd.DataFrame( { 'ingresos' : a } )
b = pd.DataFrame( { 'horas' : b } )

c = a >> bind_cols(b , join='inner')

c['index']=['Q(0)','Q(0.25)', 'Q(0.5)', 'Q(0.75)', 'Q(1)'] 

c = c.set_index('index')

In [36]:
c

Unnamed: 0_level_0,ingresos,horas
index,Unnamed: 1_level_1,Unnamed: 2_level_1
Q(0),-999.5,0.0
Q(0.25),8310.6,0.0
Q(0.5),12331.0,12.385
Q(0.75),18269.375,33.8625
Q(1),58470.4,42.78


## Variance and Standard Deviation


La varianza de la variable $X_k$ es: 
$$
\sigma^2(X_k)=  \dfrac{1}{n} \sum_{i=1}^{n} (x_{ik} - \overline{x_k})^2 
$$


La desviación típica de la variable $X_k$ es: 
$$
\sigma(X_k)= \sqrt{\dfrac{1}{n} \sum_{i=1}^{n} (x_{ik} - \overline{x_k})^2}
$$




### Variance and Standard Deviation in R


In [37]:
%%R

var(data_R$ingresos)

[1] 67924827


In [38]:
%%R

sd(data_R$ingresos)

[1] 8241.652


### Variance and Standard Deviation in Python


In [39]:
round( data_Python[['ingresos']].var() )

ingresos    67924827.0
dtype: float64

In [40]:
data_Python[['ingresos']].std()

ingresos    8241.651951
dtype: float64

## Basic descriptive summary in Python

In [41]:
data_Python

Unnamed: 0,provincia,genero,edad,nac,neduc,sitlab,ingresos,horas
0,Huesca,1,3,1,3,1,21237.1,36.42
1,Huesca,2,2,1,2,1,17810.8,31.72
2,Huesca,1,1,1,2,1,11889.1,31.88
3,Huesca,1,1,1,2,1,16000.5,38.18
4,Huesca,1,1,1,2,3,21169.6,0.00
...,...,...,...,...,...,...,...,...
1231,Zaragoza,1,2,1,2,1,11760.6,28.79
1232,Zaragoza,1,2,1,2,2,19321.6,0.00
1233,Zaragoza,2,2,1,2,1,19924.8,37.21
1234,Zaragoza,2,3,1,1,3,13042.5,0.00


Numeric description of the quantitatives variables of the data-set:

In [42]:
data_Python.describe()

Unnamed: 0,ingresos,horas
count,1236.0,1236.0
mean,14078.766909,16.922532
std,8241.651951,17.054312
min,-999.5,0.0
25%,8310.6,0.0
50%,12331.0,12.385
75%,18269.375,33.8625
max,58470.4,42.78


Numeric description of the categorical variables of the data-set:

In [43]:
( data_Python >> select( ~X.ingresos , ~X.horas ) ).describe()

Unnamed: 0,provincia,genero,edad,nac,neduc,sitlab
count,1236,1236,1236,1236,1236,1236
unique,3,2,4,2,3,3
top,Zaragoza,2,2,1,2,1
freq,932,621,529,1179,591,618


## Basic descriptive summary in R

In [44]:
%%R

summary(data_R)

  provincia             genero           edad            nac       
 Length:1236        Min.   :1.000   Min.   :1.000   Min.   :1.000  
 Class :character   1st Qu.:1.000   1st Qu.:2.000   1st Qu.:1.000  
 Mode  :character   Median :2.000   Median :2.000   Median :1.000  
                    Mean   :1.502   Mean   :2.555   Mean   :1.046  
                    3rd Qu.:2.000   3rd Qu.:3.000   3rd Qu.:1.000  
                    Max.   :2.000   Max.   :4.000   Max.   :2.000  
     neduc           sitlab         ingresos           horas      
 Min.   :1.000   Min.   :1.000   Min.   : -999.5   Min.   : 0.00  
 1st Qu.:1.000   1st Qu.:1.000   1st Qu.: 8310.6   1st Qu.: 0.00  
 Median :2.000   Median :1.500   Median :12331.0   Median :12.38  
 Mean   :1.805   Mean   :1.978   Mean   :14078.8   Mean   :16.92  
 3rd Qu.:2.000   3rd Qu.:3.000   3rd Qu.:18269.4   3rd Qu.:33.86  
 Max.   :3.000   Max.   :3.000   Max.   :58470.4   Max.   :42.78  




# Centered Data Matrix


Given a data matrix $X$ with size $nxp$


$X$ is **centered** &nbsp; $\Leftrightarrow$  &nbsp; data matrix variables **mean** is **zero**



**Centering Operation**

The operation to center $X$ is the following:

$$
X  -  \overrightarrow{1}_{nx1}  \cdot \overline{x}\hspace{0.05cm}^t =H_{n}\cdot X_{nxp}
$$

is a centered matrix una **matriz centrada** (the mean of it´s columns is zero).

Where:

$$
H_{n}=I_n - \dfrac{1}{n} \cdot  \overrightarrow{1}_{nx1} \cdot \overrightarrow{1^t}
$$

is the **centered matrix**.



## Scaled Variables

Given the quantitative variable $X_k$


$X_k^{scale}$ is the variable $X_k$ **scaled**, if it´s defined as:

$$
x_{ik}^{scale}= \dfrac{ x_{ik} - \overline{X_k} }{ \sigma(X_k)}   
$$

for $i=1,...,n$



## Scaled Data Matrix

Let &nbsp; $X=(X_1,...,X_p)$ &nbsp; be a data matrix with $p$ variables,


$X^{scale}=(X_1^{scale},...,X_p^{scale})$ &nbsp; is &nbsp; $X$ &nbsp; but **scaled**.  



## Covariance


Given the varu¡iables $X_k=(x_{1k}, x_{2k},...,x_{nk})^t$ and
$X_r=(x_{1r}, x_{2r},...,x_{nr})^t$


The covariance between the variables $X_j$ and $X_r$ is defined as:

$$
S(X_k, X_r) = \frac{1}{n} \cdot \sum_{i=1}^{n} \left(x_{ik} - \overline{X_k}\right)\cdot \left( x_{ir} - \overline{X_r} \right)
$$




## Properties of covariance


-   $S(X_k,X_r) \in (-\infty, \infty)$

-   $S(X_k,X_r) \hspace{0.1cm} = \hspace{0.1cm} \dfrac{1}{n} \sum_{i=1}^{n} (x_{ik} \cdot x_{ir}) - \overline{x_k} \cdot \overline{x_r} \hspace{0.2cm} = \hspace{0.2cm} \overline{x_k\cdot x_r} - \overline{x_k} \cdot \overline{x_r}$

-   $S(X_k, a + b\cdot X_r) = b\cdot S(X_k,X_r)$

-   $S(X_k,X_r) = S(X_r,X_k)$

-   $S(X_k,X_r) > 0 \hspace{0.1cm} \Rightarrow \hspace{0.1cm} $ **Positive Relationship** between  $X_k$ and $X_r$

-   $S(X_k,X_r) < 0 \hspace{0.1cm} \Rightarrow \hspace{0.1cm}$ **Negative Relationship** between  $X_k$ and $X_r$ 

-   $S(X_k,X_r) = 0 \hspace{0.1cm} \Rightarrow \hspace{0.1cm}$ **There is not relationship** between $X_k$ and $X_r$




### Covariance in R


In [45]:
%%R

cov(data_R$ingresos , data_R$horas)

[1] 32453.53


### Covariance in Python

In [46]:
data_Python[['ingresos' , 'horas']].cov() 

Unnamed: 0,ingresos,horas
ingresos,67924830.0,32453.532338
horas,32453.53,290.84955


In [47]:
(data_Python[['ingresos' , 'horas']].cov()).iloc[ [0] , [1] ]

Unnamed: 0,horas
ingresos,32453.532338


In [48]:
np.cov(data_Python['ingresos'], data_Python['horas'])

array([[6.79248269e+07, 3.24535323e+04],
       [3.24535323e+04, 2.90849550e+02]])

## Covariances Matrix

The **covariance matrix** of a given data matrix $X$ is:

$$
S_X = \left( \hspace{0.2cm} S(X_k , X_r) \hspace{0.2cm} \right)_{k,r=1,...,p}
$$




**Matrix expression of the covariance matrix :**

$$
S_X=\dfrac{1}{n} \cdot  X\hspace{0.1cm}^t \cdot H \cdot X
$$

Where:   $H$ is the centered matrix



### Covariance Matrix in R

In [49]:
%%R

X1<-rnorm(30)
X2<-rnorm(30)
X3<-rnorm(30)
X4<-rnorm(30)

cov(cbind(X1,X2,X3,X4))

            X1          X2          X3          X4
X1  0.48804796  0.20208157 -0.03834544  0.15834036
X2  0.20208157  0.77934766  0.15620418 -0.07888175
X3 -0.03834544  0.15620418  0.73153554  0.11244826
X4  0.15834036 -0.07888175  0.11244826  0.70203843


### Covariance Matrix in Python

In [50]:
X1 = np.random.randn(30)
X2 = np.random.randn(30)
X3 = np.random.randn(30)
X4 = np.random.randn(30)

In [51]:
df_example = pd.DataFrame([X1 , X2 , X3, X4]).T
df_example.head(7)

Unnamed: 0,0,1,2,3
0,-1.088834,-0.40478,1.025578,-2.466817
1,0.954078,0.236629,1.486752,1.69321
2,0.390823,0.956551,1.42528,0.789308
3,0.403564,0.258544,0.97518,-1.891103
4,1.373876,1.820792,0.994045,-2.191246
5,-0.476706,-0.476832,0.911187,0.670717
6,-1.381306,-0.659748,0.868794,-1.115348


In [52]:
df_example.cov()

Unnamed: 0,0,1,2,3
0,0.878786,0.046805,0.127103,0.007163
1,0.046805,0.809996,0.223252,-0.010066
2,0.127103,0.223252,1.317426,-0.228035
3,0.007163,-0.010066,-0.228035,1.125517


## Pearson Linear Correlation


Given the variables $X_k=(x_{1k}, x_{2k},...,x_{nk})^t$ $\hspace{0.05cm}$ and $\hspace{0.05cm}$
$X_r=(x_{12r}, x_{2r},...,x_{nr})^t$

<br>

The **Pearson linear correlation** between the variables $X_k$ and $X_r$ is defined as:

$$
r(X_k,X_r) = \frac{S(X_k,X_r)}{S(X_k) \cdot S(X_r)} 
$$



## Properties of Pearson linear correlation


-   $r(X_k,X_r) \in [-1,1]$

 

-   $r_{X_k,a + b\cdot X_r} = r(X_k,X_r)$

 

-  The sign of $r(X,X)$ is equal to the sign of $S(X_k,Xr)$

-   $r(X_k,X_r) = \pm 1 \hspace{0.1cm} \Rightarrow \hspace{0.1cm} $ perfecto linear relationship between
    $X_k$ and $X_r$

-   $r(X_k,X_r) = 0 \hspace{0.1cm} \Rightarrow \hspace{0.1cm}$ There is not linear relationship between  $X_k$ and  $X_r$

 

-   $r(X_k,X_r) \rightarrow \pm 1 \hspace{0.1cm} \Rightarrow \hspace{0.1cm}$ hard linear relationship between $X_k$ and $X_r$

 

-   $r(X_k,X_r) \rightarrow 0 \hspace{0.1cm} \Rightarrow \hspace{0.1cm}$ weak linear relationship between
    $X_k$ and $X_r$

 

-   $r(X_k,X_r) >0 \hspace{0.1cm} \Rightarrow \hspace{0.1cm}$ positive relationship between $X_k$ and $X_r$



-   $r(X_k,X_r) <0 \hspace{0.1cm} \Rightarrow \hspace{0.1cm}$ negative relationship between  $X_k$ and $X_r$



### Pearson Correlation in R


In [53]:
%%R

cor(data_R$ingresos, data_R$horas  , method = "pearson")

[1] 0.2308945


### Pearson Correlation in Python


In [54]:
data_Python[['ingresos' , 'horas']].corr(method='pearson')

Unnamed: 0,ingresos,horas
ingresos,1.0,0.230894
horas,0.230894,1.0


In [55]:
(data_Python[['ingresos' , 'horas']].corr(method='pearson')).iloc[ [0] , [1] ]

Unnamed: 0,horas
ingresos,0.230894



## Pearson Correlation Matrix

The Pearson correlation matrix of the data matrix $X$ is :

$$
R_X = \left( \hspace{0.2cm} r(X_k , X_r) \hspace{0.2cm} \right)_{k,r=1,...,p}
$$



**Matrix expression of the correlation matrix**

$$
R_X= D_s^{-1} \cdot S_X \cdot D_s^{-1}
$$

Where:
$$
D_s = diag \left( \sigma(X_1) ,  \sigma(X_2) ,..., \sigma(X_p) \right)  
$$

 


### Correlation Matrix in R


In [56]:
%%R

cor(cbind(X1,X2,X3,X4))

            X1         X2          X3         X4
X1  1.00000000  0.3276648 -0.06417483  0.2705079
X2  0.32766483  1.0000000  0.20687562 -0.1066425
X3 -0.06417483  0.2068756  1.00000000  0.1569114
X4  0.27050788 -0.1066425  0.15691144  1.0000000


### Correlation Matrix in Python

In [57]:
corr_matrix = ( df_example.corr(method='pearson') >> rename(X1=0 , X2=1 , X3=2 , X4=3 ) )

corr_matrix['']=['X1','X2', 'X3', 'X4'] 

corr_matrix = corr_matrix.set_index('')

corr_matrix

Unnamed: 0,X1,X2,X3,X4
,,,,
X1,1.0,0.055477,0.118127,0.007202
X2,0.055477,1.0,0.216118,-0.010543
X3,0.118127,0.216118,1.0,-0.187268
X4,0.007202,-0.010543,-0.187268,1.0


## Frequency distribution


### Absolute frequency distribution of an element

Given a variable $X_k=(x_{1k}, x_{2k},...,x_{nk})^t$ $\hspace{0.03cm}$ and
$\hspace{0.03cm}$ $a \in Range(X_k)$

<br>

The **frequency absolute** of the **element** $a$ in $X_k$ is defined as :

$$
Fabs(a ,X_k) \hspace{0.1cm}=\hspace{0.1cm} \# \lbrace i \hspace{0.05cm} / \hspace{0.05cm} x_{ik}=a \rbrace 
$$

$$
= \text{nº de observaciones de la variable} \hspace{0.12cm} X_k \hspace{0.12cm} \text{que coinciden con el valor} \hspace{0.12cm} a  
$$
 
<br>

For example:  $F(a=1000, X_k)$ it could be the number of employees of a certain company with a salary of $a=1000$ euros.

<br>

**Note:**

If $\hspace{0.05cm}$ $X_k$ $\hspace{0.05cm}$ is continuous, usually $\hspace{0.05cm}$ $Fabs(a , X_k) = 0$ $\hspace{0.05cm}$ for many values $\hspace{0.05cm}$ $a$


### Absolute frequency of an element in R

It is mainly useful for categorical variables

In [58]:
%%R

frecuencia_absoluta_elemento <- 
function( variable, elemento){
  
frecuencia_absoluta_elemento <-  
sum(variable == elemento)
  
return(frecuencia_absoluta_elemento)  
} 


Let's see how the function works :

In [59]:
%%R 

frecuencia_absoluta_elemento( round(data_R$ingresos, 3), 8736.2)  

[1] 1


In [60]:
%%R 

frecuencia_absoluta_elemento( round(data_R$ingresos, 3), 8736)  

[1] 0


In [61]:
%%R 

frecuencia_absoluta_elemento(data_R$genero , 1)

[1] 615


In [62]:
%%R 

frecuencia_absoluta_elemento(data_R$neduc , 3)  

[1] 202


### Absolute frequency  of an element in Python


In [63]:
def freq_abs_element_py( variable , element ) :

# variable must be a python vector (like df['X2'])
# element must be a constant (like a number or string)

    freq_abs_element = ( variable == element ).sum() 
    
    return freq_abs_element 

Let's see how the function works :

In [64]:
freq_abs_element_py( data_Python['genero'] , 1)

615

We can check it:

In [65]:
(data_Python['genero'] == 1).sum()

615

In [66]:
freq_abs_element_py( data_Python['ingresos'] , 8736.2)

1

In [67]:
freq_abs_element_py( data_Python['ingresos'] , 8736)

0

In [68]:
freq_abs_element_py( data_Python['neduc'] , 3)

202


## Absolute frequency of a set

 Given a variable  $X_k=(x_{1k}, x_{2k},...,x_{nk})^t$ $\hspace{0.05cm}$ and $\hspace{0.05cm}$ $A \subset Recorrido(X_k)$

 <br>

The absolute frequency of the set $A$ in $X_k$ is defined as:

$$
Fabs(A, X_k) = \sum_{a \in A} Fabs(a , X_k ) = 
$$

$$
= \text{ nº of observations of} \hspace{0.1cm} X_k \hspace{0.1cm} \text{that belongs to} \hspace{0.1cm} A
$$
 
<br>

For example:  $Fabs(A=[500,1500] , X_k)$  could be the number of employees with a salary between 500 and 1500 euros.

<br>

**Note :**

$Fabs([b_1,b_2], X_k)$ $\hspace{0.1cm}$ is a particular case of $\hspace{0.1cm}$ $Fabs(A, X_k)$ $\hspace{0.1cm}$ with $\hspace{0.1cm}$ $A=[b_1,b_2]$

 



### Absolute Frequency of an interval in R

 Useful to quantitatives variables (specially continuous)


In [69]:
%%R

frecuencia_absoluta_intervalo <- 
  
function( variable, cota_inferior, cota_superior ){
  
frecuencia_absoluta_intervalo <-  

sum(variable >= cota_inferior & 
variable <= cota_superior)
  
return(frecuencia_absoluta_intervalo)  
}

In [70]:
%%R

frecuencia_absoluta_intervalo(data_R$ingresos , 8000, 12000)

[1] 305


In [71]:
%%R

frecuencia_absoluta_intervalo(data_R$neduc , 3, 4)

[1] 202


### Absolute Frequency of an interval in Python

In [72]:
def freq_abs_interval_py( df, variable , lower_bound , upper_bound) :


# To use this function you have to had intalled the dfply package
# df is a data-frame
# variable is a column of the data frame df
# lower_bound and upper_bound are numbers such that lower_bound <= upper_bound
   
    freq_abs_interval = len( df >> filter_by( (X[variable] >= lower_bound) & (X[variable] <= upper_bound)) ) 
    return freq_abs_interval

In [73]:
freq_abs_interval_py(data_Python , 'ingresos' , 8000 , 12000)

305

In [74]:
len(data_Python >> filter_by( (X.ingresos >= 8000) & (X.ingresos <= 12000 ))) 

305

In [75]:
len(data_Python >> filter_by( (X['ingresos'] >= 8000) & (X['ingresos'] <= 12000 ))) 

305

To use `freq_abs_interval_py` with categorical variables that are encode with numbers, we have to convert them to integer or float:

In [76]:
## The following code gives an error:

# freq_abs_interval_py(data_Python , 'neduc' , 3 , 4)

TypeError: Unordered Categoricals can only compare equality or not

In [None]:
data_Python['neduc'] = data_Python['neduc'].astype('int')

In [None]:
freq_abs_interval_py(data_Python , 'neduc' , 3 , 4)

202


### Absolute Frequence in a discrete set in R

Useful to categorical and discrete quantitatives variables 

In [86]:
%%R

frecuencia_absoluta_conjunto_discreto <- 

function(variable, conjunto){
cont=0
for( i in conjunto){
  
  if( any(variable == i) ) {
   
  cont = cont + sum(variable==i)
  }
}
return(cont)  
}

In [89]:
%%R

A=c(3, 4)

frecuencia_absoluta_conjunto_discreto(data_R$neduc , A)

[1] 202


In [88]:
%%R

A=c(1,3,4)

frecuencia_absoluta_conjunto_discreto(data_R$neduc , A)

[1] 645


### Absolute Frequence in a discrete set in Python


In [81]:
def freq_abs_set_py(variable , A) :

    cont=0

    for i in A : 

        if any(variable == i) :

            cont = cont + (variable == i).sum()

    return cont

In [83]:
A = pd.Series([ 3 , 4])

freq_abs_set_py(data_Python['neduc'] , A)

202

In [90]:
A = pd.Series([1, 3 , 4])

freq_abs_set_py(data_Python['neduc'] , A)

645


## Relative frequency of an element

 

Dada una variable $X_k=(x_{1k}, x_{2k},...,x_{nk})^t$  y
 $a \in Recorrido(X_k)$

 

La **frecuencia relativa** del **elemento** $a$ en $X_k$ se define como:

$$
Fre(a,X_k) =  \dfrac{Fabs(a,X_k) }{n} = \\ =\text{ proporcion de observaciones de $X_k$ que coinciden con el valor $a$}
$$
 
 

Por ejemplo:  $Fre(a=1000,X_k)$ podria ser la proporcion de empleados de
cierta empresas con un salario de 1000 euros.

 



## Frecuencia Relativa de un conjunto

 

Dada una variable  $X_k=(x_{1k}, x_{2k},...,x_{nk})^t$  y
 $A \subset Recorrido(X_k)$

La **frecuencia relativa** del **conjunto** $A$ en $X_k$ se define como:

$$
Fre(A,X_k) =  \dfrac{Fabs(A ,X_k) }{n} = \\ = \text{ proporción de observaciones de $X_k$ que pertenecen a $A$}
$$

 

Por ejemplo:  $Fre(A=[500,1500],X_k)$ podria ser la proporcion de
empleados de cierta empresas con un salario entre 500 y 1500 euros.

 



### Frecuencia Relativa de un elemento en R

```{r}
frecuencia_relativa_elemento <- 

function( variable , elemento ){
  
frecuencia_relativa_elemento <-

frecuencia_absoluta_elemento(variable , elemento)/
  length(variable)
  
  return(frecuencia_relativa_elemento)

}
```

```{r}
frecuencia_relativa_elemento(Datos$genero, 1)
```

 

### Frecuencia Relativa de un intervalo en R

```{r}
frecuencia_relativa_intervalo <- 
  
  function( variable, cota_inferior, cota_superior ){
  
  frecuencia_relativa_intervalo <- 
  frecuencia_absoluta_intervalo(variable , 
    cota_inferior, cota_superior)/
    length(variable)
  
  return(frecuencia_relativa_intervalo)
  
}
```

```{r}
frecuencia_relativa_intervalo(Datos$ingresos, 8000, 12000)
```

 

### Frecuencia Relativa de conjunto discreto en R

```{r}
frecuencia_relativa_conjunto_discreto <- 
function( variable, conjunto ){
  
frecuencia_relativa_conjunto_discreto <-
frecuencia_absoluta_conjunto_discreto(variable,conjunto)/
  length(variable)
  
  return(frecuencia_relativa_conjunto_discreto)
  
}
```

```{r}
A=c(1, 3)

frecuencia_relativa_conjunto_discreto(Datos$neduc, A)
```

 



## Frecuencias Acumuladas

 

La **frecuencia absoluta acumulada** del elemento $a$ en $X_k$ se define
como:

$$
Fabscum(a ,X_k)=Fabs \left( \lbrace   i=1,...,n  / x_{ik} \leq a  \rbrace , X_k \right) = \\ = \text{nº de observaciones de $X_k$ que son menores o iguales que $a$}
$$
 

La **frecuencia relativa acumulada** del elemento $a$ en $X_k$ se define
como: 

$$
Frecum(a,X_k)= \dfrac{Fabscum(a,X_k)}{n} \\ = \text{proporcion de observaciones de $X_k$ que son menores o iguales que $a$}
$$

 



### Frecuencias Absoluta Acumulada en R

```{r}
frecuencia_absoluta_acumulada <- 

function(variable, elemento){
  
frecuencia_absoluta_acumulada <- 
frecuencia_absoluta_intervalo(variable,-Inf,elemento)
    
 return(frecuencia_absoluta_acumulada)
}
```

```{r}
frecuencia_absoluta_acumulada(Datos$ingresos, 8000)
```

 

### Frecuencia Relativa Acumulada en R

```{r}
frecuencia_relativa_acumulada <- 

function(variable, elemento){
  
frecuencia_relativa_acumulada <-

frecuencia_absoluta_acumulada(variable,elemento)/
  length(variable)
    
  return(frecuencia_relativa_acumulada)
}
```

```{r}
frecuencia_relativa_acumulada(Datos$ingresos, 8000)
```
