# Descriptive Statistics with Python and R


## Index:
* [Data-set](#1)
* [Selecting](#2)
* [Sampling](#3)
* [Filtering](#4)
* [Mutate](#5)
* [Arrange](#6)
* [Rename](#7)
* [Gather](#8)
* [Spread](#9)
* [Separate](#10)
* [Unite](#11)
* [Joins](#12)
* *  [Inner Join](#13)
* * [Full Join](#14)
* * [Left Join](#15)
* * [Right Join](#16)
* * [Semi Join](#17)
* * [Anti Join](#18)
* * [Union](#19)
* * [Intersect](#20)
* * [Difference](#21)
*  [Concatenate](#22)
*  [Group and Summarize](#23)
*  [Other usuful functions ](#24)

## Data-Set <a class="anchor" id="1"></a>

We load the data-set with which we are going to work mainly:

Working with `Python`:

In [119]:
import pandas as pd

from IPython.display import display
pd.options.display.max_columns = None

import warnings
warnings.filterwarnings('ignore')

In [120]:
url = 'https://raw.githubusercontent.com/FabioScielzoOrtiz/Estadistica4all-blog/main/Descriptive%20Statisitcs%20in%20Python%20and%20R/datosAragon.csv'

data_Python = pd.read_csv(url)

data_Python

Unnamed: 0,ca,datosECVmas16.prov,nomprov,gen,edad,nac,neduc,sitlab,ingnorm,horas,factorel
0,2,22,Huesca,1,3,1,3,1,21237.1,36.42,393.7
1,2,22,Huesca,2,2,1,2,1,17810.8,31.72,393.7
2,2,22,Huesca,1,1,1,2,1,11889.1,31.88,393.7
3,2,22,Huesca,1,1,1,2,1,16000.5,38.18,393.7
4,2,22,Huesca,1,1,1,2,3,21169.6,0.00,393.7
...,...,...,...,...,...,...,...,...,...,...,...
1231,2,50,Zaragoza,1,2,1,2,1,11760.6,28.79,2206.1
1232,2,50,Zaragoza,1,2,1,2,2,19321.6,0.00,124.4
1233,2,50,Zaragoza,2,2,1,2,1,19924.8,37.21,124.4
1234,2,50,Zaragoza,2,3,1,1,3,13042.5,0.00,246.5


In [121]:
# pip install dfply 

In [122]:
from dfply import *

In [123]:
list(range(2,10)) 

[2, 3, 4, 5, 6, 7, 8, 9]

In [124]:
data_Python = (data_Python.T >> row_slice( list(range(2,10)) ) ).T

data_Python

Unnamed: 0,nomprov,gen,edad,nac,neduc,sitlab,ingnorm,horas
0,Huesca,1,3,1,3,1,21237.1,36.42
1,Huesca,2,2,1,2,1,17810.8,31.72
2,Huesca,1,1,1,2,1,11889.1,31.88
3,Huesca,1,1,1,2,1,16000.5,38.18
4,Huesca,1,1,1,2,3,21169.6,0.0
...,...,...,...,...,...,...,...,...
1231,Zaragoza,1,2,1,2,1,11760.6,28.79
1232,Zaragoza,1,2,1,2,2,19321.6,0.0
1233,Zaragoza,2,2,1,2,1,19924.8,37.21
1234,Zaragoza,2,3,1,1,3,13042.5,0.0


In [125]:
data_Python = data_Python >> rename( genero=X.gen , provincia=X.nomprov , ingresos=X.ingnorm ) 

data_Python

Unnamed: 0,provincia,genero,edad,nac,neduc,sitlab,ingresos,horas
0,Huesca,1,3,1,3,1,21237.1,36.42
1,Huesca,2,2,1,2,1,17810.8,31.72
2,Huesca,1,1,1,2,1,11889.1,31.88
3,Huesca,1,1,1,2,1,16000.5,38.18
4,Huesca,1,1,1,2,3,21169.6,0.0
...,...,...,...,...,...,...,...,...
1231,Zaragoza,1,2,1,2,1,11760.6,28.79
1232,Zaragoza,1,2,1,2,2,19321.6,0.0
1233,Zaragoza,2,2,1,2,1,19924.8,37.21
1234,Zaragoza,2,3,1,1,3,13042.5,0.0


Working with `R`

In [126]:
import rpy2

In [127]:
%load_ext rpy2.ipython

The rpy2.ipython extension is already loaded. To reload it, use:
  %reload_ext rpy2.ipython


In [128]:
import rpy2.robjects as robjects

In [129]:
%%R

library(tidyverse)

url = 'https://raw.githubusercontent.com/FabioScielzoOrtiz/Estadistica4all-blog/main/Descriptive%20Statisitcs%20in%20Python%20and%20R/datosAragon.csv'

data_R <- read_csv(url)

data_R <- data_R %>% select(3:10)

data_R <- data_R %>% rename("genero"="gen",
         "provincia"="nomprov", "ingresos"="ingnorm")

data_R <- as.data.frame(data_R)

Rows: 1236 Columns: 11
-- Column specification --------------------------------------------------------
Delimiter: ","
chr  (1): nomprov
dbl (10): ca, datosECVmas16.prov, gen, edad, nac, neduc, sitlab, ingnorm, ho...

i Use `spec()` to retrieve the full column specification for this data.
i Specify the column types or set `show_col_types = FALSE` to quiet this message.


In [130]:
%%R

head(data_R)

  provincia genero edad nac neduc sitlab ingresos horas
1    Huesca      1    3   1     3      1  21237.1 36.42
2    Huesca      2    2   1     2      1  17810.8 31.72
3    Huesca      1    1   1     2      1  11889.1 31.88
4    Huesca      1    1   1     2      1  16000.5 38.18
5    Huesca      1    1   1     2      3  21169.6  0.00
6    Huesca      1    2   1     2      1  16001.3 34.52


The definition of the variables are the following:

-   **provincia**: indicates the province of the Aragon community which belong the sample individuals.

-   **genero**: indicates the sex of the sample individuals. Take 1 if it's male and 2 if it's female.

-   **edad**: indicates the age range of the sample individuals.
    Take 1 if age is between 16 and 24 ages,  2 if age is between 25 and 49 ages, and 3 if age is between 50 and 64 ages, and 4 if age is greater or equal than 65 ages.

-   **nacionalidad** (**nac**): indicates the nacionality of the sample individuals. If they are from Spain, it takes 1, and if they are from other country, it takes 2.

-   **situacion laboral** (**sitlab**): indicates the labor situation of the sample individuals. It takes 1 if they are working, 2 if they are unemployed, and 3 if they are inactive.

-   **ingresos** : indicates the incomes of the individuals in the sample.

-   **horas** : indicates the number of working hours of the individuals in the sample.

## Statistical Variable <a class="anchor" id="2"></a>

$$
X_k= \begin{pmatrix}
x_{1k} \\
x_{2k}\\
... \\
x_{nk} 
\end{pmatrix} 
$$

is una **variable estadística** porque es un vector con los
valores/observaciones de la variable $X_k$ para $n$ indiviuos/elementos
de una muestra.

Donde:  $x_{ik}$ es el valor u observación de la variable $X_k$ para el
elemento $i$ de la muestra ($i$-esima observación de la variable $X_k$)

## Data Martrix   <a class="anchor" id="3"></a>

En general, si hemos observado $p$ variables sobre una muestra
$\varepsilon$ de $n$ elementos o individuos ,

La matriz de datos $X$ de las variables $X_1,...,X_p$ medidas sobre los
$n$ individuos o elementos de $\varepsilon$ es:

$$
X= \begin{pmatrix}
x_{11} & x_{12}&...&x_{1p}\\
x_{21} & x_{22}&...&x_{2p}\\
...&...&...&...\\
x_{n1}& x_{n2}&...&x_{np}
\end{pmatrix}
$$

Observación: $X$ es una matriz $nxp$