# **Introducci√≥n a R para An√°lisis de Datos**
## Cap√≠tulo 3: Manipulaci√≥n de Datos 

**Autor:** Juan Martin Bellido  

**Descripci√≥n**  
En este capitulo aprenderemos a utilizar las librer√≠a *dplyr* para realizar operaciones avanzadas de manipulaci√≥n de datos.

**¬øFeedback? ¬øcomentarios?** Por favor comp√°rtelo conmigo escribi√©ndome por [LinkedIn](https://www.linkedin.com/in/jmartinbellido/)  



## INDICE
---
1. Seleccionar columnas 
2. Filtrar filas
3. Ordenar *data frame*
4. Crear nuevos campos
5. Agregar datos
6. Ejercicios


Convenciones utilizadas en este documento
> üëâ *Esto es una nota u observaci√≥n*

> ‚ö†Ô∏è *Esto es una advertencia*

In [None]:
# instalamos librer√≠a "data.table" porque Google colab no la incluye por defecto
install.packages("data.table")

In [None]:
# importamos las librer√≠as que utilizaremos
require(dplyr)
require(data.table)

In [4]:
# (opcional) editamos las opciones globales para evitar que R utilice notaci√≥n cient√≠fica
options(scipen=999)

# 1. Seleccionar columnas 
---

Comenzamos aprendiendo a reducir un *data frame* en n√∫mero de campos (columnas).

### Seleccionando columnas

Para seleccionar columnas espec√≠ficas en un *data frame* utilizaremos la funci√≥n `dplyr::select()`.

```
dplyr::select(df, field_1, field_2 ...)
```



In [5]:
# importamos un df
df_james_bond = data.table::fread("https://data-wizards.s3.amazonaws.com/datasets/jamesbond.csv")
glimpse(df_james_bond)

Rows: 26
Columns: 7
$ Film                [3m[90m<chr>[39m[23m "Dr. No", "From Russia with Love", "Goldfinger", "‚Ä¶
$ Year                [3m[90m<int>[39m[23m 1962, 1963, 1964, 1965, 1967, 1967, 1969, 1971, 19‚Ä¶
$ Actor               [3m[90m<chr>[39m[23m "Sean Connery", "Sean Connery", "Sean Connery", "S‚Ä¶
$ Director            [3m[90m<chr>[39m[23m "Terence Young", "Terence Young", "Guy Hamilton", ‚Ä¶
$ `Box Office`        [3m[90m<dbl>[39m[23m 448.8, 543.8, 820.4, 848.1, 315.0, 514.2, 291.5, 4‚Ä¶
$ Budget              [3m[90m<dbl>[39m[23m 7.0, 12.6, 18.6, 41.9, 85.0, 59.9, 37.3, 34.7, 30.‚Ä¶
$ `Bond Actor Salary` [3m[90m<dbl>[39m[23m 0.6, 1.6, 3.2, 4.7, NA, 4.4, 0.6, 5.8, NA, NA, NA,‚Ä¶


In [6]:
# comenzamos seleccionando dos columnas espec√≠ficas utilizando la sintaxis b√°sica 
dplyr::select(df_james_bond, Film, Director)

Film,Director
<chr>,<chr>
Dr. No,Terence Young
From Russia with Love,Terence Young
Goldfinger,Guy Hamilton
Thunderball,Terence Young
Casino Royale,Ken Hughes
You Only Live Twice,Lewis Gilbert
On Her Majesty's Secret Service,Peter R. Hunt
Diamonds Are Forever,Guy Hamilton
Live and Let Die,Guy Hamilton
The Man with the Golden Gun,Guy Hamilton



*R no permite anidar operaciones de forma nativa*. Esto es algo que logramos implementar utilizando un operador espec√≠fico que forma parte de la librer√≠a *dplyr*. Nos referimos a este como el *pipe operator (%>%)*


```
object %>% function() %>% function() ...
```


In [7]:
# repetimos la operaci√≥n anterior, esta vez utilizando el "pipe operator"
df_james_bond %>% select(
  Film        # columna 1
  ,Director   # columna 2
)

Film,Director
<chr>,<chr>
Dr. No,Terence Young
From Russia with Love,Terence Young
Goldfinger,Guy Hamilton
Thunderball,Terence Young
Casino Royale,Ken Hughes
You Only Live Twice,Lewis Gilbert
On Her Majesty's Secret Service,Peter R. Hunt
Diamonds Are Forever,Guy Hamilton
Live and Let Die,Guy Hamilton
The Man with the Golden Gun,Guy Hamilton


In [8]:
# adicionalmente, al seleccionar columnas podremos renombrar las variables
df_james_bond %>% select(
  james_bond_film = Film                    # seleccionamos una primer columna y la renombramos
  ,film_director = Director                 # seleccionamos una segunda columna y la renombramos
  ,film_budget = Budget                     # seleccionamos una tercer columna y la renombramos
  ,bond_actor_salary = `Bond Actor Salary`  # en este caso, al renombrar la variable sin espacios, nos libraremos de las comillas
)

james_bond_film,film_director,film_budget,bond_actor_salary
<chr>,<chr>,<dbl>,<dbl>
Dr. No,Terence Young,7.0,0.6
From Russia with Love,Terence Young,12.6,1.6
Goldfinger,Guy Hamilton,18.6,3.2
Thunderball,Terence Young,41.9,4.7
Casino Royale,Ken Hughes,85.0,
You Only Live Twice,Lewis Gilbert,59.9,4.4
On Her Majesty's Secret Service,Peter R. Hunt,37.3,0.6
Diamonds Are Forever,Guy Hamilton,34.7,5.8
Live and Let Die,Guy Hamilton,30.8,
The Man with the Golden Gun,Guy Hamilton,27.7,


In [9]:
# podemos negar (-) columnas para evitarlas
df_james_bond %>% select(
  -Director
  ,-Budget
)
# en este caso, hemos seleccionado TODAS las variables disponibles en el data frame, salvo dos espec√≠ficas que hemos negativizado

Film,Year,Actor,Box Office,Bond Actor Salary
<chr>,<int>,<chr>,<dbl>,<dbl>
Dr. No,1962,Sean Connery,448.8,0.6
From Russia with Love,1963,Sean Connery,543.8,1.6
Goldfinger,1964,Sean Connery,820.4,3.2
Thunderball,1965,Sean Connery,848.1,4.7
Casino Royale,1967,David Niven,315.0,
You Only Live Twice,1967,Sean Connery,514.2,4.4
On Her Majesty's Secret Service,1969,George Lazenby,291.5,0.6
Diamonds Are Forever,1971,Sean Connery,442.5,5.8
Live and Let Die,1973,Roger Moore,460.3,
The Man with the Golden Gun,1974,Roger Moore,334.0,


In [10]:
# podemos forzar combinaciones √∫nicas utilizando la funci√≥n unique()
df_james_bond %>% select(
  Director                      # seleccionamos variable Director
) %>% unique()                  # forzamos valores √∫nicos

Director
<chr>
Terence Young
Guy Hamilton
Ken Hughes
Lewis Gilbert
Peter R. Hunt
John Glen
Irvin Kershner
Martin Campbell
Roger Spottiswoode
Michael Apted


# 2. Filtrando filas
---
A continuaci√≥n, aprenderemos una forma sencilla de filtrar observaciones en funci√≥n de criterios l√≥gicos utilizando la funci√≥n `dplyr::filter()`.


```
dplyr::filter(df, condition_1, condition_2, ...)
```



In [11]:
# filtramos el df utilizando una condici√≥n
df_james_bond %>% filter(Year>2000) # √∫nicamente pel√≠culas a partir del a√±o 2000

Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
<chr>,<int>,<chr>,<chr>,<dbl>,<dbl>,<dbl>
Die Another Day,2002,Pierce Brosnan,Lee Tamahori,465.4,154.2,17.9
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3
Quantum of Solace,2008,Daniel Craig,Marc Forster,514.2,181.4,8.1
Skyfall,2012,Daniel Craig,Sam Mendes,943.5,170.2,14.5
Spectre,2015,Daniel Craig,Sam Mendes,726.7,206.3,


In [12]:
# podemos combinar m√∫ltiples funciones de manipulaci√≥n utilizando el "pipe operator"
# el orden de las funciones es fundamental, ya que estamos haciendo operaciones de a pasos 
df_james_bond %>% select(Film,Year) %>% filter(Year>2000) 
# si hubi√©ramos omitido seleccionar la columna "year", no podr√≠amos luego filtrar por ese campo

Film,Year
<chr>,<int>
Die Another Day,2002
Casino Royale,2006
Quantum of Solace,2008
Skyfall,2012
Spectre,2015


In [13]:
# en el siguiente ejercicio, filtramos utilizando dos coindicines
# agregar una nueva condici√≥n como par√°metro, es exactamente lo mismo que agregarla utilizando el operador "&" (AND)
df_james_bond %>% filter(
  Year>2000
  ,Actor == 'Daniel Craig'
)

Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
<chr>,<int>,<chr>,<chr>,<dbl>,<dbl>,<dbl>
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3
Quantum of Solace,2008,Daniel Craig,Marc Forster,514.2,181.4,8.1
Skyfall,2012,Daniel Craig,Sam Mendes,943.5,170.2,14.5
Spectre,2015,Daniel Craig,Sam Mendes,726.7,206.3,


In [14]:
# comprobamos que obtenemos el resultado utilizando una √∫nica condici√≥n compuesta por dos elementos
# filtramos por pel√≠culas que hayan sido lanzadas a partir de 2000 y cuyo actor sea Daniel Craig
df_james_bond %>% filter(
  Year>2000 & Actor == 'Daniel Craig' 
)


Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
<chr>,<int>,<chr>,<chr>,<dbl>,<dbl>,<dbl>
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3
Quantum of Solace,2008,Daniel Craig,Marc Forster,514.2,181.4,8.1
Skyfall,2012,Daniel Craig,Sam Mendes,943.5,170.2,14.5
Spectre,2015,Daniel Craig,Sam Mendes,726.7,206.3,


In [15]:
# distinto ser√≠a si ambos elementos en la condici√≥n fueran de tipo OR ("|")
df_james_bond %>% filter(
  Year>2000 | Actor == 'Daniel Craig' 
)
# filtramos por pel√≠culas que hayan sido lanzadas a partir de 2000 o cuyo actor sea Daniel Craig

Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
<chr>,<int>,<chr>,<chr>,<dbl>,<dbl>,<dbl>
Die Another Day,2002,Pierce Brosnan,Lee Tamahori,465.4,154.2,17.9
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3
Quantum of Solace,2008,Daniel Craig,Marc Forster,514.2,181.4,8.1
Skyfall,2012,Daniel Craig,Sam Mendes,943.5,170.2,14.5
Spectre,2015,Daniel Craig,Sam Mendes,726.7,206.3,


In [16]:
# filtramos un campo seg√∫n un vector, utilizando el operador "IN"
df_james_bond %>% filter(
  Actor %in% c('Daniel Craig','Pierce Brosnan','Sean Connery') 
)
# filtramos por pel√≠culas cuyos actores sean alguno de los especificados

Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
<chr>,<int>,<chr>,<chr>,<dbl>,<dbl>,<dbl>
Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7
You Only Live Twice,1967,Sean Connery,Lewis Gilbert,514.2,59.9,4.4
Diamonds Are Forever,1971,Sean Connery,Guy Hamilton,442.5,34.7,5.8
Never Say Never Again,1983,Sean Connery,Irvin Kershner,380.0,86.0,
GoldenEye,1995,Pierce Brosnan,Martin Campbell,518.5,76.9,5.1
Tomorrow Never Dies,1997,Pierce Brosnan,Roger Spottiswoode,463.2,133.9,10.0
The World Is Not Enough,1999,Pierce Brosnan,Michael Apted,439.5,158.3,13.5


In [17]:
# podemos negar condiciones utilizando el operador "!" (negaci√≥n)
df_james_bond %>% filter(
  !Actor %in% c('Daniel Craig','Pierce Brosnan','Sean Connery') 
)
# filtramos por pel√≠culas cuyos actores NO sean los especificados

Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
<chr>,<int>,<chr>,<chr>,<dbl>,<dbl>,<dbl>
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,
On Her Majesty's Secret Service,1969,George Lazenby,Peter R. Hunt,291.5,37.3,0.6
Live and Let Die,1973,Roger Moore,Guy Hamilton,460.3,30.8,
The Man with the Golden Gun,1974,Roger Moore,Guy Hamilton,334.0,27.7,
The Spy Who Loved Me,1977,Roger Moore,Lewis Gilbert,533.0,45.1,
Moonraker,1979,Roger Moore,Lewis Gilbert,535.0,91.5,
For Your Eyes Only,1981,Roger Moore,John Glen,449.4,60.2,
Octopussy,1983,Roger Moore,John Glen,373.8,53.9,7.8
A View to a Kill,1985,Roger Moore,John Glen,275.2,54.5,9.1
The Living Daylights,1987,Timothy Dalton,John Glen,313.5,68.8,5.2


# 3. Ordenar *data frame*
---

In [18]:
df_james_bond = data.table::fread("https://data-wizards.s3.amazonaws.com/datasets/jamesbond.csv")
glimpse(df_james_bond)

Rows: 26
Columns: 7
$ Film                [3m[90m<chr>[39m[23m "Dr. No", "From Russia with Love", "Goldfinger", "‚Ä¶
$ Year                [3m[90m<int>[39m[23m 1962, 1963, 1964, 1965, 1967, 1967, 1969, 1971, 19‚Ä¶
$ Actor               [3m[90m<chr>[39m[23m "Sean Connery", "Sean Connery", "Sean Connery", "S‚Ä¶
$ Director            [3m[90m<chr>[39m[23m "Terence Young", "Terence Young", "Guy Hamilton", ‚Ä¶
$ `Box Office`        [3m[90m<dbl>[39m[23m 448.8, 543.8, 820.4, 848.1, 315.0, 514.2, 291.5, 4‚Ä¶
$ Budget              [3m[90m<dbl>[39m[23m 7.0, 12.6, 18.6, 41.9, 85.0, 59.9, 37.3, 34.7, 30.‚Ä¶
$ `Bond Actor Salary` [3m[90m<dbl>[39m[23m 0.6, 1.6, 3.2, 4.7, NA, 4.4, 0.6, 5.8, NA, NA, NA,‚Ä¶



La funci√≥n `dplyr::arrange()` nos permite establecer un criterio para ordenar filas en un *data frame*.

```
dplyr::arrange(object, columns ...)
```



In [19]:
# ordenamos el data frame seg√∫n variable texto
df_james_bond %>% arrange(Actor) 
# por defecto, el criterio es ascendente; al ser campo de tipo texto ser√° en orden alfab√©tico

Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
<chr>,<int>,<chr>,<chr>,<dbl>,<dbl>,<dbl>
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3
Quantum of Solace,2008,Daniel Craig,Marc Forster,514.2,181.4,8.1
Skyfall,2012,Daniel Craig,Sam Mendes,943.5,170.2,14.5
Spectre,2015,Daniel Craig,Sam Mendes,726.7,206.3,
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,
On Her Majesty's Secret Service,1969,George Lazenby,Peter R. Hunt,291.5,37.3,0.6
GoldenEye,1995,Pierce Brosnan,Martin Campbell,518.5,76.9,5.1
Tomorrow Never Dies,1997,Pierce Brosnan,Roger Spottiswoode,463.2,133.9,10.0
The World Is Not Enough,1999,Pierce Brosnan,Michael Apted,439.5,158.3,13.5
Die Another Day,2002,Pierce Brosnan,Lee Tamahori,465.4,154.2,17.9


In [20]:
# ordenamos seg√∫n dos variables
df_james_bond %>% arrange(Actor,`Box Office`)

Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
<chr>,<int>,<chr>,<chr>,<dbl>,<dbl>,<dbl>
Quantum of Solace,2008,Daniel Craig,Marc Forster,514.2,181.4,8.1
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3
Spectre,2015,Daniel Craig,Sam Mendes,726.7,206.3,
Skyfall,2012,Daniel Craig,Sam Mendes,943.5,170.2,14.5
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,
On Her Majesty's Secret Service,1969,George Lazenby,Peter R. Hunt,291.5,37.3,0.6
The World Is Not Enough,1999,Pierce Brosnan,Michael Apted,439.5,158.3,13.5
Tomorrow Never Dies,1997,Pierce Brosnan,Roger Spottiswoode,463.2,133.9,10.0
Die Another Day,2002,Pierce Brosnan,Lee Tamahori,465.4,154.2,17.9
GoldenEye,1995,Pierce Brosnan,Martin Campbell,518.5,76.9,5.1


In [21]:
# para cambiar el criterio de orden a descendente, debemos utilizar la funci√≥n desc() en el par√°metro de la funci√≥n arrange()
df_james_bond %>% arrange(desc(`Box Office`))

Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
<chr>,<int>,<chr>,<chr>,<dbl>,<dbl>,<dbl>
Skyfall,2012,Daniel Craig,Sam Mendes,943.5,170.2,14.5
Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7
Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
Spectre,2015,Daniel Craig,Sam Mendes,726.7,206.3,
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3
From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
Moonraker,1979,Roger Moore,Lewis Gilbert,535.0,91.5,
The Spy Who Loved Me,1977,Roger Moore,Lewis Gilbert,533.0,45.1,
GoldenEye,1995,Pierce Brosnan,Martin Campbell,518.5,76.9,5.1
You Only Live Twice,1967,Sean Connery,Lewis Gilbert,514.2,59.9,4.4


# 4. Crear nuevos campos
---
La funci√≥n `dplyr::mutate()` permite crear columnas nuevas en un *data frame*.

```
dplyr::mutate(df, new_field_1, new_field 2, ...)
```



In [22]:
# creamos una nueva columna como cosciente entre dos variables existentes
df_james_bond %>% mutate(
  profit = `Box Office`/Budget    # nombramos nuestra nueva columna "profit"
) %>% arrange(desc(profit))       # ordenamos el resultado, seg√∫n la nueva columna definida


Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary,profit
<chr>,<int>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>
Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6,64.114286
Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2,44.107527
From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6,43.15873
Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7,20.24105
Live and Let Die,1973,Roger Moore,Guy Hamilton,460.3,30.8,,14.944805
Diamonds Are Forever,1971,Sean Connery,Guy Hamilton,442.5,34.7,5.8,12.752161
The Man with the Golden Gun,1974,Roger Moore,Guy Hamilton,334.0,27.7,,12.057762
The Spy Who Loved Me,1977,Roger Moore,Lewis Gilbert,533.0,45.1,,11.818182
You Only Live Twice,1967,Sean Connery,Lewis Gilbert,514.2,59.9,4.4,8.584307
On Her Majesty's Secret Service,1969,George Lazenby,Peter R. Hunt,291.5,37.3,0.6,7.815013


In [23]:
# repetimos el ejercicio anterior, en este caso utilizamos la funci√≥n round() para redondear el resultado y forzar decimales
df_james_bond %>% mutate(
  profit = round(`Box Office`/Budget,2),  # el segundo par√°metro de la funci√≥n round() establece el n√∫mero de decimales
  profit_EUR = round(profit / 1.2)        # por defecto, round() fuerza a n√∫meros enteros
) %>% arrange(desc(profit_EUR))


Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary,profit,profit_EUR
<chr>,<int>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6,64.11,53
Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2,44.11,37
From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6,43.16,36
Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7,20.24,17
Live and Let Die,1973,Roger Moore,Guy Hamilton,460.3,30.8,,14.94,12
Diamonds Are Forever,1971,Sean Connery,Guy Hamilton,442.5,34.7,5.8,12.75,11
The Man with the Golden Gun,1974,Roger Moore,Guy Hamilton,334.0,27.7,,12.06,10
The Spy Who Loved Me,1977,Roger Moore,Lewis Gilbert,533.0,45.1,,11.82,10
You Only Live Twice,1967,Sean Connery,Lewis Gilbert,514.2,59.9,4.4,8.58,7
On Her Majesty's Secret Service,1969,George Lazenby,Peter R. Hunt,291.5,37.3,0.6,7.82,7


#### Nuevas variables a partir de pruebas l√≥gicas

Con frecuencia, al trabajar con datos, buscamos crear nuevas variables asignando valores por fila partiendo de pruebas l√≥gicas. En estos casos, nos apartamos de asignar valores simplemente utilizando operaciones matem√°ticas entre otras variables existentes. 

Existen dos m√©todos en R particularmente populares para esta tarea:

*   `if_else()`
*   `case_when()`




Comenzamos utilizando la funci√≥n `dplyr::if_else()`.

```
dplyr::if_else(condition, value if true, value if false)
```

In [24]:
# al definir una variable nueva utilizando funci√≥n mutate(), combinaremos con funci√≥n if_else() para crear una prueba l√≥gica
# creamos una nueva variable con una clasificaci√≥n de pel√≠culas TOP MOVIE vs. NOT IN THE TOP
df_james_bond %>% mutate(
  film_segment = if_else(
    Actor == 'Sean Connery' | Budget> 100,'TOP MOVIE','NOT IN THE TOP'
  )
)

Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary,film_segment
<chr>,<int>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<chr>
Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6,TOP MOVIE
From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6,TOP MOVIE
Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2,TOP MOVIE
Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7,TOP MOVIE
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,,NOT IN THE TOP
You Only Live Twice,1967,Sean Connery,Lewis Gilbert,514.2,59.9,4.4,TOP MOVIE
On Her Majesty's Secret Service,1969,George Lazenby,Peter R. Hunt,291.5,37.3,0.6,NOT IN THE TOP
Diamonds Are Forever,1971,Sean Connery,Guy Hamilton,442.5,34.7,5.8,TOP MOVIE
Live and Let Die,1973,Roger Moore,Guy Hamilton,460.3,30.8,,NOT IN THE TOP
The Man with the Golden Gun,1974,Roger Moore,Guy Hamilton,334.0,27.7,,NOT IN THE TOP


In [25]:
# podemos anidar condiciones if_else() para crear pruebas complejas
df_james_bond %>% mutate(
  film_segment = if_else(
    Actor == 'Sean Connery' | Budget> 100
    ,'1st CLASS MOVIE'
    ,if_else(
      Actor == 'Roger Moore' | Budget> 100
      ,'2nd CLASS MOVIE'
      ,'NOT IN THE TOP'
    )
  )
)

Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary,film_segment
<chr>,<int>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<chr>
Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6,1st CLASS MOVIE
From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6,1st CLASS MOVIE
Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2,1st CLASS MOVIE
Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7,1st CLASS MOVIE
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,,NOT IN THE TOP
You Only Live Twice,1967,Sean Connery,Lewis Gilbert,514.2,59.9,4.4,1st CLASS MOVIE
On Her Majesty's Secret Service,1969,George Lazenby,Peter R. Hunt,291.5,37.3,0.6,NOT IN THE TOP
Diamonds Are Forever,1971,Sean Connery,Guy Hamilton,442.5,34.7,5.8,1st CLASS MOVIE
Live and Let Die,1973,Roger Moore,Guy Hamilton,460.3,30.8,,2nd CLASS MOVIE
The Man with the Golden Gun,1974,Roger Moore,Guy Hamilton,334.0,27.7,,2nd CLASS MOVIE


De forma alternativa, podemos utilizar la funci√≥n `dplyr::case_when()`.

```
dplyr::case_when(
   condition_1 ~ value_if_true
  ,condition_2 ~ value_if_true
  ,condition_3 ~ value_if_true
  ...
)
```

In [26]:
# la funci√≥n case_when(), imitando mismo operador en lenguaje SQL, permite una sintaxis m√°s l√≠mpia para crear condiciones complejas
# observar que las condiciones tienen un orden jer√°rquico y no necesitan ser mutuamente excluyentes
# en caso de que se cumpla una condici√≥n, el sistema no seguir√° evaluando condiciones posteriores  

df_james_bond %>% mutate(
  film_segment = case_when(
    Actor == 'Sean Connery' | Budget > 100 ~ '1st CLASS MOVIE',     # condici√≥n 1
    Actor == 'Roger Moore' | Budget > 100 ~ '2st CLASS MOVIE',      # condici√≥n 2
    Actor == 'Daniel Craig' | Budget > 100 ~ '3rd CLASS MOVIE',     # condici√≥n 3
    TRUE == TRUE ~ 'NOT IN THE TOP'                                 # (opcional) esta condici√≥n se cumple siempre (TRUE = TRUE), por tanto nos sirve como "en caso de que ninguna otra se cumpla"
  )
)

Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary,film_segment
<chr>,<int>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<chr>
Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6,1st CLASS MOVIE
From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6,1st CLASS MOVIE
Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2,1st CLASS MOVIE
Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7,1st CLASS MOVIE
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,,NOT IN THE TOP
You Only Live Twice,1967,Sean Connery,Lewis Gilbert,514.2,59.9,4.4,1st CLASS MOVIE
On Her Majesty's Secret Service,1969,George Lazenby,Peter R. Hunt,291.5,37.3,0.6,NOT IN THE TOP
Diamonds Are Forever,1971,Sean Connery,Guy Hamilton,442.5,34.7,5.8,1st CLASS MOVIE
Live and Let Die,1973,Roger Moore,Guy Hamilton,460.3,30.8,,2st CLASS MOVIE
The Man with the Golden Gun,1974,Roger Moore,Guy Hamilton,334.0,27.7,,2st CLASS MOVIE


# 5. Agregaciones
---

Una agregaci√≥n es una compilaci√≥n de datos en un *data frame*, donde alteramos la unidad de observaci√≥n original, llevando la informaci√≥n a un nivel superior de abstracci√≥n. Al realizar una agregaci√≥n, siempre utilizaremos una funci√≥n de agregaci√≥n espec√≠fica para determinar el tipo de operaci√≥n.

*Funciones b√°sicas de agregaci√≥n*

| Function  	  | Description        	|
|-----------|--------------------|
| *sum()*    	  | Sum                	|
| *mean()*   	  | Mean               	|
| *median()* 	  | Median             	|
| *sd()*     	  | Standard deviation 	|
| *min()*    	  | Minimum            	|
| *max()*    	  | Maximum            	|
| *n()*      	  | Count              	|
| *n_distinct()*| Count distinct      |   


In [27]:
# importamos un df
df_james_bond = data.table::fread("https://data-wizards.s3.amazonaws.com/datasets/jamesbond.csv")
glimpse(df_james_bond)

Rows: 26
Columns: 7
$ Film                [3m[90m<chr>[39m[23m "Dr. No", "From Russia with Love", "Goldfinger", "‚Ä¶
$ Year                [3m[90m<int>[39m[23m 1962, 1963, 1964, 1965, 1967, 1967, 1969, 1971, 19‚Ä¶
$ Actor               [3m[90m<chr>[39m[23m "Sean Connery", "Sean Connery", "Sean Connery", "S‚Ä¶
$ Director            [3m[90m<chr>[39m[23m "Terence Young", "Terence Young", "Guy Hamilton", ‚Ä¶
$ `Box Office`        [3m[90m<dbl>[39m[23m 448.8, 543.8, 820.4, 848.1, 315.0, 514.2, 291.5, 4‚Ä¶
$ Budget              [3m[90m<dbl>[39m[23m 7.0, 12.6, 18.6, 41.9, 85.0, 59.9, 37.3, 34.7, 30.‚Ä¶
$ `Bond Actor Salary` [3m[90m<dbl>[39m[23m 0.6, 1.6, 3.2, 4.7, NA, 4.4, 0.6, 5.8, NA, NA, NA,‚Ä¶


### Introducci√≥n a agregaciones

Utilizamos la funci√≥n `dplyr::summarise()` para realizar una aggregaci√≥n.

```
dplyr::summarise(df, agg_1, agg_2, ...)
```



In [28]:
# creamos una primer agregaci√≥n, donde obtendremos la suma total de la variable "box office"
## esto podr√≠a interpretarse como el total de ingresos generados por todas las pel√≠culas de James Bond 
df_james_bond %>% summarise(
  total_box_office = sum(`Box Office`)    # definimos el nombre de la variable y utilizamos una funci√≥n de agregaci√≥n para establecer un criterio
)

total_box_office
<dbl>
12781.9


In [29]:
# los valores nulos (NA) pueden causarnos problemas
## utilizaremos un par√°metro adicional en la funci√≥n de agregaci√≥n, para especificar que no tome en cuanta las observaciones con valores NA
## en el ejemplo a continuaci√≥n, sumaremos los salarios de los actores en todas las pel√≠culas
df_james_bond %>% summarise(
  total_bond_salary_v1 = sum(`Bond Actor Salary`)                
  ,total_bond_salary_v2 = sum(`Bond Actor Salary`,na.rm=TRUE)    # el segunda par√°metro "na.rm = TRUE" (NA remove) evita aquellas observaciones NA al realizar el c√°lculo
)
## la primer variable tendr√° valor NA, ya que al menos una observaci√≥n es NA

total_bond_salary_v1,total_bond_salary_v2
<dbl>,<dbl>
,123.3


In [30]:
# a continuaci√≥n, utilizaremos distintas funciones de agregaci√≥n para jugar con los datos disponibles en el dataset
df_james_bond %>% summarise(
  avg_box_office = mean(`Box Office`)
  ,avg_budget = mean(Budget)
  ,avg_bond_actor_salary = mean(`Bond Actor Salary`,na.rm = TRUE)
)

avg_box_office,avg_budget,avg_bond_actor_salary
<dbl>,<dbl>,<dbl>
491.6115,80.71923,6.85


### Agregaciones agrupadas

Podemos agrupar agregaciones utilizando la funci√≥n `dplyr::group_by()`.

```
dplyr::group_by(df, field_1, field_2, ...)
```





In [31]:
# en el ejercicio a continuaci√≥n, buscaremos obtener m√©tricas seg√∫n Director de la pel√≠cula
# nota: la unidad de observaci√≥n original en el data frame es "pel√≠cula" (tenemos una pel√≠cula por fila)
# ahora lo queremos abstreaer a un nivel superior: director (existen varias pel√≠culas por director)

df_james_bond %>% group_by(
  Director
) %>% summarise(
  avg_box_office = mean(`Box Office`)
  ,median_budget = median(Budget)
  ,avg_bond_actor_salary = mean(`Bond Actor Salary`,na.rm = TRUE)
) %>% arrange(desc(avg_bond_actor_salary))

Director,avg_box_office,median_budget,avg_bond_actor_salary
<chr>,<dbl>,<dbl>,<dbl>
Lee Tamahori,465.4,154.2,17.9
Sam Mendes,835.1,188.25,14.5
Michael Apted,439.5,158.3,13.5
Roger Spottiswoode,463.2,133.9,10.0
Marc Forster,514.2,181.4,8.1
John Glen,332.56,56.7,7.5
Guy Hamilton,514.3,29.25,4.5
Lewis Gilbert,527.4,59.9,4.4
Martin Campbell,550.0,111.1,4.2
Terence Young,613.5667,12.6,2.3


In [32]:
# realizamos un ejercicio similar al anterior, agregando datos a nivel actor
df_james_bond %>% group_by(Actor) %>% summarise(
  total_salary = sum(`Bond Actor Salary`,na.rm = TRUE)
  ,max_salary_in_movie = max(`Bond Actor Salary`,na.rm = TRUE)  # funci√≥n max() aplicada a salario de actor en pel√≠cula (sueldo mayor en pel√≠cula)
  ,count_movies = n()                                           # n() cuenta observaciones (cantidad de pel√≠culas)
) %>% arrange(desc(total_salary))

[1m[22m[36m‚Ñπ[39m In argument: `max_salary_in_movie = max(`Bond Actor Salary`, na.rm = TRUE)`.
[36m‚Ñπ[39m In group 2: `Actor = "David Niven"`.
[33m![39m no non-missing arguments to max; returning -Inf‚Äù


Actor,total_salary,max_salary_in_movie,count_movies
<chr>,<dbl>,<dbl>,<int>
Pierce Brosnan,46.5,17.9,4
Daniel Craig,25.9,14.5,4
Sean Connery,20.3,5.8,7
Roger Moore,16.9,9.1,7
Timothy Dalton,13.1,7.9,2
George Lazenby,0.6,0.6,1
David Niven,0.0,-inf,1


### Agregaciones condicionadas

En ocasiones, nos interesa definir un subconjunto espec√≠fico de observaciones (filas) para cada variable agregada. En otras palabras, podemos definir condiciones espec√≠ficas para cada variable al configurar una funci√≥n de agregaci√≥n.  


In [33]:
# en el ejemplo a continuaci√≥n, realizaremos una agregaci√≥n global (sin agrupar), pero definiendo condiciones en las variables agregadas
df_james_bond %>% summarise(
  sum_salary_Roger_Moore = sum(`Bond Actor Salary`[Actor == 'Roger Moore'],na.rm = TRUE)    # √∫nicamente suma de salarios cuando el actor es Rooger Moore
  ,sum_salary_Daniel_Craig = sum(`Bond Actor Salary`[Actor == 'Daniel Craig'],na.rm = TRUE) # √∫nicamente suma de salarios cuando el actor es Daniel Craig
)

sum_salary_Roger_Moore,sum_salary_Daniel_Craig
<dbl>,<dbl>
16.9,25.9


In [34]:
# las agregaciones condicionadas pueden utilizarse para "pivotear" (o "transponer") una tabla (es decir, cambiar filas por columnas)
# en el ejercicio debajo no utilizamos agregaci√≥n condicionada y obtenemos la misma informaci√≥n, reportada de otra forma

df_james_bond %>% filter(
  Actor %in% c('Roger Moore','Daniel Craig')            # filtramos por dos actores      
) %>% group_by(
  Actor                                                 # agrupamos por actor (quiero agrupar la variable agregada seg√∫n actor)
) %>% summarise(
  sum_salary = sum(`Bond Actor Salary`,na.rm = TRUE)    # suma de salarios en pel√≠culas
)

Actor,sum_salary
<chr>,<dbl>
Daniel Craig,25.9
Roger Moore,16.9


In [35]:
# realizamos otro ejercicio de agregaciones condicionadas
# en este caso, agrupamos las m√©tricas agregadas seg√∫n Director (observar que siempre agrupamos seg√∫n variables categ√≥ricas)

df_james_bond %>% group_by(Director) %>% summarise(
  total_actor_salary = sum(`Bond Actor Salary`,na.rm = TRUE)                                # esta variable agregada no es condicionada, no estamos limitando las observaciones a ser agragadas
  ,sum_salary_Roger_Moore = sum(`Bond Actor Salary`[Actor == 'Roger Moore'],na.rm = TRUE)   # agregamos salarios para actor Roger Moore
  ,sum_salary_Daniel_Craig = sum(`Bond Actor Salary`[Actor == 'Daniel Craig'],na.rm = TRUE) # agregamos salarios para actor Daniel Craig
) %>% arrange(desc(total_actor_salary))

## como resultado, obtenemos (i) la suma total salarios de actores seg√∫n Director y (ii) la suma de salarios espec√≠fica para dos actores
## aparentemente, el actor Roger Moore √∫nicamente ha hecho pel√≠culas de James Bond con el director John Glen, por tanto el resto de directores tienen 0 a la variable agregada (condicionada a actor Roger Moore)

Director,total_actor_salary,sum_salary_Roger_Moore,sum_salary_Daniel_Craig
<chr>,<dbl>,<dbl>,<dbl>
John Glen,30.0,16.9,0.0
Lee Tamahori,17.9,0.0,0.0
Sam Mendes,14.5,0.0,14.5
Michael Apted,13.5,0.0,0.0
Roger Spottiswoode,10.0,0.0,0.0
Guy Hamilton,9.0,0.0,0.0
Martin Campbell,8.4,0.0,3.3
Marc Forster,8.1,0.0,8.1
Terence Young,6.9,0.0,0.0
Lewis Gilbert,4.4,0.0,0.0


# 6. Ejercicios
---
> üëâ Puedes encontrar las soluciones a los ejercicios [aqu√≠](https://nbviewer.org/github/SomosDataWizards/R-Curso-Introductorio-Ejercicios/blob/main/Capitulo_3_Ejercicios.ipynb)






### Ejercicio #1
Partiendo del dataset de personajes de Star Wars, filtrar por aquellos que sean originarios de "Tatooine", "Naboo" o "Kashyyyk". Seleccionar √∫nicamente columnas name, homeworld y species

> *Dataset https://data-wizards.s3.amazonaws.com/datasets/dataset_star_wars.csv*


In [36]:
# importamos las librer√≠as
require(dplyr)
require(data.table)

In [37]:
# importamos df
df_star_wars = fread("https://data-wizards.s3.amazonaws.com/datasets/dataset_star_wars.csv")
glimpse(df_star_wars)

Rows: 87
Columns: 10
$ name       [3m[90m<chr>[39m[23m "Mon Mothma", "Yoda", "Tion Medon", "Ratts Tyerell", "Luke ‚Ä¶
$ height     [3m[90m<int>[39m[23m 150, 66, 206, 79, 172, 96, 165, 228, 188, 188, 184, 150, 18‚Ä¶
$ mass       [3m[90m<dbl>[39m[23m NA, 17.0, 80.0, 15.0, 77.0, 32.0, 75.0, 112.0, 79.0, 84.0, ‚Ä¶
$ hair_color [3m[90m<chr>[39m[23m "auburn", "white", "none", "none", "blond", "", "brown", "b‚Ä¶
$ skin_color [3m[90m<chr>[39m[23m "fair", "green", "grey", "grey & blue", "fair", "white & bl‚Ä¶
$ eye_color  [3m[90m<chr>[39m[23m "blue", "brown", "black", "unknown", "blue", "red", "blue",‚Ä¶
$ birth_year [3m[90m<dbl>[39m[23m 48.0, 896.0, NA, NA, 19.0, 33.0, 47.0, 200.0, NA, 72.0, NA,‚Ä¶
$ gender     [3m[90m<chr>[39m[23m "female", "male", "male", "male", "male", "", "female", "ma‚Ä¶
$ homeworld  [3m[90m<chr>[39m[23m "Chandrila", "", "Utapau", "Aleen Minor", "Tatooine", "Nabo‚Ä¶
$ species    [3m[90m<chr>[39m[23m "Human", "Yoda's species", "Pau'

### Ejercicio #2
Importa el dataset de valoraciones de pel√≠culas de IMDB, filtrar por pel√≠culas  
(i) cuyo actor principal (*actor_1_name*) sea Johnny Depp y su valoraci√≥n (*imdb score*) mayor a 7, o   
(ii) cuyo director sea James Cameron y su valoraci√≥n mayor a 8. 

Seleccionar √∫nicamente variables *actor_1_name*, *director_name*, *imdb_score*.

> *Dataset https://data-wizards.s3.amazonaws.com/datasets/movies.csv*

In [38]:
# importamos las librer√≠as
require(dplyr)
require(data.table)

In [39]:
# importar el dataset
df_movies = fread("https://data-wizards.s3.amazonaws.com/datasets/movies.csv")
glimpse(df_movies)

Rows: 4,916
Columns: 28
$ color                     [3m[90m<chr>[39m[23m "Color", "Color", "Color", "Color", "", "Col‚Ä¶
$ director_name             [3m[90m<chr>[39m[23m "James Cameron", "Gore Verbinski", "Sam Mend‚Ä¶
$ num_critic_for_reviews    [3m[90m<int>[39m[23m 723, 302, 602, 813, NA, 462, 392, 324, 635, ‚Ä¶
$ duration                  [3m[90m<int>[39m[23m 178, 169, 148, 164, NA, 132, 156, 100, 141, ‚Ä¶
$ director_facebook_likes   [3m[90m<int>[39m[23m 0, 563, 0, 22000, 131, 475, 0, 15, 0, 282, 0‚Ä¶
$ actor_3_facebook_likes    [3m[90m<int>[39m[23m 855, 1000, 161, 23000, NA, 530, 4000, 284, 1‚Ä¶
$ actor_2_name              [3m[90m<chr>[39m[23m "Joel David Moore", "Orlando Bloom", "Rory K‚Ä¶
$ actor_1_facebook_likes    [3m[90m<int>[39m[23m 1000, 40000, 11000, 27000, 131, 640, 24000, ‚Ä¶
$ gross                     [3m[90m<dbl>[39m[23m 760505847, 309404152, 200074175, 448130642, ‚Ä¶
$ genres                    [3m[90m<chr>[39m[23m "Action|Advent

### Ejercicio #3
Importa el dataset con datos del WHO (*World Health Organization*) y crea una nueva variable que identifique si un pa√≠s est√° por debajo de la mediana de PIB per c√°pita mundial. Filtra por *pa√≠ses europeos que est√©n por debajo de la mediana mundial de PIB per c√°pita* y selecciona √∫nicamente las variables relevantes.

> *Dataset https://data-wizards.s3.amazonaws.com/datasets/dataset_na_who.csv*








In [40]:
# importamos las librer√≠as
require(dplyr)
require(data.table)

In [41]:
# importar el dataset
df_who = fread("https://data-wizards.s3.amazonaws.com/datasets/dataset_na_who.csv")
glimpse(df_who)

Rows: 196
Columns: 13
$ Country                                                  [3m[90m<chr>[39m[23m "Afghanistan"‚Ä¶
$ CountryID                                                [3m[90m<int>[39m[23m 1, 2, 3, 4, 5‚Ä¶
$ ContinentID                                              [3m[90m<int>[39m[23m 1, 2, 3, 2, 3‚Ä¶
$ `Adolescent fertility rate (%)`                          [3m[90m<int>[39m[23m 151, 27, 6, N‚Ä¶
$ `Adult literacy rate (%)`                                [3m[90m<dbl>[39m[23m 28.0, 98.7, 6‚Ä¶
$ `Gross national income per capita (PPP international $)` [3m[90m<int>[39m[23m NA, 6000, 594‚Ä¶
$ `Net primary school enrolment ratio female (%)`          [3m[90m<int>[39m[23m NA, 93, 94, 8‚Ä¶
$ `Net primary school enrolment ratio male (%)`            [3m[90m<int>[39m[23m NA, 94, 96, 8‚Ä¶
$ `Population (in thousands) total`                        [3m[90m<int>[39m[23m 26088, 3172, ‚Ä¶
$ `Population annual growth rate (%)`                      [3m[90m<

### Ejercicio #4
Agregar *revenue* total, seg√≠n sector productivo. Ordenar de forma descendente por *revenue*. Para las empresas incluidas en el ranking, ¬øcu√°les son los sectores que generan m√°s facturaci√≥n?


> *Dataset https://data-wizards.s3.amazonaws.com/datasets/fortune1000.csv*


In [42]:
# importamos las librer√≠as
require(dplyr)
require(data.table)

In [43]:
# importar el dataset
df_fortune1000 = fread("https://data-wizards.s3.amazonaws.com/datasets/fortune1000.csv")
glimpse(df_fortune1000)

Rows: 1,000
Columns: 8
$ Rank      [3m[90m<int>[39m[23m 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1‚Ä¶
$ Company   [3m[90m<chr>[39m[23m "Walmart", "Exxon Mobil", "Apple", "Berkshire Hathaway", "Mc‚Ä¶
$ Sector    [3m[90m<chr>[39m[23m "Retailing", "Energy", "Technology", "Financials", "Health C‚Ä¶
$ Industry  [3m[90m<chr>[39m[23m "General Merchandisers", "Petroleum Refining", "Computers, O‚Ä¶
$ Location  [3m[90m<chr>[39m[23m "Bentonville, AR", "Irving, TX", "Cupertino, CA", "Omaha, NE‚Ä¶
$ Revenue   [3m[90m<int>[39m[23m 482130, 246204, 233715, 210821, 181241, 157107, 153290, 1523‚Ä¶
$ Profits   [3m[90m<int>[39m[23m 14694, 16150, 53394, 24083, 1476, 5813, 5237, 9687, 7373, 13‚Ä¶
$ Employees [3m[90m<int>[39m[23m 2300000, 75600, 110000, 331000, 70400, 200000, 199000, 21500‚Ä¶


### Ejercicio #5
Partiendo del dataset con datos de empleados estatales, agregar la mediana de salario base seg√∫n departamento. Tener en cuenta √∫nicamente empleados full time.


> *Dataset https://data-wizards.s3.amazonaws.com/datasets/employees.csv*


In [44]:
# importamos las librer√≠as
require(dplyr)
require(data.table)

In [45]:
# importar el dataset
df_employees = fread("https://data-wizards.s3.amazonaws.com/datasets/employees.csv")
glimpse(df_employees)

Rows: 2,000
Columns: 10
$ UNIQUE_ID         [3m[90m<int>[39m[23m 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15‚Ä¶
$ POSITION_TITLE    [3m[90m<chr>[39m[23m "ASSISTANT DIRECTOR (EX LVL)", "LIBRARY ASSISTANT", ‚Ä¶
$ DEPARTMENT        [3m[90m<chr>[39m[23m "Municipal Courts Department", "Library", "Houston P‚Ä¶
$ BASE_SALARY       [3m[90m<dbl>[39m[23m 121862, 26125, 45279, 63166, 56347, 66614, 71680, 42‚Ä¶
$ RACE              [3m[90m<chr>[39m[23m "Hispanic/Latino", "Hispanic/Latino", "White", "Whit‚Ä¶
$ EMPLOYMENT_TYPE   [3m[90m<chr>[39m[23m "Full Time", "Full Time", "Full Time", "Full Time", ‚Ä¶
$ GENDER            [3m[90m<chr>[39m[23m "Female", "Female", "Male", "Male", "Male", "Male", ‚Ä¶
$ EMPLOYMENT_STATUS [3m[90m<chr>[39m[23m "Active", "Active", "Active", "Active", "Active", "A‚Ä¶
$ HIRE_DATE         [3m[90m<IDate>[39m[23m 2006-06-12, 2000-07-19, 2015-02-03, 1982-02-08, 19‚Ä¶
$ JOB_DATE          [3m[90m<IDate>[39m[23m 2012-10-13, 2010-09-