# **Introducción a R para Análisis de Datos**
## Capítulo 2: Data Frames
---
**Autor:** Juan Martin Bellido  

**Descripción**  
En este capítulo comenzamos a trabajar con Data Frames, es decir, tablas de datos, realizando operaciones utilizando exclusivamente sintaxis disponible en R de forma nativa.

**¿Feedback? ¿comentarios?** Por favor compártelo conmigo escribiéndome por [LinkedIn](https://www.linkedin.com/in/jmartinbellido/)  

**Material Adicional**
* [Comandos Jupyter Notebook](https://datawizards.es/contenido/codigo-para-analisis-de-datos/guias/comandos-rapidos-jupyter)
* [Sintaxis Markdown](https://datawizards.es/contenido/codigo-para-analisis-de-datos/guias/sintaxis-markdown)


## INDICE
---
1. Introducción a Data Frames
2. Seleccionar elementos en un Data Frame
3. Operaciones básicas con Data Frames
4. Ejercicios


In [None]:
# instalamos librería "data.table"
install.packages("data.table")

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



In [None]:
# importamos librerías
require(data.table)

# 1. Introducción a Data Frames
---

### Importar un Data Frame

In [None]:
# importamos un df
df_jamesbond = data.table::fread("https://data-wizards.s3.amazonaws.com/datasets/jamesbond.csv")

### Exportar un Data Frame

In [None]:
# exportar un df
data.table::fwrite(df_jamesbond)

Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Dr. No,1962,Sean Connery,Terence Young,448.8,7,0.6
From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7
Casino Royale,1967,David Niven,Ken Hughes,315,85,
You Only Live Twice,1967,Sean Connery,Lewis Gilbert,514.2,59.9,4.4
On Her Majesty's Secret Service,1969,George Lazenby,Peter R. Hunt,291.5,37.3,0.6
Diamonds Are Forever,1971,Sean Connery,Guy Hamilton,442.5,34.7,5.8
Live and Let Die,1973,Roger Moore,Guy Hamilton,460.3,30.8,
The Man with the Golden Gun,1974,Roger Moore,Guy Hamilton,334,27.7,
The Spy Who Loved Me,1977,Roger Moore,Lewis Gilbert,533,45.1,
Moonraker,1979,Roger Moore,Lewis Gilbert,535,91.5,
For Your Eyes Only,1981,Roger Moore,John Glen,449.4,60.2,
Never Say Never Again,1983,Sean Connery,Irvin Kershner,380,86,
Octopussy,1983,Roger Moore,John Glen,373.8,53.9,7.8
A View to a Kill,1985,Rog

### Primeras operaciones con un Data Frame

In [None]:
# utilizamos la función str() para explorar nuestro data frame
str(df_jamesbond)

Classes ‘data.table’ and 'data.frame':	26 obs. of  7 variables:
 $ Film             : chr  "Dr. No" "From Russia with Love" "Goldfinger" "Thunderball" ...
 $ Year             : int  1962 1963 1964 1965 1967 1967 1969 1971 1973 1974 ...
 $ Actor            : chr  "Sean Connery" "Sean Connery" "Sean Connery" "Sean Connery" ...
 $ Director         : chr  "Terence Young" "Terence Young" "Guy Hamilton" "Terence Young" ...
 $ Box Office       : num  449 544 820 848 315 ...
 $ Budget           : num  7 12.6 18.6 41.9 85 59.9 37.3 34.7 30.8 27.7 ...
 $ Bond Actor Salary: num  0.6 1.6 3.2 4.7 NA 4.4 0.6 5.8 NA NA ...
 - attr(*, ".internal.selfref")=<externalptr> 


In [None]:
dim(df_jamesbond)

In [None]:
nrow(df_jamesbond)

In [None]:
ncol(df_jamesbond)

In [None]:
head(df_jamesbond)

Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
<chr>,<int>,<chr>,<chr>,<dbl>,<dbl>,<dbl>
Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,
You Only Live Twice,1967,Sean Connery,Lewis Gilbert,514.2,59.9,4.4


In [None]:
tail(df_jamesbond)

Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
<chr>,<int>,<chr>,<chr>,<dbl>,<dbl>,<dbl>
The World Is Not Enough,1999,Pierce Brosnan,Michael Apted,439.5,158.3,13.5
Die Another Day,2002,Pierce Brosnan,Lee Tamahori,465.4,154.2,17.9
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3
Quantum of Solace,2008,Daniel Craig,Marc Forster,514.2,181.4,8.1
Skyfall,2012,Daniel Craig,Sam Mendes,943.5,170.2,14.5
Spectre,2015,Daniel Craig,Sam Mendes,726.7,206.3,


In [None]:
names(df_jamesbond)

# 2. Seleccionar elementos en un Data Frame
---

### Seleccionar columnas

In [None]:
# invocamos columna según nombre
df_jamesbond$Film

In [None]:
# nota: en caso de que el nombre del campo esté definido con espacios, debemos utilizar comillas
df_jamesbond$'Box Office'

In [None]:
# invocamos columna según nombre
df_jamesbond[,'Film']

Film
<chr>
Dr. No
From Russia with Love
Goldfinger
Thunderball
Casino Royale
You Only Live Twice
On Her Majesty's Secret Service
Diamonds Are Forever
Live and Let Die
The Man with the Golden Gun


In [None]:
# invocamos columna según nombre
# invocamos múltiples columnas utilizando un vector
df_jamesbond[,c('Film','Director')]

Film,Director
<chr>,<chr>
Dr. No,Terence Young
From Russia with Love,Terence Young
Goldfinger,Guy Hamilton
Thunderball,Terence Young
Casino Royale,Ken Hughes
You Only Live Twice,Lewis Gilbert
On Her Majesty's Secret Service,Peter R. Hunt
Diamonds Are Forever,Guy Hamilton
Live and Let Die,Guy Hamilton
The Man with the Golden Gun,Guy Hamilton


In [None]:
# invocamos columna según posición
df_jamesbond[,1]

Film
<chr>
Dr. No
From Russia with Love
Goldfinger
Thunderball
Casino Royale
You Only Live Twice
On Her Majesty's Secret Service
Diamonds Are Forever
Live and Let Die
The Man with the Golden Gun


In [None]:
# invocamos columna según posición
# invocamos múltiples columnas utiliando un vector
df_jamesbond[,c(1,2)]

Film,Year
<chr>,<int>
Dr. No,1962
From Russia with Love,1963
Goldfinger,1964
Thunderball,1965
Casino Royale,1967
You Only Live Twice,1967
On Her Majesty's Secret Service,1969
Diamonds Are Forever,1971
Live and Let Die,1973
The Man with the Golden Gun,1974


In [None]:
# invocamos columna según posición
# invocamos columnas según rango
df_jamesbond[,1:3]

Film,Year,Actor
<chr>,<int>,<chr>
Dr. No,1962,Sean Connery
From Russia with Love,1963,Sean Connery
Goldfinger,1964,Sean Connery
Thunderball,1965,Sean Connery
Casino Royale,1967,David Niven
You Only Live Twice,1967,Sean Connery
On Her Majesty's Secret Service,1969,George Lazenby
Diamonds Are Forever,1971,Sean Connery
Live and Let Die,1973,Roger Moore
The Man with the Golden Gun,1974,Roger Moore


### Seleccionar filas

In [None]:
# invocamos filas según posición
df_jamesbond[1,]

Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
<chr>,<int>,<chr>,<chr>,<dbl>,<dbl>,<dbl>
Dr. No,1962,Sean Connery,Terence Young,448.8,7,0.6


In [None]:
# invocamos filas según posición
df_jamesbond[c(1,4,5),]

Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
<chr>,<int>,<chr>,<chr>,<dbl>,<dbl>,<dbl>
Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,


In [None]:
# invocamos filas según posición
df_jamesbond[1:10,]

Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
<chr>,<int>,<chr>,<chr>,<dbl>,<dbl>,<dbl>
Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,
You Only Live Twice,1967,Sean Connery,Lewis Gilbert,514.2,59.9,4.4
On Her Majesty's Secret Service,1969,George Lazenby,Peter R. Hunt,291.5,37.3,0.6
Diamonds Are Forever,1971,Sean Connery,Guy Hamilton,442.5,34.7,5.8
Live and Let Die,1973,Roger Moore,Guy Hamilton,460.3,30.8,
The Man with the Golden Gun,1974,Roger Moore,Guy Hamilton,334.0,27.7,


### Combinando métodos

In [None]:
# invocamos fila según posición, columna según nombre
df_jamesbond[1,'Actor']

Actor
<chr>
Sean Connery


In [None]:
# invocamos filas según rango, columnas según nombre
df_jamesbond[1:10,c('Film','Actor','Director')]

Film,Actor,Director
<chr>,<chr>,<chr>
Dr. No,Sean Connery,Terence Young
From Russia with Love,Sean Connery,Terence Young
Goldfinger,Sean Connery,Guy Hamilton
Thunderball,Sean Connery,Terence Young
Casino Royale,David Niven,Ken Hughes
You Only Live Twice,Sean Connery,Lewis Gilbert
On Her Majesty's Secret Service,George Lazenby,Peter R. Hunt
Diamonds Are Forever,Sean Connery,Guy Hamilton
Live and Let Die,Roger Moore,Guy Hamilton
The Man with the Golden Gun,Roger Moore,Guy Hamilton


### Filtrar filas según condiciones lógicas

In [None]:
# creamos una prueba lógica
# # nota: el resultado será siempre un vector con booleanos, tendremos tantos booleanos como filas en nuestro df
df_jamesbond$Actor == 'Sean Connery'

In [None]:
# de forma alternativa, podríamos conseguir el mismo resultado con la siguiente sintaxis
as.vector(df_jamesbond[,'Actor'] == 'Sean Connery')

In [None]:
# utilizamos el vector con booleanos para filtrar filas
# como resultado, conservamos únicamente las filas que se corresponden con valores TRUE en nuestro vector
cond = df_jamesbond$Actor == 'Sean Connery'
df_jamesbond[cond,]

Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
<chr>,<int>,<chr>,<chr>,<dbl>,<dbl>,<dbl>
Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7
You Only Live Twice,1967,Sean Connery,Lewis Gilbert,514.2,59.9,4.4
Diamonds Are Forever,1971,Sean Connery,Guy Hamilton,442.5,34.7,5.8
Never Say Never Again,1983,Sean Connery,Irvin Kershner,380.0,86.0,


In [None]:
# podríamos conseguir el mismo resultado sin declarar objetos auxiliares
df_jamesbond[df_jamesbond$Actor == 'Sean Connery',]

Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
<chr>,<int>,<chr>,<chr>,<dbl>,<dbl>,<dbl>
Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7
You Only Live Twice,1967,Sean Connery,Lewis Gilbert,514.2,59.9,4.4
Diamonds Are Forever,1971,Sean Connery,Guy Hamilton,442.5,34.7,5.8
Never Say Never Again,1983,Sean Connery,Irvin Kershner,380.0,86.0,


In [None]:
# a continuación, creamos dos objectos auxiliares con condiciones
cond = df_jamesbond$Actor == 'Sean Connery'
cond_2 = df_jamesbond$'Box Office' > 800

In [None]:
# combinamos ambas condiciones utilizando un operador "and" (&)
# de esta forma, exigimos que ambas condiciones se cumplan de forma simultánea
df_jamesbond[cond & cond_2,]

Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
<chr>,<int>,<chr>,<chr>,<dbl>,<dbl>,<dbl>
Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7


In [None]:
# combinamos ambas condiciones utilizando un operador "or" (|)
# de esta forma, exigimos simplemente que alguna de las condiciones se cumpla (no necesariamente ambas de forma simultánea)
df_jamesbond[cond | cond_2,]

Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
<chr>,<int>,<chr>,<chr>,<dbl>,<dbl>,<dbl>
Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7
You Only Live Twice,1967,Sean Connery,Lewis Gilbert,514.2,59.9,4.4
Diamonds Are Forever,1971,Sean Connery,Guy Hamilton,442.5,34.7,5.8
Never Say Never Again,1983,Sean Connery,Irvin Kershner,380.0,86.0,
Skyfall,2012,Daniel Craig,Sam Mendes,943.5,170.2,14.5


In [None]:
# el operador "in" nos permite testear más de un valor para la misma variable
cond = df_jamesbond$Actor %in% c('Sean Connery','Roger Moore','Pierce Brosnan')
df_jamesbond[cond,]

Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
<chr>,<int>,<chr>,<chr>,<dbl>,<dbl>,<dbl>
Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7
You Only Live Twice,1967,Sean Connery,Lewis Gilbert,514.2,59.9,4.4
Diamonds Are Forever,1971,Sean Connery,Guy Hamilton,442.5,34.7,5.8
Live and Let Die,1973,Roger Moore,Guy Hamilton,460.3,30.8,
The Man with the Golden Gun,1974,Roger Moore,Guy Hamilton,334.0,27.7,
The Spy Who Loved Me,1977,Roger Moore,Lewis Gilbert,533.0,45.1,
Moonraker,1979,Roger Moore,Lewis Gilbert,535.0,91.5,


# 3. Operaciones básicas con Data Frames
---

In [None]:
# importamos Data Frame
df_cars = data.table::fread('https://data-wizards.s3.amazonaws.com/datasets/dataset_us_cars.csv')
str(df_cars)

Classes ‘data.table’ and 'data.frame':	2499 obs. of  7 variables:
 $ year   : int  2008 2011 2018 2014 2018 2018 2010 2017 2018 2017 ...
 $ brand  : chr  "toyota" "ford" "dodge" "ford" ...
 $ price  : int  6300 2899 5350 25000 27700 5700 7300 13350 14600 5250 ...
 $ mileage: int  274117 190552 39590 64146 6654 45561 149050 23525 9371 63418 ...
 $ color  : chr  "black" "silver" "silver" "blue" ...
 $ state  : chr  "new jersey" "tennessee" "georgia" "virginia" ...
 $ country: chr  "usa" "usa" "usa" "usa" ...
 - attr(*, ".internal.selfref")=<externalptr> 


### Crear una nueva columna a partir de campos

Podemos crear nuevas columnas a partir de campos existentes en nuestro DataFrame.

In [None]:
# creamos un nuevo campo como operación entre columnas existentes
# versión 1
df_cars$price_eur = df_cars$price / 1.2

In [None]:
# creamos un nuevo campo como operación entre columnas existentes
# versión 2
df_cars[,'price_eur'] = df_cars$price / 1.2

In [None]:
# creamos un nuevo campo como operación entre columnas existentes
# versión 1
df_cars[,'price_eur'] = df_cars[,'price'] / 1.2

### Editar campos

In [None]:
# editamos un campo existante en nuestro Data Frame
# la sintaxis es idéntica a la que usamos para declarar un campo
df_cars$price_eur = round(df_cars$price_eur, 2)

In [None]:
# invocamos df
df_cars[1:5,c('brand','price_eur')]

brand,price_eur
<chr>,<dbl>
toyota,5250.0
ford,2415.83
dodge,4458.33
ford,20833.33
chevrolet,23083.33


### Eliminar campos en un Data Frame

In [None]:
# R nos permite utilizar la siguiente sintaxis para omitir seleccionar columnas
df_cars[,-c('brand','price')]

year,mileage,color,state,country,price_eur
<int>,<int>,<chr>,<chr>,<chr>,<dbl>
2008,274117,black,new jersey,usa,5250.00
2011,190552,silver,tennessee,usa,2415.83
2018,39590,silver,georgia,usa,4458.33
2014,64146,blue,virginia,usa,20833.33
2018,6654,red,florida,usa,23083.33
2018,45561,white,texas,usa,4750.00
2010,149050,black,georgia,usa,6083.33
2017,23525,gray,california,usa,11125.00
2018,9371,silver,florida,usa,12166.67
2017,63418,black,texas,usa,4375.00


In [None]:
# sobreescribimos el df tras omitir una columna
df_cars = df_cars[,-'price_eur']

In [None]:
str(df_cars)

Classes ‘data.table’ and 'data.frame':	2499 obs. of  7 variables:
 $ year   : int  2008 2011 2018 2014 2018 2018 2010 2017 2018 2017 ...
 $ brand  : chr  "toyota" "ford" "dodge" "ford" ...
 $ price  : int  6300 2899 5350 25000 27700 5700 7300 13350 14600 5250 ...
 $ mileage: int  274117 190552 39590 64146 6654 45561 149050 23525 9371 63418 ...
 $ color  : chr  "black" "silver" "silver" "blue" ...
 $ state  : chr  "new jersey" "tennessee" "georgia" "virginia" ...
 $ country: chr  "usa" "usa" "usa" "usa" ...
 - attr(*, ".internal.selfref")=<externalptr> 


# 4. Ejercicios
---
> 👉 Puedes encontrar las soluciones a los ejercicios [aquí](https://nbviewer.org/github/SomosDataWizards/R-Curso-Introductorio-Ejercicios/blob/main/Capitulo_2_Ejercicios.ipynb)

### Ejercicio #1

##### 1A. Importar dataframe. Seleccionar columnas "name", "homeworld", "species" para las primeras 10 filas del DataFrame.

##### 1B. Seleccionar personajes que a) no sean *species* hombre y b) sean *homeworld* Naboo, Endor o Kashyyyk.

> *Dataset https://data-wizards.s3.amazonaws.com/datasets/dataset_star_wars.csv*

### Ejercicio #2

##### EX 2A. Importar dataframe. Seleccionar columnas *country*, *director_name*, *imdb_score* para las primeras 10 filas del dataframe
##### EX 2B. Seleccionar películas:  
(i) *producidas fuera de USA o con IMDB score mayor a 8.5*, y  
(ii) *dirigidas por alguno de lo siguientes directores: James Cameron, Peter Jackson, Tim Burton.*

> Dataset https://data-wizards.s3.amazonaws.com/datasets/movies.csv

### Ejercicio #3  

#### EX 3A. Importar dataframe. Seleccionar columnas "Company", "Sector" y "Revenue".
#### EX 3B. Crear una columna "Profits_per_Employee" como operación entre las columnas "Profits" / "Employees".


> Dataset https://data-wizards.s3.amazonaws.com/datasets/fortune1000.csv
