# **Lab 4: Transforming data: Into to dplyr**

## **dplyr verbs(functions)**
dplyr utilities handle the vast majority of your data manipulation needs:

*   filter() - for picking observations by their values,
*   select() - for picking variables by their names,
*   arrange() - for reorder the rows,
*   mutate() - for creating new variables with functions on existing variables,
*   summarise() - for collapse many values down to a single summary.




## **The structure of dplyr functions**

All verbs work similarly:


*   The first argument is a tibble (or data frame)
*   The subsequent ones describe what to do, using the variable names
*   The result is a new tibble



## **The movie industry dataset**

`movies.csv` contains information on last three decades of movies.
The data has been scraped from the IMDb website and can be accessed from a [github repo](https://raw.githubusercontent.com/Juanets/movie-stats/master/movies.csv).

In [0]:
library (tidyverse)

In [0]:
url <- "https://raw.githubusercontent.com/Juanets/movie-stats/master/movies.csv"
movies <- read_csv(url)
locale <- Sys.setlocale(category = "LC_ALL", locale = "C")
movies

## **filter(): retain rows matching a criteria**

filter() allows you to subset observations based on their values.

In [0]:
# note: both comma and "&" represent AND condition
filter(movies, genre == "Comedy", director == "Woody Allen")

Package dplyr executes the filtering and returns a new data frame. It never modifies the original one.

## **Logical operators**

In [0]:
# Using AND operator
filter(movies, country == "USA", budget > 2.5e8)
# same as filter(movies, country == "USA" & budget > 2.5e8)

In [0]:
# Using OR operator
filter(movies, country == "USA" | budget > 2.5e8)

In [0]:
#Using xor(), xor indicates elementwise exclusive OR.
filter(movies, xor(score > 9, budget > 2.5e8))

In [0]:
# you can also use %in% operator
filter(movies, country %in% c("Peru", "Colombia", "Chile"))

In R, if you want to find if a variable's value is missing, use the is.na() function. In particular, do not check for equality with NA:

In [0]:
x <- 1

In [0]:
x == NA

In [0]:
is.na(x)

Similarly, never put an equality condition with NA in your dplyr filter() statements.

In [0]:
# create a dataframe
df <- tibble(x = c(1, NA, 3))
print(df)

In [0]:
filter(df, x > 1)

In [0]:
filter(df, is.na(x) | x > 1) # Note the special case of NA

## **Exercise 1:**


1.   Write code using filter that will allow you to output movies with `country` USA or UK and `genre` Action or Drama.
2.   Write code using filter that will allow you to output movies with `released` later than 2014-12-01. (hint: `movies$released <- as.Date(movies$released)`)


## **select(): pick columns by name**

select() let’s you choose a subset variables, specified by name.
Note, there is no need for quotation marks in dplyr:

In [0]:
#select 5 columns
select(movies, name, country, year, genre)

In [0]:
select(movies, name, genre:score) # use colon to select contiguous columns,

In [0]:
select(movies, -(star:writer)) # To drop columns use a minus, "-"

## **select() helpers**
You can use the following functions to help select the columns:


*   starts_with()
*   ends_with()
*   contains()
*   matches() (matches a regular expression)
*   num_range("x", 1:4): pickes variables x1, x2, x3, x4

Example:



In [0]:
select(movies, starts_with("r"))
select(movies, ends_with("e"))
select(movies, contains("re"))

## **Exercise 2:**

Write code that will have company as the first column and the columns starting with the letter 'g' as the following columns. Output the first 20 rows of such a dataset.

## **arrange(): reorder rows**

arrange() takes a data frame and a set of column names to order by.
For descending order, use the function desc() around the column name.

In [0]:
print(arrange(movies, runtime), n = 4)

In [0]:
# use `desc` for descending
print(arrange(movies, desc(budget)), n = 4)

Missing values are always sorted at the end:

In [0]:
df <- tibble(x = c(5, NA, 2))
arrange(df, x)

In [0]:
arrange(df, desc(x))

## **Exercise 3:**

Use arrange to sort the `movies` dataset by ascending order of the product of the budget and score variables. Output the first 20 rows of the new dataset.


## **mutate(): add new variables**

mutate() adds new columns that are a function of the existing ones

In [0]:
movies <- mutate(movies, profit = gross - budget)
select(movies, name, gross, budget, profit)

To discard old variables, use transmute() instead of mutate().

In [0]:
# Generating multiple new variables
movies <- mutate(
movies,
profit = gross - budget,
gross_in_mil = gross/10^6,
budget_in_mil = budget/10^6,
profit_in_mil = profit/10^6
)
select(movies, name, year, country, contains("_in_mil"), profit)

Any vectorized function can be used with mutate(), including:


*   arithmetic operators (+,-,*,/, %, %%),
*   logical operators (<,<=,>,>=,==,!=),
*   logarithmic and exponential transfomations (log, log10, exp),
*   offsets (lead, lag),
*   cummulative rolling aggregates (cumsum, cumprod, cummin, cummax),
*   ranking (min_rank, percent_rank).

