### Introducción a la Programación para Ciencia de Datos

### R (2da parte)
_Rocío Romero Zaliz_ - rocio@decsai.ugr.es

* Podcast SintonIA: https://www.spreaker.com/show/sintonia-la-ia-en-las-ondas
* Instagram SintonIA: @sintonia_dasci
* TikTok SintonIA: @sintonia_dasci

* Twitter: @RCRZ_UGR

# Index
* String Manipulation
* Input/Output
* Functions
* R Programming Structures 
* Performance Enhancement: Speed and Memory
* Graphs with `ggplot`

# String Manipulation using basic R

Although R is a statistical language with numeric vectors and matrices playing a central role, character strings are also necessary and R has a number of string-manipulation utilities.

In [1]:
texto <- "Ciencia de datos"
class(texto)

In [2]:
length(texto)

Function `nchar` finds the length of a string

In [3]:
nchar("Ciencia de Datos")

In [4]:
textos <- c("hola", 'mundo')
length(textos)

In [5]:
class(textos)

In [6]:
print(textos)

[1] "hola"  "mundo"


In [7]:
# String construction

empty_str <- character(10)
empty_str

Function `paste` concatenates several strings, returning the result in one long string. 

In [8]:
paste("Ciencia", "de", "Datos")

In [9]:
paste("Ciencia", "de", "Datos", sep="_") 

In [10]:
paste("Ciencia", c("hola", "mundo"), "cierta")

In [11]:
paste(1:3, 1:5, sep="_") 

In [12]:
paste(1:3, 1:5, sep="_", collapse="|") 

In [None]:
help(paste)

## Example

* Given a dataset, remove column info

In [13]:
data <- data.frame(name = c("Susan", "Greg", "Amy", "Laura", "David"), lastname = c("Wilson", "Gray", "Sanders", "Xeon", "Rogers"), gender = c("F", "M", "F", "F", "M"), age = c(23, 46, 32, 90, 53))
data

name,lastname,gender,age
Susan,Wilson,F,23
Greg,Gray,M,46
Amy,Sanders,F,32
Laura,Xeon,F,90
David,Rogers,M,53


In [14]:
colnames(data)

In [15]:
colnames(data) <- paste("var", 1:dim(data)[2], sep="_")
data

var_1,var_2,var_3,var_4
Susan,Wilson,F,23
Greg,Gray,M,46
Amy,Sanders,F,32
Laura,Xeon,F,90
David,Rogers,M,53


## Basic string functions
The call `substr(x, start, stop)` returns the substring in the given character position range start:stop for string x. 

In [16]:
substr("Ciencia de Datos", 1, 7)

In [17]:
substr("Ciencia de Datos", 12)

ERROR: Error in substr("Ciencia de Datos", 12): el argumento "stop" está ausente, sin valor por omisión


The call `strsplit(x, split)` splits a string x into an R list of substrings based on another string split in x. 

In [18]:
date()

In [19]:
strsplit(date(), split=" ")

In [20]:
strsplit(c(date(), "hola mundo"), split=" ")

In [21]:
strsplit(c("Esta es una frase", "Esta es otra linda frase"), split="a ")

In [22]:
strsplit(c("Esta es una frase", "Esta es otra frase"), split=c(" ", "e"))

## Example

* Files in directory

In [23]:
files <- list.files()
files

In [24]:
strsplit(files, split=".")

In [25]:
strsplit(files, split="\\.")

In [26]:
matrix(unlist(strsplit(files, split="\\.")), ncol=2, byrow=TRUE)

"la longitud de los datos [23] no es un submúltiplo o múltiplo del número de filas [12] en la matriz"

0,1
01_Curso_Jornada_1_R_Rstudio_2022,pdf
02_Curso_2_Data_types_I_2022,pdf
03_Curso_2,2_Data_types_II_Listas_2020
pdf,03_Factores_y_DF_Listas_2022
pdf,04_dplyr_joins_2022
pdf,05_forcats_2022
pdf,Functions
ipynb,Graphs with ggplot
ipynb,Input-Output
ipynb,Programming


## Regular expressions

* A regular expression is a kind of wild card. 
* It’s shorthand to specify broad classes of strings.
* In R, you must pay attention to this point when using the string functions `grep`, `grepl`, `regexpr`, `gregexpr`, `sub`, `gsub` and `strsplit`. 

In [27]:
date()

In [28]:
strsplit(date(), split="[0-9]+")

In [29]:
# For example, the expression “[ia]” refers to any string that contains either 
# of the letters i or a

donde <- grep("[ia]", c("Ciencia","de","Datos"))
donde

In [30]:
c("Ciencia","de","Datos")[donde]

In [31]:
# A period (.) represents any single character

grep("e.", c("Ciencia","de","Datos"))

In [32]:
# Another example using strsplit()

strsplit("a.b.c", ".")

In [33]:
strsplit("a.b.c", "\\.")

In [34]:
strsplit("a.b.c", "[.]")

## Example

Anonymize data

In [None]:
data

In [None]:
ncolumn <- paste(tolower(substr(data$var_1, 1, 1)), tolower(substr(data$var_2, 1, 1)), sep = "")
ncolumn

In [None]:
paste(colnames(data)[1:2], collapse="_and_")

In [None]:
data$var_1 <- ncolumn
colnames(data)[1] <- paste(colnames(data)[1:2], collapse="_and_")
data <- data[,-2]
data

More info in: [R Manual](https://stat.ethz.ch/R-manual/R-devel/library/base/html/regex.html)

## Homework (1st part)
1. Create a string array contaning your first name and last names (e.g., ["Rocio", "Romero", Zaliz"]). Using that array and the R functions you just learned create a new string with the initial of your first name, a dot, and your last names (e.g., "R. Romero Zaliz").

In [None]:
# Write code here

2. Given an array of strings representing dates (e.g., [“2005-11-28”, “2015-10-18”, “2000-01-01”]), show only those corresponding to odd months (use format YEAR-MONTH-DAY).

In [None]:
# Write code here

3. Given a string with several words (e.g., “Esta es una frase, pero no cualquier frase.”) create an array with each of the words in the string (e.g., ["Esta","es","una","frase","pero","no","cualquier","frase"]). Take into account all possible punctuation characters.

In [None]:
# Write code here

4. Search in an array of strings those including only vocals "a" and/or "e" or none (check uppercase and lowercase).

In [None]:
# Write code here

5. Given three numeric arrays representing days, months and years, create an array with dates (only if they are valid) (Hint: research the as.Date function).

In [None]:
# Write code here

# String manipulation using stringr (tidyverse) package

In [39]:
install.packages(tidyverse)
install.packages(stringr)

library(tidyverse)
library(stringr)
print(texto)
print(data)

ERROR: Error in install.packages(tidyverse): objeto 'tidyverse' no encontrado


Function <s>`nchar`</s> `str_length` finds the length of a string

In [38]:
str_length(texto)

ERROR: Error in str_length(texto): no se pudo encontrar la función "str_length"


In [None]:
print(data$name)

data$name %>% str_length()

Function <s>`paste`</s> `str_c` concatenates several strings, returning the result in one long string. 

In [None]:
data$name %>% str_c(collapse = ", ")

In [None]:
str_c(data$name, data$lastname, sep = " - ")

The call <s>`substr(x, start, stop)`</s> `str_sub(x, start, stop)` returns the substring in the given character position range start:stop for string x. 

In [None]:
data$name %>% str_sub(2, 3)

## Regular expressions

Function `str_detect` tells you if there’s any match to the pattern

In [None]:
str_detect(data[1,], "[aeiou]")

In [None]:
data$name %>% str_detect("[aeiou]")

Function `str_subset`extracts the matching components

In [None]:
data$name %>% str_subset("[aeiou]")

In [None]:
data$name %>% str_view("[aeiou]", match = TRUE)

In [None]:
data$name %>% str_view_all("[aeiou]", match = TRUE)

Function `str_count` counts the number of patterns

In [None]:
data$name %>% str_count("[aeiou]")

Function `str_extract` extracts the text of the match

In [None]:
data$name %>% str_extract("[aeiou]")

In [None]:
data$name %>% str_extract_all("[aeiou]")

Function `str_match` extracts parts of the match defined by parentheses

In [None]:
print(data$name)

data$name %>% str_match("(.)[aeiou](..)")

Function `str_replace` replaces the matches with new text

In [None]:
data$name %>% str_replace("[aeiou]", "*")

In [None]:
data$name %>% str_replace_all("[aeiou]", "*")

Function <s>`strsplit`</s>`str_split` splits up a string into multiple pieces

In [None]:
data$name %>% str_split("[aeiou]")

## Cheat sheets

* https://github.com/rstudio/cheatsheets/raw/main/strings.pdf

## Example

In [None]:
#install.packages("ISLR")

library("ISLR")
College

In [None]:
college.names <- rownames(College)
college.names

1. Get a vector which contains all colleges with `Texas` in its name. How many are there?

In [None]:
college.names %>% str_detect('Texas')

In [None]:
college.names %>% str_detect('Texas') %>% sum()

In [None]:
college.names %>% str_view("Texas", match = TRUE)

In [None]:
college.names %>% str_subset("Texas")

2. Get a vector of all rows of the College dataset containing the term ‘University’

In [None]:
college.names %>% str_which("University")

In [None]:
college.names[1:3]

3. How many ‘Universities’ are in the dataset vs ‘Colleges’?

In [None]:
college.names %>% str_detect("University") %>% sum()
college.names %>% str_detect("College") %>% sum()

In [None]:
345+406
length(college.names)

In [None]:
universities <- college.names %>% str_detect("University")
colleges <- college.names %>% str_detect("College")
other <- college.names[!(universities | colleges)]

print(other)

In [None]:
777-751

In [None]:
both <- college.names[(universities & colleges)]
both

In [None]:
college.names %>% str_detect("Univ\\.") %>% sum()
college.names %>% str_detect("Coll\\.") %>% sum()

In [None]:
college.names %>% str_detect("Univ[.e]") %>% sum()
college.names %>% str_detect("Coll[.e]") %>% sum()

## Homework (2nd part)

Repeat homework (1st part) excersises using `stringr` functions:

1. Create a string array contaning your first name and last names (e.g., ["Rocio", "Romero", Zaliz"]). Using that array and the R functions you just learned create a new string with the initial of your first name, a dot, and your last names (e.g., "R. Romero Zaliz").

In [None]:
# Write code here

2. Given an array of strings representing dates (e.g., [“2005-11-28”, “2015-10-18”, “2000-01-01”]), show only those corresponding to odd months (use format YEAR-MONTH-DAY).

In [None]:
# Write code here

3. Given a string with several words (e.g., “Esta es una frase, pero no cualquier frase.”) create an array with each of the words in the string (e.g., ["Esta","es","una","frase","pero","no","cualquier","frase"]). Take into account all possible punctuation characters.

In [None]:
# Write code here

4. Search in an array of strings those including only vocals "a" and/or "e" or none (check uppercase and lowercase).

In [None]:
# Write code here

5. Given three numeric arrays representing days, months and years, create an array with dates (only if they are valid) (Hint: research the as.Date function).

In [None]:
# Write code here

# References

* Gaston Sanchez. Handling and Processing Strings in R. https://www.gastonsanchez.com/Handling_and_Processing_Strings_in_R.pdf
* R-tutorials. http://r-tutorials.com
* Norman Matloff. 2011. The Art of R Programming: A Tour of Statistical Software Design (1st ed.). No Starch Press, San Francisco, CA, USA.
* Patrick Burns. 2011. The R Inferno.