[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://githubtocolab.com/CU-Denver-MathStats-OER/Data-Wrangling-and-Visualization/blob/main/09-Importing-Data.ipynb)






# <a name="08-title"><font size="6">Module 09: Importing Data</font></a>

---

# <a name="wrangle">What is Data Wrangling?</a>

---

<img src="https://raw.githubusercontent.com/CU-Denver-MathStats-OER/Data-Wrangling-and-Visualization/main/Images/data-cycle.png"  alt = "The Data Cycle" width="80%">


The past couple of modules we have been focusing on data manipulation and visualization with core `tidyverse` packages. In this module, we turn our attention to the <font color="dodgerblue">**import**</font> stage which is typically the very first step in the <font color="dodgerblue">**data wrangling**</font> process. We need some data in order to wrestle with it!

Data wrangling is the process of importing messy, unrefined data and  massaging, guiding, and/or manipulating (in a good way) it to extract useful data that is neatly formatted in such a way for us to easily encode the data for further analysis.

- Data wrangling requires us to be comfortable writing code.
- Data wrangling requires us to work with different forms of data.
  - It's important to be able to work not only with numbers but also with character strings, categorical variables, logical variables, regular expression, and dates
-  Data wrangling requires a strong knowledge of the different structures to hold your datasets.
- Minimizing duplication and writing simple and readable code is important to becoming an effective and efficient data analyst.



# <a name="import">The `readr` Package</a>

---

![The readr logos](https://readr.tidyverse.org/logo.png)


We want to work with custom data that is interesting to us and not already available in an existing R package. This requires us to find or collect our own data, and then import or enter the external data into R. In this module, we will learn how to load files in R with the [`readr` package](https://readr.tidyverse.org/index.html), which is part of the core `tidyverse`.


<br>  



## <a name="load-tidyverse">Loading `tidyverse` Packages</a>

---

The [`tidyverse`](https://www.tidyverse.org/) is a [collection of packages](https://www.tidyverse.org/packages/) by [Hadley Wickham](https://blog.revolutionanalytics.com/2016/09/tidyverse.html) that "share an underlying design philosophy, grammar, and data structures" of tidy data. We can load individual packages within the `tidyverse` one by one, or we can load all packages at once with the command `library(tidyverse)`.

- The `readr` package is one of the core `tidyverse` packages.
- The `dplyr`, `ggplot2`, and `tidyr` packages are also`tidyverse` packages.


In [None]:
library(tidyverse)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.4     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.4     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


## <a name="csv-file">Comma-Separated Files</a>

---

One of the most universal and common file formats for storing files is [**comma-separated file (or CSV file)**](https://en.wikipedia.org/wiki/Comma-separated_values#:~:text=Comma%2Dseparated%20values%20(CSV),typically%20represents%20one%20data%20record.). CSV files:

- Store data in plain text.
- Use commas to separate adjacent cell values.
- Use newlines to separate rows.

Below are examples of CSV files:

- Data from the top 100 songs in each year from 2000 to 2019: <https://raw.githubusercontent.com/CU-Denver-MathStats-OER/Statistical-Theory/main/Data/spotify-hits.csv>.

- A much smaller data set stored in a CSV file that we will begin working with: <https://pos.it/r4ds-students-csv>.


# <a name="import-readr">Importing CSV Files with `read_csv()`</a>

---

The `readr` package is a core `tidyverse` package. According to the developers of [`readr`](https://readr.tidyverse.org/index.html):

> The goal of readr is to provide a fast and friendly way to read rectangular data from delimited files, such as comma-separated values (CSV) and tab-separated values (TSV). It is designed to parse many types of data found in the wild, while providing an informative problem report when parsing leads to unexpected results

Compared to the equivalent `base` R functions such as `read.csv()` and `read.table()`, `readr` functions such as `read_csv()` and `read_table()`:

-  Are much faster and efficient.
-  Bring consistency to importing.
-  Produce data frames in a tibble format.
-  Have more flexible column specifications.

We can run commands `?read_table` and `?read_csv` to access help documentation for more details.



## <a name="use-read-csv">The `read_csv()` Function</a>

---

The function `read_csv()` from the `readr` package can be used to import a CSV file into R as a data frame that is stored as a tibble. The first argument in `read_csv()` is the path to the file.

- If the file is stored locally on your computer, then we can type the location of the file in quotes.
  - If working in Colab, then it works best if the file is stored in Google Drive.
  - If working in RStudio or other software, we can store the file locally on our computer in any location.
- If the file is saved online, then we can type the link address to the file in quotes.

Run the code cell below to import the data frame with illustrative (fake) data<sup>1</sup>.

<br>

<font size=2>1. Data and examples below are motivated by examples from [Chapter 7 of R for Data Science (2e)](https://r4ds.hadley.nz/data-import.html) written by Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund.</font>




In [None]:
students <- read_csv("https://pos.it/r4ds-students-csv")

[1mRows: [22m[34m6[39m [1mColumns: [22m[34m5[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): Full Name, favourite.food, mealPlan, AGE
[32mdbl[39m (1): Student ID

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


The message indicates:

- the number of rows and columns of data,
- the delimiter that was used,
- the column specifications (names of columns organized by the type of data the column contains), and
- some information about retrieving the full column specification and how to quiet this message.


In [None]:
print(students)

[90m# A tibble: 6 × 5[39m
  `Student ID` `Full Name`      favourite.food     mealPlan            AGE  
         [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m            [3m[90m<chr>[39m[23m              [3m[90m<chr>[39m[23m               [3m[90m<chr>[39m[23m
[90m1[39m            1 Sunil Huffmann   Strawberry yoghurt Lunch only          4    
[90m2[39m            2 Barclay Lynn     French fries       Lunch only          5    
[90m3[39m            3 Jayendra Lyne    N/A                Breakfast and lunch 7    
[90m4[39m            4 Leon Rossini     Anchovies          Lunch only          [31mNA[39m   
[90m5[39m            5 Chidiegwu Dunkel Pizza              Breakfast and lunch five 
[90m6[39m            6 Güvenç Attila    Ice cream          Lunch only          6    


## <a name="quest1">Question 1</a>

---

Based on the output generated above, what are some issues with the data frame `students` that should be cleaned up so the data is tidy?


### <a name="sol1">Solution to Question 1</a>

---



<br>  
<br>  


# <a name="base-read">Importing csv Files with `base::read.csv()`</a>

---

The [`read.table()`](https://www.rdocumentation.org/packages/utils/versions/3.6.2/topics/read.table) is a multipurpose function in `base` R for importing data. The more focused functions [`read.csv()`](https://www.geeksforgeeks.org/read-contents-of-a-csv-file-in-r-programming-read-csv-function/) and [`read.delim()`](https://www.geeksforgeeks.org/how-to-use-read-delim-in-r/) have been adjusted (by setting options for the `read_table()` function accordingly) to import more specific file types.


The `base` R function `read.csv()` has very similar syntax as the `tidyr` function `read_csv()`, namely the first argument (and only required argument) in `read.csv()` is the path to the file.

- Run `?read.csv()` to access help documentation.
- Notice the `tidyverse` function has an underscore `_` and the `base` R function has `.` between `read` and `csv`.



## <a name="quest2">Question 2</a>


---

Run the code cells below to load the same CSV file using the `base` R function `read.csv()`. Based on the output from the second code cell, do you notice any difference(s) in how the data was imported?

In [None]:
students_ver2 <- read.csv("https://pos.it/r4ds-students-csv")

In [None]:
students_ver2

Student.ID,Full.Name,favourite.food,mealPlan,AGE
<int>,<chr>,<chr>,<chr>,<chr>
1,Sunil Huffmann,Strawberry yoghurt,Lunch only,4
2,Barclay Lynn,French fries,Lunch only,5
3,Jayendra Lyne,,Breakfast and lunch,7
4,Leon Rossini,Anchovies,Lunch only,
5,Chidiegwu Dunkel,Pizza,Breakfast and lunch,five
6,Güvenç Attila,Ice cream,Lunch only,6


### <a name="sol2">Solution to Question 2</a>

---




<br>  
<br>  


# <a name="missing">Coding Missing Values as `NA` on Import

---


After importing the data, the next step usually rquires transforming it in some way to make it easier for us to perform analysis on the data. One of the most common transformations is converting non-standard ways of coding missing values into proper `NA` values in R.

- The argument `na` can be added to `read_csv()` to customize the strings that are read as `NA`.
  - By default, `read_csv()` only recognizes empty strings (`""`) and the character string `"NA"` as `NA`s.
- The argument `na.strings` can be added to `read.csv()` to customize the strings that are read as `NA`.
  - By default, `read.csv()` only recognizes the character string `"NA"` as `NA`s.


We would also like to recognize the string `"N/A"` as a missing value when importing the data in `r4ds-students-csv`.



In [None]:
students <- read_csv("https://pos.it/r4ds-students-csv", na = c("", "N/A"))
print(students)

[1mRows: [22m[34m6[39m [1mColumns: [22m[34m5[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): Full Name, favourite.food, mealPlan, AGE
[32mdbl[39m (1): Student ID

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


[90m# A tibble: 6 × 5[39m
  `Student ID` `Full Name`      favourite.food     mealPlan            AGE  
         [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m            [3m[90m<chr>[39m[23m              [3m[90m<chr>[39m[23m               [3m[90m<chr>[39m[23m
[90m1[39m            1 Sunil Huffmann   Strawberry yoghurt Lunch only          4    
[90m2[39m            2 Barclay Lynn     French fries       Lunch only          5    
[90m3[39m            3 Jayendra Lyne    [31mNA[39m                 Breakfast and lunch 7    
[90m4[39m            4 Leon Rossini     Anchovies          Lunch only          [31mNA[39m   
[90m5[39m            5 Chidiegwu Dunkel Pizza              Breakfast and lunch five 
[90m6[39m            6 Güvenç Attila    Ice cream          Lunch only          6    


In [None]:
students_ver2 <- read.csv("https://pos.it/r4ds-students-csv", na.strings = c("", "N/A"))
print(students_ver2)

  Student.ID        Full.Name     favourite.food            mealPlan  AGE
1          1   Sunil Huffmann Strawberry yoghurt          Lunch only    4
2          2     Barclay Lynn       French fries          Lunch only    5
3          3    Jayendra Lyne               <NA> Breakfast and lunch    7
4          4     Leon Rossini          Anchovies          Lunch only <NA>
5          5 Chidiegwu Dunkel              Pizza Breakfast and lunch five
6          6    Güvenç Attila          Ice cream          Lunch only    6


In [None]:
na_stuff <- c("N/A", "", "-999", "not available", "missing", "NA")

students <- read_csv("https://pos.it/r4ds-students-csv", na = na_stuff)
print(students)

[1mRows: [22m[34m6[39m [1mColumns: [22m[34m5[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): Full Name, favourite.food, mealPlan, AGE
[32mdbl[39m (1): Student ID

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


[90m# A tibble: 6 × 5[39m
  `Student ID` `Full Name`      favourite.food     mealPlan            AGE  
         [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m            [3m[90m<chr>[39m[23m              [3m[90m<chr>[39m[23m               [3m[90m<chr>[39m[23m
[90m1[39m            1 Sunil Huffmann   Strawberry yoghurt Lunch only          4    
[90m2[39m            2 Barclay Lynn     French fries       Lunch only          5    
[90m3[39m            3 Jayendra Lyne    [31mNA[39m                 Breakfast and lunch 7    
[90m4[39m            4 Leon Rossini     Anchovies          Lunch only          [31mNA[39m   
[90m5[39m            5 Chidiegwu Dunkel Pizza              Breakfast and lunch five 
[90m6[39m            6 Güvenç Attila    Ice cream          Lunch only          6    


# <a name="col-names">Columns Names</a>

---

By default both `read_csv()` and `read.csv()` assume the first row of the imported file contains the column names and not data related to an observation.

- The option `col_names = TRUE` is the default for `read_csv()`.
- The option `col.names = TRUE` is the default for `read.csv()`.

By default, column names are defined from the values in each column of the first row. In the case of `r4ds-students-csv`, when the column names are imported using `read_csv()`:

- `Student ID` and `Full Name` columns are surrounded by backticks since they have a space.
  - These are **non-syntactic** names.
  - R's rules for variable names typically forbids spaces.
  - To refer to these variables in our code, we would need to surround them with backticks.
- The other column names (`favourite.food`, `mealPlan`, and `AGE`) are allowable variable names in R, but they do not follow `snake_case` properties.

In [None]:
# column names with spaces are problematic
summary(students$Full Name)

ERROR: Error in parse(text = input): <text>:2:23: unexpected symbol
1: # column names with spaces are problematic
2: summary(students$Full Name
                         ^


In [None]:
# need to write them inside ` `
summary(students$`Full Name`)

   Length     Class      Mode 
        6 character character 

## <a name="rename-dplyr">Renaming Columns After Import</a>

---

We can import the file into R with the default column names and then update the column names using functions such as `dplyr::rename`.

In [None]:
students |>
  rename(
    student_id = `Student ID`,  # rename first column
    full_name = `Full Name`,  # rename second column
    favorite_food = favourite.food, # rename third column
    meal_plan = mealPlan,  # rename fourth column
    age = AGE  # rename last column
  )

student_id,full_name,favorite_food,meal_plan,age
<dbl>,<chr>,<chr>,<chr>,<chr>
1,Sunil Huffmann,Strawberry yoghurt,Lunch only,4
2,Barclay Lynn,French fries,Lunch only,5
3,Jayendra Lyne,,Breakfast and lunch,7
4,Leon Rossini,Anchovies,Lunch only,
5,Chidiegwu Dunkel,Pizza,Breakfast and lunch,five
6,Güvenç Attila,Ice cream,Lunch only,6


### <a name="janitor">Renaming with `janitor::clean_names`</a>

---

The [`janitor`](https://cran.r-project.org/web/packages/janitor/index.html) package contains useful functions for data cleaning. According to their documentation:


> The main janitor functions can: perfectly format data.frame column names; provide quick counts of variable combinations (i.e., frequency tables and crosstabs); and explore duplicate records.

The `janitor` package is not a core package in the `tidyverse`, but it does work well within the pipe `|>` workflow.

- The `janitor` package is not installed in Colab, so we need to first install the package.
- Then we load the package with `library(janitor)`.

In [None]:
install.packages("janitor")

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



In [None]:
library(janitor)


Attaching package: ‘janitor’


The following objects are masked from ‘package:stats’:

    chisq.test, fisher.test




In [None]:
students <- read_csv("https://pos.it/r4ds-students-csv", na = c("", "N/A"))
print(students)

[1mRows: [22m[34m6[39m [1mColumns: [22m[34m5[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): Full Name, favourite.food, mealPlan, AGE
[32mdbl[39m (1): Student ID

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


[90m# A tibble: 6 × 5[39m
  `Student ID` `Full Name`      favourite.food     mealPlan            AGE  
         [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m            [3m[90m<chr>[39m[23m              [3m[90m<chr>[39m[23m               [3m[90m<chr>[39m[23m
[90m1[39m            1 Sunil Huffmann   Strawberry yoghurt Lunch only          4    
[90m2[39m            2 Barclay Lynn     French fries       Lunch only          5    
[90m3[39m            3 Jayendra Lyne    [31mNA[39m                 Breakfast and lunch 7    
[90m4[39m            4 Leon Rossini     Anchovies          Lunch only          [31mNA[39m   
[90m5[39m            5 Chidiegwu Dunkel Pizza              Breakfast and lunch five 
[90m6[39m            6 Güvenç Attila    Ice cream          Lunch only          6    


In [None]:
students <- students |> clean_names()
print(students)

[90m# A tibble: 6 × 5[39m
  student_id full_name        favourite_food     meal_plan           age  
       [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m            [3m[90m<chr>[39m[23m              [3m[90m<chr>[39m[23m               [3m[90m<chr>[39m[23m
[90m1[39m          1 Sunil Huffmann   Strawberry yoghurt Lunch only          4    
[90m2[39m          2 Barclay Lynn     French fries       Lunch only          5    
[90m3[39m          3 Jayendra Lyne    [31mNA[39m                 Breakfast and lunch 7    
[90m4[39m          4 Leon Rossini     Anchovies          Lunch only          [31mNA[39m   
[90m5[39m          5 Chidiegwu Dunkel Pizza              Breakfast and lunch five 
[90m6[39m          6 Güvenç Attila    Ice cream          Lunch only          6    


## <a name="rename-readr">Naming Columns in `read_csv`</a>

---


We can customize the column names during the import with `read_csv`.

- If the imported CSV file does not have any column headers, then we can set the column names with the `col_names` argument.
- If the imported CSV file has column headers that we want to ignore and rename, then:
  - We can ignore the first row in the CSV with `skip` argument.
  - Then define new column names with the `col_names` argument.


In [None]:
students <- read_csv("https://pos.it/r4ds-students-csv",
  na = c("N/A", ""),
  skip = 1,  # remove header row
  col_names = c("student_id", "full_name", "favourite_food", "meal_plan", "age")
  )
print(students)

[1mRows: [22m[34m6[39m [1mColumns: [22m[34m5[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): full_name, favourite_food, meal_plan, age
[32mdbl[39m (1): student_id

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


[90m# A tibble: 6 × 5[39m
  student_id full_name        favourite_food     meal_plan           age  
       [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m            [3m[90m<chr>[39m[23m              [3m[90m<chr>[39m[23m               [3m[90m<chr>[39m[23m
[90m1[39m          1 Sunil Huffmann   Strawberry yoghurt Lunch only          4    
[90m2[39m          2 Barclay Lynn     French fries       Lunch only          5    
[90m3[39m          3 Jayendra Lyne    [31mNA[39m                 Breakfast and lunch 7    
[90m4[39m          4 Leon Rossini     Anchovies          Lunch only          [31mNA[39m   
[90m5[39m          5 Chidiegwu Dunkel Pizza              Breakfast and lunch five 
[90m6[39m          6 Güvenç Attila    Ice cream          Lunch only          6    


The arguments are very similar using `base::read.csv`. Typically change the underscore `_` to a `.` in the name of the argument.

In [None]:
new_names <- c("student_id", "full_name", "favourite_food", "meal_plan", "age")

students_ver2 <- read.csv("https://pos.it/r4ds-students-csv",
  na.strings = c("N/A", ""),
#  skip = 1,  # remove header row should NOT be used with read.cvs
  col.names = new_names  # rename columns
  )

print(students_ver2)

  student_id        full_name favourite_food           meal_plan  age
1          2     Barclay Lynn   French fries          Lunch only    5
2          3    Jayendra Lyne           <NA> Breakfast and lunch    7
3          4     Leon Rossini      Anchovies          Lunch only <NA>
4          5 Chidiegwu Dunkel          Pizza Breakfast and lunch five
5          6    Güvenç Attila      Ice cream          Lunch only    6


# <a name="finish-clean">Final Tidying</a>

---

The tibble generated by `read_csv()` now stored in `students` has variables with the data types:

- `student_id` is a `double` (decimal).
- `full_name`, `favourite_food`, `meal_plan`, and `age` are all `character`.




## <a name="quest3">Question 3</a>


---

Which of the variables currently stored in `students` should be converted to a different data type?

In the code cell below, convert the variable(s) using `base` R functions such as `as.double`, `as.character`, `as.integer` and/or `factor`.

- Store the new data frame back to a data frame with the same name, `students`.
- Print the updated date frame to the screen and confirm the variables have been properly converted.

In [None]:
# solution to question 3


## <a name="if-else">Updating Values with `dplyr::if_else()`</a>

---

The `dplyr` package contains the `if_else()` function which has three arguments.

```
if_else(condition, true, false)
```

1. The first `condition` argument test should be a logical vector.
2. The result will contain the value of the second argument, `true`, when result of the test is `TRUE`.
3. The value of the third argument, `false`, when the result of the test is `FALSE`.




## <a name="quest4">Question 4</a>


---

Consider the almost tidy `students` data that is printed to the screen after running the first code cell below. Then, without running the command, first determine what would be the output of the following command:

```
if_else(students$meal_plan == "Breakfast and lunch", "Both", "One meal")
```

Type an explanation of what the output would be in the text cell below. Then run code cell after the text cell to check your work.

<br>  


In [None]:
# run first to see the data frame students
print(students)

[90m# A tibble: 6 × 5[39m
  student_id full_name        favourite_food     meal_plan           age  
       [3m[90m<int>[39m[23m [3m[90m<chr>[39m[23m            [3m[90m<chr>[39m[23m              [3m[90m<fct>[39m[23m               [3m[90m<chr>[39m[23m
[90m1[39m          1 Sunil Huffmann   Strawberry yoghurt Lunch only          4    
[90m2[39m          2 Barclay Lynn     French fries       Lunch only          5    
[90m3[39m          3 Jayendra Lyne    [31mNA[39m                 Breakfast and lunch 7    
[90m4[39m          4 Leon Rossini     Anchovies          Lunch only          [31mNA[39m   
[90m5[39m          5 Chidiegwu Dunkel Pizza              Breakfast and lunch five 
[90m6[39m          6 Güvenç Attila    Ice cream          Lunch only          6    


### <a name="sol4">Solution to Question 4</a>

---




<br>  
<br>  


In [None]:
# check your answer to question 4
if_else(students$meal_plan == "Breakfast and lunch", "Both", "One meal")

In the code cell below, the command  `if_else(age == "five", "5", age)` will convert any the `age` value equal to the character string `"five"` to the character `"5"`. If the value of `age` is not equal to `"five"`, then the value of `age` is not changed.



In [None]:
if_else(students$age == "five", "5", students$age)

In the code cell below, we run through the entire import and tidying process from start to finish!

In [None]:
students <- read_csv("https://pos.it/r4ds-students-csv", na = c("N/A", ""))

students <- students |>
  clean_names() |>
  mutate(
    meal_plan = factor(meal_plan),
    student_id = as.integer(student_id),
    age = if_else(age == "five", "5", age),
    age = as.integer(age)
  )

print(students)

[1mRows: [22m[34m6[39m [1mColumns: [22m[34m5[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): Full Name, favourite.food, mealPlan, AGE
[32mdbl[39m (1): Student ID

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


[90m# A tibble: 6 × 5[39m
  student_id full_name        favourite_food     meal_plan             age
       [3m[90m<int>[39m[23m [3m[90m<chr>[39m[23m            [3m[90m<chr>[39m[23m              [3m[90m<fct>[39m[23m               [3m[90m<int>[39m[23m
[90m1[39m          1 Sunil Huffmann   Strawberry yoghurt Lunch only              4
[90m2[39m          2 Barclay Lynn     French fries       Lunch only              5
[90m3[39m          3 Jayendra Lyne    [31mNA[39m                 Breakfast and lunch     7
[90m4[39m          4 Leon Rossini     Anchovies          Lunch only             [31mNA[39m
[90m5[39m          5 Chidiegwu Dunkel Pizza              Breakfast and lunch     5
[90m6[39m          6 Güvenç Attila    Ice cream          Lunch only              6


# <a name="other-files">Other File Types</a>

---

- `read_csv2()` reads semicolon-separated files. These use ; instead of , to separate fields and are common in countries that use , as the decimal marker.

- `read_tsv()` reads tab-delimited files.

- `read_delim()` reads in files with any delimiter, attempting to automatically guess the delimiter if you do not specify it.

- `read_table()` reads a common variation of fixed-width files where columns are separated by white space.

- `read_fwf()` reads fixed-width files.



# <a name="col-types">Column Types</a>

---

A CSV file does not contain any information about the type of each variable (whether a variable is a logical, number, string, etc.), so `readr` will try its best to guess the type.  `readr` uses the following heuristic to figure out the column types:


1. Does it contain only `F`, `T`, `FALSE`, or `TRUE` (ignoring case)? If so, it's a `logical`.
2. Does it contain only numbers? If so, it's a number (`double`).
3. Does it match the ISO8601 standard? If so, it's a `date` or `date-time`.
4. Otherwise, it must be a `character`.

In [None]:
read_csv("
  logical, numeric, date, string
  TRUE, 1, 2021-01-15, abc
  false , 4.5, 2021-02-15, def
  T, Inf, 2021-02-16, ghi
")

[1mRows: [22m[34m3[39m [1mColumns: [22m[34m4[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m  (1): string
[32mdbl[39m  (1): numeric
[33mlgl[39m  (1): logical
[34mdate[39m (1): date

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


logical,numeric,date,string
<lgl>,<dbl>,<date>,<chr>
True,1.0,2021-01-15,abc
False,4.5,2021-02-15,def
True,inf,2021-02-16,ghi


## <a name="ignore-comment">Ignoring Comments in a File</a>

---

In the code cell below, we store a CSV file to `my_data`. We have added an additional row to the previous table, and we wish to indicate this with a comment.

In [None]:
my_data <- "
  logical, numeric, date, string
  TRUE, 1, 2021-01-15, abc
  false , 4.5, 2021-02-15, def
  T, Inf, 2021-02-16, ghi
   , ?, hello, FALSE  # we added this extra row
"

read_csv(my_data)

[1mRows: [22m[34m4[39m [1mColumns: [22m[34m4[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (3): numeric, date, string
[33mlgl[39m (1): logical

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


logical,numeric,date,string
<lgl>,<chr>,<chr>,<chr>
True,1,2021-01-15,abc
False,4.5,2021-02-15,def
True,Inf,2021-02-16,ghi
,?,hello,FALSE # we added this extra row


In [None]:
# the comment argument drops all text that comes after a specified string
read_csv(my_data, comment = "#")

[1mRows: [22m[34m4[39m [1mColumns: [22m[34m4[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (3): numeric, date, string
[33mlgl[39m (1): logical

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


logical,numeric,date,string
<lgl>,<chr>,<chr>,<chr>
True,1,2021-01-15,abc
False,4.5,2021-02-15,def
True,Inf,2021-02-16,ghi
,?,hello,FALSE


## <a name="def-types">Defining Column Types</a>

---

The output of the command `read_csv(my_data, comment = "#")` produces a data frame with column types indicated in the output above.

- The `numeric` column is missing a value coded as `"?"` and that changed the column type to `character`.
- The `date` column changed to a `character`.
- The `string` column is treating the entry in the last row as the string `"FALSE"` and not a logical.


We can override the column types chosen by the `read_csv()` with the `col_types` argument.

In [None]:
read_csv(
  my_data,
  comment = "#",
  col_types = list(numeric = col_double(), date = col_date())
)

“[1m[22mOne or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)”


logical,numeric,date,string
<lgl>,<dbl>,<date>,<chr>
True,1.0,2021-01-15,abc
False,4.5,2021-02-15,def
True,inf,2021-02-16,ghi
,,,FALSE


### <a name="fix-problem">Fixing Error Messages</a>

---

The output from the previous code cell seems to indicate there was a potential problem importing the file. The warning says we can find out more with `problems()`.




In [None]:
df <- read_csv(
  my_data,
  comment = "#",
  col_types = list(numeric = col_double(), date = col_date())
)

“[1m[22mOne or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)”


In [None]:
problems(df)

row,col,expected,actual,file
<int>,<int>,<chr>,<chr>,<chr>
5,2,a double,?,/tmp/Rtmp0h159P/filee270e90bee
5,3,date in ISO8601,hello,/tmp/Rtmp0h159P/filee270e90bee


This tells us that there was a problem in row 5, in column 2 and column 3, where `readr` expected a double and date (respectively), but got a `?` and `hello` respectively. We may choose to resolve this problem be changing both to `NA` in the import.


In [None]:
df <- read_csv(
  my_data,
  comment = "#",
  col_types = list(numeric = col_double(), date = col_date()),
  na = c("?", "hello")
)
print(df)

“[1m[22mOne or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)”


[90m# A tibble: 4 × 4[39m
  logical numeric date       string
  [3m[90m<lgl>[39m[23m     [3m[90m<dbl>[39m[23m [3m[90m<date>[39m[23m     [3m[90m<chr>[39m[23m 
[90m1[39m TRUE        1   2021-01-15 abc   
[90m2[39m FALSE       4.5 2021-02-15 def   
[90m3[39m TRUE      [31mInf[39m   2021-02-16 ghi   
[90m4[39m [31mNA[39m         [31mNA[39m   [31mNA[39m         FALSE 


## <a name="quest5">Question 5</a>


---

The output of the previous code cell is still indicating there were problems when importing the CSV file. Identify and correct the problem so you can import the `my_data` with `read_csv()` without any problems.

In [None]:
# solution to question 5


## <a name="col-types">Column Types in `readr`</a>

---

`readr` provides a total of nine column types for you to use:

- `col_logical()` and `col_double()` read logicals and real numbers.
  - These are rarely needed (except as above), since `readr` will usually guess them for you.
- `col_integer()` reads integers.
- `col_character()` reads strings.
  - This can be useful to specify explicitly when you have a column that is a numeric (such as a phone number or student ID) that doesn't make sense to apply mathematical operations to.
- `col_factor()`, `col_date()`, and `col_datetime()` create factors, dates, and date-times respectively.
- `col_number()` is a permissive numeric parser that will ignore non-numeric components, and is particularly useful for currencies.
- `col_skip()` skips a column so it is not included in the result.


In [None]:
read_csv("https://pos.it/r4ds-students-csv")

[1mRows: [22m[34m6[39m [1mColumns: [22m[34m5[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): Full Name, favourite.food, mealPlan, AGE
[32mdbl[39m (1): Student ID

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


Student ID,Full Name,favourite.food,mealPlan,AGE
<dbl>,<chr>,<chr>,<chr>,<chr>
1,Sunil Huffmann,Strawberry yoghurt,Lunch only,4
2,Barclay Lynn,French fries,Lunch only,5
3,Jayendra Lyne,,Breakfast and lunch,7
4,Leon Rossini,Anchovies,Lunch only,
5,Chidiegwu Dunkel,Pizza,Breakfast and lunch,five
6,Güvenç Attila,Ice cream,Lunch only,6


In [None]:
# using col_skip to ignore some columns
read_csv(
  "https://pos.it/r4ds-students-csv",
  col_types = list(favourite.food = col_skip(), AGE = col_skip()),
)

Student ID,Full Name,mealPlan
<dbl>,<chr>,<chr>
1,Sunil Huffmann,Lunch only
2,Barclay Lynn,Lunch only
3,Jayendra Lyne,Breakfast and lunch
4,Leon Rossini,Lunch only
5,Chidiegwu Dunkel,Breakfast and lunch
6,Güvenç Attila,Lunch only


In [None]:
# setting all columns to the same type
read_csv(
  "https://pos.it/r4ds-students-csv",
  col_types = cols(.default = col_double())  # use double for all
)

“[1m[22mOne or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)”


Student ID,Full Name,favourite.food,mealPlan,AGE
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,,,,4.0
2,,,,5.0
3,,,,7.0
4,,,,
5,,,,
6,,,,6.0


## <a name="cols-only">Selecting a Subset of Columns to Import with `cols_only()`</a>

---

The `cols_only()` option can be used within the `col_type` argument to read in only the columns we specify. This can help save time when we have a data set with many columns.

In [None]:
read_csv(
  "https://pos.it/r4ds-students-csv",
  col_types = cols_only(
    favourite.food = col_character(),
    AGE = col_character()
  )
)

favourite.food,AGE
<chr>,<chr>
Strawberry yoghurt,4
French fries,5
,7
Anchovies,
Pizza,five
Ice cream,6


# <a name="mult-files">Reading Data from Multiple Files</a>

---

Sometimes our data is split across multiple files instead of being contained in a single file. For example, suppose we sales data for multiple months, with each month's data in a separate file: `01-sales.csv` for January, `02-sales.csv` for February, and `03-sales.csv` for March. With `read_csv()` we can read these data in at once and stack them on top of each other in a single data frame.

- The `id` argument adds a new column called file to the resulting data frame that identifies the file the data come from.

In [None]:
sales_files <- c(
  "https://pos.it/r4ds-01-sales",
  "https://pos.it/r4ds-02-sales",
  "https://pos.it/r4ds-03-sales"
)

sales_df <- read_csv(sales_files, id = "month_id")
print(sales_df)

[1mRows: [22m[34m19[39m [1mColumns: [22m[34m6[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (1): month
[32mdbl[39m (4): year, brand, item, n

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


[90m# A tibble: 19 × 6[39m
   month_id                     month     year brand  item     n
   [3m[90m<chr>[39m[23m                        [3m[90m<chr>[39m[23m    [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m
[90m 1[39m https://pos.it/r4ds-01-sales January   [4m2[24m019     1  [4m1[24m234     3
[90m 2[39m https://pos.it/r4ds-01-sales January   [4m2[24m019     1  [4m8[24m721     9
[90m 3[39m https://pos.it/r4ds-01-sales January   [4m2[24m019     1  [4m1[24m822     2
[90m 4[39m https://pos.it/r4ds-01-sales January   [4m2[24m019     2  [4m3[24m333     1
[90m 5[39m https://pos.it/r4ds-01-sales January   [4m2[24m019     2  [4m2[24m156     9
[90m 6[39m https://pos.it/r4ds-01-sales January   [4m2[24m019     2  [4m3[24m987     6
[90m 7[39m https://pos.it/r4ds-01-sales January   [4m2[24m019     2  [4m3[24m827     6
[90m 8[39m https://pos.it/r4ds-02-sales February  [4m2[24m019     1  [4

In the code cell below, we use the `str_replace_all()` function from the `stringr` package (a core `tidyverse` package) to replace the file names with a numeric month.

In [None]:
sales_df$month_id <- str_replace_all(
  sales_df$month_id,
  c("https://pos.it/r4ds-01-sales" = "01",
    "https://pos.it/r4ds-02-sales" = "02",
    "https://pos.it/r4ds-03-sales" = "03" )
)
print(sales_df)

[90m# A tibble: 19 × 6[39m
   month_id month     year brand  item     n
   [3m[90m<chr>[39m[23m    [3m[90m<chr>[39m[23m    [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m
[90m 1[39m 01       January   [4m2[24m019     1  [4m1[24m234     3
[90m 2[39m 01       January   [4m2[24m019     1  [4m8[24m721     9
[90m 3[39m 01       January   [4m2[24m019     1  [4m1[24m822     2
[90m 4[39m 01       January   [4m2[24m019     2  [4m3[24m333     1
[90m 5[39m 01       January   [4m2[24m019     2  [4m2[24m156     9
[90m 6[39m 01       January   [4m2[24m019     2  [4m3[24m987     6
[90m 7[39m 01       January   [4m2[24m019     2  [4m3[24m827     6
[90m 8[39m 02       February  [4m2[24m019     1  [4m1[24m234     8
[90m 9[39m 02       February  [4m2[24m019     1  [4m8[24m721     2
[90m10[39m 02       February  [4m2[24m019     1  [4m1[24m822     3
[90m11[39m 02       February  [4m2

# <a name="write">Writing to a File</a>

---

`readr` has useful functions for writing data to files that we can export outside of R.

- `write_csv(x, "file")`: writes data to a comma-separated file.
- `write_tsv(x, "file")`: writes data to a tab-separated file.
- Other write options include `write_delim()`, `write_csv2()`, `write_excel_csv()`, and `write_excel_csv2()`.

The most important arguments to these functions are `x` (the data frame to save) and `file` (the file and location to save it). We can specify other options such as how missing values are coded by entering additional arguments.

- If working in Google Colab, the file is saved to the  `/content` folder in your Drive.
  - Select the files menu tab from the options on the left side bar (the image of the file boxed in red in the figure above).
  - Click on the three vertical dots on the right of the file name (see red arrow in figure above).
  - Choose `Download` from the available menu options.
  - Choose where you want to download the file onto your computer.

- In RStudio and other software, we can specify a the location on our computer where the file will be saved.
  - The `getwd()` function can be used to check the location of the current working directory.
  - The `setwd()` function can be used to set/change the working directory.

In [None]:
getwd()

## <a name="quest6">Question 6</a>

---

Run the first code cell below to print the tidy data frame `students` from earlier to the screen. Then use the `write_csv()` function to write `students` to a CSV file named `students_data.csv`. Finally, import the file `students_data.csv` that you created back into R and compare the resulting data frame with `students`. Is the original `students` data frame and the newly imported data the same? If not, what is different?

In [None]:
print(students)

[90m# A tibble: 6 × 5[39m
  student_id full_name        favourite_food     meal_plan             age
       [3m[90m<int>[39m[23m [3m[90m<chr>[39m[23m            [3m[90m<chr>[39m[23m              [3m[90m<fct>[39m[23m               [3m[90m<int>[39m[23m
[90m1[39m          1 Sunil Huffmann   Strawberry yoghurt Lunch only              4
[90m2[39m          2 Barclay Lynn     French fries       Lunch only              5
[90m3[39m          3 Jayendra Lyne    [31mNA[39m                 Breakfast and lunch     7
[90m4[39m          4 Leon Rossini     Anchovies          Lunch only             [31mNA[39m
[90m5[39m          5 Chidiegwu Dunkel Pizza              Breakfast and lunch     5
[90m6[39m          6 Güvenç Attila    Ice cream          Lunch only              6


In [None]:
# solution to question 6


### <a name="sol6">Solution to Question 6</a>

---



<br>  
<br>  

## <a name="rds">`read_rds()` and `write_rds()`</a>

---


The [`write_rds()` and `read_rds()`](https://readr.tidyverse.org/reference/read_rds.html) functions are uniform wrappers around the base functions `readRDS()` and `saveRDS()` that store data in R's custom binary format called [RDS (or R Data Serialization)](https://www.geeksforgeeks.org/data-serialization-rds-using-r/). This means that when we reload the object, we are loading the exact same R object that we stored.

In [None]:
write_rds(students, "students_data.rds")

In [None]:
read_rds("students_data.rds")

student_id,full_name,favourite_food,meal_plan,age
<int>,<chr>,<chr>,<fct>,<int>
1,Sunil Huffmann,Strawberry yoghurt,Lunch only,4.0
2,Barclay Lynn,French fries,Lunch only,5.0
3,Jayendra Lyne,,Breakfast and lunch,7.0
4,Leon Rossini,Anchovies,Lunch only,
5,Chidiegwu Dunkel,Pizza,Breakfast and lunch,5.0
6,Güvenç Attila,Ice cream,Lunch only,6.0


# <a name="tribble">Data Entry with `tribble()`</a>

---

We have seen how to create a tibble by entering data into the `tibble()` function where we can enter the data column by column.

In [None]:
# defining a tibble column by column
tibble(
  x = c(1, 2, 5),
  y = c("h", "m", "g"),
  z = c(0.08, 0.83, 0.60)
)

x,y,z
<dbl>,<chr>,<dbl>
1,h,0.08
2,m,0.83
5,g,0.6


Laying out the data by column can make it hard to see how the rows are related, so an alternative is `tribble()`, short for <font color="dodgerblue">**transposed tibble**</font>, which lets you lay out your data row by row:

- Column headings start with `~`, and
- Entries are separated by commas.

This makes it possible to lay out small amounts of data in an easy to read form.

In [None]:
tribble(
  ~x, ~y, ~z,
  1, "h", 0.08,
  2, "m", 0.83,
  5, "g", 0.60
)

x,y,z
<dbl>,<chr>,<dbl>
1,h,0.08
2,m,0.83
5,g,0.6


# <a name="other-intro">Reading Files from Other Software</a>

---



## <a name="excel">Excel Files</a>

---


There are several packages available to connect R with Excel (i.e. `xlsx`, `gdata`, `RODBC`, `XLConnect`, `RExcel`, and others).

The most popular package (developed by Hadley Wickham, et. al) for importing Excel files with the extension `*.xlsx` is the `tidyverse` package [`readxl`](https://readxl.tidyverse.org/) that works very similarly to `readr`.

- It is not a core `tidyverse` package, so we must load it separately.
- `readxl` is already installed in Colab, so we only need to load it (and no need to install it).



In [None]:
library(readxl)

In [None]:
# download file with gdp and health data
download.file("https://raw.githubusercontent.com/CU-Denver-MathStats-OER/Data-Wrangling-and-Visualization/main/Data/health-wealth.xlsx", "health-wealth.xlsx")

In [None]:
# import one worksheet from the Excel workbook
health_wealth <- read_excel("health-wealth.xlsx", sheet = "Health")
print(health_wealth)

[90m# A tibble: 222 × 14[39m
   `Country Name`     `Country Code` `Series Name` `Series Code` `2011 [YR2011]`
   [3m[90m<chr>[39m[23m              [3m[90m<chr>[39m[23m          [3m[90m<chr>[39m[23m         [3m[90m<chr>[39m[23m         [3m[90m<chr>[39m[23m          
[90m 1[39m Afghanistan        AFG            Mortality ra… SP.DYN.IMRT.… 61.7           
[90m 2[39m Albania            ALB            Mortality ra… SP.DYN.IMRT.… 10.8           
[90m 3[39m Algeria            DZA            Mortality ra… SP.DYN.IMRT.… 22.9           
[90m 4[39m American Samoa     ASM            Mortality ra… SP.DYN.IMRT.… ..             
[90m 5[39m Andorra            AND            Mortality ra… SP.DYN.IMRT.… 4              
[90m 6[39m Angola             AGO            Mortality ra… SP.DYN.IMRT.… 71.40000000000…
[90m 7[39m Antigua and Barbu… ATG            Mortality ra… SP.DYN.IMRT.… 8.199999999999…
[90m 8[39m Argentina          ARG            Mortality ra… SP.DYN.IMRT.… 

In [None]:
# use the janitor::clean_names function to rename all columns
health_wealth <- health_wealth |> clean_names()
print(health_wealth)

[90m# A tibble: 222 × 14[39m
   country_name   country_code series_name series_code x2011_yr2011 x2012_yr2012
   [3m[90m<chr>[39m[23m          [3m[90m<chr>[39m[23m        [3m[90m<chr>[39m[23m       [3m[90m<chr>[39m[23m       [3m[90m<chr>[39m[23m        [3m[90m<chr>[39m[23m       
[90m 1[39m Afghanistan    AFG          Mortality … SP.DYN.IMR… 61.7         59.5        
[90m 2[39m Albania        ALB          Mortality … SP.DYN.IMR… 10.8         10          
[90m 3[39m Algeria        DZA          Mortality … SP.DYN.IMR… 22.9         22.5        
[90m 4[39m American Samoa ASM          Mortality … SP.DYN.IMR… ..           ..          
[90m 5[39m Andorra        AND          Mortality … SP.DYN.IMR… 4            3.9         
[90m 6[39m Angola         AGO          Mortality … SP.DYN.IMR… 71.40000000… 67.3        
[90m 7[39m Antigua and B… ATG          Mortality … SP.DYN.IMR… 8.199999999… 7.8         
[90m 8[39m Argentina      ARG          Mortality … SP.

In [None]:
# importing with additional options
read_excel(
  "health-wealth.xlsx",
  sheet = "Health",  # import only the data in the sheet named "Health"
  range = "A3:E9",  # import only this cell range from the indicated sheet
  col_names = c("country_name", "country_code", "series_name", "series_code", "2011")  # rename columns
)

country_name,country_code,series_name,series_code,2011
<chr>,<chr>,<chr>,<chr>,<chr>
Albania,ALB,"Mortality rate, infant (per 1,000 live births)",SP.DYN.IMRT.IN,10.8
Algeria,DZA,"Mortality rate, infant (per 1,000 live births)",SP.DYN.IMRT.IN,22.9
American Samoa,ASM,"Mortality rate, infant (per 1,000 live births)",SP.DYN.IMRT.IN,..
Andorra,AND,"Mortality rate, infant (per 1,000 live births)",SP.DYN.IMRT.IN,4
Angola,AGO,"Mortality rate, infant (per 1,000 live births)",SP.DYN.IMRT.IN,71.400000000000006
Antigua and Barbuda,ATG,"Mortality rate, infant (per 1,000 live births)",SP.DYN.IMRT.IN,8.1999999999999993
Argentina,ARG,"Mortality rate, infant (per 1,000 live births)",SP.DYN.IMRT.IN,12.4


## <a name="import-other">Importing Data from Other Statistical Packages</a>

---


For Stata and Systat, use the [`foreign`](https://cran.r-project.org/web/packages/foreign/index.html) package which is already installed in Colab (so we only need to load it with a `library()` command).

- `read.dta("myfile.dta")`: Imports a Stata `*.dta` file into an R data frame.
- `read.systat("myfile.dta")`: Imports a systat `*.dta` file into an R data frame.




In [None]:
library(foreign)

For SPSS and SAS it recommend the [`Hmisc`](https://cran.r-project.org/web/packages/Hmisc/index.html) package for ease and functionality.

- For SPSS files:
  - Export your data in SPSS to a portable file format `*.por`.
  - `spss.get("myfile.por")` imports the data into R.

- For SAS files:
  - Export your data in SAS to an XPT file `*.xpt`
  - `sasxport.get("myfile.xpt")` imports the data into R.


In [None]:
install.packages("Hmisc")
library(Hmisc)

## <a name="CC License">Creative Commons License Information</a>
---

![Creative Commons
License](https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png)

Materials created by the [Department of Mathematical and Statistical Sciences at the University of Colorado Denver](https://github.com/CU-Denver-MathStats-OER/)
and is licensed under a [Creative Commons
Attribution-NonCommercial-ShareAlike 4.0 International
License](http://creativecommons.org/licenses/by-nc-sa/4.0/).