# Chapter 11 - Reading SAS Datasets

## Table of Contents:

1. [Objectives](#Objectives)
2. [Reading a Single Dataset](#Reading-a-Single-Dataset)
3. [Manipulating Data](#Manipulating-Data)
4. [Using BY-Group Processing](#Using-BY-Group-Processing)
5. [Detecting the End of a Dataset](#Detecting-the-End-of-a-Dataset)

### Objectives

* create a new dataset from an existing dataset
* use `BY` groups to process observations
* read obervations by observation number
* stop processing when necessary
* explicity write observations to an output dataset
* detect the last observation in a dataset
* identify differences in DATA step processing for raw data and SAS datasets

[(back to top)](#Table-of-Contents:)

### Reading a Single Dataset

To create a new dataset from an existing dataset, use the DATA step. The new dataset name will go in the DATA statement while the source dataset goes in the `SET` statement:

    data <output dataset>;
        set <source dataset>;
        ...
    run;
    
For example:

In [1]:
data class;
    set sashelp.class;
    where sex = 'F';
run;

proc print data = class label;
run;

Obs,Name,Sex,Age,Height,Weight
1,Alice,F,13,56.5,84.0
2,Barbara,F,13,65.3,98.0
3,Carol,F,14,62.8,102.5
4,Jane,F,12,59.8,84.5
5,Janet,F,15,62.5,112.5
6,Joyce,F,11,51.3,50.5
7,Judy,F,14,64.3,90.0
8,Louise,F,12,56.3,77.0
9,Mary,F,15,66.5,112.0


[(back to top)](#Table-of-Contents:)

### Manipulating Data

All the techniques learned in Chapter 10 can be used to manipulate data and create an output dataset that you want. It will not be reviewed here. `DROP`/`KEEP` statements can be embedded as options pertaining to specific datasets in the `SET` statement:

    data <output dataset>;
        set <source dataset> (drop = var1);
        ...
    run;
    
You can access specific observations by using the `POINT =` option to get to that particular observation number in the `SET` statement:

    data <output dataset>;
        numvar = <observation number>;
        set <source dataset> point = numvar;
        ...
        stop;
    run;
    
Apparently the observation number **has** to be put in a variable or else SAS will return an error. The `STOP` statement prevents SAS from looping continuously because the DATA step cannot reach the end of file. An alternative method is to execute a `STOP` statement when an invalid value of the `POINT =` variable is imputed. If so, `_ERROR_ = 1`.

Use the `OUTPUT` statement to output an observation in the PDV explicitly at that point in the code. For example:

    data <output dataset>;
        set <source dataset>;
        if <expression1> then output;
        else if <expression2> then delete;
        else output;
    run;
    
This can be used to output to 2 or more datasets at the same time:

    data <output dataset1> <output dataset2>;
        set <source dataset>;
        if <expression1> then output <dataset1>;
        else output <dataset2>;
    run;

If a `STOP` statement is used, the `OUTPUT` statement must be used as well.

Regarding DATA step processing, SAS retains the values of all variables found from the `SET` statement throughout each iteration. The newly created variables in the DATA step are set to missing.

[(back to top)](#Table-of-Contents:)

### Using BY-Group Processing

A `BY`-group breaks down the dataset into groups of values of a specific variable. It is important to make sure that the dataset is sorted by that particular variable first though. For example:

    data temp;
        set salaray;
        by dept;
    run;
    
SAS will process all the observations in the first BY-group it sees and then move on to the observations in the next `BY`-group, and so on. Having the `BY`-groups allows you to acccess 2 additional temporary variables:
* `FIRST.variable` - `TRUE` for first observation in `BY`-group, `FALSE` for the rest
* `LAST.variable` - `TRUE` for last observation in `BY`-group, `FALSE` for the rest

Here is an example of usage:

In [2]:
proc sort data = sashelp.class out = class;
    by sex age;
run;

data class;
    length status $ 8.;
    set class;
    by sex;
    if first.sex then status = 'Youngest';
    if last.sex then status = 'Oldest';
run;

proc print data = class label;
run;
    

Obs,status,Name,Sex,Age,Height,Weight
1,Youngest,Joyce,F,11,51.3,50.5
2,,Jane,F,12,59.8,84.5
3,,Louise,F,12,56.3,77.0
4,,Alice,F,13,56.5,84.0
5,,Barbara,F,13,65.3,98.0
6,,Carol,F,14,62.8,102.5
7,,Judy,F,14,64.3,90.0
8,,Janet,F,15,62.5,112.5
9,Oldest,Mary,F,15,66.5,112.0
10,Youngest,Thomas,M,11,57.5,85.0


[(back to top)](#Table-of-Contents:)

### Detecting the End of a Dataset

Use the `END =` option in the `SET` statement to create a temporary variable that returns `TRUE` if the current observation is the end of the dataset or `FALSE` if otherwise:

    set <source_dataset> end = <end variable>;
    
Then, use the variable in an `IF` statement:

    if <end variable> then <expression>;
    
As a note, the `END =` option cannot be specified with the `POINT =` option.

[(back to top)](#Table-of-Contents:)