# Chapter 06 - Understanding DATA Step Processing

## Table of Contents:

1. [Objectives](#Objectives)
2. [How SAS Processes Programs](#How-SAS-Processes-Programs)
3. [Compilation Phase](#Compilation-Phase)
4. [Execution Phase](#Execution-Phase)
5. [Debugging a DATA Step](#Debugging-a-DATA-Step)
6. [Testing Your Programs](#Testing-Your-Programs)

### Objectives

* identify the 2 phases that occur when a DATA step is processed
* interpret automatic variables
* identify the processing phase in which an error occurs
* debug SAS DATA steps
* validate and clean invalid data
* test programs by limiting the number of observations that are created
* flag errors in the SAS log

[(back to top)](#Table-of-Contents:)

### How SAS Processes Programs

A SAS DATA step is processed in 2 phases:
1. compilation phase - each statement scanned for syntax errors and the descriptor portion of the dataset is created
2. execution phase - the input data is read and processed with the commands in the DATA step executing once per observation

If a syntax error occurs, SAS fails at the compilation phase.

[(back to top)](#Table-of-Contents:)

### Compilation Phase

At the beginning of the compilation phase, an input buffer is first created to hold a single observation from an external file. No input buffer is created for a SAS dataset.

Next, a Program Data Vector (PDV) is created where SAS will hold the current observation. The PDV contains 2 automatic variables that are not output to the resulting dataset:
* `_N_` counts the number of times the DATA step has executed (inclusive)
* `_ERROR_` boolean value signals that an error has occured during this DATA step iteration

A PDV vector looks like the following:

<table>
    <tr>
        <td>`_N_`</td><td>`_ERROR_`</td><td>`var1`</td><td>`var2`</td><td>`var3`</td><td>`var4`</td><td>`var5`</td><td>`...`</td>
    </tr>
</table>

At this stage, the PDV vector holds no observations.

Afterwards, syntax checking is performed. Syntax errors will still allow the DATA step to compile, but it won't execute. Once syntactical issues are resolved, dataset variable lengths are set and space is allocated to accomodate them. Finally, the descriptor portion of the dataset is generated.

In summary, the compilation phase consists of:
1. input buffer creation (if source data is from external file)
2. Program Data Vector (PDV) is created to hold the current observation
3. syntax checking occurs
4. dataset variables are generated
5. the descriptor portion of the SAS dataset is created

[(back to top)](#Table-of-Contents:)

### Execution Phase

Once the compilation phase is over, the DATA step moves onto the execution phase. During execution, each observation in the source destination is copied to:

<table>
    <tr>
        <td>**Source**</td>
        <td>**Intermediary**</td>
    </tr>
    <tr>
        <td>external file</td>
        <td>input buffer => PDV</td>
    </tr>
    <tr>
        <td>SAS dataset</td>
        <td>PDV</td>
    </tr>
</table>

Processing of each variable value occurs and the resultant variables are placed into the PDV. The DATA step executes once for each record in the input file, unless otherwise directed by additional statements.

At the beginning of the execution phase, `_N_ = 1` and `_ERROR_ = 0`. All other variables are set to missing.

Then, the `SET`/`INPUT` statement identifies the source file to draw data from. The first observation is used. Then, SAS proceeds to parse out the variables to the appropriate places in the PDV. Afterwards, assignment of the variables occurs with the final value replacing whatever the initial value of a variable was in the PDV.

At the end of the DATA step (indicated by a RUN statement), the values in the PDV are written to the output dataset. Then, SAS returns to the top of the DATA step. `_N_` is incremented by 1 and `_ERROR_` is set to 0. Then, the next observation is processed until SAS reaches the end of all observations from that particular data source.

In summary, the execution phase consists of:
1. `_N_` is incremented by 1 and `_ERROR_` set to 0
2. variables in the PDV are set to missing
3. each statement executed sequentially
4. `SET`/`INPUT` statement identifies the source and adds the current observation to the PDV
5. assignment statements modify the current observation
6. PDV values are written to the output dataset at the end of each iteration of the DATA step
7. if not the end of file, the DATA step is restarts at the top using the next observation
8. if end of file, DATA step ends

[(back to top)](#Table-of-Contents:)

### Debugging a DATA step

Debugging is addressed in Chapter 04. Refer to that document for more information.

To detect invalid data, invoke any of these statements:
* `PROC PRINT`
* `PROC FREQ`
* `PROC MEANS`

Their usage will not be covered here. Subsetting can be used to help clean the data and check for invalid observations. Observations with invalid data can be deleted if need be:

    if <expression to determine validity> then delete;

[(back to top)](#Table-of-Contents:)

### Testing Your Programs

Apart from limiting the number of printed observations, the `PUT` statement can be useful here. `PUT` outputs a variable's value to the log. This can be used in diagnostic testing. For example:

In [1]:
data class;
    set sashelp.class;
    if sex = 'M' then put 'Male';
run;

[(back to top)](#Table-of-Contents:)