# Chapter 05 - Creating SAS Data Sets from External Files

## Table of Contents:

1. [Objectives](#Objectives)
2. [Creating SAS Datasets from Raw Data Files](#Creating-SAS-Datasets-from-Raw-Data-Files)
3. [Writing a DATA Step Program](#Writing-a-DATA-Step-Program)
4. [Submitting the DATA Step Program](#Submitting-the-DATA-Step-Program)
5. [Creating and Modifying Variables](#Creating-and-Modifying-Variables)
6. [Subsetting Data](#Subsetting-Data)
7. [Creating a Raw Data File](#Creating-a-Raw-Data-File)
8. [Reading/Creating Microsoft Excel Data](#Reading/Creating-Microsoft-Excel-Data)

### Objectives <a class="anchor" id="objectives"></a>

* reference a SAS library
* reference a raw data file
* name a SAS dataset to be created
* specify a raw data file to be read
* read standard character and numeric values in fixed fields
* create new variables and assign values
* select observations based on conditions
* read instream data
* submit and verify a DATA step program
* read a SAS data set and write the observations out to a raw data file
* use the DATA step to create a SAS dataset from an Excel worksheet
* ise the SAS/ACCESS LIBNAME statement to read from an Excel worksheet
* create an Excel worksheet from a SAS dataset
* use the IMPORT procedure to read external files

[(back to top)](#Table-of-Contents:)

### Creating SAS Datasets from Raw Data Files

Before creating SAS datasets from raw data files, these files can be printed into SAS using the `FSLIST` procedure. See the SAS help documentation for more information.

In order to read a raw data file, the DATA step needs to include the following instructions:
* the location/name of the external text file
* a reference that identifies the external file
* a description of the data values to be read

In addition to creating a _libref_, a reference to a filename can be made using the `FILENAME` statement:

    filename fileref 'filename';

One a _fileref_ is specified, it can be used in SAS commands and statements.

If you are unsure about the contents and structure of the raw data file, `PROC FSLIST` can be used to display the contents within SAS:

    proc fslist fileref = <file path>;

[(back to top)](#Table-of-Contents:)

### Writing a DATA Step Program

To read a raw data file, use the `INFILE` statement to indicate the file within a DATA step:

    data <dataset>;
        infile <raw file path/ref>;
    run;
    
Standard numeric data includes:
* numerals
* decimal points
* numbers in scientific/E notation
* +/- signs

Data organization:
* fixed field
    * delimited data not arranged in columns
* column input
    * specifies actual column locations for values
    * can only be used in standard character or numeric values

Nonstandard numeric data includes:
* values with special chars
* date and time values
* data in fraction, integer binary, real binary, hexadecimal forms

The `INPUT` statement describes the fields of raw data to be read and placed within a DATA step:

    input <variable> <$ if char> <startcol>-<endcol>;

For example (this dataset does not actually exist):

    filename exer 'C:\Users\exer.dat';
    data exercise;
        infile exer;
        input ID $ 1-4 Age 6-7 ActLevel $ 9-12 Sex $ 14;
    run;

[(back to top)](#Table-of-Contents:)

### Submitting the DATA Step Program

As a tip, to limit the amount of processing SAS needs to do when debugging, try setting `OBS = 1` in conjunction with printing the data. Once you are assured no errors pop up in the log, you can remove the `OBS = 1` option.

[(back to top)](#Table-of-Contents:)

### Creating and Modifying Variables

The DATA step can be used to:
* transform variables
* create new variables
* conditionally process variables
* calculate new variables
* assign new variables

The following arithmetic operators are available:

<table>
    <tr>
        <td>**Operator**</td>
        <td>**Meaning**</td>
    </tr>
    <tr>
        <td>-</td>
        <td>negative prefix</td>
    </tr>
    <tr>
        <td>\**\**</td>
        <td>exponentiation</td>
    </tr>
    <tr>
        <td>\***</td>
        <td>multiplication</td>
    </tr>
    <tr>
        <td>/</td>
        <td>division</td>
    </tr>
    <tr>
        <td>+</td>
        <td>addition</td>
    </tr>
    <tr>
        <td>-</td>
        <td>subtraction</td>
    </tr>
</table>

SAS uses PEMDAS order of operations.

Comparison operators are:

<table>
    <tr>
        <td>**Operator**</td>
        <td>**Meaning**</td>
    </tr>
    <tr>
        <td>=</td>
        <td>equal to</td>
    </tr>
    <tr>
        <td>~=</td>
        <td>not equal to</td>
    </tr>
    <tr>
        <td>\></td>
        <td>greater than</td>
    </tr>
    <tr>
        <td><</td>
        <td>less than</td>
    </tr>
    <tr>
        <td>>=</td>
        <td>greater than or equal to</td>
    </tr>
    <tr>
        <td><=</td>
        <td>less than or equal to</td>
    </tr>
</table>

Logical operators are:

<table>
    <tr>
        <td>**Operator**</td>
        <td>**Meaning**</td>
    </tr>
    <tr>
        <td>&</td>
        <td>logical AND</td>
    </tr>
    <tr>
        <td>|</td>
        <td>logical OR</td>
    </tr>
</table>

An example of variable assignment is as follows:

In [1]:
data class;
    set sashelp.class;
    age_days = age*365; * Created a new variable called AGE_DAYS;
    label age_days = 'Days Alive';
run;

proc print data = class (obs = 10) label noobs;
    var name age_days;
run;

Name,Days Alive
Alfred,5110
Alice,4745
Barbara,4745
Carol,5110
Henry,5110
James,4380
Jane,4380
Janet,5475
Jeffrey,4745
John,4380


Adding or subtracting variables together have shorthands in SAS:

    var1 + var2;
    var1 - var2;
    
where the output of either expression is placed into `VAR1`. If `VAR1` is not defined, SAS automatically defines them as 0 only for short hand. If `VAR1` is not defined in this case:

    var1 = var1 + var2;
    
then SAS places a missing value into `VAR1`.

Date values can be assigned to variables using date constants and the special `D` operator. A date constant has the following form:

    'ddmmmyy'd

[(back to top)](#Table-of-Contents:)

### Subsetting Data

There are 2 ways of subsetting data in a DATA step:
* IF statement
* WHERE statement

The `WHERE` statement uses the same format found in the `PROC PRINT` step. The IF statement has the following syntax:

    if <expression>;

[(back to top)](#Table-of-Contents:)

### Reading Instream Data

To manually input observations directly into SAS, use the `DATALINES` statement. The usage is as so:

    data <dataset>;
        input <variables>;
        datalines;
    ... list data here ...
    ;
    run;

The `INPUT` statement precedes the list of variables and their column start and end values. Don't forget to include the '$' symbol before the column list if the variable is meant to be a char.

An example of this in action can be seen below:

In [2]:
data stress; 
   input ID $ 1-4 Name $ 6-25 RestHR 27-29 MaxHR 31-33  
         RecHR 35-37 TimeMin 39-40 TimeSec 42-43  
         Tolerance $ 45;  
   if tolerance='D';  
   TotalTime=(timemin*60)+timesec;  
   datalines;  
2458 Murray, W            72  185 128 12 38 D  
2462 Almers, C            68  171 133 10  5 I  
- 187 - 
2501 Bonaventure, T       78  177 139 11 13 I 
2523 Johnson, R           69  162 114  9 42 S 
2539 LaMance, K           75  168 141 11 46 D  
2544 Jones, M             79  187 136 12 26 N  
2552 Reberson, P          69  158 139 15 41 D  
2555 King, E              70  167 122 13 13 I  
2563 Pitts, D             71  159 116 10 22 S  
2568 Eberhardt, S         72  182 122 16 49 N 
2571 Nunnelly, A          65  181 141 15  2 I  
2572 Oberon, M            74  177 138 12 11 D  
2574 Peterson, V          80  164 137 14  9 D  
2575 Quigley, M           74  152 113 11 26 I  
2578 Cameron, L           75  158 108 14 27 I  
2579 Underwood, K         72  165 127 13 19 S  
2584 Takahashi, Y         76  163 135 16  7 D  
2586 Derber, B            68  176 119 17 35 N  
2588 Ivan, H              70  182 126 15 41 N  
2589 Wilcox, E            78  189 138 14 57 I  
2595 Warren, C            77  170 136 12 10 S  
;
run;

proc print data = stress (obs = 5);
run;

Obs,ID,Name,RestHR,MaxHR,RecHR,TimeMin,TimeSec,Tolerance,TotalTime
1,2458,"Murray, W",72,185,128,12,38,D,758
2,2539,"LaMance, K",75,168,141,11,46,D,706
3,2552,"Reberson, P",69,158,139,15,41,D,941
4,2572,"Oberon, M",74,177,138,12,11,D,731
5,2574,"Peterson, V",80,164,137,14,9,D,849


[(back to top)](#Table-of-Contents:)

### Creating a Raw Data File

Within a null DATA step, the `FILE` statement can be used to write obsrevations from a SAS dataset to a raw data file:

    file <filename/path/ref> <options> <os-options>;

The `PUT` statement can be used in conjunction to describe the lines that are written to the file:

    put <variable list with start/end column information>;
    
For example:

    data _null_;
        set <dataset>;
        file <filename/path/ref>;
        put <variable list with start/end column information>;
    run;

[(back to top)](#Table-of-Contents:)

### Reading/Creating Microsoft Excel Data

Excel data can be accessed using the following methods (could be outdated soon):
* SAS/ACCESS `LIBNAME` statement
* Import Wizard

Only the SAS/ACCESS`LIBNAME` statement will be covered here. For more information on the Import Wizard, refer to the SAS documentation. The SAS/ACCESS `LIBNAME` statement assocaites a SAS name with an Excel workbook by pointing to its location on disk. To use it, a licensed SAS/ACCESS Interface to PC Files must be installed. The workbook then becomes a SAS library while its worksheets become individual datasets. Some limitations on Excel versions:

<table>
    <tr>
        <td>**SAS version**</td>
        <td>**Excel version**</td>
    </tr>
    <tr>
        <td>9.2 or higher</td>
        <td>any Excel version</td>
    </tr>
    <tr>
        <td>9.1 or lower</td>
        <td>Excel 2003 or earlier</td>
    </tr>
</table>

The `LIBNAME` statement is as follows:
   
      libname <libref> '<location on disk>' <options>;
      
Excel worksheet names have the special character ($) at the end. To reference the dataset generated, a name literal needs to be added. The 2-level structure for the dataset name looks like this:

    libref.'excel_filename$'n
    
The name literal `'...'n` allows for $ to be recognized.

Once a _libref_ is assigned to a workbook, it cannot be used by any other program. To disassocate a _libref_, use the CLEAR option:

    libname <libref> clear;
    
To create an Excel worksheet, assign a _libref_ to the workbook that you want the worksheet to go in. After that, in the DATA step, set the library of the created dataset to be that _libref_:

    libname <new_libref> '<new path>';
    data new_libref.dataset;
        set work.dataset;
    run;

[(back to top)](#Table-of-Contents:)