# Chapter 08 - Producing Descriptive Statistics

## Table of Contents:
1. [Objectives](#objectives)
2. [Computing Statistics with PROC MEANS](#statprocmeans)
3. [Creating a Summarized Dataset Using PROC MEANS](#summaryprocmeans)
4. [Producing Frequency Tables Using PROC FREQ](#freqtableprocfreq)

## Objectives: <a class="anchor" id="objectives"></a>

* determine the n-count, mean, standard deviation, min, and max of numeric variables using PROC MEANS
* control the number of decimal places used in PROC MEANS output
* specify the variables for which to produce statistics
* use the PROC SUMMARY procedure to produce the same results as the PROC MEANS procedure
* describe the difference between the SUMMARY and MEANS procedure
* create 1-way frequency tables for categorical data using PROC FREQ
* create 2-way and n-way crossed frequency tables
* control the layout and complexity of crossed frequency tables

## Computing Statistics with PROC MEANS <a class="anchor" id="statprocmeans"></a>

PROC MEANS by default prints out the n-count (non-missing), the mean, the standard deviation, the minimum, and the maximum values for every numeric variable in a dataset. You can control what statistics it outputs in the options for the PROC MEANS statement:

    proc means data = <dataset> <statistical options>;
    run;
    
A list of these statistical keywords and options can be found in the SAS documentation.

By default, PROC MEANS chooses the number of decimal places that maximizes the inital length of the variable. To control the number of decimal places produced by PROC MEANS, use the MAXDEC=n option:

    proc means data = <dataset> <statistical options> maxdec = <n>;
    run;
    
To specify the variables that are processed using PROC MEANS, use the VAR statement:

    proc means data = <dataset> <statistical options> maxdec = <n>;
        var <variable list>;
    run;

In addition, PROC MEANS can also perform group processing using either CLASS or BY statement. The CLASS or BY statement will group the different levels of a categorical variable and will print out the N statistic for each class grouping:

    proc means data = <dataset> <statistical options> maxdec = <n>;
        class <variable list>;
    run;
    
    -or-
    
    proc means data = <dataset> <statistical options> maxdec = <n>;
        by <variable list>;
    run;

The BY statement requires that the categorical variables be sorted, whereas the CLASS statement does not. The BY group processing can be more efficient with categorical variables that contain many levels.

## Creating a Summarized Dataset Using PROC MEANS <a class="anchor" id="summaryprocmeans"></a>

An output dataset can be generated from PROC MEANS. By default, the output dataset contains N, MEAN, STD, MIN, and MAX. To specify statistics, use the OUT =  option with the output. For example:

In [1]:
proc means data = sashelp.class maxdec = 4;
    class sex;
    var age height weight;
    output out = classdata mean = AvgAge AvgHeight AvgWeight std = STDAge STDHeight STDWeight median = MedAge MedHeight MedWeight;
run;

proc print data = classdata;
run;

Sex,N Obs,Variable,N,Mean,Std Dev,Minimum,Maximum
F,9,Age Height Weight,9 9 9,13.2222 60.5889 90.1111,1.3944 5.0183 19.3839,11.0000 51.3000 50.5000,15.0000 66.5000 112.5000
M,10,Age Height Weight,10 10 10,13.4000 63.9100 108.9500,1.6465 4.9379 22.7272,11.0000 57.3000 83.0000,16.0000 72.0000 150.0000

Obs,Sex,_TYPE_,_FREQ_,AvgAge,AvgHeight,AvgWeight,STDAge,STDHeight,STDWeight,MedAge,MedHeight,MedWeight
1,,0,19,13.3158,62.3368,100.026,1.49267,5.12708,22.7739,13.0,62.8,99.5
2,F,1,9,13.2222,60.5889,90.111,1.39443,5.01833,19.3839,13.0,62.5,90.0
3,M,1,10,13.4,63.91,108.95,1.64655,4.93794,22.7272,13.5,64.15,107.25


In addition, proc means also adds \_TYPE\_ and \_FREQ\_ to the output dataset.

PROC SUMMARY can also be used to create a statistics summary dataset. However, PROC MEANS will not generate a report unless specified by the PRINT option in its statement. PROC SUMMARY automatically generates a report. Other than that, the syntax for PROC SUMMARY is the same.

## Producing Frequency Tables using PROC FREQ <a class="anchor" id="freqtableprocfreq"></a>

PROC FREQ uses 1-way and n-way frequency tables and is used to get a sense of data distribution. It generates 4 outputs in its report:
* frequency
* percent
* cumulative frequency
* cumulative percent

Its syntax is like this:

    proc freq data = <dataset>;
        tables <variable list>;
    run;
    
where the optional TABLES statement allows you to select the variables to process. By default, SAS processes all variables. To suppress cumulative frequencies and percentages, add the NOCUM option to the TABLES statement:

    tables <variable list> / nocum;

To generate 2-way or n-way tables, stick an asterisk (*) between each variable in the TABLES statement:

    tables var1*var2;
    
When crosstabulations are requested, PROC FREQ outputs:
* cell frequency
* cell percentage of total frequency
* cell percentage of marginal row frequency
* cell percentage of marginal column frequency

For example:

In [2]:
proc freq data = sashelp.class;
    tables sex*age;
run;

Table of Sex by Age,Table of Sex by Age,Table of Sex by Age,Table of Sex by Age,Table of Sex by Age,Table of Sex by Age,Table of Sex by Age,Table of Sex by Age
Sex,Age,Age,Age,Age,Age,Age,Age
Sex,11,12,13,14,15,16,Total
Frequency Percent Row Pct Col Pct,,,,,,,
F,1 5.26 11.11 50.00,2 10.53 22.22 40.00,2 10.53 22.22 66.67,2 10.53 22.22 50.00,2 10.53 22.22 50.00,0 0.00 0.00 0.00,9 47.37
M,1 5.26 10.00 50.00,3 15.79 30.00 60.00,1 5.26 10.00 33.33,2 10.53 20.00 50.00,2 10.53 20.00 50.00,1 5.26 10.00 100.00,10 52.63
Total,2 10.53,5 26.32,3 15.79,4 21.05,4 21.05,1 5.26,19 100.00
Frequency Percent Row Pct Col Pct,Table of Sex by Age Sex Age 11 12 13 14 15 16 Total F 1 5.26 11.11 50.00 2 10.53 22.22 40.00 2 10.53 22.22 66.67 2 10.53 22.22 50.00 2 10.53 22.22 50.00 0 0.00 0.00 0.00 9 47.37  M 1 5.26 10.00 50.00 3 15.79 30.00 60.00 1 5.26 10.00 33.33 2 10.53 20.00 50.00 2 10.53 20.00 50.00 1 5.26 10.00 100.00 10 52.63  Total 2 10.53 5 26.32 3 15.79 4 21.05 4 21.05 1 5.26 19 100.00,,,,,,

Frequency Percent Row Pct Col Pct

Table of Sex by Age,Table of Sex by Age,Table of Sex by Age,Table of Sex by Age,Table of Sex by Age,Table of Sex by Age,Table of Sex by Age,Table of Sex by Age
Sex,Age,Age,Age,Age,Age,Age,Age
Sex,11,12,13,14,15,16,Total
F,1 5.26 11.11 50.00,2 10.53 22.22 40.00,2 10.53 22.22 66.67,2 10.53 22.22 50.00,2 10.53 22.22 50.00,0 0.00 0.00 0.00,9 47.37
M,1 5.26 10.00 50.00,3 15.79 30.00 60.00,1 5.26 10.00 33.33,2 10.53 20.00 50.00,2 10.53 20.00 50.00,1 5.26 10.00 100.00,10 52.63
Total,2 10.53,5 26.32,3 15.79,4 21.05,4 21.05,1 5.26,19 100.00


With n-way tables, the last two variables found in the TABLES statement become the row and columns of 2-way tables. The data is stratified by the other variables listed. Given the stratification, n-way tables can be difficult to read. To output all information on one table with a list format, use the LIST option in the TABLES statement:

    tables var1*var2*...*varn / list;
    
To suppress table information, the following options are available:
* NOFREQ - suppresses cell frequencies
* NOPERCENT - suppresses cell percentages
* NOROW - suppresses row percentages
* NOCOL - suppresses column percentages

In addition, to change how crosstabluation displays variable values, formats can be applied using the FORMAT option:

    proc freq data = <dataset>;
        tables <variable list>;
        format <variable list>;
    run;