Proc means is used to compute statistics.
Common supported statistics:
- Keyword Description
- N Counting
- SUM Sum
- MIN Minimum
- MEAN Mean
- MEDIAN Median
- MAX Maximum
- STD Standard deviation
- QRANGE Interquartile range
- VAR Variance
- SKEW Skew
In addition, there are several percentiles: P1, P5, P10, P25, P50 (=Median), P75, P90, P95 and P99.
Example:
/* summary statistics for full sample */
proc means data=work.myData n mean median stddev;
OUTPUT OUT=work.meansout n= mean= median= stddev= /autoname;
var var1 var2 var3;
run;
The above example computes the count, mean, median and standard deviation for var1, var2, and var3, and will print these, as well as generate a dataset (output out=
).
By default, proc means will print the statistics to the screen. To supress this, add the 'noprint' keyword (proc means data=work.myData noprint
).
The autoname keyword will automatically create variable names for the different statistics, by concatenating the variable names with the statistic (in the example, var1_N, var1_Mean, var1_Median, var1_Stddev, similarly for var2). The output dataset will also hold a frequency count (FREQ). The difference between _N and FREQ is that missings are included in FREQ, but not counted towards N.
Proc means supports 'by', which means that statistics can be computed by group (for example, by firm, industry, year, etc). Example:
/* as above, but repeated by a group (requires sorting on that variable) */
proc means data=work.myData n mean median stddev;
OUTPUT OUT=work.meansout n= mean= median= stddev= /autoname;
var var1 var2 var3;
by groupvar;
run;
When 'by' is used, the output dataset will have one row for each group. A column is added that holds the variable used to group by (in the example 'groupvar').
Proc rank is used to create ranked variables, and variables to create low/high, terciles, quantiles, etc.
Example:
proc rank data = myComp out=myComp2 groups = 10;
var roa size ;
ranks roa_d size_d ;
run;
'var' holds the list of variables to rank, and 'ranks' holds the variable names to give the ranked variables.
The above example will create two decile rank variables (roa_d and size_d), with values 0-9. If roa and size have missing values, the ranked variable will also have a missing value.
Proc rank supports the 'by' keyword, for example to rank by year, or industry.
Proc standard is used to demean (subtract the mean) or standardize (mean 0, standard deviation 1).
Example:
proc standard data=dsIn mean=0 std=1 out=dsOut;
var at sale;
by fyear;
run;
The above example standardizes the data by year (so for each year the variables at and sale will have an average of 0 and standard deviation of 1).