# Summarizing, Reporting

## Calculating Summary Statistics

### PROC MEANS
The following program:
    * reads data
    * computes new variable
    * sorts the data
    * summarizes the data by month

In [2]:
data sales;
    infile 'Flowers.dat';
    input customerid $ @9 saledate mmddyy10. petunia snapdragon marigold;
    month = month(saledate);
proc sort data = sales;
    by month;
* calculate the means by month for flower sales;
proc means data = sales;
    by month;
    var petunia snapdragon marigold;
    title 'summary of flower sales by month';
run;

Variable,N,Mean,Std Dev,Minimum,Maximum
petunia snapdragon marigold,3 3 3,86.6666667 113.3333333 81.6666667,35.1188458 41.6333200 25.6580072,50.0000000 80.0000000 60.0000000,120.0000000 160.0000000 110.0000000

Variable,N,Mean,Std Dev,Minimum,Maximum
petunia snapdragon marigold,4 4 4,81.2500000 97.5000000 83.7500000,16.5201897 47.8713554 19.7378655,60.0000000 60.0000000 60.0000000,100.0000000 160.0000000 100.0000000


### Exporting Summary Statistics

In [4]:
data sales;
    infile 'Flowers.dat';
    input customerid $ @9 saledate mmddyy10. petunia snapdragon marigold;
    month = month(saledate);
proc sort data = sales;
    by customerid;
* calculate the means by customerid for flower sales;
proc means noprint data = sales;
    by customerid;
    var petunia snapdragon marigold;
    output out = totals mean(petunia snapdragon marigold) =
            mean_petunia mean_snapdragon mean_marigold
        sum(petunia snapdragon marigold) = sum_petunia sum_snapdragon sum_marigold;
proc print data = totals;
    title "sum of flower data over customer id";
    format meanpetunia meansnapdragon meanmarigold 3.;
run;

Obs,customerid,_TYPE_,_FREQ_,mean_petunia,mean_snapdragon,mean_marigold,sum_petunia,sum_snapdragon,sum_marigold
1,756-01,0,3,101.667,116.667,95.0,305,350,285
2,834-01,0,2,85.0,110.0,80.0,170,220,160
3,901-02,0,2,55.0,80.0,67.5,110,160,135


## Counting Data
### PROC FREQ
The following program:
* reads data
* produces one-way and two-way frequencies

Two-way frequency table (bottom):
Each cell contains:
* frequency
* percentage
* percentage for that row
* percentage for that column

In [7]:
data orders;
    infile "Coffee.dat";
    input cofee $ window $ @@;
* print tables for window and window by coffee;
proc freq data = orders;
    tables window window * cofee;
run;

window,Frequency,Percent,Cumulative Frequency,Cumulative Percent
d,13,43.33,13,43.33
w,17,56.67,30,100.0

Table of window by cofee,Table of window by cofee,Table of window by cofee,Table of window by cofee,Table of window by cofee,Table of window by cofee,Table of window by cofee
window,cofee,cofee,cofee,cofee,cofee,cofee
window,Kon,cap,esp,ice,kon,Total
Frequency Percent Row Pct Col Pct,,,,,,
d,1 3.45 8.33 100.00,2 6.90 16.67 33.33,6 20.69 50.00 75.00,2 6.90 16.67 50.00,1 3.45 8.33 10.00,12 41.38
w,0 0.00 0.00 0.00,4 13.79 23.53 66.67,2 6.90 11.76 25.00,2 6.90 11.76 50.00,9 31.03 52.94 90.00,17 58.62
Total,1 3.45,6 20.69,8 27.59,4 13.79,10 34.48,29 100.00
Frequency Missing = 1,Frequency Missing = 1,Frequency Missing = 1,Frequency Missing = 1,Frequency Missing = 1,Frequency Missing = 1,Frequency Missing = 1
Frequency Percent Row Pct Col Pct,Table of window by cofee window cofee Kon cap esp ice kon Total d 1 3.45 8.33 100.00 2 6.90 16.67 33.33 6 20.69 50.00 75.00 2 6.90 16.67 50.00 1 3.45 8.33 10.00 12 41.38  w 0 0.00 0.00 0.00 4 13.79 23.53 66.67 2 6.90 11.76 25.00 2 6.90 11.76 50.00 9 31.03 52.94 90.00 17 58.62  Total 1 3.45 6 20.69 8 27.59 4 13.79 10 34.48 29 100.00 Frequency Missing = 1,,,,,

Frequency Percent Row Pct Col Pct

Table of window by cofee,Table of window by cofee,Table of window by cofee,Table of window by cofee,Table of window by cofee,Table of window by cofee,Table of window by cofee
window,cofee,cofee,cofee,cofee,cofee,cofee
window,Kon,cap,esp,ice,kon,Total
d,1 3.45 8.33 100.00,2 6.90 16.67 33.33,6 20.69 50.00 75.00,2 6.90 16.67 50.00,1 3.45 8.33 10.00,12 41.38
w,0 0.00 0.00 0.00,4 13.79 23.53 66.67,2 6.90 11.76 25.00,2 6.90 11.76 50.00,9 31.03 52.94 90.00,17 58.62
Total,1 3.45,6 20.69,8 27.59,4 13.79,10 34.48,29 100.00
Frequency Missing = 1,Frequency Missing = 1,Frequency Missing = 1,Frequency Missing = 1,Frequency Missing = 1,Frequency Missing = 1,Frequency Missing = 1


## Tabular Reports
### Proc Tabulate
The following program:
* reads the data
* creates a 3-dimensional report with the values of the Port for the pages, Locomotion for the rows and Type for the columns

"CLASS" statement tells SAS which variables contain categorical data to be used for dividing observations into groups

"TABLE" statement tells SAS how to organize your table and what numbers to compute
    * can specify up to 3 dimensions (separated by columns)
    * dimensions tell SAS which variables to use for the pages, rows and columns in te report

In [11]:
data boats;
    infile "Boats.dat";
    input Name $ 1-12 Port $ 14-20 Locomotion $ 22-26 Type $ 28-30 Price 32-36;
run;
*Tabulations with 3 dimensions;
proc tabulate data = boats;
    class port locomotion type;
    table port, locomotion, type;
    title 'number of boats by port, locomotion, and type';
run;

Unnamed: 0_level_0,Type,Type
Unnamed: 0_level_1,cat,yac
Unnamed: 0_level_2,N,N
Locomotion,.,1
power,.,1
sail,1,.
Port Lahaina,,
Type cat yac N N Locomotion . 1 power sail 1 .,,

Unnamed: 0_level_0,Type,Type
Unnamed: 0_level_1,cat,yac
Unnamed: 0_level_2,N,N
Locomotion,.,1
power,.,1
sail,1,.

Unnamed: 0_level_0,Type,Type,Type
Unnamed: 0_level_1,cat,sch,yac
Unnamed: 0_level_2,N,N,N
Locomotion,3.0,.,1.0
power,3.0,.,1.0
sail,1.0,2,1.0
Port Maalea,,,
Type cat sch yac N N N Locomotion 3 . 1 power sail 1 2 1,,,

Unnamed: 0_level_0,Type,Type,Type
Unnamed: 0_level_1,cat,sch,yac
Unnamed: 0_level_2,N,N,N
Locomotion,3,.,1
power,3,.,1
sail,1,2,1


### Adding Statistics to PROC Tabulate

The following program:
* concatenates, crosses and groupes variables and statistics

In [13]:
data boats;
    infile "Boats.dat";
    input Name $ 1-12 Port $ 14-20 Locomotion $ 22-26 Type $ 28-30 Price 32-36;
run;
*tabulations with 2 dimensions and statistics;
proc tabulate data = boats;
    class locomotion type;
    var price;
    table locomotion all, mean * price*(type all);
    title 'mean price by locomotion and type';
run;

Unnamed: 0_level_0,Mean,Mean,Mean,Mean
Unnamed: 0_level_1,Price,Price,Price,Price
Unnamed: 0_level_2,Type,Type,Type,All
Unnamed: 0_level_3,cat,sch,yac,All
Locomotion,73.47,.,79.9,76.04
power,73.47,.,79.9,76.04
sail,102.45,136.25,72.9,110.06
All,85.06,136.25,77.57,93.05


Same output with currency formatting

In [18]:
data boats;
    infile "Boats.dat";
    input Name $ 1-12 Port $ 14-20 Locomotion $ 22-26 Type $ 28-30 Price 32-36;
run;
*tabulations with 2 dimensions and statistics;
proc tabulate data = boats format = dollar9.2;
    class locomotion type;
    var price;
    table locomotion all, mean * price*(type all)
        /box='full day excursions' misstext='none';
    title 'mean price by locomotion and type';
run;

full day excursions,Mean,Mean,Mean,Mean
full day excursions,Price,Price,Price,Price
full day excursions,Type,Type,Type,All
full day excursions,cat,sch,yac,All
Locomotion,$73.47,none,$79.90,$76.04
power,$73.47,none,$79.90,$76.04
sail,$102.45,$136.25,$72.90,$110.06
All,$85.06,$136.25,$77.57,$93.05


### Changing Headers in PROC TABULATE Output

In [17]:
data boats;
    infile "Boats.dat";
    input Name $ 1-12 Port $ 14-20 Locomotion $ 22-26 Type $ 28-30 Price 32-36;
run;
*changing headers;
proc format;
    value $typ 'cat' = 'catamaran'
                'sch' = 'schooner'
                'yac' = 'yacht';
run;
proc tabulate data = boats format = dollar9.2;
    class locomotion type;
    var price;
    format type $typ.;
    table locomotion='' all,
        mean='' * price='mean price by type of boat'* (type='' all)
        /box='full day excursions' misstext='none';
    title;
run;

full day excursions,mean price by type of boat,mean price by type of boat,mean price by type of boat,mean price by type of boat
full day excursions,catamaran,schooner,yacht,All
power,$73.47,none,$79.90,$76.04
sail,$102.45,$136.25,$72.90,$110.06
All,$85.06,$136.25,$77.57,$93.05


#### Multiple Format Output Types

In [23]:
data boats;
    infile "Boats.dat";
    input Name $ 1-12 Port $ 14-20 Locomotion $ 22-26 Type $ 28-30 Price 32-36 Length 38-40;
run;
proc tabulate data = boats;
    class locomotion type;
    var price length;
    table locomotion all,
        mean * (price*format=dollar6.2 length*format=6.0) * (type all);
    title 'price and length by type of boat';
run;

Unnamed: 0_level_0,Mean,Mean,Mean,Mean,Mean,Mean,Mean,Mean
Unnamed: 0_level_1,Price,Price,Price,Price,Length,Length,Length,Length
Unnamed: 0_level_2,Type,Type,Type,All,Type,Type,Type,All
Unnamed: 0_level_3,cat,sch,yac,All,cat,sch,yac,All
Locomotion,$73.47,.,$79.90,$76.04,8,.,5,7
power,$73.47,.,$79.90,$76.04,8,.,5,7
sail,102.45,136.25,$72.90,110.06,6,6,6,6
All,$85.06,136.25,$77.57,$93.05,7,6,5,6


# PROC REPORT

COLUMN is similar to VAR (proc tabulate)

The following program:
* reads the data
* runs 2 reports:
    * report 1 : has no column statement, so SAS uses all the variables
    * report 2 : uses a column statement to select just the numeric variables

In [25]:
data natparks;
    infile "Parks.dat";
    input name $ 1-21 type $ region $ museums camping;
run;
proc report data = natparks nowindows headline;
    title 'report with character and numeric variables';
run;
proc report data = natparks nowindows headline;
    column museums camping;
    title 'report with only numeric variables';
run;

name,type,region,museums,camping
Dinosaur,NM,West,2,6
Ellis Island,NM,East,1,0
Everglades,NP,East,5,2
Grand Canyon,NP,West,5,3
Great Smoky Mountains,NP,East,3,10
Hawaii Volcanoes,NP,West,2,2
Lava Beds,NM,West,1,1
Statue of Liberty,NM,East,1,0
Theodore Roosevelt,NP,,2,2
Yellowstone,NP,West,9,11

museums,camping
33,50


### DEFINE statements
Specifies options for an individual variable

DEFINE varible/ options 'column-header'

The following program:
* contains 2 DEFINE statements
 1. defines region as having a usage type of ORDER
 2. specifies a column header for the variable Camping ( a numeric variable with default usage of ANALYSIS)

In [29]:
data natparks;
    infile "Parks.dat";
    input name $ 1-21 type $ region $ museums camping;
run;
*proc report with ORder varible, missing option, and column header;
proc report data = natparks nowindows headline missing; *since missing included, obs with missing region values will be included in report;
    column region name museums camping;
    define region/order;
    define camping/analysis 'camp/grounds';
    title 'national parks and monuments arranged by region';
run;

region,name,museums,camp grounds
,Theodore Roosevelt,2,2
East,Ellis Island,1,0
,Everglades,5,2
,Great Smoky Mountains,3,10
,Statue of Liberty,1,0
West,Dinosaur,2,6
,Grand Canyon,5,3
,Hawaii Volcanoes,2,2
,Lava Beds,1,1
,Yellowstone,9,11


### Summary Reports
2 different usage types to "roll up" data:
* GROUP - produces summary rows
* ACROSS - produces summary columns

The following program: 
* contains 2 proc reports

    1 - Region & Type are both defined as group varaibles
    
    2 - Region is still a group variable, but type is an across varaible

In [31]:
data natparks;
    infile "Parks.dat";
    input name $ 1-21 type $ region $museums camping;
run;
* region and type as group variables;
proc report data = natparks nowindows headline;
    column region type museums camping;
    define region/group;
    define type/group;
    title 'summary report with 2 group variables';
run;
* region as group adn type as across with sums;
proc report data = natparks nowindows headline;
    column region type,(museums camping);
    define region/group;
    define type/across;
    title 'summary report with a group and an across variable';
run;

region,type,museums,camping
East,NM,2,0
,NP,8,12
West,NM,3,7
,NP,18,29

Unnamed: 0_level_0,type,type,type,type
Unnamed: 0_level_1,NM,NM,NP,NP
region,museums,camping,museums,camping
East,2,0,8,12
West,3,7,18,29


### Summary Breaks

The following program:

Defines Region as an order variable and then uses both BREAK and RBREAK sttements with the AFTER location.

The SUMMARIZE option tells SAS to print totals for numeric variables while the OL and SKIP options tell SAS to draw a line above the totals and skip a line under the totals

In [33]:
data natparks;
    infile "Parks.dat";
    input Name $ 1-21 Type $ Region $ Museums Camping;
run;
* PROC REPORT with breaks;
proc report data = natparks nowindows headline;
    column name region museums camping;
    define region/order;
    break after region/summarize ol skip;
    rbreak after/summarize ol skip;
    title 'detail report with summary breaks';
run;

Name,Region,Museums,Camping
Ellis Island,East,1,0
Everglades,,5,2
Great Smoky Mountains,,3,10
Statue of Liberty,,1,0
,East,10,12
Dinosaur,West,2,6
Grand Canyon,,5,3
Hawaii Volcanoes,,2,2
Lava Beds,,1,1
Yellowstone,,9,11


### Adding Statistics

To request a statistic for a particular variable, insert a comma between the statistic and a variable in the COLUMN statement

In [34]:
data natparks;
    infile "Parks.dat";
    input Name $ 1-21 Type $ Region $ Museums Camping;
run;
* statistics in a column statement with 2 group variables;
proc report data = natparks nowindows headline;
    column region type n (museums camping),mean;
    define region/group;
    define type/group;
    title 'Statistics with 2 group variables';
run;
* statistics in a column statement with group and across variables;
proc report data = natparks nowindows headline;
    column region type n (museums camping),mean;
    define region/group;
    define type/across;
    title 'Statistics with a group and across variable';
run;

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Museums,Camping
Region,Type,n,mean,mean
East,NM,2,1.0,0.0
,NP,2,4.0,6.0
West,NM,2,1.5,3.5
,NP,4,4.5,7.25

Unnamed: 0_level_0,Type,Type,Unnamed: 3_level_0,Museums,Camping
Region,NM,NP,n,mean,mean
East,2,2,4,2.5,3
West,2,4,6,3.5,6


### Adding Computed Variables

The following program:
* computes 2 variables named Facilities and Note
    * Facilities is a numeric variable equal to the number of museums + number of campgrounds
    * Note is a character varible which is equal to 'No Camping' for parks that have no campgrounds

In [36]:
data natparks;
    infile "Parks.dat";
    input Name $ 1-21 Type $ Region $ Museums Camping;
run;
* compute new variables that are numeric and character;
proc report data = natparks nowindows headline;
    column name region museums camping facilities note;
    define museums/analysis sum noprint;
    define camping/analysis sum noprint;
    define facilities/computed 'Camping/and/Museums';
    define note/Computed;
    compute facilities;
        facilities = museums.sum + camping.sum;
    endcomp;
    compute note/char length=10;
        if camping.sum = 0 then note = 'no camping';
    endcomp;
    title 'report with two computed variables';
run;

Name,Region,Camping and Museums,note
Dinosaur,West,8,
Ellis Island,East,1,no camping
Everglades,East,7,
Grand Canyon,West,8,
Great Smoky Mountains,East,13,
Hawaii Volcanoes,West,4,
Lava Beds,West,2,
Statue of Liberty,East,1,no camping
Theodore Roosevelt,,4,
Yellowstone,West,20,


### Grouping Data in Procedures with User-Define Formats

The following program:
* reads data
* creates 2 user-defined formats to group the age data AND 1 user-defined format for the book type data:

In [41]:
data books;
    infile "LibraryBooks.dat";
    input age booktype $ @@;
run;
* define formats to group the data;
proc format;
    value agegpa
        0-18 = '0 to 18'
        19-25 = '19 to 25'
        26-49 = '26 to 49'
        50-HIGH = '50 + ';
    value agegpb
        0-25 = '0 to 25'
        26-HIGH = '26+ ';
    value $typ
        'bio','non','ref' = 'non-fiction'
        'fic','mys','sci' = 'Fiction';
run;
* create two way table with Age grouped into 4 categories;
proc freq data = books;
    title 'patron age by book type: four age groups';
    tables booktype * age/nopercent norow nocol;
    format age agegpa. booktype $typ.;
run;
* create two way table with age grouped into 2 categories;
proc freq data = books;
    title 'patron age by book type: two age groups';
    tables booktype * age/nopercent norow nocol;
    format age agegpb. booktype $typ.;
run;

Table of booktype by age,Table of booktype by age,Table of booktype by age,Table of booktype by age,Table of booktype by age,Table of booktype by age
booktype,age,age,age,age,age
booktype,0 to 18,19 to 25,26 to 49,50 +,Total
Frequency,,,,,
non-fiction,3,0.0,3.0,8.0,14.0
Fiction,6,3.0,12.0,10.0,31.0
Total,9,3.0,15.0,18.0,45.0
Frequency,Table of booktype by age booktype age 0 to 18 19 to 25 26 to 49 50 + Total non-fiction 3 0 3 8 14 Fiction 6 3 12 10 31 Total 9 3 15 18 45,,,,

Frequency

Table of booktype by age,Table of booktype by age,Table of booktype by age,Table of booktype by age,Table of booktype by age,Table of booktype by age
booktype,age,age,age,age,age
booktype,0 to 18,19 to 25,26 to 49,50 +,Total
non-fiction,3,0,3,8,14
Fiction,6,3,12,10,31
Total,9,3,15,18,45

Table of booktype by age,Table of booktype by age,Table of booktype by age,Table of booktype by age
booktype,age,age,age
booktype,0 to 25,26+,Total
Frequency,,,
non-fiction,3,11.0,14.0
Fiction,9,22.0,31.0
Total,12,33.0,45.0
Frequency,Table of booktype by age booktype age 0 to 25 26+ Total non-fiction 3 11 14 Fiction 9 22 31 Total 12 33 45,,

Frequency

Table of booktype by age,Table of booktype by age,Table of booktype by age,Table of booktype by age
booktype,age,age,age
booktype,0 to 25,26+,Total
non-fiction,3,11,14
Fiction,9,22,31
Total,12,33,45
