<a href="https://colab.research.google.com/github/Alphabf/sas2r/blob/main/SAS_2_R_using_dplyr_for_Data_Manipulation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## SAS_2_R using dplyr for Data Manipulation

 **{dplyr} to SAS Function Equivalents**

| **dplyr Function** | **Purpose**                                | **SAS Equivalent**                            |
| ------------------ | ------------------------------------------ | --------------------------------------------- |
| `filter()`         | Subset rows based on conditions            | `WHERE` or `IF` statements                    |
| `select()`         | Select specific columns                    | `KEEP`, `DROP`, or `VAR` statements           |
| `mutate()`         | Create or modify variables                 | `DATA step` with assignment statements        |
| `arrange()`        | Sort rows                                  | `PROC SORT`                                   |
| `summarise()`      | Summarize data                             | `PROC SUMMARY` or `PROC MEANS`                |
| `group_by()`       | Group data before summarizing              | `CLASS` or `BY` statement in `PROC` steps     |
| `distinct()`       | Get unique rows                            | `PROC SORT NODUPKEY`                          |
| `rename()`         | Rename variables                           | `RENAME` statement in `DATA` step             |
| `left_join()`      | Merge datasets by key (keep all left rows) | `MERGE` in `DATA` step with `BY`, use `IN=`   |
| `inner_join()`     | Merge datasets (only matching rows)        | `PROC SQL` `INNER JOIN` or `MERGE` with logic |


**Create The data for SAS**

In [None]:
data students;
    input ID Name $ Grade $ Homework;
    datalines;
1 Alice A 99
2 Bob A 99
3 Charlie B 85
;
run;

data scores;
  input ID Score;
  datalines;
1 85
2 90
4 88
;
run;


In [None]:
proc print data= students; run ;
proc print data= scores; run;

Obs,ID,Name,Grade,Homework
1,1,Alice,A,99
2,2,Bob,A,99
3,3,Charlie,B,85

Obs,ID,Score
1,1,85
2,2,90
3,4,88


**Create The data for R**

In [None]:
install.packages("dplyr")
library(dplyr)

In [None]:
students <- tribble(
  ~ID, ~Name,    ~Grade, ~Homework,
   1,  "Alice",  "A",     99,
   2,  "Bob",    "A",     99,
   3,  "Charlie","B",     85
)

scores <- tibble(ID = c(1, 2, 4), Score = c(85, 90, 88))

In [None]:
# glimpse(students)
# glimpse(scores)
students
scores

ID,Name,Grade,Homework
<dbl>,<chr>,<chr>,<dbl>
1,Alice,A,99
2,Bob,A,99
3,Charlie,B,85


ID,Score
<dbl>,<dbl>
1,85
2,90
4,88


**1. Filter rows**

**SAS**

In [None]:
data temp;
    set students;
    where id ne 2;
run;
proc print; run;

Obs,ID,Name,Grade,Homework
1,1,Alice,A,99
2,3,Charlie,B,85


**R (dplyr)**

In [None]:
students |>
  filter(ID != 2)

ID,Name
<dbl>,<chr>
1,Alice
3,Charlie


**2. Select columns**

**SAS**

In [None]:
data temp;
    set students(keep=Name);
run;

87   ods listing close;ods html5 (id=saspy_internal) options(bitmap_mode='inline') device=svg style=HTMLBlue; ods graphics on /
87 ! outputfmt=png;
[38;5;21mNOTE: Writing HTML5(SASPY_INTERNAL) Body file: sashtml6.htm[0m
88   
89   data temp;
90       set students(keep=Name);
91   run;

[38;5;21mNOTE: There were 3 observations read from the data set WORK.STUDENTS.[0m
[38;5;21mNOTE: The data set WORK.TEMP has 3 observations and 1 variables.[0m
[38;5;21mNOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds
      [0m

92   ods html5 (id=saspy_internal) close;ods listing;
93   




In [None]:
proc print; run;

Obs,Name
1,Alice
2,Bob
3,Charlie


in shutdown function


**R (dplyr)**

In [None]:
students |>
  select(Name)

Name
<chr>
Alice
Bob
Charlie


**3. Create new variable**

**SAS**

In [None]:
data temp;
    set scores;
    if Score > 89 then Grade = "A"; else Grade = "B";
run;
proc print; run;

Obs,ID,Score,Grade
1,1,85,B
2,2,90,A
3,4,88,B


**R (dplyr)**

In [None]:
mutate(scores, Grade = ifelse(Score > 89, "A", "B"))

ID,Score,Grade
<dbl>,<dbl>,<chr>
1,85,B
2,90,A
4,88,B


**4. Sort data**

**SAS**

In [None]:
proc sort data=scores out=temp; by descending Score; run;
proc print; run;

Obs,ID,Score
1,2,90
2,4,88
3,1,85


**R (dplyr)**

In [None]:
scores |>
  arrange(desc(Score))

ID,Score
<dbl>,<dbl>
2,90
4,88
1,85


 **5. Summarize by group**

**SAS**

In [None]:
proc means data=students noprint;
class Grade;
var homework;
output out=outdata mean=mean;
run;

proc print; run;

Obs,Grade,_TYPE_,_FREQ_,mean
1,,0,3,94.3333
2,A,1,2,99.0
3,B,1,1,85.0


**R (dplyr)**

In [None]:
students %>%
  group_by(Grade) %>%
  summarise(mean_homework = mean(Homework, na.rm = TRUE))


Grade,mean_homework
<chr>,<dbl>
A,99
B,85


**6. Join tables**

**SAS**

In [None]:
proc sort data=students; by ID; run;
proc sort data=scores; by ID; run;

data lefmerge;
    merge students (in=a) scores; by ID;
    if a;
run;

proc print; run;

Obs,ID,Name,Grade,Homework,Score
1,1,Alice,A,99,85
2,2,Bob,A,99,90
3,3,Charlie,B,85,.


In [None]:
data innermerge;
    merge students (in=a) scores (in=b); by ID;
    if a and b;
run;

proc print; run;

Obs,ID,Name,Grade,Homework,Score
1,1,Alice,A,99,85
2,2,Bob,A,99,90


**R (dplyr)**

In [None]:
left_join(students, scores, by = "ID")

ID,Name,Grade,Homework,Score
<dbl>,<chr>,<chr>,<dbl>,<dbl>
1,Alice,A,99,85.0
2,Bob,A,99,90.0
3,Charlie,B,85,


In [None]:
inner_join(students, scores, by = "ID")

ID,Name,Grade,Homework,Score
<dbl>,<chr>,<chr>,<dbl>,<dbl>
1,Alice,A,99,85
2,Bob,A,99,90


**7. No duplicate or distinct values**

**SAS**

In [None]:
proc sort data=students nodupkey out=temp;
    by Grade;
run;
proc print; run;

Obs,ID,Name,Grade,Homework
1,1,Alice,A,99
2,3,Charlie,B,85


**R (dplyr)**

In [None]:
students %>%
  distinct(Grade, .keep_all = TRUE)

ID,Name,Grade,Homework
<dbl>,<chr>,<chr>,<dbl>
1,Alice,A,99
3,Charlie,B,85


**8. Rename variables**

**SAS**

In [None]:
data temp;
    set students;
    rename Homework = HomeworkScore Name = StudentName;
run;
proc print; run;

Obs,ID,StudentName,Grade,HomeworkScore
1,1,Alice,A,99
2,2,Bob,A,99
3,3,Charlie,B,85


**R (dplyr)**

In [None]:
# rename(new_name = old_name)

temp <- students %>%
  rename(
    HomeworkScore = Homework,
    StudentName = Name
  )

temp

ID,StudentName,Grade,HomeworkScore
<dbl>,<chr>,<chr>,<dbl>
1,Alice,A,99
2,Bob,A,99
3,Charlie,B,85
