### Data Manupilation

In this notebook we are going to do some data manupilation using the `R` programming language. We are going to use the package [`dplyr`](https://dplyr.tidyverse.org/) to do that. First we need to makesure that it is installed, to install it we neeed to run the command.

```shell
pacman::p_load(pacman, dplyr)
```

https://dplyr.tidyverse.org/

We are going to look at the basic data gramma manupilations such as:

1. `mutate()` adds new variables that are functions of existing variables
2. `select()` picks variables based on their names.
3. `filter()` picks cases based on their values.
4. `summarise()` reduces multiple values down to a single summary.
5. `arrange()` changes the ordering of the rows.

In [19]:
pacman::p_load(pacman, dplyr)

In [5]:
head(iris)

Unnamed: 0_level_0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<fct>
1,5.1,3.5,1.4,0.2,setosa
2,4.9,3.0,1.4,0.2,setosa
3,4.7,3.2,1.3,0.2,setosa
4,4.6,3.1,1.5,0.2,setosa
5,5.0,3.6,1.4,0.2,setosa
6,5.4,3.9,1.7,0.4,setosa


### 1. `filter()` 
- Subset rows based on condition

In [25]:
df <- filter(iris, Species == "setosa" | Species == "virginica")
table(df$Species)


    setosa versicolor  virginica 
        50          0         50 

Or we can do it as follows.

In [28]:
df <- filter(iris, Species %in% c("setosa", "virginica"))
table(df$Species)


    setosa versicolor  virginica 
        50          0         50 

In [32]:
df2 <- iris %>% filter(Species == "setosa")
summary(df2$Species)

### 2. select()
- Choose columns



In [35]:
df3 = iris %>% select(Sepal.Length, Species)
head(df3)

Unnamed: 0_level_0,Sepal.Length,Species
Unnamed: 0_level_1,<dbl>,<fct>
1,5.1,setosa
2,4.9,setosa
3,4.7,setosa
4,4.6,setosa
5,5.0,setosa
6,5.4,setosa


Or you can use column numbers.

In [37]:
df3 <- select(iris, 1, 2, 3)
names(df3)

### 3. mutate() 
- Create or modify columns

In [49]:
iris %>% mutate(Sepal.Area = Sepal.Length * Sepal.Width) -> df6
names(df6)

Or you can do it as follows.

In [51]:
df7 <- mutate(iris, Sepal.Area = Sepal.Length * Sepal.Width)
names(df7)

### 4. `arrange()` 
- Sort rows

In [55]:
df <- iris %>% arrange(Sepal.Length)

# Sort by Sepal.Length (descending)
df10 <- iris %>% arrange(desc(Sepal.Length))

### 5. `summarise()` + `group_by()` 
- Aggregate/group data

In [58]:
df8 = iris %>%
  group_by(Species) %>%
  summarise(Avg_Sepal_Length = mean(Sepal.Length))
head(df8)

Species,Avg_Sepal_Length
<fct>,<dbl>
setosa,5.006
versicolor,5.936
virginica,6.588


Or you can do it as follows.

In [60]:
df8 = group_by(iris, Species) %>%
  summarise(Avg_Sepal_Length = mean(Sepal.Length))
head(df8)

Species,Avg_Sepal_Length
<fct>,<dbl>
setosa,5.006
versicolor,5.936
virginica,6.588


#### 6. `rename()` 
- Rename columns

In [70]:
# Rename Sepal.Length to Sepal_Length
df9<-iris %>% rename(Sepal_Length = Sepal.Length, Sepal_Width = Sepal.Width)
colnames(df9)

### 🧠 Notes:
- `%>%` is the pipe operator. It passes the left-hand side as the first argument to the function on the right.
- Each function returns a new dataframe; original data isn't modified unless reassigned.


- For each species, calculate:
    - mean Sepal.Length
    - max Petal.Width
    - and sort them by mean Sepal.Length (descending)

In [76]:
iris %>%
  group_by(Species) %>%
  summarise(
    Mean_Sepal = mean(Sepal.Length),
    Max_Petal_Width = max(Petal.Width)
  ) %>%
  arrange(desc(Mean_Sepal))

Species,Mean_Sepal,Max_Petal_Width
<fct>,<dbl>,<dbl>
virginica,6.588,2.5
versicolor,5.936,1.8
setosa,5.006,0.6
