## Package Installation
install the package in your `R` if it does not exist already 

In [None]:
install.packages("tidyverse")

## Importing the package

In [None]:
library(tidyverse)

## Data Set/Frame

Now lets create some data with which we can do some data plotting/visualizing using our newly installed library `tidyverse` where we will be using `tibbles`, `dplyr`, and many more libraries

In [None]:
# Build a tibble of student data
df_students <- tibble(
  
  # Student names
  name = c('Dan', 'Joann', 'Pedro', 'Rosie', 'Ethan', 'Vicky',
           'Frederic', 'Jimmie', 'Rhonda', 'Giovanni',
           'Francesca', 'Rajab', 'Naiyana', 'Kian', 'Jenny',
           'Jakeem','Helena','Ismat','Anila','Skye','Daniel',
           'Aisha'),
  
  # Study hours
  study_hours = c(10.0, 11.5, 9.0, 16.0, 9.25, 1.0, 11.5, 9.0,
                 8.5, 14.5, 15.5, 13.75, 9.0, 8.0, 15.5, 8.0,
                 9.0, 6.0, 10.0, 12.0, 12.5, 12.0),
  
  # Grades
  grade = c(50, 50, 47, 97, 49, 3, 53, 42, 26,
             74, 82, 62, 37, 15, 70, 27, 36, 35,
             48, 52, 63, 64)
)

# Print the tibble
df_students

## Importing data from Online CSV File

We can use the `read_csv` function and then slice_head function to query certain number of rows only

In [None]:
# Importing using read_csv function
students <- read.csv(file = "https://raw.githubusercontent.com/MicrosoftDocs/ml-basics/master/data/grades.csv")

# printing data from csv, particularly first 10 results from it
slice_head(students, n = 10)

# printing results from 5-10 rows
slice(students, 5:10)

## Lets do some Queries

We can filter the data out using the `filter` function in `dplyr` library like this

In [None]:
# Fetches where the column `Name` has value `Jenny`
filter(students, Name == "Jenny")

# Kind of works like the SQL in function
# Basically queries where the `Name` column has `Jenny` and `Giovanni` in it.
# The data to be queried is in vector form indicated by `c`
filter(students, Name %in% c("Jenny", "Giovanni"))

# We can query mutliple factors just by adding a comma
# This works as `AND` operation
filter(students, StudyHours > 12, Grade > 80)

# OR

filter(students, StudyHours > 12 & Grade > 80)

## Check Missing Values

You can check the missing values using several functions provided within the dplyr library. You can even combine the functions below with the filter function to create your own complex queries

In [None]:
# Filters out the only missing rows and returns with TRUE or FALSE
anyNA(students)

# Prints the entire table and prints TRUE where there is missing value
# and FALSE where there is a value
is.na(students)

# Prints the sum of all missing values in column
colSums(is.na(students))

# Works for rows
rowSums(is.na(students))

## Lets Change Data

You can play around with the data, lets say with the missing values, and replace them with something useful. To achieve this goal we can use `tidyr` that comes with tidyverse. The job of this library is to tidy up the data and make it look meanningful 

In [None]:
# mutate function is used to make changes to the data

# replace_na function is used to replace the data cell with the value
# we assign where the data is missing

# mean function takes the mean of the data of the entire column
# where the data is available because of na.rm = TRUE value
# na.rm = TRUE is used to exclude the missing values
students <- students %<%
mutate(StudyHours = replace_na(StudyHours, mean(StudyHours, na.rm = TRUE)))

# drop_na basically removes the rows which have any data cell empty
students <- students %<%
drop_na()

## Now let's do some calculations

Lets see how some functions work in dplyr

In [None]:
# takes mean of all the values in the column `StudyHours`
mean_stude <- mean(students$StudyHours)

# This works exactly like above
mean_grade <- students %>%
    pull(Grade) %>%
    mean()

## Make your first query

Now that we know a bunch of functionalities, let's try to extract some meaningful information like students who have grade higher than 50, lets assign them PASS and assign FAIL to rest of them

In [None]:
# Creates a new column named `Status` and checks if Grade 
# is higher than 60 then assign PASS else FAIL
results <- students %>%
  mutate(Status = if_else(Grade >= 60, "PASS", "FAIL"))

# now lets see the mean of StudyHours and Grade of students
# who failed and passed
results %>%
  group_by(Status) %>%
  summarise(mean_study = mean(StudyHours), mean_grade = mean(Grade))

# the same can be done in this fashion, lets say you have many
# columns and you want to take mean of every column, you can use
# across function with the check that if column has numeric values
# take its mean
results %>% 
  group_by(Status) %>% 
  summarise(across(where(is.numeric), mean))

## Select specific columns and Order them

In [None]:
# select specific columns like Name and Status
select(results, Name, Status)

# select all except status column
select(results, !Status)

# sort the results by grade
results %>%
  arrange(desc(Grade))