# Data Visualizing in R

we can use the ggplot2 library that comes with tidyverse to plot some graphs to visually look at some data

In [None]:
# Import Libs
library(tidyverse)

# load from CSV
df_students <- read.csv(file = "https://raw.githubusercontent.com/MicrosoftDocs/ml-basics/master/data/grades.csv")

# remove rows with no data
df_students <- df_students %>% drop_na()

# Add a new column for Pass and Fail
df_students <- df_students %>% mutate(Status = if_else(Grade >= 60, "Pass", "Fail"))

## Visualize the data in Simple Graph

You initialize a graphic by using the function ggplot() and the data frame to use for the plot. ggplot(data = df_students) basically creates an empty graph to which you can add layers by using a plus sign (+).

geom_col() then adds a layer of bars whose height corresponds to the variables that are specified by the mapping argument. The mapping argument is always paired with aes(), which specifies how variables in the data are mapped. What goes into aes() are variables found in the data. In this case, you specified that you want to map Name to the x-axis and Grade to the y-axis.

In [None]:
ggplot(data = df_students) +
  geom_col(mapping = aes(x = Name, y = Grade))

![Alt text](image-1.png)

## Improvment on previous Graph

we can add some colors and titles to make it readable and more visually appealing to naked eye.

In [None]:
# Change the default grey background
theme_set(theme_light())


ggplot(data = df_students) +
  geom_col(mapping = aes(x = Name, y = Grade),
           # Specifiy color and transparency of the bars
           fill = "midnightblue", alpha = 0.7) +
  # Add a title to the chart
  ggtitle("Student Grades") +
  # Add labels to axes
  xlab("Student") +
  ylab("Grade")

![Alt text](image.png)

In [None]:
# you can play with the theme and change whatever you like
ggplot(data = df_students) +
  geom_col(mapping = aes(x = Name, y = Grade),
           fill = "midnightblue", alpha = 0.7) +
  ggtitle("Student Grades") +
  xlab("Student") +
  ylab("Grade") +
  theme(
    plot.title = element_text(hjust = 0.5),
    panel.grid = element_blank(),
    panel.grid.major.y = element_line(colour = "gray",
                                      linetype = "dashed", size = 1),
    axis.text.x = element_text(angle = 70)
  )

![Alt text](image3.png)

## Historgram

GGplot is extremely flexible and we can use it to further enhance our data visualizing and see where the grades or lying the most. For example with the graph below plotted we can see that most of our student's grades lie near 50

In [None]:
ggplot(data = df_students) +
  geom_histogram(mapping = aes(x = Grade), binwidth = 20, boundary = 0.5, fill = "midnightblue", alpha = 0.7) +
  xlab('Grade') +
  ylab('Frequency') +
  ggtitle("Pussy Ass Graph") +
  theme(plot.title = element_text(hjust = 0.5))

![Alt text](image.png)

## Statistics

Now we can move on to some data analytics using the concepts of Measures of Centeral Tendency. We can use the `statip` library in R to measure properties like min, max, median, mode, etc. Here's an example on how to do it.

In [None]:
# Load statip into the current R sesssion
library(statip)

# Get summary statistics
min_val <- min(df_students$Grade)
max_val <- max(df_students$Grade)
mean_val <- mean(df_students$Grade)
med_val <- median(df_students$Grade)
mod_val <- mfv(df_students$Grade)

# Print the stats
cat(
  "Minimum: ", round(min_val, 2),
   "\nMean: ", round(mean_val, 2),
   "\nMedian: ", round(med_val, 2),
   "\nMode: ", round(mod_val, 2),
   "\nMaximum: ", round(max_val, 2)
)

In [None]:
# Plot a histogram
ggplot(data = df_students) +
  geom_histogram(mapping = aes(x = Grade), binwidth = 15, fill = "midnightblue", alpha = 0.7, boundary = 0.5) +
  
# Add lines for the statistics
  geom_vline(xintercept = min_val, color = 'gray33', linetype = "dashed", size = 1.3) +
  geom_vline(xintercept = mean_val, color = 'cyan', linetype = "dashed", size = 1.3) +
  geom_vline(xintercept = med_val, color = 'red', linetype = "dashed", size = 1.3 ) +
  geom_vline(xintercept = mod_val, color = 'yellow', linetype = "dashed", size = 1.3 ) +
  geom_vline(xintercept = max_val, color = 'gray33', linetype = "dashed", size = 1.3 ) +
  
# Add titles and labels
  ggtitle('Data Distribution')+
  xlab('Value')+
  ylab('Frequency')+
  theme(plot.title = element_text(hjust = 0.5))

![Alt text](image-1.png)

## Lets write a Generic Function

The function below works on numeric data and gives an output of various factors. It gives us all the measures of centeral tendencies. Besides it also helps us visualizing the data in 2 forms of graphs: BoxPlot and Historgram. From Histogram we can also plot some lines on it which help us graphically view the mode, median, and mean.

In [None]:
library(tidyverse)
library(patchwork)
# Create a function that you can reuse
show_distribution <- function(var_data, binwidth) {
  
  # Get summary statistics by first extracting values from the column
  min_val <- min(pull(var_data))
  max_val <- max(pull(var_data))
  mean_val <- mean(pull(var_data))
  med_val <- median(pull(var_data))
  mod_val <- statip::mfv(pull(var_data))

  # Print the stats
  stats <- glue::glue(
  'Minimum: {format(round(min_val, 2), nsmall = 2)}
   Mean: {format(round(mean_val, 2), nsmall = 2)}
   Median: {format(round(med_val, 2), nsmall = 2)}
   Mode: {format(round(mod_val, 2), nsmall = 2)}
   Maximum: {format(round(max_val, 2), nsmall = 2)}'
  )
  
  # Plot the histogram
  hist_gram <- ggplot(var_data) +
  geom_histogram(aes(x = pull(var_data)), binwidth = binwidth,
                 fill = "midnightblue", alpha = 0.7, boundary = 0.4) +
    
  # Add lines for the statistics
  geom_vline(xintercept = min_val, color = 'gray33', linetype = "dashed", size = 1.3) +
  geom_vline(xintercept = mean_val, color = 'cyan', linetype = "dashed", size = 1.3) +
  geom_vline(xintercept = med_val, color = 'red', linetype = "dashed", size = 1.3 ) +
  geom_vline(xintercept = mod_val, color = 'yellow', linetype = "dashed", size = 1.3 ) +
  geom_vline(xintercept = max_val, color = 'gray33', linetype = "dashed", size = 1.3 ) +
    
  # Add titles and labels
  ggtitle('Data Distribution') +
  xlab('')+
  ylab('Frequency') +
  theme(plot.title = element_text(hjust = 0.5))
  
  # Plot the box plot
  bx_plt <- ggplot(data = var_data) +
  geom_boxplot(mapping = aes(x = pull(var_data), y = 1),
               fill = "#E69F00", color = "gray23", alpha = 0.7) +
    
    # Add titles and labels
  xlab("Value") +
  ylab("") +
  theme(plot.title = element_text(hjust = 0.5))
  
  return (list(
    stats,
    hist_gram,
    bx_plt
  ))
} # End of function

df_students <- read.csv(file = "https://raw.githubusercontent.com/MicrosoftDocs/ml-basics/master/data/grades.csv")

df_students <- df_students %>%
drop_na()
# mutate(Grade = replace_na(Grade, as.integer(mean(Grade, na.rm = TRUE))))

col <- df_students %>% select(Grade)

# Call the function
show_distribution(var_data = col, binwidth = 20)

## Another Get Denisty Function

This function smoothes out the histogram and gives another form of visualizing towards the data

In [None]:
# Create a function that returns a density plot
show_density <- function(var_data) {
  
  # Get statistics
  mean_val <- mean(pull(var_data))
  med_val <- median(pull(var_data))
  mod_val <- statip::mfv(pull(var_data))
  
  
  # Plot the density plot
  density_plot <- ggplot(data = var_data) +
  geom_density(aes(x = pull(var_data)), fill="orangered", color="white", alpha=0.4) +
    
  # Add lines for the statistics
  geom_vline(xintercept = mean_val, color = 'cyan', linetype = "dashed", size = 1.3) +
  geom_vline(xintercept = med_val, color = 'red', linetype = "dashed", size = 1.3 ) +
  geom_vline(xintercept = mod_val, color = 'yellow', linetype = "dashed", size = 1.3 ) +
    
  # Add titles and labels
  ggtitle('Data Density') +
  xlab('') +
  ylab('Density') +
  theme(plot.title = element_text(hjust = 0.5))
  
  
  
  return(density_plot) # End of returned outputs
  
} # End of function


# Get the density of Grade
col <- df_students %>% select(Grade)
show_density(var_data = col)
