# Lets work on Real World Data Problems

In this notebook, we will work on real world problems and try to eliminate problems like

- Outliers
- Clusters
- Empty Data Cells

These problems must be eliminated from a data set to properly analyse it and get the best possible results.

In [None]:
# Install necessary packages
install.packages(c("tidyverse", "patchwork", "statip", "glue"))

# Import the packages
library(tidyverse)
library(patchworkd)
library(statip)
library(glue)

## Use Previous Model on StudyHours

In this piece of code we will be testing the same model for distribution we created in Exercise #02 and try to extract data based on `Grades` then on `StudyHours` this time and let's see what results do we get.

In [None]:
# import data from the source
students <- read.csv(file = "https://raw.githubusercontent.com/MicrosoftDocs/ml-basics/master/data/grades.csv")

# remove the records with no data in them
students <- students %>% drop_na()

# Add a column of pass or fail
students <- students %>% 
  mutate(Status = if_else(Grade >= 60, "PASS", "FAIL"))

# distribution function
show_distribution <- function (data, binwidth) {

  # get all measures of centeral tendancies
  # get all measures of centeral tendancies
  data_min <- min(pull(data))
  data_max <- max(pull(data))
  data_mean <- mean(pull(data))
  data_median <- median(pull(data))
  data_mode <- mfv(pull(data))

  # print the values
  stats <- glue(
    'Minimum: {format(round(data_min, 2), nsmall = 2)}
     Maximum: {format(round(data_max, 2), nsmall = 2)}
     Median: {format(round(data_median, 2), nsmall = 2)}
     Mode: {format(round(data_mode, 2), nsmall = 2)}
     Mean: {format(round(data_mean, 2), nsmall = 2)}'
  )

  # ploting historgram based on given data set
  histogram <- ggplot(data) +
    geom_histogram(mapping = aes(x = pull(data)),
      binwidth = binwidth, fill = "midnightblue", alpha = 0.7, boundary = 0.4) + 

  # Adding line for measures of centeral tendancy
  geom_vline(xintercept = data_min, color = "gray33", linetype = "dashed", size = 1.0) +
  geom_vline(xintercept = data_max, color = "gray33", linetype = "dashed", size = 1.0) +
  geom_vline(xintercept = data_mode, color = "green", linetype = "dashed", size = 1.0) +
  geom_vline(xintercept = data_median, color = "blue", linetype = "dashed", size = 1.0) +
  geom_vline(xintercept = data_mean, color = "red", linetype = "dashed", size = 1.0) +

  # Adding titles and legends
  ggtitle("Data Distribution") +
  xlab("") +
  ylab("Frequency") +

  # adjusting the title position on the graph
  theme(plot.title = element_text(hjust = 0.5, size = 20))

  # Lets plot the box plot graph now
  box_plot <- ggplot(data) +
    geom_boxplot(mapping = aes(x = pull(data), y = 1), 
      fill = "yellow", color = "gray33") + 

  # Lets now add titles and labels to this graph
  ggtitle("Box Plot") +
  xlab("Value") +
  ylab("") +
  
  # adjust title properties
  theme(plot.title = element_text(hjust = 0.5, size = 20))

  return (
    list(
      stats, histogram, box_plot
    )
  )
}

## Lets remove the outlier

We'll attempt to remove the outlier by removing all the students with studyhours less than 1

In [None]:
# Getting students with study Hours greater than 1
col <- students %>%
  select(StudyHours) %>%
  filter(StudyHours > 1)

# calling the function with new refined data
show_distribution(col, 2)

## Now let's work on 99 Percentile

In this piece of code, we'll be attempting to remove the students which are at the bottom with 1% percentile and work on the rest of the data. This can be acheived using the quantile function available in stats library.

In [None]:
# 1% quantile
q01 <- students %>%
  pull(StudyHours) %>%
  quantile(1/100, names = FALSE)

# getting students with 99% percentile
col <- students %>%
  select(StudyHours) %>%
  filter(StudyHours > q01)

# calling the function with new refined data
show_distribution(col, 2.5)

## With the Density Graph

We can use the previous model of density graph to visualize the data and see that most our data trend lies near 8-10 hours.

In [None]:
# Create a function that returns a density plot
show_density <- function(var_data) {
  
  # Get statistics
  mean_val <- mean(pull(var_data))
  med_val <- median(pull(var_data))
  mod_val <- statip::mfv(pull(var_data))
  
  
  # Plot the density plot
  density_plot <- ggplot(data = var_data) +
  geom_density(aes(x = pull(var_data)), fill="orangered", color="white", alpha=0.4) +
    
  # Add lines for the statistics
  geom_vline(xintercept = mean_val, color = 'cyan', linetype = "dashed", size = 1.3) +
  geom_vline(xintercept = med_val, color = 'red', linetype = "dashed", size = 1.3 ) +
  geom_vline(xintercept = mod_val, color = 'yellow', linetype = "dashed", size = 1.3 ) +
    
  # Add titles and labels
  ggtitle('Data Density') +
  xlab('') +
  ylab('Density') +
  theme(plot.title = element_text(hjust = 0.5),
        axis.text.x = element_text(size = 20),
        axis.text.y = element_text(size = 20))
  
  return(density_plot) # End of returned outputs
  
} # End of function

# 1% quantile
q01 <- students %>%
  pull(StudyHours) %>%
  quantile(1/100, names = FALSE)

# getting students with 99% percentile
col <- students %>%
  select(StudyHours) %>%
  filter(StudyHours > q01)

# Get the density of StudyHours
show_density(var_data = col)

## Now lets's get Measure of Variance

Measure of variance basically helps us understand how much data is variating.

__Range__: The difference between the maximum and minimum. There's no built-in function for this, but it's easy to calculate by using the min and max functions. Another approach would be to use Base R's base::range() which returns a vector containing the minimum and maximum of all the given arguments. Wrapping this in base::diff() will get you well on your way to finding the range.

__Variance__: The average of the squared difference from the mean. You can use the built-in var function to find this.

__Standard deviation__: The square root of the variance. You can use the built-in sd function to find this.

In [None]:
# Selecting the columns on which Variance is to be calculated
cols <- students %>%
  select(Grade, StudyHours)

# Map function works like any language's map function
map(cols, function(column) {
  range <- max(column) - min(column)
  variance <- var(column)
  std <- sd(column)

  glue('
  Range: {format(round(range, 2), nsmall = 2)}
  Variance: {format(round(variance, 2), nsmall = 2)}
  Standard Deviation: {format(round(std, 2), nsmall = 2)}
  ', .sep = '\n')
})

The greater the standard deviation, the more variance there is when you compare values in the distribution with the distribution mean. That is, the data is more spread out.

When you're working with a normal distribution, the standard deviation works with the particular characteristics of a normal distribution to provide even greater insight. This can be summarized by using the 68–95–99.7 rule, also known as the empirical rule, which is described as follows:

In any normal distribution:

- Approximately 68.26 percent of values fall within one standard deviation from the mean.
- Approximately 95.45 percent of values fall within two standard deviations from the mean.
- Approximately 99.73 percent of values fall within three standard deviations from the mean.

In [None]:
# use this function to quickly evaluate a data set
summary(students)

# this library does descriptive analysis on your data
library(summarytools)

# gives lots of details like mean, media, etc.
descr(students, stats = "common")

## Let's examine boxplot

In this point of this notebook, we are going to attempt to learn more deeply about the boxplot graphs and see how they help us greatly in understanding the data trend and how we can learn quickly about the students in this particular case.

In [None]:
# getting the 1% percentile
q01 <- students %>%
  pull(StudyHours) %>%
  quantile(1/100, name = FALSE)

# filtering and removing the outliers using 1 percentile
col <- students %>%
  select(StudyHours) %>%
  filter(StudyHours > q01)

# plotting the boxplot chart
students %>%
  ggplot() +
  geom_boxplot(mapping = aes(x = Status, y = StudyHours), fill = "yellow", alpha = 0.4) +

# adding our own custom number of data labels on y axis
scale_y_continuous(limits = c(0,16), breaks = seq(0, 16, by = 1)) +

# adding titles and labels
ggtitle("Box Plot") +
theme(axis.text.x = element_text(size = 20),
      axis.text.y = element_text(size = 20),
      plot.title = element_text(hjust = 0.5, size = 20),
      panel.grid.major.y = element_line(colour = "gray33", linetype = "dashed", linewidth = 0.5),
      )

![Alt text](Assets/image.png)

## Let's pivot the table

We can use the pivot_longer function in R to pivot the table.

In [None]:
# Pivot data from wide to long
df_sample_long <- students %>%

  # Select all except the Status
  select(!Status) %>% 

  # Pivot all except the Name column
  pivot_longer(!Name names_to = "Metrics", values_to = "Values")

## What's Correlation Coefficient

It is used to check the relationship between two properties and you can read it like this:

- If it is close to 1, it means they are strongly related such that if one increases the other increases as well
- If it is close to -1, it means they are strongly negatively related such that if one increases the other decreases.
- If it is close to 0 then the two properties are not closely related.

In [None]:
# Compute Correlation coefficient as
cor(students$StudyHours, students$Grade)

## Lets get the Linear Equation for our data

We can use the linear regression model to predict the student's scores, or perhaps use the same model on other data set to extract meaningful information from it.

In [None]:
# Extract tables to apply the linear regression
regression <- students %>%
  select(Grade, StudyHours)

# Apply the linear model
linear <- lm(Grade ~ StudyHours, data = regression)

# extract the values from it
intercept_c <- linear$coefficients[1]
slope_m <- linear$coefficients[2]

# print out the data
glue('
  slope: {format(round(slope_m, 4), nsmall = 4)}
  y-intercept: {format(round(intercept_c, 4), nsmall = 4)}
  f(x) = {format(round(slope_m, 4), nsmall = 4)}x + {format(round(intercept_c, 4), nsmall = 4)}
')

In [None]:
# Now lets use the above model to calculate linear regression for everyone
df_regression <- df_regression %>%
  mutate(fx = (slope_m * Grade + intercept_c), error = (fx - Grade))

# And plot it as
df_regression %>% 
  ggplot() +
  geom_point(aes(x = StudyHours, y = Grade)) +
  # Add a line based on the linear model
  geom_abline(intercept = intercept_c, slope = slope_m, color = "springgreen3", size = 1) +
  ggtitle('Study Time vs Grade') +
  theme(plot.title = element_text(hjust = 0.5))

![Alt text](Assets/image2.png)

## A model to predict student score

With this given model below and the sample data given to us, we have devised a linear model that can help us calculate a student's grade based on their study hours. But remember we are only using this linear model because we found our `correlation coefficient` near to 1 which means study hours and grades and very closely related to eachother

In [None]:
# it returns the grade based on previous slope_m and intercept_c calculations
predict_grade <- function (study_hours) {
  return (slope_m * study_hours + intercept_c)
} # end of function

# prints the result making sure it does not get less than 0 or greater than 100
max(0, min(100, predict_grade(17)))