# Predicting Attendance to a Test Preparation Course Based on Candidates' Scores

## Introduction 

Test preparation courses are a form of shadow education, which is referred to as "educational activities, such as tutoring and extra classes,
occurring outside of the formal channels of an educational system" (Buchmann et al., 436) and are used with the intention of increasing students' chances of success in high school courses and gaining admission into the post-secondary institute of their choice. A few companies offering these courses are confident their services are effective, and go as far as to offer a return of clients' money if a high score is not achieved (Buchmann et al., 440). 

Predictive Question: Can we use the exam scores of students to predict whether they attended a test preparation course?

The all_exams.csv data set is used to determine whether a student took a test prep course. Their exam scores from math, reading, and writing would identify if they attended a test prep course. The data set also contains information about high school students from the US, and includes the students’ gender, race/ethnicity, parental level of education, and lunch access. The size of the sample was increased to 1200 by combining the downloaded data, since the data is generated spontaneously. By doing this, we expect our model to have a higher accuracy because it will be able to gain familiarity with more data examples.

## Methods

In [None]:
library(tidyverse)
library(tidymodels)
library(RColorBrewer)
library(GGally)

### Loading the Data

In [None]:
options(repr.matrix.max.rows = 10)
all_exams<-read_csv("https://raw.githubusercontent.com/SopTes27/group26_project/main/GP_data/all_exams.csv")
all_exams

### Wrangling and Cleaning the Data

We remove the X1 column that will not be used in our model from the original data set. Then, we make the gender, race/ethnicity, parental level of education, lunch, and test preparation course columns as category data types.

In [None]:
colnames(all_exams)<-c("X1", "gender", "race_ethnicity", "parental_level_of_education",
"lunch", "test_preparation_course", "math_score", "reading_score", "writing_score")

tidying_data <-select(all_exams, gender:writing_score)%>%
    mutate(across(gender:test_preparation_course, as.factor))
tidying_data

Use `tidying_data` dataset created in the previous step, create a new column in the dataset called `avg_grade` by grouping the test_preparation_course, math_score, reading_score, and writing_score and calculating the average grade. The new dataset created is named `exams_data`. The new average grade column represents the mean of students' combined math, reading, and writing scores. The average grade will be used as a predictor in the data analysis performed later on. 

In [None]:
exams_data<-tidying_data %>%
    rowwise(math_score:writing_score)%>%
    mutate(avg_grade=mean(math_score:writing_score))%>%
    select(test_preparation_course, math_score, reading_score, writing_score, avg_grade)
exams_data

The `exams_data` dataset is split into a training set and a testing set. The training set will contain 75% of the dataset, and be named `exam_train`. The testing set will contain 25% of the data from `exams_data`, and will be named `exam_test`. The seed is also set to 2021.

In [None]:
set.seed(2021)

data_split <- initial_split(exams_data, prop = 0.75, strata = test_preparation_course)
exam_train <- training(data_split)
exam_test <- testing(data_split)

glimpse(exam_train)

### Exploratory Data Analysis - Creating a Summary and Visualization of the `exams_data` Dataset

First, the training and testing datasets were examined for any missing values. 

In [None]:
sum(is.na(exam_train))

In [None]:
sum(is.na(exam_test))

Next, we check the number of observations in both the training and testing datasets. This is performed to determine whether there is a class imbalance present in the data before upsampling. From Table ?? and Table ?? below, we can conclude that there is a class imbalance present in the training data, because students who did not take the test preparation course were more common than those who did. 

In [None]:
num_obs_train <- nrow(exam_train)
exam_train %>%
  group_by(test_preparation_course) %>%
  summarize(
    count = n(), 
    percentage = n() / num_obs_train 
  )

In [None]:
num_obs_test <- nrow(exam_test)
exam_test %>%
    group_by(test_preparation_course)%>%
    summarize(
        count = n(), 
        percentage = n() / num_obs_train
    )

Due to the class imbalance in the training data, upsampling is conducted on only the training dataset to balance the data, as shown below.

In [None]:
exam_recipe <- recipe(test_preparation_course ~ ., data = exam_train)%>% 
  step_upsample(test_preparation_course, over_ratio = 1, skip = FALSE)%>%
  prep() 
exam_recipe

upsampled_exam <- bake(exam_recipe, exam_train)

upsampled_exam %>%
  group_by(test_preparation_course) %>%
  summarize(n = n())
upsampled_exam

Table ? below summarizes the values of the predictor variables in the training set which will be used later on in our data analysis.

In [None]:
predictor_means <- exam_train%>%
    group_by(test_preparation_course)%>%
    summarize(
        math_score_average=mean(math_score),
        writing_score_average=mean(writing_score),
        reading_score_average=mean(reading_score),
        total_average_score=mean(avg_grade)
    )
predictor_means

Finally, Table ??? summarizes all of the data present in the training data set.

In [None]:
summary(exam_train) 
do.call(cbind, lapply(exam_train, summary))

The final step of the exploratory data analysis was to create a visualization representing the relationship that each predictor variable had with each other. 

In [None]:
options(repr.plot.width = 15, repr.plot.height = 20) 
predictor_plots <-ggplot(exam_train, aes(x=test_preparation_course, fill=test_preparation_course))+
geom_bar()+
labs(fill="Test Preparation Course")+
ggtitle("Predictors Pairwise Matrix Plot")

bar_legend<-grab_legend(predictor_plots)

Pairwise_Matrix_legend<- ggpairs(exam_train, title = "Pairwise Matrix Plot", legend = bar_legend,
                           aes(alpha = 0.2, color = test_preparation_course))+
labs(fill="Test Preparation Course")
Pairwise_Matrix_legend

### Data Analysis - Performing KNN Classification

Before scaling and centering the data, a 5-fold cross-validation is performed to tune the hyperparameters. The strata argument is set as our categorical target variable, which is the `test_preparation_course`.

In [None]:
exam_vfold <- vfold_cv(exam_train, v = 5, strata = test_preparation_course)

To create our KNN classification model, we will first create a recipe using the training data. The recipe specifies the target variable (test_preparation_course) and the predictors, and also scales and centers the predictors.

In [None]:
exam_recipe <- recipe(test_preparation_course ~ ., data = exam_train) %>%
                step_scale(all_predictors()) %>%
                step_center(all_predictors())
exam_recipe

Next, we created the K-nearest neighbors classifier and tuned each parameter in the model. In the next code block, cross validation is used to evaluate the the accuracy of the classifier.

In [None]:
knn_tune <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) %>%
       set_engine("kknn") %>%
       set_mode("classification")
knn_tune

Then, we created a dataframe named `k_vals` that has a sequence of K values between 1 and 20 we would like to test out. This new argument is passed through the grid argument of the `tune_grid` function.

In [None]:
k_vals <- tibble(neighbors = seq(from = 1, to = 20))
knn_results <- workflow() %>%
       add_recipe(exam_recipe) %>%
       add_model(knn_tune) %>%
       tune_grid(resamples = exam_vfold, grid = k_vals) %>%
       collect_metrics()
knn_results

As the last step of our KNN classification model, we plotted a visualization of the accuracy versus K value to deduce which K value would be the best. From the plot below, k = 5 would be the best value because it has the highest accuracy on the graph, and we can see that values greater than 5 do not have any dramatic increases in accuracy.

In [None]:
accuracies <- knn_results %>% 
       filter(.metric == "accuracy" )

accuracy_versus_k <- ggplot(accuracies, aes(x = neighbors, y = mean))+
       geom_point() +
       geom_line() +
       labs(x = "Neighbors", y = "Accuracy Estimate") +
       scale_x_continuous(breaks = seq(0, 20, by = 1)) +  
       scale_y_continuous(limits = c(0.4, 1.0)) 
accuracy_versus_k

In [None]:
knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 5) %>%
  set_engine("kknn") %>%
  set_mode("classification")
knn_spec

In [None]:
knn_fit <- knn_spec %>%
  fit(test_preparation_course ~. , data = exam_train)

### Visualization of the Data Analysis

To visualize our data analysis, we plotted two decision boundary graphs. The first graph, Figure 3, has the math score on the x-axis versus the writing score on the y-axis. The second graph, Figure 4, has the reading score on the x-axis, and the writing score on the y-axis. These visualizations can be used to...

In [None]:
are_grid <- seq(min(exam_test$math_score), 
                max(exam_test$math_score), 
                length.out = 100)
smo_grid <- seq(min(exam_test$writing_score), 
                max(exam_test$writing_score), 
                length.out = 100)
wot_grid <- seq(min(exam_test$reading_score), 
                max(exam_test$reading_score), 
                length.out = 100)
oi_grid <- seq(min(exam_test$avg_grade), 
                max(exam_test$avg_grade), 
                length.out = 100)

asgrid <- as_tibble(expand.grid(math_score = are_grid, 
                                writing_score = smo_grid,
                                reading_score = wot_grid,
                                avg_grade = oi_grid))
knnPredGrid <- predict(knn_fit, asgrid)

wkflw_plot <-
  ggplot() +
  geom_point(data = exam_test, 
             mapping = aes(x = math_score, 
                           y = reading_score, 
                           color = test_preparation_course), 
             alpha = 0.75) +
  geom_point(data = exam_test, 
             mapping = aes(x = math_score, 
                           y = reading_score, 
                           color = test_preparation_course), 
             alpha = 0.02, 
             size = 5) +
  labs(color = "Attendance to Test Preparation Course", 
       x = "Math Scores", 
       y = "Writing Scores ") +
  scale_color_manual(labels = c("Completed", "Not Completed"), 
                     values = c("orange2", "steelblue2"))

wkflw_plot

In [None]:
wkflw_plot <-
  ggplot() +
  geom_point(data = exam_test, 
             mapping = aes(x = reading_score, 
                           y = writing_score, 
                           color = test_preparation_course), 
             alpha = 0.75) +
  geom_point(data = exam_test, 
             mapping = aes(x = reading_score, 
                           y = writing_score, 
                           color = test_preparation_course), 
             alpha = 0.02, 
             size = 5) +
  labs(color = "Attendance to Test Preparation Course", 
       x = "Math Scores", 
       y = "Writing Scores ") +
  scale_color_manual(labels = c("Completed", "Not Completed"), 
                     values = c("orange2", "steelblue2"))

wkflw_plot

The graphs above are just two examples of the relationships that exist between some of the predictor values we have chosen. As seen above, students who have completed the test preparation course have scored higher than those who have not. 

### Discussion

Our model predicted whether a student attended a test prep course based on their math, reading, writing and average scores. The conclusion drawn from our data analysis was that… ☹ 

Based on previous studies on the topic of test preparation scores, it has been shown that students who had attended test preparation courses received higher scores than those who studied independently (Buchmann et al., 450). Although the increase in scores was not significantly high, it was noticeable enough to improve students’ chances of being admitted into their choice of college (Buchmann et al., 450). This information led us to expect a correlation between high exam scores and the completion of test preparation scores.  

The information extracted from this data analysis is important in determining the effectiveness of the test preparation course in students’ performance. Based on the results of this analysis, future projects could examine the impact of the test preparation courses compared to self-studying methods in students. Other factors that have not been considered in this data set could also be explored. For example, it has been shown that the taking test preparation courses in certain years may be more effective than others when studying for college exams (Devine-Eller, 475). Future studies may be interested in determining the potential benefits and detriments to attending test preparation studies at different periods of a student’s high school career. 


### References

Alon, S. "Commentaries: Racial Differences in Test Preparation Strategies: A Commentary on Shadow Education, American      Style: Test Preparation, the SAT and College Enrollment." Social Forces, vol. 89, no. 2, 2010, pp. 463-474.

Devine-Eller, Audrey. “Timing Matters: Test Preparation, Race, and Grade Level.” Sociological Forum, vol. 27, no. 2, [Wiley, Springer], 2012, pp. 458–80, http://www.jstor.org/stable/23262117.


# **Note to Sophie if you see this, we need to cite the source of the data, so please include it when you can (even just the link is fine)**