# Group 26 Project Proposal

### Introduction: 

There are a variety of test preparation classes available to students at varying prices. These courses claim to improve students' performance on exams. 

Predictive Question: Can we use the exam scores of students to predict whether they attended a test preparation course?

The `all_exams.csv` data set is used to determine whether a student took a test prep course. Their exam scores from math, reading, and writing would identify if they attended a test prep course. The data set also contains information about high school students from the US, and includes the students’ gender, race/ethnicity, parental level of education, and lunch access.
The size of the sample was increased to 1200 by combining the downloaded data, since the data is generated spontaneously. By doing this, we expect our model to have a higher accuracy because it will be able to gain familiarity with more data examples.


### Primary Exploratory Data Analysis:

In [None]:
#Run this cell 
library(tidyverse)
library(tidymodels)
library(RColorBrewer)
library(GGally)
library(themis)

In [None]:
options(repr.matrix.max.rows = 10)
all_exams<-read_csv("GP_data/all_exams.csv")
all_exams

In [None]:
#Tidying the data by changing the column names and the chr data types to fct data types

colnames(all_exams)<-c("X1", "gender", "race_ethnicity", "parental_level_of_education",
"lunch", "test_preparation_course", "math_score", "reading_score", "writing_score")

tidying_data <-select(all_exams, gender:writing_score)%>%
    mutate(across(gender:test_preparation_course, as.factor))
tidying_data

In [None]:
#Creating the new dataset we will be using

exams_data<-tidying_data %>%
    rowwise(math_score:writing_score)%>%
    mutate(avg_grade=mean(math_score:writing_score))%>%
    select(test_preparation_course, math_score, reading_score, writing_score, avg_grade)
exams_data

In [None]:
#Splitting the dataset into training and testing data

set.seed(2021)

data_split <- initial_split(exams_data, prop = 0.75, strata = test_preparation_course)
exam_train <- training(data_split)
exam_test <- testing(data_split)

glimpse(exam_train)

In the cells above, the packages were loaded, the data was read tidied for easier handling in the future. Additionally, the dataset was split into training and testing data. 

In [None]:
exam_recipe <- recipe(test_preparation_course ~ ., data = exam_train)%>% 
  step_upsample(test_preparation_course, over_ratio = 1, skip = FALSE)%>%
  prep()

exam_recipe


A class imbalance was present in our data. Students who did not take the test preparation course were more common than those who did. For this reason, the data was balanced using the code above.

In [None]:
#Determine the average score of all the exams for students who have completed the test prep course and those who haven't

predictor_means <- exam_train%>%
    group_by(test_preparation_course)%>%
    summarize(
        math_score_average=mean(math_score),
        writing_score_average=mean(writing_score),
        reading_score_average=mean(reading_score),
        total_average_score=mean(avg_grade)
    )
predictor_means

num_obs <- nrow(exam_train)
exam_train %>%
  group_by(test_preparation_course) %>%
  summarize(
    count = n(),
    percentage = n() / num_obs * 100
  )
num_obs

In [None]:
Pairwise_Matrix <- ggpairs(exam_train, title = "Pairwise Matrix Plot", 
                           aes(alpha = 0.2, color = test_preparation_course))
Pairwise_Matrix

The final step of our exploratory data analysis was to create two tables and a matrix plot. The first table summarized the averages of our predictor variables, and the second outlines the number of observations in each class. The plot above allows for easy comparison of the distributions of each predictor variable.

### Methods:

Our data analysis will be conducted using the K nearest neighbor classification algorithm. The predictors used will be the quantitative variables of the math, reading, writing, and average exam scores. The model specification will be created, then passed through a fit function to ensure our model fits the data set. Before scaling and centering the data, cross-validation would be applied to ensure that no selection bias is present. After scaling and centering, a prediction could be made on the new observation. Training data would be used to build the classifier and testing data would be used to estimate the model’s accuracy. Lastly, the model would be retrained with the testing data.

Based on the distribution plots in the Pairwise Matrix Plot above, the math, writing, reading and average exam scores were chosen as predictors. The red peak, representative of the students who completed the test preparation course, shows higher scores for every exam compared to students who did not complete the test preparation course.

Two decision boundary plots will be created for our visualization. The first would have the math score on the x-axis, and the writing score on the y-axis. The second plot would have the reading score on the x-axis, and the writing score on the y-axis.


### Expected outcomes and significance

Our model predicted whether a student attended a test prep course based on their scores over the math, reading, and writing exams. It was expected that there would be a correlation between high exam scores and completion of test preparation classes. This would determine the effectiveness of the test preparation course in students’ performance. Based on the results of this analysis, future projects could examine the impact of the test preparation courses compared to self-studying methods in students. 