# Predict test preparation course based on candidates' score

In [14]:
library(tidyverse)
library(tidymodels)
library(RColorBrewer)
library(GGally)

## Introduction 

#### There are a variety of test preparation classes available to students at varying prices. These courses claim to improve students' performance on exams.

#### Predictive Question: Can we use the exam scores of students to predict whether they attended a test preparation course?

The all_exams.csv data set is used to determine whether a student took a test prep course. Their exam scores from math, reading, and writing would identify if they attended a test prep course. The data set also contains information about high school students from the US, and includes the students’ gender, race/ethnicity, parental level of education, and lunch access. The size of the sample was increased to 1200 by combining the downloaded data, since the data is generated spontaneously. By doing this, we expect our model to have a higher accuracy because it will be able to gain familiarity with more data examples.

## 1. Primary Exploratory Data Analysis:

### Read data

In [15]:
options(repr.matrix.max.rows = 10)
all_exams<-read_csv("https://raw.githubusercontent.com/SopTes27/group26_project/main/GP_data/all_exams.csv")
all_exams

“Missing column names filled in: 'X1' [1]”
Parsed with column specification:
cols(
  X1 = [32mcol_double()[39m,
  gender = [31mcol_character()[39m,
  `race/ethnicity` = [31mcol_character()[39m,
  `parental level of education` = [31mcol_character()[39m,
  lunch = [31mcol_character()[39m,
  `test preparation course` = [31mcol_character()[39m,
  `math score` = [32mcol_double()[39m,
  `reading score` = [32mcol_double()[39m,
  `writing score` = [32mcol_double()[39m
)



X1,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>
1,male,group D,some college,standard,none,69,63,62
2,female,group E,bachelor's degree,free/reduced,completed,65,78,80
3,female,group C,some high school,standard,none,57,56,59
4,female,group D,associate's degree,free/reduced,none,62,73,71
5,male,group C,some college,free/reduced,none,46,52,46
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
1196,male,group C,associate's degree,standard,completed,65,63,62
1197,female,group C,some high school,standard,none,69,74,68
1198,female,group D,some college,free/reduced,none,43,48,45
1199,female,group B,some college,standard,none,51,51,57


### Wrangles and Cleans the data

In original data, we have a column that will not be used in our model, X1. We then will make gender, race/ethnicity, parental level of education, lunch, and test preparation course as category data type.

In [16]:
colnames(all_exams)<-c("X1", "gender", "race_ethnicity", "parental_level_of_education",
"lunch", "test_preparation_course", "math_score", "reading_score", "writing_score")

tidying_data <-select(all_exams, gender:writing_score)%>%
    mutate(across(gender:test_preparation_course, as.factor))
tidying_data

gender,race_ethnicity,parental_level_of_education,lunch,test_preparation_course,math_score,reading_score,writing_score
<fct>,<fct>,<fct>,<fct>,<fct>,<dbl>,<dbl>,<dbl>
male,group D,some college,standard,none,69,63,62
female,group E,bachelor's degree,free/reduced,completed,65,78,80
female,group C,some high school,standard,none,57,56,59
female,group D,associate's degree,free/reduced,none,62,73,71
male,group C,some college,free/reduced,none,46,52,46
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
male,group C,associate's degree,standard,completed,65,63,62
female,group C,some high school,standard,none,69,74,68
female,group D,some college,free/reduced,none,43,48,45
female,group B,some college,standard,none,51,51,57


#### Tidy data and create a new dataset that will use in the following steps.

Use `tidying_data` that we did in the priovse step, we group test_preparation_course, math_score,reading_score,writing_score and calculate the average grade called `avg_grade`, and the new dataset called `exams_data`.

In [17]:
exams_data<-tidying_data %>%
    rowwise(math_score:writing_score)%>%
    mutate(avg_grade=mean(math_score:writing_score))%>%
    select(test_preparation_course, math_score, reading_score, writing_score, avg_grade)
exams_data

test_preparation_course,math_score,reading_score,writing_score,avg_grade
<fct>,<dbl>,<dbl>,<dbl>,<dbl>
none,69,63,62,65.5
completed,65,78,80,72.5
none,57,56,59,58.0
none,62,73,71,66.5
none,46,52,46,46.0
⋮,⋮,⋮,⋮,⋮
completed,65,63,62,63.5
none,69,74,68,68.5
none,43,48,45,44.0
none,51,51,57,54.0


## 2. Splitting the data into a training and test set and analysis the data:

To create the training and test set, first use the initial_split function to split `exams_data`. The protion that we use is 75% of training data called `exam_train` and 25% of testing data called `exam_test`. 

And set the seed (2021).


In [19]:
set.seed(2021)

data_split <- initial_split(exams_data, prop = 0.75, strata = test_preparation_course)
exam_train <- training(data_split)
exam_test <- testing(data_split)

glimpse(exam_train)

Rows: 901
Columns: 5
Rowwise: math_score, reading_score, writing_score
$ test_preparation_course [3m[90m<fct>[39m[23m none, completed, none, none, none, none, none…
$ math_score              [3m[90m<dbl>[39m[23m 69, 65, 57, 62, 46, 39, 78, 57, 80, 85, 79, 5…
$ reading_score           [3m[90m<dbl>[39m[23m 63, 78, 56, 73, 52, 35, 90, 62, 86, 86, 75, 4…
$ writing_score           [3m[90m<dbl>[39m[23m 62, 80, 59, 71, 46, 28, 84, 54, 91, 84, 71, 3…
$ avg_grade               [3m[90m<dbl>[39m[23m 65.5, 72.5, 58.0, 66.5, 46.0, 33.5, 81.0, 55.…


In [20]:
sum(is.na(exam_train))

In the cells above, the packages were loaded, the data was read and tidied for easier handling in the future. Additionally, the dataset was split into training and testing data. From the code above, it is apparent that there are no rows with any missing data.

In this step we will add a recipe that set `test_preparation_course` as our outcome variable and others will be predictors. With `step_upsample()`function to creates a specification of a recipe step that will replicate rows of a data set to make the occurrence of levels in a specific factor level equal.

In [26]:
exam_recipe <- recipe(test_preparation_course ~ ., data = exam_train)%>% 
  step_upsample(test_preparation_course, over_ratio = 1, skip = FALSE)%>%
  prep()
exam_recipe

upsampled_exam <- bake(exam_recipe, exam_train)

upsampled_exam %>%
  group_by(test_preparation_course) %>%
  summarize(n = n())
upsampled_exam

Data Recipe

Inputs:

      role #variables
   outcome          1
 predictor          4

Training data contained 901 data points and no missing data.

Operations:

Up-sampling based on test_preparation_course [trained]

`summarise()` ungrouping output (override with `.groups` argument)



test_preparation_course,n
<fct>,<int>
completed,592
none,592


math_score,reading_score,writing_score,avg_grade,test_preparation_course
<dbl>,<dbl>,<dbl>,<dbl>,<fct>
46,60,64,55.0,completed
33,45,52,42.5,completed
77,84,85,81.0,completed
72,82,86,79.0,completed
42,38,47,44.5,completed
⋮,⋮,⋮,⋮,⋮
73,70,71,72,none
56,66,62,59,none
58,57,58,58,none
43,48,45,44,none


A class imbalance was present in our data. Students who did not take the test preparation course were more common than those who did. For this reason, the data was balanced using the code above.

In this step we will calculate two groups's average score for each subject, and average score. Then we use `summary()` function to produce result summaries of each column, after that we use `do.call()`function to constructs and executes a function call from exam_train and summary. `cbind` is combine objects by rows, and `lapply()` function is apply a function over a Vector.

In [34]:
predictor_means <- exam_train%>%
    group_by(test_preparation_course)%>%
    summarize(
        math_score_average=mean(math_score),
        writing_score_average=mean(writing_score),
        reading_score_average=mean(reading_score),
        total_average_score=mean(avg_grade)
    )
predictor_means

num_obs <- nrow(exam_train)
exam_train %>%
  group_by(test_preparation_course) %>%
  summarize(
    count = n(),
    percentage = n() / num_obs * 100
  )
num_obs
summary(exam_train)
do.call(cbind, lapply(exam_train, summary))

`summarise()` ungrouping output (override with `.groups` argument)



test_preparation_course,math_score_average,writing_score_average,reading_score_average,total_average_score
<fct>,<dbl>,<dbl>,<dbl>,<dbl>
completed,70.89644,75.24272,74.23301,73.06958
none,64.9848,64.67399,66.88514,64.82939


`summarise()` ungrouping output (override with `.groups` argument)



test_preparation_course,count,percentage
<fct>,<int>,<dbl>
completed,309,34.29523
none,592,65.70477


 test_preparation_course   math_score     reading_score    writing_score  
 completed:309           Min.   : 14.00   Min.   : 15.00   Min.   : 13.0  
 none     :592           1st Qu.: 56.00   1st Qu.: 59.00   1st Qu.: 58.0  
                         Median : 67.00   Median : 69.00   Median : 68.0  
                         Mean   : 67.01   Mean   : 69.41   Mean   : 68.3  
                         3rd Qu.: 78.00   3rd Qu.: 80.00   3rd Qu.: 79.0  
                         Max.   :100.00   Max.   :100.00   Max.   :100.0  
   avg_grade     
 Min.   : 16.50  
 1st Qu.: 57.50  
 Median : 67.50  
 Mean   : 67.66  
 3rd Qu.: 79.00  
 Max.   :100.00  

Unnamed: 0,test_preparation_course,math_score,reading_score,writing_score,avg_grade
Min.,309,14.0,15.0,13.0,16.5
1st Qu.,592,56.0,59.0,58.0,57.5
Median,309,67.0,69.0,68.0,67.5
Mean,592,67.01221,69.40511,68.29856,67.65538
3rd Qu.,309,78.0,80.0,79.0,79.0
Max.,592,100.0,100.0,100.0,100.0
