# Title: Predicting the Performance of a Student

### Introduction

A student's performance is not only influenced by how hard they work; it can be affected by many factors, including school quality, relationships, health conditions, and so on. The question is, how can one accurately predict the performance of a student?

Using a dataset made by Paulo Cortez, we are able to answer the question above. The dataset contains data collected using school reports and questionnaires from two Portuguese secondary schools, and includes the first period, second period, and final grades of numerous students. It also tracks various social and health related information, ranging from family situations to alcohol consumption. Since the original dataset includes too many variables to reasonably consider, the variables we use has been filtered and narrowed down to a manageable amount. In this project, we are only interested in predicting a student's math performance.

### Preliminary exploratory data analysis

In [1]:
library(repr)
library(tidyverse)
library(tidymodels)
options(repr.matrix.max.rows = 6)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.0 ──

[32m✔[39m [34mggplot2[39m 3.3.2     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.0.3     [32m✔[39m [34mdplyr  [39m 1.0.2
[32m✔[39m [34mtidyr  [39m 1.1.2     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.3.1     [32m✔[39m [34mforcats[39m 0.5.0

“package ‘ggplot2’ was built under R version 4.0.1”
“package ‘tibble’ was built under R version 4.0.2”
“package ‘tidyr’ was built under R version 4.0.2”
“package ‘dplyr’ was built under R version 4.0.2”
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

“package ‘tidymodels’ was built under R version 4.0.2”
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 0.1.1 ──

[32m✔

In [2]:
temp <- tempfile()
download.file("https://archive.ics.uci.edu/ml/machine-learning-databases/00320/student.zip", temp)
data <- read_delim(unz(temp, "student-mat.csv"), delim = ";") %>% 
        select(age, absences, G1, G2, G3)
unlink(temp)

Parsed with column specification:
cols(
  .default = col_character(),
  age = [32mcol_double()[39m,
  Medu = [32mcol_double()[39m,
  Fedu = [32mcol_double()[39m,
  traveltime = [32mcol_double()[39m,
  studytime = [32mcol_double()[39m,
  failures = [32mcol_double()[39m,
  famrel = [32mcol_double()[39m,
  freetime = [32mcol_double()[39m,
  goout = [32mcol_double()[39m,
  Dalc = [32mcol_double()[39m,
  Walc = [32mcol_double()[39m,
  health = [32mcol_double()[39m,
  absences = [32mcol_double()[39m,
  G1 = [32mcol_double()[39m,
  G2 = [32mcol_double()[39m,
  G3 = [32mcol_double()[39m
)

See spec(...) for full column specifications.



In [3]:
# splitting data into training data/testing data
data_split <- initial_split(data, prop = 0.75, strata = G3)

training_data <- training(data_split)
testing_data <- testing(data_split)

training_data
testing_data

age,absences,G1,G2,G3
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
18,6,5,6,6
15,2,15,14,15
15,0,16,18,19
⋮,⋮,⋮,⋮,⋮
17,3,14,16,16
18,0,11,12,10
19,5,8,9,9


age,absences,G1,G2,G3
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
17,4,5,5,6
15,10,7,8,10
16,4,6,10,10
⋮,⋮,⋮,⋮,⋮
18,7,6,5,6
20,11,9,9,9
21,3,10,8,7


In [4]:
# Summarizing training data
training_count <- training_data %>% 
    summarize(count = n()) 

# Counting missing values
missing_count <- training_data %>% 
    summarize(missing_G1 = sum(is.na(G1)), missing_G2 = sum(is.na(G2)), missing_G3 = sum(is.na(G3)))

# Mean of Grades
training_mean <- training_data %>% 
    summarize(mean_G1 = mean(G1), mean_G2 = mean(G2), mean_G3 = mean(G3))

training_summary <- training_count %>% 
    bind_cols( missing_count, training_mean)

training_summary

count,missing_G1,missing_G2,missing_G3,mean_G1,mean_G2,mean_G3
<int>,<int>,<int>,<int>,<dbl>,<dbl>,<dbl>
298,0,0,0,10.9698,10.7047,10.38591


### Methods

Of the thirty-two factors and single output target, we will use four factors to predict the final grade.

The factors used are as follows:

* Age (from 15 to 22)
* Number of absences (from 0 to 93)
* First period grade (from 0 to 20)
* Second period grade (from 0 to 20)
* Final grade (from 0 to 20)

From the age, absences, and first and second period grades, we expect to use our data analysis skills to predict the final math grade of a secondary school student.

We will use the `ggplot2` library to create a clear visualization of our results.

### Expected outcomes and significance

In this project, we expect to figure out which features can be used to predict a student’s math performance effectively. Based on such findings, we could figure out where students need to improve to increase their performance in math. However, this finding might not work to predict a student's performance in any class. Thus, predicting a student's performance in classes besides math is a question for the future.

**Reference**

P. Cortez and A. Silva. Using Data Mining to Predict Secondary School Student Performance. In A. Brito and J. Teixeira Eds., Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008) pp. 5-12, Porto, Portugal, April, 2008, EUROSIS, ISBN 978-9077381-39-7.
