# Which study habits are most significantly associated with student performance?

### Group 35 - Alex Han, Ihor Parkhomenko, Jingyi Ying, Hank Zhao

## Introduction

Some students struggle with getting good grades because they need to learn how to study (Fournier and Hess). Therefore, we want to create a statistical model that explains which study habits significantly impact students' grades.

We plan to analyze the student performance data collected by surveying 101 Turkish students in 2019 (Yılmaz and B. Boran Sekeroglu). We can use the obtained data to assess the population relationship between study habits and end-of-term grades because the authors ensured data randomization by asking students from diverse courses (Yılmaz and Sekeroglu).

Thus, we plan to perform an inferential analysis to create a generative model describing the behavioral attributes associated with good performance.

## Methods and Results

### Libraries and Tooling

In [1]:
# Loading the packages and setting the seed
# NOTE: you must load plyr first then dplyr, don't load again.
library(tidyverse)
library(plyr)
library(dplyr)

# Setting Up Color Scheme
library(RColorBrewer)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.2 ──
[32m✔[39m [34mggplot2[39m 3.4.0      [32m✔[39m [34mpurrr  [39m 0.3.5 
[32m✔[39m [34mtibble [39m 3.1.8      [32m✔[39m [34mdplyr  [39m 1.0.10
[32m✔[39m [34mtidyr  [39m 1.2.1      [32m✔[39m [34mstringr[39m 1.4.1 
[32m✔[39m [34mreadr  [39m 2.1.3      [32m✔[39m [34mforcats[39m 0.5.2 
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
------------------------------------------------------------------------------

You have loaded plyr after dplyr - this is likely to cause problems.
If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
library(plyr); library(dplyr)

------------------------------------------------------------------------------


Attach

In [2]:
# Setting Seed
set.seed(1)

### Data Cleaning

The original dataset abstracts away descriptive details in exchange for numbers. Table 1.0 shows what the original variables and values mean and how they will be transformed for our report.

| Original Variable Name/Number | Renamed Variable Name | Description | Possible Values | 
| --- | --- | --- | --- |
| STUDENT ID | id | any, unique student id | |
| 1 | age | a | |
| 2 | sex | a | |
| 3 | high_school_type | a | |
| 4 | scholarship_type | a | |
| 5 | does_work | a | |
| 6 | does_arts_or_sports | a | |
| 7 | has_partner | a | |
| 8 | total_salary | a | |
| 9 | transportation_to_university | a | |
| 10 | accommodation | a | |
| 11 | education_mother | a | |
| 12 | education_father | a | |
| 13 | num_siblings | a | |
| 14 | parental_status | a | |
| 15 | occupation_mother | a | |
| 16 | occupation_father | a | |
| 17 | study_hours | a | |
| 18 | reading_freq_non_scientific | a | |
| 19 | reading_freq_scientific | a | |
| 20 | conference_attendance | a | |
| 21 | project_impact_to_success | a | |
| 22 | class_attendance | a | |
| 23 | midterm_prepration_style | a | |
| 24 | midterm_preparation_start_time | a | |
| 25 | takes_class_notes | a | |
| 26 | listens_class | a | |
| 27 | enjoys_discussions | a | |
| 28 | flip_classroom | a | |
| 29 | cumulative_gpa | a | |
| 30 | expected_cumulative_gpa | a | |
| COURSE ID | course_id | a | |
| GRADE | grade | a |  |

Table 1.0 - Raw Data Description

We first download the data directly from UCI Machine Learning Repository. We rename our columns to be more descriptive based on the attribute information section on the website.

In [18]:
# Download the data from UCI Machine Learning Repository
raw_data <- read_delim(
  file = "https://archive.ics.uci.edu/ml/machine-learning-databases/00623/DATA.csv",
  delim = ";"
)

[1mRows: [22m[34m145[39m [1mColumns: [22m[34m33[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ";"
[31mchr[39m  (1): STUDENT ID
[32mdbl[39m (32): 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,...

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


In [19]:
head(raw_data)

STUDENT ID,1,2,3,4,5,6,7,8,9,⋯,23,24,25,26,27,28,29,30,COURSE ID,GRADE
<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
STUDENT1,2,2,3,3,1,2,2,1,1,⋯,1,1,3,2,1,2,1,1,1,1
STUDENT2,2,2,3,3,1,2,2,1,1,⋯,1,1,3,2,3,2,2,3,1,1
STUDENT3,2,2,2,3,2,2,2,2,4,⋯,1,1,2,2,1,1,2,2,1,1
STUDENT4,1,1,1,3,1,2,1,2,1,⋯,1,2,3,2,2,1,3,2,1,1
STUDENT5,2,2,1,3,2,2,1,3,1,⋯,2,1,2,2,2,1,2,2,1,1
STUDENT6,2,2,2,3,2,2,2,2,1,⋯,1,1,1,2,1,2,4,4,1,2


In [20]:
# Rename columns to be more descriptive
raw_data <- raw_data %>%
    dplyr::rename(
        id = `STUDENT ID`,
        age = `1`,
        sex = `2`,
        high_school_type = `3`,
        scholarship_type = `4`,
        does_work = `5`,
        does_arts_or_sports = `6`,
        has_partner = `7`,
        total_salary = `8`,
        transportation_to_university = `9`,
        accommodation = `10`,
        education_mother = `11`,
        education_father = `12`,
        num_siblings = `13`,
        parental_status = `14`,
        occupation_mother = `15`,
        occupation_father = `16`,
        study_hours= `17`,
        reading_freq_non_scientific = `18`,
        reading_freq_scientific = `19`,
        conference_attendance = `20`,
        project_impact_to_success = `21`,
        class_attendance = `22`,
        midterm_prepration_style = `23`,
        midterm_preparation_start_time = `24`,
        takes_class_notes = `25`,
        listens_class = `26`,
        enjoys_discussions = `27`,
        flip_classroom = `28`,
        cumulative_gpa = `29`,
        expected_cumulative_gpa = `30`,
        course_id = `COURSE ID`,
        grade = `GRADE`,        
    )

The values are in various scales holding different meanings (e.g. 1 means private for high_school_type yet means None for study_hours). Thus, we will convert them to their original meaning and also change them to factors.

In [None]:
# TODO
raw_data <- raw_data %>%
    mutate(
        age = factor(plyr::mapvalues(age, from=c(1,2,3), to=c("18-21", "22-25", "26+"))),
        sex = factor(plyr::mapvalues(sex, from=c(), to=c())),
        high_school_type = factor(plyr::mapvalues(high_school_type, from=c(), to=c())),,
        scholarship_type = factor(plyr::mapvalues(scholarship_type, from=c(), to=c())),,
        does_work = factor(plyr::mapvalues(does_work, from=c(), to=c())),,
        does_arts_or_sports = factor(plyr::mapvalues(does_arts_or_sports, from=c(), to=c())),,
        has_partner = factor(plyr::mapvalues(has_partner, from=c(), to=c())),,
        total_salary = factor(plyr::mapvalues(total_salary, from=c(), to=c())),,
        transportation_to_university = factor(plyr::mapvalues(transportation_to_university, from=c(), to=c())),,
        accommodation = factor(plyr::mapvalues(accommodation, from=c(), to=c())),,
        education_mother = factor(plyr::mapvalues(education_mother, from=c(), to=c())),,
        education_father = factor(plyr::mapvalues(education_father, from=c(), to=c())),,
        num_siblings = factor(plyr::mapvalues(num_siblings, from=c(), to=c())),,
        parental_status = factor(plyr::mapvalues(parental_status, from=c(), to=c())),,
        occupation_mother = factor(plyr::mapvalues(occupation_mother, from=c(), to=c())),,
        occupation_father = factor(plyr::mapvalues(occupation_father, from=c(), to=c())),,
        study_hours= factor(plyr::mapvalues(study_hours, from=c(), to=c())),,
        reading_freq_non_scientific = factor(plyr::mapvalues(, from=c(), to=c())),,
        reading_freq_scientific = factor(plyr::mapvalues(, from=c(), to=c())),,
        conference_attendance = factor(plyr::mapvalues(, from=c(), to=c())),,
        project_impact_to_success = factor(plyr::mapvalues(, from=c(), to=c())),,
        class_attendance = factor(plyr::mapvalues(, from=c(), to=c())),,
        midterm_prepration_style = factor(plyr::mapvalues(, from=c(), to=c())),,
        midterm_preparation_start_time = factor(plyr::mapvalues(, from=c(), to=c())),,
        takes_class_notes = factor(plyr::mapvalues(, from=c(), to=c())),,
        listens_class = factor(plyr::mapvalues(, from=c(), to=c())),,
        enjoys_discussions = factor(plyr::mapvalues(, from=c(), to=c())),,
        flip_classroom = factor(plyr::mapvalues(, from=c(), to=c())),,
        cumulative_gpa = factor(plyr::mapvalues(, from=c(), to=c())),,
        expected_cumulative_gpa = factor(plyr::mapvalues(, from=c(), to=c())),,
        course_id = factor(plyr::mapvalues(, from=c(), to=c())),,
        grade = factor(plyr::mapvalues(, from=c(), to=c())),,
    )

We can download the data directly from the data set's website.

In [None]:
# Downloading the data
raw_data <- read_delim(
  file = "https://archive.ics.uci.edu/ml/machine-learning-databases/00623/DATA.csv",
  delim = ";"
)

raw_data %>%
  head()

The data set contains 3 types of variables: 
- personal information
- family information
- education habits

For our analysis, we need to only the last one, so we select the subset of the data below.

In [None]:
# Filtering to include only study habits
student_data <- raw_data %>%
  select(`17`:`GRADE`) %>%
  select(-`21`, -`27`, -`28`, -`29`, -`30`, -`COURSE ID`)

student_data %>% 
  head()

Additionally, the original data set's column names are numbers, which correspond to various attributes. We rename them according to the data set's documentation to make our further analysis clearer.

In [None]:
# Renaming columns
student_data <- student_data %>%
  rename(
    study_hours = `17`,
    reading_non_scientific = `18`,
    reading_scientific = `19`,
    attendance_seminars = `20`,
    attendance_classes = `22`,
    preparation_to_midterm_1 = `23`,
    preparation_to_midterm_2 = `24`,
    taking_notes = `25`,
    listening = `26`,
    grade = `GRADE`
  )

student_data %>%
  head()

The observations themselves are also numbers that need to be translated using the documentation. Thus, we convert all study habits records to factors so that they are correctly interpreted while building the model.

In [None]:
# Converting study habits columns to factors
student_data <- student_data %>%
  mutate_if(is.numeric, as.factor) %>%
  mutate(grade = as.numeric(grade) - 1)

Thus, this is our wrangled data set.

In [None]:
cat("Number of observations: ", nrow(student_data), "\n")

student_data %>%
  str()

# Num. of students per final grade
student_data %>%
  group_by(grade) %>%
  summarise(n = n())

Scholars argue that spending time studying does not necessarily lead to better performance (Fournier and Hess). We can check whether our data reflects this.

In [None]:
options(repr.plot.width = 10, repr.plot.height = 5)

# Plotting the association between studying hours and grades
study_hours_plot <- student_data %>%
  ggplot(aes(x = study_hours, y = grade)) +
  geom_boxplot() +
  labs(
    title = "Student final grades per study hours per week",
    x = "Study hours per week",
    y = "Final grade"
  ) +
  theme(text = element_text(size = 13))

study_hours_plot

The plot shows that study hours may not lead to good grades, so it corresponds to the scholarly claim (Fournier and Hess). However, it's necessary to create a model with the `study_hours` as a single predictor and then conduct the regression analysis to conclude whether there's an association.

### Exploratory Data Analysis

#### Exploring Grade Distribution by Courses

#### Exploring Trends in Background Characteristics 

#### Summary Statistics for Quantitative Variables

In [None]:
student_data

#### Correlation Matrix

#### Dimensionality Reduction

### Model Fitting

#### Creating Training, Validating, Testing Set

#### Mallows' Cp v P

#### Max Model

#### Simple Model


## Discussion

#### Our analysis is trustworthy because:

- It is based on the relatively new data set
  
- The data was collected following the guidelines ensuring the randomization
  
- It uses inferential techniques allowing to test the response variables' significance of and create a powerful generative model
  

#### Methods we plan to use:

We want to do variable selection for our generative model by:

1. Splitting the data into training and testing datasets
  
2. Create a model that uses all input variables and estimate its MLR
  
3. Select a subset of statistically significant variables using forward selection
  
4. Create another model that uses only the statistically significant variables from above
  
5. Compare the models using the adjusted $R^2$
  

#### Expectations:

We want to find the study habits that are statistically associated with final grades. Fournier and Hess claim that note taking is crucial (Fournier and Hess), and Wolfe and College argue that listening in classes significantly impact school performance as well (Wolfe and College). Therefore, we expect our generative model to contain `taking_notes` and `listening` variables.

#### Project impact:

This analysis will help students to spot education habits that significantly impact school performance. Our hope is that students will become aware of those and use them while planning their future studies.

## References

Fournier, Caitlin, and Kayla Hess. “Cracking the Code on How to Study.” *The Phi Delta Kappan*, vol. 94, no. 4, 2012, pp. 72–73, www.jstor.org/stable/41763743. Accessed 3 Nov. 2022.

Wolfe, Alison M, and Elmira College. *Student Attitudes toward Study Skills*. 2009. https://alisonwolfe.com/wordpress/wp-content/uploads/Student_Attitudes_Study_Skills1.pdf.

Yılmaz, Nevriye, and Boran Boran Sekeroglu. “UCI Machine Learning Repository: Higher Education Students Performance Evaluation Dataset Data Set.” *Archive.ics.uci.edu*, 30 Jan. 2021, https://archive.ics.uci.edu/ml/datasets/Higher+Education+Students+Performance+Evaluation+Dataset. Accessed 3 Nov. 2022.

Yılmaz, Nevriye, and Boran Sekeroglu. “Student Performance Classification Using Artificial Intelligence Techniques.” *Advances in Intelligent Systems and Computing*, vol. 1095, 20 Nov. 2019, pp. 596–603, 10.1007/978-3-030-35249-3_76. Accessed 3 Nov. 2022.

**Word count:** 490