## Which study habits impact student performance most significantly?

## Introduction (100 words)

Some students struggle with getting good grades because they need to learn how to study (!!! citation). Therefore, we want to create a statistical model that explains which study habits most significantly impact students' performance.

For this, we want to analyze the student performance data collected by surveying students from Engineering and Educational Sciences faculties in 2021 (!!! data set). This data set contains variables that outline students' studying habits and end-of-term grades. Thus, we plan to perform an inferential analysis to create a generative model describing the behavioral attributes associated with good performance.

## Exploratory Data Analysis (200 words)

In [1]:
# Loading the packages and setting the seed
library(tidyverse)
set.seed(1)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.2 ──
[32m✔[39m [34mggplot2[39m 3.3.6      [32m✔[39m [34mpurrr  [39m 0.3.5 
[32m✔[39m [34mtibble [39m 3.1.8      [32m✔[39m [34mdplyr  [39m 1.0.10
[32m✔[39m [34mtidyr  [39m 1.2.1      [32m✔[39m [34mstringr[39m 1.4.1 
[32m✔[39m [34mreadr  [39m 2.1.3      [32m✔[39m [34mforcats[39m 0.5.2 
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()


We can download the data directly from the data set's website.

In [7]:
# Downloading the data
raw_data <- read_delim(
  file = "https://archive.ics.uci.edu/ml/machine-learning-databases/00623/DATA.csv",
  delim = ";"
)

raw_data %>%
  head()

[1mRows: [22m[34m145[39m [1mColumns: [22m[34m33[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ";"
[31mchr[39m  (1): STUDENT ID
[32mdbl[39m (32): 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,...

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


STUDENT ID,1,2,3,4,5,6,7,8,9,⋯,23,24,25,26,27,28,29,30,COURSE ID,GRADE
<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
STUDENT1,2,2,3,3,1,2,2,1,1,⋯,1,1,3,2,1,2,1,1,1,1
STUDENT2,2,2,3,3,1,2,2,1,1,⋯,1,1,3,2,3,2,2,3,1,1
STUDENT3,2,2,2,3,2,2,2,2,4,⋯,1,1,2,2,1,1,2,2,1,1
STUDENT4,1,1,1,3,1,2,1,2,1,⋯,1,2,3,2,2,1,3,2,1,1
STUDENT5,2,2,1,3,2,2,1,3,1,⋯,2,1,2,2,2,1,2,2,1,1
STUDENT6,2,2,2,3,2,2,2,2,1,⋯,1,1,1,2,1,2,4,4,1,2


The data set contains 3 types of variables: 
- personal information
- family information
- education habits

For our analysis, we need to only the last one, so we select the subset of the data below.

In [3]:
# Filtering to include only study habits
student_data <- raw_data %>%
  select(`17`:`GRADE`) %>%
  select(-`21`, -`27`, -`28`, -`29`, -`30`, -`COURSE ID`)

student_data %>% 
  head()

17,18,19,20,22,23,24,25,26,GRADE
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
3,2,2,1,1,1,1,3,2,1
2,2,2,1,1,1,1,3,2,1
2,1,2,1,1,1,1,2,2,1
3,1,2,1,1,1,2,3,2,1
2,1,1,1,1,2,1,2,2,1
1,1,2,1,1,1,1,1,2,2


Additionally, the original data set's column names are numbers, which correspond to various attributes. We rename them according to the data set's documentation to make our further analysis clearer.

In [4]:
student_data <- student_data %>%
  rename(
    weekly_study_hours = `17`,
    reading_non_scientific = `18`,
    reading_scientific = `19`,
    attendance_seminars = `20`,
    attendance_classes = `22`,
    preparation_to_midterm_1 = `23`,
    preparation_to_midterm_2 = `24`,
    taking_notes = `25`,
    listening = `26`,
    grade = `GRADE`
  )

student_data %>%
  head()

weekly_study_hours,reading_non_scientific,reading_scientific,attendance_seminars,attendance_classes,preparation_to_midterm_1,preparation_to_midterm_2,taking_notes,listening,grade
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
3,2,2,1,1,1,1,3,2,1
2,2,2,1,1,1,1,3,2,1
2,1,2,1,1,1,1,2,2,1
3,1,2,1,1,1,2,3,2,1
2,1,1,1,1,2,1,2,2,1
1,1,2,1,1,1,1,1,2,2


The observations themselves are also numbers that need to be translated using the documentation. Thus, we convert all data set columns to factors, so that they are correctly interpreted while building the model.

In [5]:
# Converting all columns to factors
student_data <- student_data %>%
  mutate_if(is.numeric, as.factor)

Thus, this is our wrangled data set.

In [6]:
cat("Number of observations: ", nrow(student_data), "\n")
student_data %>%
  str()

Number of observations:  145 
tibble [145 × 10] (S3: tbl_df/tbl/data.frame)
 $ weekly_study_hours      : Factor w/ 5 levels "1","2","3","4",..: 3 2 2 3 2 1 2 1 1 2 ...
 $ reading_non_scientific  : Factor w/ 3 levels "1","2","3": 2 2 1 1 1 1 2 2 2 2 ...
 $ reading_scientific      : Factor w/ 3 levels "1","2","3": 2 2 2 2 1 2 2 2 2 2 ...
 $ attendance_seminars     : Factor w/ 2 levels "1","2": 1 1 1 1 1 1 2 1 1 1 ...
 $ attendance_classes      : Factor w/ 2 levels "1","2": 1 1 1 1 1 1 2 1 1 2 ...
 $ preparation_to_midterm_1: Factor w/ 3 levels "1","2","3": 1 1 1 1 2 1 1 3 1 1 ...
 $ preparation_to_midterm_2: Factor w/ 3 levels "1","2","3": 1 1 1 2 1 1 1 1 1 1 ...
 $ taking_notes            : Factor w/ 3 levels "1","2","3": 3 3 2 3 2 1 3 3 3 2 ...
 $ listening               : Factor w/ 3 levels "1","2","3": 2 2 2 2 2 2 3 2 2 2 ...
 $ grade                   : Factor w/ 8 levels "0","1","2","3",..: 2 2 2 2 2 3 6 3 6 1 ...


!!! Fit a LR model and find it's $R^2$

## Methods: Plan (200 words)

## References