Student Performance Data set  
https://archive.ics.uci.edu/ml/datasets/Student+Performance 

# What factors influence a secondary student's grades?

## Introduction:
By Sébastien Rhéaume, Ryan Shar, Justing Wong, Robert Yip

A student's grades is often the associative indicator of the student's performance in school. While a student's innate intelligence is often the predictor variable in the student's grades, there are likely other factors that affect the grades. In our study, we investigate whether certain environmental factors are significant in influencing a student's performance. We selected Math grades as the indicator for performance, as it is a subject that can be objectively evaluated. Our data set consists of 395 secondary students from two Portuguese schools, with their grades in mathematics and environmental factors collected through questionnaires.


## Preliminary Exploratory Data Analysis
1. Demonstrate that the dataset can be read from the web into R 


In [3]:
library(tidyverse)


── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.0 ──

[32m✔[39m [34mggplot2[39m 3.3.2     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.0.3     [32m✔[39m [34mdplyr  [39m 1.0.2
[32m✔[39m [34mtidyr  [39m 1.1.2     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.3.1     [32m✔[39m [34mforcats[39m 0.5.0

“package ‘ggplot2’ was built under R version 4.0.1”
“package ‘tibble’ was built under R version 4.0.2”
“package ‘tidyr’ was built under R version 4.0.2”
“package ‘dplyr’ was built under R version 4.0.2”
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



### Read data set

In [21]:
# dataset
url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/00320/student.zip"

# read into table
temp <- tempfile()
download.file(url,temp)
data <- read_csv2(unz(temp, "student-mat.csv"))
print.data.frame(head(data))     # print.data.frame is needed because the table is wide and has collapsed columns

Using ',' as decimal and '.' as grouping mark. Use read_delim() for more control.

Parsed with column specification:
cols(
  .default = col_character(),
  age = [32mcol_double()[39m,
  Medu = [32mcol_double()[39m,
  Fedu = [32mcol_double()[39m,
  traveltime = [32mcol_double()[39m,
  studytime = [32mcol_double()[39m,
  failures = [32mcol_double()[39m,
  famrel = [32mcol_double()[39m,
  freetime = [32mcol_double()[39m,
  goout = [32mcol_double()[39m,
  Dalc = [32mcol_double()[39m,
  Walc = [32mcol_double()[39m,
  health = [32mcol_double()[39m,
  absences = [32mcol_double()[39m,
  G1 = [32mcol_double()[39m,
  G2 = [32mcol_double()[39m,
  G3 = [32mcol_double()[39m
)

See spec(...) for full column specifications.



  school sex age address famsize Pstatus Medu Fedu     Mjob     Fjob     reason
1     GP   F  18       U     GT3       A    4    4  at_home  teacher     course
2     GP   F  17       U     GT3       T    1    1  at_home    other     course
3     GP   F  15       U     LE3       T    1    1  at_home    other      other
4     GP   F  15       U     GT3       T    4    2   health services       home
5     GP   F  16       U     GT3       T    3    3    other    other       home
6     GP   M  16       U     LE3       T    4    3 services    other reputation
  guardian traveltime studytime failures schoolsup famsup paid activities
1   mother          2         2        0       yes     no   no         no
2   father          1         2        0        no    yes   no         no
3   mother          1         2        3       yes     no  yes         no
4   mother          1         3        0        no    yes  yes        yes
5   father          1         2        0        no    yes  yes        

### Wrangling table columns into factors
For our study, we will only look at these 10 factors for simplicity
1. sex
2. age
3. famsize
4. Medu - Mother's education level
5. Fedu - Father's education level (1-4)
6. traveltime - Home to school travel time (1=short, 5=long)
7. studytime - Weekly study time (1=very little, 3=alot)
8. failures - Number of past failures
9. paid - Is the student attending paid lessons? (1=yes)
10. activities - Is the student participating in extracurriculars? (1=yes)
11. internet - Is there internet access at home (1=yes)
12. romantic - Is the student dating? (1=yes)
13. G1
14. G2
15. G3

In [27]:
#select columns
data <- data %>%
    select (sex, age, famsize, Medu, Fedu, traveltime, studytime, failures, paid, activities, internet, romantic, G1, G2, G3)

# convert columns to factors
# data <- data %>%
#     as_factor (sex,famsize,Medu,Fedu,studytime,failures,paid,activities,internet,romantic)

col_factors <- c("sex","famsize","Medu","Fedu","traveltime","studytime","paid","activities","internet","romantic") 
data[,col_factors] <- lapply(data[,col_factors], factor)
head(data)

sex,age,famsize,Medu,Fedu,traveltime,studytime,failures,paid,activities,internet,romantic,G1,G2,G3
<fct>,<dbl>,<fct>,<fct>,<fct>,<fct>,<fct>,<dbl>,<fct>,<fct>,<fct>,<fct>,<dbl>,<dbl>,<dbl>
F,18,GT3,4,4,2,2,0,no,no,no,no,5,6,6
F,17,GT3,1,1,1,2,0,no,no,yes,no,5,5,6
F,15,LE3,1,1,1,2,3,yes,no,yes,no,7,8,10
F,15,GT3,4,2,1,3,0,yes,yes,yes,yes,15,14,15
F,16,GT3,3,3,1,2,0,yes,no,no,no,6,10,10
M,16,LE3,4,3,1,2,0,yes,yes,yes,no,15,15,15


### 3.
Using only training data, summarize the data in at least one table (this is exploratory data analysis). An example of a useful table could be one that reports the number of observations in each class, the means of the predictor variables you plan to use in your analysis and how many rows have missing data.   

### 4.
Using only training data, visualize the data with at least one plot relevant to the analysis you plan to do (this is exploratory data analysis). An example of a useful visualization could be one that compares the distributions of each of the predictor variables you plan to use in your analysis.


## Methods

For our study, we will only look at these 10 factors for simplicity
1. sex
2. age
3. famsize
4. Medu - Mother's education level
5. Fedu - Father's education level (1-4)
6. traveltime - Home to school travel time (1=short, 5=long)
7. studytime - Weekly study time (1=very little, 3=alot)
8. failures - Number of past failures
9. paid - Is the student attending paid lessons? (1=yes)
10. activities - Is the student participating in extracurriculars? (1=yes)
11. internet - Is there internet access at home (1=yes)
12. romantic - Is the student dating? (1=yes)
13. G1
14. G2
15. G3

Our goal is to predict the student's grades given the other environmental variables. We total all 3 grades together for simplicity. As the grades are numeric, we require a regression analysis, with a mix of numeric and factor variables.

### Expected Outcomes Significance

#### What do you expect to find?
We expect to find a mix of significant and insignificant factors that influence a student's grades. As students ourselves, we hypothesize these significant factors to be strong predictors of a student's grades.

#### What impact could such findings have?
By improving a student's exposures to these significant factors, a student's grades may increase. This can become helpful information for parents to plan for their children's success.

##### What future questions could this lead to?
Replicated for different schools and subjects, perhaps to university students as well.
