# Stat 201 Project

## Introduction

- background
- significance

- explain dataset (https://archive.ics.uci.edu/dataset/697/predict+students+dropout+and+academic+success)

Introduction:

Higher education institutions are continually striving to improve student retention and academic success. For this purpose, a comprehensive dataset has been collected, amalgamating information from various disjoint databases related to students enrolled in different undergraduate degree programs in an effort to decrease the failure and dropout rate of students. The institution this data set was taken from offers a wide range of programs, encompassing a diverse range of disciplines, including agronomy, design, education, nursing, journalism, management, social service, and technologies. Higher education has the potential to allow significantly different career paths for individuals who pursue a university or college degree but not every student is well supported in these educational institutes. The understanding of factors that influence academic success, can allow for a curated and informed academic program design to support students better.

Research Question

The central question addressed by this project is as follows: How can we predict students' likelihood of dropping out and their likelihood of academic success in various undergraduate degree programs, given the demographic, socioeconomic, and academic path information available at the time of enrollment?

### Research Question

**Is the dropout rate of students with parents that recieved higher level education greater than those whose parents have not recieved higher level education?**
- higher level education refers to education beyond the highschool level
- parents are considered to recieve higher level education as long as *one* parent have recived higher level eduction

$H_0: p_1 = p_2$  
$H_a: p_1 > p_2$  
$p_1$: proportion of student dropout with parents that recieved higher level education  
$p_2$: proportion of student dropout with parents that have **not** recieved higher level education

Hypothesis testing will be conducted with $\alpha = 1$.

### not sure exactly what our variables of interest are so I am just adding a few placeholder ones ###

Variables of Interest:

Some of the variables of interest in the project include student mean grades and gender diversity at the university.

Category-based: drop-out, succeed, average (probability of each outcome)

the level of confidence in our predictions and the degree of variability in student outcomes within each category.

### which leads to the end of the intro below 

Via the mentioned variables and statistical inference, we can provide valuable insights for universities and colleges to enhance their support systems, ultimately improving the overall success and retention of students across a range of undergraduate degree programs.

## Preliminary Analysis

This section will include:
- Reading data from UCI database
- Wrangling, analyzing, and plotting relevant data
    - Classifying parents' education levels
    - Calculating relative dropout rates
- Computing point estimates

In [43]:
# Constants
DATASET_URL <- "https://raw.githubusercontent.com/MehrshadEsm/stat-201-project/main/data.csv"
DATA_FIELDS <- c("Mother's qualification", "Father's qualification", "Admission grade") 

In [44]:
# Load liraries and set seed
set.seed(1234)
library(tidyverse)
library(tidymodels)
library(infer)

### Reading and Cleaning Data

The `Target` describes 3 possible outcomes of a student, they either "Graduate", "Dropout", or stay "Enrolled". This analysis will focus on the proportion of students that have dropped out specifically. 

Moreover, `Mother's qualification` and `Father's qualification` describe the parents education level, where each number corresponds to an education experience (details [here](#Wrangling-and-Visualizing-Data)). Note that the order is arbitrary, and it not indicative of education level!

The admission grade of the student is also a significant variable that can shed further insight into our investigation, so we keep that in the table. 

In [45]:
# Load data from url
student_data_raw <- read_delim(DATASET_URL, delim = ";")

# select target columns
student_data <- 
    student_data_raw |>
    select(Target, all_of(DATA_FIELDS))

cat("[Table 1] Unwrangled student data with relevant variables")
head(student_data)

[1mRows: [22m[34m4424[39m [1mColumns: [22m[34m37[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ";"
[31mchr[39m  (1): Target
[32mdbl[39m (36): Marital status, Application mode, Application order, Course, Dayti...

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


[Table 1] Unwrangled student data with relevant variables

Target,Mother's qualification,Father's qualification,Admission grade
<chr>,<dbl>,<dbl>,<dbl>
Dropout,19,12,127.3
Graduate,1,3,142.5
Dropout,37,37,124.8
Graduate,38,37,119.6
Graduate,37,38,141.5
Graduate,37,37,114.8


### Wrangling and Visualizing Data 

We need to classify parents education level into two categories, those who have recieved higher education, and those who have not recieved higher education. Using the UCI dataset description, we classify the education levels numerically as follows
- **Recieved higher education:**  
**2** - Bachelor's Degree, 
**3** - Degree, 
**4** - Master's, 
**5** - Doctorate, 
**22** - Technical-professional course, 
**40** - degree (1st cycle), 
**41** - Specialized higher studies course, 
**42** - Professional higher technical course, 
**43** - Master (2nd cycle), 
**44** - Doctorate (3rd cycle),
**39** - Technological specialization course.

- **Have not recieved higher education:**  
**1** - Secondary Education, 
**9** - 12th Year of Schooling - Not Completed,
**10** - 11th Year of Schooling - Not Completed, 
**11** - 7th Year, 
**12** - 11th Year of Schooling,
**14** - 10th Year of Schooling 18, 
**19** - Basic Education 3rd Cycle (9th/10th/11th Year), 
**26** - 7th year of schooling, 
**27** - 2nd cycle of the general high school course, 
**29** - 9th Year of Schooling - Not Completed, 
**30** - 8th year of schooling, 
**35** - Can't read or write, 
**36** - Can read without having a 4th year of schooling, 
**37** - Basic education 1st cycle (4th/5th year), 
**38** - Basic Education 2nd Cycle (6th/7th/8th Year). 

- **Outliers (We ignore these)**:  
**34** - Unknown , 
**6** - Frequency of Higher Education 

Below is a table of the results:

In [53]:
higher_edu = c(2, 3, 4, 5, 22, 40, 41, 42, 43, 44, 39)
not_higher_edu = c(1, 9, 10, 11, 12, 14, 19, 26, 27, 29, 30, 35, 36, 37, 38)
outliers = c(34, 6)

student_data_edu <- 
    student_data |>
    filter(!(`Mother's qualification` %in%  outliers | `Father's qualification` %in%  outliers)) |>
    mutate(one_higher_edu  = ((`Mother's qualification` %in%  higher_edu) | (`Father's qualification` %in%  higher_edu)),
           both_higher_edu = ((`Mother's qualification` %in%  higher_edu) & (`Father's qualification` %in%  higher_edu)))

cat("[Table 2] Wrangled summary of student dropout rates with classification")
head(student_data_edu)

[Table 2] Wrangled summary of student dropout rates with classification

Target,Mother's qualification,Father's qualification,Admission grade,one_higher_edu,both_higher_edu
<chr>,<dbl>,<dbl>,<dbl>,<lgl>,<lgl>
Dropout,19,12,127.3,False,False
Graduate,1,3,142.5,True,False
Dropout,37,37,124.8,False,False
Graduate,38,37,119.6,False,False
Graduate,37,38,141.5,False,False
Graduate,37,37,114.8,False,False


Next, we also summarize the data into a point-estimate of the dropout rates.

In [57]:
# proportion table
student_dropout_props_one <-
    student_data_edu |>
    group_by(one_higher_edu) |>
    summarize(dropout_prop = sum(Target == "Dropout") / n(),
              size = n(), 
              grade = mean(`Admission grade`))

student_dropout_props_both <-
    student_data_edu |>
    group_by(both_higher_edu) |>
    summarize(dropout_prop = sum(Target == "Dropout") / n(),
              size = n(), 
              grade = mean(`Admission grade`))

cat("[Table 3] Summary table of student dropout proportions")
student_dropout_props_one

[Table 3] Summary table of student dropout proportions

one_higher_edu,dropout_prop,size,grade
<lgl>,<dbl>,<int>,<dbl>
False,0.3059873,3474,126.5741
True,0.3125778,803,129.6268


At a glance, we can see that students with parents that have recieved higher education actually are more likely to drop out. The difference in dropout rates is actually quite small, and
This seems to match with out intial assumption; however, we would need to perform more tests to verify the accuracy and validity of our sample point-estimates.

### Visualize Data

Data visualization... somehow, idk how

bar chart maybe

visualize the proportion of education for parents maybe 



## Methods and Results
The previous sections will carry over to your final report (you’ll be allowed to improve them based on feedback you get). Begin this Methods section with a brief description of “the good things” about this report – specifically, in what ways is this report trustworthy?

Continue by explaining why the plot(s) and estimates that you produced are not enough to give to a stakeholder, and what you should provide in addition to address this gap. Make sure your plans include at least one hypothesis test and one confidence interval. If possible, compare both the bootstrapping and asymptotics methods.

Finish this section by reflecting on how your final report might play out:

What do you expect to find?
What impact could such findings have?
What future questions could this lead to?

### Expectations/Impacts/Future Questions

## Discussion

## References

https://oa.mg/work/10.1007/978-3-030-72657-7_16


At least two citations of literature relevant to the project. The citation format is your choice – just be consistent. Make sure to cite the source of your data as well.

assigned to review a different group’s proposal. This allows your group to collectively see a larger variety of proposals.)