# COGS 108 - Data Checkpoint

## Authors

- Michelle Ma: Dataset #2
- Yves Mojica: Dataset #1
- Edgar Seecof: Dataset #1
- Travon Williams: Data overview
- Felix Xie: Dataset #2

## Research Question

To what extent does first year STEM majors early academic performance predict student dropout at research focused college institutions. Specifically, using students first-year GPA, course completion rates, and credit accumulation as predictors, can we model the probability that a student drops out within one year? 



## Background and Prior Work

Predicting student dropout in higher education has become a prevalent topic in educational research because early identification of at-risk students can enable universities and colleges to proactively support their students and try to prevent them leaving the college. Student attrition is a large loss to higher education institutions, as it represents lost tuition revenue, reduced completion metrics that reflect poorly on the institution itself, and an inefficient allocation of resources. Arguably, it is sometimes worse off for students, who may incur financial debt, delayed career entry, and negative psychological consequences from leaving college early. At a higher societal level, dropout undermine the workforce and its development, and only further education and economic inequality. 

This understanding has motivated more and more work to help model student dropout risk using early academic data, serving as a basis for data driven intervention strategies to help mitigate these fallbacks of dropout. Previous research suggests that much of dropout experienced in the first years of college, and is related to students' academic performance early on in their education as it could affect their belief in their academic fit and future success. This is believed to be because early academic success and/or failures act as a feedback loop to update student beliefs in their own abilities. Furthermore, when students fail to meet expectations early on, their assessment of whether or not higher education is a worthwhile financial investment is also called into question, leading to a higher chance of dropout.<a name="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1)

Building on this theoretical research, researchers in educational data mining have this task of dropout prediction as a supervised machine learning problem. Studies applying traditional ML classification models such as logistic regression, decision trees, and boosting methods have found that academic performance serves as a relatively consistent predictor of dropout (as a binary classification task), amplified when the scope is purely on the first year of dropout. <a name="cite_ref-2"></a>[<sup>2</sup>](#cite_note-2). This study highlights the feasibility of utilizing a more traditional ML approach to accurately map out dropout rates, and motivates us to focus on early academic indicators. 

Expanding on this, a similar paper from UCI expands this to a multi-class classification task, adding three labels: graduated, dropped out, or still continuing the degree after the expected amount of time. They found similar results, arguing that early academic performance is a consistent predictor of student outcomes, but they also note that many other factors and variable do play a role, for example one being their financial situation and socioeconomic status. <a name="cite_ref-3"></a>[<sup>3</sup>](#cite_note-3). This study provides both a validated and cleaned dataset as well as a methodological reference point for modeling dropout. 

1. <a name="cite_note-1"></a> Stinebrickner, T., & Stinebrickner, R. (2014). A major in science? Initial beliefs and final outcomes for college major and dropout. NBER Working Paper No. 18945. https://www.nber.org/papers/w18945
2. <a name="cite_note-2"></a> Lakkaraju, H., Aguiar, E., Shan, C., Miller, D., Bhanpuri, N., Ghani, R., & Addison, K. (2015). A machine learning framework to identify students at risk of adverse academic outcomes. Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. https://dl.acm.org/doi/10.1145/2783258.2788620
3. <a name="cite_note-3"></a> Martins, M. V., Tolledo, D., Oliveira, J., & Gonçalves, R. (2021). Early prediction of student’s performance in higher education: A case study. UCI Machine Learning Repository. https://archive.ics.uci.edu/ml/datasets/Predict+Students+Dropout+and+Academic+Success

## Hypothesis


After looking at the data available to us, we predict that there will be a slight correlation between the early academic performance of a student and the likelihood of dropping out of university.  A recurrence of class failure doesn't necessarily mean that a student will inevitably dropout; there are likely other factors that have a greater correlation with early dropout than failing classes in your first year . Students can always catch up, but those who struggle early on will have a harder start to their college career that is more likely to lead to dropping out.

## Data
1. Our ideal dataset would include first-year GPA, course completion rate, credit accumulation, and dropout status within one year, documented as a binary variable (0 = student dropped out, 1 = student remained enrolled) to allow for efficient data processing. Depending on the direction of the project, we may also want to incorporate additional variables such as demographics (age, gender, race/ethnicity), socioeconomic factors (family income bracket, first-generation status), high school GPA, field of study, institution type, and campus size. These additional variables would provide more context for each observation and allow us to draw more nuanced conclusions.
   
    In terms of sample size, we would aim to include several thousand students, ideally over 5,000, to ensure sufficient statistical power and representation of dropout cases. The data would come from undergraduate students entering college for the first time and would be collected using academic records and enrollment statuses during students’ first year of college. These data could be obtained through institutional records, such as registrar data for GPA, credits earned, and course completion, and enrollment databases for registration and withdrawal statuses.

    The data should be stored in a clean, tidy, and structured dataset where each row represents a single student and each column represents one variable. Time-based academic variables, such as term GPAs, should be stored in separate columns (e.g., “Fall GPA” and “Spring GPA”). To protect student privacy, each record should also include a unique anonymized student ID rather than personally identifiable information.


2. One potential dataset for this project is the Predict Students’ Dropout and Academic Success dataset from the UCI Machine Learning Repository (https://archive.ics.uci.edu/dataset/697/predict+students%27+dropout+and+academic+success). This dataset is publicly available and can be downloaded directly without requesting permission, as it is released under a Creative Commons license. It contains data on approximately 4,400 undergraduate students, including demographic, socioeconomic, and academic performance information collected at enrollment and during the first year. Important variables for this project include semester grades, number of curricular units approved, number of units enrolled, and a categorical outcome variable indicating whether a student dropped out, remained enrolled, or graduated. This outcome variable can be converted into a binary indicator of dropout within one year.

    Another useful dataset is the University Student Dropout Longitudinal Dataset hosted on Zenodo and described in an academic data paper (https://zenodo.org/records/17239943). The data are publicly accessible and can be downloaded as CSV files without special permission, though proper citation is required. This dataset tracks students across multiple academic terms and includes detailed information on course enrollments, grades, and credits earned. Key variables relevant to this project include course completion records, cumulative credits earned by term, and enrollment status across semesters, which can be used to derive first-year GPA, completion rates, and dropout within one year. The dataset also includes variables such as parental education level, placement exam results, age, number of assignments submitted, and number of exams taken, which may provide additional context in the analysis.

## Data overview

Data Overview
### Dataset #1

- Dataset Name: Predict Students’ Dropout and Academic Success Dataset

- Link to dataset: https://archive.ics.uci.edu/dataset/697/predict+students%27+dropout+and+academic+success

- Number of observations: 4,424
- Number of variables: 37

#### Relevant variables:

- Academic performance and grades

- Student background information

- Enrollment characteristics

- Dropout or academic success labels

#### Shortcomings:

- Limited demographic diversity

- May not generalize beyond the sampled institutions

- Some variables may contain missing or inconsistent values


#### Description:

- This dataset contains student-level academic and demographic information used to predict dropout and academic success. Each row represents an individual student record with performance and background features.



## Dataset #2

### Dataset Name: University Dropout Dataset (2022)

- Link to dataset: https://zenodo.org/records/17239943

- Number of observations: 159,173
- Number of variables: 169
#### Relevant variables:

- Academic performance and grades

- Learning management system engagement

- Campus activity indicators

- Student background information

#### Shortcomings:

- High missingness in engagement variables

- Single-institution focus

- Differences in grading systems

  

#### Description:

- This dataset contains anonymized student academic and engagement records used to analyze dropout behavior and academic success.



#### Combining the datasets


Our two datasets will be used to compare patterns between student dropout and their failure of classes in their first year. The datasets come from different sources and structures, but they share common themes being most importantly students academic performance data. We will standardize relevant variables and zero in on trends across the datasets to find the most consistent predictors that lead to dropouts. 

In [1]:
# Imports and Setup
import pandas as pd
import numpy as np
import os

RAW_DATA_DIR = 'data/00-raw/'
INT_DATA_DIR = 'data/01-interim/'
PROCESSED_DATA_DIR = 'data/02-processed/'

In [2]:
# Run this code every time when you're actively developing modules in .py files.  It's not needed if you aren't making modules
#
## this code is necessary for making sure that any modules we load are updated here 
## when their source code .py files are modified

%load_ext autoreload
%autoreload 2

In [3]:
# Setup code -- this only needs to be run once after cloning the repo!
# this code downloads the data from its source to the `data/00-raw/` directory
# if the data hasn't updated you don't need to do this again!

# if you don't already have these packages (you should!) uncomment this line
# %pip install requests tqdm

import sys
sys.path.append('./modules') 

import get_data 

datafiles = [
    {
        'url': 'https://archive.ics.uci.edu/static/public/697/predict+students+dropout+and+academic+success.zip',
        'filename': 'predict_students_dropout.zip'
    },
    { 
        'url': 'https://zenodo.org/records/17239943/files/dataset_2022_hash.zip?download=1', 
        'filename':'university_dropout_2022.zip'
    } # Decompressed later using pd.read_csv
]

get_data.get_raw(datafiles,destination_directory='data/00-raw/')



Overall Download Progress:   0%|          | 0/2 [00:00<?, ?it/s]
Downloading predict_students_dropout.zip: 0.00B [00:00, ?B/s][A
                                                             [A

Successfully downloaded: predict_students_dropout.zip



Downloading university_dropout_2022.zip:   0%|          | 0.00/15.1M [00:00<?, ?B/s][A
Downloading university_dropout_2022.zip:   0%|          | 12.3k/15.1M [00:00<03:14, 77.9kB/s][A
Downloading university_dropout_2022.zip:   0%|          | 39.9k/15.1M [00:00<01:49, 138kB/s] [A
Downloading university_dropout_2022.zip:   1%|          | 96.3k/15.1M [00:00<01:01, 243kB/s][A
Downloading university_dropout_2022.zip:   1%|          | 186k/15.1M [00:00<00:39, 381kB/s] [A
Downloading university_dropout_2022.zip:   3%|▎         | 392k/15.1M [00:00<00:20, 731kB/s][A
Downloading university_dropout_2022.zip:   5%|▌         | 817k/15.1M [00:00<00:09, 1.43MB/s][A
Downloading university_dropout_2022.zip:  11%|█         | 1.67M/15.1M [00:01<00:04, 2.80MB/s][A
Downloading university_dropout_2022.zip:  13%|█▎        | 1.95M/15.1M [00:01<00:05, 2.47MB/s][A
Downloading university_dropout_2022.zip:  31%|███       | 4.63M/15.1M [00:01<00:01, 7.25MB/s][A
Downloading university_dropout_2022.zip:  3

Successfully downloaded: university_dropout_2022.zip





### Students’ Dropout and Academic Success dataset

The Students' Dropout and Academic Success dataset consists of 1 large CSV file containing a table of data from 4424 responses of students, and 36 different variables/questions. This dataset was created to identify students at risk of dropping out early in their academic career. The variables from this table range from normal metrics such as age or gender to more specific ones such as admission grades or frequency of attendance. For our project, we'll mostly just be looking at the variables related to academic performance, but there are a few that might provide interesting information that we'll also keep an eye on. The link to the data set can be found [here](https://archive.ics.uci.edu/dataset/697/predict+students%27+dropout+and+academic+success)

The columns from the dataset we'll for sure be looking at will be the curricular units from the first and second semesters, which include (grade averages, units enrolled, credited, evaluated, not evaluated, and approved), as well as the target column. Early academic performance could also be influenced by other factors, such as previous qualifications or admission grades found in the dataset. The dataset contains information on the number of units a student takes in their first year, split into two semesters. The grade averages are the GPA of a specific student in that specific semester, which just measures the academic performance of a student. The GPA is on the Portuguese scale, so their equivalent of a 4.0 GPA would be a 20.0. A bad gpa would be considered anything less than a 14.0 or the equivalent of a 2.0 in the U.S. A proficient GPA would be between 14.0 and 16.0, which is below a 3.0 GPA. Anything above a 16.0 would just be considered a good GPA. The number of units enrolled is just a numerical value that measures how many credit units/hours a student is taking. This data can be compared to the column with the approved credit units, since that column measures the credits a student earns from successfully passing their courses. The number of evaluations a student takes refers to the number of  exams (evaluations) a student takes in a semester. The number of credited curricular units refers to transfer units from previous coursework. There is a  final column called "target", which pretty much lists the final status of these students after conducted for the study, categorical data that lists either "enrolled", "graduate", or "drop-out". These 6 main columns provide some information on the early academic performance of a student through GPA and curricular units, which are about 25 to 30 hours of work per unit. 
- Enrolled: Number of curricular units being taken in a semester
- Credited: Number of transfer curricular units from prior courses
- Evaluated: Number of exams taken in curricular units in a semester
- Not Evaluated: Number of curricular units without any exams/evaluations
- Approved: Number of curricular units passed
- Target: Status of student, enrolled, graduate, or dropout

Additionally, two columns we could also look at would be previous qualifications, which are just integers that tell us the education level before entering university (a continuous integer scale from 0 to 200 is also in the dataset), and the other column is an admission grade, ranging from 0 to 200, which can also influence early academic performance. A poor admission grade would be below 100, a proficient one would be below 150, and anything higher would be considered a good admission grade reflective of their prior experience. These are all very useful metrics, and the dataset is mostly cleaned out and ready to be used and analyzed by us, but there are a few concerns regarding this dataset. One concern is the fact that all the data is taken from Portuguese students, so things like GPA, course units, and even the overall academic system will differ from what we're used to. This might require making some values more readable, such as a Portuguese GPA conversion to the U.S. system. We'll just have to take account of possible differences in how Portuguese higher education varies from the U.S. system. A smaller concern would be the fact that while the dataset is clean, there are a lot of columns with 0 values that we will not be able to use. So while the dataset is tidy, there are plenty of rows that are incomplete. Besides that, this dataset has a lot of information we can use, and any concerns we have with it can be worked around. 

3. Use the cell below to 
    1. load the dataset 
    2. make the dataset tidy or demonstrate that it was already tidy
    3. demonstrate the size of the dataset
    4. find out how much data is missing, where its missing, and if its missing at random or seems to have any systematic relationships in its missingness
    5. find and flag any outliers or suspicious entries
    6. clean the data or demonstrate that it was already clean.  You may choose how to deal with missingness (dropna of fillna... how='any' or 'all') and you should justify your choice in some way
    7. You will load raw data from `data/00-raw/`, you will (optionally) write intermediate stages of your work to `data/01-interim` and you will write the final fully wrangled version of your data to `data/02-processed`
4. Optionally you can also show some summary statistics for variables that you think are important to the project
5. Feel free to add more cells here if that's helpful for you


In [4]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE
## 3.A LOAD DATASET
data1 = pd.read_csv(
    f'{RAW_DATA_DIR}predict_students_dropout.zip',
    sep=';',
    compression='zip' # Webpage download default as .zip
)

In [5]:
## 3.B MAKE TIDY OR SHOW TIDY
data1.head(10)

Unnamed: 0,Marital status,Application mode,Application order,Course,Daytime/evening attendance\t,Previous qualification,Previous qualification (grade),Nacionality,Mother's qualification,Father's qualification,...,Curricular units 2nd sem (credited),Curricular units 2nd sem (enrolled),Curricular units 2nd sem (evaluations),Curricular units 2nd sem (approved),Curricular units 2nd sem (grade),Curricular units 2nd sem (without evaluations),Unemployment rate,Inflation rate,GDP,Target
0,1,17,5,171,1,1,122.0,1,19,12,...,0,0,0,0,0.0,0,10.8,1.4,1.74,Dropout
1,1,15,1,9254,1,1,160.0,1,1,3,...,0,6,6,6,13.666667,0,13.9,-0.3,0.79,Graduate
2,1,1,5,9070,1,1,122.0,1,37,37,...,0,6,0,0,0.0,0,10.8,1.4,1.74,Dropout
3,1,17,2,9773,1,1,122.0,1,38,37,...,0,6,10,5,12.4,0,9.4,-0.8,-3.12,Graduate
4,2,39,1,8014,0,1,100.0,1,37,38,...,0,6,6,6,13.0,0,13.9,-0.3,0.79,Graduate
5,2,39,1,9991,0,19,133.1,1,37,37,...,0,5,17,5,11.5,5,16.2,0.3,-0.92,Graduate
6,1,1,1,9500,1,1,142.0,1,19,38,...,0,8,8,8,14.345,0,15.5,2.8,-4.06,Graduate
7,1,18,4,9254,1,1,119.0,1,37,37,...,0,5,5,0,0.0,0,15.5,2.8,-4.06,Dropout
8,1,1,3,9238,1,1,137.0,62,1,1,...,0,6,7,6,14.142857,0,16.2,0.3,-0.92,Graduate
9,1,1,1,9238,1,1,138.0,1,1,19,...,0,6,14,2,13.5,0,8.9,1.4,3.51,Dropout


In [6]:
## 3.C DEMONSTRATE SIZE OF DATASET
print("Dataset shape (rows, columns):", data1.shape)
data1.dtypes

Dataset shape (rows, columns): (4424, 37)


Marital status                                      int64
Application mode                                    int64
Application order                                   int64
Course                                              int64
Daytime/evening attendance\t                        int64
Previous qualification                              int64
Previous qualification (grade)                    float64
Nacionality                                         int64
Mother's qualification                              int64
Father's qualification                              int64
Mother's occupation                                 int64
Father's occupation                                 int64
Admission grade                                   float64
Displaced                                           int64
Educational special needs                           int64
Debtor                                              int64
Tuition fees up to date                             int64
Gender        

**This data set was already used for more formal projects and has already been cleaned of missing values.**

In [7]:
## 3.D FIND OUT HOW MUCH DATA IS MISSING AND WHERE
col_missing_count = data1.isnull().sum()
total_missing_count = col_missing_count.sum()
total_missing_count
print(f"There are {total_missing_count} missing values.")

There are 0 missing values.


In [8]:
## 3.E FIND AND FLAG ANY OUTLIERS OR SUS ENTRIES
numeric_cols = data1.select_dtypes(include = ['number']).columns
outliers = pd.DataFrame(index = data1.index)
## iterate accross all numeric columns and find outliers and store in outliers df
for col in numeric_cols:
    ##https://stackoverflow.com/questions/23228244/how-do-you-find-the-iqr-in-numpy
    q75, q25 = np.nanpercentile(data1[col], [75, 25])
    iqr = q75 - q25
    lower_bound = q25 - 1.5 * iqr
    upper_bound = q75 + 1.5 * iqr
    outliers[f'{col}_outlier'] = (data1[col] < lower_bound) | (data1[col] > upper_bound)

print('Number of outliers in each column')
print(outliers.sum())

Number of outliers in each column
Marital status_outlier                                     505
Application mode_outlier                                     0
Application order_outlier                                  541
Course_outlier                                             442
Daytime/evening attendance\t_outlier                       483
Previous qualification_outlier                             707
Previous qualification (grade)_outlier                     179
Nacionality_outlier                                        110
Mother's qualification_outlier                               0
Father's qualification_outlier                               0
Mother's occupation_outlier                                182
Father's occupation_outlier                                177
Admission grade_outlier                                     86
Displaced_outlier                                            0
Educational special needs_outlier                           51
Debtor_outlier       

**This data is already very clean, no missing values or ridiculous outliers, the only thing that we may have to change is the data types being used for certain variables**

In [9]:
## 3.F CLEAN THE DATA
if data1.isna().sum().sum() == 0:
    print('No missing values')
else: 
    print(f'There are {data1.isna().sum().sum()} missin values')

No missing values


In [10]:
## 3.G MOVE TO PROCESSED
data1_clean = data1.copy()
processed_path = os.path.join(PROCESSED_DATA_DIR, 'predict_students_dropout_clean.csv')
data1_clean.to_csv(processed_path, index=False, sep = ';')

In [11]:
## 4. SUMMARY STATISTICS
data1.describe()

Unnamed: 0,Marital status,Application mode,Application order,Course,Daytime/evening attendance\t,Previous qualification,Previous qualification (grade),Nacionality,Mother's qualification,Father's qualification,...,Curricular units 1st sem (without evaluations),Curricular units 2nd sem (credited),Curricular units 2nd sem (enrolled),Curricular units 2nd sem (evaluations),Curricular units 2nd sem (approved),Curricular units 2nd sem (grade),Curricular units 2nd sem (without evaluations),Unemployment rate,Inflation rate,GDP
count,4424.0,4424.0,4424.0,4424.0,4424.0,4424.0,4424.0,4424.0,4424.0,4424.0,...,4424.0,4424.0,4424.0,4424.0,4424.0,4424.0,4424.0,4424.0,4424.0,4424.0
mean,1.178571,18.669078,1.727848,8856.642631,0.890823,4.577758,132.613314,1.873192,19.561935,22.275316,...,0.137658,0.541817,6.232143,8.063291,4.435805,10.230206,0.150316,11.566139,1.228029,0.001969
std,0.605747,17.484682,1.313793,2063.566416,0.311897,10.216592,13.188332,6.914514,15.603186,15.343108,...,0.69088,1.918546,2.195951,3.947951,3.014764,5.210808,0.753774,2.66385,1.382711,2.269935
min,1.0,1.0,0.0,33.0,0.0,1.0,95.0,1.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7.6,-0.8,-4.06
25%,1.0,1.0,1.0,9085.0,1.0,1.0,125.0,1.0,2.0,3.0,...,0.0,0.0,5.0,6.0,2.0,10.75,0.0,9.4,0.3,-1.7
50%,1.0,17.0,1.0,9238.0,1.0,1.0,133.1,1.0,19.0,19.0,...,0.0,0.0,6.0,8.0,5.0,12.2,0.0,11.1,1.4,0.32
75%,1.0,39.0,2.0,9556.0,1.0,1.0,140.0,1.0,37.0,37.0,...,0.0,0.0,7.0,10.0,6.0,13.333333,0.0,13.9,2.6,1.79
max,6.0,57.0,9.0,9991.0,1.0,43.0,190.0,109.0,44.0,44.0,...,12.0,19.0,23.0,33.0,20.0,18.571429,12.0,16.2,3.7,3.51


In [12]:
## 5. TRANSFORMING PORTUGESE GRADING SCALE TO 4.0
max_port = 200.0
min_port = 0.0
max_usa = 4.0

grade_cols = [
    'Previous qualification (grade)',
    'Admission grade',
    'Curricular units 1st sem (grade)',
    'Curricular units 2nd sem (grade)'
]

for col in grade_cols:
    data1[col] = (data1[col] / max_port) * max_usa

print(data1[grade_cols])

      Previous qualification (grade)  Admission grade  \
0                               2.44            2.546   
1                               3.20            2.850   
2                               2.44            2.496   
3                               2.44            2.392   
4                               2.00            2.830   
...                              ...              ...   
4419                            2.50            2.444   
4420                            2.40            2.380   
4421                            3.08            2.990   
4422                            3.60            3.076   
4423                            3.04            3.040   

      Curricular units 1st sem (grade)  Curricular units 2nd sem (grade)  
0                             0.000000                          0.000000  
1                             0.280000                          0.273333  
2                             0.000000                          0.000000  
3              

### University Student Dropout Dataset

The University Student Dropout dataset is organized as yearly CSV files named dataset_{year}.csv, with each row corresponding to a student-course enrollment for that academic year. Each file integrates data from four sources: students, programs, courses, and digital logs, and also groups variables into six thematic categories: context, admission pathways, socio-economic and demographic background, academic data, digital logs, and Wi-Fi access. Contextual attributes include anonymized identifiers for students, courses, academic programs, and campuses, as well as the academic year and group IDs, capturing where and how each student is enrolled. Admission pathway variables describe how the student entered the university, including year of enrollment, type of admission, entry exam grades (scaled to 10 or 14), and program selection preference. Socioeconomic and demographic variables capture parental education, student dedication to studies, and whether the student had to move provinces to attend university, providing insight into economic or social challenges that might affect retention.

Academic data is the most detailed category, including grades, credits enrolled and earned across multiple years, semester performance, adjustments for credit recognition, internships, activities, and overall progress toward degree completion. Metrics like cumulative GPA, credits passed per semester, and credit completion rates across previous years allow for longitudinal assessment of academic success and dropout risk. Digital logs track Learning Management System (LMS) site engagement monthly, including number of visits, events, assignment and test submissions, total minutes spent online, and usage of course resources. For 2021 and 2022, Wi-Fi access records provide an additional proxy for on-campus presence, recording the number of days each student accessed the university network per month. All variables are anonymized using hash codes, and numerical metrics such as grades are scaled (e.g., 0–10 or 0–14 for entry exams), while credit counts are in academic credit units. LMS and Wi-Fi activity metrics are counts of actions, logins, or days.

While these metrics provide valuable insights, several concerns about the dataset should be noted. It is drawn from a single Spanish technological university, which limits generalizability to other fields or institutions, particularly in humanities or social sciences. Early dropouts may be underrepresented, and some variables, like parental education, employment, student dedication, may be self-reported and incomplete. Engagement measures may also reflect infrastructure availability or device usage rather than actual participation. Finally, identifiers are anonymized, which may reduce the precision of longitudinal tracking, and the data does not include periods affected by the COVID-19 pandemic, meaning it may not capture disruptions caused by virtual or hybrid learning environments. Despite these limitations, the dataset provides a detailed framework for studying factors influencing student retention and academic success.


In [13]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE

# A - Load the Dataset (Just the 2022 Subset for now - it is quite large)
data2 = pd.read_csv(
    f'{RAW_DATA_DIR}university_dropout_2022.zip',
    sep=';',
    compression='zip' # Webpage download default as .zip
)

  data2 = pd.read_csv(


Based on the description of the dataset from the source on Zemodo, the dataset has been thouroughly tidied up, which is demontrated below. 

In [14]:
# B - Tidiness

# Show that each row is a single observation by cross checking duplicates against the identifiers for student, course, and degree hashes
duplicates = data2.duplicated(subset=['dni_hash', 'asi_hash', 'anyo_ingreso'])
print("Number of duplicate rows:", duplicates.sum())


# Show that columns are aptly named
print('='*50)
print(data2.columns)
print('='*50)
print(data2.dtypes) # note that for now, dtypes are often objects because pandas interprets the comma usage in certain numbers as a string most likely

# Show a preview of what the data looks like, demonstrating that columns are properly named, there are no overlapping values, and columns are generally meaningful
print('='*50)
data2.head(10)

Number of duplicate rows: 0
Index(['dni_hash', 'tit_hash', 'asi_hash', 'anyo_ingreso', 'tipo_ingreso',
       'nota10_hash', 'nota14_hash', 'campus_hash', 'estudios_p_hash',
       'estudios_m_hash',
       ...
       'n_resource_days_2023_6', 'pft_events_2023_7', 'pft_days_logged_2023_7',
       'pft_visits_2023_7', 'pft_assignment_submissions_2023_7',
       'pft_test_submissions_2023_7', 'pft_total_minutes_2023_7',
       'n_wifi_days_2023_7', 'resource_events_2023_7',
       'n_resource_days_2023_7'],
      dtype='object', length=169)
dni_hash                       object
tit_hash                       object
asi_hash                       object
anyo_ingreso                   object
tipo_ingreso                   object
                                ...  
pft_test_submissions_2023_7    object
pft_total_minutes_2023_7       object
n_wifi_days_2023_7             object
resource_events_2023_7         object
n_resource_days_2023_7         object
Length: 169, dtype: object


Unnamed: 0,dni_hash,tit_hash,asi_hash,anyo_ingreso,tipo_ingreso,nota10_hash,nota14_hash,campus_hash,estudios_p_hash,estudios_m_hash,...,n_resource_days_2023_6,pft_events_2023_7,pft_days_logged_2023_7,pft_visits_2023_7,pft_assignment_submissions_2023_7,pft_test_submissions_2023_7,pft_total_minutes_2023_7,n_wifi_days_2023_7,resource_events_2023_7,n_resource_days_2023_7
0,319636fc9270,620c9c332101,4596fcf257c4,20120,NAP,,9456,e4f95d56d90df35e,F,L,...,,,,,,,,,,
1,319636fc9270,620c9c332101,81f4b5a1d0a8,20120,NAP,,9456,e4f95d56d90df35e,F,L,...,,,,,,,,,,
2,319636fc9270,620c9c332101,442fcac005ed,20120,NAP,,9456,e4f95d56d90df35e,F,L,...,,,,,,,,,,
3,319636fc9270,620c9c332101,3dc87ab71825,20120,NAP,,9456,e4f95d56d90df35e,F,L,...,,,,,,,,,,
4,319636fc9270,620c9c332101,677c622c0bfb,20120,NAP,,9456,e4f95d56d90df35e,F,L,...,,,,,,,,,,
5,319636fc9270,620c9c332101,2344965e8b89,20120,NAP,,9456,e4f95d56d90df35e,F,L,...,,,,,,,,,,
6,319636fc9270,620c9c332101,5f52e54c6a9c,20120,NAP,,9456,e4f95d56d90df35e,F,L,...,,,,,,,,,,
7,319636fc9270,620c9c332101,8b8b029f1142,20120,NAP,,9456,e4f95d56d90df35e,F,L,...,,,,,,,,,,
8,319636fc9270,620c9c332101,705d739be21c,20120,NAP,,9456,e4f95d56d90df35e,F,L,...,,,,,,,,,,
9,319636fc9270,620c9c332101,696d9363dc5a,20120,NAP,,9456,e4f95d56d90df35e,F,L,...,,,,,,,,,,


In [15]:
# C - Size of Dataset
print("Dataset shape (rows, columns):", data2.shape)
print("Number of observations of student-course-year (rows):", data2.shape[0])
print("Number of variables (columns):", data2.shape[1])

Dataset shape (rows, columns): (159173, 169)
Number of observations of student-course-year (rows): 159173
Number of variables (columns): 169


As also mentioned in the paper connected to this dataset, there is a high systematic relationship in the missingness of much of the data, as well as a large portion of columns that have a lot of missing data. This is demonstrated below.

In [16]:
# D - Missing Data Exploration
# Basic exploratory analysis on the missing data as porportions and counts
missing_counts = data2.isnull().sum()
missing_percent = (missing_counts / len(data2)) * 100

missing_df = pd.concat([missing_counts, missing_percent], axis=1)
missing_df.columns = ['missing_count', 'missing_pct']

missing_df = missing_df[missing_df['missing_count'] > 0].sort_values(by='missing_pct', ascending=False)
missing_df

Unnamed: 0,missing_count,missing_pct
pft_test_submissions_2023_7,159148,99.984294
pft_assignment_submissions_2023_7,158800,99.765664
es_retitulado,158630,99.658862
total1,157656,99.046949
es_adaptado,156916,98.582046
...,...,...
rendimiento_cuat_a,12207,7.669014
rendimiento_cuat_b,9051,5.686266
rendimiento_total,8819,5.540513
estudios_m_hash,918,0.576731


In [17]:
# D - Missing Data Exploration
# A deeper dive into why some data is missing in the way it is
# Let us take a look at the missing wifi monitoring usage by campus
data2.groupby('campus_hash')['n_wifi_days_2023_7'] \
     .apply(lambda x: x.isnull().mean()*100) \
     .sort_values(ascending=False)

campus_hash
1398b376fdcce25c    88.458559
297c138806bdb5dd    87.812500
9103a6c82e355433    85.704161
3ca0e4af1c44f084    84.429455
7b778e4c1d1f33c9    83.958427
0f01a84bff1b2bf4    82.044018
60f19cd67252161d    79.815005
85ff657216cc9b54    79.775281
1a9d786be0ff0bfe    79.341426
6781b441c78d2643    78.329399
48c6e3d042649ef6    77.824773
234001f5d5f1eca4    77.424844
f9418773503e50b6    77.080491
47cfe5eb8ada0e74    76.700434
79df3742da86cfd4    76.659119
40f5b57b09f073ed    76.653696
86348ea0bf50ebf0    75.927487
4e808094851fc2ea    75.469381
16a36e86f6fed5d4    75.291622
e984139bcc2c5043    75.257732
f32b702fba23083f    74.931880
5d9d4510699dac58    74.718222
911ac1b13dac6fe9    73.259053
ddf9288fd8062579    72.413793
0672d49fe5a7035e    72.110665
eb074cd8374ba297    71.810089
f2a369a3b17169d7    71.753555
52025890fa603dbc    70.521364
2f4c06aba0f9a393    70.130678
0448d563bf72277a    70.024096
8138689887e6817e    68.799798
c8361f9b468e68c8    65.359477
e4f95d56d90df35e    64.51633

This clearly demonstrates that some campuses such as `bc5d84bed7dee3e1` have a very comparitively low missing percentage (38%) of their students' wifi utilization, while others, like `1398b376fdcce25c` have a very high missing percentage, around 88%. This clearly shows a systematic disparity in certain campus's ability to report such data. While every inconsistency cannot be described due to the sheer size of this dataset, this small subsample shows how there is a large systematic reason for certain data being missing. This is explored in a little more depth in the accompanying paper, but on a theoretical basis, because this dataset was compiled from various databases and resources and then homogenized, inconsistencies are bound to show. Furthermore, certain courses may be more open to utilizing LMS tools or adopting digital platforms for their education, resulting in systematic missingness in the data.

#### Outlier Discovery

The accompanying paper describes that during the data anonymization of students, suspicious variables were dealt with to protect anonymity. For example, if a certain aggregation of variables could identify a student, this was deleted. Furthermore duplicate entries were deleted. Furthermore, a general check to ensure data types remained consistent and that value ranges for data was well within the expected distribution accross datasets was conducted. 

#### Data Cleaning

This dataset features a high rate of missingness. As such, the general rule for now that we chose to go with was to delete any columns with a high threshold of missingness. In this case, we chose to drop the columns with more than 90% overall missingness (this means dropping around 30 columns), as even if this data may be useful, the sheer proportion of missing data would make it less impactful. While agressive, this will help us narrow down our scope for our final project. A test also revealed that attempting a super agressive drop of all rows with any sort of missing data would cut the dataset to only 162 entries, so this is also not used. However, entries with all NA entries were deleted. Another notable feature is that the csv was ';' deliminated and utilized commas as decimals, which is quite typical of much of Europe. As such, numbers are cleaned into decimal format and converted to float/int. 

For specific column-based adjustments, a few rules were established to deal with missingness:

1) Leave identifying hashes alone
2) Leave demographic/enrollment data alone
3) Fill credit/coursework work columns as 0 for NA entries
4) Fill activity/practical work columns as 0 for NA entries
5) Fill LMS/Wifi/Digital Engagement columns as 0 for NA entries

This was done because demographics, enrollement data, and identifying hashes being empty are likely a result of truely missing data. However, the other categories can be attributed to simply the absence of the student doing said column. For example, no entry for credits enrolled for a specific semester and for a specific courses may just mean that the student didn't take that course. 


In [18]:
# F - Data Cleaning

data_cleaned_2 = data2.copy()

# Step 1: Drop columns with >90% missingness
threshold = 0.90
data_cleaned_2 = data_cleaned_2.loc[:, data_cleaned_2.isna().mean() < threshold].copy()

# Step 2: Drop rows that are all NA
data_cleaned_2 = data_cleaned_2.dropna(axis=0, how='all')

# Step 3: Convert comma-based numbers to floats
for col in data_cleaned_2.select_dtypes(include='object').columns:
    try:
        data_cleaned_2[col] = data_cleaned_2[col].str.replace(',', '.').astype(float)
    except:
        pass  # leave non-numeric columns as object

# Step 4: Fill NA with 0 for predetermined count/activity columns
fillna_cols = [col for col in data_cleaned_2.columns 
               if any(x in col for x in ['n_wifi_days', 'resource_events', 'n_resource_days', 
                                         'pft_', 'actividades', 'total1', 'cred_mat', 'cred_sup'])]

for col in fillna_cols:
    if pd.api.types.is_numeric_dtype(data_cleaned_2[col]):
        data_cleaned_2[col] = data_cleaned_2[col].fillna(0)

data_cleaned_2.head()

Unnamed: 0,dni_hash,tit_hash,asi_hash,anyo_ingreso,tipo_ingreso,nota10_hash,nota14_hash,campus_hash,estudios_p_hash,estudios_m_hash,...,resource_events_2023_5,n_resource_days_2023_5,pft_events_2023_6,pft_days_logged_2023_6,pft_visits_2023_6,pft_total_minutes_2023_6,n_wifi_days_2023_6,resource_events_2023_6,n_resource_days_2023_6,n_wifi_days_2023_7
0,319636fc9270,620c9c332101,4596fcf257c4,2012.0,NAP,,9.456,e4f95d56d90df35e,F,L,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,319636fc9270,620c9c332101,81f4b5a1d0a8,2012.0,NAP,,9.456,e4f95d56d90df35e,F,L,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,319636fc9270,620c9c332101,442fcac005ed,2012.0,NAP,,9.456,e4f95d56d90df35e,F,L,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,319636fc9270,620c9c332101,3dc87ab71825,2012.0,NAP,,9.456,e4f95d56d90df35e,F,L,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,319636fc9270,620c9c332101,677c622c0bfb,2012.0,NAP,,9.456,e4f95d56d90df35e,F,L,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [19]:
print("Shape of cleaned dataframe:", data_cleaned_2.shape)

print("\nData types:")
print(data_cleaned_2.dtypes.value_counts())

processed_file_path = os.path.join(PROCESSED_DATA_DIR, 'university_dropout_cleaned_2022.csv')
data_cleaned_2.to_csv(processed_file_path, index=False, sep=';')

print(f"\nCleaned dataset saved to: {processed_file_path}")

Shape of cleaned dataframe: (159173, 135)

Data types:
float64    115
object      13
int64        7
Name: count, dtype: int64

Cleaned dataset saved to: data/02-processed/university_dropout_cleaned_2022.csv


## Ethics

### A. Data Collection
 - [ ] **A.1 Informed consent**: If there are human subjects, have they given informed consent, where subjects affirmatively opt-in and have a clear understanding of the data uses to which they consent?

> We will have to read through the paper that originally collected this data and determine whether the data was appropriately collected and whether or not students were given adequate warning as to what was being collected.

 - [ ] **A.2 Collection bias**: Have we considered sources of bias that could be introduced during data collection and survey design and taken steps to mitigate those?

> The data was collected by serveral researchers in Portugal for the purposes of their research as such we will have to consider the bias that the researchers possibly could have introduced through the data collection and the biases that may originate from the data being collected in portugal.

 - [ ] **A.3 Limit PII exposure**: Have we considered ways to minimize exposure of personally identifiable information (PII) for example through anonymization or not collecting information that isn't relevant for analysis?

> Although we currently believe the data we have is anonymous we will need to re-examine the source and determine if there is any additional information that needs to be censored.

 - [ ] **A.4 Downstream bias mitigation**: Have we considered ways to enable testing downstream results for biased outcomes (e.g., collecting data on protected group status like race or gender)?

> We will have to perform analysis where we include and exclude gender, our protected groups, to see if doing so results some form of algorithmic bias so that we can hopefully correct for it and provide an unbiased collection of data.


### B. Data Storage
 - [ ] **B.1 Data security**: Do we have a plan to protect and secure data (e.g., encryption at rest and in transit, access controls on internal users and third parties, access logs, and up-to-date software)?

> Since we are not generating any new data and instead using an already created data set there will not be anything for us to hide, although we could hide the conclusions of our data at the end of the analysis.

 - [ ] **B.2 Right to be forgotten**: Do we have a mechanism through which an individual can request their personal information be removed?

> The data was already collected by others so we cannot support this for the original participants.

 - [ ] **B.3 Data retention plan**: Is there a schedule or plan to delete the data after it is no longer needed?

> The data was already collected by others so we cannot support this for the original participants.

### C. Analysis
 - [ ] **C.1 Missing perspectives**: Have we sought to address blindspots in the analysis through engagement with relevant stakeholders (e.g., checking assumptions and discussing implications with affected communities and subject matter experts)?

> While we intend for our analysis to be qualitative, we could reach out to experts in education inequity to get a better handle of how accurate our analysis is from their persepctive.

 - [ ] **C.2 Dataset bias**: Have we examined the data for possible sources of bias and taken steps to mitigate or address these biases (e.g., stereotype perpetuation, confirmation bias, imbalanced classes, or omitted confounding variables)?

> Since the data has many data points that are discrete, many that maybe would have been better served being continuous, we will take specific caution towards determining which of these can be used and determining how best to interpret and use them.

 - [ ] **C.3 Honest representation**: Are our visualizations, summary statistics, and reports designed to honestly represent the underlying data?

> We will have to ensure that different data points are correctly weighted when determining such that we are not forcing our analysis towards a particulur conclusion.

 - [ ] **C.4 Privacy in analysis**: Have we ensured that data with PII are not used or displayed unless necessary for the analysis?

> Since the data does not have any specific PII, we do not anticipate having to do anything specific for our analysis.

 - [ ] **C.5 Auditability**: Is the process of generating the analysis well documented and reproducible if we discover issues in the future?

> We intend to have a very well formatted and easily readable jupyter notebook to ensure that it is easy to see what we did thus making it easily auditable.

### D. Modeling
 - [ ] **D.1 Proxy discrimination**: Have we ensured that the model does not rely on variables or proxies for variables that are unfairly discriminatory?

> We will have to be careful with the careers or the parents as these may be proxies that result in us finding that socioeconomic status is the sole determinant of success.

 - [ ] **D.2 Fairness across groups**: Have we tested model results for fairness with respect to different affected groups (e.g., tested for disparate error rates)?

> We will have to test to see if the model is appropriatley fair accross the binary grouping that they have in the data like rural vs urban addresses which do not actually make up an address and rather just have a one value for rural and another for urban.

 - [ ] **D.3 Metric selection**: Have we considered the effects of optimizing for our defined metrics and considered additional metrics?

> Since we are intending to optimize to find out what factors can be used as predictors to determine if a students chance of dropping out we need to be cautious that this can result in false positives and negatives, if our model places certain students in the dropout or graduate pile this could change how money is spent in a school which we would like to avoid.

 - [ ] **D.4 Explainability**: Can we explain in understandable terms a decision the model made in cases where a justification is needed?

> We hope to make the model very clear and to explain all of the analysis that we do in the jupyter notebook so that we can go back and understand and justify the decision that the model made.

 - [ ] **D.5 Communicate limitations**: Have we communicated the shortcomings, limitations, and biases of the model to relevant stakeholders in ways that can be generally understood?

> We will have to ensure that we communicate the various limitations that are present with our data and analysis so that who ever desires to look at our model understands the potential flaws in our analysis and modeling.

### E. Deployment
 - [ ] **E.1 Monitoring and evaluation**: Do we have a clear plan to monitor the model and its impacts after it is deployed (e.g., performance monitoring, regular audit of sample predictions, human review of high-stakes decisions, reviewing downstream impacts of errors or low-confidence decisions, testing for concept drift)?

> We do not yet have a clear plan for how to monitor the model after it is deployed but, if it were to be used, the users would likely have to be careful with how they use it when modifying school programs and spending.

 - [ ] **E.2 Redress**: Have we discussed with our organization a plan for response if users are harmed by the results (e.g., how does the data science team evaluate these cases and update analysis and models to prevent future harm)?

> No we have not discussed this, but it would be difficult to address if it were to happen since we do not know the respondants and are geographically very far away from them.

 - [ ] **E.3 Roll back**: Is there a way to turn off or roll back the model in production if necessary?

> We think that we would be able to just turn off the script. In the end our analysis will not continue to aggregate more data so we do not anticipate any issues that require deleting the model.

 - [ ] **E.4 Unintended use**: Have we taken steps to identify and prevent unintended uses and abuse of the model and do we have a plan to monitor these once the model is deployed?

> We have yet to take steps towards this, but we need to ensure that the models results cannot be misinterpreted and abused for someone elses purpose.


## Team Expectations 

* Our primary form of communication is Discord. We expect all members to respond to and/or acknowledge all members' messages within one day. We plan to meet once a week, either in person or virtually, on Monday afternoons.
We expect all members to maintain a respectful and polite tone when communicating with others. Don't be mean, even if there are disagreements. We want to keep an open mind, value everyone's opinions equally, and be proactive in brainstorming solutions for the good of the team. For example, if there are conflicting perspectives, we can communicate our opinions by saying "I don't think moving forward with X is within our group's best interest because of Y. Instead, we should explore Z."

* Ideally, we want to make unanimous decisions. However, this is not always possible, so we will default to majority vote rules. If a member does not reply or acknowledge a proposal/message within a day, we can move forward with their input. Team members can react to Discord messages as a form of acknowledgement, especially if they're unable to respond immediately.

* Every member will get first-hand experience pertaining to all aspects of our project. Having members do a little bit of everything will ensure that we are all able to develop our skills individually. We will delegate tasks during our weekly meetings and send a message in our "to-do" channel on Discord.

* If there any issues, we expect each other to speak up EARLY before the deadline. As a general rule of thumb, we expect members to reach out at least a day or two PLUS the expected time it takes to complete the specific task if there are any issues or concerns.


## Project Timeline Proposal

Tentative timeline that is subject to change throughout the rest of the quarter

| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 1/26  |  3 PM | Reviewed Lecture slides and any information related to our project  | Determine best form of communication; Introduce Ourselves; Review previous projects; Begin brainstorming possible project ideas  | 
| 2/2  |  2 PM |  Brainstorm Ideas For Final Project | Discuss ideal dataset(s) and project ideas; Draft project proposal/Assign Individual parts;  | 
| 2/9  | 2 PM| Finish and finalize project proposal  | Discuss Wrangling and possible analytical approaches; Discuss overall organiation of project and procedures; Work on data  |
| 2/16 or 2/18  | 2 PM  | Review dataset and have it prepared for analysis | Work on Data Checkpoint; Discuss Analysis Plan   |
| 2/23  | 2 PM  | Finish data checkpoint and all things related to data | Work on analysis of our data  |
| 3/2  | 2 PM  | Complete analysis; Draft results/conclusion/discussion | Discuss/edit full project |
| 3/9  | 2PM  | Work on Final Project | Finishing touches; Turn in Final Project; Group Project Surveys |
| 3/16(?)  | 2PM  | Work on Final Project | Buffer Day if necessary |