Skip to content

Project Repo for the Spring 2022 Machine Learning class at Hertie

Notifications You must be signed in to change notification settings

AnnaWeronikaMatysiak/PISA_Revisited

Repository files navigation

PISA Revisited

Summary

Background

The OCED’s Programme for International Student Assessment (PISA) has consistently found that girls outperform boys in reading, among other domains, and that this gender gap is large, worldwide, and persistent throughout primary and secondary schooling. Cited literature highlights how girls’ academic strength relative to their male peers may impact their confidence and interests across subjects, thus explaining differences in girls’ career aspirations, such as a lower likelihood of joining STEM fields.

This project uses a machine learning framework to identify the strongest predictors of reading scores from the complete 2018 PISA dataset for boys and girls. Using the SKLearn library for Python, a multiple regression was trained on the data and used baseline model, with subsequent models introducing a Ridge penalty, polynomial regression as well as a Random Forest Regressor. Regressions with a Ridge penalty, and polynomial performed worse than the baseline regression while an Extra Trees Regressor slightly improved on the results of the Random Forest algorithm.

Methods

Data preprocessing and analysis was conducted using Python and the SciKit Learn library.

Data

The underlying dataset of this project is based on the full student questionnaire of the 2018 iteration of the PISA. The initial loading of the full dataset produced a pandas dataframe with 612,004 observations and 1,120 columns. In accordance with the results of a literature review and previous research, a selection of variables with high construct validity was conducted, including but not limited to, items related to self-efficacy, reading habits and attitudes, school environment, teacher interaction, and parental involvement. As a result, the final dataset included 205 variables as covariates and reading score as the independent variable. A sample of 100,000 observations was randomly created for further processing.

Contributors

Further Resources

License

The material in this repository is made available under the MIT license.

About

Project Repo for the Spring 2022 Machine Learning class at Hertie

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published