PISA Revisited

Summary

Background

The OCED’s Programme for International Student Assessment (PISA) has consistently found that girls outperform boys in reading, among other domains, and that this gender gap is large, worldwide, and persistent throughout primary and secondary schooling. Cited literature highlights how girls’ academic strength relative to their male peers may impact their confidence and interests across subjects, thus explaining differences in girls’ career aspirations, such as a lower likelihood of joining STEM fields.

This project uses a machine learning framework to identify the strongest predictors of reading scores from the complete 2018 PISA dataset for boys and girls. Using the SKLearn library for Python, a multiple regression was trained on the data and used baseline model, with subsequent models introducing a Ridge penalty, polynomial regression as well as a Random Forest Regressor. Regressions with a Ridge penalty, and polynomial performed worse than the baseline regression while an Extra Trees Regressor slightly improved on the results of the Random Forest algorithm.

Methods

Data preprocessing and analysis was conducted using Python and the SciKit Learn library.

Data

The underlying dataset of this project is based on the full student questionnaire of the 2018 iteration of the PISA. The initial loading of the full dataset produced a pandas dataframe with 612,004 observations and 1,120 columns. In accordance with the results of a literature review and previous research, a selection of variables with high construct validity was conducted, including but not limited to, items related to self-efficacy, reading habits and attitudes, school environment, teacher interaction, and parental involvement. As a result, the final dataset included 205 variables as covariates and reading score as the independent variable. A sample of 100,000 observations was randomly created for further processing.

Name		Name	Last commit message	Last commit date
Latest commit History 141 Commits
archive		archive
blogpost		blogpost
codebook		codebook
figures		figures
.DS_Store		.DS_Store
.Rhistory		.Rhistory
.gitignore		.gitignore
0_setup.py		0_setup.py
1_data_preparation.py		1_data_preparation.py
1a_functions.py		1a_functions.py
2_split.py		2_split.py
3_baseline_model.py		3_baseline_model.py
4a_modeling_and_tuning_ridge.py		4a_modeling_and_tuning_ridge.py
4b_modeling_and_tuning_trees.py		4b_modeling_and_tuning_trees.py
4c_ensemble.py		4c_ensemble.py
5_evaluation_and_feature_importance.py		5_evaluation_and_feature_importance.py
5a_further_analysis.py		5a_further_analysis.py
README.md		README.md

AnnaWeronikaMatysiak/PISA_Revisited

Folders and files

Latest commit

History

Repository files navigation

PISA Revisited

Summary

Background

Methods

Data

Contributors

Further Resources

License

About

Resources

Stars

Watchers

Forks

Languages