### Purpose: LOAD AND INSPECT DATASET 

##### Load dataset

In [24]:
%run 00_project_setup.ipynb

In [25]:
# fetch dataset 
predict_students_dropout_and = fetch_ucirepo(id=697) 
  
# data (as pandas dataframes) 
x = predict_students_dropout_and.data.features 
y = predict_students_dropout_and.data.targets 

In [26]:
md = "### Dataset Metadata\n"
for k, v in predict_students_dropout_and.metadata.items():
    md += f"**{k}:** {v}  \n"

Markdown(md)

### Dataset Metadata
**uci_id:** 697  
**name:** Predict Students' Dropout and Academic Success  
**repository_url:** https://archive.ics.uci.edu/dataset/697/predict+students+dropout+and+academic+success  
**data_url:** https://archive.ics.uci.edu/static/public/697/data.csv  
**abstract:** A dataset created from a higher education institution (acquired from several disjoint databases) related to students enrolled in different undergraduate degrees, such as agronomy, design, education, nursing, journalism, management, social service, and technologies.
The dataset includes information known at the time of student enrollment (academic path, demographics, and social-economic factors) and the students' academic performance at the end of the first and second semesters. 
The data is used to build classification models to predict students' dropout and academic sucess. The problem is formulated as a three category classification task, in which there is a strong imbalance towards one of the classes.  
**area:** Social Science  
**tasks:** ['Classification']  
**characteristics:** ['Tabular']  
**num_instances:** 4424  
**num_features:** 36  
**feature_types:** ['Real', 'Categorical', 'Integer']  
**demographics:** ['Marital Status', 'Education Level', 'Nationality', 'Occupation', 'Gender', 'Age']  
**target_col:** ['Target']  
**index_col:** None  
**has_missing_values:** no  
**missing_values_symbol:** None  
**year_of_dataset_creation:** 2021  
**last_updated:** Mon Feb 26 2024  
**dataset_doi:** 10.24432/C5MC89  
**creators:** ['Valentim Realinho', 'Mónica Vieira Martins', 'Jorge Machado', 'Luís Baptista']  
**intro_paper:** {'ID': 99, 'type': 'NATIVE', 'title': "Early prediction of student's performance in higher education: a case study", 'authors': 'Mónica V. Martins, Daniel Tolledo, Jorge Machado, Luís M. T. Baptista, and Valentim Realinho', 'venue': 'Trends and Applications in Information Systems and Technologies', 'year': 2021, 'journal': 'Advances in Intelligent Systems and Computing series', 'DOI': 'http://www.doi.org/10.1007/978-3-030-72657-7_16', 'URL': 'http://www.worldcist.org/2021/', 'sha': None, 'corpus': None, 'arxiv': None, 'mag': None, 'acl': None, 'pmid': None, 'pmcid': None}  
**additional_info:** {'summary': None, 'purpose': 'The dataset was created in a project that aims to contribute to the reduction of academic dropout and failure in higher education, by using machine learning techniques to identify students at risk at an early stage of their academic path, so that strategies to support them can be put into place. \n\nThe dataset includes information known at the time of student enrollment – academic path, demographics, and social-economic factors. \n\nThe problem is formulated as a three category classification task (dropout, enrolled, and graduate) at the end of the normal duration of the course. \n', 'funded_by': 'This dataset is supported by program SATDAP - Capacitação da Administração Pública under grant POCI-05-5762-FSE-000191, Portugal.', 'instances_represent': 'Each instance is a student', 'recommended_data_splits': 'The dataset was used, in our project, with a data split of 80% for training and 20% for test.', 'sensitive_data': None, 'preprocessing_description': 'We performed a rigorous data preprocessing to handle data from anomalies, unexplainable outliers, and missing values.', 'variable_info': None, 'citation': 'If you use this dataset in experiments for a scientific publication, please kindly cite our paper: \nM.V.Martins, D. Tolledo, J. Machado, L. M.T. Baptista, V.Realinho. (2021) "Early prediction of student’s performance in higher education: a case study" Trends and Applications in Information Systems and Technologies, vol.1, in Advances in Intelligent Systems and Computing series. Springer. DOI: 10.1007/978-3-030-72657-7_16'}  


**Dataset Citation:**  
@incollection{martins2021early,
 title={Early prediction of student’s performance in higher education: a case study},
 author={Martins, M. V. and Tolledo, D. and Machado, J. and Baptista, L. M. T. and Realinho, V.}
 booktitle={Trends and Applications in Information Systems and Technologies},
 volume={1},
 year={2021},
 publisher={Springer},
 series={Advances in Intelligent Systems and Computing},
 doi={10.1007/978-3-030-72657-7_16}
}


##### Quick inspection

In [27]:
x.head()

Unnamed: 0,Marital Status,Application mode,Application order,Course,Daytime/evening attendance,Previous qualification,Previous qualification (grade),Nacionality,Mother's qualification,Father's qualification,...,Curricular units 1st sem (without evaluations),Curricular units 2nd sem (credited),Curricular units 2nd sem (enrolled),Curricular units 2nd sem (evaluations),Curricular units 2nd sem (approved),Curricular units 2nd sem (grade),Curricular units 2nd sem (without evaluations),Unemployment rate,Inflation rate,GDP
0,1,17,5,171,1,1,122.0,1,19,12,...,0,0,0,0,0,0.0,0,10.8,1.4,1.74
1,1,15,1,9254,1,1,160.0,1,1,3,...,0,0,6,6,6,13.666667,0,13.9,-0.3,0.79
2,1,1,5,9070,1,1,122.0,1,37,37,...,0,0,6,0,0,0.0,0,10.8,1.4,1.74
3,1,17,2,9773,1,1,122.0,1,38,37,...,0,0,6,10,5,12.4,0,9.4,-0.8,-3.12
4,2,39,1,8014,0,1,100.0,1,37,38,...,0,0,6,6,6,13.0,0,13.9,-0.3,0.79


In [28]:
y['Target'].unique()    

array(['Dropout', 'Graduate', 'Enrolled'], dtype=object)

##### About dataset :

This dataset originates from a higher education institution and was compiled by integrating information from multiple independent internal databases. It contains records of students enrolled in various undergraduate programs, including agronomy, design, education, nursing, journalism, management, social services, and technology-related fields.
The dataset captures attributes available at the time of student admission such as academic background, demographic details, and socio-economic indicators alongside academic performance outcomes from both the first and second semesters.
Its primary purpose is to support the development of predictive models aimed at identifying students at risk of dropping out or succeeding academically. The classification task involves three outcome categories and presents a noticeable class imbalance, with one category significantly more represented than the others.