# Occupational Status Prediction using Machine Learning

This project implements a multi-model AI system to predict an individual's occupational status based on demographic, educational, and linguistic features. The system was developed using data from the 2022 Adult Education Survey (AEDA2022), published by the Romanian National Institute of Statistics, and was used in the National Statistics Olympiad by the Author.

## Objective

The primary goal of this project is to analyze how factors such as formal education, non-formal learning, and language usage influence labor market integration. The model predicts categories such as "Employed", "Unemployed", "Student", "Retired", and others, using a wide range of supervised learning algorithms.

## Methodology

A class-based machine learning pipeline was implemented to ensure clean design, reproducibility, and easy extensibility. The pipeline includes preprocessing steps such as standardization and one-hot encoding, followed by individual model training and ensemble learning.

### Models Used

The project applies and compares multiple machine learning models suitable for structured/tabular data:

- XGBoost Classifier
- Random Forest Classifier
- Support Vector Machine (SVC)
- Logistic Regression
- Gradient Boosting Classifier
- AdaBoost Classifier

Each model is trained independently and evaluated based on classification performance. A final **Stacked Ensemble Model** combines the strengths of all base learners to improve robustness and predictive accuracy.

## Evaluation

The system evaluates each model using the following metrics:

- Accuracy
- Precision
- Recall
- F1 Score
- Confusion Matrix
- Classification Report

The stacked model serves as the final predictor, leveraging the combined insights of all models.

## Dataset

- **Name**: Ancheta asupra Educației Adulților (AEDA2022)
- **Publisher**: Institutul Național de Statistică (INSSE), România
- **Official Report**: https://insse.ro/cms/files/Rapoarte%20de%20calitate/Educatie/Ancheta-asupra-educatiei-adultilor.pdf

## Author

- **Name**: Tihoc Andrei  

## Potential Enhancements

- Hyperparameter tuning with GridSearchCV or Optuna
- SHAP-based model explainability
- Frontend deployment using Streamlit or FastAPI
- Integration with real-time systems or APIs

## Disclaimer

This project is provided for **educational and research purposes only**.  
It is not intended to be used in real-world decision-making processes, such as hiring, profiling, or eligibility assessments.  
Predictions are based on publicly available, anonymized data and should not be interpreted as recommendations for individuals.

---


In [6]:
#Install optuna
!pip install optuna



In [7]:
# Import all required libraries
# This cell imports all key libraries needed for data manipulation, visualization,
# model building, preprocessing, hyperparameter tuning, and evaluation.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, f1_score, precision_score, recall_score
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier, StackingClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier

import optuna

import torch
import torch.nn.functional as F

import warnings
warnings.filterwarnings('ignore')



In [8]:
# Define preprocessing and modeling pipeline class
# This class encapsulates all preprocessing, encoding, model training, and ensemble logic.

class OccupationalStatusModel:
    def __init__(self, df):
        self.df = df
        self.label_encoder = LabelEncoder()
        self.scaler = StandardScaler()
        self.pipeline = None
        self.models = {}
        self.trained_models = {}
        self.feature_names = None

    def preprocess(self):
        mappings = self.get_mappings()
        self.df['MAINSTAT'] = self.df['MAINSTAT'].map(mappings['mainstat'])
        self.df['HATLEVEL 1'] = self.df['HATLEVEL 1'].map(mappings['hatlevel'])
        self.df['HATFIELDniv1'] = self.df['HATFIELDniv1'].astype(str).map(mappings['hatfield'])

    def get_mappings(self):
        return {
            "mainstat": {
                10: "Employed", 20: "Unemployed", 32: "Retired", 33: "Disabled",
                31: "Student", 35: "Housework", 34: "Military", 36: "Other"
            },
            "hatlevel": {
                100: "Primary", 200: "Lower Secondary", 342: "Upper Secondary (partial)",
                343: "Upper Secondary (no tertiary access)", 344: "Upper Secondary (tertiary access)",
                349: "Upper Secondary (no access)", 352: "VET (partial)", 353: "VET (no tertiary access)",
                354: "VET (tertiary access)", 359: "VET (no access)", 392: "Unknown Secondary (partial)",
                393: "Unknown Secondary (no tertiary)", 394: "Unknown Secondary (tertiary)",
                399: "Unknown Secondary (no access)", 440: "Post-secondary general",
                450: "Post-secondary vocational", 490: "Post-secondary unknown",
                540: "Short Tertiary (general)", 550: "Short Tertiary (vocational)",
                590: "Short Tertiary (unknown)", 600: "Bachelor", 700: "Master", 800: "Doctorate"
            },
            "hatfield": {
                "1": "Education", "2": "Arts and Humanities", "3": "Social Sciences",
                "4": "Business and Law", "5": "Science and Math", "6": "ICT",
                "7": "Engineering", "8": "Agriculture", "9": "Health", "10": "Services"
            }
        }

    def prepare_data(self):
        features = ['VARSTA', 'HATLEVEL 1', 'FEDLEVEL', 'NFELESSON', 'NFEWORKSHOP', 'LANGUSED']
        X = self.df[features]
        y = self.df['MAINSTAT']

        categorical = ['HATLEVEL 1', 'FEDLEVEL']
        numerical = ['VARSTA', 'NFELESSON', 'NFEWORKSHOP', 'LANGUSED']

        preprocessor = ColumnTransformer([
            ('num', self.scaler, numerical),
            ('cat', OneHotEncoder(drop='first', handle_unknown='ignore'), categorical)
        ])

        self.pipeline = Pipeline([
            ('preprocessor', preprocessor)
        ])

        y_encoded = self.label_encoder.fit_transform(y)
        X_processed = self.pipeline.fit_transform(X)
        self.feature_names = self.pipeline.named_steps['preprocessor'].get_feature_names_out()

        return train_test_split(X_processed, y_encoded, test_size=0.2, stratify=y_encoded, random_state=42)

    def optimize_model(self, model_name, X_train, y_train):
        def objective(trial):
            if model_name == 'xgb':
                model = XGBClassifier(
                    max_depth=trial.suggest_int("max_depth", 3, 10),
                    learning_rate=trial.suggest_float("learning_rate", 0.01, 0.3),
                    n_estimators=trial.suggest_int("n_estimators", 50, 200),
                    subsample=trial.suggest_float("subsample", 0.5, 1.0),
                    use_label_encoder=False,
                    eval_metric='mlogloss',
                    random_state=42
                )
            elif model_name == 'rf':
                model = RandomForestClassifier(
                    n_estimators=trial.suggest_int("n_estimators", 50, 200),
                    max_depth=trial.suggest_int("max_depth", 3, 15),
                    min_samples_split=trial.suggest_int("min_samples_split", 2, 10),
                    min_samples_leaf=trial.suggest_int("min_samples_leaf", 1, 10),
                    random_state=42
                )
            elif model_name == 'svm':
                model = SVC(
                    C=trial.suggest_float("C", 0.1, 10.0),
                    kernel='linear',
                    probability=True,
                    random_state=42
                )
            elif model_name == 'logreg':
                model = LogisticRegression(
                    C=trial.suggest_float("C", 0.01, 10.0),
                    max_iter=1000
                )
            elif model_name == 'gb':
                model = GradientBoostingClassifier(
                    n_estimators=trial.suggest_int("n_estimators", 50, 200),
                    learning_rate=trial.suggest_float("learning_rate", 0.01, 0.3),
                    max_depth=trial.suggest_int("max_depth", 3, 10),
                    random_state=42
                )
            elif model_name == 'ada':
                model = AdaBoostClassifier(
                    n_estimators=trial.suggest_int("n_estimators", 50, 200),
                    learning_rate=trial.suggest_float("learning_rate", 0.01, 1.0),
                    random_state=42
                )
            else:
                return 0.0

            model.fit(X_train, y_train)
            preds = model.predict(X_train)
            return accuracy_score(y_train, preds)

        study = optuna.create_study(direction="maximize")
        study.optimize(objective, n_trials=30, show_progress_bar=False)
        print(f"Best trial for {model_name}: {study.best_trial.params}")
        return objective(study.best_trial), study.best_trial.params

    def train_models(self, X_train, y_train):
        self.models = {
            'xgb': XGBClassifier(use_label_encoder=False, eval_metric='mlogloss', random_state=42),
            'rf': RandomForestClassifier(random_state=42),
            'svm': SVC(kernel='linear', probability=True, random_state=42),
            'logreg': LogisticRegression(max_iter=1000),
            'gb': GradientBoostingClassifier(random_state=42),
            'ada': AdaBoostClassifier(random_state=42)
        }

        for name, model in self.models.items():
            _, best_params = self.optimize_model(name, X_train, y_train)
            model.set_params(**best_params)
            print(f"\nTraining {name.upper()}...")
            model.fit(X_train, y_train)
            self.trained_models[name] = model

    def evaluate_models(self, X_test, y_test):
        results = {}
        for name, model in self.trained_models.items():
            y_pred = model.predict(X_test)
            acc = accuracy_score(y_test, y_pred)
            prec = precision_score(y_test, y_pred, average='weighted')
            rec = recall_score(y_test, y_pred, average='weighted')
            f1 = f1_score(y_test, y_pred, average='weighted')
            print(f"\n{name.upper()} Results:")
            print(f"Accuracy: {acc:.4f}, Precision: {prec:.4f}, Recall: {rec:.4f}, F1: {f1:.4f}")
            print(classification_report(y_test, y_pred, target_names=self.label_encoder.classes_))
            results[name] = acc
        return results

    def build_stacked_model(self):
        estimators = [(name, model) for name, model in self.trained_models.items() if name != 'logreg']
        stacked = StackingClassifier(
            estimators=estimators,
            final_estimator=LogisticRegression(max_iter=1000),
            passthrough=True,
            cv=5
        )
        return stacked

    def predict_stacked(self, stacked_model, X_train, X_test, y_train, y_test):
        stacked_model.fit(X_train, y_train)
        y_pred = stacked_model.predict(X_test)
        acc = accuracy_score(y_test, y_pred)
        print("\nSTACKED MODEL RESULTS")
        print(f"Accuracy: {acc:.4f}")
        print(classification_report(y_test, y_pred, target_names=self.label_encoder.classes_))
        return y_pred



In [9]:
# Load Excel data
file_path = "Olimpiada Martie 2025.xlsx"
xls = pd.ExcelFile(file_path)
df = pd.read_excel(xls, sheet_name="Date_seniori")

In [10]:
# Initialize and run model
model = OccupationalStatusModel(df)
model.preprocess()
X_train, X_test, y_train, y_test = model.prepare_data()
model.train_models(X_train, y_train)
results = model.evaluate_models(X_test, y_test)


[I 2025-04-08 07:45:59,288] A new study created in memory with name: no-name-3b9a71f7-8ec0-4d47-8488-4af2f0dcb01f
[I 2025-04-08 07:46:01,187] Trial 0 finished with value: 0.8604166666666667 and parameters: {'max_depth': 7, 'learning_rate': 0.09814729608847703, 'n_estimators': 127, 'subsample': 0.9691734896715636}. Best is trial 0 with value: 0.8604166666666667.
[I 2025-04-08 07:46:01,646] Trial 1 finished with value: 0.8410416666666667 and parameters: {'max_depth': 5, 'learning_rate': 0.03516059767654904, 'n_estimators': 91, 'subsample': 0.8625579367927099}. Best is trial 0 with value: 0.8604166666666667.
[I 2025-04-08 07:46:03,179] Trial 2 finished with value: 0.8675 and parameters: {'max_depth': 10, 'learning_rate': 0.14658763397025557, 'n_estimators': 169, 'subsample': 0.5040905117875293}. Best is trial 2 with value: 0.8675.
[I 2025-04-08 07:46:03,739] Trial 3 finished with value: 0.8427083333333333 and parameters: {'max_depth': 5, 'learning_rate': 0.050451452347426616, 'n_estimator

Best trial for xgb: {'max_depth': 7, 'learning_rate': 0.29708769890255043, 'n_estimators': 188, 'subsample': 0.933169452251674}

Training XGB...


[I 2025-04-08 07:47:04,714] A new study created in memory with name: no-name-309f1e76-fe5e-4e1d-a5c5-c36953c533fc
[I 2025-04-08 07:47:05,455] Trial 0 finished with value: 0.8347916666666667 and parameters: {'n_estimators': 180, 'max_depth': 11, 'min_samples_split': 9, 'min_samples_leaf': 10}. Best is trial 0 with value: 0.8347916666666667.
[I 2025-04-08 07:47:06,093] Trial 1 finished with value: 0.8360416666666667 and parameters: {'n_estimators': 136, 'max_depth': 12, 'min_samples_split': 3, 'min_samples_leaf': 6}. Best is trial 1 with value: 0.8360416666666667.
[I 2025-04-08 07:47:06,592] Trial 2 finished with value: 0.8352083333333333 and parameters: {'n_estimators': 119, 'max_depth': 12, 'min_samples_split': 2, 'min_samples_leaf': 9}. Best is trial 1 with value: 0.8360416666666667.
[I 2025-04-08 07:47:07,003] Trial 3 finished with value: 0.8360416666666667 and parameters: {'n_estimators': 88, 'max_depth': 11, 'min_samples_split': 10, 'min_samples_leaf': 5}. Best is trial 1 with valu

Best trial for rf: {'n_estimators': 200, 'max_depth': 14, 'min_samples_split': 4, 'min_samples_leaf': 1}

Training RF...


[I 2025-04-08 07:47:26,866] A new study created in memory with name: no-name-34170c10-929e-4213-9ad8-d6339e27d6a2
[I 2025-04-08 07:47:28,723] Trial 0 finished with value: 0.840625 and parameters: {'C': 7.393768252252545}. Best is trial 0 with value: 0.840625.
[I 2025-04-08 07:47:30,117] Trial 1 finished with value: 0.8402083333333333 and parameters: {'C': 2.7183886942922575}. Best is trial 0 with value: 0.840625.
[I 2025-04-08 07:47:31,514] Trial 2 finished with value: 0.8397916666666667 and parameters: {'C': 1.9203872356149405}. Best is trial 0 with value: 0.840625.
[I 2025-04-08 07:47:32,934] Trial 3 finished with value: 0.8404166666666667 and parameters: {'C': 3.472795743336113}. Best is trial 0 with value: 0.840625.
[I 2025-04-08 07:47:34,371] Trial 4 finished with value: 0.8404166666666667 and parameters: {'C': 2.9898711668516715}. Best is trial 0 with value: 0.840625.
[I 2025-04-08 07:47:35,666] Trial 5 finished with value: 0.8402083333333333 and parameters: {'C': 1.1547899252830

Best trial for svm: {'C': 7.393768252252545}

Training SVM...


[I 2025-04-08 07:48:19,771] A new study created in memory with name: no-name-a3a1baba-158d-4a66-a8ce-62e56eb58418
[I 2025-04-08 07:48:19,964] Trial 0 finished with value: 0.839375 and parameters: {'C': 7.696389212430215}. Best is trial 0 with value: 0.839375.
[I 2025-04-08 07:48:20,097] Trial 1 finished with value: 0.8383333333333334 and parameters: {'C': 2.9156153391409347}. Best is trial 0 with value: 0.839375.
[I 2025-04-08 07:48:20,303] Trial 2 finished with value: 0.839375 and parameters: {'C': 9.192958085557965}. Best is trial 0 with value: 0.839375.
[I 2025-04-08 07:48:20,476] Trial 3 finished with value: 0.839375 and parameters: {'C': 7.445299155601977}. Best is trial 0 with value: 0.839375.
[I 2025-04-08 07:48:20,608] Trial 4 finished with value: 0.8391666666666666 and parameters: {'C': 5.769834795718776}. Best is trial 0 with value: 0.839375.
[I 2025-04-08 07:48:20,747] Trial 5 finished with value: 0.8389583333333334 and parameters: {'C': 4.403547165627947}. Best is trial 0 w

Best trial for logreg: {'C': 5.834268606798109}

Training LOGREG...


[I 2025-04-08 07:48:24,843] A new study created in memory with name: no-name-de72503f-696b-4364-83ad-71e1574f2a5e
[I 2025-04-08 07:48:45,502] Trial 0 finished with value: 0.8760416666666667 and parameters: {'n_estimators': 158, 'learning_rate': 0.157005251956179, 'max_depth': 8}. Best is trial 0 with value: 0.8760416666666667.
[I 2025-04-08 07:48:51,282] Trial 1 finished with value: 0.8758333333333334 and parameters: {'n_estimators': 57, 'learning_rate': 0.21371616875055008, 'max_depth': 7}. Best is trial 0 with value: 0.8760416666666667.
[I 2025-04-08 07:48:59,234] Trial 2 finished with value: 0.875625 and parameters: {'n_estimators': 150, 'learning_rate': 0.27660168384120665, 'max_depth': 5}. Best is trial 0 with value: 0.8760416666666667.
[I 2025-04-08 07:49:26,312] Trial 3 finished with value: 0.8760416666666667 and parameters: {'n_estimators': 123, 'learning_rate': 0.1813248153556131, 'max_depth': 10}. Best is trial 0 with value: 0.8760416666666667.
[I 2025-04-08 07:49:39,737] Tri

Best trial for gb: {'n_estimators': 158, 'learning_rate': 0.157005251956179, 'max_depth': 8}

Training GB...


[I 2025-04-08 07:57:16,492] A new study created in memory with name: no-name-75182d17-ff49-4979-b60c-1091f8a4735b
[I 2025-04-08 07:57:16,847] Trial 0 finished with value: 0.8133333333333334 and parameters: {'n_estimators': 89, 'learning_rate': 0.5864894040377142}. Best is trial 0 with value: 0.8133333333333334.
[I 2025-04-08 07:57:17,305] Trial 1 finished with value: 0.8289583333333334 and parameters: {'n_estimators': 115, 'learning_rate': 0.5427497261768822}. Best is trial 1 with value: 0.8289583333333334.
[I 2025-04-08 07:57:17,873] Trial 2 finished with value: 0.8310416666666667 and parameters: {'n_estimators': 143, 'learning_rate': 0.8571632350516035}. Best is trial 2 with value: 0.8310416666666667.
[I 2025-04-08 07:57:18,629] Trial 3 finished with value: 0.8220833333333334 and parameters: {'n_estimators': 191, 'learning_rate': 0.5972819441776357}. Best is trial 2 with value: 0.8310416666666667.
[I 2025-04-08 07:57:18,914] Trial 4 finished with value: 0.8175 and parameters: {'n_est

Best trial for ada: {'n_estimators': 143, 'learning_rate': 0.8571632350516035}

Training ADA...

XGB Results:
Accuracy: 0.8283, Precision: 0.7843, Recall: 0.8283, F1: 0.8006
              precision    recall  f1-score   support

    Disabled       0.00      0.00      0.00         7
    Employed       0.84      0.93      0.88       774
   Housework       0.33      0.10      0.16        77
       Other       0.00      0.00      0.00        17
     Retired       0.85      0.86      0.85       173
     Student       0.94      0.94      0.94       120
  Unemployed       0.18      0.09      0.12        32

    accuracy                           0.83      1200
   macro avg       0.45      0.42      0.42      1200
weighted avg       0.78      0.83      0.80      1200


RF Results:
Accuracy: 0.8325, Precision: 0.7702, Recall: 0.8325, F1: 0.7919
              precision    recall  f1-score   support

    Disabled       0.00      0.00      0.00         7
    Employed       0.83      0.95      0.89

In [11]:
# Build and evaluate stacked model
stacked_model = model.build_stacked_model()
stk_pred = model.predict_stacked(stacked_model, X_train, X_test, y_train, y_test)


STACKED MODEL RESULTS
Accuracy: 0.8433
              precision    recall  f1-score   support

    Disabled       0.00      0.00      0.00         7
    Employed       0.83      0.96      0.89       774
   Housework       0.40      0.03      0.05        77
       Other       0.00      0.00      0.00        17
     Retired       0.85      0.87      0.86       173
     Student       0.96      0.97      0.96       120
  Unemployed       0.00      0.00      0.00        32

    accuracy                           0.84      1200
   macro avg       0.43      0.40      0.39      1200
weighted avg       0.78      0.84      0.80      1200



In [12]:
# Display model accuracy results
acc_df = pd.DataFrame(list(results.items()), columns=['Model', 'Accuracy'])
acc_df = acc_df.sort_values(by='Accuracy', ascending=False).reset_index(drop=True)
print("\n Model Accuracy Comparison Table:")
print(acc_df.to_string(index=False))



 Model Accuracy Comparison Table:
 Model  Accuracy
logreg  0.841667
   svm  0.841667
   ada  0.840000
    rf  0.832500
   xgb  0.828333
    gb  0.815833
