# Data analysis

In [6]:
# Installing required packages
!pip3 install pandas
!pip3 install xgboost
!pip3 install scikit-learn


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip3 install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip3 install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip3 install --upgrade pip[0m


In [21]:
#Importing libraries
import pandas as pd

from xgboost import XGBRegressor # model to predict continuous-scale variable from categorical ones
from sklearn.preprocessing import OneHotEncoder # for ability to encode categorical variables as numeric ones
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import RandomizedSearchCV

from sklearn.datasets import make_classification
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectFromModel

### Load the data:

In [8]:
data = pd.read_csv("/Users/justina/Desktop/Data Science/data/retirement_data.csv")

### XGBoost

The code below follows (in some ways) the following tutorials:
1. https://xgboosting.com/one-hot-encode-categorical-features-for-xgboost/#:~:text=One%2Dhot%20encoding%20is%20a,before%20training%20an%20XGBoost%20model. [one-hot encoding for categorical variables].
2. https://xgboosting.com/encode-categorical-features-as-dummy-variables-for-xgboost/ [used for some interpretation of tutorial 1 (above)].
3. https://xgboost.readthedocs.io/en/latest/python/python_api.html# [contains information about the XGBRegressor and possible parameters].
4. https://dev.to/uche_4rm_germany/grid-and-randomized-hyperparameter-optimization-for-xgboost-algorithms-159k [hyperparameter tuning].

In [9]:
# [1] Separating features and target:

X = data.drop(["age_ret", "mergeid"], axis=1)  # dropping the age of retirement and ID from the data, as it shouldn't have an inherent meaning to the prediction of age.
y = data["age_ret"] # contains only the column of retirement age

In [None]:
X.shape # 19 columns

(22603, 19)

In [10]:
# [2] Identifying which columns contain categorical information, and which - numerical:

categorical_c = X.select_dtypes(include=["object", "category"]).columns
numerical_c = X.select_dtypes(exclude=["object", "category"]).columns

In [11]:
# [3] Create ColumnTransformer for one-hot encoding (tutorial 1):

transformer = ColumnTransformer(
    transformers=[('encoder', OneHotEncoder(), categorical_c)],
    remainder='passthrough')


# If something crashes, add this: OneHotEncoder(handle_unknown='ignore')

In [12]:
# [4] Perform one-hot encoding on categorical data:

X_cat = transformer.fit_transform(X[categorical_c])

In [13]:
# [5] Transform the encoded data to a data frame (instead of having as a matrix):

X_cat = pd.DataFrame(X_cat.toarray(), columns = transformer.get_feature_names_out(categorical_c))

In [14]:
X_cat.shape # 102 columns

(22603, 102)

After this step we end up with 102 columns instead of 19 as in the original data set X. 
The reason is that each column from the X data frame contains many categorical values, which are then transformed into a new binary feature column for each unique category value. Therefore, since we have many values within column "country", such as "Austria", "Belgium", etc., after one-hot-encoding, we get columns such as "encoder_country_Austria", "encoder_country_Belgium", etc., with numeric values of 0s and 1s. These are the values that will be used in the model.


In [15]:
# [6] Now, we "remove" the original categorical values and create a separate data frame with encoded categorical values ([5])
# and the numerical value (gender) we had from before:

X_numeric = X.drop(categorical_c, axis=1) # leaves only one column - the gender - which we transformed to the categorical before.
X_transformed = pd.concat([X_numeric, X_cat], axis=1) # creates a new dataframe with columns for XGBRegressor model.

In [52]:
X_transformed.shape

(22603, 103)

Splitting the data into test and train:

In [16]:
X_train, X_test, y_train, y_test = train_test_split(X_transformed, y, test_size = 0.2, random_state = 123)


In [None]:
# This below is taken from : https://xgboosting.com/how-to-use-xgboost-xgbregressor/

# Define XGBRegressor model parameters
params = {
    'objective': 'reg:squarederror',
    'max_depth': 3, # A tree can have up to 4 levels of splits. Lower n - generalize better, higher n - capture complex patterns (risk overfitting).
    'learning_rate': 0.1, # Also called "eta". The smaller the value - the slower learning rate. The higher - faster learning.
    'n_estimators': 100, # More trees can improve accuracy but also increase training time and risk overfitting. If learning rate is higher, then the n_estimators should be higher too.
    'subsample': 0.8, # 80% of data are used when constructing each tree. Prevents overfitting.
    'colsample_bytree': 0.8, # 80% of features are used when constructing each tree. Preents overfitting.
    'random_state': 123 # reproducibility
}

Another good idea maybe could be to:

1. Go over this: https://machinelearningmastery.com/xgboost-for-regression/ and try to do grid search / randomizedsearchcv (some code below is for that but didnt have a chance to do it)
2. https://books.google.dk/books?hl=en&lr=&id=2tcDEAAAQBAJ&oi=fnd&pg=PP1&dq=xgbregressor+parameter+tuning.&ots=s5uRCntjiH&sig=4bBsG2F6wqzDk7Iw6H2ew9T4lRk&redir_esc=y#v=onepage&q=xgbregressor%20parameter%20tuning.&f=false This book introduces some concepts.
3. feature importance calculation: https://stackabuse.com/bytes/get-feature-importance-from-xgbregressor-with-xgboost/
4. http://xgboost.readthedocs.io/en/latest/python/python_api.html# this contains an explanation to each parameter and what we could possibly adjust. 

In [42]:
# Instantiate XGBRegressor with the parameters
model = XGBRegressor(**params)

In [43]:
model.fit(X_train, y_train)

In [44]:
y_pred = model.predict(X_test)

# Evaluate model performance
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse:.2f}")
print(f"R-squared: {r2:.2f}")

Mean Squared Error: 39.29
R-squared: 0.13


In [None]:
# Now this is the place 
pipeline = Pipeline([
    ('regressor', XGBRegressor())
])

In [66]:
pipeline.fit(X_train, y_train)

In [68]:
pipeline.named_steps['regressor']

In [67]:
pipeline.score(X_test, y_test)

0.058133792131507045

In [61]:
pipeline.get_params().keys()

dict_keys(['memory', 'steps', 'transform_input', 'verbose', 'scaler', 'regressor', 'scaler__clip', 'scaler__copy', 'scaler__feature_range', 'regressor__objective', 'regressor__base_score', 'regressor__booster', 'regressor__callbacks', 'regressor__colsample_bylevel', 'regressor__colsample_bynode', 'regressor__colsample_bytree', 'regressor__device', 'regressor__early_stopping_rounds', 'regressor__enable_categorical', 'regressor__eval_metric', 'regressor__feature_types', 'regressor__feature_weights', 'regressor__gamma', 'regressor__grow_policy', 'regressor__importance_type', 'regressor__interaction_constraints', 'regressor__learning_rate', 'regressor__max_bin', 'regressor__max_cat_threshold', 'regressor__max_cat_to_onehot', 'regressor__max_delta_step', 'regressor__max_depth', 'regressor__max_leaves', 'regressor__min_child_weight', 'regressor__missing', 'regressor__monotone_constraints', 'regressor__multi_strategy', 'regressor__n_estimators', 'regressor__n_jobs', 'regressor__num_parallel_t

In [64]:
hyperparameter_grid = {
    'regressor__n_estimators': [100, 500, 1000, 2000],
    'regressor__max_depth': [3, 6, 9, 12],
    'regressor__learning_rate': [0.01, 0.03, 0.05, 0.1]
}


random_cv = RandomizedSearchCV(estimator=pipeline,
            param_distributions=hyperparameter_grid,
            cv=3, 
            n_iter=5,
            scoring = 'neg_root_mean_squared_error',
            n_jobs = -1,
            verbose = 5, 
            return_train_score = True,
            random_state=42)



random_cv.fit(X_train, y_train)

Fitting 3 folds for each of 5 candidates, totalling 15 fits
[CV 1/3] END regressor__learning_rate=0.01, regressor__max_depth=3, regressor__n_estimators=100;, score=(train=-6.446, test=-6.416) total time=   0.6s
[CV 2/3] END regressor__learning_rate=0.01, regressor__max_depth=3, regressor__n_estimators=100;, score=(train=-6.437, test=-6.432) total time=   0.6s
[CV 1/3] END regressor__learning_rate=0.1, regressor__max_depth=6, regressor__n_estimators=100;, score=(train=-5.627, test=-6.234) total time=   0.7s
[CV 3/3] END regressor__learning_rate=0.1, regressor__max_depth=6, regressor__n_estimators=100;, score=(train=-5.554, test=-6.301) total time=   0.7s
[CV 2/3] END regressor__learning_rate=0.1, regressor__max_depth=6, regressor__n_estimators=100;, score=(train=-5.614, test=-6.272) total time=   0.7s
[CV 3/3] END regressor__learning_rate=0.01, regressor__max_depth=3, regressor__n_estimators=100;, score=(train=-6.415, test=-6.494) total time=   0.5s
[CV 1/3] END regressor__learning_rate