## 📥 Data Loading

Se carga el archivo CSV con los indicadores socioeconómicos y se muestra una vista previa del DataFrame.

In [4]:
import pandas as pd

In [5]:
try:
    df = pd.read_csv(r'C:\Users\Soporte-NuCom\Desktop\DOCUMENTOS VARIOS\DRAGON\MATERIAS\FUNDAMENTOS CIENCIA DATOS\WWBI_InclusionFinanciera4\data\DataEngineering_curated\analisis_final\dataset_indicadores_latam_2013_2017_sin_venezuela.csv', encoding='latin-1')
    print(df.shape)
    display(df.head())
except FileNotFoundError:
    print("Error: File not found.")
except pd.errors.ParserError:
    print("Error: Could not parse the file. Check the file format or encoding.")
except Exception as e:
    print(f"An unexpected error occurred: {e}")

(80, 68)


Unnamed: 0,Country Name,Female to male wage ratio in the private sector (using mean),Female to male wage ratio in the private sector (using median),Female to male wage ratio in the public sector (using mean),Female to male wage ratio in the public sector (using median),"Females, as a share of private paid employees","Females, as a share of private paid employees by occupational group: Clerks","Females, as a share of private paid employees by occupational group: Elementary occupation","Females, as a share of private paid employees by occupational group: Managers","Females, as a share of private paid employees by occupational group: Professionals",...,"Public sector wage premium for females, by industry: Education (compared to paid wage employees)","Public sector wage premium for females, by industry: Health (compared to paid wage employees)","Public sector wage premium for females, by occupation: Medical workers (compared to paid wage employees)","Public sector wage premium for females, by occupation: Teachers (compared to paid wage employees)","Public sector wage premium for males, by industry: Education (compared to paid wage employees)","Public sector wage premium for males, by industry: Health (compared to paid wage employees)","Public sector wage premium for males, by occupation: Medical workers (compared to paid wage employees)","Public sector wage premium for males, by occupation: Teachers (compared to paid wage employees)","Public sector wage premium, by gender: Female (compared to all private employees)","Public sector wage premium, by gender: Male (compared to all private employees)"
0,Argentina,0.675478,0.652174,0.852742,0.833333,0.424759,,0.095658,0.313095,0.437079,...,0.041243,0.040041,-2.149905,-2.149905,-0.144038,0.0159,-2.149905,-2.149905,0.390397,0.017744
1,Bolivia,0.732018,0.686243,0.808738,0.861087,0.307266,0.550602,0.351888,0.358105,0.47555,...,0.169259,0.242015,0.38784,0.33433,0.116933,0.205743,0.367138,0.099831,0.319806,0.095851
2,Brazil,0.741663,0.764864,0.714107,0.729973,0.428903,0.620984,0.010419,0.392691,0.540284,...,0.048058,-0.204508,0.136523,0.069629,-0.116306,-0.223518,0.015326,-0.042621,0.170618,0.189598
3,Chile,0.744976,0.766667,0.807607,0.83868,0.42598,0.627146,0.52002,0.328217,0.499085,...,0.013377,-0.039521,0.021776,-0.005614,-0.220607,-0.056611,0.044908,-0.146293,0.158985,0.072197
4,Colombia,0.876958,0.917658,0.998386,1.023948,0.41373,0.572295,0.090592,0.557941,0.243795,...,0.659146,0.365168,,,0.251482,0.33644,,,0.849002,0.488014


## 🔀  Data preparation

### Subtask:
Prepare the data for modeling.

**Reasoning**:
Drop the 'Country Name' column, separate the target variable, and handle missing values in the features.

In [6]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer

In [7]:
# Drop the 'Country Name' column
df = df.drop(columns=['Country Name'])

In [8]:
# Identify the target variable
target_column = 'Females, as a share of private paid employees by occupational group: Managers'
y = df[target_column]
X = df.drop(columns=[target_column])

In [9]:
# Handle missing values using mean imputation
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
X = pd.DataFrame(imputer.fit_transform(X), columns=X.columns)

## Data splitting

### Subtask:
Split the data into training and testing sets.

**Reasoning**:
Split the data into training and testing sets using train_test_split.

In [10]:
from sklearn.model_selection import train_test_split

In [11]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [12]:
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.preprocessing import PolynomialFeatures

## 🔎 Feature Engineering:
### Subtask:
Apply feature selection and polynomial transformation to the training and testing data.

**Reasoning**:
Apply feature selection using SelectKBest and then apply polynomial transformation to the selected features.

In [13]:
# Select top 10 features
selector = SelectKBest(score_func=f_regression, k=10)

In [14]:
# Fit the selector on the training data
X_train_selected = selector.fit_transform(X_train, y_train)

In [15]:
# Transform both training and testing data
X_test_selected = selector.transform(X_test)

In [16]:
# Apply polynomial transformation
poly = PolynomialFeatures(degree=2, include_bias=False)

In [17]:
# Fit and transform the training data
X_train_pca = poly.fit_transform(X_train_selected)

In [18]:
# Transform the testing data
X_test_pca = poly.transform(X_test_selected)

 PCA:  Dimensionality reduction with PCA”, ‘PCA is used to reduce the dimensionality of the polynomial data set while maintaining 95% of the variance.’),

In [19]:

from sklearn.decomposition import PCA

In [20]:
# Apply PCA after polynomial transformation
pca = PCA(n_components=0.95)  # Retain 95% of variance
X_train_pca = pca.fit_transform(X_train_pca)
X_test_pca = pca.transform(X_test_pca)

## Model training

### Subtask:
Train a Lasso and a RandomForestRegressor model using pipelines.


In [21]:

from sklearn.pipeline import Pipeline
from sklearn.linear_model import Lasso
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler

In [22]:
# Create pipelines
lasso_pipeline = Pipeline([
    ('scaler', StandardScaler()), #zscore
    ('lasso', Lasso())  # Experiment with different alpha values
])

In [23]:

from sklearn.model_selection import GridSearchCV

In [24]:
# Define the parameter grid for Lasso
param_grid_lasso = {'lasso__alpha': [1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1, 10]}

In [25]:
# Create GridSearchCV object for Lasso
grid_search_lasso = GridSearchCV(lasso_pipeline, param_grid_lasso, scoring='neg_mean_squared_error', cv=5, verbose=4)

In [26]:
# Fit the GridSearchCV object to the training data
grid_search_lasso.fit(X_train_pca, y_train)

Fitting 5 folds for each of 7 candidates, totalling 35 fits
[CV 1/5] END ...............lasso__alpha=1e-05;, score=-0.002 total time=   0.0s
[CV 2/5] END ...............lasso__alpha=1e-05;, score=-0.004 total time=   0.0s
[CV 3/5] END ...............lasso__alpha=1e-05;, score=-0.004 total time=   0.0s
[CV 4/5] END ...............lasso__alpha=1e-05;, score=-0.002 total time=   0.0s
[CV 5/5] END ...............lasso__alpha=1e-05;, score=-0.004 total time=   0.0s
[CV 1/5] END ..............lasso__alpha=0.0001;, score=-0.002 total time=   0.0s
[CV 2/5] END ..............lasso__alpha=0.0001;, score=-0.004 total time=   0.0s
[CV 3/5] END ..............lasso__alpha=0.0001;, score=-0.004 total time=   0.0s
[CV 4/5] END ..............lasso__alpha=0.0001;, score=-0.002 total time=   0.0s
[CV 5/5] END ..............lasso__alpha=0.0001;, score=-0.004 total time=   0.0s
[CV 1/5] END ...............lasso__alpha=0.001;, score=-0.003 total time=   0.0s
[CV 2/5] END ...............lasso__alpha=0.001;, 

In [27]:
# Get the best alpha value
best_alpha = grid_search_lasso.best_params_['lasso__alpha']

In [28]:
# Update the lasso_pipeline with the best alpha
lasso_pipeline.set_params(lasso__alpha=best_alpha)

In [29]:
print(f"Best alpha for Lasso: {best_alpha}")

Best alpha for Lasso: 0.001


In [30]:

rf_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('rf', RandomForestRegressor(max_depth=10, random_state=42)) # Experiment with n_estimators and max_depth
])

In [31]:
# Define the parameter grid for Random Forest
param_grid_rf = {'rf__n_estimators': [50, 100, 200, 250, 300]}

In [32]:
# Create GridSearchCV object for Random Forest
grid_search_rf = GridSearchCV(rf_pipeline, param_grid_rf, scoring='neg_mean_squared_error', cv=5, verbose=4)

In [33]:
# Fit the GridSearchCV object to the training data
grid_search_rf.fit(X_train_pca, y_train)

Fitting 5 folds for each of 5 candidates, totalling 25 fits
[CV 1/5] END ..............rf__n_estimators=50;, score=-0.002 total time=   0.0s
[CV 2/5] END ..............rf__n_estimators=50;, score=-0.003 total time=   0.0s
[CV 3/5] END ..............rf__n_estimators=50;, score=-0.006 total time=   0.0s
[CV 4/5] END ..............rf__n_estimators=50;, score=-0.003 total time=   0.0s
[CV 5/5] END ..............rf__n_estimators=50;, score=-0.002 total time=   0.0s
[CV 1/5] END .............rf__n_estimators=100;, score=-0.002 total time=   0.0s
[CV 2/5] END .............rf__n_estimators=100;, score=-0.003 total time=   0.0s
[CV 3/5] END .............rf__n_estimators=100;, score=-0.005 total time=   0.0s
[CV 4/5] END .............rf__n_estimators=100;, score=-0.003 total time=   0.0s
[CV 5/5] END .............rf__n_estimators=100;, score=-0.002 total time=   0.0s
[CV 1/5] END .............rf__n_estimators=200;, score=-0.002 total time=   0.1s
[CV 2/5] END .............rf__n_estimators=200;, 

In [34]:
# Get the best n_estimators value
best_n_estimators = grid_search_rf.best_params_['rf__n_estimators']

In [35]:
# Update the rf_pipeline with the best n_estimators
rf_pipeline.set_params(rf__n_estimators=best_n_estimators)

In [36]:
print(f"Best n_estimators for Random Forest: {best_n_estimators}")

Best n_estimators for Random Forest: 200


In [37]:
# Train the pipelines
lasso_pipeline.fit(X_train_pca, y_train)
rf_pipeline.fit(X_train_pca, y_train)

## Model evaluation

### Subtask:
Evaluate the trained Lasso and Random Forest models and compare their performance.


**Reasoning**:
Evaluate the trained Lasso and Random Forest models using the test set and calculate the R-squared, MSE, and MAE for both models. Then compare their performance based on these metrics.

In [38]:
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error

In [39]:
# Predict on the test set
y_pred_lasso = lasso_pipeline.predict(X_test_pca)
y_pred_rf = rf_pipeline.predict(X_test_pca)

In [40]:
# Calculate evaluation metrics for Lasso
r2_lasso = r2_score(y_test, y_pred_lasso)
mse_lasso = mean_squared_error(y_test, y_pred_lasso)
mae_lasso = mean_absolute_error(y_test, y_pred_lasso)

In [41]:
# Calculate evaluation metrics for Random Forest
r2_rf = r2_score(y_test, y_pred_rf)
mse_rf = mean_squared_error(y_test, y_pred_rf)
mae_rf = mean_absolute_error(y_test, y_pred_rf)

In [42]:
# Print the evaluation metrics
print("Lasso Regression Metrics:")
print(f"R-squared: {r2_lasso}")
print(f"MSE: {mse_lasso}")
print(f"MAE: {mae_lasso}")

Lasso Regression Metrics:
R-squared: 0.8812600505289842
MSE: 0.0011366474192601225
MAE: 0.026544345318491376


In [43]:
print("\nRandom Forest Regression Metrics:")
print(f"R-squared: {r2_rf}")
print(f"MSE: {mse_rf}")
print(f"MAE: {mae_rf}")


Random Forest Regression Metrics:
R-squared: 0.7623956321235974
MSE: 0.0022744863270938976
MAE: 0.03941812790626174


In [44]:
# Compare the models
print("\nModel Comparison:")
if r2_lasso > r2_rf:
    print("Lasso Regression performs better based on R-squared.")
elif r2_rf > r2_lasso:
    print("Random Forest Regression performs better based on R-squared.")
else:
    print("Both models have the same R-squared.")


Model Comparison:
Lasso Regression performs better based on R-squared.


In [45]:
if mse_lasso < mse_rf:
    print("Lasso Regression performs better based on MSE.")
elif mse_rf < mse_lasso:
    print("Random Forest Regression performs better based on MSE.")
else:
    print("Both models have the same MSE.")

Lasso Regression performs better based on MSE.


In [46]:
if mae_lasso < mae_rf:
    print("Lasso Regression performs better based on MAE.")
elif mae_rf < mae_lasso:
    print("Random Forest Regression performs better based on MAE.")
else:
    print("Both models have the same MAE.")

Lasso Regression performs better based on MAE.
