# Machine Learning Model for Happiness Score Prediction
In this notebook, we will build a predictive model using selected features to predict the Happiness Score.

## Step 1: Import Libraries
We start by importing the necessary libraries.

In [31]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

## Step 2: Load and Explore the Data
We load the combined dataset and take a look at the first few rows.

In [32]:
# Load the data
combined_df = pd.read_csv('C:/Users/admin-ilinf/Documents/ETL/Workshop3_ETL/data/combined_df.csv')

# Display the first few rows
combined_df.head()

Unnamed: 0,Country,Happiness Rank,Score,GDP per capita,Social support,Healthy life expectancy,Freedom,Generosity,Perceptions of corruption,Year
0,Switzerland,1,7.587,1.39651,1.34951,0.94143,0.66557,0.29678,0.41978,2015
1,Iceland,2,7.561,1.30232,1.40223,0.94784,0.62877,0.4363,0.14145,2015
2,Denmark,3,7.527,1.32548,1.36058,0.87464,0.64938,0.34139,0.48357,2015
3,Norway,4,7.522,1.459,1.33095,0.88521,0.66973,0.34699,0.36503,2015
4,Canada,5,7.427,1.32629,1.32261,0.90563,0.63297,0.45811,0.32957,2015


## Step 3: Select Features and Target
We will use the following features to predict the Happiness Score:
- **GDP per capita**
- **Healthy life expectancy**
- **Freedom**
- **Perceptions of corruption**
- **Generosity**

In [33]:
# Select features and target
features = combined_df[['GDP per capita', 'Healthy life expectancy', 'Freedom', 'Perceptions of corruption', 'Generosity']]
target = combined_df['Score']

## Step 4: Split the Data into Training and Testing Sets
We will split the data into 70% for training and 30% for testing.

In [34]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.3, random_state=42)

## Step 5: Feature selection
I'll choose all the features

In [None]:
import pandas as pd
from sklearn.ensemble import RandomForestRegressor

# Preprocesa los datos: convierte las columnas categóricas en variables dummy
X = pd.get_dummies(combined_df.drop(columns=['Score', 'Happiness Rank']), drop_first=True)
y = combined_df['Score']

# Usa RandomForestRegressor para calcular la importancia de las características
model = RandomForestRegressor(random_state=42)
model.fit(X, y)

# Obtén la importancia de las características
importances = model.feature_importances_
feature_importance_df = pd.DataFrame({
    'Feature': X.columns,
    'Importance': importances
}).sort_values(by='Importance', ascending=False)

# Muestra las características más importantes
print(feature_importance_df)


## Step 6: Train the Random Forest Model
We use a Random Forest Regressor to train our model.

In [35]:
# Create and train the model
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

## Step 7: Make Predictions and Evaluate the Model
We make predictions on the test set and evaluate the model's performance using Mean Squared Error (MSE) and R-squared (R²).

In [36]:
# Make predictions
y_pred = model.predict(X_test)

# Calculate MSE and R-squared
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse}")
print(f"R-squared (R²): {r2}")

Mean Squared Error: 0.2982702445119376
R-squared (R²): 0.7611077459869162


## Step 8: Conclusion
We have trained a Random Forest model to predict the Happiness Score using selected features. The model's performance is evaluated using MSE and R-squared. Further optimization and testing can be performed to improve the model.

In [40]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.metrics import mean_squared_error, r2_score

# Cargar los datos
combined_df = pd.read_csv('C:/Users/admin-ilinf/Documents/ETL/Workshop3_ETL/data/combined_df.csv')

# Seleccionar las características y el objetivo
X = combined_df[['GDP per capita', 'Social support', 'Healthy life expectancy', 'Freedom', 
                 'Generosity', 'Perceptions of corruption']]
y = combined_df['Score']

# Dividir los datos en entrenamiento y prueba
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Escalar las características
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Probar varios modelos
models = {
    'Linear Regression': LinearRegression(),
    'Ridge Regression': Ridge(alpha=1.0),
    'Lasso Regression': Lasso(alpha=0.1),
    'Random Forest': RandomForestRegressor(n_estimators=100, random_state=42),
    'Gradient Boosting': GradientBoostingRegressor(n_estimators=100, random_state=42),
    'SVR': SVR(kernel='rbf', C=100, gamma=0.1, epsilon=0.1)
}

# Evaluar cada modelo
for name, model in models.items():
    model.fit(X_train_scaled, y_train)
    y_pred = model.predict(X_test_scaled)
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    print(f'{name}:')
    print(f'  Mean Squared Error: {mse:.4f}')
    print(f'  R-squared (R²): {r2:.4f}\n')

# Aplicar transformaciones polinomiales y evaluar
poly = PolynomialFeatures(degree=2)
X_train_poly = poly.fit_transform(X_train_scaled)
X_test_poly = poly.transform(X_test_scaled)

# Modelo de regresión lineal con características polinomiales
poly_model = LinearRegression()
poly_model.fit(X_train_poly, y_train)
y_pred_poly = poly_model.predict(X_test_poly)
mse_poly = mean_squared_error(y_test, y_pred_poly)
r2_poly = r2_score(y_test, y_pred_poly)
print('Polynomial Regression (degree 2):')
print(f'  Mean Squared Error: {mse_poly:.4f}')
print(f'  R-squared (R²): {r2_poly:.4f}')


Linear Regression:
  Mean Squared Error: 0.3415
  R-squared (R²): 0.7265

Ridge Regression:
  Mean Squared Error: 0.3415
  R-squared (R²): 0.7265

Lasso Regression:
  Mean Squared Error: 0.3677
  R-squared (R²): 0.7055

Random Forest:
  Mean Squared Error: 0.2738
  R-squared (R²): 0.7807

Gradient Boosting:
  Mean Squared Error: 0.3071
  R-squared (R²): 0.7540

SVR:
  Mean Squared Error: 0.2656
  R-squared (R²): 0.7873

Polynomial Regression (degree 2):
  Mean Squared Error: 0.3256
  R-squared (R²): 0.7392


In [38]:
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, ExtraTreesRegressor, AdaBoostRegressor
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error, r2_score

# Definir modelos
models = {
    'Linear Regression': LinearRegression(),
    'Ridge Regression': Ridge(),
    'Lasso Regression': Lasso(),
    'ElasticNet Regression': ElasticNet(),
    'Random Forest': RandomForestRegressor(random_state=42),
    'Gradient Boosting': GradientBoostingRegressor(random_state=42),
    'Extra Trees': ExtraTreesRegressor(random_state=42),
    'AdaBoost': AdaBoostRegressor(random_state=42),
    'SVR': SVR(),
    'Decision Tree': DecisionTreeRegressor(random_state=42),
    'KNeighbors': KNeighborsRegressor()
}

# Entrena y evalúa cada modelo
for name, model in models.items():
    model.fit(X_train_reduced, y_train)
    y_pred = model.predict(X_test_reduced)
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    print(f"{name}: Mean Squared Error: {mse}, R-squared (R²): {r2}")


Linear Regression: Mean Squared Error: 0.3538175407748981, R-squared (R²): 0.7166184982233497
Ridge Regression: Mean Squared Error: 0.352019899050093, R-squared (R²): 0.7180582753766134
Lasso Regression: Mean Squared Error: 1.2491581361696804, R-squared (R²): -0.00048264370673489765
ElasticNet Regression: Mean Squared Error: 1.2491581361696804, R-squared (R²): -0.00048264370673489765
Random Forest: Mean Squared Error: 0.3003275091211226, R-squared (R²): 0.7594600302370809
Gradient Boosting: Mean Squared Error: 0.30982895893753887, R-squared (R²): 0.751850076496137
Extra Trees: Mean Squared Error: 0.27333333998664683, R-squared (R²): 0.7810803494891652
AdaBoost: Mean Squared Error: 0.333988846247573, R-squared (R²): 0.732499805919733
SVR: Mean Squared Error: 0.3130001052728515, R-squared (R²): 0.7493102244331604
Decision Tree: Mean Squared Error: 0.5145710868390104, R-squared (R²): 0.5878668789571009
KNeighbors: Mean Squared Error: 0.3081000396537959, R-squared (R²): 0.7532348120917908


In [41]:
from sklearn.svm import SVR
import pickle

# Define y entrena el modelo SVR
svr_model = SVR()
svr_model.fit(X_train_reduced, y_train)

# Guarda el modelo SVR en un archivo .pkl
with open('svr_model.pkl', 'wb') as file:
    pickle.dump(svr_model, file)

print("Modelo SVR guardado exitosamente en 'svr_model.pkl'.")


Modelo SVR guardado exitosamente en 'svr_model.pkl'.
