## Machine learning Model Development

- In this section, we will develop and evaluate machine learning models to predict life expectancy based on various public health metrics, environmental factors, and demographic data. The process will be carried out in a systematic manner, starting from data preparation, model training, and evaluation, followed by model tuning and final evaluation.




In [8]:
# importing libraries and prepared dataset

import pandas as pd
from sklearn.model_selection import train_test_split

data = "/Users/alexandreribeiro/Documents/GitHub/final_project/notebooks/dataset_for_ml.csv"

df = pd.read_csv(data)

df.sample(5)

Unnamed: 0,population_city,greenspacearea_km2,AQI,adjusted_obesity_rate,adjusted_smoking_rate,adjusted_copd_rate,adjusted_depression_rate,adjusted_life_expectancy
195,1.76886,0.827357,-0.614014,2.072841,1.442063,1.678737,1.243903,1.710293
1794,-0.405541,2.261801,-2.092229,1.224708,1.272402,1.4787,1.089777,0.768165
339,1.094393,0.212659,1.338461,0.634357,0.925072,0.656073,0.627727,0.657804
601,0.546686,0.410107,2.662695,-1.131691,-1.123796,-1.119047,-1.080786,-1.039814
977,0.087315,0.008053,-0.213665,0.073358,-0.077285,-0.46844,-0.644293,0.359496


#### Defining features and target variable

- The first step in developing a machine learning model is to define the features and target variable. In this case, the target variable is the life expectancy, and the features are the various public health metrics, environmental factors, and demographic data. We will use the following features to predict life expectancy.

In [10]:
# Selected features

selected_features = ['population_city', 'greenspacearea_km2', 'AQI', 'adjusted_obesity_rate',
       'adjusted_smoking_rate', 'adjusted_copd_rate',
       'adjusted_depression_rate']

# Define the target variable (y) and the feature set (X)

y = df['adjusted_life_expectancy']  # Target variable
X = df[selected_features]  # Using the selected features

# Display the shapes of X and y

X.shape, y.shape

((3238, 7), (3238,))

#### Spliting the data into training and testing sets

- Split the dataset into training and testing sets to train the model on one portion and test it on another.

- test_size=0.2: This means that 20% of the data will be used for testing, and the remaining 80% will be used for training.
- random_state=42: Setting a random state ensures that the split is reproducible.

In [12]:
from sklearn.model_selection import train_test_split

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Display the shapes of the training and testing sets
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)

X_train shape: (2590, 7)
X_test shape: (648, 7)
y_train shape: (2590,)
y_test shape: (648,)


#### Model Selection:

- We will start with a few common regression models since you’re predicting a continuous target variable (adjusted life expectancy).

We’ll try these models:

- Linear Regression
- Decision Tree Regressor
- Random Forest Regressor
- Gradient Boosting Regressor
- Support Vector Regressor (SVR)

- After selecting the models, we’ll compare their performance to choose the best one.

In [13]:
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Define the models
models = {
    "Linear Regression": LinearRegression(),
    "Decision Tree": DecisionTreeRegressor(),
    "Random Forest": RandomForestRegressor(),
    "Gradient Boosting": GradientBoostingRegressor(),
    "SVR": SVR()
}

# Dictionary to store the results
results = {}

# Train and evaluate each model
for model_name, model in models.items():
    # Train the model
    model.fit(X_train, y_train)
    
    # Make predictions
    y_pred = model.predict(X_test)
    
    # Evaluate the model
    mae = mean_absolute_error(y_test, y_pred)
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    
    # Store the results
    results[model_name] = {
        "MAE": mae,
        "MSE": mse,
        "R²": r2
    }

    print(f"{model_name} results:")
    print(f"  - MAE: {mae:.3f}")
    print(f"  - MSE: {mse:.3f}")
    print(f"  - R²: {r2:.3f}\n")

# Convert the results to a DataFrame for easier comparison
import pandas as pd
results_df = pd.DataFrame(results).T
print(results_df)

Linear Regression results:
  - MAE: 0.149
  - MSE: 0.048
  - R²: 0.957

Decision Tree results:
  - MAE: 0.065
  - MSE: 0.026
  - R²: 0.976

Random Forest results:
  - MAE: 0.065
  - MSE: 0.017
  - R²: 0.985

Gradient Boosting results:
  - MAE: 0.121
  - MSE: 0.031
  - R²: 0.972

SVR results:
  - MAE: 0.125
  - MSE: 0.039
  - R²: 0.965

                        MAE       MSE        R²
Linear Regression  0.148989  0.047983  0.956544
Decision Tree      0.065243  0.026446  0.976049
Random Forest      0.064775  0.016573  0.984990
Gradient Boosting  0.121096  0.030596  0.972290
SVR                0.125336  0.039029  0.964653


In [15]:
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import pandas as pd

# Define the models
models = {
    "Linear Regression": LinearRegression(),
    "Decision Tree": DecisionTreeRegressor(),
    "Random Forest": RandomForestRegressor(),
    "Gradient Boosting": GradientBoostingRegressor(),
    "SVR": SVR()
}

# Dictionary to store the results
results = []

# Train and evaluate each model
for model_name, model in models.items():
    # Train the model
    model.fit(X_train, y_train)
    
    # Make predictions
    y_pred = model.predict(X_test)
    
    # Evaluate the model
    mae = mean_absolute_error(y_test, y_pred)
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    
    # Store the results in the list
    results.append({
        "Model": model_name,
        "MAE": mae,
        "MSE": mse,
        "R²": r2
    })

# Convert the results to a DataFrame for easier comparison
results_df = pd.DataFrame(results)
results_df

Unnamed: 0,Model,MAE,MSE,R²
0,Linear Regression,0.148989,0.047983,0.956544
1,Decision Tree,0.064775,0.026647,0.975867
2,Random Forest,0.064788,0.016654,0.984917
3,Gradient Boosting,0.12107,0.030654,0.972238
4,SVR,0.125336,0.039029,0.964653
