## Regression Model Evaluation

In this section, various regression models are evaluated to predict the "Happiness Score" using a set of features. The models include linear regression, Random Forest, Gradient Boosting Regression, AdaBoost Regression, Support Vector Regression (SVR), K-Nearest Neighbors (KNN). One-Hot encoding is applied to the "Country" column, and the dataset is split into training and test sets using train_test_split. Each model is then trained and evaluated based on its performance, measured by the R2 Score, to determine which is the best model.

### Process:
1. **One-Hot Encoding**: The "Country" column is converted into dummy variables (0 or 1), with the first category dropped to avoid multicollinearity.
2. **Feature and Target Separation**: The features (X) and the target variable (y), which is the "Happiness Score," are separated.
3. **Data Split**: The dataset is divided into training (70%) and testing (30%) sets using train_test_split.
4. **Model Evaluation**: The following models are trained and evaluated:
   - **Linear Regression**
   - **Random Forest Regression**
   - **Gradient Boosting Regression**
   - **AdaBoost Regression**
   - **Support Vector Regression (SVR)**
   - **K-Nearest Neighbors Regression**
5. **R2 Score Calculation**: For each model, the R2 Score on the test set is calculated to measure its predictive power.
6. **Best Model Selection**: The model with the highest R2 Score is selected as the best model.


In [None]:
import pandas as pd
import pickle
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, AdaBoostRegressor
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import r2_score
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

# Load the data
data = pd.read_csv('../data/Model_Data.csv')

# Define features and target variable
X = data.drop(columns=['Happiness Score'])
y = data['Happiness Score']

# Preprocess the data: apply One-Hot Encoding to the 'Country' column and scale the rest
preprocessor = ColumnTransformer(
    transformers=[
        ('country', OneHotEncoder(handle_unknown='ignore'), ['Country']), 
        ('num', SimpleImputer(strategy='mean'), X.columns.difference(['Country', 'Year'])) 
    ])

# Create a dictionary of models
models = {
    'Linear Regression': LinearRegression(),
    'Random Forest': RandomForestRegressor(random_state=42),
    'Gradient Boosting': GradientBoostingRegressor(random_state=42),
    'AdaBoost': AdaBoostRegressor(random_state=42),
    'Support Vector Regression': SVR(),
    'K-Nearest Neighbors': KNeighborsRegressor()
}

# Split the data into training and testing sets (70% training, 30% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Variables to store the best model and its performance
best_model = None
best_r2 = float('-inf')  
best_model_name = "" 

# Train and evaluate each model
for model_name, model in models.items():
    pipeline = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('regressor', model)
    ])
    
    pipeline.fit(X_train, y_train)
    
    y_pred = pipeline.predict(X_test)
    
    r2 = r2_score(y_test, y_pred)
    
    print(f'\nModel: {model_name}')
    print(f'R^2 Score: {r2}')
    
    if r2 > best_r2:
        best_r2 = r2
        best_model = pipeline
        best_model_name = model_name

print(f"\nBest model: {best_model_name} with R2 Score = {best_r2:.4f}")



Model: Linear Regression
R^2 Score: 0.9495858523063235

Model: Random Forest
R^2 Score: 0.8486986222615859

Model: Gradient Boosting
R^2 Score: 0.842152498124545

Model: AdaBoost
R^2 Score: 0.7726456924117253

Model: Support Vector Regression
R^2 Score: 0.9237282170578003

Model: K-Nearest Neighbors
R^2 Score: 0.8937572721799011

Best model: Linear Regression with R2 Score = 0.9496


## Save the Best Model to a .pkl File

Once the model has been trained and evaluated, you can save the best model to a .pkl file to use it later without needing to retrain it.

In [2]:
# Save the best model to a .pkl file
with open('../Model/Regression_Model.pkl', 'wb') as file:
    pickle.dump(best_model, file)

print("Model successfully saved to 'Model/Regression_Model.pkl'")


Model successfully saved to 'Model/Regression_Model.pkl'
