### Example:
- Which models use for training
- Comparison of selected models
- How is model performance

In [1]:
# Imports
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, RandomizedSearchCV, cross_val_score
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import Ridge, Lasso, ElasticNet

# Model Training and Evaluation

# XGBoost Model
1. More Data Preprocessing for this model:

- The target variable (price) is log-transformed to reduce skewness.
- Categorical variables are one-hot encoded.
- Numerical features are scaled to standardize their values.

2. Model Setup:

- The data is split into training and testing sets to evaluate model performance.
- An XGBoost regressor is used to predict the log-transformed price.

3. Hyperparameter Tuning:

- A range of hyperparameters for the model is tested using RandomizedSearchCV to find the best configuration.

4. Model Evaluation:

- The best model is trained, and performance is evaluated using RMSE (Root Mean Squared Error) and R² (coefficient of determination)

In [2]:
# Load the merged dataset
processed_df = pd.read_csv('merged_tourism_data.csv')

# Log transform 'price' to reduce skewness
processed_df['log_price'] = np.log1p(processed_df['price'])

# Drop original price column and keep log_price as target
processed_df.drop(columns=['price'], inplace=True)

# One-hot encode categorical variables
processed_df_encoded = pd.get_dummies(processed_df, drop_first=True)

# Feature and target variables
X = processed_df_encoded.drop('log_price', axis=1)  # Features
y = processed_df_encoded['log_price']  # Target (log-transformed price)

# Apply feature scaling to numerical features (e.g., 'Tourists')
numerical_features = ['Tourists']
scaler = StandardScaler()
X[numerical_features] = scaler.fit_transform(X[numerical_features])

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define hyperparameter grid for tuning
param_dist = {
    'learning_rate': [0.01, 0.05, 0.1, 0.2],
    'n_estimators': [100, 200, 500],
    'max_depth': [3, 5, 7],
    'subsample': [0.8, 0.9, 1.0],
    'colsample_bytree': [0.7, 0.8, 1.0],
    'gamma': [0, 0.1, 0.2],
    'reg_alpha': [0, 0.1, 0.5],
    'reg_lambda': [0, 0.1, 1.0]
}

# Initialize the XGBoost model
xgboost_model = XGBRegressor(objective='reg:squarederror', random_state=42)

# Perform RandomizedSearchCV to find best parameters
random_search = RandomizedSearchCV(
    estimator=xgboost_model, param_distributions=param_dist,
    n_iter=100,  # Testing 100 random combinations
    cv=3, scoring='neg_mean_squared_error', n_jobs=-1, verbose=1, random_state=42
)

# Fit the randomized search to a sample of the dataset to speed up tuning
sample_df = processed_df_encoded.sample(n=20000, random_state=42)
X_sample = sample_df.drop('log_price', axis=1)
y_sample = sample_df['log_price']
random_search.fit(X_sample, y_sample)

# Print best parameters
print("Best Parameters:", random_search.best_params_)

# Reinitialize the XGBoost model with the best found parameters
best_xgboost = XGBRegressor(**random_search.best_params_, objective='reg:squarederror', random_state=42)

# Use cross-validation to tune the model (this replaces RandomizedSearchCV)
cv_scores = cross_val_score(best_xgboost, X_train, y_train, cv=3, scoring='neg_mean_squared_error', n_jobs=-1)

# Print the average cross-validation score
print(f"Cross-Validation MSE: {-cv_scores.mean()}")

# Train the best model without early stopping (for now)
best_xgboost.fit(X_train, y_train)

# Make predictions on the test set
y_pred = best_xgboost.predict(X_test)

# Calculate RMSE and R² for the best model
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)

# Print results
print(f"Optimized XGBoost RMSE: {rmse}")
print(f"Optimized XGBoost R²: {r2}")


Fitting 3 folds for each of 100 candidates, totalling 300 fits
Best Parameters: {'subsample': 1.0, 'reg_lambda': 0, 'reg_alpha': 0.5, 'n_estimators': 500, 'max_depth': 3, 'learning_rate': 0.01, 'gamma': 0.2, 'colsample_bytree': 0.8}
Cross-Validation MSE: 0.5879523178720912
Optimized XGBoost RMSE: 0.7628190646895546
Optimized XGBoost R²: 0.09236845677041161


# Random Forest Model
1. Data Preprocessing specific for this model

- The dataset is encoded using one-hot encoding for categorical variables.
- The features (X) and target (y) are defined, with price as the target variable.

2. Data Split:

- The data is split into training (80%) and test (20%) sets for model evaluation.
- A sample of 10% of the training data is used for faster hyperparameter tuning.

3. Hyperparameter Tuning:

- A wide range of hyperparameters, such as the number of trees (n_estimators), tree depth (max_depth), and more, is tested using RandomizedSearchCV.
- The best hyperparameters are selected based on negative mean squared error (MSE).

4. Model Training and Evaluation:

- The best Random Forest model is trained on the full training data.
- Predictions are made on the test set, and performance is evaluated using RMSE (Root Mean Squared Error) and R² (coefficient of determination).

In [3]:
# Load the dataset
df = pd.read_csv('merged_tourism_data.csv')

# Apply one-hot encoding to categorical columns (if any)
df_encoded = pd.get_dummies(df, drop_first=True)

# Define features (X) and target (y)
X = df_encoded.drop(columns=['price']) 
y = df_encoded['price'] 

# Split data into training and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Sample data
X_sample, _, y_sample, _ = train_test_split(X_train, y_train, test_size=0.9, random_state=42)

# Initialize the Random Forest model
rf_model = RandomForestRegressor(random_state=42)

# Expand the hyperparameter grid for tuning
rf_param_dist = {
    'n_estimators': [100, 200, 300, 500, 1000],
    'max_depth': [3, 5, 7, 10, 12, 15],  
    'max_features': ['sqrt', 'log2', None, 0.7, 0.8], 
    'min_samples_split': [2, 5, 10, 15], 
    'min_samples_leaf': [1, 2, 4, 6], 
    'bootstrap': [True, False] 
}

# Perform RandomizedSearchCV for Random Forest tuning with a larger hyperparameter space
rf_random_search = RandomizedSearchCV(
    estimator=rf_model, param_distributions=rf_param_dist,
    n_iter=100, cv=3, scoring='neg_mean_squared_error', n_jobs=-1, verbose=1, random_state=42
)

# Fit the randomized search to a sample of the dataset to speed up tuning
rf_random_search.fit(X_sample, y_sample)

# Best parameters found for Random Forest
print("Random Forest Best Parameters:", rf_random_search.best_params_)

# Reinitialize the Random Forest model with the best found parameters
best_rf_model = RandomForestRegressor(**rf_random_search.best_params_, random_state=42)

# Cross-validation for Random Forest with the best parameters
rf_cv_scores = cross_val_score(best_rf_model, X_train, y_train, cv=3, scoring='neg_mean_squared_error', n_jobs=-1)

# Print the average cross-validation score
print(f"Random Forest Cross-Validation MSE: {-rf_cv_scores.mean()}")

# Train the best Random Forest model
best_rf_model.fit(X_train, y_train)

# Make predictions on the test set
rf_y_pred = best_rf_model.predict(X_test)

# Calculate RMSE and R² for Random Forest
rf_rmse = np.sqrt(mean_squared_error(y_test, rf_y_pred))
rf_r2 = r2_score(y_test, rf_y_pred)

# Print results for Random Forest
print(f"Random Forest RMSE: {rf_rmse}")
print(f"Random Forest R²: {rf_r2}")

NameError: name 'rf_model' is not defined

# Linear Regression (3 types)
1. Data Preprocessing

- One-hot Encoding: Categorical columns are transformed using one-hot encoding, so the model can handle them numerically.
- Feature Scaling: The data is scaled using StandardScaler to ensure features with different scales don't affect the model performance.

2. Model Setup

2. Testing three regularized linear models:

- Ridge Regression: Adds an L2 penalty (square of the magnitude of coefficients), which helps prevent overfitting.
- Lasso Regression: Adds an L1 penalty (absolute value of the coefficients), which can shrink some coefficients to zero and perform feature selection.
- ElasticNet Regression: A combination of Lasso and Ridge, balancing both L1 and L2 penalties.

3. Cross-Validation

- Cross-validation is used to assess the performance of each model and prevent overfitting.
- Mean Squared Error (MSE) is used as the evaluation metric, where lower values indicate better performance.

4. Model Training and Evaluation
- After cross-validation, the models are trained on the full training set, and predictions are made on the test set.
Performance metrics like RMSE (Root Mean Squared Error) and R² (coefficient of determination) are calculated for each model to evaluate how well the model fits the data.

In [None]:
# Load the dataset
df = pd.read_csv('merged_tourism_data.csv')

# Apply one-hot encoding to categorical columns (if any)
df_encoded = pd.get_dummies(df, drop_first=True)

# Define features (X) and target (y)
X = df_encoded.drop(columns=['price'])
y = df_encoded['price']

# Split data into training and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Feature Scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Try Ridge, Lasso, and ElasticNet regression
ridge_model = Ridge(alpha=5.0)
lasso_model = Lasso(alpha=0.1, max_iter=10000)
elasticnet_model = ElasticNet(alpha=0.1, l1_ratio=0.7, max_iter=10000)  # ElasticNet combines Lasso and Ridge

# Cross-validation for Ridge
ridge_cv_scores = cross_val_score(ridge_model, X_train_scaled, y_train, cv=3, scoring='neg_mean_squared_error')
# Cross-validation for Lasso
lasso_cv_scores = cross_val_score(lasso_model, X_train_scaled, y_train, cv=3, scoring='neg_mean_squared_error')
# Cross-validation for ElasticNet
elasticnet_cv_scores = cross_val_score(elasticnet_model, X_train_scaled, y_train, cv=3, scoring='neg_mean_squared_error')

# Print the average cross-validation scores
print(f"Ridge Cross-Validation MSE: {-ridge_cv_scores.mean()}")
print(f"Lasso Cross-Validation MSE: {-lasso_cv_scores.mean()}")
print(f"ElasticNet Cross-Validation MSE: {-elasticnet_cv_scores.mean()}")

# Train Ridge, Lasso, and ElasticNet models
ridge_model.fit(X_train_scaled, y_train)
lasso_model.fit(X_train_scaled, y_train)
elasticnet_model.fit(X_train_scaled, y_train)

# Make predictions on the test set
ridge_y_pred = ridge_model.predict(X_test_scaled)
lasso_y_pred = lasso_model.predict(X_test_scaled)
elasticnet_y_pred = elasticnet_model.predict(X_test_scaled)

# Calculate RMSE and R² for Ridge, Lasso, and ElasticNet
ridge_rmse = np.sqrt(mean_squared_error(y_test, ridge_y_pred))
ridge_r2 = r2_score(y_test, ridge_y_pred)

lasso_rmse = np.sqrt(mean_squared_error(y_test, lasso_y_pred))
lasso_r2 = r2_score(y_test, lasso_y_pred)

elasticnet_rmse = np.sqrt(mean_squared_error(y_test, elasticnet_y_pred))
elasticnet_r2 = r2_score(y_test, elasticnet_y_pred)

# Print results for Ridge, Lasso, and ElasticNet
print(f"Ridge RMSE: {ridge_rmse}")
print(f"Ridge R²: {ridge_r2}")

print(f"Lasso RMSE: {lasso_rmse}")
print(f"Lasso R²: {lasso_r2}")

print(f"ElasticNet RMSE: {elasticnet_rmse}")
print(f"ElasticNet R²: {elasticnet_r2}")

# Comparison of Selected Models
## XGBoost

- The XGBoost model had its hyperparameters tuned, but the performance was still pretty low, with a small R² (around 0.09).
- Why? XGBoost can handle complex problems, but this one might have too many factors affecting the price that the model just couldn’t capture well. Some important features might also be missing or need better engineering.

## Random Forest:

- The Random Forest model also had the best parameters tuned but still gave a really low R² (close to 0).
- Why? Random Forest might not be able to handle this data well because it’s set to shallow trees (max_depth=3). It also needs a lot of diverse data to learn from, and our dataset might not have enough of that variety.

## Ridge, Lasso, and ElasticNet
- All three of these models performed similarly, with high RMSE values (around 11,900).
- Why? These models are linear, which means they might be too simple for this problem. Prices depend on many non-linear factors like holidays and tourism trends, which these models can't capture very well.

# How is Model Performance?
Even though we tried multiple models and fine-tuned them, the results were still not great. Here’s why:
- Problem Complexity: Predicting prices on Airbnb is complicated. It’s influenced by a lot of things like holidays, the time of year, and local events, which the models might not fully capture.
- Feature Engineering: We did some basic data cleaning, but we might not have considered all the important features or done enough transformations to really improve predictions. For example, knowing the specific area of Bangkok or the type of tourist could help.
- Data: The datasets we have might not be large or detailed enough to get good results.

## Really it comes down to time constraint and not being able to find a good dataset and question to solve. This was the best we could do. However you have mentioned before that the process is more important than the result. Despite the poor accuraries, we did think we follow the processes well.