In [None]:
import pandas as pd
import numpy as np
from sklearn.datasets import load_boston

In [None]:
# Load the Boston Housing dataset
boston = load_boston()

# Create a pandas DataFrame for the features and a pandas Series for the target variable
df_features = pd.DataFrame(boston.data, columns=boston.feature_names)
s_target = pd.Series(boston.target, name='PRICE')

In [None]:
# Display the first few rows of the DataFrame
print("First 5 rows of the features DataFrame:")
print(df_features.head())

# Display basic information about the dataset
print("\nDataset shape:")
print(df_features.shape)
print("\nDataset description:")
print(boston.DESCR)

## Data Splitting
We split the dataset into training and testing sets to evaluate the model's performance on unseen data. This helps prevent overfitting, where the model learns the training data too well but generalizes poorly to new data.

In [None]:
from sklearn.model_selection import train_test_split

# Split the data
X_train, X_test, y_train, y_test = train_test_split(df_features, s_target, test_size=0.2, random_state=42)

## Feature Scaling
Feature scaling standardizes the range of independent variables or features of data. Since the range of values of raw data varies widely, in some machine learning algorithms, objective functions will not work properly without normalization. For example, many algorithms calculate the distance between two points by the Euclidean distance. If one of the features has a broad range of values, the distance will be governed by this particular feature.

In [None]:
from sklearn.preprocessing import StandardScaler

# Initialize a StandardScaler
scaler = StandardScaler()

# Fit the scaler on the training features and transform both X_train and X_test
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Linear Regression

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Initialize and train the LinearRegression model
lr_model = LinearRegression()
lr_model.fit(X_train_scaled, y_train)

# Make predictions on the scaled test data
y_pred_lr = lr_model.predict(X_test_scaled)

# Calculate and print the Mean Squared Error (MSE) and R-squared (R²) score
mse_lr = mean_squared_error(y_test, y_pred_lr)
r2_lr = r2_score(y_test, y_pred_lr)

print(f"Linear Regression MSE: {mse_lr}")
print(f"Linear Regression R²: {r2_lr}")

## Linear Regression Results
The Mean Squared Error (MSE) measures the average squared difference between the estimated values and the actual value. A lower MSE indicates a better fit.
The R-squared (R²) score represents the proportion of the variance in the dependent variable that is predictable from the independent variables. An R² score closer to 1 indicates a better fit.

# Random Forest Regressor

In [None]:
from sklearn.ensemble import RandomForestRegressor

# Initialize and train the RandomForestRegressor model
rf_model = RandomForestRegressor(random_state=42)
rf_model.fit(X_train_scaled, y_train)

# Make predictions on the scaled test data
y_pred_rf = rf_model.predict(X_test_scaled)

# Calculate and print the Mean Squared Error (MSE) and R-squared (R²) score
mse_rf = mean_squared_error(y_test, y_pred_rf)
r2_rf = r2_score(y_test, y_pred_rf)

print(f"Random Forest Regressor MSE: {mse_rf}")
print(f"Random Forest Regressor R²: {r2_rf}")

## Random Forest Regressor Results
The Random Forest Regressor is an ensemble learning method that operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. It is generally more robust to overfitting than a single decision tree and can handle a large number of features.
Comparing its MSE and R² score to Linear Regression can give insights into which model performs better on this dataset.

In [None]:
# Install xgboost if not already installed
try:
    import xgboost
except ImportError:
    print("Installing xgboost...")
    import subprocess
    subprocess.check_call(['pip', 'install', 'xgboost'])
    import xgboost

# XGBoost Regressor

In [None]:
from xgboost import XGBRegressor

# Initialize and train the XGBRegressor model
xgb_model = XGBRegressor(random_state=42)
xgb_model.fit(X_train_scaled, y_train)

# Make predictions on the scaled test data
y_pred_xgb = xgb_model.predict(X_test_scaled)

# Calculate and print the Mean Squared Error (MSE) and R-squared (R²) score
mse_xgb = mean_squared_error(y_test, y_pred_xgb)
r2_xgb = r2_score(y_test, y_pred_xgb)

print(f"XGBoost Regressor MSE: {mse_xgb}")
print(f"XGBoost Regressor R²: {r2_xgb}")

## XGBoost Regressor Results
XGBoost (Extreme Gradient Boosting) is an optimized distributed gradient boosting library designed to be highly efficient, flexible, and portable. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting that solves many data science problems in a fast and accurate way. The same metrics (MSE and R²) are used for evaluation.

# Model Performance Comparison

| Model                   | Mean Squared Error (MSE) | R-squared (R²)   |
|-------------------------|--------------------------|------------------|
| Linear Regression       | {mse_lr:.4f}                 | {r2_lr:.4f}          |
| Random Forest Regressor | {mse_rf:.4f}                 | {r2_rf:.4f}          |
| XGBoost Regressor       | {mse_xgb:.4f}                | {r2_xgb:.4f}         |

## Summary
To evaluate the models, we look at MSE (lower is better) and R² (closer to 1 is better).

Based on the scores:
- The **XGBoost Regressor** typically performs the best, often yielding the lowest MSE and the highest R² score, indicating a strong predictive power and good fit to the data.
- The **Random Forest Regressor** is also a strong performer and usually provides better results than Linear Regression. Its R² score is generally high, and MSE is lower than Linear Regression.
- **Linear Regression** serves as a good baseline but is often outperformed by more complex ensemble methods like Random Forest and XGBoost, especially on datasets with non-linear relationships.

*(Note: The actual best-performing model can vary depending on the dataset and specific hyperparameter tuning. The values in the table above will be populated when the notebook is executed.)*

# **Boosting Performance (Mother Section)**

While the models above provide a good starting point, their performance can often be significantly improved. Here are several common techniques to boost model accuracy and robustness:

## 1. Hyperparameter Tuning
**Importance:** Most machine learning models have hyperparameters, which are settings that are not learned from the data but are set prior to training. The choice of hyperparameters can have a significant impact on model performance. Tuning these parameters is crucial for optimizing the model for a specific dataset.
**Tools:** `sklearn.model_selection` provides tools like `GridSearchCV` and `RandomizedSearchCV`.
   - `GridSearchCV`: Exhaustively searches over a specified parameter grid. It tries every combination of hyperparameter values and evaluates them using cross-validation.
   - `RandomizedSearchCV`: Samples a given number of candidates from a parameter space with a specified distribution. It's often more efficient than `GridSearchCV`, especially when the hyperparameter space is large.

## 2. Feature Engineering
**Explanation:** This involves creating new features from existing ones or transforming existing features to better represent the underlying patterns in the data. Well-engineered features can lead to simpler models and improved performance.
**Examples:**
   - **Polynomial Features:** Creating features that are powers of existing features (e.g., x², x³). `sklearn.preprocessing.PolynomialFeatures` can be used for this. This can help capture non-linear relationships.
   - **Interaction Terms:** Combining two or more features (e.g., featureA * featureB). This can capture how features jointly influence the target.
   - **Domain-Specific Features:** Creating features based on expert knowledge of the problem domain. For example, in a housing dataset, creating a 'rooms_per_household' feature if 'total_rooms' and 'households' are available.

## 3. Cross-Validation
**Importance:** Cross-validation (CV) is a technique for assessing how the results of a statistical analysis will generalize to an independent dataset. It is crucial for robust model evaluation, especially when tuning hyperparameters, as it helps prevent overfitting by ensuring the model performs well on multiple subsets of the data.
**K-Fold Cross-Validation:** A common method where the original training data is split into 'k' folds. For each fold, the model is trained on k-1 folds and validated on the remaining fold. The performance metric is then averaged across all k folds.

## 4. Advanced Ensemble Methods
**Explanation:** While Random Forest and XGBoost are powerful ensemble methods themselves, more advanced techniques can sometimes offer further improvements.
   - **Stacking (Stacked Generalization):** Involves training multiple different models (base learners) and then using another model (meta-learner) to combine their predictions. The meta-learner is trained on the outputs of the base learners.
   - It's worth noting that the complexity of implementing and tuning these methods can be higher.

## 5. Feature Selection
**Explanation:** Involves selecting a subset of the most relevant features from the original set. This can reduce model complexity, improve training time, reduce overfitting, and sometimes even improve performance by removing noise or irrelevant information.
**Techniques:**
   - **Recursive Feature Elimination (RFE):** Recursively removes features and builds a model on the remaining features.
   - **Feature Importance Scores:** Many models (especially tree-based ones like Random Forest and XGBoost) provide feature importance scores. Features below a certain importance threshold can be removed.
   - **Statistical Tests:** Using statistical tests (e.g., ANOVA F-value, chi-squared) to select features that have a strong relationship with the target variable.

## 6. Handling Outliers and Data Cleaning
**Explanation:** Outliers are data points that are significantly different from other observations. They can disproportionately affect model training and performance, especially for models sensitive to variance like Linear Regression. Further data cleaning beyond initial preprocessing might also be beneficial.
**Actions:**
   - **Outlier Detection and Treatment:** Techniques like using statistical methods (e.g., Z-score, IQR) or visualization to identify outliers. Treatment can involve removing them, transforming them (e.g., capping), or using robust models that are less sensitive to outliers.
   - **Improved Imputation:** If missing values were handled simply (e.g., mean imputation), more sophisticated imputation techniques could be explored.
   - **Data Transformation:** Applying transformations like log transforms to skewed data can sometimes help.