In [5]:
#!pip install scikit-learn==1.3.0
from sklearn.datasets import fetch_california_housing
import pandas as pd
housing = fetch_california_housing(as_frame=True)
df = pd.DataFrame(data=housing.data, columns=housing.feature_names)
df['target'] = housing.target
display(df)

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,target
0,8.3252,41.0,6.984127,1.023810,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.971880,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.802260,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422
...,...,...,...,...,...,...,...,...,...
20635,1.5603,25.0,5.045455,1.133333,845.0,2.560606,39.48,-121.09,0.781
20636,2.5568,18.0,6.114035,1.315789,356.0,3.122807,39.49,-121.21,0.771
20637,1.7000,17.0,5.205543,1.120092,1007.0,2.325635,39.43,-121.22,0.923
20638,1.8672,18.0,5.329513,1.171920,741.0,2.123209,39.43,-121.32,0.847


In [7]:
print(df.isnull().sum())
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='mean') # or 'median' or 'most_frequent'
df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)
df_dropped = df.dropna()
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
numerical_features = df.columns[:-1] # all columns except the target
df_scaled = df.copy() # create a copy
df_scaled[numerical_features] = scaler.fit_transform(df[numerical_features])

MedInc        0
HouseAge      0
AveRooms      0
AveBedrms     0
Population    0
AveOccup      0
Latitude      0
Longitude     0
target        0
dtype: int64


In [None]:
Preprocessing steps and their justifications for the California Housing dataset:

1. Handling Missing Values:
The first step involves checking for and handling missing values within the dataset. Used df.isnull().sum() to identify the existence and extent of missing data in each column. If missing values had been present, techniques such as imputation (filling in missing values with estimates like the mean or median) using SimpleImputer or removing rows/columns with missing values using dropna() is used.
Reasoning: Missing values can create issues during model training, leading to biased or inaccurate results. Handling them ensures data completeness and integrity for better model performance and reliability. In case, the dataset might contain missing values in real-world scenarios, this step would be crucial.

2. Feature Scaling (Standardization):
The second step involves scaling numerical features using standardization. We applied the StandardScaler to transform the data by subtracting the mean and dividing by the standard deviation for each feature.
Reasoning: The California Housing dataset likely has features with varying scales (e.g., population, median income). Standardization brings all features to a similar scale (mean of 0 and standard deviation of 1), preventing features with larger values from disproportionately influencing the model. This improves the performance and stability of algorithms sensitive to feature scales, such as linear regression, k-nearest neighbors, and support vector machines.
In conclusion, these preprocessing steps are essential for preparing the California Housing dataset for machine learning tasks:

Handling missing values ensures data completeness and prevents biases or errors in the analysis.
Feature scaling (standardization) levels the playing field for features, optimizing the performance of various machine learning algorithms.
By performing these steps, the aim is to enhance the quality of the data and improve the accuracy and reliability of any subsequent analysis or model training.

In [9]:
#Step 1: Import necessary libraries
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

#Step 2: Split data into training and testing sets
X = df_scaled.drop('target', axis=1)  # Features
y = df_scaled['target']  # Target variable
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)  # 80% train, 20% test

#Step 3: Create and train the Linear Regression model
model = LinearRegression()
model.fit(X_train, y_train)

#Step 4: Make predictions on the test set
y_pred = model.predict(X_test)

#Step 5: Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Mean Squared Error: {mse}")
print(f"R-squared: {r2}")

Mean Squared Error: 0.5558915986952442
R-squared: 0.575787706032451


In [None]:
Code explained:

Import Libraries: import LinearRegression, train_test_split, mean_squared_error, and r2_score for model creation, data splitting, and evaluation.
Split Data: the dataset is divided into training and testing sets using train_test_split. This is crucial to assess how well the model generalizes to unseen data.
Create and Train: Create a LinearRegression object and train it using the training data (X_train, y_train). The model learns the relationships between features and the target variable.
Make Predictions: Use the trained model to predict target values for the test set (X_test).
Evaluate Model: Evaluate the model's performance using metrics like Mean Squared Error (MSE) and R-squared. These metrics provide insights into the model's accuracy and goodness of fit.

Reasoning why Linear Regression might be suitable for the California Housing dataset:

1. Linear Relationship: Linear Regression assumes a linear relationship between the features and the target variable. In the California Housing dataset, we expect features like median income, housing median age, and average rooms to have some degree of linear correlation with the median house value (target variable). This makes Linear Regression a potentially appropriate model for this dataset.

2. Interpretability: Linear Regression offers excellent interpretability. The coefficients associated with each feature provide insights into the direction and magnitude of their impact on the target variable. This is valuable for understanding the factors influencing housing prices.

3. Simplicity and Efficiency: Linear Regression is relatively simple to implement and computationally efficient, especially for datasets of moderate size like the California Housing dataset. This makes it a practical choice for initial exploration and analysis.

4. Baseline Model: Linear Regression often serves as a good baseline model. Even if it might not be the most accurate model, it can provide a benchmark against which to compare more complex models.

5. Data Characteristics: The California Housing dataset is relatively clean and well-structured. It has numerical features and a continuous target variable, which align with the requirements of Linear Regression.

However, this method assumes linearity, which might not perfectly capture all complexities in real-world housing data.
Outliers can significantly impact the model, so it's crucial to handle them appropriately.
Feature scaling (standardization) is often necessary to improve performance, as was done in the preprocessing steps.
Despite these considerations, Linear Regression seems a suitable starting point for modeling the California Housing dataset due to its simplicity, interpretability, potential for capturing linear relationships, and appropriateness for the dataset's characteristics. It's a valuable tool for initial exploration and provides a baseline for comparing more complex models.

In [11]:
#Step 1: Import necessary libraries
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

#Step 2: Split data into training and testing sets
X = df_scaled.drop('target', axis=1)  # Features
y = df_scaled['target']  # Target variable
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)  # 80% train, 20% test

#Step 3: Create and train the Decision Tree Regressor model
model = DecisionTreeRegressor(random_state=42)  #hyperparameters can be adjusted here
model.fit(X_train, y_train)

#Step 4: Make predictions on the test set
y_pred = model.predict(X_test)

#Step 5: Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Mean Squared Error: {mse}")
print(f"R-squared: {r2}")

Mean Squared Error: 0.4942716777366763
R-squared: 0.6228111330554302


In [None]:
Code explained:

Import Libraries: import DecisionTreeRegressor, train_test_split, mean_squared_error, and r2_score for model creation, data splitting, and evaluation.
Split Data: the dataset is split into training and testing sets using train_test_split to assess the model's generalization ability.
Create and Train: Created a DecisionTreeRegressor object and train it on the training data. Hyperparameters like max_depth, min_samples_split, and min_samples_leaf can be tuned to optimize performance.
Make Predictions: Use the trained model to predict target values for the test set.
Evaluate Model: Evaluate the model using metrics like Mean Squared Error (MSE) and R-squared to measure its accuracy and goodness of fit.

Reasoning why a Decision Tree Regressor might be suitable for the California Housing dataset:

1. Handling Non-linear Relationships: Unlike Linear Regression, Decision Trees can capture non-linear relationships between features and the target variable. This is beneficial for housing data, where factors like location and proximity to amenities might have non-linear impacts on prices.

2. Feature Interactions: Decision Trees automatically consider interactions between features when making predictions. This is important in housing data, where the combined effect of features (e.g., income and school quality) might be more significant than their individual effects.

3. Interpretability: Decision Trees are relatively easy to interpret. The tree structure visually represents the decision-making process, allowing us to understand how features contribute to predictions.

4. Handling Outliers: Decision Trees are less sensitive to outliers compared to Linear Regression. They partition data based on feature thresholds, making them more robust to extreme values.

5. No Feature Scaling: Decision Trees generally don't require feature scaling (standardization). This simplifies the preprocessing steps and can be advantageous in certain cases.

However, Decision Trees can be prone to overfitting, especially with complex trees. Techniques like pruning or setting hyperparameter limits are often used to address this.
Small changes in the data can lead to significant changes in the tree structure, potentially affecting stability.
Interpretability can decrease with increasing tree complexity.
But overall, Decision Tree Regressors are potentially suitable for the California Housing dataset because of their ability to handle non-linear relationships, feature interactions, robustness to outliers, and relative ease of interpretation. While overfitting and instability are potential concerns, techniques like pruning and hyperparameter tuning can mitigate these issues. The choice between Linear Regression and Decision Tree Regressor often depends on the specific characteristics of the dataset and the desired balance between interpretability and predictive accuracy. In some cases, ensemble methods (like Random Forests) that combine multiple decision trees might offer even better performance.

In [13]:
#Step 1: Import necessary libraries
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

#Step 2: Split data into training and testing sets
X = df_scaled.drop('target', axis=1)  # Features
y = df_scaled['target']  # Target variable
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)  # 80% train, 20% test

#Step 3: Create and train the Random Forest Regressor model
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

#Step 4: Make predictions on the test set
y_pred = model.predict(X_test)

#Step 5: Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Mean Squared Error: {mse}")
print(f"R-squared: {r2}")

Mean Squared Error: 0.25549776668540763
R-squared: 0.805024407701793


In [None]:
Code explained:

Import Libraries: import RandomForestRegressor, train_test_split, mean_squared_error, and r2_score for model creation, data splitting, and evaluation.
Split Data: split the dataset into training and testing sets using train_test_split to assess the model's generalization ability on unseen data.
Create and Train: create a RandomForestRegressor object and train it on the training data. Hyperparameters like n_estimators, max_depth, min_samples_split, and min_samples_leaf can be tuned to optimize performance.
Make Predictions: use the trained model to predict target values for the test set.
Evaluate Model: evaluate the model using metrics like Mean Squared Error (MSE) and R-squared to measure its accuracy and goodness of fit.

Reasoning why a Random Forest Regressor might be suitable for the California Housing dataset:

1. Handling Non-linearity and Interactions: Like Decision Trees, Random Forests excel at capturing non-linear relationships and interactions between features. This is crucial for housing data, where factors like location, proximity to amenities, and neighborhood characteristics can have complex effects on prices.

2. Reduced Overfitting: Random Forests mitigate overfitting, a common issue with Decision Trees, by averaging predictions from multiple trees. This ensemble approach improves generalization and makes the model more robust to noise in the data.

3. Feature Importance: Random Forests provide valuable insights into feature importance, helping us understand which factors are most influential in predicting housing prices. This can be useful for feature selection and gaining domain knowledge.

4. Robustness to Outliers: Random Forests are less sensitive to outliers compared to Linear Regression, as they rely on the collective wisdom of multiple trees.

5. Handling Missing Values (Imputation): Some implementations of Random Forests can handle missing values directly without requiring imputation beforehand. This can simplify the preprocessing steps.

However, Random Forests can be computationally more intensive than Linear Regression or single Decision Trees, especially with large datasets or many trees.
Interpretability can be slightly reduced compared to single Decision Trees, as the model combines predictions from multiple trees. Overall, Random Forest Regressors are often a strong choice for regression tasks, including the California Housing dataset, due to their ability to handle non-linearity, interactions, reduce overfitting, provide feature importance, and offer robustness to outliers. While they might be computationally more demanding, their potential for improved accuracy and insights often outweighs this consideration.
In comparison to Linear Regression and Decision Trees:
Linear Regression: Might be too simplistic for capturing complex relationships in housing data.
Decision Tree: Prone to overfitting, while Random Forests address this issue through ensembling.
Therefore, Random Forest Regressor is often considered a more suitable and robust option for the California Housing dataset, especially when aiming for higher predictive accuracy and handling potential non-linearities and interactions.

In [15]:
#Step 1: Import necessary libraries
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

#Step 2: Split data into training and testing sets
X = df_scaled.drop('target', axis=1)  # Features
y = df_scaled['target']  # Target variable
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)  # 80% train, 20% test

#Step 3: Create and train the Gradient Boosting Regressor model
model = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, random_state=42)
model.fit(X_train, y_train)

#Step 4: Make predictions on the test set
y_pred = model.predict(X_test)

#Step 5: Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Mean Squared Error: {mse}")
print(f"R-squared: {r2}")

Mean Squared Error: 0.29399901242474274
R-squared: 0.7756433164710084


In [None]:
Code explained:

Import Libraries: import GradientBoostingRegressor, train_test_split, mean_squared_error, and r2_score for model creation, data splitting, and evaluation.
Split Data: split the dataset into training and testing sets using train_test_split to assess the model's performance on unseen data.
Create and Train: create a GradientBoostingRegressor object and train it on the training data. Hyperparameters like n_estimators, learning_rate, max_depth, and min_samples_split can be tuned to optimize performance.
Make Predictions: use the trained model to predict target values for the test set.
Evaluate Model: evaluate the model using metrics like Mean Squared Error (MSE) and R-squared to measure its accuracy and goodness of fit.

Reasoning why a Gradient Boosting Regressor might be suitable for the California Housing dataset:

1. Handling Non-linearity and Interactions: Similar to Random Forests, Gradient Boosting excels at capturing non-linear relationships and interactions between features. This is crucial for housing data, where factors like location, proximity to amenities, and neighborhood characteristics can have complex, interwoven effects on prices.

2. High Predictive Accuracy: Gradient Boosting is known for its high predictive accuracy. It often outperforms other regression algorithms, including Linear Regression, Decision Trees, and even Random Forests in many cases. This makes it a strong contender for the California Housing dataset, where accurate predictions are desirable.

3. Sequential Improvement: Gradient Boosting's sequential learning process allows it to iteratively improve predictions by focusing on errors made by previous trees. This leads to a more refined and accurate model.

4. Regularization: Gradient Boosting incorporates regularization techniques, such as shrinkage (learning rate) and subsampling, to prevent overfitting and improve generalization to unseen data.

5. Feature Importance: Like Random Forests, Gradient Boosting provides insights into feature importance, helping us understand which factors are most influential in predicting housing prices. This can be valuable for feature selection and gaining domain knowledge.

However, Gradient Boosting can be computationally more intensive than Linear Regression or single Decision Trees, especially with large datasets or many trees. Hyperparameter tuning might also require more computational resources.
Interpretability can be slightly reduced compared to single Decision Trees, as the model combines predictions from multiple trees in a complex way.
Overall, Gradient Boosting Regressors are often a top choice for regression tasks, including the California Housing dataset, due to their potential for high predictive accuracy, ability to handle non-linearity and interactions, sequential improvement, regularization, and feature importance. While they might be computationally more demanding, their potential for superior performance often outweighs this consideration.

In comparison to other algorithms:                                                                                                                        
Linear Regression: Too simplistic for capturing complex relationships in housing data.
Decision Tree: Prone to overfitting, which Gradient Boosting addresses through its ensemble approach and regularization.
Random Forest: While generally robust, Gradient Boosting often achieves even higher accuracy due to its sequential learning process.
Therefore, Gradient Boosting Regressor is often considered a highly suitable and powerful option for the California Housing dataset, especially when aiming for top-notch predictive performance and handling potential non-linearities and interactions.

In [17]:
#Step 1: Import necessary libraries
from sklearn.svm import SVR
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

#Step 2: Split data into training and testing sets
X = df_scaled.drop('target', axis=1)  # Features
y = df_scaled['target']  # Target variable
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)  # 80% train, 20% test

#Step 3: Create and train the SVR model
model = SVR(kernel='rbf')
model.fit(X_train, y_train)

#Step 4: Make predictions on the test set
y_pred = model.predict(X_test)

#Step 5: Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Mean Squared Error: {mse}")
print(f"R-squared: {r2}")

Mean Squared Error: 0.3551984619989419
R-squared: 0.7289407597956462


In [None]:
Code explained:

Import Libraries: import SVR, train_test_split, mean_squared_error, and r2_score for model creation, data splitting, and evaluation.
Split Data: split the dataset into training and testing sets using train_test_split to assess the model's performance on unseen data.
Create and Train: create an SVR object and train it on the training data. Hyperparameters like kernel (e.g., 'linear', 'rbf', 'poly'), C (regularization parameter), and epsilon (tolerance for errors) can be tuned to optimize performance.
Make Predictions: use the trained model to predict target values for the test set.
Evaluate Model: evaluate the model using metrics like Mean Squared Error (MSE) and R-squared to measure its accuracy and goodness of fit.

Reasoning why a Support Vector Regressor (SVR) might be suitable for the dataset:

1. Handling Non-linearity: SVR, particularly with non-linear kernels like the Radial Basis Function (RBF) kernel, can effectively capture non-linear relationships between features and the target variable. This is crucial for housing data, where factors like location and proximity to amenities might have non-linear impacts on prices.

2. Regularization: SVR incorporates regularization through the C parameter, which helps prevent overfitting and improves the model's generalization ability. This is important for ensuring that the model performs well on unseen data.

3. Handling Outliers: SVR is relatively robust to outliers due to its focus on maximizing the margin between support vectors. This makes it less sensitive to extreme values in the data compared to some other regression algorithms.

4. Versatility: SVR offers flexibility through different kernel choices (e.g., linear, polynomial, RBF). This allows you to adapt the model to various data patterns and complexities.

5. Feature Scaling: While not strictly required, feature scaling (standardization) often improves the performance of SVR, especially with kernels like RBF. This is because SVR relies on distances between data points, and scaling ensures that features with larger values don't disproportionately influence the model.

However, SVR can be computationally more intensive than Linear Regression or Decision Trees, especially with large datasets. Hyperparameter tuning can also require more computational resources.
Interpretability can be more challenging compared to Linear Regression or Decision Trees, as the model's decision boundaries are defined by support vectors and kernel functions.
Overall, SVR is a potentially suitable option for the California Housing dataset due to its ability to handle non-linearity, incorporate regularization, handle outliers, offer versatility, and benefit from feature scaling. While it might be computationally more demanding and less interpretable than some other algorithms, its potential for capturing complex relationships and achieving good predictive performance makes it a valuable consideration.

In comparison to other algorithms:
Linear Regression: Might be too simplistic for capturing complex relationships in housing data.
Decision Tree: Prone to overfitting, while SVR addresses this through regularization.
Random Forest/Gradient Boosting: Often achieve high accuracy, but SVR might be preferred in cases where non-linearity is prominent or when a more robust approach to outliers is desired.
Therefore, SVR is worth exploring for the California Housing dataset, especially when aiming for a balance between handling non-linearity, preventing overfitting, and achieving good predictive accuracy. The choice between SVR and other algorithms ultimately depends on the specific characteristics of the data and the desired trade-offs between accuracy, interpretability, and computational cost.

In [19]:
#Evaluating the performance of each algorithm using the Mean Squared Error (MSE)

#Step 1: Import necessary libraries
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

#Step 2: Split data into training and testing sets
X = df_scaled.drop('target', axis=1)  #Features
y = df_scaled['target']  #Target variable
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#Step 3: Train and evaluate each algorithm
#Train and Evaluate: iterate through each algorithm, train it on the training data, make predictions on the test data, and calculate the MSE. The results are stored in a dictionary.
models = {
    "Linear Regression": LinearRegression(),
    "Decision Tree": DecisionTreeRegressor(random_state=42),
    "Random Forest": RandomForestRegressor(n_estimators=100, random_state=42),
    "Gradient Boosting": GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, random_state=42),
    "SVR": SVR(kernel='rbf'),
}

results = {}

for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    results[name] = mse

#Step 4: Print the results
for name, mse in results.items():
    print(f"{name}: MSE = {mse:.4f}")

Linear Regression: MSE = 0.5559
Decision Tree: MSE = 0.4943
Random Forest: MSE = 0.2555
Gradient Boosting: MSE = 0.2940
SVR: MSE = 0.3552


In [21]:
#Evaluating the performance of each algorithm using the Mean Absolute Error (MAE)

#Step 1: Import necessary libraries
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error  # Import MAE

#Step 2: Split data into training and testing sets
X = df_scaled.drop('target', axis=1)
y = df_scaled['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#Step 3: Train and evaluate each algorithm
models = {
    "Linear Regression": LinearRegression(),
    "Decision Tree": DecisionTreeRegressor(random_state=42),
    "Random Forest": RandomForestRegressor(n_estimators=100, random_state=42),
    "Gradient Boosting": GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, random_state=42),
    "SVR": SVR(kernel='rbf'),
}

results = {}

for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    mae = mean_absolute_error(y_test, y_pred)  #Calculate MAE
    results[name] = mae

for name, mae in results.items():
    print(f"{name}: MAE = {mae:.4f}")

Linear Regression: MAE = 0.5332
Decision Tree: MAE = 0.4538
Random Forest: MAE = 0.3276
Gradient Boosting: MAE = 0.3717
SVR: MAE = 0.3978


In [None]:
#The algorithm with the lowest MAE generally indicates better predictive accuracy in terms of the average absolute difference between predicted and actual values.
#Further tuning and cross-validation might be needed for a more robust evaluation.

In [23]:
#Evaluating the performance of each algorithm using R-squared Score (R²)

#Step 1: Import necessary libraries
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score  # Import R-squared

#Step 2: Split data into training and testing sets
X = df_scaled.drop('target', axis=1)
y = df_scaled['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#Step 3: Train and evaluate each algorithm
models = {
    "Linear Regression": LinearRegression(),
    "Decision Tree": DecisionTreeRegressor(random_state=42),
    "Random Forest": RandomForestRegressor(n_estimators=100, random_state=42),
    "Gradient Boosting": GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, random_state=42),
    "SVR": SVR(kernel='rbf'),
}

results = {}

for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    r2 = r2_score(y_test, y_pred)  # Calculate R-squared
    results[name] = r2

for name, r2 in results.items():
    print(f"{name}: R² = {r2:.4f}")

Linear Regression: R² = 0.5758
Decision Tree: R² = 0.6228
Random Forest: R² = 0.8050
Gradient Boosting: R² = 0.7756
SVR: R² = 0.7289


In [None]:
#The algorithm with the highest R-squared score generally indicates better goodness of fit, 
# meaning it explains a larger proportion of the variance in the target variable.