### Import Libraries
This cell imports essential libraries for data manipulation, numerical operations, and plotting: `numpy`, `pandas`, `matplotlib.pyplot`, and `seaborn`.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

### Load Dataset
This cell loads the **original 1990 California Housing dataset** from **OpenML** using `fetch_openml`.  
Unlike the simplified version available in `sklearn.datasets`, this dataset includes the categorical
feature `ocean_proximity` and reflects the original census data.


In [None]:
from sklearn.datasets import fetch_openml
dataset = fetch_openml(name="california_housing", as_frame=True)

In [None]:
dataset

### Dataset Description
This cell prints the detailed description of the California Housing dataset, providing information about its features, target variable, and source.

In [None]:
dataset.DESCR

### Separate Features and Target
This cell separates the dataset into input features (`X`) and the target variable (`y`),  
where the target represents **median house value (in USD)** for each district.


In [None]:
X,y = dataset.data,dataset.target

In [None]:
feature_names = dataset.feature_names
feature_names

In [None]:
target_name = dataset.target_names
target_name

In [None]:
X

In [None]:
y

In [None]:
X.head()

In [None]:
X.tail()

In [None]:
X.describe()

In [None]:
X.info()

### Split Data into Training and Test Sets
This cell splits the dataset into training and test sets using an **80/20 split**.
The test set is kept completely unseen during model training and feature engineering
to ensure an unbiased evaluation of model performance.


In [None]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.2,random_state = 42)

### Visualize Target Variable Distribution
This cell generates two histograms with Kernel Density Estimates (KDE) to visualize the distribution of the target variable (`y`) in both the training and test sets. This helps in understanding if the target distribution is consistent across the splits.

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

sns.histplot(y_train, kde=True, ax=axes[0])
axes[0].set_title('Distribution of Target Variable (Train Set)')
axes[0].set_xlabel('Target Value')
axes[0].set_ylabel('Frequency')

sns.histplot(y_test, kde=True, ax=axes[1])
axes[1].set_title('Distribution of Target Variable (Test Set)')
axes[1].set_xlabel('Target Value')
axes[1].set_ylabel('Frequency')

plt.tight_layout()
plt.show()

### Concatenate Training Features and Target (EDA Only)
This cell concatenates `X_train` and `y_train` **only for exploratory data analysis (EDA)**.
This combined DataFrame is **not used for model training or feature engineering fitting**,
preventing any form of data leakage.


In [None]:
df_train_test = pd.concat([X_train,y_train],axis = 1)

### Histograms of All Features in Training Set
This cell generates histograms for all features in the `df_train_test` DataFrame. This provides a visual overview of the distribution of each variable in the training data, helping to identify skewness, outliers, and potential transformations needed.

In [None]:
df_train_test.hist(figsize = (12,10),bins = 50)
plt.show()

### Scatter Matrix Plot of Selected Numerical Attributes
This cell creates a scatter matrix for selected numerical features from the training data.
The plot helps visualize pairwise relationships, distributions, and potential non-linear patterns
among key housing attributes.


In [None]:
from pandas.plotting import scatter_matrix
attributes = ['housing_median_age',
              'total_rooms',
              'total_bedrooms',
              'population',
              'households',
              'median_income']
scatter_matrix(df_train_test[attributes],figsize = (12,14))
plt.show()

### Feature Engineering Class Definition
This cell defines a custom `FeatureEngineer` transformer that follows **scikit-learn's fit/transform API**.
All feature engineering steps are **fitted only on training data** and later applied to validation/test data,
ensuring no data leakage.

The transformer performs:
- **Missing value imputation** using KNNImputer for `total_bedrooms`
- **Feature creation**, including:
  - Income categories (`income_cat`)
  - Bedrooms per household
  - Log-transformed versions of skewed numerical features
- **One-hot encoding** of the `ocean_proximity` categorical feature
- **Removal of raw columns** once their engineered counterparts are created


In [None]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import KNNImputer

class FeatureEngineer(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
        self.feature_names_out = None
        self.imputer = KNNImputer(n_neighbors=5) # Initialize KNNImputer

    def fit(self, X, y=None):
        # Fit the OneHotEncoder on the 'ocean_proximity' column
        self.encoder.fit(X[['ocean_proximity']])
        original_feature_names_out = self.encoder.get_feature_names_out(['ocean_proximity'])
        # Sanitize feature names for XGBoost compatibility
        cleaned_feature_names = [
            name.replace('<', '').replace('>', '').replace(' ', '_').replace('-', '_')
            for name in original_feature_names_out
        ]
        self.feature_names_out = cleaned_feature_names

        # Fit the KNNImputer on the 'total_bedrooms' column
        self.imputer.fit(X[['total_bedrooms']])
        return self

    def transform(self, X):
        X = X.copy()

        # Impute missing values in 'total_bedrooms' first
        X['total_bedrooms'] = self.imputer.transform(X[['total_bedrooms']])

        # Income category
        X['income_cat'] = pd.cut(
            X['median_income'],
            bins=[0,1.5,3,4.5,6,np.inf],
            labels=[1,2,3,4,5]
        ).astype(int)

        # Engineered features
        X['bedrooms_per_house'] = X['total_bedrooms'] / X['total_rooms']
        X['Log_population'] = np.log1p(X['population'])
        X['Log_total_rooms'] = np.log1p(X['total_rooms'])
        X['Log_total_bedrooms'] = np.log1p(X['total_bedrooms'])

        # One-hot encode 'ocean_proximity' using sklearn's OneHotEncoder
        ocean_proximity_encoded = self.encoder.transform(X[['ocean_proximity']])
        ocean_proximity_df = pd.DataFrame(ocean_proximity_encoded, columns=self.feature_names_out, index=X.index)

        # Drop the original 'ocean_proximity' column and concatenate the new encoded columns
        X = X.drop('ocean_proximity', axis=1)
        X = pd.concat([X, ocean_proximity_df], axis=1)

        # Drop raw columns
        X.drop(['total_rooms','total_bedrooms','population'], axis=1, inplace=True)

        return X

### Instantiate Feature Engineer
This cell initializes the `feature_engineer` object, which is an instance of our custom `FeatureEngineer` class. This object will encapsulate all the defined feature engineering steps, allowing for consistent application of transformations to the dataset.

In [None]:
feature_engineer = FeatureEngineer()

### Apply Feature Engineering to Training Data
This cell **fits and applies** the feature engineering pipeline on the training data (`X_train`).
All transformation parameters (imputation, encoding, feature creation) are learned exclusively
from the training set.


In [None]:
X_train_featured = feature_engineer.fit_transform(X_train)

### Create Featured Training Data for Analysis
This cell constructs a feature-engineered version of the training data **without refitting**
the feature engineering pipeline.  
The engineered features are concatenated with the target variable purely for
correlation analysis and visualization purposes.


In [None]:
df_train_test_featured = pd.concat(
    [X_train_featured, y_train.reset_index(drop=True)],
    axis=1
)

### Visualize Income Category Distribution
This cell generates a bar plot to visualize the distribution of the `IncomeCat` feature in the `X_train_featured` DataFrame, showing the counts for each income category.

In [None]:
income_cat_counts = X_train_featured['income_cat'].value_counts().sort_index()
income_cat_counts.plot(kind='bar', figsize=(8, 6))
plt.title('Distribution of Income Categories')
plt.xlabel('Income Category')
plt.ylabel('Count')
plt.xticks(rotation=0)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

### Visualize District Locations (Longitude vs Latitude)
This cell creates a scatter plot of `Longitude` against `Latitude` from `df_train_test`, visualizing the geographical distribution of the housing districts in California.

In [None]:
df_train_test.plot(kind = 'scatter',x = 'longitude',y = 'latitude',grid = True,alpha = 0.3)
plt.title('Location of districts in California')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.show()

### Visualize Median House Value and Population by Location
This cell generates a scatter plot mapping `Longitude` and `Latitude`, where the size of the markers represents `Population` and the color indicates `MedHouseVal`. This helps to visualize the spatial distribution of house values and population density.

In [None]:
df_train_test.plot(kind = 'scatter',x = 'longitude',y = 'latitude',
             s= X_train['population']/100,label = 'population',
             c = 'median_house_value',cmap = 'jet',colorbar = True,
             sharex = False,figsize = (10,6),alpha = 0.8)
plt.grid(True)
plt.legend()
plt.show()

### Correlation Matrix Heatmap
This cell visualizes the Pearson correlation matrix of the engineered training features.
It provides insight into linear relationships and multicollinearity among engineered variables,
but does not imply causal relationships.


In [None]:
corr_matrix = X_train_featured.corr()
plt.figure(figsize=(12, 10))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix of Training Data')
plt.show()

### Correlation of Features with Target (Training Data Only)
This cell computes the correlation between engineered features and the target variable
using **training data only**.  
The results help identify features that exhibit strong linear association with
median house value.


In [None]:
corr_matrix_with_target = df_train_test_featured.corr()
correlations_with_target = corr_matrix_with_target['median_house_value'].sort_values(ascending=False)
plt.figure(figsize=(8, 5))
correlations_with_target.drop('median_house_value').plot(kind='bar')
plt.title('Correlation of Features with Target(median_house_value)')
plt.xlabel('Features')
plt.ylabel('Correlation Coefficient')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

In [None]:
corr_matrix = df_train_test_featured.corr()
corr_matrix['median_house_value'].sort_values(ascending = False)

### Distribution Comparison: Population vs LogPopulation
This cell displays two histograms side-by-side, comparing the original `Population` distribution from `df_train_test` with its log-transformed version, `LogPopulation`, from `df_train_test_featured`. This illustrates the effect of log transformation on the data distribution.

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(8, 4))

df_train_test['population'].hist(ax=axes[0], bins=50)
axes[0].set_title('Population Distribution')
axes[0].set_xlabel('Population')
axes[0].set_ylabel('Frequency')

df_train_test_featured['Log_population'].hist(ax=axes[1], bins=50)
axes[1].set_title('LogPopulation Distribution')
axes[1].set_xlabel('Log(Population)')
axes[1].set_ylabel('Frequency')

plt.tight_layout()
plt.show()

### Distribution Comparison: AveRooms vs LogAveRooms
This cell displays two histograms side-by-side, comparing the original `AveRooms` distribution from `df_train_test` with its log-transformed version, `LogAveRooms`, from `df_train_test_featured`. This visualizes how the log transformation affects the distribution of average rooms.

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(8,4))

df_train_test['total_rooms'].hist(ax=axes[0], bins=50)
axes[0].set_title('Total Rooms Distribution')
axes[0].set_xlabel('Total Rooms')
axes[0].set_ylabel('Frequency')

df_train_test_featured['Log_total_rooms'].hist(ax=axes[1], bins=50)
axes[1].set_title('Log Total Rooms Distribution')
axes[1].set_xlabel('Log(Total Rooms )')
axes[1].set_ylabel('Frequency')

plt.tight_layout()
plt.show()

### Distribution Comparison: AveBedrms vs LogAveBedrms
This cell displays two histograms side-by-side, comparing the original `AveBedrms` distribution from `df_train_test` with its log-transformed version, `LogAveBedrms`, from `df_train_test_featured`. This shows the impact of log transformation on the distribution of average bedrooms.

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(8,4))

df_train_test['total_bedrooms'].hist(ax=axes[0], bins=50)
axes[0].set_title('Total Bedrooms Distribution')
axes[0].set_xlabel('Total Bedrooms')
axes[0].set_ylabel('Frequency')

df_train_test_featured['Log_total_bedrooms'].hist(ax=axes[1], bins=50)
axes[1].set_title('Log Total Bedrooms Distribution')
axes[1].set_xlabel('Log(Total Bedrooms)')
axes[1].set_ylabel('Frequency')

plt.tight_layout()
plt.show()

In [None]:
X_train_featured

### Apply Feature Engineering to Test Data
This cell applies the `FeatureEngineer` transformation to the `X_test` DataFrame, creating the same new engineered features and dropping specified original columns as done for the training data. The result is stored in `X_test_featured`.

In [None]:
X_test_featured = feature_engineer.transform(X_test)

In [None]:
X_test_featured

### Model Definition and Training (XGBoost Regressor)
This cell initializes and trains an **XGBoost Regressor**, a powerful gradient-boosted
tree-based model well suited for structured/tabular data.
The model is trained using the engineered training features.


In [None]:
from xgboost import XGBRegressor
xgb_reg = XGBRegressor(random_state = 42)
xgb_reg.fit(X_train_featured,y_train)

### Cross-Validation Score Calculation
This cell performs 5-fold cross-validation on the trained XGBoost Regressor model using `neg_root_mean_squared_error` as the scoring metric. This provides a more robust estimate of the model's performance.

In [None]:
from sklearn.model_selection import cross_val_score
xgb_cvs = cross_val_score(xgb_reg,X_train_featured,y_train,cv = 5,scoring = 'neg_root_mean_squared_error')
xgb_cvs

### Display Cross-Validation Mean and Standard Deviation
This cell calculates and displays the mean and standard deviation of the cross-validation scores, providing a summary of the model's performance across different folds.

In [None]:
f'{xgb_cvs.mean()} ± {xgb_cvs.std()}'

### Predict on Test Data
This cell uses the trained XGBoost Regressor model (`xgb_reg`) to make predictions on the `X_test_featured` dataset. The predictions are stored in `xgb_predicted`.

In [None]:
xgb_predicted = xgb_reg.predict(X_test_featured)
xgb_predicted

### Calculate Root Mean Squared Error (RMSE)
This cell calculates the Root Mean Squared Error (RMSE) between the actual target values (`y_test`) and the predicted values (`xgb_predicted`) on the test set. RMSE is a common metric to measure the average magnitude of the errors.

In [None]:
from sklearn.metrics import root_mean_squared_error
root_mean_squared_error(y_test,xgb_predicted)

### Calculate Mean Absolute Error (MAE)
This cell calculates the Mean Absolute Error (MAE) between the actual target values (`y_test`) and the predicted values (`xgb_predicted`) on the test set. MAE measures the average magnitude of the errors without considering their direction.

In [None]:
from sklearn.metrics import mean_absolute_error
mean_absolute_error(y_test,xgb_predicted)

### Calculate R-squared (R2) Score
This cell calculates the R-squared (R2) score between the actual target values (`y_test`) and the predicted values (`xgb_predicted`) on the test set. R2 represents the proportion of the variance in the dependent variable that is predictable from the independent variables.

In [None]:
from sklearn.metrics import r2_score
r2_score(y_test,xgb_predicted)

### Hyperparameter Tuning with GridSearchCV (XGBoost)
This cell performs an exhaustive grid search to find the optimal hyperparameters for the XGBoost Regressor. It defines a parameter grid with various combinations of `n_estimators`, `max_depth`, `learning_rate`, `subsample`, and `colsample_bytree`. Cross-validation (`cv=5`) is used to evaluate each combination, with `neg_root_mean_squared_error` as the scoring metric. The best parameters and corresponding RMSE are then printed.

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid_xgb = {
    'n_estimators': [800,1000],
    'max_depth': [3,4,5,6],
    'learning_rate': [ 0.1,0.2],
    'subsample': [ 0.8,0.9],
    'colsample_bytree': [0.8,0.9]
}

xgb_reg = XGBRegressor(random_state=42)

grid_search_xgb = GridSearchCV(xgb_reg, param_grid_xgb, cv=5,
                               scoring='neg_root_mean_squared_error',
                               return_train_score=True, n_jobs=-1, verbose=2)
grid_search_xgb.fit(X_train_featured, y_train)

print("Best parameters for XGBoost: ", grid_search_xgb.best_params_)
print("Best RMSE for XGBoost: ", -grid_search_xgb.best_score_)

### Retrieve Best Estimator
This cell retrieves the best performing XGBoost Regressor model found during the `GridSearchCV`. This optimized model (`xgb_best_reg`) will be used for further evaluation and predictions.

In [None]:
xgb_best_reg = grid_search_xgb.best_estimator_

In [None]:
xgb_best_cvs = cross_val_score(xgb_best_reg,X_train_featured,y_train,cv = 10,scoring = 'neg_root_mean_squared_error')
xgb_best_cvs

In [None]:
f'{xgb_best_cvs.mean()} ± {xgb_best_cvs.std()}'

### Learning Curve Calculation
This cell calculates the learning curve for the XGBoost Regressor. It evaluates the model's performance on both training and cross-validation sets for varying sizes of the training data, helping to diagnose bias and variance.

In [None]:
from sklearn.model_selection import learning_curve

train_sizes = np.linspace(0.1, 1.0, 10)

train_sizes, train_scores, test_scores = learning_curve(
    xgb_best_reg, X_train_featured, y_train,
    cv=5, scoring='neg_root_mean_squared_error', train_sizes=train_sizes
)

### Plot Learning Curve
This cell plots the learning curve generated in the previous step. It visualizes the training score and cross-validation score against the training set size, along with their standard deviations, to assess model performance and identify potential overfitting or underfitting.

In [None]:
train_scores_mean = -train_scores.mean(axis=1)
train_scores_std = train_scores.std(axis=1)
test_scores_mean = -test_scores.mean(axis=1)
test_scores_std = test_scores.std(axis=1)

plt.figure(figsize=(10, 6))
plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                 train_scores_mean + train_scores_std, alpha=0.1,
                 color="r")
plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
         label="Training score")

plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                 test_scores_mean + test_scores_std, alpha=0.1,
                 color="g")
plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
         label="Cross-validation score")

plt.title('Learning Curve (XGBoost)')
plt.xlabel('Training Set Size')
plt.ylabel('RMSE')
plt.grid(True)
plt.legend(loc="best")
plt.show()

In [None]:
xgb_best_predicted = xgb_best_reg.predict(X_test_featured)
xgb_best_predicted

In [None]:
root_mean_squared_error(y_test,xgb_best_predicted)

In [None]:
mean_absolute_error(y_test,xgb_best_predicted)

In [None]:
r2_score(y_test,xgb_best_predicted)

### Calculate Confidence Interval for RMSE
This cell calculates a confidence interval for the Root Mean Squared Error (RMSE) of the model's predictions. It first computes the squared differences between the actual (`y_test`) and predicted (`xgb_best_predicted`) target values. Then, using `scipy.stats.t.interval`, it determines a `confidence` (e.g., 95%) interval around the mean of these squared errors. This provides a statistical range, using a t-distribution, to estimate the variability of the model's error.
## Note:
 This confidence interval is an approximate statistical estimate based on
the distribution of squared errors and should be interpreted accordingly.


In [None]:
from scipy import stats
confidence = 0.95
squared_errors = (y_test - xgb_best_predicted) ** 2
np.sqrt(stats.t.interval(confidence, len(squared_errors) - 1, loc=np.mean(squared_errors),scale = stats.sem(squared_errors)))

### Plot Actual vs. Predicted Values
This cell generates a scatter plot comparing the actual target values (`y_test`) against the model's predicted values (`xgb_best_predicted`). A diagonal line representing perfect prediction is also included for reference, helping to visualize model accuracy.

In [None]:
plt.figure(figsize=(6, 6))
plt.scatter(y_test, xgb_best_predicted, alpha=0.7)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2, label='Perfect Prediction') # Added explicit label
plt.xlabel('Actual Target Values')
plt.ylabel('Predicted Values')
plt.title('Actual vs. Predicted Values')
plt.grid(True)
plt.legend()
plt.show()

### Plot Residuals vs. Predicted Values
This cell generates a scatter plot of the residuals (difference between actual and predicted values) against the predicted values. This plot helps to diagnose heteroscedasticity and other patterns in the errors, indicating potential model biases.

In [None]:
residuals = y_test - xgb_best_predicted

plt.figure(figsize=(6, 6))
plt.scatter(xgb_best_predicted, residuals, alpha=0.7)
plt.hlines(y=0, xmin=xgb_best_predicted.min(), xmax=xgb_best_predicted.max(), colors='red', linestyles='--', label='Perfect Prediction')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals (Actual - Predicted)')
plt.title('Residuals vs. Predicted Values')
plt.grid(True)
plt.legend()
plt.show()

In [None]:
feature_importances = xgb_best_reg.feature_importances_
features = X_train_featured.columns
importances_df = pd.Series(feature_importances, index=features).sort_values(ascending=False)

### Plot Feature Importances
This cell visualizes feature importances derived from the **optimized XGBoost Regressor**.
These importances are based on the model’s internal split criteria and provide a high-level
view of influential features.


In [None]:
plt.figure(figsize=(8,5))
importances_df.sort_values().plot.barh()
plt.title('Feature Importances from Optimised XGBoost Regressor')
plt.xlabel('Importance')
plt.ylabel('Features')
plt.tight_layout()
plt.show()

### Calculate Permutation Importances
This cell calculates permutation importances for the XGBoost Regressor. Unlike impurity-based feature importances, permutation importance measures the decrease in a model's score when a single feature is randomly shuffled, providing a more reliable estimate of feature contribution.

In [None]:
from sklearn.inspection import permutation_importance
perm_importance = permutation_importance(xgb_best_reg, X_test_featured, y_test, n_repeats=10, random_state=42)

### Display Permutation Importances
This cell extracts and displays the mean permutation importances, sorted in descending order. This provides a numerical ranking of features based on their impact on model performance when shuffled.

In [None]:
perm_importance_mean = perm_importance.importances_mean
perm_importances_df = pd.Series(perm_importance_mean, index=X_test_featured.columns).sort_values(ascending=False)
perm_importances_df

### Plot Permutation Importances
This cell generates a bar plot of the permutation importances. This visual representation helps to understand the relative importance of each feature in the XGBoost model's predictions.

In [None]:
plt.figure(figsize=(8, 5))
perm_importances_df.plot.bar()
plt.title('Permutation Importances from XGBoost Regressor')
plt.xlabel('Features')
plt.ylabel('Importance (mean decrease in score)')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

### Prepare Data for Map Visualization
This cell creates a new DataFrame `df_map` containing the `Latitude` and `Longitude` from the test set, along with the actual and predicted median house values. This DataFrame is used for geographical visualization of results.

In [None]:
df_map = X_test[['latitude', 'longitude']].copy()
df_map['Actual_median_house_value'] = y_test
df_map['Predicted_median_house_value'] = xgb_best_predicted
df_map.head()

### Visualize Actual vs. Predicted Median House Value on Map
This cell generates two scatter plots side-by-side, visualizing the geographical distribution of actual and predicted median house values on a map. The size and color of the markers represent the house values, allowing for a visual comparison of model performance across locations.

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(10, 5), sharey=True)

# Plot for Actual Median House Value
df_map.plot(kind='scatter', x='longitude', y='latitude',
             c='Actual_median_house_value', cmap='jet', colorbar=True,
             s=df_map['Actual_median_house_value']/10000, alpha=0.8, ax=axes[0])
axes[0].set_title('Actual Median House Value by Location')
axes[0].set_xlabel('Longitude')
axes[0].set_ylabel('Latitude')
axes[0].grid(True)

# Plot for Predicted Median House Value
df_map.plot(kind='scatter', x='longitude', y='latitude',
             c='Predicted_median_house_value', cmap='jet', colorbar=True,
             s=df_map['Predicted_median_house_value']/10000, alpha=0.8, ax=axes[1])
axes[1].set_title('Predicted Median House Value by Location')
axes[1].set_xlabel('Longitude')
axes[1].set_ylabel('Latitude')
axes[1].grid(True)

plt.tight_layout()
plt.show()

### Save Trained Model
This cell serializes the trained XGBoost regression model using `joblib`.
Saving the model enables reuse for inference or deployment without retraining.
'California_house_prediction_xgboost_reg.pkl'. This allows for later reuse of the trained model without retraining.

In [None]:
import joblib
joblib.dump(xgb_best_reg,'California_house_prediction_xgboost_reg.pkl')
print("XGBoost Regressor model dumped successfully!")