# Project 1
## CSCI 441
## Group 2
## Members: Chakong Lor, Zachary Sunder, Luis Aguilar
## Date: 10/4/2024

# *TEMPORARY CELL
### All prompts NEED to be answered before completion is acknowledged

Questions to be answered for framing the problem:
1. Define the objective in business terms.x
2. How will your solution be used?x
3. What are the current solutions/workarounds (if any)?x
4. How should you frame this problem (supervised/unsupervised,
 online/offline, etc.)?x
5. How should performance be measured?x
6. Is the performance measure aligned with the business objective?x
7. What would be the minimum performance needed to reach the
 business objective?x
8. What are comparable problems? Can you reuse experience or
 tools?x
9. Is human expertise available?x
10. How would you solve the problem manually?x
11. List the assumptions you (or others) have made so far.x
12. Verify assumptions if possiblex

Questions to be answered for Get the data:
 Note: automate as much as possible so you can easily get fresh data.
1. List the data you need and how much you need.x
2. Find and document where you can get that data.x
3. Check how much space it will take.x
4. Check legal obligations, and get authorization if necessary.x
5. Get access authorizations.x
6. Create a workspace (with enough storage space).x
7. Get the data.x
8. Convert the data to a format you can easily manipulate (without
 changing the data itself).x
9. Ensure sensitive information is deleted or protected (e.g.,
 anonymized).x
10. Check the size and type of data (time series, sample,
 geographical, etc.).x
11. Sample a test set, put it aside, and never look at it (no data
 snooping!)x

Questions to be answered for Explore the Data:
 Note: try to get insights from a field expert for these steps.
1. Create a copy of the data for exploration (sampling it down to a
 manageable size if necessary).x
2. Create a Jupyter notebook to keep a record of your data
 exploration.x
3. Study each attribute and its characteristics:
 Name
 Type (categorical, int/float, bounded/unbounded, text,
 structured, etc.)
 % of missing values
 Noisiness and type of noise (stochastic, outliers,
 rounding errors, etc.)
 Usefulness for the task
 Type of distribution (Gaussian, uniform, logarithmic,
 etc.)x
4. For supervised learning tasks, identify the target attribute(s).
5. Visualize the data.
6. Study the correlations between attributes.
7. Study how you would solve the problem manually.
8. Identify the promising transformations you may want to apply.
9. Identify extra data that would be useful (go back to “Get the
 Data”).
10. Document what you have learned.

Questions to be answered for Prepare the Data: Zachary
 * Work on copies of the data (keep the original dataset intact).
 * Write functions for all data transformations you apply, for five
 reasons:
 * So you can easily prepare the data the next time you get a
 fresh dataset
  * So you can apply these transformations in future projects
  * To clean and prepare the test set
  * To clean and prepare new data instances once your solution is live
  * To make it easy to treat your preparation choices as
 hyperparameters
 1. Data cleaning:
 Fix or remove outliers (optional).
 Fill in missing values (e.g., with zero, mean, median…)
 or drop their rows (or columns).
 2. Feature selection (optional):
 Drop the attributes that provide no useful information for
 the task.
 3. Feature engineering, where appropriate:
 Discretize continuous features.
 Decompose features (e.g., categorical, date/time, etc.).
 2
 Add promising transformations of features (e.g., log(x),
 sqrt(x), x , etc.).
 Aggregate features into promising new features.
 4. Feature scaling: Standardize or normalize features

Shortlist Promising Models: Zachary
 * If the data is huge, you may want to sample smaller training sets
 so you can train many different models in a reasonable time (be
 aware that this penalizes complex models such as large neural
 nets or Random Forests).
 * Once again, try to automate these steps as much as possible.
 1. Train many quick-and-dirty models from different categories
 (e.g., linear, naive Bayes, SVM, Random Forest, neural net, etc.)
 using standard parameters.
 2. Measure and compare their performance.
 * For each model, use N-fold cross-validation and compute
 the mean and standard deviation of the performance
 measure on the N folds.
 3. Analyze the most significant variables for each algorithm.
 4. Analyze the types of errors the models make.
What data would a human have used to avoid these
 errors?
 5. Perform a quick round of feature selection and engineering.
 6. Perform one or two more quick iterations of the five previous
 steps.
 7. Shortlist the top three to five most promising models, preferring
 models that make different types of errors.

Fine-Tune the System: Luis
* You will want to use as much data as possible for this step,
 especially as you move toward the end of fine-tuning.
* As always, automate what you can.
1. Fine-tune the hyperparameters using cross-validation:
 * Treat your data transformation choices as
 hyperparameters, especially when you are not sure about
 them (e.g., if you’re not sure whether to replace missing
 values with zeros or with the median value, or to just
 drop the rows).
 * Unless there are very few hyperparameter values to
 explore, prefer random search over grid search. If
 training is very long, you may prefer a Bayesian
 optimization approach (e.g., using Gaussian process
 priors, as described by Jasper Snoek et al.).
 1
2. Try Ensemble methods. Combining your best models will often
 produce better performance than running them individually.
3. Once you are confident about your final model, measure its
 performance on the test set to estimate the generalization error.

Present Your Solution: Luis
 1. Document what you have done.
 2. Create a nice presentation.
 Make sure you highlight the big picture first.
 3. Explain why your solution achieves the business objective.
 4. Don’t forget to present interesting points you noticed along the
 way.
 Describe what worked and what did not.
 List your assumptions and your system’s limitations.
 5. Ensure your key findings are communicated through beautiful
 visualizations or easy-to-remember statements (e.g., “the median
 income is the number-one predictor of housing prices”).

Lastly: Luis\
Remember to write a summary at the bottom as to what we have learned.

###Prepare the Data

In [None]:
housing = strat_train_set.drop("median_house_value", axis=1)
housing_labels = strat_train_set["median_house_value"].copy()

####Data Cleaning

There are three options for cleaning the data where there are missing features, like the total_bedrooms attribute. The third option is the most reliable since it is the least destructive. The three options will also be listed below for reference:

In [None]:
housing.dropna(subset=["total_bedrooms"], inplace=True)    # option 1 removing
# the corresponding districts

housing.drop("total_bedrooms", axis=1)  # option 2 removing the whole attribute

median = housing["total_bedrooms"].median()  # option 3 Set the missing values
housing["total_bedrooms"].fillna(median, inplace=True) # to some value (zero,
# the mean, the median, etc.). This is called imputation

Instead of using the code as is from above though, we will be using the Scikit-Learn class called SimpleImputer because it is able to store the median value of each feature.

To use it though, first it is needed to create a SimpleImputer instance and specify that you want to replace each attribute's missing values with the median of that attribute like so:

In [None]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy="median")

Since the median can only be computed on numerical attributes though you will need to create a copy of the data with only the numerical attributes (this will exclude the text attribute ocean_proximity):

In [None]:
housing_num = housing.select_dtypes(include=[np.number])

Now it is possible to fit the imputer instance to the training data using the fit() method:

In [None]:
imputer.fit(housing_num)

It is much safer to apply the imputer to all the numerical attributes like so:

In [None]:
imputer.statistics_

In [None]:
housing_num.median().values

It is now possible to use this “trained” imputer to transform the training set by replacing missing values with the learned medians:

In [None]:
X = imputer.transform(housing_num)

There are also other methods to replace the missing values such as replacing it with the mean using (strategy="mean"), or with the most frequent value using (strategy="most_frequent"), or with a constant value using (strategy="constant", fill_value=...). Those last two methods also suppport non-numerical data.

Now here are some further ways to transform the training set:

In [None]:
imputer.feature_names_in_

In [None]:
housing_tr = pd.DataFrame(X, columns=housing_num.columns,
                          index=housing_num.index)

In [None]:
housing_tr.loc[null_rows_idx].head()

In [None]:
imputer.strategy

In [None]:
housing_tr = pd.DataFrame(X, columns=housing_num.columns,
                          index=housing_num.index)

In [None]:
housing_tr.loc[null_rows_idx].head()  # not shown in the book

In [None]:
#from sklearn import set_config
#
# set_config(transform_output="pandas")  # scikit-learn >= 1.2

In [None]:
outlier_pred

To drop outliers run the following code:

In [None]:
#housing = housing.iloc[outlier_pred == 1]
#housing_labels = housing_labels.iloc[outlier_pred == 1]

####Handling Text and Categorical Attributes

Up to this point we have only dealt with numerical attributes, but there are also many cases where the data will contain text attributes.

For example, the ocean_proximity attribute:

In [None]:
housing_cat = housing[["ocean_proximity"]]
housing_cat.head(8)

Since most machine learning algorithms prefer to work with numbers, let's first convert the text to numbers using Scikit-Learn's OrdinalEncoder class like so:

In [None]:
from sklearn.preprocessing import OrdinalEncoder

ordinal_encoder = OrdinalEncoder()
housing_cat_encoded = ordinal_encoder.fit_transform(housing_cat)

In [None]:
housing_cat_encoded[:8]

It is possible to get the list of categories using the categories_ instance variable. It is a list containing a 1D array of categories for each categorical attribute (in this case, a list containing a single array since there is just one categorical attribute):

In [None]:
ordinal_encoder.categories_

One issue with this representation is that ML algorithms will assume that two nearby values are more similar than two distant values. To fix this issue, a common solution is to create one binary attribute per category: one attribute equal to 1 when the category is "<1H OCEAN" (and 0 otherwise), another attribute equal to 1 when the category is "INLAND" (and 0 otherwise), and so on. This is called one-hot encoding, because only one attribute will be equal to 1 (hot), while the others will be 0 (cold). The new attributes are sometimes called dummy attributes. Scikit-Learn provides a OneHotEncoder class to convert categorical values into one-hot vectors:

In [None]:
from sklearn.preprocessing import OneHotEncoder

cat_encoder = OneHotEncoder()
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)

In [None]:
housing_cat_1hot

If you want to convert the sparse matrix to a (dense) NumPy array, just call the toarray() method:

In [None]:
housing_cat_1hot.toarray()

To drop some outliers:

In [None]:
from sklearn.ensemble import IsolationForest

isolation_forest = IsolationForest(random_state=42)
outlier_pred = isolation_forest.fit_predict(X)

Alternatively, you can set sparse_output=False when creating the OneHotEncoder (note: the sparse hyperparameter was renamned to sparse_output in Scikit-Learn 1.2):

In [None]:
cat_encoder = OneHotEncoder(sparse_output=False)
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)
housing_cat_1hot

In [None]:
cat_encoder.categories_

The Pandas function called get_dummies() converts each categorical feature into a one-hot representation, with one binary feature per category:

In [None]:
df_test = pd.DataFrame({"ocean_proximity": ["INLAND", "NEAR BAY"]})
pd.get_dummies(df_test)

The advantage of OneHotEncoder is that it remembers which categories it was trained on. This is very important because once your model is in production, it should be fed exactly the same features as during training: no more, no less. Look what our trained cat_encoder outputs when we make it transform the same df_test (using transform(), not fit_transform()):

In [None]:
cat_encoder.transform(df_test)

Since get_dummies() saw only two categories, it output two columns, whereas OneHotEncoder output one column per learned category, in the right order. Moreover, if you feed get_dummies() a DataFrame containing an unknown category (e.g., "<2H OCEAN"), it will happily generate a column for it:

In [None]:
df_test_unknown = pd.DataFrame({"ocean_proximity": ["<2H OCEAN", "ISLAND"]})
pd.get_dummies(df_test_unknown)

OneHotEncoder is smarter: it will detect the unknown category and raise an exception. If you prefer, you can set the handle_unknown hyperparameter to "ignore", in which case it will just represent the unknown category with zeros:

In [None]:
cat_encoder.handle_unknown = "ignore"
cat_encoder.transform(df_test_unknown)

When you fit any Scikit-Learn estimator using a DataFrame, the estimator stores the column names in the feature_names_in_ attribute. Scikit-Learn then ensures that any DataFrame fed to this estimator after that (e.g., to transform() or predict()) has the same column names. Transformers also provide a get_feature_names_out() method that you can use to build a DataFrame around the transformer’s output:

In [None]:
cat_encoder.feature_names_in_

array(['ocean_proximity'], dtype=object)

In [None]:
cat_encoder.get_feature_names_out()

In [None]:
df_output = pd.DataFrame(cat_encoder.transform(df_test_unknown),
                         columns=cat_encoder.get_feature_names_out(),
                         index=df_test_unknown.index)

In [None]:
df_output

####Feature Scaling

Feature scaling is one of the most important transformations that will need to be applied to your data. Machine learning algorithms don’t generally perform well when the input of numerical attributes have very different scales. This is the case for  housing data: the total number of rooms ranges from about 6 to 39,320, while the median incomes only range from 0 to 15. Without any scaling, most models will be biased toward ignoring the median income and focusing more on the number of rooms.

The two main ways to get the attributes to have the same scale are: min-max  scaling (also known as normalization) and standardization.

Scikit-Learn provides a transformer called MinMaxScaler for min-max scaling. In our case, we will want to set the range from (-1, 1) since that is what neural networks work best with:

In [None]:
from sklearn.preprocessing import MinMaxScaler

min_max_scaler = MinMaxScaler(feature_range=(-1, 1))
housing_num_min_max_scaled = min_max_scaler.fit_transform(housing_num)

In a similar way, Scikit-Learn provides a transformer called StandardScaler for standardization:

In [None]:
from sklearn.preprocessing import StandardScaler

std_scaler = StandardScaler()
housing_num_std_scaled = std_scaler.fit_transform(housing_num)

When a feature’s distribution has a heavy tail (i.e., when values far from the mean are not exponentially rare), both min-max scaling and standardization will squash most values into a small range. Machine learning models generally don’t like this at all. Before you scaling the feature, you should first transform it to shrink the heavy tail, and if possible to make the distribution roughly symmetrical.

For example, a common way to do this for positive features with a heavy tail to the right is to replace the feature with its square root (or raise the feature to a power between 0 and 1). If the feature has a really long and heavy tail, such as a power law distribution, then replacing the feature with its logarithm may help.

For example, the population feature roughly follows a power law: districts with 10,000 inhabitants are only 10 times less frequent than districts with 1,000 inhabitants, not exponentially less frequent. Figure 2-17 shows how much better this feature looks when you compute its log: it’s very close to a Gaussian distribution (i.e., bell-shaped).

In [None]:
# extra code – this cell generates Figure 2–17
fig, axs = plt.subplots(1, 2, figsize=(8, 3), sharey=True)
housing["population"].hist(ax=axs[0], bins=50)
housing["population"].apply(np.log).hist(ax=axs[1], bins=50)
axs[0].set_xlabel("Population")
axs[1].set_xlabel("Log of population")
axs[0].set_ylabel("Number of districts")
save_fig("long_tail_plot")
plt.show()

Another approach to handling heavy-tailed features is bucketizing the feature. This is especially the case if the feature has a multimodal distribution, such as the housing_median_age feature. Another approach to transforming multimodal distributions is to add a feature for each of the modes (at least the main ones), representing the similarity between the housing median age and that particular mode. The similarity measure is typically computed using a radial basis function (RBF) like so:

In [None]:
from sklearn.metrics.pairwise import rbf_kernel

age_simil_35 = rbf_kernel(housing[["housing_median_age"]], [[35]], gamma=0.1)

In [None]:
# extra code – this cell generates Figure 2–18

ages = np.linspace(housing["housing_median_age"].min(),
                   housing["housing_median_age"].max(),
                   500).reshape(-1, 1)
gamma1 = 0.1
gamma2 = 0.03
rbf1 = rbf_kernel(ages, [[35]], gamma=gamma1)
rbf2 = rbf_kernel(ages, [[35]], gamma=gamma2)

fig, ax1 = plt.subplots()

ax1.set_xlabel("Housing median age")
ax1.set_ylabel("Number of districts")
ax1.hist(housing["housing_median_age"], bins=50)

ax2 = ax1.twinx()  # create a twin axis that shares the same x-axis
color = "blue"
ax2.plot(ages, rbf1, color=color, label="gamma = 0.10")
ax2.plot(ages, rbf2, color=color, label="gamma = 0.03", linestyle="--")
ax2.tick_params(axis='y', labelcolor=color)
ax2.set_ylabel("Age similarity", color=color)

plt.legend(loc="upper left")
save_fig("age_similarity_plot")
plt.show()

There are also helpful tools to help transform the target values. One way is to use the LinearRegression class from Scikit-Learn like so:

In [None]:
from sklearn.linear_model import LinearRegression

target_scaler = StandardScaler()
scaled_labels = target_scaler.fit_transform(housing_labels.to_frame())

model = LinearRegression()
model.fit(housing[["median_income"]], scaled_labels)
some_new_data = housing[["median_income"]].iloc[:5]  # pretend this is new data

scaled_predictions = model.predict(some_new_data)
predictions = target_scaler.inverse_transform(scaled_predictions)

In [None]:
predictions

There is a simpler way of doing this is which is to use Scikit-Learn's TransformedTargetRegressor class as shown below:

In [None]:
from sklearn.compose import TransformedTargetRegressor

model = TransformedTargetRegressor(LinearRegression(),
                                   transformer=StandardScaler())
model.fit(housing[["median_income"]], housing_labels)
predictions = model.predict(some_new_data)

In [None]:
predictions

####Custom Transformers

Although Scikit-Learn provides many useful transformers, you will need to write your own for tasks such as custom transformations, cleanup operations, or combining specific attributes.

For example, creating a log-transformer and applying it to the population feature:

In [None]:
from sklearn.preprocessing import FunctionTransformer

log_transformer = FunctionTransformer(np.log, inverse_func=np.exp)
log_pop = log_transformer.transform(housing[["population"]])

Here’s a transformer that computes the same Gaussian RBF similarity measure as earlier:

In [None]:
rbf_transformer = FunctionTransformer(rbf_kernel,
                                      kw_args=dict(Y=[[35.]], gamma=0.1))
age_simil_35 = rbf_transformer.transform(housing[["housing_median_age"]])

In [None]:
age_simil_35

Adding a feature that will measure the geographic similarity between each district and San Francisco:

In [None]:
sf_coords = 37.7749, -122.41
sf_transformer = FunctionTransformer(rbf_kernel,
                                     kw_args=dict(Y=[sf_coords], gamma=0.1))
sf_simil = sf_transformer.transform(housing[["latitude", "longitude"]])

In [None]:
sf_simil

Custom transformers are also able to combine features, like FunctionTransformer that computes the ratio between the input features 0 and 1:

In [None]:
ratio_transformer = FunctionTransformer(lambda X: X[:, [0]] / X[:, [1]])
ratio_transformer.transform(np.array([[1., 2.], [3., 4.]]))

To make FunctionTransformer trainable you would need to create a custom class that has these three methods: fit() (which must return self), transform(), and fit_transform().

As an example, here's a custom transformer class that acts similarly to StandardScaler:

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.utils.validation import check_array, check_is_fitted

class StandardScalerClone(BaseEstimator, TransformerMixin):
    def __init__(self, with_mean=True):  # no *args or **kwargs!
        self.with_mean = with_mean

    def fit(self, X, y=None):  # y is required even though we don't use it
        X = check_array(X)  # checks that X is an array with finite float values
        self.mean_ = X.mean(axis=0)
        self.scale_ = X.std(axis=0)
        self.n_features_in_ = X.shape[1]  # every estimator stores this in fit()
        return self  # always return self!

    def transform(self, X):
        check_is_fitted(self)  # looks for learned attributes (with trailing _)
        X = check_array(X)
        assert self.n_features_in_ == X.shape[1]
        if self.with_mean:
            X = X - self.mean_
        return X / self.scale_

A custom transformer can (and often does) use other estimators in its implementation. Here is an example of one:

In [None]:
from sklearn.cluster import KMeans

class ClusterSimilarity(BaseEstimator, TransformerMixin):
    def __init__(self, n_clusters=10, gamma=1.0, random_state=None):
        self.n_clusters = n_clusters
        self.gamma = gamma
        self.random_state = random_state

    def fit(self, X, y=None, sample_weight=None):
        self.kmeans_ = KMeans(self.n_clusters, n_init=10,
                              random_state=self.random_state)
        self.kmeans_.fit(X, sample_weight=sample_weight)
        return self  # always return self!

    def transform(self, X):
        return rbf_kernel(X, self.kmeans_.cluster_centers_, gamma=self.gamma)

    def get_feature_names_out(self, names=None):
        return [f"Cluster {i} similarity" for i in range(self.n_clusters)]

Putting this custom transformer into use:

In [None]:
cluster_simil = ClusterSimilarity(n_clusters=10, gamma=1., random_state=42)
similarities = cluster_simil.fit_transform(housing[["latitude", "longitude"]],
                                           sample_weight=housing_labels)

This creates a ClusterSimilarity transformer with 10 clusters. To look at the first three rows while rounding to two decimal places we call the following:

In [None]:
similarities[:3].round(2)

The following shows the 10 clusters colored according to their geographic similarity to their closest cluster center:

In [None]:
# extra code – this cell generates Figure 2–19

housing_renamed = housing.rename(columns={
    "latitude": "Latitude", "longitude": "Longitude",
    "population": "Population",
    "median_house_value": "Median house value (ᴜsᴅ)"})
housing_renamed["Max cluster similarity"] = similarities.max(axis=1)

housing_renamed.plot(kind="scatter", x="Longitude", y="Latitude", grid=True,
                     s=housing_renamed["Population"] / 100, label="Population",
                     c="Max cluster similarity",
                     cmap="jet", colorbar=True,
                     legend=True, sharex=False, figsize=(10, 7))
plt.plot(cluster_simil.kmeans_.cluster_centers_[:, 1],
         cluster_simil.kmeans_.cluster_centers_[:, 0],
         linestyle="", color="black", marker="X", markersize=20,
         label="Cluster centers")
plt.legend(loc="upper right")
save_fig("district_cluster_plot")
plt.show()

####Transformation Pipelines

To better facilitate the data transformation steps Scikit-Learn has the Pipeline class. Here is a small pipeline for numerical attributes, which will first impute then scale the input features:

In [None]:
from sklearn.pipeline import Pipeline

num_pipeline = Pipeline([
    ("impute", SimpleImputer(strategy="median")),
    ("standardize", StandardScaler()),
])

There is also the option to use the make_pipeline() function instead; it takes transformers as positional arguments and creates a Pipeline using the names of the transformers’ classes, in lowercase and without underscores (e.g., "simpleimputer"):

In [None]:
from sklearn.pipeline import make_pipeline

num_pipeline = make_pipeline(SimpleImputer(strategy="median"), StandardScaler())

In [None]:
from sklearn import set_config

set_config(display='diagram')

num_pipeline

Calling the pipeline’s fit_transform() method and looking at the output’s first two rows, rounded to two decimal places:

In [None]:
housing_num_prepared = num_pipeline.fit_transform(housing_num)
housing_num_prepared[:2].round(2)

Calling the pipeline’s get_feature_names_out() method to recover a nice DataFrame:

In [None]:
df_housing_num_prepared = pd.DataFrame(
    housing_num_prepared, columns=num_pipeline.get_feature_names_out(),
    index=housing_num.index)

In [None]:
df_housing_num_prepared.head(2)  # extra code

Method to have a single transformer that is capable of handling all columns and applying the appropriate transformations on each column using the ColumnTransformer class from Scikit-Learn:

In [None]:
from sklearn.compose import ColumnTransformer

num_attribs = ["longitude", "latitude", "housing_median_age", "total_rooms",
               "total_bedrooms", "population", "households", "median_income"]
cat_attribs = ["ocean_proximity"]

cat_pipeline = make_pipeline(
    SimpleImputer(strategy="most_frequent"),
    OneHotEncoder(handle_unknown="ignore"))

preprocessing = ColumnTransformer([
    ("num", num_pipeline, num_attribs),
    ("cat", cat_pipeline, cat_attribs),
])

Passing make_column_selector to make_column_transformer to automatically name the transformer and select all features of a given type:

In [None]:
from sklearn.compose import make_column_selector, make_column_transformer

preprocessing = make_column_transformer(
    (num_pipeline, make_column_selector(dtype_include=np.number)),
    (cat_pipeline, make_column_selector(dtype_include=object)),
)

Applying ColumnTransformer to the housing data:

In [None]:
housing_prepared = preprocessing.fit_transform(housing)

Creating a single pipeline that will perform all the transformations that have been experimented with up to now. The following code builds the pipeline to do all of this:

In [None]:
def column_ratio(X):
    return X[:, [0]] / X[:, [1]]

def ratio_name(function_transformer, feature_names_in):
    return ["ratio"]  # feature names out

def ratio_pipeline():
    return make_pipeline(
        SimpleImputer(strategy="median"),
        FunctionTransformer(column_ratio, feature_names_out=ratio_name),
        StandardScaler())

log_pipeline = make_pipeline(
    SimpleImputer(strategy="median"),
    FunctionTransformer(np.log, feature_names_out="one-to-one"),
    StandardScaler())
cluster_simil = ClusterSimilarity(n_clusters=10, gamma=1., random_state=42)
default_num_pipeline = make_pipeline(SimpleImputer(strategy="median"),
                                     StandardScaler())
preprocessing = ColumnTransformer([
        ("bedrooms", ratio_pipeline(), ["total_bedrooms", "total_rooms"]),
        ("rooms_per_house", ratio_pipeline(), ["total_rooms", "households"]),
        ("people_per_house", ratio_pipeline(), ["population", "households"]),
        ("log", log_pipeline, ["total_bedrooms", "total_rooms", "population",
                               "households", "median_income"]),
        ("geo", cluster_simil, ["latitude", "longitude"]),
        ("cat", cat_pipeline, make_column_selector(dtype_include=object)),
    ],
    remainder=default_num_pipeline)  # one column remaining: housing_median_age

Testing the functionality of the code:

In [None]:
housing_prepared = preprocessing.fit_transform(housing)
housing_prepared.shape

In [None]:
preprocessing.get_feature_names_out()

###Select and Train a Model

####Training and Evaluating on the Training Set

Starting off with training a basic linear regression model:

In [None]:
from sklearn.linear_model import LinearRegression

lin_reg = make_pipeline(preprocessing, LinearRegression())
lin_reg.fit(housing, housing_labels)

Testing the linear regression model on the training set:

In [None]:
housing_predictions = lin_reg.predict(housing)
housing_predictions[:5].round(-2)  # -2 = rounded to the nearest hundred

And comparing against the actual values:

In [None]:
housing_labels.iloc[:5].values

Measuring this regression model’s RMSE on the whole training set using Scikit-Learn’s mean_squared_error() function, with the squared argument set to False:

In [None]:
from sklearn.metrics import mean_squared_error

lin_rmse = mean_squared_error(housing_labels, housing_predictions,
                              squared=False)
lin_rmse

Since the model isn't working as well as what was hoped for we need to go about changing something. The three general options are: selecting a more powerful model, feeding the training algorithm with better features, or reducing the constraints on the model.

The last option isn't realistic since the model isn't regularized, so we will go with the option of selecting a more powerful model. That model will be the DecisionTreeRegressor model as shown in the code below:

In [None]:
from sklearn.tree import DecisionTreeRegressor

tree_reg = make_pipeline(preprocessing, DecisionTreeRegressor(random_state=42))
tree_reg.fit(housing, housing_labels)

Testing the newly trained DecisionTreeRegressor model on the training set:

In [None]:
housing_predictions = tree_reg.predict(housing)
tree_rmse = mean_squared_error(housing_labels, housing_predictions,
                              squared=False)
tree_rmse

Before finalizing anything or changing the test set, we will move onto model validation.

####Better Evaluation Using Cross-Validation

This step of this process is about verifying whether the DecisionTreeRegressor model is correctly fit for the expected data.

One great way to test that is to use Scikit-Learn’s k_-fold cross-validation feature. The following code will test and evaluate the model 10 different times and give an array containing the 10 evaluation scores:

In [None]:
from sklearn.model_selection import cross_val_score

tree_rmses = -cross_val_score(tree_reg, housing, housing_labels,
                              scoring="neg_root_mean_squared_error", cv=10)

Showing the results:

In [None]:
pd.Series(tree_rmses).describe()

Showcasing the error stats for the LinearRegression model:

In [None]:
# extra code – computes the error stats for the linear model
lin_rmses = -cross_val_score(lin_reg, housing, housing_labels,
                              scoring="neg_root_mean_squared_error", cv=10)
pd.Series(lin_rmses).describe()

After doing this cross-validation, it can be seen that the DecisionTreeRegressor model performs similarly to the LinearRegression model. Although the process of cross-validation can provide you will valuable information about the model, sometimes it isn't always realistic because you will then need to train the model several times.

The last model that we will be trying is the RandomForestRegressor model and the code is shown below:

In [None]:
from sklearn.ensemble import RandomForestRegressor

forest_reg = make_pipeline(preprocessing,
                           RandomForestRegressor(random_state=42))
forest_rmses = -cross_val_score(forest_reg, housing, housing_labels,
                                scoring="neg_root_mean_squared_error", cv=10)

In [None]:
pd.Series(forest_rmses).describe()

Showcasing the RMSE for the RandomForestRegressor model on the training set:

In [None]:
forest_reg.fit(housing, housing_labels)
housing_predictions = forest_reg.predict(housing)
forest_rmse = mean_squared_error(housing_labels, housing_predictions,
                                 squared=False)
forest_rmse

After testing the new RandomForestRegressor model, it can be seen that it is much more accurate than what the previous two were, but it can also be seen that there is still a lot of overfitting as shown by the RMSE.

Some potential solutions are to simplify the model, constrain the model, or to get more training data. This is where it comes down to shortlisting a few promising models after testing models from various categories of machine learning algorithms without worrying too much about the hyperparameters.