# Supervised Learning Project - Part I

**NOTE THAT THIS EXAMPLE IS MORE ADVANCED**

In [1]:
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.linear_model import LinearRegression, Lasso, Ridge, ElasticNet
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.pipeline import Pipeline

# The California Housing Dataset

The California Housing dataset is a popular dataset used in machine learning for regression analysis. It was originally compiled for a 1990 census study and has since been widely used for educational and benchmarking purposes in the field of data science and machine learning. Here's a detailed description:

### Overview of the California Housing Dataset
1. **Content**:
   - The dataset contains data related to the housing conditions in California districts, as gathered from the 1990 census.
   - It is often used for predicting housing prices based on various demographic and geographic attributes.

2. **Features**:
   - The dataset includes several features, typically around 8 to 9, such as:
     - Median Income in a block
     - Median House Age in a block
     - Average number of rooms per household
     - Average number of bedrooms per household
     - Population per block
     - Average house occupancy
     - Latitude and Longitude of the block
   - These features are used to predict the median house value in the area.

3. **Target Variable**:
   - The main variable of interest, or the target variable, is the median housing price for California districts.

4. **Usage**:
   - This dataset is commonly used for regression tasks in machine learning, where the goal is to predict the median house value based on other metrics.
   - It's a good dataset for beginners to practice regression techniques due to its simplicity and the clear relationships between variables.

### Availability and Loading
- The California Housing dataset is available in several machine learning libraries, including scikit-learn. In scikit-learn, it can be loaded using the `fetch_california_housing` function:

  ```python
  from sklearn.datasets import fetch_california_housing
  housing = fetch_california_housing()
  ```

### Applications
- **Educational Tool**: It's widely used for educational purposes to teach regression analysis.
- **Real-World Scenario**: The dataset provides a real-world scenario where regression techniques can be applied, making it a practical choice for hands-on learning.
- **Model Benchmarking**: It is often used to benchmark the performance of various regression models.

### Considerations
- **Data Preprocessing**: Depending on the version of the dataset, some preprocessing steps like feature scaling or normalization might be required to optimize the performance of certain machine learning models.
- **Geographical Data**: The inclusion of latitude and longitude allows for interesting geographical analyses but might require specific handling or domain knowledge for meaningful insights.


In [2]:
# Load the California Housing dataset
california = fetch_california_housing()
data = pd.DataFrame(california.data, columns=california.feature_names)

In [3]:
data

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984127,1.023810,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.971880,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,8.288136,1.073446,496.0,2.802260,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25
...,...,...,...,...,...,...,...,...
20635,1.5603,25.0,5.045455,1.133333,845.0,2.560606,39.48,-121.09
20636,2.5568,18.0,6.114035,1.315789,356.0,3.122807,39.49,-121.21
20637,1.7000,17.0,5.205543,1.120092,1007.0,2.325635,39.43,-121.22
20638,1.8672,18.0,5.329513,1.171920,741.0,2.123209,39.43,-121.32


## One-hot encoding and Binning data

### Binning with ```pandas.cut()```

`pd.cut()` is a function from the Pandas library in Python, used to segment and sort data values into bins or categories. This function is particularly useful for converting a continuous variable into a categorical variable by categorizing the data into discrete intervals (bins). Here's a detailed description:

### Overview of `pd.cut()`
1. **Purpose**:
   - The primary purpose of `pd.cut()` is to divide the range of a continuous variable into intervals and assign these intervals to the corresponding data points.

2. **Functionality**:
   - It allows you to specify the number of bins to use or the specific bin edges.
   - Each bin can be a different size.
   - You can also label the bins with specific names.

### Key Features and Parameters
1. **Syntax**:
   ```python
   pandas.cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False, duplicates='raise')
   ```
   - `x`: The input array to be binned. It must be a one-dimensional array.
   - `bins`: Defines the bin edges. Can be an integer specifying the number of equal-width bins, or a sequence of bin edges.
   - `right`: Indicates whether bins include the rightmost edge or not.
   - `labels`: Specifies the labels for the returned bins.
   - `retbins`: Whether to return the bins or not.
   - `precision`: The precision at which to store and display the bin labels.
   - `include_lowest`: Whether the first interval should be left-inclusive or not.
   - `duplicates`: If bin edges are not unique, raise an error or drop non-unique bins.

2. **Usage**:
   - Useful for grouping continuous variables into categories for analysis.
   - Often used in data preprocessing for machine learning, data visualization, and statistical analysis.

### Example Usage
```python
import pandas as pd

# Sample data
data = pd.Series([0.1, 0.6, 0.2, 0.9, 0.15, 0.5])

# Using pd.cut to bin data
bins = pd.cut(data, bins=3, labels=["Low", "Medium", "High"])
print(bins)
```

### Considerations
- **Choice of Bins**: The way you define the bins can significantly impact the analysis. Equal-width bins are simple but may not be appropriate for all data distributions.
- **Handling Outliers**: Be mindful of how outliers are treated and which bin they fall into.
- **Data Distribution**: Understanding the underlying distribution of the data is important to make meaningful binning decisions.


In [4]:
# Feature engineering: One-hot encode median_income
bins = [-np.inf, 1.5, 3, 4.5, 6, np.inf]
labels = [1, 2, 3, 4, 5]
data['median_income_binned'] = pd.cut(data['MedInc'], bins=bins, labels=labels)
data = pd.get_dummies(data, columns=['median_income_binned'], prefix='median_income_bin')

In [5]:
# Split the data into train, validation, and test sets
X_trainval, X_test, y_trainval, y_test = train_test_split(data, california.target, test_size=0.2, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_trainval, y_trainval, test_size=0.25, random_state=42)

## Data Pipelines

`Pipeline()` in scikit-learn is a utility that helps in sequentially applying a list of transforms and a final estimator. Essentially, it chains together multiple steps in a machine learning process—such as data preprocessing, feature extraction, and model fitting—into a single, unified workflow. Here's a detailed description:

### Overview of `Pipeline()`
1. **Purpose**:
   - The primary purpose of a `Pipeline` is to assemble several steps that can be cross-validated together while setting different parameters.
   - It ensures that the same sequence of steps is applied during both training and prediction.

2. **Functionality**:
   - Each step of a pipeline is a tuple containing the name of the step and an instance of a transformer or estimator.
   - All steps except the last one must be transformers (i.e., they must have a `fit` and `transform` method). The last step can be a transformer or an estimator (i.e., it must have a `fit` method).

### Key Features and Parameters
1. **Syntax**:
   ```python
   from sklearn.pipeline import Pipeline
   pipeline = Pipeline(steps=[('name1', transform1), ('name2', transform2), ..., ('nameN', estimator)])
   ```
   - `steps`: A list of (name, transform) tuples (implementing `fit`/`transform`) that are chained, in the order in which they are chained, with the last object an estimator.

2. **Usage**:
   - Commonly used to combine preprocessing steps (like scaling, dimensionality reduction) with a model like a classifier or regressor.
   - Simplifies the code and reduces the risk of forgetting a preprocessing step in prediction.

3. **Example Usage**:
   ```python
   from sklearn.preprocessing import StandardScaler
   from sklearn.decomposition import PCA
   from sklearn.ensemble import RandomForestClassifier

   pipeline = Pipeline(steps=[
       ('scaler', StandardScaler()),
       ('pca', PCA(n_components=2)),
       ('classifier', RandomForestClassifier())
   ])
   ```

4. **Benefits**:
   - **Convenience and encapsulation**: Only call `fit` and `predict` once on your data to fit a whole sequence of estimators.
   - **Joint parameter selection**: Grid search over parameters of all estimators in the pipeline at once.
   - **Safety**: Pipelines help avoid leaking statistics from your test data into the trained model in cross-validation, by ensuring that the same samples are used to train the transformers and predictors.

5. **Grid Search Integration**:
   - Pipelines can be used with grid search to simultaneously adjust parameters for all transformers and the estimator.

### Considerations
- **Debugging**: While convenient, pipelines can sometimes be harder to debug due to their encapsulated nature.
- **Custom Transformers**: You can create custom transformers to include in the pipeline as long as they implement the `fit` and `transform` methods.


In [6]:
# Define the pipelines for each model
linear_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('linear', LinearRegression())
])

lasso_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('lasso', Lasso())
])

ridge_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('ridge', Ridge())
])

elasticnet_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('elasticnet', ElasticNet())
])

In [7]:
# Define the hyperparameters to search over for each model
grid_linear_params = {
    'linear__fit_intercept': [True, False]
}

grid_lasso_params = {
    'lasso__alpha': [0.1, 1, 10],
    'lasso__max_iter': [100, 1000, 10000]
}

grid_ridge_params = {
    'ridge__alpha': [0.1, 1, 10],
    'ridge__max_iter': [100, 1000, 10000]
}

grid_elasticnet_params = {
    'elasticnet__alpha': [0.1, 1, 10],
    'elasticnet__l1_ratio': [0.1, 0.5, 0.9],
    'elasticnet__max_iter': [100, 1000, 10000]
}

## Hyperparameter Search with ```GridSearchCV()```

`GridSearchCV()` is a function in scikit-learn, a popular Python library for machine learning. It's used for hyperparameter tuning, allowing you to find the best parameters for your machine learning model. Here's a detailed description:

### Overview of `GridSearchCV()`
1. **Purpose**:
   - The primary purpose of `GridSearchCV` is to perform an exhaustive search over specified parameter values for an estimator.
   - The goal is to find the combination of parameters that yields the best model performance, as measured by a specified evaluation metric.

2. **Functionality**:
   - It trains the model multiple times on a range of values for the hyperparameters and evaluates each combination using cross-validation.
   - The parameters of the estimator used to apply these methods are optimized by cross-validated grid-search over a parameter grid.

### Key Features and Parameters
1. **Syntax**:
   ```python
   from sklearn.model_selection import GridSearchCV
   grid_search = GridSearchCV(estimator, param_grid, scoring=None, n_jobs=None, refit=True, cv=None, ...)
   ```
   - `estimator`: The base model for which to search the hyperparameters.
   - `param_grid`: Dictionary with parameters names (`str`) as keys and lists of parameter settings to try as values.
   - `scoring`: Strategy to evaluate the performance of the cross-validated model on the test set.
   - `n_jobs`: Number of jobs to run in parallel (can speed up the grid search).
   - `refit`: Refit an estimator using the best found parameters on the whole dataset.
   - `cv`: Determines the cross-validation splitting strategy.

2. **Usage**:
   - Commonly used with models to tune hyperparameters like `C` and `gamma` in SVM, `max_depth` for Decision Trees, or `learning_rate` in neural networks.
   - Useful in almost all kinds of machine learning problems to boost model performance.

3. **Example Usage**:
   ```python
   from sklearn.model_selection import GridSearchCV
   from sklearn.svm import SVC

   param_grid = {'C': [0.1, 1, 10], 'gamma': [0.001, 0.01, 0.1]}
   grid_search = GridSearchCV(SVC(), param_grid, cv=5)
   grid_search.fit(X_train, y_train)
   print(grid_search.best_params_)
   ```

4. **Benefits**:
   - **Comprehensive Search**: It provides a thorough approach to finding the optimal parameters.
   - **Improved Model Accuracy**: Helps in improving the model's performance by fine-tuning the parameters.
   - **Automation**: Automates the process of systematic parameter search and evaluation.

### Considerations
- **Computational Cost**: Can be computationally expensive, especially for large datasets and complex models.
- **Overfitting**: There's a risk of overfitting on the training set since it evaluates many different models.
- **Choice of Range and Scoring**: The ranges in `param_grid` and the choice of `scoring` metric significantly influence the effectiveness of the grid search.


In [8]:
# Perform hyperparameter tuning with GridSearchCV
grid_search_linear = GridSearchCV(linear_pipeline, grid_linear_params, cv=5, n_jobs=-1)
grid_search_lasso = GridSearchCV(lasso_pipeline, grid_lasso_params, cv=5, n_jobs=-1)
grid_search_ridge = GridSearchCV(ridge_pipeline, grid_ridge_params, cv=5, n_jobs=-1)
grid_search_elasticnet = GridSearchCV(elasticnet_pipeline, grid_elasticnet_params, cv=5, n_jobs=-1)

grid_search_linear.fit(X_trainval, y_trainval)
grid_search_lasso.fit(X_trainval, y_trainval)
grid_search_ridge.fit(X_trainval, y_trainval)
grid_search_elasticnet.fit(X_trainval, y_trainval)

In [9]:
# Define the hyperparameters to search over for each model
rand_lasso_params = {
    'lasso__alpha': np.arange(0, 10, 0.01),
}

rand_ridge_params = {
    'ridge__alpha': np.arange(0, 10, 0.01),
    'ridge__max_iter': [10, 100, 1000, 10000]
}

rand_elasticnet_params = {
    'elasticnet__alpha': np.arange(0, 10, 0.01),
    'elasticnet__l1_ratio': np.arange(0, 1, 0.01),
    'elasticnet__max_iter': [10, 100, 1000, 10000]
}

## Hyperparameter Search with ```RandomizedSearchCV()```

`RandomizedSearchCV` is a function in scikit-learn, a popular Python library for machine learning, used for hyperparameter tuning. Unlike `GridSearchCV` which exhaustively tries all possible parameter combinations, `RandomizedSearchCV` samples a fixed number of parameter settings from specified distributions. Here's a detailed description:

### Overview of `RandomizedSearchCV`
1. **Purpose**:
   - The primary purpose of `RandomizedSearchCV` is to find the best parameters for a particular model, but instead of trying out all possible combinations, it randomly samples a given number of parameter combinations from the specified distributions.

2. **Functionality**:
   - This approach can be more efficient than `GridSearchCV`, especially when dealing with a large hyperparameter space or when each evaluation is very expensive.

### Key Features and Parameters
1. **Syntax**:
   ```python
   from sklearn.model_selection import RandomizedSearchCV
   randomized_search = RandomizedSearchCV(estimator, param_distributions, n_iter=100, scoring=None, n_jobs=None, refit=True, cv=None, ...)
   ```
   - `estimator`: The base model to tune.
   - `param_distributions`: Dictionary with parameters names (`str`) as keys and distributions or lists of parameters to try. Distributions must provide a `rvs` method for sampling (such as those from scipy.stats.distributions).
   - `n_iter`: Number of parameter settings that are sampled. `n_iter` trades off runtime vs quality of the solution.
   - `scoring`: Strategy to evaluate the performance of the cross-validated model on the test set.
   - `n_jobs`: Number of jobs to run in parallel.
   - `refit`: Refit an estimator using the best found parameters on the whole dataset.
   - `cv`: Determines the cross-validation splitting strategy.

2. **Usage**:
   - Useful in scenarios where the parameter space is large, and it's computationally infeasible to try all combinations.
   - Often used to optimize the hyperparameters of machine learning models to enhance their performance.

3. **Example Usage**:
   ```python
   from sklearn.ensemble import RandomForestClassifier
   from sklearn.model_selection import RandomizedSearchCV
   from scipy.stats import randint

   param_distributions = {'n_estimators': randint(100, 200), 'max_depth': [None, 10, 20, 30]}
   randomized_search = RandomizedSearchCV(RandomForestClassifier(), param_distributions, n_iter=100, cv=5)
   randomized_search.fit(X_train, y_train)
   print(randomized_search.best_params_)
   ```

4. **Benefits**:
   - **Efficiency**: More efficient than `GridSearchCV` especially for large hyperparameter spaces.
   - **Exploration**: Can explore a broader range of values and distributions for the hyperparameters.

### Considerations
- **Randomness**: The results can depend on the random seed due to the random nature of the parameter sampling.
- **Coverage**: It might not cover the entire parameter space as thoroughly as `GridSearchCV`.
- **Balance**: Requires balancing between `n_iter` (number of iterations) and the computational budget.


In [11]:
# Perform hyperparameter tuning with RandomizedSearchCV
random_search_lasso = RandomizedSearchCV(lasso_pipeline, rand_lasso_params, cv=5, n_jobs=-1, random_state=42)
random_search_ridge = RandomizedSearchCV(ridge_pipeline, rand_ridge_params, cv=5, n_jobs=-1, random_state=42)
random_search_elasticnet = RandomizedSearchCV(elasticnet_pipeline, rand_elasticnet_params, cv=5, n_jobs=-1, random_state=42)

random_search_lasso.fit(X_trainval, y_trainval)
random_search_ridge.fit(X_trainval, y_trainval)
random_search_elasticnet.fit(X_trainval, y_trainval)

In [12]:
# Get the best model and its parameters from GridSearchCV
best_model_grid_linear = grid_search_linear.best_estimator_

best_model_lasso = grid_search_lasso.best_estimator_
best_params_lasso = grid_search_lasso.best_params_

best_model_ridge = grid_search_ridge.best_estimator_
best_params_ridge = grid_search_ridge.best_params_

best_model_elasticnet = grid_search_elasticnet.best_estimator_
best_params_elasticnet = grid_search_elasticnet.best_params_

In [13]:
# Get the best model and its parameters from RandomizedSearchCV
best_model_random_lasso = random_search_lasso.best_estimator_
best_params_random_lasso = random_search_lasso.best_params_

best_model_random_ridge = random_search_ridge.best_estimator_
best_params_random_ridge = random_search_ridge.best_params_

best_model_random_elasticnet = random_search_elasticnet.best_estimator_
best_params_random_elasticnet = random_search_elasticnet.best_params_

In [14]:
best_model_random_elasticnet

In [15]:
# Evaluate the models on the test set
y_pred_linear = best_model_grid_linear.predict(X_test)

y_pred_lasso = best_model_lasso.predict(X_test)
mse_lasso = mean_squared_error(y_test, y_pred_lasso)
mae_lasso = mean_absolute_error(y_test, y_pred_lasso)
r2_lasso = r2_score(y_test, y_pred_lasso)

y_pred_ridge = best_model_ridge.predict(X_test)
mse_ridge = mean_squared_error(y_test, y_pred_ridge)
mae_ridge = mean_absolute_error(y_test, y_pred_ridge)
r2_ridge = r2_score(y_test, y_pred_ridge)

y_pred_elasticnet = best_model_elasticnet.predict(X_test)
mse_elasticnet = mean_squared_error(y_test, y_pred_elasticnet)
mae_elasticnet = mean_absolute_error(y_test, y_pred_elasticnet)
r2_elasticnet = r2_score(y_test, y_pred_elasticnet)

y_pred_random_lasso = best_model_random_lasso.predict(X_test)
mse_random_lasso = mean_squared_error(y_test, y_pred_random_lasso)
mae_random_lasso = mean_absolute_error(y_test, y_pred_random_lasso)
r2_random_lasso = r2_score(y_test, y_pred_random_lasso)

y_pred_random_ridge = best_model_random_ridge.predict(X_test)
mse_random_ridge = mean_squared_error(y_test, y_pred_random_ridge)
mae_random_ridge = mean_absolute_error(y_test, y_pred_random_ridge)
r2_random_ridge = r2_score(y_test, y_pred_random_ridge)

y_pred_random_elasticnet = best_model_random_elasticnet.predict(X_test)
mse_random_elasticnet = mean_squared_error(y_test, y_pred_random_elasticnet)
mae_random_elasticnet = mean_absolute_error(y_test, y_pred_random_elasticnet)
r2_random_elasticnet = r2_score(y_test, y_pred_random_elasticnet)

In [16]:
# Print the results
print("Results for best LinearRegression model:")
print("MSE: {:.4f}".format(mean_squared_error(y_test, y_pred_linear)))
print("MAE: {:.4f}".format(mean_absolute_error(y_test, y_pred_linear)))
print("R^2: {:.4f}".format(r2_score(y_test, y_pred_linear)))
print()

print("Results for best Lasso model (GridSearchCV):")
print("MSE: {:.4f}".format(mse_lasso))
print("MAE: {:.4f}".format(mae_lasso))
print("R^2: {:.4f}".format(r2_lasso))
print("Best parameters: {}".format(best_params_lasso))
print()

print("Results for best Ridge model (GridSearchCV):")
print("MSE: {:.4f}".format(mse_ridge))
print("MAE: {:.4f}".format(mae_ridge))
print("R^2: {:.4f}".format(r2_ridge))
print("Best parameters: {}".format(best_params_ridge))
print()

print("Results for best ElasticNet model (GridSearchCV):")
print("MSE: {:.4f}".format(mse_elasticnet))
print("MAE: {:.4f}".format(mae_elasticnet))
print("R^2: {:.4f}".format(r2_elasticnet))
print("Best parameters: {}".format(best_params_elasticnet))
print()

print("Results for best Lasso model (RandomizedSearchCV):")
print("MSE: {:.4f}".format(mse_random_lasso))
print("MAE: {:.4f}".format(mae_random_lasso))
print("R^2: {:.4f}".format(r2_random_lasso))
print("Best parameters: {}".format(best_params_random_lasso))
print()

print("Results for best Ridge model (RandomizedSearchCV):")
print("MSE: {:.4f}".format(mse_random_ridge))
print("MAE: {:.4f}".format(mae_random_ridge))
print("R^2: {:.4f}".format(r2_random_ridge))
print("Best parameters: {}".format(best_params_random_ridge))
print()

print("Results for best ElasticNet model (RandomizedSearchCV):")
print("MSE: {:.4f}".format(mse_random_elasticnet))
print("MAE: {:.4f}".format(mae_random_elasticnet))
print("R^2: {:.4f}".format(r2_random_elasticnet))
print("Best parameters: {}".format(best_params_random_elasticnet))
print()

Results for best LinearRegression model:
MSE: 0.5518
MAE: 0.5314
R^2: 0.5789

Results for best Lasso model (GridSearchCV):
MSE: 0.6790
MAE: 0.6216
R^2: 0.4818
Best parameters: {'lasso__alpha': 0.1, 'lasso__max_iter': 100}

Results for best Ridge model (GridSearchCV):
MSE: 0.5518
MAE: 0.5314
R^2: 0.5789
Best parameters: {'ridge__alpha': 0.1, 'ridge__max_iter': 100}

Results for best ElasticNet model (GridSearchCV):
MSE: 0.5824
MAE: 0.5637
R^2: 0.5556
Best parameters: {'elasticnet__alpha': 0.1, 'elasticnet__l1_ratio': 0.1, 'elasticnet__max_iter': 100}

Results for best Lasso model (RandomizedSearchCV):
MSE: 0.7429
MAE: 0.6579
R^2: 0.4331
Best parameters: {'lasso__alpha': 0.2}

Results for best Ridge model (RandomizedSearchCV):
MSE: 0.5518
MAE: 0.5314
R^2: 0.5789
Best parameters: {'ridge__max_iter': 10, 'ridge__alpha': 2.15}

Results for best ElasticNet model (RandomizedSearchCV):
MSE: 0.9501
MAE: 0.7644
R^2: 0.2750
Best parameters: {'elasticnet__max_iter': 1000, 'elasticnet__l1_ratio': 0