# Regression techniques 

## Table of Content 

### 1 Introduction to Regression:

* Understand the concept of regression in machine learning
* Differentiate between classification and regression tasks

### 2 Simple Linear Regression:

* Understand the concept of simple linear regression
* Implement simple linear regression using scikit-learn
* Evaluate the performance of a simple linear regression model

### 3 Multiple Linear Regression:

* Understand the concept of multiple linear regression
* Implement multiple linear regression using scikit-learn
* Evaluate the performance of a multiple linear regression model

### 4 Polynomial Regression:

* Understand the concept of polynomial regression
* Implement polynomial regression using scikit-learn
* Evaluate the performance of a polynomial regression model

### 5 Regularization Techniques:

* Understand the concept of overfitting and underfitting
* Learn about Ridge regression (L2 regularization)
* Learn about Lasso regression (L1 regularization)
* Implement Ridge and Lasso regression using scikit-learn
* Evaluate the performance of Ridge and Lasso regression models

### 6 Support Vector Regression (SVR):

* Understand the concept of support vector machines for regression
* Implement support vector regression using scikit-learn
* Evaluate the performance of a support vector regression model

### 7 Decision Tree Regression:

* Understand the concept of decision trees for regression
* Implement decision tree regression using scikit-learn
* Evaluate the performance of a decision tree regression model

### 8 Random Forest Regression:

* Understand the concept of ensemble learning and random forests for regression
* Implement random forest regression using scikit-learn
* Evaluate the performance of a random forest regression model

### 9  Model Evaluation Metrics for Regression:

* Learn about various evaluation metrics such as Mean Absolute Error (MAE), Mean Squared Error (MSE), and Root Mean Squared Error (RMSE)
* Understand the concept of R-squared and Adjusted R-squared

### 10  Model Selection and Hyperparameter Tuning:

* Learn how to split datasets into training, validation, and testing sets
* Understand the concept of cross-validation
* Learn about grid search and randomized search for hyperparameter tuning
* Implement model selection and hyperparameter tuning using scikit-learn 

### 11 Assignment: Predicting House Prices using Regression Models

## 1 Introduction to Regression:

Regression is a supervised learning technique in machine learning that focuses on predicting continuous numerical outcomes based on input features. The main objective of regression is to establish a relationship between the independent variables (input features) and the dependent variable (output).

### Understand the concept of regression in machine learning:

In regression, we attempt to fit a model that best represents the relationship between the dependent and independent variables. This model is then used to make predictions on new, unseen data. Linear regression is a common type of regression where the relationship between the variables is modeled as a straight line.


Examples of regression problems include predicting housing prices, estimating stock prices, or forecasting product sales.

### Differentiate between classification and regression tasks:

While both classification and regression are types of supervised learning, they have different goals:

* **Classification**: 
    The objective is to predict a category or class label for a given input. Examples include spam detection, image recognition, and medical diagnosis. Classification models predict discrete outcomes.

    
* **Regression**: 
    The objective is to predict a continuous numerical value for a given input. Examples include predicting housing prices, estimating stock prices, and forecasting product sales. Regression models predict continuous outcomes.

In summary, regression is a supervised learning technique used to predict continuous numerical outcomes, while classification focuses on predicting discrete categorical outcomes. Both techniques play a crucial role in machine learning and can be applied to various real-world problems depending on the nature of the data and the desired output.

## 2 Simple Linear Regression:

Simple linear regression is a basic regression technique that models the relationship between a single independent variable (input feature) and a dependent variable (output) as a straight line. The goal is to find the best-fitting line that minimizes the error between predicted and actual values.

### Understand the concept of simple linear regression:
    
    
The equation for simple linear regression is:
    
    

y = b0 + b1 * x


where:

* y is the dependent variable (output)
* x is the independent variable (input feature)
* b0 is the y-intercept
* b1 is the slope of the line

The objective is to find the optimal values for b0 and b1 that minimize the error between the predicted values and the actual data points.

### Implement simple linear regression using scikit-learn:

Here's a simple example of implementing simple linear regression using scikit-learn:

In [1]:
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Create sample data
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([2, 4, 6, 8, 10])

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a linear regression model
model = LinearRegression()

# Fit the model on the training data
model.fit(X_train, y_train)

# Make predictions on the testing data
y_pred = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)


Mean Squared Error: 7.888609052210118e-31


### Evaluate the performance of a simple linear regression model:

To evaluate the performance of a simple linear regression model, you can use various metrics such as Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), or R-squared. In the example above, we used MSE as the evaluation metric. A lower MSE indicates a better fit of the model to the data. Other metrics may be more appropriate depending on the specific problem and data distribution.

## 3 Multiple Linear Regression:

Multiple linear regression is an extension of simple linear regression that models the relationship between multiple independent variables (input features) and a dependent variable (output). The goal is to find the best-fitting hyperplane that minimizes the error between predicted and actual values.

### Understand the concept of multiple linear regression:


The equation for multiple linear regression is:

    
y = b0 + b1 * x1 + b2 * x2 + ... + bn * xn


where:

    
* y is the dependent variable (output)
* x1, x2, ..., xn are the independent variables (input features)
* b0 is the y-intercept
* b1, b2, ..., bn are the coefficients of the independent variables

The objective is to find the optimal values for b0, b1, b2, ..., bn that minimize the error between the predicted values and the actual data points.

### Implement multiple linear regression using scikit-learn:

Here's a simple example of implementing multiple linear regression using scikit-learn:

In [1]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Create sample data
data = {
    'x1': [1, 2, 3, 4, 5],
    'x2': [2, 3, 4, 5, 6],
    'y': [3, 5, 7, 9, 11]
}
df = pd.DataFrame(data)

X = df[['x1', 'x2']]
y = df['y']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a linear regression model
model = LinearRegression()

# Fit the model on the training data
model.fit(X_train, y_train)

# Make predictions on the testing data
y_pred = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)


Mean Squared Error: 0.0


In [4]:
model.coef_
model.intercept_

8.881784197001252e-16

### Evaluate the performance of a multiple linear regression model:


Evaluating the performance of a multiple linear regression model is similar to simple linear regression. You can use metrics such as Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), R-squared, or Adjusted R-squared to measure the performance. The choice of evaluation metric(s) depends on the specific problem and data distribution.



In the example above, we used MSE as the evaluation metric. However, you can also calculate other metrics using the sklearn.metrics

## 4 Polynomial Regression:

Polynomial regression is a type of regression analysis where the relationship between the independent variable(s) and the dependent variable is modeled as an nth-degree polynomial. Polynomial regression can model more complex relationships than simple linear regression.



The degree of a polynomial is just a fancy way of saying how many times the variable (usually x) is multiplied by itself. For example, if we have a polynomial with the degree of 2, that means that x is multiplied by itself two times.


So, when we talk about an nth-degree polynomial, we mean a polynomial that has a degree of n. It might look something like this:


P(x) = 5x^3 + 2x^2 - 3x + 7


This is an example of a third-degree polynomial, because x is multiplied by itself three times. The "nth-degree" just means that we don't know what the degree is yet, but we know it's some number n.

### Understand the concept of polynomial regression:

    
In polynomial regression, the relationship between the independent variable(s) and the dependent variable is modeled as an nth-degree polynomial. The equation for polynomial regression with a single independent variable is:

    
y = b0 + b1 * x + b2 * x^2 + ... + bn * x^n


where:

* y is the dependent variable (output)
* x is the independent variable (input feature)
* b0 is the constant term
* b1, b2, ..., bn are the coefficients of the independent variable's powers


The objective is to find the optimal values for b0, b1, b2, ..., bn that minimize the error between the predicted values and the actual data points

### Implement polynomial regression using scikit-learn:

Here's a simple example of implementing polynomial regression using scikit-learn:

In [3]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import PolynomialFeatures

# Create sample data
data = {
    'x': [1, 2, 3, 4, 5],
    'y': [1, 4, 9, 16, 25]
}
df = pd.DataFrame(data)

X = df[['x']]
y = df['y']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create PolynomialFeatures object with the desired degree
poly_features = PolynomialFeatures(degree=2)

# Transform the input features to polynomial features
X_train_poly = poly_features.fit_transform(X_train)
X_test_poly = poly_features.transform(X_test)

# Create a linear regression model
model = LinearRegression()

# Fit the model on the transformed training data
model.fit(X_train_poly, y_train)

# Make predictions on the transformed testing data
y_pred = model.predict(X_test_poly)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)


Mean Squared Error: 1.262177448353619e-29


### Evaluate the performance of a polynomial regression model:

Evaluating the performance of a polynomial regression model is similar to simple and multiple linear regression. You can use metrics such as Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), R-squared, or Adjusted R-squared to measure the performance. The choice of evaluation metric(s) depends on the specific problem and data distribution.

# 5 Regularization Techniques:

Regularization techniques are used in regression models to prevent overfitting, improve generalization, and balance model complexity.

### Understand the concept of overfitting and underfitting:
    
* **Overfitting**: When a model performs well on the training data but poorly on the testing data, it is said to be overfitting. This occurs when the model captures the noise in the data, fitting it too closely, and thus fails to generalize well to new data.

    
* **Underfitting**: When a model performs poorly on both the training and testing data, it is said to be underfitting. This occurs when the model fails to capture the underlying patterns in the data, often due to excessive simplicity.



### Ridge regression (L2 regularization):

Ridge regression is a technique that adds an L2 penalty term to the linear regression objective function. This penalty term is the sum of the squared coefficients multiplied by a regularization parameter (alpha). Ridge regression helps prevent overfitting by shrinking the coefficients, thereby reducing model complexity.

### Lasso regression (L1 regularization):


Lasso regression is another regularization technique that adds an L1 penalty term to the linear regression objective function. The L1 penalty term is the sum of the absolute values of the coefficients multiplied by a regularization parameter (alpha). Lasso regression not only helps prevent overfitting but can also induce sparsity in the model, setting some coefficients to zero and effectively performing feature selection.

### Implement Ridge and Lasso regression using scikit-learn:

Here's a simple example of implementing Ridge and Lasso regression using scikit-learn:

In [4]:
import numpy as np
import pandas as pd
from sklearn.linear_model import Ridge, Lasso
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Create sample data
data = {
    'x1': [1, 2, 3, 4, 5],
    'x2': [2, 3, 4, 5, 6],
    'y': [3, 5, 7, 9, 11]
}
df = pd.DataFrame(data)

X = df[['x1', 'x2']]
y = df['y']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create Ridge and Lasso regression models
ridge = Ridge(alpha=1)
lasso = Lasso(alpha=1)

# Fit the models on the training data
ridge.fit(X_train, y_train)
lasso.fit(X_train, y_train)

# Make predictions on the testing data
y_pred_ridge = ridge.predict(X_test)
y_pred_lasso = lasso.predict(X_test)

# Evaluate the models
mse_ridge = mean_squared_error(y_test, y_pred_ridge)
mse_lasso = mean_squared_error(y_test, y_pred_lasso)
print("Mean Squared Error (Ridge):", mse_ridge)
print("Mean Squared Error (Lasso):", mse_lasso)


Mean Squared Error (Ridge): 0.018261504747991222
Mean Squared Error (Lasso): 0.32653061224489766


### Evaluate the performance of Ridge and Lasso regression models:

Evaluating the performance of Ridge and Lasso regression models is similar to other regression models. You can use metrics such as Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), R-squared, or Adjusted R-squared to measure the performance. The choice of evaluation metric(s) depends on the specific problem and data distribution.

### 6 Support Vector Regression (SVR):

Support Vector Regression is an extension of Support Vector Machines (SVM) for solving regression problems. The goal of SVR is to find a function that best approximates the relationship between input features and output values while considering a margin of tolerance.

### Understand the concept of support vector machines for regression:


In SVR, the objective is to fit a hyperplane that has a maximum margin while allowing some tolerance for the error in the predicted values. The margin of tolerance is determined by a parameter called epsilon (ε). SVR tries to minimize the error within this margin while also minimizing the model's complexity by penalizing large coefficients.


There are different types of SVR, such as linear, polynomial, and radial basis function (RBF) kernel. The choice of kernel and its parameters can significantly impact the model's performance.

### Implement support vector regression using scikit-learn:

Here's a simple example of implementing support vector regression using scikit-learn:

In [5]:
import numpy as np
import pandas as pd
from sklearn.svm import SVR
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Create sample data
data = {
    'x': [1, 2, 3, 4, 5],
    'y': [1, 4, 9, 16, 25]
}
df = pd.DataFrame(data)

X = df[['x']]
y = df['y']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a support vector regression model (RBF kernel)
svr = SVR(kernel='rbf', C=1.0, epsilon=0.1)

# Fit the model on the training data
svr.fit(X_train, y_train)

# Make predictions on the testing data
y_pred = svr.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)


Mean Squared Error: 50.44443474343727


### Evaluate the performance of a support vector regression model:

Evaluating the performance of a support vector regression model is similar to other regression models. You can use metrics such as Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), R-squared, or Adjusted R-squared to measure the performance. The choice of evaluation metric(s) depends on the specific problem and data distribution

## 7 Decision Tree Regression:

Decision Tree Regression is a type of regression model that uses a tree-like structure to represent the relationship between input features and output values. It recursively splits the data into subsets based on the input features, aiming to minimize the variance of the output values in each subset.

### Understand the concept of decision trees for regression:

A decision tree for regression works by recursively splitting the data into subsets based on the input features. The splitting criteria aim to minimize the variance of the output values in each subset. At each leaf node of the tree, the predicted value is the average of the output values in that subset. Decision trees can be prone to overfitting, especially when they grow too deep, and may require pruning or limiting the maximum depth to improve generalization.

### Implement decision tree regression using scikit-learn:

Here's a simple example of implementing decision tree regression using scikit-learn:

In [6]:
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Create sample data
data = {
    'x': [1, 2, 3, 4, 5],
    'y': [1, 4, 9, 16, 25]
}
df = pd.DataFrame(data)

X = df[['x']]
y = df['y']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a decision tree regression model
dtree = DecisionTreeRegressor(max_depth=3, random_state=42)

# Fit the model on the training data
dtree.fit(X_train, y_train)

# Make predictions on the testing data
y_pred = dtree.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)


Mean Squared Error: 9.0


### Evaluate the performance of a decision tree regression model:

Evaluating the performance of a decision tree regression model is similar to other regression models. You can use metrics such as Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), R-squared, or Adjusted R-squared to measure the performance. The choice of evaluation metric(s) depends on the specific problem and data distribution.

### 8 Random Forest Regression:

Random Forest Regression is an ensemble learning method that combines multiple decision trees to improve the model's performance and generalization. Each decision tree in the ensemble is built using a random subset of features and data points, which helps reduce the correlation between the trees and increase the model's robustness.

### Understand the concept of ensemble learning and random forests for regression:

Ensemble learning is a technique that combines multiple weak models to create a more powerful and accurate model. In the case of Random Forest Regression, multiple decision trees are combined to form an ensemble.



A random forest works by constructing multiple decision trees, each trained on a random subset of the input features and data points. The random subsets are obtained using bootstrapping, which involves sampling with replacement. The final prediction of the random forest is obtained by averaging the predictions of all the individual decision trees in the ensemble.



This approach reduces the correlation between the trees and helps mitigate overfitting, making random forests more robust and accurate compared to individual decision trees.

### Implement random forest regression using scikit-learn:

Here's a simple example of implementing random forest regression using scikit-learn:

In [7]:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Create sample data
data = {
    'x': [1, 2, 3, 4, 5],
    'y': [1, 4, 9, 16, 25]
}
df = pd.DataFrame(data)

X = df[['x']]
y = df['y']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a random forest regression model
rforest = RandomForestRegressor(n_estimators=100, max_depth=3, random_state=42)

# Fit the model on the training data
rforest.fit(X_train, y_train)

# Make predictions on the testing data
y_pred = rforest.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)


Mean Squared Error: 0.008099999999999975


### Evaluate the performance of a random forest regression model:

Evaluating the performance of a random forest regression model is similar to other regression models. You can use metrics such as Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), R-squared, or Adjusted R-squared to measure the performance. The choice of evaluation metric(s) depends on the specific problem and data distribution.

## 9 Model Evaluation Metrics for Regression:

When evaluating the performance of regression models, various metrics can be used to measure how well the model fits the data and generalizes to new data points. Some common evaluation metrics for regression include Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), R-squared, and Adjusted R-squared.

### Mean Absolute Error (MAE):


MAE measures the average absolute difference between the predicted values and the actual values. It is a simple and intuitive metric that gives an idea of the average magnitude of the errors.



Formula: MAE = (1/n) * Σ|y_true - y_pred|, 


where 
* n is the number of samples, 
* y_true are the true values, and 
* y_pred are the predicted values.

### Mean Squared Error (MSE):


MSE measures the average squared difference between the predicted values and the actual values. It emphasizes larger errors by squaring them, making it more sensitive to outliers than MAE.



Formula: MSE = (1/n) * Σ(y_true - y_pred)^2

### Root Mean Squared Error (RMSE):


RMSE is the square root of the MSE. It represents the standard deviation of the residuals, which are the differences between the predicted values and the actual values. Like MSE, it is sensitive to outliers but has the same unit as the original data, making it easier to interpret.


Formula: RMSE = √(MSE)

### R-squared:
    
    
R-squared, also known as the coefficient of determination, measures the proportion of variance in the dependent variable that is predictable from the independent variables. It ranges from 0 to 1, with higher values indicating a better fit. An R-squared value of 1 indicates that the model explains all the variability in the data, while a value of 0 indicates that the model does not explain any variability.


Formula: R^2 = 1 - (Σ(y_true - y_pred)^2) / (Σ(y_true - y_mean)^2), 

where 

* y_mean is the mean of the true values.

### Adjusted R-squared:


Adjusted R-squared is a modified version of R-squared that takes into account the number of independent variables (features) in the model. It is useful when comparing models with different numbers of features, as it penalizes models that include irrelevant features. Unlike R-squared, Adjusted R-squared can decrease when adding a new feature that does not improve the model's performance.



Formula: Adjusted R^2 = 1 - [(1 - R^2) * (n - 1) / (n - k - 1)], 
    
where 

* n is the number of samples, 
* k is the number of independent variables (features).

### Implement random forest regression using scikit-learn:

In [9]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print("Mean Absolute Error:", mae)
print("Mean Squared Error:", mse)
print("Root Mean Squared Error:", rmse)
print("R-squared:", r2)


Mean Absolute Error: 0.08999999999999986
Mean Squared Error: 0.008099999999999975
Root Mean Squared Error: 0.08999999999999986
R-squared: nan




## 10 Model Selection and Hyperparameter Tuning:

Model selection and hyperparameter tuning are essential steps in building effective machine learning models. These processes help ensure that the chosen model generalizes well to new data points and performs optimally.

### Split datasets into training, validation, and testing sets:

To evaluate a model's performance and tune hyperparameters, it's crucial to split the dataset into separate sets: training, validation, and testing. 

* The training set is used to fit the model,
* the validation set is used to tune hyperparameters and select the best model, 
* testing set is used to evaluate the final model's performance.

### Understand the concept of cross-validation:

* Cross-validation is a technique used to assess the performance of a model by training and evaluating it on different subsets of the data. 
* The most common form of cross-validation is k-fold cross-validation, where the data is divided into k equal-sized folds. The model is trained on k-1 folds and evaluated on the remaining fold. This process is repeated k times, with each fold used as the validation set once. The average performance across all iterations is used as the model's performance estimate.

### Grid search and randomized search for hyperparameter tuning:

Grid search and randomized search are two methods for hyperparameter tuning:

* **Grid search**: An exhaustive search over a specified range of hyperparameter values. The model is trained and evaluated for each combination of hyperparameter values, and the combination that yields the best performance is chosen.

* **Randomized search**: A more efficient alternative to grid search. Instead of trying every possible combination, a fixed number of random combinations of hyperparameter values are sampled. This approach can be faster and yield similar results compared to grid search, especially when dealing with a large number of hyperparameters.

### Implement model selection and hyperparameter tuning using scikit-learn:
    
Scikit-learn provides various tools for model selection and hyperparameter tuning, such as train_test_split, cross_val_score, GridSearchCV, and RandomizedSearchCV.

### Example of using grid search for hyperparameter tuning:

In [10]:
import numpy as np
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestRegressor

# Load data
data = load_boston()
X = data.data
y = data.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the hyperparameter grid
param_grid = {
    'n_estimators': [10, 50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10]
}

# Create a random forest regressor
rforest = RandomForestRegressor(random_state=42)

# Instantiate the grid search with cross-validation
grid_search = GridSearchCV(estimator=rforest, param_grid=param_grid, cv=5, scoring='neg_mean_squared_error')

# Fit the grid search to the data
grid_search.fit(X_train, y_train)

# Print the best combination of hyperparameters and their score
print("Best hyperparameters:", grid_search.best_params_)
print("Best score:", -grid_search.best_score_)



    The Boston housing prices dataset has an ethical problem. You can refer to
    the documentation of this function for further details.

    The scikit-learn maintainers therefore strongly discourage the use of this
    dataset unless the purpose of the code is to study and educate about
    ethical issues in data science and machine learning.

    In this special case, you can fetch the dataset from the original
    source::

        import pandas as pd
        import numpy as np


        data_url = "http://lib.stat.cmu.edu/datasets/boston"
        raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
        data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
        target = raw_df.values[1::2, 2]

    Alternative datasets include the California housing dataset (i.e.
    :func:`~sklearn.datasets.fetch_california_housing`) and the Ames housing
    dataset. You can load the datasets as follows::

        from sklearn.datasets import fetch_california_h

Best hyperparameters: {'max_depth': None, 'min_samples_split': 2, 'n_estimators': 100}
Best score: 14.858836289320982


### Randomized search for hyperparameter tuning with scikit-learn

In [11]:
import numpy as np
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split, RandomizedSearchCV, KFold
from sklearn.ensemble import RandomForestRegressor
from scipy.stats import randint

# Load data
data = load_boston()
X = data.data
y = data.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the hyperparameter distribution
param_dist = {
    'n_estimators': randint(10, 200),
    'max_depth': randint(1, 30),
    'min_samples_split': randint(2, 10)
}

# Create a random forest regressor
rforest = RandomForestRegressor(random_state=42)

# Instantiate the randomized search with cross-validation
kfold = KFold(n_splits=5, shuffle=True, random_state=42)
random_search = RandomizedSearchCV(estimator=rforest, param_distributions=param_dist, n_iter=50, cv=kfold, scoring='neg_mean_squared_error', random_state=42)

# Fit the randomized search to the data
random_search.fit(X_train, y_train)

# Print the best combination of hyperparameters and their score
print("Best hyperparameters:", random_search.best_params_)
print("Best score:", -random_search.best_score_)



    The Boston housing prices dataset has an ethical problem. You can refer to
    the documentation of this function for further details.

    The scikit-learn maintainers therefore strongly discourage the use of this
    dataset unless the purpose of the code is to study and educate about
    ethical issues in data science and machine learning.

    In this special case, you can fetch the dataset from the original
    source::

        import pandas as pd
        import numpy as np


        data_url = "http://lib.stat.cmu.edu/datasets/boston"
        raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
        data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
        target = raw_df.values[1::2, 2]

    Alternative datasets include the California housing dataset (i.e.
    :func:`~sklearn.datasets.fetch_california_housing`) and the Ames housing
    dataset. You can load the datasets as follows::

        from sklearn.datasets import fetch_california_h

Best hyperparameters: {'max_depth': 15, 'min_samples_split': 6, 'n_estimators': 33}
Best score: 13.878459555873864


## Assignment: Predicting House Prices using Regression Models

### Objective: 

The goal of this assignment is to apply the concepts learned about regression models to predict house prices using a given dataset. You will use various regression models, evaluate their performance, and tune their hyperparameters.

### Dataset: 

Use the Boston Housing dataset, which is available in scikit-learn's datasets module.

### Instructions:

* 1 Import necessary libraries and load the dataset.
* 2 Perform exploratory data analysis and preprocessing (e.g., check for missing values, visualize the data, etc.).
* 3 Split the dataset into training and testing sets.
* 4 Implement the following regression models:

    * Simple Linear Regression (choose an appropriate feature)
    * Multiple Linear Regression
    * Polynomial Regression
    * Ridge Regression
    * Lasso Regression
    * Support Vector Regression
    * Decision Tree Regression
    * Random Forest Regression
* 5 Evaluate the performance of each model using appropriate evaluation metrics, such as MAE, MSE, RMSE, R-squared, or Adjusted R-squared.
* 6 Perform cross-validation and hyperparameter tuning for the models that require it, using grid search or randomized search.
* 7 Compare the performance of the different models and discuss your findings.
* 8 Choose the best model based on the evaluation metrics and provide insights into its performance and predictions.


Submission: Prepare a report or Jupyter Notebook that includes your code, visualizations, results, and explanations for each step. The report should be well-organized, clear, and concise. 

## Solution