# Regression Algorithms in Machine Learning

Regression analysis consists of a set of statistical processes that estimate the relationships among variables. It helps understand how the typical value of the dependent variable changes when any one of the independent variables varies. In machine learning, regression algorithms predict a continuous outcome variable (y) based on one or more predictor variables (x).

## Common Types of Regression Algorithms

### 1. Linear Regression

Linear regression attempts to model the relationship between two variables by fitting a linear equation to observed data. It is used when the relationship between the dependent and independent variables is linear.

```python
from sklearn.linear_model import LinearRegression

# Initialize and fit the model
model = LinearRegression()
model.fit(X_train, y_train)
```

### 2. Logistic Regression

Though it's named "regression," logistic regression is actually used for classification problems where the outcome is a categorical variable (binary or multi-class).

```python
from sklearn.linear_model import LogisticRegression

# Initialize and fit the model
model = LogisticRegression()
model.fit(X_train, y_train)
```

### 3. Decision Tree Regression

Decision trees split the data into subsets, which further split into more subsets, making the decision a sequence of questions. This makes it a good option for non-linear data.

```python
from sklearn.tree import DecisionTreeRegressor

# Initialize and fit the model
model = DecisionTreeRegressor()
model.fit(X_train, y_train)
```

### 4. Random Forest Regression

Random forest builds multiple decision trees and merges them together to get a more accurate and stable prediction.

```python
from sklearn.ensemble import RandomForestRegressor

# Initialize and fit the model
model = RandomForestRegressor()
model.fit(X_train, y_train)
```

### 5. Support Vector Regression (SVR)

Support Vector Regression is similar to SVM for classification. The model predicts a value, but the error does not exceed the threshold.

```python
from sklearn.svm import SVR

# Initialize and fit the model
model = SVR()
model.fit(X_train, y_train)
```

### 6. Gradient Boosting Regression

Gradient boosting is a type of machine learning boosting. It relies on the intuition that the best possible next model, when combined with previous models, minimizes the overall prediction error.

```python
from sklearn.ensemble import GradientBoostingRegressor

# Initialize and fit the model
model = GradientBoostingRegressor()
model.fit(X_train, y_train)
```

### 7. XGBoost

XGBoost is an efficient and scalable implementation of gradient boosting. It has become incredibly popular because of its performance and speed.

```python
import xgboost as xgb

# Initialize and fit the model
model = xgb.XGBRegressor()
model.fit(X_train, y_train)
```

## Conclusion

These algorithms represent just a few of the many machine learning algorithms used in regression. Each has its strengths and weaknesses and may be suited to different types of data or problem domains. Understanding these algorithms and knowing when to apply each will equip you to tackle a wide array of regression problems.


In [4]:
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error, r2_score

# Load the diabetes dataset
diabetes = load_diabetes()

# Split the dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(diabetes.data, diabetes.target, test_size=0.2, random_state=0)

# Create a decision tree regressor object
regressor = DecisionTreeRegressor(random_state=0)

# Fit the regressor with training data
regressor.fit(X_train, y_train)

# Make predictions using the test set
y_pred = regressor.predict(X_test)

# Calculate and print metrics
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error (MSE): {mse}")
print(f"R-squared (R2 ): {r2}")


Mean Squared Error (MSE): 6891.797752808989
R-squared (R2 ): -0.34397344448845835


In [5]:
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

# Load the diabetes dataset
diabetes = load_diabetes()

# Split the dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(diabetes.data, diabetes.target, test_size=0.2, random_state=0)

# Create a random forest regressor object
regressor = RandomForestRegressor(random_state=0, n_estimators=100)

# Fit the regressor with the training data
regressor.fit(X_train, y_train)

# Make predictions using the test set
y_pred = regressor.predict(X_test)

# Calculate and print metrics
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error (MSE): {mse}")
print(f"R-squared (R2 ): {r2}")


Mean Squared Error (MSE): 3750.300122471911
R-squared (R2 ): 0.26865181564422547


In [7]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix

# Load the breast cancer dataset
data = load_breast_cancer()

# The target variable is more suitable for logistic regression since it's categorical
X, y = data.data, data.target

# Split the dataset into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Create a logistic regression classifier
classifier = LogisticRegression(max_iter=10000, random_state=0)

# Fit the classifier with the training data
classifier.fit(X_train, y_train)

# Predict the test set results
y_pred = classifier.predict(X_test)

# Calculate and print metrics
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error (MSE): {mse}")
print(f"R-squared (R2 ): {r2}")


Mean Squared Error (MSE): 0.05263157894736842
R-squared (R2 ): 0.7827881867259447


In [8]:
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Load the diabetes dataset
diabetes = load_diabetes()

# Split the dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(diabetes.data, diabetes.target, test_size=0.2, random_state=0)

# Create a linear regression object
regressor = LinearRegression()

# Train the model using the training sets
regressor.fit(X_train, y_train)

# Make predictions using the testing set
y_pred = regressor.predict(X_test)

# The coefficients
print('Coefficients: \n', regressor.coef_)
# The mean squared error
print('Mean squared error: %.2f' % mean_squared_error(y_test, y_pred))
# The coefficient of determination: 1 is perfect prediction
print('Coefficient of determination: %.2f' % r2_score(y_test, y_pred))


Coefficients: 
 [ -35.55025079 -243.16508959  562.76234744  305.46348218 -662.70290089
  324.20738537   24.74879489  170.3249615   731.63743545   43.0309307 ]
Mean squared error: 3424.26
Coefficient of determination: 0.33


In [9]:
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error, r2_score

# Load the diabetes dataset
diabetes = load_diabetes()

# Split the dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(diabetes.data, diabetes.target, test_size=0.2, random_state=0)

# Create KNN regressor object
regressor = KNeighborsRegressor(n_neighbors=5)  # here, 5 is the number of neighbors

# Train the model using the training sets
regressor.fit(X_train, y_train)

# Make predictions using the testing set
y_pred = regressor.predict(X_test)

# Calculate and print metrics
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error (MSE): {mse}")
print(f"R-squared (R2 ): {r2}")


Mean Squared Error (MSE): 4243.422022471909
R-squared (R2 ): 0.1724878302420758


In [10]:
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler

# Load the diabetes dataset
diabetes = load_diabetes()

# It's a good practice to scale the data for SVM models
scaler = StandardScaler()
X_scaled = scaler.fit_transform(diabetes.data)

# Split the dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X_scaled, diabetes.target, test_size=0.2, random_state=0)

# Create SVR object
regressor = SVR(kernel='rbf')  # kernel can also be 'linear', 'poly', etc.

# Train the model using the training sets
regressor.fit(X_train, y_train)

# Make predictions using the testing set
y_pred = regressor.predict(X_test)

# Calculate and print metrics
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error (MSE): {mse}")
print(f"R-squared (R2 ): {r2}")


Mean Squared Error (MSE): 4470.939682846807
R-squared (R2 ): 0.12811948040601506


In [11]:
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score

# Load the diabetes dataset
diabetes = load_diabetes()

# Split the dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(diabetes.data, diabetes.target, test_size=0.2, random_state=0)

# Create Gradient Boosting Regressor object
regressor = GradientBoostingRegressor(random_state=0)

# Train the model using the training sets
regressor.fit(X_train, y_train)

# Make predictions using the testing set
y_pred = regressor.predict(X_test)

# Calculate and print metrics
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error (MSE): {mse}")
print(f"R-squared (R2 ): {r2}")


Mean Squared Error (MSE): 4071.9510461055215
R-squared (R2 ): 0.20592648398710256


In [12]:
import xgboost as xgb
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# Load the diabetes dataset
diabetes = load_diabetes()

# Split the dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(diabetes.data, diabetes.target, test_size=0.2, random_state=0)

# Convert the dataset into an optimized data structure called Dmatrix that XGBoost supports
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)

# Specify the parameters
params = {
    'max_depth': 3,  # the maximum depth of each tree
    'eta': 0.3,  # the training step for each iteration
    'objective': 'reg:squarederror',  # error evaluation for regression training
    'eval_metric': 'rmse'  # evaluation metric for validation data
}

# Specify validation set to watch performance
watchlist = [(dtest, 'eval'), (dtrain, 'train')]

num_round = 50  # the number of training iterations

# Train the model
bst = xgb.train(params, dtrain, num_round, watchlist)

# Make predictions
y_pred = bst.predict(dtest)

# Calculate and print metrics
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error (MSE): {mse}")
print(f"R-squared (R2 ): {r2}")


[0]	eval-rmse:124.75405	train-rmse:125.72972
[1]	eval-rmse:96.98714	train-rmse:95.93678
[2]	eval-rmse:79.56656	train-rmse:76.40035
[3]	eval-rmse:70.01961	train-rmse:63.83979
[4]	eval-rmse:64.08181	train-rmse:56.05293
[5]	eval-rmse:62.62043	train-rmse:50.80412
[6]	eval-rmse:61.49554	train-rmse:47.72474
[7]	eval-rmse:61.78865	train-rmse:45.38086
[8]	eval-rmse:62.51831	train-rmse:43.74025
[9]	eval-rmse:62.43861	train-rmse:42.50393
[10]	eval-rmse:62.37349	train-rmse:41.22457
[11]	eval-rmse:62.28799	train-rmse:40.64192
[12]	eval-rmse:62.40342	train-rmse:39.84432
[13]	eval-rmse:62.46278	train-rmse:39.03206
[14]	eval-rmse:62.62018	train-rmse:38.32677
[15]	eval-rmse:63.19474	train-rmse:37.73889
[16]	eval-rmse:63.00371	train-rmse:37.24875
[17]	eval-rmse:63.29954	train-rmse:36.68275
[18]	eval-rmse:63.72195	train-rmse:35.91110
[19]	eval-rmse:63.80423	train-rmse:35.67652
[20]	eval-rmse:63.94451	train-rmse:35.47408
[21]	eval-rmse:64.22036	train-rmse:35.24786
[22]	eval-rmse:64.65215	train-rmse:34.59

