# Outlier Robust Regression

* **Outlier-robust regression**, also known as **robust regression**, refers to regression techniques that are designed to handle outliers or influential observations in the dataset more effectively compared to traditional regression methods like ordinary least squares (OLS) regression. Outliers are data points that deviate significantly from the rest of the data and can have a disproportionate impact on the estimated regression model.

Here are some key characteristics of outlier-robust regression:

**1. Resilience to Outliers:** Robust regression methods are less sensitive to outliers in the data compared to OLS regression. They use robust estimation techniques that downweight or ignore the influence of outliers when fitting the regression model.

**2. Robust Estimators:** Robust regression estimators are resistant to the effects of outliers and influential observations. They use robust measures of central tendency (e.g., median) and dispersion (e.g., median absolute deviation) instead of means and variances, which are more sensitive to outliers.

**3. Huber Loss Function:** One common approach in robust regression is to use a loss function that is less sensitive to outliers than the squared error loss function used in OLS regression. The Huber loss function is a popular choice, as it combines the advantages of the least squares and absolute error loss functions.

**4. Iterative Reweighted Least Squares (IRLS):** Many robust regression methods, such as Huber regression and M-estimation, use iterative algorithms like IRLS to iteratively fit the regression model while downweighting the influence of outliers in each iteration.

**5. Applications:** Robust regression techniques are commonly used in various fields where outliers are common or influential, such as finance, economics, environmental science, and engineering. They are particularly useful when the underlying assumptions of OLS regression, such as normality and constant variance of residuals, are violated due to the presence of outliers.

* Overall, outlier-robust regression methods provide more reliable estimates of regression parameters and better model performance in the presence of outliers, making them a valuable tool for data analysis and modeling.

## Huber Regressor

* The **Huber Regressor** is a robust regression technique that combines the best properties of **least squares** and **least absolute deviation (LAD) regression**. It aims to minimize the residual sum of squares (RSS) while being less sensitive to outliers in the data compared to ordinary least squares (OLS) regression.

Here are some key points about the Huber Regressor:

**1. Robustness:** The Huber loss function used in Huber regression is less sensitive to outliers compared to the squared error loss function used in OLS regression. It achieves this by applying a linear loss for small residuals (typically, below a threshold δ) and a quadratic loss for larger residuals.

**2. Piecewise Linear Loss:** The Huber loss function is piecewise linear, transitioning from a linear to a quadratic loss at the threshold δ. This allows it to balance between the robustness of LAD regression and the efficiency of OLS regression.

**3. Parameter δ:** The parameter δ determines the point at which the loss function transitions from linear to quadratic. It is a hyperparameter of the model that needs to be specified. Larger values of δ result in a more robust regression but may sacrifice efficiency, while smaller values may lead to better efficiency but less robustness.

**4. Gradient Descent:** The Huber Regressor can be optimized using gradient descent-based algorithms to find the regression coefficients that minimize the Huber loss function.

**5. Scalability:** Huber regression can handle large datasets and is scalable to high-dimensional feature spaces. It is commonly used in situations where there are potential outliers in the data and traditional regression techniques may be overly influenced by them.

**6. Applications:** Huber regression is widely used in various fields, including finance, economics, and engineering, where robustness to outliers is important for accurate modeling and prediction.

* Overall, the Huber Regressor is a powerful tool for robust regression that strikes a balance between robustness and efficiency, making it suitable for a wide range of practical applications.

In [1]:
# Importing the necessary libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.linear_model import HuberRegressor, QuantileRegressor, RANSACRegressor, TheilSenRegressor
from sklearn.preprocessing import StandardScaler

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


In [2]:
# Loading the dataset
df = pd.read_csv("C:\\Users\\User\\Desktop\\Drive D\\New folder\\ML\\Completed\\AirBnb.csv")
df.head()

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,2020-12-10,146.0,165.0,141.25,144.710007,144.710007,70447500
1,2020-12-11,146.550003,151.5,135.100006,139.25,139.25,26980800
2,2020-12-14,135.0,135.300003,125.160004,130.0,130.0,16966100
3,2020-12-15,126.690002,127.599998,121.5,124.800003,124.800003,10914400
4,2020-12-16,125.830002,142.0,124.910004,137.990005,137.990005,20409600


In [3]:
df.set_index("Date",inplace=True)
df

Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2020-12-10,146.000000,165.000000,141.250000,144.710007,144.710007,70447500
2020-12-11,146.550003,151.500000,135.100006,139.250000,139.250000,26980800
2020-12-14,135.000000,135.300003,125.160004,130.000000,130.000000,16966100
2020-12-15,126.690002,127.599998,121.500000,124.800003,124.800003,10914400
2020-12-16,125.830002,142.000000,124.910004,137.990005,137.990005,20409600
...,...,...,...,...,...,...
2024-01-08,137.309998,140.250000,136.610001,140.080002,140.080002,4179700
2024-01-09,138.520004,139.539993,137.789993,139.529999,139.529999,3560900
2024-01-10,139.199997,140.824997,138.699997,139.759995,139.759995,2492700
2024-01-11,140.710007,141.199997,137.550003,139.449997,139.449997,2383500


In [4]:
x = df.drop(columns=['Adj Close','Volume'])
x

Unnamed: 0_level_0,Open,High,Low,Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2020-12-10,146.000000,165.000000,141.250000,144.710007
2020-12-11,146.550003,151.500000,135.100006,139.250000
2020-12-14,135.000000,135.300003,125.160004,130.000000
2020-12-15,126.690002,127.599998,121.500000,124.800003
2020-12-16,125.830002,142.000000,124.910004,137.990005
...,...,...,...,...
2024-01-08,137.309998,140.250000,136.610001,140.080002
2024-01-09,138.520004,139.539993,137.789993,139.529999
2024-01-10,139.199997,140.824997,138.699997,139.759995
2024-01-11,140.710007,141.199997,137.550003,139.449997


In [5]:
y = df['Adj Close']
y

Date
2020-12-10    144.710007
2020-12-11    139.250000
2020-12-14    130.000000
2020-12-15    124.800003
2020-12-16    137.990005
                 ...    
2024-01-08    140.080002
2024-01-09    139.529999
2024-01-10    139.759995
2024-01-11    139.449997
2024-01-12    137.139999
Name: Adj Close, Length: 777, dtype: float64

In [6]:
#Splitting the dataset into training and testing
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=42)

In [7]:
hbr = HuberRegressor()

In [8]:
hbr.fit(x_train,y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


In [9]:
pred = hbr.predict(x_test)

In [10]:
r2_score(y_test,pred)

1.0

In [11]:
mean_squared_error(y_test,pred)

3.817007781028472e-15

## Quantile Regressor

* **Quantile Regression** is a statistical technique used to estimate the conditional quantiles of a response variable given certain predictor variables. Unlike traditional regression methods that focus on estimating the conditional mean of the response variable, quantile regression allows for the estimation of various quantiles, providing a more comprehensive understanding of the relationship between predictors and the response variable.

Here are some key points about Quantile Regression:

**1. Conditional Quantiles:** In Quantile Regression, instead of estimating the conditional mean of the response variable, we estimate its conditional quantiles. A quantile represents a specific value below which a certain proportion of the data falls. For example, the median represents the 50th percentile, while the 25th percentile represents the first quartile.

**2. Robustness:** Quantile Regression is more robust to outliers compared to traditional least squares regression, which focuses on estimating the conditional mean. By estimating quantiles, the model is less influenced by extreme values in the response variable.

**3. Flexibility:** Quantile Regression allows for the estimation of multiple quantiles simultaneously, providing insights into the entire distribution of the response variable. This makes it suitable for analyzing asymmetric or non-normal distributions.

**4. Interpretability:** The coefficients estimated in Quantile Regression represent the effects of predictor variables on specific quantiles of the response variable. This can provide valuable insights into how the relationship between predictors and the response variable varies across different parts of the distribution.

**5. Applications:** Quantile Regression has applications in various fields, including economics, finance, epidemiology, and environmental science. It is particularly useful when analyzing data with heteroscedasticity (varying spread) or when the distribution of the response variable is skewed.

**6. Estimation Methods:** Quantile Regression can be estimated using various methods, including linear programming, gradient descent, and iteratively reweighted least squares (IRLS). The choice of method depends on the computational complexity and the specific requirements of the analysis.

* Overall, Quantile Regression is a valuable tool for exploring the relationship between predictor variables and the conditional distribution of the response variable, providing insights that may be missed by focusing solely on the conditional mean.

In [12]:
qr = QuantileRegressor(solver='highs')

In [13]:
qr.fit(x_train,y_train)

In [14]:
pred = qr.predict(x_test)

In [15]:
r2_score(y_test,pred)

1.0

In [16]:
mean_squared_error(y_test,pred)

0.0

## RANSAC Regressor

* **RANSAC (RANdom SAmple Consensus)** is an iterative algorithm used for robust regression. It is particularly useful when a dataset contains outliers or noisy data points that can significantly affect the estimation of a regression model. RANSAC works by iteratively fitting models to subsets of the data called "inliers," discarding outliers, and eventually selecting the model with the best overall performance.

Here's how the RANSAC algorithm works:

**1. Initialization:** Choose a random subset of the data points (called the "minimal sample") to form an initial model. The minimal sample size is determined by the number of parameters in the model.

**2. Model Fitting:** Fit a model to the minimal sample using standard regression techniques (e.g., linear regression).

**3. Inlier Selection:** Determine which data points in the dataset are "inliers" according to the fitted model. This is typically done by calculating the residuals (the vertical distances between the data points and the model) and considering data points with residuals below a certain threshold as inliers.

**4. Model Evaluation:** Evaluate the quality of the fitted model based on the number of inliers it has. A good model should have a sufficient number of inliers.

**5. Model Refinement:** Optionally, refit the model using all of the inliers identified in the previous step to improve its accuracy.

**6. Iteration:** Repeat steps 1-5 for a fixed number of iterations or until certain convergence criteria are met.

**7. Model Selection:** After a predetermined number of iterations, select the model that has the highest number of inliers or the best overall performance.

* RANSAC is commonly used in computer vision, image processing, and other fields where robust estimation of geometric models (e.g., lines, planes, or other shapes) is necessary in the presence of outliers or noise.

* In summary, RANSAC is a powerful algorithm for robust regression that can effectively handle datasets with outliers or noisy data points by iteratively fitting models and identifying the most consistent subset of data.

In [17]:
rr = RANSACRegressor()

In [18]:
rr.fit(x_train,y_train)

In [19]:
pred = rr.predict(x_test)

In [20]:
r2_score(y_test,pred)

1.0

In [21]:
mean_squared_error(y_test,pred)

2.5036422411445118e-27

## TheilSenRegressor

* The **Theil-Sen** estimator, also known as **Sen's slope estimator** or **Kendall's tau estimator**, is a **non-parametric method** for robust linear regression. It is particularly useful when the assumption of normally distributed errors or homoscedasticity (constant variance) in traditional linear regression techniques is violated due to outliers or non-normality in the data.

Here's how the Theil-Sen estimator works:

**1. Median Slope:** For each pair of data points (x_i, y_i) and (x_j, y_j), calculate the slope (y_j - y_i) / (x_j - x_i) and take the median of all these slopes. This median slope estimate is robust to outliers because it is less influenced by extreme values.

**2. Median Intercept:** Calculate the median of the y-intercepts of the lines passing through all pairs of data points. This provides a robust estimate of the intercept.

**3. Final Model:** Use the median slope and median intercept to define the final robust linear regression model.

* The Theil-Sen estimator is robust because it relies on the median instead of the mean, making it less sensitive to outliers. It provides a compromise between the highly robust, but less efficient, estimators like the median and the efficient, but less robust, estimators like ordinary least squares (OLS) regression.

**Advantages of Theil-Sen estimator:**
* **Robustness:** It can handle datasets with outliers and non-normal errors without significantly affecting the estimation.

* **Efficiency:** It is more efficient than some other robust estimators like M-estimators.

**Disadvantages of Theil-Sen estimator:**
* **Computational complexity:** The computation of all pairwise slopes can be computationally intensive for large datasets.

* **Less efficient:** While it is more efficient than some other robust estimators, it is less efficient than ordinary least squares when the data are clean and normally distributed.

* Overall, the Theil-Sen estimator is a valuable tool for robust linear regression when dealing with real-world datasets that may contain outliers or non-normal errors.

In [22]:
tr = TheilSenRegressor()

In [23]:
tr.fit(x_train,y_train)

In [24]:
pred = tr.predict(x_test)

In [25]:
r2_score(y_test,pred)

1.0

In [26]:
mean_squared_error(y_test,pred)

1.3804984932556916e-26