# Outliers

Outliers can have a significant impact on the results. Here’s a quick guide to detecting and dealing with outliers:

### **Detecting Outliers**
#### a. Visual Inspection:
   - **Box Plots:** A box plot represents the distribution of the data and its central tendency. Points that are located outside the "whiskers" of the box plot are typically considered as outliers.
   - **Scatter Plots:** Useful to see the spread of data and identify potential outliers in bivariate analysis.
   - **Histograms:** Helps to identify outliers in univariate analysis by visualizing the data distribution.

#### b. Statistical Methods:
   - **Z-Score:** A Z-score represents the number of standard deviations a data point is from the mean. A high absolute value of Z-score (typically above 3) indicates that the data point is likely an outlier.
   - **IQR Method:** Outliers can be identified by finding values that lie below Q1 - 1.5*IQR or above Q3 + 1.5*IQR, where Q1 and Q3 are the first and third quartiles, respectively, and IQR is the interquartile range (Q3 - Q1).

### **Handling Outliers**
#### a. Removal:
   - **Truncation:** Simply remove outlier values.
   - **Winsorizing:** Cap the outlier values to a certain threshold.

#### b. Transformation:
   - **Log Transformation:** Reduces the impact of outliers but requires all values to be positive.
   - **Box-Cox Transformation:** Generalizes the power transformation and can stabilize variance and make the data more normal distribution-like.

#### c. Imputation:
   - Replace outlier values with statistical measures such as mean, median, or mode.

#### d. Robust Methods:
   - Use models and methods that are robust to outliers, such as robust regression methods, tree-based models, or ensemble methods.

### Example
Here’s an example of detecting and handling outliers using Python and the `scipy.stats` module for Z-score and the `numpy` library for IQR method.


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import zscore

# More complex example data
np.random.seed(0)
data = {'Value': np.random.normal(25, 5, 100).tolist() + [100, 105, 110, 150, 200]}
df = pd.DataFrame(data)

# Original DataFrame
original_df = df.copy()

# Detecting outliers using Z-score
df['Z_Score'] = zscore(df['Value'])
outliers_z_score = df[np.abs(df['Z_Score']) > 3]

# Detecting outliers using IQR method
Q1 = df['Value'].quantile(0.25)
Q3 = df['Value'].quantile(0.75)
IQR = Q3 - Q1
outliers_iqr = df[(df['Value'] < (Q1 - 1.5 * IQR)) | (df['Value'] > (Q3 + 1.5 * IQR))]

# Handling outliers: Removal
df_no_outliers_z = df[np.abs(df['Z_Score']) <= 3]
df_no_outliers_iqr = df[(df['Value'] >= (Q1 - 1.5 * IQR)) & (df['Value'] <= (Q3 + 1.5 * IQR))]

# Handling outliers: Imputation (Replace with Median)
df_imputed = df.copy()
df_imputed['Value'].where(~df.index.isin(outliers_z_score.index), df['Value'].median(), inplace=True)

# Create Subplots
fig, axs = plt.subplots(2, 2, figsize=(10, 8))
axs[0, 0].boxplot(original_df['Value'])
axs[0, 0].set_title('Original Data')

axs[0, 1].boxplot(df_no_outliers_z['Value'])
axs[0, 1].set_title('Data after Outlier Removal (Z-score)')

axs[1, 0].boxplot(df_no_outliers_iqr['Value'])
axs[1, 0].set_title('Data after Outlier Removal (IQR)')


axs[1, 1].boxplot(df_imputed['Value'])
axs[1, 1].set_title('Data after Outlier Imputation')

plt.tight_layout()
plt.show()



Note that outliers aren't merely statistical anomalies - they are also values that are unrealistic!  So it's often a good idea to inspect your data carefully when you are trying to remove outliers.

### **Handling Outliers in sklearn**

Scikit-Learn provides several tools for handling outliers. Note that most of these tools follow the estimation API, meaning they provide `fit` and `predict` methods.

### 1. Robust Scalers
- **RobustScaler**: `RobustScaler` scales features using statistics that are robust to outliers by removing the median and scaling data according to the interquartile range.  Note that unlike other methods,`RobustScaler` is a transformer, and is really a tool for scaling data rather than removing outliers. 
- **Use Case**: Ideal for datasets with outliers, where you want to scale features without being influenced by them.

### 2. EllipticEnvelope
- **What It Does**: Fits a robust covariance estimate to the data, thus identifying the data points that are statistical outliers in a Gaussian distributed dataset.
- **Use Case**: Useful in detecting outliers when the data is assumed to have a Gaussian distribution.

### 3. Isolation Forest
- **What It Does**: This algorithm isolates anomalies instead of profiling normal data points. It works well for high-dimensional datasets.
- **Use Case**: Effective for outlier detection, particularly for datasets where the number of outliers is expected to be low.

### 4. Local Outlier Factor (LOF)
- **What It Does**: Measures the local deviation of a given data point with respect to its neighbors. It considers outliers as those points that have a substantially lower density than their neighbors.
- **Use Case**: Particularly useful for anomaly detection in datasets where the density of the data is not uniform.

### 5. One-Class SVM
- **What It Does**: This unsupervised algorithm learns a decision function for novelty detection, identifying outliers as data points that do not conform to the learned region.
- **Use Case**: Suitable for outlier detection in feature spaces where the data is not too noisy.

### Example of Using Isolation Forest:

```python
from sklearn.ensemble import IsolationForest

iso = IsolationForest(contamination=0.1)
outliers = iso.fit_predict(X)

# outliers will be an array of -1 for outliers and 1 for inliers
```

### Important Points:
- **Parameter Tuning**: These methods often have parameters like `contamination` that specify the proportion of outliers in the data, which might need tuning based on the specific dataset.
- **Unsupervised Methods**: Most outlier detection methods in Scikit-Learn are unsupervised, meaning they don't require labeled data indicating which points are outliers.

### **Outlier removal in pipelines**

In Scikit-Learn, there isn't a direct, built-in way to remove outliers using tools like `IsolationForest` within a pipeline. The standard pipeline components in Scikit-Learn, such as transformers and estimators, are designed to transform or predict, but not to filter or remove data points.

However, you can create a custom transformer that incorporates outlier detection and data removal. This transformer can then be integrated into a Scikit-Learn pipeline. 

Here's an example of how you might implement such a custom transformer:



In [1]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.ensemble import IsolationForest

class OutlierRemover(BaseEstimator, TransformerMixin):
    def __init__(self, n_estimators=100, contamination='auto'):
        self.n_estimators = n_estimators
        self.contamination = contamination

    def fit(self, X, y=None):
        self.isolation_forest = IsolationForest(n_estimators=self.n_estimators, 
                                                contamination=self.contamination)
        self.isolation_forest.fit(X)
        return self

    def transform(self, X, y=None):
        # Predict outliers (-1 for outliers, 1 for inliers)
        outliers = self.isolation_forest.predict(X)
        # Filter out the outliers
        return X[outliers == 1]


In [2]:
# Usage in a pipeline
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

pipeline = Pipeline([
    ('remove_outliers', OutlierRemover()),
    ('classifier', LogisticRegression())
])


### Important Points:
- **Custom Transformer**: `OutlierRemover` is a custom transformer that uses `IsolationForest` for outlier detection. It filters out outliers in the `transform` method.
- **Pipeline Integration**: This transformer can be used as a step in a Scikit-Learn Pipeline, ensuring that outlier removal is appropriately applied during cross-validation or other model evaluation methods.
- **Caution**: Be cautious with automatically removing outliers, as they might be genuine data points that are important for the model to learn. Always inspect and understand your data before deciding to remove outliers.