Outliers
1. Univariate Outlier
2. Multivariate Outlier
3. Global Outlier
4. Point Outlier
5. local Outlier
6. Contextual Outlier
7. Collective Outlier
8. Recurrent Outlier
9. Periodic Outlier

---------------------------------------------------
For each method, choose based on:               

- Data distribution (normal vs. skewed).        

- Dimensionality (univariate vs. multivariate). 

- Dataset size (small, medium, large).           

---------------------------------------------------

In [1]:

### 1. **Z-Score**
from scipy.stats import zscore
import pandas as pd

df['z_score'] = df['column_name'].apply(zscore)
outliers = df[abs(df['z_score']) > threshold]  # e.g., threshold = 3

# **When to Use**:  
# - Use when the data is normally distributed.  
# - Detects points far from the mean in terms of standard deviation.



NameError: name 'df' is not defined

In [None]:

### 2. **Interquartile Range (IQR)**

Q1 = df['column_name'].quantile(0.25)
Q3 = df['column_name'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = df[(df['column_name'] < lower_bound) | (df['column_name'] > upper_bound)]

# **When to Use**:  
# - Use for skewed data.  
# - Non-parametric method, not reliant on data distribution.



In [None]:

### 3. **DBSCAN (Density-Based Spatial Clustering of Applications with Noise)**
from sklearn.cluster import DBSCAN

model = DBSCAN(eps=0.5, min_samples=5)
df['dbscan_labels'] = model.fit_predict(df[['column_x', 'column_y']])
outliers = df[df['dbscan_labels'] == -1]

# **When to Use**:  
# - For spatial or clustering-based outliers.  
# - Works well when clusters are dense and outliers are isolated.



In [None]:
### 4. **Isolation Forest**
from sklearn.ensemble import IsolationForest

model = IsolationForest(contamination=0.1)
df['anomaly_score'] = model.fit_predict(df[['column_x', 'column_y']])
outliers = df[df['anomaly_score'] == -1]

# **When to Use**:  
# - Effective for high-dimensional datasets.  
# - Isolation-based method, identifies anomalies as points that are easily isolated.



In [None]:
### 5. **Local Outlier Factor (LOF)**
from sklearn.neighbors import LocalOutlierFactor

model = LocalOutlierFactor(n_neighbors=20, contamination=0.1)
df['lof_scores'] = model.fit_predict(df[['column_x', 'column_y']])
outliers = df[df['lof_scores'] == -1]

# **When to Use**:  
# - Detects local anomalies in clusters.  
# - Suitable for datasets with varying densities.


In [None]:

### 6. **Elliptic Envelope**
from sklearn.covariance import EllipticEnvelope

model = EllipticEnvelope(contamination=0.1)
model.fit(df[['column_x', 'column_y']])
df['elliptic_env'] = model.predict(df[['column_x', 'column_y']])
outliers = df[df['elliptic_env'] == -1]

# **When to Use**:  
# - For Gaussian distributed data.  
# - Models data using covariance to detect deviations.

In [None]:
### 7. **One-Class SVM**
from sklearn.svm import OneClassSVM

model = OneClassSVM(kernel='rbf', nu=0.1, gamma=0.1)
df['svm_scores'] = model.fit_predict(df[['column_x', 'column_y']])
outliers = df[df['svm_scores'] == -1]

# **When to Use**:  
# - Good for nonlinear relationships.  
# - Models normal data points and identifies those that deviate significantly.



In [None]:
### 8. **Mahalanobis Distance**
import numpy as np

cov_matrix = np.cov(df[['column_x', 'column_y']].values, rowvar=False)
inv_cov_matrix = np.linalg.inv(cov_matrix)
mean = df[['column_x', 'column_y']].mean(axis=0)
df['mahalanobis'] = df[['column_x', 'column_y']].apply(
    lambda row: np.sqrt((row - mean).T @ inv_cov_matrix @ (row - mean)),
    axis=1
)
outliers = df[df['mahalanobis'] > threshold]  # e.g., threshold = 3

# **When to Use**:  
# - For multidimensional Gaussian distributions.  
# - Measures the distance from the mean using covariance.



In [None]:
### 9. **Robust Random Cut Forest**
# Requires third-party libraries like Amazon's `sagemaker`.
from sagemaker import RandomCutForest

# **When to Use**:  
# - Handles streaming data and multidimensional data.



In [None]:
### 10. **Histogram-based Outlier Score (HBOS)**
from pyod.models.hbos import HBOS

model = HBOS()
model.fit(df[['column_x', 'column_y']])
df['hbos_scores'] = model.decision_function(df[['column_x', 'column_y']])
outliers = df[model.predict(df[['column_x', 'column_y']]) == 1]

# **When to Use**:  
# - Efficient for large datasets.  
# - Non-parametric, based on histograms.



In [None]:
### 11. **K-Nearest Neighbors (KNN)**
from sklearn.neighbors import NearestNeighbors

knn = NearestNeighbors(n_neighbors=5)
knn.fit(df[['column_x', 'column_y']])
distances, indices = knn.kneighbors(df[['column_x', 'column_y']])
df['knn_outlier_score'] = distances.mean(axis=1)
outliers = df[df['knn_outlier_score'] > threshold]

# **When to Use**:  
# - For detecting distance-based outliers.  
# - Good for small to medium-sized datasets.


In [None]:
### 12. **K-Means Clustering**
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=3)
df['kmeans_labels'] = kmeans.fit_predict(df[['column_x', 'column_y']])
outliers = df[df['kmeans_labels'] == anomaly_cluster]

# **When to Use**:  
# - Outliers are farthest from cluster centroids.  
# - Works well with clusterable datasets.


In [None]:
### 13. **Local Correlation Integral (LOCI)**
# Requires custom implementation or specialized libraries.
# Pseudo-code, depends on specific library

# **When to Use**:  
# - Detects outliers based on local density correlations.  
# - Suitable for datasets with varying densities.



In **Seaborn**, you can visually check for outliers using various types of plots. Below are some common methods to detect outliers in your data:

### 1. **Boxplot** (Most Common Method)
A **boxplot** is a great tool to visually detect outliers. It displays the distribution of data through **quartiles** and shows any data points that fall outside the whiskers as potential outliers.

#### Code Example:
```python
import seaborn as sns
import matplotlib.pyplot as plt

# Sample data
sns.set(style="whitegrid")
data = sns.load_dataset("tips")  # Example dataset

# Boxplot for visualizing outliers in the 'total_bill' column
sns.boxplot(x=data['total_bill'])
plt.title('Boxplot for Total Bill')
plt.show()
```

#### How to Interpret:
- **Whiskers**: The lines extending from the box represent the range within 1.5 times the **interquartile range (IQR)** from the 1st and 3rd quartiles.
- **Outliers**: Points outside of the whiskers are considered outliers and are plotted individually.

### 2. **Violin Plot**
A **violin plot** combines aspects of boxplot and density plot, providing a deeper understanding of the distribution. It’s helpful for detecting outliers in continuous data.

#### Code Example:
```python
# Violin plot for visualizing outliers in the 'total_bill' column
sns.violinplot(x=data['total_bill'])
plt.title('Violin Plot for Total Bill')
plt.show()
```

#### How to Interpret:
- The width of the violin at various values indicates the density of the data. Outliers might appear as points outside the main body of the violin.

### 3. **Scatter Plot**
For two or more continuous variables, a **scatter plot** can be helpful for identifying outliers that may appear far away from the rest of the data points.

#### Code Example:
```python
# Scatter plot for visualizing outliers in the 'total_bill' vs 'tip' columns
sns.scatterplot(x=data['total_bill'], y=data['tip'])
plt.title('Scatter Plot of Total Bill vs Tip')
plt.show()
```

#### How to Interpret:
- Outliers will appear as points far away from the majority of data points in the plot.

### 4. **Pairplot**
If you want to visualize the relationships between multiple variables and identify potential outliers in a pairwise manner, you can use **pairplot**.

#### Code Example:
```python
# Pairplot for visualizing relationships between multiple columns
sns.pairplot(data[['total_bill', 'tip', 'size']])
plt.title('Pairplot for Total Bill, Tip, and Size')
plt.show()
```

#### How to Interpret:
- The diagonal shows the univariate distribution (with histograms), and the off-diagonal plots show relationships between pairs of variables.
- Outliers will appear as points far away from the general trend in any of the scatter plots.

### 5. **Swarm Plot**
A **swarm plot** is another method for visualizing data points, where individual data points are plotted in a way that avoids overlap. This can help you see if there are any outliers in categorical or continuous data.

#### Code Example:
```python
# Swarm plot for visualizing outliers in the 'total_bill' column by 'time' category
sns.swarmplot(x='time', y='total_bill', data=data)
plt.title('Swarm Plot for Total Bill by Time')
plt.show()
```

#### How to Interpret:
- Outliers are displayed as points that are far away from the others on the plot.

### 6. **Histogram**
A **histogram** shows the frequency distribution of the data. While not specifically designed for outlier detection, it can give you an idea of whether data is spread across a large range, potentially indicating outliers.

#### Code Example:
```python
# Histogram for visualizing the distribution of the 'total_bill' column
sns.histplot(data['total_bill'], kde=True)
plt.title('Histogram for Total Bill')
plt.show()
```

#### How to Interpret:
- Outliers could appear as bars far from the rest of the data distribution.

### Summary of Methods:
- **Boxplot**: Shows the distribution, quartiles, and potential outliers as points outside the whiskers.
- **Violin Plot**: Combines boxplot and density plot, providing more detailed information about the distribution.
- **Scatter Plot**: Helps visualize relationships between two continuous variables and detect points that are far from the rest.
- **Pairplot**: For visualizing pairwise relationships between multiple variables, useful for detecting outliers in multi-dimensional data.
- **Swarm Plot**: Helps visualize individual data points in categorical data to detect outliers.
- **Histogram**: Useful for understanding the distribution and spotting outliers in a univariate dataset.

These visualizations can help you identify and confirm outliers before deciding on how to handle them (e.g., removing or transforming the outliers).