# Detecting Outliers with the IQR Method📊
Hey there, data explorer! 🌟 Now that you've gotten comfortable with datasets, let’s dive into how we can identify and handle outliers using the IQR method. Outliers can skew your analysis, but with this technique, we can easily spot and deal with them to keep our data clean and accurate! Let's get started with understanding how the Interquartile Range (IQR) helps us find those unexpected values in our dataset. 🚀

# AIM:
## 1. Load a dataset with missing values and noise.
## 2. Handle missing values by imputation (mean, median) or removal.
## 3. Remove noise by filtering outliers using Z-score or IQR method.

In [2]:
# Import necessary libraries
import numpy as np
import pandas as pd
from scipy import stats

## 1.Load a dataset with missing values and noise
- **Noise in a dataset** refers to random, irrelevant, or erroneous data that does not contribute meaningfully to the analysis or prediction. It can arise from various sources and can negatively affect the performance of machine learning models.

In [3]:
np.random.seed(0)
data = {
 'A': np.random.randn(100),
 'B': np.random.randn(100),
 'C': np.random.randn(100)
}

In [4]:
# introduce missing values
data['A'][::10] = np.nan 
data['B'][::15] = np.nan 
data['C'][::20] = np.nan 

In [5]:
# Convert to dataframe
df = pd.DataFrame(data)

In [6]:
print("Original dataset with missing values and noise:")
print(df.head(20))

Original dataset with missing values and noise:
           A         B         C
0        NaN       NaN       NaN
1   0.400157 -1.347759 -0.239379
2   0.978738 -1.270485  1.099660
3   2.240893  0.969397  0.655264
4   1.867558 -1.173123  0.640132
5  -0.977278  1.943621 -1.616956
6   0.950088 -0.413619 -0.024326
7  -0.151357 -0.747455 -0.738031
8  -0.103219  1.922942  0.279925
9   0.410599  1.480515 -0.098150
10       NaN  1.867559  0.910179
11  1.454274  0.906045  0.317218
12  0.761038 -0.861226  0.786328
13  0.121675  1.910065 -0.466419
14  0.443863 -0.268003 -0.944446
15  0.333674       NaN -0.410050
16  1.494079  0.947252 -0.017020
17 -0.205158 -0.155010  0.379152
18  0.313068  0.614079  2.259309
19 -0.854096  0.922207 -0.042257


## 2. Handle missing values by imputation (mean, median) or removal
**Imputation** refers to the process of filling in missing or incomplete data in a dataset. It is commonly used in data preprocessing to handle **missing values** before applying machine learning algorithms, as many models cannot handle null or missing values directly.

In [10]:
# Imputation by mean
df_mean_imputed = df.fillna(df.mean())
print("\nDataset after mean imputation:")
print(df_mean_imputed.head(20))


Dataset after mean imputation:
           A         B         C
0   0.110294  0.069359 -0.028013
1   0.400157 -1.347759 -0.239379
2   0.978738 -1.270485  1.099660
3   2.240893  0.969397  0.655264
4   1.867558 -1.173123  0.640132
5  -0.977278  1.943621 -1.616956
6   0.950088 -0.413619 -0.024326
7  -0.151357 -0.747455 -0.738031
8  -0.103219  1.922942  0.279925
9   0.410599  1.480515 -0.098150
10  0.110294  1.867559  0.910179
11  1.454274  0.906045  0.317218
12  0.761038 -0.861226  0.786328
13  0.121675  1.910065 -0.466419
14  0.443863 -0.268003 -0.944446
15  0.333674  0.069359 -0.410050
16  1.494079  0.947252 -0.017020
17 -0.205158 -0.155010  0.379152
18  0.313068  0.614079  2.259309
19 -0.854096  0.922207 -0.042257


In [11]:
# Imputation by median
df_median_imputed = df.fillna(df.median())
print("\nDataset after median imputation:")
print(df_median_imputed.head(20))


Dataset after median imputation:
           A         B         C
0   0.124294  0.017479 -0.024326
1   0.400157 -1.347759 -0.239379
2   0.978738 -1.270485  1.099660
3   2.240893  0.969397  0.655264
4   1.867558 -1.173123  0.640132
5  -0.977278  1.943621 -1.616956
6   0.950088 -0.413619 -0.024326
7  -0.151357 -0.747455 -0.738031
8  -0.103219  1.922942  0.279925
9   0.410599  1.480515 -0.098150
10  0.124294  1.867559  0.910179
11  1.454274  0.906045  0.317218
12  0.761038 -0.861226  0.786328
13  0.121675  1.910065 -0.466419
14  0.443863 -0.268003 -0.944446
15  0.333674  0.017479 -0.410050
16  1.494079  0.947252 -0.017020
17 -0.205158 -0.155010  0.379152
18  0.313068  0.614079  2.259309
19 -0.854096  0.922207 -0.042257


In [12]:
# Removal of rows with missing values
df_dropped = df.dropna()
print("\nDataset after removing rows with missing values:")
print(df_dropped.head(20))


Dataset after removing rows with missing values:
           A         B         C
1   0.400157 -1.347759 -0.239379
2   0.978738 -1.270485  1.099660
3   2.240893  0.969397  0.655264
4   1.867558 -1.173123  0.640132
5  -0.977278  1.943621 -1.616956
6   0.950088 -0.413619 -0.024326
7  -0.151357 -0.747455 -0.738031
8  -0.103219  1.922942  0.279925
9   0.410599  1.480515 -0.098150
11  1.454274  0.906045  0.317218
12  0.761038 -0.861226  0.786328
13  0.121675  1.910065 -0.466419
14  0.443863 -0.268003 -0.944446
16  1.494079  0.947252 -0.017020
17 -0.205158 -0.155010  0.379152
18  0.313068  0.614079  2.259309
19 -0.854096  0.922207 -0.042257
21  0.653619 -1.099401 -0.345982
22  0.864436  0.298238 -0.463596
23 -0.742165  1.326386  0.481481


## 3. Remove noise by filtering outliers using Z-score or IQR method

### Using Z-Score method
**A Z-score measures how many standard deviations a data point is from the mean of a dataset.** It is used primarily for normalizing or standardizing data, and for detecting outliers.It’s calculated using:

$$
Z = \frac{x - \mu}{\sigma}
$$

Where:

- **x** is the data point,

- **μ** is the mean, and

- **σ** is the standard deviation.

- **Z = 0:** The value is at the mean.

- **Z > 0:** The value is above the mean.

- **Z < 0:** The value is below the mean.

**Normalization**: Z-score scales features to have a mean of 0 and a standard deviation of 1, improving model performance, especially for models sensitive to scale.

**Outlier Detection**: Z-scores above ±3 are considered outliers, as they are far from the mean.

Z-scores help compare data across different datasets and detect outliers

- `np.abs(stats.zscore(...))`: Computes the absolute Z-scores for the dataset.
- `(z_scores < 3).all(axis=1)`: Filters rows where all feature Z-scores are below 3 (no outliers).
- `df_mean_imputed[...]`: Selects rows without outliers.

In [13]:
z_scores = np.abs(stats.zscore(df_mean_imputed))
df_no_outliers_zscore = df_mean_imputed[(z_scores < 3).all(axis=1)]
print("\nDataset after removing outliers using Z-score method:")
print(df_no_outliers_zscore.head(20))


Dataset after removing outliers using Z-score method:
           A         B         C
0   0.110294  0.069359 -0.028013
1   0.400157 -1.347759 -0.239379
2   0.978738 -1.270485  1.099660
3   2.240893  0.969397  0.655264
4   1.867558 -1.173123  0.640132
5  -0.977278  1.943621 -1.616956
6   0.950088 -0.413619 -0.024326
7  -0.151357 -0.747455 -0.738031
8  -0.103219  1.922942  0.279925
9   0.410599  1.480515 -0.098150
10  0.110294  1.867559  0.910179
11  1.454274  0.906045  0.317218
12  0.761038 -0.861226  0.786328
13  0.121675  1.910065 -0.466419
14  0.443863 -0.268003 -0.944446
15  0.333674  0.069359 -0.410050
16  1.494079  0.947252 -0.017020
17 -0.205158 -0.155010  0.379152
18  0.313068  0.614079  2.259309
19 -0.854096  0.922207 -0.042257


### Using IQR Method 

The **IQR (Interquartile Range)** method is used to detect **outliers** by calculating the range between the first quartile (Q1) and third quartile (Q3) of a dataset:
1. IQR = Q3-Q1
2. Lower Bound = Q1 - 1.5 × IQR
3. Upper Bound = Q3 + 1.5 × IQR

Outliers are values below the lower bound or above the upper bound.

It’s effective for identifying outliers in skewed or non-normal data.

- `(df_mean_imputed < ... | df_mean_imputed > ...)`: Creates a boolean mask where True represents outliers.
- `.any(axis=1)`: Checks if any feature in a row is an outlier (across columns).
- `~`: Inverts the boolean mask, keeping rows where no feature is an outlier.
- This results in a dataframe (`df_no_outliers_iqr`) that contains only rows where no feature exceeds the outlier thresholds.

In [14]:
Q1 = df_mean_imputed.quantile(0.25)
Q3 = df_mean_imputed.quantile(0.75)
IQR = Q3 - Q1
df_no_outliers_iqr = df_mean_imputed[~((df_mean_imputed < (Q1 - 1.5 * IQR)) |
(df_mean_imputed > (Q3 + 1.5 * IQR))).any(axis=1)]
print("\nDataset after removing outliers using IQR method:")
print(df_no_outliers_iqr.head(20))


Dataset after removing outliers using IQR method:
           A         B         C
0   0.110294  0.069359 -0.028013
1   0.400157 -1.347759 -0.239379
2   0.978738 -1.270485  1.099660
3   2.240893  0.969397  0.655264
4   1.867558 -1.173123  0.640132
5  -0.977278  1.943621 -1.616956
6   0.950088 -0.413619 -0.024326
7  -0.151357 -0.747455 -0.738031
8  -0.103219  1.922942  0.279925
9   0.410599  1.480515 -0.098150
10  0.110294  1.867559  0.910179
11  1.454274  0.906045  0.317218
12  0.761038 -0.861226  0.786328
13  0.121675  1.910065 -0.466419
14  0.443863 -0.268003 -0.944446
15  0.333674  0.069359 -0.410050
16  1.494079  0.947252 -0.017020
17 -0.205158 -0.155010  0.379152
19 -0.854096  0.922207 -0.042257
20  0.110294  0.376426 -0.028013


## When to Use Z-Score:
- **Normal Distribution**: Best for data that is approximately normally distributed.
- **Feature Scaling**: Useful for standardizing features (mean = 0, SD = 1), especially in models like KNN, SVM, and Logistic Regression.
- **Outlier Detection**: Detects outliers based on how far data points are from the mean (Z-score > 3).

## When to Use IQR:
- **Non-Normal or Skewed Data**: Best for data that is skewed or non-normal.
- **Robust Outlier Detection**: Less sensitive to extreme values, detects outliers using quartiles (1.5 * IQR rule).
- **Non-Parametric**: Doesn't assume normality, works well with small or non-normal datasets.