# Outlier Removal

## Setup

In [2]:
import pandas as pd

## 1. Outlier Removal Operations
- Outliers are unusual values in your dataset, and they can distort statistical analyses. 

### 1.1 `clip`
- If you want to trim values that the outliers, one of the methods are to use `df.clip`

In [5]:
data = {'A': [9, -3, 0, -1, 5]}
df = pd.DataFrame(data)
df.head()

Unnamed: 0,A
0,9
1,-3
2,0
3,-1
4,5


- Example: to exclude the outliers that are below `.05` percentile or above `.95` percentile

In [6]:
lower_bound, upper_bound = df['A'].quantile(.05), df['A'].quantile(.95)
print(lower_bound, upper_bound)

-2.6 8.2


In [7]:
df.loc[: ,'A'] = df['A'].clip(lower=lower_bound, upper=upper_bound)

In [8]:
# As you can see, -3 becomes -2, and 9 becomes 8
df.head()

Unnamed: 0,A
0,8.2
1,-2.6
2,0.0
3,-1.0
4,5.0


### 1.2. `isin` Filter Rows only if Column Contains Values from a List

In [9]:
df = pd.DataFrame({'col1': [1, 2, 3, 4, 5], 'col2': ['a', 'b', 'c', 'd', 'e']})
df.head()

Unnamed: 0,col1,col2
0,1,a
1,2,b
2,3,c
3,4,d
4,5,e


- Example, in `col2`, we only want to keep the rows whose values are `NOT 'a', 'b', 'c'`

In [14]:
excluded_list = ['a', 'b', 'c']
df = df.loc[~df['col2'].isin(excluded_list), :]

In [15]:
df.head()

Unnamed: 0,col1,col2
3,4,d
4,5,e
