Typically we have three strategies we can use to handle outliers. First, we can
drop them:


In [2]:
# Load library
import pandas as pd
# Create DataFrame
houses = pd.DataFrame()
houses['Price'] = [534433, 392333, 293222, 4322032]
houses['Bathrooms'] = [2, 3.5, 2, 116]
houses['Square_Feet'] = [1500, 2500, 1500, 48000]
# Filter observations
houses[houses['Bathrooms'] < 20]


Unnamed: 0,Price,Bathrooms,Square_Feet
0,534433,2.0,1500
1,392333,3.5,2500
2,293222,2.0,1500


Second, we can mark them as outliers and include it as a feature:

In [3]:
# Load library
import numpy as np
# Create feature based on boolean condition
houses["Outlier"] = np.where(houses["Bathrooms"] < 20, 0, 1)
# Show data
houses


Unnamed: 0,Price,Bathrooms,Square_Feet,Outlier
0,534433,2.0,1500,0
1,392333,3.5,2500,0
2,293222,2.0,1500,0
3,4322032,116.0,48000,1


Finally, we can transform the feature to dampen the effect of the outlier

In [4]:
# Log feature
houses["Log_Of_Square_Feet"] = [np.log(x) for x in houses["Square_Feet"]]
# Show data
houses


Unnamed: 0,Price,Bathrooms,Square_Feet,Outlier,Log_Of_Square_Feet
0,534433,2.0,1500,0,7.31322
1,392333,3.5,2500,0,7.824046
2,293222,2.0,1500,0,7.31322
3,4322032,116.0,48000,1,10.778956


Similar to detecting outliers, there is no hard-and-fast rule for handling them.
How we handle them should be based on two aspects. First, we should consider
what makes them an outlier. If we believe they are errors in the data such as
from a broken sensor or a miscoded value, then we might drop the observation
or replace outlier values with NaN since we can’t believe those values. However,
if we believe the outliers are genuine extreme values (e.g., a house [mansion]
with 200 bathrooms), then marking them as outliers or transforming their values 

is more appropriate.
Second, how we handle outliers should be based on our goal for machine
learning. For example, if we want to predict house prices based on features of
the house, we might reasonably assume the price for mansions with over 100
bathrooms is driven by a different dynamic than regular family homes.
Furthermore, if we are training a model to use as part of an online home loan
web application, we might assume that our potential users will not include
billionaires looking to buy a mansion.
So what should we do if we have outliers? Think about why they are outliers,
have an end goal in mind for the data, and, most importantly, remember that not
making a decision to address outliers is itself a decision with implications.
One additional point: if you do have outliers standardization might not be
appropriate because the mean and variance might be highly influenced by the
outliers. In this case, use a rescaling method more robust against outliers like
**RobustScaler**