# Workshop Outliers

In this brief workshop you will examine a data set and check it for outliers.

In [None]:
# imports
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import scipy.stats as st
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression


In [None]:
# Read in marketing data
df = pd.read_csv("data/marketing_AB.csv", index_col=0)
df

## 1. Detecting outliers and deciding what to do with them

Check the distribution of the numeric variables. Do the variables contain values that are flagged as outliers? Can you remove these outliers? Why (not)?

In [None]:
# Your code goes here
df['total ads'].std()

## 2. Turning non-normal distributions into normal distributions

There are many possible reasons why values in a distribution are flagged as outliers. Many of them boil down to the fact that most algorithms for deciding which values are outliers make use of the assumption that the distribution is normal. If it isn't, many values will be flagged as outliers even though they are not.

A simple case is when the distribution is log normal instead of normal. This case is "simple" because you can easily turn a log normal distribution into a normal distribution simply by taking the log of the values.

Below you see what a log normal distribution looks like.

Apply a log function to one of the variables in the data set and check if this causes the number of values flagged as outliers to go down.

In [None]:
# Show a graph for a log normal distribution PDF
from scipy.stats import lognorm
mu = 0
sigma = 1
x = np.linspace(0, 10, 1000)
pdf = lognorm.pdf(x, s=sigma, scale=np.exp(mu))
plt.plot(x, pdf, 'r', linewidth=2)
plt.title('Log-Normal distribution shape')
plt.show()

In [None]:
# To apply a log function to a series: logdata = df['column'].apply(lambda x : np.log(x))

# Your code goes here


What did you find? Did applying a log function to the values cause fewer values to be flagged as outliers?

Is applying a log function to data a valid operation? Why (not)?