# Detect and Delete outliers with Optimus

An outlier is an observation that lies an abnormal distance from other values in a random sample from a population. In a sense, this definition leaves it up to the analyst (or a consensus process) to decide what will be considered abnormal. Before abnormal observations can be singled out, it is necessary to characterize normal observations.

You have to be careful when studying outliers because how do you know if an outlier is the result of a data glitch, or a real data point -- indeed maybe not an outlier.

With Optimus you can perform analysis to check if a value in your data is an outlier. Lets see how.

In [1]:
# Import optimus
import optimus as op
# Import os for reading from local
import os

Deleting previous folder if exists...
Creation of checkpoint directory...
Done.


Lets import the utilities module to read the csv we have in the folder called outliers.csv

In [2]:
tools = op.Utilities()

In [3]:
path = "file:///" + os.getcwd() + "/outliers.csv"
df = tools.read_dataset_csv(path, delimiter_mark=",", header="true")

In [4]:
df.show()

+----+---+
| num|idk|
+----+---+
|   1|  2|
|   2|  3|
|   3|  4|
|   4|  5|
|   5|  6|
|   6|  2|
|   7|  3|
|   8|  4|
|   9|  5|
|  10|  6|
|1000| 12|
+----+---+



From a quick inspection of the dataframe we can guess that the 1000 in the column 'num' can be an outlier. You can perform a very intense search to see if it is actually and outlier, if you need something like that please check out [these articles and tutorials](http://www.datasciencecentral.com/profiles/blogs/11-articles-and-tutorials-about-outliers)

With optimus you can perform several analysis too to check if a value is an outlier. First lets run some visual analysis. Remember to check the [Main Example](https://github.com/ironmussa/Optimus/blob/master/examples/Optimus_Example.ipynb) for more.

In [5]:
profiler = op.DataFrameProfiler(df)

In [6]:
profiler.profiler()

0,1
Number of variables,2
Number of observations,11
Total Missing (%),0.0%
Total size in memory,0.0 B
Average record size in memory,0.0 B

0,1
Numeric,2
Categorical,0
Date,0
Text (Unique),0
Rejected,0

0,1
Distinct count,6
Unique (%),54.5%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,4.7273
Minimum,2
Maximum,12
Zeros (%),0.0%

0,1
Minimum,2.0
5-th percentile,2.0
Q1,3.0
Median,4.0
Q3,5.5
95-th percentile,9.0
Maximum,12.0
Range,10.0
Interquartile range,2.5

0,1
Standard deviation,2.7961
Coef of variation,0.59148
Kurtosis,2.2763
Mean,4.7273
MAD,1.8843
Skewness,1.6178
Sum,52
Variance,7.8182
Memory size,0.0 B

0,1
Distinct count,11
Unique (%),100.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,95.909
Minimum,1
Maximum,1000
Zeros (%),0.0%

0,1
Minimum,1.0
5-th percentile,1.5
Q1,3.5
Median,6.0
Q3,8.5
95-th percentile,505.0
Maximum,1000.0
Range,999.0
Interquartile range,5.0

0,1
Standard deviation,299.87
Coef of variation,3.1266
Kurtosis,6.0984
Mean,95.909
MAD,164.38
Skewness,2.8456
Sum,1055
Variance,89920
Memory size,0.0 B

Unnamed: 0,num,idk
0,1,2
1,2,3
2,3,4
3,4,5
4,5,6


From the profiler quantile and descriptive statistics we can see that the 1000 value starts to look a lot like an outlier.

## Outlier detection 

One of the commonest ways of finding outliers in one-dimensional data is to mark as a potential outlier any point that is more than two standard deviations, say, from the mean (I am referring to sample means and standard deviations here and in what follows). But the presence of outliers is likely to have a strong effect on the mean and the standard deviation, making this technique unreliable.

That's why we have programmed in Optimus the median absolute deviation from median, commonly shortened to the median absolute deviation (MAD). It is the median of the set comprising the absolute values of the differences between the median and each data point. If you want more information on the subject please read the amazing article by Leys et al. about dtecting outliers [here](http://www.sciencedirect.com/science/article/pii/S0022103113000668)

To import the class for detecting outlier you just need to do:

In [7]:
# Choose a column for analyzing
detector = op.OutlierDetector(df,"num")

With the  `outliers()` method you can use MAD to detect if there is an outlier in your column

In [8]:
detector.outliers()

[1000]

And with the  `run()` method you can see which values are not outliers 

In [9]:
detector.run()

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

Finally with the `delete_outliers()` method you can delete existing outliers in your column. This will modify the dataframe we have used when instantiating the OutlierDetector (deleting the whole row that contains the outlier value), but the original dataframe that we read from disk will be intact.

In [10]:
print("Original dataframe (with outliers):")

df.show()

print("Dataframe without outliers:")

detector.delete_outliers().show()

Original dataframe (with outliers):
+----+---+
| num|idk|
+----+---+
|   1|  2|
|   2|  3|
|   3|  4|
|   4|  5|
|   5|  6|
|   6|  2|
|   7|  3|
|   8|  4|
|   9|  5|
|  10|  6|
|1000| 12|
+----+---+

Dataframe without outliers:
+---+---+
|num|idk|
+---+---+
|  1|  2|
|  2|  3|
|  3|  4|
|  4|  5|
|  5|  6|
|  6|  2|
|  7|  3|
|  8|  4|
|  9|  5|
| 10|  6|
+---+---+

