### What is an outlier?
An outlier is a data point in a data set that is distant from all other observations. A data point that lies outside the overall distribution of the dataset.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

### What are the criteria to identify an outlier?

1. Data point that falls outside of 1.5 times of an interquartile range above the 3rd quartile and below the 1st quartile
2. Data point that falls outside of 3 standard deviations. we can use a z score and if the z score falls outside of 2 standard deviation

### What is the reason for an outlier to exists in a dataset?

1. Variability in the data
2. An experimental measurement error

### What are the impacts of having outliers in a dataset?

1. It causes various problems during our statistical analysis
2. It may cause a significant impact on the mean and the standard deviation

### Various ways of finding the outlier.
1. Using scatter plots
2. Box plot
3. using z score
4. using the IQR interquantile range



In [None]:
dataset= [11,10,12,14,12,15,14,13,15,102,12,14,17,19,107, 10,13,12,14,12,108,12,11,14,13,15,10,15,12,10,14,13,15,10]

## Detecting outlier using Z score

### Using Z score

Formula for Z score = (Observation — Mean)/Standard Deviation

z = (X — μ) / σ

The Z-score is a statistical measurement that describes a value's relationship to the mean of a group of values. It is measured in terms of standard deviations from the mean. If a Z-score is 0, it indicates that the data point's score is identical to the mean score. A Z-score of 1.0 would indicate a value that is one standard deviation from the mean.

In terms of outlier detection, if a data point's Z-score is too high or too low (for instance, a common threshold is a Z-score of +/- 3 or more), it could be considered an outlier. This is because it is unusually far from the mean. 

Here's how you can calculate the Z-score and use it to detect outliers:

1. Calculate the mean (μ) and standard deviation (σ) of the dataset.
2. For each data point (x) in the dataset, calculate the Z-score using the formula: Z = (x - μ) / σ
3. If the Z-score of a data point is greater than a certain threshold (e.g., 3), it is considered an outlier.

This method assumes that the data follows a normal distribution, and might not work well with data that is not normally distributed.

#### Exploring the Z-score method for identifying outliers in our dataset

The function detect_outliers(data) takes a list of numerical data as input.

threshold=3 sets the Z-score at which a data point will be considered an outlier. A Z-score measures how many standard deviations a data point is from the mean. Here, any data point that is more than 3 standard deviations from the mean will be considered an outlier.

mean = np.mean(data) calculates the average value of the data.

std =np.std(data) calculates the standard deviation of the data. The standard deviation is a measure of how spread out the numbers in the data are.

The for loop iterates over each data point in the dataset. For each data point, it calculates the Z-score (z_score= (i - mean)/std), which is the distance from the mean in units of standard deviation.

if np.abs(z_score) > threshold: checks if the absolute value of the Z-score is greater than the threshold. If it is, the data point is considered an outlier and is added to the outliers list.

Finally, the function returns the outliers list, which contains all the outliers in the input data.

In [None]:
outliers=[]
def detect_outliers(data):

    threshold=3
    mean = np.mean(data)
    std =np.std(data)


    for i in data:
        z_score= (i - mean)/std
        if np.abs(z_score) > threshold:
            outliers.append(y)
    return outliers

In [None]:
outlier_pt=detect_outliers(dataset)

In [None]:
outlier_pt

[102, 107, 108]

## InterQuantile Range

75%- 25% values in a dataset

### Steps
#### 1. Arrange the data in increasing order
#### 2. Calculate first(q1) and third quartile(q3)
#### 3. Find interquartile range (q3-q1)
#### 4.Find lower bound q1*1.5
#### 5.Find upper bound q3*1.5

Anything that lies outside of lower and upper bound is an outlier

The Interquartile Range (IQR) is a statistical measure used to describe the spread of data in a dataset. It is calculated as the difference between the 75th percentile (Q3) and the 25th percentile (Q1) in a dataset, representing the range within which the central 50% of the data falls. 

The IQR is often used to identify outliers. A common rule of thumb is that a data point is considered an outlier if it is less than Q1 - 1.5*IQR or greater than Q3 + 1.5*IQR. 

The advantage of the IQR is that it is not affected by extreme values as it only considers the middle 50% of the data. This makes it a more robust measure of spread compared to the range or standard deviation.

**1. The data is sorted in ascending order. This makes it easier to identify the quartiles.**

In [None]:
## Perform all the steps of IQR
sorted(dataset)

[10,
 10,
 10,
 10,
 10,
 11,
 11,
 12,
 12,
 12,
 12,
 12,
 12,
 12,
 13,
 13,
 13,
 13,
 14,
 14,
 14,
 14,
 14,
 14,
 15,
 15,
 15,
 15,
 15,
 17,
 19,
 102,
 107,
 108]

**2. The first quartile (Q1) and the third quartile (Q3) are calculated. Q1 is the median of the lower half of the data (not including the median of the data set if the number of data points is odd), and Q3 is the median of the upper half.**

In [None]:
quantile1, quantile3= np.percentile(dataset,[25,75])

In [None]:
print(quantile1,quantile3)

12.0 15.0


**3. The Interquartile Range (IQR) is calculated as Q3 - Q1. This represents the range within which the central 50% of the data lies**

In [None]:
## Find the IQR

iqr_value=quantile3-quantile1
print(iqr_value)

3.0


**4. The lower bound is calculated as Q1 - 1.5*IQR. Any data point below this value could be considered an outlier.**

**5. The upper bound is calculated as Q3 + 1.5*IQR. Any data point above this value could be considered an outlier.**

In [None]:
## Find the lower bound value and the higher bound value

lower_bound_val = quantile1 -(1.5 * iqr)
upper_bound_val = quantile3 +(1.5 * iqr)

In [None]:
print(lower_bound_val,upper_bound_val)

7.5 19.5
