#### Outliers
    When there is a gap/change in the data

    Cons of Outliers
        - affects model accuracy
        - pulls the center of data to itself
        - causes skewness in the data
        - wrong results

    Other names of Outliers
        - Deviants
        - Abnormalities
        - Anomalous points
        - Abberrant observations

    Types of Outliers
        - Univariate ( inconsistency in one variable/feature )
        - Multivariate ( inconsistency in multiple features e.g age name price)
        - Global Outlier( outlier in the whole dataset, eg a whole feature )
        - Point Outlier ( one point in the data, can be uni or multivariate )
        - Local Outlier ( low relativity to its immediate neighbors )
        - Contextual Outliers ( A contextual outlier is a data point that is only considered unusual in a specific context, even if it looks normal compared to the whole dataset )
        - there are others too

    Causes of Outliers
        - Data Entry errors
        - Measurement errors
        - Experimental errors
        - Intentional outliers
        - Data Processing Errors
        - Sampling Errors

    Why removin outliers is necessary?
        - Hidden clues
        - affect Data Quality
        - have a big impact
        - better decision, better modelling, better analysis
        - prevents Visualization errors

    How to handle Outliars:
        - Removing the outliars
        - Transforming and binning values
        - Imputation
        - Seperate Treatment
        - Robus statistical methods

    There are Multiple methods to detect outliars, important ones are:
        - Z-Score
        - IQR
        - K Mean clustering

### Z-Score

In [15]:
import pandas as pd
import numpy as np

data = pd.DataFrame({'Age': [20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 50]})

# Calculate the mean and standard deviation
mean = np.mean(data['Age'])
std = np.std(data['Age'])

# Calculate the Z-Score
data['Z-Score'] = (data['Age'] - mean) / std

print("----------------------------------------")
print(f"Here is the data with outliers:\n {data}")
print("----------------------------------------")

# Print the outliers
print(f"Here are the outliers based on the z-score threshold, 3:\n {data[data['Z-Score'] > 3]}")
print("----------------------------------------")

# Remove the outliers
data = data[data['Z-Score'] <= 3]

# Print the data without outliers
print(f"Here is the data without outliers:\n {data}")

----------------------------------------
Here is the data with outliers:
     Age   Z-Score
0    20 -0.938954
1    21 -0.806396
2    22 -0.673838
3    23 -0.541280
4    24 -0.408721
5    25 -0.276163
6    26 -0.143605
7    27 -0.011047
8    28  0.121512
9    29  0.254070
10   30  0.386628
11   50  3.037793
----------------------------------------
Here are the outliers based on the z-score threshold, 3:
     Age   Z-Score
11   50  3.037793
----------------------------------------
Here is the data without outliers:
     Age   Z-Score
0    20 -0.938954
1    21 -0.806396
2    22 -0.673838
3    23 -0.541280
4    24 -0.408721
5    25 -0.276163
6    26 -0.143605
7    27 -0.011047
8    28  0.121512
9    29  0.254070
10   30  0.386628


In [16]:
# We can also use scipy stats library

In [28]:
import numpy as np
from scipy import stats

data = [2.5, 2.7, 2.8, 3.0, 3.2, 3.4, 3.6, 3.8, 4.0, 110.0]

# Calculate the Z-score for each data point
z_scores = np.abs(stats.zscore(data))
print(z_scores)

# Set a threshold for identifying outliers
threshold = 2.5
outliers = np.where(z_scores > threshold)[0] # Stores the indices of the outliers

# print the data
print("----------------------------------------")
print("Data:", data)
print("----------------------------------------")
print("Indices of Outliers:", outliers)
print("Outliers:", [data[i] for i in outliers])


# Remove outliers
data = [data[i] for i in range(len(data)) if i not in outliers]
print("----------------------------------------")
print("Data without outliers:", data)

[0.35584227 0.34959942 0.346478   0.34023515 0.3339923  0.32774946
 0.32150661 0.31526376 0.30902092 2.99968787]
----------------------------------------
Data: [2.5, 2.7, 2.8, 3.0, 3.2, 3.4, 3.6, 3.8, 4.0, 110.0]
----------------------------------------
Indices of Outliers: [9]
Outliers: [110.0]
----------------------------------------
Data without outliers: [2.5, 2.7, 2.8, 3.0, 3.2, 3.4, 3.6, 3.8, 4.0]


In [29]:
# IQR method

In [33]:
# K MEANS CLUSTERING

# We make clusters and if the data doesnt belong in any clusters, its an outlier

In [35]:
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

data = [[2, 2], [3, 3], [3, 4], [30, 30], [31, 31], [32, 32]]

# Create a K-means model with two clusters (normal and outlier)
kmeans = KMeans(n_clusters=2, n_init=10)
kmeans.fit(data)

# Predict cluster labels
labels = kmeans.predict(data)

# Identify outliers based on cluster labels
outliers = [data[i] for i, label in enumerate(labels) if label == 1]

# print data
print("Data:", data)
print("Outliers:", outliers)

# Remove outliers
data = [data[i] for i, label in enumerate(labels) if label == 0]
print("Data without outliers:", data)

Data: [[2, 2], [3, 3], [3, 4], [30, 30], [31, 31], [32, 32]]
Outliers: [[2, 2], [3, 3], [3, 4]]
Data without outliers: [[30, 30], [31, 31], [32, 32]]


# Practice

In [47]:
import pandas as pd

# Example: ages with one obvious outlier
data = pd.DataFrame({'Age': [20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 50]})
print(data)

    Age
0    20
1    21
2    22
3    23
4    24
5    25
6    26
7    27
8    28
9    29
10   30
11   50


In [48]:
mean = np.mean(data['Age'])
std = np.std(data['Age'])

In [49]:
data['Z-Score'] = (data['Age'] - mean) / std

In [50]:
data

Unnamed: 0,Age,Z-Score
0,20,-0.938954
1,21,-0.806396
2,22,-0.673838
3,23,-0.54128
4,24,-0.408721
5,25,-0.276163
6,26,-0.143605
7,27,-0.011047
8,28,0.121512
9,29,0.25407


In [51]:
data[data['Z-Score'] > 3 ]

Unnamed: 0,Age,Z-Score
11,50,3.037793


In [52]:
data = data[data['Z-Score'] <= 3] 

In [53]:
data

Unnamed: 0,Age,Z-Score
0,20,-0.938954
1,21,-0.806396
2,22,-0.673838
3,23,-0.54128
4,24,-0.408721
5,25,-0.276163
6,26,-0.143605
7,27,-0.011047
8,28,0.121512
9,29,0.25407
