# Outliers

Outliers are data points that significantly differ from the rest of the dataset, often lying far from the central tendency (mean or median).
If outliers are not handled in data, they can distort statistical analysis, skew results, reduce model accuracy, and lead to incorrect conclusions by overemphasizing or underrepresenting certain patterns.

Outliers are also referred to by different names such as:
1. Anomalies
2. Deviations
3. Extremes
4. Aberrations
5. Exceptions
6. Irregualrities

There are different types of outliers:
* **Univariate outliers**: These are values that are far away from the mean in a single variable.
* **Multivariate outliers**: These are values that are far away from the mean in multiple variables.
* **Clustered outliers**: These are values that are far away from the mean in a single variable, but are in the same cluster.
* **Contextual outliers**: These are values that are far away from the mean in a single variable.
* **Global Outliers** : Data points that deviate significantly from the rest of the dataset in a global context.

* **Local outliers**:  also known as local outlier factors (LOFs), are data points that appear normal globally but behave anomalously within a local region of the data.

* **Recurrent outliers** :These are data points that repeatedly appear as anomalies over time, often following a certain pattern or in response to a recurring event. These outliers may occur at regular intervals but are still considered anomalies compared to the rest of the data.

* **Periodic outliers** :They are similar in that they occur at regular time intervals, but unlike recurrent outliers, they are more tied to a predictable cycle or seasonality, such as spikes in demand during holidays or unusual temperature readings during specific seasons.




#### We will now handle the outliers using numpy and pandas only.

In [8]:
# Step 1: Import the required libraries
import pandas as pd
import numpy as np

# Step 2: Create Dataframe
df = pd.DataFrame({"Age": [20, 21, 23, 24, 25, 27, 29, 30, 31, 32, 33, 50]})
df

Unnamed: 0,Age
0,20
1,21
2,23
3,24
4,25
5,27
6,29
7,30
8,31
9,32


In [9]:
# Step 3: Calculate the mean and standard deviation
mean = np.mean(df['Age'])
std = np.std(df['Age'])

# Step 4: Calculate the Z- Score
df['Z-Score'] = (df['Age'] - mean) / std
df

Unnamed: 0,Age,Z-Score
0,20,-1.148725
1,21,-1.017442
2,23,-0.754876
3,24,-0.623594
4,25,-0.492311
5,27,-0.229745
6,29,0.032821
7,30,0.164104
8,31,0.295386
9,32,0.426669


In [10]:
# Step 5: Print The data
print("---------------------------------------")
print(f"Here is the data with outliers:\n {df}")
print("---------------------------------------")

---------------------------------------
Here is the data with outliers:
     Age   Z-Score
0    20 -1.148725
1    21 -1.017442
2    23 -0.754876
3    24 -0.623594
4    25 -0.492311
5    27 -0.229745
6    29  0.032821
7    30  0.164104
8    31  0.295386
9    32  0.426669
10   33  0.557952
11   50  2.789761
---------------------------------------


In [11]:

# Step 6: Print The outliers
print(f"Here are outliers based on Z-Score threshold, 2:\n {df[df['Z-Score'] > 2]}")
print("-----------------------------------")

Here are outliers based on Z-Score threshold, 2:
     Age   Z-Score
11   50  2.789761
-----------------------------------


In [12]:
# Step 7: Remove the Outliers
df = df[df['Z-Score'] <= 2]

#Step 8: Print the data without Outliers
print(f"Here is the data without outliers:\n {df}")

Here is the data without outliers:
     Age   Z-Score
0    20 -1.148725
1    21 -1.017442
2    23 -0.754876
3    24 -0.623594
4    25 -0.492311
5    27 -0.229745
6    29  0.032821
7    30  0.164104
8    31  0.295386
9    32  0.426669
10   33  0.557952


#### As you can see, there was an outliers in the dataset and it has been handled. Now we will use numpy and scipy library to handle outliers.

In [13]:
# Import Libraries 
import numpy as np
from scipy import stats

In [16]:
# Sample Data
df = [2.5, 2.9, 3.1, 3.3, 3.7, 4.0, 4.4, 4.7, 4.9, 110.0]
df

[2.5, 2.9, 3.1, 3.3, 3.7, 4.0, 4.4, 4.7, 4.9, 110.0]

In [17]:
# Calculate Z-Scores for each data point
z_scores = np.abs(stats.zscore(df))
z_scores

array([0.37156492, 0.35902265, 0.35275151, 0.34648037, 0.3339381 ,
       0.32453139, 0.31198911, 0.30258241, 0.29631127, 2.99917172])

In [18]:
# Set a threshold for outliers
Threshold = 2.5
outliers = np.where(z_scores > Threshold)[0]

In [20]:
# Print the data
print("--------------------------------------")
print("Data: ", df)
print("--------------------------------------")

print("Indices of outliers: ", outliers)
print("Outliers: ", [df[i] for i in outliers])

--------------------------------------
Data:  [2.5, 2.9, 3.1, 3.3, 3.7, 4.0, 4.4, 4.7, 4.9, 110.0]
--------------------------------------
Indices of outliers:  [9]
Outliers:  [110.0]


In [21]:
# Remove the outliers
df = [df[i] for i in range(len(df)) if i not in outliers]
print("--------------------------------------------")
print("Data without outliers:", df)

--------------------------------------------
Data without outliers: [2.5, 2.9, 3.1, 3.3, 3.7, 4.0, 4.4, 4.7, 4.9]


#### Another method to handle outliers is kmeans clustering.

In [40]:
# Import Libraries
from sklearn.cluster import KMeans

# Sample Data
df = [[2, 2], [3, 3], [4, 4], [30, 30], [31, 31], [32, 32]]

# Create a kmeans model with two clusters (normal and outlier)
kmeans = KMeans(n_clusters=2, n_init=10)
kmeans.fit(df)

# Predict cluster labels
labels = kmeans.predict(df)

# Identify outliers based on cluster labels
outliers = [df[i] for i, label in enumerate(labels) if label == 1]

# Print Data
print("Data: ", df)
print("Outliers: ", outliers)

# Remove outliers
df = [df[i] for i, label in enumerate(labels) if label == 0]
print("Data without outliers: ", df)

Data:  [[2, 2], [3, 3], [4, 4], [30, 30], [31, 31], [32, 32]]
Outliers:  [[30, 30], [31, 31], [32, 32]]
Data without outliers:  [[2, 2], [3, 3], [4, 4]]
