# OutlierDetection Using Z-score Technique

### 1.A z-score describes the position of a raw score in terms of its distance from the mean, when measured in standard deviation units. The z-score is positive if the value lies above the mean, and negative if it lies below the mean.

### 2.It is also known as a standard score, because it allows comparison of scores on different kinds of variables by standardizing the distribution.

### 3.The formula for calculating a z-score is is z = (x-μ)/σ, where x is the raw score, μ is the population mean, and σ is the population standard deviation.

### 4.Z-score technique can be appllied only on a column that is normally Distributed. ( https://www.simplypsychology.org/z-score.html )

### 5.In normal distribution most of the data points are present in the middle and few values are present on both side. 

### 6.Note: Skewness is a measure of the asymmetry of a distribution. A distribution is asymmetrical when its left and right side are not mirror images. A distribution can have right (or positive), left (or negative), or zero skewness.  ( https://www.scribbr.com/statistics/skewness )



# The Empirical Rule
## For all normal distributions, 68.2% of the observations will appear within plus or minus one standard deviation of the mean; 95.4% of the observations will fall within +/- two standard deviations; and 99.7% within +/- three standard deviations. This fact is sometimes referred to as the "empirical rule," a heuristic that describes where most of the data in a normal distribution will appear.

## This means that data falling outside of +/- three standard deviations ("3-sigma") would signify rare occurrences(and treat as outliers).

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
df = pd.read_csv('dataset/placement.csv')

FileNotFoundError: [Errno 2] No such file or directory: 'dataset/placement.csv'

In [None]:
df.shape

In [None]:
#df.sample(5)  
df.head()

In [None]:
# Box plots are used to show distributions of numeric data values,
# especially when you want to compare them between multiple groups.
# They are built to provide high-level information at a glance,
# offering general information about a group of data's symmetry, 
# skew, variance, and outliers.
plt.figure(figsize=(16,5))
plt.subplot(1,2,1)
sns.boxplot(df['cgpa'])

plt.subplot(1,2,2)
sns.boxplot(df['placement_exam_marks'])

plt.show()

In [None]:
plt.figure(figsize=(16,5))
plt.subplot(1,2,1)
sns.distplot(df['cgpa'])
#sns.histplot(df['cgpa'], kde=True)
plt.subplot(1,2,2)
sns.distplot(df['placement_exam_marks'])

plt.show()

In [None]:
df['cgpa'].skew()

In [None]:
df['placement_exam_marks'].skew()

In [None]:
print("Mean value of cgpa",df['cgpa'].mean())
print("Std value of cgpa",df['cgpa'].std())
print("Min value of cgpa",df['cgpa'].min())
print("Max value of cgpa",df['cgpa'].max())

In [None]:
# Finding the boundary values
ub = df['cgpa'].mean() + 3*df['cgpa'].std()
lb = df['cgpa'].mean() - 3*df['cgpa'].std()
print("Highest allowed",ub)
print("Lowest allowed",lb)

In [None]:
# Finding the outliers
df[(df['cgpa'] > 8.80) | (df['cgpa'] < 5.11)]

## Trimming

In [None]:
# Trimming

new_df = df[(df['cgpa'] < 8.80) & (df['cgpa'] > 5.11)]
new_df

In [None]:
# Approach 2

# Calculating the Zscore      

df['cgpa_zscore'] = (df['cgpa'] - df['cgpa'].mean())/df['cgpa'].std()

In [None]:
df.head()

In [None]:
df[df['cgpa_zscore'] > 3]

In [None]:
df[df['cgpa_zscore'] < -3]

In [None]:
df[(df['cgpa_zscore'] > 3) | (df['cgpa_zscore'] < -3)]

In [None]:
# Trimming 
new_df = df[(df['cgpa_zscore'] < 3) & (df['cgpa_zscore'] > -3)]

In [None]:
new_df

## Capping

In [None]:
upper_limit = df['cgpa'].mean() + 3*df['cgpa'].std()
lower_limit = df['cgpa'].mean() - 3*df['cgpa'].std()

In [None]:
lower_limit

In [None]:
upper_limit

In [None]:
df['cgpa'] = np.where(
    df['cgpa']>upper_limit,
    upper_limit,
    np.where(
        df['cgpa']<lower_limit,
        lower_limit,
        df['cgpa']
    )
)

In [None]:
df.shape

In [None]:
df['cgpa'].describe()

In [None]:
df

## Z-score using scipy.stats.zscore

In [None]:
import pandas as pd
import numpy as np
import scipy.stats as sp

In [None]:
df = pd.read_csv('dataset/placement.csv')

In [None]:
df.head()

In [None]:
df['zscore']=sp.zscore(df['cgpa'])
df['zscore']

In [None]:
df_zscore=np.abs(sp.zscore(df['cgpa']))
df[(df_zscore>3)]