Outliers are extreme values that can distort data analysis and negatively affect machine learning models. Identifying and addressing them is crucial in data preprocessing.

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('sample_dataset_1000.csv')

handling duplicates

In [4]:
df_cleaned = df.drop_duplicates()

df_cleaned.head()

Unnamed: 0,ID,Name,Age,Gender,Salary,Joining Date,Is Active,Comments
0,1,Paul,25,Female,90041.86,2017-01-31,False,Creative
1,2,Paul,31,Male,90285.02,2019-05-29,False,Needs Improvement
2,3,Rachel,30,Male,84321.65,2022-12-14,False,Hardworking
3,4,Bob,45,Male,68189.99,2023-09-05,True,Fast Learner
4,5,Eve,34,Male,35428.35,2018-05-19,True,Dedicated


handle missing values

In [5]:
df_cleaned['Age'] = df_cleaned['Age'].fillna(df_cleaned['Age'].mean(), inplace=True)   # inplace=True means the changes will be applied to the original dataframe, not a copy of it.

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_cleaned['Age'] = df_cleaned['Age'].fillna(df_cleaned['Age'].mean(), inplace=True)   # inplace=True means the changes will be applied to the original dataframe, not a copy of it.


**Normalize and Scale the Data**

In data preprocessing, Normalization and Scaling are key techniques to ensure that numerical features are on a similar scale, which can improve the performance of machine learning models, particularly those that rely on distance metrics or assume normally distributed data.

**Normalization** transforms features to fall within a specific range, often between 0 and 1. This ensures that all features contribute equally to the model’s performance. We’ll use MinMaxScaler to perform this transformation on our data.

**Scaling** refers to adjusting the values of numeric columns so that they fit within a particular range or distribution. A common method is to scale the data so that it has a mean of 0 and a standard deviation of 1 (this is called standardization). In other cases, scaling can bring the data into a specific range, like 0 to 1 (this is often done with MinMaxScaler).


min-max scaling

In [7]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

df[['Age', 'Salary']] = scaler.fit_transform(df[['Age', 'Salary']])

df[['Age', 'Salary']].head()

Unnamed: 0,Age,Salary
0,0.125,0.66793
1,0.275,0.670636
2,0.25,0.604271
3,0.625,0.424746
4,0.35,0.06015


standardization/standard scaler/scaling

In [8]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df[['Age', 'Salary']] = scaler.fit_transform(df[['Age', 'Salary']])
df[['Age', 'Salary']].head()

Unnamed: 0,Age,Salary
0,-1.353353,0.580768
1,-0.843392,0.590183
2,-0.928385,0.359284
3,0.346519,-0.265329
4,-0.588411,-1.53385


**The Z-score** method helps us find outliers by measuring how far each value is from the average (mean) in terms of standard deviations. A Z-score tells us how many standard deviations a value is away from the mean.

If a Z-score is greater than 3, the value is considered an outlier because it's far from the mean.

In [11]:
from scipy.stats import zscore
import numpy as np

# calculate zscore
z_score = np.abs(zscore(df[['Age', 'Salary']]))

# Identify rows with Z-scores > 3 (outliers)
outliers = (z_score > 3).any(axis=1)

df_no_outliers = df[~outliers]
df_no_outliers.head()

Unnamed: 0,ID,Name,Age,Gender,Salary,Joining Date,Is Active,Comments
0,1,Paul,-1.353353,Female,0.580768,2017-01-31,False,Creative
1,2,Paul,-0.843392,Male,0.590183,2019-05-29,False,Needs Improvement
2,3,Rachel,-0.928385,Male,0.359284,2022-12-14,False,Hardworking
3,4,Bob,0.346519,Male,-0.265329,2023-09-05,True,Fast Learner
4,5,Eve,-0.588411,Male,-1.53385,2018-05-19,True,Dedicated
