2.1 Introduction to Outliers
Introduction:
Outliers are data points that significantly differ from other observations in a dataset. They may occur due to variability in the data, measurement errors, or special cases that represent significant deviations from the norm. Properly identifying and handling outliers is crucial for accurate data analysis and modeling, as they can skew results and impact the performance of statistical models.

Importance of Handling Outliers:

Impact on Statistical Measures: Outliers can affect mean, variance, and other statistical measures, leading to misleading conclusions.
Model Performance: Outliers can distort the training of machine learning models, leading to poor performance or inaccurate predictions.
Data Integrity: Handling outliers ensures the dataset reflects true patterns and trends, enhancing the reliability of analyses and results.
Types of Outliers:

Univariate Outliers: Deviations in a single variable. Example: A temperature reading of 100°C in a dataset where most temperatures are between 20°C and 30°C.
Multivariate Outliers: Deviations that occur in the context of multiple variables. Example: A combination of age and income that is far outside the normal range for a population.
Common Causes of Outliers:

Measurement Errors: Errors in data collection or entry.
Data Entry Errors: Typographical errors or incorrect data input.
Variability in the Data: Natural variations in data, especially in large datasets.
Special Cases: Genuine cases that are distinct from the norm but valid. For example, a high-value transaction in a financial dataset might be an outlier but not an error.
Approaches to Handling Outliers:

Identification: Use statistical methods, visualizations, and domain knowledge to identify outliers.
Handling: Depending on the context, outliers can be removed, transformed, or capped to reduce their impact on analyses and models.


2.2 Definition
Outliers: Outliers are data points that significantly differ from the majority of the data. They can arise from variability in the data or may indicate measurement errors. Proper handling of outliers is crucial as they can skew statistical analyses and models.

Goals
Identify Outliers: Detect outliers using various statistical methods and visualization techniques.
Handle Outliers: Decide on methods to manage outliers, including removal, transformation, or capping.

Description
Handling outliers involves identifying data points that deviate significantly from the rest of the data and deciding on appropriate actions to manage them. Common techniques include statistical methods (Z-score, IQR), visual methods (box plots), and more advanced methods (robust statistics).

2.3 Techniques

1. Z-Score Method
   - Description: Calculates the Z-score for each data point to determine how many standard deviations away from the mean a data point is. Data points with Z-scores beyond a certain threshold (e.g., ±3) are considered outliers.
   - Example: df['column'].apply(lambda x: (x - mean) / std)

2. Interquartile Range (IQR) Method
   - Description: Identifies outliers based on the interquartile range (IQR). Data points outside 1.5 * IQR from the Q1 (25th percentile) or Q3 (75th percentile) are considered outliers.
   - Example: df[(df['column'] < (Q1 - 1.5 * IQR)) | (df['column'] > (Q3 + 1.5 * IQR))]

3. Box Plot Visualization
   - Description: Uses box plots to visually identify outliers as points outside the "whiskers" of the plot.
   - Example: sns.boxplot(x=df['column'])

4. Capping (Winsorization)
   - Description: Limits extreme values to a specified percentile range to reduce the impact of outliers.
   - Example: df['column'] = np.where(df['column'] > upper_limit, upper_limit, df['column'])

5. Transformation
   - Description: Applies transformations like logarithmic or square root to reduce the effect of outliers.
   - Example: df['column'] = np.log1p(df['column'])


2.3.1 Introduction to Z-Score
The Z-score, also known as the standard score, is a statistical measurement that describes a value's relation to the mean of a group of values. It is expressed in terms of standard deviations from the mean. A Z-score indicates how many standard deviations an element is from the mean. A Z-score can be positive or negative, with a positive score indicating the value is above the mean and a negative score indicating it is below the mean.

Using Z-scores to detect outliers involves calculating the Z-score for each data point and identifying those that lie beyond a certain threshold, commonly ±3 standard deviations from the mean. These data points are considered outliers as they deviate significantly from the rest of the dataset.

In [5]:
import pandas as pd
import numpy as np

# Read the data from the specified location
df = pd.read_csv('D:/Projects/Data-cleaning-series/Chapter02 Handling Outliers/Products.csv')

# Display the initial DataFrame
print("Initial DataFrame:")
print(df.to_string(index=False))

# Calculate Z-scores for the 'Price' column
mean_price = df['Price'].mean()
std_price = df['Price'].std()
df['Price_Z_Score'] = (df['Price'] - mean_price) / std_price

# Identify outliers
df_outliers_zscore = df[df['Price_Z_Score'].abs() > 3]

# Display the DataFrame with identified outliers
print("\nDataFrame with Outliers Identified by Z-Score:")
print(df_outliers_zscore.to_string(index=False))


Initial DataFrame:
 Product ID Product Name  Price    Category  Stock              Description
          1     Widget A  19.99 Electronics  100.0    A high-quality widget
          2     Widget B  29.99 Electronics    NaN                      NaN
          3          NaN  15.00  Home Goods   50.0      Durable and stylish
          4     Widget D    NaN  Home Goods  200.0       A versatile widget
          5     Widget E   9.99         NaN   10.0    Compact and efficient
          6     Widget F  25.00 Electronics    0.0 Latest technology widget
          7     Widget G    NaN     Kitchen  150.0     Multi-purpose widget
          8     Widget H  39.99     Kitchen   75.0          Premium quality
          9     Widget I    NaN Electronics    NaN        Advanced features
         10     Widget J  49.99 Electronics   60.0            Best in class

DataFrame with Outliers Identified by Z-Score:
Empty DataFrame
Columns: [Product ID, Product Name, Price, Category, Stock, Description, Price_Z_