### Z-score method

Removing outliers using the z-score transformation is a simple way to identify and exclude data points that are significantly different from the average of a dataset. Here's a straightforward explanation of the method:

1. **Calculate the Mean and Standard Deviation**: First, find the mean (average) and standard deviation of your dataset. These values describe the central tendency and spread of your data, respectively.

2. **Compute the Z-Score for Each Data Point**: For each data point in your dataset, calculate its z-score using the formula:
   
   **Z-Score = (Data Point - Mean) / Standard Deviation**

   This formula tells you how many standard deviations a data point is away from the mean. A high absolute z-score indicates that a data point is far from the average, which suggests it might be an outlier.

3. **Set a Threshold**: Decide on a threshold value, usually a z-score of 2 or 3 is used. This threshold represents how far from the mean a data point can be before being considered an outlier. A z-score greater than the threshold suggests that a data point is an outlier.

4. **Identify and Remove Outliers**: Go through your dataset and identify data points with z-scores greater than the chosen threshold. These are your outliers.

5. **Remove or Handle Outliers**: You can choose to either remove these outliers from your dataset or handle them differently, depending on your analysis. Common approaches include removing them, transforming them, or imputing them with more typical values.

In essence, the z-score transformation helps you standardize your data by expressing each data point's deviation from the mean in terms of standard deviations. This makes it easier to spot data points that stand out as potential outliers, helping you make more informed decisions about how to handle them in your analysis.

## Iterative approach of z score to remove more outliers

The z-score iterative method is an approach to remove outliers by repeatedly calculating z-scores and removing data points with z-scores exceeding a certain threshold until no more outliers are detected. Here's a step-by-step explanation of this method:

1. **Calculate the Mean and Standard Deviation**: Begin by calculating the mean (average) and standard deviation of your dataset. These values are used to compute the z-scores.

2. **Compute Initial Z-Scores**: Calculate the z-score for each data point in your dataset using the formula:
   
   **Z-Score = (Data Point - Mean) / Standard Deviation**

3. **Set a Threshold**: Decide on a threshold value, typically a z-score of 2 or 3 is used. This threshold determines how far from the mean a data point can be before it's considered an outlier.

4. **Identify Outliers**: Go through your dataset and identify data points with z-scores greater than the chosen threshold. These are your initial outliers.

5. **Remove Initial Outliers**: Remove these initial outliers from your dataset.

6. **Recalculate Mean and Standard Deviation**: After removing outliers, calculate the new mean and standard deviation for your updated dataset.

7. **Repeat Steps 2-6**: Repeat the process of calculating z-scores for the remaining data points, identifying outliers based on the updated threshold, and removing them. Continue this iterative process until no more outliers are detected (i.e., all data points have z-scores within the threshold).

8. **Stop When No Outliers Remain**: Continue the iterations until no more outliers are found or until you reach a predefined number of iterations. Be cautious not to overdo it, as repeatedly removing outliers can lead to data loss and bias if not done carefully.

This iterative approach is useful when you want to be more aggressive in removing outliers or when your dataset contains multiple layers of outliers. It allows you to gradually clean your data by iteratively identifying and removing outliers, leading to a dataset with fewer extreme values. However, it's important to use this method judiciously and monitor the impact on your data, as excessive outlier removal can potentially distort your dataset.

### Drawbacks

While the z-score iterative method can be effective in identifying and removing outliers, it also has some drawbacks and limitations to consider:

1. **Loss of Data**: One of the most significant drawbacks is the potential loss of data. Repeatedly removing outliers can result in the removal of valid data points that are not truly outliers. This can lead to a loss of information and potentially bias your analysis.

2. **Sensitivity to Threshold Selection**: The effectiveness of the method depends on choosing an appropriate z-score threshold. Selecting a threshold that is too aggressive may remove data points that are not outliers, while a threshold that is too lenient may fail to identify important outliers.

3. **Iterative Nature**: The iterative process can be computationally expensive and time-consuming, especially for large datasets. It may require multiple passes through the data, making it less efficient than some other outlier detection methods.

4. **Potential Overfitting**: If you iteratively remove outliers until none are left, you risk overfitting your data to a specific threshold and potentially eliminating valuable information that was initially considered an outlier but is relevant to your analysis.

5. **Assumption of Normality**: The z-score method assumes that your data follows a normal distribution. If your data is not normally distributed, the method may not perform well, as it relies on the mean and standard deviation to calculate z-scores.

6. **Ignoring Context**: The z-score method does not take into account the context or domain-specific knowledge. Some data points may be outliers for a good reason, and removing them without considering the context can lead to erroneous conclusions.

7. **Impact on Statistical Tests**: If you plan to perform statistical tests or modeling on the cleaned data, removing outliers can affect the assumptions of these tests and the validity of your results.

8. **Data Transformation**: Repeatedly removing outliers can alter the distribution and shape of your data, potentially requiring additional data transformations to make it suitable for analysis.

To mitigate these drawbacks, it's important to use the z-score iterative method cautiously and in conjunction with domain knowledge. Consider alternative outlier detection techniques, such as the IQR method or visual inspection of data plots, to complement the z-score method. Additionally, when removing outliers, document your approach and the reasons behind each removal to maintain transparency and rigor in your data analysis.

### Modified z-score method for Non Normal Distribution 

The modified z-score method, also known as the modified Z-scores or the generalized Z-score method, is a variation of the traditional z-score method that can be used for detecting outliers in datasets that are not normally distributed or have heavy-tailed distributions. This method is less sensitive to extreme values compared to the standard z-score method. Here's how it works:

1. **Calculate the Median and Median Absolute Deviation (MAD)**:
   - Calculate the median (M) of your dataset. The median is the middle value when the data is sorted, and it is less sensitive to outliers than the mean.
   - Calculate the median absolute deviation (MAD) of your dataset. MAD is the median of the absolute differences between each data point and the median (M). Mathematically, MAD = Median(|Data Point - M|).

2. **Calculate the Modified Z-Score for Each Data Point**:
   - Calculate the modified Z-score (Z) for each data point using the formula:
   
     **Z = 0.6745 * (Data Point - M) / MAD**

   The factor 0.6745 is used to make Z approximately equivalent to a standard Z-score for normally distributed data. This scaling factor ensures that Z has a similar interpretation to a standard Z-score.

3. **Set a Threshold**: Decide on a threshold value, typically around 2 or 3, similar to the standard Z-score method. Data points with modified Z-scores greater than this threshold are considered outliers.

4. **Identify Outliers**: Go through your dataset and identify data points with modified Z-scores greater than the chosen threshold. These are your outliers.

5. **Remove or Handle Outliers**: As with the standard Z-score method, you can choose to either remove these outliers from your dataset or handle them differently based on your analysis needs.

The key difference between the modified Z-score method and the standard Z-score method is the use of the median and MAD instead of the mean and standard deviation. This makes the modified Z-score method robust against outliers and works well with datasets that have non-normal or heavy-tailed distributions.

Remember that the choice of the threshold value is important and should be determined based on your specific dataset and analysis goals. A lower threshold will be more conservative and flag fewer data points as outliers, while a higher threshold will be more aggressive in identifying outliers. It's often a good practice to visually inspect your data and consider domain knowledge when selecting an appropriate threshold.