### **1. Log Transformation**
#### **Description**:
- **Purpose**: To reduce skewness in data, especially for variables with a long right tail (e.g., income, prices).
- **How it works**: Takes the logarithm of data values, which compresses larger values more than smaller values.
- **Formula**:
  $
  x_{\text{log}} = \log(x + 1)
  $
  Adding `1` ensures no issues when `x = 0`.

#### **What it Does**:
- Converts multiplicative relationships into additive ones.
- Reduces the impact of outliers.

In [1]:
# Before transformation:
import pandas as pd
data = pd.DataFrame([1, 10, 100, 1000, 10000])
data.describe()

Unnamed: 0,0
count,5.0
mean,2222.2
std,4368.044093
min,1.0
25%,10.0
50%,100.0
75%,1000.0
max,10000.0


In [2]:
#After log transformation:
import numpy as np
log_transformed = np.log1p(data)  # log(x+1)
log_transformed.describe()

Unnamed: 0,0
count,5.0
mean,4.765072
std,3.411864
min,0.693147
25%,2.397895
50%,4.615121
75%,6.908755
max,9.21044


#### **Use Case**:
- Useful for skewed features like income (`$20,000`, `$100,000`, `$1,000,000`) or sales volume.

---

### **2. Square Root Transformation**
#### **Description**:
- **Purpose**: Similar to log transformation, but less aggressive. It is used to stabilize variance and normalize data distributions.
- **How it works**: Takes the square root of data values.
- **Formula**:
  $
  x_{\text{sqrt}} = \sqrt{x}
  $

#### **What it Does**:
- Reduces skewness for moderate right-tailed distributions.
- Less compressive than the logarithmic transformation.

In [3]:
# Before transformation:
data = pd.DataFrame([1, 4, 9, 16, 25])
data.describe()

Unnamed: 0,0
count,5.0
mean,11.0
std,9.66954
min,1.0
25%,4.0
50%,9.0
75%,16.0
max,25.0


In [4]:
# After square root transformation:
sqrt_transformed = np.sqrt(data)
sqrt_transformed.describe()

Unnamed: 0,0
count,5.0
mean,3.0
std,1.581139
min,1.0
25%,2.0
50%,3.0
75%,4.0
max,5.0


#### **Use Case**:
- Useful for data that follows a quadratic relationship or counts (e.g., population density, rainfall).

---

### **3. Z-Scores or IQR Method**
#### **Description**:
- **Purpose**: Identify and handle outliers in a dataset.
- **How it works**:
  - **Z-Score**: Measures how far a data point is from the mean in terms of standard deviations.
    $
    Z = \frac{x - \mu}{\sigma}
    $
    Data points with \( |Z| > 3 \) are often considered outliers.
  - **IQR (Interquartile Range)**: Based on the range between the first quartile (Q1) and the third quartile (Q3).
    $
    \text{IQR} = Q3 - Q1
    $
    Outliers lie outside:
    $
    [Q1 - 1.5 \cdot \text{IQR}, Q3 + 1.5 \cdot \text{IQR}]
    $

#### **What it Does**:
- Z-Score identifies extreme deviations from the mean.
- IQR handles non-normal distributions effectively by focusing on quartiles.

**Z-Score**:

In [5]:
from scipy.stats import zscore
data = pd.Series([10, 12, 15, 18, 19, 200])  # Contains an outlier
z_scores = zscore(data)
outliers = data[abs(z_scores) > 1]  # Identify outliers usiuly use 3 
outliers

5    200
dtype: int64

**IQR**:

In [6]:
Q1 = data.quantile(0.25)
Q3 = data.quantile(0.75)
IQR = Q3 - Q1
outliers = data[(data < (Q1 - 1.5 * IQR)) | (data > (Q3 + 1.5 * IQR))]
print(outliers)

5    200
dtype: int64


#### **Use Case**:
- Z-Score: For normally distributed data.
- IQR: For skewed or non-normal distributions.

---

### **4. Capping**
#### **Description**:
- **Purpose**: Treat outliers by capping extreme values at a specified percentile or threshold.
- **How it works**:
  - Define caps for outliers using percentiles (e.g., 5th and 95th percentiles).
  - Replace values below the lower cap and above the upper cap with the respective caps.

#### **What it Does**:
- Limits the influence of extreme values while retaining the structure of the data.

In [7]:
data.describe()

count      6.000000
mean      45.666667
std       75.685313
min       10.000000
25%       12.750000
50%       16.500000
75%       18.750000
max      200.000000
dtype: float64

In [8]:
lower_cap = data.quantile(0.05)  # 5th percentile
upper_cap = data.quantile(0.95)  # 95th percentile
data_capped = data.clip(lower=lower_cap, upper=upper_cap)
data_capped.describe()

count      6.000000
mean      38.208333
std       57.188377
min       10.500000
25%       12.750000
50%       16.500000
75%       18.750000
max      154.750000
dtype: float64

#### **Use Case**:
- Useful in regression models to reduce the influence of outliers without completely removing them.

---

### **Comparison and Recommendations**
| **Method**         | **Purpose**                          | **Best For**                                  | **Limitations**                             |
|---------------------|--------------------------------------|-----------------------------------------------|---------------------------------------------|
| **Log**            | Reduce skewness                     | Highly skewed data (e.g., income, sales)      | Can’t handle zeros or negative values.      |
| **Sqrt**           | Stabilize variance                  | Moderately skewed data (e.g., counts)         | Less aggressive than log; may not be enough.|
| **Z-Scores / IQR** | Detect and remove outliers           | Normal (Z-Scores) or non-normal (IQR) data    | May remove important outliers unintentionally. |
| **Capping**        | Reduce outlier impact               | Preventing extreme outlier influence          | Alters original data; choosing caps is subjective.|

---
---