# Week 9: Data Transformation ⚙️

## 1. What is Data Transformation?

**Data transformation** is the process of cleaning and converting raw data into a format that is more usable and suitable for analysis. This is a key step in the data pre-processing stage of data wrangling.

The main goals of transforming data are to:
* **Fix Skewness**: Correct data distributions that are asymmetrical.
* **Enhance Visualisation**: Make data easier to plot and understand visually.
* **Improve Interpretability**: Make the results of analysis easier to explain.
* **Meet Model Assumptions**: Ensure the data is compatible with the requirements of a specific statistical model or machine learning algorithm.

---

## 2. Data Normalisation: Creating a Common Scale

**Data normalisation** is a fundamental transformation technique that changes the values of numeric columns in a dataset to a common scale. This is extremely important when your features have different units or scales (e.g., age in years, income in dollars, and height in centimeters). Normalisation prevents features with larger ranges from dominating the analysis.

There are two main types of normalisation: **scaling** and **standardisation**.

### 2.1. Scaling (Rescaling to a Specific Interval)
Scaling techniques focus on adjusting the range of your data to fall within a specific interval, like [0, 1] or [-1, 1].

#### Min-Max Scaling
This is one of the simplest methods, rescaling features to fit within a `[0, 1]` range.
* **Formula**:
    $$x_{scaled} = \frac{x - x_{min}}{x_{max} - x_{min}}$$
    
* **Pros**: Easy to implement and preserves the original shape of the distribution.
* **Cons**: Highly **sensitive to outliers**. A single extreme value can squash all other data points into a very narrow range, reducing their distinctiveness.


#### MaxAbs Scaling
This method scales each feature by its maximum absolute value, resulting in a range of `[-1, 1]`.
* **Formula**:
    $$x_{scaled} = \frac{x}{\max(|x|)}$$
    
* **Pros**: It doesn't shift or center the data, which is useful for sparse datasets (data with many zeros).
* **Cons**: Also sensitive to outliers.

#### Robust Scaling (Outlier-Resistant) 💪
This scaler uses statistics that are robust to outliers: the **median** and the **Interquartile Range (IQR)**.
* **Formula**:
    $$x_{scaled} = \frac{x - x_{median}}{IQR(x)}$$
    where $IQR = Q3 - Q1$ (the difference between the 75th and 25th percentiles).
* **Pros**: Because it ignores extreme values in its calculation, it's **very effective at scaling data with outliers** without squashing the inlying data points.
* **Cons**: Calculating quartiles can be more computationally intensive than calculating the mean and standard deviation.


#### Log Scaling (Log Transform)
A logarithmic transform is useful for data that exhibits exponential growth or is highly right-skewed. It helps to stabilize the variance and make the data's distribution more "normal".
* **Formula**:
    $$x_{scaled} = \log(x)$$
    
* **Pros**: Very effective at reducing skewness.
* **Cons**: Can only be applied to **positive values** and can sometimes obscure small differences in the original data.

### 2.2. Standardisation (Z-score Normalisation)
Standardisation is a different approach that rescales data to have a **mean ($\mu$) of 0** and a **standard deviation ($\sigma$) of 1**. The resulting value is called a Z-score.
* **Formula**:
    $$z = \frac{x - \mu}{\sigma}$$
    
* **Pros**: Handles outliers better than Min-Max scaling and is required by many algorithms that assume a normal distribution of the input data.
* **Cons**: Doesn't scale the data to a specific, bounded range like Min-Max scaling does.

---

## 3. Power Transformation

**Power transformations** are a family of functions used to make data more linear or to stabilize its variance. A key application is transforming a non-linear relationship between two variables into a linear one, which is often easier to model.

A well-known example is the **Box-Cox Transformation**, which can transform a continuous variable into an almost normal distribution by finding an optimal exponent, $\lambda$.
* **Formula**:
    $$y = \begin{cases} \frac{x^\lambda - 1}{\lambda}, & \text{if } \lambda \neq 0 \\ \log(x), & \text{if } \lambda = 0 \end{cases}$$
    

---

## 4. Data Discretisation (Binning)

**Discretisation** is the process of converting continuous numerical variables into discrete, categorical variables (or "bins"). This can help reduce noise, smooth data, and make it compatible with algorithms that require categorical inputs.

There are two common unsupervised approaches:
1.  **Equal-Width Binning**: Divides the data range into a predefined number of intervals, each with the same width. This method is simple but can be negatively affected by outliers and skewed data.
2.  **Equal-Depth (Frequency) Binning**: Divides the data into intervals that each contain approximately the same number of data points. This method handles skewed data much better.

After binning, the values within each bin can be replaced by the bin's **mean**, **median**, or **boundary values** to smooth the data.

---

## 5. Data Construction: Creating Better Features

### 5.1. Feature Subset Selection
Often, not all features in a dataset are useful. **Feature subset selection** is the process of reducing the dataset by removing irrelevant or redundant features. The goal is to find the minimum set of attributes that can still produce a good model.
Common methods include:
* **Stepwise Forward Selection**: Start with no features and add them one by one.
* **Stepwise Backward Elimination**: Start with all features and remove them one by one.
* **Decision Tree Induction**: Use a decision tree to identify the most important features.

### 5.2. Data Sampling
Sampling involves selecting a representative subset of data from a larger dataset. This is useful for reducing data volume, fixing class imbalances, or creating training and testing sets.

* **Simple Random Sample (SRS)**: Every data point has an equal chance of being selected. This can be done with or without replacement.
* **Stratified Sampling**: The data is first divided into subgroups (strata), and then a simple random sample is taken from each stratum. This ensures that all subgroups are fairly represented in the final sample, which is especially important for imbalanced datasets.

# Data Transformation


Q1. Which scaling method is robust to outliers?
Robust scaling - it scales by quartiles, which naturally filters out outliers.

Q1? Which methods keep the original meaning of the data after scaling?
All of them are linear except log scale - so Min-Max scaling, MaxAbs scaling, Decimal scaling, Robust scaling

Q2 Which data sampling method is used in the Random Forest algorithm?
Simple Random Sample with replacement