# Transformation in ETL Assignment


**Question 1 : Define Data Transformation in ETL and explain why it is important.**

**Ans :** Data Transformation in ETL is the process of converting raw data extracted from source systems into a clean, consistent, and meaningful format suitable for storage and analysis in a data warehouse.

**Importance :**

- Ensures data consistency and quality
- Converts data into a usable format for analytics
- Removes errors, duplicates, and inconsistencies
- Helps in accurate reporting and decision-making

**Question 2 : List any four common activities involved in Data Cleaning.**

**Ans :** Here are four common data cleaning activities:

**1. Handling missing values** – Filling missing data using mean/median/mode or removing incomplete records.

**2. Removing duplicate records** – Identifying and deleting repeated rows to avoid data redundancy.

**3. Correcting inconsistent data** – Standardizing formats (e.g., “M”, “male” → “Male”).

**4. Outlier detection and treatment** – Identifying abnormal values and correcting or removing them.

These steps help improve data quality before transformation in ETL

**Question 3 : What is the difference between Normalization and Standardization?**

**Ans :**

**Normalization**
- Scales data into a fixed range, usually between 0 and 1
- Uses minimum and maximum values for transformation
- Highly sensitive to outliers
- Useful when features need to be on the same bounded scale
- Commonly used in algorithms like KNN and Neural Networks

**Standardization**

- Transforms data to have mean = 0 and standard deviation = 1
- Uses mean and standard deviation for scaling
- Less affected by outliers compared to normalization
- Suitable when data follows a normal distribution
- Commonly used in regression and clustering algorithms


**Question 4 : A dataset has missing values in the “Age” column. Suggest two techniques to handle this and explain when they should be used.**

**Ans :**

**1. Mean/Median Imputation**

- Replace missing age values with the mean or median of the column
- Mean is used when data is normally distributed
- Median is preferred when data contains outliers

**2. Deletion of Records(Row Removal)**

- Remove rows where the age value is missing
- Used when the number of missing values is very small.
- Suitable when removing data does not affect overall analysis or results.

**Question 5 : Convert the following inconsistent “Gender” entries into a standardized format (“Male”, “Female”):["M", "male", "F", "Female", "MALE", "f"]**

**Ans :**

UPDATE table_name
SET gender =
  CASE
    WHEN LOWER(gender) IN ('m', 'male') THEN 'Male'
    WHEN LOWER(gender) IN ('f', 'female') THEN 'Female'
    ELSE gender
  END;


**Question 6 : What is One-Hot Encoding? Give an example with the categories: “Red, Blue, Green”.**

**Ans :**
- One-Hot Encoding is a data transformation technique used to convert categorical data into a numerical format by creating separate binary columns for each category.
- Each column represents a category and contains a value of **1 if the category is present** and **0 otherwise**. This helps machine learning algorithms understand categorical variables.

**Example:** For a "Color" column with ["Red", "Blue", "Green"], **"Red" becomes [1, 0, 0], "Blue" becomes [0, 1, 0], and "Green" becomes [0, 0, 1]**.

**Question 7 : Explain the difference between Data Integration and Data Mapping in ETL.**

**Ans :**

**Data Integration**

- Combines data from multiple source systems into a single, unified dataset
- Focuses on merging and consolidating data
- Works at a system or database level

Example: Combining customer data from CRM, ERP, and sales databases

**Data Mapping**

- Defines how source fields are matched to target fields
- Focuses on column-to-column relationships
- Works at a field or attribute level

Example: Mapping cust_id in source table to customer_id in target table

**Data Integration brings data together, while Data Mapping ensures data is placed correctly during ETL.**



**Question 8 : Explain why Z-score Standardization is preferred over Min-Max Scaling when outliers exist.**

**Ans :**
- Z-score Standardization uses the mean and standard deviation to scale data.
- It is less sensitive to extreme values (outliers).
- Outliers do not significantly distort the overall distribution.
- Min-Max Scaling depends on minimum and maximum values.
- Outliers can stretch the range, compressing most data points into a small interval.
- This can lead to loss of important variation in the data.

Z-score Standardization is preferred when outliers are present because it maintains data distribution more effectively and provides better analytical results.