🔑 Common Terms in ML & Data Preprocessing
1. Continuous Values

Numbers that can take any value within a range.

Example: Sales = 123.45 or Temperature = 28.3°C.

Often need scaling before ML.

2. Discrete Values

Numbers or categories that are countable and separate (no fractions).

Example: Number of Orders = 5, or Category = Furniture.

Usually treated as categorical features.

3. Categorical Variables

Data stored as labels or groups instead of numbers.

Example: Segment = [Consumer, Corporate, Home Office].

Need encoding (One-Hot Encoding, Label Encoding) before ML.

4. On the Same Scale

When features are in a comparable numeric range.

Example:

Feature A = "Age" (20–70)

Feature B = "Income" (10,000–1,000,000)

Without scaling, the algorithm will think Income is more important just because it has bigger numbers.

Scaling makes them fairly comparable.

5. Normalization (Min-Max Scaling)

Rescales values into 0 to 1 range.

Formula:

𝑥
′
=
𝑥
−
min
(
𝑥
)
max
(
𝑥
)
−
min
(
𝑥
)
x
′
=
max(x)−min(x)
x−min(x)
	​


Example: If Sales = 500, min=0, max=1000 → normalized value = 0.5.

6. Standardization (Z-score Scaling)

Converts data to have mean = 0, standard deviation = 1.

Formula:

𝑧
=
𝑥
−
𝜇
𝜎
z=
σ
x−μ
	​


where μ = mean, σ = standard deviation.

Example: If Sales = 500, mean=300, std=100 → z-score = (500-300)/100 = 2.0.

Interpretation: "Sales is 2 standard deviations above the mean."

7. Z-score

A standardized score showing how far a value is from the mean in terms of standard deviations.

Example: z=0 → exactly the mean; z=+2 → much higher; z=-1 → below average.

8. Outlier

A data point that is very different from the majority.

Example: In Sales, most values < 5000, but one is 50,000 → outlier.

Important to detect because it can skew models.

9. Feature Engineering

The process of transforming raw data into meaningful inputs for ML.

Includes binning, scaling, encoding, creating new features (like Profit Margin = Sales - Cost).

10. Target Variable (y)

The outcome you’re predicting.

Example: Predicting Sales (regression) or Segment (classification).

11. Feature / Independent Variable (X)

The input(s) used to predict the target.

Example: Customer Age, Region, Category → help predict Sales.

12. Overfitting

When a model memorizes training data instead of generalizing.

Good performance on training, bad on test data.

13. Underfitting

When a model is too simple and can’t capture patterns.

Bad on both training and test data.

14. Bias vs Variance

Bias: Error from making the model too simple (underfitting).

Variance: Error from making the model too sensitive to noise (overfitting).

Goal = balance (Bias-Variance Tradeoff).

15. Training vs Testing Data

Training data: Used to build the model.

Testing data: Used to check how well it works on unseen data.

Often split 70% train, 30% test (or similar).