# Lecture on Z-Score and StandardScaler

## 1. What is a Z-Score?

A **Z-score** (also known as a standard score) is a numerical measurement that describes a value's relationship to the mean of a group of values. Z-scores are measured in terms of standard deviations from the mean. A Z-score of **0** indicates the data point's score is identical to the mean score. A Z-score of **1.0** means the data point is one standard deviation from the mean.

### Why is the Z-score important?

- **Standardization:** It allows for the comparison of scores from different normal distributions. For example, comparing a student's performance in two different subjects with different grading scales.
- **Outlier Detection:** Scores with a very high or very low Z-score (e.g., beyond $$\pm3$$) are often considered outliers.
- **Probability Calculation:** It can be used with a Z-table to find the probability of a score occurring within a normal distribution.

## 2. The Formula

The formula for calculating a Z-score is:

$$ Z = \frac{x - \mu}{\sigma} $$

Where:
- X is the raw score or data point.
- \mu is the population mean.
- \sigma is the population standard deviation.

## 3. Example 1: Simple Calculation

Let's say a class's average score on a test ($$\mu$$) is 75, and the standard deviation ($$\sigma$$) is 10. A student scores 90 on the test ($$x$$). What is their Z-score?

In [1]:
# Python code to calculate Z-score

# Define the variables
mu = 75       # Population mean
sigma = 10    # Population standard deviation
x = 90        # Raw score of the student

# Calculate the Z-score
z_score = (x - mu) / sigma

# Print the result
print(f"The Z-score for a score of {x} is: {z_score}")

# Interpretation of the result
# A Z-score of 1.5 means the student's score is 1.5 standard deviations above the mean.

The Z-score for a score of 90 is: 1.5


## 4. Example 2: Comparing Scores

Imagine a student gets a 65 on a math test and a 70 on a science test. At first glance, it seems they did better in science. However, let's consider the class statistics:

- **Math Test:** $$\mu_{math} = 60$$, $$\sigma_{math} = 5$$
- **Science Test:** $$\mu_{science} = 65$$, $$\sigma_{science} = 10$$

Let's use Z-scores to see which performance was relatively better.

In [2]:
# Python code to compare scores using Z-scores

# Math Test details
x_math = 65
mu_math = 60
sigma_math = 5

# Science Test details
x_science = 70
mu_science = 65
sigma_science = 10

# Calculate Z-score for Math
z_score_math = (x_math - mu_math) / sigma_math

# Calculate Z-score for Science
z_score_science = (x_science - mu_science) / sigma_science

# Print the results
print(f"Z-score for Math: {z_score_math}")
print(f"Z-score for Science: {z_score_science}")

# Compare the scores
if z_score_math > z_score_science:
    print("The student performed relatively better in Math.")
elif z_score_science > z_score_math:
    print("The student performed relatively better in Science.")
else:
    print("The student performed equally well relative to their peers in both subjects.")

Z-score for Math: 1.0
Z-score for Science: 0.5
The student performed relatively better in Math.


## 5. Z-Score with NumPy and SciPy

In a real-world scenario, you might have a large dataset. Libraries like **NumPy** and **SciPy** in Python can make Z-score calculations much more efficient.

In [3]:
# Import necessary libraries
import numpy as np
from scipy import stats

# Create a sample dataset (e.g., student scores)
data = np.array([78, 85, 92, 60, 75, 88, 95, 68, 80, 72])

# Method 1: Manual Calculation with NumPy
mean_np = np.mean(data)
std_dev_np = np.std(data)

# Calculate Z-scores for the entire array
z_scores_np = (data - mean_np) / std_dev_np

print("Z-scores calculated with NumPy:\n", z_scores_np)

# Method 2: Using SciPy's `stats.zscore` function
# This is a more direct and recommended way
z_scores_scipy = stats.zscore(data)

print("\nZ-scores calculated with SciPy:\n", z_scores_scipy)

# Interpretation: The Z-scores tell us how many standard deviations each score is from the mean of the dataset.

Z-scores calculated with NumPy:
 [-0.12451171  0.54593594  1.21638359 -1.84851994 -0.41184641  0.83327065
  1.50371829 -1.08229406  0.06704476 -0.69918112]

Z-scores calculated with SciPy:
 [-0.12451171  0.54593594  1.21638359 -1.84851994 -0.41184641  0.83327065
  1.50371829 -1.08229406  0.06704476 -0.69918112]


## 6. Z-Score and Machine Learning: The StandardScaler

In machine learning, standardizing features is a crucial preprocessing step. The `StandardScaler` from the `scikit-learn` library is a powerful tool for this. It applies the Z-score formula to each feature (column) of a dataset independently.

### Understanding `fit`, `transform`, and `fit_transform`

- **`.fit(data)`:** This method calculates the mean ($$\mu$$) and standard deviation ($$\sigma$$) of the training data. It **learns** the parameters necessary for scaling. It does not modify the data itself.

- **`.transform(data)`:** This method applies the standardization to the data using the mean and standard deviation learned during the `fit` step. It standardizes the data by subtracting the learned mean and dividing by the learned standard deviation.

- **`.fit_transform(data)`:** This is a convenience method that performs both `fit` and `transform` in one step. It is typically used on the **training data** to learn the parameters and then apply the transformation immediately.

### Why separate `fit` and `transform`?

It's crucial to `fit` the scaler **only** on the training data to prevent **data leakage**. The parameters learned from the training set should be used to transform both the training data and the test data. We use `fit_transform()` on the training data and just `transform()` on the test data.

In [4]:
# Import StandardScaler from scikit-learn
from sklearn.preprocessing import StandardScaler
import numpy as np

# Create a sample dataset (e.g., features for a machine learning model)
# This is our training data
train_data = np.array([
    [10000, 2],
    [12000, 3],
    [5000, 1],
    [8000, 2.5]
])

# Instantiate the StandardScaler object
scaler = StandardScaler()

# Use fit_transform() on the training data
# This learns the mean and std dev and then applies the transformation
train_scaled = scaler.fit_transform(train_data)

print("Original Training Data:\n", train_data)
print("\nScaled Training Data (using fit_transform):\n", train_scaled)

# Now, let's say we have new data to test our model
test_data = np.array([
    [11000, 2.8],
    [6000, 1.2]
])

# Use only transform() on the new test data
# We MUST NOT use fit() again to avoid data leakage
test_scaled = scaler.transform(test_data)

print("\nOriginal Test Data:\n", test_data)
print("\nScaled Test Data (using transform):\n", test_scaled)

# You can inspect the learned parameters
print("\nLearned Mean of the features:", scaler.mean_)
print("Learned Standard Deviation of the features:", scaler.scale_)

Original Training Data:
 [[1.0e+04 2.0e+00]
 [1.2e+04 3.0e+00]
 [5.0e+03 1.0e+00]
 [8.0e+03 2.5e+00]]

Scaled Training Data (using fit_transform):
 [[ 0.48336824 -0.16903085]
 [ 1.25675744  1.18321596]
 [-1.45010473 -1.52127766]
 [-0.29002095  0.50709255]]

Original Test Data:
 [[1.1e+04 2.8e+00]
 [6.0e+03 1.2e+00]]

Scaled Test Data (using transform):
 [[ 0.87006284  0.9127666 ]
 [-1.06341014 -1.2508283 ]]

Learned Mean of the features: [8.750e+03 2.125e+00]
Learned Standard Deviation of the features: [2.58602011e+03 7.39509973e-01]


## 7. Applications of Z-Score

Z-scores and standardization are widely used in various fields:

- **Finance:** Comparing the performance of different stocks or portfolios.
- **Healthcare:** Standardizing lab results (e.g., blood pressure, cholesterol levels) to identify abnormal readings.
- **Quality Control:** In manufacturing, Z-scores can monitor if products are within acceptable deviation from the standard specifications.
- **Social Sciences:** Analyzing survey data to compare responses across different groups.
- **Machine Learning:** Standardizing features for algorithms that are sensitive to the magnitude of the features (e.g., Support Vector Machines, K-Nearest Neighbors, neural networks).