# Data Preprocessing: Data Standardization

Data normalization and standardization are both techniques used to adjust the scale of features in a dataset, but they serve different purposes and follow different methodologies.
Normalization (also called Min-Max scaling) transforms the data to fit within a specific range. Standardization (also called Z-score normalization) transforms the data to have a mean of 0 and a standard deviation of 1. This method is useful when you need the data to follow a standard normal distribution (a bell curve).
Standardization is typically used when:
- The data distribution is Gaussian, or the algorithm assumes a Gaussian distribution (e.g., linear regression, logistic regression, SVM).
- The data have different units or scales and need to be compared on a common scale.

## General Example on a Generated Dataset

In [None]:
import numpy as np #
from sklearn.model_selection import train_test_split

In [None]:
x1 = np.array([0,1,2,3,4,5,6,7,8,9])

In [None]:
x1

In [None]:
x1 = x1.reshape(-1,1)

In [None]:
x1

In [None]:
x2 = np.array([201,220,233,243,257,269,272,283,299,339])

In [None]:
x2 = x2.reshape(-1,1)

In [None]:
x2

In [None]:
x3 = np.array([2000, 3550, 2350, 3940, 4000, 50000, 2677, 9765, 8876, 9776]).reshape(-1,1)
x3

In [None]:
X = np.hstack((x1,x2,x3))

In [None]:
X

In [None]:
y = np.array([10,20,30,40,50,60,70,80,90,100])
y

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

In [None]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()

sc.fit(X_train) # Compute the mean and standard deviation to be used for later scaling.

X_train_scaled = sc.transform(X_train) # Scaling the data according to the calculated mean and standard deviation


In [None]:
X_train_scaled

In [None]:
X_train_scaled.mean()

In [None]:
X_train_scaled.std()

In [None]:
X_test_scaled = sc.transform(X_test)

In [None]:
X_test_scaled

## Manual Implementation of Data Standardization

In [1]:
import numpy as np

def standardize_data(data):
    """
    Standardize the dataset to have a mean of 0 and a standard deviation of 1.

    Parameters:
    - data: numpy array of data to be standardized.

    Returns:
    - standardized_data: numpy array of standardized data.
    """
    mean = np.mean(data)
    std_dev = np.std(data)
    standardized_data = (data - mean) / std_dev
    return standardized_data

# Example usage
data = np.array([1, 2, 3, 4, 5])
standardized_data = standardize_data(data)
print("Standardized Data:", standardized_data)


Standardized Data: [-1.41421356 -0.70710678  0.          0.70710678  1.41421356]


### **Explanation:**

- **`mean = np.mean(data)`:** Calculates the mean of the dataset.
- **`std_dev = np.std(data)`:** Calculates the standard deviation of the dataset.
- **`standardized_data = (data - mean) / std_dev`:** This line applies the standardization formula to each data point, rescaling the dataset to have a mean of 0 and a standard deviation of 1.

## Standardization Using scikit-learn

In [2]:
from sklearn.preprocessing import StandardScaler
import numpy as np

def sklearn_standardize_data(data):
    """
    Standardize the dataset using sklearn's StandardScaler.

    Parameters:
    - data: numpy array of data to be standardized.

    Returns:
    - standardized_data: numpy array of standardized data.
    """
    scaler = StandardScaler()
    standardized_data = scaler.fit_transform(data.reshape(-1, 1)).flatten()
    return standardized_data

# Example usage
data = np.array([1, 2, 3, 4, 5])
standardized_data = sklearn_standardize_data(data)
print("Standardized Data (sklearn):", standardized_data)


Standardized Data (sklearn): [-1.41421356 -0.70710678  0.          0.70710678  1.41421356]


### **Explanation:**

- **`StandardScaler()`:** Creates an instance of the `StandardScaler` class.
- **`fit_transform(data.reshape(-1, 1))`:** Fits the scaler to the data and transforms it, standardizing it in the process. The `reshape(-1, 1)` is necessary because `sklearn` expects a 2D array.
- **`flatten()`:** Converts the output back to a 1D array.

## Full Implementation with Multiple Methods

In [3]:
import numpy as np
from sklearn.preprocessing import StandardScaler

def standardize_data(data, method='manual'):
    """
    Standardize the dataset using either manual calculation or sklearn's StandardScaler.

    Parameters:
    - data: numpy array of data to be standardized.
    - method: string, either 'manual' for manual calculation or 'sklearn' for sklearn's StandardScaler.

    Returns:
    - standardized_data: numpy array of standardized data.
    """
    if method == 'manual':
        # Manual Standardization
        mean = np.mean(data)
        std_dev = np.std(data)
        standardized_data = (data - mean) / std_dev
    elif method == 'sklearn':
        # Standardization using sklearn
        scaler = StandardScaler()
        standardized_data = scaler.fit_transform(data.reshape(-1, 1)).flatten()
    else:
        raise ValueError("Unknown method. Use 'manual' or 'sklearn'.")

    return standardized_data

# Example usage
data = np.array([1, 2, 3, 4, 5])
method = 'manual'  # Choose between 'manual' and 'sklearn'
standardized_data = standardize_data(data, method)
print(f"Standardized Data ({method}):", standardized_data)


Standardized Data (manual): [-1.41421356 -0.70710678  0.          0.70710678  1.41421356]


### **Explanation:**

- **`method` parameter:** This allows you to choose between manual standardization and using `sklearn`.
    - **`method='manual'`:** Uses the manual standardization method described above.
    - **`method='sklearn'`:** Uses the `sklearn` implementation of standardization.
- **`standardize_data` function:** This function handles both methods of standardization based on the `method` parameter, making it flexible for different use cases.

## Handling Multidimensional Data

When dealing with multidimensional data (e.g., multiple features), you would typically standardize each feature independently.

In [4]:
import numpy as np
from sklearn.preprocessing import StandardScaler

def standardize_multidimensional_data(data, method='manual'):
    """
    Standardize each feature in a multidimensional dataset.

    Parameters:
    - data: 2D numpy array where each column represents a feature.
    - method: string, either 'manual' for manual calculation or 'sklearn' for sklearn's StandardScaler.

    Returns:
    - standardized_data: 2D numpy array of standardized data.
    """
    if method == 'manual':
        # Manual Standardization for each feature
        standardized_data = (data - np.mean(data, axis=0)) / np.std(data, axis=0)
    elif method == 'sklearn':
        # Standardization using sklearn
        scaler = StandardScaler()
        standardized_data = scaler.fit_transform(data)
    else:
        raise ValueError("Unknown method. Use 'manual' or 'sklearn'.")

    return standardized_data

# Example usage
data = np.array([[1, 2], [2, 3], [3, 4], [4, 5], [5, 6]])
method = 'manual'  # Choose between 'manual' and 'sklearn'
standardized_data = standardize_multidimensional_data(data, method)
print(f"Standardized Multidimensional Data ({method}):\\n", standardized_data)


Standardized Multidimensional Data (manual):\n [[-1.41421356 -1.41421356]
 [-0.70710678 -0.70710678]
 [ 0.          0.        ]
 [ 0.70710678  0.70710678]
 [ 1.41421356  1.41421356]]


### **Explanation:**

- **Multidimensional Data:** Each feature (column) is standardized independently.
- **`axis=0`:** Ensures that the mean and standard deviation are calculated for each column (feature) separately.
- **`scaler.fit_transform(data)`:** The `sklearn` method automatically handles multidimensional data.