# Why Preprocess?
---

Preprocessing data is required for making the machine learning model work efficiently and get correct and best results, all the raw data are mostly not ready to be computed, so we clean and prepare the data, this process is considered as Data Preprocessing. Preprocessing can involve things like scaling numbers to be within a certain range, normalizing the data to avoid huge differences between values, or turning words into numbers that computers can understand.

---

### PreProccesing Techniques

### `Standardization or Mean Removal`

---

### Dataset Example: Standardization

Let’s say we have a small dataset of test scores for five students:

$70, 80, 90, 100, 110$

### Step 1: Find the mean (average)  
To calculate the mean:  

$\text{mean} = \frac{70 + 80 + 90 + 100 + 110}{5} = \frac{450}{5} = 90$

---
### Step 2: Subtract the mean from each value  
Now, we subtract 90 (the mean) from each score:

$70 - 90 = -20$  
$80 - 90 = -10$  
$90 - 90 = 0$  
$100 - 90 = 10$  
$110 - 90 = 20$

So the adjusted scores are:  

$-20, -10, 0, 10, 20$

---

### Step 3: Divide by the standard deviation  
Next, we calculate the standard deviation. For simplicity, let’s assume the standard deviation is 15 (in reality, you would calculate it, but we'll use this number for this example).

Now we divide each adjusted score by 15 (the standard deviation):

$\frac{-20}{15} \approx -1.33$  
$\frac{-10}{15} \approx -0.67$  
$\frac{0}{15} = 0$  
$\frac{10}{15} \approx 0.67$  
$\frac{20}{15} \approx 1.33$

So, the standardized scores are approximately:  

$-1.33, -0.67, 0, 0.67, 1.33$

These values are now centered around 0, and have no bias towards larger or smaller numbers, making it easier for machine learning algorithms to process the data.

---

When we say the values are "centered around 0" and "have no bias towards larger or smaller numbers," we mean that, after standardization, the data is more balanced and easier for machine learning algorithms to handle. Here’s why:

### Centered around 0:
After subtracting the mean, the average of all the values becomes 0. Some values will be slightly below zero (negative), and some will be above zero (positive). This makes the data more evenly distributed around a central point, helping algorithms not to focus too much on larger numbers.

### No bias towards larger or smaller numbers:
Before standardization, one feature (like a test score of 110) might have been much larger than another (like a score of 70). This could cause a machine learning model to give more importance to the bigger numbers, even though that might not be helpful. Standardization adjusts the scale of the values so that they’re all in a similar range, and no value dominates just because it’s larger.

In short, standardization ensures that all features are treated equally by the model, without any one feature skewing the results due to its size. This makes it easier for algorithms to learn patterns from the data fairly.



In [29]:
from sklearn import preprocessing
import numpy as np

In [23]:
data = np.array([[8,-3,4,-7],[4,-3,6,8],[6,6,-2,1],[5,6,7,8]])

In [25]:
print("Mean: ",data.mean(axis=0))
print('Standard Deviation: ',data.std(axis=0))

Mean:  [5.75 1.5  3.75 2.5 ]
Standard Deviation:  [1.47901995 4.5        3.49106001 6.18465844]


In [39]:
data_standardized= preprocessing.scale(data)

In [41]:
print("Mean standardized data: ",data_standardized.mean(axis=0))
print("Standard Deviation standardized data: ",data_standardized.std(axis=0))

Mean standardized data:  [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 -2.77555756e-17]
Standard Deviation standardized data:  [1. 1. 1. 1.]


In [49]:
print("Mean standardized data: ", np.round(data_standardized.mean(axis=0), 2))
print("Standard Deviation standardized data: ", data_standardized.std(axis=0))

Mean standardized data:  [ 0.  0.  0. -0.]
Standard Deviation standardized data:  [1. 1. 1. 1.]


In [51]:
print(data_standardized)

[[ 1.52127766 -1.          0.07161149 -1.53605896]
 [-1.18321596 -1.          0.64450339  0.88929729]
 [ 0.16903085  1.         -1.64706421 -0.24253563]
 [-0.50709255  1.          0.93094934  0.88929729]]


### `Z-Score Standardization`
---
The `sklearn.preprocessing` package includes helpful tools for adjusting the features of your data to make them easier to work with. One common method is the `scale()` function, which performs z-score standardization. 

- **What is a z-score?**  
  A z-score tells you how far away a particular data point is from the average (mean) of the dataset. It shows this distance in terms of standard deviations, which are a measure of how spread out the data is.
  
  - **Positive z-scores** mean that the data point is above the average.
  - **Negative z-scores** mean that the data point is below the average.
  
In simpler terms, z-scores help you understand where a specific value stands compared to the average of the whole set. 

---
### Importance of Standardization

Standardization is especially important when we don’t know the minimum and maximum values in our data. In such cases, we can’t use other methods that rely on knowing these boundaries. 

- **What happens after standardization?**  
  After standardization, the new values (z-scores) don’t have specific minimum or maximum limits. They’re simply adjusted around the average. 

- **How does it handle outliers?**  
  This technique is less affected by extreme values (outliers) compared to other methods. So, even if there are some very high or low values in the data, z-score standardization still works effectively.

In summary, standardization with z-scores is a way to make data easier to compare and analyze, especially when we don’t know the limits of the data or when there are outliers present.


### `Data Scaling`

### Importance of Scaling Data in Machine Learning

When working with machine learning algorithms, it is crucial to scale your data before training the model. Scaling, or rescaling, involves adjusting the range of the data features so that they have a common scale without distorting differences in the ranges of values.

---
#### Why Scale Data?

1. **Elimination of Units:**
   Scaling removes the units of measurement from the data. This means that all features are treated equally, regardless of their original units (e.g., meters, kilometers, pounds). This is particularly important when features have different units or scales, as it ensures that no single feature dominates the model due to its magnitude.

2. **Improved Comparability:**
   When data is scaled, it becomes easier to compare features from different locations or sources. For example, if you are comparing temperature data from two cities, scaling allows you to analyze how temperatures relate to each other without the influence of differing measurement systems or ranges.

3. **Enhanced Algorithm Performance:**
   Many machine learning algorithms, such as gradient descent-based methods (like linear regression and neural networks), converge faster and perform better when the data is scaled. This is because scaled data helps the algorithm to navigate the feature space more efficiently, reducing the chances of getting stuck in local minima.

4. **Minimized Sensitivity to Outliers:**
   Scaling can help mitigate the impact of outliers, as it brings all features into a similar range. This is especially useful for algorithms that are sensitive to the scale of the input data.

---
### Common Scaling Techniques

- **Min-Max Scaling:** Rescales the data to a fixed range, usually [0, 1].
- **Standardization (Z-score normalization):** Centers the data around the mean with a standard deviation of 1.

In summary, scaling data is an essential preprocessing step in machine learning that improves the performance and accuracy of models by ensuring that all features are on a comparable scale.


### Min-Max Scaler

The Min-Max Scaler is a popular technique for scaling features in machine learning. It transforms the data into a specified range, typically [0, 1]. This ensures that all features contribute equally to the model training process.

---
#### How Min-Max Scaling Works

The Min-Max scaling process involves the following steps:

1. **Identify Minimum and Maximum Values:**
   For each feature (column) in the dataset, identify the minimum value (\(X_{min}\)) and the maximum value (\(X_{max}\)).

2. **Apply the Min-Max Formula:**
   Use the following formula to scale each value (\(X\)) in the dataset:

   $
   X' = \frac{X - X_{min}}{X_{max} - X_{min}}
   $

   Here, \(X'\) is the scaled value, \(X\) is the original value, \(X_{min}\) is the minimum value of the feature, and \(X_{max}\) is the maximum value of the feature.

3. **Resulting Range:**
   After applying the formula, all the transformed values will lie within the range [0, 1]. This allows for easier comparison of features with different original scales.

---
#### Advantages of Min-Max Scaling

- **Uniform Scale:** It brings all features into the same scale, which can improve the performance of algorithms that are sensitive to the scale of data.
- **Preserves Relationships:** Min-Max scaling maintains the relationships between the values, which is beneficial for many machine learning algorithms.

#### Disadvantages of Min-Max Scaling

- **Sensitive to Outliers:** Since the minimum and maximum values determine the scaling, outliers can significantly affect the scaled values. If an outlier is present, it can compress the range of other values.
- **Does Not Center the Data:** Min-Max scaling does not center the data around zero, which may not be ideal for some algorithms that assume normally distributed data.

In summary, the Min-Max Scaler is a straightforward and effective way to scale features, making it an essential tool in the data preprocessing phase of machine learning.


In [62]:
data_scaler = preprocessing.MinMaxScaler(feature_range=(0, 1))

In [64]:
data_scaled = data_scaler.fit_transform(data)

In [66]:
print("Min: ",data.min(axis=0))
print("Max: ",data.max(axis=0))

Min:  [ 4 -3 -2 -7]
Max:  [8 6 7 8]


In [70]:
print("Min: ",data_scaled.min(axis=0))
print("Max: ",data_scaled.max(axis=0))

Min:  [0. 0. 0. 0.]
Max:  [1. 1. 1. 1.]


In [72]:
print(data_scaled)

[[1.         0.         0.66666667 0.        ]
 [0.         0.         0.88888889 1.        ]
 [0.5        1.         0.         0.53333333]
 [0.25       1.         1.         1.        ]]


### `Noramlization`

# Normalization: L1, L2, and Max Norm

Normalization is a technique used to rescale feature vectors so that they can be compared on a common scale. Different normalization techniques serve different purposes depending on the needs of the machine learning algorithm. Below are three common types of normalization.

---

## 1. L1 Normalization

**When to Use**:  
- Use when you want the sum of the absolute values of the features in a vector to equal 1.
- Commonly used when dealing with sparse datasets or when using algorithms that rely on the size of vectors, like certain distance-based algorithms (e.g., KNN, text classification).

**How to Use**:  
L1 normalization scales each value by dividing it by the sum of absolute values in the vector. The formula is as follows:

$ x' = \frac{x}{\sum |x|} $

- `x` is the original value.
- `x'` is the normalized value.
- The sum of the absolute values of the vector elements equals 1.

**Example**:  
For a vector $ x = [3, 4, 5] $:

1. $ \sum |x| = 3 + 4 + 5 = 12 $

$ x' = \frac{x}{12} $

The normalized values would be:

$ x' = [0.25, 0.33, 0.42] $

---

## 2. L2 Normalization

**When to Use**:  
- Use when you want the length (Euclidean norm) of the vector to be 1.
- Often used in text classification, image processing, and algorithms where the magnitude of vectors is important.

**How to Use**:  
L2 normalization scales each value in the vector by dividing it by the Euclidean norm (the square root of the sum of squared values). The formula is:

$ x' = \frac{x}{\sqrt{x_1^2 + x_2^2 + \dots + x_n^2}} = \frac{x}{\|x\|_2} $

- `x` is the original value.
- `x'` is the normalized value.
- The vector's L2 norm is the square root of the sum of squares of all values.

**Example**:  
For a vector $ x = [3, 4] $:

1. Calculate the L2 norm:

$ \|x\|_2 = \sqrt{3^2 + 4^2} = \sqrt{9 + 16} = 5 $

2. Normalize the vector:

$ x' = \frac{x}{5} = [0.6, 0.8] $

---

## 3. Max Normalization

**When to Use**:  
- Use when you want to scale values based on the maximum value in the feature.
- Suitable when the feature’s range is unbounded, but you need to bring all values into proportion relative to the maximum value.
- Can be useful in cases where you don’t want to distort the original data too much, but still want to normalize.

**How to Use**:  
Max normalization scales each value in the vector by dividing it by the maximum value in the vector. The formula is:

$ x' = \frac{x}{\max(x)} $

- `x` is the original value.
- `x'` is the normalized value.
- $ \max(x) $ is the maximum value in the feature.

**Example**:  
For a feature $ x = [1, 3, 6, 9] $:

1. Find the maximum value:

$ \max(x) = 9 $

2. Normalize the vector:

$ x' = \frac{x}{9} $

The normalized values would be:

$ x' = [0.11, 0.33, 0.67, 1] $

---

## Summary

- **L1 Normalization**: Scales so that the sum of absolute values equals 1. Use it for sparse data or when you want to minimize absolute differences.
- **L2 Normalization**: Scales so that the vector's length equals 1. Use it for distance-based algorithms where vector direction matters.
- **Max Normalization**: Scales so that the largest value in the vector equals 1. Use it when you want to maintain the relative proportions between values.


In [100]:
# Sample data
data = np.array([[4, 1, 2], [1, 3, 9], [5, 7, 3]])

# Apply L1 normalization
l1_normalized_data = preprocessing.normalize(data, norm='l1')

print("L1 Normalized Data:\n", l1_normalized_data)

L1 Normalized Data:
 [[0.57142857 0.14285714 0.28571429]
 [0.07692308 0.23076923 0.69230769]
 [0.33333333 0.46666667 0.2       ]]


In [102]:
# Apply L2 normalization
l2_normalized_data = preprocessing.normalize(data, norm='l2')

print("L2 Normalized Data:\n", l2_normalized_data)


L2 Normalized Data:
 [[0.87287156 0.21821789 0.43643578]
 [0.10482848 0.31448545 0.94345635]
 [0.5488213  0.76834982 0.32929278]]


In [98]:
# Apply Max normalization
max_normalized_data = preprocessing.normalize(data, norm='max')

print("Max Normalized Data:\n", max_normalized_data)


Max Normalized Data:
 [[1.         0.25       0.5       ]
 [0.11111111 0.33333333 1.        ]
 [0.71428571 1.         0.42857143]]


### `Binarization`

### Binarization

**Binarization** is a process used to convert numerical data into two categories: typically represented by **0** and **1**. This is especially useful when working with machine learning or image processing tasks where you want to simplify data for better recognition and analysis.

---

#### How Binarization Works
In binarization, each value is compared to a chosen threshold:
- If a value is above the threshold, it is converted to **1**.
- If a value is below the threshold, it is converted to **0**.

This process transforms the original data into a **Boolean vector** (a sequence of 0s and 1s), which is easier to work with in many cases.

---

#### Example in Image Processing
In **digital image processing**, binarization is often used to convert a color or grayscale image into a **binary image**. A binary image consists of only two colors, typically **black** and **white**. This is useful for object recognition, shape analysis, or character recognition, where the goal is to separate objects (like letters or shapes) from the background.

For example, in character recognition:
- The image of a letter on a white background is converted into just black and white pixels.
- The letter becomes **black** (1), and the background turns **white** (0).

---

#### Applications:
- **Object Recognition**: Identifying specific shapes or objects within an image.
- **Character Recognition**: Recognizing letters and numbers (like in OCR – Optical Character Recognition).
- **Skeletonization**: Reducing an object to its simplest form for easier recognition later on.

---

By using binarization, it becomes easier to separate the object of interest (like a letter or shape) from the background, making further analysis and recognition more accurate and efficient.


In [111]:
data_binarized = preprocessing.Binarizer(threshold=3).transform(data)

In [113]:
print(data_binarized)

[[1 0 0]
 [0 0 1]
 [1 1 0]]


### `One Hot Encoding`

### What is One-Hot Encoding?

One-hot encoding is a way to represent categorical data (non-numerical or integer categories) in a format that machine learning algorithms can understand. Instead of using the actual values (like 0, 1, 2), we represent each possible value with a separate column. 

---

### Why One-Hot Encoding?

When dealing with categorical data, machine learning algorithms might incorrectly assume that the categories have some kind of numerical importance or order (like 2 is greater than 1). One-hot encoding helps prevent this by creating new columns, where each column represents one possible value from the original data. 

Each category is then represented by a "1" in the corresponding column and "0" in all the other columns.

---

### Example Data

Let's say you have a small dataset with four rows (data points) and three columns (features), as shown below:

In [2]:
import numpy as np

data = np.array([[1, 1, 2], 
                 [0, 2, 3], 
                 [1, 0, 1], 
                 [0, 1, 0]])

print(data)

[[1 1 2]
 [0 2 3]
 [1 0 1]
 [0 1 0]]


Each number in the array represents a category for a feature.

### Analyzing the Values:
- **First feature** has two possible values: `0` and `1`
- **Second feature** has three possible values: `0`, `1`, and `2`
- **Third feature** has four possible values: `0`, `1`, `2`, and `3`

---

So, to fully represent the data using one-hot encoding, we need a total of:
- 2 columns for the first feature
- 3 columns for the second feature
- 4 columns for the third feature

This gives us a total of **9 columns** to represent our original 3 features.

---

### How One-Hot Encoding Works:
For each feature:
- The **first feature** (values `0` and `1`) will be encoded into **2 columns**
- The **second feature** (values `0`, `1`, `2`) will be encoded into **3 columns**
- The **third feature** (values `0`, `1`, `2`, `3`) will be encoded into **4 columns**

Each data point will be represented by a vector where one value is `1` (indicating the category), and the rest are `0`.


In [10]:
from sklearn import preprocessing

# Initialize the OneHotEncoder
encoder = preprocessing.OneHotEncoder()

# Fit the encoder on our dataset
encoder.fit(data)

In [16]:
# Transform the data using the fitted encoder
encoded_vector = encoder.transform([[1, 1, 2]]).toarray()

# Display the encoded vector
print(encoded_vector)

[[0. 1. 0. 1. 0. 0. 0. 1. 0.]]


In [18]:
# Transform the data using the fitted encoder
encoded_vector = encoder.transform([[1, 2, 2]]).toarray()

# Display the encoded vector
print(encoded_vector)

[[0. 1. 0. 0. 1. 0. 0. 1. 0.]]


In [20]:
# Transform the data using the fitted encoder
encoded_vector = encoder.transform([[1, 2, 3]]).toarray()

# Display the encoded vector
print(encoded_vector)

[[0. 1. 0. 0. 1. 0. 0. 0. 1.]]


### Understanding the Output:
- The **first feature** (value `1`) is represented by the second column being `1`: `[0, 1]`
- The **second feature** (value `2`) is represented by the fifth column being `1`: `[0, 0, 1]`
- The **third feature** (value `3`) is represented by the ninth column being `1`: `[0, 0, 0, 1]`

Each feature is now represented in a way that machine learning algorithms can understand without implying any kind of numerical relationship between the categories.

---

### Summary:
One-hot encoding converts categorical data into binary columns (`0`s and `1`s), where each column represents a specific category. This approach is essential when using machine learning models to avoid incorrectly implying any numerical relationship between categories.


### `Label Encoding`

Label encoding is a method used to convert labels (the target/output in supervised learning) into a numerical format.

## Why do we need Label Encoding?
In supervised learning, the labels or outputs can be either **numbers** (which the algorithm can use directly) or **words** (like "cat", "dog", or "apple"). Since machine learning algorithms work with numbers, we need to convert these word labels into numbers.

## How does Label Encoding work?
Label encoding assigns a unique number to each word label. For example:

- "cat" → 0
- "dog" → 1
- "apple" → 2

So, if your labels were ["cat", "dog", "apple"], the algorithm would be able to use the encoded numbers [0, 1, 2] instead.

Label encoding is useful when the labels are words, and you need to make them understandable for the machine learning algorithm.


In [27]:
label_encoder = preprocessing.LabelEncoder()

In [29]:
input_class = ['adidas','reebok','adidas','nike','reebok','skechers']

In [33]:
label_encoder.fit(input_class)
print("CLass Mapping: ")
for i, item in enumerate(label_encoder.classes_):
    print(item,"-->",i)

CLass Mapping: 
adidas --> 0
nike --> 1
reebok --> 2
skechers --> 3


In [41]:
labels =['reebok','skechers','nike']
encoded_labels = label_encoder.transform(labels)
print("Label =", labels)
print("Encoded Labels =", list(encoded_labels))

Label = ['reebok', 'skechers', 'nike']
Encoded Labels = [2, 3, 1]


In [43]:
encoded_labels = [2,1,0,3,1]
decoded_labels = label_encoder.inverse_transform(encoded_labels)
print("Encoded Labels =", encoded_labels)
print("Decoded Labels =", decoded_labels)

Encoded Labels = [2, 1, 0, 3, 1]
Decoded Labels = ['reebok' 'nike' 'adidas' 'skechers' 'nike']


## Label Encoding vs. One-Hot Encoding

### Label Encoding:
Label encoding is a method that transforms categorical data (like "apple", "banana", "cherry") into numbers. However, it can cause problems because it gives an order to the numbers (like 0, 1, 2), which may make the algorithm think that one label is "greater" or "less" than another. This can be an issue if the algorithm tries to do math with these numbers, which isn't appropriate for categories that don't have a natural order.

### Why One-Hot Encoding is Better:
One-hot encoding solves this problem by representing each category as a separate column with a binary value (0 or 1). This way, no order is implied, and each category is treated equally. It's like putting each category in its own "box."

### When to Use Label Encoding vs. One-Hot Encoding:
In most cases, you don't have to use label encoding, especially when the categories don't have a natural order (like "cat," "dog," or "apple"). Using label encoding can mistakenly give the model a sense of hierarchy or order between categories, which could lead to incorrect predictions or assumptions.

Instead, one-hot encoding is a better choice because it treats each category independently without implying any ranking or order. 

However, if your labels do have a meaningful order (like "low," "medium," "high"), then label encoding might be appropriate.

### In Summary:
- **Use one-hot encoding** for categorical data without any natural order.
- **Use label encoding** if the categories have a clear order or ranking.

### Advantages and Disadvantages:

- **Advantage:** One-hot encoding is **binary**, not ordinal, and the categories are completely separate in the data.
  
- **Disadvantage:** If you have many categories (high cardinality), the number of columns (or feature space) can become very large, which can make the data harder to manage.

### In short:
One-hot encoding is great for preventing problems caused by giving categories an order, but it can create a lot of new columns if there are many categories.
