In [2]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.preprocessing import RobustScaler, MaxAbsScaler, QuantileTransformer

# Example dataset
data = {
    'Feature1': [10, 200, 30, 400, 50],
    'Feature2': [1, 5, 10, 15, 20],
    'Feature3': [1000, 500, 2000, 1500, 3000]
}

df = pd.DataFrame(data)

# Min-Max Scaling
min_max_scaler = MinMaxScaler()
df_min_max_scaled = min_max_scaler.fit_transform(df)
df_min_max_scaled = pd.DataFrame(df_min_max_scaled, columns=df.columns)

# Standardization
standard_scaler = StandardScaler()
df_standard_scaled = standard_scaler.fit_transform(df)
df_standard_scaled = pd.DataFrame(df_standard_scaled, columns=df.columns)

# Robust Scaling
robust_scaler = RobustScaler()
df_robust_scaled = robust_scaler.fit_transform(df)
df_robust_scaled = pd.DataFrame(df_robust_scaled, columns=df.columns)

# MaxAbs Scaling
max_abs_scaler = MaxAbsScaler()
df_max_abs_scaled = max_abs_scaler.fit_transform(df)
df_max_abs_scaled = pd.DataFrame(df_max_abs_scaled, columns=df.columns)

# Quantile Transformer Scaling (normal distribution)
quantile_transformer = QuantileTransformer(output_distribution='normal')
df_quantile_scaled = quantile_transformer.fit_transform(df)
df_quantile_scaled = pd.DataFrame(df_quantile_scaled, columns=df.columns)

# Print scaled dataframes
print("Min-Max Scaled Data:")
print(df_min_max_scaled)
print("\nStandard Scaled Data:")
print(df_standard_scaled)
print("\nRobust Scaled Data:")
print(df_robust_scaled)
print("\nMaxAbs Scaled Data:")
print(df_max_abs_scaled)
print("\nQuantile Scaled Data:")
print(df_quantile_scaled)


Min-Max Scaled Data:
   Feature1  Feature2  Feature3
0  0.000000  0.000000       0.2
1  0.487179  0.210526       0.0
2  0.051282  0.473684       0.6
3  1.000000  0.736842       0.4
4  0.102564  1.000000       1.0

Standard Scaled Data:
   Feature1  Feature2  Feature3
0 -0.869803 -1.354113 -0.697486
1  0.421311 -0.765368 -1.278724
2 -0.733896 -0.029437  0.464991
3  1.780378  0.706494 -0.116248
4 -0.597989  1.442425  1.627467

Robust Scaled Data:
   Feature1  Feature2  Feature3
0 -0.235294      -0.9      -0.5
1  0.882353      -0.5      -1.0
2 -0.117647       0.0       0.5
3  2.058824       0.5       0.0
4  0.000000       1.0       1.5

MaxAbs Scaled Data:
   Feature1  Feature2  Feature3
0     0.025      0.05  0.333333
1     0.500      0.25  0.166667
2     0.075      0.50  0.666667
3     1.000      0.75  0.500000
4     0.125      1.00  1.000000

Quantile Scaled Data:
   Feature1  Feature2  Feature3
0 -5.199338 -5.199338 -0.674490
1  0.674490 -0.674490 -5.199338
2 -0.674490  0.000000  0.67



## Choosing the Right Scaling Method

- Min-Max Scaling: When the features are not normally distributed and you want to retain the relationships in the data.<br><br>

- Standardization: When the data is normally distributed or when using algorithms that assume normally distributed data.<br><br>

- Robust Scaling: When the data contains outliers.<br><br>

- MaxAbs Scaling: When dealing with sparse data.<br><br>

- Quantile Transformer Scaling: When you want to transform the data to follow a specific distribution.<br><br>

**Choose the scaling method that best fits the characteristics of your dataset and the requirements of your machine learning algorithm.**

## What is Sparse Data?

Sparse data refers to datasets where the majority of the values are zeros or are absent. This kind of data structure is common in various fields such as natural language processing, recommendation systems, and more. Here are some key characteristics and examples of sparse data:

Characteristics of Sparse Data:
High Dimensionality: Sparse data often has a large number of dimensions (features or variables) but only a small fraction of the possible values are non-zero.
Many Zero Entries: The matrix or dataset contains a high proportion of zero values compared to non-zero values.
Storage Efficiency: Sparse data can be stored more efficiently using specialized data structures that only store the non-zero entries.
Examples of Sparse Data:
Text Data in NLP:

When representing text data using a bag-of-words or TF-IDF model, each document is represented as a vector of word counts or frequencies. Since any given document uses only a small subset of all possible words, the resulting matrix is sparse.
Recommendation Systems:

User-item ratings matrices in recommendation systems are typically sparse because any given user has rated only a small fraction of all available items.
One-Hot Encoding:

When categorical variables are converted into binary vectors through one-hot encoding, the resulting vectors are sparse since only one element is '1' and the rest are '0'.
Image Data:

In some applications, such as certain types of image processing or computer vision tasks, the data can be sparse if most of the pixels in an image are background (zeros) and only a few pixels contain meaningful information (non-zeros).
Handling Sparse Data:
Handling sparse data effectively is crucial for efficient storage and computation. Some common techniques and data structures used for sparse data include:

Compressed Sparse Row (CSR) and Compressed Sparse Column (CSC):

These are efficient storage formats for sparse matrices where only the non-zero entries and their row/column indices are stored.
Scipy's Sparse Module:

In Python, the scipy.sparse module provides various functions and classes to create and manipulate sparse matrices.
Sparse Data Structures:

Libraries like Pandas and NumPy also offer support for sparse data structures to efficiently handle sparse datasets.
Example in Python:
Here's an example of creating and manipulating sparse matrices using the scipy.sparse module in Python:

In [3]:
import numpy as np
from scipy.sparse import csr_matrix

# Example dense matrix
dense_matrix = np.array([
    [0, 0, 1, 0],
    [2, 0, 0, 0],
    [0, 0, 0, 3],
    [0, 4, 0, 0]
])

# Convert dense matrix to CSR (Compressed Sparse Row) format
sparse_matrix = csr_matrix(dense_matrix)

# Print sparse matrix
print("Sparse Matrix (CSR):")
print(sparse_matrix)

# Accessing data in sparse matrix
print("\nData in Sparse Matrix:")
print(sparse_matrix.data)

# Converting sparse matrix back to dense format
dense_matrix_back = sparse_matrix.toarray()
print("\nConverted Back to Dense Matrix:")
print(dense_matrix_back)

Sparse Matrix (CSR):
  (0, 2)	1
  (1, 0)	2
  (2, 3)	3
  (3, 1)	4

Data in Sparse Matrix:
[1 2 3 4]

Converted Back to Dense Matrix:
[[0 0 1 0]
 [2 0 0 0]
 [0 0 0 3]
 [0 4 0 0]]
