# Data Preparation

Once data has been loaded and some initial cleaning is done, it is often necessary to perform several addition rounds of data preparation.  This can include scaling data, removing outliers, re-encoding variables, and imputing missing data.  We'll cover these aspects below.

## Scaling Data

Many machine learning algorithms, such as k-means clustering and support vector machines, use distance metrics like Euclidean distance to compare points in the feature space. Features with large numeric ranges can dominate the distance computation, thereby affecting the algorithm's performance. Similarly, optimization algorithms like gradient descent converge more quickly when features are on similar scales.

### Working Example - Impact of Scaling on K-Means Clustering

Let's consider a simple synthetic dataset with two features `X1` and `X2`, where `X1` has values ranging between 0 and 10, but are separated into two groups along the axis. `X2` ranges from 0 to 1000, and so dominates the distance function. We'll cluster the data using k-means before and after scaling to see the difference.


In [None]:
import numpy as np
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

# Generate synthetic data
np.random.seed(0)
X1a = np.random.uniform(0, 4, 50)
X1b = np.random.uniform(6, 10, 50)
X1 = np.concatenate([X1a,X1b])
X2 = np.random.uniform(0, 1000, 100)
X = np.column_stack((X1, X2))

# Cluster without scaling
kmeans = KMeans(n_clusters=2, random_state=0)
labels = kmeans.fit_predict(X)

plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis')
plt.title('K-Means Clustering without Scaling')
plt.xlabel('X1')
plt.ylabel('X2')
plt.show()

# Scale data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Cluster after scaling
kmeans = KMeans(n_clusters=2, random_state=0)
labels = kmeans.fit_predict(X_scaled)

plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=labels, cmap='viridis')
plt.title('K-Means Clustering with Scaling')
plt.xlabel('X1 (scaled)')
plt.ylabel('X2 (scaled)')
plt.show()


### How to scale

Not all distributions are created equal!  It is important to examine your distributions before scaling parameters, or else your scaling efforts might not yield any improvements.

Two common types of distributions are normal and power-law (heavy-tailed) distributions.  These are easy to recognize by plotting histograms.

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Generate synthetic data
n_samples = 1000

# Power-law (exponential) distribution
power_law_data = np.random.exponential(scale=10, size=n_samples)

# Scaled
scaled_power_law = np.log1p(power_law_data)

# Normal distribution
normal_data = np.random.normal(loc=50, scale=10, size=n_samples)
normal_data_2D = normal_data.reshape(-1, 1)

# Apply scaling
scaler = StandardScaler()
scaled_normal_data_2D = scaler.fit_transform(normal_data_2D)

# Convert back to 1D array
scaled_normal_data = scaled_normal_data_2D.ravel()


# Plotting
fig, axes = plt.subplots(2, 2, figsize=(12, 6))

# Plot power-law distribution
axes[0,0].hist(power_law_data, bins=30, color='blue', edgecolor='black')
axes[0,0].set_title("Power-law (Exponential) Distribution")
axes[0,0].set_xlabel("Value")
axes[0,0].set_ylabel("Frequency")

# Plot scaled power-law distribution
axes[1,0].hist(scaled_power_law, bins=30, color='blue', edgecolor='black')
axes[1,0].set_title("Scaled Exponential Distribution")
axes[1,0].set_xlabel("Value")
axes[1,0].set_ylabel("Frequency")

# Plot normal distribution
axes[0,1].hist(normal_data, bins=30, color='green', edgecolor='black')
axes[0,1].set_title("Normal Distribution")
axes[0,1].set_xlabel("Value")
axes[0,1].set_ylabel("Frequency")

# Plot scaled normal distribution
axes[1,1].hist(scaled_normal_data, bins=30, color='green', edgecolor='black')
axes[1,1].set_title("Normal Distribution")
axes[1,1].set_xlabel("Value")
axes[1,1].set_ylabel("Frequency")



plt.tight_layout()
plt.show()




- **Z-scaling**: Use it when the feature roughly follows a normal distribution or when you don't have information about the distribution. It transforms the data into a distribution with a mean of 0 and a standard deviation of 1.
  
- **Log-Scaling**: It is useful for features that follow a power-law distribution. In these cases, log-scaling can help equalize the ranges and variances across features.


### Exercise

In the following, I've created a sample dataset with an exponential feature and a normal feature. Try using the different scaling methods before running the classifier.  How do your results change:

1.  If you scale the exponential feature using a StandardScaler
2.  If you scale the exponential feature using a Log transform

In [None]:
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, PowerTransformer
import matplotlib.pyplot as plt

# Set random seed for reproducibility
np.random.seed(42)

# Create synthetic dataset
n_samples = 1000

# Feature 1: Power law (exponential) distribution
X1 = np.random.exponential(scale=5, size=n_samples)

# Feature 2: Normal distribution
X2 = np.random.normal(loc=50, scale=10, size=n_samples)

# Create labels: simple linear relation to X1 and X2
y = np.array([1 if x1 + 0.001 * x2 > 1 else 0 for x1, x2 in zip(X1, X2)])
flip_indices = np.random.choice(n_samples, size=int(0.1 * n_samples), replace=False)
y[flip_indices] = 1 - y[flip_indices]



# Combine features into single data array
X = np.column_stack((X1, X2))

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Logistic Regression without scaling
lr = LogisticRegression(max_iter=1000)
lr.fit(X_train, y_train)
print(f"Logistic Regression without Scaling - Test Accuracy: {lr.score(X_test, y_test):.2f}")

# Plot original features
plt.figure(figsize=(12, 6))
plt.scatter(X[:, 0], X[:, 1], c=y, cmap='viridis')
plt.xlabel('Feature 1 (Power law)')
plt.ylabel('Feature 2 (Normal)')
plt.show()
