# 🔢 NumPy & Pandas for Machine Learning

This notebook covers essential NumPy and Pandas operations commonly used in machine learning interviews and data science workflows.

## 📋 Table of Contents
1. [Matrix Operations with NumPy](#matrix-operations)
2. [PCA Implementation](#pca-implementation)
3. [Cosine Similarity](#cosine-similarity)
4. [Data Preprocessing with Pandas](#data-preprocessing)
5. [Feature Engineering](#feature-engineering)
6. [Practice Problems](#practice-problems)
7. [Interview Tips](#interview-tips)

In [None]:
# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.datasets import make_classification, load_iris
import warnings
warnings.filterwarnings('ignore')

# Set up plotting
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")
np.random.seed(42)

print("✅ All libraries imported successfully!")
print(f"📊 NumPy version: {np.__version__}")
print(f"🐼 Pandas version: {pd.__version__}")

## 🔢 Problem 1: Matrix Operations for ML

**Problem Statement**: Implement common matrix operations used in machine learning algorithms.

**Requirements**:
- Batch matrix multiplication with broadcasting
- Efficient computation using vectorization
- Memory-optimized operations for large datasets
- Numerical stability for edge cases

In [None]:
class MatrixOperations:
    """Efficient matrix operations for machine learning."""
    
    @staticmethod
    def batch_matrix_multiply(A, B):
        """
        Multiply batches of matrices using broadcasting.
        A: (batch_size, m, k) or (m, k)
        B: (batch_size, k, n) or (k, n)
        Returns: (batch_size, m, n)
        """
        return np.matmul(A, B)
    
    @staticmethod
    def softmax(x, axis=-1):
        """
        Numerically stable softmax implementation.
        """
        # Subtract max for numerical stability
        x_shifted = x - np.max(x, axis=axis, keepdims=True)
        exp_x = np.exp(x_shifted)
        return exp_x / np.sum(exp_x, axis=axis, keepdims=True)
    
    @staticmethod
    def euclidean_distance_matrix(X, Y=None):
        """
        Compute pairwise Euclidean distances efficiently.
        X: (n_samples_1, n_features)
        Y: (n_samples_2, n_features) or None
        Returns: (n_samples_1, n_samples_2)
        """
        if Y is None:
            Y = X
        
        # Using the identity: ||x-y||² = ||x||² + ||y||² - 2*x·y
        X_norm_sq = np.sum(X**2, axis=1, keepdims=True)
        Y_norm_sq = np.sum(Y**2, axis=1, keepdims=True)
        
        distances_sq = X_norm_sq + Y_norm_sq.T - 2 * np.dot(X, Y.T)
        
        # Ensure non-negative due to floating point errors
        distances_sq = np.maximum(distances_sq, 0)
        
        return np.sqrt(distances_sq)

# Test matrix operations
print("🧪 Testing Matrix Operations:")

# Test batch matrix multiplication
A = np.random.randn(5, 3, 4)
B = np.random.randn(5, 4, 2)
result = MatrixOperations.batch_matrix_multiply(A, B)
print(f"Batch matrix multiplication: {A.shape} × {B.shape} = {result.shape}")

# Test softmax
logits = np.array([[2.0, 1.0, 0.1], [1.0, 3.0, 0.2]])
probabilities = MatrixOperations.softmax(logits)
print(f"\nSoftmax test:")
print(f"Input: {logits}")
print(f"Output: {probabilities}")
print(f"Sum of probabilities: {np.sum(probabilities, axis=1)}")

# Test distance matrix
X = np.random.randn(100, 5)
distances = MatrixOperations.euclidean_distance_matrix(X[:10])  # 10x10 for display
print(f"\nDistance matrix shape: {distances.shape}")
print(f"Diagonal elements (should be 0): {np.diag(distances)[:5]}")

print("\n✅ Matrix operations test completed!")

In [None]:
# Visualize matrix operations performance
import time

def benchmark_matrix_operations():
    """Benchmark different matrix operation approaches."""
    sizes = [100, 200, 500, 1000]
    vectorized_times = []
    loop_times = []
    
    for n in sizes:
        X = np.random.randn(n, 10)
        
        # Vectorized approach
        start = time.time()
        _ = MatrixOperations.euclidean_distance_matrix(X)
        vectorized_time = time.time() - start
        vectorized_times.append(vectorized_time)
        
        # Loop-based approach (slower)
        start = time.time()
        distances_loop = np.zeros((n, n))
        for i in range(min(n, 50)):  # Limit for demonstration
            for j in range(min(n, 50)):
                distances_loop[i, j] = np.linalg.norm(X[i] - X[j])
        loop_time = time.time() - start
        loop_times.append(loop_time)
    
    return sizes, vectorized_times, loop_times

# Run benchmark
sizes, vec_times, loop_times = benchmark_matrix_operations()

# Plot results
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.semilogy(sizes, vec_times, 'o-', label='Vectorized', alpha=0.8)
plt.semilogy(sizes, loop_times, 's-', label='Loop-based (limited)', alpha=0.8)
plt.xlabel('Matrix Size')
plt.ylabel('Time (seconds)')
plt.title('Performance Comparison')
plt.legend()
plt.grid(True, alpha=0.3)

# Visualize softmax behavior
plt.subplot(1, 2, 2)
x = np.linspace(-5, 5, 100)
temperatures = [0.5, 1.0, 2.0, 5.0]

for temp in temperatures:
    # Create a simple 2-class softmax
    logits = np.column_stack([x, np.zeros_like(x)])
    probs = MatrixOperations.softmax(logits / temp, axis=1)
    plt.plot(x, probs[:, 0], label=f'Temperature: {temp}', alpha=0.8)

plt.xlabel('Input Value')
plt.ylabel('Probability')
plt.title('Softmax with Different Temperatures')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 🎯 Problem 2: PCA Implementation from Scratch

**Problem Statement**: Implement Principal Component Analysis using SVD for dimensionality reduction.

**Requirements**:
- Compute principal components using SVD
- Handle data centering and scaling
- Calculate explained variance ratio
- Efficient implementation for large datasets

In [None]:
class PCAFromScratch:
    """Principal Component Analysis implementation using SVD."""
    
    def __init__(self, n_components=None):
        self.n_components = n_components
        self.components_ = None
        self.explained_variance_ = None
        self.explained_variance_ratio_ = None
        self.mean_ = None
        self.singular_values_ = None
    
    def fit(self, X):
        """
        Fit PCA model to data.
        X: (n_samples, n_features)
        """
        n_samples, n_features = X.shape
        
        if self.n_components is None:
            self.n_components = min(n_samples, n_features)
        
        # Center the data
        self.mean_ = np.mean(X, axis=0)
        X_centered = X - self.mean_
        
        # Perform SVD
        U, s, Vt = np.linalg.svd(X_centered, full_matrices=False)
        
        # Select top n_components
        self.components_ = Vt[:self.n_components]
        self.singular_values_ = s[:self.n_components]
        
        # Calculate explained variance
        self.explained_variance_ = (s[:self.n_components] ** 2) / (n_samples - 1)
        total_variance = np.sum((s ** 2)) / (n_samples - 1)
        self.explained_variance_ratio_ = self.explained_variance_ / total_variance
        
        return self
    
    def transform(self, X):
        """
        Transform data to lower dimensional space.
        X: (n_samples, n_features)
        Returns: (n_samples, n_components)
        """
        if self.components_ is None:
            raise ValueError("PCA model must be fitted first.")
        
        X_centered = X - self.mean_
        return X_centered @ self.components_.T
    
    def fit_transform(self, X):
        """Fit PCA model and transform data."""
        return self.fit(X).transform(X)
    
    def inverse_transform(self, X_transformed):
        """
        Transform data back to original space.
        X_transformed: (n_samples, n_components)
        Returns: (n_samples, n_features)
        """
        return X_transformed @ self.components_ + self.mean_

# Test PCA implementation
print("🧪 Testing PCA Implementation:")

# Generate sample data
np.random.seed(42)
n_samples, n_features = 1000, 20
X_original = np.random.randn(n_samples, n_features)

# Add some correlation to make PCA interesting
correlation_matrix = np.random.randn(n_features, n_features)
X = X_original @ correlation_matrix

# Apply PCA
pca = PCAFromScratch(n_components=5)
X_transformed = pca.fit_transform(X)

print(f"Original shape: {X.shape}")
print(f"Transformed shape: {X_transformed.shape}")
print(f"Explained variance ratio: {pca.explained_variance_ratio_}")
print(f"Total explained variance: {np.sum(pca.explained_variance_ratio_):.3f}")

# Test reconstruction
X_reconstructed = pca.inverse_transform(X_transformed)
reconstruction_error = np.mean((X - X_reconstructed) ** 2)
print(f"Reconstruction error (MSE): {reconstruction_error:.6f}")

print("\n✅ PCA test completed!")

In [None]:
# Visualize PCA results
# Use Iris dataset for better visualization
from sklearn.datasets import load_iris

iris = load_iris()
X_iris = iris.data
y_iris = iris.target
feature_names = iris.feature_names
target_names = iris.target_names

# Apply PCA to Iris dataset
pca_iris = PCAFromScratch(n_components=2)
X_iris_pca = pca_iris.fit_transform(X_iris)

plt.figure(figsize=(15, 5))

# Plot 1: Original features (first two)
plt.subplot(1, 3, 1)
scatter = plt.scatter(X_iris[:, 0], X_iris[:, 1], c=y_iris, alpha=0.7)
plt.xlabel(feature_names[0])
plt.ylabel(feature_names[1])
plt.title('Original Features')
plt.colorbar(scatter)
plt.grid(True, alpha=0.3)

# Plot 2: PCA transformed
plt.subplot(1, 3, 2)
scatter = plt.scatter(X_iris_pca[:, 0], X_iris_pca[:, 1], c=y_iris, alpha=0.7)
plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')
plt.title('PCA Transformed')
plt.colorbar(scatter)
plt.grid(True, alpha=0.3)

# Plot 3: Explained variance
plt.subplot(1, 3, 3)
components = range(1, len(pca_iris.explained_variance_ratio_) + 1)
plt.bar(components, pca_iris.explained_variance_ratio_, alpha=0.7)
plt.xlabel('Principal Component')
plt.ylabel('Explained Variance Ratio')
plt.title('Explained Variance by Component')
plt.grid(True, alpha=0.3)

# Add values on bars
for i, v in enumerate(pca_iris.explained_variance_ratio_):
    plt.text(i + 1, v + 0.01, f'{v:.3f}', ha='center', va='bottom')

plt.tight_layout()
plt.show()

print(f"📊 PCA Results for Iris Dataset:")
print(f"Explained variance by PC1: {pca_iris.explained_variance_ratio_[0]:.3f}")
print(f"Explained variance by PC2: {pca_iris.explained_variance_ratio_[1]:.3f}")
print(f"Total explained variance: {np.sum(pca_iris.explained_variance_ratio_):.3f}")

## 🎯 Problem 3: Cosine Similarity Matrix

**Problem Statement**: Implement efficient cosine similarity computation for document/feature similarity.

**Requirements**:
- Compute pairwise cosine similarity
- Handle zero vectors gracefully
- Memory-efficient for large matrices
- Vectorized implementation

In [None]:
class SimilarityMetrics:
    """Collection of similarity metrics for ML applications."""
    
    @staticmethod
    def cosine_similarity_matrix(X, Y=None):
        """
        Compute pairwise cosine similarity matrix.
        X: (n_samples_1, n_features)
        Y: (n_samples_2, n_features) or None
        Returns: (n_samples_1, n_samples_2)
        """
        if Y is None:
            Y = X
        
        # Normalize vectors to unit length
        X_norm = X / (np.linalg.norm(X, axis=1, keepdims=True) + 1e-10)
        Y_norm = Y / (np.linalg.norm(Y, axis=1, keepdims=True) + 1e-10)
        
        # Compute cosine similarity using dot product
        similarity = X_norm @ Y_norm.T
        
        return similarity
    
    @staticmethod
    def pearson_correlation_matrix(X, Y=None):
        """
        Compute pairwise Pearson correlation matrix.
        """
        if Y is None:
            Y = X
        
        # Center the data
        X_centered = X - np.mean(X, axis=1, keepdims=True)
        Y_centered = Y - np.mean(Y, axis=1, keepdims=True)
        
        # Compute correlation using cosine similarity on centered data
        return SimilarityMetrics.cosine_similarity_matrix(X_centered, Y_centered)
    
    @staticmethod
    def jaccard_similarity_matrix(X, Y=None):
        """
        Compute Jaccard similarity for binary data.
        X, Y should be binary matrices.
        """
        if Y is None:
            Y = X
        
        # Convert to boolean for safety
        X_bool = X.astype(bool)
        Y_bool = Y.astype(bool)
        
        # Compute intersection and union
        intersection = X_bool @ Y_bool.T
        
        X_sum = np.sum(X_bool, axis=1, keepdims=True)
        Y_sum = np.sum(Y_bool, axis=1, keepdims=True)
        union = X_sum + Y_sum.T - intersection
        
        # Avoid division by zero
        jaccard = intersection / (union + 1e-10)
        
        return jaccard

# Test similarity metrics
print("🧪 Testing Similarity Metrics:")

# Generate sample documents (as TF-IDF vectors)
np.random.seed(42)
n_docs, n_terms = 100, 50
documents = np.random.exponential(0.5, (n_docs, n_terms))  # Sparse-like data
documents[documents < 0.1] = 0  # Make it sparse

# Compute cosine similarity
cosine_sim = SimilarityMetrics.cosine_similarity_matrix(documents)
print(f"Cosine similarity matrix shape: {cosine_sim.shape}")
print(f"Diagonal elements (self-similarity): {np.diag(cosine_sim)[:5]}")
print(f"Average similarity: {np.mean(cosine_sim):.3f}")

# Test with binary data for Jaccard
binary_data = (documents > 0.2).astype(int)
jaccard_sim = SimilarityMetrics.jaccard_similarity_matrix(binary_data)
print(f"\nJaccard similarity matrix shape: {jaccard_sim.shape}")
print(f"Average Jaccard similarity: {np.mean(jaccard_sim):.3f}")

# Find most similar document pairs
# Exclude diagonal (self-similarity)
cosine_sim_no_diag = cosine_sim.copy()
np.fill_diagonal(cosine_sim_no_diag, 0)

# Find top 3 most similar pairs
top_pairs = np.unravel_index(np.argpartition(cosine_sim_no_diag.ravel(), -3)[-3:], 
                             cosine_sim_no_diag.shape)
print(f"\nTop 3 most similar document pairs:")
for i in range(3):
    doc1, doc2 = top_pairs[0][i], top_pairs[1][i]
    similarity = cosine_sim[doc1, doc2]
    print(f"  Documents {doc1} and {doc2}: {similarity:.4f}")

print("\n✅ Similarity metrics test completed!")

In [None]:
# Visualize similarity matrices and performance
plt.figure(figsize=(15, 5))

# Plot 1: Cosine similarity heatmap (subset)
plt.subplot(1, 3, 1)
subset_size = 20
sns.heatmap(cosine_sim[:subset_size, :subset_size], 
            cmap='coolwarm', center=0, 
            square=True, cbar_kws={'label': 'Cosine Similarity'})
plt.title('Cosine Similarity Matrix (20x20 subset)')
plt.xlabel('Document ID')
plt.ylabel('Document ID')

# Plot 2: Jaccard similarity heatmap (subset)
plt.subplot(1, 3, 2)
sns.heatmap(jaccard_sim[:subset_size, :subset_size], 
            cmap='viridis', 
            square=True, cbar_kws={'label': 'Jaccard Similarity'})
plt.title('Jaccard Similarity Matrix (20x20 subset)')
plt.xlabel('Document ID')
plt.ylabel('Document ID')

# Plot 3: Similarity distribution
plt.subplot(1, 3, 3)
# Get upper triangle values (excluding diagonal)
upper_tri_indices = np.triu_indices_from(cosine_sim, k=1)
cosine_values = cosine_sim[upper_tri_indices]
jaccard_values = jaccard_sim[upper_tri_indices]

plt.hist(cosine_values, bins=50, alpha=0.7, label='Cosine', density=True)
plt.hist(jaccard_values, bins=50, alpha=0.7, label='Jaccard', density=True)
plt.xlabel('Similarity Value')
plt.ylabel('Density')
plt.title('Similarity Value Distributions')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Performance comparison
sizes = [50, 100, 200, 500]
times = []

for size in sizes:
    data_subset = documents[:size]
    
    start_time = time.time()
    _ = SimilarityMetrics.cosine_similarity_matrix(data_subset)
    elapsed_time = time.time() - start_time
    times.append(elapsed_time)

plt.figure(figsize=(8, 5))
plt.plot(sizes, times, 'o-', alpha=0.8, linewidth=2, markersize=8)
plt.xlabel('Number of Documents')
plt.ylabel('Computation Time (seconds)')
plt.title('Cosine Similarity Computation Performance')
plt.grid(True, alpha=0.3)

# Add annotations
for size, time_val in zip(sizes, times):
    plt.annotate(f'{time_val:.3f}s', 
                xy=(size, time_val), 
                xytext=(5, 5), 
                textcoords='offset points',
                fontsize=9)

plt.show()

## 🐼 Problem 4: Data Preprocessing Pipeline

**Problem Statement**: Create a comprehensive data preprocessing pipeline for real-world datasets.

**Requirements**:
- Handle missing values with different strategies
- Encode categorical variables efficiently
- Normalize and scale numerical features
- Create date/time features
- Pipeline should be reusable and configurable

In [None]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler, LabelEncoder
from typing import Dict, List, Optional

class DataPreprocessor:
    """Comprehensive data preprocessing pipeline for ML workflows."""
    
    def __init__(self):
        self.label_encoders = {}
        self.scaler = None
        self.feature_stats = {}
        self.is_fitted = False
    
    def handle_missing_values(self, df: pd.DataFrame, 
                            strategy: Dict[str, str]) -> pd.DataFrame:
        """
        Handle missing values with different strategies per column.
        
        strategy: dict mapping column names to strategies
        Strategies: 'drop', 'mean', 'median', 'mode', 'forward_fill', 'backward_fill', 'constant'
        """
        df_processed = df.copy()
        
        for col, method in strategy.items():
            if col not in df_processed.columns:
                continue
                
            if method == 'drop':
                df_processed = df_processed.dropna(subset=[col])
            elif method == 'mean' and df_processed[col].dtype in ['int64', 'float64']:
                fill_value = df_processed[col].mean()
                df_processed[col].fillna(fill_value, inplace=True)
                self.feature_stats[f'{col}_mean'] = fill_value
            elif method == 'median' and df_processed[col].dtype in ['int64', 'float64']:
                fill_value = df_processed[col].median()
                df_processed[col].fillna(fill_value, inplace=True)
                self.feature_stats[f'{col}_median'] = fill_value
            elif method == 'mode':
                fill_value = df_processed[col].mode()[0] if not df_processed[col].mode().empty else 'Unknown'
                df_processed[col].fillna(fill_value, inplace=True)
                self.feature_stats[f'{col}_mode'] = fill_value
            elif method == 'forward_fill':
                df_processed[col].fillna(method='ffill', inplace=True)
            elif method == 'backward_fill':
                df_processed[col].fillna(method='bfill', inplace=True)
            elif method == 'constant':
                df_processed[col].fillna('Missing', inplace=True)
        
        return df_processed
    
    def encode_categorical_features(self, df: pd.DataFrame, 
                                  categorical_cols: List[str],
                                  method: str = 'label') -> pd.DataFrame:
        """
        Encode categorical features.
        Methods: 'label', 'onehot'
        """
        df_processed = df.copy()
        
        for col in categorical_cols:
            if col not in df_processed.columns:
                continue
            
            if method == 'label':
                if col not in self.label_encoders:
                    self.label_encoders[col] = LabelEncoder()
                    df_processed[col] = self.label_encoders[col].fit_transform(df_processed[col].astype(str))
                else:
                    # Handle unseen categories
                    known_categories = set(self.label_encoders[col].classes_)
                    df_processed[col] = df_processed[col].astype(str)
                    
                    # Replace unknown categories with most frequent known category
                    unknown_mask = ~df_processed[col].isin(known_categories)
                    if unknown_mask.any():
                        most_frequent = df_processed[col][~unknown_mask].mode()
                        if not most_frequent.empty:
                            df_processed.loc[unknown_mask, col] = most_frequent.iloc[0]
                    
                    df_processed[col] = self.label_encoders[col].transform(df_processed[col])
            
            elif method == 'onehot':
                # Create dummy variables
                dummies = pd.get_dummies(df_processed[col], prefix=col, drop_first=True)
                df_processed = pd.concat([df_processed.drop(col, axis=1), dummies], axis=1)
        
        return df_processed
    
    def scale_features(self, df: pd.DataFrame, 
                      numerical_cols: List[str],
                      method: str = 'standard') -> pd.DataFrame:
        """
        Scale numerical features.
        Methods: 'standard', 'minmax', 'robust'
        """
        df_processed = df.copy()
        
        # Filter columns that actually exist and are numerical
        existing_numerical_cols = [col for col in numerical_cols 
                                 if col in df_processed.columns and 
                                 df_processed[col].dtype in ['int64', 'float64']]
        
        if not existing_numerical_cols:
            return df_processed
        
        if method == 'standard':
            if self.scaler is None:
                self.scaler = StandardScaler()
                df_processed[existing_numerical_cols] = self.scaler.fit_transform(
                    df_processed[existing_numerical_cols])
            else:
                df_processed[existing_numerical_cols] = self.scaler.transform(
                    df_processed[existing_numerical_cols])
        
        elif method == 'minmax':
            if self.scaler is None:
                self.scaler = MinMaxScaler()
                df_processed[existing_numerical_cols] = self.scaler.fit_transform(
                    df_processed[existing_numerical_cols])
            else:
                df_processed[existing_numerical_cols] = self.scaler.transform(
                    df_processed[existing_numerical_cols])
        
        elif method == 'robust':
            if self.scaler is None:
                self.scaler = RobustScaler()
                df_processed[existing_numerical_cols] = self.scaler.fit_transform(
                    df_processed[existing_numerical_cols])
            else:
                df_processed[existing_numerical_cols] = self.scaler.transform(
                    df_processed[existing_numerical_cols])
        
        return df_processed
    
    def create_date_features(self, df: pd.DataFrame, date_col: str) -> pd.DataFrame:
        """
        Extract useful features from datetime column.
        """
        df_processed = df.copy()
        
        if date_col in df_processed.columns:
            # Convert to datetime
            df_processed[date_col] = pd.to_datetime(df_processed[date_col], errors='coerce')
            
            # Extract features
            df_processed[f'{date_col}_year'] = df_processed[date_col].dt.year
            df_processed[f'{date_col}_month'] = df_processed[date_col].dt.month
            df_processed[f'{date_col}_day'] = df_processed[date_col].dt.day
            df_processed[f'{date_col}_dayofweek'] = df_processed[date_col].dt.dayofweek
            df_processed[f'{date_col}_quarter'] = df_processed[date_col].dt.quarter
            df_processed[f'{date_col}_is_weekend'] = (df_processed[date_col].dt.dayofweek >= 5).astype(int)
            df_processed[f'{date_col}_is_month_start'] = df_processed[date_col].dt.is_month_start.astype(int)
            df_processed[f'{date_col}_is_month_end'] = df_processed[date_col].dt.is_month_end.astype(int)
        
        return df_processed
    
    def fit_transform(self, df: pd.DataFrame, config: Dict) -> pd.DataFrame:
        """
        Full preprocessing pipeline.
        """
        df_processed = df.copy()
        
        # Handle missing values
        if 'missing_strategy' in config:
            df_processed = self.handle_missing_values(df_processed, config['missing_strategy'])
        
        # Create date features
        if 'date_columns' in config:
            for date_col in config['date_columns']:
                df_processed = self.create_date_features(df_processed, date_col)
        
        # Encode categorical features
        if 'categorical_columns' in config:
            df_processed = self.encode_categorical_features(
                df_processed, 
                config['categorical_columns'],
                config.get('categorical_method', 'label')
            )
        
        # Scale numerical features
        if 'numerical_columns' in config:
            df_processed = self.scale_features(
                df_processed,
                config['numerical_columns'],
                config.get('scaling_method', 'standard')
            )
        
        self.is_fitted = True
        return df_processed

# Create sample dataset for testing
print("🧪 Creating Sample Dataset for Preprocessing:")

np.random.seed(42)
n_samples = 1000

# Create sample data with various data types and missing values
sample_data = {
    'age': np.random.randint(18, 80, n_samples),
    'salary': np.random.lognormal(10, 1, n_samples),
    'category': np.random.choice(['A', 'B', 'C', 'D'], n_samples),
    'date': pd.date_range('2020-01-01', periods=n_samples, freq='D'),
    'score': np.random.normal(75, 15, n_samples),
    'binary_feature': np.random.choice([0, 1], n_samples)
}

# Create DataFrame
df_sample = pd.DataFrame(sample_data)

# Introduce missing values randomly
missing_indices = np.random.choice(n_samples, size=int(0.1 * n_samples), replace=False)
df_sample.loc[missing_indices[:50], 'age'] = np.nan
df_sample.loc[missing_indices[50:], 'category'] = np.nan

print(f"Original dataset shape: {df_sample.shape}")
print(f"Missing values per column:")
print(df_sample.isnull().sum())
print(f"\nData types:")
print(df_sample.dtypes)
print(f"\nFirst few rows:")
print(df_sample.head())

In [None]:
# Apply preprocessing pipeline
print("🔧 Applying Preprocessing Pipeline:")

preprocessor = DataPreprocessor()

# Define preprocessing configuration
config = {
    'missing_strategy': {
        'age': 'median',
        'category': 'mode',
        'salary': 'mean'
    },
    'date_columns': ['date'],
    'categorical_columns': ['category'],
    'numerical_columns': ['age', 'salary', 'score'],
    'categorical_method': 'label',
    'scaling_method': 'standard'
}

# Apply preprocessing
df_processed = preprocessor.fit_transform(df_sample, config)

print(f"Processed dataset shape: {df_processed.shape}")
print(f"Missing values after processing:")
print(df_processed.isnull().sum())
print(f"\nNew columns created:")
new_columns = set(df_processed.columns) - set(df_sample.columns)
print(list(new_columns))
print(f"\nProcessed data types:")
print(df_processed.dtypes)

print("\n✅ Data preprocessing completed!")

In [None]:
# Visualize preprocessing results
plt.figure(figsize=(16, 10))

# Plot 1: Before and after missing values
plt.subplot(2, 3, 1)
missing_before = df_sample.isnull().sum()
missing_after = df_processed.isnull().sum()

x = range(len(missing_before))
width = 0.35
plt.bar([i - width/2 for i in x], missing_before.values, width, label='Before', alpha=0.7)
plt.bar([i + width/2 for i in x], missing_after.values, width, label='After', alpha=0.7)
plt.xlabel('Columns')
plt.ylabel('Missing Values Count')
plt.title('Missing Values Before and After')
plt.xticks(x, missing_before.index, rotation=45)
plt.legend()
plt.grid(True, alpha=0.3)

# Plot 2: Distribution of numerical features before scaling
plt.subplot(2, 3, 2)
numerical_cols = ['age', 'salary', 'score']
df_sample[numerical_cols].boxplot(ax=plt.gca())
plt.title('Numerical Features Before Scaling')
plt.xticks(rotation=45)
plt.ylabel('Original Values')

# Plot 3: Distribution of numerical features after scaling
plt.subplot(2, 3, 3)
scaled_numerical_cols = [col for col in numerical_cols if col in df_processed.columns]
if scaled_numerical_cols:
    df_processed[scaled_numerical_cols].boxplot(ax=plt.gca())
plt.title('Numerical Features After Scaling')
plt.xticks(rotation=45)
plt.ylabel('Scaled Values')

# Plot 4: Categorical encoding results
plt.subplot(2, 3, 4)
if 'category' in df_processed.columns:
    category_counts = df_processed['category'].value_counts()
    plt.bar(category_counts.index, category_counts.values, alpha=0.7)
    plt.xlabel('Encoded Category Values')
    plt.ylabel('Count')
    plt.title('Categorical Feature After Encoding')
    plt.grid(True, alpha=0.3)

# Plot 5: Date features created
plt.subplot(2, 3, 5)
date_features = [col for col in df_processed.columns if 'date_' in col and col != 'date']
if date_features:
    # Show distribution of one date feature
    feature_to_plot = 'date_month' if 'date_month' in date_features else date_features[0]
    df_processed[feature_to_plot].value_counts().sort_index().plot(kind='bar', alpha=0.7)
    plt.xlabel('Value')
    plt.ylabel('Count')
    plt.title(f'Distribution of {feature_to_plot}')
    plt.xticks(rotation=45)
    plt.grid(True, alpha=0.3)

# Plot 6: Feature correlation heatmap
plt.subplot(2, 3, 6)
# Select numerical columns for correlation
corr_cols = [col for col in df_processed.columns 
             if df_processed[col].dtype in ['int64', 'float64'] and not col.startswith('date_')]
if len(corr_cols) > 1:
    correlation_matrix = df_processed[corr_cols].corr()
    sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0,
                square=True, fmt='.2f', cbar_kws={'label': 'Correlation'})
    plt.title('Feature Correlation Matrix')

plt.tight_layout()
plt.show()

# Summary statistics
print("📊 Preprocessing Summary:")
print(f"Original features: {len(df_sample.columns)}")
print(f"Final features: {len(df_processed.columns)}")
print(f"Features added: {len(df_processed.columns) - len(df_sample.columns)}")
print(f"Missing values eliminated: {df_sample.isnull().sum().sum() - df_processed.isnull().sum().sum()}")
print(f"Label encoders created: {len(preprocessor.label_encoders)}")
print(f"Scaler fitted: {preprocessor.scaler is not None}")

## 🏃‍♂️ Practice Problems

Let's practice some additional NumPy and Pandas problems commonly seen in ML interviews.

In [None]:
# Problem 5: Efficient K-Nearest Neighbors Distance Computation
def knn_distances(X_train, X_test, k=5):
    """
    Find k-nearest neighbors efficiently using vectorized operations.
    
    Time Complexity: O(n_test * n_train)
    Space Complexity: O(n_test * n_train)
    """
    # Compute distance matrix
    distances = MatrixOperations.euclidean_distance_matrix(X_test, X_train)
    
    # Find k smallest distances and their indices
    k_nearest_indices = np.argpartition(distances, k-1, axis=1)[:, :k]
    k_nearest_distances = np.take_along_axis(distances, k_nearest_indices, axis=1)
    
    # Sort within k-nearest
    sort_indices = np.argsort(k_nearest_distances, axis=1)
    k_nearest_indices_sorted = np.take_along_axis(k_nearest_indices, sort_indices, axis=1)
    k_nearest_distances_sorted = np.take_along_axis(k_nearest_distances, sort_indices, axis=1)
    
    return k_nearest_distances_sorted, k_nearest_indices_sorted

# Test KNN implementation
print("🧪 Testing KNN Distance Computation:")

# Generate sample data
np.random.seed(42)
X_train = np.random.randn(1000, 10)
X_test = np.random.randn(100, 10)
k = 5

# Compute k-nearest neighbors
start_time = time.time()
knn_dists, knn_indices = knn_distances(X_train, X_test, k)
elapsed_time = time.time() - start_time

print(f"Computed {k}-NN for {len(X_test)} test points in {elapsed_time:.4f} seconds")
print(f"KNN distances shape: {knn_dists.shape}")
print(f"KNN indices shape: {knn_indices.shape}")
print(f"Sample distances for first test point: {knn_dists[0]}")
print(f"Sample indices for first test point: {knn_indices[0]}")

# Verify distances are sorted
is_sorted = np.all(np.diff(knn_dists, axis=1) >= 0)
print(f"Distances are properly sorted: {is_sorted}")

print("\n✅ KNN distance computation test completed!")

In [None]:
# Problem 6: Outlier Detection using IQR Method
def detect_outliers_iqr(data, columns=None, factor=1.5):
    """
    Detect outliers using Interquartile Range (IQR) method.
    
    Parameters:
    - data: pandas DataFrame
    - columns: list of columns to check (None for all numerical columns)
    - factor: IQR multiplier (typically 1.5)
    
    Returns:
    - DataFrame with outlier flags and cleaned data
    """
    df = data.copy()
    
    if columns is None:
        columns = df.select_dtypes(include=[np.number]).columns.tolist()
    
    outlier_info = {}
    
    for col in columns:
        if col not in df.columns:
            continue
            
        Q1 = df[col].quantile(0.25)
        Q3 = df[col].quantile(0.75)
        IQR = Q3 - Q1
        
        lower_bound = Q1 - factor * IQR
        upper_bound = Q3 + factor * IQR
        
        # Identify outliers
        outliers_mask = (df[col] < lower_bound) | (df[col] > upper_bound)
        
        outlier_info[col] = {
            'count': outliers_mask.sum(),
            'percentage': (outliers_mask.sum() / len(df)) * 100,
            'bounds': (lower_bound, upper_bound),
            'outlier_values': df.loc[outliers_mask, col].values
        }
        
        # Add outlier flag column
        df[f'{col}_is_outlier'] = outliers_mask
    
    return df, outlier_info

# Test outlier detection
print("🧪 Testing Outlier Detection:")

# Create sample data with outliers
np.random.seed(42)
n_samples = 1000

normal_data = np.random.normal(50, 10, n_samples)
# Inject some outliers
outlier_indices = np.random.choice(n_samples, size=50, replace=False)
normal_data[outlier_indices] = np.random.choice([0, 100], size=50)  # Extreme values

test_df = pd.DataFrame({
    'feature1': normal_data,
    'feature2': np.random.exponential(2, n_samples),  # Naturally skewed
    'feature3': np.random.normal(0, 1, n_samples)     # Standard normal
})

# Detect outliers
df_with_outliers, outlier_stats = detect_outliers_iqr(test_df)

print("Outlier Detection Results:")
for col, stats in outlier_stats.items():
    print(f"\n{col}:")
    print(f"  Outliers found: {stats['count']} ({stats['percentage']:.2f}%)")
    print(f"  Normal range: [{stats['bounds'][0]:.2f}, {stats['bounds'][1]:.2f}]")
    if stats['count'] > 0:
        print(f"  Sample outlier values: {stats['outlier_values'][:5]}")

print("\n✅ Outlier detection test completed!")

In [None]:
# Visualize outlier detection results
plt.figure(figsize=(15, 10))

features = ['feature1', 'feature2', 'feature3']
colors = ['red', 'blue', 'green']

for i, (feature, color) in enumerate(zip(features, colors)):
    # Box plot
    plt.subplot(2, 3, i + 1)
    
    # Normal data
    normal_data = df_with_outliers[~df_with_outliers[f'{feature}_is_outlier']][feature]
    outlier_data = df_with_outliers[df_with_outliers[f'{feature}_is_outlier']][feature]
    
    plt.boxplot(df_with_outliers[feature], patch_artist=True, 
                boxprops=dict(facecolor=color, alpha=0.7))
    
    # Add outlier points
    if len(outlier_data) > 0:
        plt.scatter([1] * len(outlier_data), outlier_data, 
                   color='red', alpha=0.6, s=20, label=f'Outliers ({len(outlier_data)})')
    
    plt.title(f'{feature} - Box Plot with Outliers')
    plt.ylabel('Value')
    if len(outlier_data) > 0:
        plt.legend()
    
    # Histogram
    plt.subplot(2, 3, i + 4)
    
    plt.hist(normal_data, bins=50, alpha=0.7, color=color, label='Normal', density=True)
    if len(outlier_data) > 0:
        plt.hist(outlier_data, bins=20, alpha=0.7, color='red', label='Outliers', density=True)
    
    # Add bounds
    bounds = outlier_stats[feature]['bounds']
    plt.axvline(bounds[0], color='orange', linestyle='--', alpha=0.8, label='IQR Bounds')
    plt.axvline(bounds[1], color='orange', linestyle='--', alpha=0.8)
    
    plt.xlabel('Value')
    plt.ylabel('Density')
    plt.title(f'{feature} - Distribution with IQR Bounds')
    plt.legend()
    plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Summary statistics
print("📊 Outlier Detection Summary:")
total_outliers = sum(stats['count'] for stats in outlier_stats.values())
total_data_points = len(df_with_outliers) * len(features)
print(f"Total outliers detected: {total_outliers}")
print(f"Total data points examined: {total_data_points}")
print(f"Overall outlier rate: {(total_outliers / total_data_points) * 100:.2f}%")

## 💡 Interview Tips

### 🎯 NumPy Best Practices
1. **Vectorization over loops** - Always prefer NumPy operations over Python loops
2. **Broadcasting** - Understand how NumPy handles operations on arrays of different shapes
3. **Memory efficiency** - Use views instead of copies when possible
4. **Numerical stability** - Handle edge cases like division by zero
5. **Data types** - Choose appropriate dtypes to save memory

### 🐼 Pandas Best Practices
1. **Vectorized operations** - Use pandas methods instead of apply() when possible
2. **Memory optimization** - Use categorical data types for string columns with few unique values
3. **Missing data handling** - Understand different strategies and their implications
4. **Chaining operations** - Use method chaining for readable data transformations
5. **Index usage** - Leverage indices for efficient data access

### ⚡ Performance Tips
- Use `np.dot()` or `@` for matrix multiplication
- Preallocate arrays when possible
- Use `pd.cut()` and `pd.qcut()` for binning operations
- Leverage `pd.crosstab()` and `pd.pivot_table()` for aggregations
- Use `numba` or `cython` for performance-critical loops

### 🗃️ Common Data Operations
- **Reshaping**: `reshape()`, `transpose()`, `flatten()`
- **Aggregation**: `sum()`, `mean()`, `std()`, `groupby()`
- **Filtering**: Boolean indexing, `query()`, `where()`
- **Joining**: `merge()`, `concat()`, `join()`
- **Time series**: `resample()`, `rolling()`, `shift()`

In [None]:
# Performance comparison: Different approaches to common operations
import timeit

def benchmark_operations():
    """Benchmark different approaches to common operations."""
    
    # Setup data
    n = 100000
    arr = np.random.randn(n)
    df = pd.DataFrame({'values': arr, 'groups': np.random.choice(['A', 'B', 'C'], n)})
    
    results = {}
    
    # 1. Sum of squares: vectorized vs loop
    def vectorized_sum_squares():
        return np.sum(arr ** 2)
    
    def loop_sum_squares():
        total = 0
        for x in arr[:1000]:  # Limited for demo
            total += x ** 2
        return total
    
    results['Sum of Squares'] = {
        'Vectorized': timeit.timeit(vectorized_sum_squares, number=1000),
        'Loop (1K only)': timeit.timeit(loop_sum_squares, number=100)
    }
    
    # 2. GroupBy operations: pandas vs manual
    def pandas_groupby():
        return df.groupby('groups')['values'].mean()
    
    def manual_groupby():
        groups = {}
        for group, value in zip(df['groups'], df['values']):
            if group not in groups:
                groups[group] = []
            groups[group].append(value)
        return {k: np.mean(v) for k, v in groups.items()}
    
    results['GroupBy Mean'] = {
        'Pandas': timeit.timeit(pandas_groupby, number=100),
        'Manual': timeit.timeit(manual_groupby, number=10)  # Much slower
    }
    
    # 3. Boolean indexing vs query
    def boolean_indexing():
        return df[df['values'] > 0]
    
    def query_method():
        return df.query('values > 0')
    
    results['Filtering'] = {
        'Boolean Indexing': timeit.timeit(boolean_indexing, number=100),
        'Query Method': timeit.timeit(query_method, number=100)
    }
    
    return results

# Run benchmarks
print("⚡ Running Performance Benchmarks:")
benchmark_results = benchmark_operations()

# Visualize results
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

for i, (operation, times) in enumerate(benchmark_results.items()):
    methods = list(times.keys())
    exec_times = list(times.values())
    
    bars = axes[i].bar(methods, exec_times, alpha=0.7)
    axes[i].set_ylabel('Time (seconds)')
    axes[i].set_title(f'{operation} Performance')
    axes[i].tick_params(axis='x', rotation=45)
    
    # Add value labels
    for bar, time_val in zip(bars, exec_times):
        axes[i].text(bar.get_x() + bar.get_width()/2, bar.get_height() + bar.get_height()*0.01,
                    f'{time_val:.4f}s', ha='center', va='bottom', fontsize=9)
    
    axes[i].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\n📊 Performance Results:")
for operation, times in benchmark_results.items():
    print(f"\n{operation}:")
    for method, time_val in times.items():
        print(f"  {method}: {time_val:.4f} seconds")
    
    # Calculate speedup
    if len(times) == 2:
        methods = list(times.keys())
        speedup = max(times.values()) / min(times.values())
        print(f"  Speedup: {speedup:.1f}x")

## 🎓 Summary

In this notebook, we covered:

✅ **Matrix Operations** - Efficient batch operations, softmax, distance matrices  
✅ **PCA Implementation** - Dimensionality reduction using SVD  
✅ **Similarity Metrics** - Cosine, Pearson, Jaccard similarities  
✅ **Data Preprocessing** - Complete pipeline for real-world data  
✅ **Outlier Detection** - IQR method for anomaly detection  
✅ **Performance Optimization** - Vectorization vs loops benchmarking  

### 🚀 Next Steps
1. Practice implementing these algorithms from memory
2. Try variations with different datasets
3. Move on to scikit-learn algorithm implementations
4. Apply preprocessing pipelines to real datasets

### 📚 Additional Practice
- Implement other dimensionality reduction techniques (t-SNE, UMAP)
- Create custom pandas aggregation functions
- Build efficient recommendation system components
- Implement time series preprocessing functions

### 🔑 Key Interview Points
- **Vectorization** is crucial for performance
- **Memory management** matters with large datasets
- **Numerical stability** prevents edge case failures
- **Pipeline thinking** enables reusable, maintainable code

**Ready for the next challenge! 💪**