## Handling Missing Values in Large-scale ML Pipelines:

**Task 1**: Impute with Mean or Median
- Step 1: Load a dataset with missing values (e.g., Boston Housing dataset).
- Step 2: Identify columns with missing values.
- Step 3: Impute missing values using the mean or median of the respective columns.

In [6]:
# write your code from here
import pandas as pd
import numpy as np
import logging
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
from sklearn.feature_selection import VarianceThreshold, mutual_info_classif
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_california_housing

# Set up logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

def load_data():
    try:
        data = fetch_california_housing(as_frame=True)
        df = data.frame
        logger.info("Data loaded successfully.")
        return df
    except Exception as e:
        logger.error("Failed to load dataset: %s", e)
        raise

def impute_data(df):
    try:
        imputer = SimpleImputer(strategy='mean')
        df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)
        logger.info("Mean imputation successful.")
        return df_imputed
    except ValueError as e:
        logger.error("ValueError during imputation: %s", e)
        raise
def knn_impute_data(df):
    try:
        imputer = KNNImputer(n_neighbors=3)
        df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)
        logger.info("KNN imputation successful.")
        return df_imputed
    except ValueError as e:
        logger.error("ValueError during KNN imputation: %s", e)
        raise

def scale_data(df):
    try:
        scaler = StandardScaler()
        scaled_df = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)
        logger.info("Standardization successful.")
        return scaled_df
    except Exception as e:
        logger.error("Error in standardization: %s", e)
        raise

def minmax_scale_data(df):
    try:
        scaler = MinMaxScaler()
        scaled_df = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)
        logger.info("Min-Max scaling successful.")
        return scaled_df
    except Exception as e:
        logger.error("Error in Min-Max scaling: %s", e)
        raise
def robust_scale_data(df):
    try:
        scaler = RobustScaler()
        scaled_df = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)
        logger.info("Robust scaling successful.")
        return scaled_df
    except Exception as e:
        logger.error("Error in robust scaling: %s", e)
        raise

def remove_highly_correlated(df, threshold=0.9):
    try:
        corr_matrix = df.corr().abs()
        upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
        to_drop = [column for column in upper.columns if any(upper[column] > threshold)]
        df_reduced = df.drop(columns=to_drop)
        logger.info("Highly correlated features removed: %s", to_drop)
        return df_reduced
    except Exception as e:
        logger.error("Error in correlation removal: %s", e)
        raise

def select_features_by_mi(X, y, top_k=5):
    try:
        mi = mutual_info_classif(X, y)
        selected_features = X.columns[np.argsort(mi)[-top_k:]]
        logger.info("Selected top mutual information features: %s", list(selected_features))
        return X[selected_features]
    except Exception as e:
        logger.error("Error in mutual information feature selection: %s", e)
        raise

def apply_variance_threshold(df, threshold=0.01):
    try:
        selector = VarianceThreshold(threshold=threshold)
        reduced_df = pd.DataFrame(selector.fit_transform(df), columns=df.columns[selector.get_support()])
        logger.info("Low variance features removed.")
        return reduced_df
    except Exception as e:
        logger.error("Error in variance thresholding: %s", e)
        raise

def main():
    df = load_data()
    df_missing = df.copy()
    df_missing.iloc[0:10, 0] = np.nan  # introduce missing values

    df_imputed = impute_data(df_missing)
    df_knn_imputed = knn_impute_data(df_missing)

    df_scaled = scale_data(df_imputed)
    df_minmax_scaled = minmax_scale_data(df_imputed)
    df_robust_scaled = robust_scale_data(df_imputed)

    df_uncorrelated = remove_highly_correlated(df_scaled)

    X = df_scaled.drop(columns=['MedHouseVal'])
    y = df_scaled['MedHouseVal'] > df_scaled['MedHouseVal'].median()
    X_mi = select_features_by_mi(X, y)
    X_vt = apply_variance_threshold(X)

if __name__ == "__main__":
    main()

INFO:__main__:Data loaded successfully.
INFO:__main__:Mean imputation successful.
INFO:__main__:KNN imputation successful.
INFO:__main__:Standardization successful.
INFO:__main__:Min-Max scaling successful.
INFO:__main__:Robust scaling successful.
INFO:__main__:Highly correlated features removed: ['Longitude']
INFO:__main__:Selected top mutual information features: ['AveOccup', 'AveRooms', 'Latitude', 'Longitude', 'MedInc']
INFO:__main__:Low variance features removed.


**Task 2**: Impute with the Most Frequent Value
- Step 1: Use the Titanic dataset and identify columns with missing values.
- Step 2: Impute categorical columns using the most frequent value.

In [7]:
# write your code from here

**Task 3**: Advanced Imputation - k-Nearest Neighbors
- Step 1: Implement KNN imputation using the KNNImputer from sklearn.
- Step 2: Explore how KNN imputation improves data completion over simpler methods.

In [8]:
# write your code from here

## Feature Scaling & Normalization Best Practices:

**Task 1**: Standardization
- Step 1: Standardize features using StandardScaler.
- Step 2: Observe how standardization affects data distribution.

In [9]:
# write your code from here

**Task 2**: Min-Max Scaling

- Step 1: Scale features to lie between 0 and 1 using MinMaxScaler.
- Step 2: Compare with standardization.

In [10]:
# write your code from here

**Task 3**: Robust Scaling
- Step 1: Scale features using RobustScaler, which is useful for data with outliers.
- Step 2: Assess changes in data scaling compared to other scaling methods.

In [11]:
# write your code from here

## Feature Selection Techniques:
### Removing Highly Correlated Features:

**Task 1**: Correlation Matrix
- Step 1: Compute correlation matrix.
- Step 2: Remove highly correlated features (correlation > 0.9).

In [12]:
# write your code from here

### Using Mutual Information & Variance Thresholds:

**Task 2**: Mutual Information
- Step 1: Compute mutual information between features and target.
- Step 2: Retain features with high mutual information scores.

In [13]:
# write your code from here

**Task 3**: Variance Threshold
- Step 1: Implement VarianceThreshold to remove features with low variance.
- Step 2: Analyze impact on feature space.

In [14]:
# write your code from here