## Handling Missing Values in Large-scale ML Pipelines:

**Task 1**: Impute with Mean or Median
- Step 1: Load a dataset with missing values (e.g., Boston Housing dataset).
- Step 2: Identify columns with missing values.
- Step 3: Impute missing values using the mean or median of the respective columns.

In [10]:
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.impute import SimpleImputer
import numpy as np

data = fetch_california_housing(as_frame=True)
df = data.frame

# Introduce missing values artificially
np.random.seed(42)
df.loc[df.sample(frac=0.1).index, 'MedInc'] = np.nan

missing_cols = df.columns[df.isnull().any()]

imputer = SimpleImputer(strategy='median')
df[missing_cols] = imputer.fit_transform(df[missing_cols])


**Task 2**: Impute with the Most Frequent Value
- Step 1: Use the Titanic dataset and identify columns with missing values.
- Step 2: Impute categorical columns using the most frequent value.

In [11]:
import pandas as pd
from sklearn.impute import SimpleImputer

url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
df = pd.read_csv(url)

cat_cols = df.select_dtypes(include=['object']).columns
missing_cat_cols = [col for col in cat_cols if df[col].isnull().any()]

imputer = SimpleImputer(strategy='most_frequent')
df[missing_cat_cols] = imputer.fit_transform(df[missing_cat_cols])


**Task 3**: Advanced Imputation - k-Nearest Neighbors
- Step 1: Implement KNN imputation using the KNNImputer from sklearn.
- Step 2: Explore how KNN imputation improves data completion over simpler methods.

In [12]:
import pandas as pd
from sklearn.impute import KNNImputer

url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
df = pd.read_csv(url)

num_cols = df.select_dtypes(include=['float64', 'int64']).columns
imputer = KNNImputer(n_neighbors=5)
df[num_cols] = imputer.fit_transform(df[num_cols])









## Feature Scaling & Normalization Best Practices:

**Task 1**: Standardization
- Step 1: Standardize features using StandardScaler.
- Step 2: Observe how standardization affects data distribution.

In [13]:
import pandas as pd
from sklearn.preprocessing import StandardScaler

data = {'Age': [25, 32, 47, 51, 62], 'Income': [50000, 64000, 120000, 110000, 150000]}
df = pd.DataFrame(data)

scaler = StandardScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)









**Task 2**: Min-Max Scaling

- Step 1: Scale features to lie between 0 and 1 using MinMaxScaler.
- Step 2: Compare with standardization.

In [14]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

data = {'Age': [25, 32, 47, 51, 62], 'Income': [50000, 64000, 120000, 110000, 150000]}
df = pd.DataFrame(data)

scaler = MinMaxScaler()
df_minmax_scaled = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)


**Task 3**: Robust Scaling
- Step 1: Scale features using RobustScaler, which is useful for data with outliers.
- Step 2: Assess changes in data scaling compared to other scaling methods.

In [15]:
import pandas as pd
from sklearn.preprocessing import RobustScaler

data = {'Age': [25, 32, 47, 51, 62, 150], 'Income': [50000, 64000, 120000, 110000, 150000, 1000000]}
df = pd.DataFrame(data)

scaler = RobustScaler()
df_robust_scaled = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)









## Feature Selection Techniques:
### Removing Highly Correlated Features:

**Task 1**: Correlation Matrix
- Step 1: Compute correlation matrix.
- Step 2: Remove highly correlated features (correlation > 0.9).

In [16]:
import pandas as pd
import numpy as np

data = {
    'A': [1, 2, 3, 4, 5],
    'B': [2, 4, 6, 8, 10],
    'C': [5, 3, 6, 2, 1],
    'D': [1, 2, 2, 4, 5]
}
df = pd.DataFrame(data)

corr_matrix = df.corr().abs()
upper_tri = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
to_drop = [col for col in upper_tri.columns if any(upper_tri[col] > 0.9)]
df_reduced = df.drop(columns=to_drop)


### Using Mutual Information & Variance Thresholds:

**Task 2**: Mutual Information
- Step 1: Compute mutual information between features and target.
- Step 2: Retain features with high mutual information scores.

In [17]:
from sklearn.feature_selection import mutual_info_classif
import pandas as pd
import numpy as np

np.random.seed(0)
X = pd.DataFrame({
    "feature1": np.random.rand(100),
    "feature2": np.random.rand(100),
    "feature3": np.random.randint(0, 2, 100),
    "feature4": np.random.rand(100)
})
y = np.random.randint(0, 2, 100)

mi_scores = mutual_info_classif(X, y, discrete_features=[2])
selected_features = X.columns[mi_scores > 0.05]
selected_features.tolist()


[]

**Task 3**: Variance Threshold
- Step 1: Implement VarianceThreshold to remove features with low variance.
- Step 2: Analyze impact on feature space.

In [18]:
from sklearn.feature_selection import VarianceThreshold
import pandas as pd
import numpy as np

np.random.seed(0)
X = pd.DataFrame({
    "feature1": np.random.rand(100),
    "feature2": np.random.rand(100),
    "feature3": np.random.randint(0, 2, 100),
    "feature4": np.ones(100) * 0.5
})

selector = VarianceThreshold(threshold=0.01)
X_selected = selector.fit_transform(X)
X_selected.shape


(100, 3)