In [1]:
# 2.1 Data Preprocessing for Classification
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Sample dataset
data = {
    'Age': [25, 27, 29, np.nan, 32, 33, np.nan],
    'Salary': [50000, 52000, 54000, 58000, np.nan, 64000, 66000],
    'City': ['New York', 'Los Angeles', 'New York', 'Chicago', 'Chicago', 'Los Angeles', np.nan]
}

df = pd.DataFrame(data)

# Splitting the dataset into features and target variable for illustration
X = df.dropna(subset=['City'])  # Dropping rows where 'City' is NaN for this example
y = [0, 1, 0, 1, 1, 0]  # Dummy target variable

# Splitting into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Defining preprocessing for numerical columns (impute missing values then scale)
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

# Defining preprocessing for categorical columns (impute missing values then apply one-hot encoding)
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Combining preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, ['Age', 'Salary']),
        ('cat', categorical_transformer, ['City'])
    ]
)

# Applying the preprocessing to the training data
X_train_preprocessed = preprocessor.fit_transform(X_train)
X_test_preprocessed = preprocessor.transform(X_test)

# The preprocessed data is now ready for model training
# This is a numpy array, you can convert it back to a DataFrame if needed for better readability

print(" Training Data:\n", X_train)
print(" Test Data:\n", X_test)

print("Preprocessed Training Data:\n", X_train_preprocessed)

print("Preprocessed Test Data:\n", X_test_preprocessed)


#  1. `pd.DataFrame.dropna(subset=['City'])`
# - Parameters:
#   - `subset`: Column names to consider for identifying rows with missing values. Rows with NaN in these columns get dropped.
#   - `how`: Determines if row/column is removed from DataFrame when we have at least one NA or all NA. Values are `'any'` or `'all'`. Default is `'any'`.
#   - `inplace`: If `True`, do operation inplace and return None. Default is `False`.

# - Alternatives:
#   - If you want to drop rows where all specified columns are NaN, use `how='all'`.
#   - To apply the operation directly to the DataFrame without creating a copy, use `inplace=True`.

#  2. `train_test_split(X, y, test_size=0.2, random_state=42)`
# - Parameters:
#   - `X, y`: Arrays or matrices containing the dataset to split.
#   - `test_size`: Represents the proportion of the dataset to include in the test split. Can be an int (absolute number of test samples) or a float (fraction of the dataset). Default is `None`.
#   - `random_state`: Controls the shuffling applied to the data before applying the split. Pass an int for reproducible output across multiple function calls.

# - Alternatives:
#   - `train_size`: Complement of `test_size`. If both are None, it will set to the default value of 0.25.
#   - `shuffle`: Whether or not to shuffle the data before splitting. Can be useful for time-series data. Default is `True`.
#   - `stratify`: If not None, data is split in a stratified fashion, using this as the class labels. Helps in maintaining the percentage of samples for each class.

#  3. `SimpleImputer(strategy='mean')`
# - Parameters:
#   - `strategy`: The imputation strategy. Choices are `"mean"`, `"median"`, `"most_frequent"`, and `"constant"`.
#   - `fill_value`: When `strategy="constant"`, fill_value is used to replace all occurrences of missing values. Default is `None`.

# - Alternatives:
#   - Using `strategy="median"` is a good alternative for numerical data, especially when the data might have outliers that could heavily influence the mean.
#   - For categorical data, `strategy="most_frequent"` or `strategy="constant"` with a specified `fill_value` (like `"missing"` or `0`) can be useful.

#  4. `StandardScaler()`
# - Parameters:
#   - `with_mean`: If `True`, center the data before scaling. Default is `True`.
#   - `with_std`: If `True`, scale the data to unit variance (or equivalently, unit standard deviation). Default is `True`.

# - Alternatives:
#   - `MinMaxScaler`: Scales features to a given range, usually between 0 and 1.
#   - `RobustScaler`: Useful if your data contains many outliers, scales data according to the percentile range.

#  5. `OneHotEncoder(handle_unknown='ignore')`
# - Parameters:
#   - `handle_unknown`: Options are `"error"` or `"ignore"`. Determines what happens when the encoder encounters a category not seen during fit. If `"ignore"`, the unknown category is ignored (encoded as all zeros).
#   - `sparse`: Whether the transformed output is a sparse matrix or a 2D array. Default is `True`.

# - Alternatives:
#   - `LabelEncoder`: Good for encoding target labels (y) rather than input (X) features.
#   - `OrdinalEncoder`: Transforms categorical features to ordinal integers. Useful when the categorical features have a natural order.

#  6. `ColumnTransformer`
# - Parameters:
#   - `transformers`: List of transformers to apply. Each transformer is a tuple containing a name, transformer object, and column(s) to apply the transformer to.
#   - `remainder`: Determines what to do with the remaining columns not explicitly selected in `transformers`. Options are `'drop'` (default), `'passthrough'`, or a transformer object to apply.





#  `random_state=42` in `train_test_split`
# The `train_test_split` function from `scikit-learn` is used to split the dataset into training and testing sets. Here, `random_state` is set to 42, which serves as a seed for the random number generator. This ensures that the split is reproducible; anyone running this code with `random_state` set to 42 will get the exact same training and testing sets. This is particularly useful for educational purposes, demonstrations, or scenarios where you want to ensure consistent results across different runs for debugging or comparison purposes.

#  `X_train_preprocessed = preprocessor.fit_transform(X_train)`
# This line applies the preprocessing steps defined in the `preprocessor` to the training data (`X_train`). The `preprocessor` is a `ColumnTransformer` that combines both numerical and categorical transformations:

# - For numerical columns (`'Age'`, `'Salary'`), it first imputes missing values using the mean (with `SimpleImputer(strategy='mean')`), and then standardizes the features (with `StandardScaler()`), which scales the features to have a mean of 0 and a standard deviation of 1.

# - For the categorical column (`'City'`), it first imputes missing values by replacing them with the most frequent category (with `SimpleImputer(strategy='most_frequent')`), and then applies one-hot encoding (with `OneHotEncoder(handle_unknown='ignore')`), which converts the categorical variable into a format that can be provided to machine learning algorithms (creating a binary column for each category).

# The `fit_transform` method on the `preprocessor` does two things: it first `fit`s the transformers to the training data, learning any necessary parameters (like the mean and standard deviation for scaling, or the categories for one-hot encoding), and then `transform`s the training data according to these parameters, outputting the preprocessed data ready for model training.

#  `X_test_preprocessed = preprocessor.transform(X_test)`
# After the `preprocessor` has been fitted to the training data, it is used to transform the test data (`X_test`) using the same `transform` method. However, since the `preprocessor` has already been fitted, it uses the parameters learned from the training data (not from `X_test`). This ensures that the test data is preprocessed in exactly the same way as the training data, which is crucial for the model to make accurate predictions on the test data. This step does not involve fitting (`fit`) the `preprocessor` again to the test data, as that would lead to data leakage and overfitting.

# In this workflow, the preprocessed training and test data are represented as numpy arrays. If needed for better readability or for further processing that requires a DataFrame structure, these arrays can be converted back into pandas DataFrames, although the column names for the one-hot encoded variables would need to be manually specified or extracted from the `preprocessor`.

 Training Data:
     Age   Salary         City
5  33.0  64000.0  Los Angeles
2  29.0  54000.0     New York
4  32.0      NaN      Chicago
3   NaN  58000.0      Chicago
 Test Data:
     Age   Salary         City
0  25.0  50000.0     New York
1  27.0  52000.0  Los Angeles
Preprocessed Training Data:
 [[ 1.13227703  1.4985373   0.          1.          0.        ]
 [-1.58518785 -1.31122014  0.          0.          1.        ]
 [ 0.45291081  0.          1.          0.          0.        ]
 [ 0.         -0.18731716  1.          0.          0.        ]]
Preprocessed Test Data:
 [[-4.30265273 -2.43512311  0.          0.          1.        ]
 [-2.94392029 -1.87317162  0.          1.          0.        ]]
