# Data Preprocessing

In [1]:
%load_ext autoreload
%autoreload 2
%pdb on

Automatic pdb calling has been turned ON


## Create a copy of the data

## Data Cleaning

Tasks:
- Convert columns to their suitable data types
- Handle missing values
    - Determine why some values are missing?
        - Missing not at random.
        - Missing at random (but not random w.r.t. to observable variables).
        - Missing completely at random.
    - Delete either rows or columns
    - Imputation: mean, median, or mode.
- Investigate the source of outliers, fix, or remove them.

## Implement data transforms

Options:
- Transform numerical columns by discretization.
- Transform categorical columns (one-hot; embeddings).
- Use `ColumnTransformer`
    - `ColumnTransformer` to apply different preprocessing to different columns [[ex](https://nbviewer.jupyter.org/github/justmarkham/scikit-learn-tips/blob/master/notebooks/01_column_transformer.ipynb)]
    - Seven ways to select columns using `ColumnTransformer` [[ex](https://nbviewer.jupyter.org/github/justmarkham/scikit-learn-tips/blob/master/notebooks/02_select_columns.ipynb)]
    - Get the feature names output by a `ColumnTransformer` [[example](https://nbviewer.jupyter.org/github/justmarkham/scikit-learn-tips/blob/master/notebooks/38_get_feature_names.ipynb)]
    - Passthrough some columns and drop others in a `ColumnTransformer` [[example](https://nbviewer.jupyter.org/github/justmarkham/scikit-learn-tips/blob/master/notebooks/42_passthrough_or_drop.ipynb)]
- Use `Pipeline`
    - `Pipeline` to chain together multiple steps [[example](https://nbviewer.jupyter.org/github/justmarkham/scikit-learn-tips/blob/master/notebooks/08_pipeline.ipynb)]
    - Examine the intermediate steps in a `Pipeline` [[example](https://nbviewer.jupyter.org/github/justmarkham/scikit-learn-tips/blob/master/notebooks/13_examine_pipeline_steps.ipynb), [example](https://nbviewer.jupyter.org/github/justmarkham/scikit-learn-tips/blob/master/notebooks/30_examine_pipeline_steps.ipynb)]
    - `cross_val_score` and `GridSearchCV` on a `Pipeline` [[example](https://nbviewer.jupyter.org/github/justmarkham/scikit-learn-tips/blob/master/notebooks/16_pipeline_cross_validation.ipynb)]
    - Save a model or `Pipeline` using `joblib` [[example](https://nbviewer.jupyter.org/github/justmarkham/scikit-learn-tips/blob/master/notebooks/28_joblib.ipynb)]
    - Create an interactive diagram of a `Pipeline` in Jupyter [[example](https://nbviewer.jupyter.org/github/justmarkham/scikit-learn-tips/blob/master/notebooks/37_pipeline_diagram.ipynb)]
    - Access part of a `Pipeline` using slicing [[example](https://nbviewer.jupyter.org/github/justmarkham/scikit-learn-tips/blob/master/notebooks/48_pipeline_slicing.ipynb)]
    - `FunctionTransformer` to convert functions into transformers [[example](https://nbviewer.jupyter.org/github/justmarkham/scikit-learn-tips/blob/master/notebooks/33_function_transformer.ipynb)]
- Use `Encoder`
    - Encode categorical features using `OneHotEncoder` or `OrdinalEncoder` [[ex](https://nbviewer.jupyter.org/github/justmarkham/scikit-learn-tips/blob/master/notebooks/06_encode_categorical_features.ipynb)]
    - Handle unknown categories with `OneHotEncoder` by encoding them as zeros [[ex](https://nbviewer.jupyter.org/github/justmarkham/scikit-learn-tips/blob/master/notebooks/07_handle_unknown_categories.ipynb)]
    - `OrdinalEncoder` instead of `OneHotEncoder` with tree-based models [[example](https://nbviewer.jupyter.org/github/justmarkham/scikit-learn-tips/blob/master/notebooks/43_ordinal_encoding_for_trees.ipynb)]
    - Create feature interactions using `PolynomialFeatures` [[example](https://nbviewer.jupyter.org/github/justmarkham/scikit-learn-tips/blob/master/notebooks/45_feature_interactions.ipynb)]
- Use `Imputing`
    - Impute missing values using `KNNImputer` or `IterativeImputer` [[example](https://nbviewer.jupyter.org/github/justmarkham/scikit-learn-tips/blob/master/notebooks/11_new_imputers.ipynb)]
    - Add a missing indicator to encode "missingness" as a feature [[example](https://nbviewer.jupyter.org/github/justmarkham/scikit-learn-tips/blob/master/notebooks/09_add_missing_indicator.ipynb)]
    - `HistGradientBoostingClassifier` natively supports missing values [[example](https://nbviewer.jupyter.org/github/justmarkham/scikit-learn-tips/blob/master/notebooks/14_handle_missing_values.ipynb)]
    - Two ways to impute missing values for a categorical feature [[example](https://nbviewer.jupyter.org/github/justmarkham/scikit-learn-tips/blob/master/notebooks/27_impute_categorical_features.ipynb)]

## Conduct Feature Engineering

#### Options
- Scale the values: Min/max, standardize.
- Normalize or transform distributions.
- Discretize continuous features.
- Encode categorical features: hashing, …
- Conduct feature crossing.
- Decompose features (Categorical, datetime, ...)
- Add promising feature transformations ($log(x)$, $sqrt(x)$, $x^{2}$, ..)
- Aggregate features into promising new features

#### Tips
- Split data by time into train/valid/test splits instead of doing it randomly.
- If you oversample your data, do it after splitting.
- Scale and normalize your data after splitting to avoid data leakage.
- Use statistics from only the train split, instead of the entire data, to scale your features and handle missing values.
- Understand how your data is generated, collected, and processed. Involve domain experts if possible.
- Keep track of your data’s lineage.
- Understand feature importance to your model.
- Use features that generalize well.
- Remove no longer useful features from your models.

## Check for data leakage 

Red flags:
- Spliting time-correlated data randomly instead of by time
- Scaling before splitting.
- Filling in missing data with statistics from the test split
- Poorly handling data duplication before splitting.
- Group leaking where two items belong to the same conceptual “group” but one is in train while other is in test.
- Leaking from the data generation process.

<span style="color:gray">

## Preprocess for DL

- Feature engineer before normalizing, especially for small datasets.
- Normalize inputs by subtracting the mean and dividing by the std
- Convert inputs to float32 tensors
<span/>

---