## **Data Preprocessing For Machine Learning**

**Data Cleaning Is Not Data Preprocessing !**

It worthwhile to know that data cleaning is different from data preprocessing. In the machine learning implementation pipleline, data cleaning comes first. As matter of emphasis, throigh more light on data cleaning.

The process of identifying andcorrecting errors, inconsistencies and inaccuracies in raw data is data cleaning. The tasks you would most likely encounter in data cleaning may include;
* Handling missing values.
* Removing duplicate rows or columns.
* Correcting inconsistencies(like typos and inconsistent cases)
* Handling outliers.
* Handling corrupt strings(unicode problems).
* Handling irrelivant data and more that you might have encountered or may likely encounter in your jourey with working with data.



---



**Other Conflicting Terms With "Data Preprocessing"**

It is quite important to note that as you progress in your learning, practice and research, you would come accross some big data lingo that might throw you in confusion, kind of wandering what they mean...some of those terms include **data mining**, **data wrangling**, **data preparation**, **data transformation**, **data harmonization**, **data refinement**, **data shaping**, d**ata manipulation**, **data manicuring**, **data validation** etc

In your spare time, do well to take your time to learning their meaning, their similarities and difference with other term related terms.

But two of those terms will be differentiated from data proprocessing in a simple table to get us going before moving into data preprocessing proper.

| Feature       | Data Wrangling                                      | Data Preprocessing                               | Data Mining                                    |
|--------------|---------------------------------------------------|-------------------------------------------------|-----------------------------------------------|
| **Purpose**   | Preparing and structuring raw data               | Transforming data into a format suitable for ML | Extracting insights and patterns             |
| **Focus**     | Cleaning, transforming, and integrating data     | Encoding, scaling, and feature selection       | Finding hidden relationships in data (the complete ML pipline)         |
| **Techniques** | Data cleaning, merging, reshaping               | Normalization, encoding, feature extraction    | Clustering, classification, regression, anomaly detection |
| **End Goal**  | Organized, structured data                       | Model ready dataset                            | Actionable insights for decision making      |


In the Nigerian setting, when you see people cooking many varieties of dishes with big sets of pots. The next thing that comes to mind is "What is the celebration?", that is, they are most likely cooking for a party or an event. It is the samething when it comes to data processing, So in a lay mans' definition, we can say that data preprocessing is the preparing and cooking of your dataset for ML model building. I think this should be clear enough.

If simplicity feels like a scam to you, lets subscribe to a little bit of complexity in our definition.

Data preprocessing is the step in the machine learning pipeline where raw data is transformed into a structured, clean, and suitable format for model training. It ensures that data is compatible with machine learning algorithms by handling inconsistencies, scaling, encoding, and feature selection.



---


**Why Should We Preprocess Data Before Building The ML Model?**

**NOTE**: *I will introduce some new terms here, don't get worried if you dont understand them yet, but is good you are aware of their existence. I will bolden then for emphasis*

Lets break it down, the effectiveness of our model is at the mercy of how well our features are engineered, and **feature engineering** is at the mercy of data preprocesing. Data preprocessing is crucial in machine learning because raw data is often incomplete, inconsistent, and noisy.Without proper preprocessing, **ML algorithms** may learn incorrect patterns, leading to poor performance.


So lets answer the question now.
1. To improve data quality: After cleaning data **features** might still contain errors, incosistencies and missing features that can affect the performance of the **model**. So preprocesing prepares the dataset for **feature selection** and **feature extraction** helping.

2. Enhance Model Performance: Most ML algorithms learns better when the features are  **scaled**, **normalized** and structured properly. So proper preprocessing ensures that features contribute meaningfully to **predictions**.

3. Prevent Models From **Overfitting**, **Underfitting**, **High Variance** and **Bias**: Handling **imbalanced dataset**s prevents the model from being biased towards the dominant **class**. So proper preperocessing ensures that the features contributes meaningfully to predictions of the model.

4. Ensures Algorithm Compatibility: Many ML algorithms require numeric inputs (e.g., **encoding** categorical variables).
Some algorithms, like KNN and neural networks, are sensitive to scale, requiring normalization.

5. Improves Model Accuracy And Efficiency: Well preprocessed data helps models learn better patterns, leading to improved **accuracy** and **generalization**.
Reduces **computational complexity** by eliminating **irrelevant features**.



---


**Technical Terms Used in Data Preprocessing**

| Term                        | Meaning                                                         | Implementation (Python)                                      | Use Case                                                      |
|-----------------------------|-----------------------------------------------------------------|-------------------------------------------------------------|--------------------------------------------------------------|
| **Missing Value Imputation** | Strategies for filling in missing data                        | `df.fillna()` or `KNNImputer()` (from `sklearn`)             | Ensures complete datasets for analysis or model training      |
| **Mean/Median/Mode Imputation** | Replacing missing values with the average, median, or mode    | `df.fillna(df.mean())`, `df.fillna(df.median())`             | Quick and simple imputation for numeric columns               |
| **KNN Imputation**           | Using similar data points to fill in missing values            | `KNNImputer()` (from `sklearn`)                              | Used when data is missing at random and can be inferred        |
| **Regression Imputation**    | Predicting missing values using a regression model            | Custom implementation using `LinearRegression()`             | Suitable for datasets with relationships between features     |
| **Multiple Imputation**      | Creating multiple plausible versions of data and combining them | `IterativeImputer()` (from `sklearn`)                        | Better for uncertain data with missing values                 |
| **Outlier Detection**        | Identifying and handling extreme values                        | `zscore()`, `IQR method`                                     | Prevents extreme values from distorting the model             |
| **Z-score Method**           | Identifying outliers based on standard deviation               | `zscore(df['col'])`                                          | Used for identifying outliers in normally distributed data    |
| **IQR Method**               | Identifying outliers based on interquartile range              | `IQR = df['col'].quantile(0.75) - df['col'].quantile(0.25)`  | Detects outliers in skewed or non-normal distributions        |
| **Box Plots**                | Visual representation of data distribution with outliers      | `sns.boxplot(x='col', data=df)`                              | Helps visually identify outliers                             |
| **Winsorizing/Clipping**     | Limiting extreme values to a specified threshold              | `winsorize()` from `scipy.stats`                             | Reduces the impact of outliers on model performance           |
| **Data Deduplication**       | Removing duplicate records                                    | `df.drop_duplicates()`                                       | Prevents duplicate data from affecting analysis or models     |
| **Noise Removal**            | Reducing random errors or noise in data                       | `smooth()` or `scipy.signal.savgol_filter()`                 | Improves signal-to-noise ratio in datasets                   |
| **Smoothing Techniques**     | Applying filters to reduce random noise in time-series data   | `moving_average()`, `SavitzkyGolay()` (from `scipy`)         | Helps in smoothing time-series data or signals                |
| **Data Type Conversion**     | Changing data types for consistency                           | `df['col'] = df['col'].astype(int)`                          | Ensures consistency in feature formats (e.g., string to int) |
| **Data Transformation**      | Changing the format or structure of the data                  | `np.log()`, `BoxCox()` (from `scipy`)                        | Used for normalizing, stabilizing variance, or skewed data   |
| **Normalization**            | Scaling data to a specific range (usually 0 to 1)             | `MinMaxScaler()` (from `sklearn`)                            | Common for algorithms requiring bounded features (e.g., KNN) |
| **Min-Max Scaling**          | Scaling data between 0 and 1                                  | `MinMaxScaler().fit_transform(df)`                           | Ensures all features are in the same range                    |
| **Standardization**          | Scaling data to have a mean of 0 and standard deviation of 1 | `StandardScaler()` (from `sklearn`)                          | Often required for algorithms like linear regression, SVM    |
| **Robust Scaling**           | Scaling using median and IQR for robustness against outliers | `RobustScaler()` (from `sklearn`)                            | Handles datasets with many outliers                          |
| **Feature Scaling**          | A broader term covering normalization and standardization    | Combination of `MinMaxScaler()`, `StandardScaler()`          | Ensures uniform scaling of features                           |
| **Encoding Categorical Variables** | Converting categorical data into numeric format           | `pd.get_dummies()`, `LabelEncoder()`                         | Required for machine learning models to process categorical data |
| **One-Hot Encoding**         | Creating binary columns for each category                     | `pd.get_dummies()`                                           | Used when categorical variables have no ordinal relationship  |
| **Label Encoding**           | Assigning unique integers to categories                       | `LabelEncoder().fit_transform(df['col'])`                    | Used when categories have a natural order                    |
| **Ordinal Encoding**         | Assigning integers to categories based on order               | Custom mapping: `{category: index}`                           | Suitable for ordinal data where order matters                |
| **Target Encoding**          | Replacing categories with mean of the target variable         | `CategoryEncoders.TargetEncoder()`                           | Used in modeling when categories correlate with target variable |
| **Discretization/Binning**   | Converting continuous variables into discrete bins            | `pd.cut()` or `KBinsDiscretizer()`                           | Used for reducing model complexity or creating categorical features |
| **Log Transformation**       | Applying a log function to reduce skewness                    | `np.log(df['col'])`                                          | Used for positively skewed data or stabilizing variance       |
| **Power Transformation**     | Box-Cox or Yeo-Johnson transformations for normalizing data  | `PowerTransformer()` (from `sklearn`)                        | Used for stabilizing variance and making data more Gaussian  |
| **Polynomial Features**      | Creating new features by raising existing features to higher powers | `PolynomialFeatures()` (from `sklearn`)                      | Used for capturing non-linear relationships                   |
| **Interaction Features**     | Creating new features by combining existing features         | `df['new_feature'] = df['feature1'] * df['feature2']`        | Used to capture interactions between features                 |
| **Data Augmentation**        | Generating new data points based on existing data            | `ImageDataGenerator()` (from `Keras`), `augmented_text()`    | Common in image and text processing tasks                    |
| **Data Integration**         | Merging data from multiple sources                           | `pd.merge()` or `pd.concat()`                               | Combines multiple datasets into a single source of truth     |
| **Feature Selection**        | Choosing relevant features for model training                | `SelectKBest()`, `RFE()` (from `sklearn`)                   | Reduces dimensionality and avoids overfitting                |
| **Filter Methods**           | Selecting features based on statistical measures             | `SelectKBest()` (from `sklearn`)                            | Selects features with the highest statistical relevance       |
| **Wrapper Methods**          | Using a model to evaluate feature subsets                     | `RFE()` (Recursive Feature Elimination)                      | Evaluates feature sets by recursively fitting a model        |
| **Embedded Methods**         | Built-in feature selection during model training            | `Lasso`, `DecisionTreeClassifier()`                          | Selects features while training models (e.g., L1 regularization) |
| **Dimensionality Reduction** | Reducing the number of features while retaining key information | `PCA()`, `LDA()`, `t-SNE()`                                 | Reduces data complexity and overfitting risks                 |
| **PCA (Principal Component Analysis)** | Finding principal components that capture data variance | `PCA()` (from `sklearn`)                                     | Used for feature extraction and noise reduction              |
| **LDA (Linear Discriminant Analysis)** | Maximizing separability between classes                   | `LDA()` (from `sklearn`)                                     | Used for classification problems, especially in imbalanced data |
| **t-SNE (t-Distributed Stochastic Neighbor Embedding)** | Reducing dimensionality while preserving structure | `TSNE()` (from `sklearn`)                                   | Used for visualizing high-dimensional data                   |
| **Sampling**                 | Selecting a subset of data for processing                    | `train_test_split()`, `resample()`                          | Used when working with large datasets or for cross-validation |
| **Data Serialization**       | Converting data to a suitable format for storage or transmission | `pickle`, `json`, `parquet`                                 | Ensures data can be stored and loaded efficiently             |
| **JSON**                     | A lightweight data-interchange format                        | `json.load()` / `json.dump()`                               | Common format for storing and transmitting data              |
| **XML**                      | A markup language for representing structured data           | `xml.etree.ElementTree`                                     | Used for structured data storage and exchange                |
| **Pickle**                   | Python-specific format for serializing objects               | `pickle.dump()` / `pickle.load()`                           | Saves Python objects for future use or distributed computing  |
| **Parquet**                  | A columnar storage format                                    | `pyarrow.parquet`                                           | Efficient storage format, commonly used in big data          |


**Hands On Assigment**

Study the table above;
1. Tell why data cleaning and feature engineering seems to intertwine with data preprocessing. Give a list of terms that intertwined.
2. Some terms here are familiar with some terms you encountered during your introduction to statistics class, can you identify them and their usecases when you were learning statistics?

* **Hints:** Peer discussions are permitted, but ensure to have studied and understood some terms in the table before asking for help.*



---

**Answer**
1. Data cleaning and feature engineering do indeed intertwine with data preprocessing because the processes are overlapped across each section. For example, feature selection is used in data cleaning when working on a particular column, and it is also used in feature engineering when  you want to create a new feature (or column) from already existing columns.

Some of the terms that are intertwined are:
- Feature selection
* Missing value imputation
- Data transformation etc


2. Scaling: This is the process of transforming numerical features that are skewed by scaling i.e. interpolating the number to make them fall between the range of 0 and 1. The two commonly used types are:
**i** Normalization: This is used when the numerical feature is highly skewed, and the otliers are not negligible. The mathematical formula is (df['column'] - df['column'].min()) / (df['column'].max() - df['column'].min())


**ii**  Standardization: This is used when the data is roughly skewed and outliers are negligible. Its uses the minimum and maximum values of the data. The formula is ((x- mean) / std)

**iii** Measures of Center: These are used to determine the statistical value of a datasset, and they can also be used to fill in missing values for numerical and categorical data.

**iv** Measures of spread: These are used to get and deal with outliers

---

In [19]:
# Note that the scikit-learn library is usually imported as "sklearn"


import pandas as pd
from scipy import stats
from sklearn.datasets import load_iris
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest, f_classif

In [20]:
from sklearn.datasets import load_diabetes

In [21]:
# assign diabetes to load_diabetes()
diabetes = load_diabetes()

#load the feature dataset
features = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)

#   Load the target dataset
target = diabetes.target


In [22]:
diabetes.data.shape

(442, 10)

In [23]:
features.head()

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019907,-0.017646
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.068332,-0.092204
2,0.085299,0.05068,0.044451,-0.00567,-0.045599,-0.034194,-0.032356,-0.002592,0.002861,-0.02593
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022688,-0.009362
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031988,-0.046641


In [24]:
print(target)

[151.  75. 141. 206. 135.  97. 138.  63. 110. 310. 101.  69. 179. 185.
 118. 171. 166. 144.  97. 168.  68.  49.  68. 245. 184. 202. 137.  85.
 131. 283. 129.  59. 341.  87.  65. 102. 265. 276. 252.  90. 100.  55.
  61.  92. 259.  53. 190. 142.  75. 142. 155. 225.  59. 104. 182. 128.
  52.  37. 170. 170.  61. 144.  52. 128.  71. 163. 150.  97. 160. 178.
  48. 270. 202. 111.  85.  42. 170. 200. 252. 113. 143.  51.  52. 210.
  65. 141.  55. 134.  42. 111.  98. 164.  48.  96.  90. 162. 150. 279.
  92.  83. 128. 102. 302. 198.  95.  53. 134. 144. 232.  81. 104.  59.
 246. 297. 258. 229. 275. 281. 179. 200. 200. 173. 180.  84. 121. 161.
  99. 109. 115. 268. 274. 158. 107.  83. 103. 272.  85. 280. 336. 281.
 118. 317. 235.  60. 174. 259. 178. 128.  96. 126. 288.  88. 292.  71.
 197. 186.  25.  84.  96. 195.  53. 217. 172. 131. 214.  59.  70. 220.
 268. 152.  47.  74. 295. 101. 151. 127. 237. 225.  81. 151. 107.  64.
 138. 185. 265. 101. 137. 143. 141.  79. 292. 178.  91. 116.  86. 122.
  72. 

#### **1. Handling  Missing Values**

In most datasets,missing values might occur in features. You can handle them with imputation techniques such as mean or median imputation

In [25]:
# check for missing values

features.isna().sum()

age    0
sex    0
bmi    0
bp     0
s1     0
s2     0
s3     0
s4     0
s5     0
s6     0
dtype: int64

In [26]:
# take a snapshot of the dataset
features.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 442 entries, 0 to 441
Data columns (total 10 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   age     442 non-null    float64
 1   sex     442 non-null    float64
 2   bmi     442 non-null    float64
 3   bp      442 non-null    float64
 4   s1      442 non-null    float64
 5   s2      442 non-null    float64
 6   s3      442 non-null    float64
 7   s4      442 non-null    float64
 8   s5      442 non-null    float64
 9   s6      442 non-null    float64
dtypes: float64(10)
memory usage: 34.7 KB


In [27]:
features.values

array([[ 0.03807591,  0.05068012,  0.06169621, ..., -0.00259226,
         0.01990749, -0.01764613],
       [-0.00188202, -0.04464164, -0.05147406, ..., -0.03949338,
        -0.06833155, -0.09220405],
       [ 0.08529891,  0.05068012,  0.04445121, ..., -0.00259226,
         0.00286131, -0.02593034],
       ...,
       [ 0.04170844,  0.05068012, -0.01590626, ..., -0.01107952,
        -0.04688253,  0.01549073],
       [-0.04547248, -0.04464164,  0.03906215, ...,  0.02655962,
         0.04452873, -0.02593034],
       [-0.04547248, -0.04464164, -0.0730303 , ..., -0.03949338,
        -0.00422151,  0.00306441]], shape=(442, 10))

This dataset doesn't contain missing values

In [28]:
# Create missing values

features.loc[10, 'age'] = None
features.loc[50:56, 'bmi'] = None
features.loc[100:105, 'bp'] = None

In [29]:
# check for missing values againi
features.isna().sum()

age    1
sex    0
bmi    7
bp     6
s1     0
s2     0
s3     0
s4     0
s5     0
s6     0
dtype: int64

In [30]:
# Use the fit function to transform the data

# import the needed library
from sklearn.impute import SimpleImputer

In [None]:
# Create an instance of the imputer class using "mean" as strategy for imputation
impute_mean = SimpleImputer(strategy='mean')

# create an instance of the imputer class using the median as strategy for imputation
impute_median = SimpleImputer(strategy='median')

In [35]:
features.age.nunique()

58

In [36]:
# Apply the defined instances above
features['age'] = impute_mean.fit_transform(features[['age']])
features['bmi'] = impute_median.fit_transform(features[['bmi']])
features['bp'] = impute_mean.fit_transform(features[['bp']])

In [37]:
features.isna().sum()

age    0
sex    0
bmi    0
bp     0
s1     0
s2     0
s3     0
s4     0
s5     0
s6     0
dtype: int64

I believe we should be familiar with outliers by now. Here, I will be introducing us to another method for handling outliers.

Again, let me throw more light on handling outliers. Have you seen a family that everyone of them is tall? But there is this one child who is just "heightly challenged", I mean short. That child is an outlier.

So outliers are data points that are significantly different from other observations in your dataset. They can be unusually high or low values. Think of them as the odd ones out that don't seem to fit the general pattern of the data.

Outliers are not bad in themselves, for the purpose of data analysis or clustering or when building recommender systems or trying to detect anomaly they can be very useful...


But when training a predictive model, they could introduce biases by assigning more weights to those higher values. And this would not help the performance of our model on new dataset.

In [38]:
# Using Z-Score to get outliers
z_scores = stats.zscore(features)       # Calculate the z-score for the whole dataframe

# let's print the z-score
print(z_scores)

[[ 0.79963488  1.06548848  1.30284864 ... -0.05449919  0.41853093
  -0.37098854]
 [-0.04436617 -0.93853666 -1.08246065 ... -0.83030083 -1.43658851
  -1.93847913]
 [ 1.79709067  1.06548848  0.93937293 ... -0.05449919  0.06015558
  -0.54515416]
 ...
 [ 0.87636225  1.06548848 -0.33279202 ... -0.23293356 -0.98564884
   0.32567395]
 [-0.96509458 -0.93853666  0.82578678 ...  0.55838411  0.93616291
  -0.54515416]
 [-0.96509458 -0.93853666 -1.53680528 ... -0.83030083 -0.08875225
   0.06442552]]


**Removal Method**