# Data Preparation for Machine Learning

## Part II: Foundation

### Chapter 1: Data Preparation in a Machine Learning Project

#### Tutorial Overview
- Introduction to data preparation within a broader machine learning project context.
- Discusses the applied machine learning process and the role of data preparation.
- Explains how to choose data preparation techniques based on project needs.

#### Applied Machine Learning Process
1. **Define Problem**
   - Understanding the project goals and framing the prediction task (classification, regression, etc.).
   - Collecting relevant data and defining the prediction problem clearly.
2. **Prepare Data**
   - Transforming raw data into a suitable form for modeling.
   - Common tasks include cleaning data, selecting features, and transforming variables.
3. **Evaluate Models**
   - Designing a robust test harness to evaluate models.
   - Selecting performance metrics, establishing baselines, and using resampling techniques like k-fold cross-validation.
4. **Finalize Model**
   - Selecting the best model based on evaluation results.
   - Summarizing model performance and integrating the model into a production system.

#### What Is Data Preparation
- Transforming raw data to meet the requirements of machine learning algorithms.
- Includes tasks like data cleaning, feature selection, data transforms, feature engineering, and dimensionality reduction.
- Aims to expose the underlying structure of the problem to learning algorithms.
- Use because:
    + Machine learning algorithms require data to be numbers
    + Some machine learning algorithms impose requirements on the data.
    + Statistical noise and errors in the data may need to be corrected.
    + Complex nonlinear relationships may be teased out of the data.

#### How to Choose Data Preparation Techniques
- Informed by the steps before and after data preparation in the project.
- Analysis of collected data, including summary statistics and data visualization, helps determine necessary transformations.
- The choice of algorithms and evaluation metrics also guide data preparation methods.

### Chapter 2: Why Data Preparation is So Important

#### Tutorial Overview
- Highlights the critical role of data preparation in predictive modeling projects.
- Emphasizes the need for transforming raw data into a suitable form for modeling.

#### What Is Data in Machine Learning
- Structured data consists of rows (examples) and columns (features).
- Each example has input and output elements relevant to the prediction task.

#### Raw Data Must Be Prepared
- Raw data often contains noise, errors, and complex relationships that need to be addressed.
- Preparing data involves correcting these issues to make it suitable for machine learning algorithms.
- Machine Learning Algorithms Expect Numbers
- Machine Learning Algorithms Have Requirements
    + The relationship between data and algorithms is reciprocal. Algorithms have specific expectations for data, necessitating appropriate data preparation to meet these requirements. Meanwhile, the characteristics of the data can inform which algorithms might be most effective.
- Model Performance Depends on Data:
    + Complex Data: Raw data contains compressed complex nonlinear relationships that
may need to be exposed
    + Messy Data: Raw data contains statistical noise, errors, missing values, and conflicting
examples.
- We can think about getting the most out of our predictive modeling project in two ways:
    + focus on the model
        + We could minimally prepare the raw data and begin modeling. This puts full onus on the model to tease out the relationships in the data and learn the mapping function from inputs to outputs as best it can. This may be a reasonable path through a project and may require a large dataset and a flexible and powerful machine learning algorithm with few expectations, such as random forest or gradient boosting.
    + focus on the data
        + Alternately, we could push the onus back onto the data and the data preparation process. This requires that each row of data best expresses the information content of the data for modeling. Just like denormalization of data in a relational database to rows and columns, data preparation can denormalize the complex structure inherent in each single observation. This is also a reasonable path. It may require more knowledge of the data than is available but allows good or even best modeling performance to be achieved almost irrespective of the machine learning algorithm used.
    + Often a balance between these approaches is pursued on any given project. That is both exploring powerful and flexible machine learning algorithms and using data preparation to best expose the structure of the data to the learning algorithms. This is all to say, data preprocessing is a path to better data, and in turn, better model performance.





#### Predictive Modeling Is Mostly Data Preparation
- With standardized machine learning algorithms, the majority of the effort in a project is spent on preparing the unique data for modeling.
- Proper data preparation is essential for achieving reliable and accurate predictive models.

### Chapter 3: Tour of Data Preparation Techniques

#### Tutorial Overview
- Introduction to various data preparation tasks.
- Overview of common techniques and their importance in a machine learning project.

#### Common Data Preparation Tasks
- **Data Cleaning**:
  - Identifying and correcting mistakes or errors in the data.
  - Handling missing values, outliers, and duplicates.
![Data Cleaning](../Photos/1.png)
- **Feature Selection**:
  - Identifying the most relevant input variables for the prediction task.
  - Removing irrelevant or redundant features to improve model performance.
![Feature Selection](../Photos/2.png)
- **Data Transforms**:
  - Changing the scale, type, or distribution of variables.
  - Common transforms include normalization, standardization, and encoding categorical variables.
![Data Transforms](../Photos/3.png)
![Data Transforms](../Photos/4.png)
- **Feature Engineering**:
  - Creating new variables from existing data to better represent the underlying structure of the problem.
  - Includes techniques like polynomial features and interaction terms.
- **Dimensionality Reduction**:
  - Reducing the number of input variables by projecting data into a lower-dimensional space.
  - Techniques include Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA).
 ![Dimensionality Reduction](../Photos/5.png)

### Chapter 4: Data Preparation Without Data Leakage

#### Tutorial Overview
- Discusses the importance of avoiding data leakage during data preparation.
- Presents techniques for properly preparing data without contaminating the test set.

#### Problem With Naive Data Preparation
- Applying data transforms to the entire dataset before splitting into training and test sets can cause data leakage.
- Data leakage leads to overly optimistic model performance estimates that do not generalize to new data.

#### Data Preparation With Train and Test Sets
- Ensuring data preparation steps are applied only to the training set.
- The test set remains unseen by the model until the final evaluation to ensure a fair assessment of performance.

In [5]:
# naive approach to normalizing the data before splitting the data and evaluating the model

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=7)

# standardize the dataset
scaler = MinMaxScaler()
X = scaler.fit_transform(X)

# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

# fit the model
model = LogisticRegression()
model.fit(X_train, y_train)

# evaluate the model
yhat = model.predict(X_test)

# evaluate predictions
accuracy = accuracy_score(y_test, yhat)
print('Accuracy: %.3f' % (accuracy*100))

Accuracy: 84.848


In [6]:
# correct approach for normalizing the data after the data is split before the model is evaluated

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=7)

# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

# define the scaler
scaler = MinMaxScaler()
# fit on the training dataset
scaler.fit(X_train)
# scale the training dataset
X_train = scaler.transform(X_train)
# scale the test dataset
X_test = scaler.transform(X_test)

# fit the model
model = LogisticRegression()
model.fit(X_train, y_train)

# evaluate the model
yhat = model.predict(X_test)

# evaluate predictions
accuracy = accuracy_score(y_test, yhat)
print('Accuracy: %.3f' % (accuracy*100))

Accuracy: 85.152


#### Data Preparation With k-fold Cross-Validation
- Applying data preparation techniques within each fold during cross-validation.
- Ensures that the validation data remains unseen during training to avoid leakage.

In [7]:
# naive data preparation for model evaluation with k-fold cross-validation

from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression

# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=7)

# standardize the dataset
scaler = MinMaxScaler()
X = scaler.fit_transform(X)

# define the model
model = LogisticRegression()

# define the evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# evaluate the model using cross-validation
scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)

# report performance
print('Accuracy: %.3f (%.3f)' % (mean(scores)*100, std(scores)*100))

Accuracy: 85.300 (3.513)


In [3]:
# correct data preparation for model evaluation with k-fold cross-validation

from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5,
random_state=7)

# define the pipeline
steps = list()
steps.append(('scaler', MinMaxScaler()))
steps.append(('model', LogisticRegression()))
pipeline = Pipeline(steps=steps)

# define the evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# evaluate the model using cross-validation
scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1)

# report performance
print('Accuracy: %.3f (%.3f)' % (mean(scores)*100, std(scores)*100))

Accuracy: 85.400 (3.489)


+ As with the train-test example in the previous section, removing data leakage has resulted
in a slight improvement in performance when our intuition might suggest a drop given that
data leakage often results in an optimistic estimate of model performance. Nevertheless, the
examples demonstrate that data leakage may impact the estimate of model performance and
how to correct data leakage by correctly performing data preparation after the data is split.