<a href="https://colab.research.google.com/github/ML-Challenge/week5-preprocessing-and-tunning/blob/master/L1.Preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" /></a>

This lesson covers the basics of how and when to perform data preprocessing. This essential step in any machine learning project is when we get our data ready for modeling. Between importing and cleaning the data and fitting the machine learning model is when preprocessing comes into play. We'll learn how to standardize the data so that it's in the right form for the model, create new features to best leverage the information in the dataset, and select the best features to improve the model fit. Finally, we'll have some practice preprocessing by getting a dataset on UFO sightings ready for modeling.

# Setup

In [1]:
# Download lesson datasets
# Required if you're using Google Colab
#!wget "https://github.com/ML-Challenge/week5-preprocessing-and-tunning/raw/master/datasets.zip"
#!unzip -o datasets.zip

In [None]:
# Import utils
# We'll be using this module throughout the lesson
import utils

In [None]:
# Import dependencies
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
# and setting the size of all plots.
plt.rcParams['figure.figsize'] = [11, 7]

# Introduction to Data Preprocessing

In this chapter we'll learn exactly what it means to `preprocess` data. We'll take the first steps in any preprocessing journey, including exploring data types and dealing with missing data.

## What is data preprocessing?

### Missing data - columns

### Missing data - rows

## Working with data types

### Exploring data types

### Converting a column type

## Class distribution

### Class imbalance

### Stratified sampling

# Standardizing Data

This chapter is all about standardizing data. Often a model will make some assumptions about the distribution or scale of the features. Standardization is a way to make the data fit these assumptions and improve the algorithm's performance.

## What is data standardization

### When to standardize

### Modeling without normalizing

## Log normalization

### Checking the variance

### Log normalization in Python

## Scaling data for feature comparison

### Scaling data - investigating columns

### Scaling data - standardizing columns

## Standardized data and modeling

### KNN on non-scaled data

### KNN on scaled data

# Feature Engineering

In this section we'll learn about feature engineering. We'll explore different ways to create new, more useful, features from the ones already in the dataset. We'll see how to encode, aggregate, and extract information from both numerical and textual features.

## What is Feature engineering

### Feature engineering knowledge test

### Identifying areas for feature engineering

## Encoding categorical variables

### Encoding categorical variables - binary

### Encoding categorical variables - one-hot

## Engineering numerical features

### Engineering numerical features - taking an average

### Engineering numerical features - datetime

## Text classification

### Engineering features from strings - extraction

### Engineering features from strings - tf/idf

### Text classification using tf/idf vectors

# Selecting features for modeling

This chapter goes over a few different techniques for selecting the most important features from the dataset. We'll learn how to drop redundant features, work with text vectors, and reduce the number of features in the dataset using principal component analysis (PCA).

## Feature selection

### When to use feature selection

### Identifying areas for feature selection

## Removing redundant features

### Selecting relevant features

### Checking for correlated features

## Selecting features using text vectors

### Exploring text vectors, part 1

### Exploring text vectors, part 2

### Training Naive Bayes with feature selection

## Dimensionality reduction

### Using PCA

### Training a model with PCA

# Putting it all together

Now that we've learned all about preprocessing we'll try these techniques out on a dataset that records information on UFO sightings.

## UFOs and preprocessing

### Checking column types

### Dropping missing data

## Categorical variables and standardization

### Extracting numbers from strings

### Identifying features for standardization

## Engineering new features

### Encoding categorical variables

### Features from dates

### Text vectorization

## Feature selection and modeling

### Modeling the UFO dataset, part 1

### Modeling the UFO dataset, part 2

## Congratulations!