# Checklist

In [2]:
# ![](important_guide_images/high_level_ml_pipeline.png) 
# OR
# <img src="important_guide_images/high_level_ml_pipeline.png"/>

<img src="important_guide_images/high_level_ml_pipeline.png"/>

# 1.	Frame	the	problem	and	look	at	the	big	picture.

## 1.1. Frame	the	Problem

### 1.1.1. What	exactly	is	the	business	objective. How	do you intend to use	and	benefit	from	this	model?

### 1.1.2. What	the	current	solution	looks	like	(if	any).

### 1.1.3. With the answers to the above questions you	are	now	ready	to	start	designing	your	system. First,	you	need	to	frame	the	problem:	is	it	supervised,	unsupervised,	or Reinforcement	Learning?	Is	it	a	classification	task,	a	regression	task,	or something	else?		Should	you	use	batch	learning	or	online	learning	techniques?

## 1.2. Select	a	Performance	Measure.

## 1.3. Check	the	Assumptions

## 1.4. Import Packages

# 2. Get the data (Data Retrival)

## 2.1. Create	the	Workspace

## 2.2. Download	the	Data

### 2.2.1. Create a function to get the data from source systems to your system.

### 2.2.2.  Run the function to fetch the latest data or set	up	a	scheduled	job	to	do	that	automatically	at regular	intervals

### 2.2.3. Write a function to load the data using Pandas

## 2.3. Take	a	Quick	Look	at	the	Data	Structure

In [2]:
# check for null
# example: 
# data.isnull().sum() or using missingno package to visualize

## 2.4. Create	a	Test	Set

# 3. Explore(Discover and Visualize)	the	data	to	gain	insights. 
# EDA (Exploratory Data Analysis) 

## Data Exploration - Univariate

When exploring our dataset and its features, we have many options available to us. We can explore each feature individually, or compare pairs of features, finding the correlation between. Let's start with some simple Univariate (one feature) analysis.

Features can be of multiple types:
- **Nominal:**  is for mutual exclusive, but not ordered, categories.
- **Ordinal:** is one where the order matters but not the difference between values.
- **Interval:** is a measurement where the difference between two values is meaningful.
- **Ratio:** has all the properties of an interval variable, and also has a clear definition of 0.0.

There are multiple ways of manipulating each feature type, but for simplicity, we'll define only two feature types:
- **Numerical:** any feature that contains numeric values.
- **Categorical:** any feature that contains categories, or text.

- 1st plot pairplot to see the any relation between the features
- 2nd plot the distribution of each feature to see if the features are normally distributed or skewed

# 4.	Prepare	the	data	to	better	expose	the	underlying	data	patterns	to	Machine Learning	algorithms.

# Data Preparation

# Feature Cleaning, Engineering, and Imputation

**Cleaning:**
To clean our data, we'll need to work with:

- **Missing values:** Either omit elements from a dataset that contain missing values or impute them (fill them in).
- **Special values:** Numeric variables are endowed with several formalized special values including ±Inf, NA and NaN. Calculations involving special values often result in special values, and need to be handled/cleaned.
- **Outliers:** They should be detected, but not necessarily removed. Their inclusion in the analysis is a statistical decision.
- **Obvious inconsistencies:** A person's age cannot be negative, a man cannot be pregnant and an under-aged person cannot possess a drivers license. Find the inconsistencies and plan for them.

**Engineering:**
There are multiple techniques for feature engineering:
- **Decompose:** Converting 2014-09-20T20:45:40Z into categorical attributes like hour_of_the_day, part_of_day, etc.
- **Discretization:** We can choose to either discretize some of the continuous variables we have, as some algorithms will perform faster. We are going to do both, and compare the results of the ML algorithms on both discretized and non discretised datasets. We'll call these datasets:

- dataset_bin => where Continuous variables are Discretised
- dataset_con => where Continuous variables are Continuous 

- **Reframe Numerical Quantities:** Changing from grams to kg, and losing detail might be both wanted and efficient for calculation
- **Feature Crossing:** Creating new features as a combination of existing features. Could be multiplying numerical features, or combining categorical variables. This is a great way to add domain expertise knowledge to the dataset.
    
**Imputation:**
We can impute missing values in a number of different ways:
- **Hot-Deck:**	The technique then finds the first missing value and uses the cell value immediately prior to the data that are missing to impute the missing value.
- **Cold-Deck:** Selects donors from another dataset to complete missing data.
- **Mean-substitution:** Another imputation technique involves replacing any missing value with the mean of that variable for all other cases, which has the benefit of not changing the sample mean for that variable.
- **Regression:** A regression model is estimated to predict observed values of a variable based on other variables, and that model is then used to impute values in cases where that variable is missing.

## 4.1. Data	Cleaning

## 4.2. Handling	Text	and	Categorical	Attributes

## 4.3 Custom	Transformers

## 4.4 Feature Scaling

## 4.5 Feature	Selection

## 4.6 Transformation	Pipelines

# 5. Explore	many	different	models	and	short-list	the	best	ones.

## 5.1 Training	and	Evaluating	on	the	Training	Set

# 6. Fine-tune	your	models	and	combine	them	into	a	great	solution.

# 7. Present	your	solution.

# 8. Launch,	monitor,	and	maintain	your	system. 