# Data Preprocessing & Feature Engineering

<img src='https://miro.medium.com/max/1200/0*uoJhp9fB0xlgfLN7.png' width='700' height='600'>

# CHAPTER 1: Data Preprocessing

## CONTENTS:
* What is data preprocessing?
* Why is data preprocessing important?
* Techniques?
* Data Cleaning
* Data Integration
* Data Reduction
* Data Transformation
* Data Discretization

## What is data preprocessing?
Data preprocessing is made with two words: Data & Preprocessing. Let's see it,
* **Data:**
    * Text
    * Image
    * Video
    * Audio
Now these are the different forms of data. Generally, data are not clean so we need to process these data to clean them.  
  
  
* **Data Preprocessing:** It is a process to convert the raw data into meaningful data using different techniques.


## Why is data preprocessing important?
* Data in the real world is dirty. Let's see the types of dirtiness in data
    * Incomplete
    * Noisy
    * Inconsistent
    * Duplicate


* Now see what are some features of quality data:
    * Accuracy
    * Completeness
    * Believability
    * Interpretability
     
* Machine Learning algorithm follow the rules (*learn like kids*):  
    **GIGO** = Garbage in Garbage Out

## Techniques in data preprocessing:
* Major steps in data preprocessing are:  
    1. Data Cleaning  
    2. Data Integration  
    3. Data Reduction  
    4. Data Transformation  
    5. Data Discretization  

### 1. Data Cleaning:
It means fill in missing values, smooth out noise while identifying outliers, and correct inconsistencies in the data.

<img src = 'https://startwithdata.co.uk/wp-content/uploads/2021/07/Screenshot-2021-07-28-at-10.08.54-1024x872.png' width="200" height="200">

### 2. Data Integration:
It is a technique to merges data from multiple sources into a coherent data store, such as a data warehouse.

<img src = 'https://lh3.googleusercontent.com/kr1G5oTGXBGVCtKWy1oFduVs0eTlzG1US4vZdPaJOiyTZ_ltkgxXJVki6dg06odv_j3Hkj9U1iIHw6biURmcQbARduJXTpH42S8nPMk2uvETQV_qFSnbHdNbtxEazYcN63prNz6h=s0' width="300" height="300">

### 3. Data Reduction:
It is a technique use to reduce the data size by aggregating, eliminating redundant features, or clustering   


<img src = 'https://www.cohesity.com/wp-content/new_media/2020/09/datareduction_banner-1.png' width="500" height="500">

### 4. Data Transformation:
It means data are transformed or consolidated into forms appropriate for ML model training, such as normalization, may be applied where data are scaled to fall within a smaller range like 0.0 to 1.0 or we have do any of the following:
* Aggregation
* Feature type conversion
* Normalization
* Attribute/feature construction

<img src = 'https://www.tibco.com/sites/tibco/files/media_entity/2022-04/data-transformation-diagram.svg' width="500" height="500">

### 5. Data Discretization:
It is a technique t transforms numeric data by mapping values to interval or concept labels.
* It can be used to reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals.
* Discretization technique includes:
    * Binning
    * Histogram Analysis
    * Cluster Analysis
    * Decision-Tree Analysis
    * Correlation Analysis
    
<img src='https://revolution-computing.typepad.com/.a/6a010534b1db25970b017c3755470d970b-500wi' width="450" height="450">

# CHAPTER 2: Feature Engineering

## CONTENTS:
* What is data feature engineering?
* What is the importance of feature engineering?
* Process

## What is feature engineering?
Firstly, let's see what is meaning of the both words:
* **Feature:** It is an attribute or property shared by all the independent units on which analysis or prediction is to be done.
<img src='https://d2slcw3kip6qmk.cloudfront.net/marketing/blog/2019Q4/feature-driven-development/feature-driven-development-header.png' width='300' height='300'>
  
  
* **Engineering:** Invention to solve the problems called engineering.
<img src = 'https://www.creativefabrica.com/wp-content/uploads/2020/12/01/computer-engineering-concept-Graphics-6940051-1-1-580x386.jpg' width='300' height='300'>
  

* **Feature Engineering:** It is process to create feature/extract the feature from existing features by domain knowledge to increase the performance of machine learning model.
<img src='https://cdn.analyticsvidhya.com/wp-content/uploads/2021/05/64590automated-feature-engineering.png' width='300' height='300'>

## Why is Feature Engineering Important?
* Quality data always help to improve the accuracy and performance of machine learning model.
* Machine learning algorithm follow the rule (learn like kids)
    * **Rule:**  ```GIGO = Garbage in Garbage out```

## What are the process of Feature Engineering?
Major process of Feature Engineering are as follow:
* Brainstorming or testing features
* Deciding what features to create
* Creating features
* Checking how the features work with your model
* Improving your features if needed
* Go back to brainstorming/creating more features until the work is done

### Example:

| **DateTime**             | *Hour* | *Day* | *Month* | *Year* | *Day of week* |
|--------------------------|--------|-------|---------|--------|---------------|
| **07-Apr-2020 12.00.00** | 12     |   7   |   4     |   2020 |    2          | 
| **10-Apr-2020 23.00.00** | 23     |   10  |   4     |   2020 |    5          |
| **15-Apr-2020 02.00.00** | 02     |   15  |   4     |   2020 |    4          |
| **25-Apr-2020 11.00.00** | 11     |   25  |   4     |   2020 |    2          |
| **28-Apr-2020 05.00.00** | 05     |   28  |   4     |   2020 |    5          |

 * **Here the "DateTime" is the data and remaining columns are the features of data we extracted.**

### Note:
Feature engineering is just the *data transformation* as we mentioned in the last chapter. That's why the prerequisites are same for both.

# Prerequisites:

 Prerequisites for Data Preprocessing & Feature Engineering:
* **Python Libraries:**

    * Numpy
    * Pandas
    * Matplotlib
    * Seaborn
    * Scikit Learn
* **Mathematics:**

    * Statistics
    * Probability
    * Calculus
    * Linear Algebra
* **Software:**

    * Anaconda
    * Jupyter Notebook
    * Spyder

## Learning Platforms:
* YouTube
* Websites

# Save your work:

In [1]:
!pip install jovian --upgrade --quiet

In [2]:
import jovian

In [3]:
jovian.commit(project = 'Data Preprocessing and Feature Engineering')

<IPython.core.display.Javascript object>

[jovian] Updating notebook "sonihariom555/data-preprocessing-and-feature-engineering" on https://jovian.ai[0m
[jovian] Committed successfully! https://jovian.ai/sonihariom555/data-preprocessing-and-feature-engineering[0m


'https://jovian.ai/sonihariom555/data-preprocessing-and-feature-engineering'