In [1]:
# allows you to access the src module
import sys
sys.path.append("../")

from data_helper import generate_fake_data

# Data Cleaning & Preprocessing Techniques

This project is a simple walk-through of various data cleaning and preprocessing techniques to consider before doing data analysis.

### Why

Raw data often contains errors, missing values, inconsistencies, and other issues that can negatively impact the quality and reliability of any downstream analyses. By cleaning up, and preprocessing data, one can ensure that the data they are working with is accurate, complete, and in a format that is appropriate for the intended analysis or modeling task. This can lead to more reliable and accurate results, as well as more efficient use of time and resources.

> **Note**: Data cleaning and preprocessing is an iterative process as new issues can be uncovered during analysis or modeling.

### What this IS NOT!

Data cleaning and preprocessing is not a way for you to introduce your own bias. This can happen unintentionally, but the techniques used should be conducted carefully to avoid overemphasizing or underemphasizing aspects of your data.

For example, if a dataset contains missing values, one common approach to address this is to impute the missing values. However, if the missing values are not random, but instead systematically related to other variables in the dataset, then the imputed values may introduce bias into the dataset.

### Data Cleaning (DC)
**Data Cleaning** involves Identifying and correcting errors and inconsistencies in the data such as:

- incorrect formatting
- missing values
- duplicates
- outliers

The goal of data cleaning is to ensure that the data is **accurate**, **complete**, and **consistent**.

### Data Preprocessing (DP)
Data preprocessing on the otherhand, involves transforming the data into a format that is more suitable for analysis or modeling. Data preprocessing techniques may include:

- transforming the data into a different representation
- scaling or standardizing variables
- encoding categorical variables 
- reducing the dimensionality of the data

## DC Techniques: Incorrect Formatting
Incorrect formatting refers to data that fails to meet the expected or required format for a particular type of data. For example, if you have a dataset of dates and some of the dates are in the format "dd/mm/yyyy" while others are in the format "mm/dd/yyyy," that would be considered incorrect formatting. Additionally, if you have a dataset of genders and some of those values resolve to "Man" while others "Male", this would also be considered incorrect formatting. There are various types of incorrect formatting. Let us go over the following:

- Column Headers

- Inconsistent Values

- Inconsistent Types

- Date Formats

- Text Casing

### Column Headers
This refers to datasets with column headers that differ in standardization. Consider the following cell:

In [2]:
df = generate_fake_data()
df.head(2)

Unnamed: 0,ID,name,AgE,dateOfBirth,current_COUNTRY
0,645b90bb4513d03db4809f0a,Glenn Vaughn,62,1982-11-27,Madagascar
1,645b90bb4513d03db4809f0b,Jocelyn Shannon,55,1977-06-05,Sao Tome and Principe


Notice how the columns in the above dataset don't adhere to the same conventions. This isn't ideal. Let's clean this up.

In [18]:
df.columns = ["id", "name", "age", "date_of_birth", "current_country"]
df.head(2)

Unnamed: 0,id,name,age,date_of_birth,current_country
0,645b90bb4513d03db4809f0a,Glenn Vaughn,62,1982-11-27,Madagascar
1,645b90bb4513d03db4809f0b,Jocelyn Shannon,55,1977-06-05,Sao Tome and Principe


Yes! Notice how all of the columns follow the same convention. That is:

- lowercase
- snakecase

This is much cleaner and adheres to python naming conventions which will make data analysis much simpler!

### Inconsistent Values

### Inconsistent Types

### Date Formats

### Text Casing

## DC Techniques: Missing Values

### Drop Values

### Impute Values

#### Mean

#### Median

#### Mode

## DC Techniques: Duplicates

## DC Techniques: Outliers

## DP Techniques: Transforming Representation

## The Remaining DP Techniques

- scaling or standardizing variables
- encoding categorical variables 
- reducing the dimensionality of the data