In [1]:
# allows you to access the src module
import sys
sys.path.append("../")

from data_helper import *

# Data Cleaning & Preprocessing Techniques

This project is a simple walk-through of various data cleaning and preprocessing techniques to consider before doing data analysis.

### Why

Raw data often contains errors, missing values, inconsistencies, and other issues that can negatively impact the quality and reliability of any downstream analyses. By cleaning up, and preprocessing data, one can ensure that the data they are working with is accurate, complete, and in a format that is appropriate for the intended analysis or modeling task. This can lead to more reliable and accurate results, as well as more efficient use of time and resources.

> **Note**: Data cleaning and preprocessing is an iterative process as new issues can be uncovered during analysis or modeling.

### What this IS NOT!

Data cleaning and preprocessing is not a way for you to introduce your own bias. This can happen unintentionally, but the techniques used should be conducted carefully to avoid overemphasizing or underemphasizing aspects of your data.

For example, if a dataset contains missing values, one common approach to address this is to impute the missing values. However, if the missing values are not random, but instead systematically related to other variables in the dataset, then the imputed values may introduce bias into the dataset.

### Data Cleaning (DC)
**Data Cleaning** involves Identifying and correcting errors and inconsistencies in the data such as:

- incorrect formatting
- missing values
- duplicates
- outliers

The goal of data cleaning is to ensure that the data is **accurate**, **complete**, and **consistent**.

### Data Preprocessing (DP)
Data preprocessing on the otherhand, involves transforming the data into a format that is more suitable for analysis or modeling. Data preprocessing techniques may include:

- transforming the data into a different representation
- scaling or standardizing variables
- encoding categorical variables 
- reducing the dimensionality of the data

## DC Techniques: Incorrect Formatting
Incorrect formatting refers to data that fails to meet the expected or required format for a particular type of data. For example, if you have a dataset of dates and some of the dates are in the format "dd/mm/yyyy" while others are in the format "mm/dd/yyyy," that would be considered incorrect formatting. Additionally, if you have a dataset of genders and some of those values resolve to "Man" while others "Male", this would also be considered incorrect formatting. There are various types of incorrect formatting. Let us go over the following:

- Column Headers

- Inconsistent Values

- Inconsistent Types

- Text Casing

### Column Headers
This refers to datasets with column headers that differ in standardization. Consider the following cell:

In [2]:
users = generate_users()
users.head(2)

Unnamed: 0,ID,name,AgE,dateOfBirth,current_COUNTRY
0,645b9bc3742ea75794fdb7f6,Cassandra Spencer,38,1964-02-04,Israel
1,645b9bc3742ea75794fdb7f7,Amanda Fuller,29,1963-10-27,Tonga


Notice how the columns in the above dataset don't adhere to the same conventions. This isn't ideal. Let's clean this up.

In [3]:
users.columns = ["id", "name", "age", "date_of_birth", "current_country"]
users.head(2)

Unnamed: 0,id,name,age,date_of_birth,current_country
0,645b9bc3742ea75794fdb7f6,Cassandra Spencer,38,1964-02-04,Israel
1,645b9bc3742ea75794fdb7f7,Amanda Fuller,29,1963-10-27,Tonga


Yes! Notice how all of the columns follow the same convention. That is:

- lowercase
- snakecase

This is much cleaner and adheres to python naming conventions which will make data analysis much simpler!

### Inconsistent Values
Inconsistent values refers to a field, containing different values that are trying to represent the same thing. For example:

In [4]:
states = generate_states()
states.state.value_counts()

MI          14
Michigan    14
Alabama     13
AL          13
Texas       13
TX          13
Ohio        10
OH          10
Name: state, dtype: int64

From real world knowledge, we know that `Alabama` and `AL` are refering to the same state. This will cause confusion and added complexity when we get to conducting an analysis. Let's clean this up!

In [5]:
state_map = dict(zip(["Ohio", "Michigan", "Alabama", "Texas"], ["OH", "MI", "AL", "TX"]))
state_mask = states.state.map(state_map)
states.loc[states.state[state_mask.notna()].index, "state"] = state_mask
states.state.value_counts()

MI    28
AL    26
TX    26
OH    20
Name: state, dtype: int64

Great! Now every state adheres to it's abbreviation. Making the data more consistent!

### Inconsistent Types
Inconsistent types refers to data types in a column that don't match up. For example:

In [6]:
vals = pd.DataFrame([True, "true", 1], columns=["is_true"])
vals

Unnamed: 0,is_true
0,True
1,true
2,1


Notice the above values range from Booleans, strings, to ints. We need to make the types for this consistent. Let's get started!

In [7]:
vals.is_true = vals.is_true.apply(lambda x: bool(x))
vals

Unnamed: 0,is_true
0,True
1,True
2,True


Great! Now all of the values match the same type: `Boolean`!

### Text Casing
Text casing can be applied to the dataframe and to a specific row. This means values vary in casing and forcing them to adhere to the same sensitivity will make analysis all the more merry!

In [8]:
genders = generate_gender()
genders.value_counts()

gender
WOMAN     25
MAN       19
woman     18
dnd       16
man       12
DND       10
dtype: int64

Take a look, we have the same value producing different stats due to the casing of them. Let's fix this!

In [12]:
genders.gender = genders.gender.str.lower()
genders.value_counts()

gender
woman     43
man       31
dnd       26
dtype: int64

Now we have more accurate stats!

## DC Techniques: Missing Values

Missing values refer to data that is missing from specific rows or values. There are a variety of reasons for this, but nonetheless, handling them is, or can be, important! Missing value techniques include:

- Drop Values

- Imputing Values

- Do Nothing!

### Drop Values

### Impute Values

#### Mean

#### Median

#### Mode

## DC Techniques: Duplicates

## DC Techniques: Outliers

## DP Techniques: Transforming Representation

## The Remaining DP Techniques

- scaling or standardizing variables
- encoding categorical variables 
- reducing the dimensionality of the data