# Week 8: Data Cleansing 🧼

## 1. What is Data Cleansing?

**Data cleansing**, also known as data cleaning, is a core part of data wrangling. It's the process of **detecting and correcting (or removing) corrupt or inaccurate records** from a dataset. In the age of big data, clean data is absolutely essential for effective machine learning models, statistical analysis, and business intelligence tools. The goal is to improve data quality, which leads to more trustworthy insights and decisions.

This process is a key component of the **Data Pre-processing** stage in the overall data wrangling workflow.

---

## 2. The Data Cleansing Process

Effective data cleansing follows a structured, iterative process to ensure thoroughness and prevent data loss.


The typical steps are:
1.  **Data Audit**: Review and diagnose the current state of data quality.
2.  **Defining Cleansing Goals**: Set clear, measurable objectives for the cleaning process.
3.  **Data Cleansing Plan**: Create a detailed strategy and timeline.
4.  **Backup Data**: **Crucially**, create a backup of the original data to prevent accidental loss.
5.  **Data Cleansing Operations**: Execute the actual cleaning tasks (e.g., removing duplicates, handling missing values).
6.  **Verification**: Check that the cleaning operations were successful and the data meets the defined quality standards.
7.  **Documentation & Reporting**: Record the steps taken and the outcomes.
8.  **Review**: Analyze the results and the effectiveness of the process.
9.  **Implementation of Preventative Measures**: Use insights from the audit to improve data entry processes and prevent future errors.

---

## 3. A Closer Look at the Process Steps

### Data Audit (The "What's Wrong?" Phase)
A **data audit** is the critical first step where you perform a comprehensive review of your data to understand its overall health.

* **Objectives**:
    * **Identify Quality Issues**: Find inaccuracies, inconsistencies, duplicates, and missing values.
    * **Assess Completeness**: Check if critical data is missing.
    * **Evaluate Consistency**: Ensure data is consistent across different systems.
    * **Check Compliance**: Verify that data management practices meet regulatory standards (like GDPR).
* **Methods**:
    * This can involve a mix of automated scanning using software tools and manual review for more complex issues.
    * Popular tools for this include Tableau Prep, Talend, Alteryx, SAS Data Management, and Google Cloud Dataprep.

### Defining Goals, Planning, and Backup
* **Defining Cleansing Goals** involves understanding the business requirements and setting specific, measurable targets for data quality.
* The **Data Cleansing Plan** is a formal document that outlines the strategies, tools, timeline, and monitoring procedures for the project.
* **Backing up data** is a non-negotiable safety net. It mitigates the risk of irreversible errors during the cleaning process and ensures operational continuity.

---

## 4. Data Cleansing Operations

These are the hands-on tasks performed to fix data quality problems.

| Problem | Cleansing Operation |
| :--- | :--- |
| Duplicated records | Removing duplicates |
| Inaccurate data | Validating and correcting errors |
| Inconsistent data | Running consistency checks |
| Incomplete data | Filling missing values |
| Irrelevant data | Handling outliers |

### Deep Dive: Removing Duplicates
There are many techniques to find and remove duplicate records:
* **Manual Review**: For very small datasets.
* **Sorting**: Sort data to bring duplicates next to each other for easier identification.
* **Database Queries (SQL)**: Use `GROUP BY` and `HAVING COUNT(*)>1` to identify duplicate records based on specific columns. You can then delete them, often by keeping the record with the minimum ID.
* **Hashing Techniques**: A hash function converts a row's data into a unique string (a hash). Duplicate rows will produce the same hash, making them easy to find.
* **Scripting**: Use libraries like **Pandas** in Python to easily identify and drop duplicate rows (`df.duplicated()` and `df.drop_duplicates()`).
* **Machine Learning**: Advanced techniques like fuzzy matching, record linkage, and text similarity measures can find "near duplicates" that simple methods might miss (e.g., "John Smith" vs. "Jon Smith").

---

## 5. Handling Missing Data ❓

Missing data is a very common problem that can arise from equipment errors, users skipping survey questions, or changes in circumstances.

### 5.1. Why Missing Data is a Problem
Ignoring or mishandling missing data is risky because:
* Most standard statistical methods assume complete data.
* It can lead to **biased estimations** (e.g., an incorrect sample mean) and **incorrect inferences**—a classic "garbage in, garbage out" scenario.

### 5.2. Missing Data Mechanisms
To handle missing data correctly, we first need to understand *why* it's missing. There are three key mechanisms:

1.  **Missing Completely at Random (MCAR)**: The probability of a value being missing is completely random and unrelated to any other variable or the missing value itself. This is the ideal but rarest scenario.
    * The probability of missingness, $p(B)$, is only dependent on some unknown parameter $\eta$: $$p(B|Y_{obs}, Y_{miss}) = p(B|\eta)$$ 

2.  **Missing at Random (MAR)**: The probability of a value being missing is related to *other observed variables* in the dataset, but not to the missing value itself. For example, men might be less likely to fill out a depression survey, so missingness on the "depression" variable is related to the "gender" variable.
    * The probability of missingness depends only on the observed data, $Y_{obs}$: $$p(B|Y_{obs}, Y_{miss}) = p(B|Y_{obs}, \eta)$$ 

3.  **Missing Not at Random (MNAR)**: The probability of a value being missing is related to the value of that variable itself. For example, people with very high incomes might be less likely to disclose their income. This is the most difficult type to handle.

### 5.3. Methods for Handling Missing Data: Deletion
These methods involve removing data.

* **List-wise Deletion**: Any row with one or more missing values is completely discarded.
    * **Pro**: It's simple and convenient.
    * **Con**: It can dramatically reduce the sample size and will produce biased results if the data is not MCAR.

* **Pairwise Deletion**: For a specific analysis (e.g., a correlation between two variables), only rows missing data for *those specific variables* are ignored.
    * **Pro**: It uses more of the available data than list-wise deletion.
    * **Con**: It requires MCAR data and can lead to issues like inconsistent sample sizes and invalid covariance matrices.

### 5.4. Methods for Handling Missing Data: Single Imputation
**Imputation** means filling in missing values with a replacement value.

* **Mean/Median Imputation**: Replace missing values with the mean or median of the non-missing values for that variable.
    * **Con**: This method artificially reduces the variance of the data and can distort relationships between variables.

* **Regression Imputation**: Use a regression model to predict the missing value based on other variables in the dataset.
    * **Con**: While better than mean imputation, it still artificially reduces variance because all the imputed points fall perfectly on the regression line. The regression equation is: $$JP_i = \hat{\beta}_0 + \hat{\beta}_1 IQ_i$$ 

* **Stochastic Regression Imputation**: This is an improvement on standard regression imputation. It first predicts the missing value and then **adds a random residual term** to that prediction.
    * **Pro**: This method restores the lost variability and can produce **unbiased parameter estimates** if the data is MAR, making it a much stronger choice. The equation becomes: $$P_i = \hat{\beta}_0 + \hat{\beta}_1 IQ_i + z_i$$ where $z_i$ is a random value drawn from the residual variance.

---

## 6. Handling Outliers 🎯

An **outlier** is a data point that deviates so much from other observations that it seems to have been generated by a different mechanism.

### 6.1. Why Do Outliers Matter?
Outliers aren't always errors; they can contain useful information about abnormal events like credit card fraud, system intrusion, or medical conditions. However, they can also negatively impact statistical analysis by:
* Increasing error variance and reducing the power of tests.
* Violating assumptions like normality.
* Biasing or distorting estimates like the **mean** and **standard deviation**.

### 6.2. Univariate Outlier Detection Methods
These methods are used to find outliers in a single variable.

* **The 3σ Edit Rule**: Assumes the data is normally distributed. Any data point that falls more than **three standard deviations** away from the mean is flagged as an outlier.
    * **Weakness**: The mean and standard deviation are themselves very sensitive to outliers, so this method can fail if outliers are present.

* **The Hampel Identifier (MAD Method)**: This is a more robust method that uses the **median** as the central point and the **Median Absolute Deviation (MAD)** as the measure of variation.
    * Since the median and MAD are less sensitive to extreme values, this method is better at detecting outliers without being skewed by them.

* **The Boxplot Rule (IQR Method)**: This is a very common and effective non-parametric method. It flags any data point as an outlier if it falls outside of the following range:
    * Lower Bound: $Q1 - 1.5 \times IQR$
    * Upper Bound: $Q3 + 1.5 \times IQR$
    * Where $Q1$ is the 25th percentile, $Q3$ is the 75th percentile, and $IQR = Q3 - Q1$.


### 6.3. Multivariate Outlier Detection Methods
These methods detect outliers in an n-dimensional space (i.e., by considering multiple variables at once).

* **Linear Models**: A regression model is fitted to the data. Points that have a large **residual** (i.e., are far from the fitted regression line or plane) are considered outliers.
* **Proximity-Based Models**: These methods are based on distance. Outliers are defined as points that are far from other points or located in sparse regions of the data space. This includes techniques like clustering and density-based approaches.

# Data Cleaning
Always make a backup of your main dataframe before modifying it!  
You don't want to delete corrupt data only to find it could have been helpful  

## Data audit
Some metrics matter more to some companies than others  
E.g. instagram won't care if a few entries are missing, but it would care about inaccurate data

## Missing data
Know why the data is missing: Errors? Filter question?

Q1: Slide 31 - What missing data mechanisms are these 3 columns shown?
Column 1 - Missing Completely At Random
Column 2 - Missing Not At Random, lower IQ (below 90) are missing ratings, which is related to the x value measured - a direct inference.
Column 3 - Missing At Random, lower ratings (7 and 8) are missing ratings, which is related to the y value measured - an indirect inference.

Q2: How can you determine what imputation method is the best for your project?
It depends on the project, but mean/median/other simple statistics are typically not as good as machine learning methods like regression or random forest.

A good R^2 value for assignment is 0.98 or 0.99