# Data Cleaning:

Data cleaning (or data cleansing) is the process of identifying, correcting, and removing errors, inconsistencies, and inaccuracies in data to ensure its quality and usability for analysis.
It is a crucial step in data preprocessing to prepare raw data for further analysis or machine learning tasks.

- Improves Accuracy: Clean data ensures that your analysis or models are based on accurate and reliable information.
- Enhances Consistency: Consistent data across different sources or within a dataset is critical for meaningful comparisons and insights.
- Reduces Errors: Cleaning data helps to minimize errors that could lead to incorrect conclusions or predictions.

## Common Steps in Data Cleaning:

##### 1. Handling Missing Data:
- Identify Missing Values: Detect missing or null values in the dataset.
- Imputation: Fill missing values using methods like mean, median, mode, or more advanced techniques like K-Nearest Neighbors (KNN) imputation.
- Removal: If a large portion of a row or column has missing data, you might decide to remove it entirely.

##### 2. Removing Duplicates:
- Identify Duplicates: Find duplicate records in the dataset.
- Remove Duplicates: Drop the duplicate rows to avoid skewed analysis results.

##### 3. Correcting Inconsistencies:
- Standardizing Formats: Ensure that data formats (e.g., date formats, text case) are consistent across the dataset.
- Unifying Categories: Merge similar but differently labeled categories (e.g., "USA," "United States," "US" should be standardized to one label).

##### 4. Outlier Detection and Treatment:
- Identify Outliers: Detect data points that significantly differ from the rest of the dataset using statistical methods like Z-scores or the IQR method.
- Treat Outliers: Depending on the context, you can either remove outliers, cap them at a threshold, or analyze them separately.

##### 5. Handling Incorrect Data:
- Identify Errors: Detect data entry errors, typos, or logically inconsistent values (e.g., a negative age).
- Correct Errors: Fix or remove incorrect data points based on domain knowledge or by referencing authoritative sources.

##### 6. Converting Data Types:
- Data Type Consistency: Ensure that the data types (e.g., integers, floats, strings) are consistent and appropriate for each column.
- Conversion: Convert columns to the correct data types as needed.

##### 7. Data Normalization and Scaling:
- Normalization: Adjust the range of numerical data (e.g., scaling between 0 and 1) to ensure that different features are comparable.
- Standardization: Standardize features by removing the mean and scaling to unit variance.

##### 8. Addressing Imbalanced Data:
- Balancing Classes: If you're working with classification tasks, ensure that the dataset has a balanced distribution of classes. This might involve techniques like oversampling the minority class or undersampling the majority class.


## Tools and Libraries for Data Cleaning:
#### Python Libraries: 
Pandas, NumPy, SciPy, and Scikit-learn offer powerful tools for data cleaning.
#### Data Cleaning Tools: 
OpenRefine, Trifacta, Talend, and DataWrangler are specialized tools designed for more extensive or complex data cleaning tasks.

## NumPy (import numpy as np):
NumPy is a library for numerical computing in Python. It provides support for large multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on these arrays.

- np is an alias commonly used to refer to NumPy functions and objects. 

You can perform operations like array creation, mathematical calculations, linear algebra, random number generation, and more using NumPy.

## Pandas (import pandas as pd):

Pandas is a library for data manipulation and analysis in Python. It provides data structures like Series (1D) and DataFrame (2D) for handling and analyzing structured data.

- pd is an alias commonly used to refer to Pandas functions and objects. 

You can use Pandas to load data from various file formats (e.g., CSV, Excel), clean and transform data, perform data analysis, and visualize data

In [20]:
import numpy as np
import pandas as pd

In [21]:
messy_data = {
    'name': ['Alice', 'Bob', 'Charlie', 'David', 'Emily', 'Frank', 'Grace'],
    'age': [25, 30, np.nan, 'thirty-five', 22, 45, 'unknown'],
    'gender': ['Female', 'Male', 'Male', 'Male', 'Female', 'Other', 'Male'],
    'country': ['USA', 'Canada', 'Mexico', 'USA', 'Australia', 'Unknown', 'UK'],
    'monthly_salary': [50000, 75000, '60k', np.nan, 40000, 'unknown', 80000],
    'annual_income': ['60,000', '90,000', np.nan, '100,000', '40k', '75k', 'unknown']
}

# Import messy data into a Pandas DataFrame
messy_df = pd.DataFrame(messy_data)

print(messy_data)
print()
print(messy_df)

{'name': ['Alice', 'Bob', 'Charlie', 'David', 'Emily', 'Frank', 'Grace'], 'age': [25, 30, nan, 'thirty-five', 22, 45, 'unknown'], 'gender': ['Female', 'Male', 'Male', 'Male', 'Female', 'Other', 'Male'], 'country': ['USA', 'Canada', 'Mexico', 'USA', 'Australia', 'Unknown', 'UK'], 'monthly_salary': [50000, 75000, '60k', nan, 40000, 'unknown', 80000], 'annual_income': ['60,000', '90,000', nan, '100,000', '40k', '75k', 'unknown']}

      name          age  gender    country monthly_salary annual_income
0    Alice           25  Female        USA          50000        60,000
1      Bob           30    Male     Canada          75000        90,000
2  Charlie          NaN    Male     Mexico            60k           NaN
3    David  thirty-five    Male        USA            NaN       100,000
4    Emily           22  Female  Australia          40000           40k
5    Frank           45   Other    Unknown        unknown           75k
6    Grace      unknown    Male         UK          80000       

In [22]:
# Handling non-numeric data types and missing values:
messy_df['age'] = pd.to_numeric(messy_df['age'].replace('thirty-five', 35), errors='coerce')
messy_df['age'].fillna(messy_df['age'].median(), inplace=True)
messy_df.fillna({'country': 'Unknown', 'gender': 'Other'}, inplace=True)
print(messy_df)

      name   age  gender    country monthly_salary annual_income
0    Alice  25.0  Female        USA          50000        60,000
1      Bob  30.0    Male     Canada          75000        90,000
2  Charlie  30.0    Male     Mexico            60k           NaN
3    David  35.0    Male        USA            NaN       100,000
4    Emily  22.0  Female  Australia          40000           40k
5    Frank  45.0   Other    Unknown        unknown           75k
6    Grace  30.0    Male         UK          80000       unknown


In [23]:
# Handling inconsistent annual income format:
messy_df['annual_income'] = pd.to_numeric(messy_df['annual_income'].str.replace('[^\d.]', '', regex=True), errors='coerce')
messy_df['annual_income'] *= 1000  # Multiply by 1000 to convert 'k' to 000
print(messy_df)

      name   age  gender    country monthly_salary  annual_income
0    Alice  25.0  Female        USA          50000     60000000.0
1      Bob  30.0    Male     Canada          75000     90000000.0
2  Charlie  30.0    Male     Mexico            60k            NaN
3    David  35.0    Male        USA            NaN    100000000.0
4    Emily  22.0  Female  Australia          40000        40000.0
5    Frank  45.0   Other    Unknown        unknown        75000.0
6    Grace  30.0    Male         UK          80000            NaN


In [24]:
# Remove dependent column
messy_df.drop('monthly_salary', axis=1, inplace=True)

# Convert 'Other' to 'Non-binary'
messy_df['gender'].replace('Other', 'Non-binary', inplace=True)
print(messy_df)

      name   age      gender    country  annual_income
0    Alice  25.0      Female        USA     60000000.0
1      Bob  30.0        Male     Canada     90000000.0
2  Charlie  30.0        Male     Mexico            NaN
3    David  35.0        Male        USA    100000000.0
4    Emily  22.0      Female  Australia        40000.0
5    Frank  45.0  Non-binary    Unknown        75000.0
6    Grace  30.0        Male         UK            NaN


In [25]:
print("ORIGINAL DATA:")
print(messy_data)
print()
print("CLEANED DATA:")
print(messy_df)

ORIGINAL DATA:
{'name': ['Alice', 'Bob', 'Charlie', 'David', 'Emily', 'Frank', 'Grace'], 'age': [25, 30, nan, 'thirty-five', 22, 45, 'unknown'], 'gender': ['Female', 'Male', 'Male', 'Male', 'Female', 'Other', 'Male'], 'country': ['USA', 'Canada', 'Mexico', 'USA', 'Australia', 'Unknown', 'UK'], 'monthly_salary': [50000, 75000, '60k', nan, 40000, 'unknown', 80000], 'annual_income': ['60,000', '90,000', nan, '100,000', '40k', '75k', 'unknown']}

CLEANED DATA:
      name   age      gender    country  annual_income
0    Alice  25.0      Female        USA     60000000.0
1      Bob  30.0        Male     Canada     90000000.0
2  Charlie  30.0        Male     Mexico            NaN
3    David  35.0        Male        USA    100000000.0
4    Emily  22.0      Female  Australia        40000.0
5    Frank  45.0  Non-binary    Unknown        75000.0
6    Grace  30.0        Male         UK            NaN


## Summary:

In this code, we start by importing a dictionary of messy data into a Pandas DataFrame. The data includes various inconsistencies, such as missing values (`np.nan`), non-numeric entries (e.g., `"thirty-five"` for age and `"60k"` for salary), and inconsistent formats (e.g., `"40k"` vs. `"75k"` for income). The initial steps focus on cleaning this data. We convert the `age` column to numeric values, replacing non-numeric entries with their correct values (e.g., replacing `"thirty-five"` with `35`) and then imputing the missing values with the median age. Similarly, missing values in the `country` and `gender` columns are filled with `'Unknown'` and `'Other'`, respectively.

Next, we clean up the `annual_income` column by removing non-numeric characters and converting the income values to a consistent numeric format (e.g., converting `'75k'` to `75000`). The `monthly_salary` column is dropped since it may be redundant, and the `gender` column's `'Other'` entries are standardized to `'Non-binary'` for consistency. After these cleaning steps, the cleaned data is printed, showing a more uniform and accurate DataFrame compared to the original messy data. The code demonstrates how to handle missing values, convert inconsistent data formats, and standardize categorical variables in a dataset.