# Data Cleaning 
## Automating Missing Values Handling with Python Functions

Missing data is one of the most common issues in datasets, and it can wreak havoc if not handled properly. Depending on your dataset and problem, you might choose to drop rows with missing values, fill them with defualt values, or even use more advanced techniques like imputation

In [1]:
# Code example: Handling Missing Values

import pandas as pd
# Define a reusable function to handle missing values
def handle_missing_values(df, method='mean', fill_value=None):
    if method == 'drop':
        return df.dropna()
    elif method == 'fill':
        return df.fillna(fill_value)
    elif method == 'mean':
        numeric_cols =df.select_dtypes(include=['number']).columns
        df[numeric_cols]=df[numeric_cols].fillna(df[numeric_cols].mean().round(2))
        return df
    else:
        raise ValueError("Invalid method provided")

# Example dataset
data= {"Name": ['Joshua', 'Judith', None, 'Jude'],
       "Age": [25, None, 30, 22], 
       "Salary":[500000, 60000, None, 450000]}
df=pd.DataFrame(data)

# use the function to handle missing values by filling with the mean
cleaned_df=handle_missing_values(df, method= 'mean')
print(cleaned_df)

     Name    Age     Salary
0  Joshua  25.00  500000.00
1  Judith  25.67   60000.00
2    None  30.00  336666.67
3    Jude  22.00  450000.00


## Removing Duplicates Efficiently

Duplicate rows are another common issue in datassets. While removing them seems straightfoward, it can be tricky to ensure you're not accidentally deleting valuable data.

In [2]:
# Define a function to remove duplicates based on specific columns
def remove_duplicates(df, subset=None):
    return df.drop_duplicates(subset=subset)

# Example dataset with duplicates
data = {'Name': ['Alice', 'Bob', 'Alice', 'David'],
        'Age': [25, 30, 25, 22],
        'Salary': [50000, 60000, 50000, 450000]}

df= pd.DataFrame(data)

# Remove duplicates based on the 'Name' column
cleaned_df= remove_duplicates(df, subset=['Name'])
print(cleaned_df)

    Name  Age  Salary
0  Alice   25   50000
1    Bob   30   60000
3  David   22  450000


In this example, we used the subset parameter to specify which columns to check for duplicates. This prevents accidental removal of rows where only some columns might be repeated.

### Transforming Data Types in a Pipeline

When working with messy data, it's common to encounter incorrect data types, such as numerical values stored as strings. You can automate this transformation process and integrate it into a data pipeline

In [3]:
# Define a function to transform data types

def transform_data_types(df, col_types):
    for col, dtype in col_types.items():
        df[col] = df[col].astype(dtype)  # Transform each column to its specified type
    return df

# Example dataset with incorrect data types
data = {
    'Name': ['Victor', 'Josh', 'Jonathan'],
    'Age': ['23', '30', '22'],  # All values are valid for integer conversion
    'Salary': ['50000', '60000', '45000']  # All values are valid for float conversion
}

df = pd.DataFrame(data)

# Specify the correct data types
col_types = {'Age': 'int', 'Salary': 'float'}

# Apply the transformation
cleaned_df = transform_data_types(df, col_types)
print(cleaned_df)


       Name  Age   Salary
0    Victor   23  50000.0
1      Josh   30  60000.0
2  Jonathan   22  45000.0


In this example, we create a reusable function to transform the data types of specific columns, ensuring that numeric data is treated correctly for further analysis