In [3]:
import pandas as pd
import numpy as np

# Data Cleaning Techniques with Pandas and NumPy

## Basic Data Cleaning Operations

### Handling Missing Values

Pandas offers several methods to deal with missing data, such as removal and imputation.

Reference: [Working with missing data - User Guide]([https://](https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html#missing-data))

#### Fill missing values

[pandas.DataFrame.fillna](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html) is used to fill NaN with a value

[pandas.DataFrame.ffill](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.ffill.html) is used to fill NaN values by propagating the last valid observation to next valid.

[pandas.DataFrame.bfill]([https://](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.bfill.html)) is uset to fill NaN values by using the next valid observation to fill the gap.

In [34]:
# Example dataset
data = {
    'A': [1, np.nan, 3, 4, 5],
    'B': [6, 7, 8, np.nan, 10],
    'C': [11, 12, np.nan, np.nan, 15]
}
df = pd.DataFrame(data)
df

Unnamed: 0,A,B,C
0,1.0,6.0,11.0
1,,7.0,12.0
2,3.0,8.0,
3,4.0,,
4,5.0,10.0,15.0


In [35]:
# Replace all NaN elements with 0s.
df_fillna_zeros = df.fillna(value=0)
df_fillna_zeros

Unnamed: 0,A,B,C
0,1.0,6.0,11.0
1,0.0,7.0,12.0
2,3.0,8.0,0.0
3,4.0,0.0,0.0
4,5.0,10.0,15.0


In [36]:
# Replace all NaN in column A with 0; in column Б with its mean; in column C with its median
values = {
    'A':0,
    'B': df['B'].mean(),
    'C': df['C'].median()
}
df_fillna_dict_values = df.fillna(value=values)
df_fillna_dict_values

Unnamed: 0,A,B,C
0,1.0,6.0,11.0
1,0.0,7.0,12.0
2,3.0,8.0,12.0
3,4.0,7.75,12.0
4,5.0,10.0,15.0


In [37]:
# fill NaN values by propagating the last valid observation to next valid
df_fflil = df.ffill()
df_fflil

Unnamed: 0,A,B,C
0,1.0,6.0,11.0
1,1.0,7.0,12.0
2,3.0,8.0,12.0
3,4.0,8.0,12.0
4,5.0,10.0,15.0


#### Remove rows/columns with missing values

[pandas.DataFrame.dropna](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html) is used to drop rows or columns with missing data

By using the `how{‘any’, ‘all’}, default ‘any’` parameter we can  determine if row or column is removed from DataFrame, when we have at least one NA or all NA.

In [38]:
# Example dataset
data = {
    'A': [1, np.nan, 3, np.nan, 5],
    'B': [6, 7, 8, np.nan, 10],
    'C': [11, 12, np.nan, np.nan, 15]
}
df = pd.DataFrame(data)
df

Unnamed: 0,A,B,C
0,1.0,6.0,11.0
1,,7.0,12.0
2,3.0,8.0,
3,,,
4,5.0,10.0,15.0


In [39]:
# Dropping rows with all missing values
df.dropna(how='all', inplace=True)
df

Unnamed: 0,A,B,C
0,1.0,6.0,11.0
1,,7.0,12.0
2,3.0,8.0,
4,5.0,10.0,15.0


In [40]:
# Dropping columns with any missing values
df.dropna(axis=1, inplace=True)
df

Unnamed: 0,B
0,6.0
1,7.0
2,8.0
4,10.0


### Handle Duplicate data

Duplicate data can be easily identified and removed using Pandas's [duplicated()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.duplicated.html) method

In [41]:
# Example dataset with duplicate rows
data = {
    'A': [1, 1, 2, 3, 4, 4],
    'B': ['a', 'a', 'b', 'c', 'd', 'd']
}
df = pd.DataFrame(data)
df

Unnamed: 0,A,B
0,1,a
1,1,a
2,2,b
3,3,c
4,4,d
5,4,d


In [42]:
# Identifying duplicate rows
duplicates = df.duplicated()
duplicates

0    False
1     True
2    False
3    False
4    False
5     True
dtype: bool

In [43]:
# Removing duplicate rows
df.drop_duplicates(inplace=True)
df

Unnamed: 0,A,B
0,1,a
2,2,b
3,3,c
4,4,d


### Replacing values

Replacing values is a common operation in data cleaning and preparation. Pandas provides a convenient method .replace() for this purpose.

#### Basic Replacement

If you want to replace all occurrences of a specific value in a DataFrame:

In [44]:
# Sample DataFrame
df = pd.DataFrame({
    'A': [1, 2, 3, 4, 5],
    'B': [5, 6, 7, 8, 9],
    'C': ['one', 'two', 'three', 'four', 'five']
})

# Replacing value 5 with 50 in the entire DataFrame
df.replace(5, 50, inplace=True)
df

Unnamed: 0,A,B,C
0,1,50,one
1,2,6,two
2,3,7,three
3,4,8,four
4,50,9,five


#### Replacing Multiple Values

You can replace multiple values at once by passing a list of values and their replacements or by using a mapping dict.

In [45]:
# Using lists:
# df.replace([2,3],[20,30], inplace=True)

# Using mapping dict:
df.replace({2: 20, 3: 30}, inplace=True)
df

Unnamed: 0,A,B,C
0,1,50,one
1,20,6,two
2,30,7,three
3,4,8,four
4,50,9,five


#### Replacing Values in a Specific Column

If you want to replace values in a specific column, you can use the replace method on that column:

In [46]:
# Replacing values in column 'C'
df['C'] = df['C'].replace({'one': 'ONE', 'two': 'TWO'})
df


Unnamed: 0,A,B,C
0,1,50,ONE
1,20,6,TWO
2,30,7,three
3,4,8,four
4,50,9,five


#### Using Regular Expressions

The replace method can also work with regular expressions, which is very powerful for pattern-based replacement:

In [47]:
df.replace(to_replace=r'^f.*', value='STARTS WITH F', regex=True, inplace=True)
df

Unnamed: 0,A,B,C
0,1,50,ONE
1,20,6,TWO
2,30,7,three
3,4,8,STARTS WITH F
4,50,9,STARTS WITH F


Replacing empty strings and strings with only whitespaces with np.nan

In [48]:
# introduce empty strings and strings with spaces
df.loc[3,'C'] = ''
df.loc[4,'C'] = '     '
df

Unnamed: 0,A,B,C
0,1,50,ONE
1,20,6,TWO
2,30,7,three
3,4,8,
4,50,9,


In [52]:
df.replace(to_replace=r'^\s*$', value=np.nan, regex=True, inplace=True)
df

Unnamed: 0,A,B,C
0,1,50,ONE
1,20,6,TWO
2,30,7,three
3,4,8,
4,50,9,


## Examples

### Handling Duplicate Customer Profiles by Email

In this example we will remove duplicate customer profiles based on their email, regardless of differences in their customer IDs or names. This method is especially useful in situations where the email address is a unique identifier for customer profiles.

In [66]:
# Sample DataFrame with customer profiles
data = {
    'CustomerID': [101, 102, 103, 104, 105],
    'Name': ['Ivan Petrov', 'Maria Ivanova', 'Georgi Dimitrov', 'Ivan Georgiev', 'Maria Ivanova'],
    'Email': ['ivan_petrov@example.com', 'maria_ivanova@example.com', 'georgi_dimitrov@example.com', 'ivan_georgiev@example.com', 'maria_ivanova@example.com']
}

df = pd.DataFrame(data)
print('Customer profiles DataFrame:')
print(df)

# Identifying duplicate rows based on the 'Email' column
duplicates_by_email = df.duplicated(subset=['Email'])

print("\nDuplicate Rows by Email (excluding first occurrence):")
print(duplicates_by_email)

# Removing duplicate rows based on 'Email', keeping the first occurrence (default)
df_cleaned = df.drop_duplicates(subset=['Email'])

print("\nDataFrame after removing duplicates by Email:")
df_cleaned

Customer profiles DataFrame:
   CustomerID             Name                        Email
0         101      Ivan Petrov      ivan_petrov@example.com
1         102    Maria Ivanova    maria_ivanova@example.com
2         103  Georgi Dimitrov  georgi_dimitrov@example.com
3         104    Ivan Georgiev    ivan_georgiev@example.com
4         105    Maria Ivanova    maria_ivanova@example.com

Duplicate Rows by Email (excluding first occurrence):
0    False
1    False
2    False
3    False
4     True
dtype: bool

DataFrame after removing duplicates by Email:


Unnamed: 0,CustomerID,Name,Email
0,101,Ivan Petrov,ivan_petrov@example.com
1,102,Maria Ivanova,maria_ivanova@example.com
2,103,Georgi Dimitrov,georgi_dimitrov@example.com
3,104,Ivan Georgiev,ivan_georgiev@example.com


###  Handling Missing Values in Patient Records

In this example, we'll work with a dataset of patient records where some entries are missing the age or address. We'll see how to identify these missing entries and several strategies for handling them using Pandas.

In [73]:
# Sample DataFrame with patient records
data = {
    'PatientID': [1, 2, 3, 4, 5],
    'Name': ['Ivan Ivanov', 'Maria Popova', 'Georgi Georgiev', 'Sofia Petrova', 'Nikolai Nikolov'],
    'Age': [30, np.nan, 45, np.nan, 50],
    'Address': ['1000 Sofia', np.nan, '1500 Plovdiv', '1300 Varna', np.nan]
}

df = pd.DataFrame(data)
df

Unnamed: 0,PatientID,Name,Age,Address
0,1,Ivan Ivanov,30.0,1000 Sofia
1,2,Maria Popova,,
2,3,Georgi Georgiev,45.0,1500 Plovdiv
3,4,Sofia Petrova,,1300 Varna
4,5,Nikolai Nikolov,50.0,


#### Identifying Missing Values

In [68]:
# Identify rows with missing 'Age' or 'Address'
missing_age_or_address = df[df['Age'].isna() | df['Address'].isna()]
missing_age_or_address

Unnamed: 0,PatientID,Name,Age,Address
1,2,Maria Popova,,
3,4,Sofia Petrova,,1300 Varna
4,5,Nikolai Nikolov,50.0,


#### Filling Missing Values with a Default Value


In [70]:
# Filling missing 'Age' with the median age and 'Address' with a 'Address Unknown' placeholder
df['Age'] = df['Age'].fillna(df['Age'].median())
df['Address'] = df['Address'].fillna('Address Unknown')
df

Unnamed: 0,PatientID,Name,Age,Address
0,1,Ivan Ivanov,30.0,1000 Sofia
1,2,Maria Popova,45.0,Address Unknown
2,3,Georgi Georgiev,45.0,1500 Plovdiv
3,4,Sofia Petrova,45.0,1300 Varna
4,5,Nikolai Nikolov,50.0,Address Unknown


#### Dropping Rows with Missing Values

If missing data cannot be accurately imputed or filled, it might be best to exclude those records from analysis:

In [74]:
# Dropping rows where either 'Age' or 'Address' is missing
df_dropped = df.dropna(subset=['Age', 'Address'])
df_dropped

Unnamed: 0,PatientID,Name,Age,Address
0,1,Ivan Ivanov,30.0,1000 Sofia
2,3,Georgi Georgiev,45.0,1500 Plovdiv


### Handling Inconsistent Formats

In this example, we'll address a common data cleaning issue where dates are inconsistently entered in a dataset, with some records using the DD/MM/YYYY format and others using the MM/DD/YYYY format.

To solve the problem we'll define `standardize_date` function that attempts to parse each date string using the DD/MM/YYYY format initially and, if it fails (indicating the date might be in the MM/DD/YYYY format due to a ValueError), it tries the MM/DD/YYYY format. After determining the correct format, it converts the date to a standardized YYYY-MM-DD format for consistency. The apply method is then used to apply this function to each date in the 'Date' column, creating a new 'Standardized Date' column with the corrected dates.

Keep in mind, this solution assumes that all dates are valid and does not account for ambiguous cases (e.g., 01/02/2023 could be January 2nd or February 1st). In real-world scenarios, additional context or data validation might be necessary to accurately distinguish between formats for such cases.

In [76]:
# Sample data with inconsistent date formats
data = {
    'Event': ['Concert', 'Conference', 'Meeting', 'Workshop', 'Seminar'],
    'Date': ['12/05/2024', '05/15/2024', '23/06/2024', '07/20/2024', '10/11/2024']
}

df = pd.DataFrame(data)
df

Unnamed: 0,Event,Date
0,Concert,12/05/2024
1,Conference,05/15/2024
2,Meeting,23/06/2024
3,Workshop,07/20/2024
4,Seminar,10/11/2024


In [77]:
from datetime import datetime

# Function to correct date formats
def correct_date_format(date_str):
    try:
        # Try parsing the date assuming DD/MM/YYYY format
        return datetime.strptime(date_str, "%d/%m/%Y").strftime("%Y-%m-%d")
    except ValueError:
        # If parsing fails, assume MM/DD/YYYY format
        return datetime.strptime(date_str, "%m/%d/%Y").strftime("%Y-%m-%d")

# Applying the function to correct date formats in the DataFrame
df['Corrected Date'] = df['Date'].apply(correct_date_format)
df


Unnamed: 0,Event,Date,Corrected Date
0,Concert,12/05/2024,2024-05-12
1,Conference,05/15/2024,2024-05-15
2,Meeting,23/06/2024,2024-06-23
3,Workshop,07/20/2024,2024-07-20
4,Seminar,10/11/2024,2024-11-10


### Handling Incorrect Values

Let's have incorrect values in a dataset that tracks employee information for a company. In this case, the 'Years of Experience' field contains some unrealistic values due to data entry errors, such as negative numbers or excessively high values for experience years.
We'll assume that a valid range for years of experience is between 0 and 50 years, and will replace any value outside this range with -1.

In [4]:
import pandas as pd

# Sample data with incorrect values in the 'Years of Experience' field
data = {
    'Employee Name': ['Ivan Ivanov', 'Maria Popova', 'Georgi Georgiev', 'Sofia Petrova', 'Nikolai Nikolov'],
    'Position': ['Software Developer', 'Project Manager', 'Data Analyst', 'UX Designer', 'HR Specialist'],
    'Years of Experience': [5, -2, 25, 3, 150]
}

df = pd.DataFrame(data)
df

Unnamed: 0,Employee Name,Position,Years of Experience
0,Ivan Ivanov,Software Developer,5
1,Maria Popova,Project Manager,-2
2,Georgi Georgiev,Data Analyst,25
3,Sofia Petrova,UX Designer,3
4,Nikolai Nikolov,HR Specialist,150


In [11]:
# Calculate the meadian of valid values in 'Years of Experience':
valid_years_mask = (df['Years of Experience']>0) & (df['Years of Experience']<50)
valid_years_median = np.median(df.loc[valid_years_mask,'Years of Experience']).astype(int)
print(f'valid_years_median = {valid_years_median}')

# Correcting incorrect 'Years of Experience' values
df['Corrected Years of Experience'] = (
    df['Years of Experience']
    .apply(lambda x: x if 0<x<50 else valid_years_median)
)
df

valid_years_median = 5


Unnamed: 0,Employee Name,Position,Years of Experience,Corrected Years of Experience
0,Ivan Ivanov,Software Developer,5,5
1,Maria Popova,Project Manager,-2,5
2,Georgi Georgiev,Data Analyst,25,25
3,Sofia Petrova,UX Designer,3,3
4,Nikolai Nikolov,HR Specialist,150,5
