# Data cleaning

In [18]:
import pandas as pd
import numpy as np

## Handling Missing Values

In pandas, missing values can be represented by `None`, `NaN`, or other placeholders like `null` or `na`. 

To handle these different representations uniformly, you can use the `replace()` function to convert them to `NaN`, which pandas recognizes as missing values. 

After that, you can use `isnull()`, `dropna()`, and `fillna()` as usual.

In pandas, `None` and `NaN` are both considered as missing values. However, to handle other representations like `null`, `na`, `NA`, and empty strings '', you need to convert them to `NaN` so that pandas can recognize them as missing values.

In [19]:
# Create a sample DataFrame with different representations of missing values
df = pd.DataFrame(
    {
        "A": [1, 2, None, 4, "null"],
        "B": [5, 2, 3, 4, "na"],
        "C": [1, "NA", None, 4, ""],
    }
)

# Replace different representations of missing values with np.nan
df.replace(["null", "na", "NA", ""], value=np.nan, inplace=True)

# Now you can use isnull(), dropna(), and fillna() as usual

#### Identifying Missing Values with `isnull()`

The `isnull()` function helps identify missing values in a DataFrame. It returns a DataFrame of the same shape with `True` for missing values and `False` for non-missing values.

In [4]:
# Create a sample DataFrame with missing values
df = pd.DataFrame({"A": [1, 2, None, 4], "B": [None, 2, 3, 4], "C": [1, None, None, 4]})

# Identify missing values
df.isnull()

Unnamed: 0,A,B,C
0,False,True,False
1,False,False,True
2,True,False,True
3,False,False,False


#### Dropping Rows with Missing Values using `dropna()`

The `dropna()` function removes rows or columns with missing values. By default, it drops rows with any missing values.

In [5]:
# Drop rows with any missing values
df.dropna()

Unnamed: 0,A,B,C
3,4.0,4.0,4.0


You can also drop columns with missing values by setting the axis parameter to 1.

In [6]:
df.dropna(axis=1)

0
1
2
3


#### Filling Missing Values using `fillna()`

The `fillna()` function replaces missing values with a specified value.

In [7]:
# Fill missing values with 0
df.fillna(0)

Unnamed: 0,A,B,C
0,1.0,0.0,1.0
1,2.0,2.0,0.0
2,0.0,3.0,0.0
3,4.0,4.0,4.0


You can also fill missing values with different values for each column by passing a dictionary.

In [8]:
# Fill missing values with different values for each column
df.fillna({"A": 0, "B": 1, "C": 2})

Unnamed: 0,A,B,C
0,1.0,1.0,1.0
1,2.0,2.0,2.0
2,0.0,3.0,2.0
3,4.0,4.0,4.0


#### Removing Duplicates

Removing Duplicate Rows using `drop_duplicates()`

In [9]:
# Create a sample DataFrame with duplicate rows
df_duplicates = pd.DataFrame({"A": [1, 2, 2, 4], "B": [1, 2, 2, 4], "C": [1, 2, 2, 4]})

# Remove duplicate rows
df_duplicates.drop_duplicates()

Unnamed: 0,A,B,C
0,1,1,1
1,2,2,2
3,4,4,4


You can also specify which columns to consider for identifying duplicates.

Pandas will keep the first occurrence of each unique value in column 'A' and remove subsequent duplicates. Row 1 is kept because it is the first occurrence of the value 2 in column 'A', and row 2 is removed because it is a duplicate.

In [11]:
# Create a sample DataFrame with duplicate rows
df_duplicates = pd.DataFrame({"A": [1, 2, 2, 4], "B": [1, 2, 2, 4], "C": [1, 2, 2, 4]})

# Remove duplicates based on column 'A'
df_duplicates.drop_duplicates(subset=["A"])

Unnamed: 0,A,B,C
0,1,1,1
1,2,2,2
3,4,4,4
