# Data cleaning

In [54]:
import pandas as pd
import numpy as np

1. Handling Missing Values
2. Removing Duplicates
3. Handling Outliers

## 1. Handling Missing Values

In pandas, missing values can be represented by `None`, `NaN`, or other placeholders like `null` or `na`. 

To handle these different representations uniformly, you can use the `replace()` function to convert them to `NaN`, which pandas recognizes as missing values. 

After that, you can use `isnull()`, `dropna()`, and `fillna()` as usual.

In pandas, `None` and `NaN` are both considered as missing values. However, to handle other representations like `null`, `na`, `NA`, and empty strings '', you need to convert them to `NaN` so that pandas can recognize them as missing values.

In [55]:
# Create a sample DataFrame with different representations of missing values
df = pd.DataFrame(
    {
        "A": [1, 2, None, 4, "null"],
        "B": [5, 2, 3, 4, "na"],
        "C": [1, "NA", None, 4, ""],
    }
)

# Replace different representations of missing values with np.nan
df.replace(["null", "na", "NA", ""], value=np.nan, inplace=True)

# Now you can use isnull(), dropna(), and fillna() as usual

  df.replace(["null", "na", "NA", ""], value=np.nan, inplace=True)


#### Identifying Missing Values with `isnull()`

The `isnull()` function helps identify missing values in a DataFrame. It returns a DataFrame of the same shape with `True` for missing values and `False` for non-missing values.

In [56]:
# Create a sample DataFrame with missing values
df = pd.DataFrame({"A": [1, 2, None, 4], "B": [None, 2, 3, 4], "C": [1, None, None, 4]})

# Identify missing values
df.isnull()

Unnamed: 0,A,B,C
0,False,True,False
1,False,False,True
2,True,False,True
3,False,False,False


#### Dropping Rows with Missing Values using `dropna()`

The `dropna()` function removes rows or columns with missing values. By default, it drops rows with any missing values.

In [57]:
# Drop rows with any missing values
df.dropna()

Unnamed: 0,A,B,C
3,4.0,4.0,4.0


You can also drop columns with missing values by setting the axis parameter to 1.

In [58]:
df.dropna(axis=1)

0
1
2
3


#### Filling Missing Values using `fillna()`

The `fillna()` function replaces missing values with a specified value.

In [59]:
# Fill missing values with 0
df.fillna(0)

Unnamed: 0,A,B,C
0,1.0,0.0,1.0
1,2.0,2.0,0.0
2,0.0,3.0,0.0
3,4.0,4.0,4.0


You can also fill missing values with different values for each column by passing a dictionary.

In [60]:
# Fill missing values with different values for each column
df.fillna({"A": 0, "B": 1, "C": 2})

Unnamed: 0,A,B,C
0,1.0,1.0,1.0
1,2.0,2.0,2.0
2,0.0,3.0,2.0
3,4.0,4.0,4.0


#### The `interpolate()` method in pandas can be used to fill missing values using various interpolation methods.

In [61]:
# DataFrame that has missing values:
df = pd.DataFrame(
    {
        "A": [1, 2, np.nan, 4, 5],
        "B": [5, np.nan, np.nan, 8, 10],
        "C": [1, 2, 3, np.nan, 5],
    }
)

df

Unnamed: 0,A,B,C
0,1.0,5.0,1.0
1,2.0,,2.0
2,,,3.0
3,4.0,8.0,
4,5.0,10.0,5.0


##### Linear Interpolation

Fills missing values by interpolating between the values before and after the missing value.

By default, the axis is 0, which means interpolation is done **column-wise**. 
This means that for each column, missing values are filled in based on the values in the rows above and below.

- 1C = 1
- 2C = nan => 2
- 3C = 3

In [62]:
# vertically interpolate missing values
df.interpolate(method="linear")

Unnamed: 0,A,B,C
0,1.0,5.0,1.0
1,2.0,6.0,2.0
2,3.0,7.0,3.0
3,4.0,8.0,4.0
4,5.0,10.0,5.0


In [63]:
# horizontally interpolate missing values
df.interpolate(method="linear", axis=1)

Unnamed: 0,A,B,C
0,1.0,5.0,1.0
1,2.0,2.0,2.0
2,,,3.0
3,4.0,8.0,8.0
4,5.0,10.0,5.0


#### Polynomial Interpolation

Polynomial interpolation is a method to estimate values between known data points. It does this by fitting a polynomial curve to the existing data.

- Order 1 (linear): Straight line connection between points.
- Order 2 (quadratic): Curved line (parabola) fitting the points.
- Higher orders: More complex curves to fit the data.

In [64]:
df.interpolate(method="polynomial", order=2)

Unnamed: 0,A,B,C
0,1.0,5.0,1.0
1,2.0,5.5,2.0
2,3.0,6.5,3.0
3,4.0,8.0,4.0
4,5.0,10.0,5.0


#### Time Interpolation

If your DataFrame has a datetime index, you can use time interpolation to fill missing values based on time.

While both `method='linear'` and `method='time'` interpolate missing values, there's a key difference:

- `method='linear'`: Treats the index as numerical values and performs linear interpolation based on the index positions.
- `method='time'`: Takes into account the temporal nature of the index (in your case, a date range). It interpolates values based on the time intervals between the known data points.

In [65]:
# Create a sample DataFrame with a datetime index and missing values
df_time = pd.DataFrame(
    {
        "A": [1, 2, np.nan, 4, 5],
        "B": [5, np.nan, np.nan, 8, 10],
        "C": [1, 2, 3, np.nan, 5],
    },
    index=pd.date_range("20230101", periods=5),
)

df_time

Unnamed: 0,A,B,C
2023-01-01,1.0,5.0,1.0
2023-01-02,2.0,,2.0
2023-01-03,,,3.0
2023-01-04,4.0,8.0,
2023-01-05,5.0,10.0,5.0


In [66]:
df_time.interpolate(method="time")

Unnamed: 0,A,B,C
2023-01-01,1.0,5.0,1.0
2023-01-02,2.0,6.0,2.0
2023-01-03,3.0,7.0,3.0
2023-01-04,4.0,8.0,4.0
2023-01-05,5.0,10.0,5.0


---

## 2. Removing Duplicates

Removing Duplicate Rows using `drop_duplicates()`

In [67]:
# Create a sample DataFrame with duplicate rows
df_duplicates = pd.DataFrame({"A": [1, 2, 2, 4], "B": [1, 2, 2, 4], "C": [1, 2, 2, 4]})

# Remove duplicate rows
df_duplicates.drop_duplicates()

Unnamed: 0,A,B,C
0,1,1,1
1,2,2,2
3,4,4,4


You can also specify which columns to consider for identifying duplicates.

Pandas will keep the first occurrence of each unique value in column 'A' and remove subsequent duplicates. Row 1 is kept because it is the first occurrence of the value 2 in column 'A', and row 2 is removed because it is a duplicate.

In [68]:
# Create a sample DataFrame with duplicate rows
df_duplicates = pd.DataFrame({"A": [1, 2, 2, 4], "B": [1, 2, 2, 4], "C": [1, 2, 2, 4]})

# Remove duplicates based on column 'A'
df_duplicates.drop_duplicates(subset=["A"])

Unnamed: 0,A,B,C
0,1,1,1
1,2,2,2
3,4,4,4


---

## 3. Handling Outliers

In [69]:
# Create a sample DataFrame with outliers
df = pd.DataFrame(
    {
        "A": [1, 2, 3, 4, 100],  # 100 is an outlier
        "B": [5, 6, 7, 8, 200],  # 200 is an outlier
        "C": [1, 2, 3, 4, 5],
    }
)
df

Unnamed: 0,A,B,C
0,1,5,1
1,2,6,2
2,3,7,3
3,4,8,4
4,100,200,5


#### Z-Score

The Z-score indicates **how many standard deviations an element is from the mean**. <br>
Values with a Z-score greater than 3 or less than -3 are often considered outliers.

In [70]:
from scipy import stats

# Calculate Z-scores for each value in the DataFrame
# Z-score indicates how many standard deviations an element is from the mean
z_scores = np.abs(stats.zscore(df))

# Print Z-scores
print("\nZ-scores:")
print(z_scores)



Z-scores:
          A         B         C
0  0.538285  0.519337  1.414214
1  0.512652  0.506418  0.707107
2  0.487019  0.493499  0.000000
3  0.461387  0.480580  0.707107
4  1.999343  1.999833  1.414214


**Step-by-Step Calculation**
1. Calculate the Mean of Column A:
- Sum all the values in column A: 1 + 2 + 3 + 4 + 100 = 110
- Divide this total by the number of values (5): 110 / 5 = 22
- The mean of column A is therefore 22.

2. Calculate the Standard Deviation of Column A:
- First, calculate the variance:
    - Subtract the mean from each value and square the result:<br>
        (1 - 22)² = (-21)² = 441<br>
        (2 - 22)² = (-20)² = 400<br>
        (3 - 22)² = (-19)² = 361<br>
        (4 - 22)² = (-18)² = 324<br>
        (100 - 22)² = (78)² = 6084<br>
        
    - Sum these squared differences: 441 + 400 + 361 + 324 + 6084 = 7610
    - Divide this total by the number of values (5): 7610 / 5 = 1522
- Take the square root of the variance to get the standard deviation: √1522 ≈ 39.01
- The standard deviation of column A is therefore approximately 39.01.

3. Calculate the Z-score for the Value 1 in Column A:
- Subtract the mean from the value: 1 - 22 = -21
- Divide this result by the standard deviation: -21 / 39.01 ≈ -0.538
- The Z-score for the value 1 in column A is therefore approximately -0.538.

**Explanation of the Z-score**
- The Z-score of approximately -0.538 means that the value 1 is about 0.538 standard deviations below the mean of column A.
- In the context of Z-scores, a value close to 0 indicates that the value is near the mean, while a value further from 0 (either positive or negative) indicates that the value is further from the mean.

**Summary**<br>
The Z-score for the value 1 in column A is approximately -0.538, indicating it is 0.538 standard deviations below the mean of 22 for column A.

In [71]:
# Identify outliers: Calculate Z-scores and identify rows with any Z-score greater than 1.5.
# This means any value that is more than 1.5 standard deviations away from the mean is considered an outlier
outliers = (z_scores > 1.5).any(axis=1)

print("\nOutliers in rows identified using Z-score:")
print(df[outliers])


Outliers in rows identified using Z-score:
     A    B  C
4  100  200  5


In [72]:
# Remove outliers using Z-score: Filter out rows identified as outliers
# Keep only the rows where all Z-scores are less than 1.5
df_no_outliers_z = df[~outliers]

print("\nDataFrame after removing outliers using Z-score:")
print(df_no_outliers_z)



DataFrame after removing outliers using Z-score:
   A  B  C
0  1  5  1
1  2  6  2
2  3  7  3
3  4  8  4


In [73]:
# Cap outliers to the 5th and 75th percentiles
# Calculate the 5th and 75th percentiles for each column
lower_bound = df.quantile(0.05)
upper_bound = df.quantile(0.75)

# Use the clip() method to cap values below the 5th percentile to the 5th percentile
# and values above the 75th percentile to the 75th percentile
# Assigns values outside boundary to boundary values.
df_capped = df.clip(lower=lower_bound, upper=upper_bound, axis=1)

print("\nOriginal DataFrame:")
print(df)

print("\nZ-scores:")
print(z_scores)

print("\nDataFrame after capping outliers to the 5th and 75th percentiles:")
print(df_capped)

clipped_z_scores = np.abs(stats.zscore(df_capped))

# Print Z-scores
print("\nClipped Z-scores:")
print(clipped_z_scores)


Original DataFrame:
     A    B  C
0    1    5  1
1    2    6  2
2    3    7  3
3    4    8  4
4  100  200  5

Z-scores:
          A         B         C
0  0.538285  0.519337  1.414214
1  0.512652  0.506418  0.707107
2  0.487019  0.493499  0.000000
3  0.461387  0.480580  0.707107
4  1.999343  1.999833  1.414214

DataFrame after capping outliers to the 5th and 75th percentiles:
     A    B    C
0  1.2  5.2  1.2
1  2.0  6.0  2.0
2  3.0  7.0  3.0
3  4.0  8.0  4.0
4  4.0  8.0  4.0

Clipped Z-scores:
          A         B         C
0  1.483328  1.483328  1.483328
1  0.759753  0.759753  0.759753
2  0.144715  0.144715  0.144715
3  1.049183  1.049183  1.049183
4  1.049183  1.049183  1.049183
