# Data cleaning

In [48]:
import pandas as pd
import numpy as np

1. Handling Missing Values
2. Removing Duplicates
3. Handling Outliers
    - Z-Score (Standard deviations)
    - Interquartile Range (IQR)
    - Winsorization
4. Data Type Conversion
    - Convert Data Types
    - Parse Dates
5. String Operations
6. Handling Inconsistent Data
    - Standardize Categories
    - Correct Typos
7. Binning data
8. DateTime functionalities

## 1. Handling Missing Values

In pandas, missing values can be represented by `None`, `NaN`, or other placeholders like `null` or `na`. 

To handle these different representations uniformly, you can use the `replace()` function to convert them to `NaN`, which pandas recognizes as missing values. 

After that, you can use `isnull()`, `dropna()`, and `fillna()` as usual.

In pandas, `None` and `NaN` are both considered as missing values. However, to handle other representations like `null`, `na`, `NA`, and empty strings '', you need to convert them to `NaN` so that pandas can recognize them as missing values.

In [49]:
# Set the future option to avoid the warning
# "FutureWarning: Downcasting behavior in `replace` is deprecated"
pd.set_option("future.no_silent_downcasting", True)

# Create a sample DataFrame with different representations of missing values
df = pd.DataFrame(
    {
        "A": [1, 2, None, 4, "null"],
        "B": [5, 2, 3, 4, "na"],
        "C": [1, "NA", None, 4, ""],
    }
)

# Replace different representations of missing values with np.nan
df.replace(["null", "na", "NA", ""], value=np.nan, inplace=True)

# Explicitly infer objects to avoid the FutureWarning
# Is used to convert columns of a DataFrame that have an object data type to a more specific type (e.g., int, float, datetime) if possible.
df = df.infer_objects(copy=False)

# Now you can use isnull(), dropna(), and fillna() as usual

#### Identifying Missing Values with `isnull()`

The `isnull()` function helps identify missing values in a DataFrame. It returns a DataFrame of the same shape with `True` for missing values and `False` for non-missing values.

In [50]:
# Create a sample DataFrame with missing values
df = pd.DataFrame({"A": [1, 2, None, 4], "B": [None, 2, 3, 4], "C": [1, None, None, 4]})

# Identify missing values
df.isnull()

Unnamed: 0,A,B,C
0,False,True,False
1,False,False,True
2,True,False,True
3,False,False,False


#### Dropping Rows with Missing Values using `dropna()`

The `dropna()` function removes rows or columns with missing values. By default, it drops rows with any missing values.

In [51]:
# Drop rows with any missing values
df.dropna()

Unnamed: 0,A,B,C
3,4.0,4.0,4.0


You can also drop columns with missing values by setting the axis parameter to 1.

In [52]:
df.dropna(axis=1)

0
1
2
3


#### Filling Missing Values using `fillna()`

The `fillna()` function replaces missing values with a specified value.

In [53]:
# Fill missing values with 0
df.fillna(0)

Unnamed: 0,A,B,C
0,1.0,0.0,1.0
1,2.0,2.0,0.0
2,0.0,3.0,0.0
3,4.0,4.0,4.0


You can also fill missing values with different values for each column by passing a dictionary.

In [54]:
# Fill missing values with different values for each column
df.fillna({"A": 0, "B": 1, "C": 2})

Unnamed: 0,A,B,C
0,1.0,1.0,1.0
1,2.0,2.0,2.0
2,0.0,3.0,2.0
3,4.0,4.0,4.0


#### The `interpolate()` method in pandas can be used to fill missing values using various interpolation methods.

In [55]:
# DataFrame that has missing values:
df = pd.DataFrame(
    {
        "A": [1, 2, np.nan, 4, 5],
        "B": [5, np.nan, np.nan, 8, 10],
        "C": [1, 2, 3, np.nan, 5],
    }
)

df

Unnamed: 0,A,B,C
0,1.0,5.0,1.0
1,2.0,,2.0
2,,,3.0
3,4.0,8.0,
4,5.0,10.0,5.0


##### Linear Interpolation

Fills missing values by interpolating between the values before and after the missing value.

By default, the axis is 0, which means interpolation is done **column-wise**. 
This means that for each column, missing values are filled in based on the values in the rows above and below.

- 1C = 1
- 2C = nan => 2
- 3C = 3

In [56]:
# vertically interpolate missing values
df.interpolate(method="linear")

Unnamed: 0,A,B,C
0,1.0,5.0,1.0
1,2.0,6.0,2.0
2,3.0,7.0,3.0
3,4.0,8.0,4.0
4,5.0,10.0,5.0


In [57]:
# horizontally interpolate missing values
df.interpolate(method="linear", axis=1)

Unnamed: 0,A,B,C
0,1.0,5.0,1.0
1,2.0,2.0,2.0
2,,,3.0
3,4.0,8.0,8.0
4,5.0,10.0,5.0


#### Polynomial Interpolation

Polynomial interpolation is a method to estimate values between known data points. It does this by fitting a polynomial curve to the existing data.

- Order 1 (linear): Straight line connection between points.
- Order 2 (quadratic): Curved line (parabola) fitting the points.
- Higher orders: More complex curves to fit the data.

In [58]:
df.interpolate(method="polynomial", order=2)

Unnamed: 0,A,B,C
0,1.0,5.0,1.0
1,2.0,5.5,2.0
2,3.0,6.5,3.0
3,4.0,8.0,4.0
4,5.0,10.0,5.0


#### Time Interpolation

If your DataFrame has a datetime index, you can use time interpolation to fill missing values based on time.

While both `method='linear'` and `method='time'` interpolate missing values, there's a key difference:

- `method='linear'`: Treats the index as numerical values and performs linear interpolation based on the index positions.
- `method='time'`: Takes into account the temporal nature of the index (in your case, a date range). It interpolates values based on the time intervals between the known data points.

In [59]:
# Create a sample DataFrame with a datetime index and missing values
df_time = pd.DataFrame(
    {
        "A": [1, 2, np.nan, 4, 5],
        "B": [5, np.nan, np.nan, 8, 10],
        "C": [1, 2, 3, np.nan, 5],
    },
    index=pd.date_range("20230101", periods=5),
)

df_time

Unnamed: 0,A,B,C
2023-01-01,1.0,5.0,1.0
2023-01-02,2.0,,2.0
2023-01-03,,,3.0
2023-01-04,4.0,8.0,
2023-01-05,5.0,10.0,5.0


In [60]:
df_time.interpolate(method="time")

Unnamed: 0,A,B,C
2023-01-01,1.0,5.0,1.0
2023-01-02,2.0,6.0,2.0
2023-01-03,3.0,7.0,3.0
2023-01-04,4.0,8.0,4.0
2023-01-05,5.0,10.0,5.0


---

## 2. Removing Duplicates

Removing Duplicate Rows using `drop_duplicates()`

In [61]:
# Create a sample DataFrame with duplicate rows
df_duplicates = pd.DataFrame({"A": [1, 2, 2, 4], "B": [1, 2, 2, 4], "C": [1, 2, 2, 4]})

# Remove duplicate rows
df_duplicates.drop_duplicates()

Unnamed: 0,A,B,C
0,1,1,1
1,2,2,2
3,4,4,4


You can also specify which columns to consider for identifying duplicates.

Pandas will keep the first occurrence of each unique value in column 'A' and remove subsequent duplicates. Row 1 is kept because it is the first occurrence of the value 2 in column 'A', and row 2 is removed because it is a duplicate.

In [62]:
# Create a sample DataFrame with duplicate rows
df_duplicates = pd.DataFrame({"A": [1, 2, 2, 4], "B": [1, 2, 2, 4], "C": [1, 2, 2, 4]})

# Remove duplicates based on column 'A'
df_duplicates.drop_duplicates(subset=["A"])

Unnamed: 0,A,B,C
0,1,1,1
1,2,2,2
3,4,4,4


---

## 3. Handling Outliers

In [63]:
# Create a sample DataFrame with outliers
df = pd.DataFrame(
    {
        "A": [1, 2, 3, 4, 100],  # 100 is an outlier
        "B": [5, 6, 7, 8, 200],  # 200 is an outlier
        "C": [1, 2, 3, 4, 5],
    }
)
df

Unnamed: 0,A,B,C
0,1,5,1
1,2,6,2
2,3,7,3
3,4,8,4
4,100,200,5


#### Z-Score

The Z-score indicates **how many standard deviations an element is from the mean**. <br>
Values with a Z-score greater than 3 or less than -3 are often considered outliers.


<img src="../pandas/images/standarddeviation.png" width="200">


In [64]:
from scipy import stats

# Calculate Z-scores for each value in the DataFrame
# Z-score indicates how many standard deviations an element is from the mean
z_scores = np.abs(stats.zscore(df))

# Print Z-scores
print("\nZ-scores:")
print(z_scores)



Z-scores:
          A         B         C
0  0.538285  0.519337  1.414214
1  0.512652  0.506418  0.707107
2  0.487019  0.493499  0.000000
3  0.461387  0.480580  0.707107
4  1.999343  1.999833  1.414214


**Step-by-Step Calculation**
1. Calculate the Mean of Column A:
- Sum all the values in column A: 1 + 2 + 3 + 4 + 100 = 110
- Divide this total by the number of values (5): 110 / 5 = 22
- The mean of column A is therefore 22.

2. Calculate the Standard Deviation of Column A:
- First, calculate the variance:
    - Subtract the mean from each value and square the result:<br>
        (1 - 22)² = (-21)² = 441<br>
        (2 - 22)² = (-20)² = 400<br>
        (3 - 22)² = (-19)² = 361<br>
        (4 - 22)² = (-18)² = 324<br>
        (100 - 22)² = (78)² = 6084<br>
        
    - Sum these squared differences: 441 + 400 + 361 + 324 + 6084 = 7610
    - Divide this total by the number of values (5): 7610 / 5 = 1522
- Take the square root of the variance to get the standard deviation: √1522 ≈ 39.01
- The standard deviation of column A is therefore approximately 39.01.

3. Calculate the Z-score for the Value 1 in Column A:
- Subtract the mean from the value: 1 - 22 = -21
- Divide this result by the standard deviation: -21 / 39.01 ≈ -0.538
- The Z-score for the value 1 in column A is therefore approximately -0.538.

**Explanation of the Z-score**
- The Z-score of approximately -0.538 means that the value 1 is about 0.538 standard deviations below the mean of column A.
- In the context of Z-scores, a value close to 0 indicates that the value is near the mean, while a value further from 0 (either positive or negative) indicates that the value is further from the mean.

**Summary**<br>
The Z-score for the value 1 in column A is approximately -0.538, indicating it is 0.538 standard deviations below the mean of 22 for column A.

In [65]:
# Identify outliers: Calculate Z-scores and identify rows with any Z-score greater than 1.5.
# This means any value that is more than 1.5 standard deviations away from the mean is considered an outlier
outliers = (z_scores > 1.5).any(axis=1)

print("\nOutliers in rows identified using Z-score:")
print(df[outliers])


Outliers in rows identified using Z-score:
     A    B  C
4  100  200  5


In [66]:
# Remove outliers using Z-score: Filter out rows identified as outliers
# Keep only the rows where all Z-scores are less than 1.5
df_no_outliers_z = df[~outliers]

print("\nDataFrame after removing outliers using Z-score:")
print(df_no_outliers_z)



DataFrame after removing outliers using Z-score:
   A  B  C
0  1  5  1
1  2  6  2
2  3  7  3
3  4  8  4


In [67]:
# Cap outliers to the 5th and 75th percentiles
# Calculate the 5th and 75th percentiles for each column
lower_bound = df.quantile(0.05)
upper_bound = df.quantile(0.75)

# Use the clip() method to cap values below the 5th percentile to the 5th percentile
# and values above the 75th percentile to the 75th percentile
# Assigns values outside boundary to boundary values.
df_capped = df.clip(lower=lower_bound, upper=upper_bound, axis=1)

print("\nOriginal DataFrame:")
print(df)

print("\nZ-scores:")
print(z_scores)

print("\nDataFrame after capping outliers to the 5th and 75th percentiles:")
print(df_capped)

clipped_z_scores = np.abs(stats.zscore(df_capped))

# Print Z-scores
print("\nClipped Z-scores:")
print(clipped_z_scores)


Original DataFrame:
     A    B  C
0    1    5  1
1    2    6  2
2    3    7  3
3    4    8  4
4  100  200  5

Z-scores:
          A         B         C
0  0.538285  0.519337  1.414214
1  0.512652  0.506418  0.707107
2  0.487019  0.493499  0.000000
3  0.461387  0.480580  0.707107
4  1.999343  1.999833  1.414214

DataFrame after capping outliers to the 5th and 75th percentiles:
     A    B    C
0  1.2  5.2  1.2
1  2.0  6.0  2.0
2  3.0  7.0  3.0
3  4.0  8.0  4.0
4  4.0  8.0  4.0

Clipped Z-scores:
          A         B         C
0  1.483328  1.483328  1.483328
1  0.759753  0.759753  0.759753
2  0.144715  0.144715  0.144715
3  1.049183  1.049183  1.049183
4  1.049183  1.049183  1.049183


#### Interquartile Range (IQR)

- Measures the spread of the middle 50% of the data.
- Best suited for data that does not follow a normal distribution.
- Useful for skewed distributions and when you want to identify outliers based on the spread of the central portion of the data.
- Common threshold: 1.5 times the IQR for mild outliers, 3 times for extreme outliers.

<img src="../pandas/images/iqr.png" width="200">

The `df.quantile()` function in pandas is used to calculate the quantiles of the columns in a DataFrame.<br> 
Quantiles are values that divide a dataset into equal-sized, contiguous intervals.

**Calculate the 25th percentile (Q1) for each column**<br>
```
Q1 = df.quantile(0.25)
```
- **What it does**: This line calculates the 25th percentile (also known as the first quartile, Q1) for each column in the DataFrame df.
- **How it works**: The `quantile(0.25)` function computes the value below which 25% of the data in each column falls. This is useful for understanding the lower end of the data distribution.

**Calculate the 75th percentile (Q3) for each column**<br>
```
Q3 = df.quantile(0.75)
```
- **What it does**: This line calculates the 75th percentile (also known as the third quartile, Q3) for each column in the DataFrame df.
- **How it works**: The `quantile(0.75)` function computes the value below which 75% of the data in each column falls. This helps in understanding the upper end of the data distribution.

**Calculate the Interquartile Range (IQR) for each column**<br>
```
IQR = Q3 - Q1
```
- **What it does**: This line calculates the Interquartile Range (IQR) for each column in the DataFrame df.
- **How it works**: The IQR is the range between the 75th percentile (Q3) and the 25th percentile (Q1).<br> 
It measures the spread of the middle 50% of the data. The formula is: [ IQR = Q3 - Q1 ] The IQR is useful for identifying outliers and understanding the variability of the data.

In [68]:
# Sample DataFrame
df = pd.DataFrame(
    {
        "A": [1, 2, 3, 4, 100],  # 100 is an outlier
        "B": [5, 6, 7, 8, 200],  # 200 is an outlier
        "C": [1, 2, 3, 4, 5],
    }
)

# Calculate the 25th percentile (Q1) for each column
Q1 = df.quantile(0.25)
# Calculate the 75th percentile (Q3) for each column
Q3 = df.quantile(0.75)

# Calculate the Interquartile Range (IQR) for each column
IQR = Q3 - Q1

# Print the results
print("Q1 (25th percentile):")
print(Q1)
print("\nQ3 (75th percentile):")
print(Q3)
print("\nIQR (Interquartile Range):")
print(IQR)

Q1 (25th percentile):
A    2.0
B    6.0
C    2.0
Name: 0.25, dtype: float64

Q3 (75th percentile):
A    4.0
B    8.0
C    4.0
Name: 0.75, dtype: float64

IQR (Interquartile Range):
A    2.0
B    2.0
C    2.0
dtype: float64


The 25th percentile (Q1) is not necessarily one of the values in the dataset.<br> 
Instead, it is a value below which 25% of the data falls.<br>
When calculating quantiles, pandas uses linear interpolation by default, which can result in a value that is not explicitly present in the dataset.

How is the 25th percentile (Q1) is calculated for the column "A":

**Data in Column "A"**<br>
[1, 2, 3, 4, 100]<br>

**Sorted Data**<br>
The data is already sorted:<br>
[1, 2, 3, 4, 100]<br>

**Position Calculation**<br>
To find the 25th percentile, we need to determine the position in the sorted list:<br>
`position = (n - 1) * percentile`

where n is the number of data points and percentile is 0.25 for the 25th percentile.<br>
For column "A":<br>
`position = (5 - 1) * 0.25 = 1`

**Interpolation**<br>
The position is 1, which means the 25th percentile lies exactly at the second value in the sorted list.<br>
If the position were not an integer, pandas would interpolate between the two nearest values.

**Result**<br>
The value at position 1 (second value in the sorted list) is 2. Therefore, the 25th percentile (Q1) for column "A" is 2.




**How to use the IQR to filter out outliers and adjust your DataFrame:**

- `df < lower_bound` creates a boolean DataFrame where each element is True if it is less than the corresponding lower bound.
- `df > upper_bound` creates a boolean DataFrame where each element is True if it is greater than the corresponding upper bound.
- `|` (bitwise OR) combines these two boolean DataFrames.
- `.any(axis=1)` checks if any element in a row is True.
- `~` (bitwise NOT) inverts the boolean DataFrame, so True becomes False and vice versa.
- `df[...]` filters the DataFrame to keep only the rows where all elements are within the bounds.

In [69]:
# Calculate the lower and upper bounds for outliers
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Filter the DataFrame to remove outliers
df_filtered = df[~((df < lower_bound) | (df > upper_bound)).any(axis=1)]

# Print the results
print("Original DataFrame:")
print(df)
print("\nFiltered DataFrame (without outliers):")
print(df_filtered)

Original DataFrame:
     A    B  C
0    1    5  1
1    2    6  2
2    3    7  3
3    4    8  4
4  100  200  5

Filtered DataFrame (without outliers):
   A  B  C
0  1  5  1
1  2  6  2
2  3  7  3
3  4  8  4


#### Winsorization

<img src="../pandas/images/winsorization.png" width="300">

Winsorization is a statistical technique used to limit extreme values in data to reduce the effect of possibly spurious outliers.<br> 
This is done by setting all values above a certain percentile to the value of that percentile, and all values below a certain percentile to the value of that percentile.<br> 
This technique can help make the data more robust to outliers.

**Steps for Winsorization**
- **Determine the percentiles**: Decide the lower and upper percentiles to which you want to limit the data. Common choices are the `5th` and `95th` percentiles.
- **Calculate the percentile values**: Compute the values at these percentiles.
- **Replace extreme values**: Replace values below the lower percentile with the lower percentile value and values above the upper percentile with the upper percentile value.

In [70]:
# Sample DataFrame
df = pd.DataFrame(
    {
        "A": [1, 2, 3, 4, 100],  # 100 is an outlier
        "B": [5, 6, 7, 8, 200],  # 200 is an outlier
        "C": [1, 2, 3, 4, 5],
    }
)

# Define the percentiles for Winsorization
lower_percentile = 0.05
upper_percentile = 0.95

# Calculate the values at these percentiles for each column
lower_bound = df.quantile(lower_percentile)
upper_bound = df.quantile(upper_percentile)

# Apply Winsorization
df_winsorized = df.copy()
df_winsorized = df_winsorized.apply(
    lambda x: np.where(x < lower_bound[x.name], lower_bound[x.name], x)
)
df_winsorized = df_winsorized.apply(
    lambda x: np.where(x > upper_bound[x.name], upper_bound[x.name], x)
)

# Convert the numpy arrays back to DataFrame
df_winsorized = pd.DataFrame(df_winsorized, columns=df.columns)

# Print the results
print("Original DataFrame:")
print(df)
print("\nWinsorized DataFrame:")
print(df_winsorized)

Original DataFrame:
     A    B  C
0    1    5  1
1    2    6  2
2    3    7  3
3    4    8  4
4  100  200  5

Winsorized DataFrame:
      A      B    C
0   1.2    5.2  1.2
1   2.0    6.0  2.0
2   3.0    7.0  3.0
3   4.0    8.0  4.0
4  80.8  161.6  4.8


**Explanation of the Winsorized Values**
- **For column "A"**:
    - The value `1` is below the 5th percentile (`1.2`), so it is replaced with `1.2`.
    - The value `100` is above the 95th percentile (`80.8`), so it is replaced with `80.8`.

- **For column "B"**:
    - The value `5` is below the 5th percentile (`5.2`), so it is replaced with `5.2`.
    - The value `200` is above the 95th percentile (`161.6`), so it is replaced with `161.6`.

- **For column "C"**:
    - The value `1` is below the 5th percentile (`1.2`), so it is replaced with `1.2`.
    - The value `5` is above the 95th percentile (`4.8`), so it is replaced with `4.8`.

**Calculate the lower and upper bounds:**<br>
`lower_bound = df.quantile(lower_percentile)`<br>
`upper_bound = df.quantile(upper_percentile)`<br>

**Apply Winsorization:**<br>
`df_winsorized = df.copy()`<br>
`df_winsorized = df_winsorized.apply(lambda x: np.where(x < lower_bound[x.name], lower_bound[x.name], x))`<br>
`df_winsorized = df_winsorized.apply(lambda x: np.where(x > upper_bound[x.name], upper_bound[x.name], x))`<br>
- `df_winsorized.apply(...)` applies a function to each column of the DataFrame.
- The function being applied is a lambda function: `lambda x: np.where(x < lower_bound[x.name], lower_bound[x.name], x)`.
- The lambda function:
    - `x` represents a column of the DataFrame.
    - `x < lower_bound[x.name]` is a condition that checks if each element in the column `x` is less than the lower bound for that column.
    - `np.where(condition, value_if_true, value_if_false)` is a NumPy function that returns an array where elements that meet the condition are replaced with `value_if_true`, and elements that do not meet the condition are replaced with `value_if_false`.
    - `lower_bound[x.name]` is the lower bound value for the column `x`. This value is obtained from the `lower_bound` Series, where `x.name` is the name of the column.
    - `x` is the original value in the column if the condition is not met.
- As a result, any value in the column that is less than the lower bound is replaced with the lower bound value.

**Convert the numpy arrays back to DataFrame:**<br>
`df_winsorized = pd.DataFrame(df_winsorized, columns=df.columns)`<br>

---

## 4. Data Type Conversion

#### Convert Data Types

The `astype()` method is used to cast a pandas object to a specified data type.

In [71]:
# Create a sample DataFrame
df = pd.DataFrame(
    {
        "A": ["1", "2", "3", "4"],
        "B": ["5.1", "6.2", "7.3", "8.4"],
        "C": ["True", "False", "True", "False"],
    }
)

# Convert columns to appropriate data types
df["A"] = df["A"].astype(int)  # Convert to integer
df["B"] = df["B"].astype(float)  # Convert to float
df["C"] = df["C"].astype(bool)  # Convert to boolean

print(df)
print(df.dtypes)

   A    B     C
0  1  5.1  True
1  2  6.2  True
2  3  7.3  True
3  4  8.4  True
A      int32
B    float64
C       bool
dtype: object


#### Parse Dates

The `pd.to_datetime()` function is used to convert columns to datetime objects.

In [72]:
# Create a sample DataFrame
df = pd.DataFrame(
    {
        "Date": ["2023-04-01", "2023-02-01", "2023-03-01"], 
        "Value": [10, 20, 30]
    }
)

# Convert the 'Date' column to datetime objects
df["Date"] = pd.to_datetime(df["Date"])

print(df)
print(df.dtypes)

        Date  Value
0 2023-04-01     10
1 2023-02-01     20
2 2023-03-01     30
Date     datetime64[ns]
Value             int64
dtype: object


---

#### 5. String Operations

Combining all the string operations (stripping whitespace, converting to lowercase, and replacing substrings) in a single DataFrame.

In [73]:
# Create a sample DataFrame
df = pd.DataFrame(
    {"Text": [" Hello World ", " Pandas is great ", " Python is awesome "]}
)

# Perform string operations
df["Text"] = df["Text"].str.strip()  # Strip leading and trailing whitespace
df["Text"] = df["Text"].str.lower()  # Convert strings to lowercase
df["Text"] = df["Text"].str.replace("is", "was")  # Replace substrings

print(df)

                 Text
0         hello world
1    pandas was great
2  python was awesome


---

#### 6. Handling Inconsistent Data

#### Standardize Categories

The `replace()` method can be used to standardize categorical values in a DataFrame.
Replaces lowercase category values with their uppercase equivalents to standardize the categories.

In [74]:
# Create a sample DataFrame
df = pd.DataFrame({"Category": ["A", "B", "a", "b", "C", "c"]})

# Standardize categories
df["Category"] = df["Category"].replace({"a": "A", "b": "B", "c": "C"})

print(df)

  Category
0        A
1        B
2        A
3        B
4        C
5        C


#### Correct Typos

You can use `replace()` or custom functions to correct typos in a DataFrame.

In [75]:
# Create a sample DataFrame
df = pd.DataFrame(
    {
        "City": [
            "New York",
            "new york",
            "Los Angeles",
            "los angeles",
            "San Francisco",
            "san francisco",
        ]
    }
)

# Correct typos
df["City"] = df["City"].replace(
    {
        "new york": "New York",
        "los angeles": "Los Angeles",
        "san francisco": "San Francisco",
    }
)

print(df)

            City
0       New York
1       New York
2    Los Angeles
3    Los Angeles
4  San Francisco
5  San Francisco


**Example using a Custom Function:**

`correct_typos(city)`: A custom function that uses a dictionary to map incorrect city names to their correct forms.<br>
`df['City'].apply(correct_typos)`: Applies the custom function to each element in the 'City' column to correct typos.

In [76]:
# Create a sample DataFrame
df = pd.DataFrame(
    {
        "City": [
            "New York",
            "new york",
            "Los Angeles",
            "los angeles",
            "San Francisco",
            "san francisco",
        ]
    }
)

# Define a custom function to correct typos
def correct_typos(city):
    corrections = {
        "new york": "New York",
        "los angeles": "Los Angeles",
        "san francisco": "San Francisco",
    }
    return corrections.get(city.lower(), city)

# Apply the custom function to the 'City' column
df["City"] = df["City"].apply(correct_typos)

print(df)

            City
0       New York
1       New York
2    Los Angeles
3    Los Angeles
4  San Francisco
5  San Francisco


---

## 7. Binning data

#### Binning Continuous Data into Discrete Intervals

Binning involves converting continuous data into discrete intervals.<br> 
This can be done using `pd.cut()` for equal-width bins or `pd.qcut()` for equal-frequency bins.

Example using `pd.cut()`:

In [77]:
# Create a sample DataFrame
df = pd.DataFrame({"Age": [22, 25, 47, 35, 46, 55, 18, 21, 30, 42]})

# Bin ages into discrete intervals
df["AgeGroup"] = pd.cut(
    df["Age"],
    bins=[0, 18, 35, 50, 100],
    labels=["Child", "Young Adult", "Adult", "Senior"],
)

df

Unnamed: 0,Age,AgeGroup
0,22,Young Adult
1,25,Young Adult
2,47,Adult
3,35,Young Adult
4,46,Adult
5,55,Senior
6,18,Child
7,21,Young Adult
8,30,Young Adult
9,42,Adult


Example using `pd.qcut()`:

In [78]:
# Create a sample DataFrame
df = pd.DataFrame(
    {
        "Income": [
            30000,
            40000,
            50000,
            60000,
            70000,
            80000,
            90000,
            100000,
            110000,
            120000,
        ]
    }
)

# Bin incomes into quartiles
df["IncomeGroup"] = pd.qcut(
    df["Income"], q=4, labels=["Low", "Medium", "High", "Very High"]
)

df

Unnamed: 0,Income,IncomeGroup
0,30000,Low
1,40000,Low
2,50000,Low
3,60000,Medium
4,70000,Medium
5,80000,High
6,90000,High
7,100000,Very High
8,110000,Very High
9,120000,Very High


Combining Both Operations:

In [79]:
# Create a sample DataFrame
df = pd.DataFrame(
    {
        "Height": [150, 160, 170, 180, 190],
        "Weight": [50, 60, 70, 80, 90],
        "Age": [22, 25, 47, 35, 46],
    }
)

# Create a new feature: Body Mass Index (BMI)
df["BMI"] = df["Weight"] / (df["Height"] / 100) ** 2

# Round BMI to two decimal places
df["BMI"] = df["BMI"].round(2)

# Bin ages into discrete intervals
df["AgeGroup"] = pd.cut(
    df["Age"],
    bins=[0, 18, 35, 50, 100],
    labels=["Child", "Young Adult", "Adult", "Senior"],
)

df

Unnamed: 0,Height,Weight,Age,BMI,AgeGroup
0,150,50,22,22.22,Young Adult
1,160,60,25,23.44,Young Adult
2,170,70,47,24.22,Adult
3,180,80,35,24.69,Young Adult
4,190,90,46,24.93,Adult


---

## 8. DateTime functionalities

#### Formatting DateTime

Pandas allows you to format datetime objects in various ways using the `strftime()` method.

**Default Formatting**:<br>
ISO 8601 format (YYYY-MM-DD HH:MM:SS):<br>
`df['formatted_date'] = df['date'].dt.strftime('%Y-%m-%d %H:%M:%S')`

**Specify custom formats using format codes**:<br>
`df['formatted_date'] = df['date'].dt.strftime('%d/%m/%Y')  # DD/MM/YYYY format`<br>
`df['formatted_time'] = df['date'].dt.strftime('%I:%M %p')  # 12-hour clock with AM/PM`

**Saving DateTime**:<br>
When saving datetime data to a file, you can specify the format.<br>
`df.to_csv('output.csv', date_format='%Y-%m-%d %H:%M:%S')`

In [80]:
# Create a sample DataFrame
data = {
    "date": ["2023-01-01", "2023-01-02", "2023-01-03", "2023-01-04", "2023-01-05"],
    "value": [10, 20, 30, 40, 50],
}
df = pd.DataFrame(data)

# Convert to datetime
df["date"] = pd.to_datetime(df["date"])

# Formatting datetime
df["formatted_date"] = df["date"].dt.strftime("%d/%m/%Y")
df["formatted_time"] = df["date"].dt.strftime("%I:%M %p")

# Save to CSV with formatted date
df.to_csv("../datasets/output.csv", date_format="%Y-%m-%d %H:%M:%S")

df

Unnamed: 0,date,value,formatted_date,formatted_time
0,2023-01-01,10,01/01/2023,12:00 AM
1,2023-01-02,20,02/01/2023,12:00 AM
2,2023-01-03,30,03/01/2023,12:00 AM
3,2023-01-04,40,04/01/2023,12:00 AM
4,2023-01-05,50,05/01/2023,12:00 AM


#### Resampling DateTime

The `.resample()` method can be used to fill in missing dates and aggregate data (merging or combining data over a period of time) over a specified frequency.

**Original DataFrame**: <br>
The original DataFrame has data for the dates `2023-01-01`, `2023-01-03`, and `2023-01-05`, with missing data for `2023-01-02` and `2023-01-04`.

**Resampled DataFrame**: After resampling to a daily frequency:
- The dates `2023-01-02` and `2023-01-04` are added to the DataFrame.
- The values for these dates are `NaN` because there was no data for these dates in the original DataFrame.
- The values for the existing dates remain the same.

In [81]:
# Create a sample DataFrame with missing dates
data = {
    "date": ["2023-01-01", "2023-01-03", "2023-01-05"],
    "value": [10, 30, 50],
}
df = pd.DataFrame(data)

# Convert 'date' column to datetime
df["date"] = pd.to_datetime(df["date"])

# Display the original DataFrame
print("Original DataFrame:")
print(df)

# Set the 'date' column as the index
df.set_index("date", inplace=True)

# Resample the data to a daily frequency and calculate the mean
resampled_df = df.resample("D").mean()

# Display the resampled DataFrame
print("\nResampled DataFrame:")
print(resampled_df)

Original DataFrame:
        date  value
0 2023-01-01     10
1 2023-01-03     30
2 2023-01-05     50

Resampled DataFrame:
            value
date             
2023-01-01   10.0
2023-01-02    NaN
2023-01-03   30.0
2023-01-04    NaN
2023-01-05   50.0


#### Additional Time Series Calculations

The `diff()` method calculates the difference between consecutive elements in a Series. When applied to a datetime column, it computes the time difference between each date and the previous date.

**Why is time_diff Always 1 days?**
- The dates in the DataFrame are consecutive days: `2023-01-01`, `2023-01-02`, `2023-01-03`, `2023-01-04`, and `2023-01-05`.
- The difference between each consecutive date is exactly `1 day`.
- Therefore, the `time_diff` column shows `1 days` for each row except the first one, which has `NaT` because there is no previous date to compare with.

In [82]:
# Create a sample DataFrame
data = {
    "date": ["2023-01-01", "2023-01-02", "2023-01-03", "2023-01-04", "2023-01-05"],
    "value": [10, 20, 30, 40, 50],
}
df = pd.DataFrame(data)

# Convert to datetime
df["date"] = pd.to_datetime(df["date"])

# Additional calculations
df["time_diff"] = df["date"].diff()  # Calculate the difference between dates
df["year"] = df["date"].dt.year
df["month"] = df["date"].dt.month
df["day"] = df["date"].dt.day

df

Unnamed: 0,date,value,time_diff,year,month,day
0,2023-01-01,10,NaT,2023,1,1
1,2023-01-02,20,1 days,2023,1,2
2,2023-01-03,30,1 days,2023,1,3
3,2023-01-04,40,1 days,2023,1,4
4,2023-01-05,50,1 days,2023,1,5


`shift()` Method:
- Shifts data forward or backward by a specified number of periods.
- Useful for creating lagged or lead variables.
- Example: `df["shifted"] = df["value"].shift(1)`

**1 Period**: When you use shift(1), it shifts the data by one index position.<br> 
If your index is a datetime index, it shifts by one time unit (e.g., one day if your index is daily).<br>
If your index is an integer index, it shifts by one integer position.

In [83]:
# Create a sample DataFrame with a datetime index
data = {
    "date": pd.date_range(start="2023-01-01", periods=5, freq="D"),
    "value": [10, 20, 30, 40, 50],
}
df = pd.DataFrame(data)
df.set_index("date", inplace=True)

# Shift the 'value' column forward by 1 period (1 day)
df["shifted"] = df["value"].shift(-1)

print("DataFrame with Shifted Values:")
print("'shifted' will take the value of the previous row.")
df

DataFrame with Shifted Values:
'shifted' will take the value of the previous row.


Unnamed: 0_level_0,value,shifted
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2023-01-01,10,20.0
2023-01-02,20,30.0
2023-01-03,30,40.0
2023-01-04,40,50.0
2023-01-05,50,


`rolling()` Method:
- Performs calculations over a rolling window of a specified size.
- Useful for smoothing data and calculating rolling statistics like moving averages.
- Example: `df["rolling_mean"] = df["value"].rolling(window=3).mean()`

**Window Size**: The window size specifies the number of consecutive data points to include in each calculation.<br> 
For example, a window size of 3 means each calculation (e.g., mean) will be based on the current data point and the previous two data points.

In [84]:
# Create a sample DataFrame with a datetime index
data = {
    "date": pd.date_range(start="2023-01-01", periods=5, freq="D"),
    "value": [10, 20, 30, 40, 50],
}
df = pd.DataFrame(data)
df.set_index("date", inplace=True)

# Calculate the rolling mean with a window size of 3
df["rolling_mean"] = df["value"].rolling(window=3).mean()

print("DataFrame with Rolling Mean:")
df

DataFrame with Rolling Mean:


Unnamed: 0_level_0,value,rolling_mean
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2023-01-01,10,
2023-01-02,20,
2023-01-03,30,20.0
2023-01-04,40,30.0
2023-01-05,50,40.0


**NaN Values**: For the first two rows, there are not enough data points to fill the window size of 3, so the result is NaN.

**Rolling Mean Calculation:**<br>
**2023-01-01**: Only one data point (10), not enough to fill the window of 3, so the result is `NaN`.<br>
**2023-01-02**: Only two data points (10, 20), still not enough to fill the window of 3, so the result is `NaN`.<br>
**2023-01-03**: Three data points (10, 20, 30), the mean is calculated as `(10 + 20 + 30) / 3 = 20.0`.<br>
**2023-01-04**: Three data points (20, 30, 40), the mean is calculated as `(20 + 30 + 40) / 3 = 30.0`.<br>
**2023-01-05**: Three data points (30, 40, 50), the mean is calculated as `(30 + 40 + 50) / 3 = 40.0`.

**Summary**
- `Rolling`: Useful for calculating moving averages, sums, or other statistics over a window of time. You might want to calculate rolling statistics (like mean, sum, etc.) on lagged data to understand trends and patterns.
- `Shift`: Useful for creating lagged features, aligning data, or comparing current values with past values. In time series forecasting, you often create lagged features to use past values as predictors for future values.

Combining these functions allows you to perform complex time series manipulations and feature engineering, which are essential for time series analysis, forecasting, and machine learning models.