## Data Cleaning Best Practices


## In this tutorial, we will:
#### - Understand common data issues
#### - Learn best practices for cleaning datasets
#### - Handle date columns effectively
 
## Let's dive in!

### 1. Importing Libraries and Loading Data


In [1]:
import pandas as pd
import numpy as np

# Example dataset with common data issues
data = {
    "Name": ["Alice", "Bob", "Charlie", "David", None],
    "Age": [25, 30, None, 40, 28],
    "City": ["New York", "Los Angeles", "Chicago", None, "Phoenix"],
    "Salary": ["50000", "45000", "40000", "NaN", "60000"],
    "Joining_Date": ["2020-01-15", "2019/07/10", None, "2021-03-25", "not a date"]
}

# Create a DataFrame
df = pd.DataFrame(data)

# Display the dataset
print("Original Dataset:")
df

Original Dataset:


Unnamed: 0,Name,Age,City,Salary,Joining_Date
0,Alice,25.0,New York,50000.0,2020-01-15
1,Bob,30.0,Los Angeles,45000.0,2019/07/10
2,Charlie,,Chicago,40000.0,
3,David,40.0,,,2021-03-25
4,,28.0,Phoenix,60000.0,not a date


### 2. Identifying Missing Values


In [2]:
print("\nMissing Values:")
df.isnull().sum()


Missing Values:


Name            1
Age             1
City            1
Salary          0
Joining_Date    1
dtype: int64

### 3. Handling Missing Values


In [3]:
df["Age"] = df["Age"].fillna(df["Age"].mean())
df["City"] = df["City"].fillna("Unknown")
df = df.dropna(subset=["Name"])
df

Unnamed: 0,Name,Age,City,Salary,Joining_Date
0,Alice,25.0,New York,50000.0,2020-01-15
1,Bob,30.0,Los Angeles,45000.0,2019/07/10
2,Charlie,30.75,Chicago,40000.0,
3,David,40.0,Unknown,,2021-03-25


### 4. Correcting Data Types


In [4]:
df["Salary"] = pd.to_numeric(df["Salary"], errors="coerce")
df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["Salary"] = pd.to_numeric(df["Salary"], errors="coerce")


Unnamed: 0,Name,Age,City,Salary,Joining_Date
0,Alice,25.0,New York,50000.0,2020-01-15
1,Bob,30.0,Los Angeles,45000.0,2019/07/10
2,Charlie,30.75,Chicago,40000.0,
3,David,40.0,Unknown,,2021-03-25


### 5. Handling Duplicates


In [5]:
# Add a duplicate row for demonstration
df = pd.concat([df, df.iloc[[0]]], ignore_index=True)
df

Unnamed: 0,Name,Age,City,Salary,Joining_Date
0,Alice,25.0,New York,50000.0,2020-01-15
1,Bob,30.0,Los Angeles,45000.0,2019/07/10
2,Charlie,30.75,Chicago,40000.0,
3,David,40.0,Unknown,,2021-03-25
4,Alice,25.0,New York,50000.0,2020-01-15


In [6]:
# Drop duplicate rows
df = df.drop_duplicates()
df

Unnamed: 0,Name,Age,City,Salary,Joining_Date
0,Alice,25.0,New York,50000.0,2020-01-15
1,Bob,30.0,Los Angeles,45000.0,2019/07/10
2,Charlie,30.75,Chicago,40000.0,
3,David,40.0,Unknown,,2021-03-25


### 6. Dealing with Outliers

In [7]:
z_scores = (df["Age"] - df["Age"].mean()) / df["Age"].std()
outliers = df[z_scores.abs() > 3]
print("\nOutliers Detected:")
print(outliers)


Outliers Detected:
Empty DataFrame
Columns: [Name, Age, City, Salary, Joining_Date]
Index: []


### 7. Renaming Columns for Consistency


In [8]:
df.columns = df.columns.str.lower()
df

Unnamed: 0,name,age,city,salary,joining_date
0,Alice,25.0,New York,50000.0,2020-01-15
1,Bob,30.0,Los Angeles,45000.0,2019/07/10
2,Charlie,30.75,Chicago,40000.0,
3,David,40.0,Unknown,,2021-03-25


### 8. Handling Date Columns
#### Dates often come in different formats and need to be standardized.

In [9]:
# Example: The "joining_date" column has inconsistent formats and invalid values.
print("\nOriginal Joining_Date Column:")
df["joining_date"]


Original Joining_Date Column:


0    2020-01-15
1    2019/07/10
2          None
3    2021-03-25
Name: joining_date, dtype: object

In [10]:
# Convert the "joining_date" column to datetime
df["joining_date"] = pd.to_datetime(df["joining_date"], errors="coerce")
df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["joining_date"] = pd.to_datetime(df["joining_date"], errors="coerce")


Unnamed: 0,name,age,city,salary,joining_date
0,Alice,25.0,New York,50000.0,2020-01-15
1,Bob,30.0,Los Angeles,45000.0,NaT
2,Charlie,30.75,Chicago,40000.0,NaT
3,David,40.0,Unknown,,2021-03-25


In [11]:
# Display the converted date column
print("\nConverted Joining_Date Column:")
df["joining_date"]


Converted Joining_Date Column:


0   2020-01-15
1          NaT
2          NaT
3   2021-03-25
Name: joining_date, dtype: datetime64[ns]

In [12]:
# Fill missing or invalid dates with a default value
df["joining_date"] = df["joining_date"].fillna(pd.Timestamp("2000-01-01"))
df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["joining_date"] = df["joining_date"].fillna(pd.Timestamp("2000-01-01"))


Unnamed: 0,name,age,city,salary,joining_date
0,Alice,25.0,New York,50000.0,2020-01-15
1,Bob,30.0,Los Angeles,45000.0,2000-01-01
2,Charlie,30.75,Chicago,40000.0,2000-01-01
3,David,40.0,Unknown,,2021-03-25


In [13]:
# Add new columns for year, month, and day for analysis
df["joining_year"] = df["joining_date"].dt.year
df["joining_month"] = df["joining_date"].dt.month
df["joining_day"] = df["joining_date"].dt.day
df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["joining_year"] = df["joining_date"].dt.year
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["joining_month"] = df["joining_date"].dt.month
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["joining_day"] = df["joining_date"].dt.day


Unnamed: 0,name,age,city,salary,joining_date,joining_year,joining_month,joining_day
0,Alice,25.0,New York,50000.0,2020-01-15,2020,1,15
1,Bob,30.0,Los Angeles,45000.0,2000-01-01,2000,1,1
2,Charlie,30.75,Chicago,40000.0,2000-01-01,2000,1,1
3,David,40.0,Unknown,,2021-03-25,2021,3,25


In [14]:
# Display the dataset after handling dates
print("\nDataset After Handling Dates:")
df


Dataset After Handling Dates:


Unnamed: 0,name,age,city,salary,joining_date,joining_year,joining_month,joining_day
0,Alice,25.0,New York,50000.0,2020-01-15,2020,1,15
1,Bob,30.0,Los Angeles,45000.0,2000-01-01,2000,1,1
2,Charlie,30.75,Chicago,40000.0,2000-01-01,2000,1,1
3,David,40.0,Unknown,,2021-03-25,2021,3,25
