<a href="https://colab.research.google.com/github/Tanu-N-Prabhu/Python/blob/master/Data%20Analysis/Level%201/data_cleaning_techniques_in_pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Cleaning Techniques with Pandas

Cleaning your data is an essential first step in any data analysis process. Dirty data can mislead your models and visualizations. Below are the most common data cleaning techniques using Python and the Pandas library.

## 1. Handling Missing Data
Real-world data is often incomplete. You need strategies to deal with missing values effectively.

### Detecting Missing Values



In [2]:
import pandas as pd

df = pd.DataFrame({
    'Name': ['Alice', 'Bob', None, 'David'],
    'Age': [25, None, 35, 40],
    'Score': [90, 85, None, 95]
})

print(df)
print(df.isnull())        # True wherever data is missing
print(df.isnull().sum())  # Count of missing values per column

    Name   Age  Score
0  Alice  25.0   90.0
1    Bob   NaN   85.0
2   None  35.0    NaN
3  David  40.0   95.0
    Name    Age  Score
0  False  False  False
1  False   True  False
2   True  False   True
3  False  False  False
Name     1
Age      1
Score    1
dtype: int64


### Removing Missing Data
Remove rows where any value is missing:

In [3]:
df_cleaned = df.dropna()
df_cleaned

Unnamed: 0,Name,Age,Score
0,Alice,25.0,90.0
3,David,40.0,95.0


### Remove rows where all values are missing:


In [4]:
df_cleaned = df.dropna(how='all')
df_cleaned

Unnamed: 0,Name,Age,Score
0,Alice,25.0,90.0
1,Bob,,85.0
2,,35.0,
3,David,40.0,95.0


### Remove columns with missing values:

In [5]:
df_cleaned = df.dropna(axis=1)
df_cleaned

0
1
2
3


###  Filling Missing Data

Fill with a constant:

In [6]:
df_filled = df.fillna(0)
df_filled

Unnamed: 0,Name,Age,Score
0,Alice,25.0,90.0
1,Bob,0.0,85.0
2,0,35.0,0.0
3,David,40.0,95.0


### Forward fill (propagate last valid value forward):




In [7]:
df_filled = df.fillna(method='ffill')
df_filled

  df_filled = df.fillna(method='ffill')


Unnamed: 0,Name,Age,Score
0,Alice,25.0,90.0
1,Bob,25.0,85.0
2,Bob,35.0,85.0
3,David,40.0,95.0


### Backward fill (propagate next valid value backward):


In [8]:
df_filled = df.fillna(method='bfill')
df_filled

  df_filled = df.fillna(method='bfill')


Unnamed: 0,Name,Age,Score
0,Alice,25.0,90.0
1,Bob,35.0,85.0
2,David,35.0,95.0
3,David,40.0,95.0


### Fill with mean/median:


In [9]:
df['Age'].fillna(df['Age'].mean(), inplace=True)
df

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Age'].fillna(df['Age'].mean(), inplace=True)


Unnamed: 0,Name,Age,Score
0,Alice,25.0,90.0
1,Bob,33.333333,85.0
2,,35.0,
3,David,40.0,95.0


## 2. Detecting and Fixing Incorrect or Inconsistent Data

Bad entries often sneak in, especially with strings.

### Common Issues:
- Typos (e.g., `"calgary"` vs `"Calgary"`)

- Case sensitivity (`"YES"` vs `"yes"`)

- Unexpected characters or symbols

### Example: Fixing inconsistent categories

In [10]:
df = pd.DataFrame({'City': ['Calgary', 'calgary', 'Toronto', 'Vancouver', 'Calgary']})

# Standardize to lowercase
df['City'] = df['City'].str.lower()

# Capitalize consistently
df['City'] = df['City'].str.title()

df

Unnamed: 0,City
0,Calgary
1,Calgary
2,Toronto
3,Vancouver
4,Calgary


## 3. Handling Duplicate Data
Duplicate rows can bias your results or skew aggregates.

### Detecting Duplicates

In [12]:
df.duplicated()
df.duplicated().sum()

np.int64(2)

### Removing Duplicates

In [14]:
df_unique = df.drop_duplicates()
df_unique

Unnamed: 0,City
0,Calgary
2,Toronto
3,Vancouver


### Remove duplicates based on a specific column:



In [16]:
df_unique = df.drop_duplicates(subset='City')
df_unique

Unnamed: 0,City
0,Calgary
2,Toronto
3,Vancouver


### Keep the last occurrence:


In [17]:
df_unique = df.drop_duplicates(keep='last')
df_unique

Unnamed: 0,City
2,Toronto
3,Vancouver
4,Calgary


## Summary

| Task                     | Method                              |
| ------------------------ | ----------------------------------- |
| Detect missing values    | `isnull()`, `notnull()`             |
| Remove missing data      | `dropna()`                          |
| Fill missing data        | `fillna()`                          |
| Detect incorrect values  | `.unique()`, `.value_counts()`      |
| Normalize strings        | `.str.lower()`, `.str.title()`      |
| Detect/Remove duplicates | `duplicated()`, `drop_duplicates()` |


## Best Practices
- Always inspect your data with `.info()` and `.describe()` first.

- Use `.isnull()` `.sum()` to detect where missing values occur.

- Be cautious when dropping or filling data, you may unintentionally bias your dataset.

- Normalize string data before running analyses or merges.