# 🧹 **Data Cleaning & Preprocessing in Pandas**

Real-world data is often messy, but with the powerful tools Pandas provides, you can clean and transform data before diving into analysis. Let's walk through some essential techniques!

---

## 🚨 **Handling Missing Values**

### 🔍 **1. Check for Missing Data**
To identify missing values (NaNs), you can use:

```python
df.isnull()              # Returns True for NaNs
df.isnull().sum()        # Count of missing values per column
````

### ❌ **2. Drop Missing Data**

If you want to remove rows or columns that contain missing values:

```python
df.dropna()              # Drop rows with *any* missing values
df.dropna(axis=1)        # Drop columns with missing values
```

### ✅ **3. Fill Missing Data**

You can replace missing data with specific values or methods:

```python
df.fillna(0)                     # Replace NaN with 0
df["Age"].fillna(df["Age"].mean())  # Replace with mean value
df.ffill()      # Forward fill (propagate previous values)
df.bfill()      # Backward fill (propagate next values)
```

---

## 🔄 **Detecting & Removing Duplicates**

### 🧐 **1. Identify Duplicates**

Check for duplicates in your dataset:

```python
df.duplicated()          # Returns True for duplicate rows
```

### 🧹 **2. Remove Duplicates**

To remove duplicate rows:

```python
df.drop_duplicates()     # Remove duplicate rows
```

### 👀 **3. Check for Duplicates Based on Specific Columns**

You can also check duplicates for specific columns:

```python
df.duplicated(subset=["Name", "Age"])
```

---

## ✂️ **String Operations with `.str`**

Pandas provides vectorized string methods that you can apply to entire columns. Here are some useful ones:

### 🔤 **1. Transform Text**

You can manipulate strings easily:

```python
df["Name"].str.lower()  # Convert all names to lowercase
df["City"].str.contains("delhi", case=False)  # Check if 'delhi' is in the city name (case-insensitive)
```

### 🔄 **2. Split Strings**

Split strings into lists:

```python
df["Email"].str.split("@")  # Split email addresses at "@"
```

### ✂️ **3. Chain Methods**

For more advanced clean-up, chain methods:

```python
df["Name"].str.strip().str.upper()  # Remove extra spaces and convert to uppercase
```

---

## 🔢 **Type Conversions with `.astype()`**

Sometimes you need to convert the type of a column. Use `.astype()` for basic types, and `pd.to_datetime()` for date handling.

### 🧑‍🔬 **1. Convert Data Types**

Convert columns to the appropriate data types:

```python
df["Age"] = df["Age"].astype(int)
df["Category"] = df["Category"].astype("category")
```

### 📅 **2. Convert Dates with `pd.to_datetime()`**

`pd.to_datetime()` is a powerful function that can handle multiple date formats and mixed types:

```python
df["Date"] = pd.to_datetime(df["Date"])
```

This function can:

* Handle different date formats (e.g., "YYYY-MM-DD", "MM/DD/YYYY").
* Convert UNIX timestamps to datetime objects.
* Recognize timezones.

### 🧐 **3. Check Column Data Types**

Check the data types of your columns:

```python
df.dtypes
```

---

## 🛠️ **Applying Functions**

Pandas provides flexible ways to apply custom functions to your data using `.apply()`, `.map()`, and `.replace()`.

### 🖋️ **1. Apply Custom Functions**

Use `.apply()` to apply a function across rows or columns:

```python
df["Age Group"] = df["Age"].apply(lambda x: "Adult" if x >= 18 else "Minor")
```

### 🗺️ **2. Map Values**

Use `.map()` for element-wise transformations:

```python
gender_map = {"M": "Male", "F": "Female"}
df["Gender"] = df["Gender"].map(gender_map)
```

### 🔄 **3. Replace Specific Values**

Use `.replace()` to swap values:

```python
df["City"].replace({"Del": "Delhi", "Mum": "Mumbai"})
```

---

## 📚 **Summary**

Here's a quick recap of the essential tools for data cleaning:

* **Handling Missing Data:** Use `isnull()`, `fillna()`, and `dropna()` to handle missing values.
* **Detecting Duplicates:** Use `duplicated()` and `drop_duplicates()` to remove duplicates.
* **String Operations:** Use `.str` to clean and manipulate string data.
* **Type Conversions:** Convert data types using `.astype()` and handle dates with `pd.to_datetime()`.
* **Applying Functions:** Transform columns with `.apply()`, `.map()`, and `.replace()`.

---

### 🚨 **Remember!**

Data cleaning takes up around **80%** of your time in real projects. Get comfortable with these tools, and you'll save a lot of time in the long run!

In [2]:
import pandas as pd

In [3]:
df = pd.read_csv("data_cleaning_sample.csv")

In [4]:
df

Unnamed: 0,Name,Age,City,Gender,Email,Join Date
0,Alice,25.0,New York,F,alice@example.com,01-05-2021
1,Charlie,,Delhi,M,charlie@example,20-07-2021
2,Bob,30.0,Los Angeles,M,bob@example.com,15-06-2020
3,Charlie,,Delhi,M,charlie@example,20-07-2021
4,David,22.0,Mumbai,M,david@example.com,12-11-2019
5,,28.0,Delhi,F,eve@domain.com,
6,Alice,25.0,New York,F,alice@example.com,01-05-2021
7,Alice,25.0,New York,F,alice@example.com,01-05-2021
8,Charlie,,Delhi,M,charlie@example,20-07-2021


In [5]:
df.isnull()           # Returns true and false value to show if there is any null values or not

Unnamed: 0,Name,Age,City,Gender,Email,Join Date
0,False,False,False,False,False,False
1,False,True,False,False,False,False
2,False,False,False,False,False,False
3,False,True,False,False,False,False
4,False,False,False,False,False,False
5,True,False,False,False,False,True
6,False,False,False,False,False,False
7,False,False,False,False,False,False
8,False,True,False,False,False,False


In [6]:
df.isnull().sum()      # Returns how many null values are there in each column

Name         1
Age          3
City         0
Gender       0
Email        0
Join Date    1
dtype: int64

In [7]:
df.dropna()            # Returns a view of the dataset where there is no null values in any column

Unnamed: 0,Name,Age,City,Gender,Email,Join Date
0,Alice,25.0,New York,F,alice@example.com,01-05-2021
2,Bob,30.0,Los Angeles,M,bob@example.com,15-06-2020
4,David,22.0,Mumbai,M,david@example.com,12-11-2019
6,Alice,25.0,New York,F,alice@example.com,01-05-2021
7,Alice,25.0,New York,F,alice@example.com,01-05-2021


In [8]:
df.dropna(axis=1)       # It drops the whole column where there is null value

Unnamed: 0,City,Gender,Email
0,New York,F,alice@example.com
1,Delhi,M,charlie@example
2,Los Angeles,M,bob@example.com
3,Delhi,M,charlie@example
4,Mumbai,M,david@example.com
5,Delhi,F,eve@domain.com
6,New York,F,alice@example.com
7,New York,F,alice@example.com
8,Delhi,M,charlie@example


In [9]:
df.fillna(0)        # Replace null with 0

Unnamed: 0,Name,Age,City,Gender,Email,Join Date
0,Alice,25.0,New York,F,alice@example.com,01-05-2021
1,Charlie,0.0,Delhi,M,charlie@example,20-07-2021
2,Bob,30.0,Los Angeles,M,bob@example.com,15-06-2020
3,Charlie,0.0,Delhi,M,charlie@example,20-07-2021
4,David,22.0,Mumbai,M,david@example.com,12-11-2019
5,0,28.0,Delhi,F,eve@domain.com,0
6,Alice,25.0,New York,F,alice@example.com,01-05-2021
7,Alice,25.0,New York,F,alice@example.com,01-05-2021
8,Charlie,0.0,Delhi,M,charlie@example,20-07-2021


In [10]:
df["Age"].fillna(df["Age"].mean())      # Replace the Age column with the mean of Age

0    25.000000
1    25.833333
2    30.000000
3    25.833333
4    22.000000
5    28.000000
6    25.000000
7    25.000000
8    25.833333
Name: Age, dtype: float64

In [11]:
df.fillna(method="ffill")

  df.fillna(method="ffill")


Unnamed: 0,Name,Age,City,Gender,Email,Join Date
0,Alice,25.0,New York,F,alice@example.com,01-05-2021
1,Charlie,25.0,Delhi,M,charlie@example,20-07-2021
2,Bob,30.0,Los Angeles,M,bob@example.com,15-06-2020
3,Charlie,30.0,Delhi,M,charlie@example,20-07-2021
4,David,22.0,Mumbai,M,david@example.com,12-11-2019
5,David,28.0,Delhi,F,eve@domain.com,12-11-2019
6,Alice,25.0,New York,F,alice@example.com,01-05-2021
7,Alice,25.0,New York,F,alice@example.com,01-05-2021
8,Charlie,25.0,Delhi,M,charlie@example,20-07-2021


In [12]:
df.fillna(method="bfill")

  df.fillna(method="bfill")


Unnamed: 0,Name,Age,City,Gender,Email,Join Date
0,Alice,25.0,New York,F,alice@example.com,01-05-2021
1,Charlie,30.0,Delhi,M,charlie@example,20-07-2021
2,Bob,30.0,Los Angeles,M,bob@example.com,15-06-2020
3,Charlie,22.0,Delhi,M,charlie@example,20-07-2021
4,David,22.0,Mumbai,M,david@example.com,12-11-2019
5,Alice,28.0,Delhi,F,eve@domain.com,01-05-2021
6,Alice,25.0,New York,F,alice@example.com,01-05-2021
7,Alice,25.0,New York,F,alice@example.com,01-05-2021
8,Charlie,,Delhi,M,charlie@example,20-07-2021


In [13]:
df.ffill()

Unnamed: 0,Name,Age,City,Gender,Email,Join Date
0,Alice,25.0,New York,F,alice@example.com,01-05-2021
1,Charlie,25.0,Delhi,M,charlie@example,20-07-2021
2,Bob,30.0,Los Angeles,M,bob@example.com,15-06-2020
3,Charlie,30.0,Delhi,M,charlie@example,20-07-2021
4,David,22.0,Mumbai,M,david@example.com,12-11-2019
5,David,28.0,Delhi,F,eve@domain.com,12-11-2019
6,Alice,25.0,New York,F,alice@example.com,01-05-2021
7,Alice,25.0,New York,F,alice@example.com,01-05-2021
8,Charlie,25.0,Delhi,M,charlie@example,20-07-2021


In [14]:
df.bfill()

Unnamed: 0,Name,Age,City,Gender,Email,Join Date
0,Alice,25.0,New York,F,alice@example.com,01-05-2021
1,Charlie,30.0,Delhi,M,charlie@example,20-07-2021
2,Bob,30.0,Los Angeles,M,bob@example.com,15-06-2020
3,Charlie,22.0,Delhi,M,charlie@example,20-07-2021
4,David,22.0,Mumbai,M,david@example.com,12-11-2019
5,Alice,28.0,Delhi,F,eve@domain.com,01-05-2021
6,Alice,25.0,New York,F,alice@example.com,01-05-2021
7,Alice,25.0,New York,F,alice@example.com,01-05-2021
8,Charlie,,Delhi,M,charlie@example,20-07-2021


In [15]:
df.duplicated()           # Return true if that whole row is a duplicate row

0    False
1    False
2    False
3     True
4    False
5    False
6     True
7     True
8     True
dtype: bool

In [16]:
df.drop_duplicates()      # Remove all duplicate rows

Unnamed: 0,Name,Age,City,Gender,Email,Join Date
0,Alice,25.0,New York,F,alice@example.com,01-05-2021
1,Charlie,,Delhi,M,charlie@example,20-07-2021
2,Bob,30.0,Los Angeles,M,bob@example.com,15-06-2020
4,David,22.0,Mumbai,M,david@example.com,12-11-2019
5,,28.0,Delhi,F,eve@domain.com,


In [17]:
df["Name"].str.lower()

0      alice
1    charlie
2        bob
3    charlie
4      david
5        NaN
6      alice
7      alice
8    charlie
Name: Name, dtype: object

In [18]:
df["City"].str.contains('delhi', case=False)     # Case sensitive is turned off

0    False
1     True
2    False
3     True
4    False
5     True
6    False
7    False
8     True
Name: City, dtype: bool

In [19]:
df["Email"].str.split('@')

0    [alice, example.com]
1      [charlie, example]
2      [bob, example.com]
3      [charlie, example]
4    [david, example.com]
5       [eve, domain.com]
6    [alice, example.com]
7    [alice, example.com]
8      [charlie, example]
Name: Email, dtype: object

In [20]:
type(df["Email"].str.split('@')[0])

list

In [21]:
df2 = df.dropna().copy()

In [22]:
df2

Unnamed: 0,Name,Age,City,Gender,Email,Join Date
0,Alice,25.0,New York,F,alice@example.com,01-05-2021
2,Bob,30.0,Los Angeles,M,bob@example.com,15-06-2020
4,David,22.0,Mumbai,M,david@example.com,12-11-2019
6,Alice,25.0,New York,F,alice@example.com,01-05-2021
7,Alice,25.0,New York,F,alice@example.com,01-05-2021


In [23]:
df2["Age"] = df2["Age"].astype(int)

In [24]:
df2

Unnamed: 0,Name,Age,City,Gender,Email,Join Date
0,Alice,25,New York,F,alice@example.com,01-05-2021
2,Bob,30,Los Angeles,M,bob@example.com,15-06-2020
4,David,22,Mumbai,M,david@example.com,12-11-2019
6,Alice,25,New York,F,alice@example.com,01-05-2021
7,Alice,25,New York,F,alice@example.com,01-05-2021


In [25]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, 0 to 7
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Name       5 non-null      object
 1   Age        5 non-null      int32 
 2   City       5 non-null      object
 3   Gender     5 non-null      object
 4   Email      5 non-null      object
 5   Join Date  5 non-null      object
dtypes: int32(1), object(5)
memory usage: 260.0+ bytes


In [26]:
df2["Age Group"] = df2["Age"].apply(lambda x: "Adult" if x >= 25 else "Minor")

In [27]:
df2

Unnamed: 0,Name,Age,City,Gender,Email,Join Date,Age Group
0,Alice,25,New York,F,alice@example.com,01-05-2021,Adult
2,Bob,30,Los Angeles,M,bob@example.com,15-06-2020,Adult
4,David,22,Mumbai,M,david@example.com,12-11-2019,Minor
6,Alice,25,New York,F,alice@example.com,01-05-2021,Adult
7,Alice,25,New York,F,alice@example.com,01-05-2021,Adult


In [28]:
gender_map = {"M": "Male", "F": "Female", "O": "Other"}
df2["Gender"] = df2["Gender"].map(gender_map)

In [29]:
df2

Unnamed: 0,Name,Age,City,Gender,Email,Join Date,Age Group
0,Alice,25,New York,Female,alice@example.com,01-05-2021,Adult
2,Bob,30,Los Angeles,Male,bob@example.com,15-06-2020,Adult
4,David,22,Mumbai,Male,david@example.com,12-11-2019,Minor
6,Alice,25,New York,Female,alice@example.com,01-05-2021,Adult
7,Alice,25,New York,Female,alice@example.com,01-05-2021,Adult


In [30]:
df2["City"] = df2["City"].replace({"Delhi": "New Delhi", "Mumbai": "New Mumbai"})

In [31]:
df2

Unnamed: 0,Name,Age,City,Gender,Email,Join Date,Age Group
0,Alice,25,New York,Female,alice@example.com,01-05-2021,Adult
2,Bob,30,Los Angeles,Male,bob@example.com,15-06-2020,Adult
4,David,22,New Mumbai,Male,david@example.com,12-11-2019,Minor
6,Alice,25,New York,Female,alice@example.com,01-05-2021,Adult
7,Alice,25,New York,Female,alice@example.com,01-05-2021,Adult
