#### Data Cleaning & Preprocessing
Real-world data is messy, Pandas gives us powerful tools to clean and transform data before analysis.

---

##### Handling Missing Values
---

**Check for Missing Data**

In [36]:
import pandas as pd

data = {
    'name':['Alice', 'Bob', None, 'Charlie', 'Viru'],
    'age':[20, 22, None, 33, 43],
    'city': ['Mumbai', 'Pune', None, 'Delhi', 'Mumbai']
}

df = pd.DataFrame(data)

print(df.isnull())     #* Tre for NaNs
print(df.isnull().sum())   # Count missing per column

    name    age   city
0  False  False  False
1  False  False  False
2   True   True   True
3  False  False  False
4  False  False  False
name    1
age     1
city    1
dtype: int64


**Drop Missing Data**

In [37]:
df.dropna()     # Drop rows with 'any' missing values
df.dropna(axis=1)    # Drops columns with missing values

0
1
2
3
4


**Fill Missing Data**

In [38]:
df.fillna(0)   # Replace NaNs with 0
df['age'].fillna(df['age'].mean())   # replace with mean
df.ffill()    # Forward fill
df.bfill()    # Backward fill

Unnamed: 0,name,age,city
0,Alice,20.0,Mumbai
1,Bob,22.0,Pune
2,Charlie,33.0,Delhi
3,Charlie,33.0,Delhi
4,Viru,43.0,Mumbai


---

**Detecting and Removing Duplicates**

`df.duplicated()` returns a boolean Series where: *True* means that row is duplicate of previous row. *False* means it's the first occurence (not a duplicate yet).

```Python
df.duplicated()         # True for duplicates
df.drop_duplicated()    # Remove duplicate rows
```

Check based on specific columns:
```Python
df.duplicated(subset=['Name', 'Age'])
```

---

**String Operations with `.str`**

Works like vectorized string methods and returns a pandas Series:

In [39]:
print(df['name'].str.lower())    # Convert all names to lowercase

print(df['city'].str.contains('Delhi', case=False))  # Check if 'Delhi' is in the city column, case-insensitive.

df['email'] = ['alice@gmail.com', 'bob@gmail.com', None, 'charlie@gmail.com', 'viru@gmail.com']

print(df['email'].str.split('@'))

0      alice
1        bob
2       None
3    charlie
4       viru
Name: name, dtype: object
0    False
1    False
2     None
3     True
4    False
Name: city, dtype: object
0      [alice, gmail.com]
1        [bob, gmail.com]
2                    None
3    [charlie, gmail.com]
4       [viru, gmail.com]
Name: email, dtype: object


We can also chain methods like  `str.strip().str.upper()` for more complex operations.

---

**Type Conversions with `.astype()`**

Convert column data types:

In [40]:
import pandas as pd

df = pd.read_csv('data.csv')

df.head()

df['Sales'] = df['Sales'].dropna().astype(int)
df['Date'] = pd.to_datetime(df['Date'])
df['Category'] = df['Category'].astype('category')

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   Date      50 non-null     datetime64[ns]
 1   Category  50 non-null     category      
 2   Value     47 non-null     float64       
 3   Product   50 non-null     object        
 4   Sales     46 non-null     float64       
 5   Region    50 non-null     object        
dtypes: category(1), datetime64[ns](1), float64(2), object(2)
memory usage: 2.3+ KB


**Why is `pd.to_datetime()` special ?**

Unlike `.astype()`, which works on simple data types (like integers, strings, etc). `pd.to_datetime()` is designed to:
    - Handle different date formats (e.g. "YYYY-MM-DD", "MM/DD/YYYY", etc.)
    - Handle mixed types (e.g. some date strings, some NaT, or missing values).
    - Convert integer timestamps (e.g. UNIX time) into datetime objects.
    - Recognize timezones if provided.

Check data types:
```Python
df.dtypes
```

---

**Applying Functions**

1. `.apply()` -> Apply any function to rows and columns.

In [41]:
df.head()

df['group_sales'] = df['Sales'].apply(lambda x: "Good" if x >= 500 else "Bad")

df.head()

Unnamed: 0,Date,Category,Value,Product,Sales,Region,group_sales
0,2023-01-01,A,28.0,Product1,754.0,East,Good
1,2023-01-02,B,39.0,Product3,110.0,North,Bad
2,2023-01-03,C,32.0,Product2,398.0,East,Bad
3,2023-01-04,B,8.0,Product1,522.0,East,Good
4,2023-01-05,B,26.0,Product3,869.0,North,Good


2. `.map()` -> Element-wise mapping for Series

In [42]:
product_map = {
    'Product1': 1,
    'Product2': 2,
    'Product3':3
}

df['Product'] = df['Product'].map(product_map)

df.head()

Unnamed: 0,Date,Category,Value,Product,Sales,Region,group_sales
0,2023-01-01,A,28.0,1,754.0,East,Good
1,2023-01-02,B,39.0,3,110.0,North,Bad
2,2023-01-03,C,32.0,2,398.0,East,Bad
3,2023-01-04,B,8.0,1,522.0,East,Good
4,2023-01-05,B,26.0,3,869.0,North,Good


3. `.replace()` -> Replace specific values

In [46]:
df['Region'] = df['Region'].replace({'East': 'E', 'North': 'N', 'West':'W', 'South':'S'})
df['Region']

0     E
1     N
2     E
3     E
4     N
5     W
6     E
7     W
8     W
9     W
10    N
11    W
12    S
13    E
14    W
15    N
16    S
17    W
18    W
19    E
20    S
21    S
22    N
23    W
24    E
25    W
26    N
27    N
28    N
29    S
30    W
31    W
32    S
33    E
34    W
35    W
36    E
37    N
38    S
39    W
40    E
41    E
42    W
43    E
44    E
45    W
46    S
47    W
48    N
49    N
Name: Region, dtype: object