In [1]:
import pandas as pd

# RECAP: What we learned in Chapter 4

## 1. Handaling missing values:  

1. `.isnull()` -> It returns the entire df but in True/False form (Null/NotNull).
2. `.isnull().sum()` -> It returns the number of null values in each column.
3. `.isnull().sum().sum()` -> It returns the total number of null values in entire df.  
4. `.notnull()`, `.notnull().sum()`, `.notnull().sum().sum()` -> Opposite to `.isnull()`
---

1. `.dropna()` -> Drops entire row whhich contains null values. (Returns new df without null values)
2. `.dropna(inplace=Ture)` -> To do changes in original df use `inplace=True`.
3. `.dropna(how='all')` -> Drops rows which are full empty.
4. `.dropna(subset=[...])` -> Drops rows if the specified column have null value.
---

1. `.fillna(0)` -> fill with constant value
2. `.fillna("XXXXXX")` -> fill with string  
   Use int for int column.  
   real world ex  
   `df[...].fillna(df[...].mean())` - Fill with mean  
   `df[...].fillna(df[...].mediam())` - Fill with median  
3. `.ffill()` -> For forward fill
4. `.bfill()` -> for backward fill

## 2. Removing duplicates
`.drop_duplicates()`

1. `.duplicated()` -> Returns True for which are repeated (Considers full row).
2. `.duplicated().sum()` -> Returns total number of duplicate rows.
---

1. `.drop_duplicates()` -> Remove duplicate rows and returns fresh clean df.
2. `.drop_duplicates(subset=["City"])` -> Remove the duplicates only on specified column. So only duplicate cities are removed.
---

1. `.drop_duplicaes(keep=first)` -> Default
2. `.drop_duplicaes(keep=last)` -> Keeps the last occurrence of duplicate values.
3. `.drop_duplicaes(keep=False)` -> No keeping remove all data which is duplicate
---

1. `.drop_duplicates(inplace=True)` -> Do droping inplace. No return of fresh df.
---

1. `.drop_duplicates().reset_index(drop=True)` -> Remove duplicates and resets the index. (drop=True)-> Removes the old index column.

## 3. Changing data types

`.dtypes` -> return column wise data types

---

1. `df["Age"].astype(int)` -> The datatype of Age column changed to int  

2. For entire df use
```
df = df.astype({
    "Age": int,
    "Salary": float
})
```
3. integer -> int,  
   string -> str,  
   category -> category,  
   boolean -> bool

4. Error handling -> Use `pd.to_numeric(df["Age"], error="coerce")`: This makes null for error

---

## 4. Renaming Columns

1. Rename one column -> `df.rename(columns={"Old":"New"})`
2. Rename multiple columns -> 
```
df = df.rename(columns={
    "Old1": "New1",
    "Old2": "New2"
})
```
3. Rename inplace -> `df.rename(columns={"Old": "New"}, inplace=True)`
4. Rename all column at once -> `df.columns = ["new1", "new2", "new3", ....]` (the df contain 5 column then the list also has to contain 5 names, weather you change name or not)
5. Auto change column names -> `df.columns = df.columns.str.replace(" ", "_")`: Spaces are replaced by underscore
6. `df.columns = df.columns.str.lower()`: All column names to lowercase
7. `df.columns = df.columns.str.strip().str.lower().str.replace(" ", "_")` : First strip then lower then replace the spaces with underscore

## 5. String operation

`df["..."].str.fun()`

---

1. Remove extra spaces  
   `.strip()` - Remove leading and trailing spaces  
   `.lstrip()` - Remove left side spaces / leading spaces  
   `.rstrip()` - Remove right side spaces / trailing spaces
---

2. Change letter case  
   `.lower()` - all lower case  
   `.upper()` - all upper case  
   `.title()` - first letter capital rest all
---

3. String contains  
   `.contains("...")` - used for condition  
   ex. `df[df["Email"].str.contains("gmail")]`
---

4. Replace str values  
   `.replace("pune", "Pune", case=False)` - pune/PUNE/PUne... are replaced by Pune
---

5. Extract substring  
   `df["Gender"].str[:1]`
---

6. Len of each str  
   `df[".."].str.len()`
---

7. split str  
   `df[".."].str.split("@", expand=True)`
---

8. Extract particulat column after split  
   `df[".."].str.split("@")[0]`  
   `df[".."].str.split("@")[1]`
---

9. Remove symbols and numbers  
    `df["..."].str.replace(r"[^a-zA-Z]", "", regex=True)`

## 6. Data and Time conversion

1. Convert date column to DateTime:  
   `df["..."] = pd.to_datetime(df["..."])`
---

2. Extracting Year, Month and Day:  
   `df["..."].dt.year`  
   `df["..."].dt.month`  
   `df["..."].dt.day`
---

3. Extracting day name and month name:  
   `df["..."].dt.day_name()`  
   `df["..."].dt.day_month()`
---

4. Extract week number and quarter:  
   `df["..."].dt.isocalendar().week`  
   `df["..."].dt.quarter`
---

5. Check min/max date:  
   `df["..."].min()` - (The column should be in datetime type)  
   `df["..."].max()` - - (The column should be in datetime type)


6. Filter by date:  
   `df[df["..."] > "2023-05-10"]`  
   `df[(df["..."] > "2023-05-10") & (condition2)]`
---

7. Custom date format conversion:  
   `pd.to_datetime(df["..."], format="%d-%m-%Y")`
   |Format|Meaning|
   |-|-|
   |%d|Day|
   |%m|Month|
   |%Y|Year|
   |%H|Hour|
   |%M|Min|
   |%S|Sec|
---


8. Date with time:  
   `pd.to_datetime("2023-01-20 10:30:00")`
---

9. Handle invalid dates:  
    `pd.to_datetime(df["..."], errors="coerce")`

---