In [1]:
import pandas as pd

# End of Chapter Recap

## 1. Adding & Removing columns

1. Create new columns:
    1. Direct assignment - `df["col_name"] = [....]` or `df["Bonus"] = df["Salary"] * 0.10`
    2. Using `assign()` - `df.assign(Col_Name=[...])`
    3. Using np.where - `df["Level"] = np.where(df["Experience"]>=3, "Senior", "Junior")`
    4. Using `apply()` - `df["Tax"] = df["Existing_col (Salary)"].apply(lambda x: x * 0.05)`
2. Deleting columna:
    1. Single column - `df.drop("Col_name", axis=1)`
    2. Multiple column - `df.drop(["col1", "col2",...], axis=1)`
    3. For inplace - Pass extra parameter `inplace=True`
    4. Using `del` - `del df["col"]` (it deletes inplace)

## 2. Sorting Data

1. `.sort_values()`:
    1. `df.sort_values("Col_Name")` - Ascending order
    2. `df.sort_values("Col_Name", ascending=False)` - Descending order
    3. `df.sort_values(["Col1", "col2",...])` - Multiple columns -- First col1 them col2 ...
    3. `df.sort_values(["Col1", "col2",...], ascending=[True, False])` - Multiple columns with different orders -- First col1 them col2 ...
    4. `.sort_values("col", na_position="first"/"last")` - Na values either at first or at last
2. `.sort_index()` - Ascending and `.sort_index(ascending=False)` - Descending

For changes in original df use `inplace=True`.

## 3. Reindexing & Alignment

Reindexing: changing order of rows & colns, adding new inded labels & removing existing ones.  
1. `.reindex()` Rows:  
Parameters:
    1. List/Series - `[0, 1, 2, 3,...]` or `["a", "b", "c",....]`
    2. fill_value - `fill_value=0` or `fill_value="!"` -- It fill missing index values to 0 / "!".

2. `.reindex()` Columns:
Parameters:
    1. columns - `columns=["Name", "Marks"]`
    2. fill_value - `fill_value=0` or `fill_value="!"` -- It fill missing column values to 0 / "!".
3. Aligning Data:  
If we have two series s1 & s2. And if we do s1 + s2. The operation is done only on matching index rows.  
All other rows became NaN.  
This mean the data between two series are aligned by index. More precise on common index.
5. Align DataFrame:  
DataFrame alignment are done by `.align()` method.  
`df1.align(df2)` -> This returns two df's. Take it as left and right. left-df1 & right-df2.  
So `left, right = df1.align(df2)`  
Here also the alignment done on the basis of index.  
All the data from df2 is NaN in df1 and vice versa.

## 4. Merging, Joining & Concatenating DataFrames

**Part 1: `pd.merge()` (SQL style joins)**
1. Inner join (default): `pd.merge(df1, df2, on="common_col")`
2. Left join : `pd.merge(df1, df2, on="common_col", how="left")`
3. Right join : `pd.merge(df1, df2, on="common_col", how="right")`
4. Right join : `pd.merge(df1, df2, on="common_col", how="outer")`
5. Right join : `pd.merge(df1, df2, left_on="common_col", right_on="common_col")`

**Part 2: `.join()` (Index-based joining)**  
The df1 has to be indexed to common column.  
`.join()` is used when we want to join df on a index.  
So before that we have to do `df1.set_index("common_col_df1", inplace=True)`  
1. `df1.join(df2, on="common_col_df2")`

**Part 3: `pd.concat()` (Stacking DataFrames)**  
1. Row-wise concatenation (default) -  
Column concat with common column and if no common then values are filled NaN.  
`pd.concat([df1, df2])` - Get index as existed earlier in df1/df2  
`pd.concat([df1, df2], ignore_index=True)` - This sets fresh indexs
So the df's are concatenated row-wise with common columns and other columns are filled with NaN

2. Colimn-wise concatenation -     
`pd.concar([df1,df2], axis=1)`  
Concatenated by common index.

## 5. GroupBy & Aggregation

split->apply->combine  
1. Basic `groupby()` - One column
    1. `df.groupby("col1")["col2"].mean()`  
        Group by "col1", aggregate function on "col2" and mean() agg fun is applied.
    2. Agg functions - mean, avg, count, min, max, sum,...
2. Multiple agg function
   1. `df.groupby("col1")["col2"].agg["mean", "sum", "max",...]`  
      Multiple agg functions are applied in a single statement.
   2. ```
        df.groupby("col1").agg(
            col_name=("col2", "agg_fun"),
            ...
        )
      ```
3. `groupby()` multiple columns
    1. `df.groupby(["col1", ....])["col"].mean()`
    2. ```
        df.groupby(["col1", "col2"]).agg(
            col_name=("col3", "agg_fun"),
            ...
        )
        ```
4. `transform()`  
    `df["col_name"] = df.groupby()[].transform("mean")` - It gives same amount of rows as original.  
   Shape stays as the same.

## 6. Pivot tables & Crosstab

<hr>

```
pd.pivot_table(
    df,
    values = ".." / [List],
    index = "..", [List]
    columns = ".." / [List],
    aggfunc = ".." / [List]
)
```  
Groups the data and aggregate the values in a table  
1. **Parameters:**:
    1. df -> DataFrame in which the all data available
    2. values -> Values on which the aggregate function is applied
    3. index -> row or list of rows w wanted in result table
    4. columns -> column or list of column wanted in result
    5. aggfunc -> single function or list of functions to be apply on values


<hr>
<hr>

```
pd.crosstab(
    df[".."],
    df[".."],
    normalize = True / False / "index"
)
```
It makes the frequencies count of every pair occured.  
In two columns it gives **"How many times does each pair occured"**
1. **Parameters:**
    1. First column
    2. Second columns
    3. Normalize ->
        1. True -> Overall percentage
        2. "index" -> index/row wise percentage
        3. "columns" -> column wise percentage

<hr>

## 7. Reshaping

1. `melt()` -> It converts columns into rows.  
   - id_vars - Columns want to keep as same.  
   - value_vars - Columns want to unpivot, If not specified then all columns which are not specified in id_vars.  
   - var_name - Name of the column for column names.  
   - value_name - Name of column for column values.  
   - ignore_index - True: fresh index (default), False: Original index  
2. `stack()` -> It keeps index column as it is (not at all) and all other columns are converted into rows.  
   It returns series. We have to convert it to df.  
4. `unstack()` -> Converts the stacked series into original df.  