# Selection and Assignment



- **Selection (getter)** --> used to retrieve values from pandas objects.
 - **Assignment (setter)** --> used to modify or update values.
 - Pandas uses the same API style for both selecting and assiging values for consistency and flexibility.
 - Selection starts simple and progresses to advanced indexing, including hierarchical indexing with `pd.MultiIndex`.


###  **Basic Selection from a Series**

A `pd.Series` allows selection by **position** or by **label**.

* It behaves like:
    * A **list** when selecting by position
    * A **dictionary** when selecting by label
* Using `[]` retrieves values by **index label**, not position (unless using default index).
* Passing a single value returns a **scalar**, while passing a list returns a **Series**.
* **Slicing** works similarly to Python lists for selecting ranges and subsets of data.
* **Index labels** can be integers, strings, or any hashable type.

---

### ‚ö†Ô∏è **Important Series Selection Notes:**

* When the index contains integers, selection is by **label**, not position.
    * This can cause subtle bugs if index values resemble positions.
* When index values are **duplicated**, selecting a label returns a **Series** instead of a scalar.

---

### **Basic Selection from a DataFrame**

Using `[]` on a DataFrame primarily selects **columns**, not rows.

* A column can be retrieved as:
    * A **Series** (single label)
    * A **DataFrame** (list of labels)
* A list of labels can also:
    * Select multiple columns
    * Reorder columns in the output
* Using **slicing** (`[:]`) on a DataFrame selects **rows**, not columns.

---

### **Position-Based Selection of a Series (`.iloc`)**

`.iloc` selects data strictly by **integer position**.

* **Eliminates ambiguity** between label-based and position-based indexing.
* Supports:
    * Single position
    * Multiple positions
    * Slices
* Always uses **numeric positions** (0-based indexing).

---

### **Position-Based Selection of a DataFrame (`.iloc`)**

`.iloc` requires two values:

1.  First $\rightarrow$ **row position**
2.  Second $\rightarrow$ **column position**

* Allows exact extraction of:
    * Single values
    * Entire rows or columns
    * Selected submatrices
* **Empty positions** (e.g., `,`) return all data from that axis.
* **List-based selection** (e.g., `df.iloc[[0, 1], :]`) avoids automatic reduction to Series and preserves DataFrame shape.

---

### **Best Practice Rule**

| Usage Type | Recommended Method | Notes |
| :--- | :--- | :--- |
| ‚úÖ **Position-based access** | Use **`.iloc`** | Strict integer-based selection. |
| ‚úÖ **Label-based access** | Use **`.loc`** | Strict label-based selection. |
| ‚ùå **Ambiguous access** | **Avoid** `[]` when index labels are numeric | Can lead to bugs due to label vs. position confusion. |

---

### üß† **One-line summary:**

Selection retrieves data, **`.iloc`** removes ambiguity, and DataFrame indexing prioritizes **columns** over rows when using `[]`.

### Basic Selection from a `pd.Series`



Selection from a `pd.Series` involves retrieving values by **position** or by **label**.

This is similar to:
* Accessing by index in a **list**
* Accessing by key in a **dictionary**

A `pd.Series` behaves like a Python container (list, tuple, dict), so it supports the `[]` operator.

In [2]:
import pandas as pd
import numpy as np

ser = pd.Series(list("abc") * 3)
ser

Unnamed: 0,0
0,a
1,b
2,c
3,a
4,b
5,c
6,a
7,b
8,c


In Python, the `[]` operator is used to access values from a container, and in a `pd.Series` it works the same way by selecting data using the index.

In [3]:
ser[3]

'a'

Using `ser[3]`, pandas looks for the label `3` in the Series index and returns the corresponding value.
If you want the result as a `pd.Series` instead of a single value, pass the label inside a list so the index label is preserved along with the data.

In [4]:
ser[[3]]

Unnamed: 0,0
3,a


If you give pandas a list with more than one item, it will return all the matching values from the Series at once.


In [5]:
ser[[0, 2]]

Unnamed: 0,0
0,a
2,c


When a `pd.Series` uses the default index, slicing works just like slicing a Python list, allowing you to extract a range of elements based on position (start included, stop excluded).

In [6]:
ser[:3]

Unnamed: 0,0
0,a
1,b
2,c


Pandas allows negative slicing, so you can easily select elements from the end of a `pd.Series`.

In [7]:
ser[-4:]

Unnamed: 0,0
5,c
6,a
7,b
8,c


You can use start and stop values in slicing to select a specific range of positions in a `pd.Series` (start included, stop excluded).

In [8]:
ser[2:6]

Unnamed: 0,0
2,c
3,a
4,b
5,c


#### Step Sclicing

Slicing can also include a step value, letting you pick elements at fixed intervals (for example, every third element within a selected range).

In [9]:
ser[1:8:3]

Unnamed: 0,0
1,b
4,b
7,b


#### Series with String Index

In [10]:
ser = pd.Series(range(3), index = ['Jack', 'Jill', 'Jayne'])
ser

Unnamed: 0,0
Jack,0
Jill,1
Jayne,2


#### Select by label

In [11]:
ser[['Jill']]

Unnamed: 0,0
Jill,1


#### Integer index Trap ( Label != position )

In [12]:
ser = pd.Series(list("abc"), index = [2, 42, 21])
ser

Unnamed: 0,0
2,a
42,b
21,c


#### Select by Label (not postition)

In [13]:
ser[2]

'a'

#### But slicing is till positional.


In [14]:
ser[:2]

Unnamed: 0,0
2,a
42,b


#### Duplicate index returns series

In [15]:
ser = pd.Series(["apple","banana","orange"], index = [0,1,1])
ser

Unnamed: 0,0
0,apple
1,banana
1,orange


In [16]:
ser[1]

Unnamed: 0,0
1,banana
1,orange


Explanation:
Duplicate label ‚Üí returns multiple values as a Series.

### Basic Selection from `pd.DataFrame`



using `[]` on a DataFrame:
- Selects columns, not rows.
-  Row selection happens only with slicing.

#### Create a DataFrame

In [17]:
df = pd.DataFrame(
    np.arange(9).reshape(3,-1), columns = ["a","b","c"]
)
df

Unnamed: 0,a,b,c
0,0,1,2
1,3,4,5
2,6,7,8


#### Selecting one Column(Series)

In [18]:
df['a']

Unnamed: 0,a
0,0
1,3
2,6


#### Selecting one column (DataFrame)

In [19]:
df[["a"]]

Unnamed: 0,a
0,0
1,3
2,6


#### Selecting Multiple Columns

In [20]:
df[["a", "b"]]

Unnamed: 0,a,b
0,0,1
1,3,4
2,6,7


#### Slice Rows

In [21]:
df[:2]

Unnamed: 0,a,b,c
0,0,1,2
1,3,4,5


#### Reorder Columns

In [22]:
df[["a","b"]]
df[["b","a"]]

Unnamed: 0,b,a
0,1,0
1,4,3
2,7,6


#### Position based selection (`iloc`) - Series

In [23]:
ser = pd.Series(
    ["apple","banana","orange"], index = [0, 1, 1]
)

In [24]:
ser.iloc[1]

'banana'

In [25]:
ser.iloc[[1]]

Unnamed: 0,0
1,banana


In [26]:
ser.iloc[[0, 2]]

Unnamed: 0,0
0,apple
1,orange


In [27]:
ser.iloc[:2]

Unnamed: 0,0
0,apple
1,banana


#### Position based selection (`.iloc`) - DATAFRAME

In [28]:
df = pd.DataFrame(
    np.arange(20).reshape(5, -1), columns = list("abcd")
)

In [29]:
df

Unnamed: 0,a,b,c,d
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11
3,12,13,14,15
4,16,17,18,19


#### Single Value

In [30]:
print(df.iloc[2,2])

10


In [31]:
df.iloc[2,2]

np.int64(10)

#### Column

In [32]:
df.iloc[:, [0]]

Unnamed: 0,a
0,0
1,4
2,8
3,12
4,16


#### Row

In [33]:
df.iloc[[0], :]

Unnamed: 0,a,b,c,d
0,0,1,2,3


##### Flipping the order

In [34]:
df.iloc[0, :]

Unnamed: 0,0
a,0
b,1
c,2
d,3


In [35]:
df.iloc[:, [0]]
df.iloc[[0], :]


Unnamed: 0,a,b,c,d
0,0,1,2,3


#### Multi Selection

In [36]:
df.iloc[[0, 1], [-1, -2]]


Unnamed: 0,d,c
0,3,2
1,7,6


#### Empty Slicers

In [37]:
df.iloc[:, :]



Unnamed: 0,a,b,c,d
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11
3,12,13,14,15
4,16,17,18,19


In [38]:
ser.iloc[:]

Unnamed: 0,0
0,apple
1,banana
1,orange


### Label based selection from Series



`.loc` is used to select data from a `pd.Series` by label, treating the index like dictionary keys rather than relying on position or order.

In [39]:
ser = pd.Series(["apple","banana","orange"], index=[0, 1,1])
ser

Unnamed: 0,0
0,apple
1,banana
1,orange


`pd.Series.loc` selects all rows whose index label matches the given value (for example, selecting all rows with label 1).


In [40]:
ser.loc[1]

Unnamed: 0,0
1,banana
1,orange


Pandas indexes are not limited to numbers; they can also use strings as labels.


In [41]:
ser = pd.Series([2, 2, 4], index=["dog", "cat", "human"], name="num_legs")
ser

Unnamed: 0,num_legs
dog,2
cat,2
human,4


In [42]:
ser.loc["dog"]

np.int64(2)

In [43]:
print(ser.loc["human"])

4


In [44]:
ser.loc[["dog", "cat"]]

Unnamed: 0,num_legs
dog,2
cat,2


You can use label-based slicing to select all rows from the start up to and including the label "cat".


In [45]:
ser.loc[:"cat"]

Unnamed: 0,num_legs
dog,2
cat,2


Label-based selection with `pd.Series.loc` is powerful, but it behaves differently from normal Python slicing and from `iloc`.  
Many users make mistakes by assuming all slicing works the same, so it is important to understand these differences before using `loc`.


In [46]:
values = ["Jack","Jill","Jayne"]
ser = pd.Series(values)
ser

Unnamed: 0,0
0,Jack
1,Jill
2,Jayne


In Python, slicing returns elements up to but not including the end position.


In [47]:
values[:2]

['Jack', 'Jill']

Slicing with `pd.Series.iloc` works the same way as Python list slicing, returning the same range of elements.


In [48]:
ser.iloc[:2]

Unnamed: 0,0
0,Jack
1,Jill


But slicing with `pd.Series.loc` actually produces a different result

In [49]:
ser.loc[:2]

Unnamed: 0,0
0,Jack
1,Jill
2,Jayne


This happens because:

`.loc` works using labels, not positions.

Pandas internally scans the index from the top and keeps including rows **until it finds the label value you asked for**. When it finds that value, it *does not stop immediately* ‚Äî it continues as long as the label is still the same.

That is why when your index contains duplicate labels, all matching labels are included.

**Why Duplicates change the result**

When the label `2` appears more than once, pandas assumes you want `all rows with that label` inclueded in the slice.

In [50]:
repeats_2 = pd.Series(range(5), index=[0, 1, 2, 2, 0])
repeats_2.loc[:2]


Unnamed: 0,0
0,0
1,1
2,2
2,3


Pandas:
- Starts at the beginning.
- Includes rows until it reaches label `2`.
- Keeps going while labels are still `2`.
- Stops once labels change again.


**Why `.loc` is not meant for numeric positioning.**

Using  `.loc` with numbers can feel confusing when labesl look like positions.

The key rule:

`.loc` = label logic
`.iloc` = position logic

If your index is numbers but you actually care about the row position rather than label meaning then you should use `.iloc` instead.

**Why `.loc` works better with strings.**

With human-readable labels like strings, `.loc` feels far more natural.

In [51]:
ser = pd.Series(
    range(4),
    index = ["zzz","xxx","xxx","yyy"]
)
ser

Unnamed: 0,0
zzz,0
xxx,1
xxx,2
yyy,3


In [52]:
ser.loc[:"xxx"]

Unnamed: 0,0
zzz,0
xxx,1
xxx,2


Padas logic:
- Starts from top
- Include rows
- Stop when label `xxx` is finished.

This feels intuitave becuase `xxx` is clearly a label, not a position.

**Why slicing can someone fail with .loc**

If index labels are duplicated and placed **out of order**, pandas cannot tell where the slice should stop.

In [53]:
ser = pd.Series(range(4), index=["zzz", "xxx", "yyy", "xxx"])
ser.loc[:"xxx"]


KeyError: "Cannot get right slice bound for non-unique label: 'xxx'"

Because:

 - `"xxx"` appears in multiple places,
 - But not in a block,
 - Pandas cannot determine the slice boundary clearly.

**Final understanding. **

Use this rule:

 - Use `.loc` when order does not matter, only labels matter.
 - Use `.iloc` when row number / position matters.
 - Never rely on numeric-looking labels behaving like positions in `.loc`.

### Label based selection from a DataFrame



Pandas allows selection by **labels** for both rows and columns using `.loc`.

Most users:
- use labels for columns
- use positions for rows

But pandas lets you use labels for both, which is a powerful and unique feature.

In tools like:
- SQL ‚Üí you filter rows but don‚Äôt label them
- Excel ‚Üí row/column labels exist, but selection is limited

In pandas, both row names and column names are first-class and fully usable.


In [54]:
df = pd.DataFrame(
    [
        [24, 180, "blue"],
        [42, 166, "brown"],
        [22, 160, "green"]
    ], columns=["age","height","eye_color"],
        index = ["Jack","Jill","Jayne"]
)
df

Unnamed: 0,age,height,eye_color
Jack,24,180,blue
Jill,42,166,brown
Jayne,22,160,green


`.loc[row_label, column_label]` selects using names, not positions.

In [55]:
df.loc["Jayne", "eye_color"]

'green'

To select all rows from the column with the label `"age":`



In [56]:
df.loc[:, "age"]

Unnamed: 0,age
Jack,24
Jill,42
Jayne,22


To select all columns from the row with the label `"Jack":`

In [57]:
df.loc["Jack",:]

Unnamed: 0,Jack
age,24
height,180
eye_color,blue


To select all rows from the column with the label `"age"`, maintaining the `pd.DataFrame` shape:

In [58]:
df.loc[:,["age"]]

Unnamed: 0,age
Jack,24
Jill,42
Jayne,22


To select all columns from the row with the label `"Jack"`, maintaining the `pd.DataFrame` shape:

In [59]:
df.loc[["Jack"], :]

Unnamed: 0,age,height,eye_color
Jack,24,180,blue


To select both rows and columns using lists of labels:

In [60]:
df.loc[["Jack","Jill"], ["age","eye_color"]]

Unnamed: 0,age,eye_color
Jack,24,blue
Jill,42,brown


### Mixing position-based and label based selection

In Pandas:
 - `DataFrame.iloc` is used for **position-based** selection.
 - `DataFrame.loc` is used for **label-based** selection.

Sometimes you want to mix these styles, for example:

- Select **rows by position**.
- Select **columns by label**.

This is common because:

- Columns usually have meaningful labels (e.g., `"age"`, `"height_cm"`).
- Rows are often just positional (0, 1, 2, ‚Ä¶), and their order matters more than their labels.

In [61]:
import pandas as pd

df = pd.DataFrame([
    [24, 180, "blue"],
    [42, 166, "brown"],
    [22, 160, "green"],
], columns=["age", "height_cm", "eye_color"])

df


Unnamed: 0,age,height_cm,eye_color
0,24,180,blue
1,42,166,brown
2,22,160,green


This creates a DataFrame with:

- Default row index: `0, 1, 2` (position-based, via `RangeIndex`)
- Named columns: `"age"`, `"height_cm"`, `"eye_color"` (label-based)

The goal is to mix:

- Row selection by position (e.g., first two rows ‚Üí positions `[0, 1]`)
- Column selection by label (e.g., `"age"` and `"eye_color"`)


#### Using `get_indexer` to convert labels to positions

To use `iloc` for everything (both rows and columns), we need column **positions**, not labels.

`df.columns` is an Index containing the column labels.  
We can convert labels to integer positions using:

- `Index.get_indexer([...])`


In [62]:
col_idxer = df.columns.get_indexer(["age", "eye_color"])
col_idxer


array([0, 2])

This returns an array of positions, for example:

- `"age"`      ‚Üí position `0`
- `"eye_color"` ‚Üí position `2`

So `col_idxer` becomes something like:

```python
array([0, 2])



#### One-step mixed selection with `.iloc` and `get_indexer`

Now that we have:

- Row positions: `[0, 1]`
- Column positions: `col_idxer` (e.g., `[0, 2]`)

We can perform mixed selection in **one `.iloc` call**:


In [63]:
df.iloc[:, col_idxer]

Unnamed: 0,age,eye_color
0,24,blue
1,42,brown
2,22,green


In [64]:
df.iloc[[0, 1], col_idxer]

Unnamed: 0,age,eye_color
0,24,blue
1,42,brown


This selects:

- Rows at positions `0` and `1`
- Columns at positions in `col_idxer` ‚Üí `"age"` and `"eye_color"`

Result:

- A DataFrame with two rows (`0`, `1`) and two columns (`"age"`, `"eye_color"`)

This approach uses only `.iloc` and no intermediate DataFrame.


#### Alternative: Two-step selection (label, then position)

Instead of using `get_indexer`, you can:

1. First select columns by label with `df[["age", "eye_color"]]`
2. Then select rows by position with `.iloc[[0, 1]]`

This is often easier to read:


In [65]:
df[["age", "eye_color"]].iloc[[0, 1]]

Unnamed: 0,age,eye_color
0,24,blue
1,42,brown


#### Performance comparison with `timeit`

To see the performance difference, we can benchmark both approaches using `timeit`.

##### 1. `get_indexer` approach (single `.iloc` call)


In [66]:
import timeit

def get_indexer_approach():
    col_idxer = df.columns.get_indexer(["age", "eye_color"])
    df.iloc[[0, 1], col_idxer]

timeit.timeit(get_indexer_approach, number=10_000)

4.550187574999995

In [67]:
two_step_approach = lambda: df[["age", "eye_color"]].iloc[[0, 1]]

timeit.timeit(two_step_approach, number=10_000)


5.640840591

The get_indexer version is faster.

Reason:

- Pandas executes eagerly: it does each step as you write it.
- In the two-step approach, `df[["age", "eye_color"]]` creates a temporary DataFrame.
- Then `.iloc[[0, 1]]` runs on that temporary object.
- In the `get_indexer` approach, everything is done in one iloc call, with no intermediate DataFrame.
- So, fewer allocations ‚Üí better performance, especially for big datasets.

### `DataFrame.Filter`

`pd.DataFrame.Filter` allows to select from either rows or columns of a `pd.DataFrame`

In [74]:
df = pd.DataFrame([
    [24,180,"blue"],
    [42, 166, "brown"],
    [22, 160,"green"]
                    ],
    columns = ["Age","height","eye_color"],
    index = ["Jack","Jill","Jayne"]
                  )
df

Unnamed: 0,Age,height,eye_color
Jack,24,180,blue
Jill,42,166,brown
Jayne,22,160,green


By default, `pd.DataFrame.filter` will select columns matching the label argument(s), similar to `pd.DataFrame[]`:

In [75]:
df.filter(["age","eye_color"])

Unnamed: 0,eye_color
Jack,blue
Jill,brown
Jayne,green


#### Selecting rows using `pd.DataFrame.filter` with `axis=0`

By default, `pd.DataFrame.filter()` works on **columns**.
However, you can change this behavior using the `axis` argument.

- `axis=1` ‚Üí filter columns (default)
- `axis=0` ‚Üí filter rows

To filter **rows instead of columns**, you must explicitly set:

`axis=0`


In [76]:
df.filter(items=["Jack", "Jayne"], axis=0)


Unnamed: 0,Age,height,eye_color
Jack,24,180,blue
Jayne,22,160,green


You do not have to match the full label name exactly when filtering.  
Using `like=` lets you select any row or column whose name contains the given text.
For example, `like="_"` selects all labels that include an underscore.


In [78]:
df.filter(like="_")

Unnamed: 0,eye_color
Jack,blue
Jill,brown
Jayne,green


You can use patterns to search labels instead of exact words.
With `regex=`, pandas can match labels based on rules like how they start or end.

In [79]:
df.filter(regex=r"^Ja.*(?<!e)$", axis=0)

Unnamed: 0,Age,height,eye_color
Jack,24,180,blue


### Selection by data type:

Pandas stores information about the data type of each column. You can use this to select specific types of data, such as only integers, only numbers, or exclude certain types.

In [80]:
df = pd.DataFrame([
    [0, 1.0, "2"],
    [4, 8.0, "16"],
], columns=["int_col", "float_col", "string_col"])

df

Unnamed: 0,int_col,float_col,string_col
0,0,1.0,2
1,4,8.0,16


using `pd.DataFrame.select_dtypes` to select only integral columns

In [81]:
df.select_dtypes("int")

Unnamed: 0,int_col
0,0
1,4


Multiple types can be selected

In [82]:
df.select_dtypes(include=["int","float"])

Unnamed: 0,int_col,float_col
0,0,1.0
1,4,8.0


Defaultly it is include.. but we can also exclude datatypes. using `exclude=` parameter

In [83]:
df.select_dtypes(exclude=["int"])

Unnamed: 0,float_col,string_col
0,1.0,2
1,8.0,16


### Selection/Filtering via Boolean arrays

using boolean lists/arrays to select a subset of rows.

In [84]:
mask = [True, False, True]
ser = pd.Series(range(3))
ser

Unnamed: 0,0
0,0
1,1
2,2


Using `mask` as an argument to `pd.Series` will return each row where the value will be `True`

In [85]:
ser[mask]

Unnamed: 0,0
0,0
2,2


Normally, when you give a list inside `df[]`, pandas treats it as a list of column names. But when you give a list of True and False values instead, pandas understands it as a filter for rows, not columns.

In [86]:
df = pd.DataFrame(np.arange(6).reshape(3, -1))
df


Unnamed: 0,0,1
0,0,1
1,2,3
2,4,5


In [87]:
mask = [True, False, True]
df[mask]


Unnamed: 0,0,1
0,0,1
2,4,5


Here, pandas compares the Boolean list with the rows and returns only the rows where the mask is True.
So row 0 and row 2 are kept, and row 1 is removed.

In [88]:
col_mask = [True, False]
df.loc[mask, col_mask]


Unnamed: 0,0
0,0
2,4


### Selection with a MultiIndex ‚Äì A single level

A `MultiIndex` lets you use more than one label level (for example: first name + last name).
Selection can feel confusing, especially with `[]`, so we use `.loc` for clarity.

In [89]:
index = pd.MultiIndex.from_tuples([
    ("John","Smith"),
    ("John","Doe"),
    ("Jane","Doe"),
    ("Stephen","Smith"),
], names=["first_name","last_name"])

ser = pd.Series(range(4), index=index)
ser

Unnamed: 0_level_0,Unnamed: 1_level_0,0
first_name,last_name,Unnamed: 2_level_1
John,Smith,0
John,Doe,1
Jane,Doe,2
Stephen,Smith,3


In [90]:
ser.loc["John"]


Unnamed: 0_level_0,0
last_name,Unnamed: 1_level_1
Smith,0
Doe,1


In [91]:
ser.loc[("Jane", "Doe")]

np.int64(2)

#### Selection with a MultiIndex ‚Äì Multiple levels

To select deeper levels of a MultiIndex, you pass a tuple instead of a string.

In [92]:
ser.loc[(["Jane"], "Doe")]

Unnamed: 0_level_0,Unnamed: 1_level_0,0
first_name,last_name,Unnamed: 2_level_1
Jane,Doe,2


In [94]:
ser.loc[(["Jane"], "Doe")]
#Using a list preserves the MultiIndex shape.

Unnamed: 0_level_0,Unnamed: 1_level_0,0
first_name,last_name,Unnamed: 2_level_1
Jane,Doe,2


In [95]:
ser.loc[[("John", "Smith"), ("Jane", "Doe")]]
#Selects multiple specific combinations

Unnamed: 0_level_0,Unnamed: 1_level_0,0
first_name,last_name,Unnamed: 2_level_1
John,Smith,0
Jane,Doe,2


In [96]:
ser.loc[(slice(None), "Doe")]
#Selects all rows where the second level is "Doe" and removes that level in the result.

Unnamed: 0_level_0,0
first_name,Unnamed: 1_level_1
John,1
Jane,2


In [97]:
ser.loc[(slice(None), ["Doe"])]
#Keeps the MultiIndex structure while selecting.

Unnamed: 0_level_0,Unnamed: 1_level_0,0
first_name,last_name,Unnamed: 2_level_1
John,Doe,1
Jane,Doe,2


#### Understanding `slice(None)`
slice(None) means ‚Äúselect everything‚Äù, just like [:] in list slicing.

In [98]:
alist = list("abc")
alist[:]
alist[slice(None)]

# Both Returns the sames output

['a', 'b', 'c']

Inside a tuple, `slice(None)` means ‚Äúselect all values for this level‚Äù.

Using `pd.IndexSlice`

Instead of writing `slice(None)` manually, pandas gives `IndexSlice`.

In [99]:
ixsl = pd.IndexSlice
ser.loc[ixsl[:, ["Doe"]]]


Unnamed: 0_level_0,Unnamed: 1_level_0,0
first_name,last_name,Unnamed: 2_level_1
John,Doe,1
Jane,Doe,2


### Selection with a MultiIndex ‚Äì DataFrame

MultiIndex works on both rows and columns.
`.loc` takes **two arguments**: one for rows, one for columns.

In [100]:
row_index = pd.MultiIndex.from_tuples([
    ("John", "Smith"),
    ("John", "Doe"),
    ("Jane", "Doe"),
    ("Stephen", "Smith"),
], names=["first_name", "last_name"])

col_index = pd.MultiIndex.from_tuples([
    ("music", "favorite"),
    ("music", "last_seen_live"),
    ("art", "favorite"),
], names=["art_type", "category"])

df = pd.DataFrame([
   ["Swift", "Swift", "Matisse"],
   ["Mozart", "T. Swift", "Van Gogh"],
   ["Beatles", "Wonder", "Warhol"],
   ["Jackson", "Dylan", "Picasso"],
], index=row_index, columns=col_index)

df


Unnamed: 0_level_0,art_type,music,music,art
Unnamed: 0_level_1,category,favorite,last_seen_live,favorite
first_name,last_name,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
John,Smith,Swift,Swift,Matisse
John,Doe,Mozart,T. Swift,Van Gogh
Jane,Doe,Beatles,Wonder,Warhol
Stephen,Smith,Jackson,Dylan,Picasso


Rows and columns both use hierarchical labels.

In [101]:
row_idxer = (slice(None), "Smith")
col_idxer = (slice(None), "favorite")
df.loc[row_idxer, col_idxer]


Unnamed: 0_level_0,art_type,music,art
Unnamed: 0_level_1,category,favorite,favorite
first_name,last_name,Unnamed: 2_level_2,Unnamed: 3_level_2
John,Smith,Swift,Matisse
Stephen,Smith,Jackson,Picasso


Selects:

- All people whose last name is "Smith"

- All columns whose category is "favorite"

In [102]:
df.loc[(slice(None), "Smith"), (slice(None), "favorite")]


Unnamed: 0_level_0,art_type,music,art
Unnamed: 0_level_1,category,favorite,favorite
first_name,last_name,Unnamed: 2_level_2,Unnamed: 3_level_2
John,Smith,Swift,Matisse
Stephen,Smith,Jackson,Picasso


Same result without intermediate variables.

### Item assignment with .loc and .iloc

Selection reads data. Assignment changes data.
 - `.loc` assigns by label.
 - `.iloc` assigns by position.

In [103]:
ser = pd.Series(range(3), index=list("abc"))
ser.loc["b"] = 42
ser


Unnamed: 0,0
a,0
b,42
c,2


In [104]:
ser.iloc[2] = -42
ser


Unnamed: 0,0
a,0
b,42
c,-42


### DataFrame Column assignment

you can add columns with `df["new_col"] = value`

In [105]:
df = pd.DataFrame({"col1": [1, 2, 3]})
df["new_column1"] = 42
df


Unnamed: 0,col1,new_column1
0,1,42
1,2,42
2,3,42


In [106]:
df["new_column2"] = list("abc")
df["new_column3"] = pd.Series(["dog", "cat", "human"])
df


Unnamed: 0,col1,new_column1,new_column2,new_column3
0,1,42,a,dog
1,2,42,b,cat
2,3,42,c,human


In [107]:
df["should_fail"] = ["too few", "rows"]


ValueError: Length of values (2) does not match length of index (3)

#### Assigning columns in a MultiIndex DataFrame

To add a column under a hierarchy, use a tuple with `.loc`

In [109]:
df = pd.DataFrame({"col1": [1, 2, 3]})
df

Unnamed: 0,col1
0,1
1,2
2,3
