In [2]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

# 5.1 Introduction to pandas Data Structures

* A Series is a one-dimensional array-like object containing a sequence of values
* The simplest Series is formed from only an array of data:

In [3]:
obj = pd.Series([4, 7, -5, 3])

obj

0    4
1    7
2   -5
3    3
dtype: int64

- You can get the array representation and index object of the Series via its array and index attributes, respectively:

In [4]:
obj.array

<NumpyExtensionArray>
[4, 7, -5, 3]
Length: 4, dtype: int64

In [6]:
obj.index

RangeIndex(start=0, stop=4, step=1)

- The result of the .array attribute is a PandasArray which usually wraps a NumPy array but can also contain special extension array types.

- Often, you'll want to create a Series with an index identifying each data point with a label:

In [7]:
obj2 = pd.Series([4, 7, -5, 3], index=["d", "b", "a", "c"])
obj2

d    4
b    7
a   -5
c    3
dtype: int64

In [8]:
obj2.index

Index(['d', 'b', 'a', 'c'], dtype='object')

- Compared with NumPy arrays, you can use labels in the index when selecting single values or a set of values:


In [12]:
obj2["a"]

-5

In [15]:
obj2["d"] = 6

In [14]:
obj2[["c", "a", "d"]]

#Here ["c", "a", "d"] is interpreted as a list of indices, even though it contains strings instead of integers.

c    3
a   -5
d    6
dtype: int64

##### Using NumPy functions or NumPy-like operations, such as filtering with a Boolean array, scalar multiplication, or applying math functions, will preserve the index-value link:

In [16]:
obj2[obj2>0]

d    6
b    7
c    3
dtype: int64

In [17]:
obj2 * 2

d    12
b    14
a   -10
c     6
dtype: int64

In [18]:
np.exp(obj2)

d     403.428793
b    1096.633158
a       0.006738
c      20.085537
dtype: float64

* Another way to think about a Series is as a fixed-length, ordered dictionary, as it is a mapping of index values to data values.

In [19]:
"b" in obj2

True

In [20]:
"e" in obj2

False

- Should you have data contained in a Python dictionary, you can create a Series from it by passing the dictionary:

In [22]:
sdata = {"Ohio": 35000, "Texas": 71000, "Oregon": 16000, "Utah": 5000}
obj3 = pd.Series(sdata)
obj3

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

In [23]:
#A Series can be converted back to a dictionary with its to_dict method:

obj3.to_dict()

{'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}

* When you are only passing a dictionary, the index in the resulting Series will respect the order of the keys according to the dictionary's keys method, which depends on the key insertion order. 
* You can override this by passing an index with the dictionary keys in the order you want them to appear in the resulting Series:

In [25]:
states =["California", "Ohio", "Oregon", "Texas"]

obj4 = pd.Series(sdata, index=states)
obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

- since no value for "California" was found, it appears as NaN (Not a Number), which is considered in pandas to mark missing or NA values. Since "Utah" was not included in states, it is excluded from the resulting object.
- I will use the terms “missing,” “NA,” or “null” interchangeably to refer to missing data. The isna and notna functions in pandas should be used to detect missing data:

In [27]:
pd.isna(obj4)

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

In [28]:
pd.notna(obj4)

California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

In [29]:
#instance methods
obj4.isna()

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

- A useful Series feature for many applications is that it automatically aligns by index label in arithmetic operations:

In [32]:
obj3

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

In [33]:
obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

In [34]:
obj3 + obj4

California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

* Both the Series object itself and its index have a `name` attribute, which integrates with other areas of pandas functionality:



In [38]:
obj4.name = "population"
obj4.index.name = "state"
obj4

state
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64

-A Series’s index can be altered in place by assignment:

In [39]:
obj

0    4
1    7
2   -5
3    3
dtype: int64

In [40]:
obj.index = ["Bob", "Steve", "Jeff", "Ryan"]
obj

Bob      4
Steve    7
Jeff    -5
Ryan     3
dtype: int64

## DataFrame

A DataFrame represents a rectangular table of data and contains an ordered, named collection of columns, each of which can be a different value type (numeric, string, Boolean, etc.). The DataFrame has both a row and column index; it can be thought of as a dictionary of Series all sharing the same index.

In [41]:
- There are many ways to construct a DataFrame, though one of the most common is from a dictionary of equal-length lists or NumPy arrays:

SyntaxError: invalid syntax (3786536494.py, line 1)

In [43]:
data = {"state": ["Ohio", "Ohio", "Ohio", "Nevada", "Nevada", "Nevada"],
        "year": [2000, 2001, 2002, 2001, 2002, 2003],
        "pop": [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)
frame

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


- For large DataFrames, the `head` method selects only the first five rows:



In [44]:
frame.head()

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9


- Similarly, `tail` returns the last five rows:

In [45]:
frame.tail()

Unnamed: 0,state,year,pop
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


- If you specify a sequence of columns, the DataFrame’s columns will be arranged in that order:

In [46]:
pd.DataFrame(data, columns=["year", "state", "pop"])

Unnamed: 0,year,state,pop
0,2000,Ohio,1.5
1,2001,Ohio,1.7
2,2002,Ohio,3.6
3,2001,Nevada,2.4
4,2002,Nevada,2.9
5,2003,Nevada,3.2


- If you pass a column that isn’t contained in the dictionary, it will appear with missing values in the result:

In [47]:
frame2 = pd.DataFrame(data, columns=["year", "state", "pop", "debt"])
frame2

Unnamed: 0,year,state,pop,debt
0,2000,Ohio,1.5,
1,2001,Ohio,1.7,
2,2002,Ohio,3.6,
3,2001,Nevada,2.4,
4,2002,Nevada,2.9,
5,2003,Nevada,3.2,


In [48]:
frame2.columns

Index(['year', 'state', 'pop', 'debt'], dtype='object')

 - A column in a DataFrame can be retrieved as a Series either by dictionary-like notation or by using the dot attribute notation:

In [49]:
frame2["state"]

0      Ohio
1      Ohio
2      Ohio
3    Nevada
4    Nevada
5    Nevada
Name: state, dtype: object

In [50]:
frame2.year

0    2000
1    2001
2    2002
3    2001
4    2002
5    2003
Name: year, dtype: int64

- Rows can also be retrieved by position or name with the special `iloc` and `loc` attributes 

In [52]:
frame2.loc[1]

year     2001
state    Ohio
pop       1.7
debt      NaN
Name: 1, dtype: object

In [53]:
frame2.iloc[2]

year     2002
state    Ohio
pop       3.6
debt      NaN
Name: 2, dtype: object

- Columns can be modified by assignment. For example, the empty `debt` column could be assigned a scalar value or an array of values:

In [54]:
frame2["debt"] = 16.5
frame2

Unnamed: 0,year,state,pop,debt
0,2000,Ohio,1.5,16.5
1,2001,Ohio,1.7,16.5
2,2002,Ohio,3.6,16.5
3,2001,Nevada,2.4,16.5
4,2002,Nevada,2.9,16.5
5,2003,Nevada,3.2,16.5


In [56]:
frame2["debt"] = np.arange(6.)
frame2

Unnamed: 0,year,state,pop,debt
0,2000,Ohio,1.5,0.0
1,2001,Ohio,1.7,1.0
2,2002,Ohio,3.6,2.0
3,2001,Nevada,2.4,3.0
4,2002,Nevada,2.9,4.0
5,2003,Nevada,3.2,5.0


- When you are assigning lists or arrays to a column, the value’s length must match the length of the DataFrame. 
- If you assign a Series, its labels will be realigned exactly to the DataFrame’s index, inserting missing values in any index values not present:

In [59]:
val = pd.Series([-1.2, -1.5, -1.7], index=[2, 4, 5])
frame2["debt"] = val

frame2

# Assigning a column that doesn’t exist will create a new column.

Unnamed: 0,year,state,pop,debt
0,2000,Ohio,1.5,
1,2001,Ohio,1.7,
2,2002,Ohio,3.6,-1.2
3,2001,Nevada,2.4,
4,2002,Nevada,2.9,-1.5
5,2003,Nevada,3.2,-1.7


- The `del` keyword will delete columns like with a dictionary. As an example, I first add a new column of Boolean values where the `state` column equals `"Ohio"`:



In [64]:
frame2["eastern"] = frame2["state"] == "Ohio"
frame2

Unnamed: 0,year,state,pop,debt,eastern
0,2000,Ohio,1.5,,True
1,2001,Ohio,1.7,,True
2,2002,Ohio,3.6,-1.2,True
3,2001,Nevada,2.4,,False
4,2002,Nevada,2.9,-1.5,False
5,2003,Nevada,3.2,-1.7,False


- The `del` method can then be used to remove this column:

In [66]:
del frame2["eastern"]
frame2.columns

Index(['year', 'state', 'pop', 'debt'], dtype='object')

Another common form of data is a nested dictionary of dictionaries:

In [69]:
populations = {"Ohio": {2000: 1.5, 2001: 1.7, 2002: 3.6},
              "Nevada": {2001: 2.4, 2002: 2.9}}

If the nested dictionary is passed to the DataFrame, pandas will interpret the outer dictionary keys as the columns, and the inner keys as the row indices:

In [70]:
frame3 = pd.DataFrame(populations)
frame3

Unnamed: 0,Ohio,Nevada
2000,1.5,
2001,1.7,2.4
2002,3.6,2.9


You can transpose the DataFrame (swap rows and columns) with similar syntax to a NumPy array:

In [71]:
frame3.T

Unnamed: 0,2000,2001,2002
Ohio,1.5,1.7,3.6
Nevada,,2.4,2.9


The keys in the inner dictionaries are combined to form the index in the result. This isn’t true if an explicit index is specified:

In [72]:
pd.DataFrame(populations, index=[2001, 2002, 2003])

Unnamed: 0,Ohio,Nevada
2001,1.7,2.4
2002,3.6,2.9
2003,,


Dictionaries of Series are treated in much the same way:

In [73]:
pdata = {"Ohio": frame3["Ohio"][:-1],
         "Nevada": frame3["Nevada"][:2]}

pd.DataFrame(pdata)

Unnamed: 0,Ohio,Nevada
2000,1.5,
2001,1.7,2.4


If a DataFrame’s `index` and `columns` have their `name` attributes set, these will also be displayed:

In [75]:
frame3.index.name = "year"
frame3.columns.name = "state"

frame3

state,Ohio,Nevada
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2000,1.5,
2001,1.7,2.4
2002,3.6,2.9


Unlike Series, DataFrame does not have a `name` attribute. DataFrame's `to_numpy` method returns the data contained in the DataFrame as a two-dimensional ndarray:

In [76]:
frame3.to_numpy()

array([[1.5, nan],
       [1.7, 2.4],
       [3.6, 2.9]])

If the DataFrame’s columns are different data types, the data type of the returned array will be chosen to accommodate all of the columns:

In [77]:
frame2.to_numpy()

array([[2000, 'Ohio', 1.5, nan],
       [2001, 'Ohio', 1.7, nan],
       [2002, 'Ohio', 3.6, -1.2],
       [2001, 'Nevada', 2.4, nan],
       [2002, 'Nevada', 2.9, -1.5],
       [2003, 'Nevada', 3.2, -1.7]], dtype=object)

## Index Objects

- pandas’s Index objects are responsible for holding the *axis labels* (including a DataFrame's column names) and other metadata (like the axis name or names). 
- Any array or other sequence of labels you use when constructing a Series or DataFrame is internally converted to an Index:

In [80]:
obj = pd.Series(np.arange(3), index=["a", "b", "c"])

index = obj.index
index

Index(['a', 'b', 'c'], dtype='object')

In [81]:
index[1:]

Index(['b', 'c'], dtype='object')

Index objects are immutable and thus can’t be modified by the user:

In [82]:
index[1] = "d"

TypeError: Index does not support mutable operations

Immutability makes it safer to share Index objects among data structures:

In [83]:
labels = pd.Index(np.arange(3))
labels

Index([0, 1, 2], dtype='int64')

In [84]:
obj2 = pd.Series([1.5, -2.5, 0], index = labels)
obj2

0    1.5
1   -2.5
2    0.0
dtype: float64

In [85]:
obj2.index is labels

True

In addition to being array-like, an Index also behaves like a fixed-size set:

In [86]:
frame3

state,Ohio,Nevada
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2000,1.5,
2001,1.7,2.4
2002,3.6,2.9


In [87]:
frame3.columns

Index(['Ohio', 'Nevada'], dtype='object', name='state')

In [89]:
"Ohio" in frame3.columns

True

In [90]:
2003 in frame3.index

False

Unlike Python sets, a pandas Index can contain duplicate labels:

In [92]:
pd.Index(["foo", "foo", "bar", "bar"])

#Selections with duplicate labels will select all occurrences of that label.

Index(['foo', 'foo', 'bar', 'bar'], dtype='object')

#### Table for reference
[Some Index methods and properties](https://wesmckinney.com/book/pandas-basics#tbl-table_index_methods)

## 5.2 Essential Functionality

Fundamental mechanics of interacting with the data contained in a Series or DataFrame.

##### Reindexing
An important method on pandas objects is `reindex`, which means to create a *new object* with the values rearranged to align with the new index. Consider an example:



In [93]:
obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=["d", "b", "a", "c"])
obj

d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

Calling `reindex` on this Series rearranges the data according to the new index, introducing missing values if any index values were not already present:

In [94]:
obj2 = obj.reindex(["a", "b", "c", "d", "e"])
obj2

a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

- For ordered data like time series, you may want to do some interpolation or filling of values when reindexing.
- The `method` option allows us to do this, using a method such as `ffill`, which *forward-fills* the values:



In [95]:
obj3 = pd.Series(["blue", "purple", "yellow"], index=[0, 2, 4])
obj3

0      blue
2    purple
4    yellow
dtype: object

In [96]:
obj3.reindex(np.arange(6), method="ffill")

0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

With DataFrame, `reindex` can alter the (row) index, columns, or both. When passed only a sequence, it reindexes the rows in the result:

In [97]:
frame = pd.DataFrame(np.arange(9).reshape((3, 3)),
                     index=["a", "c", "d"], 
                     columns=["Ohio", "Texas", "California"])
frame

Unnamed: 0,Ohio,Texas,California
a,0,1,2
c,3,4,5
d,6,7,8


In [98]:
frame2 = frame.reindex(index=["a", "b", "c", "d"])
frame2

Unnamed: 0,Ohio,Texas,California
a,0.0,1.0,2.0
b,,,
c,3.0,4.0,5.0
d,6.0,7.0,8.0


The columns can be reindexed with the `columns` keyword:

In [100]:
states = ["Texas", "Utah", "California"]
frame.reindex(columns=states)

Unnamed: 0,Texas,Utah,California
a,1,,2
c,4,,5
d,7,,8


Because `"Ohio"` was not in `states`, the data for that column is dropped from the result.

Another way to reindex a particular axis is to pass the new axis labels as a positional argument and then specify the axis to reindex with the `axis` keyword:

In [101]:
frame.reindex(states, axis="columns")

Unnamed: 0,Texas,Utah,California
a,1,,2
c,4,,5
d,7,,8


See [Table 5.3](https://wesmckinney.com/book/pandas-basics#tbl-table_reindex_function) for more about the arguments to `reindex`.



You can also reindex by using the `loc` operator, and many users prefer to always do it this way. This works only if all of the new index labels already exist in the DataFrame (whereas `reindex` will insert missing data for new labels):

In [102]:
frame.loc[["a", "d", "c"], ["California", "Texas"]]

Unnamed: 0,California,Texas
a,2,1
d,8,7
c,5,4


### Dropping Entries from an Axis

Dropping one or more entries from an axis is simple if you already have an index array or list without those entries, since you can use the `reindex` method or `.loc`-based indexing.

As that can require a bit of munging and set logic, the `drop` method will return a new object with the indicated value or values deleted from an axis:

In [103]:
obj = pd.Series(np.arange(5.), index=["a", "b", "c", "d", "e"])
obj

a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64

In [105]:
new_obj = obj.drop("c")
new_obj

a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

In [106]:
obj.drop(["d", "c"])

a    0.0
b    1.0
e    4.0
dtype: float64

With DataFrame, index values can be deleted from either axis. To illustrate this, we first create an example DataFrame:

In [107]:
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                    index=["Ohio", "Colorado", "Utah", "New York"],
                    columns=["one", "two", "three", "four"])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


Calling `drop` with a sequence of labels will drop values from the row labels (axis 0):

In [109]:
data.drop(index=["Colorado", "Ohio"])

Unnamed: 0,one,two,three,four
Utah,8,9,10,11
New York,12,13,14,15


To drop labels from the columns, instead use the `columns` keyword:

In [111]:
data.drop(columns=["two"])

Unnamed: 0,one,three,four
Ohio,0,2,3
Colorado,4,6,7
Utah,8,10,11
New York,12,14,15


You can also drop values from the columns by passing `axis=1` (which is like NumPy) or `axis="columns"`:

In [112]:
data.drop("two", axis=1)

Unnamed: 0,one,three,four
Ohio,0,2,3
Colorado,4,6,7
Utah,8,10,11
New York,12,14,15


In [113]:
data.drop(["two", "four"], axis="columns")

Unnamed: 0,one,three
Ohio,0,2
Colorado,4,6
Utah,8,10
New York,12,14


### Indexing, Selection, and Filtering

Series indexing (`obj[...]`) works analogously to NumPy array indexing, except you can use the Series’s index values instead of only integers. Here are some examples of this:



In [114]:
obj = pd.Series(np.arange(4.), index=["a", "b", "c", "d"])
obj

a    0.0
b    1.0
c    2.0
d    3.0
dtype: float64

In [115]:
obj["b"]

1.0

In [116]:
obj[1]

  obj[1]


1.0

In [117]:
obj[2:4]

c    2.0
d    3.0
dtype: float64

In [118]:
obj[["b", "a", "d"]]

b    1.0
a    0.0
d    3.0
dtype: float64

In [119]:
obj[[1, 3]]

  obj[[1, 3]]


b    1.0
d    3.0
dtype: float64

In [120]:
obj[obj <2]

a    0.0
b    1.0
dtype: float64

While you can select data by label this way, the preferred way to select index values is with the special `loc` operator:

In [122]:
obj.loc[["b", "a", "d"]]

b    1.0
a    0.0
d    3.0
dtype: float64

The reason to prefer `loc` is because of the different treatment of integers when indexing with `[]`. 

Regular `[]`-based indexing will treat integers as labels if the index contains integers, so the behavior differs depending on the data type of the index. For example:

In [123]:
obj1 = pd.Series([1, 2, 3], index=[2, 0, 1])

obj2 = pd.Series([1, 2, 3], index=["a", "b", "c"])

obj1

2    1
0    2
1    3
dtype: int64

In [124]:
obj2

a    1
b    2
c    3
dtype: int64

In [125]:
obj1[[0, 1, 2]]

0    2
1    3
2    1
dtype: int64

In [127]:
obj2[[0, 1, 2]]

  obj2[[0, 1, 2]]


a    1
b    2
c    3
dtype: int64

When using `loc`, the expression `obj.loc[[0, 1, 2]]` will fail when the index does **not** contain integers:

In [128]:
obj2.loc[[0, 1]]

KeyError: "None of [Index([0, 1], dtype='int64')] are in the [index]"

Since `loc` operator indexes exclusively with **labels**, there is also an `iloc` operator that indexes exclusively with **integers** to work consistently *whether or not* the index contains integers:

In [129]:
obj1.iloc[[0, 1, 2]]

2    1
0    2
1    3
dtype: int64

In [130]:
obj2.iloc[[0, 1, 2]]

a    1
b    2
c    3
dtype: int64

In [131]:
obj2.loc["b":"c"]

b    2
c    3
dtype: int64

Assigning values using these methods modifies the corresponding section of the Series:

In [132]:
obj2.loc["b":"c"] = 5
obj2

a    1
b    5
c    5
dtype: int64

**NOTE:**
It can be a common newbie error to try to call `loc` or `iloc` like functions rather than "indexing into" them with square brackets. 
The square bracket notation is used to enable slice operations and to allow for indexing on multiple axes with DataFrame objects.

Indexing into a DataFrame retrieves one or more columns either with a single value or sequence:

In [133]:
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                    index=["Ohio", "Colorado", "Utah", "New York"],
                    columns=["one", "two", "three", "four"])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [134]:
data["two"]

Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int64

In [136]:
data[["three", "one"]]

Unnamed: 0,three,one
Ohio,2,0
Colorado,6,4
Utah,10,8
New York,14,12


Indexing like this has a few special cases. The first is slicing or selecting data with a **Boolean** array:

In [137]:
data[:2]

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7


In [138]:
data[data["three"] > 5]

Unnamed: 0,one,two,three,four
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


The row selection syntax `data[:2]` is provided as a convenience. Passing a single element or a list to the `[]` operator selects columns.

Another use case is indexing with a Boolean DataFrame, such as one produced by a scalar comparison. Consider a DataFrame with all Boolean values produced by comparing with a scalar value:

In [139]:
data < 5

Unnamed: 0,one,two,three,four
Ohio,True,True,True,True
Colorado,True,False,False,False
Utah,False,False,False,False
New York,False,False,False,False


We can use this DataFrame to assign the value 0 to each location with the value `True`, like so:

In [141]:
data[data < 5] = 0
data

Unnamed: 0,one,two,three,four
Ohio,0,0,0,0
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


#### Selection on DataFrame with loc and iloc

- Like Series, DataFrame has special attributes `loc` and `iloc` for label-based and integer-based indexing, respectively.
- Since DataFrame is two-dimensional, you can select a subset of the rows and columns with NumPy-like notation using either axis labels (`loc`) or integers (`iloc`).

As a first example, let's select a single row by label:

In [142]:
data

Unnamed: 0,one,two,three,four
Ohio,0,0,0,0
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [143]:
data.loc["Colorado"]

one      0
two      5
three    6
four     7
Name: Colorado, dtype: int64

The result of selecting a single row is a Series with an index that contains the DataFrame's column labels. To select multiple roles, creating a new DataFrame, pass a sequence of labels:

In [146]:
data.loc[["Colorado", "New York"]]

Unnamed: 0,one,two,three,four
Colorado,0,5,6,7
New York,12,13,14,15


You can combine both row and column selection in `loc` by separating the selections with a *comma*:

In [147]:
data.loc["Colorado", ["two", "three"]]

two      5
three    6
Name: Colorado, dtype: int64

We'll then perform some similar selections with integers using `iloc`:

In [148]:
data.iloc[2]

one       8
two       9
three    10
four     11
Name: Utah, dtype: int64

In [149]:
data.iloc[[2, 1]]

Unnamed: 0,one,two,three,four
Utah,8,9,10,11
Colorado,0,5,6,7


In [150]:
data.iloc[2, [3, 0, 1]]

four    11
one      8
two      9
Name: Utah, dtype: int64

In [151]:
data.iloc[[1, 2], [3, 0, 1]]

Unnamed: 0,four,one,two
Colorado,7,0,5
Utah,11,8,9


Both indexing functions work with slices in addition to single labels or lists of labels:

In [153]:
data.loc[:"Utah", "two"]

Ohio        0
Colorado    5
Utah        9
Name: two, dtype: int64

In [154]:
data.iloc[:, :3][data.three > 5]

Unnamed: 0,one,two,three
Colorado,0,5,6
Utah,8,9,10
New York,12,13,14


**Boolean** arrays can be used with `loc` but not `iloc`:

In [155]:
data.loc[data.three >= 2]

Unnamed: 0,one,two,three,four
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


##### Table

There are many ways to select and rearrange the data contained in a pandas object. For DataFrame, [Table 5.4](https://wesmckinney.com/book/pandas-basics#tbl-table_dataframe_loc_iloc) provides a short summary of many of them.
                                                                                                              
                                                                                                              
### Integer indexing pitfalls
Working with pandas objects indexed by integers can be a stumbling block for new users since they work differently from built-in Python data structures like lists and tuples. 
For example, you might not expect the following code to generate an error:

In [156]:
ser = pd.Series(np.arange(3.))

ser

0    0.0
1    1.0
2    2.0
dtype: float64

In [157]:
ser[-1]

KeyError: -1

In this case, pandas could “fall back” on integer indexing, but it is difficult to do this in general without introducing subtle bugs into the user code. 

Here we have an index containing `0`, `1`, and `2`, but pandas does not want to guess what the user wants (label-based indexing or position-based):

In [158]:
ser

0    0.0
1    1.0
2    2.0
dtype: float64

On the other hand, with a noninteger index, there is no such ambiguity:

In [159]:
ser2 = pd.Series(np.arange(3.), index=["a", "b", "c"])

ser2[-1]

  ser2[-1]


2.0

If you have an axis index containing integers, data selection will always be label oriented. As I said above, if you use `loc` (for ***labels***) or `iloc` (for ***integers***) you will get exactly what you want:

In [160]:
ser.iloc[-1]

2.0

On the other hand, slicing with integers is always integer oriented:

In [161]:
ser[:2]

0    0.0
1    1.0
dtype: float64

As a result of these pitfalls, it is best to always prefer indexing with `loc` and `iloc` to avoid ambiguity.


### Pitfalls with chained indexing
In the previous section we looked at how you can do flexible selections on a DataFrame using `loc` and `iloc`. These indexing attributes can also be used to modify DataFrame objects in place, but doing so requires some care.

For example, in the example DataFrame above, we can assign to a column or row by label or integer position:

In [162]:
data.loc[:, "one"] = 1
data

Unnamed: 0,one,two,three,four
Ohio,1,0,0,0
Colorado,1,5,6,7
Utah,1,9,10,11
New York,1,13,14,15


In [163]:
data.iloc[2] = 5
data

Unnamed: 0,one,two,three,four
Ohio,1,0,0,0
Colorado,1,5,6,7
Utah,5,5,5,5
New York,1,13,14,15


In [164]:
data.loc[data["four"] > 5] = 3
data

Unnamed: 0,one,two,three,four
Ohio,1,0,0,0
Colorado,3,3,3,3
Utah,5,5,5,5
New York,3,3,3,3


A common gotcha for new pandas users is to chain selections when assigning, like this:

```
In [177]: data.loc[data.three == 5]["three"] = 6
<ipython-input-11-0ed1cf2155d5>:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
```

Depending on the data contents, this may print a special `SettingWithCopyWarning`, which warns you that you are trying to modify a temporary value (the nonempty result of `data.loc[data.three == 5]`) instead of the original DataFrame `data`, which might be what you were intending. Here, `data` was unmodified:

In [165]:
data

Unnamed: 0,one,two,three,four
Ohio,1,0,0,0
Colorado,3,3,3,3
Utah,5,5,5,5
New York,3,3,3,3


In these scenarios, the fix is to rewrite the chained assignment to **use a single `loc` operation**:

In [166]:
data.loc[data.three == 5, "three"] = 6
data

Unnamed: 0,one,two,three,four
Ohio,1,0,0,0
Colorado,3,3,3,3
Utah,5,5,6,5
New York,3,3,3,3


A good rule of thumb is to avoid chained indexing when doing assignments. There are other cases where pandas will generate `SettingWithCopyWarning` that have to do with chained indexing. 

## Arithmetic and Data Alignment

pandas can make it much simpler to work with objects that have different indexes. For example, when you add objects, if any index pairs are not the same, the respective index in the result will be the union of the index pairs. Let’s look at an example:


In [167]:
s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=["a", "c", "d", "e"])

s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1],
                index=["a", "c", "e", "f", "g"])

s1

a    7.3
c   -2.5
d    3.4
e    1.5
dtype: float64

In [168]:
s2

a   -2.1
c    3.6
e   -1.5
f    4.0
g    3.1
dtype: float64

In [169]:
s1 + s2

a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64

The internal data alignment introduces missing values in the label locations that don’t overlap. Missing values will then propagate in further arithmetic computations.

In the case of DataFrame, alignment is performed on both rows and columns:

In [170]:
df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list("bcd"),
                   index=["Ohio", "Texas", "Colorado"])

df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list("bde"),
                   index=["Utah", "Ohio", "Texas", "Oregon"])
df1

Unnamed: 0,b,c,d
Ohio,0.0,1.0,2.0
Texas,3.0,4.0,5.0
Colorado,6.0,7.0,8.0


In [171]:
df2

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


Adding these returns a DataFrame with index and columns that are the unions of the ones in each DataFrame:

In [172]:
df1 + df2

Unnamed: 0,b,c,d,e
Colorado,,,,
Ohio,3.0,,6.0,
Oregon,,,,
Texas,9.0,,12.0,
Utah,,,,


Since the `"c"` and `"e"` columns are not found in both DataFrame objects, they appear as **missing** in the result. The *same holds for the rows* with labels that are not common to both objects.

If you add DataFrame objects with no column or row labels in common, the result will contain all nulls:

In [None]:
df1 = pd.DataFrame({"A": [1, 2]})
df2 = pd.DataFrame({"B": [3, 4]})

df1

In [175]:
df2

Unnamed: 0,B
0,3
1,4


In [176]:
df1 + df2

Unnamed: 0,A,B
0,,
1,,


### Arithmetic methods with fill values

In arithmetic operations between differently indexed objects, you might want to fill with a special value, like 0, when an axis label is found in one object but not the other. Here is an example where we set a particular value to NA (null) by assigning `np.nan` to it:

In [177]:
df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)),
                   columns=list("abcd"))
df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)),
                   columns=list("abcde"))

df2.loc[1, "b"] = np.nan

df1

Unnamed: 0,a,b,c,d
0,0.0,1.0,2.0,3.0
1,4.0,5.0,6.0,7.0
2,8.0,9.0,10.0,11.0


In [178]:
df2

Unnamed: 0,a,b,c,d,e
0,0.0,1.0,2.0,3.0,4.0
1,5.0,,7.0,8.0,9.0
2,10.0,11.0,12.0,13.0,14.0
3,15.0,16.0,17.0,18.0,19.0
