10/03/2023, 15:11 Exp02_notebook_2102620 - Colaboratory

> **Experiment No :** 02
>
> **Aim :** To understand how to perform Data Manipulation with Pandas
> Library
>
> **Theory :** NumPy library and its *ndarray* object, provides e�cient
> storage and manipulation of dense typed arrays in Python. **Pandas**
> is a newer package built on top of NumPy, and provides an e�cient
> implementation of a **DataFrame** data structure. **DataFrames** are
> essentially multidimensional arrays with attached row and column
> labels, and often with heterogeneous types and/or missing data.
>
> Pandas not only provide a convenient storage interface for labeled
> data but also implements a number of powerful data operations familiar
> to users of both database frameworks and spreadsheet programs.
>
> Numpy's *ndarray* data structure is more suitable for clean,
> well-organized data typically seen in numerical computing tasks. While
> it serves this purpose very well, its limitations become clear when we
> need more �exibility (e.g., attaching labels to data, working with
> missing data, etc.) and when attempting operations that do not map
> well to element-wise broadcasting (e.g., groupings, pivots, etc.),
> each of which is an important piece of analyzing the less structured
> data available in many forms in the world around us.
>
> Pandas, and in particular its **Series** and **DataFrame** objects,
> builds on the NumPy array structure and provides e�cient access to
> these sorts of "data munging" tasks that occupy much of a data
> scientist's time.
>
> The name Pandas has a reference to both *Panel Data*, and *Python Data
> Analysis* and was created by Wes McKinney in 2008. Pandas allows us to
> analyze big data and make conclusions based on statistical theories.
> Pandas can clean messy data sets, and make them readable and relevant.
> Relevant data is very important in any successful data science
> pipeline.
>
> **Working :** These are actual Performance steps that students will
> carry out. Students need toexecute all cell and note the output also
> they need add apropriate comments to all cell to indicate what concept
> it describes.
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image1.png"
> style="width:0.125in" /> Installing and Using Pandas
>
> Installation of Pandas on your system requires NumPy to be installed.
>
> One can use any one of the following ways to install pandas to your
> native machine :
>
> \$ pip3 install pandas
>
> OR

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
1/126

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> \$ conda install pandas
>
> Once Pandas is installed, you can import it and check the version:
>
> import pandas  
> pandas.\_\_version\_\_
>
> ' 1.3.5 '
>
> Just as we generally import NumPy under the alias np , we will import
> Pandas under the alias pd :
>
> import pandas as pd
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image2.png"
> style="width:0.125in" /> Introducing Pandas Objects
>
> At the very basic level, Pandas objects can be thought of as enhanced
> versions of NumPy structured arrays in which the rows and columns are
> identi�ed with labels rather than simple integer indices.
>
> Pandas provides a host of useful tools, methods, and functionality on
> top of the basic data structures.
>
> Before we go any further, let's introduce these three fundamental
> Pandas data structures: the Series , DataFrame , and Index .
>
> We will start our code sessions with the standard NumPy and Pandas
> imports:
>
> #import numpy and pandas
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image3.png"
> style="width:0.125in" /> The Pandas Series Object
>
> A Pandas Series is a one-dimensional array of indexed data. It can be
> created from a list or array as follows:
>
> data = pd.Series(\[0.25, 0.5, 0.75, 1.0\])  
> data
>
> 0 0.25  
> 1 0.50  
> 2 0.75  
> 3 1.00  
> dtype: float64

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
2/126

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> As we see in the output, the Series wraps both a sequence of values
> and a sequence of indices, which we can access with the values and
> index attributes. The values are simply a familiar NumPy array:
>
> data.values
>
> array(\[0.25, 0.5 , 0.75, 1. \])
>
> The index is an array-like object of type pd.Index , which we'll
> discuss in more detail momentarily.
>
> data.index
>
> RangeIndex(start=0, stop=4, step=1)
>
> Like with a NumPy array, data can be accessed by the associated index
> via the familiar Python square-bracket notation:
>
> data\[1\]
>
> 0.5
>
> data\[1:3\]
>
> 1 0.50  
> 2 0.75  
> dtype: float64
>
> As we will see, though, the Pandas Series is much more general and
> �exible than the one-dimensional NumPy array that it emulates.
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image4.png"
> style="width:0.125in" /> Series as generalized NumPy array
>
> From what we've seen so far, it may look like the Series object is
> basically interchangeable with a one-dimensional NumPy array. The
> essential difference is the presence of the index: while the Numpy
> Array has an *implicitly de�ned* integer index used to access the
> values, the Pandas Series has an *explicitly de�ned* index associated
> with the values.
>
> This explicit index de�nition gives the Series object additional
> capabilities. For example, the index need not be an integer, but can
> consist of values of any desired type. For example, if we wish, we can
> use strings as an index:

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
3/126

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> data = pd.Series(\[0.25, 0.5, 0.75, 1.0\],  
> index=\['a', 'b', 'c', 'd'\])  
> data
>
> a 0.25  
> b 0.50  
> c 0.75  
> d 1.00  
> dtype: float64
>
> And the item access works as expected:
>
> data\['b'\]
>
> 0.5
>
> We can even use non-contiguous or non-sequential indices:
>
> data = pd.Series(\[0.25, 0.5, 0.75, 1.0\],  
> index=\[2, 5, 3, 7\])  
> data
>
> 2 0.25  
> 5 0.50  
> 3 0.75  
> 7 1.00  
> dtype: float64
>
> data\[5\]
>
> 0.5
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image5.png"
> style="width:0.125in" /> Series as specialized dictionary
>
> In this way, you can think of a Pandas Series a bit like a
> specialization of a Python dictionary. A dictionary is a structure
> that maps arbitrary keys to a set of arbitrary values, and a Series is
> a
>
> structure which maps typed keys to a set of typed values. This typing
> is important: just as the type-speci�c compiled code behind a NumPy
> array makes it more e�cient than a Python list for
>
> certain operations, the type information of a Pandas Series makes it
> much more e�cient than Python dictionaries for certain operations.
>
> The Series -as-dictionary analogy can be made even more clear by
> constructing a Series
>
> object directly from a Python dictionary:
>
> population_dict = {'California': 38332521,  
> 'Texas': 26448193,  
> 'New York': 19651127,  
> 'Florida': 19552860,

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
4/126

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> 'Illinois': 12882135}
>
> population = pd.Series(population_dict)
>
> population
>
> California 38332521  
> Texas 26448193  
> New York 19651127  
> Florida 19552860  
> Illinois 12882135  
> dtype: int64

By default, a Series will be created where the index is drawn from the
sorted keys. From here,

> typical dictionary-style item access can be performed:
>
> population\['California'\]
>
> 38332521
>
> Unlike a dictionary, though, the Series also supports array-style
> operations such as slicing:
>
> population\['California':'Illinois'\]
>
> California 38332521  
> Texas 26448193  
> New York 19651127  
> Florida 19552860  
> Illinois 12882135  
> dtype: int64
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image6.png"
> style="width:0.125in" /> Constructing Series objects

We've already seen a few ways of constructing a Pandas Series from
scratch; all of them are

> some version of the following:
>
> \>\>\> pd.Series(data, index=index)
>
> where index is an optional argument, and data can be one of many
> entities.
>
> For example, data can be a list or NumPy array, in which case index
> defaults to an integer
>
> sequence:
>
> pd.Series(\[2, 4, 6\])
>
> 0 2  
> 1 4  
> 2 6  
> dtype: int64
>
> data can be a scalar, which is repeated to �ll the speci�ed index:

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
5/126

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> pd.Series(5, index=\[100, 200, 300\])
>
> 100 5  
> 200 5  
> 300 5  
> dtype: int64
>
> data can be a dictionary, in which index defaults to the sorted
> dictionary keys:
>
> pd.Series({2:'a', 1:'b', 3:'c'})
>
> 2 a  
> 1 b  
> 3 c  
> dtype: object
>
> In each case, the index can be explicitly set if a different result is
> preferred:
>
> pd.Series({2:'a', 1:'b', 3:'c'}, index=\[3, 2\])
>
> 3 c  
> 2 a  
> dtype: object
>
> Notice that in this case, the Series is populated only with the
> explicitly identi�ed keys.
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image7.png"
> style="width:0.125in" /> The Pandas DataFrame Object

The next fundamental structure in Pandas is the DataFrame . Like the
Series object discussed

in the previous section, the DataFrame can be thought of either as a
generalization of a NumPy

> array, or as a specialization of a Python dictionary. We'll now take a
> look at each of these
>
> perspectives.
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image8.png"
> style="width:0.125in" /> DataFrame as a generalized NumPy array
>
> If a Series is an analog of a one-dimensional array with �exible
> indices, a DataFrame is an
>
> analog of a two-dimensional array with both �exible row indices and
> �exible column names.
>
> Just as you might think of a two-dimensional array as an ordered
> sequence of aligned one-

dimensional columns, you can think of a DataFrame as a sequence of
aligned Series objects.

> Here, by "aligned" we mean that they share the same index.
>
> To demonstrate this, let's �rst construct a new Series listing the
> area of each of the �ve states
>
> discussed in the previous section:

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
6/126

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> area_dict = {'California': 423967, 'Texas': 695662, 'New York':
> 141297, 'Florida': 170312, 'Illinois': 149995}  
> area = pd.Series(area_dict)  
> area
>
> California 423967  
> Texas 695662  
> New York 141297  
> Florida 170312  
> Illinois 149995  
> dtype: int64
>
> Now that we have this along with the population Series from before, we
> can use a dictionary to
>
> construct a single two-dimensional object containing this information:
>
> states = pd.DataFrame({'population': population, 'area': area})  
> states

<table>
<colgroup>
<col style="width: 25%" />
<col style="width: 25%" />
<col style="width: 25%" />
<col style="width: 25%" />
</colgroup>
<thead>
<tr class="header">
<th></th>
<th><strong>population</strong></th>
<th><strong>area</strong></th>
<th><blockquote>
<p><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image9.png"
style="width:0.22222in;height:0.22222in" /></p>
</blockquote></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>California</strong></td>
<td><blockquote>
<p>38332521</p>
</blockquote></td>
<td>423967</td>
<td rowspan="5"></td>
</tr>
<tr class="even">
<td><strong>Texas</strong></td>
<td><blockquote>
<p>26448193</p>
</blockquote></td>
<td>695662</td>
</tr>
<tr class="odd">
<td><strong>New York</strong></td>
<td>19651127</td>
<td>141297</td>
</tr>
<tr class="even">
<td><strong>Florida</strong></td>
<td><blockquote>
<p>19552860</p>
</blockquote></td>
<td>170312</td>
</tr>
<tr class="odd">
<td><strong>Illinois</strong></td>
<td><blockquote>
<p>12882135</p>
</blockquote></td>
<td>149995</td>
</tr>
</tbody>
</table>

> Like the Series object, the DataFrame has an index attribute that
> gives access to the index
>
> labels:
>
> states.index
>
> Index(\['California', 'Texas', 'New York', 'Florida', 'Illinois'\],
> dtype='object')
>
> Additionally, the DataFrame has a columns attribute, which is an Index
> object holding the column labels:
>
> states.columns
>
> Index(\['population', 'area'\], dtype='object')

Thus the DataFrame can be thought of as a generalization of a
two-dimensional NumPy array,

> where both the rows and columns have a generalized index for accessing
> the data.

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
7/126

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image10.png"
> style="width:0.125in" /> DataFrame as specialized dictionary
>
> Similarly, we can also think of a DataFrame as a specialization of a
> dictionary. Where a dictionary maps a key to a value, a DataFrame maps
> a column name to a Series of column data. For example, asking for the
> 'area' attribute returns the Series object containing the areas we saw
> earlier:
>
> states\['area'\]
>
> California 423967  
> Texas 695662  
> New York 141297  
> Florida 170312  
> Illinois 149995  
> Name: area, dtype: int64
>
> Notice the potential point of confusion here: in a two-dimesnional
> NumPy array, data\[0\] will return the �rst *row*. For a DataFrame ,
> data\['col0'\] will return the �rst *column*. Because of this, it is
> probably better to think about DataFrame s as generalized dictionaries
> rather than  
> generalized arrays, though both ways of looking at the situation can
> be useful. We'll explore more �exible means of indexing DataFrame s in
> .
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image11.png"
> style="width:0.125in" /> Constructing DataFrame objects

A Pandas DataFrame can be constructed in a variety of ways. Here we'll
give several examples.

> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image12.png"
> style="width:0.125in" /> From a single Series object
>
> A DataFrame is a collection of Series objects, and a single-column
> DataFrame can be constructed from a single Series :
>
> pd.DataFrame(population, columns=\['population'\])

<table>
<colgroup>
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
</colgroup>
<thead>
<tr class="header">
<th colspan="3"><strong>population</strong></th>
<th><blockquote>
<p><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image13.png"
style="width:0.22222in;height:0.22222in" /></p>
</blockquote></th>
<th></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td colspan="2"><strong>California</strong></td>
<td colspan="2"><blockquote>
<p>38332521</p>
</blockquote></td>
<td rowspan="7">8/126</td>
</tr>
<tr class="even">
<td colspan="2"><strong>Texas</strong></td>
<td colspan="2"><blockquote>
<p>26448193</p>
</blockquote></td>
</tr>
<tr class="odd">
<td colspan="2"><strong>New York</strong></td>
<td colspan="2"><blockquote>
<p>19651127</p>
</blockquote></td>
</tr>
<tr class="even">
<td colspan="2"><strong>Florida</strong></td>
<td colspan="2"><blockquote>
<p>19552860</p>
</blockquote></td>
</tr>
<tr class="odd">
<td colspan="2"><strong>Illinois</strong></td>
<td colspan="2"><blockquote>
<p>12882135</p>
</blockquote></td>
</tr>
<tr class="even">
<td><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image14.png"
style="width:0.125in" /></td>
<td colspan="3"><blockquote>
<p>From a list of dicts</p>
</blockquote></td>
</tr>
<tr class="odd">
<td
colspan="4">https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&amp;printMode=true</td>
</tr>
</tbody>
</table>

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> Any list of dictionaries can be made into a DataFrame . We'll use a
> simple list comprehension to
>
> create some data:
>
> data = \[{'a': i, 'b': 2 \* i}
>
> for i in range(3)\]
>
> pd.DataFrame(data)

<table>
<colgroup>
<col style="width: 25%" />
<col style="width: 25%" />
<col style="width: 25%" />
<col style="width: 25%" />
</colgroup>
<thead>
<tr class="header">
<th></th>
<th><strong>a</strong></th>
<th><strong>b</strong></th>
<th><blockquote>
<p><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image15.png"
style="width:0.22222in;height:0.22222in" /></p>
</blockquote></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>0</strong></td>
<td>0</td>
<td>0</td>
<td rowspan="3"></td>
</tr>
<tr class="even">
<td><strong>1</strong></td>
<td>1</td>
<td>2</td>
</tr>
<tr class="odd">
<td><strong>2</strong></td>
<td>2</td>
<td>4</td>
</tr>
</tbody>
</table>

> Even if some keys in the dictionary are missing, Pandas will �ll them
> in with NaN (i.e., "not a
>
> number") values:
>
> pd.DataFrame(\[{'a': 1, 'b': 2}, {'b': 3, 'c': 4}\])

<table style="width:100%;">
<colgroup>
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
</colgroup>
<thead>
<tr class="header">
<th></th>
<th colspan="2"><strong>a</strong></th>
<th><strong>b</strong></th>
<th><strong>c</strong></th>
<th><blockquote>
<p><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image16.png"
style="width:0.23611in;height:0.22222in" /></p>
</blockquote></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td rowspan="3"><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image17.png"
style="width:0.125in" /></td>
<td><strong>0</strong></td>
<td>1.0</td>
<td>2</td>
<td colspan="2"><blockquote>
<p>NaN</p>
</blockquote></td>
</tr>
<tr class="even">
<td><strong>1</strong></td>
<td>NaN</td>
<td>3</td>
<td colspan="2"><blockquote>
<p>4.0</p>
</blockquote></td>
</tr>
<tr class="odd">
<td colspan="5"><blockquote>
<p>From a dictionary of Series objects</p>
</blockquote></td>
</tr>
</tbody>
</table>

> As we saw before, a DataFrame can be constructed from a dictionary of
> Series objects as well:
>
> pd.DataFrame({'population': population,
>
> 'area': area})

<table>
<colgroup>
<col style="width: 25%" />
<col style="width: 25%" />
<col style="width: 25%" />
<col style="width: 25%" />
</colgroup>
<thead>
<tr class="header">
<th></th>
<th><strong>population</strong></th>
<th><strong>area</strong></th>
<th><blockquote>
<p><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image18.png"
style="width:0.22222in;height:0.22222in" /></p>
</blockquote></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>California</strong></td>
<td><blockquote>
<p>38332521</p>
</blockquote></td>
<td>423967</td>
<td rowspan="5"></td>
</tr>
<tr class="even">
<td><strong>Texas</strong></td>
<td><blockquote>
<p>26448193</p>
</blockquote></td>
<td>695662</td>
</tr>
<tr class="odd">
<td><strong>New York</strong></td>
<td>19651127</td>
<td>141297</td>
</tr>
<tr class="even">
<td><strong>Florida</strong></td>
<td><blockquote>
<p>19552860</p>
</blockquote></td>
<td>170312</td>
</tr>
<tr class="odd">
<td><strong>Illinois</strong></td>
<td><blockquote>
<p>12882135</p>
</blockquote></td>
<td>149995</td>
</tr>
</tbody>
</table>

> From a two-dimensional NumPy array
>
> Given a two-dimensional array of data, we can create a DataFrame with
> any speci�ed column
>
> and index names. If omitted, an integer index will be used for each:

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
9/126

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image19.png"
> style="width:0.125in" /> From a NumPy structured array  
> A Pandas DataFrame operates much like a structured array in numpy, and
> can be created directly from one:
>
> A = np.zeros(3, dtype=\[('A', 'i8'), ('B', 'f8')\])
>
> A
>
> array(\[(0, 0.), (0, 0.), (0, 0.)\], dtype=\[('A', '\<i8'), ('B',
> '\<f8')\])
>
> pd.DataFrame(A)

<table>
<colgroup>
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
</colgroup>
<thead>
<tr class="header">
<th></th>
<th colspan="2"><strong>A</strong></th>
<th><strong>B</strong></th>
<th><blockquote>
<p><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image20.png"
style="width:0.22222in;height:0.22222in" /></p>
</blockquote></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td rowspan="4"><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image21.png"
style="width:0.125in" /></td>
<td><strong>0</strong></td>
<td>0</td>
<td colspan="2"><blockquote>
<p>0.0</p>
</blockquote></td>
</tr>
<tr class="even">
<td><strong>1</strong></td>
<td>0</td>
<td colspan="2"><blockquote>
<p>0.0</p>
</blockquote></td>
</tr>
<tr class="odd">
<td><strong>2</strong></td>
<td>0</td>
<td colspan="2"><blockquote>
<p>0.0</p>
</blockquote></td>
</tr>
<tr class="even">
<td colspan="4"><blockquote>
<p>The Pandas Index Object</p>
</blockquote></td>
</tr>
</tbody>
</table>

> We have seen here that both the Series and DataFrame objects contain
> an explicit *index* that lets you reference and modify data. This
> Index object is an interesting structure in itself, and it can be
> thought of either as an *immutable array* or as an *ordered set*
> (technically a multi-set, as Index objects may contain repeated
> values). Those views have some interesting consequences in the
> operations available on Index objects. As a simple example, let's
> construct an Index from a list of integers:
>
> ind = pd.Index(\[2, 3, 5, 7, 11\])
>
> ind
>
> Int64Index(\[2, 3, 5, 7, 11\], dtype='int64')
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image3.png"
> style="width:0.125in" /> Index as immutable array  
> The Index in many ways operates like an array. For example, we can use
> standard Python indexing notation to retrieve values or slices:
>
> ind\[1\]
>
> 3
>
> ind\[::2\]
>
> Int64Index(\[2, 5, 11\], dtype='int64')

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
10/126

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> Index objects also have many of the attributes familiar from NumPy
> arrays:
>
> print(ind.size, ind.shape, ind.ndim, ind.dtype)
>
> 5 (5,) 1 int64
>
> One difference between Index objects and NumPy arrays is that indices
> are immutable–that is, they cannot be modi�ed via the normal means:
>
> This immutability makes it safer to share indices between multiple
> DataFrame s and arrays, without the potential for side effects from
> inadvertent index modi�cation.
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image22.png"
> style="width:0.125in" /> Index as ordered set  
> Pandas objects are designed to facilitate operations such as joins
> across datasets, which depend on many aspects of set arithmetic. The
> Index object follows many of the conventions used by Python's built-in
> set data structure, so that unions, intersections, differences, and
> other combinations can be computed in a familiar way:
>
> indA = pd.Index(\[1, 3, 5, 7, 9\])
>
> indB = pd.Index(\[2, 3, 5, 7, 11\])
>
> indA & indB \# intersection
>
> \<ipython-input-41-4d2a3e5acbb8\>:1: FutureWarning: Index.\_\_and\_\_
> operating as a set op
>
> indA & indB \# intersection
>
> Int64Index(\[3, 5, 7\], dtype='int64')
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image23.png"
> style="width:6.83333in;height:0.23611in" />
>
> indA \| indB \# union
>
> \<ipython-input-42-a4c8ebc5c197\>:1: FutureWarning: Index.\_\_or\_\_
> operating as a set ope
>
> indA \| indB \# union
>
> Int64Index(\[1, 2, 3, 5, 7, 9, 11\], dtype='int64')
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image24.png"
> style="width:6.83333in;height:0.23611in" />
>
> indA ^ indB \# symmetric difference
>
> \<ipython-input-43-3b8ccf9eb8f2\>:1: FutureWarning: Index.\_\_xor\_\_
> operating as a set op
>
> indA ^ indB \# symmetric difference
>
> Int64Index(\[1, 2, 9, 11\], dtype='int64')
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image25.png"
> style="width:6.83333in;height:0.23611in" />

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
11/126

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> These operations may also be accessed via object methods, for example
> indA.intersection(indB) .
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image26.png"
> style="width:0.125in" /> Updated till this section .......
>
> We'll start with the simple case of the one-dimensional Series object,
> and then move on to the more complicated two-dimesnional DataFrame
> object.
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image27.png"
> style="width:0.125in" /> Data Selection in Series
>
> As we saw in the previous section, a Series object acts in many ways
> like a one-dimensional NumPy array, and in many ways like a standard
> Python dictionary. If we keep these two overlapping analogies in mind,
> it will help us to understand the patterns of data indexing and
> selection in these arrays.
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image28.png"
> style="width:0.125in" /> Series as dictionary
>
> Like a dictionary, the Series object provides a mapping from a
> collection of keys to a collection of values:
>
> import pandas as pd  
> data = pd.Series(\[0.25, 0.5, 0.75, 1.0\],  
> index=\['a', 'b', 'c', 'd'\])  
> data
>
> a 0.25  
> b 0.50  
> c 0.75  
> d 1.00  
> dtype: float64
>
> data\['b'\]
>
> 0.5
>
> We can also use dictionary-like Python expressions and methods to
> examine the keys/indices and values:
>
> 'a' in data
>
> True

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
12/126

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> data.keys()
>
> Index(\['a', 'b', 'c', 'd'\], dtype='object')
>
> list(data.items())
>
> \[('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0)\]
>
> Series objects can even be modi�ed with a dictionary-like syntax. Just
> as you can extend a dictionary by assigning to a new key, you can
> extend a Series by assigning to a new index value:
>
> data\['e'\] = 1.25
>
> data
>
> a 0.25
>
> b 0.50
>
> c 0.75
>
> d 1.00
>
> e 1.25
>
> dtype: float64
>
> This easy mutability of the objects is a convenient feature: under the
> hood, Pandas is making decisions about memory layout and data copying
> that might need to take place; the user generally does not need to
> worry about these issues.
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image29.png"
> style="width:0.125in" /> Series as one-dimensional array
>
> A Series builds on this dictionary-like interface and provides
> array-style item selection via the same basic mechanisms as NumPy
> arrays – that is, *slices*, *masking*, and *fancy indexing*.
>
> Examples of these are as follows:
>
> \# slicing by explicit index
>
> data\['a':'c'\]
>
> a 0.25
>
> b 0.50
>
> c 0.75
>
> dtype: float64
>
> \# slicing by implicit integer index
>
> data\[0:2\]
>
> a 0.25
>
> b 0.50
>
> dtype: float64

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
13/126

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> \# masking  
> data\[(data \> 0.3) & (data \< 0.8)\]
>
> b 0.50  
> c 0.75  
> dtype: float64
>
> \# fancy indexing  
> data\[\['a', 'e'\]\]
>
> a 0.25  
> e 1.25  
> dtype: float64
>
> Among these, slicing may be the source of the most confusion. Notice
> that when slicing with an explicit index (i.e., data\['a':'c'\] ), the
> �nal index is *included* in the slice, while when slicing with
>
> an implicit index (i.e., data\[0:2\] ), the �nal index is *excluded*
> from the slice.
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image30.png"
> style="width:0.125in" /> Indexers: loc, iloc, and ix
>
> These slicing and indexing conventions can be a source of confusion.
> For example, if your Series has an explicit integer index, an indexing
> operation such as data\[1\] will use the explicit
>
> indices, while a slicing operation like data\[1:3\] will use the
> implicit Python-style index.
>
> data = pd.Series(\['a', 'b', 'c'\], index=\[1, 3, 5\]) data
>
> 1 a  
> 3 b  
> 5 c  
> dtype: object
>
> \# explicit index when indexing  
> data\[1\]
>
> ' '
>
> \# implicit index when slicing  
> data\[1:3\]
>
> 3 b  
> 5 c  
> dtype: object
>
> Because of this potential confusion in the case of integer indexes,
> Pandas provides some
>
> special *indexer* attributes that explicitly expose certain indexing
> schemes. These are not

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
14/126

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> functional methods, but attributes that expose a particular slicing
> interface to the data in the Series .
>
> First, the loc attribute allows indexing and slicing that always
> references the explicit index:
>
> data.loc\[1\]
>
> ' '
>
> data.loc\[1:3\]
>
> 1 a  
> 3 b  
> dtype: object
>
> The iloc attribute allows indexing and slicing that always references
> the implicit Python-style index:
>
> data.iloc\[1\]
>
> ' '
>
> data.iloc\[1:3\]
>
> 3 b  
> 5 c  
> dtype: object
>
> A third indexing attribute, ix , is a hybrid of the two, and for
> Series objects is equivalent to standard \[\] -based indexing. The
> purpose of the ix indexer will become more apparent in the context of
> DataFrame objects, which we will discuss in a moment.
>
> One guiding principle of Python code is that "explicit is better than
> implicit." The explicit nature of loc and iloc make them very useful
> in maintaining clean and readable code; especially in the case of
> integer indexes, I recommend using these both to make code easier to
> read and understand, and to prevent subtle bugs due to the mixed
> indexing/slicing convention.
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image31.png"
> style="width:0.125in" /> Data Selection in DataFrame
>
> Recall that a DataFrame acts in many ways like a two-dimensional or
> structured array, and in other ways like a dictionary of Series
> structures sharing the same index. These analogies can be helpful to
> keep in mind as we explore data selection within this structure.
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image14.png"
> style="width:0.125in" /> DataFrame as a dictionary

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
15/126

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> The �rst analogy we will consider is the DataFrame as a dictionary of
> related Series objects.
>
> Let's return to our example of areas and populations of states:
>
> area = pd.Series({'California': 423967, 'Texas': 695662, 'New York':
> 141297, 'Florida': 170312, 'Illinois': 149995})  
> pop = pd.Series({'California': 38332521, 'Texas': 26448193, 'New
> York': 19651127, 'Florida': 19552860, 'Illinois': 12882135})  
> data = pd.DataFrame({'area':area, 'pop':pop})  
> data

<table>
<colgroup>
<col style="width: 25%" />
<col style="width: 25%" />
<col style="width: 25%" />
<col style="width: 25%" />
</colgroup>
<thead>
<tr class="header">
<th></th>
<th><strong>area</strong></th>
<th><strong>pop</strong></th>
<th><blockquote>
<p><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image32.png"
style="width:0.23611in;height:0.23611in" /></p>
</blockquote></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>California</strong></td>
<td>423967</td>
<td>38332521</td>
<td rowspan="5"></td>
</tr>
<tr class="even">
<td><strong>Texas</strong></td>
<td>695662</td>
<td>26448193</td>
</tr>
<tr class="odd">
<td><strong>New York</strong></td>
<td>141297</td>
<td>19651127</td>
</tr>
<tr class="even">
<td><strong>Florida</strong></td>
<td>170312</td>
<td>19552860</td>
</tr>
<tr class="odd">
<td><strong>Illinois</strong></td>
<td>149995</td>
<td>12882135</td>
</tr>
</tbody>
</table>

> The individual Series that make up the columns of the DataFrame can be
> accessed via
>
> dictionary-style indexing of the column name:
>
> data\['area'\]
>
> California 423967  
> Texas 695662  
> New York 141297  
> Florida 170312  
> Illinois 149995  
> Name: area, dtype: int64
>
> Equivalently, we can use attribute-style access with column names that
> are strings:
>
> data.area
>
> California 423967  
> Texas 695662  
> New York 141297  
> Florida 170312  
> Illinois 149995  
> Name: area, dtype: int64

This attribute-style column access actually accesses the exact same
object as the dictionary-

> style access:

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
16/126

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> data.area is data\['area'\]
>
> True
>
> Though this is a useful shorthand, keep in mind that it does not work
> for all cases! For example, if the column names are not strings, or if
> the column names con�ict with methods of the DataFrame , this
> attribute-style access is not possible. For example, the DataFrame has
> a pop() method, so data.pop will point to this rather than the "pop"
> column:
>
> data.pop is data\['pop'\]
>
> False
>
> In particular, you should avoid the temptation to try column
> assignment via attribute (i.e., use data\['pop'\] = z rather than
> data.pop = z ).
>
> Like with the Series objects discussed earlier, this dictionary-style
> syntax can also be used to modify the object, in this case adding a
> new column:
>
> data\['density'\] = data\['pop'\] / data\['area'\]  
> data

<table>
<colgroup>
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
</colgroup>
<thead>
<tr class="header">
<th></th>
<th><strong>area</strong></th>
<th><strong>pop</strong></th>
<th><strong>density</strong></th>
<th><blockquote>
<p><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image33.png"
style="width:0.23611in;height:0.22222in" /></p>
</blockquote></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>California</strong></td>
<td>423967</td>
<td>38332521</td>
<td><blockquote>
<p>90.413926</p>
</blockquote></td>
<td rowspan="5"></td>
</tr>
<tr class="even">
<td><strong>Texas</strong></td>
<td>695662</td>
<td>26448193</td>
<td><blockquote>
<p>38.018740</p>
</blockquote></td>
</tr>
<tr class="odd">
<td><strong>New York</strong></td>
<td>141297</td>
<td>19651127</td>
<td>139.076746</td>
</tr>
<tr class="even">
<td><strong>Florida</strong></td>
<td>170312</td>
<td>19552860</td>
<td>114.806121</td>
</tr>
<tr class="odd">
<td><strong>Illinois</strong></td>
<td>149995</td>
<td>12882135</td>
<td><blockquote>
<p>85.883763</p>
</blockquote></td>
</tr>
</tbody>
</table>

> This shows a preview of the straightforward syntax of
> element-by-element arithmetic between Series objects; we'll dig into
> this further in .

<table>
<colgroup>
<col style="width: 50%" />
<col style="width: 50%" />
</colgroup>
<thead>
<tr class="header">
<th><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image34.png"
style="width:0.125in" /></th>
<th><blockquote>
<p>DataFrame as two-dimensional array</p>
</blockquote></th>
</tr>
</thead>
<tbody>
</tbody>
</table>

> As mentioned previously, we can also view the DataFrame as an enhanced
> two-dimensional array. We can examine the raw underlying data array
> using the values attribute:
>
> data.values
>
> array(\[\[4.23967000e+05, 3.83325210e+07, 9.04139261e+01\],
> \[6.95662000e+05, 2.64481930e+07, 3.80187404e+01\], \[1.41297000e+05,
> 1.96511270e+07, 1.39076746e+02\],

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
17/126

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> \[1.70312000e+05, 1.95528600e+07, 1.14806121e+02\],
>
> \[1.49995000e+05, 1.28821350e+07, 8.58837628e+01\]\])
>
> With this picture in mind, many familiar array-like observations can
> be done on the DataFrame
>
> itself. For example, we can transpose the full DataFrame to swap rows
> and columns:
>
> data.T

<table style="width:100%;">
<colgroup>
<col style="width: 14%" />
<col style="width: 14%" />
<col style="width: 14%" />
<col style="width: 14%" />
<col style="width: 14%" />
<col style="width: 14%" />
<col style="width: 14%" />
</colgroup>
<thead>
<tr class="header">
<th></th>
<th><blockquote>
<p><strong>California</strong></p>
</blockquote></th>
<th><strong>Texas</strong></th>
<th><strong>New York</strong></th>
<th><strong>Florida</strong></th>
<th><strong>Illinois</strong></th>
<th><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image35.png"
style="width:0.22222in;height:0.22222in" /></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>area</strong></td>
<td>4.239670e+05</td>
<td>6.956620e+05</td>
<td>1.412970e+05</td>
<td>1.703120e+05</td>
<td>1.499950e+05</td>
<td rowspan="3"></td>
</tr>
<tr class="even">
<td><strong>pop</strong></td>
<td>3.833252e+07</td>
<td>2.644819e+07</td>
<td>1.965113e+07</td>
<td>1.955286e+07</td>
<td>1.288214e+07</td>
</tr>
<tr class="odd">
<td><strong>density</strong></td>
<td>9.041393e+01</td>
<td>3.801874e+01</td>
<td>1.390767e+02</td>
<td>1.148061e+02</td>
<td>8.588376e+01</td>
</tr>
</tbody>
</table>

> When it comes to indexing of DataFrame objects, however, it is clear
> that the dictionary-style
>
> indexing of columns precludes our ability to simply treat it as a
> NumPy array. In particular,
>
> passing a single index to an array accesses a row:
>
> data.values\[0\]
>
> array(\[4.23967000e+05, 3.83325210e+07, 9.04139261e+01\])
>
> and passing a single "index" to a DataFrame accesses a column:
>
> data\['area'\]
>
> California 423967
>
> Texas 695662
>
> New York 141297
>
> Florida 170312
>
> Illinois 149995
>
> Name: area, dtype: int64
>
> Thus for array-style indexing, we need another convention. Here Pandas
> again uses the loc ,

iloc , and ix indexers mentioned earlier. Using the iloc indexer, we can
index the underlying

array as if it is a simple NumPy array (using the implicit Python-style
index), but the DataFrame

> index and column labels are maintained in the result:
>
> data.iloc\[:3, :2\]

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
18/126

<table>
<colgroup>
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
</colgroup>
<thead>
<tr class="header">
<th>10/03/2023, 15:11</th>
<th><strong>area</strong></th>
<th><strong>pop</strong></th>
<th><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image36.png"
style="width:0.23611in;height:0.22222in" /></th>
<th><blockquote>
<p>Exp02_notebook_2001622 - Colaboratory</p>
</blockquote></th>
</tr>
</thead>
<tbody>
</tbody>
</table>

> Similarly, using the loc indexer we can index the underlying data in
> an array-like style but using the explicit index and column names:
> **Texas** 695662 26448193
>
> **New York** 141297 19651127  
> data.loc\[:'Illinois', :'pop'\]

<table>
<colgroup>
<col style="width: 25%" />
<col style="width: 25%" />
<col style="width: 25%" />
<col style="width: 25%" />
</colgroup>
<thead>
<tr class="header">
<th></th>
<th><strong>area</strong></th>
<th><strong>pop</strong></th>
<th><blockquote>
<p><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image37.png"
style="width:0.23611in;height:0.22222in" /></p>
</blockquote></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>California</strong></td>
<td>423967</td>
<td>38332521</td>
<td rowspan="5"></td>
</tr>
<tr class="even">
<td><strong>Texas</strong></td>
<td>695662</td>
<td>26448193</td>
</tr>
<tr class="odd">
<td><strong>New York</strong></td>
<td>141297</td>
<td>19651127</td>
</tr>
<tr class="even">
<td><strong>Florida</strong></td>
<td>170312</td>
<td>19552860</td>
</tr>
<tr class="odd">
<td><strong>Illinois</strong></td>
<td>149995</td>
<td>12882135</td>
</tr>
</tbody>
</table>

> The ix indexer allows a hybrid of these two approaches:
>
> Keep in mind that for integer indices, the ix indexer is subject to
> the same potential sources of confusion as discussed for
> integer-indexed Series objects.
>
> Any of the familiar NumPy-style data access patterns can be used
> within these indexers. For example, in the loc indexer we can combine
> masking and fancy indexing as in the following:
>
> data.loc\[data.density \> 100, \['pop', 'density'\]\]

<table>
<colgroup>
<col style="width: 25%" />
<col style="width: 25%" />
<col style="width: 25%" />
<col style="width: 25%" />
</colgroup>
<thead>
<tr class="header">
<th></th>
<th><strong>pop</strong></th>
<th><strong>density</strong></th>
<th><blockquote>
<p><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image38.png"
style="width:0.22222in;height:0.22222in" /></p>
</blockquote></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>New York</strong></td>
<td>19651127</td>
<td>139.076746</td>
<td rowspan="2"></td>
</tr>
<tr class="even">
<td><strong>Florida</strong></td>
<td>19552860</td>
<td>114.806121</td>
</tr>
</tbody>
</table>

> Any of these indexing conventions may also be used to set or modify
> values; this is done in the standard way that you might be accustomed
> to from working with NumPy:
>
> data.iloc\[0, 2\] = 90  
> data

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
19/126

<table>
<colgroup>
<col style="width: 25%" />
<col style="width: 25%" />
<col style="width: 25%" />
<col style="width: 25%" />
</colgroup>
<thead>
<tr class="header">
<th>10/03/2023, 15:11</th>
<th colspan="3"><blockquote>
<p>Exp02_notebook_2001622 - Colaboratory</p>
</blockquote></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>area</strong></td>
<td><strong>pop</strong></td>
<td><blockquote>
<p><strong>density</strong></p>
</blockquote></td>
<td><blockquote>
<p><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image39.png"
style="width:0.23611in;height:0.22222in" /></p>
</blockquote></td>
</tr>
<tr class="even">
<td colspan="4"><blockquote>
<p><strong>California</strong> 423967 38332521 90.000000</p>
<p>To build up your �uency in Pandas data manipulation, I suggest
spending some time with a</p>
<p>simple DataFrame and exploring the types of indexing, slicing,
masking, and fancy indexing that <strong>Texas</strong> 695662 26448193
38.018740</p>
</blockquote></td>
</tr>
</tbody>
</table>

> are allowed by these various indexing approaches.
>
> **Florida** 170312 19552860 114.806121
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image40.png"
> style="width:0.125in" /> Additional indexing conventions 85.883763
>
> There are a couple extra indexing conventions that might seem at odds
> with the preceding
>
> discussion, but nevertheless can be very useful in practice. First,
> while *indexing* refers to
>
> columns, *slicing* refers to rows:
>
> data\['Florida':'Illinois'\]

<table>
<colgroup>
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
</colgroup>
<thead>
<tr class="header">
<th></th>
<th><strong>area</strong></th>
<th><strong>pop</strong></th>
<th><strong>density</strong></th>
<th><blockquote>
<p><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image41.png"
style="width:0.22222in;height:0.22222in" /></p>
</blockquote></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>Florida</strong></td>
<td>170312</td>
<td>19552860</td>
<td>114.806121</td>
<td rowspan="2"></td>
</tr>
<tr class="even">
<td><strong>Illinois</strong></td>
<td>149995</td>
<td>12882135</td>
<td><blockquote>
<p>85.883763</p>
</blockquote></td>
</tr>
</tbody>
</table>

> Such slices can also refer to rows by number rather than by index:
>
> data\[1:3\]

<table>
<colgroup>
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
</colgroup>
<thead>
<tr class="header">
<th></th>
<th><strong>area</strong></th>
<th><strong>pop</strong></th>
<th><strong>density</strong></th>
<th><blockquote>
<p><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image42.png"
style="width:0.22222in;height:0.22222in" /></p>
</blockquote></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>Texas</strong></td>
<td>695662</td>
<td>26448193</td>
<td><blockquote>
<p>38.018740</p>
</blockquote></td>
<td rowspan="2"></td>
</tr>
<tr class="even">
<td><strong>New York</strong></td>
<td>141297</td>
<td>19651127</td>
<td>139.076746</td>
</tr>
</tbody>
</table>

> Similarly, direct masking operations are also interpreted row-wise
> rather than column-wise:
>
> data\[data.density \> 100\]

<table>
<colgroup>
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
</colgroup>
<thead>
<tr class="header">
<th></th>
<th><strong>area</strong></th>
<th><strong>pop</strong></th>
<th><strong>density</strong></th>
<th><blockquote>
<p><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image43.png"
style="width:0.22222in;height:0.22222in" /></p>
</blockquote></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>New York</strong></td>
<td>141297</td>
<td>19651127</td>
<td>139.076746</td>
<td rowspan="2"></td>
</tr>
<tr class="even">
<td><strong>Florida</strong></td>
<td>170312</td>
<td>19552860</td>
<td>114.806121</td>
</tr>
</tbody>
</table>

> These two conventions are syntactically similar to those on a NumPy
> array, and while these may
>
> not precisely �t the mold of the Pandas conventions, they are
> nevertheless quite useful in
>
> practice.

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
20/126

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image3.png"
> style="width:0.125in" /> Operating on Data in Pandas
>
> One of the essential pieces of NumPy is the ability to perform quick
> element-wise operations, both with basic arithmetic (addition,
> subtraction, multiplication, etc.) and with more  
> sophisticated operations (trigonometric functions, exponential and
> logarithmic functions, etc.).
>
> Pandas inherits much of this functionality from NumPy, and the ufuncs
> that we introduced in are key to this.
>
> Pandas includes a couple useful twists, however: for unary operations
> like negation and trigonometric functions, these ufuncs will *preserve
> index and column labels* in the output, and for binary operations such
> as addition and multiplication, Pandas will automatically *align
> indices* when passing the objects to the ufunc. This means that
> keeping the context of data and combining data from different
> sources–both potentially error-prone tasks with raw NumPy
> arrays–become essentially foolproof ones with Pandas. We will
> additionally see that there are well-de�ned operations between
> one-dimensional Series structures and two-dimensional DataFrame
> structures.
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image44.png"
> style="width:0.125in" /> Ufuncs: Index Preservation
>
> Because Pandas is designed to work with NumPy, any NumPy ufunc will
> work on Pandas Series and DataFrame objects. Let's start by de�ning a
> simple Series and DataFrame on which to demonstrate this:
>
> import pandas as pd  
> import numpy as np
>
> rng = np.random.RandomState(42)  
> ser = pd.Series(rng.randint(0, 10, 4))  
> ser
>
> 0 6  
> 1 3  
> 2 7  
> 3 4  
> dtype: int64
>
> df = pd.DataFrame(rng.randint(0, 10, (3, 4)),  
> columns=\['A', 'B', 'C', 'D'\])  
> df

| https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true | 21/126 |
|------------------------------------|------------------------------------|

<table style="width:100%;">
<colgroup>
<col style="width: 14%" />
<col style="width: 14%" />
<col style="width: 14%" />
<col style="width: 14%" />
<col style="width: 14%" />
<col style="width: 14%" />
<col style="width: 14%" />
</colgroup>
<thead>
<tr class="header">
<th>10/03/2023, 15:11</th>
<th><strong>A</strong></th>
<th><strong>B</strong></th>
<th><strong>C</strong></th>
<th><strong>D</strong></th>
<th><blockquote>
<p><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image45.png"
style="width:0.22222in;height:0.22222in" /></p>
</blockquote></th>
<th><blockquote>
<p>Exp02_notebook_2001622 - Colaboratory</p>
</blockquote></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>0</strong></td>
<td>6</td>
<td>9</td>
<td>2</td>
<td>6</td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

If we apply a NumPy ufunc on either of these objects, the result will be
another Pandas object

> *with the indices preserved:* **2** 7 2 5 4
>
> np.exp(ser)
>
> 0 403.428793
>
> 1 20.085537
>
> 2 1096.633158
>
> 3 54.598150
>
> dtype: float64
>
> Or, for a slightly more complex calculation:
>
> np.sin(df \* np.pi / 4)

<table style="width:100%;">
<colgroup>
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
</colgroup>
<thead>
<tr class="header">
<th></th>
<th><strong>A</strong></th>
<th><strong>B</strong></th>
<th><strong>C</strong></th>
<th><strong>D</strong></th>
<th><blockquote>
<p><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image46.png"
style="width:0.23611in;height:0.23611in" /></p>
</blockquote></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>0</strong></td>
<td>-1.000000</td>
<td>7.071068e-01</td>
<td><blockquote>
<p>1.000000</p>
</blockquote></td>
<td>-1.000000e+00</td>
<td rowspan="3"></td>
</tr>
<tr class="even">
<td><strong>1</strong></td>
<td>-0.707107</td>
<td>1.224647e-16</td>
<td><blockquote>
<p>0.707107</p>
</blockquote></td>
<td>-7.071068e-01</td>
</tr>
<tr class="odd">
<td><strong>2</strong></td>
<td>-0.707107</td>
<td>1.000000e+00</td>
<td>-0.707107</td>
<td><blockquote>
<p>1.224647e-16</p>
</blockquote></td>
</tr>
</tbody>
</table>

> Any of the ufuncs discussed in can be used
>
> in a similar manner.

<table>
<colgroup>
<col style="width: 50%" />
<col style="width: 50%" />
</colgroup>
<thead>
<tr class="header">
<th><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image47.png"
style="width:0.125in" /></th>
<th><blockquote>
<p>UFuncs: Index Alignment</p>
</blockquote></th>
</tr>
</thead>
<tbody>
</tbody>
</table>

> For binary operations on two Series or DataFrame objects, Pandas will
> align indices in the
>
> process of performing the operation. This is very convenient when
> working with incomplete
>
> data, as we'll see in some of the examples that follow.
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image48.png"
> style="width:0.125in" /> Index alignment in Series
>
> As an example, suppose we are combining two different data sources,
> and �nd only the top
>
> three US states by *area* and the top three US states by *population*:
>
> area = pd.Series({'Alaska': 1723337, 'Texas': 695662,
>
> 'California': 423967}, name='area')
>
> population = pd.Series({'California': 38332521, 'Texas': 26448193,

'New York': 19651127}, name='population')

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
22/126

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> Let's see what happens when we divide these to compute the population
> density:
>
> population / area
>
> Alaska NaN  
> California 90.413926  
> New York NaN  
> Texas 38.018740  
> dtype: float64
>
> The resulting array contains the *union* of indices of the two input
> arrays, which could be
>
> determined using standard Python set arithmetic on these indices:
>
> area.index \| population.index
>
> \<ipython-input-88-ff558a211efb\>:1: FutureWarning: Index.\_\_or\_\_
> operating as a set ope area.index \| population.index  
> Index(\['Alaska', 'California', 'New York', 'Texas'\], dtype='object')
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image49.png"
> style="width:6.83333in;height:0.23611in" />
>
> Any item for which one or the other does not have an entry is marked
> with NaN , or "Not a
>
> Number," which is how Pandas marks missing data (see further
> discussion of missing data in
>
> ). This index matching is implemented this way for any of Python's
> built-in
>
> arithmetic expressions; any missing values are �lled in with NaN by
> default:
>
> A = pd.Series(\[2, 4, 6\], index=\[0, 1, 2\])
>
> B = pd.Series(\[1, 3, 5\], index=\[1, 2, 3\])
>
> A + B
>
> 0 NaN  
> 1 5.0  
> 2 9.0  
> 3 NaN  
> dtype: float64

If using NaN values is not the desired behavior, the �ll value can be
modi�ed using appropriate

> object methods in place of the operators. For example, calling
> A.add(B) is equivalent to calling
>
> A + B , but allows optional explicit speci�cation of the �ll value for
> any elements in A or B that
>
> might be missing:
>
> A.add(B, fill_value=0)
>
> 0 2.0  
> 1 5.0  
> 2 9.0  
> 3 5.0  
> dtype: float64

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
23/126

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image50.png"
> style="width:0.125in" /> Index alignment in DataFrame
>
> A similar type of alignment takes place for *both* columns and indices
> when performing operations on DataFrame s:
>
> A = pd.DataFrame(rng.randint(0, 20, (2, 2)),  
> columns=list('AB'))  
> A

<table>
<colgroup>
<col style="width: 25%" />
<col style="width: 25%" />
<col style="width: 25%" />
<col style="width: 25%" />
</colgroup>
<thead>
<tr class="header">
<th></th>
<th><strong>A</strong></th>
<th><strong>B</strong></th>
<th><blockquote>
<p><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image51.png"
style="width:0.23611in;height:0.22222in" /></p>
</blockquote></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>0</strong></td>
<td>1</td>
<td>11</td>
<td rowspan="2"></td>
</tr>
<tr class="even">
<td><strong>1</strong></td>
<td>5</td>
<td>1</td>
</tr>
</tbody>
</table>

> B = pd.DataFrame(rng.randint(0, 10, (3, 3)),  
> columns=list('BAC'))  
> B

<table>
<colgroup>
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
</colgroup>
<thead>
<tr class="header">
<th></th>
<th><strong>B</strong></th>
<th><strong>A</strong></th>
<th><strong>C</strong></th>
<th><blockquote>
<p><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image52.png"
style="width:0.22222in;height:0.23611in" /></p>
</blockquote></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>0</strong></td>
<td>4</td>
<td>0</td>
<td>9</td>
<td rowspan="3"></td>
</tr>
<tr class="even">
<td><strong>1</strong></td>
<td>5</td>
<td>8</td>
<td>0</td>
</tr>
<tr class="odd">
<td><strong>2</strong></td>
<td>9</td>
<td>2</td>
<td>6</td>
</tr>
</tbody>
</table>

> A + B

<table>
<colgroup>
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
</colgroup>
<thead>
<tr class="header">
<th></th>
<th><strong>A</strong></th>
<th><strong>B</strong></th>
<th><strong>C</strong></th>
<th><blockquote>
<p><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image53.png"
style="width:0.23611in;height:0.22222in" /></p>
</blockquote></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>0</strong></td>
<td>1.0</td>
<td>15.0</td>
<td>NaN</td>
<td rowspan="3"></td>
</tr>
<tr class="even">
<td><strong>1</strong></td>
<td>13.0</td>
<td>6.0</td>
<td>NaN</td>
</tr>
<tr class="odd">
<td><strong>2</strong></td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
</tbody>
</table>

> Notice that indices are aligned correctly irrespective of their order
> in the two objects, and indices in the result are sorted. As was the
> case with Series , we can use the associated object's arithmetic
> method and pass any desired fill_value to be used in place of missing
> entries.
>
> Here we'll �ll with the mean of all values in A (computed by �rst
> stacking the rows of A ):
>
> fill = A.stack().mean()  
> A.add(B, fill_value=fill)

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
24/126

<table style="width:100%;">
<colgroup>
<col style="width: 14%" />
<col style="width: 14%" />
<col style="width: 14%" />
<col style="width: 14%" />
<col style="width: 14%" />
<col style="width: 14%" />
<col style="width: 14%" />
</colgroup>
<thead>
<tr class="header">
<th colspan="3">10/03/2023, 15:11</th>
<th colspan="4">Exp02_notebook_2001622 - Colaboratory</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td colspan="3"><strong>A</strong></td>
<td colspan="2"><strong>B</strong></td>
<td><strong>C</strong></td>
<td><blockquote>
<p><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image54.png"
style="width:0.22222in;height:0.22222in" /></p>
</blockquote></td>
</tr>
<tr class="even">
<td colspan="2"><strong>0</strong></td>
<td>1.0</td>
<td>15.0</td>
<td colspan="3"><blockquote>
<p>13.5</p>
</blockquote></td>
</tr>
<tr class="odd">
<td colspan="7"><blockquote>
<p>The following table lists Python operators and their equivalent
Pandas object methods: <strong>1</strong> 13.0 6.0 4.5</p>
</blockquote></td>
</tr>
<tr class="even">
<td colspan="3"><blockquote>
<p><strong>Python Operator</strong> <strong>2</strong> 6.5</p>
</blockquote></td>
<td>13.5</td>
<td colspan="3"><blockquote>
<p><strong>Pandas Method(s)</strong><br />
10.5</p>
</blockquote></td>
</tr>
<tr class="odd">
<td colspan="3">+</td>
<td colspan="4"><blockquote>
<p>add()</p>
</blockquote></td>
</tr>
<tr class="even">
<td colspan="3">-</td>
<td colspan="4"><blockquote>
<p>sub() , subtract()</p>
</blockquote></td>
</tr>
<tr class="odd">
<td colspan="3">*</td>
<td colspan="4"><blockquote>
<p>mul() , multiply()</p>
</blockquote></td>
</tr>
<tr class="even">
<td colspan="3">/</td>
<td colspan="4"><blockquote>
<p>truediv() , div() , divide()</p>
</blockquote></td>
</tr>
<tr class="odd">
<td colspan="3">//</td>
<td colspan="4"><blockquote>
<p>floordiv()</p>
</blockquote></td>
</tr>
<tr class="even">
<td colspan="3">%</td>
<td colspan="4"><blockquote>
<p>mod()</p>
</blockquote></td>
</tr>
<tr class="odd">
<td colspan="3">**</td>
<td colspan="4"><blockquote>
<p>pow()</p>
</blockquote></td>
</tr>
<tr class="even">
<td><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image55.png"
style="width:0.125in" /></td>
<td colspan="6"><blockquote>
<p>Ufuncs: Operations Between DataFrame and Series</p>
</blockquote></td>
</tr>
</tbody>
</table>

> When performing operations between a DataFrame and a Series , the
> index and column alignment is similarly maintained. Operations between
> a DataFrame and a Series are similar to operations between a
> two-dimensional and one-dimensional NumPy array. Consider one common
> operation, where we �nd the difference of a two-dimensional array and
> one of its rows:
>
> A = rng.randint(10, size=(3, 4))  
> A
>
> array(\[\[3, 8, 2, 4\],  
> \[2, 6, 4, 8\],  
> \[6, 1, 3, 8\]\])
>
> A - A\[0\]
>
> array(\[\[ 0, 0, 0, 0\],  
> \[-1, -2, 2, 4\],  
> \[ 3, -7, 1, 4\]\])
>
> According to NumPy's broadcasting rules (see ), subtraction between a
> two-dimensional array and one of its rows is applied row-wise.
>
> In Pandas, the convention similarly operates row-wise by default:
>
> df = pd.DataFrame(A, columns=list('QRST'))  
> df - df.iloc\[0\]

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
25/126

<table>
<colgroup>
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
</colgroup>
<thead>
<tr class="header">
<th>10/03/2023, 15:11</th>
<th colspan="4">Exp02_notebook_2001622 - Colaboratory</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>Q</strong></td>
<td><strong>R</strong></td>
<td><strong>S</strong></td>
<td><strong>T</strong></td>
<td><blockquote>
<p><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image56.png"
style="width:0.22222in;height:0.22222in" /></p>
</blockquote></td>
</tr>
<tr class="even">
<td colspan="5"><blockquote>
<p>If you would instead like to operate column-wise, you can use the
object methods mentioned <strong>0</strong> 0 0 0 0</p>
</blockquote></td>
</tr>
</tbody>
</table>

> earlier, while specifying the axis keyword:

<table>
<colgroup>
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
</colgroup>
<thead>
<tr class="header">
<th><strong>2</strong></th>
<th>3</th>
<th>-7</th>
<th>1</th>
<th><blockquote>
<p>4</p>
</blockquote></th>
</tr>
</thead>
<tbody>
</tbody>
</table>

> df.subtract(df\['R'\], axis=0)

<table style="width:100%;">
<colgroup>
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
</colgroup>
<thead>
<tr class="header">
<th></th>
<th><strong>Q</strong></th>
<th><strong>R</strong></th>
<th><strong>S</strong></th>
<th><strong>T</strong></th>
<th><blockquote>
<p><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image57.png"
style="width:0.22222in;height:0.22222in" /></p>
</blockquote></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>0</strong></td>
<td>-5</td>
<td>0</td>
<td>-6</td>
<td>-4</td>
<td rowspan="3"></td>
</tr>
<tr class="even">
<td><strong>1</strong></td>
<td>-4</td>
<td>0</td>
<td>-2</td>
<td>2</td>
</tr>
<tr class="odd">
<td><strong>2</strong></td>
<td>5</td>
<td>0</td>
<td>2</td>
<td>7</td>
</tr>
</tbody>
</table>

> Note that these DataFrame / Series operations, like the operations
> discussed above, will
>
> automatically align indices between the two elements:
>
> halfrow = df.iloc\[0, ::2\]
>
> halfrow
>
> Q 3  
> S 2  
> Name: 0, dtype: int64
>
> df - halfrow

<table style="width:100%;">
<colgroup>
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
</colgroup>
<thead>
<tr class="header">
<th></th>
<th><strong>Q</strong></th>
<th><strong>R</strong></th>
<th><strong>S</strong></th>
<th><strong>T</strong></th>
<th><blockquote>
<p><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image58.png"
style="width:0.22222in;height:0.22222in" /></p>
</blockquote></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>0</strong></td>
<td>0.0</td>
<td>NaN</td>
<td>0.0</td>
<td>NaN</td>
<td rowspan="3"></td>
</tr>
<tr class="even">
<td><strong>1</strong></td>
<td>-1.0</td>
<td>NaN</td>
<td>2.0</td>
<td>NaN</td>
</tr>
<tr class="odd">
<td><strong>2</strong></td>
<td>3.0</td>
<td>NaN</td>
<td>1.0</td>
<td>NaN</td>
</tr>
</tbody>
</table>

> This preservation and alignment of indices and columns means that
> operations on data in
>
> Pandas will always maintain the data context, which prevents the types
> of silly errors that might
>
> come up when working with heterogeneous and/or misaligned data in raw
> NumPy arrays.
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image59.png"
> style="width:0.125in" /> Handling Missing Data
>
> The difference between data found in many tutorials and data in the
> real world is that real-world

data is rarely clean and homogeneous. In particular, many interesting
datasets will have some

amount of data missing. To make matters even more complicated, different
data sources may

> indicate missing data in different ways.

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
26/126

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> In this section, we will discuss some general considerations for
> missing data, discuss how Pandas chooses to represent it, and
> demonstrate some built-in Pandas tools for handling missing data in
> Python. Here and throughout the book, we'll refer to missing data in
> general as *null*, *NaN*, or *NA* values.
>
> Trade-Offs in Missing Data Conventions
>
> There are a number of schemes that have been developed to indicate the
> presence of missing data in a table or DataFrame. Generally, they
> revolve around one of two strategies: using a *mask* that globally
> indicates missing values, or choosing a *sentinel value* that
> indicates a missing entry.
>
> In the masking approach, the mask might be an entirely separate
> Boolean array, or it may involve appropriation of one bit in the data
> representation to locally indicate the null status of a value.
>
> In the sentinel approach, the sentinel value could be some
> data-speci�c convention, such as indicating a missing integer value
> with -9999 or some rare bit pattern, or it could be a more global
> convention, such as indicating a missing �oating-point value with NaN
> (Not a Number), a special value which is part of the IEEE
> �oating-point speci�cation.
>
> None of these approaches is without trade-offs: use of a separate mask
> array requires allocation of an additional Boolean array, which adds
> overhead in both storage and computation. A sentinel value reduces the
> range of valid values that can be represented, and may require extra
> (often non-optimized) logic in CPU and GPU arithmetic. Common special
> values like NaN are not available for all data types.
>
> As in most cases where no universally optimal choice exists, different
> languages and systems use different conventions. For example, the R
> language uses reserved bit patterns within each data type as sentinel
> values indicating missing data, while the SciDB system uses an extra
> byte attached to every cell which indicates a NA state.
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image50.png"
> style="width:0.125in" /> Missing Data in Pandas
>
> The way in which Pandas handles missing values is constrained by its
> reliance on the NumPy package, which does not have a built-in notion
> of NA values for non-�oating-point data types.
>
> Pandas could have followed R's lead in specifying bit patterns for
> each individual data type to indicate nullness, but this approach
> turns out to be rather unwieldy. While R contains four basic data
> types, NumPy supports *far* more than this: for example, while R has a
> single integer type, NumPy supports *fourteen* basic integer types
> once you account for available precisions, signedness, and endianness
> of the encoding. Reserving a speci�c bit pattern in all available
> NumPy types would lead to an unwieldy amount of overhead in
> special-casing various operations for various types, likely even
> requiring a new fork of the NumPy package. Further, for

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
27/126

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> the smaller data types (such as 8-bit integers), sacri�cing a bit to
> use as a mask will signi�cantly reduce the range of values it can
> represent.
>
> NumPy does have support for masked arrays – that is, arrays that have
> a separate Boolean mask array attached for marking data as "good" or
> "bad." Pandas could have derived from this, but the overhead in both
> storage, computation, and code maintenance makes that an unattractive
> choice.
>
> With these constraints in mind, Pandas chose to use sentinels for
> missing data, and further chose to use two already-existing Python
> null values: the special �oating-point NaN value, and the Python None
> object. This choice has some side effects, as we will see, but in
> practice ends up being a good compromise in most cases of interest.
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image60.png"
> style="width:0.125in" /> None: Pythonic missing data
>
> The �rst sentinel value used by Pandas is None , a Python singleton
> object that is often used for missing data in Python code. Because it
> is a Python object, None cannot be used in any arbitrary NumPy/Pandas
> array, but only in arrays with data type 'object' (i.e., arrays of
> Python objects):
>
> import numpy as np  
> import pandas as pd
>
> vals1 = np.array(\[1, None, 3, 4\])  
> vals1
>
> array(\[1, None, 3, 4\], dtype=object)
>
> This dtype=object means that the best common type representation NumPy
> could infer for the contents of the array is that they are Python
> objects. While this kind of object array is useful for some purposes,
> any operations on the data will be done at the Python level, with much
> more overhead than the typically fast operations seen for arrays with
> native types:
>
> for dtype in \['object', 'int'\]:  
> print("dtype =", dtype)  
> %timeit np.arange(1E6, dtype=dtype).sum()  
> print()
>
> dtype = object  
> 47.2 ms ± 1.13 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>
> dtype = int  
> 863 µs ± 213 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>
> The use of Python objects in an array also means that if you perform
> aggregations like sum() or min() across an array with a None value,
> you will generally get an error:

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
28/126

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> This re�ects the fact that addition between an integer and None is
> unde�ned.
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image1.png"
> style="width:0.125in" /> NaN : Missing numerical data  
> The other missing data representation, NaN (acronym for *Not a
> Number*), is different; it is a special �oating-point value recognized
> by all systems that use the standard IEEE �oating-point
> representation:
>
> vals2 = np.array(\[1, np.nan, 3, 4\])
>
> vals2.dtype
>
> dtype('float64')
>
> Notice that NumPy chose a native �oating-point type for this array:
> this means that unlike the object array from before, this array
> supports fast operations pushed into compiled code. You should be
> aware that NaN is a bit like a data virus–it infects any other object
> it touches.
>
> Regardless of the operation, the result of arithmetic with NaN will be
> another NaN :
>
> 1 + np.nan
>
> nan
>
> 0 \* np.nan
>
> nan
>
> Note that this means that aggregates over the values are well de�ned
> (i.e., they don't result in an error) but not always useful:
>
> vals2.sum(), vals2.min(), vals2.max()
>
> (nan, nan, nan)
>
> NumPy does provide some special aggregations that will ignore these
> missing values:
>
> np.nansum(vals2), np.nanmin(vals2), np.nanmax(vals2)
>
> (8.0, 1.0, 4.0)
>
> Keep in mind that NaN is speci�cally a �oating-point value; there is
> no equivalent NaN value for integers, strings, or other types.

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
29/126

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image10.png"
> style="width:0.125in" /> NaN and None in Pandas
>
> NaN and None both have their place, and Pandas is built to handle the
> two of them nearly interchangeably, converting between them where
> appropriate:
>
> pd.Series(\[1, np.nan, 2, None\])
>
> 0 1.0  
> 1 NaN  
> 2 2.0  
> 3 NaN  
> dtype: float64
>
> For types that don't have an available sentinel value, Pandas
> automatically type-casts when NA values are present. For example, if
> we set a value in an integer array to np.nan , it will  
> automatically be upcast to a �oating-point type to accommodate the NA:
>
> x = pd.Series(range(2), dtype=int)  
> x
>
> 0 0  
> 1 1  
> dtype: int64
>
> x\[0\] = None  
> x
>
> 0 NaN  
> 1 1.0  
> dtype: float64
>
> Notice that in addition to casting the integer array to �oating point,
> Pandas automatically converts the None to a NaN value. (Be aware that
> there is a proposal to add a native integer NA to Pandas in the
> future; as of this writing, it has not been included).
>
> While this type of magic may feel a bit hackish compared to the more
> uni�ed approach to NA values in domain-speci�c languages like R, the
> Pandas sentinel/casting approach works quite well in practice and in
> my experience only rarely causes issues.

The following table lists the upcasting conventions in Pandas when NA
values are introduced:

<table>
<colgroup>
<col style="width: 33%" />
<col style="width: 33%" />
<col style="width: 33%" />
</colgroup>
<thead>
<tr class="header">
<th><strong>Typeclass</strong></th>
<th><strong>Conversion When Storing NAs</strong></th>
<th><strong>NA Sentinel Value</strong></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>floating</td>
<td><blockquote>
<p>No change</p>
</blockquote></td>
<td><blockquote>
<p>np.nan</p>
</blockquote></td>
</tr>
<tr class="even">
<td><blockquote>
<p>object</p>
</blockquote></td>
<td><blockquote>
<p>No change</p>
</blockquote></td>
<td><blockquote>
<p>None or np.nan</p>
</blockquote></td>
</tr>
<tr class="odd">
<td><blockquote>
<p>integer</p>
</blockquote></td>
<td><blockquote>
<p>Cast to float64</p>
</blockquote></td>
<td><blockquote>
<p>np.nan</p>
</blockquote></td>
</tr>
<tr class="even">
<td><blockquote>
<p>boolean</p>
</blockquote></td>
<td><blockquote>
<p>Cast to object</p>
</blockquote></td>
<td><blockquote>
<p>None or np.nan</p>
</blockquote></td>
</tr>
</tbody>
</table>

> Keep in mind that in Pandas, string data is always stored with an
> object dtype.

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
30/126

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image50.png"
> style="width:0.125in" /> Operating on Null Values
>
> As we have seen, Pandas treats None and NaN as essentially
> interchangeable for indicating missing or null values. To facilitate
> this convention, there are several useful methods for detecting,
> removing, and replacing null values in Pandas data structures. They
> are:
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image61.png" />
> isnull() : Generate a boolean mask indicating missing values
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image62.png" />
> notnull() : Opposite of isnull()
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image63.png" />
> dropna() : Return a �ltered version of the data
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image64.png" />
> fillna() : Return a copy of the data with missing values �lled or
> imputed
>
> We will conclude this section with a brief exploration and
> demonstration of these routines.
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image55.png"
> style="width:0.125in" /> Detecting null values
>
> Pandas data structures have two useful methods for detecting null
> data: isnull() and notnull() . Either one will return a Boolean mask
> over the data. For example:
>
> data = pd.Series(\[1, np.nan, 'hello', None\])
>
> data.isnull()
>
> 0 False  
> 1 True  
> 2 False  
> 3 True  
> dtype: bool
>
> As mentioned in , Boolean masks can be used directly as a Series or
> DataFrame index:
>
> data\[data.notnull()\]
>
> 0 1  
> 2 hello  
> dtype: object
>
> The isnull() and notnull() methods produce similar Boolean results for
> DataFrame s.
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image40.png"
> style="width:0.125in" /> Dropping null values
>
> In addition to the masking used before, there are the convenience
> methods, dropna() (which removes NA values) and fillna() (which �lls
> in NA values). For a Series , the result is straightforward:

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
31/126

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> data.dropna()
>
> 0 1
>
> 2 hello
>
> dtype: object
>
> For a DataFrame , there are more options. Consider the following
> DataFrame :
>
> df = pd.DataFrame(\[\[1, np.nan, 2\],
>
> \[2, 3, 5\],
>
> \[np.nan, 4, 6\]\])
>
> df

<table>
<colgroup>
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
</colgroup>
<thead>
<tr class="header">
<th></th>
<th><strong>0</strong></th>
<th><strong>1</strong></th>
<th><strong>2</strong></th>
<th><blockquote>
<p><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image65.png"
style="width:0.23611in;height:0.22222in" /></p>
</blockquote></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>0</strong></td>
<td>1.0</td>
<td>NaN</td>
<td>2</td>
<td rowspan="3"></td>
</tr>
<tr class="even">
<td><strong>1</strong></td>
<td>2.0</td>
<td>3.0</td>
<td>5</td>
</tr>
<tr class="odd">
<td><strong>2</strong></td>
<td>NaN</td>
<td>4.0</td>
<td>6</td>
</tr>
</tbody>
</table>

> We cannot drop single values from a DataFrame ; we can only drop full
> rows or full columns.

Depending on the application, you might want one or the other, so
dropna() gives a number of

> options for a DataFrame .
>
> By default, dropna() will drop all rows in which *any* null value is
> present:
>
> df.dropna()

<table>
<colgroup>
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
</colgroup>
<thead>
<tr class="header">
<th></th>
<th><strong>0</strong></th>
<th><strong>1</strong></th>
<th><strong>2</strong></th>
<th><blockquote>
<p><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image66.png"
style="width:0.22222in;height:0.22222in" /></p>
</blockquote></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>1</strong></td>
<td>2.0</td>
<td>3.0</td>
<td>5</td>
<td></td>
</tr>
</tbody>
</table>

> Alternatively, you can drop NA values along a different axis; axis=1
> drops all columns
>
> containing a null value:
>
> df.dropna(axis='columns')

<table>
<colgroup>
<col style="width: 33%" />
<col style="width: 33%" />
<col style="width: 33%" />
</colgroup>
<thead>
<tr class="header">
<th></th>
<th><strong>2</strong></th>
<th><blockquote>
<p><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image67.png"
style="width:0.23611in;height:0.22222in" /></p>
</blockquote></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>0</strong></td>
<td>2</td>
<td rowspan="3"></td>
</tr>
<tr class="even">
<td><strong>1</strong></td>
<td>5</td>
</tr>
<tr class="odd">
<td><strong>2</strong></td>
<td>6</td>
</tr>
</tbody>
</table>

> But this drops some good data as well; you might rather be interested
> in dropping rows or
>
> columns with *all* NA values, or a majority of NA values. This can be
> speci�ed through the how or

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
32/126

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> thresh parameters, which allow �ne control of the number of nulls to
> allow through.
>
> The default is how='any' , such that any row or column (depending on
> the axis keyword)
>
> containing a null value will be dropped. You can also specify
> how='all' , which will only drop
>
> rows/columns that are *all* null values:
>
> df\[3\] = np.nan
>
> df

<table style="width:100%;">
<colgroup>
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
</colgroup>
<thead>
<tr class="header">
<th></th>
<th><strong>0</strong></th>
<th><strong>1</strong></th>
<th><strong>2</strong></th>
<th><strong>3</strong></th>
<th><blockquote>
<p><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image68.png"
style="width:0.22222in;height:0.22222in" /></p>
</blockquote></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>0</strong></td>
<td>1.0</td>
<td>NaN</td>
<td>2</td>
<td>NaN</td>
<td rowspan="3"></td>
</tr>
<tr class="even">
<td><strong>1</strong></td>
<td>2.0</td>
<td>3.0</td>
<td>5</td>
<td>NaN</td>
</tr>
<tr class="odd">
<td><strong>2</strong></td>
<td>NaN</td>
<td>4.0</td>
<td>6</td>
<td>NaN</td>
</tr>
</tbody>
</table>

> df.dropna(axis='columns', how='all')

<table>
<colgroup>
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
</colgroup>
<thead>
<tr class="header">
<th></th>
<th><strong>0</strong></th>
<th><strong>1</strong></th>
<th><strong>2</strong></th>
<th><blockquote>
<p><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image69.png"
style="width:0.23611in;height:0.22222in" /></p>
</blockquote></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>0</strong></td>
<td>1.0</td>
<td>NaN</td>
<td>2</td>
<td rowspan="3"></td>
</tr>
<tr class="even">
<td><strong>1</strong></td>
<td>2.0</td>
<td>3.0</td>
<td>5</td>
</tr>
<tr class="odd">
<td><strong>2</strong></td>
<td>NaN</td>
<td>4.0</td>
<td>6</td>
</tr>
</tbody>
</table>

For �ner-grained control, the thresh parameter lets you specify a
minimum number of non-null

> values for the row/column to be kept:
>
> df.dropna(axis='rows', thresh=3)

<table style="width:100%;">
<colgroup>
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
</colgroup>
<thead>
<tr class="header">
<th></th>
<th><strong>0</strong></th>
<th><strong>1</strong></th>
<th><strong>2</strong></th>
<th><strong>3</strong></th>
<th><blockquote>
<p><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image70.png"
style="width:0.22222in;height:0.22222in" /></p>
</blockquote></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>1</strong></td>
<td>2.0</td>
<td>3.0</td>
<td>5</td>
<td>NaN</td>
<td></td>
</tr>
</tbody>
</table>

> Here the �rst and last row have been dropped, because they contain
> only two non-null values.
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image1.png"
> style="width:0.125in" /> Filling null values
>
> Sometimes rather than dropping NA values, you'd rather replace them
> with a valid value. This
>
> value might be a single number like zero, or it might be some sort of
> imputation or interpolation
>
> from the good values. You could do this in-place using the isnull()
> method as a mask, but
>
> because it is such a common operation Pandas provides the fillna()
> method, which returns a
>
> copy of the array with the null values replaced.
>
> Consider the following Series :

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
33/126

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> data = pd.Series(\[1, np.nan, 2, None, 3\], index=list('abcde')) data
>
> a 1.0  
> b NaN  
> c 2.0  
> d NaN  
> e 3.0  
> dtype: float64
>
> We can �ll NA entries with a single value, such as zero:
>
> data.fillna(0)
>
> a 1.0  
> b 0.0  
> c 2.0  
> d 0.0  
> e 3.0  
> dtype: float64
>
> We can specify a forward-�ll to propagate the previous value forward:
>
> \# forward-fill  
> data.fillna(method='ffill')
>
> a 1.0  
> b 1.0  
> c 2.0  
> d 2.0  
> e 3.0  
> dtype: float64
>
> Or we can specify a back-�ll to propagate the next values backward:
>
> \# back-fill  
> data.fillna(method='bfill')
>
> a 1.0  
> b 2.0  
> c 2.0  
> d 3.0  
> e 3.0  
> dtype: float64
>
> For DataFrame s, the options are similar, but we can also specify an
> axis along which the �lls
>
> take place:
>
> df

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
34/126

<table style="width:100%;">
<colgroup>
<col style="width: 14%" />
<col style="width: 14%" />
<col style="width: 14%" />
<col style="width: 14%" />
<col style="width: 14%" />
<col style="width: 14%" />
<col style="width: 14%" />
</colgroup>
<thead>
<tr class="header">
<th>10/03/2023, 15:11</th>
<th><strong>0</strong></th>
<th><strong>1</strong></th>
<th><strong>2</strong></th>
<th><strong>3</strong></th>
<th><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image71.png"
style="width:0.22222in;height:0.22222in" /></th>
<th><blockquote>
<p>Exp02_notebook_2001622 - Colaboratory</p>
</blockquote></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>0</strong></td>
<td>1.0</td>
<td>NaN</td>
<td>2</td>
<td>NaN</td>
<td rowspan="3"></td>
<td rowspan="3"></td>
</tr>
<tr class="even">
<td><strong>1</strong></td>
<td>2.0</td>
<td>3.0</td>
<td>5</td>
<td>NaN</td>
</tr>
<tr class="odd">
<td><strong>2</strong></td>
<td>NaN</td>
<td>4.0</td>
<td>6</td>
<td>NaN</td>
</tr>
</tbody>
</table>

> df.fillna(method='ffill', axis=1)

<table style="width:100%;">
<colgroup>
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
</colgroup>
<thead>
<tr class="header">
<th></th>
<th><strong>0</strong></th>
<th><strong>1</strong></th>
<th><strong>2</strong></th>
<th><strong>3</strong></th>
<th><blockquote>
<p><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image72.png"
style="width:0.23611in;height:0.22222in" /></p>
</blockquote></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>0</strong></td>
<td>1.0</td>
<td>1.0</td>
<td>2.0</td>
<td>2.0</td>
<td rowspan="3"></td>
</tr>
<tr class="even">
<td><strong>1</strong></td>
<td>2.0</td>
<td>3.0</td>
<td>5.0</td>
<td>5.0</td>
</tr>
<tr class="odd">
<td><strong>2</strong></td>
<td>NaN</td>
<td>4.0</td>
<td>6.0</td>
<td>6.0</td>
</tr>
</tbody>
</table>

> Notice that if a previous value is not available during a forward �ll,
> the NA value remains.
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image73.png"
> style="width:0.125in" /> Hierarchical Indexing
>
> Up to this point we've been focused primarily on one-dimensional and
> two-dimensional data, stored in Pandas Series and DataFrame objects,
> respectively. Often it is useful to go beyond this and store
> higher-dimensional data–that is, data indexed by more than one or two
> keys. While Pandas does provide Panel and Panel4D objects that
> natively handle three-dimensional and four-dimensional data (see
> <u>Aside: Panel Data</u>), a far more common pattern in practice is to
> make use of *hierarchical indexing* (also known as *multi-indexing*)
> to incorporate multiple index *levels* within a single index. In this
> way, higher-dimensional data can be compactly represented within the
> familiar one-dimensional Series and two-dimensional DataFrame objects.
>
> In this section, we'll explore the direct creation of MultiIndex
> objects, considerations when indexing, slicing, and computing
> statistics across multiply indexed data, and useful routines for
> converting between simple and hierarchically indexed representations
> of your data.
>
> We begin with the standard imports:
>
> import pandas as pd  
> import numpy as np

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
35/126

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image3.png"
> style="width:0.125in" /> A Multiply Indexed Series
>
> Let's start by considering how we might represent two-dimensional data
> within a one-
>
> dimensional Series . For concreteness, we will consider a series of
> data where each point has a
>
> character and numerical key.
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image40.png"
> style="width:0.125in" /> The bad way
>
> Suppose you would like to track data about states from two different
> years. Using the Pandas
>
> tools we've already covered, you might be tempted to simply use Python
> tuples as keys:
>
> index = \[('California', 2000), ('California', 2010), ('New York',
> 2000), ('New York', 2010), ('Texas', 2000), ('Texas', 2010)\]  
> populations = \[33871648, 37253956,  
> 18976457, 19378102,  
> 20851820, 25145561\]  
> pop = pd.Series(populations, index=index)  
> pop
>
> (California, 2000) 33871648  
> (California, 2010) 37253956  
> (New York, 2000) 18976457  
> (New York, 2010) 19378102  
> (Texas, 2000) 20851820  
> (Texas, 2010) 25145561  
> dtype: int64
>
> With this indexing scheme, you can straightforwardly index or slice
> the series based on this
>
> multiple index:
>
> pop\[('California', 2010):('Texas', 2000)\]
>
> (California, 2010) 37253956  
> (New York, 2000) 18976457  
> (New York, 2010) 19378102  
> (Texas, 2000) 20851820  
> dtype: int64
>
> But the convenience ends there. For example, if you need to select all
> values from 2010, you'll
>
> need to do some messy (and potentially slow) munging to make it
> happen:
>
> pop\[\[i for i in pop.index if i\[1\] == 2010\]\]
>
> (California, 2010) 37253956  
> (New York, 2010) 19378102  
> (Texas, 2010) 25145561  
> dtype: int64

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
36/126

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> This produces the desired result, but is not as clean (or as e�cient
> for large datasets) as the
>
> slicing syntax we've grown to love in Pandas.
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image59.png"
> style="width:0.125in" /> The Better Way: Pandas MultiIndex
>
> Fortunately, Pandas provides a better way. Our tuple-based indexing is
> essentially a rudimentary

multi-index, and the Pandas MultiIndex type gives us the type of
operations we wish to have.

> We can create a multi-index from the tuples as follows:
>
> index = pd.MultiIndex.from_tuples(index)
>
> index
>
> MultiIndex(\[('California', 2000),  
> ('California', 2010),  
> ( 'New York', 2000),  
> ( 'New York', 2010),  
> ( 'Texas', 2000),  
> ( 'Texas', 2010)\],  
> )
>
> Notice that the MultiIndex contains multiple *levels* of indexing–in
> this case, the state names
>
> and the years, as well as multiple *labels* for each data point which
> encode these levels.
>
> If we re-index our series with this MultiIndex , we see the
> hierarchical representation of the data:
>
> pop = pop.reindex(index)
>
> pop
>
> California 2000 33871648  
> 2010 37253956  
> New York 2000 18976457  
> 2010 19378102  
> Texas 2000 20851820  
> 2010 25145561  
> dtype: int64

Here the �rst two columns of the Series representation show the multiple
index values, while

> the third column shows the data. Notice that some entries are missing
> in the �rst column: in this
>
> multi-index representation, any blank entry indicates the same value
> as the line above it.
>
> Now to access all data for which the second index is 2010, we can
> simply use the Pandas
>
> slicing notation:
>
> pop\[:, 2010\]

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
37/126

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> California 37253956  
> New York 19378102  
> Texas 25145561  
> dtype: int64
>
> The result is a singly indexed array with just the keys we're
> interested in. This syntax is much more convenient (and the operation
> is much more e�cient!) than the home-spun tuple-based multi-indexing
> solution that we started with. We'll now further discuss this sort of
> indexing operation on hieararchically indexed data.
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image74.png"
> style="width:0.125in" /> MultiIndex as extra dimension
>
> You might notice something else here: we could easily have stored the
> same data using a simple DataFrame with index and column labels. In
> fact, Pandas is built with this equivalence in mind.
>
> The unstack() method will quickly convert a multiply indexed Series
> into a conventionally indexed DataFrame :
>
> pop_df = pop.unstack()  
> pop_df

<table>
<colgroup>
<col style="width: 25%" />
<col style="width: 25%" />
<col style="width: 25%" />
<col style="width: 25%" />
</colgroup>
<thead>
<tr class="header">
<th></th>
<th><strong>2000</strong></th>
<th><strong>2010</strong></th>
<th><blockquote>
<p><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image75.png"
style="width:0.22222in;height:0.22222in" /></p>
</blockquote></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>California</strong></td>
<td>33871648</td>
<td>37253956</td>
<td rowspan="3"></td>
</tr>
<tr class="even">
<td><strong>New York</strong></td>
<td>18976457</td>
<td>19378102</td>
</tr>
<tr class="odd">
<td><strong>Texas</strong></td>
<td>20851820</td>
<td>25145561</td>
</tr>
</tbody>
</table>

> Naturally, the stack() method provides the opposite operation:
>
> pop_df.stack()
>
> California 2000 33871648  
> 2010 37253956  
> New York 2000 18976457  
> 2010 19378102  
> Texas 2000 20851820  
> 2010 25145561  
> dtype: int64
>
> Seeing this, you might wonder why would we would bother with
> hierarchical indexing at all. The reason is simple: just as we were
> able to use multi-indexing to represent two-dimensional data within a
> one-dimensional Series , we can also use it to represent data of three
> or more dimensions in a Series or DataFrame . Each extra level in a
> multi-index represents an extra dimension of data; taking advantage of
> this property gives us much more �exibility in the types of data we
> can represent. Concretely, we might want to add another column of
> demographic

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
38/126

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> data for each state at each year (say, population under 18) ; with a
> MultiIndex this is as easy as
>
> adding another column to the DataFrame :
>
> pop_df = pd.DataFrame({'total': pop,  
> 'under18': \[9267089, 9284094, 4687374, 4318033, 5906301, 6879014\]})
> pop_df

<table>
<colgroup>
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
</colgroup>
<thead>
<tr class="header">
<th></th>
<th></th>
<th><strong>total</strong></th>
<th><strong>under18</strong></th>
<th><blockquote>
<p><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image76.png"
style="width:0.22222in;height:0.22222in" /></p>
</blockquote></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>California</strong></td>
<td><strong>2000</strong></td>
<td>33871648</td>
<td>9267089</td>
<td rowspan="6"></td>
</tr>
<tr class="even">
<td rowspan="2"><strong>New York</strong></td>
<td><strong>2010</strong></td>
<td>37253956</td>
<td>9284094</td>
</tr>
<tr class="odd">
<td><strong>2000</strong></td>
<td>18976457</td>
<td>4687374</td>
</tr>
<tr class="even">
<td rowspan="3"><strong>Texas</strong></td>
<td><strong>2010</strong></td>
<td>19378102</td>
<td>4318033</td>
</tr>
<tr class="odd">
<td><strong>2000</strong></td>
<td>20851820</td>
<td>5906301</td>
</tr>
<tr class="even">
<td><strong>2010</strong></td>
<td>25145561</td>
<td>6879014</td>
</tr>
</tbody>
</table>

> In addition, all the ufuncs and other functionality discussed in work
>
> with hierarchical indices as well. Here we compute the fraction of
> people under 18 by year, given the above data:
>
> f_u18 = pop_df\['under18'\] / pop_df\['total'\]  
> f_u18.unstack()

<table>
<colgroup>
<col style="width: 25%" />
<col style="width: 25%" />
<col style="width: 25%" />
<col style="width: 25%" />
</colgroup>
<thead>
<tr class="header">
<th></th>
<th><strong>2000</strong></th>
<th><strong>2010</strong></th>
<th><blockquote>
<p><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image77.png"
style="width:0.22222in;height:0.22222in" /></p>
</blockquote></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>California</strong></td>
<td>0.273594</td>
<td>0.249211</td>
<td rowspan="3"></td>
</tr>
<tr class="even">
<td><strong>New York</strong></td>
<td>0.247010</td>
<td>0.222831</td>
</tr>
<tr class="odd">
<td><strong>Texas</strong></td>
<td>0.283251</td>
<td>0.273568</td>
</tr>
</tbody>
</table>

> This allows us to easily and quickly manipulate and explore even
> high-dimensional data.
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image1.png"
> style="width:0.125in" /> Methods of MultiIndex Creation
>
> The most straightforward way to construct a multiply indexed Series or
> DataFrame is to simply
>
> pass a list of two or more index arrays to the constructor. For
> example:
>
> df = pd.DataFrame(np.random.rand(4, 2),  
> index=\[\['a', 'a', 'b', 'b'\], \[1, 2, 1, 2\]\], columns=\['data1',
> 'data2'\])  
> df

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
39/126

<table style="width:100%;">
<colgroup>
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
</colgroup>
<thead>
<tr class="header">
<th>10/03/2023, 15:11</th>
<th></th>
<th><strong>data1</strong></th>
<th><strong>data2</strong></th>
<th><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image78.png"
style="width:0.22222in;height:0.22222in" /></th>
<th><blockquote>
<p>Exp02_notebook_2001622 - Colaboratory</p>
</blockquote></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>a</strong></td>
<td><strong>1</strong></td>
<td>0.898453</td>
<td>0.175211</td>
<td rowspan="4"></td>
<td rowspan="4"></td>
</tr>
<tr class="even">
<td rowspan="3"><strong>b</strong></td>
<td><strong>2</strong></td>
<td>0.333352</td>
<td>0.764679</td>
</tr>
<tr class="odd">
<td><strong>1</strong></td>
<td>0.216291</td>
<td>0.780221</td>
</tr>
<tr class="even">
<td><strong>2</strong></td>
<td>0.300422</td>
<td>0.918307</td>
</tr>
</tbody>
</table>

> The work of creating the MultiIndex is done in the background.
>
> Similarly, if you pass a dictionary with appropriate tuples as keys,
> Pandas will automatically
>
> recognize this and use a MultiIndex by default:
>
> data = {('California', 2000): 33871648,
>
> ('California', 2010): 37253956,
>
> ('Texas', 2000): 20851820,
>
> ('Texas', 2010): 25145561,
>
> ('New York', 2000): 18976457,
>
> ('New York', 2010): 19378102}
>
> pd.Series(data)
>
> California 2000 33871648  
> 2010 37253956  
> Texas 2000 20851820  
> 2010 25145561  
> New York 2000 18976457  
> 2010 19378102  
> dtype: int64
>
> Nevertheless, it is sometimes useful to explicitly create a MultiIndex
> ; we'll see a couple of
>
> these methods here.
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image79.png"
> style="width:0.125in" /> Explicit MultiIndex constructors
>
> For more �exibility in how the index is constructed, you can instead
> use the class method

constructors available in the pd.MultiIndex . For example, as we did
before, you can construct

> the MultiIndex from a simple list of arrays giving the index values
> within each level:
>
> pd.MultiIndex.from_arrays(\[\['a', 'a', 'b', 'b'\], \[1, 2, 1, 2\]\])
>
> MultiIndex(\[('a', 1),  
> ('a', 2),  
> ('b', 1),  
> ('b', 2)\],  
> )
>
> You can construct it from a list of tuples giving the multiple index
> values of each point:

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
40/126

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> pd.MultiIndex.from_tuples(\[('a', 1), ('a', 2), ('b', 1), ('b', 2)\])
>
> MultiIndex(\[('a', 1),  
> ('a', 2),  
> ('b', 1),  
> ('b', 2)\],  
> )
>
> You can even construct it from a Cartesian product of single indices:
>
> pd.MultiIndex.from_product(\[\['a', 'b'\], \[1, 2\]\])
>
> MultiIndex(\[('a', 1),  
> ('a', 2),  
> ('b', 1),  
> ('b', 2)\],  
> )
>
> Similarly, you can construct the MultiIndex directly using its
> internal encoding by passing
>
> levels (a list of lists containing available index values for each
> level) and labels (a list of lists
>
> that reference these labels):
>
> Any of these objects can be passed as the index argument when creating
> a Series or
>
> Dataframe , or be passed to the reindex method of an existing Series
> or DataFrame .
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image17.png"
> style="width:0.125in" /> MultiIndex level names
>
> Sometimes it is convenient to name the levels of the MultiIndex . This
> can be accomplished by
>
> passing the names argument to any of the above MultiIndex
> constructors, or by setting the
>
> names attribute of the index after the fact:
>
> pop.index.names = \['state', 'year'\]  
> pop
>
> state year  
> California 2000 33871648  
> 2010 37253956  
> New York 2000 18976457  
> 2010 19378102  
> Texas 2000 20851820  
> 2010 25145561  
> dtype: int64
>
> With more involved datasets, this can be a useful way to keep track of
> the meaning of various
>
> index values.

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
41/126

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image80.png"
> style="width:0.125in" /> MultiIndex for columns
>
> In a DataFrame , the rows and columns are completely symmetric, and
> just as the rows can have
>
> multiple levels of indices, the columns can have multiple levels as
> well. Consider the following, which is a mock-up of some (somewhat
> realistic) medical data:
>
> \# hierarchical indices and columns  
> index = pd.MultiIndex.from_product(\[\[2013, 2014\], \[1, 2\]\],  
> names=\['year', 'visit'\])  
> columns = pd.MultiIndex.from_product(\[\['Bob', 'Guido', 'Sue'\],
> \['HR', 'Temp'\]\], names=\['subject', 'type'\])
>
> \# mock some data  
> data = np.round(np.random.randn(4, 6), 1)  
> data\[:, ::2\] \*= 10  
> data += 37
>
> \# create the DataFrame  
> health_data = pd.DataFrame(data, index=index, columns=columns)
> health_data

<table>
<colgroup>
<col style="width: 11%" />
<col style="width: 11%" />
<col style="width: 11%" />
<col style="width: 11%" />
<col style="width: 11%" />
<col style="width: 11%" />
<col style="width: 11%" />
<col style="width: 11%" />
<col style="width: 11%" />
</colgroup>
<thead>
<tr class="header">
<th rowspan="3"><strong>year</strong></th>
<th><strong>subject</strong></th>
<th><strong>Bob</strong></th>
<th rowspan="3"><strong>Temp</strong></th>
<th><blockquote>
<p><strong>Guido</strong></p>
</blockquote></th>
<th rowspan="3"><strong>Temp</strong></th>
<th><strong>Sue</strong></th>
<th rowspan="3"><strong>Temp</strong></th>
<th rowspan="3"><blockquote>
<p><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image81.png"
style="width:0.22222in;height:0.22222in" /></p>
</blockquote></th>
</tr>
<tr class="odd">
<th><blockquote>
<p><strong>type</strong></p>
</blockquote></th>
<th rowspan="2"><blockquote>
<p><strong>HR</strong></p>
</blockquote></th>
<th rowspan="2"><blockquote>
<p><strong>HR</strong></p>
</blockquote></th>
<th rowspan="2"><blockquote>
<p><strong>HR</strong></p>
</blockquote></th>
</tr>
<tr class="header">
<th><strong>visit</strong></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>2013</strong></td>
<td><strong>1</strong></td>
<td>38.0</td>
<td>37.3</td>
<td>41.0</td>
<td>35.8</td>
<td>39.0</td>
<td>37.2</td>
<td rowspan="4"></td>
</tr>
<tr class="even">
<td rowspan="3"><strong>2014</strong></td>
<td><strong>2</strong></td>
<td>45.0</td>
<td>36.0</td>
<td>47.0</td>
<td>35.9</td>
<td>41.0</td>
<td>36.8</td>
</tr>
<tr class="odd">
<td><strong>1</strong></td>
<td>44.0</td>
<td>37.7</td>
<td>44.0</td>
<td>37.0</td>
<td>29.0</td>
<td>35.2</td>
</tr>
<tr class="even">
<td><strong>2</strong></td>
<td>35.0</td>
<td>37.4</td>
<td>53.0</td>
<td>36.2</td>
<td>38.0</td>
<td>37.4</td>
</tr>
</tbody>
</table>

> Here we see where the multi-indexing for both rows and columns can
> come in *very* handy. This is fundamentally four-dimensional data,
> where the dimensions are the subject, the measurement
>
> type, the year, and the visit number. With this in place we can, for
> example, index the top-level
>
> column by the person's name and get a full DataFrame containing just
> that person's information:
>
> health_data\['Guido'\]

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
42/126

<table style="width:100%;">
<colgroup>
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
</colgroup>
<thead>
<tr class="header">
<th>10/03/2023, 15:11</th>
<th><strong>type</strong></th>
<th><strong>HR</strong></th>
<th><strong>Temp</strong></th>
<th><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image82.png"
style="width:0.23611in;height:0.22222in" /></th>
<th><blockquote>
<p>Exp02_notebook_2001622 - Colaboratory</p>
</blockquote></th>
</tr>
</thead>
<tbody>
</tbody>
</table>

For complicated records containing multiple labeled measurements across
multiple times for

<table>
<colgroup>
<col style="width: 33%" />
<col style="width: 33%" />
<col style="width: 33%" />
</colgroup>
<thead>
<tr class="header">
<th rowspan="3"><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image83.png"
style="width:0.125in" /></th>
<th colspan="2"><blockquote>
<p>many subjects (people, countries, cities, etc.) use of hierarchical
rows and columns can be <strong>2013</strong> <strong>1</strong> 41.0
35.8</p>
</blockquote></th>
</tr>
<tr class="odd">
<th><blockquote>
<p>extremely convenient! <strong>2</strong> 47.0</p>
</blockquote></th>
<th><blockquote>
<p>35.9</p>
</blockquote></th>
</tr>
<tr class="header">
<th colspan="2"><blockquote>
<p><strong>2014</strong> <strong>1</strong> 44.0 37.0</p>
<p>Indexing and Slicing a MultiIndex <strong>2</strong> 53.0 36.2</p>
</blockquote></th>
</tr>
</thead>
<tbody>
</tbody>
</table>

Indexing and slicing on a MultiIndex is designed to be intuitive, and it
helps if you think about

the indices as added dimensions. We'll �rst look at indexing multiply
indexed Series , and then

> multiply-indexed DataFrame s.
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image60.png"
> style="width:0.125in" /> Multiply indexed Series
>
> Consider the multiply indexed Series of state populations we saw
> earlier:
>
> pop
>
> state year  
> California 2000 33871648  
> 2010 37253956  
> New York 2000 18976457  
> 2010 19378102  
> Texas 2000 20851820  
> 2010 25145561  
> dtype: int64
>
> We can access single elements by indexing with multiple terms:
>
> pop\['California', 2000\]
>
> 33871648
>
> The MultiIndex also supports *partial indexing*, or indexing just one
> of the levels in the index.
>
> The result is another Series , with the lower-level indices
> maintained:
>
> pop\['California'\]
>
> year  
> 2000 33871648  
> 2010 37253956  
> dtype: int64
>
> Partial slicing is available as well, as long as the MultiIndex is
> sorted (see discussion in <u>Sorted</u>
>
> <u>and Unsorted Indices</u>):

| https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true | 43/126 |
|------------------------------------|------------------------------------|

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> pop.loc\['California':'New York'\]
>
> state year  
> California 2000 33871648  
> 2010 37253956  
> New York 2000 18976457  
> 2010 19378102  
> dtype: int64
>
> With sorted indices, partial indexing can be performed on lower levels
> by passing an empty slice
>
> in the �rst index:
>
> pop\[:, 2000\]
>
> state  
> California 33871648  
> New York 18976457  
> Texas 20851820  
> dtype: int64

Other types of indexing and selection (discussed in ) work as well;

> for example, selection based on Boolean masks:
>
> pop\[pop \> 22000000\]
>
> state year  
> California 2000 33871648  
> 2010 37253956  
> Texas 2010 25145561  
> dtype: int64
>
> Selection based on fancy indexing also works:
>
> pop\[\['California', 'Texas'\]\]
>
> state year  
> California 2000 33871648  
> 2010 37253956  
> Texas 2000 20851820  
> 2010 25145561  
> dtype: int64
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image84.png"
> style="width:0.125in" /> Multiply indexed DataFrames
>
> A multiply indexed DataFrame behaves in a similar manner. Consider our
> toy medical DataFrame
>
> from before:
>
> health_data

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
44/126

<table>
<colgroup>
<col style="width: 12%" />
<col style="width: 12%" />
<col style="width: 12%" />
<col style="width: 12%" />
<col style="width: 12%" />
<col style="width: 12%" />
<col style="width: 12%" />
<col style="width: 12%" />
</colgroup>
<thead>
<tr class="header">
<th rowspan="2">10/03/2023, 15:11</th>
<th rowspan="2"><strong>subject</strong></th>
<th rowspan="2"><strong>Bob</strong></th>
<th rowspan="4"><strong>Temp</strong></th>
<th rowspan="2"><blockquote>
<p><strong>Guido</strong></p>
</blockquote></th>
<th colspan="3"><blockquote>
<p>Exp02_notebook_2001622 - Colaboratory</p>
</blockquote></th>
</tr>
<tr class="odd">
<th colspan="2"><strong>Sue</strong></th>
<th><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image85.png"
style="width:0.22222in;height:0.22222in" /></th>
</tr>
<tr class="header">
<th rowspan="2"><strong>year</strong></th>
<th><blockquote>
<p><strong>type</strong></p>
</blockquote></th>
<th rowspan="2"><blockquote>
<p><strong>HR</strong></p>
</blockquote></th>
<th rowspan="2"><blockquote>
<p><strong>HR</strong></p>
</blockquote></th>
<th rowspan="2"><strong>Temp</strong></th>
<th rowspan="2"><blockquote>
<p><strong>HR</strong></p>
</blockquote></th>
<th rowspan="2"><blockquote>
<p><strong>Temp</strong></p>
</blockquote></th>
</tr>
<tr class="odd">
<th><strong>visit</strong></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>2013</strong></td>
<td><strong>1</strong></td>
<td>38.0</td>
<td>37.3</td>
<td>41.0</td>
<td>35.8</td>
<td>39.0</td>
<td><blockquote>
<p>37.2</p>
</blockquote></td>
</tr>
<tr class="even">
<td rowspan="3"><strong>2014</strong></td>
<td><strong>2</strong></td>
<td>45.0</td>
<td>36.0</td>
<td>47.0</td>
<td>35.9</td>
<td>41.0</td>
<td><blockquote>
<p>36.8</p>
</blockquote></td>
</tr>
<tr class="odd">
<td><strong>1</strong></td>
<td>44.0</td>
<td>37.7</td>
<td>44.0</td>
<td>37.0</td>
<td>29.0</td>
<td><blockquote>
<p>35.2</p>
</blockquote></td>
</tr>
<tr class="even">
<td><strong>2</strong></td>
<td>35.0</td>
<td>37.4</td>
<td>53.0</td>
<td>36.2</td>
<td>38.0</td>
<td><blockquote>
<p>37.4</p>
</blockquote></td>
</tr>
</tbody>
</table>

Remember that columns are primary in a DataFrame , and the syntax used
for multiply indexed

> Series applies to the columns. For example, we can recover Guido's
> heart rate data with a
>
> simple operation:
>
> health_data\['Guido', 'HR'\]
>
> year visit  
> 2013 1 41.0  
> 2 47.0  
> 2014 1 44.0  
> 2 53.0  
> Name: (Guido, HR), dtype: float64
>
> Also, as with the single-index case, we can use the loc , iloc , and
> ix indexers introduced in
>
> . For example:
>
> health_data.iloc\[:2, :2\]

<table>
<colgroup>
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
</colgroup>
<thead>
<tr class="header">
<th rowspan="3"><strong>year</strong></th>
<th><strong>subject</strong></th>
<th><strong>Bob</strong></th>
<th rowspan="3"><strong>Temp</strong></th>
<th rowspan="3"><blockquote>
<p><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image86.png"
style="width:0.22222in;height:0.22222in" /></p>
</blockquote></th>
</tr>
<tr class="odd">
<th><blockquote>
<p><strong>type</strong></p>
</blockquote></th>
<th rowspan="2"><blockquote>
<p><strong>HR</strong></p>
</blockquote></th>
</tr>
<tr class="header">
<th><strong>visit</strong></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td rowspan="2"><strong>2013</strong></td>
<td><strong>1</strong></td>
<td>38.0</td>
<td>37.3</td>
<td rowspan="2"></td>
</tr>
<tr class="even">
<td><strong>2</strong></td>
<td>45.0</td>
<td>36.0</td>
</tr>
</tbody>
</table>

> These indexers provide an array-like view of the underlying
> two-dimensional data, but each
>
> individual index in loc or iloc can be passed a tuple of multiple
> indices. For example:
>
> health_data.loc\[:, ('Bob', 'HR')\]
>
> year visit  
> 2013 1 38.0  
> 2 45.0  
> 2014 1 44.0

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
45/126

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> 2 35.0  
> Name: (Bob, HR), dtype: float64
>
> Working with slices within these index tuples is not especially
> convenient; trying to create a slice within a tuple will lead to a
> syntax error:
>
> You could get around this by building the desired slice explicitly
> using Python's built-in slice() function, but a better way in this
> context is to use an IndexSlice object, which Pandas provides for
> precisely this situation. For example:
>
> idx = pd.IndexSlice  
> health_data.loc\[idx\[:, 1\], idx\[:, 'HR'\]\]

| **year** | **subject** | **Bob** | **Guido** | **Sue** |
|----------|-------------|---------|-----------|---------|
|          | **type**    | **HR**  | **HR**    | **HR**  |
|          | **visit**   |         |           |         |
| **2013** | **1**       | 31.0    | 32.0      | 35.0    |
| **2014** | **1**       | 30.0    | 39.0      | 61.0    |

> There are so many ways to interact with data in multiply indexed
> Series and DataFrame s, and as with many tools in this book the best
> way to become familiar with them is to try them out!
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image87.png"
> style="width:0.125in" /> Rearranging Multi-Indices
>
> One of the keys to working with multiply indexed data is knowing how
> to effectively transform the data. There are a number of operations
> that will preserve all the information in the dataset, but rearrange
> it for the purposes of various computations. We saw a brief example of
> this in the stack() and unstack() methods, but there are many more
> ways to �nely control the  
> rearrangement of data between hierarchical indices and columns, and
> we'll explore them here.
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image1.png"
> style="width:0.125in" /> Sorted and unsorted indices
>
> Earlier, we brie�y mentioned a caveat, but we should emphasize it more
> here. *Many of the MultiIndex slicing operations will fail if the
> index is not sorted.* Let's take a look at this here.
>
> We'll start by creating some simple multiply indexed data where the
> indices are *not lexographically sorted*:
>
> index = pd.MultiIndex.from_product(\[\['a', 'c', 'b'\], \[1, 2\]\])
> data = pd.Series(np.random.rand(6), index=index)

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
46/126

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> data.index.names = \['char', 'int'\]  
> data char int  
> a 1 0.195712  
> 2 0.382111  
> c 1 0.310816  
> 2 0.013865  
> b 1 0.610929  
> 2 0.617758  
> dtype: float64
>
> If we try to take a partial slice of this index, it will result in an
> error:
>
> try:  
> data\['a':'b'\]  
> except KeyError as e:  
> print(type(e))  
> print(e)
>
> \<class 'pandas.errors.UnsortedIndexError'\>  
> 'Key length (1) was greater than MultiIndex lexsort depth (0)'
>
> Although it is not entirely clear from the error message, this is the
> result of the MultiIndex not
>
> being sorted. For various reasons, partial slices and other similar
> operations require the levels in
>
> the MultiIndex to be in sorted (i.e., lexographical) order. Pandas
> provides a number of
>
> convenience routines to perform this type of sorting; examples are the
> sort_index() and
>
> sortlevel() methods of the DataFrame . We'll use the simplest,
> sort_index() , here:
>
> data = data.sort_index()  
> data
>
> char int  
> a 1 0.195712  
> 2 0.382111  
> b 1 0.610929  
> 2 0.617758  
> c 1 0.310816  
> 2 0.013865  
> dtype: float64
>
> With the index sorted in this way, partial slicing will work as
> expected:
>
> data\['a':'b'\]
>
> char int  
> a 1 0.195712  
> 2 0.382111  
> b 1 0.610929  
> 2 0.617758  
> dtype: float64

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
47/126

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image80.png"
> style="width:0.125in" /> Stacking and unstacking indices
>
> As we saw brie�y before, it is possible to convert a dataset from a
> stacked multi-index to a
>
> simple two-dimensional representation, optionally specifying the level
> to use:
>
> pop.unstack(level=0)

<table>
<colgroup>
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
</colgroup>
<thead>
<tr class="header">
<th><strong>state</strong></th>
<th><strong>California</strong></th>
<th><blockquote>
<p><strong>New York</strong></p>
</blockquote></th>
<th><strong>Texas</strong></th>
<th><blockquote>
<p><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image88.png"
style="width:0.23611in;height:0.22222in" /></p>
</blockquote></th>
</tr>
</thead>
<tbody>
</tbody>
</table>

> **year**

<table>
<colgroup>
<col style="width: 25%" />
<col style="width: 25%" />
<col style="width: 25%" />
<col style="width: 25%" />
</colgroup>
<thead>
<tr class="header">
<th><strong>2000</strong></th>
<th><blockquote>
<p>33871648</p>
</blockquote></th>
<th>18976457</th>
<th>20851820</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>2010</strong></td>
<td><blockquote>
<p>37253956</p>
</blockquote></td>
<td>19378102</td>
<td>25145561</td>
</tr>
</tbody>
</table>

> pop.unstack(level=1)

<table>
<colgroup>
<col style="width: 25%" />
<col style="width: 25%" />
<col style="width: 25%" />
<col style="width: 25%" />
</colgroup>
<thead>
<tr class="header">
<th><strong>year</strong></th>
<th><strong>2000</strong></th>
<th><strong>2010</strong></th>
<th><blockquote>
<p><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image89.png"
style="width:0.22222in;height:0.23611in" /></p>
</blockquote></th>
</tr>
</thead>
<tbody>
</tbody>
</table>

> **state**

| **California** | 33871648 | 37253956 |
|----------------|----------|----------|
| **New York**   | 18976457 | 19378102 |
| **Texas**      | 20851820 | 25145561 |

> The opposite of unstack() is stack() , which here can be used to
> recover the original series:
>
> pop.unstack().stack()
>
> state year  
> California 2000 33871648  
> 2010 37253956  
> New York 2000 18976457  
> 2010 19378102  
> Texas 2000 20851820  
> 2010 25145561  
> dtype: int64
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image34.png"
> style="width:0.125in" /> Index setting and resetting

Another way to rearrange hierarchical data is to turn the index labels
into columns; this can be

> accomplished with the reset_index method. Calling this on the
> population dictionary will result
>
> in a DataFrame with a *state* and *year* column holding the
> information that was formerly in the
>
> index. For clarity, we can optionally specify the name of the data for
> the column representation:
>
> pop_flat = pop.reset_index(name='population')
>
> pop_flat

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
48/126

<table>
<colgroup>
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
</colgroup>
<thead>
<tr class="header">
<th rowspan="2">10/03/2023, 15:11</th>
<th rowspan="2"><strong>state</strong></th>
<th rowspan="2"><strong>year</strong></th>
<th rowspan="2"><strong>population</strong></th>
<th><blockquote>
<p>Exp02_notebook_2001622 - Colaboratory</p>
</blockquote></th>
</tr>
<tr class="odd">
<th><blockquote>
<p><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image90.png"
style="width:0.22222in;height:0.22222in" /></p>
</blockquote></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>0</strong></td>
<td>California</td>
<td>2000</td>
<td><blockquote>
<p>33871648</p>
</blockquote></td>
<td rowspan="6"></td>
</tr>
<tr class="even">
<td><strong>1</strong></td>
<td>California</td>
<td>2010</td>
<td><blockquote>
<p>37253956</p>
</blockquote></td>
</tr>
<tr class="odd">
<td><strong>2</strong></td>
<td>New York</td>
<td>2000</td>
<td><blockquote>
<p>18976457</p>
</blockquote></td>
</tr>
<tr class="even">
<td><strong>3</strong></td>
<td>New York</td>
<td>2010</td>
<td><blockquote>
<p>19378102</p>
</blockquote></td>
</tr>
<tr class="odd">
<td><strong>4</strong></td>
<td>Texas</td>
<td>2000</td>
<td><blockquote>
<p>20851820</p>
</blockquote></td>
</tr>
<tr class="even">
<td><strong>5</strong></td>
<td>Texas</td>
<td>2010</td>
<td><blockquote>
<p>25145561</p>
</blockquote></td>
</tr>
</tbody>
</table>

> Often when working with data in the real world, the raw input data
> looks like this and it's useful to build a MultiIndex from the column
> values. This can be done with the set_index method of the DataFrame ,
> which returns a multiply indexed DataFrame :
>
> pop_flat.set_index(\['state', 'year'\])

<table>
<colgroup>
<col style="width: 25%" />
<col style="width: 25%" />
<col style="width: 25%" />
<col style="width: 25%" />
</colgroup>
<thead>
<tr class="header">
<th><strong>state</strong></th>
<th><strong>year</strong></th>
<th><strong>population</strong></th>
<th><blockquote>
<p><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image91.png"
style="width:0.22222in;height:0.22222in" /></p>
</blockquote></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>California</strong></td>
<td><strong>2000</strong></td>
<td><blockquote>
<p>33871648</p>
</blockquote></td>
<td rowspan="6"></td>
</tr>
<tr class="even">
<td rowspan="2"><strong>New York</strong></td>
<td><strong>2010</strong></td>
<td><blockquote>
<p>37253956</p>
</blockquote></td>
</tr>
<tr class="odd">
<td><strong>2000</strong></td>
<td><blockquote>
<p>18976457</p>
</blockquote></td>
</tr>
<tr class="even">
<td rowspan="3"><strong>Texas</strong></td>
<td><strong>2010</strong></td>
<td><blockquote>
<p>19378102</p>
</blockquote></td>
</tr>
<tr class="odd">
<td><strong>2000</strong></td>
<td><blockquote>
<p>20851820</p>
</blockquote></td>
</tr>
<tr class="even">
<td><strong>2010</strong></td>
<td><blockquote>
<p>25145561</p>
</blockquote></td>
</tr>
</tbody>
</table>

> In practice, I �nd this type of reindexing to be one of the more
> useful patterns when encountering real-world datasets.
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image3.png"
> style="width:0.125in" /> Data Aggregations on Multi-Indices
>
> We've previously seen that Pandas has built-in data aggregation
> methods, such as mean() , sum() , and max() . For hierarchically
> indexed data, these can be passed a level parameter that controls
> which subset of the data the aggregate is computed on.
>
> For example, let's return to our health data:
>
> health_data

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
49/126

<table>
<colgroup>
<col style="width: 12%" />
<col style="width: 12%" />
<col style="width: 12%" />
<col style="width: 12%" />
<col style="width: 12%" />
<col style="width: 12%" />
<col style="width: 12%" />
<col style="width: 12%" />
</colgroup>
<thead>
<tr class="header">
<th rowspan="2">10/03/2023, 15:11</th>
<th rowspan="2"><strong>subject</strong></th>
<th rowspan="2"><strong>Bob</strong></th>
<th rowspan="4"><strong>Temp</strong></th>
<th rowspan="2"><blockquote>
<p><strong>Guido</strong></p>
</blockquote></th>
<th colspan="3"><blockquote>
<p>Exp02_notebook_2001622 - Colaboratory</p>
</blockquote></th>
</tr>
<tr class="odd">
<th colspan="2"><strong>Sue</strong></th>
<th><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image85.png"
style="width:0.22222in;height:0.22222in" /></th>
</tr>
<tr class="header">
<th rowspan="2"><strong>year</strong></th>
<th><blockquote>
<p><strong>type</strong></p>
</blockquote></th>
<th rowspan="2"><blockquote>
<p><strong>HR</strong></p>
</blockquote></th>
<th rowspan="2"><blockquote>
<p><strong>HR</strong></p>
</blockquote></th>
<th rowspan="2"><strong>Temp</strong></th>
<th rowspan="2"><blockquote>
<p><strong>HR</strong></p>
</blockquote></th>
<th rowspan="2"><blockquote>
<p><strong>Temp</strong></p>
</blockquote></th>
</tr>
<tr class="odd">
<th><strong>visit</strong></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>2013</strong></td>
<td><strong>1</strong></td>
<td>38.0</td>
<td>37.3</td>
<td>41.0</td>
<td>35.8</td>
<td>39.0</td>
<td><blockquote>
<p>37.2</p>
</blockquote></td>
</tr>
<tr class="even">
<td rowspan="2"><strong>2014</strong></td>
<td><strong>2</strong></td>
<td>45.0</td>
<td>36.0</td>
<td>47.0</td>
<td>35.9</td>
<td>41.0</td>
<td><blockquote>
<p>36.8</p>
</blockquote></td>
</tr>
<tr class="odd">
<td><strong>1</strong></td>
<td>44.0</td>
<td>37.7</td>
<td>44.0</td>
<td>37.0</td>
<td>29.0</td>
<td><blockquote>
<p>35.2</p>
</blockquote></td>
</tr>
</tbody>
</table>

> Perhaps we'd like to average-out the measurements in the two visits
> each year. We can do this by naming the index level we'd like to
> explore, in this case the year:
>
> data_mean = health_data.mean(level='year')  
> data_mean
>
> \<ipython-input-173-af3ae0440116\>:1: FutureWarning: Using the level
> keyword in DataFra data_mean = health_data.mean(level='year')

<table>
<colgroup>
<col style="width: 12%" />
<col style="width: 12%" />
<col style="width: 12%" />
<col style="width: 12%" />
<col style="width: 12%" />
<col style="width: 12%" />
<col style="width: 12%" />
<col style="width: 12%" />
</colgroup>
<thead>
<tr class="header">
<th><strong>subject</strong></th>
<th><strong>Bob</strong></th>
<th rowspan="2"><strong>Temp</strong></th>
<th><blockquote>
<p><strong>Guido</strong></p>
</blockquote></th>
<th rowspan="2"><blockquote>
<p><strong>Temp</strong></p>
</blockquote></th>
<th><strong>Sue</strong></th>
<th rowspan="2"><strong>Temp</strong></th>
<th rowspan="2"><blockquote>
<p><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image92.png"
style="width:0.22222in;height:0.22222in" /></p>
</blockquote></th>
</tr>
<tr class="odd">
<th><strong>type</strong></th>
<th><blockquote>
<p><strong>HR</strong></p>
</blockquote></th>
<th><blockquote>
<p><strong>HR</strong></p>
</blockquote></th>
<th><blockquote>
<p><strong>HR</strong></p>
</blockquote></th>
</tr>
</thead>
<tbody>
</tbody>
</table>

> **year**

| **2013** | 41.5 | 36.65 | 44.0 | 35.85 | 40.0 | 37.0 |
|----------|------|-------|------|-------|------|------|
| **2014** | 39.5 | 37.55 | 48.5 | 36.60 | 33.5 | 36.3 |

> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image93.png"
> style="width:6.83333in;height:0.22222in" />
>
> By further making use of the axis keyword, we can take the mean among
> levels on the columns as well:
>
> data_mean.mean(axis=1, level='type')
>
> \<ipython-input-174-e9f5c76486ab\>:1: FutureWarning: Using the level
> keyword in DataFra data_mean.mean(axis=1, level='type')

<table>
<colgroup>
<col style="width: 25%" />
<col style="width: 25%" />
<col style="width: 25%" />
<col style="width: 25%" />
</colgroup>
<thead>
<tr class="header">
<th><strong>type</strong></th>
<th><strong>HR</strong></th>
<th><strong>Temp</strong></th>
<th><blockquote>
<p><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image94.png"
style="width:0.22222in;height:0.22222in" /></p>
</blockquote></th>
</tr>
</thead>
<tbody>
</tbody>
</table>

> **year**

| **2013** | 41.833333 | 36.500000 |
|----------|-----------|-----------|
| **2014** | 40.500000 | 36.816667 |

> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image95.png"
> style="width:6.83333in;height:0.23611in" />
>
> Thus in two lines, we've been able to �nd the average heart rate and
> temperature measured among all subjects in all visits each year. This
> syntax is actually a short cut to the GroupBy functionality, which we
> will discuss in . While this is a toy example, many real-world
> datasets have similar hierarchical structure.

| https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true | 50/126 |
|------------------------------------|------------------------------------|

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> Aside: Panel Data
>
> Pandas has a few other fundamental data structures that we have not
> yet discussed, namely the pd.Panel and pd.Panel4D objects. These can
> be thought of, respectively, as three-dimensional and four-dimensional
> generalizations of the (one-dimensional) Series and (two-dimensional)
> DataFrame structures. Once you are familiar with indexing and
> manipulation of data in a Series and DataFrame , Panel and Panel4D are
> relatively straightforward to use. In particular, the ix , loc , and
> iloc indexers discussed in extend readily to these higher-dimensional
> structures.
>
> We won't cover these panel structures further in this text, as I've
> found in the majority of cases that multi-indexing is a more useful
> and conceptually simpler representation for higher-dimensional data.
> Additionally, panel data is fundamentally a dense data representation,
> while multi-indexing is fundamentally a sparse data representation. As
> the number of dimensions increases, the dense representation can
> become very ine�cient for the majority of real-world datasets. For the
> occasional specialized application, however, these structures can be
> useful. If you'd like to read more about the Panel and Panel4D
> structures, see the references listed in .

<table>
<colgroup>
<col style="width: 50%" />
<col style="width: 50%" />
</colgroup>
<thead>
<tr class="header">
<th><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image96.png"
style="width:0.125in" /></th>
<th><blockquote>
<p>Combining Datasets: Concat and Append</p>
</blockquote></th>
</tr>
</thead>
<tbody>
</tbody>
</table>

> Some of the most interesting studies of data come from combining
> different data sources. These operations can involve anything from
> very straightforward concatenation of two different datasets, to more
> complicated database-style joins and merges that correctly handle any
> overlaps between the datasets. Series and DataFrame s are built with
> this type of operation in mind, and Pandas includes functions and
> methods that make this sort of data wrangling fast and
> straightforward.
>
> Here we'll take a look at simple concatenation of Series and DataFrame
> s with the pd.concat function; later we'll dive into more
> sophisticated in-memory merges and joins implemented in Pandas.
>
> We begin with the standard imports:
>
> import pandas as pd  
> import numpy as np
>
> For convenience, we'll de�ne this function which creates a DataFrame
> of a particular form that will be useful below:

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
51/126

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> def make_df(cols, ind):  
> """Quickly make a DataFrame"""  
> data = {c: \[str(c) + str(i) for i in ind\]  
> for c in cols}  
> return pd.DataFrame(data, ind)
>
> \# example DataFrame  
> make_df('ABC', range(3))

<table>
<colgroup>
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
</colgroup>
<thead>
<tr class="header">
<th></th>
<th><strong>A</strong></th>
<th><strong>B</strong></th>
<th><strong>C</strong></th>
<th><blockquote>
<p><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image97.png"
style="width:0.23611in;height:0.22222in" /></p>
</blockquote></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>0</strong></td>
<td>A0</td>
<td>B0</td>
<td>C0</td>
<td rowspan="3"></td>
</tr>
<tr class="even">
<td><strong>1</strong></td>
<td>A1</td>
<td>B1</td>
<td>C1</td>
</tr>
<tr class="odd">
<td><strong>2</strong></td>
<td>A2</td>
<td>B2</td>
<td>C2</td>
</tr>
</tbody>
</table>

> In addition, we'll create a quick class that allows us to display
> multiple DataFrame s side by side.
>
> The code makes use of the special \_repr_html\_ method, which IPython
> uses to implement its rich object display:
>
> class display(object):  
> """Display HTML representation of multiple objects"""  
> template = """\<div style="float: left; padding: 10px;"\>  
> \<p style='font-family:"Courier New", Courier,
> monospace'\>{0}\</p\>{1} \</div\>"""  
> def \_\_init\_\_(self, \*args):  
> self.args = args
>
> def \_repr_html\_(self):  
> return '\n'.join(self.template.format(a, eval(a).\_repr_html\_()) for
> a in self.args)
>
> def \_\_repr\_\_(self):  
> return '\n\n'.join(a + '\n' + repr(eval(a)) for a in self.args)
>
> The use of this will become clearer as we continue our discussion in
> the following section.

<table>
<colgroup>
<col style="width: 50%" />
<col style="width: 50%" />
</colgroup>
<thead>
<tr class="header">
<th><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image1.png"
style="width:0.125in" /></th>
<th><blockquote>
<p>Recall: Concatenation of NumPy Arrays</p>
</blockquote></th>
</tr>
</thead>
<tbody>
</tbody>
</table>

> Concatenation of Series and DataFrame objects is very similar to
> concatenation of Numpy
>
> function as discussed in o a
>
> single array:

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
52/126

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> x = \[1, 2, 3\]  
> y = \[4, 5, 6\]  
> z = \[7, 8, 9\]  
> np.concatenate(\[x, y, z\])
>
> array(\[1, 2, 3, 4, 5, 6, 7, 8, 9\])
>
> The �rst argument is a list or tuple of arrays to concatenate.
> Additionally, it takes an axis
>
> keyword that allows you to specify the axis along which the result
> will be concatenated:
>
> x = \[\[1, 2\],  
> \[3, 4\]\]  
> np.concatenate(\[x, x\], axis=1)
>
> array(\[\[1, 2, 1, 2\],  
> \[3, 4, 3, 4\]\])
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image2.png"
> style="width:0.125in" /> Simple Concatenation with pd.concat
>
> Pandas has a function, pd.concat() , which has a similar syntax to
> np.concatenate but
>
> contains a number of options that we'll discuss momentarily:
>
> \# Signature in Pandas v0.18
>
> pd.concat(objs, axis=0, join='outer', join_axes=None,
> ignore_index=False,
>
> keys=None, levels=None, names=None, verify_integrity=False,
>
> copy=True)
>
> pd.concat() can be used for a simple concatenation of Series or
> DataFrame objects, just as np.concatenate() can be used for simple
> concatenations of arrays:
>
> ser1 = pd.Series(\['A', 'B', 'C'\], index=\[1, 2, 3\]) ser2 =
> pd.Series(\['D', 'E', 'F'\], index=\[4, 5, 6\]) pd.concat(\[ser1,
> ser2\])
>
> 1 A  
> 2 B  
> 3 C  
> 4 D  
> 5 E  
> 6 F  
> dtype: object
>
> It also works to concatenate higher-dimensional objects, such as
> DataFrame s:
>
> df1 = make_df('AB', \[1, 2\])  
> df2 = make_df('AB', \[3, 4\])

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
53/126

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> display('df1', 'df2', 'pd.concat(\[df1, df2\])')

<table>
<colgroup>
<col style="width: 11%" />
<col style="width: 11%" />
<col style="width: 11%" />
<col style="width: 11%" />
<col style="width: 11%" />
<col style="width: 11%" />
<col style="width: 11%" />
<col style="width: 11%" />
<col style="width: 11%" />
</colgroup>
<thead>
<tr class="header">
<th rowspan="2">df1</th>
<th rowspan="2"><strong>A</strong></th>
<th rowspan="2"><strong>B</strong></th>
<th rowspan="2">df2</th>
<th rowspan="2"><strong>A</strong></th>
<th rowspan="2"><strong>B</strong></th>
<th colspan="3"><blockquote>
<p>pd.concat([df1, df2])</p>
</blockquote></th>
</tr>
<tr class="odd">
<th colspan="2"><strong>A</strong></th>
<th><blockquote>
<p><strong>B</strong></p>
</blockquote></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>1</strong></td>
<td>A1</td>
<td>B1</td>
<td><strong>3</strong></td>
<td>A3</td>
<td>B3</td>
<td><strong>1</strong></td>
<td>A1</td>
<td><blockquote>
<p>B1</p>
</blockquote></td>
</tr>
<tr class="even">
<td rowspan="3"><strong>2</strong></td>
<td rowspan="3">A2</td>
<td rowspan="3">B2</td>
<td rowspan="3"><strong>4</strong></td>
<td rowspan="3">A4</td>
<td rowspan="3">B4</td>
<td><strong>2</strong></td>
<td>A2</td>
<td><blockquote>
<p>B2</p>
</blockquote></td>
</tr>
<tr class="odd">
<td><strong>3</strong></td>
<td>A3</td>
<td><blockquote>
<p>B3</p>
</blockquote></td>
</tr>
<tr class="even">
<td><strong>4</strong></td>
<td>A4</td>
<td><blockquote>
<p>B4</p>
</blockquote></td>
</tr>
</tbody>
</table>

> By default, the concatenation takes place row-wise within the
> DataFrame (i.e., axis=0 ). Like np.concatenate , pd.concat allows
> speci�cation of an axis along which concatenation will take place.
> Consider the following example:
>
> We could have equivalently speci�ed axis=1 ; here we've used the more
> intuitive axis='col' .
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image21.png"
> style="width:0.125in" /> Duplicate indices
>
> One important difference between np.concatenate and pd.concat is that
> Pandas concatenation *preserves indices*, even if the result will have
> duplicate indices! Consider this simple example:
>
> x = make_df('AB', \[0, 1\])  
> y = make_df('AB', \[2, 3\])  
> y.index = x.index \# make duplicate indices!
>
> display('x', 'y', 'pd.concat(\[x, y\])')

<table>
<colgroup>
<col style="width: 11%" />
<col style="width: 11%" />
<col style="width: 11%" />
<col style="width: 11%" />
<col style="width: 11%" />
<col style="width: 11%" />
<col style="width: 11%" />
<col style="width: 11%" />
<col style="width: 11%" />
</colgroup>
<thead>
<tr class="header">
<th rowspan="2">x</th>
<th rowspan="2"><strong>A</strong></th>
<th rowspan="2"><strong>B</strong></th>
<th rowspan="2">y</th>
<th rowspan="2"><strong>A</strong></th>
<th rowspan="2"><strong>B</strong></th>
<th colspan="3"><blockquote>
<p>pd.concat([x, y])</p>
</blockquote></th>
</tr>
<tr class="odd">
<th colspan="2"><strong>A</strong></th>
<th><blockquote>
<p><strong>B</strong></p>
</blockquote></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>0</strong></td>
<td>A0</td>
<td>B0</td>
<td><strong>0</strong></td>
<td>A2</td>
<td>B2</td>
<td><strong>0</strong></td>
<td>A0</td>
<td><blockquote>
<p>B0</p>
</blockquote></td>
</tr>
<tr class="even">
<td rowspan="3"><strong>1</strong></td>
<td rowspan="3">A1</td>
<td rowspan="3">B1</td>
<td rowspan="3"><strong>1</strong></td>
<td rowspan="3">A3</td>
<td rowspan="3">B3</td>
<td><strong>1</strong></td>
<td>A1</td>
<td><blockquote>
<p>B1</p>
</blockquote></td>
</tr>
<tr class="odd">
<td><strong>0</strong></td>
<td>A2</td>
<td><blockquote>
<p>B2</p>
</blockquote></td>
</tr>
<tr class="even">
<td><strong>1</strong></td>
<td>A3</td>
<td><blockquote>
<p>B3</p>
</blockquote></td>
</tr>
</tbody>
</table>

> Notice the repeated indices in the result. While this is valid within
> DataFrame s, the outcome is often undesirable. pd.concat() gives us a
> few ways to handle it.
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image14.png"
> style="width:0.125in" /> Catching the repeats as an error

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
54/126

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> If you'd like to simply verify that the indices in the result of
> pd.concat() do not overlap, you can specify the verify_integrity �ag.
> With this set to True, the concatenation will raise an exception if
> there are duplicate indices. Here is an example, where for clarity
> we'll catch and print the error message:
>
> try:  
> pd.concat(\[x, y\], verify_integrity=True)  
> except ValueError as e:  
> print("ValueError:", e)
>
> ValueError: Indexes have overlapping values: Int64Index(\[0, 1\],
> dtype='int64')
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image27.png"
> style="width:0.125in" /> Ignoring the index
>
> Sometimes the index itself does not matter, and you would prefer it to
> simply be ignored. This option can be speci�ed using the ignore_index
> �ag. With this set to true, the concatenation will create a new
> integer index for the resulting Series :
>
> display('x', 'y', 'pd.concat(\[x, y\], ignore_index=True)')

<table style="width:100%;">
<colgroup>
<col style="width: 10%" />
<col style="width: 10%" />
<col style="width: 10%" />
<col style="width: 10%" />
<col style="width: 10%" />
<col style="width: 10%" />
<col style="width: 10%" />
<col style="width: 10%" />
<col style="width: 10%" />
<col style="width: 10%" />
</colgroup>
<thead>
<tr class="header">
<th rowspan="2"></th>
<th colspan="2">x</th>
<th colspan="3">y</th>
<th rowspan="2"><strong>B</strong></th>
<th colspan="3"><blockquote>
<p>pd.concat([x, y], ignore_index=True)</p>
</blockquote></th>
</tr>
<tr class="odd">
<th colspan="2"><strong>A</strong></th>
<th><strong>B</strong></th>
<th colspan="2"><strong>A</strong></th>
<th colspan="2"><strong>A</strong></th>
<th><blockquote>
<p><strong>B</strong></p>
</blockquote></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td rowspan="4"><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image98.png"
style="width:0.125in" /></td>
<td><strong>0</strong></td>
<td>A0</td>
<td><blockquote>
<p>B0</p>
</blockquote></td>
<td><strong>0</strong></td>
<td>A2</td>
<td>B2</td>
<td><strong>0</strong></td>
<td>A0</td>
<td><blockquote>
<p>B0</p>
</blockquote></td>
</tr>
<tr class="even">
<td><strong>1</strong></td>
<td>A1</td>
<td><blockquote>
<p>B1</p>
</blockquote></td>
<td><strong>1</strong></td>
<td>A3</td>
<td rowspan="3">B3</td>
<td><strong>1</strong></td>
<td>A1</td>
<td><blockquote>
<p>B1</p>
</blockquote></td>
</tr>
<tr class="odd">
<td colspan="5" rowspan="2"><blockquote>
<p>Adding MultiIndex keys</p>
</blockquote></td>
<td><strong>2</strong></td>
<td>A2</td>
<td><blockquote>
<p>B2</p>
</blockquote></td>
</tr>
<tr class="even">
<td><strong>3</strong></td>
<td>A3</td>
<td><blockquote>
<p>B3</p>
</blockquote></td>
</tr>
</tbody>
</table>

> Another option is to use the keys option to specify a label for the
> data sources; the result will be a hierarchically indexed series
> containing the data:
>
> display('x', 'y', "pd.concat(\[x, y\], keys=\['x', 'y'\])")

| https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true | 55/126 |
|------------------------------------|------------------------------------|

<table style="width:100%;">
<colgroup>
<col style="width: 14%" />
<col style="width: 14%" />
<col style="width: 14%" />
<col style="width: 14%" />
<col style="width: 14%" />
<col style="width: 14%" />
<col style="width: 14%" />
</colgroup>
<thead>
<tr class="header">
<th colspan="3">10/03/2023, 15:11</th>
<th colspan="4"><blockquote>
<p>Exp02_notebook_2001622 - Colaboratory</p>
</blockquote></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td colspan="2">x</td>
<td>y</td>
<td colspan="4"><blockquote>
<p>pd.concat([x, y], keys=['x', 'y'])</p>
</blockquote></td>
</tr>
<tr class="even">
<td colspan="7"><blockquote>
<p>, and we can use the tools discussed in</p>
</blockquote></td>
</tr>
<tr class="odd">
<td rowspan="2"><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image31.png"
style="width:0.125in" /></td>
<td colspan="2" rowspan="2"><blockquote>
<p><strong>1</strong> A1 B1 <strong>1</strong> A3 B3</p>
<p>Concatenation with joins</p>
</blockquote></td>
<td rowspan="2"><strong>y</strong></td>
<td><strong>1</strong></td>
<td>A1</td>
<td><blockquote>
<p>B1</p>
</blockquote></td>
</tr>
<tr class="even">
<td><strong>0</strong></td>
<td>A2</td>
<td><blockquote>
<p>B2</p>
</blockquote></td>
</tr>
<tr class="odd">
<td colspan="7"><blockquote>
<p>In the simple examples we just looked at, we were mainly
concatenating DataFrame s with</p>
<p><strong>1</strong> A3 B3</p>
<p>shared column names. In practice, data from different sources might
have different sets of</p>
</blockquote></td>
</tr>
</tbody>
</table>

> column names, and pd.concat offers several options in this case.
> Consider the concatenation of the following two DataFrame s, which
> have some (but not all!) columns in common:
>
> df5 = make_df('ABC', \[1, 2\])  
> df6 = make_df('BCD', \[3, 4\])  
> display('df5', 'df6', 'pd.concat(\[df5, df6\])')

<table style="width:100%;">
<colgroup>
<col style="width: 7%" />
<col style="width: 7%" />
<col style="width: 7%" />
<col style="width: 7%" />
<col style="width: 7%" />
<col style="width: 7%" />
<col style="width: 7%" />
<col style="width: 7%" />
<col style="width: 7%" />
<col style="width: 7%" />
<col style="width: 7%" />
<col style="width: 7%" />
<col style="width: 7%" />
</colgroup>
<thead>
<tr class="header">
<th rowspan="2">df5</th>
<th rowspan="2"><strong>A</strong></th>
<th rowspan="2"><strong>B</strong></th>
<th rowspan="2"><strong>C</strong></th>
<th rowspan="2">df6</th>
<th rowspan="2"><strong>B</strong></th>
<th rowspan="2"><strong>C</strong></th>
<th rowspan="2"><strong>D</strong></th>
<th colspan="5"><blockquote>
<p>pd.concat([df5, df6])</p>
</blockquote></th>
</tr>
<tr class="odd">
<th colspan="2"><strong>A</strong></th>
<th><strong>B</strong></th>
<th><strong>C</strong></th>
<th><blockquote>
<p><strong>D</strong></p>
</blockquote></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>1</strong></td>
<td>A1</td>
<td>B1</td>
<td>C1</td>
<td><strong>3</strong></td>
<td>B3</td>
<td>C3</td>
<td>D3</td>
<td><strong>1</strong></td>
<td>A1</td>
<td>B1</td>
<td>C1</td>
<td><blockquote>
<p>NaN</p>
</blockquote></td>
</tr>
<tr class="even">
<td rowspan="3"><strong>2</strong></td>
<td rowspan="3">A2</td>
<td rowspan="3">B2</td>
<td rowspan="3">C2</td>
<td rowspan="3"><strong>4</strong></td>
<td rowspan="3">B4</td>
<td rowspan="3">C4</td>
<td rowspan="3">D4</td>
<td><strong>2</strong></td>
<td>A2</td>
<td>B2</td>
<td>C2</td>
<td><blockquote>
<p>NaN</p>
</blockquote></td>
</tr>
<tr class="odd">
<td><strong>3</strong></td>
<td>NaN</td>
<td>B3</td>
<td>C3</td>
<td><blockquote>
<p>D3</p>
</blockquote></td>
</tr>
<tr class="even">
<td><strong>4</strong></td>
<td>NaN</td>
<td>B4</td>
<td>C4</td>
<td><blockquote>
<p>D4</p>
</blockquote></td>
</tr>
</tbody>
</table>

> By default, the entries for which no data is available are �lled with
> NA values. To change this, we can specify one of several options for
> the join and join_axes parameters of the concatenate function. By
> default, the join is a union of the input columns ( join='outer' ),
> but we can change this to an intersection of the columns using
> join='inner' :
>
> display('df5', 'df6',  
> "pd.concat(\[df5, df6\], join='inner')")

<table style="width:100%;">
<colgroup>
<col style="width: 6%" />
<col style="width: 6%" />
<col style="width: 6%" />
<col style="width: 6%" />
<col style="width: 6%" />
<col style="width: 6%" />
<col style="width: 6%" />
<col style="width: 6%" />
<col style="width: 6%" />
<col style="width: 6%" />
<col style="width: 6%" />
<col style="width: 6%" />
<col style="width: 6%" />
<col style="width: 6%" />
<col style="width: 6%" />
</colgroup>
<thead>
<tr class="header">
<th colspan="2">df5</th>
<th colspan="4">df6</th>
<th colspan="8">pd.concat([df5, df6], join='inner')</th>
<th rowspan="2"></th>
</tr>
<tr class="odd">
<th colspan="2"><strong>A</strong></th>
<th colspan="2"><strong>B</strong></th>
<th colspan="2"><blockquote>
<p><strong>C</strong></p>
</blockquote></th>
<th colspan="2"><strong>B</strong></th>
<th colspan="2"><strong>C</strong></th>
<th colspan="2"><blockquote>
<p><strong>D</strong></p>
</blockquote></th>
<th><strong>B</strong></th>
<th><blockquote>
<p><strong>C</strong></p>
</blockquote></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>1</strong></td>
<td>A1</td>
<td>B1</td>
<td colspan="2"><blockquote>
<p>C1</p>
</blockquote></td>
<td><strong>3</strong></td>
<td>B3</td>
<td colspan="2">C3</td>
<td colspan="2"><blockquote>
<p>D3</p>
</blockquote></td>
<td><strong>1</strong></td>
<td>B1</td>
<td><blockquote>
<p>C1</p>
</blockquote></td>
<td rowspan="5">56/126</td>
</tr>
<tr class="even">
<td><strong>2</strong></td>
<td>A2</td>
<td>B2</td>
<td colspan="2"><blockquote>
<p>C2</p>
</blockquote></td>
<td><strong>4</strong></td>
<td>B4</td>
<td colspan="2">C4</td>
<td colspan="2"><blockquote>
<p>D4</p>
</blockquote></td>
<td><strong>2</strong></td>
<td>B2</td>
<td><blockquote>
<p>C2</p>
</blockquote></td>
</tr>
<tr class="odd">
<td colspan="12"><strong>3</strong></td>
<td>B3</td>
<td><blockquote>
<p>C3</p>
</blockquote></td>
</tr>
<tr class="even">
<td colspan="12"><strong>4</strong></td>
<td>B4</td>
<td><blockquote>
<p>C4</p>
</blockquote></td>
</tr>
<tr class="odd">
<td
colspan="14">https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&amp;printMode=true</td>
</tr>
</tbody>
</table>

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> Another option is to directly specify the index of the remaininig
> colums using the join_axes argument, which takes a list of index
> objects. Here we'll specify that the returned columns should be the
> same as those of the �rst input:
>
> The combination of options of the pd.concat function allows a wide
> range of possible behaviors when joining two datasets; keep these in
> mind as you use these tools for your own data.
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image99.png"
> style="width:0.125in" /> The append() method
>
> Because direct array concatenation is so common, Series and DataFrame
> objects have an append method that can accomplish the same thing in
> fewer keystrokes. For example, rather than calling pd.concat(\[df1,
> df2\]) , you can simply call df1.append(df2) :
>
> display('df1', 'df2', 'df1.append(df2)')

<table>
<colgroup>
<col style="width: 11%" />
<col style="width: 11%" />
<col style="width: 11%" />
<col style="width: 11%" />
<col style="width: 11%" />
<col style="width: 11%" />
<col style="width: 11%" />
<col style="width: 11%" />
<col style="width: 11%" />
</colgroup>
<thead>
<tr class="header">
<th rowspan="2">df1</th>
<th rowspan="2"><strong>A</strong></th>
<th rowspan="2"><strong>B</strong></th>
<th rowspan="2">df2</th>
<th rowspan="2"><strong>A</strong></th>
<th rowspan="2"><strong>B</strong></th>
<th colspan="3"><blockquote>
<p>df1.append(df2)</p>
</blockquote></th>
</tr>
<tr class="odd">
<th colspan="2"><strong>A</strong></th>
<th><blockquote>
<p><strong>B</strong></p>
</blockquote></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>1</strong></td>
<td>A1</td>
<td>B1</td>
<td><strong>3</strong></td>
<td>A3</td>
<td>B3</td>
<td><strong>1</strong></td>
<td>A1</td>
<td><blockquote>
<p>B1</p>
</blockquote></td>
</tr>
<tr class="even">
<td rowspan="3"><strong>2</strong></td>
<td rowspan="3">A2</td>
<td rowspan="3">B2</td>
<td rowspan="3"><strong>4</strong></td>
<td rowspan="3">A4</td>
<td rowspan="3">B4</td>
<td><strong>2</strong></td>
<td>A2</td>
<td><blockquote>
<p>B2</p>
</blockquote></td>
</tr>
<tr class="odd">
<td><strong>3</strong></td>
<td>A3</td>
<td><blockquote>
<p>B3</p>
</blockquote></td>
</tr>
<tr class="even">
<td><strong>4</strong></td>
<td>A4</td>
<td><blockquote>
<p>B4</p>
</blockquote></td>
</tr>
</tbody>
</table>

> Keep in mind that unlike the append() and extend() methods of Python
> lists, the append() method in Pandas does not modify the original
> object–instead it creates a new object with the combined data. It also
> is not a very e�cient method, because it involves creation of a new
> index *and* data buffer. Thus, if you plan to do multiple append
> operations, it is generally better to build a list of DataFrame s and
> pass them all at once to the concat() function.
>
> In the next section, we'll look at another more powerful approach to
> combining data from multiple sources, the database-style merges/joins
> implemented in pd.merge . For more

<table>
<colgroup>
<col style="width: 33%" />
<col style="width: 33%" />
<col style="width: 33%" />
</colgroup>
<thead>
<tr class="header">
<th><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image100.png"
style="width:0.125in" /></th>
<th><blockquote>
<p>Combining Datasets: Merge and Join</p>
</blockquote></th>
<th rowspan="2">57/126</th>
</tr>
<tr class="odd">
<th
colspan="2">https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&amp;printMode=true</th>
</tr>
</thead>
<tbody>
</tbody>
</table>

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> One essential feature offered by Pandas is its high-performance,
> in-memory join and merge
>
> operations. If you have ever worked with databases, you should be
> familiar with this type of data interaction. The main interface for
> this is the pd.merge function, and we'll see few examples of
>
> how this can work in practice.
>
> For convenience, we will start by rede�ning the display()
> functionality from the previous
>
> section:
>
> import pandas as pd  
> import numpy as np
>
> class display(object):  
> """Display HTML representation of multiple objects"""  
> template = """\<div style="float: left; padding: 10px;"\>  
> \<p style='font-family:"Courier New", Courier,
> monospace'\>{0}\</p\>{1} \</div\>"""  
> def \_\_init\_\_(self, \*args):  
> self.args = args
>
> def \_repr_html\_(self):  
> return '\n'.join(self.template.format(a, eval(a).\_repr_html\_()) for
> a in self.args)
>
> def \_\_repr\_\_(self):  
> return '\n\n'.join(a + '\n' + repr(eval(a)) for a in self.args)
>
> Relational Algebra
>
> The behavior implemented in pd.merge() is a subset of what is known as
> *relational algebra*,
>
> which is a formal set of rules for manipulating relational data, and
> forms the conceptual
>
> foundation of operations available in most databases. The strength of
> the relational algebra approach is that it proposes several primitive
> operations, which become the building blocks of
>
> more complicated operations on any dataset. With this lexicon of
> fundamental operations implemented e�ciently in a database or other
> program, a wide range of fairly complicated
>
> composite operations can be performed.
>
> Pandas implements several of these fundamental building-blocks in the
> pd.merge() function and the related join() method of Series and
> Dataframe s. As we will see, these let you
>
> e�ciently link data from different sources.
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image101.png"
> style="width:0.125in" /> Categories of Joins

The pd.merge() function implements a number of types of joins: the
*one-to-one*, *many-to-one*,

> and *many-to-many* joins. All three types of joins are accessed via an
> identical call to the

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
58/126

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> pd.merge() interface; the type of join performed depends on the form
> of the input data. Here we will show simple examples of the three
> types of merges, and discuss detailed options further below.
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image26.png"
> style="width:0.125in" /> One-to-one joins
>
> Perhaps the simplest type of merge expresion is the one-to-one join,
> which is in many ways very similar to the column-wise concatenation
> seen in . As a concrete example, consider the following two DataFrames
> which contain information on several employees in a company:
>
> df1 = pd.DataFrame({'employee': \['Bob', 'Jake', 'Lisa', 'Sue'\],  
> 'group': \['Accounting', 'Engineering', 'Engineering', 'HR'\]}) df2 =
> pd.DataFrame({'employee': \['Lisa', 'Bob', 'Jake', 'Sue'\],  
> 'hire_date': \[2004, 2008, 2012, 2014\]})  
> display('df1', 'df2')

<table style="width:100%;">
<colgroup>
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
</colgroup>
<thead>
<tr class="header">
<th>df1</th>
<th><strong>employee</strong></th>
<th><strong>group</strong></th>
<th>df2</th>
<th><strong>employee</strong></th>
<th><strong>hire_date</strong></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>0</strong></td>
<td>Bob</td>
<td><blockquote>
<p>Accounting</p>
</blockquote></td>
<td><strong>0</strong></td>
<td>Lisa</td>
<td>2004</td>
</tr>
<tr class="even">
<td><strong>1</strong></td>
<td>Jake</td>
<td>Engineering</td>
<td><strong>1</strong></td>
<td>Bob</td>
<td>2008</td>
</tr>
<tr class="odd">
<td><strong>2</strong></td>
<td>Lisa</td>
<td>Engineering</td>
<td><strong>2</strong></td>
<td>Jake</td>
<td>2012</td>
</tr>
<tr class="even">
<td><strong>3</strong></td>
<td>Sue</td>
<td>HR</td>
<td><strong>3</strong></td>
<td>Sue</td>
<td>2014</td>
</tr>
</tbody>
</table>

> To combine this information into a single DataFrame , we can use the
> pd.merge() function:
>
> df3 = pd.merge(df1, df2)  
> df3

<table>
<colgroup>
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
</colgroup>
<thead>
<tr class="header">
<th></th>
<th><strong>employee</strong></th>
<th><strong>group</strong></th>
<th><strong>hire_date</strong></th>
<th><blockquote>
<p><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image102.png"
style="width:0.22222in;height:0.22222in" /></p>
</blockquote></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>0</strong></td>
<td>Bob</td>
<td><blockquote>
<p>Accounting</p>
</blockquote></td>
<td>2008</td>
<td rowspan="4"></td>
</tr>
<tr class="even">
<td><strong>1</strong></td>
<td>Jake</td>
<td>Engineering</td>
<td>2012</td>
</tr>
<tr class="odd">
<td><strong>2</strong></td>
<td>Lisa</td>
<td>Engineering</td>
<td>2004</td>
</tr>
<tr class="even">
<td><strong>3</strong></td>
<td>Sue</td>
<td>HR</td>
<td>2014</td>
</tr>
</tbody>
</table>

> The pd.merge() function recognizes that each DataFrame has an
> "employee" column, and automatically joins using this column as a key.
> The result of the merge is a new DataFrame that combines the
> information from the two inputs. Notice that the order of entries in
> each column is not necessarily maintained: in this case, the order of
> the "employee" column differs between df1

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
59/126

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> and df2 , and the pd.merge() function correctly accounts for this.
> Additionally, keep in mind that the merge in general discards the
> index, except in the special case of merges by index (see the
> left_index and right_index keywords, discussed momentarily).
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image26.png"
> style="width:0.125in" /> Many-to-one joins
>
> Many-to-one joins are joins in which one of the two key columns
> contains duplicate entries. For the many-to-one case, the resulting
> DataFrame will preserve those duplicate entries as appropriate.
> Consider the following example of a many-to-one join:
>
> df4 = pd.DataFrame({'group': \['Accounting', 'Engineering', 'HR'\],
> 'supervisor': \['Carly', 'Guido', 'Steve'\]}) display('df3', 'df4',
> 'pd.merge(df3, df4)')

<table style="width:100%;">
<colgroup>
<col style="width: 14%" />
<col style="width: 14%" />
<col style="width: 14%" />
<col style="width: 14%" />
<col style="width: 14%" />
<col style="width: 14%" />
<col style="width: 14%" />
</colgroup>
<thead>
<tr class="header">
<th>df3</th>
<th><strong>employee</strong></th>
<th><strong>group</strong></th>
<th><strong>hire_date</strong></th>
<th>df4</th>
<th><strong>group</strong></th>
<th><strong>supervisor</strong></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>0</strong></td>
<td>Bob</td>
<td><blockquote>
<p>Accounting</p>
</blockquote></td>
<td>2008</td>
<td><strong>0</strong></td>
<td>Accounting</td>
<td>Carly</td>
</tr>
<tr class="even">
<td><strong>1</strong></td>
<td>Jake</td>
<td>Engineering</td>
<td>2012</td>
<td><strong>1</strong></td>
<td>Engineering</td>
<td>Guido</td>
</tr>
<tr class="odd">
<td><strong>2</strong></td>
<td>Lisa</td>
<td>Engineering</td>
<td>2004</td>
<td rowspan="2"><strong>2</strong></td>
<td rowspan="2">HR</td>
<td rowspan="2">Steve</td>
</tr>
<tr class="even">
<td><strong>3</strong></td>
<td>Sue</td>
<td>HR</td>
<td>2014</td>
</tr>
</tbody>
</table>

> The resulting DataFrame has an aditional column with the "supervisor"
> information, where the information is repeated in one or more
> locations as required by the inputs.

<table>
<colgroup>
<col style="width: 50%" />
<col style="width: 50%" />
</colgroup>
<thead>
<tr class="header">
<th><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image103.png"
style="width:0.125in" /></th>
<th><blockquote>
<p>Many-to-many joins</p>
</blockquote></th>
</tr>
</thead>
<tbody>
</tbody>
</table>

> Many-to-many joins are a bit confusing conceptually, but are
> nevertheless well de�ned. If the key column in both the left and right
> array contains duplicates, then the result is a many-to-many merge.
> This will be perhaps most clear with a concrete example. Consider the
> following, where we have a DataFrame showing one or more skills
> associated with a particular group. By performing a many-to-many join,
> we can recover the skills associated with any individual person:
>
> df5 = pd.DataFrame({'group': \['Accounting', 'Accounting',  
> 'Engineering', 'Engineering', 'HR', 'HR'\], 'skills': \['math',
> 'spreadsheets', 'coding', 'linux', 'spreadsheets',
> 'organization'\]})  
> display('df1', 'df5', "pd.merge(df1, df5)")

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
60/126

<table style="width:100%;">
<colgroup>
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
</colgroup>
<thead>
<tr class="header">
<th>10/03/2023, 15:11</th>
<th rowspan="2"><strong>employee</strong></th>
<th rowspan="2"><strong>group</strong></th>
<th rowspan="2">df5</th>
<th colspan="2"><blockquote>
<p>Exp02_notebook_2001622 - Colaboratory</p>
</blockquote></th>
</tr>
<tr class="odd">
<th>df1</th>
<th><strong>group</strong></th>
<th><blockquote>
<p><strong>skills</strong></p>
</blockquote></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>0</strong></td>
<td>Bob</td>
<td><blockquote>
<p>Accounting</p>
</blockquote></td>
<td><strong>0</strong></td>
<td>Accounting</td>
<td>math</td>
</tr>
<tr class="even">
<td><strong>1</strong></td>
<td>Jake</td>
<td>Engineering</td>
<td><strong>1</strong></td>
<td>Accounting</td>
<td><blockquote>
<p>spreadsheets</p>
</blockquote></td>
</tr>
<tr class="odd">
<td><strong>2</strong></td>
<td>Lisa</td>
<td>Engineering</td>
<td><strong>2</strong></td>
<td>Engineering</td>
<td><blockquote>
<p>coding</p>
</blockquote></td>
</tr>
<tr class="even">
<td rowspan="3"><strong>3</strong></td>
<td rowspan="3">Sue</td>
<td rowspan="3">HR</td>
<td><strong>3</strong></td>
<td>Engineering</td>
<td>linux</td>
</tr>
<tr class="odd">
<td><strong>4</strong></td>
<td>HR</td>
<td><blockquote>
<p>spreadsheets</p>
</blockquote></td>
</tr>
<tr class="even">
<td><strong>5</strong></td>
<td>HR</td>
<td><blockquote>
<p>organization</p>
</blockquote></td>
</tr>
</tbody>
</table>

> pd.merge(df1, df5)
>
> **employee** **group** **skills**  
> These three types of joins can be used with other Pandas tools to
> implement a wide array of functionality. But in practice, datasets are
> rarely as clean as the one we're working with here. In the following
> section we'll consider some of the options provided by pd.merge() that
> enable you to tune how the join operations work.

<table>
<colgroup>
<col style="width: 50%" />
<col style="width: 50%" />
</colgroup>
<thead>
<tr class="header">
<th><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image104.png"
style="width:0.125in" /></th>
<th><blockquote>
<p>Speci�cation of the Merge Key</p>
</blockquote></th>
</tr>
</thead>
<tbody>
</tbody>
</table>

> We've already seen the default behavior of pd.merge() : it looks for
> one or more matching column names between the two inputs, and uses
> this as the key. However, often the column names will not match so
> nicely, and pd.merge() provides a variety of options for handling
> this.
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image103.png"
> style="width:0.125in" /> The on keyword
>
> Most simply, you can explicitly specify the name of the key column
> using the on keyword, which takes a column name or a list of column
> names:
>
> display('df1', 'df2', "pd.merge(df1, df2, on='employee')")

<table style="width:100%;">
<colgroup>
<col style="width: 14%" />
<col style="width: 14%" />
<col style="width: 14%" />
<col style="width: 14%" />
<col style="width: 14%" />
<col style="width: 14%" />
<col style="width: 14%" />
</colgroup>
<thead>
<tr class="header">
<th colspan="2">df1</th>
<th colspan="4"><blockquote>
<p>df2</p>
</blockquote></th>
<th rowspan="2"></th>
</tr>
<tr class="odd">
<th colspan="2"><strong>employee</strong></th>
<th><strong>group</strong></th>
<th colspan="2"><strong>employee</strong></th>
<th><blockquote>
<p><strong>hire_date</strong></p>
</blockquote></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>0</strong></td>
<td>Bob</td>
<td><blockquote>
<p>Accounting</p>
</blockquote></td>
<td><blockquote>
<p><strong>0</strong></p>
</blockquote></td>
<td>Lisa</td>
<td><blockquote>
<p>2004</p>
</blockquote></td>
<td rowspan="5">61/126</td>
</tr>
<tr class="even">
<td><strong>1</strong></td>
<td>Jake</td>
<td><blockquote>
<p>Engineering</p>
</blockquote></td>
<td><blockquote>
<p><strong>1</strong></p>
</blockquote></td>
<td>Bob</td>
<td><blockquote>
<p>2008</p>
</blockquote></td>
</tr>
<tr class="odd">
<td><strong>2</strong></td>
<td>Lisa</td>
<td><blockquote>
<p>Engineering</p>
</blockquote></td>
<td><blockquote>
<p><strong>2</strong></p>
</blockquote></td>
<td>Jake</td>
<td><blockquote>
<p>2012</p>
</blockquote></td>
</tr>
<tr class="even">
<td><strong>3</strong></td>
<td>Sue</td>
<td>HR</td>
<td><blockquote>
<p><strong>3</strong></p>
</blockquote></td>
<td>Sue</td>
<td><blockquote>
<p>2014</p>
</blockquote></td>
</tr>
<tr class="odd">
<td
colspan="6">https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&amp;printMode=true</td>
</tr>
</tbody>
</table>

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> This option works only if both the left and right DataFrame s have the
> speci�ed column name.
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image1.png"
> style="width:0.125in" /> The left_on and right_on keywords
>
> At times you may wish to merge two datasets with different column
> names; for example, we
>
> may have a dataset in which the employee name is labeled as "name"
> rather than "employee". In

this case, we can use the left_on and right_on keywords to specify the
two column names:

> df3 = pd.DataFrame({'name': \['Bob', 'Jake', 'Lisa', 'Sue'\],
>
> 'salary': \[70000, 80000, 120000, 90000\]})
>
> display('df1', 'df3', 'pd.merge(df1, df3, left_on="employee",
> right_on="name")')

<table style="width:100%;">
<colgroup>
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
</colgroup>
<thead>
<tr class="header">
<th>df1</th>
<th><strong>employee</strong></th>
<th><strong>group</strong></th>
<th>df3</th>
<th><strong>name</strong></th>
<th><strong>salary</strong></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>0</strong></td>
<td>Bob</td>
<td><blockquote>
<p>Accounting</p>
</blockquote></td>
<td><strong>0</strong></td>
<td>Bob</td>
<td>70000</td>
</tr>
<tr class="even">
<td><strong>1</strong></td>
<td>Jake</td>
<td>Engineering</td>
<td><strong>1</strong></td>
<td>Jake</td>
<td>80000</td>
</tr>
<tr class="odd">
<td><strong>2</strong></td>
<td>Lisa</td>
<td>Engineering</td>
<td><strong>2</strong></td>
<td>Lisa</td>
<td>120000</td>
</tr>
<tr class="even">
<td><strong>3</strong></td>
<td>Sue</td>
<td>HR</td>
<td><strong>3</strong></td>
<td>Sue</td>
<td>90000</td>
</tr>
</tbody>
</table>

> The result has a redundant column that we can drop if desired–for
> example, by using the
>
> drop() method of DataFrame s:
>
> pd.merge(df1, df3, left_on="employee", right_on="name").drop('name',
> axis=1)

<table style="width:100%;">
<colgroup>
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
</colgroup>
<thead>
<tr class="header">
<th></th>
<th colspan="2"><strong>employee</strong></th>
<th><strong>group</strong></th>
<th><strong>salary</strong></th>
<th><blockquote>
<p><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image105.png"
style="width:0.22222in;height:0.22222in" /></p>
</blockquote></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td rowspan="5"><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image31.png"
style="width:0.125in" /></td>
<td><strong>0</strong></td>
<td>Bob</td>
<td><blockquote>
<p>Accounting</p>
</blockquote></td>
<td colspan="2"><blockquote>
<p>70000</p>
</blockquote></td>
</tr>
<tr class="even">
<td><strong>1</strong></td>
<td>Jake</td>
<td><blockquote>
<p>Engineering</p>
</blockquote></td>
<td colspan="2"><blockquote>
<p>80000</p>
</blockquote></td>
</tr>
<tr class="odd">
<td><strong>2</strong></td>
<td>Lisa</td>
<td><blockquote>
<p>Engineering</p>
</blockquote></td>
<td colspan="2"><blockquote>
<p>120000</p>
</blockquote></td>
</tr>
<tr class="even">
<td><strong>3</strong></td>
<td>Sue</td>
<td>HR</td>
<td colspan="2"><blockquote>
<p>90000</p>
</blockquote></td>
</tr>
<tr class="odd">
<td colspan="5"><blockquote>
<p>The left_index and right_index keywords</p>
</blockquote></td>
</tr>
</tbody>
</table>

Sometimes, rather than merging on a column, you would instead like to
merge on an index. For

> example, your data might look like this:
>
> df1a = df1.set_index('employee')
>
> df2a = df2.set_index('employee')
>
> display('df1a', 'df2a')

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
62/126

<table>
<colgroup>
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
</colgroup>
<thead>
<tr class="header">
<th>10/03/2023, 15:11</th>
<th rowspan="4"><strong>group</strong></th>
<th colspan="3"><blockquote>
<p>Exp02_notebook_2001622 - Colaboratory</p>
</blockquote></th>
</tr>
<tr class="odd">
<th>df1a</th>
<th colspan="3"><blockquote>
<p>df2a</p>
</blockquote></th>
</tr>
<tr class="header">
<th rowspan="2"><strong>employee</strong></th>
<th colspan="3"><strong>hire_date</strong></th>
</tr>
<tr class="odd">
<th colspan="3"><blockquote>
<p><strong>employee</strong></p>
</blockquote></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>Bob</strong></td>
<td><blockquote>
<p>Accounting</p>
</blockquote></td>
<td><strong>Lisa</strong></td>
<td colspan="2"><blockquote>
<p>2004</p>
</blockquote></td>
</tr>
<tr class="even">
<td><strong>Jake</strong></td>
<td>Engineering</td>
<td><strong>Bob</strong></td>
<td colspan="2"><blockquote>
<p>2008</p>
</blockquote></td>
</tr>
<tr class="odd">
<td><strong>Lisa</strong></td>
<td>Engineering</td>
<td colspan="2"><strong>Jake</strong></td>
<td><blockquote>
<p>2012</p>
</blockquote></td>
</tr>
<tr class="even">
<td><strong>Sue</strong></td>
<td>HR</td>
<td><strong>Sue</strong></td>
<td colspan="2"><blockquote>
<p>2014</p>
</blockquote></td>
</tr>
</tbody>
</table>

> You can use the index as the key for merging by specifying the
> left_index and/or right_index
>
> �ags in pd.merge() :
>
> display('df1a', 'df2a',
>
> "pd.merge(df1a, df2a, left_index=True, right_index=True)")

<table>
<colgroup>
<col style="width: 25%" />
<col style="width: 25%" />
<col style="width: 25%" />
<col style="width: 25%" />
</colgroup>
<thead>
<tr class="header">
<th>df1a</th>
<th rowspan="2"><strong>group</strong></th>
<th><blockquote>
<p>df2a</p>
</blockquote></th>
<th rowspan="2"><strong>hire_date</strong></th>
</tr>
<tr class="odd">
<th><strong>employee</strong></th>
<th><strong>employee</strong></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>Bob</strong></td>
<td><blockquote>
<p>Accounting</p>
</blockquote></td>
<td><strong>Lisa</strong></td>
<td>2004</td>
</tr>
<tr class="even">
<td><strong>Jake</strong></td>
<td>Engineering</td>
<td><strong>Bob</strong></td>
<td>2008</td>
</tr>
<tr class="odd">
<td><strong>Lisa</strong></td>
<td>Engineering</td>
<td><strong>Jake</strong></td>
<td>2012</td>
</tr>
<tr class="even">
<td><strong>Sue</strong></td>
<td>HR</td>
<td><strong>Sue</strong></td>
<td>2014</td>
</tr>
</tbody>
</table>

> For convenience, DataFrame s implement the join() method, which
> performs a merge that
>
> defaults to joining on indices:
>
> display('df1a', 'df2a', 'df1a.join(df2a)')

<table style="width:100%;">
<colgroup>
<col style="width: 10%" />
<col style="width: 10%" />
<col style="width: 10%" />
<col style="width: 10%" />
<col style="width: 10%" />
<col style="width: 10%" />
<col style="width: 10%" />
<col style="width: 10%" />
<col style="width: 10%" />
<col style="width: 10%" />
</colgroup>
<thead>
<tr class="header">
<th colspan="3">df1a</th>
<th colspan="2"><blockquote>
<p>df2a</p>
</blockquote></th>
<th colspan="3"><blockquote>
<p>df1a.join(df2a)</p>
</blockquote></th>
<th rowspan="3"><strong>hire_date</strong></th>
<th rowspan="3"></th>
</tr>
<tr class="odd">
<th colspan="3"><strong>group</strong></th>
<th colspan="2"><strong>hire_date</strong></th>
<th colspan="3"><strong>group</strong></th>
</tr>
<tr class="header">
<th colspan="3"><strong>employee</strong></th>
<th colspan="2"><blockquote>
<p><strong>employee</strong></p>
</blockquote></th>
<th colspan="3"><blockquote>
<p><strong>employee</strong></p>
</blockquote></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>Bob</strong></td>
<td colspan="2"><blockquote>
<p>Accounting</p>
</blockquote></td>
<td><strong>Lisa</strong></td>
<td>2004</td>
<td><strong>Bob</strong></td>
<td colspan="2"><blockquote>
<p>Accounting</p>
</blockquote></td>
<td>2008</td>
<td rowspan="5">63/126</td>
</tr>
<tr class="even">
<td><strong>Jake</strong></td>
<td colspan="2">Engineering</td>
<td><strong>Bob</strong></td>
<td>2008</td>
<td><strong>Jake</strong></td>
<td colspan="2"><blockquote>
<p>Engineering</p>
</blockquote></td>
<td>2012</td>
</tr>
<tr class="odd">
<td><strong>Lisa</strong></td>
<td colspan="2">Engineering</td>
<td><strong>Jake</strong></td>
<td>2012</td>
<td><strong>Lisa</strong></td>
<td colspan="2"><blockquote>
<p>Engineering</p>
</blockquote></td>
<td>2004</td>
</tr>
<tr class="even">
<td colspan="2"><strong>Sue</strong></td>
<td>HR</td>
<td><strong>Sue</strong></td>
<td>2014</td>
<td colspan="2"><strong>Sue</strong></td>
<td>HR</td>
<td rowspan="2">2014</td>
</tr>
<tr class="odd">
<td
colspan="8">https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&amp;printMode=true</td>
</tr>
</tbody>
</table>

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> If you'd like to mix indices and columns, you can combine left_index
> with right_on or
>
> left_on with right_index to get the desired behavior:
>
> display('df1a', 'df3', "pd.merge(df1a, df3, left_index=True,
> right_on='name')")

<table>
<colgroup>
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
</colgroup>
<thead>
<tr class="header">
<th>df1a</th>
<th><strong>group</strong></th>
<th>df3</th>
<th><strong>name</strong></th>
<th><strong>salary</strong></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>employee</strong></td>
<td></td>
<td><strong>0</strong></td>
<td>Bob</td>
<td><blockquote>
<p>70000</p>
</blockquote></td>
</tr>
<tr class="even">
<td><strong>Bob</strong></td>
<td><blockquote>
<p>Accounting</p>
</blockquote></td>
<td><strong>1</strong></td>
<td>Jake</td>
<td><blockquote>
<p>80000</p>
</blockquote></td>
</tr>
<tr class="odd">
<td><strong>Jake</strong></td>
<td><blockquote>
<p>Engineering</p>
</blockquote></td>
<td><strong>2</strong></td>
<td>Lisa</td>
<td>120000</td>
</tr>
<tr class="even">
<td><strong>Lisa</strong></td>
<td><blockquote>
<p>Engineering</p>
</blockquote></td>
<td rowspan="2"><strong>3</strong></td>
<td rowspan="2">Sue</td>
<td rowspan="2"><blockquote>
<p>90000</p>
</blockquote></td>
</tr>
<tr class="odd">
<td><strong>Sue</strong></td>
<td>HR</td>
</tr>
</tbody>
</table>

> All of these options also work with multiple indices and/or multiple
> columns; the interface for

<table>
<colgroup>
<col style="width: 50%" />
<col style="width: 50%" />
</colgroup>
<thead>
<tr class="header">
<th><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image106.png"
style="width:0.125in" /></th>
<th><blockquote>
<p>Specifying Set Arithmetic for Joins</p>
</blockquote></th>
</tr>
</thead>
<tbody>
</tbody>
</table>

> In all the preceding examples we have glossed over one important
> consideration in performing a join: the type of set arithmetic used in
> the join. This comes up when a value appears in one key
>
> column but not the other. Consider this example:
>
> df6 = pd.DataFrame({'name': \['Peter', 'Paul', 'Mary'\], 'food':
> \['fish', 'beans', 'bread'\]}, columns=\['name', 'food'\])  
> df7 = pd.DataFrame({'name': \['Mary', 'Joseph'\],  
> 'drink': \['wine', 'beer'\]},  
> columns=\['name', 'drink'\])  
> display('df6', 'df7', 'pd.merge(df6, df7)')

<table>
<colgroup>
<col style="width: 8%" />
<col style="width: 8%" />
<col style="width: 8%" />
<col style="width: 8%" />
<col style="width: 8%" />
<col style="width: 8%" />
<col style="width: 8%" />
<col style="width: 8%" />
<col style="width: 8%" />
<col style="width: 8%" />
<col style="width: 8%" />
<col style="width: 8%" />
</colgroup>
<thead>
<tr class="header">
<th colspan="2">df6</th>
<th colspan="2">df7</th>
<th colspan="7">pd.merge(df6, df7)</th>
<th rowspan="2"></th>
</tr>
<tr class="odd">
<th colspan="2"><strong>name</strong></th>
<th colspan="2"><blockquote>
<p><strong>food</strong></p>
</blockquote></th>
<th><strong>name</strong></th>
<th colspan="2"><blockquote>
<p><strong>drink</strong></p>
</blockquote></th>
<th colspan="2"><strong>name</strong></th>
<th><strong>food</strong></th>
<th><blockquote>
<p><strong>drink</strong></p>
</blockquote></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>0</strong></td>
<td>Peter</td>
<td>fish</td>
<td><strong>0</strong></td>
<td colspan="2">Mary</td>
<td><blockquote>
<p>wine</p>
</blockquote></td>
<td><strong>0</strong></td>
<td>Mary</td>
<td>bread</td>
<td><blockquote>
<p>wine</p>
</blockquote></td>
<td rowspan="4">64/126</td>
</tr>
<tr class="even">
<td><strong>1</strong></td>
<td>Paul</td>
<td><blockquote>
<p>beans</p>
</blockquote></td>
<td><strong>1</strong></td>
<td colspan="2"><blockquote>
<p>Joseph</p>
</blockquote></td>
<td colspan="5"><blockquote>
<p>beer</p>
</blockquote></td>
</tr>
<tr class="odd">
<td><strong>2</strong></td>
<td><blockquote>
<p>Mary</p>
</blockquote></td>
<td colspan="9"><blockquote>
<p>bread</p>
</blockquote></td>
</tr>
<tr class="even">
<td
colspan="11">https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&amp;printMode=true</td>
</tr>
</tbody>
</table>

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> Here we have merged two datasets that have only a single "name" entry
> in common: Mary. By default, the result contains the *intersection* of
> the two sets of inputs; this is what is known as an *inner join*. We
> can specify this explicitly using the how keyword, which defaults to
> "inner" :
>
> pd.merge(df6, df7, how='inner')

<table>
<colgroup>
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
</colgroup>
<thead>
<tr class="header">
<th></th>
<th><strong>name</strong></th>
<th><strong>food</strong></th>
<th><strong>drink</strong></th>
<th><blockquote>
<p><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image107.png"
style="width:0.22222in;height:0.22222in" /></p>
</blockquote></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>0</strong></td>
<td>Mary</td>
<td>bread</td>
<td>wine</td>
<td></td>
</tr>
</tbody>
</table>

> Other options for the how keyword are 'outer' , 'left' , and 'right' .
> An *outer join* returns a join over the union of the input columns,
> and �lls in all missing values with NAs:
>
> display('df6', 'df7', "pd.merge(df6, df7, how='outer')")

<table style="width:100%;">
<colgroup>
<col style="width: 10%" />
<col style="width: 10%" />
<col style="width: 10%" />
<col style="width: 10%" />
<col style="width: 10%" />
<col style="width: 10%" />
<col style="width: 10%" />
<col style="width: 10%" />
<col style="width: 10%" />
<col style="width: 10%" />
</colgroup>
<thead>
<tr class="header">
<th rowspan="2">df6</th>
<th rowspan="2"><strong>name</strong></th>
<th rowspan="2"><strong>food</strong></th>
<th rowspan="2">df7</th>
<th rowspan="2"><strong>name</strong></th>
<th rowspan="2"><strong>drink</strong></th>
<th colspan="4"><blockquote>
<p>pd.merge(df6, df7, how='outer')</p>
</blockquote></th>
</tr>
<tr class="odd">
<th colspan="2"><strong>name</strong></th>
<th><strong>food</strong></th>
<th><blockquote>
<p><strong>drink</strong></p>
</blockquote></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>0</strong></td>
<td>Peter</td>
<td>fish</td>
<td><strong>0</strong></td>
<td>Mary</td>
<td>wine</td>
<td><strong>0</strong></td>
<td>Peter</td>
<td>fish</td>
<td><blockquote>
<p>NaN</p>
</blockquote></td>
</tr>
<tr class="even">
<td><strong>1</strong></td>
<td>Paul</td>
<td>beans</td>
<td rowspan="3"><strong>1</strong></td>
<td rowspan="3">Joseph</td>
<td rowspan="3">beer</td>
<td><strong>1</strong></td>
<td>Paul</td>
<td><blockquote>
<p>beans</p>
</blockquote></td>
<td><blockquote>
<p>NaN</p>
</blockquote></td>
</tr>
<tr class="odd">
<td rowspan="2"><strong>2</strong></td>
<td rowspan="2">Mary</td>
<td rowspan="2">bread</td>
<td><strong>2</strong></td>
<td>Mary</td>
<td><blockquote>
<p>bread</p>
</blockquote></td>
<td><blockquote>
<p>wine</p>
</blockquote></td>
</tr>
<tr class="even">
<td><strong>3</strong></td>
<td>Joseph</td>
<td>NaN</td>
<td><blockquote>
<p>beer</p>
</blockquote></td>
</tr>
</tbody>
</table>

> The *left join* and *right join* return joins over the left entries
> and right entries, respectively. For example:
>
> display('df6', 'df7', "pd.merge(df6, df7, how='left')")

<table style="width:100%;">
<colgroup>
<col style="width: 10%" />
<col style="width: 10%" />
<col style="width: 10%" />
<col style="width: 10%" />
<col style="width: 10%" />
<col style="width: 10%" />
<col style="width: 10%" />
<col style="width: 10%" />
<col style="width: 10%" />
<col style="width: 10%" />
</colgroup>
<thead>
<tr class="header">
<th rowspan="2">df6</th>
<th rowspan="2"><strong>name</strong></th>
<th rowspan="2"><strong>food</strong></th>
<th rowspan="2">df7</th>
<th rowspan="2"><strong>name</strong></th>
<th rowspan="2"><strong>drink</strong></th>
<th colspan="4"><blockquote>
<p>pd.merge(df6, df7, how='left')</p>
</blockquote></th>
</tr>
<tr class="odd">
<th colspan="2"><strong>name</strong></th>
<th><strong>food</strong></th>
<th><blockquote>
<p><strong>drink</strong></p>
</blockquote></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>0</strong></td>
<td>Peter</td>
<td>fish</td>
<td><strong>0</strong></td>
<td>Mary</td>
<td>wine</td>
<td><strong>0</strong></td>
<td><blockquote>
<p>Peter</p>
</blockquote></td>
<td>fish</td>
<td><blockquote>
<p>NaN</p>
</blockquote></td>
</tr>
<tr class="even">
<td><strong>1</strong></td>
<td>Paul</td>
<td>beans</td>
<td rowspan="2"><strong>1</strong></td>
<td rowspan="2">Joseph</td>
<td rowspan="2">beer</td>
<td><strong>1</strong></td>
<td>Paul</td>
<td>beans</td>
<td><blockquote>
<p>NaN</p>
</blockquote></td>
</tr>
<tr class="odd">
<td><strong>2</strong></td>
<td>Mary</td>
<td>bread</td>
<td><strong>2</strong></td>
<td><blockquote>
<p>Mary</p>
</blockquote></td>
<td>bread</td>
<td><blockquote>
<p>wine</p>
</blockquote></td>
</tr>
</tbody>
</table>

> The output rows now correspond to the entries in the left input. Using
> how='right' works in a similar manner.
>
> All of these options can be applied straightforwardly to any of the
> preceding join types.

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
65/126

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image10.png"
> style="width:0.125in" /> Overlapping Column Names: The suffixes
> Keyword
>
> Finally, you may end up in a case where your two input DataFrame s
> have con�icting column
>
> names. Consider this example:
>
> df8 = pd.DataFrame({'name': \['Bob', 'Jake', 'Lisa', 'Sue'\],
>
> 'rank': \[1, 2, 3, 4\]})
>
> df9 = pd.DataFrame({'name': \['Bob', 'Jake', 'Lisa', 'Sue'\],
>
> 'rank': \[3, 1, 4, 2\]})
>
> display('df8', 'df9', 'pd.merge(df8, df9, on="name")')

<table style="width:100%;">
<colgroup>
<col style="width: 10%" />
<col style="width: 10%" />
<col style="width: 10%" />
<col style="width: 10%" />
<col style="width: 10%" />
<col style="width: 10%" />
<col style="width: 10%" />
<col style="width: 10%" />
<col style="width: 10%" />
<col style="width: 10%" />
</colgroup>
<thead>
<tr class="header">
<th rowspan="2">df8</th>
<th rowspan="2"><strong>name</strong></th>
<th rowspan="2"><strong>rank</strong></th>
<th rowspan="2">df9</th>
<th rowspan="2"><strong>name</strong></th>
<th rowspan="2"><strong>rank</strong></th>
<th colspan="4"><blockquote>
<p>pd.merge(df8, df9, on="name")</p>
</blockquote></th>
</tr>
<tr class="odd">
<th colspan="2"><strong>name</strong></th>
<th><strong>rank_x</strong></th>
<th><blockquote>
<p><strong>rank_y</strong></p>
</blockquote></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>0</strong></td>
<td>Bob</td>
<td>1</td>
<td><strong>0</strong></td>
<td>Bob</td>
<td>3</td>
<td><strong>0</strong></td>
<td>Bob</td>
<td>1</td>
<td><blockquote>
<p>3</p>
</blockquote></td>
</tr>
<tr class="even">
<td><strong>1</strong></td>
<td>Jake</td>
<td>2</td>
<td><strong>1</strong></td>
<td>Jake</td>
<td>1</td>
<td><strong>1</strong></td>
<td>Jake</td>
<td>2</td>
<td><blockquote>
<p>1</p>
</blockquote></td>
</tr>
<tr class="odd">
<td><strong>2</strong></td>
<td>Lisa</td>
<td>3</td>
<td><strong>2</strong></td>
<td>Lisa</td>
<td>4</td>
<td><strong>2</strong></td>
<td>Lisa</td>
<td>3</td>
<td><blockquote>
<p>4</p>
</blockquote></td>
</tr>
<tr class="even">
<td><strong>3</strong></td>
<td>Sue</td>
<td>4</td>
<td><strong>3</strong></td>
<td>Sue</td>
<td>2</td>
<td><strong>3</strong></td>
<td>Sue</td>
<td>4</td>
<td><blockquote>
<p>2</p>
</blockquote></td>
</tr>
</tbody>
</table>

> Because the output would have two con�icting column names, the merge
> function automatically
>
> appends a su�x \_x or \_y to make the output columns unique. If these
> defaults are
>
> inappropriate, it is possible to specify a custom su�x using the
> suffixes keyword:
>
> display('df8', 'df9', 'pd.merge(df8, df9, on="name", suffixes=\["\_L",
> "\_R"\])')

| df8   | **name** | **rank** | df9   | **name** | **rank** |
|-------|----------|----------|-------|----------|----------|
| **0** | Bob      | 1        | **0** | Bob      | 3        |
| **1** | Jake     | 2        | **1** | Jake     | 1        |
| **2** | Lisa     | 3        | **2** | Lisa     | 4        |
| **3** | Sue      | 4        | **3** | Sue      | 2        |

> These su�xes work in any of the possible join patterns, and work also
> if there are multiple
>
> overlapping columns.
>
> For more information on these patterns, see where we dive a bit

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
66/126

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image108.png"
> style="width:0.125in" /> Example: US States Data
>
> Merge and join operations come up most often when combining data from
> different sources.
>
> Here we will consider an example of some data about US states and
> their populations. The data
>
> �les can be found at :
>
> \# Following are shell commands to download the data  
> \# !curl -O
> https://raw.githubusercontent.com/jakevdp/data-USstates/master/state-population
> \# !curl -O
> https://raw.githubusercontent.com/jakevdp/data-USstates/master/state-areas.csv
> \# !curl -O
> https://raw.githubusercontent.com/jakevdp/data-USstates/master/state-abbrevs.cs
>
> Let's take a look at the three datasets, using the Pandas read_csv()
> function:
>
> pop = pd.read_csv('/content/sample_data/california_housing_test.csv')
> areas = pd.read_csv('/content/sample_data/mnist_test.csv')  
> abbrevs = pd.read_csv('/content/sample_data/mnist_train_small.csv')
>
> display('pop.head()', 'areas.head()', 'abbrevs.head()')

| https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true | 67/126 |
|------------------------------------|------------------------------------|

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> pop.head()

|       | **longitude** | **latitude** | **housing_median_age** | **total_rooms** | **total_bedrooms** | **population** |
|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
| **0** | -122.05       | 37.37        | 27.0                   | 3885.0          | 661.0              | 1537.0         |
| **1** | -118.30       | 34.26        | 43.0                   | 1510.0          | 310.0              | 809.0          |
| **2** | -117.81       | 33.78        | 27.0                   | 3589.0          | 507.0              | 1484.0         |
| **3** | -118.36       | 33.82        | 28.0                   | 67.0            | 15.0               | 49.0           |
| **4** | -119.67       | 36.33        | 19.0                   | 1241.0          | 244.0              | 850.0          |

> Given this information, say we want to compute a relatively
> straightforward result: rank US
>
> areas.head()
>
> states and territories by their 2010 population density. We clearly
> have the data here to �nd this

<table style="width:100%;">
<colgroup>
<col style="width: 14%" />
<col style="width: 14%" />
<col style="width: 14%" />
<col style="width: 14%" />
<col style="width: 14%" />
<col style="width: 14%" />
<col style="width: 14%" />
</colgroup>
<thead>
<tr class="header">
<th><blockquote>
<p>result, but we'll have to combine the datasets to �nd the result.
<strong>7</strong> <strong>0</strong> <strong>0.1</strong>
<strong>0.2</strong> <strong>0.3</strong> <strong>0.4</strong>
<strong>0.5</strong> <strong>0.6</strong> <strong>0.7</strong>
<strong>0.8</strong> <strong>...</strong></p>
</blockquote></th>
<th><strong>0.658</strong></th>
<th><strong>0.659</strong></th>
<th colspan="2"><strong>0.660</strong></th>
<th><strong>0.661</strong></th>
<th rowspan="3"><strong>0.6</strong></th>
</tr>
<tr class="odd">
<th colspan="6"><blockquote>
<p><strong>0</strong> 2 0 0 0 0 0 0 0 0 0 ... 0 0 0 0</p>
<p>We'll start with a many-to-one merge that will give us the full state
name within the population</p>
</blockquote></th>
</tr>
<tr class="header">
<th colspan="4"><blockquote>
<p>DataFrame . We want to merge based on the state/region column of pop
, and the</p>
</blockquote></th>
<th>0</th>
<th>0</th>
</tr>
</thead>
<tbody>
</tbody>
</table>

> abbreviation column of abbrevs . We'll use how='outer' to make sure no
> data is thrown away

<table>
<colgroup>
<col style="width: 5%" />
<col style="width: 5%" />
<col style="width: 5%" />
<col style="width: 5%" />
<col style="width: 5%" />
<col style="width: 5%" />
<col style="width: 5%" />
<col style="width: 5%" />
<col style="width: 5%" />
<col style="width: 5%" />
<col style="width: 5%" />
<col style="width: 5%" />
<col style="width: 5%" />
<col style="width: 5%" />
<col style="width: 5%" />
<col style="width: 5%" />
<col style="width: 5%" />
<col style="width: 5%" />
</colgroup>
<thead>
<tr class="header">
<th colspan="5"><blockquote>
<p>due to mismatched labels. <strong>3</strong> 4 0 0 0</p>
</blockquote></th>
<th>0</th>
<th>0</th>
<th>0</th>
<th>0</th>
<th>0</th>
<th>0</th>
<th>...</th>
<th colspan="2">0</th>
<th>0</th>
<th>0</th>
<th colspan="2">0</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td colspan="13"><blockquote>
<p><strong>4</strong> 1 0 0 0 0 0 0 0 0 0 ...</p>
<p>Some of the population info is null; let's �gure out which these
are!</p>
<p>5 rows × 785 columns</p>
</blockquote></td>
<td>0</td>
<td>0</td>
<td>0</td>
<td colspan="2">0</td>
</tr>
<tr class="even">
<td colspan="18"><blockquote>
<p>It appears that all the null population values are from Puerto Rico
prior to the year 2000; this is abbrevs.head()</p>
<p>likely due to this data not being available from the original
source.</p>
</blockquote></td>
</tr>
<tr class="odd">
<td colspan="2"><strong>6</strong></td>
<td><strong>0</strong></td>
<td><strong>0.1</strong></td>
<td><strong>0.2</strong></td>
<td><strong>0.3</strong></td>
<td><strong>0.4</strong></td>
<td><blockquote>
<p><strong>0.5</strong></p>
</blockquote></td>
<td><strong>0.6</strong></td>
<td><strong>0.7</strong></td>
<td><strong>0.8</strong></td>
<td><strong>...</strong></td>
<td colspan="2"><strong>0.581</strong></td>
<td><strong>0.582</strong></td>
<td><strong>0.583</strong></td>
<td><strong>0.584</strong></td>
<td><strong>0.5</strong></td>
</tr>
<tr class="even">
<td colspan="18"><blockquote>
<p>More importantly, we see also that some of the new state entries are
also null, which means</p>
<p><strong>0</strong> 5 0 0 0 0 0 0 0 0 0 ... 0 0 0 0</p>
<p>that there was no corresponding entry in the abbrevs key! Let's �gure
out which regions lack</p>
</blockquote></td>
</tr>
<tr class="odd">
<td colspan="2"><blockquote>
<p><strong>1</strong> 7</p>
</blockquote>
<p>this match:</p></td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>...</td>
<td colspan="2">0</td>
<td>0</td>
<td>0</td>
<td colspan="2">0</td>
</tr>
<tr class="even">
<td><strong>2</strong></td>
<td>9</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>...</td>
<td colspan="2">0</td>
<td>0</td>
<td>0</td>
<td colspan="2">0</td>
</tr>
</tbody>
</table>

> We can quickly infer the issue: our population data includes entries
> for Puerto Rico (PR) and the
>
> United States as a whole (USA), while these entries do not appear in
> the state abbreviation key. **4** 2 0 0 0 0 0 0 0 0 0 ... 0 0 0 0
>
> We can �x these quickly by �lling in appropriate entries:
>
> 5 rows × 785 columns
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image109.png"
> style="width:6.83333in;height:0.125in" />
>
> No more nulls in the state column: we're all set!
>
> Now we can merge the result with the area data using a similar
> procedure. Examining our
>
> results, we will want to join on the state column in both:
>
> Again, let's check for nulls to see if there were any mismatches:
>
> We see that our areasDataFrame does not contain the area of the United
> States as a whole. We
>
> could insert the appropriate value (using the sum of all state areas,
> for instance), but in this case

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
68/126

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> we'll just drop the null values because the population density of the
> entire United States is not relevant to our current discussion:
>
> Now we have all the data we need. To answer the question of interest,
> let's �rst select the portion of the data corresponding with the year
> 2000, and the total population. We'll use the
>
> Now let's compute the population density and display it in order.
> We'll start by re-indexing our data on the state, and then compute the
> result:
>
> The result is a ranking of US states plus Washington, DC, and Puerto
> Rico in order of their 2010 population density, in residents per
> square mile. We can see that by far the densest region in this dataset
> is Washington, DC (i.e., the District of Columbia); among states, the
> densest is New Jersey.
>
> We can also check the end of the list:
>
> We see that the least dense state, by far, is Alaska, averaging
> slightly over one resident per square mile.
>
> This type of messy data merging is a common task when trying to answer
> questions using real-world data sources. I hope that this example has
> given you an idea of the ways you can combine tools we've covered in
> order to gain insight from your data!

<table>
<colgroup>
<col style="width: 50%" />
<col style="width: 50%" />
</colgroup>
<thead>
<tr class="header">
<th><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image6.png"
style="width:0.125in" /></th>
<th><blockquote>
<p>Aggregation and Grouping</p>
</blockquote></th>
</tr>
</thead>
<tbody>
</tbody>
</table>

> An essential piece of analysis of large data is e�cient summarization:
> computing aggregations like sum() , mean() , median() , min() , and
> max() , in which a single number gives insight into the nature of a
> potentially large dataset. In this section, we'll explore aggregations
> in Pandas, from simple operations akin to what we've seen on NumPy
> arrays, to more sophisticated operations based on the concept of a
> groupby .
>
> For convenience, we'll use the same display magic function that we've
> seen in previous sections:
>
> import numpy as np  
> import pandas as pd
>
> class display(object):  
> """Display HTML representation of multiple objects"""

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
69/126

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> template = """\<div style="float: left; padding: 10px;"\>  
> \<p style='font-family:"Courier New", Courier,
> monospace'\>{0}\</p\>{1} \</div\>"""  
> def \_\_init\_\_(self, \*args):  
> self.args = args
>
> def \_repr_html\_(self):  
> return '\n'.join(self.template.format(a, eval(a).\_repr_html\_()) for
> a in self.args)
>
> def \_\_repr\_\_(self):  
> return '\n\n'.join(a + '\n' + repr(eval(a)) for a in self.args)
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image110.png"
> style="width:0.125in" /> Planets Data
>
> Here we will use the Planets dataset, available via the ). It gives
> information on
>
> planets that astronomers have discovered around other stars (known as
> *extrasolar planets* or *exoplanets* for short). It can be downloaded
> with a simple Seaborn command:
>
> import seaborn as sns  
> planets = sns.load_dataset('planets')  
> planets.shape
>
> (1035, 6)
>
> planets.head()

<table>
<colgroup>
<col style="width: 12%" />
<col style="width: 12%" />
<col style="width: 12%" />
<col style="width: 12%" />
<col style="width: 12%" />
<col style="width: 12%" />
<col style="width: 12%" />
<col style="width: 12%" />
</colgroup>
<thead>
<tr class="header">
<th></th>
<th><strong>method</strong></th>
<th><strong>number</strong></th>
<th><strong>orbital_period</strong></th>
<th><blockquote>
<p><strong>mass</strong></p>
</blockquote></th>
<th><strong>distance</strong></th>
<th><strong>year</strong></th>
<th><blockquote>
<p><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image111.png"
style="width:0.22222in;height:0.22222in" /></p>
</blockquote></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>0</strong></td>
<td>Radial Velocity</td>
<td>1</td>
<td>269.300</td>
<td>7.10</td>
<td>77.40</td>
<td>2006</td>
<td rowspan="5"></td>
</tr>
<tr class="even">
<td><strong>1</strong></td>
<td>Radial Velocity</td>
<td>1</td>
<td>874.774</td>
<td>2.21</td>
<td>56.95</td>
<td>2008</td>
</tr>
<tr class="odd">
<td><strong>2</strong></td>
<td>Radial Velocity</td>
<td>1</td>
<td>763.000</td>
<td>2.60</td>
<td>19.84</td>
<td>2011</td>
</tr>
<tr class="even">
<td><strong>3</strong></td>
<td>Radial Velocity</td>
<td>1</td>
<td>326.030</td>
<td>19.40</td>
<td>110.62</td>
<td>2007</td>
</tr>
<tr class="odd">
<td><strong>4</strong></td>
<td>Radial Velocity</td>
<td>1</td>
<td>516.220</td>
<td>10.50</td>
<td>119.47</td>
<td>2009</td>
</tr>
</tbody>
</table>

> This has some details on the 1,000+ extrasolar planets discovered up
> to 2014.

<table>
<colgroup>
<col style="width: 50%" />
<col style="width: 50%" />
</colgroup>
<thead>
<tr class="header">
<th><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image84.png"
style="width:0.125in" /></th>
<th><blockquote>
<p>Simple Aggregation in Pandas</p>
</blockquote></th>
</tr>
</thead>
<tbody>
</tbody>
</table>

> Series the aggregates return a single value:

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
70/126

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> rng = np.random.RandomState(42)
>
> ser = pd.Series(rng.rand(5))
>
> ser
>
> 0 0.374540
>
> 1 0.950714
>
> 2 0.731994
>
> 3 0.598658
>
> 4 0.156019
>
> dtype: float64
>
> ser.sum()
>
> 2.811925491708157
>
> ser.mean()
>
> 0.5623850983416314
>
> For a DataFrame , by default the aggregates return results within each
> column:
>
> df = pd.DataFrame({'A': rng.rand(5),
>
> 'B': rng.rand(5)})
>
> df

<table>
<colgroup>
<col style="width: 25%" />
<col style="width: 25%" />
<col style="width: 25%" />
<col style="width: 25%" />
</colgroup>
<thead>
<tr class="header">
<th></th>
<th><strong>A</strong></th>
<th><strong>B</strong></th>
<th><blockquote>
<p><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image112.png"
style="width:0.22222in;height:0.22222in" /></p>
</blockquote></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>0</strong></td>
<td>0.155995</td>
<td>0.020584</td>
<td rowspan="5"></td>
</tr>
<tr class="even">
<td><strong>1</strong></td>
<td>0.058084</td>
<td>0.969910</td>
</tr>
<tr class="odd">
<td><strong>2</strong></td>
<td>0.866176</td>
<td>0.832443</td>
</tr>
<tr class="even">
<td><strong>3</strong></td>
<td>0.601115</td>
<td>0.212339</td>
</tr>
<tr class="odd">
<td><strong>4</strong></td>
<td>0.708073</td>
<td>0.181825</td>
</tr>
</tbody>
</table>

> df.mean()
>
> A 0.477888
>
> B 0.443420
>
> dtype: float64
>
> By specifying the axis argument, you can instead aggregate within each
> row:
>
> df.mean(axis='columns')
>
> 0 0.088290
>
> 1 0.513997
>
> 2 0.849309
>
> 3 0.406727

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
71/126

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> 4 0.444949  
> dtype: float64
>
> Pandas Series and DataFrame s include all of the common aggregates
> mentioned in  
> ; in addition, there is a convenience method describe() that computes
> several common aggregates for each column and returns the result.
>
> Let's use this on the Planets data, for now dropping rows with missing
> values:
>
> planets.dropna().describe()

<table style="width:100%;">
<colgroup>
<col style="width: 14%" />
<col style="width: 14%" />
<col style="width: 14%" />
<col style="width: 14%" />
<col style="width: 14%" />
<col style="width: 14%" />
<col style="width: 14%" />
</colgroup>
<thead>
<tr class="header">
<th></th>
<th><strong>number</strong></th>
<th><strong>orbital_period</strong></th>
<th><strong>mass</strong></th>
<th><blockquote>
<p><strong>distance</strong></p>
</blockquote></th>
<th><strong>year</strong></th>
<th><blockquote>
<p><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image113.png"
style="width:0.22222in;height:0.22222in" /></p>
</blockquote></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>count</strong></td>
<td>498.00000</td>
<td>498.000000</td>
<td>498.000000</td>
<td>498.000000</td>
<td><blockquote>
<p>498.000000</p>
</blockquote></td>
<td rowspan="8"></td>
</tr>
<tr class="even">
<td><strong>mean</strong></td>
<td>1.73494</td>
<td>835.778671</td>
<td>2.509320</td>
<td><blockquote>
<p>52.068213</p>
</blockquote></td>
<td>2007.377510</td>
</tr>
<tr class="odd">
<td><strong>std</strong></td>
<td>1.17572</td>
<td>1469.128259</td>
<td>3.636274</td>
<td><blockquote>
<p>46.596041</p>
</blockquote></td>
<td>4.167284</td>
</tr>
<tr class="even">
<td><strong>min</strong></td>
<td>1.00000</td>
<td>1.328300</td>
<td>0.003600</td>
<td>1.350000</td>
<td>1989.000000</td>
</tr>
<tr class="odd">
<td><strong>25%</strong></td>
<td>1.00000</td>
<td>38.272250</td>
<td>0.212500</td>
<td><blockquote>
<p>24.497500</p>
</blockquote></td>
<td>2005.000000</td>
</tr>
<tr class="even">
<td><strong>50%</strong></td>
<td>1.00000</td>
<td>357.000000</td>
<td>1.245000</td>
<td><blockquote>
<p>39.940000</p>
</blockquote></td>
<td>2009.000000</td>
</tr>
<tr class="odd">
<td><strong>75%</strong></td>
<td>2.00000</td>
<td>999.600000</td>
<td>2.867500</td>
<td><blockquote>
<p>59.332500</p>
</blockquote></td>
<td>2011.000000</td>
</tr>
<tr class="even">
<td><strong>max</strong></td>
<td>6.00000</td>
<td><blockquote>
<p>17337.500000</p>
</blockquote></td>
<td><blockquote>
<p>25.000000</p>
</blockquote></td>
<td>354.000000</td>
<td>2014.000000</td>
</tr>
</tbody>
</table>

> This can be a useful way to begin understanding the overall properties
> of a dataset. For example, we see in the year column that although
> exoplanets were discovered as far back as 1989, half of all known
> expolanets were not discovered until 2010 or after. This is largely
> thanks to the *Kepler* mission, which is a space-based telescope
> speci�cally designed for �nding eclipsing planets around other stars.
>
> The following table summarizes some other built-in Pandas
> aggregations:

<table>
<colgroup>
<col style="width: 50%" />
<col style="width: 50%" />
</colgroup>
<thead>
<tr class="header">
<th><strong>Aggregation</strong></th>
<th><strong>Description</strong></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><blockquote>
<p>count()<br />
first() , last() mean() , median() min() , max()<br />
std() , var()</p>
<p>mad()</p>
<p>prod()</p>
<p>sum()</p>
</blockquote></td>
<td><blockquote>
<p>Total number of items<br />
First and last item<br />
Mean and median<br />
Minimum and maximum<br />
Standard deviation and variance Mean absolute deviation<br />
Product of all items<br />
Sum of all items</p>
</blockquote></td>
</tr>
</tbody>
</table>

> These are all methods of DataFrame and Series objects.

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
72/126

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> To go deeper into the data, however, simple aggregates are often not
> enough. The next level of data summarization is the groupby operation,
> which allows you to quickly and e�ciently compute aggregates on
> subsets of data.
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image59.png"
> style="width:0.125in" /> GroupBy: Split, Apply, Combine
>
> Simple aggregations can give you a �avor of your dataset, but often we
> would prefer to aggregate conditionally on some label or index: this
> is implemented in the so-called groupby operation. The name "group by"
> comes from a command in the SQL database language, but it is perhaps
> more illuminative to think of it in the terms �rst coined by Hadley
> Wickham of Rstats fame: *split, apply, combine*.
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image60.png"
> style="width:0.125in" /> Split, apply, combine
>
> A canonical example of this split-apply-combine operation, where the
> "apply" is a summation aggregation, is illustrated in this �gure:
>
> This makes clear what the groupby accomplishes:
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image114.png" />
> The *split* step involves breaking up and grouping a DataFrame
> depending on the value of the speci�ed key.
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image115.png" />
> The *apply* step involves computing some function, usually an
> aggregate, transformation, or �ltering, within the individual groups.
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image116.png" />
> The *combine* step merges the results of these operations into an
> output array.
>
> While this could certainly be done manually using some combination of
> the masking,  
> aggregation, and merging commands covered earlier, an important
> realization is that *the intermediate splits do not need to be
> explicitly instantiated*. Rather, the GroupBy can (often) do this in a
> single pass over the data, updating the sum, mean, count, min, or
> other aggregate for each group along the way. The power of the GroupBy
> is that it abstracts away these steps: the user need not think about
> *how* the computation is done under the hood, but rather thinks about
> the *operation as a whole*.
>
> As a concrete example, let's take a look at using Pandas for the
> computation shown in this diagram. We'll start by creating the input
> DataFrame :
>
> df = pd.DataFrame({'key': \['A', 'B', 'C', 'A', 'B', 'C'\], 'data':
> range(6)}, columns=\['key', 'data'\]) df

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
73/126

<table>
<colgroup>
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
</colgroup>
<thead>
<tr class="header">
<th>10/03/2023, 15:11</th>
<th><strong>key</strong></th>
<th><strong>data</strong></th>
<th><blockquote>
<p><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image117.png"
style="width:0.23611in;height:0.22222in" /></p>
</blockquote></th>
<th><blockquote>
<p>Exp02_notebook_2001622 - Colaboratory</p>
</blockquote></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>0</strong></td>
<td>A</td>
<td>0</td>
<td rowspan="6"></td>
<td rowspan="6"></td>
</tr>
<tr class="even">
<td><strong>1</strong></td>
<td>B</td>
<td>1</td>
</tr>
<tr class="odd">
<td><strong>2</strong></td>
<td>C</td>
<td>2</td>
</tr>
<tr class="even">
<td><strong>3</strong></td>
<td>A</td>
<td>3</td>
</tr>
<tr class="odd">
<td><strong>4</strong></td>
<td>B</td>
<td>4</td>
</tr>
<tr class="even">
<td><strong>5</strong></td>
<td>C</td>
<td>5</td>
</tr>
</tbody>
</table>

> The most basic split-apply-combine operation can be computed with the
> groupby() method of DataFrame s, passing the name of the desired key
> column:
>
> df.groupby('key')
>
> \<pandas.core.groupby.generic.DataFrameGroupBy object at
> 0x7fb5e61805e0\>
>
> Notice that what is returned is not a set of DataFrame s, but a
> DataFrameGroupBy object. This object is where the magic is: you can
> think of it as a special view of the DataFrame , which is poised to
> dig into the groups but does no actual computation until the
> aggregation is applied.
>
> This "lazy evaluation" approach means that common aggregates can be
> implemented very e�ciently in a way that is almost transparent to the
> user.
>
> To produce a result, we can apply an aggregate to this
> DataFrameGroupBy object, which will perform the appropriate
> apply/combine steps to produce the desired result:
>
> df.groupby('key').sum()
>
> **data** <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image118.png"
> style="width:0.22222in;height:0.22222in" />
>
> **key**

<table>
<colgroup>
<col style="width: 50%" />
<col style="width: 50%" />
</colgroup>
<thead>
<tr class="header">
<th><blockquote>
<p><strong>A</strong><br />
<strong>B</strong><br />
<strong>C</strong></p>
</blockquote></th>
<th><blockquote>
<p>3<br />
5<br />
7</p>
</blockquote></th>
</tr>
</thead>
<tbody>
</tbody>
</table>

> The sum() method is just one possibility here; you can apply virtually
> any common Pandas or NumPy aggregation function, as well as virtually
> any valid DataFrame operation, as we will see in the following
> discussion.
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image119.png"
> style="width:0.125in" /> The GroupBy object
>
> The GroupBy object is a very �exible abstraction. In many ways, you
> can simply treat it as if it's a collection of DataFrame s, and it
> does the di�cult things under the hood. Let's see some

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
74/126

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> examples using the Planets data.
>
> Perhaps the most important operations made available by a GroupBy are
> *aggregate*, *�lter*, *transform*, and *apply*. We'll discuss each of
> these more fully in <u>"Aggregate, Filter, Transform, Apply"</u>, but
> before that let's introduce some of the other functionality that can
> be used with the basic GroupBy operation.
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image120.png"
> style="width:0.125in" /> Column indexing
>
> The GroupBy object supports column indexing in the same way as the
> DataFrame , and returns a modi�ed GroupBy object. For example:
>
> planets.groupby('method')
>
> \<pandas.core.groupby.generic.DataFrameGroupBy object at
> 0x7fb5e61806a0\>
>
> planets.groupby('method')\['orbital_period'\]
>
> \<pandas.core.groupby.generic.SeriesGroupBy object at 0x7fb5e6180ee0\>
>
> Here we've selected a particular Series group from the original
> DataFrame group by reference to its column name. As with the GroupBy
> object, no computation is done until we call some aggregate on the
> object:
>
> planets.groupby('method')\['orbital_period'\].median()
>
> method  
> Astrometry 631.180000  
> Eclipse Timing Variations 4343.500000  
> Imaging 27500.000000  
> Microlensing 3300.000000  
> Orbital Brightness Modulation 0.342887  
> Pulsar Timing 66.541900  
> Pulsation Timing Variations 1170.000000  
> Radial Velocity 360.200000  
> Transit 5.714932  
> Transit Timing Variations 57.011000  
> Name: orbital_period, dtype: float64
>
> This gives an idea of the general scale of orbital periods (in days)
> that each method is sensitive to.
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image121.png"
> style="width:0.125in" /> Iteration over groups
>
> The GroupBy object supports direct iteration over the groups,
> returning each group as a Series or DataFrame :

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
75/126

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> for (method, group) in planets.groupby('method'):
>
> print("{0:30s} shape={1}".format(method, group.shape))
>
> Astrometry shape=(2, 6)  
> Eclipse Timing Variations shape=(9, 6)  
> Imaging shape=(38, 6)  
> Microlensing shape=(23, 6)  
> Orbital Brightness Modulation shape=(3, 6)  
> Pulsar Timing shape=(5, 6)  
> Pulsation Timing Variations shape=(1, 6)  
> Radial Velocity shape=(553, 6)  
> Transit shape=(397, 6)  
> Transit Timing Variations shape=(4, 6)
>
> This can be useful for doing certain things manually, though it is
> often much faster to use the
>
> built-in apply functionality, which we will discuss momentarily.
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image55.png"
> style="width:0.125in" /> Dispatch methods
>
> Through some Python class magic, any method not explicitly implemented
> by the GroupBy

object will be passed through and called on the groups, whether they are
DataFrame or Series

> objects. For example, you can use the describe() method of DataFrame s
> to perform a set of
>
> aggregations that describe each group in the data:
>
> planets.groupby('method')\['year'\].describe().unstack()
>
> method  
> count Astrometry 2.0  
> Eclipse Timing Variations 9.0  
> Imaging 38.0  
> Microlensing 23.0  
> Orbital Brightness Modulation 3.0  
> ...
>
> max Pulsar Timing 2011.0  
> Pulsation Timing Variations 2007.0  
> Radial Velocity 2014.0  
> Transit 2014.0  
> Transit Timing Variations 2014.0  
> Length: 80, dtype: float64
>
> Looking at this table helps us to better understand the data: for
> example, the vast majority of
>
> planets have been discovered by the Radial Velocity and Transit
> methods, though the latter only
>
> became common (due to new, more accurate telescopes) in the last
> decade. The newest
>
> methods seem to be Transit Timing Variation and Orbital Brightness
> Modulation, which were not
>
> used to discover a new planet until 2011.

This is just one example of the utility of dispatch methods. Notice that
they are applied *to each*

> *individual group*, and the results are then combined within GroupBy
> and returned. Again, any
>
> valid DataFrame / Series method can be used on the corresponding
> GroupBy object, which
>
> allows for some very �exible and powerful operations!

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
76/126

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image4.png"
> style="width:0.125in" /> Aggregate, �lter, transform, apply
>
> The preceding discussion focused on aggregation for the combine
> operation, but there are more
>
> options available. In particular, GroupBy objects have aggregate() ,
> filter() , transform() , and apply() methods that e�ciently implement
> a variety of useful operations before combining the
>
> grouped data.
>
> For the purpose of the following subsections, we'll use this DataFrame
> :
>
> rng = np.random.RandomState(0)  
> df = pd.DataFrame({'key': \['A', 'B', 'C', 'A', 'B', 'C'\], 'data1':
> range(6),  
> 'data2': rng.randint(0, 10, 6)},  
> columns = \['key', 'data1', 'data2'\]) df

<table style="width:100%;">
<colgroup>
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
</colgroup>
<thead>
<tr class="header">
<th></th>
<th colspan="2"><strong>key</strong></th>
<th><strong>data1</strong></th>
<th><strong>data2</strong></th>
<th><blockquote>
<p><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image122.png"
style="width:0.23611in;height:0.22222in" /></p>
</blockquote></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td rowspan="7"><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image12.png"
style="width:0.125in" /></td>
<td><strong>0</strong></td>
<td>A</td>
<td>0</td>
<td>5</td>
<td rowspan="7"></td>
</tr>
<tr class="even">
<td><strong>1</strong></td>
<td>B</td>
<td>1</td>
<td>0</td>
</tr>
<tr class="odd">
<td><strong>2</strong></td>
<td>C</td>
<td>2</td>
<td>3</td>
</tr>
<tr class="even">
<td><strong>3</strong></td>
<td>A</td>
<td>3</td>
<td>3</td>
</tr>
<tr class="odd">
<td><strong>4</strong></td>
<td>B</td>
<td>4</td>
<td>7</td>
</tr>
<tr class="even">
<td><strong>5</strong></td>
<td>C</td>
<td rowspan="2">5</td>
<td rowspan="2">9</td>
</tr>
<tr class="odd">
<td colspan="2">Aggregation</td>
</tr>
</tbody>
</table>

> We're now familiar with GroupBy aggregations with sum() , median() ,
> and the like, but the aggregate() method allows for even more
> �exibility. It can take a string, a function, or a list

thereof, and compute all the aggregates at once. Here is a quick example
combining all these:

> df.groupby('key').aggregate(\['min', np.median, max\])

<table style="width:100%;">
<colgroup>
<col style="width: 14%" />
<col style="width: 14%" />
<col style="width: 14%" />
<col style="width: 14%" />
<col style="width: 14%" />
<col style="width: 14%" />
<col style="width: 14%" />
</colgroup>
<thead>
<tr class="header">
<th colspan="2"><strong>data1</strong></th>
<th rowspan="2"><strong>max</strong></th>
<th colspan="2"><blockquote>
<p><strong>data2</strong></p>
</blockquote></th>
<th rowspan="2"><strong>max</strong></th>
<th rowspan="2"><blockquote>
<p><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image123.png"
style="width:0.22222in;height:0.22222in" /></p>
</blockquote></th>
</tr>
<tr class="odd">
<th><strong>min</strong></th>
<th><strong>median</strong></th>
<th><strong>min</strong></th>
<th><strong>median</strong></th>
</tr>
</thead>
<tbody>
</tbody>
</table>

> **key**

<table>
<colgroup>
<col style="width: 12%" />
<col style="width: 12%" />
<col style="width: 12%" />
<col style="width: 12%" />
<col style="width: 12%" />
<col style="width: 12%" />
<col style="width: 12%" />
<col style="width: 12%" />
</colgroup>
<thead>
<tr class="header">
<th colspan="8"></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>A</strong></td>
<td>0</td>
<td>1.5</td>
<td>3</td>
<td>3</td>
<td>4.0</td>
<td><blockquote>
<p>5</p>
</blockquote></td>
<td rowspan="4">77/126</td>
</tr>
<tr class="even">
<td><strong>B</strong></td>
<td>1</td>
<td>2.5</td>
<td>4</td>
<td>0</td>
<td>3.5</td>
<td><blockquote>
<p>7</p>
</blockquote></td>
</tr>
<tr class="odd">
<td><strong>C</strong></td>
<td>2</td>
<td>3.5</td>
<td>5</td>
<td>3</td>
<td>6.0</td>
<td><blockquote>
<p>9</p>
</blockquote></td>
</tr>
<tr class="even">
<td
colspan="7">https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&amp;printMode=true</td>
</tr>
</tbody>
</table>

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> Another useful pattern is to pass a dictionary mapping column names to
> operations to be applied on that column:
>
> df.groupby('key').aggregate({'data1': 'min',  
> 'data2': 'max'})

<table>
<colgroup>
<col style="width: 33%" />
<col style="width: 33%" />
<col style="width: 33%" />
</colgroup>
<thead>
<tr class="header">
<th><strong>data1</strong></th>
<th><strong>data2</strong></th>
<th><blockquote>
<p><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image124.png"
style="width:0.22222in;height:0.22222in" /></p>
</blockquote></th>
</tr>
</thead>
<tbody>
</tbody>
</table>

> **key**

<table>
<colgroup>
<col style="width: 25%" />
<col style="width: 25%" />
<col style="width: 25%" />
<col style="width: 25%" />
</colgroup>
<thead>
<tr class="header">
<th rowspan="4"><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image125.png"
style="width:0.125in" /></th>
<th><strong>A</strong></th>
<th>0</th>
<th>5</th>
</tr>
<tr class="odd">
<th><strong>B</strong></th>
<th>1</th>
<th>7</th>
</tr>
<tr class="header">
<th><strong>C</strong></th>
<th rowspan="2">2</th>
<th rowspan="2">9</th>
</tr>
<tr class="odd">
<th><blockquote>
<p>Filtering</p>
</blockquote></th>
</tr>
</thead>
<tbody>
</tbody>
</table>

> A �ltering operation allows you to drop data based on the group
> properties. For example, we might want to keep all groups in which the
> standard deviation is larger than some critical value:
>
> def filter_func(x):  
> return x\['data2'\].std() \> 4
>
> display('df', "df.groupby('key').std()",
> "df.groupby('key').filter(filter_func)")

<table>
<colgroup>
<col style="width: 12%" />
<col style="width: 12%" />
<col style="width: 12%" />
<col style="width: 12%" />
<col style="width: 12%" />
<col style="width: 12%" />
<col style="width: 12%" />
<col style="width: 12%" />
</colgroup>
<thead>
<tr class="header">
<th rowspan="2">df</th>
<th rowspan="2"><strong>key</strong></th>
<th rowspan="2"><strong>data1</strong></th>
<th rowspan="2"><strong>data2</strong></th>
<th colspan="4"><blockquote>
<p>df.groupby('key').std()</p>
</blockquote></th>
</tr>
<tr class="odd">
<th colspan="3"><strong>data1</strong></th>
<th><blockquote>
<p><strong>data2</strong></p>
</blockquote></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>0</strong></td>
<td>A</td>
<td>0</td>
<td>5</td>
<td colspan="4"><blockquote>
<p><strong>key</strong></p>
</blockquote></td>
</tr>
<tr class="even">
<td><strong>1</strong></td>
<td>B</td>
<td>1</td>
<td>0</td>
<td><strong>A</strong></td>
<td colspan="2">2.12132</td>
<td><blockquote>
<p>1.414214</p>
</blockquote></td>
</tr>
<tr class="odd">
<td><strong>2</strong></td>
<td>C</td>
<td>2</td>
<td>3</td>
<td><strong>B</strong></td>
<td colspan="2">2.12132</td>
<td><blockquote>
<p>4.949747</p>
</blockquote></td>
</tr>
<tr class="even">
<td><strong>3</strong></td>
<td>A</td>
<td>3</td>
<td>3</td>
<td><strong>C</strong></td>
<td colspan="2">2.12132</td>
<td><blockquote>
<p>4.242641</p>
</blockquote></td>
</tr>
<tr class="odd">
<td><strong>4</strong></td>
<td>B</td>
<td>4</td>
<td>7</td>
<td colspan="4" rowspan="2"><blockquote>
<p>df.groupby('key').filter(filter_func)</p>
</blockquote></td>
</tr>
<tr class="even">
<td rowspan="2"><strong>5</strong></td>
<td rowspan="2">C</td>
<td rowspan="2">5</td>
<td rowspan="2">9</td>
</tr>
<tr class="odd">
<td colspan="2"><strong>key</strong></td>
<td><blockquote>
<p><strong>data1</strong></p>
</blockquote></td>
<td><blockquote>
<p><strong>data2</strong></p>
</blockquote></td>
</tr>
<tr class="even">
<td colspan="8"></td>
</tr>
</tbody>
</table>

> The �lter function should return a Boolean value specifying whether
> the group passes the�ltering. Here because group A does not have a
> standard deviation greater than 4, it is dropped from the result.
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image99.png"
> style="width:0.125in" /> Transformation

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
78/126

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> While aggregation must return a reduced version of the data,
> transformation can return some transformed version of the full data to
> recombine. For such a transformation, the output is the same shape as
> the input. A common example is to center the data by subtracting the
> group-wise mean:
>
> df.groupby('key').transform(lambda x: x - x.mean())

<table>
<colgroup>
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
</colgroup>
<thead>
<tr class="header">
<th></th>
<th colspan="2"><strong>data1</strong></th>
<th><strong>data2</strong></th>
<th><blockquote>
<p><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image126.png"
style="width:0.22222in;height:0.22222in" /></p>
</blockquote></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td rowspan="7"><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image21.png"
style="width:0.125in" /></td>
<td><strong>0</strong></td>
<td>-1.5</td>
<td>1.0</td>
<td rowspan="7"></td>
</tr>
<tr class="even">
<td><strong>1</strong></td>
<td>-1.5</td>
<td>-3.5</td>
</tr>
<tr class="odd">
<td><strong>2</strong></td>
<td>-1.5</td>
<td>-3.0</td>
</tr>
<tr class="even">
<td><strong>3</strong></td>
<td>1.5</td>
<td>-1.0</td>
</tr>
<tr class="odd">
<td><strong>4</strong></td>
<td>1.5</td>
<td>3.5</td>
</tr>
<tr class="even">
<td><strong>5</strong></td>
<td>1.5</td>
<td>3.0</td>
</tr>
<tr class="odd">
<td colspan="3">The apply() method</td>
</tr>
</tbody>
</table>

> The apply() method lets you apply an arbitrary function to the group
> results. The function should take a DataFrame , and return either a
> Pandas object (e.g., DataFrame , Series ) or a scalar; the combine
> operation will be tailored to the type of output returned.
>
> For example, here is an apply() that normalizes the �rst column by the
> sum of the second:
>
> def norm_by_data2(x):  
> \# x is a DataFrame of group values  
> x\['data1'\] /= x\['data2'\].sum()  
> return x
>
> display('df', "df.groupby('key').apply(norm_by_data2)")

<table style="width:100%;">
<colgroup>
<col style="width: 10%" />
<col style="width: 10%" />
<col style="width: 10%" />
<col style="width: 10%" />
<col style="width: 10%" />
<col style="width: 10%" />
<col style="width: 10%" />
<col style="width: 10%" />
<col style="width: 10%" />
<col style="width: 10%" />
</colgroup>
<thead>
<tr class="header">
<th colspan="2">df</th>
<th colspan="7">df.groupby('key').apply(norm_by_data2)</th>
<th rowspan="2"></th>
</tr>
<tr class="odd">
<th colspan="2"><strong>key</strong></th>
<th><strong>data1</strong></th>
<th colspan="2"><blockquote>
<p><strong>data2</strong></p>
</blockquote></th>
<th colspan="2"><strong>key</strong></th>
<th><strong>data1</strong></th>
<th><blockquote>
<p><strong>data2</strong></p>
</blockquote></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>0</strong></td>
<td>A</td>
<td colspan="2">0</td>
<td>5</td>
<td><blockquote>
<p><strong>0</strong></p>
</blockquote></td>
<td>A</td>
<td>0.000000</td>
<td><blockquote>
<p>5</p>
</blockquote></td>
<td rowspan="7">79/126</td>
</tr>
<tr class="even">
<td><strong>1</strong></td>
<td>B</td>
<td colspan="2">1</td>
<td>0</td>
<td><blockquote>
<p><strong>1</strong></p>
</blockquote></td>
<td>B</td>
<td>0.142857</td>
<td><blockquote>
<p>0</p>
</blockquote></td>
</tr>
<tr class="odd">
<td><strong>2</strong></td>
<td>C</td>
<td colspan="2">2</td>
<td>3</td>
<td><blockquote>
<p><strong>2</strong></p>
</blockquote></td>
<td>C</td>
<td>0.166667</td>
<td><blockquote>
<p>3</p>
</blockquote></td>
</tr>
<tr class="even">
<td><strong>3</strong></td>
<td>A</td>
<td colspan="2">3</td>
<td>3</td>
<td><blockquote>
<p><strong>3</strong></p>
</blockquote></td>
<td>A</td>
<td>0.375000</td>
<td><blockquote>
<p>3</p>
</blockquote></td>
</tr>
<tr class="odd">
<td><strong>4</strong></td>
<td>B</td>
<td colspan="2">4</td>
<td>7</td>
<td><blockquote>
<p><strong>4</strong></p>
</blockquote></td>
<td>B</td>
<td>0.571429</td>
<td><blockquote>
<p>7</p>
</blockquote></td>
</tr>
<tr class="even">
<td><strong>5</strong></td>
<td>C</td>
<td colspan="2">5</td>
<td>9</td>
<td><blockquote>
<p><strong>5</strong></p>
</blockquote></td>
<td>C</td>
<td>0.416667</td>
<td><blockquote>
<p>9</p>
</blockquote></td>
</tr>
<tr class="odd">
<td
colspan="9">https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&amp;printMode=true</td>
</tr>
</tbody>
</table>

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> apply() within a GroupBy is quite �exible: the only criterion is that
> the function takes a DataFrame and returns a Pandas object or scalar;
> what you do in the middle is up to you!
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image26.png"
> style="width:0.125in" /> Specifying the split key
>
> In the simple examples presented before, we split the DataFrame on a
> single column name. This is just one of many options by which the
> groups can be de�ned, and we'll go through some other options for
> group speci�cation here.
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image99.png"
> style="width:0.125in" /> A list, array, series, or index providing the
> grouping keys
>
> The key can be any series or list with a length matching that of the
> DataFrame . For example:
>
> L = \[0, 1, 0, 1, 2, 0\]  
> display('df', 'df.groupby(L).sum()')

<table style="width:100%;">
<colgroup>
<col style="width: 14%" />
<col style="width: 14%" />
<col style="width: 14%" />
<col style="width: 14%" />
<col style="width: 14%" />
<col style="width: 14%" />
<col style="width: 14%" />
</colgroup>
<thead>
<tr class="header">
<th rowspan="2">df</th>
<th rowspan="2"><strong>key</strong></th>
<th rowspan="2"><strong>data1</strong></th>
<th rowspan="2"><strong>data2</strong></th>
<th colspan="3"><blockquote>
<p>df.groupby(L).sum()</p>
</blockquote></th>
</tr>
<tr class="odd">
<th colspan="2"><strong>data1</strong></th>
<th><blockquote>
<p><strong>data2</strong></p>
</blockquote></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>0</strong></td>
<td>A</td>
<td>0</td>
<td>5</td>
<td><strong>0</strong></td>
<td>7</td>
<td><blockquote>
<p>17</p>
</blockquote></td>
</tr>
<tr class="even">
<td><strong>1</strong></td>
<td>B</td>
<td>1</td>
<td>0</td>
<td><strong>1</strong></td>
<td>4</td>
<td><blockquote>
<p>3</p>
</blockquote></td>
</tr>
<tr class="odd">
<td><strong>2</strong></td>
<td>C</td>
<td>2</td>
<td>3</td>
<td rowspan="4"><strong>2</strong></td>
<td rowspan="4">4</td>
<td rowspan="4"><blockquote>
<p>7</p>
</blockquote></td>
</tr>
<tr class="even">
<td><strong>3</strong></td>
<td>A</td>
<td>3</td>
<td>3</td>
</tr>
<tr class="odd">
<td><strong>4</strong></td>
<td>B</td>
<td>4</td>
<td>7</td>
</tr>
<tr class="even">
<td><strong>5</strong></td>
<td>C</td>
<td>5</td>
<td>9</td>
</tr>
</tbody>
</table>

> Of course, this means there's another, more verbose way of
> accomplishing the df.groupby('key') from before:
>
> display('df', "df.groupby(df\['key'\]).sum()")

| https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true | 80/126 |
|------------------------------------|------------------------------------|

<table>
<colgroup>
<col style="width: 6%" />
<col style="width: 6%" />
<col style="width: 6%" />
<col style="width: 6%" />
<col style="width: 6%" />
<col style="width: 6%" />
<col style="width: 6%" />
<col style="width: 6%" />
<col style="width: 6%" />
<col style="width: 6%" />
<col style="width: 6%" />
<col style="width: 6%" />
<col style="width: 6%" />
<col style="width: 6%" />
<col style="width: 6%" />
<col style="width: 6%" />
</colgroup>
<thead>
<tr class="header">
<th colspan="4">10/03/2023, 15:11</th>
<th colspan="12">Exp02_notebook_2001622 - Colaboratory</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td colspan="4">df</td>
<td colspan="12"><blockquote>
<p>df.groupby(df['key']).sum()</p>
</blockquote></td>
</tr>
<tr class="even">
<td colspan="4"><strong>key</strong></td>
<td colspan="3"><strong>data1</strong></td>
<td colspan="5"><blockquote>
<p><strong>data2</strong></p>
</blockquote></td>
<td colspan="2"><strong>data1</strong></td>
<td colspan="2"><blockquote>
<p><strong>data2</strong></p>
</blockquote></td>
</tr>
<tr class="odd">
<td><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image127.png"
style="width:0.125in" /></td>
<td colspan="15"><blockquote>
<p>A dictionary or series mapping index to group</p>
<p><strong>0</strong> A 0 5 <strong>key</strong></p>
</blockquote></td>
</tr>
<tr class="even">
<td colspan="16"><blockquote>
<p>Another method is to provide a dictionary that maps index values to
the group keys: <strong>1</strong> B 1 0 <strong>A</strong> 3 8</p>
</blockquote></td>
</tr>
<tr class="odd">
<td colspan="2"><strong>2</strong></td>
<td colspan="2">C</td>
<td colspan="4">2</td>
<td colspan="3">3</td>
<td colspan="2"><strong>B</strong></td>
<td colspan="2"><blockquote>
<p>5</p>
</blockquote></td>
<td><blockquote>
<p>7</p>
</blockquote></td>
</tr>
<tr class="even">
<td colspan="16"><blockquote>
<p>df2 = df.set_index('key')</p>
<p>mapping = {'A': 'vowel', 'B': 'consonant', 'C': 'consonant'}
<strong>3</strong> A 3 3 <strong>C</strong> 7 12</p>
<p>display('df2', 'df2.groupby(mapping).sum()')</p>
<p><strong>4</strong> B 4 7</p>
</blockquote></td>
</tr>
<tr class="odd">
<td colspan="3"><blockquote>
<p><strong>5</strong><br />
df2</p>
</blockquote></td>
<td>C</td>
<td colspan="4">5</td>
<td colspan="2">9</td>
<td colspan="6"><blockquote>
<p>df2.groupby(mapping).sum()</p>
</blockquote></td>
</tr>
<tr class="even">
<td colspan="5"><strong>data1</strong></td>
<td colspan="8"><blockquote>
<p><strong>data2</strong></p>
</blockquote></td>
<td colspan="2"><strong>data1</strong></td>
<td><blockquote>
<p><strong>data2</strong></p>
</blockquote></td>
</tr>
<tr class="odd">
<td colspan="4"><strong>key</strong></td>
<td colspan="12"><strong>key</strong></td>
</tr>
<tr class="even">
<td colspan="4"><strong>A</strong></td>
<td colspan="2">0</td>
<td colspan="3">5</td>
<td colspan="4"><blockquote>
<p><strong>consonant</strong></p>
</blockquote></td>
<td colspan="2">12</td>
<td><blockquote>
<p>19</p>
</blockquote></td>
</tr>
<tr class="odd">
<td colspan="4"><strong>B</strong></td>
<td colspan="2">1</td>
<td colspan="4">0</td>
<td colspan="3"><strong>vowel</strong></td>
<td colspan="2">3</td>
<td><blockquote>
<p>8</p>
</blockquote></td>
</tr>
<tr class="even">
<td colspan="4"><strong>C</strong></td>
<td colspan="2">2</td>
<td colspan="10"><blockquote>
<p>3</p>
</blockquote></td>
</tr>
<tr class="odd">
<td colspan="4"><strong>A</strong></td>
<td colspan="2">3</td>
<td colspan="10"><blockquote>
<p>3</p>
</blockquote></td>
</tr>
<tr class="even">
<td colspan="4"><strong>B</strong></td>
<td colspan="2">4</td>
<td colspan="10"><blockquote>
<p>7</p>
</blockquote></td>
</tr>
<tr class="odd">
<td colspan="4"><strong>C</strong></td>
<td colspan="2">5</td>
<td colspan="10"><blockquote>
<p>9</p>
</blockquote></td>
</tr>
<tr class="even">
<td><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image17.png"
style="width:0.125in" /></td>
<td colspan="15"><blockquote>
<p>Any Python function</p>
</blockquote></td>
</tr>
</tbody>
</table>

> Similar to mapping, you can pass any Python function that will input
> the index value and output the group:
>
> display('df2', 'df2.groupby(str.lower).mean()')

<table style="width:100%;">
<colgroup>
<col style="width: 14%" />
<col style="width: 14%" />
<col style="width: 14%" />
<col style="width: 14%" />
<col style="width: 14%" />
<col style="width: 14%" />
<col style="width: 14%" />
</colgroup>
<thead>
<tr class="header">
<th colspan="2">df2</th>
<th colspan="4"><blockquote>
<p>df2.groupby(str.lower).mean()</p>
</blockquote></th>
<th rowspan="3"></th>
</tr>
<tr class="odd">
<th colspan="2"><strong>data1</strong></th>
<th><blockquote>
<p><strong>data2</strong></p>
</blockquote></th>
<th colspan="2"><strong>data1</strong></th>
<th><blockquote>
<p><strong>data2</strong></p>
</blockquote></th>
</tr>
<tr class="header">
<th colspan="2"><strong>key</strong></th>
<th colspan="4"><blockquote>
<p><strong>key</strong></p>
</blockquote></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>A</strong></td>
<td>0</td>
<td>5</td>
<td><blockquote>
<p><strong>a</strong></p>
</blockquote></td>
<td>1.5</td>
<td><blockquote>
<p>4.0</p>
</blockquote></td>
<td rowspan="7">81/126</td>
</tr>
<tr class="even">
<td><strong>B</strong></td>
<td>1</td>
<td>0</td>
<td><blockquote>
<p><strong>b</strong></p>
</blockquote></td>
<td>2.5</td>
<td><blockquote>
<p>3.5</p>
</blockquote></td>
</tr>
<tr class="odd">
<td><strong>C</strong></td>
<td>2</td>
<td>3</td>
<td><blockquote>
<p><strong>c</strong></p>
</blockquote></td>
<td>3.5</td>
<td><blockquote>
<p>6.0</p>
</blockquote></td>
</tr>
<tr class="even">
<td><strong>A</strong></td>
<td>3</td>
<td colspan="4"><blockquote>
<p>3</p>
</blockquote></td>
</tr>
<tr class="odd">
<td><strong>B</strong></td>
<td>4</td>
<td colspan="4"><blockquote>
<p>7</p>
</blockquote></td>
</tr>
<tr class="even">
<td><strong>C</strong></td>
<td>5</td>
<td colspan="4"><blockquote>
<p>9</p>
</blockquote></td>
</tr>
<tr class="odd">
<td
colspan="6">https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&amp;printMode=true</td>
</tr>
</tbody>
</table>

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image79.png"
> style="width:0.125in" /> A list of valid keys
>
> Further, any of the preceding key choices can be combined to group on
> a multi-index:
>
> df2.groupby(\[str.lower, mapping\]).mean()

<table style="width:100%;">
<colgroup>
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
</colgroup>
<thead>
<tr class="header">
<th></th>
<th><strong>key</strong></th>
<th><strong>key</strong></th>
<th><strong>data1</strong></th>
<th><strong>data2</strong></th>
<th><blockquote>
<p><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image128.png"
style="width:0.22222in;height:0.22222in" /></p>
</blockquote></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td rowspan="4"><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image55.png"
style="width:0.125in" /></td>
<td><strong>a</strong></td>
<td><strong>vowel</strong></td>
<td>1.5</td>
<td>4.0</td>
<td rowspan="4"></td>
</tr>
<tr class="even">
<td><strong>b</strong></td>
<td><strong>consonant</strong></td>
<td>2.5</td>
<td>3.5</td>
</tr>
<tr class="odd">
<td><strong>c</strong></td>
<td><strong>consonant</strong></td>
<td rowspan="2">3.5</td>
<td rowspan="2">6.0</td>
</tr>
<tr class="even">
<td colspan="2">Grouping example</td>
</tr>
</tbody>
</table>

As an example of this, in a couple lines of Python code we can put all
these together and count

> discovered planets by method and by decade:
>
> decade = 10 \* (planets\['year'\] // 10)
>
> decade = decade.astype(str) + 's'
>
> decade.name = 'decade'
>
> planets.groupby(\['method',
> decade\])\['number'\].sum().unstack().fillna(0)

<table style="width:100%;">
<colgroup>
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
</colgroup>
<thead>
<tr class="header">
<th><strong>decade</strong></th>
<th><strong>1980s</strong></th>
<th><strong>1990s</strong></th>
<th><strong>2000s</strong></th>
<th><strong>2010s</strong></th>
<th><blockquote>
<p><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image129.png"
style="width:0.22222in;height:0.23611in" /></p>
</blockquote></th>
</tr>
</thead>
<tbody>
</tbody>
</table>

**method**

<table>
<colgroup>
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
</colgroup>
<thead>
<tr class="header">
<th><strong>Astrometry</strong></th>
<th>0.0</th>
<th>0.0</th>
<th>0.0</th>
<th>2.0</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><blockquote>
<p><strong>Eclipse Timing Variations</strong></p>
</blockquote></td>
<td>0.0</td>
<td>0.0</td>
<td>5.0</td>
<td>10.0</td>
</tr>
<tr class="even">
<td><strong>Imaging</strong></td>
<td>0.0</td>
<td>0.0</td>
<td>29.0</td>
<td>21.0</td>
</tr>
<tr class="odd">
<td><strong>Microlensing</strong></td>
<td>0.0</td>
<td>0.0</td>
<td>12.0</td>
<td>15.0</td>
</tr>
<tr class="even">
<td><blockquote>
<p><strong>Orbital Brightness Modulation</strong></p>
</blockquote></td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>5.0</td>
</tr>
<tr class="odd">
<td><strong>Pulsar Timing</strong></td>
<td>0.0</td>
<td>9.0</td>
<td>1.0</td>
<td>1.0</td>
</tr>
<tr class="even">
<td><blockquote>
<p><strong>Pulsation Timing Variations</strong></p>
</blockquote></td>
<td>0.0</td>
<td>0.0</td>
<td>1.0</td>
<td>0.0</td>
</tr>
<tr class="odd">
<td><strong>Radial Velocity</strong></td>
<td>1.0</td>
<td>52.0</td>
<td>475.0</td>
<td>424.0</td>
</tr>
<tr class="even">
<td><strong>Transit</strong></td>
<td>0.0</td>
<td>0.0</td>
<td>64.0</td>
<td>712.0</td>
</tr>
<tr class="odd">
<td><blockquote>
<p><strong>Transit Timing Variations</strong></p>
</blockquote></td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>9.0</td>
</tr>
</tbody>
</table>

> This shows the power of combining many of the operations we've
> discussed up to this point
>
> when looking at realistic datasets. We immediately gain a coarse
> understanding of when and
>
> how planets have been discovered over the past several decades!

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
82/126

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> Here I would suggest digging into these few lines of code, and
> evaluating the individual steps to make sure you understand exactly
> what they are doing to the result. It's certainly a somewhat
> complicated example, but understanding these pieces will give you the
> means to similarly explore your own data.

<table>
<colgroup>
<col style="width: 50%" />
<col style="width: 50%" />
</colgroup>
<thead>
<tr class="header">
<th><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image130.png"
style="width:0.125in" /></th>
<th><blockquote>
<p>Pivot Tables</p>
</blockquote></th>
</tr>
</thead>
<tbody>
</tbody>
</table>

> We have seen how the GroupBy abstraction lets us explore relationships
> within a dataset. A *pivot table* is a similar operation that is
> commonly seen in spreadsheets and other programs that operate on
> tabular data. The pivot table takes simple column-wise data as input,
> and groups the entries into a two-dimensional table that provides a
> multidimensional summarization of the data. The difference between
> pivot tables and GroupBy can sometimes cause confusion; it helps me to
> think of pivot tables as essentially a *multidimensional* version of
> GroupBy aggregation.
>
> That is, you split-apply-combine, but both the split and the combine
> happen across not a one-dimensional index, but across a
> two-dimensional grid.
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image131.png"
> style="width:0.125in" /> Motivating Pivot Tables
>
> For the examples in this section, we'll use the database of passengers
> on the *Titanic*, available through the Seaborn library (see ):
>
> import numpy as np  
> import pandas as pd  
> import seaborn as sns  
> titanic = sns.load_dataset('titanic')
>
> titanic.head()

<table>
<colgroup>
<col style="width: 8%" />
<col style="width: 8%" />
<col style="width: 8%" />
<col style="width: 8%" />
<col style="width: 8%" />
<col style="width: 8%" />
<col style="width: 8%" />
<col style="width: 8%" />
<col style="width: 8%" />
<col style="width: 8%" />
<col style="width: 8%" />
<col style="width: 8%" />
</colgroup>
<thead>
<tr class="header">
<th colspan="2"><strong>survived</strong></th>
<th><blockquote>
<p><strong>pclass</strong></p>
</blockquote></th>
<th><strong>sex</strong></th>
<th><strong>age</strong></th>
<th><strong>sibsp</strong></th>
<th><strong>parch</strong></th>
<th><strong>fare</strong></th>
<th><strong>embarked</strong></th>
<th><strong>class</strong></th>
<th><strong>who</strong></th>
<th><blockquote>
<p><strong>adul</strong></p>
</blockquote></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>0</strong></td>
<td>0</td>
<td>3</td>
<td>male</td>
<td>22.0</td>
<td>1</td>
<td>0</td>
<td><blockquote>
<p>7.2500</p>
</blockquote></td>
<td>S</td>
<td>Third</td>
<td colspan="2"><blockquote>
<p>man</p>
</blockquote></td>
</tr>
<tr class="even">
<td><strong>1</strong></td>
<td>1</td>
<td>1</td>
<td><blockquote>
<p>female</p>
</blockquote></td>
<td>38.0</td>
<td>1</td>
<td>0</td>
<td>71.2833</td>
<td>C</td>
<td>First</td>
<td colspan="2"><blockquote>
<p>woman</p>
</blockquote></td>
</tr>
<tr class="odd">
<td><strong>2</strong></td>
<td>1</td>
<td>3</td>
<td><blockquote>
<p>female</p>
</blockquote></td>
<td>26.0</td>
<td>0</td>
<td>0</td>
<td><blockquote>
<p>7.9250</p>
</blockquote></td>
<td>S</td>
<td>Third</td>
<td colspan="2"><blockquote>
<p>woman</p>
</blockquote></td>
</tr>
<tr class="even">
<td><strong>3</strong></td>
<td>1</td>
<td>1</td>
<td><blockquote>
<p>female</p>
</blockquote></td>
<td>35.0</td>
<td>1</td>
<td>0</td>
<td>53.1000</td>
<td>S</td>
<td>First</td>
<td colspan="2"><blockquote>
<p>woman</p>
</blockquote></td>
</tr>
<tr class="odd">
<td colspan="12"><blockquote>
<p><strong>4</strong> 0 3 male 35.0 0 0 8.0500 S Third man</p>
<p><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image132.png"
style="width:6.83333in;height:0.22222in" /></p>
</blockquote></td>
</tr>
</tbody>
</table>

> This contains a wealth of information on each passenger of that
> ill-fated voyage, including gender, age, class, fare paid, and much
> more.

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
83/126

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image3.png"
> style="width:0.125in" /> Pivot Tables by Hand
>
> To start learning more about this data, we might begin by grouping
> according to gender, survival status, or some combination thereof. If
> you have read the previous section, you might be tempted to apply a
> GroupBy operation–for example, let's look at survival rate by gender:
>
> titanic.groupby('sex')\[\['survived'\]\].mean()
>
> **survived** <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image133.png"
> style="width:0.22222in;height:0.22222in" />
>
> **sex**

| **female male** | 0.742038 0.188908 |
|-----------------|-------------------|

> This immediately gives us some insight: overall, three of every four
> females on board survived, while only one in �ve males survived!
>
> This is useful, but we might like to go one step deeper and look at
> survival by both sex and, say, class. Using the vocabulary of GroupBy
> , we might proceed using something like this: we *group by* class and
> gender, *select* survival, *apply* a mean aggregate, *combine* the
> resulting groups, and then *unstack* the hierarchical index to reveal
> the hidden multidimensionality. In code:
>
> titanic.groupby(\['sex',
> 'class'\])\['survived'\].aggregate('mean').unstack()

<table>
<colgroup>
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
</colgroup>
<thead>
<tr class="header">
<th><strong>class</strong></th>
<th><strong>First</strong></th>
<th><strong>Second</strong></th>
<th><strong>Third</strong></th>
<th><blockquote>
<p><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image134.png"
style="width:0.23611in;height:0.22222in" /></p>
</blockquote></th>
</tr>
</thead>
<tbody>
</tbody>
</table>

> **sex**

| **female** | 0.968085 | 0.921053 | 0.500000 |
|------------|----------|----------|----------|
| **male**   | 0.368852 | 0.157407 | 0.135447 |

> This gives us a better idea of how both gender and class affected
> survival, but the code is starting to look a bit garbled. While each
> step of this pipeline makes sense in light of the tools we've
> previously discussed, the long string of code is not particularly easy
> to read or use. This two-dimensional GroupBy is common enough that
> Pandas includes a convenience routine, pivot_table , which succinctly
> handles this type of multi-dimensional aggregation.
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image120.png"
> style="width:0.125in" /> Pivot Table Syntax
>
> Here is the equivalent to the preceding operation using the
> pivot_table method of DataFrame s:
>
> titanic.pivot_table('survived', index='sex', columns='class')

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
84/126

<table>
<colgroup>
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
</colgroup>
<thead>
<tr class="header">
<th>10/03/2023, 15:11</th>
<th rowspan="2"><strong>First</strong></th>
<th rowspan="2"><strong>Second</strong></th>
<th colspan="2"><blockquote>
<p>Exp02_notebook_2001622 - Colaboratory</p>
</blockquote></th>
</tr>
<tr class="odd">
<th><strong>class</strong></th>
<th><strong>Third</strong></th>
<th><blockquote>
<p><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image135.png"
style="width:0.23611in;height:0.22222in" /></p>
</blockquote></th>
</tr>
</thead>
<tbody>
</tbody>
</table>

> **sex**

| **female** | 0.968085 | 0.921053 | 0.500000 |
|------------|----------|----------|----------|
| **male**   | 0.368852 | 0.157407 | 0.135447 |

> This is eminently more readable than the groupby approach, and
> produces the same result. As you might expect of an early 20th-century
> transatlantic cruise, the survival gradient favors both women and
> higher classes. First-class women survived with near certainty (hi,
> Rose!), while only one in ten third-class men survived (sorry, Jack!).
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image125.png"
> style="width:0.125in" /> Multi-level pivot tables
>
> Just as in the GroupBy , the grouping in pivot tables can be speci�ed
> with multiple levels, and via a number of options. For example, we
> might be interested in looking at age as a third dimension.
>
> We'll bin the age using the pd.cut function:
>
> age = pd.cut(titanic\['age'\], \[0, 18, 80\])  
> titanic.pivot_table('survived', \['sex', age\], 'class')

<table style="width:100%;">
<colgroup>
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
</colgroup>
<thead>
<tr class="header">
<th rowspan="2"><strong>sex</strong></th>
<th><strong>class</strong></th>
<th rowspan="2"><strong>First</strong></th>
<th rowspan="2"><strong>Second</strong></th>
<th rowspan="2"><strong>Third</strong></th>
<th rowspan="2"><blockquote>
<p><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image136.png"
style="width:0.22222in;height:0.22222in" /></p>
</blockquote></th>
</tr>
<tr class="odd">
<th><strong>age</strong></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>female</strong></td>
<td><strong>(0, 18]</strong></td>
<td>0.909091</td>
<td>1.000000</td>
<td>0.511628</td>
<td rowspan="4"></td>
</tr>
<tr class="even">
<td rowspan="3"><strong>male</strong></td>
<td><strong>(18, 80]</strong></td>
<td>0.972973</td>
<td>0.900000</td>
<td>0.423729</td>
</tr>
<tr class="odd">
<td><strong>(0, 18]</strong></td>
<td>0.800000</td>
<td>0.600000</td>
<td>0.215686</td>
</tr>
<tr class="even">
<td><strong>(18, 80]</strong></td>
<td>0.375000</td>
<td>0.071429</td>
<td>0.133663</td>
</tr>
</tbody>
</table>

> We can apply the same strategy when working with the columns as well;
> let's add info on the fare paid using pd.qcut to automatically compute
> quantiles:
>
> fare = pd.qcut(titanic\['fare'\], 2)  
> titanic.pivot_table('survived', \['sex', age\], \[fare, 'class'\])

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
85/126

<table style="width:100%;">
<colgroup>
<col style="width: 10%" />
<col style="width: 10%" />
<col style="width: 10%" />
<col style="width: 10%" />
<col style="width: 10%" />
<col style="width: 10%" />
<col style="width: 10%" />
<col style="width: 10%" />
<col style="width: 10%" />
<col style="width: 10%" />
</colgroup>
<thead>
<tr class="header">
<th colspan="5">10/03/2023, 15:11</th>
<th colspan="5"><blockquote>
<p>Exp02_notebook_2001622 - Colaboratory</p>
</blockquote></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td colspan="2"><strong>fare</strong></td>
<td colspan="3"><blockquote>
<p><strong>(-0.001, 14.454]</strong></p>
</blockquote></td>
<td colspan="4"><strong>(14.454, 512.329]</strong></td>
<td><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image137.png"
style="width:0.22222in;height:0.22222in" /></td>
</tr>
<tr class="even">
<td colspan="3"><strong>class</strong></td>
<td><strong>First</strong></td>
<td><blockquote>
<p><strong>Second</strong></p>
</blockquote></td>
<td><blockquote>
<p><strong>Third</strong></p>
</blockquote></td>
<td colspan="2"><blockquote>
<p><strong>First</strong></p>
</blockquote></td>
<td><strong>Second</strong></td>
<td><blockquote>
<p><strong>Third</strong></p>
</blockquote></td>
</tr>
<tr class="odd">
<td colspan="10"><blockquote>
<p>The result is a four-dimensional aggregation with hierarchical
indices (see</p>
</blockquote></td>
</tr>
<tr class="even">
<td rowspan="2"><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image84.png"
style="width:0.125in" /></td>
<td colspan="4" rowspan="2"><blockquote>
<p><strong>(18, 80]</strong> NaN 0.880000</p>
<p>Additional pivot table options</p>
<p><strong>male</strong> <strong>(0, 18]</strong> NaN 0.000000</p>
</blockquote></td>
<td colspan="2"><blockquote>
<p>0.444444</p>
</blockquote></td>
<td>0.972973</td>
<td><blockquote>
<p>0.914286</p>
</blockquote></td>
<td><blockquote>
<p>0.391304</p>
</blockquote></td>
</tr>
<tr class="odd">
<td colspan="2"><blockquote>
<p>0.260870</p>
</blockquote></td>
<td>0.800000</td>
<td><blockquote>
<p>0.818182</p>
</blockquote></td>
<td><blockquote>
<p>0.178571</p>
</blockquote></td>
</tr>
<tr class="even">
<td colspan="10"><blockquote>
<p>The full call signature of the pivot_table method of DataFrame s is
as follows: <strong>(18, 80]</strong> 0.0 0.098039 0.125000 0.391304
0.030303 0.192308</p>
</blockquote></td>
</tr>
</tbody>
</table>

> \# call signature as of Pandas 0.18
>
> DataFrame.pivot_table(data, values=None, index=None, columns=None,

aggfunc='mean', fill_value=None, margins=False,

dropna=True, margins_name='All')

> We've already seen examples of the �rst three arguments; here we'll
> take a quick look at the remaining ones. Two of the options,
> fill_value and dropna , have to do with missing data and are fairly
> straightforward; we will not show examples of them here.

The aggfunc keyword controls what type of aggregation is applied, which
is a mean by default.

> As in the GroupBy, the aggregation speci�cation can be a string
> representing one of several common choices (e.g., 'sum' , 'mean' ,
> 'count' , 'min' , 'max' , etc.) or a function that implements an
> aggregation (e.g., np.sum() , min() , sum() , etc.). Additionally, it
> can be speci�ed as a dictionary mapping a column to any of the above
> desired options:
>
> titanic.pivot_table(index='sex', columns='class',  
> aggfunc={'survived':sum, 'fare':'mean'})

<table>
<colgroup>
<col style="width: 12%" />
<col style="width: 12%" />
<col style="width: 12%" />
<col style="width: 12%" />
<col style="width: 12%" />
<col style="width: 12%" />
<col style="width: 12%" />
<col style="width: 12%" />
</colgroup>
<thead>
<tr class="header">
<th rowspan="2"><strong>class</strong></th>
<th><blockquote>
<p><strong>fare</strong></p>
</blockquote></th>
<th rowspan="2"><strong>Second</strong></th>
<th rowspan="2"><strong>Third</strong></th>
<th colspan="2"><blockquote>
<p><strong>survived</strong></p>
</blockquote></th>
<th rowspan="2"><strong>Third</strong></th>
<th rowspan="2"><blockquote>
<p><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image138.png"
style="width:0.23611in;height:0.23611in" /></p>
</blockquote></th>
</tr>
<tr class="odd">
<th><blockquote>
<p><strong>First</strong></p>
</blockquote></th>
<th><strong>First</strong></th>
<th><strong>Second</strong></th>
</tr>
</thead>
<tbody>
</tbody>
</table>

> **sex**

<table style="width:100%;">
<colgroup>
<col style="width: 14%" />
<col style="width: 14%" />
<col style="width: 14%" />
<col style="width: 14%" />
<col style="width: 14%" />
<col style="width: 14%" />
<col style="width: 14%" />
</colgroup>
<thead>
<tr class="header">
<th><strong>female</strong></th>
<th>106.125798</th>
<th>21.970121</th>
<th><blockquote>
<p>16.118810</p>
</blockquote></th>
<th>91</th>
<th>70</th>
<th>72</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>male</strong></td>
<td><blockquote>
<p>67.226127</p>
</blockquote></td>
<td>19.741782</td>
<td><blockquote>
<p>12.661633</p>
</blockquote></td>
<td>45</td>
<td>17</td>
<td>47</td>
</tr>
</tbody>
</table>

> Notice also here that we've omitted the values keyword; when
> specifying a mapping for aggfunc , this is determined automatically.
>
> At times it's useful to compute totals along each grouping. This can
> be done via the margins keyword:
>
> titanic.pivot_table('survived', index='sex', columns='class',
> margins=True)

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
86/126

<table style="width:100%;">
<colgroup>
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
</colgroup>
<thead>
<tr class="header">
<th>10/03/2023, 15:11</th>
<th rowspan="2"><strong>First</strong></th>
<th rowspan="2"><strong>Second</strong></th>
<th colspan="3"><blockquote>
<p>Exp02_notebook_2001622 - Colaboratory</p>
</blockquote></th>
</tr>
<tr class="odd">
<th><strong>class</strong></th>
<th><blockquote>
<p><strong>Third</strong></p>
</blockquote></th>
<th><strong>All</strong></th>
<th><blockquote>
<p><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image139.png"
style="width:0.23611in;height:0.22222in" /></p>
</blockquote></th>
</tr>
</thead>
<tbody>
</tbody>
</table>

> **sex**

| **female** | 0.968085 | 0.921053 | 0.500000 | 0.742038 |
|------------|----------|----------|----------|----------|
| **male**   | 0.368852 | 0.157407 | 0.135447 | 0.188908 |
| **All**    | 0.629630 | 0.472826 | 0.242363 | 0.383838 |

> Here this automatically gives us information about the class-agnostic
> survival rate by gender, the gender-agnostic survival rate by class,
> and the overall survival rate of 38%. The margin label can be speci�ed
> with the margins_name keyword, which defaults to "All" .

<table>
<colgroup>
<col style="width: 50%" />
<col style="width: 50%" />
</colgroup>
<thead>
<tr class="header">
<th><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image125.png"
style="width:0.125in" /></th>
<th><blockquote>
<p>Example: Birthrate Data</p>
</blockquote></th>
</tr>
</thead>
<tbody>
</tbody>
</table>

> As a more interesting example, let's take a look at the freely
> available data on births in the United States, provided by the Centers
> for Disease Control (CDC). This data can be found at  
> (this dataset has
>
> \# shell command to download the data:  
> \# !curl -O
> https://raw.githubusercontent.com/jakevdp/data-CDCbirths/master/births.csv
>
> births =
> pd.read_csv('https://raw.githubusercontent.com/jakevdp/data-CDCbirths/master/birt
>
> Taking a look at the data, we see that it's relatively simple–it
> contains the number of births grouped by date and gender:
>
> births.head()

<table style="width:100%;">
<colgroup>
<col style="width: 14%" />
<col style="width: 14%" />
<col style="width: 14%" />
<col style="width: 14%" />
<col style="width: 14%" />
<col style="width: 14%" />
<col style="width: 14%" />
</colgroup>
<thead>
<tr class="header">
<th></th>
<th><strong>year</strong></th>
<th><strong>month</strong></th>
<th><strong>day</strong></th>
<th><strong>gender</strong></th>
<th><strong>births</strong></th>
<th><blockquote>
<p><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image140.png"
style="width:0.22222in;height:0.22222in" /></p>
</blockquote></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>0</strong></td>
<td>1969</td>
<td>1</td>
<td>1.0</td>
<td>F</td>
<td>4046</td>
<td rowspan="5"></td>
</tr>
<tr class="even">
<td><strong>1</strong></td>
<td>1969</td>
<td>1</td>
<td>1.0</td>
<td>M</td>
<td>4440</td>
</tr>
<tr class="odd">
<td><strong>2</strong></td>
<td>1969</td>
<td>1</td>
<td>2.0</td>
<td>F</td>
<td>4454</td>
</tr>
<tr class="even">
<td><strong>3</strong></td>
<td>1969</td>
<td>1</td>
<td>2.0</td>
<td>M</td>
<td>4548</td>
</tr>
<tr class="odd">
<td><strong>4</strong></td>
<td>1969</td>
<td>1</td>
<td>3.0</td>
<td>F</td>
<td>4548</td>
</tr>
</tbody>
</table>

> We can start to understand this data a bit more by using a pivot
> table. Let's add a decade column, and take a look at male and female
> births as a function of decade:

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
87/126

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> births\['decade'\] = 10 \* (births\['year'\] // 10)  
> births.pivot_table('births', index='decade', columns='gender',
> aggfunc='sum')

<table>
<colgroup>
<col style="width: 25%" />
<col style="width: 25%" />
<col style="width: 25%" />
<col style="width: 25%" />
</colgroup>
<thead>
<tr class="header">
<th><strong>gender</strong></th>
<th><strong>F</strong></th>
<th><strong>M</strong></th>
<th><blockquote>
<p><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image141.png"
style="width:0.22222in;height:0.22222in" /></p>
</blockquote></th>
</tr>
</thead>
<tbody>
</tbody>
</table>

> **decade**

<table>
<colgroup>
<col style="width: 33%" />
<col style="width: 33%" />
<col style="width: 33%" />
</colgroup>
<thead>
<tr class="header">
<th><strong>1960</strong></th>
<th><blockquote>
<p>1753634</p>
</blockquote></th>
<th><blockquote>
<p>1846572</p>
</blockquote></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>1970</strong></td>
<td><blockquote>
<p>16263075</p>
</blockquote></td>
<td>17121550</td>
</tr>
<tr class="even">
<td><strong>1980</strong></td>
<td><blockquote>
<p>18310351</p>
</blockquote></td>
<td>19243452</td>
</tr>
<tr class="odd">
<td><strong>1990</strong></td>
<td><blockquote>
<p>19479454</p>
</blockquote></td>
<td>20420553</td>
</tr>
<tr class="even">
<td><strong>2000</strong></td>
<td><blockquote>
<p>18229309</p>
</blockquote></td>
<td>19106428</td>
</tr>
</tbody>
</table>

> We immediately see that male births outnumber female births in every
> decade. To see this trend
>
> a bit more clearly, we can use the built-in plotting tools in Pandas
> to visualize the total number of births by year (see for a discussion
> of plotting with Matplotlib):
>
> %matplotlib inline  
> import matplotlib.pyplot as plt  
> sns.set() \# use Seaborn styles  
> births.pivot_table('births', index='year', columns='gender',
> aggfunc='sum').plot() plt.ylabel('total births per year');
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image142.png"
> style="width:4.09583in;height:2.90694in" />
>
> With a simple pivot table and plot() method, we can immediately see
> the annual trend in births
>
> by gender. By eye, it appears that over the past 50 years male births
> have outnumbered female births by around 5%.
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image143.png"
> style="width:0.125in" /> Further data exploration
>
> Though this doesn't necessarily relate to the pivot table, there are a
> few more interesting features we can pull out of this dataset using
> the Pandas tools covered up to this point. We

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
88/126

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> must start by cleaning the data a bit, removing outliers caused by
> mistyped dates (e.g., June
>
> 31st) or missing values (e.g., June 99th). One easy way to remove
> these all at once is to cut outliers; we'll do this via a robust
> sigma-clipping operation:
>
> quartiles = np.percentile(births\['births'\], \[25, 50, 75\]) mu =
> quartiles\[1\]  
> sig = 0.74 \* (quartiles\[2\] - quartiles\[0\])
>
> This �nal line is a robust estimate of the sample mean, where the 0.74
> comes from the interquartile range of a Gaussian distribution (You can
> learn more about sigma-clipping
>
> operations in a book I coauthored with Željko Ivezić, Andrew J.
> Connolly, and Alexander Gray:
>
> ).
>
> births = births.query('(births \> @mu - 5 \* @sig) & (births \< @mu +
> 5 \* @sig)')
>
> Next we set the day column to integers; previously it had been a
> string because some columns
>
> in the dataset contained the value 'null' :
>
> \# set 'day' column to integer; it originally was a string due to
> nulls births\['day'\] = births\['day'\].astype(int)
>
> \# create a datetime index from the year, month, day  
> births.index = pd.to_datetime(10000 \* births.year +  
> 100 \* births.month +  
> births.day, format='%Y%m%d')
>
> births\['dayofweek'\] = births.index.dayofweek
>
> Using this we can plot births by weekday for several decades:
>
> import matplotlib.pyplot as plt  
> import matplotlib as mpl
>
> births.pivot_table('births', index='dayofweek',  
> columns='decade', aggfunc='mean').plot()  
> plt.gca().set_xticklabels(\['Mon', 'Tues', 'Wed', 'Thurs', 'Fri',
> 'Sat', 'Sun'\]) plt.ylabel('mean births by day');

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
89/126

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> \<ipython-input-274-750f1844e3eb\>:6: UserWarning: FixedFormatter
> should only be used t plt.gca().set_xticklabels(\['Mon', 'Tues',
> 'Wed', 'Thurs', 'Fri', 'Sat', 'Sun'\])
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image144.png"
> style="width:4.2in;height:2.79306in" />
>
> Apparently births are slightly less common on weekends than on
> weekdays! Note that the 1990s and 2000s are missing because the CDC
> data contains only the month of birth starting in 1989.
>
> Another intersting view is to plot the mean number of births by the
> day of the *year*. Let's �rst group the data by month and day
> separately:
>
> births_by_date = births.pivot_table('births',  
> \[births.index.month, births.index.day\]) births_by_date.head()

<table>
<colgroup>
<col style="width: 25%" />
<col style="width: 25%" />
<col style="width: 25%" />
<col style="width: 25%" />
</colgroup>
<thead>
<tr class="header">
<th></th>
<th></th>
<th><strong>births</strong></th>
<th><blockquote>
<p><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image145.png"
style="width:0.23611in;height:0.22222in" /></p>
</blockquote></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td rowspan="5"><strong>1</strong></td>
<td><strong>1</strong></td>
<td>4009.225</td>
<td rowspan="5"></td>
</tr>
<tr class="even">
<td><strong>2</strong></td>
<td>4247.400</td>
</tr>
<tr class="odd">
<td><strong>3</strong></td>
<td>4500.900</td>
</tr>
<tr class="even">
<td><strong>4</strong></td>
<td>4571.350</td>
</tr>
<tr class="odd">
<td><strong>5</strong></td>
<td>4603.625</td>
</tr>
</tbody>
</table>

> The result is a multi-index over months and days. To make this easily
> plottable, let's turn these months and days into a date by associating
> them with a dummy year variable (making sure to choose a leap year so
> February 29th is correctly handled!)
>
> births_by_date.index = \[pd.datetime(2012, month, day)  
> for (month, day) in births_by_date.index\] births_by_date.head()

| https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true | 90/126 |
|------------------------------------|------------------------------------|

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> \<ipython-input-276-2b877f2df70b\>:1: FutureWarning: The
> pandas.datetime class is depre births_by_date.index =
> \[pd.datetime(2012, month, day)

<table>
<colgroup>
<col style="width: 33%" />
<col style="width: 33%" />
<col style="width: 33%" />
</colgroup>
<thead>
<tr class="header">
<th colspan="2"><strong>births</strong></th>
<th><blockquote>
<p><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image146.png"
style="width:0.22222in;height:0.22222in" /></p>
</blockquote></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>2012-01-01</strong></td>
<td colspan="2"><blockquote>
<p>4009.225</p>
</blockquote></td>
</tr>
<tr class="even">
<td><strong>2012-01-02</strong></td>
<td colspan="2"><blockquote>
<p>4247.400</p>
</blockquote></td>
</tr>
<tr class="odd">
<td colspan="3"><blockquote>
<p><strong>2012-01-03</strong> 4500.900</p>
</blockquote>
<p>Focusing on the month and day only, we now have a time series
re�ecting the average number</p>
<blockquote>
<p><strong>2012-01-04</strong> 4571.350</p>
</blockquote>
<p>of births by date of the year. From this, we can use the plot method
to plot the data. It reveals</p>
<blockquote>
<p>some interesting trends: <strong>2012-01-05</strong> 4603.625</p>
</blockquote></td>
</tr>
</tbody>
</table>

> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image147.png"
> style="width:6.83333in;height:0.13889in" />
>
> \# Plot the results  
> fig, ax = plt.subplots(figsize=(12, 4))  
> births_by_date.plot(ax=ax);
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image148.png"
> style="width:6.7625in;height:2.45972in" />
>
> In particular, the striking feature of this graph is the dip in
> birthrate on US holidays (e.g., Independence Day, Labor Day,
> Thanksgiving, Christmas, New Year's Day) although this likely re�ects
> trends in scheduled/induced births rather than some deep psychosomatic
> effect on where we will use Matplotlib's tools to annotate this plot.
>
> Looking at this short example, you can see that many of the Python and
> Pandas tools we've seen to this point can be combined and used to gain
> insight from a variety of datasets. We will see some more
> sophisticated applications of these data manipulations in future
> sections!

<table>
<colgroup>
<col style="width: 50%" />
<col style="width: 50%" />
</colgroup>
<thead>
<tr class="header">
<th><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image74.png"
style="width:0.125in" /></th>
<th><blockquote>
<p>Vectorized String Operations</p>
</blockquote></th>
</tr>
</thead>
<tbody>
</tbody>
</table>

> One strength of Python is its relative ease in handling and
> manipulating string data. Pandas builds on this and provides a
> comprehensive set of *vectorized string operations* that become an

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
91/126

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> essential piece of the type of munging required when working with
> (read: cleaning up) real-world data. In this section, we'll walk
> through some of the Pandas string operations, and then take a look at
> using them to partially clean up a very messy dataset of recipes
> collected from the Internet.
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image31.png"
> style="width:0.125in" /> Introducing Pandas String Operations
>
> We saw in previous sections how tools like NumPy and Pandas generalize
> arithmetic operations so that we can easily and quickly perform the
> same operation on many array elements. For example:
>
> import numpy as np  
> x = np.array(\[2, 3, 5, 7, 11, 13\])  
> x \* 2
>
> array(\[ 4, 6, 10, 14, 22, 26\])
>
> This *vectorization* of operations simpli�es the syntax of operating
> on arrays of data: we no longer have to worry about the size or shape
> of the array, but just about what operation we want done. For arrays
> of strings, NumPy does not provide such simple access, and thus you're
> stuck using a more verbose loop syntax:
>
> data = \['peter', 'Paul', 'MARY', 'gUIDO'\]  
> \[s.capitalize() for s in data\]
>
> \['Peter', 'Paul', 'Mary', 'Guido'\]
>
> This is perhaps su�cient to work with some data, but it will break if
> there are any missing values. For example:
>
> Pandas includes features to address both this need for vectorized
> string operations and for correctly handling missing data via the str
> attribute of Pandas Series and Index objects containing strings. So,
> for example, suppose we create a Pandas Series with this data:
>
> import pandas as pd  
> names = pd.Series(data)  
> names
>
> 0 peter  
> 1 Paul  
> 2 None  
> 3 MARY  
> 4 gUIDO  
> dtype: object

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
92/126

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> We can now call a single method that will capitalize all the entries,
> while skipping over any missing values:
>
> names.str.capitalize()
>
> 0 Peter  
> 1 Paul  
> 2 None  
> 3 Mary  
> 4 Guido  
> dtype: object
>
> Using tab completion on this str attribute will list all the
> vectorized string methods available to Pandas.
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image149.png"
> style="width:0.125in" /> Tables of Pandas String Methods
>
> If you have a good understanding of string manipulation in Python,
> most of Pandas string syntax is intuitive enough that it's probably
> su�cient to just list a table of available methods; we will start with
> that here, before diving deeper into a few of the subtleties. The
> examples in this section use the following series of names:
>
> monte = pd.Series(\['Graham Chapman', 'John Cleese', 'Terry Gilliam',
> 'Eric Idle', 'Terry Jones', 'Michael Palin'\])
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image6.png"
> style="width:0.125in" /> Methods similar to Python string methods
>
> Nearly all Python's built-in string methods are mirrored by a Pandas
> vectorized string method. Here is a list of Pandas str methods that
> mirror Python string methods:

<table>
<colgroup>
<col style="width: 25%" />
<col style="width: 25%" />
<col style="width: 25%" />
<col style="width: 25%" />
</colgroup>
<thead>
<tr class="header">
<th><blockquote>
<p>len()</p>
</blockquote></th>
<th><blockquote>
<p>lower()</p>
</blockquote></th>
<th>translate()</th>
<th><blockquote>
<p>islower()</p>
</blockquote></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>ljust()</td>
<td><blockquote>
<p>upper()</p>
</blockquote></td>
<td>startswith()</td>
<td><blockquote>
<p>isupper()</p>
</blockquote></td>
</tr>
<tr class="even">
<td>rjust()</td>
<td><blockquote>
<p>find()</p>
</blockquote></td>
<td><blockquote>
<p>endswith()</p>
</blockquote></td>
<td>isnumeric()</td>
</tr>
<tr class="odd">
<td>center()</td>
<td><blockquote>
<p>rfind()</p>
</blockquote></td>
<td><blockquote>
<p>isalnum()</p>
</blockquote></td>
<td>isdecimal()</td>
</tr>
<tr class="even">
<td>zfill()</td>
<td><blockquote>
<p>index()</p>
</blockquote></td>
<td><blockquote>
<p>isalpha()</p>
</blockquote></td>
<td><blockquote>
<p>split()</p>
</blockquote></td>
</tr>
<tr class="odd">
<td>strip()</td>
<td><blockquote>
<p>rindex()</p>
</blockquote></td>
<td><blockquote>
<p>isdigit()</p>
</blockquote></td>
<td><blockquote>
<p>rsplit()</p>
</blockquote></td>
</tr>
<tr class="even">
<td>rstrip()</td>
<td>capitalize()</td>
<td><blockquote>
<p>isspace()</p>
</blockquote></td>
<td>partition()</td>
</tr>
<tr class="odd">
<td>lstrip()</td>
<td><blockquote>
<p>swapcase()</p>
</blockquote></td>
<td><blockquote>
<p>istitle()</p>
</blockquote></td>
<td>rpartition()</td>
</tr>
</tbody>
</table>

> Notice that these have various return values. Some, like lower() ,
> return a series of strings:
>
> monte.str.lower()

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
93/126

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> 0 graham chapman  
> 1 john cleese  
> 2 terry gilliam  
> 3 eric idle  
> 4 terry jones  
> 5 michael palin  
> dtype: object
>
> But some others return numbers:
>
> monte.str.len()
>
> 0 14  
> 1 11  
> 2 13  
> 3 9  
> 4 11  
> 5 13  
> dtype: int64
>
> Or Boolean values:
>
> monte.str.startswith('T')
>
> 0 False  
> 1 False  
> 2 True  
> 3 False  
> 4 True  
> 5 False  
> dtype: bool
>
> Still others return lists or other compound values for each element:
>
> monte.str.split()
>
> 0 \[Graham, Chapman\]  
> 1 \[John, Cleese\]  
> 2 \[Terry, Gilliam\]  
> 3 \[Eric, Idle\]  
> 4 \[Terry, Jones\]  
> 5 \[Michael, Palin\]  
> dtype: object
>
> We'll see further manipulations of this kind of series-of-lists object
> as we continue our
>
> discussion.
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image14.png"
> style="width:0.125in" /> Methods using regular expressions

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
94/126

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> In addition, there are several methods that accept regular expressions
> to examine the content of
>
> each string element, and follow some of the API conventions of
> Python's built-in re module:

<table>
<colgroup>
<col style="width: 50%" />
<col style="width: 50%" />
</colgroup>
<thead>
<tr class="header">
<th><strong>Method</strong></th>
<th><strong>Description</strong></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><blockquote>
<p>match()<br />
extract() findall() replace() contains() count()<br />
split()<br />
rsplit()</p>
</blockquote></td>
<td><blockquote>
<p>Call re.match() on each element, returning a boolean.</p>
</blockquote>
<p>Call re.match() on each element, returning matched groups as
strings.</p>
<blockquote>
<p>Call re.findall() on each element<br />
Replace occurrences of pattern with some other string Call re.search()
on each element, returning a boolean Count occurrences of pattern<br />
Equivalent to str.split() , but accepts regexps<br />
Equivalent to str.rsplit() , but accepts regexps</p>
</blockquote></td>
</tr>
</tbody>
</table>

With these, you can do a wide range of interesting operations. For
example, we can extract the

> �rst name from each by asking for a contiguous group of characters at
> the beginning of each
>
> element:
>
> monte.str.extract('(\[A-Za-z\]+)', expand=False)
>
> 0 Graham  
> 1 John  
> 2 Terry  
> 3 Eric  
> 4 Terry  
> 5 Michael  
> dtype: object
>
> Or we can do something more complicated, like �nding all names that
> start and end with a
>
> consonant, making use of the start-of-string ( ^ ) and end-of-string (
> \$ ) regular expression
>
> characters:
>
> monte.str.findall(r'^\[^AEIOU\].\*\[^aeiou\]\$')
>
> 0 \[Graham Chapman\]  
> 1 \[\]  
> 2 \[Terry Gilliam\]  
> 3 \[\]  
> 4 \[Terry Jones\]  
> 5 \[Michael Palin\]  
> dtype: object
>
> The ability to concisely apply regular expressions across Series or
> Dataframe entries opens up
>
> many possibilities for analysis and cleaning of data.
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image110.png"
> style="width:0.125in" /> Miscellaneous methods

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
95/126

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> Finally, there are some miscellaneous methods that enable other
> convenient operations:

<table>
<colgroup>
<col style="width: 33%" />
<col style="width: 33%" />
<col style="width: 33%" />
</colgroup>
<thead>
<tr class="header">
<th></th>
<th><strong>Method</strong></th>
<th><strong>Description</strong></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td rowspan="11"><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image150.png"
style="width:0.125in" /></td>
<td><blockquote>
<p>get()</p>
</blockquote></td>
<td><blockquote>
<p>Index each element</p>
</blockquote></td>
</tr>
<tr class="even">
<td><blockquote>
<p>slice()</p>
</blockquote></td>
<td><blockquote>
<p>Slice each element</p>
</blockquote></td>
</tr>
<tr class="odd">
<td><blockquote>
<p>slice_replace()</p>
</blockquote></td>
<td><blockquote>
<p>Replace slice in each element with passed value</p>
</blockquote></td>
</tr>
<tr class="even">
<td><blockquote>
<p>cat()</p>
</blockquote></td>
<td><blockquote>
<p>Concatenate strings</p>
</blockquote></td>
</tr>
<tr class="odd">
<td><blockquote>
<p>repeat()</p>
</blockquote></td>
<td><blockquote>
<p>Repeat values</p>
</blockquote></td>
</tr>
<tr class="even">
<td><blockquote>
<p>normalize()</p>
</blockquote></td>
<td><blockquote>
<p>Return Unicode form of string</p>
</blockquote></td>
</tr>
<tr class="odd">
<td><blockquote>
<p>pad()</p>
</blockquote></td>
<td><blockquote>
<p>Add whitespace to left, right, or both sides of strings</p>
</blockquote></td>
</tr>
<tr class="even">
<td><blockquote>
<p>wrap()</p>
</blockquote></td>
<td><blockquote>
<p>Split long strings into lines with length less than a given width</p>
</blockquote></td>
</tr>
<tr class="odd">
<td><blockquote>
<p>join()</p>
</blockquote></td>
<td>Join strings in each element of the Series with passed
separator</td>
</tr>
<tr class="even">
<td>get_dummies()</td>
<td><blockquote>
<p>extract dummy variables as a dataframe</p>
</blockquote></td>
</tr>
<tr class="odd">
<td colspan="2"><blockquote>
<p>Vectorized item access and slicing</p>
</blockquote></td>
</tr>
</tbody>
</table>

The get() and slice() operations, in particular, enable vectorized
element access from each

> array. For example, we can get a slice of the �rst three characters of
> each array using
>
> str.slice(0, 3) . Note that this behavior is also available through
> Python's normal indexing
>
> syntax–for example, df.str.slice(0, 3) is equivalent to df.str\[0:3\]
> :
>
> monte.str\[0:3\]
>
> 0 Gra  
> 1 Joh  
> 2 Ter  
> 3 Eri  
> 4 Ter  
> 5 Mic  
> dtype: object
>
> Indexing via df.str.get(i) and df.str\[i\] is likewise similar.
>
> These get() and slice() methods also let you access elements of arrays
> returned by
>
> split() . For example, to extract the last name of each entry, we can
> combine split() and
>
> get() :
>
> monte.str.split().str.get(-1)
>
> 0 Chapman  
> 1 Cleese  
> 2 Gilliam  
> 3 Idle  
> 4 Jones  
> 5 Palin  
> dtype: object

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
96/126

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image19.png"
> style="width:0.125in" /> Indicator variables
>
> Another method that requires a bit of extra explanation is the
> get_dummies() method. This is useful when your data has a column
> containing some sort of coded indicator. For example, we might have a
> dataset that contains information in the form of codes, such as
> A="born in America," B="born in the United Kingdom," C="likes cheese,"
> D="likes spam":
>
> full_monte = pd.DataFrame({'name': monte,  
> 'info': \['B\|C\|D', 'B\|D', 'A\|C', 'B\|D', 'B\|C', 'B\|C\|D'\]})
> full_monte

<table>
<colgroup>
<col style="width: 25%" />
<col style="width: 25%" />
<col style="width: 25%" />
<col style="width: 25%" />
</colgroup>
<thead>
<tr class="header">
<th></th>
<th><strong>name</strong></th>
<th><strong>info</strong></th>
<th><blockquote>
<p><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image151.png"
style="width:0.22222in;height:0.23611in" /></p>
</blockquote></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>0</strong></td>
<td>Graham Chapman</td>
<td>B|C|D</td>
<td rowspan="6"></td>
</tr>
<tr class="even">
<td><strong>1</strong></td>
<td>John Cleese</td>
<td>B|D</td>
</tr>
<tr class="odd">
<td><strong>2</strong></td>
<td>Terry Gilliam</td>
<td>A|C</td>
</tr>
<tr class="even">
<td><strong>3</strong></td>
<td>Eric Idle</td>
<td>B|D</td>
</tr>
<tr class="odd">
<td><strong>4</strong></td>
<td>Terry Jones</td>
<td>B|C</td>
</tr>
<tr class="even">
<td><strong>5</strong></td>
<td>Michael Palin</td>
<td>B|C|D</td>
</tr>
</tbody>
</table>

> The get_dummies() routine lets you quickly split-out these indicator
> variables into a DataFrame :
>
> full_monte\['info'\].str.get_dummies('\|')

<table style="width:100%;">
<colgroup>
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
</colgroup>
<thead>
<tr class="header">
<th></th>
<th><strong>A</strong></th>
<th><strong>B</strong></th>
<th><strong>C</strong></th>
<th><strong>D</strong></th>
<th><blockquote>
<p><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image152.png"
style="width:0.22222in;height:0.22222in" /></p>
</blockquote></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>0</strong></td>
<td>0</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td rowspan="6"></td>
</tr>
<tr class="even">
<td><strong>1</strong></td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>1</td>
</tr>
<tr class="odd">
<td><strong>2</strong></td>
<td>1</td>
<td>0</td>
<td>1</td>
<td>0</td>
</tr>
<tr class="even">
<td><strong>3</strong></td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>1</td>
</tr>
<tr class="odd">
<td><strong>4</strong></td>
<td>0</td>
<td>1</td>
<td>1</td>
<td>0</td>
</tr>
<tr class="even">
<td><strong>5</strong></td>
<td>0</td>
<td>1</td>
<td>1</td>
<td>1</td>
</tr>
</tbody>
</table>

> With these operations as building blocks, you can construct an endless
> range of string processing procedures when cleaning your data.

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
97/126

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image80.png"
> style="width:0.125in" /> Example: Recipe Database
>
> These vectorized string operations become most useful in the process
> of cleaning up messy, real-world data. Here I'll walk through an
> example of that, using an open recipe database compiled from various
> sources on the Web. Our goal will be to parse the recipe data into
> ingredient lists, so we can quickly �nd a recipe based on some
> ingredients we have on hand.
>
> The scripts used to compile this can be found at , and the link to the
> current version of the database is found there as well.
>
> As of Spring 2016, this database is about 30 MB, and can be downloaded
> and unzipped with these commands:
>
> \# !curl -O
> http://openrecipes.s3.amazonaws.com/recipeitems-latest.json.gz \#
> !gunzip recipeitems-latest.json.gz
>
> The database is in JSON format, so we will try pd.read_json to read
> it:
>
> try:  
> recipes =
> pd.read_json('http://openrecipes.s3.amazonaws.com/recipeitems-latest.json.gz
> except ValueError as e:  
> print("ValueError:", e)
>
> ValueError: Expected object or value
>
> Oops! We get a ValueError mentioning that there is "trailing data."
> Searching for the text of this error on the Internet, it seems that
> it's due to using a �le in which *each line* is itself a valid JSON,
> but the full �le is not. Let's check if this interpretation is true:
>
> Yes, apparently each line is a valid JSON, so we'll need to string
> them together. One way we can do this is to actually construct a
> string representation containing all these JSON entries, and then load
> the whole thing with pd.read_json :
>
> We see there are nearly 200,000 recipes, and 17 columns. Let's take a
> look at one row to see what we have:
>
> There is a lot of information there, but much of it is in a very messy
> form, as is typical of data scraped from the Web. In particular, the
> ingredient list is in string format; we're going to have to carefully
> extract the information we're interested in. Let's start by taking a
> closer look at the ingredients:

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
98/126

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> The ingredient lists average 250 characters long, with a minimum of 0
> and a maximum of nearly 10,000 characters!
>
> Just out of curiousity, let's see which recipe has the longest
> ingredient list:
>
> That certainly looks like an involved recipe.
>
> We can do other aggregate explorations; for example, let's see how
> many of the recipes are for breakfast food:
>
> Or how many of the recipes list cinnamon as an ingredient:
>
> We could even look to see whether any recipes misspell the ingredient
> as "cinamon":
>
> This is the type of essential data exploration that is possible with
> Pandas string tools. It is data munging like this that Python really
> excels at.
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image28.png"
> style="width:0.125in" /> A simple recipe recommender
>
> Let's go a bit further, and start working on a simple recipe
> recommendation system: given a list of ingredients, �nd a recipe that
> uses all those ingredients. While conceptually straightforward, the
> task is complicated by the heterogeneity of the data: there is no easy
> operation, for example, to extract a clean list of ingredients from
> each row. So we will cheat a bit: we'll start with a list of common
> ingredients, and simply search to see whether they are in each
> recipe's ingredient list.
>
> For simplicity, let's just stick with herbs and spices for the time
> being:
>
> spice_list = \['salt', 'pepper', 'oregano', 'sage', 'parsley',
> 'rosemary', 'tarragon', 'thyme', 'paprika', 'cumin'\]
>
> We can then build a Boolean DataFrame consisting of True and False
> values, indicating whether this ingredient appears in the list:
>
> Now, as an example, let's say we'd like to �nd a recipe that uses
> parsley, paprika, and tarragon.
>
> We �nd only 10 recipes with this combination; let's use the index
> returned by this selection to discover the names of the recipes that
> have this combination:

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
99/126

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> Now that we have narrowed down our recipe selection by a factor of
> almost 20,000, we are in a position to make a more informed decision
> about what we'd like to cook for dinner.
>
> Going further with recipes
>
> Hopefully this example has given you a bit of a �avor (ba-dum!) for
> the types of data cleaning operations that are e�ciently enabled by
> Pandas string methods. Of course, building a very robust recipe
> recommendation system would require a *lot* more work! Extracting full
> ingredient lists from each recipe would be an important piece of the
> task; unfortunately, the wide variety of formats used makes this a
> relatively time-consuming process. This points to the truism that in
> data science, cleaning and munging of real-world data often comprises
> the majority of the work, and Pandas provides the tools that can help
> you do this e�ciently.
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image153.png"
> style="width:0.125in" /> Working with Time Series
>
> Pandas was developed in the context of �nancial modeling, so as you
> might expect, it contains a fairly extensive set of tools for working
> with dates, times, and time-indexed data. Date and time data comes in
> a few �avors, which we will discuss here:
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image154.png" />
> *Time stamps* reference particular moments in time (e.g., July 4th,
> 2015 at 7:00am).
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image155.png" />
> *Time intervals* and *periods* reference a length of time between a
> particular beginning and end point; for example, the year 2015.
> Periods usually reference a special case of time intervals in which
> each interval is of uniform length and does not overlap (e.g., 24
> hour- long periods comprising days).
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image156.png" />
> *Time deltas* or *durations* reference an exact length of time (e.g.,
> a duration of 22.56 seconds).
>
> In this section, we will introduce how to work with each of these
> types of date/time data in Pandas. This short section is by no means a
> complete guide to the time series tools available in Python or Pandas,
> but instead is intended as a broad overview of how you as a user
> should approach working with time series. We will start with a brief
> discussion of tools for dealing with dates and times in Python, before
> moving more speci�cally to a discussion of the tools provided by
> Pandas. After listing some resources that go into more depth, we will
> review some short examples of working with time series data in Pandas.
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image40.png"
> style="width:0.125in" /> Dates and Times in Python
>
> The Python world has a number of available representations of dates,
> times, deltas, and timespans. While the time series tools provided by
> Pandas tend to be the most useful for data science applications, it is
> helpful to see their relationship to other packages used in Python.

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
100/126

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image50.png"
> style="width:0.125in" /> Native Python dates and times: datetime and
> dateutil
>
> Python's basic objects for working with dates and times reside in the
> built-in datetime module. Along with the third-party dateutil module,
> you can use it to quickly perform a host of useful functionalities on
> dates and times. For example, you can manually build a date using the
> datetime type:
>
> from datetime import datetime  
> datetime(year=2015, month=7, day=4)
>
> datetime.datetime(2015, 7, 4, 0, 0)
>
> Or, using the dateutil module, you can parse dates from a variety of
> string formats:
>
> from dateutil import parser  
> date = parser.parse("4th of July, 2015")  
> date
>
> datetime.datetime(2015, 7, 4, 0, 0)
>
> Once you have a datetime object, you can do things like printing the
> day of the week:
>
> date.strftime('%A')
>
> ' Saturday '
>
> In the �nal line, we've used one of the standard string format codes
> for printing dates ( "%A" ), which you can read about in the of
> Python's .
>
> Documentation of other useful date utilities can be found in . A
> related package to be aware of is , which contains tools for working
> with the most migrane-inducing piece of time series data: time zones.
>
> The power of datetime and dateutil lie in their �exibility and easy
> syntax: you can use these objects and their built-in methods to easily
> perform nearly any operation you might be interested in. Where they
> break down is when you wish to work with large arrays of dates and
> times: just as lists of Python numerical variables are suboptimal
> compared to NumPy-style typed numerical arrays, lists of Python
> datetime objects are suboptimal compared to typed arrays of encoded
> dates.
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image157.png"
> style="width:0.125in" /> Typed arrays of times: NumPy's datetime64
>
> The weaknesses of Python's datetime format inspired the NumPy team to
> add a set of native time series data type to NumPy. The datetime64
> dtype encodes dates as 64-bit integers, and

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
101/126

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> thus allows arrays of dates to be represented very compactly. The
> datetime64 requires a very speci�c input format:
>
> import numpy as np  
> date = np.array('2015-07-04', dtype=np.datetime64) date
>
> array('2015-07-04', dtype='datetime64\[D\]')
>
> Once we have this date formatted, however, we can quickly do
> vectorized operations on it:
>
> date + np.arange(12)
>
> array(\['2015-07-04', '2015-07-05', '2015-07-06', '2015-07-07',
> '2015-07-08', '2015-07-09', '2015-07-10', '2015-07-11', '2015-07-12',
> '2015-07-13', '2015-07-14', '2015-07-15'\], dtype='datetime64\[D\]')
>
> Because of the uniform type in NumPy datetime64 arrays, this type of
> operation can be accomplished much more quickly than if we were
> working directly with Python's datetime
>
> One detail of the datetime64 and timedelta64 objects is that they are
> built on a *fundamental time unit*. Because the datetime64 object is
> limited to 64-bit precision, the range of encodable times is 264 times
> this fundamental unit. In other words, datetime64 imposes a trade-off
> between *time resolution* and *maximum time span*.
>
> For example, if you want a time resolution of one nanosecond, you only
> have enough information to encode a range of 264 nanoseconds, or just
> under 600 years. NumPy will infer the desired unit from the input; for
> example, here is a day-based datetime:
>
> np.datetime64('2015-07-04')
>
> numpy.datetime64('2015-07-04')
>
> Here is a minute-based datetime:
>
> np.datetime64('2015-07-04 12:00')
>
> numpy.datetime64('2015-07-04T12:00')
>
> Notice that the time zone is automatically set to the local time on
> the computer executing the code. You can force any desired fundamental
> unit using one of many format codes; for example, here we'll force a
> nanosecond-based time:

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
102/126

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> np.datetime64('2015-07-04 12:59:59.50', 'ns')
>
> numpy.datetime64('2015-07-04T12:59:59.500000000')
>
> The following table, drawn from the , lists the available format codes
> along with the relative and absolute timespans that they can encode:

<table>
<colgroup>
<col style="width: 25%" />
<col style="width: 25%" />
<col style="width: 25%" />
<col style="width: 25%" />
</colgroup>
<thead>
<tr class="header">
<th><strong>Code</strong></th>
<th><strong>Meaning</strong></th>
<th><strong>Time span (relative)</strong></th>
<th><strong>Time span (absolute)</strong></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><blockquote>
<p>Y</p>
</blockquote></td>
<td><blockquote>
<p>Year</p>
</blockquote></td>
<td><blockquote>
<p>± 9.2e18 years</p>
</blockquote></td>
<td><blockquote>
<p>[9.2e18 BC, 9.2e18 AD]</p>
</blockquote></td>
</tr>
<tr class="even">
<td><blockquote>
<p>M</p>
</blockquote></td>
<td><blockquote>
<p>Month</p>
</blockquote></td>
<td><blockquote>
<p>± 7.6e17 years</p>
</blockquote></td>
<td><blockquote>
<p>[7.6e17 BC, 7.6e17 AD]</p>
</blockquote></td>
</tr>
<tr class="odd">
<td><blockquote>
<p>W</p>
</blockquote></td>
<td><blockquote>
<p>Week</p>
</blockquote></td>
<td><blockquote>
<p>± 1.7e17 years</p>
</blockquote></td>
<td><blockquote>
<p>[1.7e17 BC, 1.7e17 AD]</p>
</blockquote></td>
</tr>
<tr class="even">
<td><blockquote>
<p>D</p>
</blockquote></td>
<td><blockquote>
<p>Day</p>
</blockquote></td>
<td><blockquote>
<p>± 2.5e16 years</p>
</blockquote></td>
<td><blockquote>
<p>[2.5e16 BC, 2.5e16 AD]</p>
</blockquote></td>
</tr>
<tr class="odd">
<td><blockquote>
<p>h</p>
</blockquote></td>
<td><blockquote>
<p>Hour</p>
</blockquote></td>
<td><blockquote>
<p>± 1.0e15 years</p>
</blockquote></td>
<td><blockquote>
<p>[1.0e15 BC, 1.0e15 AD]</p>
</blockquote></td>
</tr>
<tr class="even">
<td><blockquote>
<p>m</p>
</blockquote></td>
<td><blockquote>
<p>Minute</p>
</blockquote></td>
<td><blockquote>
<p>± 1.7e13 years</p>
</blockquote></td>
<td><blockquote>
<p>[1.7e13 BC, 1.7e13 AD]</p>
</blockquote></td>
</tr>
<tr class="odd">
<td><blockquote>
<p>s</p>
</blockquote></td>
<td><blockquote>
<p>Second</p>
</blockquote></td>
<td><blockquote>
<p>± 2.9e12 years</p>
</blockquote></td>
<td><blockquote>
<p>[ 2.9e9 BC, 2.9e9 AD]</p>
</blockquote></td>
</tr>
<tr class="even">
<td><blockquote>
<p>ms</p>
</blockquote></td>
<td><blockquote>
<p>Millisecond</p>
</blockquote></td>
<td><blockquote>
<p>± 2.9e9 years</p>
</blockquote></td>
<td><blockquote>
<p>[ 2.9e6 BC, 2.9e6 AD]</p>
</blockquote></td>
</tr>
<tr class="odd">
<td><blockquote>
<p>us</p>
</blockquote></td>
<td>Microsecond</td>
<td><blockquote>
<p>± 2.9e6 years</p>
</blockquote></td>
<td>[290301 BC, 294241 AD]</td>
</tr>
<tr class="even">
<td><blockquote>
<p>ns</p>
</blockquote></td>
<td>Nanosecond</td>
<td><blockquote>
<p>± 292 years</p>
</blockquote></td>
<td><blockquote>
<p>[ 1678 AD, 2262 AD]</p>
</blockquote></td>
</tr>
<tr class="odd">
<td><blockquote>
<p>ps</p>
</blockquote></td>
<td><blockquote>
<p>Picosecond</p>
</blockquote></td>
<td><blockquote>
<p>± 106 days</p>
</blockquote></td>
<td><blockquote>
<p>[ 1969 AD, 1970 AD]</p>
</blockquote></td>
</tr>
<tr class="even">
<td><blockquote>
<p>fs</p>
</blockquote></td>
<td>Femtosecond</td>
<td><blockquote>
<p>± 2.6 hours</p>
</blockquote></td>
<td><blockquote>
<p>[ 1969 AD, 1970 AD]</p>
</blockquote></td>
</tr>
<tr class="odd">
<td><blockquote>
<p>as</p>
</blockquote></td>
<td><blockquote>
<p>Attosecond</p>
</blockquote></td>
<td><blockquote>
<p>± 9.2 seconds</p>
</blockquote></td>
<td><blockquote>
<p>[ 1969 AD, 1970 AD]</p>
</blockquote></td>
</tr>
</tbody>
</table>

> For the types of data we see in the real world, a useful default is
> datetime64\[ns\] , as it can encode a useful range of modern dates
> with a suitably �ne precision.
>
> Finally, we will note that while the datetime64 data type addresses
> some of the de�ciencies of the built-in Python datetime type, it lacks
> many of the convenient methods and functions
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image3.png"
> style="width:0.125in" /> Dates and times in pandas: best of both
> worlds
>
> Pandas builds upon all the tools just discussed to provide a Timestamp
> object, which combines the ease-of-use of datetime and dateutil with
> the e�cient storage and vectorized interface of numpy.datetime64 .
> From a group of these Timestamp objects, Pandas can construct a  
> DatetimeIndex that can be used to index data in a Series or DataFrame
> ; we'll see many examples of this below.
>
> For example, we can use Pandas tools to repeat the demonstration from
> above. We can parse a�exibly formatted string date, and use format
> codes to output the day of the week:
>
> import pandas as pd  
> date = pd.to_datetime("4th of July, 2015")

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
103/126

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> date
>
> Timestamp('2015-07-04 00:00:00')
>
> date.strftime('%A')
>
> ' Saturday '
>
> Additionally, we can do NumPy-style vectorized operations directly on
> this same object:
>
> date + pd.to_timedelta(np.arange(12), 'D')
>
> DatetimeIndex(\['2015-07-04', '2015-07-05', '2015-07-06',
> '2015-07-07', '2015-07-08', '2015-07-09', '2015-07-10', '2015-07-11',
> '2015-07-12', '2015-07-13', '2015-07-14', '2015-07-15'\],
> dtype='datetime64\[ns\]', freq=None)
>
> In the next section, we will take a closer look at manipulating time
> series data with the tools
>
> provided by Pandas.
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image44.png"
> style="width:0.125in" /> Pandas Time Series: Indexing by Time
>
> Where the Pandas time series tools really become useful is when you
> begin to *index data by*
>
> *timestamps*. For example, we can construct a Series object that has
> time indexed data:
>
> index = pd.DatetimeIndex(\['2014-07-04', '2014-08-04',

'2015-07-04', '2015-08-04'\])

> data = pd.Series(\[0, 1, 2, 3\], index=index)
>
> data
>
> 2014-07-04 0  
> 2014-08-04 1  
> 2015-07-04 2  
> 2015-08-04 3  
> dtype: int64
>
> Now that we have this data in a Series , we can make use of any of the
> Series indexing
>
> patterns we discussed in previous sections, passing values that can be
> coerced into dates:
>
> data\['2014-07-04':'2015-07-04'\]
>
> 2014-07-04 0  
> 2014-08-04 1  
> 2015-07-04 2  
> dtype: int64

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
104/126

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> There are additional special date-only indexing operations, such as
> passing a year to obtain a slice of all data from that year:
>
> data\['2015'\]
>
> 2015-07-04 2  
> 2015-08-04 3  
> dtype: int64
>
> Later, we will see additional examples of the convenience of
> dates-as-indices. But �rst, a closer look at the available time series
> data structures.
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image125.png"
> style="width:0.125in" /> Pandas Time Series Data Structures
>
> This section will introduce the fundamental Pandas data structures for
> working with time series data:
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image158.png" />
> For *time stamps*, Pandas provides the Timestamp type. As mentioned
> before, it is essentially a replacement for Python's native datetime ,
> but is based on the more e�cient numpy.datetime64 data type. The
> associated Index structure is DatetimeIndex .
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image159.png" />
> For *time Periods*, Pandas provides the Period type. This encodes a
> �xed-frequency interval based on numpy.datetime64 . The associated
> index structure is PeriodIndex .
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image160.png" />
> For *time deltas* or *durations*, Pandas provides the Timedelta type.
> Timedelta is a more e�cient replacement for Python's native
> datetime.timedelta type, and is based on numpy.timedelta64 . The
> associated index structure is TimedeltaIndex .
>
> The most fundamental of these date/time objects are the Timestamp and
> DatetimeIndex objects. While these class objects can be invoked
> directly, it is more common to use the pd.to_datetime() function,
> which can parse a wide variety of formats. Passing a single date to
> pd.to_datetime() yields a Timestamp ; passing a series of dates by
> default yields a  
> DatetimeIndex :
>
> dates = pd.to_datetime(\[datetime(2015, 7, 3), '4th of July, 2015',
> '2015-Jul-6', '07-07-2015', '20150708'\]) dates
>
> DatetimeIndex(\['2015-07-03', '2015-07-04', '2015-07-06',
> '2015-07-07', '2015-07-08'\],  
> dtype='datetime64\[ns\]', freq=None)
>
> Any DatetimeIndex can be converted to a PeriodIndex with the
> to_period() function with the addition of a frequency code; here we'll
> use 'D' to indicate daily frequency:

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
105/126

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> dates.to_period('D')
>
> PeriodIndex(\['2015-07-03', '2015-07-04', '2015-07-06', '2015-07-07',
>
> '2015-07-08'\],
>
> dtype='period\[D\]')
>
> A TimedeltaIndex is created, for example, when a date is subtracted
> from another:
>
> dates - dates\[0\]
>
> TimedeltaIndex(\['0 days', '1 days', '3 days', '4 days', '5 days'\],
>
> dtype='timedelta64\[ns\]', freq=None)
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image27.png"
> style="width:0.125in" /> Regular sequences: pd.date_range()  
> To make the creation of regular date sequences more convenient, Pandas
> offers a few functions for this purpose: pd.date_range() for
> timestamps, pd.period_range() for periods, and pd.timedelta_range()
> for time deltas. We've seen that Python's range() and NumPy's
> np.arange() turn a startpoint, endpoint, and optional stepsize into a
> sequence. Similarly, pd.date_range() accepts a start date, an end
> date, and an optional frequency code to create a regular sequence of
> dates. By default, the frequency is one day:
>
> pd.date_range('2015-07-03', '2015-07-10')
>
> DatetimeIndex(\['2015-07-03', '2015-07-04', '2015-07-05',
> '2015-07-06',
>
> '2015-07-07', '2015-07-08', '2015-07-09', '2015-07-10'\],
>
> dtype='datetime64\[ns\]', freq='D')
>
> Alternatively, the date range can be speci�ed not with a start and
> endpoint, but with a startpoint and a number of periods:
>
> pd.date_range('2015-07-03', periods=8)
>
> DatetimeIndex(\['2015-07-03', '2015-07-04', '2015-07-05',
> '2015-07-06',
>
> '2015-07-07', '2015-07-08', '2015-07-09', '2015-07-10'\],
>
> dtype='datetime64\[ns\]', freq='D')
>
> The spacing can be modi�ed by altering the freq argument, which
> defaults to D . For example, here we will construct a range of hourly
> timestamps:
>
> pd.date_range('2015-07-03', periods=8, freq='H')
>
> DatetimeIndex(\['2015-07-03 00:00:00', '2015-07-03 01:00:00',
>
> '2015-07-03 02:00:00', '2015-07-03 03:00:00',
>
> '2015-07-03 04:00:00', '2015-07-03 05:00:00',
>
> '2015-07-03 06:00:00', '2015-07-03 07:00:00'\],
>
> dtype='datetime64\[ns\]', freq='H')

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
106/126

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> To create regular sequences of Period or Timedelta values, the very
> similar  
> pd.period_range() and pd.timedelta_range() functions are useful. Here
> are some monthly periods:
>
> pd.period_range('2015-07', periods=8, freq='M')
>
> PeriodIndex(\['2015-07', '2015-08', '2015-09', '2015-10', '2015-11',
> '2015-12', '2016-01', '2016-02'\],  
> dtype='period\[M\]')
>
> And a sequence of durations increasing by an hour:
>
> pd.timedelta_range(0, periods=10, freq='H')
>
> TimedeltaIndex(\['0 days 00:00:00', '0 days 01:00:00', '0 days
> 02:00:00', '0 days 03:00:00', '0 days 04:00:00', '0 days 05:00:00', '0
> days 06:00:00', '0 days 07:00:00', '0 days 08:00:00', '0 days
> 09:00:00'\],  
> dtype='timedelta64\[ns\]', freq='H')
>
> All of these require an understanding of Pandas frequency codes, which
> we'll summarize in the next section.
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image17.png"
> style="width:0.125in" /> Frequencies and Offsets
>
> Fundamental to these Pandas time series tools is the concept of a
> frequency or date offset. Just as we saw the D (day) and H (hour)
> codes above, we can use such codes to specify any desired frequency
> spacing. The following table summarizes the main codes available:

<table>
<colgroup>
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
</colgroup>
<thead>
<tr class="header">
<th><strong>Code</strong></th>
<th><blockquote>
<p><strong>Description</strong></p>
</blockquote></th>
<th><strong>Code</strong></th>
<th><blockquote>
<p><strong>Description</strong></p>
</blockquote></th>
<th></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>D</td>
<td><blockquote>
<p>Calendar day</p>
</blockquote></td>
<td>B</td>
<td><blockquote>
<p>Business day</p>
</blockquote></td>
<td rowspan="12">107/126</td>
</tr>
<tr class="even">
<td>W</td>
<td colspan="3"><blockquote>
<p>Weekly</p>
</blockquote></td>
</tr>
<tr class="odd">
<td>M</td>
<td>Month end</td>
<td>BM</td>
<td><blockquote>
<p>Business month end</p>
</blockquote></td>
</tr>
<tr class="even">
<td>Q</td>
<td>Quarter end</td>
<td>BQ</td>
<td><blockquote>
<p>Business quarter end</p>
</blockquote></td>
</tr>
<tr class="odd">
<td>A</td>
<td><blockquote>
<p>Year end</p>
</blockquote></td>
<td>BA</td>
<td><blockquote>
<p>Business year end</p>
</blockquote></td>
</tr>
<tr class="even">
<td>H</td>
<td><blockquote>
<p>Hours</p>
</blockquote></td>
<td>BH</td>
<td><blockquote>
<p>Business hours</p>
</blockquote></td>
</tr>
<tr class="odd">
<td>T</td>
<td colspan="3"><blockquote>
<p>Minutes</p>
</blockquote></td>
</tr>
<tr class="even">
<td>S</td>
<td colspan="3"><blockquote>
<p>Seconds</p>
</blockquote></td>
</tr>
<tr class="odd">
<td>L</td>
<td colspan="3"><blockquote>
<p>Milliseonds</p>
</blockquote></td>
</tr>
<tr class="even">
<td>U</td>
<td colspan="3"><blockquote>
<p>Microseconds</p>
</blockquote></td>
</tr>
<tr class="odd">
<td>N</td>
<td colspan="3"><blockquote>
<p>nanoseconds</p>
</blockquote></td>
</tr>
<tr class="even">
<td
colspan="4">https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&amp;printMode=true</td>
</tr>
</tbody>
</table>

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> The monthly, quarterly, and annual frequencies are all marked at the
> end of the speci�ed period. By adding an S su�x to any of these, they
> instead will be marked at the beginning:

<table>
<colgroup>
<col style="width: 25%" />
<col style="width: 25%" />
<col style="width: 25%" />
<col style="width: 25%" />
</colgroup>
<thead>
<tr class="header">
<th><strong>Code</strong></th>
<th><strong>Description</strong></th>
<th><strong>Code</strong></th>
<th><strong>Description</strong></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><blockquote>
<p>MS</p>
</blockquote></td>
<td><blockquote>
<p>Month start</p>
</blockquote></td>
<td>BMS</td>
<td>Business month start</td>
</tr>
<tr class="even">
<td><blockquote>
<p>QS</p>
</blockquote></td>
<td>Quarter start</td>
<td>BQS</td>
<td>Business quarter start</td>
</tr>
<tr class="odd">
<td><blockquote>
<p>AS</p>
</blockquote></td>
<td><blockquote>
<p>Year start</p>
</blockquote></td>
<td>BAS</td>
<td><blockquote>
<p>Business year start</p>
</blockquote></td>
</tr>
</tbody>
</table>

> Additionally, you can change the month used to mark any quarterly or
> annual code by adding a three-letter month code as a su�x:
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image161.png" />
> Q-JAN , BQ-FEB , QS-MAR , BQS-APR , etc.
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image162.png" />
> A-JAN , BA-FEB , AS-MAR , BAS-APR , etc.
>
> In the same way, the split-point of the weekly frequency can be
> modi�ed by adding a three-letter weekday code:
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image163.png" />
> W-SUN , W-MON , W-TUE , W-WED , etc.
>
> On top of this, codes can be combined with numbers to specify other
> frequencies. For example, for a frequency of 2 hours 30 minutes, we
> can combine the hour ( H ) and minute ( T ) codes as follows:
>
> pd.timedelta_range(0, periods=9, freq="2H30T")
>
> TimedeltaIndex(\['0 days 00:00:00', '0 days 02:30:00', '0 days
> 05:00:00', '0 days 07:30:00', '0 days 10:00:00', '0 days 12:30:00', '0
> days 15:00:00', '0 days 17:30:00', '0 days 20:00:00'\],
> dtype='timedelta64\[ns\]', freq='150T')
>
> All of these short codes refer to speci�c instances of Pandas time
> series offsets, which can be found in the pd.tseries.offsets module.
> For example, we can create a business day offset directly as follows:
>
> from pandas.tseries.offsets import BDay  
> pd.date_range('2015-07-01', periods=5, freq=BDay())
>
> DatetimeIndex(\['2015-07-01', '2015-07-02', '2015-07-03',
> '2015-07-06', '2015-07-07'\],  
> dtype='datetime64\[ns\]', freq='B')
>
> For more discussion of the use of frequencies and offsets, see the of
> the Pandas documentation.
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image164.png"
> style="width:0.125in" /> Resampling, Shifting, and Windowing

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
108/126

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> The ability to use dates and times as indices to intuitively organize
> and access data is an important piece of the Pandas time series tools.
> The bene�ts of indexed data in general (automatic alignment during
> operations, intuitive data slicing and access, etc.) still apply, and
> Pandas provides several additional time series-speci�c operations.
>
> We will take a look at a few of those here, using some stock price
> data as an example. Because Pandas was developed largely in a �nance
> context, it includes some very speci�c tools for�nancial data. For
> example, the accompanying pandas-datareader package (installable via
> conda install pandas-datareader ), knows how to import �nancial data
> from a number of available sources, including Yahoo �nance, Google
> Finance, and others. Here we will load Google's closing price history:
>
> We can visualize this using the plot() method, after the normal
> Matplotlib setup boilerplate (see ):
>
> %matplotlib inline  
> import matplotlib.pyplot as plt  
> import seaborn; seaborn.set()
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image131.png"
> style="width:0.125in" /> Resampling and converting frequencies
>
> One common need for time series data is resampling at a higher or
> lower frequency. This can be done using the resample() method, or the
> much simpler asfreq() method. The primary difference between the two
> is that resample() is fundamentally a *data aggregation*, while
> asfreq() is fundamentally a *data selection*.
>
> Taking a look at the Google closing price, let's compare what the two
> return when we down-sample the data. Here we will resample the data at
> the end of business year:
>
> Notice the difference: at each point, resample reports the *average of
> the previous year*, while asfreq reports the *value at the end of the
> year*.
>
> For up-sampling, resample() and asfreq() are largely equivalent,
> though resample has many more options available. In this case, the
> default for both methods is to leave the up-sampled points empty, that
> is, �lled with NA values. Just as with the pd.fillna() function
> discussed previously, asfreq() accepts a method argument to specify
> how values are imputed. Here, we will resample the business day data
> at a daily frequency (i.e., including weekends):
>
> The top panel is the default: non-business days are left as NA values
> and do not appear on the plot. The bottom panel shows the differences
> between two strategies for �lling the gaps: forward-�lling and
> backward-�lling.

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
109/126

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image3.png"
> style="width:0.125in" /> Time-shifts
>
> Another common time series-speci�c operation is shifting of data in
> time. Pandas has two closely related methods for computing this:
> shift() and tshift() In short, the difference between them is that
> shift()*shifts the data*, while tshift()*shifts the index*. In both
> cases, the shift is speci�ed in multiples of the frequency.
>
> Here we will both shift() and tshift() by 900 days;
>
> We see here that shift(900) shifts the *data* by 900 days, pushing
> some of it off the end of the graph (and leaving NA values at the
> other end), while tshift(900) shifts the *index values* by 900 days.
>
> A common context for this type of shift is in computing differences
> over time. For example, we use shifted values to compute the one-year
> return on investment for Google stock over the course of the dataset:
>
> This helps us to see the overall trend in Google stock: thus far, the
> most pro�table times to invest in Google have been (unsurprisingly, in
> retrospect) shortly after its IPO, and in the middle of the 2009
> recession.
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image106.png"
> style="width:0.125in" /> Rolling windows
>
> Rolling statistics are a third type of time series-speci�c operation
> implemented by Pandas.
>
> These can be accomplished via the rolling() attribute of Series and
> DataFrame objects,
>
> For example, here is the one-year centered rolling mean and standard
> deviation of the Google stock prices:
>
> As with group-by operations, the aggregate() and apply() methods can
> be used for custom rolling computations.
>
> Where to Learn More
>
> This section has provided only a brief summary of some of the most
> essential features of time
>
> Another excellent resource is the textbook by Wes McKinney (OReilly,
> 2012). Although it is now a few years old, it is an invaluable
> resource on the use of Pandas. In

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
110/126

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> particular, this book emphasizes time series tools in the context of
> business and �nance, and focuses much more on particular details of
> business calendars, time zones, and related topics.
>
> As always, you can also use the IPython help functionality to explore
> and try further options available to the functions and methods
> discussed here. I �nd this often is the best way to learn a new Python
> tool.
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image165.png"
> style="width:0.125in" /> Example: Visualizing Seattle Bicycle Counts
>
> As a more involved example of working with some time series data,
> let's take a look at bicycle counts on Seattle's . This data comes
> from an automated bicycle counter, installed in late 2012, which has
> inductive sensors on the east and west sidewalks of the bridge. ; here
> is the
>
> As of summer 2016, the CSV can be downloaded as follows:
>
> \# !curl -o FremontBridge.csv
> https://data.seattle.gov/api/views/65db-xm6k/rows.csv?accessT
>
> Once this dataset is downloaded, we can use Pandas to read the CSV
> output into a DataFrame . We will specify that we want the Date as an
> index, and we want these dates to be automatically parsed:
>
> data =
> pd.read_csv('https://data.seattle.gov/api/views/65db-xm6k/rows.csv',
> index_col='Dat data.head()

<table>
<colgroup>
<col style="width: 33%" />
<col style="width: 33%" />
<col style="width: 33%" />
</colgroup>
<thead>
<tr class="header">
<th><strong>Fremont Bridge</strong></th>
<th><strong>Fremont Bridge East</strong></th>
<th><blockquote>
<p><strong>Fremont Bridge West</strong></p>
</blockquote></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>Total</strong></td>
<td><strong>Sidewalk</strong></td>
<td><strong>Sidewalk</strong></td>
</tr>
</tbody>
</table>

> **Date**

<table>
<colgroup>
<col style="width: 25%" />
<col style="width: 25%" />
<col style="width: 25%" />
<col style="width: 25%" />
</colgroup>
<thead>
<tr class="header">
<th><blockquote>
<p><strong>2012-10-03</strong></p>
</blockquote></th>
<th rowspan="2">13.0</th>
<th rowspan="2">4.0</th>
<th rowspan="2">9.0</th>
</tr>
<tr class="odd">
<th><blockquote>
<p><strong>00:00:00</strong></p>
</blockquote></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><blockquote>
<p><strong>2012-10-03</strong></p>
</blockquote></td>
<td rowspan="2">10.0</td>
<td rowspan="2">4.0</td>
<td rowspan="2">6.0</td>
</tr>
<tr class="even">
<td><blockquote>
<p><strong>01:00:00</strong></p>
</blockquote></td>
</tr>
<tr class="odd">
<td><blockquote>
<p><strong>2012-10-03</strong></p>
</blockquote></td>
<td rowspan="2">2.0</td>
<td rowspan="2">1.0</td>
<td rowspan="2">1.0</td>
</tr>
<tr class="even">
<td><blockquote>
<p><strong>02:00:00</strong></p>
</blockquote></td>
</tr>
</tbody>
</table>

> For convenience, we'll further process this dataset by shortening the
> column names and adding a "Total" column:
>
> Now let's take a look at the summary statistics for this data:
>
> data.dropna().describe()

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
111/126

<table style="width:100%;">
<colgroup>
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
</colgroup>
<thead>
<tr class="header">
<th colspan="4">10/03/2023, 15:11</th>
<th><blockquote>
<p>Exp02_notebook_2001622 - Colaboratory</p>
</blockquote></th>
<th rowspan="2"><blockquote>
<p><strong>Fremont Bridge West</strong></p>
</blockquote></th>
</tr>
<tr class="odd">
<th colspan="4"><strong>Fremont Bridge</strong></th>
<th><strong>Fremont Bridge East</strong></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td colspan="4"><strong>Total</strong></td>
<td><strong>Sidewalk</strong></td>
<td><strong>Sidewalk</strong></td>
</tr>
<tr class="even">
<td colspan="3"><strong>count</strong></td>
<td><blockquote>
<p>90538.000000</p>
</blockquote></td>
<td>90538.000000</td>
<td>90538.000000</td>
</tr>
<tr class="odd">
<td colspan="2"><strong>mean</strong></td>
<td colspan="2">105.941837</td>
<td>47.374892</td>
<td>58.566944</td>
</tr>
<tr class="even">
<td colspan="2"><strong>std</strong></td>
<td colspan="2">133.581904</td>
<td>60.933511</td>
<td>82.815485</td>
</tr>
<tr class="odd">
<td colspan="2"><strong>min</strong></td>
<td colspan="2">0.000000</td>
<td>0.000000</td>
<td>0.000000</td>
</tr>
<tr class="even">
<td colspan="2"><strong>25%</strong></td>
<td colspan="2">13.000000</td>
<td>6.000000</td>
<td>7.000000</td>
</tr>
<tr class="odd">
<td colspan="2"><strong>50%</strong></td>
<td colspan="2">59.000000</td>
<td>27.000000</td>
<td>30.000000</td>
</tr>
<tr class="even">
<td colspan="2"><strong>75%</strong></td>
<td colspan="2">142.000000</td>
<td>65.000000</td>
<td>75.000000</td>
</tr>
<tr class="odd">
<td colspan="2"><strong>max</strong></td>
<td colspan="2">1097.000000</td>
<td rowspan="2">698.000000</td>
<td rowspan="2">850.000000</td>
</tr>
<tr class="even">
<td><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image60.png"
style="width:0.125in" /></td>
<td colspan="3"><blockquote>
<p>Visualizing the data</p>
</blockquote></td>
</tr>
</tbody>
</table>

We can gain some insight into the dataset by visualizing it. Let's start
by plotting the raw data:

> %matplotlib inline  
> import seaborn; seaborn.set()
>
> data.plot()  
> plt.ylabel('Hourly Bicycle Count');
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image166.png"
> style="width:4.31389in;height:2.79306in" />
>
> The \~25,000 hourly samples are far too dense for us to make much
> sense of. We can gain more
>
> insight by resampling the data to a coarser grid. Let's resample by
> week:
>
> weekly = data.resample('W').sum()  
> weekly.plot(style=\[':', '--', '-'\])  
> plt.ylabel('Weekly bicycle count');

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
112/126

<img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image168.png"
style="width:4.3875in;height:2.79299in" />

| 10/03/2023, 15:11 | Exp02_notebook_2001622 - Colaboratory |
|-------------------|---------------------------------------|

> This shows us some interesting seasonal trends: as you might expect,
> people bicycle more in the summer than in the winter, and even within
> a particular season the bicycle use varies from week to week (likely
> dependent on weather; see where we explore this further).
>
> Another way that comes in handy for aggregating the data is to use a
> rolling mean, utilizing the pd.rolling_mean() function. Here we'll do
> a 30 day rolling mean of our data, making sure to center the window:
>
> daily = data.resample('D').sum()  
> daily.rolling(30, center=True).sum().plot(style=\[':', '--', '-'\])
> plt.ylabel('mean hourly count');
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image167.png"
> style="width:4.45972in;height:2.79305in" />
>
> The jaggedness of the result is due to the hard cutoff of the window.
> We can get a smoother version of a rolling mean using a window
> function–for example, a Gaussian window. The following code speci�es
> both the width of the window (we chose 50 days) and the width of the
> Gaussian within the window (we chose 10 days):
>
> daily.rolling(50, center=True,  
> win_type='gaussian').sum(std=10).plot(style=\[':', '--', '-'\]);

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
113/126

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image169.png"
> style="width:4.28333in;height:2.79167in" />
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image27.png"
> style="width:0.125in" /> Digging into the data
>
> While these smoothed data views are useful to get an idea of the
> general trend in the data, they hide much of the interesting
> structure. For example, we might want to look at the average tra�c as
> a function of the time of day. We can do this using the GroupBy
> functionality discussed in :
>
> by_time = data.groupby(data.index.time).mean()  
> hourly_ticks = 4 \* 60 \* 60 \* np.arange(6)  
> by_time.plot(xticks=hourly_ticks, style=\[':', '--', '-'\]);
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image170.png"
> style="width:3.95in;height:2.79305in" />
>
> The hourly tra�c is a strongly bimodal distribution, with peaks around
> 8:00 in the morning and 5:00 in the evening. This is likely evidence
> of a strong component of commuter tra�c crossing the bridge. This is
> further evidenced by the differences between the western sidewalk
> (generally used going toward downtown Seattle), which peaks more
> strongly in the morning, and the eastern sidewalk (generally used
> going away from downtown Seattle), which peaks more strongly in the
> evening.

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
114/126

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> We also might be curious about how things change based on the day of
> the week. Again, we can do this with a simple groupby:
>
> by_weekday = data.groupby(data.index.dayofweek).mean()  
> by_weekday.index = \['Mon', 'Tues', 'Wed', 'Thurs', 'Fri', 'Sat',
> 'Sun'\] by_weekday.plot(style=\[':', '--', '-'\]);
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image171.png"
> style="width:3.95in;height:2.61528in" />
>
> This shows a strong distinction between weekday and weekend totals,
> with around twice as many average riders crossing the bridge on Monday
> through Friday than on Saturday and Sunday.
>
> With this in mind, let's do a compound GroupBy and look at the hourly
> trend on weekdays versus weekends. We'll start by grouping by both a
> �ag marking the weekend, and the time of day:
>
> weekend = np.where(data.index.weekday \< 5, 'Weekday', 'Weekend')
> by_time = data.groupby(\[weekend, data.index.time\]).mean()
>
> Now we'll use some of the Matplotlib tools described in to plot two
> panels side by side:
>
> The result is very interesting: we see a bimodal commute pattern
> during the work week, and a unimodal recreational pattern during the
> weekends. It would be interesting to dig through this data in more
> detail, and examine the effect of weather, temperature, time of year,
> and other factors on people's commuting patterns; for further
> discussion, see my blog post s dataset in the context of modeling in .

<table>
<colgroup>
<col style="width: 33%" />
<col style="width: 33%" />
<col style="width: 33%" />
</colgroup>
<thead>
<tr class="header">
<th><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image99.png"
style="width:0.125in" /></th>
<th><blockquote>
<p>High-Performance Pandas: eval() and query()</p>
</blockquote></th>
<th rowspan="2">115/126</th>
</tr>
<tr class="odd">
<th
colspan="2">https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&amp;printMode=true</th>
</tr>
</thead>
<tbody>
</tbody>
</table>

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> As we've already seen in previous sections, the power of the PyData
> stack is built upon the ability of NumPy and Pandas to push basic
> operations into C via an intuitive syntax: examples are
> vectorized/broadcasted operations in NumPy, and grouping-type
> operations in Pandas. While these abstractions are e�cient and
> effective for many common use cases, they often rely on the creation
> of temporary intermediate objects, which can cause undue overhead in
> computational time and memory use.
>
> As of version 0.13 (released January 2014), Pandas includes some
> experimental tools that allow you to directly access C-speed
> operations without costly allocation of intermediate arrays.
>
> These are the eval() and query() functions, which rely on the package.
> In this notebook we will walk through their use and give some
> rules-of-thumb about when you might think about using them.
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image125.png"
> style="width:0.125in" /> Motivating query() and eval(): Compound
> Expressions
>
> We've seen previously that NumPy and Pandas support fast vectorized
> operations; for example, when adding the elements of two arrays:
>
> import numpy as np  
> rng = np.random.RandomState(42)  
> x = rng.rand(1000000)  
> y = rng.rand(1000000)  
> %timeit x + y
>
> 592 µs ± 66.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops
> each)
>
> As discussed in , this is much faster than doing the addition via a
> Python loop or comprehension:
>
> %timeit np.fromiter((xi + yi for xi, yi in zip(x, y)), dtype=x.dtype,
> count=len(x))
>
> 167 ms ± 21.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>
> But this abstraction can become less e�cient when computing compound
> expressions. For example, consider the following expression:
>
> mask = (x \> 0.5) & (y \< 0.5)
>
> Because NumPy evaluates each subexpression, this is roughly equivalent
> to the following:

| https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true | 116/126 |
|------------------------------------|------------------------------------|

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> tmp1 = (x \> 0.5)  
> tmp2 = (y \< 0.5)  
> mask = tmp1 & tmp2
>
> In other words, *every intermediate step is explicitly allocated in
> memory*. If the x and y arrays are very large, this can lead to
> signi�cant memory and computational overhead. The Numexpr library
> gives you the ability to compute this type of compound expression
> element by element, without the need to allocate full intermediate
> arrays. The has more details, but for the time being it is su�cient to
> say that the library accepts a *string* giving the NumPy-style
> expression you'd like to compute:
>
> import numexpr  
> mask_numexpr = numexpr.evaluate('(x \> 0.5) & (y \< 0.5)')
> np.allclose(mask, mask_numexpr)
>
> True
>
> The bene�t here is that Numexpr evaluates the expression in a way that
> does not use full-sized temporary arrays, and thus can be much more
> e�cient than NumPy, especially for large arrays. The Pandas eval() and
> query() tools that we will discuss here are conceptually similar, and
> depend on the Numexpr package.
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image172.png"
> style="width:0.125in" /> pandas.eval() for E�cient Operations
>
> The eval() function in Pandas uses string expressions to e�ciently
> compute operations using DataFrame s. For example, consider the
> following DataFrame s:
>
> import pandas as pd  
> nrows, ncols = 100000, 100  
> rng = np.random.RandomState(42)  
> df1, df2, df3, df4 = (pd.DataFrame(rng.rand(nrows, ncols)) for i in
> range(4))
>
> To compute the sum of all four DataFrame s using the typical Pandas
> approach, we can just write the sum:
>
> %timeit df1 + df2 + df3 + df4
>
> 39.9 ms ± 1.74 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>
> The same result can be computed via pd.eval by constructing the
> expression as a string:
>
> %timeit pd.eval('df1 + df2 + df3 + df4')

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
117/126

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> 24.5 ms ± 888 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
>
> The eval() version of this expression is about 50% faster (and uses
> much less memory), while
>
> giving the same result:
>
> np.allclose(df1 + df2 + df3 + df4,  
> pd.eval('df1 + df2 + df3 + df4'))
>
> True
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image14.png"
> style="width:0.125in" /> Operations supported by pd.eval()
>
> As of Pandas v0.16, pd.eval() supports a wide range of operations. To
> demonstrate these,
>
> we'll use the following integer DataFrame s:
>
> df1, df2, df3, df4, df5 = (pd.DataFrame(rng.randint(0, 1000, (100,
> 3))) for i in range(5))
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image173.png"
> style="width:0.125in" /> Arithmetic operators
>
> pd.eval() supports all arithmetic operators. For example:
>
> result1 = -df1 \* df2 / (df3 + df4) - df5  
> result2 = pd.eval('-df1 \* df2 / (df3 + df4) - df5')
> np.allclose(result1, result2)
>
> True
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image174.png"
> style="width:0.125in" /> Comparison operators
>
> pd.eval() supports all comparison operators, including chained
> expressions:
>
> result1 = (df1 \< df2) & (df2 \<= df3) & (df3 != df4) result2 =
> pd.eval('df1 \< df2 \<= df3 != df4')  
> np.allclose(result1, result2)
>
> True
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image120.png"
> style="width:0.125in" /> Bitwise operators
>
> pd.eval() supports the & and \| bitwise operators:
>
> result1 = (df1 \< 0.5) & (df2 \< 0.5) \| (df3 \< df4)  
> result2 = pd.eval('(df1 \< 0.5) & (df2 \< 0.5) \| (df3 \< df4)')
> np.allclose(result1, result2)

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
118/126

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> True
>
> In addition, it supports the use of the literal and and or in Boolean
> expressions:
>
> result3 = pd.eval('(df1 \< 0.5) and (df2 \< 0.5) or (df3 \< df4)')
> np.allclose(result1, result3)
>
> True
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image119.png"
> style="width:0.125in" /> Object attributes and indices
>
> pd.eval() supports access to object attributes via the obj.attr
> syntax, and indexes via the
>
> obj\[index\] syntax:
>
> result1 = df2.T\[0\] + df3.iloc\[1\]  
> result2 = pd.eval('df2.T\[0\] + df3.iloc\[1\]')  
> np.allclose(result1, result2)
>
> True
>
> Other operations
>
> Other operations such as function calls, conditional statements,
> loops, and other more involved
>
> constructs are currently *not* implemented in pd.eval() . If you'd
> like to execute these more
>
> complicated types of expressions, you can use the Numexpr library
> itself.
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image6.png"
> style="width:0.125in" /> DataFrame.eval() for Column-Wise Operations
>
> Just as Pandas has a top-level pd.eval() function, DataFrame s have an
> eval() method that
>
> works in similar ways. The bene�t of the eval() method is that columns
> can be referred to *by name*. We'll use this labeled array as an
> example:
>
> df = pd.DataFrame(rng.rand(1000, 3), columns=\['A', 'B', 'C'\])
> df.head()

<table style="width:100%;">
<colgroup>
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
</colgroup>
<thead>
<tr class="header">
<th colspan="2"><strong>A</strong></th>
<th><strong>B</strong></th>
<th><strong>C</strong></th>
<th><blockquote>
<p><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image175.png"
style="width:0.22222in;height:0.22222in" /></p>
</blockquote></th>
<th></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>0</strong></td>
<td>0.375506</td>
<td>0.406939</td>
<td colspan="2"><blockquote>
<p>0.069938</p>
</blockquote></td>
<td rowspan="6">119/126</td>
</tr>
<tr class="even">
<td><strong>1</strong></td>
<td>0.069087</td>
<td>0.235615</td>
<td colspan="2"><blockquote>
<p>0.154374</p>
</blockquote></td>
</tr>
<tr class="odd">
<td><strong>2</strong></td>
<td>0.677945</td>
<td>0.433839</td>
<td colspan="2"><blockquote>
<p>0.652324</p>
</blockquote></td>
</tr>
<tr class="even">
<td><strong>3</strong></td>
<td>0.264038</td>
<td>0.808055</td>
<td colspan="2"><blockquote>
<p>0.347197</p>
</blockquote></td>
</tr>
<tr class="odd">
<td><strong>4</strong></td>
<td>0.589161</td>
<td>0.252418</td>
<td colspan="2"><blockquote>
<p>0.557789</p>
</blockquote></td>
</tr>
<tr class="even">
<td
colspan="5">https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&amp;printMode=true</td>
</tr>
</tbody>
</table>

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> Using pd.eval() as above, we can compute expressions with the three
> columns like this:
>
> result1 = (df\['A'\] + df\['B'\]) / (df\['C'\] - 1)
>
> result2 = pd.eval("(df.A + df.B) / (df.C - 1)")
>
> np.allclose(result1, result2)
>
> True

The DataFrame.eval() method allows much more succinct evaluation of
expressions with the

> columns:
>
> result3 = df.eval('(A + B) / (C - 1)')
>
> np.allclose(result1, result3)
>
> True
>
> Notice here that we treat *column names as variables* within the
> evaluated expression, and the
>
> result is what we would wish.
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image131.png"
> style="width:0.125in" /> Assignment in DataFrame.eval()
>
> In addition to the options just discussed, DataFrame.eval() also
> allows assignment to any
>
> column. Let's use the DataFrame from before, which has columns 'A' ,
> 'B' , and 'C' :
>
> df.head()

<table>
<colgroup>
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
</colgroup>
<thead>
<tr class="header">
<th></th>
<th><strong>A</strong></th>
<th><strong>B</strong></th>
<th><strong>C</strong></th>
<th><blockquote>
<p><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image176.png"
style="width:0.22222in;height:0.22222in" /></p>
</blockquote></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>0</strong></td>
<td>0.375506</td>
<td>0.406939</td>
<td>0.069938</td>
<td rowspan="5"></td>
</tr>
<tr class="even">
<td><strong>1</strong></td>
<td>0.069087</td>
<td>0.235615</td>
<td>0.154374</td>
</tr>
<tr class="odd">
<td><strong>2</strong></td>
<td>0.677945</td>
<td>0.433839</td>
<td>0.652324</td>
</tr>
<tr class="even">
<td><strong>3</strong></td>
<td>0.264038</td>
<td>0.808055</td>
<td>0.347197</td>
</tr>
<tr class="odd">
<td><strong>4</strong></td>
<td>0.589161</td>
<td>0.252418</td>
<td>0.557789</td>
</tr>
</tbody>
</table>

> We can use df.eval() to create a new column 'D' and assign to it a
> value computed from the
>
> other columns:
>
> df.eval('D = (A + B) / C', inplace=True)
>
> df.head()

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
120/126

<table>
<colgroup>
<col style="width: 11%" />
<col style="width: 11%" />
<col style="width: 11%" />
<col style="width: 11%" />
<col style="width: 11%" />
<col style="width: 11%" />
<col style="width: 11%" />
<col style="width: 11%" />
<col style="width: 11%" />
</colgroup>
<thead>
<tr class="header">
<th colspan="2">10/03/2023, 15:11</th>
<th colspan="7">Exp02_notebook_2001622 - Colaboratory</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td colspan="2"><strong>A</strong></td>
<td colspan="2"><strong>B</strong></td>
<td colspan="3"><strong>C</strong></td>
<td><strong>D</strong></td>
<td><blockquote>
<p><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image177.png"
style="width:0.22222in;height:0.22222in" /></p>
</blockquote></td>
</tr>
<tr class="even">
<td><strong>0</strong></td>
<td><blockquote>
<p>0.375506</p>
</blockquote></td>
<td>0.406939</td>
<td colspan="2">0.069938</td>
<td colspan="4"><blockquote>
<p>11.187620</p>
</blockquote></td>
</tr>
<tr class="odd">
<td><strong>1</strong></td>
<td><blockquote>
<p>0.069087</p>
</blockquote></td>
<td>0.235615</td>
<td colspan="3">0.154374</td>
<td colspan="3"><blockquote>
<p>1.973796</p>
</blockquote></td>
</tr>
<tr class="even">
<td><strong>2</strong></td>
<td><blockquote>
<p>0.677945</p>
</blockquote></td>
<td>0.433839</td>
<td colspan="3">0.652324</td>
<td colspan="3"><blockquote>
<p>1.704344</p>
</blockquote></td>
</tr>
<tr class="odd">
<td colspan="9"><blockquote>
<p><strong>3</strong> 0.264038 0.808055 0.347197 3.087857</p>
<p>In the same way, any existing column can be modi�ed:</p>
</blockquote></td>
</tr>
<tr class="even">
<td><strong>4</strong></td>
<td><blockquote>
<p>0.589161</p>
</blockquote></td>
<td>0.252418</td>
<td colspan="3">0.557789</td>
<td colspan="3"><blockquote>
<p>1.508776</p>
</blockquote></td>
</tr>
</tbody>
</table>

> df.eval('D = (A - B) / C', inplace=True)  
> df.head()

<table style="width:100%;">
<colgroup>
<col style="width: 14%" />
<col style="width: 14%" />
<col style="width: 14%" />
<col style="width: 14%" />
<col style="width: 14%" />
<col style="width: 14%" />
<col style="width: 14%" />
</colgroup>
<thead>
<tr class="header">
<th></th>
<th colspan="2"><strong>A</strong></th>
<th><strong>B</strong></th>
<th><strong>C</strong></th>
<th><strong>D</strong></th>
<th><blockquote>
<p><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image178.png"
style="width:0.23611in;height:0.22222in" /></p>
</blockquote></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td rowspan="6"><img
src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image11.png"
style="width:0.125in" /></td>
<td><strong>0</strong></td>
<td>0.375506</td>
<td>0.406939</td>
<td><blockquote>
<p>0.069938</p>
</blockquote></td>
<td><blockquote>
<p>-0.449425</p>
</blockquote></td>
<td rowspan="6"></td>
</tr>
<tr class="even">
<td><strong>1</strong></td>
<td>0.069087</td>
<td>0.235615</td>
<td><blockquote>
<p>0.154374</p>
</blockquote></td>
<td><blockquote>
<p>-1.078728</p>
</blockquote></td>
</tr>
<tr class="odd">
<td><strong>2</strong></td>
<td>0.677945</td>
<td>0.433839</td>
<td><blockquote>
<p>0.652324</p>
</blockquote></td>
<td><blockquote>
<p>0.374209</p>
</blockquote></td>
</tr>
<tr class="even">
<td><strong>3</strong></td>
<td>0.264038</td>
<td>0.808055</td>
<td><blockquote>
<p>0.347197</p>
</blockquote></td>
<td><blockquote>
<p>-1.566886</p>
</blockquote></td>
</tr>
<tr class="odd">
<td><strong>4</strong></td>
<td>0.589161</td>
<td>0.252418</td>
<td><blockquote>
<p>0.557789</p>
</blockquote></td>
<td><blockquote>
<p>0.603708</p>
</blockquote></td>
</tr>
<tr class="even">
<td colspan="5"><blockquote>
<p>Local variables in DataFrame.eval()</p>
</blockquote></td>
</tr>
</tbody>
</table>

> The DataFrame.eval() method supports an additional syntax that lets it
> work with local Python variables. Consider the following:
>
> column_mean = df.mean(1)  
> result1 = df\['A'\] + column_mean  
> result2 = df.eval('A + @column_mean')  
> np.allclose(result1, result2)
>
> True
>
> The @ character here marks a *variable name* rather than a *column
> name*, and lets you e�ciently evaluate expressions involving the two
> "namespaces": the namespace of columns, and the namespace of Python
> objects. Notice that this @ character is only supported by the  
> DataFrame.eval()*method*, not by the pandas.eval()*function*, because
> the pandas.eval() function only has access to the one (Python)
> namespace.
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image99.png"
> style="width:0.125in" /> DataFrame.query() Method
>
> The DataFrame has another method based on evaluated strings, called
> the query() method. Consider the following:

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
121/126

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> result1 = df\[(df.A \< 0.5) & (df.B \< 0.5)\]  
> result2 = pd.eval('df\[(df.A \< 0.5) & (df.B \< 0.5)\]')
> np.allclose(result1, result2)
>
> True
>
> As with the example used in our discussion of DataFrame.eval() , this
> is an expression involving
>
> columns of the DataFrame . It cannot be expressed using the
> DataFrame.eval() syntax,
>
> however! Instead, for this type of �ltering operation, you can use the
> query() method:
>
> result2 = df.query('A \< 0.5 and B \< 0.5')  
> np.allclose(result1, result2)
>
> True
>
> In addition to being a more e�cient computation, compared to the
> masking expression this is much easier to read and understand. Note
> that the query() method also accepts the @ �ag to
>
> mark local variables:
>
> Cmean = df\['C'\].mean()  
> result1 = df\[(df.A \< Cmean) & (df.B \< Cmean)\]  
> result2 = df.query('A \< @Cmean and B \< @Cmean')  
> np.allclose(result1, result2)
>
> True
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image179.png"
> style="width:0.125in" /> Performance: When to Use These Functions
>
> When considering whether to use these functions, there are two
> considerations: *computation*
>
> *time* and *memory use*. Memory use is the most predictable aspect. As
> already mentioned, every
>
> compound expression involving NumPy arrays or Pandas DataFrame s will
> result in implicit creation of temporary arrays: For example, this:
>
> x = df\[(df.A \< 0.5) & (df.B \< 0.5)\]
>
> Is roughly equivalent to this:
>
> tmp1 = df.A \< 0.5  
> tmp2 = df.B \< 0.5  
> tmp3 = tmp1 & tmp2  
> x = df\[tmp3\]
>
> If the size of the temporary DataFrame s is signi�cant compared to
> your available system
>
> memory (typically several gigabytes) then it's a good idea to use an
> eval() or query()

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
122/126

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> expression. You can check the approximate size of your array in bytes
> using this:
>
> df.values.nbytes
>
> 32000
>
> On the performance side, eval() can be faster even when you are not
> maxing-out your system memory. The issue is how your temporary
> DataFrame s compare to the size of the L1 or L2 CPU cache on your
> system (typically a few megabytes in 2016); if they are much bigger,
> then eval() can avoid some potentially slow movement of values between
> the different memory caches. In practice, I �nd that the difference in
> computation time between the traditional methods and the eval / query
> method is usually not signi�cant–if anything, the traditional method
> is faster for smaller arrays! The bene�t of eval / query is mainly in
> the saved memory, and the sometimes cleaner syntax they offer.
>
> We've covered most of the details of eval() and query() here; for more
> information on these, you can refer to the Pandas documentation. In
> particular, different parsers and engines can be speci�ed for running
> these queries; for details on this, see the discussion within the
>
> **Post Experiment Question :**
>
> Note: Studentneed to answer these questions in typed manner or acreate
> apropriate code cells to demonstrate working.
>
> Q1. What is role of isnull, notnull functions in data preprocessing ?
>
> --\>The isnull() and notnull() functions are useful in data
> preprocessing for detecting and handling missing values in a dataset.
> Here's a brief explanation of what each function does:
>
> isnull() function: This function is used to detect missing values
> (NaN, None, NaT, etc.) in a dataset. It returns a boolean mask
> indicating where the missing values are present in the data.
>
> notnull() function: This function is the opposite of the isnull()
> function. It returns a boolean mask indicating where the values are
> not missing.
>
> Both of these functions are commonly used in data preprocessing to
> handle missing values. For example, you can use isnull() to identify
> missing values in a dataset, and then decide how to handle those
> missing values (e.g., by imputing them with a mean value, dropping the
> rows with missing values, etc.). Similarly, you can use notnull() to
> �lter out the rows with missing values or to perform operations only
> on the rows with valid values.
>
> Q2. When will you use �llna method in data preprocessing step?
>
> -\>The �llna() method is used in data preprocessing to handle missing
> values in a dataset by�lling them with some other value. Here are some
> scenarios where you might use the �llna() method:

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
123/126

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> Imputing missing values: If you have missing values in your dataset,
> you can use the �llna() method to �ll those missing values with some
> other value. One common strategy is to impute missing values with the
> mean, median, or mode of the column.
>
> Handling outliers: If you have outliers in your dataset, you can use
> the �llna() method to replace those outliers with some other value.
> For example, you might replace outliers with the median or with a
> value that is just outside the range of the other data points.
>
> Creating new features: You can use the �llna() method to create new
> features in your dataset. For example, you might create a binary
> feature that indicates whether a particular column has missing values
> or not.
>
> Encoding categorical variables: If you have categorical variables with
> missing values, you can use the �llna() method to �ll those missing
> values with a new category label.
>
> It's important to note that �lling missing values can have an impact
> on the analysis of the data, so it's important to carefully consider
> the method used to �ll missing values and the potential impact on the
> analysis. It's also important to check the distribution of the data
> before and after�lling the missing values to ensure that the
> distribution is not signi�cantly affected by the imputation method.
>
> Q3. What does parameter inplace=True means ?
>
> -\>The inplace=True parameter is an optional argument that can be used
> with many pandas methods, including the �llna() method. When you set
> inplace=True, it means that the original dataframe will be modi�ed in
> place, and the changes will be made directly to the dataframe object
> itself, rather than returning a new dataframe with the changes
>
> Q4. Name and Explain any three values that method parameter may have
> in �llna method.
>
> --\>1)The �llna() method in pandas can take several optional
> parameters to specify the behavior of the method. Here are three
> values that the method parameter can take, along with a brief
> explanation of what they do:
>
> 2)method='�ll': This parameter is used to forward-�ll missing values
> in the dataframe. This means that missing values will be �lled with
> the value from the previous non-missing element in the same column.
>
> 3)method='b�ll': This parameter is used to backward-�ll missing values
> in the dataframe. This means that missing values will be �lled with
> the value from the next non-missing element in the same column.
>
> method='nearest': This parameter is used to �ll missing values with
> the value from the nearest non-missing element. This means that
> missing values will be �lled with the value from the closest
> non-missing element in the same column, either forward or backward
> depending on which is closest.
>
> Q5. Create a excel �le containing 4 real numeric columns latitude,
> longitude, avg_temp, avg_humidity. Added manually 30 reocords and save
> �le as weather.csv Upload this to your

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
124/126

10/03/2023, 15:11 Exp02_notebook_2001622 - Colaboratory

> google drive and load it into a python variable weather_info. Show all
> statistical properties of the columns in this data set.
>
> **References :**
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image180.png" />
> : This is the go-to source for complete documentation of the package.
> While the examples in the documentation tend to be small generated
> datasets, the description of the options is complete and generally
> very useful for understanding the use of various functions.
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image181.png" />
> Written by Wes McKinney (the original creator of Pandas), this book
> contains much more detail on the Pandas package than we had room for
> in this chapter. In particular, he takes a deep dive into tools for
> time series, which were his bread and butter as a �nancial consultant.
> The book also has many entertaining examples of applying Pandas to
> gain insight from real-world datasets. Keep in mind, though, that the
> book is now several years old, and the Pandas package has quite a few
> new features that this book does not cover (but be on the lookout for
> a new edition in 2017).
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image182.png" />
> : Pandas has so many users that any question you have has likely been
> asked and answered on Stack Over�ow. Using Pandas is a case where some
> Google-Fu is your best friend. Simply go to your favorite search
> engine and type in the question, problem, or error you're coming
> across–more than likely you'll �nd your answer on a Stack Over�ow
> page.
>
> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image183.png" />
> : From PyCon to SciPy to PyData, many conferences have featured
> tutorials from Pandas developers and power users. The PyCon tutorials
> in particular tend to be given by very well-vetted presenters.
>
> **Conclusion :** Thus we have learned how to perform Data Manipulation
> using Pandas Library
>
> Q5. Create a excel �le containing 4 real numeric columns latitude,
> longitude, avg_temp, avg_humidity. Added manually 30 reocords and save
> �le as weather.csv Upload this to your google drive and load it into a
> python variable weather_info. Show all statistical properties of the
> columns in this data set.
>
> from google.colab import drive  
> drive.mount('/content/drive')
>
> Mounted at /content/drive
>
> import pandas as pd  
> path = '/content/drive/My Drive/weather.csv'  
> weather_info = pd.read_csv(path)  
> weather_info.describe()

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
125/126

<table style="width:100%;">
<colgroup>
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
<col style="width: 16%" />
</colgroup>
<thead>
<tr class="header">
<th rowspan="3">10/03/2023, 15:11</th>
<th rowspan="3"><strong>Data.Precipitation</strong></th>
<th colspan="3">Exp02_notebook_2001622 - Colaboratory</th>
<th rowspan="2"><blockquote>
<p><strong>Data.Temperature.</strong></p>
</blockquote></th>
</tr>
<tr class="odd">
<th rowspan="2"><blockquote>
<p><strong>Date.Month</strong></p>
</blockquote></th>
<th><strong>Date.Week</strong></th>
<th rowspan="2"><strong>Date.Year</strong></th>
</tr>
<tr class="header">
<th><strong>of</strong></th>
<th><strong>T</strong></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>count</strong></td>
<td>16743.000000</td>
<td>16743.000000</td>
<td>16743.000000</td>
<td>16743.000000</td>
<td>16743.000</td>
</tr>
<tr class="even">
<td><strong>mean</strong></td>
<td>0.579090</td>
<td>6.343128</td>
<td>15.650242</td>
<td><blockquote>
<p>2016.018933</p>
</blockquote></td>
<td>56.089</td>
</tr>
<tr class="odd">
<td><strong>std</strong></td>
<td>0.988057</td>
<td>3.490723</td>
<td>8.923425</td>
<td>0.136294</td>
<td>18.798</td>
</tr>
<tr class="even">
<td><strong>min</strong></td>
<td>0.000000</td>
<td>1.000000</td>
<td>1.000000</td>
<td><blockquote>
<p>2016.000000</p>
</blockquote></td>
<td>-27.000</td>
</tr>
<tr class="odd">
<td><strong>25%</strong></td>
<td>0.000000</td>
<td>3.000000</td>
<td>8.000000</td>
<td><blockquote>
<p>2016.000000</p>
</blockquote></td>
<td>44.000</td>
</tr>
<tr class="even">
<td><strong>50%</strong></td>
<td>0.190000</td>
<td>6.000000</td>
<td>16.000000</td>
<td><blockquote>
<p>2016.000000</p>
</blockquote></td>
<td>58.000</td>
</tr>
<tr class="odd">
<td><strong>75%</strong></td>
<td>0.750000</td>
<td>9.000000</td>
<td>24.000000</td>
<td><blockquote>
<p>2016.000000</p>
</blockquote></td>
<td>71.000</td>
</tr>
<tr class="even">
<td><strong>max</strong></td>
<td>20.890000</td>
<td>12.000000</td>
<td>31.000000</td>
<td><blockquote>
<p>2017.000000</p>
</blockquote></td>
<td>100.000</td>
</tr>
</tbody>
</table>

> <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image184.png"
> style="width:0.23611in;height:0.19444in" />
>
> 0s completed at 15:06 <img
> src="attachment:vertopal_8387cf7aac2f428eac22e76f17a5a0c9/media/image185.png"
> style="width:0.36111in;height:0.125in" />

https://colab.research.google.com/drive/1wAMl0ua7GiN2hvf2VzyuYpLi9FYukoPy#scrollTo=4GcS_SoaBnfe&printMode=true
126/126