# 7. Array-Oriented Programming with NumPy

### Objectives 
In this chapter, you’ll:
* Learn what arrays are and how
they differ from lists.
* Use the numpy module’s highperformance
ndarrays.
* Compare list and ndarray
performance with the IPython
%timeit magic.
* Use ndarrays to store and
retrieve data efficiently.
* Create and initialize
ndarrays.

### Objectives (cont.)
* Refer to individual ndarray
elements.
* Iterate through ndarrays.
* Create and manipulate
multidimensional ndarrays.
* Perform common ndarray
manipulations.
* Create and manipulate
pandas one-dimensional
Series and two-dimensional
DataFrames.

### Objectives (cont.)
* Customize Series and
DataFrame indices.
* Calculate basic descriptive
statistics for data in a Series
and a DataFrame.
* Customize floating-point
number precision in pandas
output formatting.

# 7.1 Introduction

### **NumPy** (**Numerical Python**) Library
* First appeared in 2006 and is the **preferred Python array implementation**.
* High-performance, richly functional **_n_-dimensional array** type called **`ndarray`**. 
* **Written in C** and **up to 100 times faster than lists**.
* Critical in big-data processing, AI applications and much more. 
* According to `libraries.io`, **over 450 Python libraries depend on NumPy**. 
* Many popular data science libraries such as Pandas, SciPy (Scientific Python) and Keras (for deep learning) are built on or depend on NumPy. 

### Array-Oriented Programming
* **Functional-style programming** with **internal iteration** makes array-oriented manipulations concise and straightforward, and reduces the possibility of error.

# 7.2 Creating `array`s from Existing Data 
* Creating an array with the **`array`** function 
* Argument is an `array` or other iterable
* Returns a new `array` containing the argument’s elements

In [1]:
import numpy as np


In [2]:
numbers = np.array([2, 3, 5, 7, 11])

In [3]:
type(numbers)

numpy.ndarray

In [4]:
numbers

array([ 2,  3,  5,  7, 11])

### Multidimensional Arguments

In [5]:
np.array([[1, 2, 3], [4, 5, 6]])

array([[1, 2, 3],
       [4, 5, 6]])

# 7.3 `array` Attributes 
* **attributes**  enable you to discover information about its structure and contents

In [1]:
import numpy as np


In [2]:
integers = np.array([[1, 2, 3], [4, 5, 6]])

In [3]:
integers

array([[1, 2, 3],
       [4, 5, 6]])

In [5]:
floats = np.array([0.0, 0.1, 0.2, 0.3, 0.4])
np.array([1,2,3])

array([1, 2, 3])

In [None]:
floats

* NumPy does not display trailing 0s

### Determining an `array`’s Element Type

In [None]:
integers.dtype

In [None]:
floats.dtype

* For performance reasons, NumPy is written in the C programming language and uses C’s data types
* [Other NumPy types](https://docs.scipy.org/doc/numpy/user/basics.types.html)

### Determining an `array`’s Dimensions
* **`ndim`** contains an `array`’s number of dimensions 
* **`shape`** contains a _tuple_ specifying an `array`’s dimensions

In [None]:
integers.ndim

In [None]:
floats.ndim

In [None]:
integers.shape

In [None]:
floats.shape

### Determining an `array`’s Number of Elements and Element Size
* view an `array`’s total number of elements with **`size`** 
* view number of bytes required to store each element with **`itemsize`**

In [None]:
integers.size

In [None]:
integers.itemsize

In [None]:
floats.size

In [None]:
floats.itemsize

### Iterating through a Multidimensional `array`’s Elements


In [7]:
for row in integers:
    print(f"row is {row}")
    for column in row:
        print(column, end='  ')
    print() 

row is [1 2 3]
1  2  3  
row is [4 5 6]
4  5  6  


* Iterate through a multidimensional `array` as if it were one-dimensional by using **`flat`**

In [None]:
for i in integers.flat:
    print(i, end='  ')

# 7.4 Filling `array`s with Specific Values
* Functions **`zeros`**, **`ones`** and **`full`** create `array`s containing  `0`s, `1`s or a specified value, respectively

In [1]:
import numpy as np

In [7]:
a =np.zeros(5)
a
b = a





TypeError: 'numpy.float64' object does not support item assignment

* For a tuple of integers, these functions return a multidimensional `array` with the specified dimensions

In [3]:
np.ones((2, 4), dtype=int)

array([[1, 1, 1, 1],
       [1, 1, 1, 1]])

In [4]:
np.full((3, 5), 13)

array([[13, 13, 13, 13, 13],
       [13, 13, 13, 13, 13],
       [13, 13, 13, 13, 13]])

# 7.5 Creating `array`s from Ranges 
* NumPy provides optimized functions for creating `array`s from ranges

### Creating Integer Ranges with `arange`

In [None]:
import numpy as np

In [None]:
np.arange(5)

In [None]:
np.arange(5, 10)

In [None]:
np.arange(10, 1, -2)

### Creating Floating-Point Ranges with `linspace` 
* Produce evenly spaced floating-point ranges with NumPy’s **`linspace`** function
* Ending value **is included** in the `array`

In [None]:
np.linspace(0.0, 1.0, num=5)

### Reshaping an `array` 
* `array` method **`reshape`** transforms an array into different number of dimensions
* New shape must have the **same** number of elements as the original

In [None]:
np.arange(1, 21).reshape(4, 5)

### Displaying Large `array`s 
* When displaying an `array`, if there are 1000 items or more, NumPy drops the middle rows, columns or both from the output

In [None]:
np.arange(1, 100001).reshape(4, 25000)

In [None]:
np.arange(1, 100001).reshape(100, 1000)

# 7.6 List vs. `array` Performance: Introducing `%timeit` 
* Most `array` operations execute **significantly** faster than corresponding list operations
* IPython **`%timeit` magic** command times the **average** duration of operations

### Timing the Creation of a List Containing Results of 6,000,000 Die Rolls 

In [None]:
import random

In [None]:
%timeit rolls_list = \
   [random.randrange(1, 7) for i in range(0, 6_000_000)]

* By default, `%timeit` executes a statement in a loop, and it runs the loop _seven_ times
* If you do not indicate the number of loops, `%timeit` chooses an appropriate value
* After executing the statement, `%timeit` displays the statement’s _average_ execution time, as well as the standard deviation of all the executions

### Timing the Creation of an `array` Containing Results of 6,000,000 Die Rolls  

In [None]:
import numpy as np

In [None]:
%timeit rolls_array = np.random.randint(1, 7, 6_000_000)

### 60,000,000 and 600,000,000 Die Rolls  

In [None]:
%timeit rolls_array = np.random.randint(1, 7, 60_000_000)

In [None]:
%timeit rolls_array = np.random.randint(1, 7, 600_000_000)

### Other IPython Magics
IPython provides dozens of magics for a variety of tasks—for a complete list, see the IPython magics documentation. Here are a few helpful ones:
* **`%load`** to read code into IPython from a local file or URL.
* **`%save`** to save snippets to a file.
* **`%run`** to execute a .py file from IPython.
* **`%precision`** to change the default floating-point precision for IPython outputs.
* **`%cd`** to change directories without having to exit IPython first.
* **`%edit`** to launch an external editor—handy if you need to modify more complex snippets. 
* **`%history`** to view a list of all snippets and commands you’ve executed in the current IPython session.
* **`%ls`** list directory
* **`%pwd`** print working directory

### Arithmetic Operations with `array`s and Individual Numeric Values

# 7.7 `array` Operators
* `array` operators perform operations on **entire `array`s**. 
* Can perform arithmetic **between `array`s and scalar numeric values**, and **between `array`s of the same shape**.

In [None]:
import numpy as np

In [None]:
numbers = np.arange(1, 6)

In [None]:
numbers

In [None]:
numbers * 2

In [None]:
numbers ** 3

In [None]:
numbers  # numbers is unchanged by the arithmetic operators

In [None]:
numbers += 10

In [None]:
numbers

### Broadcasting 
* Arithmetic operations require as operands two `array`s of the **same size and shape**. 
* **`numbers * 2`** is equivalent to **`numbers * [2, 2, 2, 2, 2]`** for a 5-element array.
* Applying the operation to every element is called **broadcasting**. 
* Also can be applied between `array`s of different sizes and shapes, enabling some concise and powerful manipulations.

### Arithmetic Operations Between `array`s 
* Can perform arithmetic operations and augmented assignments between `array`s of the _same_ shape

In [None]:
numbers2 = np.linspace(1.1, 5.5, 5)

In [None]:
numbers2

In [None]:
numbers * numbers2

### Comparing `array`s
* Can compare `array`s with individual values and with other `array`s
* Comparisons performed **element-wise**
* Produce `array`s of Boolean values in which each element’s `True` or `False` value indicates the comparison result

In [None]:
numbers

In [None]:
numbers >= 13

In [None]:
numbers2

In [None]:
numbers2 < numbers

In [None]:
numbers == numbers2

In [None]:
numbers == numbers

# 7.8 NumPy Calculation Methods
* These methods **ignore the `array`’s shape** and **use all the elements in the calculations**. 
* Consider an `array` representing four students’ grades on three exams:

In [None]:
import numpy as np

In [None]:
grades = np.array([[87, 96, 70], [100, 87, 90],
                   [94, 77, 90], [100, 81, 82]])

In [None]:
grades

* Can use methods to calculate **`sum`**, **`min`**, **`max`**, **`mean`**, **`std`** (standard deviation) and **`var`** (variance)
* Each is a functional-style programming **reduction**

In [None]:
grades.sum()

In [None]:
grades.min()

In [None]:
grades.max()

In [None]:
grades.mean()

In [None]:
grades.std()

In [None]:
grades.var()

### Calculations by Row or Column

* You can perform calculations by column or row (or other dimensions in arrays with more than two dimensions)
* Each 2D+ array has [**one axis per dimension**](https://docs.scipy.org/doc/numpy-1.16.0/glossary.html)
* In a 2D array, **`axis=0`** indicates calculations should be **column-by-column**

In [None]:
grades.mean(axis=0)

*  In a 2D array, **`axis=1`** indicates calculations should be **row-by-row**

In [None]:
grades.mean(axis=1)

* [Other Numpy `array` Calculation Methods](https://docs.scipy.org/doc/numpy/reference/arrays.ndarray.html)

# 7.9 Universal Functions
* Standalone [**universal functions** (**ufuncs**)](https://docs.scipy.org/doc/numpy/reference/ufuncs.html) perform **element-wise operations** using one or two `array` or array-like arguments (like lists)
* Each returns a **new `array`** containing the results
* Some ufuncs are called when you use `array` operators like `+` and `*`

* Create an `array` and calculate the square root of its values, using the **`sqrt` universal function**

In [None]:
import numpy as np

In [None]:
numbers = np.array([1, 4, 9, 16, 25, 36])

In [None]:
np.sqrt(numbers)

* Add two `array`s with the same shape, using the **`add` universal function**
* Equivalent to:
```python
numbers + numbers2
```

In [None]:
numbers2 = np.arange(1, 7) * 10

In [None]:
numbers2

In [None]:
np.add(numbers, numbers2)

### Broadcasting with Universal Functions
* Universal functions can use broadcasting, just like NumPy `array` operators

In [None]:
np.multiply(numbers2, 5)

In [None]:
numbers3 = numbers2.reshape(2, 3)

In [None]:
numbers3

In [None]:
numbers4 = np.array([2, 4, 6])

In [None]:
np.multiply(numbers3, numbers4)

### Other Universal Functions

| NumPy universal functions
| ----------
| **_Math_** — `add`, `subtract`, `multiply`, `divide`, `remainder`, `exp`, `log`, `sqrt`, `power`, and more.
| **_Trigonometry_** —`sin`, `cos`, `tan`, `hypot`, `arcsin`, `arccos`, `arctan`, and more.
| **_Bit manipulation_** —`bitwise_and`, `bitwise_or`, `bitwise_xor`, `invert`, `left_shift` and `right_shift`.
| **_Comparison_** —`greater`, `greater_equal`, `less`, `less_equal`, `equal`, `not_equal`, `logical_and`, `logical_or`, `logical_xor`, `logical_not`, `minimum`, `maximum`, and more.
| **_Floating point_** —`floor`, `ceil`, `isinf`, `isnan`, `fabs`, `trunc`, and more.

# 7.10 Indexing and Slicing 
* One-dimensional `array`s can be **indexed** and **sliced** like lists. 

### Indexing with Two-Dimensional `array`s
* To select an element in a two-dimensional `array`, specify a tuple containing the element’s row and column indices in square brackets

In [None]:
import numpy as np

In [None]:
grades = np.array([[87, 96, 70], [100, 87, 90],
                   [94, 77, 90], [100, 81, 82]])

In [None]:
grades

In [None]:
grades[0, 1]  # row 0, column 1

### Selecting a Subset of a Two-Dimensional `array`’s Rows
* To select a single row, specify only one index in square brackets

In [None]:
grades[1]

* Select multiple sequential rows with slice notation

In [None]:
grades[0:2]

* Select multiple non-sequential rows with a list of row indices

In [None]:
grades[[1, 3]]

### Selecting a Subset of a Two-Dimensional `array`’s Columns
* The **column index** also can be a specific **index**, a **slice** or a **list** 

In [None]:
grades[:, 0]

In [None]:
grades[:, 1:3]

In [None]:
grades[:, [0, 2]]

# 7.11 Views: Shallow Copies
* Views “see” the data in other objects, rather than having their own copies of the data
* Views are shallow copies
*`array` method **`view`** returns a **new** array object with a **view** of the original `array` object’s data

In [None]:
import numpy as np

In [None]:
numbers = np.arange(1, 6)

In [None]:
numbers

In [None]:
numbers2 = numbers.view()

In [None]:
numbers2

* Use built-in `id` function to see that `numbers` and `numbers2` are **different** objects

In [None]:
id(numbers)

In [None]:
id(numbers2)

* Modifying an element in the original `array`, also modifies the view and vice versa

In [None]:
numbers[1] *= 10

In [None]:
numbers2

In [None]:
numbers

In [None]:
numbers2[1] /= 10

In [None]:
numbers

In [None]:
numbers2

### Slice Views
* Slices also create views

In [None]:
numbers2 = numbers[0:3]

In [None]:
numbers2

In [None]:
id(numbers)

In [None]:
id(numbers2)

* Confirm that `numbers2` is a view of only first three `numbers` elements

In [None]:
numbers2[3]

* Modify an element both `array`s share to show both are updated

In [None]:
numbers[1] *= 20

In [None]:
numbers

In [None]:
numbers2

# 7.12 Deep Copies
* When sharing **mutable** values, sometimes it’s necessary to create a **deep copy** of the original data
* Especially important in multi-core programming, where separate parts of your program could attempt to modify your data at the same time, possibly corrupting it

* **`array` method `copy`** returns a new array object with an independent copy of the original array's data

In [None]:
import numpy as np

In [None]:
numbers = np.arange(1, 6)

In [None]:
numbers

In [None]:
numbers2 = numbers.copy()

In [None]:
numbers2

In [None]:
numbers[1] *= 10

In [None]:
numbers

In [None]:
numbers2

### Module `copy`—Shallow vs. Deep Copies for Other Types of Python Objects

# 7.13 Reshaping and Transposing 

### `reshape` vs. `resize` 
* Method `reshape` returns a _view_ (shallow copy) of the original `array` with new dimensions
* Does _not_ modify the original `array`

In [None]:
import numpy as np

In [None]:
grades = np.array([[87, 96, 70], [100, 87, 90]])

In [None]:
grades

In [None]:
grades.reshape(1, 6)

In [None]:
grades

* Method `resize` modifies the original `array`’s shape

In [None]:
grades.resize(1, 6)

In [None]:
grades

### `flatten` vs. `ravel` 
* Can flatten a multi-dimensonal array into a single dimension with methods **`flatten`** and **`ravel`**
* `flatten` _deep copies_ the original array’s data

In [None]:
grades = np.array([[87, 96, 70], [100, 87, 90]])

In [None]:
grades

In [None]:
flattened = grades.flatten()

In [None]:
flattened

In [None]:
grades

In [None]:
flattened[0] = 100

In [None]:
flattened

In [None]:
grades

* Method `ravel` produces a _view_ of the original `array`, which _shares_ the `grades` `array`’s data

In [None]:
raveled = grades.ravel()

In [None]:
raveled

In [None]:
grades

In [None]:
raveled[0] = 100

In [None]:
raveled

In [None]:
grades

### Transposing Rows and Columns
* Can quickly **transpose** an `array`’s rows and columns
    * “flips” the `array`, so the rows become the columns and the columns become the rows
* **`T` attribute** returns a transposed _view_ (shallow copy) of the `array`

In [None]:
grades.T

In [None]:
grades

### Horizontal and Vertical Stacking
* Can combine arrays by adding more columns or more rows—known as _horizontal stacking_ and _vertical stacking_

In [None]:
grades2 = np.array([[94, 77, 90], [100, 81, 82]])

* Combine `grades` and `grades2` with NumPy’s **`hstack` (horizontal stack) function** by passing a tuple containing the arrays to combine
* The extra parentheses are required because `hstack` expects one argument
* Adds more columns

In [None]:
np.hstack((grades, grades2))

* Combine `grades` and `grades2` with NumPy’s **`vstack` (vertical stack) function**
* Adds more rows

In [None]:
np.vstack((grades, grades2))

## 7.14.1 pandas `Series` 
* An enhanced one-dimensional `array`
* Supports custom indexing, including even non-integer indices like strings
* Offers additional capabilities that make them more convenient for many data-science oriented tasks
    * `Series` may have missing data
    * Many `Series` operations ignore missing data by default

### Creating a `Series` with Default Indices
* By default, a `Series` has integer indices numbered sequentially from 0

In [1]:
import pandas as pd

In [2]:
grades = pd.Series([87, 100, 94])

### Creating a `Series` with All Elements Having the Same Value
* Second argument is a one-dimensional iterable object (such as a list, an `array` or a `range`) containing the `Series`’ indices
* Number of indices determines the number of elements

In [149]:
pd.Series(98.6, range(3))

0    98.6
1    98.6
2    98.6
dtype: float64

### Accessing a `Series`’ Elements

In [150]:
grades[0]

87

### Producing Descriptive Statistics for a Series
* `Series` provides many methods for common tasks including producing various descriptive statistics
* Each of these is a functional-style reduction

In [151]:
grades.count()

3

In [152]:
grades.mean()

93.66666666666667

In [153]:
grades.min()

87

In [154]:
grades.max()

100

In [155]:
grades.std()

6.506407098647712

* `Series` method **`describe`** produces all these stats and more
* The `25%`, `50%` and `75%` are **quartiles**:
    * `50%` represents the median of the sorted values.
    * `25%` represents the median of the first half of the sorted values.
    * `75%` represents the median of the second half of the sorted values.
* For the quartiles, if there are two middle elements, then their average is that quartile’s median

In [156]:
grades.describe()

count      3.000000
mean      93.666667
std        6.506407
min       87.000000
25%       90.500000
50%       94.000000
75%       97.000000
max      100.000000
dtype: float64

### Creating a `Series` with Custom Indices
Can specify custom indices with the `index` keyword argument

In [157]:
grades = pd.Series([87, 100, 94], index=['Wally', 'Eva', 'Sam'])

In [158]:
grades

Wally     87
Eva      100
Sam       94
dtype: int64

### Dictionary Initializers
* If you initialize a `Series` with a dictionary, its keys are the indices, and its values become the `Series`’ element values

In [159]:
grades = pd.Series({'Wally': 87, 'Eva': 100, 'Sam': 94})

In [160]:
grades

Wally     87
Eva      100
Sam       94
dtype: int64

### Accessing Elements of a `Series` Via Custom Indices
* Can access individual elements via square brackets containing a custom index value

In [161]:
grades['Eva']

100

* If custom indices are strings that could represent valid Python identifiers, pandas automatically adds them to the `Series` as attributes

In [162]:
grades.Wally

87

* **`dtype` attribute** returns the underlying `array`’s element type

In [163]:
grades.dtype

dtype('int64')

* **`values` attribute** returns the underlying `array`

In [164]:
grades.values

array([ 87, 100,  94])

### Creating a Series of Strings 
* In a `Series` of strings, you can use **`str` attribute** to call string methods on the elements

In [165]:
hardware = pd.Series(['Hammer', 'Saw', 'Wrench'])

In [166]:
hardware

0    Hammer
1       Saw
2    Wrench
dtype: object

* Call string method `contains` on each element
* Returns a `Series` containing `bool` values indicating the `contains` method’s result for each element
* The `str` attribute provides many string-processing methods that are similar to those in Python’s string type
    * https://pandas.pydata.org/pandas-docs/stable/api.html#string-handling

In [167]:
hardware.str.contains('a')

0     True
1     True
2    False
dtype: bool

* Use string method `upper` to produce a _new_ `Series` containing the uppercase versions of each element in `hardware`

In [168]:
hardware.str.upper()

0    HAMMER
1       SAW
2    WRENCH
dtype: object

## 7.14.2 `DataFrames` 
* Enhanced two-dimensional `array`
* Can have custom row and column indices
* Offers additional operations and capabilities that make them more convenient for many data-science oriented tasks
* Support missing data
* Each column in a `DataFrame` is a `Series`

### Creating a `DataFrame` from a Dictionary
* Create a `DataFrame` from a dictionary that represents student grades on three exams

In [1]:
import pandas as pd

In [2]:
grades_dict = {'Wally': [87, 96, 70], 'Eva': [100, 87, 90],
               'Sam': [94, 77, 90], 'Katie': [100, 81, 82],
               'Bob': [83, 65, 85]}

In [3]:
grades = pd.DataFrame(grades_dict)

* Pandas displays `DataFrame`s in tabular format with indices _left aligned_ in the index column and the remaining columns’ values _right aligned_

In [4]:
grades

Unnamed: 0,Wally,Eva,Sam,Katie,Bob
0,87,100,94,100,83
1,96,87,77,81,65
2,70,90,90,82,85


### Customizing a `DataFrame`’s Indices with the `index` Attribute 
* Can use the **`index` attribute** to change the `DataFrame`’s indices from sequential integers to labels
* Must provide a one-dimensional collection that has the same number of elements as there are _rows_ in the `DataFrame`

In [5]:
grades.index = ['Test1', 'Test2', 'Test3']

In [6]:
grades

Unnamed: 0,Wally,Eva,Sam,Katie,Bob
Test1,87,100,94,100,83
Test2,96,87,77,81,65
Test3,70,90,90,82,85


### Accessing a `DataFrame`’s Columns 
* Can quickly and conveniently look at your data in many different ways, including selecting portions of the data
* Get `Eva`’s grades by name
* Displays her column as a `Series`

In [7]:
grades['Eva']

Test1    100
Test2     87
Test3     90
Name: Eva, dtype: int64

* If a `DataFrame`’s column-name strings are valid Python identifiers, you can use them as attributes

In [8]:
grades.Sam

Test1    94
Test2    77
Test3    90
Name: Sam, dtype: int64

### Selecting Rows via the `loc` and `iloc` Attributes
* `DataFrame`s support indexing capabilities with `[]`, but pandas documentation recommends using the attributes `loc`, `iloc`, `at` and `iat`
    * Optimized to access `DataFrame`s and also provide additional capabilities 
* Access a row by its label via the `DataFrame`’s **`loc` attribute**

In [9]:
grades.loc['Test1']

Wally     87
Eva      100
Sam       94
Katie    100
Bob       83
Name: Test1, dtype: int64

* Access rows by integer zero-based indices using the **`iloc` attribute** (the `i` in `iloc` means that it’s used with integer indices)

In [10]:
grades.iloc[1]

Wally    96
Eva      87
Sam      77
Katie    81
Bob      65
Name: Test2, dtype: int64

### Selecting Rows via Slices and Lists with the `loc` and `iloc` Attributes
* Index can be a _slice_
* When using slices containing **labels** with `loc`, the range specified **includes** the high index (`'Test3'`):

In [11]:
grades.loc['Test1':'Test3']

Unnamed: 0,Wally,Eva,Sam,Katie,Bob
Test1,87,100,94,100,83
Test2,96,87,77,81,65
Test3,70,90,90,82,85


* When using slices containing **integer indices** with `iloc`, the range you specify **excludes** the high index (`2`):

In [12]:
grades.iloc[0:2]

Unnamed: 0,Wally,Eva,Sam,Katie,Bob
Test1,87,100,94,100,83
Test2,96,87,77,81,65


* Select _specific rows_ with a _list_ 

In [13]:
grades.loc[['Test1', 'Test3']]

Unnamed: 0,Wally,Eva,Sam,Katie,Bob
Test1,87,100,94,100,83
Test3,70,90,90,82,85


In [14]:
grades.iloc[[0, 2]]

Unnamed: 0,Wally,Eva,Sam,Katie,Bob
Test1,87,100,94,100,83
Test3,70,90,90,82,85


### Selecting Subsets of the Rows and Columns 
* View only `Eva`’s and `Katie`’s grades on `Test1` and `Test2`

In [15]:
grades.loc['Test1':'Test2', ['Eva', 'Katie']]

Unnamed: 0,Eva,Katie
Test1,100,100
Test2,87,81


* Use `iloc` with a list and a slice to select the first and third tests and the first three columns for those tests

In [16]:
grades.iloc[[0, 2], 0:3]

Unnamed: 0,Wally,Eva,Sam
Test1,87,100,94
Test3,70,90,90


### Boolean Indexing
* One of pandas’ more powerful selection capabilities is **Boolean indexing**
* Select all the A grades—that is, those that are greater than or equal to 90:
    * Pandas checks every grade to determine whether its value is greater than or equal to 90 and, if so, includes it in the new `DataFrame`.
    * Grades for which the condition is `False` are represented as **`NaN` (not a number)** in the new `DataFrame
    * `NaN` is pandas’ notation for missing values

In [17]:
grades[grades >= 90]

Unnamed: 0,Wally,Eva,Sam,Katie,Bob
Test1,,100.0,94.0,100.0,
Test2,96.0,,,,
Test3,,90.0,90.0,,


* Select all the B grades in the range 80–89

In [18]:
grades[(grades >= 80) & (grades < 90)]

Unnamed: 0,Wally,Eva,Sam,Katie,Bob
Test1,87.0,,,,83.0
Test2,,87.0,,81.0,
Test3,,,,82.0,85.0


* Pandas Boolean indices combine multiple conditions with the Python operator `&` (bitwise AND), _not_ the `and` Boolean operator
* For `or` conditions, use `|` (bitwise OR)
* NumPy also supports Boolean indexing for `array`s, but always returns a one-dimensional array containing only the values that satisfy the condition

### Accessing a Specific `DataFrame` Cell by Row and Column
* `DataFrame` method **`at`** and **`iat`** attributes get a single value from a `DataFrame`

In [19]:
grades.at['Test2', 'Eva']

87

In [20]:
grades.iat[2, 0]

70

* Can assign new values to specific elements

In [21]:
grades.at['Test2', 'Eva'] = 100

In [22]:
grades.at['Test2', 'Eva']

100

In [23]:
grades.iat[1, 2] = 87

In [24]:
grades.iat[1, 2]

87

### Descriptive Statistics
* `DataFrame`s **`describe` method** calculates basic descriptive statistics for the data and returns them as a `DataFrame`
* Statistics are calculated by column 

In [25]:
grades.describe()

Unnamed: 0,Wally,Eva,Sam,Katie,Bob
count,3.0,3.0,3.0,3.0,3.0
mean,84.333333,96.666667,90.333333,87.666667,77.666667
std,13.203535,5.773503,3.511885,10.692677,11.015141
min,70.0,90.0,87.0,81.0,65.0
25%,78.5,95.0,88.5,81.5,74.0
50%,87.0,100.0,90.0,82.0,83.0
75%,91.5,100.0,92.0,91.0,84.0
max,96.0,100.0,94.0,100.0,85.0


* Quick way to summarize your data
* Nicely demonstrates the power of array-oriented programming with a clean, concise functional-style call
* Can control the precision and other default settings with pandas’ **`set_option` function**

In [26]:
pd.set_option('precision', 2)

In [27]:
grades.describe()

Unnamed: 0,Wally,Eva,Sam,Katie,Bob
count,3.0,3.0,3.0,3.0,3.0
mean,84.33,96.67,90.33,87.67,77.67
std,13.2,5.77,3.51,10.69,11.02
min,70.0,90.0,87.0,81.0,65.0
25%,78.5,95.0,88.5,81.5,74.0
50%,87.0,100.0,90.0,82.0,83.0
75%,91.5,100.0,92.0,91.0,84.0
max,96.0,100.0,94.0,100.0,85.0


* For student grades, the most important of these statistics is probably the mean
* Can calculate that for each student simply by calling `mean` on the `DataFrame`

In [28]:
grades.mean()

Wally    84.33
Eva      96.67
Sam      90.33
Katie    87.67
Bob      77.67
dtype: float64

### Transposing the `DataFrame` with the `T` Attribute
* Can quickly **transpose** rows and columns—so the rows become the columns, and the columns become the rows—by using the **`T` attribute** to get a view

In [29]:
grades.T

Unnamed: 0,Test1,Test2,Test3
Wally,87,96,70
Eva,100,100,90
Sam,94,87,90
Katie,100,81,82
Bob,83,65,85


* Assume that rather than getting the summary statistics by student, you want to get them by test
* Call `describe` on `grades.T`

In [30]:
grades.T.describe()

Unnamed: 0,Test1,Test2,Test3
count,5.0,5.0,5.0
mean,92.8,85.8,83.4
std,7.66,13.81,8.23
min,83.0,65.0,70.0
25%,87.0,81.0,82.0
50%,94.0,87.0,85.0
75%,100.0,96.0,90.0
max,100.0,100.0,90.0


* Get average of all the students’ grades on each test

In [31]:
grades.T.mean()

Test1    92.8
Test2    85.8
Test3    83.4
dtype: float64

### Sorting by Rows by Their Indices
* Can sort a `DataFrame` by its rows or columns, based on their indices or values
* Sort the rows by their _indices_ in _descending_ order using **`sort_index`** and its keyword argument `ascending=False` 

In [32]:
grades.sort_index(ascending=False)

Unnamed: 0,Wally,Eva,Sam,Katie,Bob
Test3,70,90,90,82,85
Test2,96,100,87,81,65
Test1,87,100,94,100,83


### Sorting by Column Indices
* Sort columns into ascending order (left-to-right) by their column names
* **`axis=1` keyword argument** indicates that we wish to sort the _column_ indices, rather than the row indices
    * `axis=0` (the default) sorts the _row_ indices

In [33]:
grades.sort_index(axis=1)

Unnamed: 0,Bob,Eva,Katie,Sam,Wally
Test1,83,100,100,94,87
Test2,65,100,81,87,96
Test3,85,90,82,90,70


### Sorting by Column Values
* To view `Test1`’s grades in descending order so we can see the students’ names in highest-to-lowest grade order, call method **`sort_values`**
* `by` and `axis` arguments work together to determine which values will be sorted
    * In this case, we sort based on the column values (`axis=1`) for `Test1`

In [34]:
grades.sort_values(by='Test1', axis=1, ascending=False)

Unnamed: 0,Eva,Katie,Sam,Wally,Bob
Test1,100,100,94,87,83
Test2,100,81,87,96,65
Test3,90,82,90,70,85


* Might be easier to read the grades and names if they were in a column
* Sort the transposed `DataFrame` instead

In [35]:
grades.T.sort_values(by='Test1', ascending=False)

Unnamed: 0,Test1,Test2,Test3
Eva,100,100,90
Katie,100,81,82
Sam,94,87,90
Wally,87,96,70
Bob,83,65,85


* Since we’re sorting only `Test1`’s grades, we might not want to see the other tests at all
* Combine selection with sorting

In [36]:
grades.loc['Test1'].sort_values(ascending=False)

Katie    100
Eva      100
Sam       94
Wally     87
Bob       83
Name: Test1, dtype: int64

### Copy vs. In-Place Sorting
* `sort_index` and `sort_values` return a _copy_ of the original `DataFrame`
* Could require substantial memory in a big data application
* Can sort _in place_ by passing the keyword argument `inplace=True` 

# 7.17 Intro to Data Science: `pandas` Series and `DataFrames`
* NumPy’s `array` is optimized for homogeneous numeric data that’s accessed via integer indices
* Big data applications must support mixed data types, customized indexing, missing data, data that’s not structured consistently and data that needs to be manipulated into forms appropriate for the databases and data analysis packages you use
* **Pandas** is the most popular library for dealing with such data
* Two key collections 
    * **`Series`** for one-dimensional collections 
    * **`DataFrames`** for two-dimensional collections
* NumPy and pandas are intimately related
    * `Series` and `DataFrame`s use `array`s “under the hood” 
    * `Series` and `DataFrame`s are valid arguments to many NumPy operations
    * `array`s are valid arguments to many `Series` and `DataFrame` operations