In this checkpoint, we'll get up to speed on the basics of working with data in NumPy and Pandas. Numpy is Python's premier scientific computing library, and Pandas is built on top of it. Numpy provides built in data constructs for working with arrays of data, while Pandas gives us *dataframes*, which are a convenient way of representing and interacting with tabular data.

Here's what we'll cover in this checkpoint:

* arrays
* element-wise functions
* aggregator functions
* dataframes
* selecting and grouping data


## Install numpy if you haven't already

You should have installed NumPy on your local environment in the previous Unit. If you don't yet have NumPy installed, install it now with `pip install numpy`.

## Importing NumPy

Once NumPy is installed we have to import the package into our current environment to actually use it. We do this with an `import` statement. When importing a package you also have the opportunity to set an abbreviation for recalling its functions. Many packages have standard shorthand, which we will follow. The shorthand for `numpy` is simply `np`. You can import and set the abbreviation like this:


In [4]:
import numpy as np



Now that the package is installed on our machine and imported into the environment, we're ready to start working with NumPy.

Before we do that, however, it is worth making a note about writing code with import statements. Import statements will work at any point in a script or any cell in a notebook. However, [Python style](https://www.python.org/dev/peps/pep-0008/#imports) requires they should always appear at the beginning of the script or in the first cell (most notebooks will use the first cell just for this purpose). This allows for easy validation if the necessary packages are installed and keeps track of them in a single place.


## Arrays

As we said above, NumPy is the fundamental package for storing and manipulating mathematical data in Python. NumPy primarily accomplishes this with a new data structure: the _array_.

A NumPy array can be thought of as a Python list with additional mathematical functionality and properties. One of the great attributes of the array is that, like lists, it can have multiple dimensions. A single dimensional array works like an ordered set of values, with various data points entered in it. Arrays use bracket notation `[` `]` to access items by index, just like lists and strings. We can create an array by calling `np.array()` and passing in any iterable. A list, for example:



In [5]:
x = np.array([0, 1, 2, 3])
x

array([0, 1, 2, 3])

Remember that you can run and re-run these code cells individually. A good shortcut for running a cell is pressing <kbd>shift</kbd> + <kbd>enter</kbd> from within the cell.

You can add multiple dimensions to your array by either manually creating an array of arrays or with `np.arange()`. Here are two ways to generate the same thing:

In [6]:
w = np.array([[0, 1, 2, 3],[4, 5, 6, 7]])
w

array([[0, 1, 2, 3],
       [4, 5, 6, 7]])

In [7]:
y = np.arange(8).reshape(2, 4)
y

array([[0, 1, 2, 3],
       [4, 5, 6, 7]])

`np.arange()` ("arange" is short for "array range") works similarly to the basic Python `range()` function by generating a sequence of integers, starting at 0 by default and incrementing by 1. But instead of returning a `range()` object it returns an array. We're taking advantage of that by calling the `.reshape()` [array method](https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.reshape.html) to reshape the initial eight-item array into two four-item arrays.

Feel free to play with it. Using `arange()` and `.reshape()`is a common way to create arrays.


## Element-wise and Aggregator functions

Now that you've seen the basic data structure of NumPy we can introduce a few other functionalities of this package. NumPy's primary value is the ability to do more sophisticated arithmetic than basic Python can do out of the box. NumPy allows you to do these computations in two ways: with *element-wise* functions that process array elements one at a time and then return a new array, and with *aggregator* functions that process the array into a single value the function returns.

Let's play with a simple array and look at what's possible. First, some element-wise functions that return a new array:

In [8]:
x = np.array([0, 1, 2, 3])

# Square each value.
print(np.square(x))

# Square root of each value.
print(np.sqrt(x))

# Cosine of each value.
print(np.cos(x))


[0 1 4 9]
[0.         1.         1.41421356 1.73205081]
[ 1.          0.54030231 -0.41614684 -0.9899925 ]


Note that these methods return arrays of the same length as the input array, just like the built-in Python function `map()`. Use element-wise functions when you want to transform each individual element in an array and get back a collection of all the results.

Here are some aggregator functions that aggregate the elements of an array and return a single value:

In [9]:
x = np.array([0, 1, 2, 3])

# Find the maximum value.
print(np.max(x))

# Find the minimum value.
print(np.min(x))

# Find the mean of the input array.
print(np.mean(x))

# Find the standard deviation of the input array.
print(np.std(x))

3
0
1.5
1.118033988749895


Note that these aggregator functions return single values, rather than arrays. That is what an _aggregator_ does. It takes a set of multiple data values and condenses them (or aggregates them) into a single value according to some rule. So `np.min()` returns the minimum value of all the data given to it, `np.mean()` the mean, and so on.

These are some of the basic functions of NumPy, but there are many more. We'll continue to use the package throughout the course and learn more as we go. If you'd like to know more now you can look through the [NumPy documentation](https://docs.scipy.org/doc/).


## Importing Pandas

Pandas is probably the most heavily utilized package in Python for data scientists. Built on top of NumPy, it is essential to data manipulation, organization, and modeling. Here we'll introduce some of its core functionalities as well as its primary data structure: the data frame. First we have to import the package, which typically gets the abbreviation `pd`.

In [10]:
import pandas as pd

## The Data Frame

The *data frame* is like a NumPy array, with a few additional features like column names and row indexing. It is probably the primary way data scientists handle data. You can create a data frame in many different ways, either from csv files, by querying databases, or explicitly. For your first data frame, let's create a 2-dimensional Numpy array. Then, to create a data frame use the `pd.DataFrame()` function and pass in the NumPy array:

In [11]:
my_array = np.array([['Montgomery','Yellohammer state',52423],
                     ['Sacramento','Golden state',163707],
                     ['Oklahoma City','Sooner state',69960 ]])
df = pd.DataFrame(my_array)
df

Unnamed: 0,0,1,2
0,Montgomery,Yellohammer state,52423
1,Sacramento,Golden state,163707
2,Oklahoma City,Sooner state,69960


Now you have your first data frame!

If you're familiar with Excel, much of Pandas may be familiar to you. Data frames are organized into rows and columns that are nameable. Columns are labeled with column names, rows with an index number (starting with zero by default). You can set both column names and indexes explicitly during the creation of the data frame or after the fact. Let's set both for df from above.

In [12]:
df.columns = ['Capital', 'Nickname','Area']
df.index = ['Alabama', 'California', 'Oklahoma']
df

Unnamed: 0,Capital,Nickname,Area
Alabama,Montgomery,Yellohammer state,52423
California,Sacramento,Golden state,163707
Oklahoma,Oklahoma City,Sooner state,69960


There, that looks better.

You can also set column and index names through the `column=` or `index=` keyword arguments when you call the `pd.DataFrame()` function to initially construct the data frame.

In [13]:
df2 = pd.DataFrame(
    my_array,
    columns=['Capital', 'Nickname','Area'],
    index=['Alabama', 'California', 'Oklahoma'])
df2

Unnamed: 0,Capital,Nickname,Area
Alabama,Montgomery,Yellohammer state,52423
California,Sacramento,Golden state,163707
Oklahoma,Oklahoma City,Sooner state,69960


Whichever method you use, we now have a data frame with labeled rows and columns. This will be useful for working with data frames, because it makes these elements easily callable and makes your code more natural to write and simpler to read.

<div class="note">Note: you're probably used to seeing a space around <code>=</code> when used for assignment and <code>==</code>, which is used for comparison. In Python, the custom is to <a href="https://www.python.org/dev/peps/pep-0008/#other-recommendations">omit spaces</a> around <code>=</code> with keyword arguments to improve readability and make it easy to distinguish keyword arguments from variable assignments. </div>

## Adding More Data

To show what data frames can really do we're going to need to make something a little bit bigger. Let's assemble a data frame with named columns via lists. You can create an empty data frame by calling the `pd.DataFrame()` function and passing in the indexes you'd like to use for row names, then add columns using `df['COLUMN_NAME'] = [LIST_OF_VALUES]`. For example:

In [14]:
# This list will become our row names.
names = ['George',
         'John',
         'Thomas',
         'James',
         'Andrew',
         'Martin',
         'William',
         'Zachary',
         'Millard',
         'Franklin']

# Create an empty data frame with named rows.
purchases = pd.DataFrame(index=names)

# Add our columns to the data frame one at a time.
purchases['country'] = ['US', 'CAN', 'CAN', 'US', 'CAN', 'US', 'US', 'US', 'CAN', 'US']
purchases['ad_views'] = [16, 42, 32, 13, 63, 19, 65, 23, 16, 77]
purchases['items_purchased'] = [2, 1, 0, 8, 0, 5, 7, 3, 0, 5]
purchases 

Unnamed: 0,country,ad_views,items_purchased
George,US,16,2
John,CAN,42,1
Thomas,CAN,32,0
James,US,13,8
Andrew,CAN,63,0
Martin,US,19,5
William,US,65,7
Zachary,US,23,3
Millard,CAN,16,0
Franklin,US,77,5


Let's say this is the purchase and browsing history for several users of an ecommerce website for a given year. Page views is the number of pages they've loaded on the site and purchases is the number of items they've bought that year.

Now we have a data frame we can do something with. First note that you can call out a column as a series using *either* dot notation *or* bracket notation: `df.column_name` or `df['column_name']` both work. So `purchases['Name']` returns the names of users who visited the ecommerce website, as does `purchases.Name`. Bracket notation is generally preferred and we'll use bracket notation here.

Pandas also makes it very easy to create a new column out of our previous data. Let's say we want to create a column of the average items purchased per page view, and call the column `items_purch_per_view`. We can do that with this one-liner:

In [15]:
purchases['items_purch_per_ad'] = purchases['items_purchased'] / purchases['ad_views']
purchases

Unnamed: 0,country,ad_views,items_purchased,items_purch_per_ad
George,US,16,2,0.125
John,CAN,42,1,0.02381
Thomas,CAN,32,0,0.0
James,US,13,8,0.615385
Andrew,CAN,63,0,0.0
Martin,US,19,5,0.263158
William,US,65,7,0.107692
Zachary,US,23,3,0.130435
Millard,CAN,16,0,0.0
Franklin,US,77,5,0.064935


If we just want to _see_ those values and don't need to _store_ them as a new column in our data frame we can just run that function without assigning it to `purchases['items_purch_per_ad']` and it will return labeled values giving the name and the purchases per ad for each user.

In [16]:
purchases['items_purchased'] / purchases['ad_views']

George      0.125000
John        0.023810
Thomas      0.000000
James       0.615385
Andrew      0.000000
Martin      0.263158
William     0.107692
Zachary     0.130435
Millard     0.000000
Franklin    0.064935
dtype: float64

## Selecting from a dataframe


In [1]:
import pandas as pd
import numpy as np


names = ['George',
         'John',
         'Thomas',
         'James',
         'Andrew',
         'Martin',
         'William',
         'Zachary',
         'Millard',
         'Franklin']
purchases = pd.DataFrame(index=names)
purchases['country'] = ['US', 'CAN', 'CAN', 'US', 'CAN', 'US', 'US', 'US', 'CAN', 'US']
purchases['ad_views'] = [16, 42, 32, 13, 63, 19, 65, 23, 16, 77]
purchases['items_purchased'] = [2, 1, 0, 8, 0, 5, 7, 3, 0, 5]

A data frame is great in and of itself. As we've shown, it allows you to easily store data with clearly labeled rows and indexes. However, sometimes you just want to work with a subset of that data. For that you will want to either select or group data.

We've actually already introduced the most basic form of indexing, or "selection" with the bracketed selection of column names. Recall what we did before on our purchases data:

In [2]:
purchases['country']

George       US
John        CAN
Thomas      CAN
James        US
Andrew      CAN
Martin       US
William      US
Zachary      US
Millard     CAN
Franklin     US
Name: country, dtype: object

## Basic Selects with `.loc` and `.iloc`

[`.loc`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.loc.html) is a selector that indexes over rows and columns. It selects over the row index first, then the column name (if included). [`.iloc`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.iloc.html) does the same thing but over indices. For example, to select the row for `'George'` in our purchases data frame, we just pass the string `'George'` in to `purchases.loc` with bracket notation: 

In [3]:
purchases.loc['George']

country            US
ad_views           16
items_purchased     2
Name: George, dtype: object

In [4]:
purchases.loc[:, 'country']

George       US
John        CAN
Thomas      CAN
James        US
Andrew      CAN
Martin       US
William      US
Zachary      US
Millard     CAN
Franklin     US
Name: country, dtype: object

The `:` above works just like it would when slicing a list or string, and selects all rows from start to finish of the data frame.

Lastly to select George's country, we'd combine the two like this:

In [5]:
purchases.loc['George', 'country']

'US'

As we mentioned above, you can also do integer indexing, as done on lists, over both rows and columns using `.iloc`. For example:

In [6]:
purchases.iloc[1:3, 1]

John      42
Thomas    32
Name: ad_views, dtype: int64

Note we used the slicing syntax again above, this time starting with the second row and going up to, but not including, the fourth row, and then the second column (with the index not counting as a column).

## Conditional Selection

You can also use `.loc` for conditional selection, or selecting all the entries that meet a given criteria. This will use __lambda__, which is a construction that allows for defining anonymous, unnamed functions at runtime. We use the lambda function to create a condition on the row or column.

<div class="note">Note: We'll introduce the lambda syntax below but won't use it much in the prep course and won't go deeply into it here. If you're interested in learning more about using <code>lambda</code> to create anonymous functions see the terse <a href="https://docs.python.org/3.6/tutorial/controlflow.html#lambda-expressions">Python documentation tutorial section on lambda</a>, or this <a href="https://pythonconquerstheuniverse.wordpress.com/2011/08/29/lambda_tutorial/">more detailed tutorial</a>.</div>

Let's return once more to the purchases data frame. For our example, let's say we want all the columns for individuals who made more than one purchase. That ends up being a relatively simple line of code.


In [14]:
purchases.loc[lambda df: purchases['items_purchased'] > 1, :]

Unnamed: 0,country,ad_views,items_purchased
George,US,16,2
James,US,13,8
Martin,US,19,5
William,US,65,7
Zachary,US,23,3
Franklin,US,77,5


We are selecting rows, so the lambda is the first item in the brackets. We define the input `df` as it takes a data frame. Then we define the condition for which each row will be evaluated. The `, :` is the same slicing syntax and means we want all columns using the same logic as above.

There is a simpler way to do this, using boolean logic, and it is also quite common.

In [15]:
purchases[purchases['items_purchased'] > 1]

Unnamed: 0,country,ad_views,items_purchased
George,US,16,2
James,US,13,8
Martin,US,19,5
William,US,65,7
Zachary,US,23,3
Franklin,US,77,5


This is a similar logic, but the lack of explicit indexing makes it slightly less robust. The first example with `.loc` using explicit indexing is more robust, but this latter [boolean indexing](http://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing) may be more common and is easily readable. Choose your tradeoffs wisely.

## Groups

There's one last thing we'll introduce here, and that is grouping and aggregation. You can create groups in your data frame using the [`.groupby()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html) method and passing in the column name. Let's try it.

If we wanted to group by the country of the site user, all we'd have to do is:

In [16]:
purchases.groupby('country')

<pandas.core.groupby.groupby.DataFrameGroupBy object at 0x7f900ac7a630>

But wait, when you run that line, it doesn't return your data any more. It returns a line that references a grouped object, but not the object.

That's because if we want it to return something we have to do something on those groups. There are several methods that you can use here. Some are built in like `.sum()` or `.count()`. For even greater possibilities you can use `.aggregate(numpy_function)`. Let's use this to find out which group has more page views and purchases.

In [18]:
purchases.groupby('country').aggregate(np.mean)

# Don't want to take the mean of all columns? Try this:
# purchases.groupby('country')['column_name'].mean()

Unnamed: 0_level_0,ad_views,items_purchased
country,Unnamed: 1_level_1,Unnamed: 2_level_1
CAN,38.25,0.25
US,35.5,5.0


Now you can see the mean of each column. Seems like Canadian visitors view slightly more ads per person but purchase far fewer items...

These are the fundamentals of selecting data inside a data frame. For a deep dive, see the [pandas documentation on Indexing and Selecting Data](http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing).

