# Querying `Series`

Now that we have seen how to create a Series and how data is represented in one, we will want to apply operations like searching for values in the series. In this notebook we conside the structure of the Series, how to query and merge Series objects together, and the importance of thinking about parallelization when engaging in data science programming.

A pandas Series can be queried either by the index position or the index label. If you don't give an index to the series when querying, the position and the label are effectively the same values. To query by numeric location, starting at zero, use the iloc attribute. To query by the index label, you can use the loc attribute. 

In [None]:
# Lets start with an example. We'll use students enrolled in classes coming from a dictionary
import pandas as pd
students_classes = {'Alice': 'Programming',
                   'Bob': 'Cryptography',
                   'Carol': 'Networking',
                   'Dave': 'Databases'}
s = pd.Series(students_classes)
s

In [None]:
# So, for this series, if you wanted to see the fourth entry we would we would use the iloc 
# attribute with the parameter 3.
s.iloc[3]

In [None]:
# If you wanted to see what class Carol has, we would use the loc attribute with a parameter 
# of Carol.
s.loc['Carol']

__Important__ Note that `iloc` and `loc` are not methods, they are attributes. So you don't use parentheses to query them, but square brackets instead, which is called the indexing operator. This calls a getter or setter for an item depending on the context of its use.

This might seem confusing if you are used to languages, such as Java, where encapsulation of attributes, variables, and properties is common.

Pandas tries to make our code a bit more readable and provides a sort of smart syntax using the indexing operator directly on the series itself. For instance, if you pass in an integer parameter, the operator will behave as if (infer!) you want it to query via the iloc attribute. That said, this facility can lead to confusion and will be removed in a future version of pandas.

In [None]:
s[3]

This also raises an important point:

__Nearly all machine learning libraries are being developed actively. APIs and language features are subject to change. Some of those changes can break existing code. Deprecation warnings require attention to avoid future difficulties. You have been warned!__

In [None]:
# If you pass in an object, it will query as if you wanted to use the label based loc attribute.
s['Carol']

So what happens if your index is a list of integers? This is a bit complicated and Pandas can't determine automatically whether you're intending to query by index position or index label. So 
you need to be careful when using the indexing operator on the Series itself. The safer option is to be more explicit and use the iloc or loc attributes directly.

Here's an example using class and their classcode information, where classes are indexed by classcodes, in the form of integers

In [None]:
class_code = {99: 'Programming',
              100: 'Cryptography',
              101: 'Networking',
              102: 'Databases'}
s = pd.Series(class_code)

If we try and call s[0] we get a key error because there's no item in the classes list with an index of zero, instead we have to call iloc explicitly if we want the first item.

In [None]:
s[0] 

So, that didn't call s.iloc[0] underneath as one might expect, instead it 
generates an error.

Now we know how to get data out of the series, how about working with the data. A common task is to want to consider all of the values inside of a series and do some sort of operation. This could be trying to find a certain number, or summarizing data or transforming the data in some way.

A typical programmatic approach to this would be to iterate over all the items in the series, and invoke the operation one is interested in. For instance, we could create a Series of integers representing student grades, and just try and get an average grade

In [None]:
grades = pd.Series([90, 80, 70, 60])

total = 0
for grade in grades:
    total+=grade
print(total/len(grades))

This works, but it's slow. Modern computers can do many tasks simultaneously, especially, but not only, mathematical computation on collections of numbers.

Pandas and the underlying numpy libraries support a method of computation called vectorization. Vectorization works with most of the functions in the numpy library, including the sum function.

Here's how we would rewrite the code using the numpy `sum()` method. First we need to import the numpy module

In [None]:
import numpy as np

# Then we just call np.sum and pass in an iterable item.
# In this case, our panda series.

total = np.sum(grades)
print(total/len(grades))

Now both of these methods create the same value, but is one actually faster? The Jupyter Notebook has a magic function which can help. 

First, let's create a big series of random numbers. This is used a lot when demonstrating techniques with Pandas

In [None]:
numbers = pd.Series(np.random.randint(0,1000,10000))

Now lets look at the top five items in that series to make sure they actually seem random. We can do this with the `head()` function

In [None]:
numbers.head()

We can actually verify that length of the series is correct using the len function

In [None]:
len(numbers)

Ok, we're confident now that we have a big series. The interpreter has something called __magic functions__ that begin with a percentage sign. If we type this sign and then hit the Tab key, you can see a list of the available magic functions. You could write your own magic functions too, 
but that is out of scope....

A __cellular magic function__ starts with two 
percentage signs and applies to the code in the current notebook cell. The function we use is called `timeit`. This function will run our code a few times to determine, on average, how long it takes.

Let's run `timeit` with our original iterative code. You can give timeit the number of loops that you would like to run. By default, it is 1,000 loops. For this purpose, 100 runs is sufficient. Note that in order to use a cellular magic function, it has to be the first line in the cell.

In [None]:
%%timeit -n 100
total = 0
for number in numbers:
    total+=number

total/len(numbers)

Timeit ran the code and it doesn't seem to take very long. Will vectorization do better?

In [None]:
%%timeit -n 100
total = np.sum(numbers)
total/len(numbers)

On my machine, vectorization is 10-15 times faster than loop code. Students should look for vectorisation opportunities, favour parallel computing features and start thinking in functional programming terms.

Vectorization allows the computer to execute multiple instructions
at once, and with high performance chips, especially graphics cards, you can get dramaticspeedups. Modern graphics cards can run thousands of instructions in parallel.

A Related feature in pandas and numpy is called _broadcasting_. With broadcasting, you can apply an operation to every value in the series, changing the series. For instance, if we wanted to increase every random variable by 2, we could do so quickly using the += operator 
directly on the Series object. 

In [None]:
# Let's look at the head of our series
numbers.head()

In [None]:
# And now lets just increase everything in the series by 2
numbers+=2
numbers.head()

The procedural way of doing this would be to iterate through all of the items in the series and increase the values directly. Pandas does support iterating through a series much like a dictionary, allowing you to unpack values easily.

In [None]:
# We can use the items() function which returns a label and value 
for label, value in numbers.items():
    # in the early version of pandas we would use the set_value() function
    # in the current version, we use the iat() or at() functions,
    numbers.iat[label]= value+2
# And we can check the result of this computation
numbers.head()

So the result is the same, though you may notice a warning depending upon the version of
pandas being used. But if you find yourself iterating pretty much *any time* in pandas,
you should question whether you're doing things in the best possible way.

Lets take a look at some speed comparisons. First, lets try five loops using the iterative approach

In [None]:
%%timeit -n 10
# we'll create a blank new series of items to deal with
s = pd.Series(np.random.randint(0,1000,1000))
# And we'll just rewrite our loop from above.
for label, value in s.items():
    s.loc[label]= value+2

Now lets try that using the broadcasting methods

In [None]:
%%timeit -n 10
# We need to recreate a series
s = pd.Series(np.random.randint(0,1000,1000))
# And we just broadcast with +=
s+=2

This is a dramatic improvement - over 100 times faster on my hardware. Not only is it significantly faster, but it is more concise and easier to read too. The typical mathematical operations you would expect are vectorized, and the numpy documentation outlines what it takes to create vectorized functions of your own. 

One last note on using the indexing operators to access series data. The .loc attribute lets 
you not only modify data in place, but also add new data as well. If the value you pass in as 
the index doesn't exist, then a new entry is added. And keep in mind, indices can have mixed types. 
While it's important to be aware of the typing going on underneath, Pandas will automatically 
change the underlying NumPy types as appropriate.

In [None]:
# Here's an example using a Series of a few numbers. 
s = pd.Series([1, 2, 3])

# We could add some new value, maybe a university course
s.loc['Databases'] = 102

s

We see that mixed types for data values or index labels are no problem for Pandas. Since "History" is not in the original list of index values, `s.loc['Databases']` essentially creates a new element in the series, with the index named "Databases", and the value of 102

_What happens if the index values are not unique?_

In this regard, a pandas Series operates differently to a relational database.

In [None]:
# Lets create a Series with students and the courses which they have taken
students_classes = pd.Series({'Alice': 'Programming',
                   'Bob': 'Cryptography',
                   'Carol': 'Networking',
                   'Dave': 'Databases'})
students_classes

Now lets create a Series just for some new student Eve, which lists all of the courses she has taken. We set the index to Eve for all the courses, and the data to be the names of courses.

In [None]:
eve_classes = pd.Series(['Philosophy', 'Arts', 'Maths'], index=['Eve', 'Eve', 'Eve'])
eve_classes

Finally, we can append all of the data in this new Series to the first using the `pd.concat()` function.

In [None]:
# This code is deprecated!!
#all_students_classes = students_classes.append(eve_classes)

all_students_classes = pd.concat([students_classes, eve_classes])

# This creates a series which has our original people in it as well as all of Kelly's courses
all_students_classes

There are a couple of important considerations when concatenating series. First, Pandas will take the series and try to infer the best data types to use. In this example, everything is a string, so thereare no datatype inconsistencies. Second, the concat method returns a new series which is made up of the two appended together. This is a common pattern in pandas - by default returning a new object instead of modifying in place - and
one you should come to expect.

Previously, it was possible to use `.append()` on the first series but that could be confusing, because the result did not change the original series - it created a new one which needed to be stored in a `all_student_classes`.

Finally, we see that when we query the appended series for Eve, we don't get a single value, but a series (the rows asociated with Eve). 

In [None]:
all_students_classes.loc['Eve']

## Summary

We've seen how to query the Series, with `.loc` and `.iloc`, that the Series is an indexed data structure, how to merge two Series objects together with `pd.concat()`, and the importance of vectorization.

There is a lot more to Series, but the real strength of Pandas is its  two-dimensional data structure, the DataFrame. The DataFrame is very similar to the series object, but includes multiple columns of data, and is the structure that you'll spend the majority of your time working with when cleaning and aggregating data.