# Querying a Series

In [1]:
# A pandas Series can be queried either by the index position or the index label. If you
# don't give an index to the series when querying, the position and the label are effectively
# the same values.

# By numeric location -> starting at zero, we use the "iloc" attribute.
# By the index label -> we can use the "loc" attribute.

import pandas as pd
students_classes = {'Alice' : ' Physics',
                   'Jack' : 'Chemistry',
                   'Molly' : 'English',
                   'Sam' : 'History'}

s = pd.Series(students_classes)
s

Alice      Physics
Jack     Chemistry
Molly      English
Sam        History
dtype: object

In [2]:
# If we want the fourth entry, we would use the iloc attribute with parameter 3
s.iloc[3]

'History'

In [3]:
# If we want to see what class Molly has, we would use the loc attribute with a parameter
# of Molly
s.loc['Molly']

'English'

In [4]:
# iloc and loc are not methods, they are attributes. So we don't have to use parentheses to
# query them, but square brackets instead, chich is called the indexing operator. 
# In Python this calls get or set for an item depending on the context of its use.

In [5]:
# Pandas tries to make our code more readable and provices a sort of smart syntax using the
# indexing operator directly on the series itself.

# For instance, if we pass in an integer parameter, the operator will behave as if we want it
# to query via the iloc attribute
s[3]

'History'

In [6]:
# If we pass in an object, it will query as if we wanted to use the label based loc attribute
s['Molly']

'English'

In [8]:
# What happens if our index is a list of integers?
# This is a bit complicated and Pandas can't determine automatically whether we're intending
# to query by index position or index label. So, we need to be careful when using the indexing
# operator on the Series itself. The safer option is to be more explicit and use the iloc
# or loc attributes directly

# Example where the indexes are a list of integers
class_code = {99 : 'Physics',
             100 : 'Chemistry',
             101 : 'English',
             102 : 'History'}

s2 = pd.Series(class_code)

In [11]:
# If we try and call s[0] we get a key error because there's no item in the classes list with
# an index of zero, instead we have to call iloc explicitly if we want the first item

# s2[0] --> error
s2.iloc[0]

'Physics'

## Working with data

In [12]:
# A common task is to want to consider all of th values inside of a series and do some sort
# of operation. This could be trying to find a certain number, summarizing data or
# transforming the data in some way

In [14]:
# A typical programmatic approach to this would be to iterate over all the items in the series,
# and invoke the operation one is interested in.

# Example

grades = pd.Series([90, 80, 70, 60])

total = 0
for grade in grades:
    total += grade
    
print(total / len(grades))

75.0


In [15]:
# This works but it's slow.
# Modern computers can do many tasks simultaneously, especially, tasks involving maths

# Pandas and the underlying numpy libraries support a method of computation called vectorization.
# Vectorization works with most of the funcions in the numpy library, including the sum func.

In [16]:
# Here's how we would really write the code using the numpy sum method.

import numpy as np

# Then we just call np.sum and pass in an iterable item. In this case, our pandas Series

total = np.sum(grades)
print(total/len(grades))

75.0


In [17]:
# Both methods create the same value, but one is actually faster.
# The Jupyter Notebook has a magic function which can help

# First we create a big Series of random numbers. This is used a lot when demonstrating
# techniques with Pandas
numbers = pd.Series(np.random.randint(0, 1000, 10000))

# Let's check the top five items with the head() function
numbers.head()

0    922
1    640
2    770
3    753
4    333
dtype: int64

In [18]:
# We also verify the number of items in our Series
len(numbers)

10000

In [19]:
# The ipython interpreter has something called magic functions that begin with a percentage
# sign. If we type this sign and then hit the Tab key, we can see a list of the available magic
# functions.

# We are going to use what's called a cellular magic fucntion. These start with two %% and
# wrap the code in the current Jupyter cell.

# The function we are going to use is called timeit. This function will run our code a few
# times to determine, on average, how long it takes

In [24]:
# Example
# We can give timeit the number of loops that we would like to run. By default, it is 1000
# loops, but we can change this.
# In order to use a cellular magic function, it has to be the first line in the cell

In [None]:
# For the loop first

In [23]:
%%timeit -n 500

total = 0
for number in numbers:
    total += number
    
total/len(numbers)

1.14 ms ± 13.6 µs per loop (mean ± std. dev. of 7 runs, 500 loops each)


In [25]:
# Now for the vectorization

In [26]:
%%timeit -n 100

total = np.sum(numbers)
total/len(numbers)

85.1 µs ± 22.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [27]:
# Wow! This is a big difference in the speed and demonstrates why we should be aware of
# parallel computing features and start thinking in functional programming terms.
# Put more simply, vectorization is the ability for a computer to execute multiple instructions
# at once, and with high performance chips, especially graphic cards, we can get grammatic
# speedups. Modern graphic cards can run thousands of instructions in parallel

In [28]:
# A related feature in pandas and numpy is called broadcasting. With broadcasting, we cam
# apply an operation to every value in the series, changing the series.
# For instance, if we want to increase every random variable by 2, we could do so quickly
# using the += operator directly on the Series object

# Let's check the head of our Series
numbers.head()

0    922
1    640
2    770
3    753
4    333
dtype: int64

In [29]:
# Let's increase everything in the Series by 2
numbers += 2
numbers.head()

0    924
1    642
2    772
3    755
4    335
dtype: int64

In [36]:
# The procedural way of doing this would be to iterate though all of the items in the series
# and increase the values directly. Pandas does support iterating through a series much like
# a dictionary, allowing us to unpack values easily.

# We can use the iteritems() function which returns a label and value
for label, value in numbers.iteritems():
    # Now, for the items which is returned, let's call at[]
    numbers.at[label] = value + 2
    
numbers.head()

0    926
1    644
2    774
3    757
4    337
dtype: int64

In [37]:
# If we find ourselves iterating pretty much "any time" in pandas, we should question
# whether we're doing things in the best possible way

In [40]:
# Let's do some speed comparison

In [None]:
# First we check with the loop

In [39]:
%%timeit -n 25

s3 = pd.Series(np.random.randint(0, 1000, 10000))

for label, value in s3.iteritems():
    s3.at[label] = value + 2

51.6 ms ± 269 µs per loop (mean ± std. dev. of 7 runs, 25 loops each)


In [41]:
# Now we try using the broadcasting method

In [43]:
%%timeit -n 25

s4 = pd.Series(np.random.randint(0, 1000, 10000))
s4 += 2

378 µs ± 105 µs per loop (mean ± std. dev. of 7 runs, 25 loops each)


In [44]:
# The broadcasting methog is way faster and it's also more concise and even easier to read.
# The typical mathematical operations are vectorized, and the nump documentation outlines
# what it takes to create vectorized functions of our own

In [45]:
# One last note on using the indexing operators to access series data.
# The .loc attribute lets us not only modify data in place, but also add new data as well.
# If the value we pass in as the index doesn't exist, then a new entry is added. And keep
# in mind, indices can have mixed types. While it's important to be aware of typing going on
# underneath, Pandas will automatically change the underlying NumPy types as appropiate

In [46]:
# Example
s4 = pd.Series([1, 2, 3])

# We add a new value
s4.loc['History'] = 102

s4

0            1
1            2
2            3
History    102
dtype: int64

In [47]:
# Mixed types for data values or index labels are no problem for Pandas. Since 'History'
# is not in the original list of indices, s.loc['History'] essentially creates a new element
# in the series, with the index named 'History' and the value of 102

In [48]:
# In Pandas, index values don't have to be unique, and this makes pandas Series a little
# different conceptually then, for instance, a relational database

# We create a new Series
students_classes2 = pd.Series({
    'Alice' : 'Physics',
    'Jack' : 'Chemistry',
    'Molly' : 'English',
    'Sam' : 'History'})

students_classes2

Alice      Physics
Jack     Chemistry
Molly      English
Sam        History
dtype: object

In [49]:
# Now we create a Series for some new student Kelly, which lists all the courses she's taken.
# We'll set the index to Kelly and the data to be the names of courses
kelly_classes = pd.Series(['Philosophy', 'Arts', 'Math'], index = ['Kelly', 'Kelly', 'Kelly'])
kelly_classes

Kelly    Philosophy
Kelly          Arts
Kelly          Math
dtype: object

In [50]:
# We can append all of the data in this new Series to the first using the .append() function
all_students = students_classes2.append(kelly_classes)
all_students

Alice       Physics
Jack      Chemistry
Molly       English
Sam         History
Kelly    Philosophy
Kelly          Arts
Kelly          Math
dtype: object

In [51]:
# Important considerations

# 1) Pandas will take the series and try to inder the best data types to use
# 2) The append method doesn't actually change the underlying Series objects, it instead
# returns a new Series which is made up of the two appended together. This is a common pattern
# in pandas - by default returning a new object instead of modifying in place

# If we print the original, we'll see that it hasn't changed
students_classes2

Alice      Physics
Jack     Chemistry
Molly      English
Sam        History
dtype: object

In [52]:
# Finally, we see that when we query teh appended series for Kelly, we don't get a single
# value, but a series itself
all_students.loc['Kelly']

Kelly    Philosophy
Kelly          Arts
Kelly          Math
dtype: object