# Series Data Structure

The Series is a cross between a list and a dictionary. The items are all stored in an order and there's labels with which we can retrieve them. An easy way to visualize this is two columns of data. The first is the special index (like keys in a dictionary) and the second is the actual data. It's important to note that the data column hasa label of its own and can be retrieved using the .name attribute. This is different than with dictionaries and is useful when it comes to merging multiple columns of data

In [1]:
import pandas as pd

In [2]:
# We can create a series by passing in a list of values. When we do this, Pandas
# automatically assigns an index starting with zero and sets the anem of the Series to None.

# One of the easiest ways to create a Series is to use an array-like object, like a list.
students = ['Alice', 'Jack', 'Molly']

# Now we just call the Series funciton in pandas and pass in the 'students'
pd.Series(students)

0    Alice
1     Jack
2    Molly
dtype: object

In [3]:
# The result is a Series object
# Pandas has automatically identified the type of data in this Series as 'object' and set the
# dtype parameter as appropiate.
# The values are indexed with integers, starting at zero

In [4]:
# We can also use whole numbers, in which case the type would be int64
# Underneath pandas stores series values in a typed array using the Numpy library. This offer
# significant speedup when compared versus traditional python lists

numbers = [1, 2, 3]
pd.Series(numbers)

0    1
1    2
2    3
dtype: int64

In [5]:
# How Numpy and thus Pandas handle missing data?

# Underneath, pandas does some type conversion. If we create a list of strings and we have one
# element, a None type, pandas inserts it as a None nd uses the type 'object' for the
# underlying array

students2 = ['Alice', 'Jack', None]

pd.Series(students2)

0    Alice
1     Jack
2     None
dtype: object

In [6]:
# However, if we create a list of numbers (integers or floats) and put in the None type,
# pandas automatically converts this to a special floating point value designated as NaN,
# which stands for 'Not a Number'

numbers = [1, 2, None]
pd.Series(numbers)

0    1.0
1    2.0
2    NaN
dtype: float64

In [7]:
# This to mention:
# 1) NaN is a different value
# 2) Underneath pandas represents NaN as a floating point number, and because integers can be
# typecast to floats, pandas went and converted out integers to floats.

# So, when we're wondering why the list of integers we put into a Series is not floats, it's
# probably because there is some missing data

In [8]:
# None and NaN might be used by the data scientists in the same way, to denote missing data,
# but underneath these are not represented by Pandas in the same way

# NaN is NOT equivalent to None and when we try the equality test, the result is False.

import numpy as np

np.nan == None

False

In [9]:
# We can't do an equality test of NaN to itself either. When we do, the answer is always false

np.nan == np.nan

False

In [10]:
# Instead, we need to use special funcitons to test for the presence of NaN, such as the
# Numpy library isnan()

np.isnan(np.nan)

True

In [11]:
# To keep in mind: when we see NaN, its meaning is similar to None, but it's a numeric value
# and treated differently for efficiency reasons

In [12]:
# Other ways of creating pandas Series

# A Series can be created directly from dictionary data. If we do this, the index is
# automatically assigned to the keys of the dictionary that we provided.

students_courses = {'Alice' : 'Physics',
                  'Jack' : 'Chemistry',
                  'Molly' : 'English'}

s = pd.Series(students_courses)
s

Alice      Physics
Jack     Chemistry
Molly      English
dtype: object

In [13]:
# Since it was string data, pandas set the data type of the Series to 'object'

In [14]:
# Once the series has been created, we can get the index object using the index attribute

s.index

Index(['Alice', 'Jack', 'Molly'], dtype='object')

In [None]:
# A lot of things in pandas are implemented as numpy arrays and have the dtype value set.
# This is true of indices, and here pandas infered that we were using objects for the index

In [15]:
# My own example

test = {'Alice' : 3,
        5 : 'Chemistry',
        17.1 : 'English'}

test_s = pd.Series(test)

print(test_s)
print('\n')
print(test_s.index)

Alice            3
5        Chemistry
17.1       English
dtype: object


Index(['Alice', 5, 17.1], dtype='object')


In [16]:
# The dtype of object is not just for strings, but for arbitrary objects.
# Let's do an example with tuples

students2 = [('Alice', 'Brown'), ('Jack', 'White'), ('Molly', 'Green')]
pd.Series(students2)

0    (Alice, Brown)
1     (Jack, White)
2    (Molly, Green)
dtype: object

In [17]:
# Each of the tuples is stored in the series object, and the type is object

In [18]:
# We can also separate our index creation from the data by passing in the index as a list
# explicitly to the Series

courses = ['Physics', 'Chemistry', 'English']
names = ['Alice', 'Jack', 'Molly']

s2 = pd.Series(courses, index = names)
s2

Alice      Physics
Jack     Chemistry
Molly      English
dtype: object

In [20]:
# What happens if our list of values in the index object are not aligned with the keys in our
# dictionary for creating the series?

# Then, pandas overrides the automatic creation to favor only and all of the indices values
# that we provided. So it will ignore from our dic
tionary all keys which are not in our index,
# and pandas will add None or NaN type values for any index value we provide which is not in
# out dictionary key list

students_courses2 = {'Alice' : 'Physics',
                  'Jack' : 'Chemistry',
                  'Molly' : 'English'}

s3 = pd.Series(students_courses2, index = ['Alice', 'Molly', 'Sam'])
s3

Alice    Physics
Molly    English
Sam          NaN
dtype: object

In [21]:
# The result is that the Series object doesn't have Jack in it, even though he was in our
# original dataset, but it explicitly does have Sam in it as a missing value

## Querying Series

In [22]:
# A pandas Series can be queried either by the **index position** or the **index label**. If
# we don't give an index to the Series when querying, the position and the label are 
# effectively the same values.

# To query by **numeric location** starting at zero --> .iloc attribute
# To query by **index label** -> .loc attribute

students_classes = {'Alice' : 'Physics',
                  'Jack' : 'Chemistry',
                  'Molly' : 'English',
                   'Sam' : 'History'}

s4 = pd.Series(students_classes)
s4

Alice      Physics
Jack     Chemistry
Molly      English
Sam        History
dtype: object

In [24]:
# For the fourth entry, we'd use the **iloc** attribute with the parameter 3
s4.iloc[3]

'History'

In [25]:
# To see what class Molly has, we'd use the **loc** attribute with a parameter of Molly
s4.loc['Molly']

'English'

In [26]:
# Keep in mind that iloc and loc are NOT methods, they are attributes. So we don't use
# parentheses to query them, but square brackets instead, which is called the indexing operator.

# In Python this calls get or set for an item depending on the context of its use.

In [27]:
# Pandas tries to make our code more readable and provides a sort of smart syntax using the
# indexing operator directly on the Series itself

# If we pass in an integer parameter, the operator will behave as if we want it to query via
# the iloc attribute
s4[3]

'History'

In [28]:
# If we pass in an object, it will query as if we wanted to use the label based loc attribute
s4['Molly']

'English'

In [29]:
# So : 

# Series.loc['x'] == Series['x']
# Series.iloc[2] == Series[2]

In [30]:
# What happens if our index is a list of integers?

# Pandas can't determine automatically whether we're intending to query by index position or
# index label. So we need to be more careful when using the indexing operator on the Series
# itself. The safer option is to be more explicit and use the iloc or loc attributes directly.

class_code = {99 : 'Physics',
             100 : 'Chemistry',
             101 : 'English',
             102 : 'History'}

s5 = pd.Series(class_code)

In [32]:
# If we try and call s[0], we get a ket error because there's no item in the classes list with
# an index of zero. Instead we have to call iloc explicitly if we want the first item

# s[0] --> error

# Correct way to do it
s.iloc[0]

'Physics'

## Operating within Series

In [33]:
# A common task is to want to consider all of the values inside of a Series and do some sort
# of operation: find a certain number, summarizing data or transforming the data in some way.

In [34]:
# A typical programmatic approach to this would be to iterate over all the items in the Series,
# and invoke the operation one is interested in.

# Average student's grade

grades = pd.Series([90, 80, 70, 60])

total = 0
for grade in grades:
    total += grade

print(total / len(grades))

75.0


In [35]:
# This works, but it's slow. Modern computers can do many tasks simultaneously, especially,
# but not only, tasks involving mathematics.

# Pandas and the underlying numpy libraries support a method of computation called vectorization
# Vectorization works with most of the functions in the numpy library, including the sum function.

In [36]:
# Using the numpy sum methog

import numpy as np
total = np.sum(grades)

print(total / len(grades))

75.0


In [38]:
# Both methods return the same value, but is one actually faster?
# The Jupyter Notebook has a magic function which can help

# First, let's create a big Series of random numbers. This is often used when demonstrating
# techniques with Pandas

# 10000 integer values between 0 and 1000
numbers2 = pd.Series(np.random.randint(0, 1000, 10000))

print(len(numbers2))
print('\n')
print(numbers2.head())

10000


0    313
1    867
2    231
3     60
4    884
dtype: int64


In [40]:
# The ipyhon interpreter has something called magic function, which begin with a % sign.

# In this case, we are going to use what's called a cellular magic function. These start with
# two % signs and wrap the code in the current Jupyter cell.
# The function we're going to use is called timeit. This function will run our code a few times
# to determine, on average, how long it takes. We can give timeit the number of loops that we
# want to run. By default, it is 1000 loops.

# In oder to use a cellular magic function, it has to be the first line in the cell

In [42]:
%%timeit -n 100

total = 0

for number in numbers2:
    total += number
    
total / len(numbers2)

1.19 ms ± 75.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [43]:
# Now let's try with vectorization

In [44]:
%%timeit -n 100

total = np.sum(numbers2)
total / len(numbers2)

63.7 µs ± 17.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [45]:
# This is a huge difference

# Vectorization is the ability for a computer to execute multiple instructions at once.

# With high performance chips, especially graphic cards, we can get dramatic speedups. Modern
# graphic cards can run thousands of instructions in parallel

In [46]:
# Broadcasting

# With broadcasting, we can apply an operation to every value in the series, changing the series
# For instance, if we wanted to increase every random variable by 2, we could do so quickly
# using the += operator directly on the Series object

numbers2.head()

0    313
1    867
2    231
3     60
4    884
dtype: int64

In [47]:
numbers2 += 2
numbers2.head()

0    315
1    869
2    233
3     62
4    886
dtype: int64

In [50]:
# Procedural way

# it would be to iterate through all of the items in the series and increase the values
# directly. Pandas does support iterating through a series much like a dictionary, allowing
# us to unpack values easily

# We can use iteritems() function which returns a label and value
for label, value in numbers2.iteritems():
    # For the item returned, lets vall .at attribute
    numbers2.at[label] += 2
    
numbers2.head()

0    317
1    871
2    235
3     64
4    888
dtype: int64

In [51]:
# Let's do some speed comparisons

In [52]:
%%timeit -n 10

s6 = pd.Series(np.random.randint(0, 1000, 10000))

for label, value in s6.iteritems():
    s6.at[label] += 2

87.7 ms ± 1.44 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [53]:
# Now let's try using broadcasting methods

In [54]:
%%timeit -n 10

s7 = pd.Series(np.random.randint(0, 1000, 10000))

s7 += 2

353 µs ± 123 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [55]:
# It's way faster

# It's also more concise and easier to read

# The typical mathematical operator are vectorized, and the numpy documentation outlines what
# it takes to create vectorized functions of our own

In [56]:
# The .loc attribute lets us not only modify data in place, but also add new data as well.
# If the value we pass in as the index doesn't exist, then a new entry is added.

# Keep in mind: indices can have mixed values

# Pandas will automatically change the underlying NumPy types as appropiate

In [57]:
s8 = pd.Series([1, 2, 3])

s8.loc['History'] = 102

s8

0            1
1            2
2            3
History    102
dtype: int64

In [58]:
# Mixed types for data values or index labels are no problem for Pandas
# Since 'History' is not in the original list of indices, s.loc['History'] essentially creates
# a new element in the series with the index named 'History', and the value of 102

In [59]:
# When index values are not unique

# In Pandas, index values don't have to be unique, what makes it a bit different from relational
# database

students_classes2 = pd.Series({'Alice' : 'Physics',
                  'Jack' : 'Chemistry',
                  'Molly' : 'English',
                   'Sam' : 'History'})
students_classes2

Alice      Physics
Jack     Chemistry
Molly      English
Sam        History
dtype: object

In [60]:
kelly_classes = pd.Series(['Philosophy', 'Arts', 'Math'], index = ['Kelly', 'Kelly', 'Kelly'])
kelly_classes

Kelly    Philosophy
Kelly          Arts
Kelly          Math
dtype: object

In [61]:
# We append kelly_classes to the rest
all_students_classes = students_classes2.append(kelly_classes)

all_students_classes

Alice       Physics
Jack      Chemistry
Molly       English
Sam         History
Kelly    Philosophy
Kelly          Arts
Kelly          Math
dtype: object

In [62]:
# Considerations when using append()

# 1) Pandas will take the series and try to inder the best data types to use
# 2) The append() method doesn't actually change the underlying Series objects. Instead it
# returns a new Series which is made up of the two append together.

# This last thing happens quite often in Pandas: by default returning a new object instead of
# modifying in place

# If we print the original Series, we see that nothing has changed
students_classes2

Alice      Physics
Jack     Chemistry
Molly      English
Sam        History
dtype: object

In [63]:
# Finally, we see that when we query the append series for Kelly, we don't get a single value
# but a Series itself

all_students_classes.loc['Kelly']

Kelly    Philosophy
Kelly          Arts
Kelly          Math
dtype: object