# Pandas Series

## This notebook provides an introduction to using the pandas package for manipulating data. It covers series creation, combination, querying, and parallel processing techniques

In [1]:
import pandas as pd
%precision 2

'%.2f'

## Creating a Series using Pandas

In [2]:
#create a short series out of a list
fruit = ['apple', 'banana', 'orange']
pd.Series(fruit)

0     apple
1    banana
2    orange
dtype: object

In [3]:
#this works with integers as well
integers = [1,2,3,4,5]
pd.Series(integers)

0    1
1    2
2    3
3    4
4    5
dtype: int64

In [4]:
#pandas builds series using a typed numpy array
#pandas handles missing data differently depending on the type of the underlying array
#with strings the series is returned as an object
fruit = ['apple', 'banana', None]
pd.Series(fruit)

0     apple
1    banana
2      None
dtype: object

In [5]:
#with integers a float datatype is returned
integers =[1,2,3,4,None]
pd.Series(integers)

0    1.0
1    2.0
2    3.0
3    4.0
4    NaN
dtype: float64

#### This is an important distinction because NaN is not the same as None. In fact you can't even compare NaN to itself. You need to use special functions to test for it like below

In [6]:
import numpy as np
np.NaN == np.NaN

False

In [7]:
np.NaN == np.NaN

False

In [8]:
np.isnan(np.NaN)

True

#### We can also pass pandas other data structures, such as a dictionary, and the keys will be used as the index of the Series

In [9]:
sports = {'Archery': 'Bhutan',
         'Golf': 'Scotland',
         'Sumo': 'Japan',
         'Taekwondo': 'South Korea'}
s = pd.Series(sports)
s

Archery           Bhutan
Golf            Scotland
Sumo               Japan
Taekwondo    South Korea
dtype: object

In [10]:
#get only the index object
s.index

Index(['Archery', 'Golf', 'Sumo', 'Taekwondo'], dtype='object')

In [11]:
#you can also explicitly pass in values and indices 
s = pd.Series(['apple','banana','orange'], index=['America', 'Canada', 'Mexico'])
s

America     apple
Canada     banana
Mexico     orange
dtype: object

In [12]:
#notice how pandas handles a mismatch between keys and values passed to a serise object
s = pd.Series(sports, index=['Golf', 'Sumo', 'Hockey'])
s

Golf      Scotland
Sumo         Japan
Hockey         NaN
dtype: object

## Querying a Pandas Series

### When we want to look up an element by its index numeric location, use iloc. When we want to look up an element by its index label, use the loc attribute.

In [13]:
sports = {'Archery': 'Bhutan',
         'Golf': 'Scotland',
         'Sumo': 'Japan',
         'Taekwondo': 'South Korea'}
s = pd.Series(sports)
#query by numeric index
s.iloc[3]

'South Korea'

In [14]:
#query by index label
s.loc['Golf']

'Scotland'

In [15]:
#you can use a shorter syntax, but this will be problematic if the index is just a list of integers
s[2]

'Japan'

In [16]:
#we can use most numpy aggregate functions across a series, for example the sum function
#These functions use vectorization under the hood and are therefore faster than a normal loop
s=pd.Series([100.00, 101.00, 301.00, 404.00])
s

0    100.0
1    101.0
2    301.0
3    404.0
dtype: float64

In [17]:
#standard approach
tot = 0
for item in s:
    tot += item
print(tot)

906.0


In [18]:
import numpy as np
new_tot = np.sum(s)
print(new_tot)

906.0


In [22]:
s=pd.Series(np.random.randint(0,1000,10000))
s.head()

0     75
1    958
2      0
3    625
4    778
dtype: int64

In [23]:
len(s)

10000

In [28]:
%%timeit -n 100
#note that the time it takes to loop 100 times with the standard approach is in milliseconds
summary = 0
for item in s:
    summary+=item

100 loops, best of 3: 1.43 ms per loop


In [29]:
%%timeit -n 100
#note that when using the vectorized numpy array function, the time is in microseconds
new_sum = np.sum(s)

100 loops, best of 3: 79.7 µs per loop


## Broadcasting

### With Broadcasting you can apply a function to every item in the series, producing a different series

In [30]:
s.head()

0     75
1    958
2      0
3    625
4    778
dtype: int64

In [32]:
#add 2 to every item
s += 2
s.head()

0     79
1    962
2      4
3    629
4    782
dtype: int64

##### notice that procedurally this is longer to write, harder to read, and slower to run

In [34]:
%%timeit -n 10
s = pd.Series(np.random.randint(1,1000,100))
for label, value in s.iteritems():
    s.loc[label] = value+2

10 loops, best of 3: 10.9 ms per loop


In [35]:
%%timeit -n 10
s = pd.Series(np.random.randint(1,1000,100))
s += 2

10 loops, best of 3: 425 µs per loop


#### mixed types are not a problem for pandas

In [37]:
s = pd.Series([1,2,3])
s

0    1
1    2
2    3
dtype: int64

In [39]:
s.loc['fruit'] = 'orange'
s

0             1
1             2
2             3
fruit    orange
dtype: object

## Index values in a pandas dataframe do not need to be unique, which makes them different from tables in a standard relational database

In [48]:
#create two series. One with the original sports and countries, and one with countries who love soccer, using
# soccer for all indices
original_countries = pd.Series({'Archery': 'Bhutan',
         'Golf': 'Scotland',
         'Sumo': 'Japan',
         'Taekwondo': 'South Korea'})
soccer_countries = pd.Series(['Germany'
                            ,'Argentina'
                            ,'Brazil'
                            ,'Netherlands'],
                        index = ['Soccer'
                                ,'Soccer'
                                ,'Soccer'
                                ,'Soccer'])

In [49]:
#unique index
original_countries

Archery           Bhutan
Golf            Scotland
Sumo               Japan
Taekwondo    South Korea
dtype: object

In [50]:
#repeating index values
soccer_countries

Soccer        Germany
Soccer      Argentina
Soccer         Brazil
Soccer    Netherlands
dtype: object

In [52]:
#we can append the 2 using the .append() function
all_sports = original_countries.append(soccer_countries)
all_sports

Archery           Bhutan
Golf            Scotland
Sumo               Japan
Taekwondo    South Korea
Soccer           Germany
Soccer         Argentina
Soccer            Brazil
Soccer       Netherlands
dtype: object

In [55]:
#notice that the original series did not change
print(original_countries)
print()
print(soccer_countries)

Archery           Bhutan
Golf            Scotland
Sumo               Japan
Taekwondo    South Korea
dtype: object

Soccer        Germany
Soccer      Argentina
Soccer         Brazil
Soccer    Netherlands
dtype: object


In [60]:
#query the combined series for only soccer countries
all_sports.loc['Soccer']

Soccer        Germany
Soccer      Argentina
Soccer         Brazil
Soccer    Netherlands
dtype: object