In [6]:
import pandas as pd
import numpy as np 

students_classes = {'Alice': 'Physics',
                    'Jack': 'Chemistry',
                    'Molly': 'English',
                    'Sam': 'History'}
s = pd.Series(students_classes)
s

Alice      Physics
Jack     Chemistry
Molly      English
Sam        History
dtype: object

In [7]:
# iloc[] 填數字
s.iloc[3]

'History'

In [8]:
# loc[] 填 key value
s.loc['Molly']

'English'

notice that iloc and loc are not methods, they are attributes. So we don't use parentheses() to query them, but square brackets[], which is called the indexing operator. 

In [9]:
s[3]  # == s.iloc[3]

'History'

In [10]:
s['Molly']  # == s.loc['Molly']

'English'

In [11]:
# what if our index is a list of integers? 
# pandas can't determine automatically whether we are querying by index position or index label
# The safer option is to use iloc or loc attributes directly

class_code = {99: 'Physics',
              100: 'Chemistry',
              101: 'English',
              102: 'History'}
s = pd.Series(class_code)
s

99       Physics
100    Chemistry
101      English
102      History
dtype: object

In [12]:
s[0]  # error returned, pandas doesn't know we mean to use index or label

KeyError: 0

In [13]:
s.iloc[0]  # correct one

'Physics'

the typical way to deal with data is to iterate over all the items in the series, and invoke the operation one is interested in

In [14]:
grades = pd.Series([90, 80, 70, 60])
total = 0

for grade in grades:
    total += grade
print(total/len(grades))  # the average grade

75.0


In [15]:
# the above way is slow, we can use numpy sum method instead

total = np.sum(grades)
print(total/len(grades))

75.0


In [34]:
# but there is still a faster way, let's create a big series of random numbers

numbers = pd.Series(np.random.randint(0,1000,10000))  # means 10000 random numbers between 0-1000
numbers.head()  # look at the top 5 items in the series

0    316
1    610
2    969
3    780
4    416
dtype: int64

In [17]:
len(numbers)  # check the length

10000

here we are going to use what's called magic function, and the one we use here is called timeit. This function will run our code a few times to determine, on average, how long it takes. The default time is 1000 loops, but we can set the time by our own

In [23]:
%%timeit -n 100
total = 0
for number in numbers:
    total += number
    
total/len(numbers)

# running timeit with our original iterative code
# notice that we MUST use timeit in the first line!!!!!!!

1.13 ms ± 31.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [24]:
%%timeit -n 100
total = np.sum(numbers)
total/len(numbers)

69.4 µs ± 3.55 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


this is a pretty difference in the speed and demonstrates why one should be aware of parallel computing features and start thinking in functional programming terms

Broadcasting

In [35]:
# with broadcasting we apply an operation to every value in the series

# if we want to increase every random variable by 2, first check the original numbers
numbers.head()

0    316
1    610
2    969
3    780
4    416
dtype: int64

In [36]:
# notice that DO NOT put any s whitespace next to the operators(ex. =, +, -), or the outcome will br wrong
numbers+=2
numbers.head()

0    318
1    612
2    971
3    782
4    418
dtype: int64

In [37]:
# pandas supports iterating through a series like a dictionary
for label, value in numbers.iteritems():
    numbers.set_value(label,value+2)
numbers.head()

0    320
1    614
2    973
3    784
4    420
dtype: int64

another speed comparison

In [40]:
%%timeit -n 10
# creating a new series 
s = pd.Series(np.random.randint(0,1000,1000))
for label, value in s.iteritems():
    s.loc[label]=value+2

129 ms ± 2.87 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [41]:
%%timeit -n 10
s = pd.Series(np.random.randint(0,1000,1000))
s+=2

261 µs ± 37 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


The .loc attribute also add new data as well. If the value we passed in as the index doesn't exist, then a new entry is created. And keep in mind indices can have mixed types, so pandas will automatically change the underlying NumPy types as appropriate.

In [43]:
s = pd.Series([1, 2, 3])
s.loc['add_numbers'] = 102  
s

# index named 'add_number' is not in the series, so pandas creates new element
# with index named 'add_numbers', and value of 102

0                1
1                2
2                3
add_numbers    102
dtype: int64

What if the index values are not unique?

In [44]:
students_classes = pd.Series({'Alice': 'Physics',
                              'Jack': 'Chemistry',
                              'Molly': 'English',
                              'Sam': 'History'})
students_classes

Alice      Physics
Jack     Chemistry
Molly      English
Sam        History
dtype: object

In [45]:
kelly_classes = pd.Series(['Philosophy', 'Arts', 'Math'], index=['Kelly', 'Kelly', 'Kelly'])
kelly_classes

Kelly    Philosophy
Kelly          Arts
Kelly          Math
dtype: object

In [47]:
# we append all of the data in this new series
all_students_classes = students_classes.append(kelly_classes)
all_students_classes

Alice       Physics
Jack      Chemistry
Molly       English
Sam         History
Kelly    Philosophy
Kelly          Arts
Kelly          Math
dtype: object

First, Pandas will take the series and try to infer the best data types to use. In this example, everything is a string, so there's no problems here. 

Second, the append method doesn't actually change the underlying series objects. It instead returns a new series which is made up of the two appended together. And this is actually a common pattern in Pandas. By default, returning a new object instead of modifying one in place. By printing the original series, we can see that series hasn't changed.

In [48]:
students_classes

Alice      Physics
Jack     Chemistry
Molly      English
Sam        History
dtype: object

In [51]:
# if we query the appended series for kelly, we don't get a single value, but a series itself
all_students_classes.loc['Kelly']

Kelly    Philosophy
Kelly          Arts
Kelly          Math
dtype: object