# Writing Efficient Python Code

## Chapter 1: Foundations for Efficiencies

### Welcome!

#### Course overview
* Your code should be a tool used to gain insights
    * Not something that leaves you waiting for results
    
#### Defining efficient
* Writing *efficient* Python code
    * Minimal completion time *(fast runtime)*
    * Minimal resource consumption *(small memory footprint)*
    
#### Defining Pythonic
* Writing efficient *Python* code
    * Focus on readability
    * Using Python's constructs as intended (i.e., *Pythonic*)

In [None]:
# Non-Pythonic
doubled_numbers = []

for i in range(len(numbers)):
    doubled_numbers.append(numbers[i] * 2)
    
# Pythonic
doubled_numbers = [x * 2 for x in numbers]

#### The Zen of Python by Tim Peters

In [2]:
import this

The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!


In [3]:
names = ['Jerry', 'Kramer', 'Elaine', 'George', 'Newman']

In [4]:
# Print the list created using the Non-Pythonic approach
i = 0
new_list= []
while i < len(names):
    if len(names[i]) >= 6:
        new_list.append(names[i])
    i += 1
print(new_list)

['Kramer', 'Elaine', 'George', 'Newman']


In [5]:
# Print the list created by looping over the contents of names
better_list = []
for name in names:
    if len(name) >= 6:
        better_list.append(name)
print(better_list)

['Kramer', 'Elaine', 'George', 'Newman']


In [6]:
# Print the list created by using list comprehension
best_list = [name for name in names if len(name) >= 6]
print(best_list)

['Kramer', 'Elaine', 'George', 'Newman']


### Building with built-ins

#### The Python Standard Library

* Python 3.6 Standard Library
    * Part of evry standard Python installation
    
* Built-in types
    * `list`, `tuple`, `set`, `dict` and others
    
* Built-in functions
    * `print()`, `len()`, `range()`, `round()`, `enumerate()`, `map()`, `zip()` and others
    
* Built-in modules
    * `os`, `sys`, `itertools`, `collections`, `math`, and others
    
#### Built-in function: range()

Explicitly typing a list of numbers

In [7]:
# Not very efficient
nums = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

Using `range()` to create the same list

In [8]:
# range(start, stop)
nums = range(0, 11)

nums_list = list(nums)
print(nums_list)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]


In [9]:
# range(stop)
nums = range(11)

nums_list = list(nums)
print(nums_list)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]


Using `range()` with a step value

In [10]:
even_nums = range(2, 11, 2)

even_nums_list = list(even_nums)
print(even_nums_list)

[2, 4, 6, 8, 10]


In [11]:
# Unpack a range object using the star character
nums_list2 = [*range(1,12,2)]
print(nums_list2)

[1, 3, 5, 7, 9, 11]


#### Built-in function: enumerate()

Creates an indexed list of objects

In [12]:
letters = ['a', 'b', 'c', 'd']

indexed_letters = enumerate(letters)

indexed_letters_list = list(indexed_letters)
print(indexed_letters_list)

[(0, 'a'), (1, 'b'), (2, 'c'), (3, 'd')]


Can specify a start value

In [13]:
letters = ['a', 'b', 'c', 'd']

indexed_letters = enumerate(letters, start=5)

indexed_letters_list = list(indexed_letters)
print(indexed_letters_list)

[(5, 'a'), (6, 'b'), (7, 'c'), (8, 'd')]


In [14]:
# Rewrite the for loop to use enumerate

indexed_names = []
for i,name in enumerate(names):
    index_name = (i,name)
    indexed_names.append(index_name) 
print(indexed_names)

[(0, 'Jerry'), (1, 'Kramer'), (2, 'Elaine'), (3, 'George'), (4, 'Newman')]


In [15]:
# Rewrite the above for loop using list comprehension

indexed_names_comp = [(i,name) for i,name in enumerate(names)]
print(indexed_names_comp)

[(0, 'Jerry'), (1, 'Kramer'), (2, 'Elaine'), (3, 'George'), (4, 'Newman')]


In [16]:
# Unpack an enumerate object with a starting index of one

indexed_names_unpack = [*enumerate(names, 1)]
print(indexed_names_unpack)

[(1, 'Jerry'), (2, 'Kramer'), (3, 'Elaine'), (4, 'George'), (5, 'Newman')]


#### Built-in function: map()

Applies a function over an object

In [17]:
nums = [1.5, 2.3, 3.4, 4.6, 5.0]

rnd_nums = map(round, nums)

print(list(rnd_nums))

[2, 2, 3, 5, 5]


`map()` with `lambda()` (anonymous function)

In [18]:
nums = [1, 2, 3, 4, 5]

sqrd_nums = map(lambda x: x ** 2, nums)

print(list(sqrd_nums))

[1, 4, 9, 16, 25]


In [19]:
# Use map to apply str.upper to each element in names
names_map  = map(str.upper, names)

# Print the type of the names_map
print(type(names_map))

<class 'map'>


In [20]:
# Unpack names_map into a list
names_uppercase = [*names_map]

# Print the list created above
print(names_uppercase)

['JERRY', 'KRAMER', 'ELAINE', 'GEORGE', 'NEWMAN']


### The power of NumPy arrays

#### NumPy array overview
* Alternative to Python lists

In [21]:
nums_list = list(range(5))
nums_list

[0, 1, 2, 3, 4]

In [22]:
import numpy as np

nums_np = np.array(range(5))
nums_np

array([0, 1, 2, 3, 4])

In [23]:
# NumPy array homeogeneity
nums_np_ints = np.array([1, 2, 3])
nums_np_ints

array([1, 2, 3])

In [24]:
nums_np_ints.dtype

dtype('int64')

In [25]:
# Coerced ints into floats
nums_np_floats = np.array([1, 2.5, 3])
nums_np_floats

array([1. , 2.5, 3. ])

In [26]:
nums_np_floats.dtype

dtype('float64')

### NumPy array broadcasting
* Python lists don't support broadcasting

In [27]:
nums = [-2, -1, 0, 1, 2]
nums ** 2

TypeError: unsupported operand type(s) for ** or pow(): 'list' and 'int'

* List approach (loop or list comprehension) but neither of these ways is the most efficient way of doing this

In [28]:
# For loop (inefficient option)
sqrd_nums = []
for num in nums:
    sqrd_nums.append(num ** 2)
print(sqrd_nums)

[4, 1, 0, 1, 4]


In [29]:
# List comprehension (better option but not best)
sqrd_nums = [num ** 2 for num in nums]

print(sqrd_nums)

[4, 1, 0, 1, 4]


* NumPy array broadbasting for the win!
    * NumPy arrays vectorize operations, so they are performed on all elements of an object at once.
    * This allows us to efficiently perform calcuations over entire arrays.

In [30]:
nums_np = np.array([-2, -1, 0, 1, 2])
nums_np ** 2

array([4, 1, 0, 1, 4])

* NumPy array indexing capabilities are superior

In [31]:
# 2-D list
nums2 = [ [1, 2, 3],
          [4, 5, 6] ]

* Basic 2-D indexing (lists)

In [32]:
nums2[0][1]

2

In [33]:
[row[0] for row in nums2]

[1, 4]

In [34]:
# 2-D array
nums2_np = np.array(nums2)

* Basic 2-D indexing (arrays)

In [35]:
nums2_np[0, 1]

2

In [36]:
nums2_np[:, 0] # way easier to return columns

array([1, 4])

#### NumPy array boolean indexing

In [37]:
nums = [-2, -1, 0, 1, 2]
nums_np = np.array(nums)

* Boolean indexing

In [38]:
nums_np > 0

array([False, False, False,  True,  True])

In [39]:
nums_np[nums_np > 0]

array([1, 2])

* No boolean indexing for lists

In [40]:
# For loop (inefficient option)
pos = []
for num in nums:
    if num > 0:
        pos.append(num)
print(pos)

[1, 2]


In [41]:
# List comprehension (better option but not best)
pos = [num for num in nums if num > 0]
print(pos)

[1, 2]


## Chapter 2: Timing and profiling code

### Examining runtime

#### Why should we time our code?
* Allows us to pick the **optimal** coding approach
* Faster code == more efficient code!

#### How can we time our code?
* Calcualte runtime eith IPython magic command `%timeit`
* **Magic commands:** enhancement on top of normal Python syntax
    * Prefixed by the "%" character
    * See all available magic commands with `%lsmagic`
    
#### Using %timeit
Code to be timed

In [42]:
import numpy as np

rand_nums = np.random.rand(1000)

In [43]:
%timeit rand_nums = np.random.rand(1000)

14.6 µs ± 990 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


#### Specifying number of runs/loops
Setting the number of runs (`-r`) and/or loops (`-n`)
* The number of runs represents how many iterations you'd like to use to estimate the runtime
* The number of loops represents how many time you'd like the code to be executed per run

In [44]:
# Set number of runs to 2 (-r2)
# Set number of loops to 10 (-n10)

%timeit -r2 -n10 rand_nums = np.random.rand(1000)

The slowest run took 4.70 times longer than the fastest. This could mean that an intermediate result is being cached.
75.6 µs ± 49.1 µs per loop (mean ± std. dev. of 2 runs, 10 loops each)


#### Using %timeit in line magic code
Line magic (`%timeit`)

In [45]:
# Single line of code

%timeit nums = [x for x in range(10)]

1.02 µs ± 170 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


Cell magic (`%%timeit`)

In [83]:
%%timeit
# Multiple lines of code

nums = []
for x in range(10):
    nums.append(x)

1.74 µs ± 170 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


#### Saving output
Saving the output to a variable (`-o`)

In [47]:
times = %timeit -o rand_nums = np.random.rand(1000)

15 µs ± 575 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [48]:
times.timings

[1.5319938079919665e-05,
 1.5169383739994374e-05,
 1.5436380750034006e-05,
 1.4591334670112701e-05,
 1.3795430900063366e-05,
 1.5614405939995777e-05,
 1.497780639998382e-05]

In [49]:
times.best

1.3795430900063366e-05

In [50]:
times.worst

1.5614405939995777e-05

#### Comparing times
Python data structures can be created using formal name

In [51]:
formal_list = list()
formal_dit = dict()
formal_tuple = tuple()

Python data structures can be created using literal syntax

In [52]:
literal_list = []
literal_dict = {}
literal_tuple = ()

In [53]:
f_time = %timeit -o formal_dict = dict()

167 ns ± 15.1 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)


In [54]:
l_time = %timeit -o literal_dict = {}

55 ns ± 7.33 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)


In [55]:
diff = (f_time.average - l_time.average) * (10**0)
print(f'l_time better than f_time by {diff} ns')

l_time better than f_time by 1.1244469400012998e-07 ns


### Code profiling for runtime

#### Code profiling
* Detailed stats on frequency and duratino of function calls
* Line-by-line analyses
* Package used: `line_profiler`

#### Code profiling: runtime

In [56]:
heroes = ['Batman', 'Superman', 'Wonder Woman']

hts = np.array([188.0, 191.0, 183.0])

wts = np.array([95.0, 101.0, 74.0])

In [57]:
def convert_units(heroes, heights, weights):
    
    new_hts = [ht * 0.39370 for ht in weights]
    new_wts = [wt * 2.20462 for wt in weights]
    
    hero_data = {}
    
    for i, hero in enumerate(heroes):
        hero_data[hero] = (new_hts[i], new_wts[i])
        
    return hero_data

In [58]:
convert_units(heroes, hts, wts)

{'Batman': (37.4015, 209.4389),
 'Superman': (39.7637, 222.66661999999997),
 'Wonder Woman': (29.1338, 163.14188)}

Using `line_profiler` package

In [59]:
%load_ext line_profiler

Magic command for line-by-line times

In [60]:
%lprun -f convert_units convert_units(heroes, hts, wts)

### Code profiling for memory usage

#### Quick and dirty approach

In [61]:
import sys

In [62]:
nums_list = [*range(1000)]
sys.getsizeof(nums_list)

9112

In [63]:
import numpy as np

nums_np = np.array(range(1000))
sys.getsizeof(nums_np)

8096

#### Code profiling: memory
* Detailed stats on memory consumption
* Line-by-line-analyses
* Package used: `memory_profiler`
* Functions must be imported when using `memory_profiler`
    * `hero_funcs.py`

In [64]:
from hero_funcs import convert_units

In [65]:
%load_ext memory_profiler

%mprun -f convert_units convert_units(heroes, hts, wts)




#### %mprun output caveats
* Inspects memory by querying the operating system
* Results may differ between platforms and runs
    * Can still observe how each line of code compares to others based on memory consumption

## Chapter 3: Gaining efficiencies

### Efficiently combinging, counting, and iterating

#### Pokémon Overview
* Trainers (collect Pokémon)
* Pokémon (fictional animal characters)

#### Combining objects

In [66]:
names = ['Bulbasaur', 'Charmander', 'Squirtle']
hps = [45, 39, 44]

In [67]:
combined = []

for i, pokemon in enumerate(names):
    combined.append((pokemon, hps[i]))

print(combined)

[('Bulbasaur', 45), ('Charmander', 39), ('Squirtle', 44)]


#### Combining objects with zip

In [68]:
names = ['Bulbasaur', 'Charmander', 'Squirtle']
hps = [45, 39, 44]

In [69]:
combined_zip = zip(names, hps)
print(type(combined_zip))

<class 'zip'>


In [70]:
combined_zip_list = [*combined_zip]
print(combined_zip_list)

[('Bulbasaur', 45), ('Charmander', 39), ('Squirtle', 44)]


#### The collections module
* Part of Python's Standard Library (built-in module)
* Specialized container datatypes
    * Alternatives to general purpose dict, list, set and tuple
* Notable:
    * `namedtuple`: tuple subclasses with named fields
    * `deque`: list-like container wtih fast appends and pops
    * `Counter`: dict for counting hashable objects
    * `OrderedDict`: dict that retains order of entries
    * `defaultdict`: dict that calls a factory function to supply missing values
    
#### Counting with loop

In [71]:
# Each Pokémon's type (720 total)
poke_types = ['Grass', 'Dark', 'Fire', 'Fire']
type_counts = {}
for poke_type in poke_types:
    if poke_type not in type_counts:
        type_counts[poke_type] = 1
    else:
        type_counts[poke_type] += 1
print(type_counts)

{'Grass': 1, 'Dark': 1, 'Fire': 2}


#### collections.Counter()

In [72]:
# Each Pokémon's type (720 total)
poke_types = ['Grass', 'Dark', 'Fire', 'Fire']
from collections import Counter
type_counts = Counter(poke_types)

# Orders by highest to lowest counts
print(type_counts)

Counter({'Fire': 2, 'Grass': 1, 'Dark': 1})


#### The itertools module
* Part of Python's Standard Library (built-in module)
* Functional tools for creating and using iterators
* Notable:
    * Infinite iterators: `count`, `cycle`, `repeat`
    * Finite iterators: `accumulate`, `chain`, `zip_longest`, etc.
    * Combination generators: `product`, `permutations`, `combinations`
    
#### Combinations with loop

In [73]:
poke_types = ['Bug', 'Fire', 'Ghost', 'Grass', 'Water']
combos = []

for x in poke_types:
    for y in poke_types:
        if x == y:
            continue
        if ((x,y) not in combos) & ((y,x) not in combos):
            combos.append((x,y))
print(combos)

[('Bug', 'Fire'), ('Bug', 'Ghost'), ('Bug', 'Grass'), ('Bug', 'Water'), ('Fire', 'Ghost'), ('Fire', 'Grass'), ('Fire', 'Water'), ('Ghost', 'Grass'), ('Ghost', 'Water'), ('Grass', 'Water')]


#### itertools.combinations()

In [74]:
poke_types = ['Bug', 'Fire', 'Ghost', 'Grass', 'Water']
from itertools import combinations
combos_obj = combinations(poke_types, 2)
print(type(combos_obj))

<class 'itertools.combinations'>


In [75]:
combos = [*combos_obj]
print(combos)

[('Bug', 'Fire'), ('Bug', 'Ghost'), ('Bug', 'Grass'), ('Bug', 'Water'), ('Fire', 'Ghost'), ('Fire', 'Grass'), ('Fire', 'Water'), ('Ghost', 'Grass'), ('Ghost', 'Water'), ('Grass', 'Water')]


### Set theory
* Branch of Mathematics applied to collections of objects
    * i.e., `sets`
* Python has built-in `set` datatype with accompanying methods:
    * `intersection()`: all elements that are in both sets
    * `difference()`: all elements in one set but not the other
    * `symmetric_difference()`: all elements in exactly one set
    * `union()` all elements that are in either set
* Fast membership testing
    * Check if a value exists in a sequence or not
    * Using the `in` operator
    
#### Comparing objects with loops

In [76]:
list_a = ['Bulbasaur', 'Charmander', 'Squirtle']
list_b = ['Caterpie', 'Pidgey', 'Squirtle']

In [77]:
# Extremely inefficient
in_common = []

for pokemon_a in list_a:
    for pokemon_b in list_b:
        if pokemon_a == pokemon_b:
            in_common.append(pokemon_a)
            
print(in_common)

['Squirtle']


In [78]:
set_a = set(list_a)
print(set_a)

{'Bulbasaur', 'Charmander', 'Squirtle'}


In [79]:
set_b = set(list_b)
print(set_b)

{'Caterpie', 'Squirtle', 'Pidgey'}


In [80]:
set_a.intersection(set_b)

{'Squirtle'}

#### Efficiency gained with set theory

In [82]:
%%timeit
in_common = []

for pokemon_a in list_a:
    for pokemon_b in list_b:
        if pokemon_a == pokemon_b:
            in_common.append(pokemon_a)

858 ns ± 71.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


In [84]:
%timeit in_common = set_a.intersection(set_b)

260 ns ± 13.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


#### Set method: difference

In [85]:
print(set_a)

{'Bulbasaur', 'Charmander', 'Squirtle'}


In [86]:
print(set_b)

{'Caterpie', 'Squirtle', 'Pidgey'}


In [87]:
set_a.difference(set_b)

{'Bulbasaur', 'Charmander'}

In [88]:
set_b.difference(set_a)

{'Caterpie', 'Pidgey'}

#### Set method: symmetric difference

In [89]:
set_a.symmetric_difference(set_b)

{'Bulbasaur', 'Caterpie', 'Charmander', 'Pidgey'}

#### Set method: union

In [90]:
set_a.union(set_b)

{'Bulbasaur', 'Caterpie', 'Charmander', 'Pidgey', 'Squirtle'}

#### Membership testing with sets

In [91]:
names_list = ['Abomasnow', 'Abra', 'Absol']
names_tuple = ('Abomasnow', 'Abra', 'Absol')
names_set = {'Abomasnow', 'Abra', 'Absol'}

In [92]:
%timeit 'Zubat' in names_list

117 ns ± 15.5 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)


In [93]:
%timeit 'Zubat' in names_tuple

97.9 ns ± 12.7 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)


In [94]:
%timeit 'Zubat' in names_set

68 ns ± 7.76 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)


#### Unique with sets
* set: collection of *distinct* elements

In [95]:
primary_types = ['Grass', 'Psychic', 'Dark', 'Bug']

In [96]:
unique_types = []

for prim_type in primary_types:
    if prim_type not in unique_types:
        unique_types.append(prim_type)

print(unique_types)

['Grass', 'Psychic', 'Dark', 'Bug']


In [97]:
unique_types_set = set(primary_types)

In [98]:
print(unique_types_set)

{'Dark', 'Psychic', 'Grass', 'Bug'}


### Eliminating loops

#### Loopin gin Python
* Looping patterns:
    * `for` loop: iterate over sequence piece-by-piece
    * `while` loop: repeat loop as long as condition is met
    * "nested" loops: use one loop inside another loop
    * Costly!
    
#### Benefits of eliminating loops
* Fewer lines of code
* Better code readability
    * "Flat is better than nested"
* Efficiency gains

#### Eliminating loops with built-ins

In [99]:
# List of HP, Attack, Defense, Speed
poke_stats = [
    [90, 92, 75, 60],
    [25, 20, 15, 90],
    [65, 130, 60, 75]
]

# For loop approach
totals = []
for row in poke_stats:
    totals.append(sum(row))

# List comprehension
totals_comp = [sum(row) for row in poke_stats]

# Built-in map() function
totals_map = [*map(sum, poke_stats)]

#### Eliminate loops with NumPy

In [100]:
# Array of HP, Attack, Defense, Speed
import numpy as np

poke_stats = np.array([
    [90, 92, 75, 60],
    [25, 20, 15, 90],
    [65, 130, 60, 75]
])

In [101]:
avgs_np = poke_stats.mean(axis = 1)
print(avgs_np)

[79.25 37.5  82.5 ]


### Writing better loops

#### Lesson caveat
* Some of the following loops can be eliminated with techniques covered in previous lessons.
* Examples in this lesson are used for **demonstrative** purposes.

#### Writing better loops
* Understand what is being done with each loop iteration
* Move one-time calculations outside (above) the loop
* Use holistic conversions outsie (below) the loop
* Anything that is done **once** should be outside the loop

#### Moving calculations above a loop

In [102]:
# Inefficient approach
import numpy as np

names = ['Absol', 'Aron', 'Jynx', 'Natu', 'Onix']
attacks = np.array([130, 70, 50, 50, 45])
for pokemon, attack in zip(names, attacks):
    total_attack_avg = attacks.mean()
    if attack > total_attack_avg:
        print(
            "{}'s attack: {} > average: {}!"
            .format(pokemon, attack, total_attack_avg)
        )

Absol's attack: 130 > average: 69.0!
Aron's attack: 70 > average: 69.0!


In [103]:
# Efficient approach
import numpy as np

names = ['Absol', 'Aron', 'Jynx', 'Natu', 'Onix']
attacks = np.array([130, 70, 50, 50, 45])
# Calculate total average once (outsie the loop)
total_attack_avg = attacks.mean()
for pokemon, attack in zip(names, attacks):
    
    if attack > total_attack_avg:
        print(
            "{}'s attack: {} > average: {}!"
            .format(pokemon, attack, total_attack_avg)
        )

Absol's attack: 130 > average: 69.0!
Aron's attack: 70 > average: 69.0!


#### Using holistic conversions

In [104]:
# Inefficient approach
names = ['Pikachu', 'Squirtle', 'Articuno']
legend_status = [False, False, True]
generations = [1, 1, 1]
poke_data = []
for poke_tuple in zip(names, legend_status, generations):
    poke_list = list(poke_tuple)
    poke_data.append(poke_list)
print(poke_data)

[['Pikachu', False, 1], ['Squirtle', False, 1], ['Articuno', True, 1]]


In [105]:
# Efficient approach
names = ['Pikachu', 'Squirtle', 'Articuno']
legend_status = [False, False, True]
generations = [1, 1, 1]
poke_data_tuples = []
for poke_tuple in zip(names, legend_status, generations):
    poke_data_tuples.append(poke_tuple)
poke_data = [*map(list, poke_data_tuples)]
print(poke_data)

[['Pikachu', False, 1], ['Squirtle', False, 1], ['Articuno', True, 1]]


## Chapter 4: Basic pandas optimizations

### Intro to pandas DataFrame iteration

#### Baseball stats

In [106]:
import pandas as pd

baseball_df = pd.read_csv('datasets/baseball_stats.csv')
baseball_df.head()

Unnamed: 0,Team,League,Year,RS,RA,W,OBP,SLG,BA,Playoffs,RankSeason,RankPlayoffs,G,OOBP,OSLG
0,ARI,NL,2012,734,688,81,0.328,0.418,0.259,0,,,162,0.317,0.415
1,ATL,NL,2012,700,600,94,0.32,0.389,0.247,1,4.0,5.0,162,0.306,0.378
2,BAL,AL,2012,712,705,93,0.311,0.417,0.247,1,5.0,4.0,162,0.315,0.403
3,BOS,AL,2012,734,806,69,0.315,0.415,0.26,0,,,162,0.331,0.428
4,CHC,NL,2012,613,759,61,0.302,0.378,0.24,0,,,162,0.335,0.424


#### Calculating win percentage

In [107]:
import numpy as np

def calc_win_perc(wins, games_played):
    win_perc = wins / games_played
    return np.round(win_perc, 2)

In [108]:
win_perc = calc_win_perc(50, 100)
win_perc

0.5

#### Adding win percentage to DataFrame

In [109]:
win_perc_list = []

for i in range(len(baseball_df)):
    row = baseball_df.iloc[i]
    wins = row['W']
    games_played = row['G']
    win_perc = calc_win_perc(wins, games_played)
    win_perc_list.append(win_perc)
    
baseball_df['WP'] = win_perc_list
baseball_df.head()

Unnamed: 0,Team,League,Year,RS,RA,W,OBP,SLG,BA,Playoffs,RankSeason,RankPlayoffs,G,OOBP,OSLG,WP
0,ARI,NL,2012,734,688,81,0.328,0.418,0.259,0,,,162,0.317,0.415,0.5
1,ATL,NL,2012,700,600,94,0.32,0.389,0.247,1,4.0,5.0,162,0.306,0.378,0.58
2,BAL,AL,2012,712,705,93,0.311,0.417,0.247,1,5.0,4.0,162,0.315,0.403,0.57
3,BOS,AL,2012,734,806,69,0.315,0.415,0.26,0,,,162,0.331,0.428,0.43
4,CHC,NL,2012,613,759,61,0.302,0.378,0.24,0,,,162,0.335,0.424,0.38


#### Iterating with .iterrows()

In [110]:
# Takes about half the time as .iloc
win_perc_list = []

for i, row in baseball_df.iterrows():
    wins = row['W']
    games_played = row['G']
    win_perc = calc_win_perc(wins, games_played)
    win_perc_list.append(win_perc)
    
baseball_df['WP'] = win_perc_list
baseball_df.head()

Unnamed: 0,Team,League,Year,RS,RA,W,OBP,SLG,BA,Playoffs,RankSeason,RankPlayoffs,G,OOBP,OSLG,WP
0,ARI,NL,2012,734,688,81,0.328,0.418,0.259,0,,,162,0.317,0.415,0.5
1,ATL,NL,2012,700,600,94,0.32,0.389,0.247,1,4.0,5.0,162,0.306,0.378,0.58
2,BAL,AL,2012,712,705,93,0.311,0.417,0.247,1,5.0,4.0,162,0.315,0.403,0.57
3,BOS,AL,2012,734,806,69,0.315,0.415,0.26,0,,,162,0.331,0.428,0.43
4,CHC,NL,2012,613,759,61,0.302,0.378,0.24,0,,,162,0.335,0.424,0.38


### Another iterator method: .itertuples()

#### Team wins data

In [111]:
team_wins_df = baseball_df[['Team', 'Year', 'W']]
print(team_wins_df.head())

  Team  Year   W
0  ARI  2012  81
1  ATL  2012  94
2  BAL  2012  93
3  BOS  2012  69
4  CHC  2012  61


In [113]:
for row_tuple in team_wins_df.head().iterrows():
    print(row_tuple)
    print(type(row_tuple[1]))

(0, Team     ARI
Year    2012
W         81
Name: 0, dtype: object)
<class 'pandas.core.series.Series'>
(1, Team     ATL
Year    2012
W         94
Name: 1, dtype: object)
<class 'pandas.core.series.Series'>
(2, Team     BAL
Year    2012
W         93
Name: 2, dtype: object)
<class 'pandas.core.series.Series'>
(3, Team     BOS
Year    2012
W         69
Name: 3, dtype: object)
<class 'pandas.core.series.Series'>
(4, Team     CHC
Year    2012
W         61
Name: 4, dtype: object)
<class 'pandas.core.series.Series'>


#### Iterating with .itertuples()
* Typically much more efficient than .iterrows() because of how the output is stored
* Since .iterrows() returns each row's values as a pandas Series, there's a bit more overhead

In [114]:
for row_namedtuple in team_wins_df.head().itertuples():
    print(row_namedtuple)

Pandas(Index=0, Team='ARI', Year=2012, W=81)
Pandas(Index=1, Team='ATL', Year=2012, W=94)
Pandas(Index=2, Team='BAL', Year=2012, W=93)
Pandas(Index=3, Team='BOS', Year=2012, W=69)
Pandas(Index=4, Team='CHC', Year=2012, W=61)


In [115]:
print(row_namedtuple.Index)

4


In [116]:
print(row_namedtuple.Team)

CHC


In [117]:
print(row_namedtuple.Year)

2012


In [118]:
print(row_namedtuple.W)

61


In [119]:
for row_tuple in team_wins_df.head().iterrows():
    print(row_tuple[1]['Team'])

ARI
ATL
BAL
BOS
CHC


In [120]:
for row_namedtuple in team_wins_df.head().itertuples():
    print(row_namedtuple['Team'])

TypeError: tuple indices must be integers or slices, not str

In [121]:
for row_namedtuple in team_wins_df.head().itertuples():
    print(row_namedtuple.Team)

ARI
ATL
BAL
BOS
CHC


### pandas alternative to looping

In [122]:
baseball_df.head()

Unnamed: 0,Team,League,Year,RS,RA,W,OBP,SLG,BA,Playoffs,RankSeason,RankPlayoffs,G,OOBP,OSLG,WP
0,ARI,NL,2012,734,688,81,0.328,0.418,0.259,0,,,162,0.317,0.415,0.5
1,ATL,NL,2012,700,600,94,0.32,0.389,0.247,1,4.0,5.0,162,0.306,0.378,0.58
2,BAL,AL,2012,712,705,93,0.311,0.417,0.247,1,5.0,4.0,162,0.315,0.403,0.57
3,BOS,AL,2012,734,806,69,0.315,0.415,0.26,0,,,162,0.331,0.428,0.43
4,CHC,NL,2012,613,759,61,0.302,0.378,0.24,0,,,162,0.335,0.424,0.38


In [123]:
def calc_run_diff(runs_scored, runs_allowed):
    run_diff = runs_scored - runs_allowed
    return run_diff

#### run differentials with a loop

In [124]:
run_diff_iterrows = []

for i, row in baseball_df.iterrows():
    run_diff = calc_run_diff(row['RS'], row['RA'])
    run_diff_iterrows.append(run_diff)
    
baseball_df['RD'] = run_diff_iterrows
baseball_df.head()

Unnamed: 0,Team,League,Year,RS,RA,W,OBP,SLG,BA,Playoffs,RankSeason,RankPlayoffs,G,OOBP,OSLG,WP,RD
0,ARI,NL,2012,734,688,81,0.328,0.418,0.259,0,,,162,0.317,0.415,0.5,46
1,ATL,NL,2012,700,600,94,0.32,0.389,0.247,1,4.0,5.0,162,0.306,0.378,0.58,100
2,BAL,AL,2012,712,705,93,0.311,0.417,0.247,1,5.0,4.0,162,0.315,0.403,0.57,7
3,BOS,AL,2012,734,806,69,0.315,0.415,0.26,0,,,162,0.331,0.428,0.43,-72
4,CHC,NL,2012,613,759,61,0.302,0.378,0.24,0,,,162,0.335,0.424,0.38,-146


#### pandas .apply() method
* Takes a function and applied it to a DataFrame
    * Must specify an anxis to apply(`0` for columns; `1` for rows)
* Can be used with anonymous functions (`lambda` functions)
* Example:

In [126]:
baseball_df.head().apply(
    lambda row: calc_run_diff(row['RS'], row['RA']),
    axis = 1
)

0     46
1    100
2      7
3    -72
4   -146
dtype: int64

#### run differentials with .apply()

In [127]:
run_diffs_apply = baseball_df.apply(
    lambda row: calc_run_diff(row['RS'], row['RA']),
    axis = 1
)

baseball_df['RD'] = run_diffs_apply
baseball_df.head()

Unnamed: 0,Team,League,Year,RS,RA,W,OBP,SLG,BA,Playoffs,RankSeason,RankPlayoffs,G,OOBP,OSLG,WP,RD
0,ARI,NL,2012,734,688,81,0.328,0.418,0.259,0,,,162,0.317,0.415,0.5,46
1,ATL,NL,2012,700,600,94,0.32,0.389,0.247,1,4.0,5.0,162,0.306,0.378,0.58,100
2,BAL,AL,2012,712,705,93,0.311,0.417,0.247,1,5.0,4.0,162,0.315,0.403,0.57,7
3,BOS,AL,2012,734,806,69,0.315,0.415,0.26,0,,,162,0.331,0.428,0.43,-72
4,CHC,NL,2012,613,759,61,0.302,0.378,0.24,0,,,162,0.335,0.424,0.38,-146


### Optimal pandas iterating

#### pandas internals
* Eliminating loops applies to using pandas as well
* pandas is built on NumPy
    * Take advantage of NumPy array efficiencies

In [128]:
baseball_df.head()

Unnamed: 0,Team,League,Year,RS,RA,W,OBP,SLG,BA,Playoffs,RankSeason,RankPlayoffs,G,OOBP,OSLG,WP,RD
0,ARI,NL,2012,734,688,81,0.328,0.418,0.259,0,,,162,0.317,0.415,0.5,46
1,ATL,NL,2012,700,600,94,0.32,0.389,0.247,1,4.0,5.0,162,0.306,0.378,0.58,100
2,BAL,AL,2012,712,705,93,0.311,0.417,0.247,1,5.0,4.0,162,0.315,0.403,0.57,7
3,BOS,AL,2012,734,806,69,0.315,0.415,0.26,0,,,162,0.331,0.428,0.43,-72
4,CHC,NL,2012,613,759,61,0.302,0.378,0.24,0,,,162,0.335,0.424,0.38,-146


In [129]:
wins_np = baseball_df['W'].values
print(type(wins_np))

<class 'numpy.ndarray'>


In [130]:
print(wins_np)

[ 81  94  93 ... 103  84  60]


#### Power of vectorization
* Broadcasting (vectorizing) is extremely efficient!

In [131]:
baseball_df['RS'].values - baseball_df['RA'].values

array([  46,  100,    7, ...,  188,  110, -117])

#### Run differentials with arrays

In [132]:
run_diffs_np = baseball_df['RS'].values - baseball_df['RA'].values
baseball_df['RD'] = run_diffs_np
baseball_df.head()

Unnamed: 0,Team,League,Year,RS,RA,W,OBP,SLG,BA,Playoffs,RankSeason,RankPlayoffs,G,OOBP,OSLG,WP,RD
0,ARI,NL,2012,734,688,81,0.328,0.418,0.259,0,,,162,0.317,0.415,0.5,46
1,ATL,NL,2012,700,600,94,0.32,0.389,0.247,1,4.0,5.0,162,0.306,0.378,0.58,100
2,BAL,AL,2012,712,705,93,0.311,0.417,0.247,1,5.0,4.0,162,0.315,0.403,0.57,7
3,BOS,AL,2012,734,806,69,0.315,0.415,0.26,0,,,162,0.331,0.428,0.43,-72
4,CHC,NL,2012,613,759,61,0.302,0.378,0.24,0,,,162,0.335,0.424,0.38,-146
