# Beginner Python Mistakes

## What?

At Galvanize, we see a lot of beginning Python programmers who are also beginning data scientists, and so have a unique insight into some of the traps they fall into.

## So?

Avoiding these mistakes may not change the logic nor execution time of your code (though some will), but it will let anybody else reading it know that you're fluent in Python.

This is especially important in interviews, where you may not get a chance to overcome the first impression your code makes!

<img src="https://s3-us-west-2.amazonaws.com/galvanize.com-dev/galvanize-logo.svg" width="500">

# About me:

## Cary Goltermann
DSI Instructor

Galvanize

San Francisco

<a href="https://twitter.com/carygoltermann">@carygoltermann</a>

https://www.linkedin.com/in/carygoltermann

cary.goltermann@galvanize.com

<img src="https://s3-us-west-2.amazonaws.com/galvanize.com-dev/galvanize-logo.svg" width="500">

### A dynamic learning community for technology. A place where anyone with the aptitude and drive to learn can get the skills necessary to work in technology.


<img src="https://s3-us-west-2.amazonaws.com/galvanize.com-dev/galvanize-logo.svg" width="500">

### Part-time classes
* Web Development Foundations in Javascript
* Data Science Fundamentals: Intro to Python
* Intro to Spark for Data Science

<img src="https://s3-us-west-2.amazonaws.com/galvanize.com-dev/galvanize-logo.svg" width="500">

### Full-time classes
* Data Science Immersive 
* Web Development Immersive
* Masters in Data Science

# This presentation

### Available at

https://github.com/zipfian/python-anti-patterns

### Many thanks to Isaac Laughlin whose notebook, also included in the same repository, served as the basis for this presentation.

### Things to keep in mind about good Python code:
* Readability
* Extensibility
* Succintness
* Speed

# Good Function Use
The hallmark of a good programmer is dividing up a large, difficult problem into many, simple sub-problems. The means by which we do this in Python is with procedures, colloquially known as functions. (Python also has classes, but those are a talk in and of themselves.)

# Using `print` instead of `return`

In [None]:
def is_palindrome(word):
    word = word.replace(' ', '')
    word = word.lower()
    if word == word[::-1]:
        print("It's a palindrome")
    else:
        print("It's not a palindrome")
is_palindrome('Too Bad I Hid A Boot')

### How to recognize when you might be making this mistake
* You are using `print` inside a function whose primary purpose is something other than providing printed output for user.
* Unless you're debugging. Good programmers use print statements, among other tools, liberially when debugging.

### Why it's important to avoid
* You want to capture the output of your function so functions can be combined and reused effectively. `print` goes to stdout, the screen, where it is hard to for other parts of your program to reach.

# Proper abstraction
Just because you put some code in a function does not mean that you are using functions well. One of the main purposes of functions to to create a layer of abstraction around the process that the function performs. If that function isn't properly abstracted then the function is only useful in a single context, and therefore no better then if the code weren't in a function.

In [None]:
# Find row with highest number of 3s, take the average of that row's values with one 3 removed

from __future__ import division

data = [[1, 2, 3, 2, 3, 2, 1, 3, 1, 2, 2, 2],
        [2, 3, 2, 3, 3, 2, 1, 1, 1, 2, 3, 3],
        [1, 2, 3, 2, 2, 2, 2, 2, 1, 2, 3, 3,]]

def find_avg_highest(data):
    """Args: list of lists of numbers.
    """
    highest, row_idx = -1, -1
    for i, row in enumerate(data):
        if row.count(3) > highest:
            row_idx = i
    return augmented_avg(data[row_idx])

def augmented_avg(row):
    """Args: list of numbers.
    """
    augmented_row_sum = sum(row) - 3
    return augmented_row_sum / len(row)

find_avg_highest(data)

In [None]:
# Solve it

In [None]:
# Solution

def find_avg_highest(data, n=3):
    """Args: list of lists of numbers.
    """
    highest, row_idx = -1, -1
    for i, row in enumerate(data):
        if row.count(n) > highest:
            row_idx = i
    return augmented_avg(data[row_idx], n)

def augmented_avg(row, n, count=1):
    """Args: list of numbers.
    """
    augmented_row_sum = sum(row) - n*count
    return augmented_row_sum / len(row)

find_avg_highest(data)

### How to recognize when you might be making this mistake
* There are hard-coded values in a function.

### Why it's important
* Encourages code
    * extensibility
    * generalizability
    * reuseability

# Using Data Structures Well
Python has many convenient data structures baked into the language that make powerful at succinctly expressing algorithms.

# Using (/not using) lists

Everybody who learns Python learns about lists, but sometimes they still fail to use them in places where they're appropriate!

In [None]:
# What does this look like?

import collections

def is_full_house(card1, card2, card3, card4, card5):
    """Args: five cards.
    
    Checks if a five card hand is a full house.
    """
    counts = collections.Counter([card1['value'], card2['value'], card3['value'], 
                                  card4['value'], card5['value']]).values()
    if 2 in counts and 3 in counts:
        return True
    return False

is_full_house({'value': 'k'}, {'value': 'k'}, {'value': 'k'},
              {'value': 'q'}, {'value': 'q'})

In [None]:
# Solve it

In [None]:
# Solution
def is_full_house(hand):
    """Args: list of five cards.
    
    Checks if a five card hand is a full house.
    """
    counts = collections.Counter([card['value'] for card in hand]).values()
    return 2 in counts and 3 in counts

is_full_house([{'value': 'k'}, {'value': 'k'}, {'value': 'k'},
               {'value': 'q'}, {'value': 'q'}])

### How to recognize when you might be making this mistake
* Note the `variable_<index>` pattern in the variables `card1`, `card2`, etc. -- this is exactly what lists are for!
* To use this pattern, you need to know the variable name, and the index, also what you need to know for a list.
    
### Why it's important
* Easy extensibility/generalizability.
* **D**on't **R**epeat **Y**ourself (DRY)

# Using (/not using) dictionaries

Dictionaries, or dicts, are one of the things that makes Python so special. Using them liberally (and correctly) is a good way to signal code intention!

In [None]:
# What does this look like?

from __future__ import division

def wa_ca_averages(data):
    """Args: list of tuples with state and value.
    
    Compute the averages for WA and CA.
    """
    data_wa = [x[1] for x in data if x[0] == 'WA']
    avg_wa = sum(data_wa) / len(data_wa)
    data_ca = [x[1] for x in data if x[0] == 'CA']
    avg_ca = sum(data_ca) / len(data_ca)
    return avg_wa, avg_ca

wa_ca_averages([('WA', 1), ('WA', 3), ('CA', 2), ('CA', 3)])

In [None]:
# Solve it
state_averages([('WA', 1), ('WA', 3) ,('CA', 2), ('CA', 3)])

In [None]:
def state_averages(data):
    """Args: list of tuples with state and value.
    
    Compute the average value for each state.
    """
    avgs = {}
    for state in set([x[0] for x in data]):
        state_data = [x[1] for x in data if x[0] == state]
        avgs[state] = sum(state_data) / len(state_data)
    return avgs

state_averages([('WA', 1), ('WA', 3), ('CA', 2), ('CA', 3)])

In [None]:
# Better solution
from collections import defaultdict

def state_averages(data):
    """Args: list of tuples with state and value.
    
    Compute the average value for each state.
    """
    state_data = defaultdict(list)
    for state, value in data:
        state_data[state].append(value)
    
    state_avgs = {state: sum(values) / len(values) for state, values in state_data.items()}
    return state_avgs

state_averages([('WA', 1), ('WA', 3), ('CA', 2), ('CA', 3)])

### How to recognize when you might be making this mistake
* Note the `variable_<key>` identifier like `avg_ca` -- this is exactly what dicts are for.
* To use this pattern you need two pieces of information `variable` and `key`, which are the same two things required for a dictionary.

### Why it's important
* Easy code extensibility/generalizability
* Parsimony
* DRY

# List comprehensions

List comprehensions are a very tidy way of doing things that would otherwise require a for loop. Experienced Python programmers use them routinely.

In [None]:
raw_data = "10:00am,60,Sapna;11:30am,30,Lin;2:00pm,60,Cary"
parsed_data = []
for event in raw_data.split(';'):
    schedule = []
    for data in event.split(','):
        schedule.append(data)
    if int(schedule[1]) > 30:
        parsed_data.append(schedule)
print(parsed_data)

In [None]:
# Solve it
raw_data = "10:00am,60,Sapna;11:30am,30,Lin;2:00pm,60,Cary"

In [None]:
# Solution
parsed_data = [event for event in raw_data.split(';')]
parsed_data = [event.split(',') for event in parsed_data]
parsed_data = [event for event in parsed_data if int(event[1]) > 30]
print(parsed_data)

### How to know if you're making this mistake
* If your for loop is preceded by initializing an empty list, and ends with a `.append()`.

### Why it's important
* Readability, much clearer with list comprehensions that the point of the code is to transform a list.
* Flatness is nice. We love flatness.

# Overusing list comprehensions

You know how you sometimes do crazy things when you're in love? It's the same when you love list comprehensions.

In [None]:
parsed_data = [[x for x in event.split(',')] 
               for event in raw_data.split(';')
               if int(event.split(',')[1]) > 30]
print(parsed_data)

### How to know if you're making this mistake
* You are writing a nested list comprehension.
* Your list comprehension takes up many lines.
* You're not _exactly_ sure what the output of your list comprehension will be.

### Why it's important to keep in mind
* Clarity
* Demonstrates your good ability to choose the correct approach among several equivalent options.

# Iteration
In using Python's data structures one will inevitably need to iterate over their contents. Doing so Pythonically is a powerful signal of good code.

# Doing too much work to iterate
Python has powerful and expressive tools that allow you to both iterate over collections easily and communicate your codes intention while you do it.

In [None]:
words = ['cat', 'bat', 'rat', 'dad']

# C, javascript, matlab programmers
i = 0
while i < len(words):
    print(words[i])
    i += 1

In [None]:
for i in range(len(words)):
    print(words[i])
# but I don't care where, what index, the word is at!

In [None]:
# Solve it

In [None]:
# Solution
for word in words:
    print(word)

### How to recognize when you might be making this mistake
* You have to index into an iterable to get the specific value you're interested in.
* You are explicitly incrementing indexes!
* You have an a variable that you only use to index into an iterable.

### Why it's important
* Clarity
* Allows good descriptive variable names

# Not using enumerate!

In the previous example we didn't care about the index of the words, but what if we do care?

In [None]:
medals = ['Gold', 'Silver', 'Bronze']
for i in range(len(medals)):
    print('{} medal for {} place'.format(medals[i], i+1))

In [None]:
# Solve it

In [None]:
# Solution
for place, medal in enumerate(medals, start=1):
    print('{} medal for {} place'.format(medal, place))

### How to recognize when you might be making this mistake
* You are indexing into your list inside a for loop.
* You are adjusting the index for some other purpose inside the for loop.
* Your variable names have no meaning, like `i`, because they serve multiple purposes.

### Why it's important to avoid
* Code readability.
* Avoid off-by-one errors.

# Zipping
Sometimes we need to iterate over two related data structures in parallel.

In [None]:
people = ['Isaac', 'Cary', 'Lee']
favorite_foods = [['Italian', 'Indian'],
                  ['Sushiritto'],
                  ['French Fries', 'Mexican', 'Water']]
for i in range(len(people)):
    print('{} has {} favorite food(s).'.format(people[i], len(favorite_foods[i])))

In [None]:
# Solve it

In [None]:
# Solution
for person, foods in zip(people, favorite_foods):
    print('{} has {} favorite food(s).'.format(person, len(foods)))

### How to recognize when you might be making this mistake
* You are indexing into multiple lists in a for loop.

### Why it's important to avoid
* Code readability.

# Using the Correct Data Structure
Frequently we want our code to run fast. One of the best ways to gain speed is to use the correct data structure for a given problem.

# Checking membership
Frequently we want to check if something is a member of a certain set, e.g. for filtering.

In [None]:
stop_words = ['the', 'a', 'and', 'has', 'are', 'an', 'but', 'as',
              'though', 'while', 'in', 'also', 'on', 'with', 'upon']
text = ("A childrens show has brothers. The brothers are named Burt and Ernie."
        "While Burt and Ernie are brothers they are also friends."
        "Burt and Ernie live on Sesame Street with Big Bird.")
words = text.lower().strip('.').split()

%timeit [word for word in words if word not in stop_words]

In [None]:
# Solve it

In [None]:
# Solution
stop_words = {'the', 'a', 'and', 'has', 'are', 'an', 'but', 'as',
              'though', 'while', 'in', 'also', 'on', 'with', 'upon'}
text = ("A childrens show has brothers. The brothers are named Burt and Ernie."
        "While Burt and Ernie are brothers they are also friends."
        "Burt and Ernie live on Sesame Street with Big Bird.")
words = text.lower().strip('.').split()

%timeit [word for word in words if word not in stop_words]

### Let's see an extreme example of this

In [None]:
large_list = range(10000)
%timeit 8000 in large_list

In [None]:
large_set = set(range(10000))
%timeit 8000 in large_set

# NumPy arrays
Frequently in data science we need to work with arrays of numbers. In these cases it's much less performant to store and operate on our data in lists

In [None]:
data = list(range(1000))
%timeit data_plus_one = [x + 1 for x in data]
print(data_plus_one[:10])

other_data = list(range(1, 1001))
%timeit data_sum = [x + y for x, y in zip(data, other_data)]
print(data_sum[:10])

In [None]:
# Solve it

In [None]:
# Solution
import numpy as np

data = np.array(range(1000))
%timeit data_plus_one = data + 1
print(data_plus_one[:10])

other_data = np.array(range(1, 1001))
%timeit data_sum = data + other_data
print(data_sum[:10])

# Doing dictionary iteration wrong

Dictionaries are one of the most important types in Python, so learning to use them according to best practices is a good idea!

In [None]:
pet_foods = {'cat': 'fish', 'dog': 'meat', 'lizard': 'crickets'}

for pet in pet_foods.keys():
    print('I have a {}'.format(pet))
for pet in pet_foods.keys():
    print('My {} eats {}'.format(pet, pet_foods[pet]))

In [None]:
# Solve it

In [None]:
# Solution
for pet in pet_foods:
    print('I have a {}'.format(pet))
for pet, food in pet_foods.items(): # .items() for python3, .iteritems() for python2
    print('My {} eats {}'.format(pet, food))

### How to know if you're making this mistake
* You're using `.keys()` to iterate over keys.
* You're using something other than `.iteritems()` to iterate over (key, value) tuples.

### Why it's important avoid
* Demonstrates you know how to use the most important built-in types.
* More memory efficient.

# Writing long lines!

`pep8` is the document that describes stylistic conventions for python and specifies a maximum line length of 79 characters. People vary in their adherence to these rules, but experienced programmers of all stripes all have strategies for avoiding long lines.

In [None]:
import math

lat1 = 53.32055555555556
lat2 = 53.31861111111111
long1 = -1.7297222222222221
long2 = -1.6997222222222223

bearing = (math.degrees(math.atan2(math.sin(long2-long1)*math.cos(lat2), math.cos(lat1)*math.sin(lat2)-math.sin(lat1)*math.cos(lat2)*math.cos(long2-long1))) + 360) % 360
print('Your bearing at this moment is {}, please continue in this direction until you arrive'.format(bearing))

In [None]:
# Allow access to functions without indicating namespace.
from math import degrees, atan2, sin, cos

# If arguments are too long, assign them to variables.
y = sin(long2 - long1) * cos(lat2)
# If variable definition is too long assign in multiple steps.
x = cos(lat1) * sin(lat2)
x -= sin(lat1) * cos(lat2) * cos(long2 - long1)
# Use the same name to represent work in progress...
bearing = atan2(y, x)
bearing = degrees(bearing)
bearing = (bearing + 360) % 360

# Create a string inside () with linebreaks, but no commas.
msg = ('Your bearing at this moment is {}, please '
       'continue in this direction until you arrive')

# Call methods after assigning to variable.
print(msg.format(bearing))

# As a last resort you can use line continuation \.


### How to know you're making this mistake
* If the line wraps either your editing window is too small or you're over the limit.
* If the pep8 checker or your IDE tells you.

### Why it's important
* Readability
* Demonstrates concern for other users of your code -- something good programmers do by default.
* Professionalism, experienced coders do this regularly.

# Polluting your namespace

In [None]:
from scipy import *

### How to know you're making this mistake
* You type `from ... import *`

### Why it's important to avoid
* Later users of your code will see a function used and wonder where it came from.
* You may introduce pernicious bugs by filling your namespace with unknown things!

# Thank You

## Cary Goltermann
DSI Instructor

Galvanize

San Francisco

<a href="https://twitter.com/carygoltermann">@carygoltermann</a>

https://www.linkedin.com/in/carygoltermann

cary.goltermann@galvanize.com

<img src="https://s3-us-west-2.amazonaws.com/galvanize.com-dev/galvanize-logo.svg" width="500">