### Crash Course in Python (skipping the first part)

#### Sets
A set represents a collection of *distinct* elements

In [1]:
s = set()
s.add(1)
s.add(2)
s.add(2) # s is still {1,2}
x = len(s) # equals 2
y = 2 in s # equals True
x = 3 in s # equals False

If we have a large collection of items that we want to use for a membership test, a set is more appropriate than a list:

In [3]:
stopwords_list = ["a","an","at"] + ["yet","you"]

"zip" in stopwords_list # False, but have to check every element

stopwords_set = set(stopwords_list) # If we just want a yes/no if a word is in a collection

"Zip" in stopwords_set # very fast to check

False

The second reason is to find distinct items in a collection

In [4]:
item_list = [1,2,3,1,2,3]
num_items = len(item_list) # 6
item_set = set(item_list)  # {1,2,3} - these are the distinct values
num_distinct_items = len(item_set) # 3
distinct_item_list = list(item_set) # [1,2,3] - turning the distinct items into a list

### Control Flow
You can perform an action conditionally using if:

In [5]:
if 1 > 2:
    message = "if only 1 were greater than two..."
elif 1 > 3:
    message = "elif stands for 'else if'"
else:
    message = "when all else fails use else (if you want to)"

You can also write *ternary*(of three parts) if-then-else on one line

In [6]:
parity = "even" if x % 2 == 0 else "odd"

For more complex logic, use continue and break:

In [8]:
for x in range(10):
    if x == 3:
        continue # go immediately to the next iteration
    if x == 5:
        break # quit the loop entirely
    print (x)

0
1
2
4


### Truthiness
Booleans work the same in Python as in most other languages, except they're capitalized:

In [11]:
one_is_less_than_two = 1 < 2 # equals True
true_equals_false = True == False # equals False
print (one_is_less_than_two)
print(true_equals_false)

True
False


Python uses None to indicate a nonexistent value. It is similar to other languages' null

In [12]:
x = None
print (x == None)
print (x is None)  # also True, more Pythonic

True
True


The following are all False

- False
- None
- [] (an empty list)
- {} (an empty dict)
- ""
- set()
- 0
- 0.0

Pretty much anything else gets treated as `True`. This allows you use `if` to test for empty lists of empty strings etc.

To ensure that a value is definitely a number (when it could possibly be None):

In [24]:
x = None
safe_x = x or 0 # converts any potential Nones to 0. This is saying safe_x is the number x if x is not-None. If x is not
                # a number the first term will be False and safe_x is converted to 0.

print(safe_x)

0
None


### More Advanced Features

#### Sorting
Sort sorts a list "in place". Sorted returns a new list.

In [25]:
x = [4,1,2,3]
y = sorted(x) # y is [1,2,3,4], x is unchanged
x.sort() # now x is [1,2,3,4]


Specify reverse = True to sort elements from largest to smallest. 

In [26]:
# sort the list by absolute value from largest to smallest
x = sorted([-4,1,-2,3], key=abs, reverse=True)
print(x)

[-4, 3, -2, 1]


### List Comprehensions

When you want to transform a list into another list, by choosing only certain elements, or transforming elements, or both.

In [33]:
even_numbers = [x for x in range(5) if x % 2 == 0] # [0,2,4]

squares = [x * x for x in range(5)] #[0, 1, 4, 9, 16]

even_squares = [x * x for x in even_numbers] # [0, 4, 16]

Similarly you can turn lists into dictionaries or sets:

In [34]:
square_dict = {x : x * x for x in range(5)} # {0:0, 1:1, 2:4, 3:9, 4:16}

square_set = {x * x for x in [1, -1]} # {1}

If you don't need the value from a list, it's conventional to use an underscore as the variable

In [37]:
zeros = [0 for _ in even_numbers] # has the same length as even_numbers
print(zeros)

[0, 0, 0]


A list comprehension can include multiple fors:

In [40]:
pairs = [(x,y) for x in range(10) for y in range(10)] # 100 pairs (0,0) (0,1)....(9,8),(9,9)

Later `for`s can use the results of earlier ones: 

In [42]:
increasing_pairs = [(x,y) for x in range(10) for y in range(x+1, 10)]
print (increasing_pairs)

[(0, 1), (0, 2), (0, 3), (0, 4), (0, 5), (0, 6), (0, 7), (0, 8), (0, 9), (1, 2), (1, 3), (1, 4), (1, 5), (1, 6), (1, 7), (1, 8), (1, 9), (2, 3), (2, 4), (2, 5), (2, 6), (2, 7), (2, 8), (2, 9), (3, 4), (3, 5), (3, 6), (3, 7), (3, 8), (3, 9), (4, 5), (4, 6), (4, 7), (4, 8), (4, 9), (5, 6), (5, 7), (5, 8), (5, 9), (6, 7), (6, 8), (6, 9), (7, 8), (7, 9), (8, 9)]


#### Generators and Iterators

When you only want to iterate over parts of a large list. It is not efficient to loop over the whole list.
A *generator* is something you can iterate over, but whose values are produced only as needed. 

One way to create generators is with functions and the yield operator:

In [45]:
def lazy_range(n):
    """a lazy version of range"""
    i = 0
    while i < n: # only looking at values less than n.
        yield i
        i += 1

The following loop will consume the `yielded` values one at a time until none are left:

In [47]:
for i in lazy_range(10): # operations will only be performed on results less than 10.
    (i)/3

In Python 3 the `range()` function IS lazy. 
This means you could create an infinite sequence:

In [48]:
def natural_numbers():
    """returns 1,2,3, ..."""
    n = 1
    while True:
        yield n
        n += 1
        
# Though you probably shouldn't iterate over it without using some kind of break logic. WHY?

The flip side of a generator is that you can only iterate through a generator once, if you need to do it multiple times, you need to either recreate the generator each time of use a list. 

#### Generator alternative

Use comprehensions wrapped in parentheses:

In [51]:
lazy_evens_below_20 = (i for i in lazy_range(20) if i % 2 == 0)
print(lazy_evens_below_20)

# What is this used for? What have we stored in lazy_evens_below_20 and how can it be used?

<generator object <genexpr> at 0x7f95e40a8830>


### Randomness

Generating random numbers is important in Data Science. We will use the `random` module:

In [53]:
import random

# random.random() produces numbers uniformly between 0 and 1.
four_uniform_randoms = [random.random() for _ in range(4)]
print(four_uniform_randoms)

[0.5262103693956621, 0.12765972847670115, 0.0008494817011335254, 0.14652841728643295]


You can set the internal state with random.seed if you want to get reproducible results:

In [63]:
random.seed(8) # set the seed to 8
print (random.random()) 

random.seed(9) # set the seed to 10
print (random.random()) 

random.seed(9) # set the seed to 10
print (random.random()) 

# random.random() will always give the same number if you use the same seed.

0.2267058593810488
0.46300735781502145
0.46300735781502145


random.randrange takes 1 or 2 arguments and returns an element chosen randomly from the corresponding range():

In [64]:
random.randrange(10) # choose randomly from range (10) = [0,1,...., 9]
random.randrange(3,6) # choose randomly from range (3,6) = [3,4,5]

4

random.shuffle randomly reorders the elements of a list:

In [70]:
up_to_ten = list(range(10))  # had to convert to a list as range is a generator object in Python 3
random.shuffle(up_to_ten)
print (up_to_ten)

# results change each time.

[9, 8, 0, 4, 1, 3, 5, 7, 2, 6]


If you want to randomly pick on element from a list, use `random.choice`:

In [72]:
my_best_friend = random.choice(["Alice", "Bob", "Charlie"])
print(my_best_friend)

Alice


Choose a sample of elements, without replacement use `random.sample`:

In [74]:
lottery_numbers = range(60)
winning_numbers = random.sample(lottery_numbers, 6)
print(winning_numbers)

[24, 50, 47, 56, 6, 18]


Choose a sample of elements *with* replacement, you can make multiple calls to `random.choice`:

In [76]:
# using list comprehension
four_with_replacement = [random.choice(range(10)) for _ in range(4)]
print(four_with_replacement)

[4, 3, 6, 4]


### Regular Expressions

Just a few examples:

In [79]:
import re

print(all([
        not re.match("a", "cat"), # cat doesn't start with an "a"
        re.search("a","cat"),     # cat doesn't have an 'a' in it
        not re.search("c","dog"), # 'dog' doesn't have a 'c' in it
        3 == len(re.split("[ab]", "carbs")),  # split on a or b to give ['c','r','s']
        "R-D-" == re.sub("[0-9]", "-", "R2D2")])) # replace digits with dashes
# All these statements are true

True
