# Python for Big Data Engineering

## Whitespace Formatting

Many languages use curly braces to delimit blocks of code. Python uses indentation

In [1]:
for i in [1, 2, 3, 4, 5]:
    print(i)                    # first line in "for i" block
    for j in [1, 2, 3, 4, 5]:
        print(j)                # first line in "for j" block
        print(i + j)            # last line in "for j" block
    print(i)                    # last line in "for i" block
# print("done looping")

1
1
2
2
3
3
4
4
5
5
6
1
2
1
3
2
4
3
5
4
6
5
7
2
3
1
4
2
5
3
6
4
7
5
8
3
4
1
5
2
6
3
7
4
8
5
9
4
5
1
6
2
7
3
8
4
9
5
10
5


In [2]:
long_winded_computation = (1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 + 10 + 11 + 12 +
                           13 + 14 + 15 + 16 + 
                           17 + 
                           18 + 
                           19 + 20)

In [3]:
list_of_lists = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]

In [4]:
easier_to_read_list_of_lists = [[1, 2, 3],
                                [4, 5, 6],
                                [7, 8, 9]]

In [5]:
two_plus_three = 2 + \
                 3
                
print(two_plus_three)

5


In [6]:
for i in [1, 2, 3, 4, 5]:

    
    
    
    # notice the blank line
    print(i)

1
2
3
4
5


## Module

Certain features of Python are not loaded by default. These include both features that are included as part of the language as well as third-party features that you download yourself. In order to use these features, you’ll need to import the modules that contain them.

In [7]:
import re
my_regex = re.compile("[0-9]+", re.I)

In [8]:
import re as regexxx
my_regex = regexxx.compile("[0-9]+", regexxx.I)

In [9]:
from collections import defaultdict, Counter
lookup = defaultdict(int)
my_counter = Counter()

In [10]:
match = 10
from re import *    # uh oh, re has a match function
print(match)        # "<function match at 0x10281e6a8>"

<function match at 0x0000021310B3C5E0>


In [14]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels as sm
#from faker import Faker = error
#import faker as Faker = error

ModuleNotFoundError: No module named 'faker'

## Functions

A function is a rule for taking zero or more inputs and returning a corresponding output. In Python, we typically define functions using `def`:

In [15]:
def a(x):
    return x

def b(y):
    return y

def sum_of_two_number(x, y):
    return x + y

In [16]:
def double(x):
    """
    This is where you put an optional docstring that explains what the
    function does. For example, this function multiplies its input by 2.
    """
    return x * 2

In [20]:
def apply_to_one(f):
    """Calls the function f with 1 as its argument"""
    return f(1)

apply_to_one

<function __main__.apply_to_one(f)>

In [18]:
my_double = double             # refers to the previously defined function
x = apply_to_one(my_double)    # equals 2

assert x == 2

In [None]:
y = apply_to_one(lambda x: x + 4)      # equals 5

assert y == 5

In [None]:
another_double = lambda x: 2 * x       # Don't do this

In [None]:
def another_double(x):
    """Do this instead"""
    return 2 * x

In [None]:
def my_print(message = "my default message"):
    print(message)

In [None]:
my_print("hello")   # prints 'hello'
my_print()          # prints 'my default message'

hello
my default message


In [None]:
def full_name(first = "What's-his-name", last = "Something"):
    return first + " " + last

In [None]:
full_name("Kholed", "Langsari")     # "Kholed Langsari"
full_name("Kholed")             # "Kholed Something"
full_name(last="Langsari")        # "What's-his-name Langsari"

"What's-his-name Langsari"

In [None]:
assert full_name("Kholed", "Langsari")     == "Kholed Langsari"
assert full_name("Kholed")                 == "Kholed Something"
assert full_name(last="Langsari")          == "What's-his-name Langsari"

## Strings

Strings can be delimited by single or double quotation marks (but the quotes have to match):

In [None]:
single_quoted_string = 'data engineering'
double_quoted_string = "data engineering"

In [None]:
tab_string = "\t"       # represents the tab character
len(tab_string)         # is 1

assert len(tab_string) == 1

In [None]:
not_tab_string = r"\t"  # represents the characters '\' and 't'
len(not_tab_string)     # is 2

assert len(not_tab_string) == 2

In [None]:
multi_line_string = """This is the first line.
and this is the second line
and this is the third line"""
print(multi_line_string)

This is the first line.
and this is the second line
and this is the third line


In [None]:
first_name = "Kholed"
last_name = "Langsari"

In [None]:
full_name1 = first_name + " " + last_name             # string addition
full_name2 = "{0} {1}".format(first_name, last_name)  # string.format

In [None]:
full_name3 = f"{first_name} {last_name}"

## Exceptions

When something goes wrong, Python raises an *exception*. Unhandled, exceptions will cause your program to crash. You can handle them using `try` and `except`:

In [None]:
try:
    print(0 / 0)
except ZeroDivisionError:
    print("cannot divide by zero")

cannot divide by zero


# Lists

Probably the most fundamental data structure in Python is the `list`, which is simply an ordered collection (it is similar to what in other languages might be called an array, but with some added functionality): 

In [None]:
integer_list = [1, 2, 3]
heterogeneous_list = ["string", 0.1, True]
list_of_lists = [integer_list, heterogeneous_list, []]

list_length = len(integer_list)     # equals 3
list_sum    = sum(integer_list)     # equals 6

assert list_length == 3
assert list_sum == 6

In [None]:
x = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

zero = x[0]          # equals 0, lists are 0-indexed
one = x[1]           # equals 1
nine = x[-1]         # equals 9, 'Pythonic' for last element
eight = x[-2]        # equals 8, 'Pythonic' for next-to-last element
x[0] = -1            # now x is [-1, 1, 2, 3, ..., 9]

assert x == [-1, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [None]:
first_three = x[:3]                 # [-1, 1, 2]
three_to_end = x[3:]                # [3, 4, ..., 9]
one_to_four = x[1:5]                # [1, 2, 3, 4]
last_three = x[-3:]                 # [7, 8, 9]
without_first_and_last = x[1:-1]    # [1, 2, ..., 8]
copy_of_x = x[:]                    # [-1, 1, 2, ..., 9]

every_third = x[::3]                 # [-1, 3, 6, 9]
five_to_three = x[5:2:-1]            # [5, 4, 3]

assert every_third == [-1, 3, 6, 9]
assert five_to_three == [5, 4, 3]

In [None]:
1 in [1, 2, 3]    # True
0 in [1, 2, 3]    # False

False

In [None]:
x = [1, 2, 3]
x.extend([4, 5, 6])     # x is now [1, 2, 3, 4, 5, 6]

assert x == [1, 2, 3, 4, 5, 6]

In [None]:
x = [1, 2, 3]
y = x + [4, 5, 6]       # y is [1, 2, 3, 4, 5, 6]; x is unchanged

assert x == [1, 2, 3]
assert y == [1, 2, 3, 4, 5, 6]

In [None]:
x = [1, 2, 3]
x.append(0)      # x is now [1, 2, 3, 0]
y = x[-1]        # equals 0
z = len(x)       # equals 4

assert x == [1, 2, 3, 0]
assert y == 0
assert z == 4

In [None]:
x, y = [1, 2]    # now x is 1, y is 2

assert x == 1
assert y == 2

In [None]:
_, y = [1, 2]    # now y == 2, didn't care about the first element

## Tuples

Tuples are lists’ immutable cousins. Pretty much anything you can do to a list that doesn’t involve modifying it, you can do to a tuple. You specify a tuple by using parentheses (or nothing) instead of square brackets:

In [None]:
my_list = [1, 2]
my_tuple = (1, 2)
other_tuple = 3, 4
my_list[1] = 3      # my_list is now [1, 3]

try:
    my_tuple[1] = 3
except TypeError:
    print("cannot modify a tuple")

cannot modify a tuple


In [None]:
def sum_and_product(x, y):
    return (x + y), (x * y)

sp = sum_and_product(2, 3)     # sp is (5, 6)
s, p = sum_and_product(5, 10)  # s is 15, p is 50

In [None]:
x, y = 1, 2     # now x is 1, y is 2
x, y = y, x     # Pythonic way to swap variables; now x is 2, y is 1

assert x == 2
assert y == 1

## Dictionaries

Another fundamental data structure is a dictionary, which associates *values* with *keys* and allows you to quickly retrieve the value corresponding to a given key:

In [None]:
empty_dict = {}                     # Pythonic
empty_dict2 = dict()                # less Pythonic
grades = {"Kholed": 80, "Muhammad": 95}    # dictionary literal

In [None]:
kholed_grade = grades["Kholed"]        # equals 80


assert kholed_grade == 80

In [None]:
try:
    someone_grade = grades["Someone"]
except KeyError:
    print("no grade for Someone!")

no grade for Someone!


In [None]:
kholed_has_grade = "Kholed" in grades     # True
someone_has_grade = "Someone" in grades     # False


assert kholed_has_grade
assert not someone_has_grade

In [None]:
kholed_grade = grades.get("Kholed", 0)   # equals 80
someone_grade = grades.get("Someone", 0)   # equals 0
no_ones_grade = grades.get("No One")  # default default is None


assert kholed_grade == 80
assert someone_grade == 0
assert no_ones_grade is None

In [None]:
grades["Muhammad"] = 99                    # replaces the old value
grades["Someone"] = 100                  # adds a third entry
num_students = len(grades)            # equals 3

print(grades)
assert num_students == 3

{'Kholed': 80, 'Muhammad': 99, 'Someone': 100}


In [None]:
tweet = {
    "user" : "kholedlangsari",
    "text" : "Big Data Engineering with Python is Awesome",
    "retweet_count" : 100,
    "hashtags" : ["#data", "#engineering", "#dataengineering", "python", "#awesome", "#let'sgo"]
}

In [None]:
tweet_keys   = tweet.keys()     # iterable for the keys
tweet_values = tweet.values()   # iterable for the values
tweet_items  = tweet.items()    # iterable for the (key, value) tuples

"user" in tweet_keys            # True, but not Pythonic
"user" in tweet                 # Pythonic way of checking for keys
"kholedlangsari" in tweet_values      # True (slow but the only way to check)


assert "user" in tweet_keys
assert "user" in tweet
assert "kholedlangsari" in tweet_values

### defaultdict

In [None]:
document = ["big", "data", "engineering", "with", "python"]

In [None]:
word_counts = {}
for word in document:
    if word in word_counts:
        word_counts[word] += 1
    else:
        word_counts[word] = 1

In [None]:
word_counts = {}
for word in document:
    try:
        word_counts[word] += 1
    except KeyError:
        word_counts[word] = 1

In [None]:
word_counts = {}
for word in document:
    previous_count = word_counts.get(word, 0)
    word_counts[word] = previous_count + 1

In [None]:
from collections import defaultdict

word_counts = defaultdict(int)          # int() produces 0
for word in document:
    word_counts[word] += 1

In [None]:
dd_list = defaultdict(list)             # list() produces an empty list
dd_list[2].append(1)                    # now dd_list contains {2: [1]}

dd_dict = defaultdict(dict)             # dict() produces an empty dict
dd_dict["Kholed"]["City"] = "Yala"     # {"Kholed" : {"City": Yala"}}

dd_pair = defaultdict(lambda: [0, 0])
dd_pair[2][1] = 1                       # now dd_pair contains {2: [0, 1]}

## Counters

A `Counter` turns a sequence of values into a `defaultdict(int)`-like object mapping keys to counts:

In [None]:
document = ["big", "data", "engineering", "with", "big", "python"]

In [None]:
from collections import Counter
c = Counter([0, 1, 2, 0])          # c is (basically) {0: 2, 1: 1, 2: 1}

In [None]:
# recall, document is a list of words
word_counts = Counter(document)

In [None]:
# print the 10 most common words and their counts
for word, count in word_counts.most_common(10):
    print(word, count)

big 2
data 1
engineering 1
with 1
python 1


## Sets

Another useful data structure is set, which represents a collection of *distinct* elements. You can define a set by listing its elements between curly braces:

In [None]:
primes_below_10 = {2, 3, 5, 7}

In [None]:
s = set()
s.add(1)       # s is now {1}
s.add(2)       # s is now {1, 2}
s.add(2)       # s is still {1, 2}
x = len(s)     # equals 2
y = 2 in s     # equals True
z = 3 in s     # equals False

In [None]:
hundreds_of_other_words = []  # required for the below code to run

stopwords_list = ["a", "an", "at"] + hundreds_of_other_words + ["yet", "you"]

"zip" in stopwords_list     # False, but have to check every element

stopwords_set = set(stopwords_list)
"zip" in stopwords_set      # very fast to check

False

In [None]:
item_list = [1, 2, 3, 1, 2, 3]
num_items = len(item_list)                # 6
item_set = set(item_list)                 # {1, 2, 3}
num_distinct_items = len(item_set)        # 3
distinct_item_list = list(item_set)       # [1, 2, 3]


assert num_items == 6
assert item_set == {1, 2, 3}
assert num_distinct_items == 3
assert distinct_item_list == [1, 2, 3]

## Control Flow

As in most programming languages, you can perform an action conditionally using `if` and `for` 

In [None]:
if 1 > 2:
    message = "if only 1 were greater than two..."
elif 1 > 3:
    message = "elif stands for 'else if'"
else:
    message = "when all else fails use else (if you want to)"

In [None]:
parity = "even" if x % 2 == 0 else "odd"

In [None]:
x = 0
while x < 10:
    print(f"{x} is less than 10")
    x += 1

0 is less than 10
1 is less than 10
2 is less than 10
3 is less than 10
4 is less than 10
5 is less than 10
6 is less than 10
7 is less than 10
8 is less than 10
9 is less than 10


In [None]:
# range(10) is the numbers 0, 1, ..., 9
for x in range(10):
    print(f"{x} is less than 10")

0 is less than 10
1 is less than 10
2 is less than 10
3 is less than 10
4 is less than 10
5 is less than 10
6 is less than 10
7 is less than 10
8 is less than 10
9 is less than 10


In [None]:
for x in range(10):
    if x == 3:
        continue  # go immediately to the next iteration
    if x == 5:
        break     # quit the loop entirely
    print(x)

0
1
2
4


## Truthiness

Booleans in Python work as in most other languages, except that they’re capitalized:

In [None]:
one_is_less_than_two = 1 < 2          # equals True
true_equals_false = True == False     # equals False


assert one_is_less_than_two
assert not true_equals_false

In [None]:
x = None
assert x == None, "this is the not the Pythonic way to check for None"
assert x is None, "this is the Pythonic way to check for None"

In [None]:
def some_function_that_returns_a_string():
    return ""

In [None]:
s = some_function_that_returns_a_string()
if s:
    first_char = s[0]
else:
    first_char = ""

In [None]:
first_char = s and s[0]

In [None]:
safe_x = x or 0

In [None]:
safe_x = x if x is not None else 0

In [None]:
all([True, 1, {3}])   # True, all are truthy
all([True, 1, {}])    # False, {} is falsy
any([True, 1, {}])    # True, True is truthy
all([])               # True, no falsy elements in the list
any([])               # False, no truthy elements in the list

False

## Sorting

Every Python list has a `sort` method that sorts it in place. If you don’t want to mess up your list, you can use the `sorted` function, which returns a new list:

In [None]:
x = [4, 1, 2, 3]
y = sorted(x)     # y is [1, 2, 3, 4], x is unchanged
x.sort()          # now x is [1, 2, 3, 4]

In [None]:

# sort the list by absolute value from largest to smallest
x = sorted([-4, 1, -2, 3], key=abs, reverse=True)  # is [-4, 3, -2, 1]

# sort the words and counts from highest count to lowest
wc = sorted(word_counts.items(),
            key=lambda word_and_count: word_and_count[1],
            reverse=True)

## List Comprehensions

Frequently, you’ll want to transform a list into another list by choosing only certain elements, by transforming elements, or both. The Pythonic way to do this is with *list comprehensions*:

In [None]:
even_numbers = [x for x in range(5) if x % 2 == 0]  # [0, 2, 4]
squares      = [x * x for x in range(5)]            # [0, 1, 4, 9, 16]
even_squares = [x * x for x in even_numbers]        # [0, 4, 16]


assert even_numbers == [0, 2, 4]
assert squares == [0, 1, 4, 9, 16]
assert even_squares == [0, 4, 16]

In [None]:
square_dict = {x: x * x for x in range(5)}  # {0: 0, 1: 1, 2: 4, 3: 9, 4: 16}
square_set  = {x * x for x in [1, -1]}      # {1}


assert square_dict == {0: 0, 1: 1, 2: 4, 3: 9, 4: 16}
assert square_set == {1}

In [None]:
zeros = [0 for _ in even_numbers]      # has the same length as even_numbers


assert zeros == [0, 0, 0]

In [None]:
pairs = [(x, y)
         for x in range(10)
         for y in range(10)]   # 100 pairs (0,0) (0,1) ... (9,8), (9,9)


assert len(pairs) == 100

In [None]:
increasing_pairs = [(x, y)                       # only pairs with x < y,
                    for x in range(10)           # range(lo, hi) equals
                    for y in range(x + 1, 10)]   # [lo, lo + 1, ..., hi - 1]


assert len(increasing_pairs) == 9 + 8 + 7 + 6 + 5 + 4 + 3 + 2 + 1
assert all(x < y for x, y in increasing_pairs)

## Automated Testing and assert

As data engineer, we’ll be writing a lot of code. How can we be confident our code is correct? One way is with *types* (discussed shortly), but another way is with *automated tests*.

In [None]:
assert 1 + 1 == 2
assert 1 + 1 == 2, "1 + 1 should equal 2 but didn't"

In [None]:
def smallest_item(xs):
    return min(xs)

assert smallest_item([10, 20, 5, 40]) == 5
assert smallest_item([1, 0, -1, 2]) == -1

In [None]:
def smallest_item(xs):
    assert xs, "empty list has no smallest item"
    return min(xs)

## Object-Oriented Programming

Like many languages, Python allows you to define classes that encapsulate data and the functions that operate on them. We’ll use them sometimes to make our code cleaner and simpler. It’s probably simplest to explain them by constructing a heavily annotated example.

In [None]:
## define class
class CountingClicker:
    """A class can/should have a docstring, just like a function"""

    def __init__(self, count = 0):
        self.count = count

    def __repr__(self):
        return f"CountingClicker(count={self.count})"

    def click(self, num_times = 1):
        """Click the clicker some number of times."""
        self.count += num_times

    def read(self):
        return self.count

    def reset(self):
        self.count = 0

In [None]:
# create object
clicker = CountingClicker()
assert clicker.read() == 0, "clicker should start with count 0"
clicker.click()
clicker.click()
assert clicker.read() == 2, "after two clicks, clicker should have count 2"
clicker.reset()
assert clicker.read() == 0, "after reset, clicker should be back to 0"

In [None]:
# A subclass inherits all the behavior of its parent class.
class NoResetClicker(CountingClicker):
    # This class has all the same methods as CountingClicker

    # Except that it has a reset method that does nothing.
    def reset(self):
        pass

In [None]:
clicker2 = NoResetClicker()
assert clicker2.read() == 0
clicker2.click()
assert clicker2.read() == 1
clicker2.reset()
assert clicker2.read() == 1, "reset shouldn't do anything"

## Iterables and Generators

One nice thing about a list is that you can retrieve specific elements by their indices. But you don’t always need this! A list of a billion numbers takes up a lot of memory. If you only want the elements one at a time, there’s no good reason to keep them all around. If you only end up needing the first several elements, generating the entire billion is hugely wasteful.

Often all we need is to iterate over the collection using `for` and `in`. In this case we can create *generators*, which can be iterated over just like lists but generate their values lazily on demand.

One way to create generators is with functions and the `yield` operator:

In [None]:
def generate_range(n):
    i = 0
    while i < n:
        yield i   # every call to yield produces a value of the generator
        i += 1

In [None]:
for i in generate_range(10):
    print(f"i: {i}")

i: 0
i: 1
i: 2
i: 3
i: 4
i: 5
i: 6
i: 7
i: 8
i: 9


In [None]:
def natural_numbers():
    """returns 1, 2, 3, ..."""
    n = 1
    while True:
        yield n
        n += 1

In [None]:
evens_below_20 = (i for i in generate_range(20) if i % 2 == 0)

In [None]:
# None of these computations *does* anything until we iterate
data = natural_numbers()
evens = (x for x in data if x % 2 == 0)
even_squares = (x ** 2 for x in evens)
even_squares_ending_in_six = (x for x in even_squares if x % 10 == 6)
# and so on

assert next(even_squares_ending_in_six) == 16
assert next(even_squares_ending_in_six) == 36
assert next(even_squares_ending_in_six) == 196

In [None]:
names = ["Alice", "Bob", "Charlie", "Debbie"]

# not Pythonic
for i in range(len(names)):
    print(f"name {i} is {names[i]}")

# also not Pythonic
i = 0
for name in names:
    print(f"name {i} is {names[i]}")
    i += 1

# Pythonic
for i, name in enumerate(names):
    print(f"name {i} is {name}")

name 0 is Alice
name 1 is Bob
name 2 is Charlie
name 3 is Debbie
name 0 is Alice
name 1 is Bob
name 2 is Charlie
name 3 is Debbie
name 0 is Alice
name 1 is Bob
name 2 is Charlie
name 3 is Debbie


## Randomness

As we learn data engineering, we will frequently need to generate random numbers, which we can do with the `random` module:

In [None]:
import random
random.seed(10)  # this ensures we get the same results every time

four_uniform_randoms = [random.random() for _ in range(4)]
print(four_uniform_randoms)

# [0.5714025946899135,       # random.random() produces numbers
#  0.4288890546751146,       # uniformly between 0 and 1
#  0.5780913011344704,       # it's the random function we'll use
#  0.20609823213950174]      # most often

[0.5714025946899135, 0.4288890546751146, 0.5780913011344704, 0.20609823213950174]


In [None]:
random.seed(10)         # set the seed to 10
print(random.random())  # 0.57140259469
random.seed(10)         # reset the seed to 10
print(random.random())  # 0.57140259469 again

0.5714025946899135
0.5714025946899135


In [None]:
random.randrange(10)    # choose randomly from range(10) = [0, 1, ..., 9]
random.randrange(3, 6)  # choose randomly from range(3, 6) = [3, 4, 5]

4

In [None]:
up_to_ten = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
random.shuffle(up_to_ten)
print(up_to_ten)
# [7, 2, 6, 8, 9, 4, 10, 1, 3, 5]   (your results will probably be different)

[5, 6, 9, 2, 3, 7, 8, 4, 1, 10]


In [None]:
my_best_friend = random.choice(["Alice", "Bob", "Charlie"])     # "Bob" for me
print(my_best_friend)

Bob


In [None]:
lottery_numbers = range(60)
winning_numbers = random.sample(lottery_numbers, 6)  # [16, 36, 10, 6, 25, 9]

In [None]:
four_with_replacement = [random.choice(range(10)) for _ in range(4)]
print(four_with_replacement)  # [9, 4, 4, 2]

[2, 9, 5, 6]


## Regular Expressions

Regular expressions provide a way of searching text. They are incredibly useful, but also fairly complicated—so much so that there are entire books written about them. We will get into their details the few times we encounter them; here are a few examples of how to use them in Python:

In [None]:
import re

re_examples = [                        # all of these are true, because
    not re.match("a", "cat"),              #  'cat' doesn't start with 'a'
    re.search("a", "cat"),                 #  'cat' has an 'a' in it
    not re.search("c", "dog"),             #  'dog' doesn't have a 'c' in it
    3 == len(re.split("[ab]", "carbs")),   #  split on a or b to ['c','r','s']
    "R-D-" == re.sub("[0-9]", "-", "R2D2") #  replace digits with dashes
    ]

assert all(re_examples), "all the regex examples should be True"

## zip and Argument Unpacking

Often we will need to `zip` two or more iterables together. The `zip` function transforms multiple iterables into a single iterable of tuples of corresponding function:

In [None]:
list1 = ['a', 'b', 'c']
list2 = [1, 2, 3]

# zip is lazy, so you have to do something like the following
[pair for pair in zip(list1, list2)]    # is [('a', 1), ('b', 2), ('c', 3)]


assert [pair for pair in zip(list1, list2)] == [('a', 1), ('b', 2), ('c', 3)]

In [None]:
pairs = [('a', 1), ('b', 2), ('c', 3)]
letters, numbers = zip(*pairs)

In [None]:
letters, numbers = zip(('a', 1), ('b', 2), ('c', 3))

In [None]:
def add(a, b): return a + b

add(1, 2)      # returns 3
try:
    add([1, 2])
except TypeError:
    print("add expects two inputs")
add(*[1, 2])   # returns 3

add expects two inputs


3

## args and kwargs

Let’s say we want to create a higher-order function that takes as input some function `f` and returns a new function that for any input returns twice the value of `f`:

In [None]:
def doubler(f):
    # Here we define a new function that keeps a reference to f
    def g(x):
        return 2 * f(x)

    # And return that new function.
    return g

In [None]:
def f1(x):
    return x + 1

g = doubler(f1)
assert g(3) == 8,  "(3 + 1) * 2 should equal 8"
assert g(-1) == 0, "(-1 + 1) * 2 should equal 0"

In [None]:
def f2(x, y):
    return x + y

g = doubler(f2)
try:
    g(1, 2)
except TypeError:
    print("as defined, g only takes one argument")

as defined, g only takes one argument


In [None]:
def magic(*args, **kwargs):
    print("unnamed args:", args)
    print("keyword args:", kwargs)

magic(1, 2, key="word", key2="word2")

# prints
#  unnamed args: (1, 2)
#  keyword args: {'key': 'word', 'key2': 'word2'}

unnamed args: (1, 2)
keyword args: {'key': 'word', 'key2': 'word2'}


In [None]:
def other_way_magic(x, y, z):
    return x + y + z

x_y_list = [1, 2]
z_dict = {"z": 3}
assert other_way_magic(*x_y_list, **z_dict) == 6, "1 + 2 + 3 should be 6"

In [None]:
def doubler_correct(f):
    """works no matter what kind of inputs f expects"""
    def g(*args, **kwargs):
        """whatever arguments g is supplied, pass them through to f"""
        return 2 * f(*args, **kwargs)
    return g

g = doubler_correct(f2)
assert g(1, 2) == 6, "doubler should work now"

## Type Annotations

Python is a *dynamically typed* language. That means that it in general it doesn’t care about the types of objects we use, as long as we use them in valid ways:

In [None]:
## dynamically typed
def add(a, b):
    return a + b

assert add(10, 5) == 15,                  "+ is valid for numbers"
assert add([1, 2], [3]) == [1, 2, 3],     "+ is valid for lists"
assert add("hi ", "there") == "hi there", "+ is valid for strings"

try:
    add(10, "five")
except TypeError:
    print("cannot add an int to a string")

cannot add an int to a string


In [None]:
## statically typed
def add(a: int, b: int) -> int:
    return a + b

add(10, 5)           # you'd like this to be OK
add("hi ", "there")  # you'd like this to be not OK

'hi there'

In [None]:
# This is not in the book, but it's needed
# to make the `dot_product` stubs not error out.
from typing import List
Vector = List[float]

def dot_product(x, y): ...

# we have not yet defined Vector, but imagine we had
def dot_product(x: Vector, y: Vector) -> float: ...

from typing import Union

def secretly_ugly_function(value, operation): ...

def ugly_function(value: int, operation: Union[str, int, float, bool]) -> int:
    ...

def total(xs: list) -> float:
    return sum(xs)

from typing import List  # note capital L

def total(xs: List[float]) -> float:
    return sum(xs)

# This is how to type-annotate variables when you define them.
# But this is unnecessary; it's "obvious" x is an int.
x: int = 5

values = []         # what's my type?
best_so_far = None  # what's my type?

from typing import Optional

values: List[int] = []
best_so_far: Optional[float] = None  # allowed to be either a float or None

lazy = True

# the type annotations in this snippet are all unnecessary
from typing import Dict, Iterable, Tuple

# keys are strings, values are ints
counts: Dict[str, int] = {'data': 1, 'engineering': 2}

# lists and generators are both iterable
if lazy:
    evens: Iterable[int] = (x for x in range(10) if x % 2 == 0)
else:
    evens = [0, 2, 4, 6, 8]

# tuples specify a type for each element
triple: Tuple[int, float, int] = (10, 2.3, 5)

from typing import Callable

# The type hint says that repeater is a function that takes
# two arguments, a string and an int, and returns a string.
def twice(repeater: Callable[[str, int], str], s: str) -> str:
    return repeater(s, 2)

def comma_repeater(s: str, n: int) -> str:
    n_copies = [s for _ in range(n)]
    return ', '.join(n_copies)

assert twice(comma_repeater, "type hints") == "type hints, type hints"

Number = int
Numbers = List[Number]

def total(xs: Numbers) -> Number:
    return sum(xs)

# An Application of Python  for Big Data Engineering

## Writing and reading files in Python

### working with data need libraries

install `Faker` package

`conda install -c conda-forge faker`

or

`pip install Faker`

### Writing and reading CSVs

Writing CSVs using the Python CSV Library. Create 1,000 records data and save to CSV format

In [None]:
from faker import Faker
import csv

output=open('data.csv','w')

fake=Faker()
header=['name','age','street','city','state','zip','lng','lat']
mywriter=csv.writer(output)
mywriter.writerow(header)

for r in range(1000):
    mywriter.writerow([fake.name(),fake.random_int(min=18, max=80, step=1), 
                       fake.street_address(), 
                       fake.city(),
                       fake.state(),
                       fake.zipcode(),
                       fake.longitude(),
                       fake.latitude()])
output.close()

ModuleNotFoundError: No module named 'faker'

Reading CSVs

In [None]:
import csv

with open('data.csv') as f:
    myreader=csv.DictReader(f)
    headers=next(myreader)
    for row in myreader:
        print(row['name'])

Michael Anderson
Amanda Alvarez
Jacqueline Norris
Andrew Lopez
Paige Price
Tracy Peterson
Krista Bolton
Mark Martinez
Jamie Merritt
Adam Day
Leslie Jackson
Jeffery Grant
Ryan Ramirez
Steven Parker
Craig Castro
Carlos Cohen
Mark Cervantes
Emma Jackson
Sheila Hartman
Justin Holland
Robert Stevens MD
Shawn Cooper
Brenda Evans
Raymond Lee
Ronald Mendez
Michele Mcintosh
Courtney Clark
Paul Harris
Chad Wiggins
Maria Moore
Mr. Alejandro Salazar
Sarah Mccarthy MD
Cheryl Bennett
Ashley Ortiz
Crystal Stone
Michael Moore
Cynthia Morrison
Dean Cabrera
Rachel Bowman
Tyler Bell
James Mccormick
Carolyn Kelly
George Chandler
Dana Rodriguez
Justin Vaughn
Kenneth Jones
Frank Davis
Kaitlin Gallegos
Emma Gallagher
Joshua Rios
Debra Hardy
James Johnson
William Glass
Sarah Evans
Joseph Dodson
Ernest Carlson
Joseph Diaz
Joseph Smith
Joel Finley
Steven Scott DVM
Michael Joseph
Adam Hunt
Jennifer Lara
Ryan Jones
Eugene Murphy
Jessica Brooks
Samantha Stone DDS
Heather Woods
Stephen Mora
Deborah House
James Clar

### Reading and writing CSVs using pandas DataFrames

In [None]:
import pandas as pd
df=pd.read_csv('data.csv')

In [None]:
df.head(10)

Unnamed: 0,name,age,street,city,state,zip,lng,lat
0,Jenna Morgan,33,317 Amanda Pines,Port Douglas,Washington,90540,6.239324,82.924952
1,Michael Anderson,37,86897 Cooper Vista Apt. 803,Vanessaton,Mississippi,22343,-41.347978,-33.195403
2,Amanda Alvarez,18,55963 Davis Spring Apt. 360,Jasonland,Idaho,18603,145.126961,-14.411687
3,Jacqueline Norris,56,543 Mary Hill Apt. 994,New Andreburgh,Oklahoma,79900,24.885753,-79.665479
4,Andrew Lopez,18,137 Jerry Divide Apt. 254,East Erin,Pennsylvania,19686,6.003035,-66.235487
5,Paige Price,41,3378 Kevin Court,North Kevinton,Missouri,41034,-28.737702,17.581486
6,Tracy Peterson,21,0365 Garcia Spurs,New Dawnville,Minnesota,88926,130.223861,75.215893
7,Krista Bolton,40,6849 Steve Ports,Melindachester,Delaware,94093,-74.998645,-54.472779
8,Mark Martinez,64,40582 George Spurs Suite 399,Bennettside,Kentucky,40896,-139.039326,-60.89779
9,Jamie Merritt,41,267 Cooper Isle Suite 752,Port Kevin,Mississippi,55918,-144.798861,81.115103


### Writing JSON with Python

Write JSON using Python and the standard library

In [None]:
from faker import Faker
import json

output=open('data.json','w')

fake=Faker()
alldata={}
alldata['records']=[]

for x in range(1000):
    data={"name":fake.name(),
          "age":fake.random_int(min=18, max=80, step=1),
          "street":fake.street_address(),
          "city":fake.city(),
          "state":fake.state(),
          "zip":fake.zipcode(),
          "lng":float(fake.longitude()),
          "lat":float(fake.latitude())}
    alldata['records'].append(data)
json.dump(alldata,output)

In [None]:
Reading CSVs

In [None]:
import json
with open("data.json",'r') as f:
    data=json.load(f)
print(type(data))
print(data['records'][0]['name'])

JSONDecodeError: Expecting ',' delimiter: line 1 column 172179 (char 172178)