# Ch. 2: A Crash Course in Python
Notes on "Data Science from Scratch" by Joel Grus

- Get the code and examples from the book for Chapters 3 - 24 [here](https://github.com/joelgrus/data-science-from-scratch)

## The Basics
- Getting Python
- The Zen of Python
- Whitespace Formatting
- Modules
- Arithmetic
- Functions
- Strings
- Exceptions
- Lists
- Tuples
- Dictionaries
    - <code>defaultdict</code>
    - <code>Counter</code>
- Sets
- Control Flow
- Truthiness

### Getting Python
- We recommend installing [Anaconda](https://store.continuum.io/cshop/anaconda/)
- Otherwise, you can install the following to get started:
    - [Python](https://www.python.org/)
    - Python package manager: [pip](https://pypi.python.org/pypi/pip)
    - Nicer Python shell: [IPython](http://ipython.org/)
    - Web app for interactive data sci: [Jupyter Notebook](http://jupyter.org/)
        - <code>pip install jupyter</code> (Python 2)
        - <code>pip3 install jupyter</code> (Python 3)
    - etc...

### The Zen of Python
- Python design principles: [The Zen of Python (PEP 20)](https://www.python.org/dev/peps/pep-0020/)
- Python code readability: [Style Guide for Python Code (PEP 8)](https://www.python.org/dev/peps/pep-0008/)

### Whitespace Formatting
In Python, whitespace is important. Other languages, like C++, use curly braces {} to delimit blocks of code.
<code> 
for( int i = 0; i < 5; i++ )
   {
       cout << i << endl;
   }</code>

In [None]:
# Python uses indentation to delimit blocks of code.
for i in range(5):
    print i

### Modules
- `import` modules (= libraries = packages) that aren't in the [Python 2.7 standard library](https://docs.python.org/2/library/)
- Python has a *huge* variety and number of extra packages that can be installed, if not already, with `pip` or `conda`, then imported:
    - [PyPI (Python Package Index)](https://pypi.python.org/pypi)
    - [Anaconda 2.5.0 Package List](https://docs.continuum.io/anaconda/pkg-docs); additionally, check out these docs:
        - Also, check out "Packages available in Anaconda" and "Managing Packages in Anaconda" from [here](https://docs.continuum.io/anaconda/index#packages-available-in-anaconda)

Let's try importing the [Matplotlib pyplot module](http://matplotlib.org/api/pyplot_api.html), which "provides a MATLAB-like plotting framework."

*Note*:
- Sometimes you'll want to use [IPython magic commands](http://ipython.readthedocs.org/en/stable/interactive/magics.html) in your Jupyter notebooks.
- Try not to do the following since you may inadvertently overwrite variables you've already defined: `from matplotlib.pyplot import *`

In [None]:
# Tip: use IPython magic command "%matplotlib inline" to display plots in notebook
%matplotlib inline 
import matplotlib.pyplot # Import module and reference by its super long name

matplotlib.pyplot.plot([1,2,3], [1,2,3]) # Must type the whole name to access methods, etc.

In [None]:
import matplotlib.pyplot as plt # OR, alias it as a shorter, more-fun-to-type name

plt.plot([1,2,3], [1,2,3])

### Arithmetic
- Python 2.7 (like Fortran) uses integer division by default.

In [None]:
print 5 / 2

- To force floating point division, specify at least one value in equation as float:

In [None]:
print type(5), type(2.0)
print 5 / 2.0

- Or, make floating point division default:

In [None]:
from __future__ import division
print 5 / 2   # now floating point division is default
print 5 // 2  # use double slash for integer division

### Functions
- Define functions with `def`

In [None]:
def double(x):
    """This is where you put an optional docstring that 
    explains what the function does.
    For example, this function multiplies its input by 2."""
    return x * 2

In [None]:
def apply_to_one(f):
    """Calls the function f with 1 as its argument"""
    return f(1)

In [None]:
my_double = double          # refers to the previously defined function
x = apply_to_one(my_double)
print x

#### `lambda` functions
- These are short, [anonymous functions](https://en.wikipedia.org/wiki/Anonymous_function)

In [None]:
y = apply_to_one(lambda x: x + 4)
print y

In [None]:
# but use `def` instead of assigning a lambda function to a variable
another_double = lambda x: 2 * x     # don't do this
def another_double(x): return 2 * x  # do this instead

#### default function arguments
- define a default arg; specify arg only if you want different value

In [None]:
def my_print(message="my default message"):
    print message
    
my_print("hello")

In [None]:
my_print()

### Strings

In [None]:
# Delimit strings with single OR double quotation marks
single_quoted_string = 'data science'
double_quoted_string = "data science"
print single_quoted_string, double_quoted_string

In [None]:
# Use backslashes to encode special characters
tab_string = "\t"  # represents the tab character
len(tab_string)    # string length is 1 (not 2)

In [None]:
# Use raw strings to represent backslashes
not_tab_string = r"\t"  # represents the characters '\' and 't'
len(not_tab_string)     # now length is 2

In [None]:
# Create multiline strings using triple-quotes
multi_line_string = """This is the first line.
and this is the second line
and this is the third line"""
print multi_line_string

### Exceptions

In [None]:
# This makes your code crash
print 0 / 0

In [None]:
# This will handle the exception by printing an error message
try:
    print 0 / 0
except ZeroDivisionError:
    print "Cannot divide by zero"

Here's another example (taken from [here](http://www.python-course.eu/exception_handling.php)):

In [None]:
n = int(raw_input("Please enter a number: "))

In [None]:
while True:
    try:
        n = raw_input("Please enter an integer: ")
        n = int(n)
        break
    except ValueError:
        print("No valid integer! Please try again ...")
print "Great, you successfully entered an integer!"

## Any questions so far?

## Moving on to Python data structures...
- Lists
- Tuples
- Dictionaries
- Sets

### Lists
- Ordered [collections](https://en.wikipedia.org/wiki/Collection_%(abstract_data_type%))
- Similar to an array in other languages, but holds heterogeneous data (e.g., floats + ints + strings)
- Note: use NumPy arrays ([here's](https://docs.scipy.org/doc/numpy-dev/user/quickstart.html) a tutorial) for larger amounts of homogeneous data (e.g., just floats)
- Specify with brackets `[]`

In [None]:
integer_list = [1, 2, 3]
heterogeneous_list = ["string", 0.1, True]
list_of_lists = [ integer_list, heterogeneous_list, [] ]

print len(integer_list)  # get the length of a list
print sum(integer_list)  # get the sum of the elements in a list (if addition is defined for those elements)

In [None]:
# Use square brackets to get the n^{th} element of a list
x = range(10)
print x
print x[0]
print x[1]
print x[-1]
print x[-2]

In [None]:
# Use square brackets to "slice" lists
print x[:3]   # up to but not including 3
print x[3:]   # 3 and up
print x[1:4]  # 1 up to but not including 4
print x[-3:]  # last 3
print x[1:-1] # without 1 and 9
print x[:]    # all elements of the list

In [None]:
# Use the `in` operator to check for list membership; use only for small lists or if run time is not a concern
print 1 in [1, 2, 3]
print 0 in [1, 2, 3]

In [None]:
# Concatenate lists like this
x = [1, 2, 3]

y = x + [4, 5, 6]   # creates a new list leaving "x" unchanged
print x
print y
print ""
x.extend([4, 5, 6]) # changes "x"
print x

In [None]:
# Append to lists like this
x = [1, 2, 3]
print x
x.append(0)
print x

In [None]:
# It's convenient (and common) to unpack lists like this
z = [1, 2]
x, y = z
print type(x), "x =", x 
print type(y), "y =", y 
print type(z), "z =", z 

In [None]:
# It's also common to use an underscore for a value you're going to throw away
z = [3, 4]
_, y = z
print x
print y
print z

### Tuples
- Immutable (can't be modified like a list)
- Specify with parentheses`()` or nothing

In [None]:
my_list = [1, 2]
my_tuple = (1, 2)
other_tuple = 3, 4
my_list[1] = 3

print my_list

In [None]:
try:
    my_tuple[1] = 3
except TypeError:
    print "Cannot modify a tuple"

In [None]:
# Use tuples to return multiple values from functions
def sum_and_product(x, y):
    return (x + y), (x * y)

sp = sum_and_product(2, 3)
s, p = sum_and_product(5, 10)

print sp
print s 
print p

In [None]:
# Use tuples (and lists) for multiple assignments
x, y = 1, 2
print "x =", x
print "y =", y

In [None]:
x, y = y, x  # Pythonic way to swap variables
print "x =", x
print "y =", y

### Dictionaries
- Dictionaries associate *values* with *keys*
- Allow quick retreival of value given a key
- Specify with curly braces `{}` or `dict()`

In [None]:
empty_dict = {}                    # Pythonic
empty_dict2 = dict()               # less Pythonic
grades = { "Joel": 80, "Tim": 95}

# Use square brackets to look up value(s) for a key
print grades["Joel"]

In [None]:
# KeyError exception raised if key not found
try: 
    kates_grade = grades["Kate"]
except KeyError:
    print "No grade for Kate!"

In [None]:
# Use `in` to check for existence of a key
joel_has_grade = "Joel" in grades
kate_has_grade = "Kate" in grades

print joel_has_grade
print kate_has_grade

In [None]:
# Use `get` method of dictionaries when you want to return a default value (rather than raise exception)
print grades.get("Joel", 0)
print grades.get("Kate", 0)
print  grades.get("No One")

In [None]:
# Use square brackets to assign key-value pairs, e.g., dict_name[key] = value
grades["Tim"] = 99
grades["Kate"] = 100

print "Number of students:", len(grades)

In [None]:
# Use dictionaries to represent structured data, such as in a tweet
tweet = {
    "user": "joelgrus",
    "text": "Data Science is Awesome",
    "retweet_count": 100,
    "hashtags": ["#data", "#science", "#datascience", "#awesome"]
}

In [None]:
print tweet.keys()    # list of keys
print tweet.values()  # list of values
print tweet.items()   # list of (key, value) tuples

In [None]:
print "user" in tweet.keys()       # list `in` is slow
print "user" in tweet              # dict `in` is fast (and more Pythonic)
print "joelgrus" in tweet.values() 

You *cannot* use lists as keys.  If that's needed, then:
- use a `tuple`, or
- represent the key as a string

#### `defaultdict`


In [None]:
# Approach 1
word_counts = {}
for word in document:
    if word in word_counts:
        word_counts[word] += 1
    else:
        word_counts[word] = 1

In [None]:
# Approach 2
word_counts = {}
for word in document:
    try:
        word_counts[word] += 1
    except KeyError:
        word_counts[word] = 1

In [None]:
# Approach 3
word_counts = {}
for word in document:
    previous_count = word_counts.get(word, 0)
    word_counts[word] = previous_count + 1

Instead of these, use `defaultdict`.

In [None]:
from collections import defaultdict

word_counts = defaultdict(int)
for word in document:
    word_counts[word] += 1

In [None]:
dd_list = defaultdict(list)
dd_list[2].append(1)

dd_dict = defaultdict(dict)
dd_dict["Joel"]["City"] = "Seattle"

dd_pair = defaultdict(lambda: [0,0])
dd_pair[2][1] = 1

#### `Counter`

In [None]:
from collections import Counter
c = Counter([0, 1, 2, 0])
print c

In [None]:
word_counts = Counter(document)

# Print the 10 most common words and their counts
for word, cound in word_counts.most_common(10):
    print word, count

### Sets

In [None]:
s = set()
s.add(1)
print "s is", s
s.add(2)
print "s is now", s
s.add(2)
print "s is still", s
print "There are", len(s), "elements in s."
print 2 in s
print 3 in s

In [None]:
stopwords_list = ["a", "an", "at"] + hundreds_of_other_words + ["yet", "you"]

"zip" in stopwords_list

stopwords_set = set(stopwords_list)
"zip" in stopwords_set

In [None]:
item_list = [1, 2, 3, 1, 2, 3]
print len(item_list)
print set(item_list)
print len(set(item_list))
print len(list(set(item_list)))

### Control Flow

### Truthiness

## The Not-So-Basics
- Sorting
- List Comprehensions
- Generators and Iterators
- Randomness
- Regular Expressions
- Object-Oriented Programming
- Functional Tools
- <code>enumerate</code>
- <code>zip</code> and Argument Unpacking
- args and kwargs

### Sorting
- list `sort` method sorts in place; `sorted` function returns a new list

In [None]:
x = [4,1,2,3]
print sorted(x)
print x         # x list is still the same
x.sort()        
print x         # x list is now changed

In [None]:
print sorted(x, reverse=True)
x.sort(reverse=True)
print x

### List Comprehensions

### Generators and Iterators

### Randomness

### Regular Expressions

### Object-Oriented Programming

### Functional Tools

### `enumerate`

### `zip` and Argument Unpacking

### args and kwargs

## For Further Exploration
- Python
    - [Official Python.org tutorial](https://docs.python.org/2/tutorial) (good)
- IPython
    - [Official IPython.org tutorial](https://ipython.org/ipython-doc/2/interactive/tutorial.html) (not quite as good)
    - [IPython.org videos](https://ipython.org/videos.html) (better)
    - [Python for Data Analysis](http://shop.oreilly.com/product/0636920023784.do?cmp=af-prog-books-videos-lp-na_afp_book_mckinney_cj_12307942_7040302) by Wes McKinney (original author of *pandas*)