<img src="https://www.assistancebeyondcrisis.org.au/wp-content/uploads/2017/05/logo-deloitte.png" width="150" align = "right">

# AnalyticsU - Python 201
Instructors: Peter Coiley, Brad Solomon, and Olivia Gebhardt

## Part 1: Intermediate Python Concepts

Before focusing on analytics-centric Python, you'll cover several intermediate-level Python concepts that extend to multiple domains besides analytics and data science:

1. **Comprehensions (list, set, dict)** - creating new data structures with idiomatic Python code
2. **Object-oriented programming (OOP)** - a programming paradigm that represents things as objects

These are sometimes referred to as **native** Python concepts because they are objects built into the Python interpreter or from the Python Standard Library, rather than being specific to a third-party Python package.

## Prequisites: Imports

You'll need to access the following imports throughout this lesson:

In [None]:
import bisect
import statistics
from collections import Counter
from datetime import date, timedelta
from math import radians, asin, cos, isclose, sin, sqrt
from typing import Container, Optional

import pandas as pd

The first several lines here import from Python's Standard Library.  These are the "[batteries included](https://docs.python.org/3/library/)" part of any Python distribution.

The last import, `pandas`, is a third-party library.

_Note_: To make your code more readable in a large codebase, it is good practice to **organize import statements** in line with [Python's PEP 8](https://www.python.org/dev/peps/pep-0008/#imports) style guide.  Generally, that means using imports in the following order:

1. Standard library imports (such as `datetime` or `bisect`)
2. Related third party imports (such as `pandas`)
3. Local application/library specific imports (not shown above)

...with a blank line between each group.

_Troubleshooting_: If you're seeing `ModuleNotFoundError: No module named 'pandas'`, you will need to [install Pandas](https://pandas.pydata.org/docs/getting_started/install.html) from the command line through `conda` or `pip`.

## Comprehensions

In AnalyticsU Python101 and/or the [Python tutorial](https://docs.python.org/3/tutorial/), you were introduced to elementary data structures such as `list`, `tuple`, `set`, and `dict`.

You can [iterate](https://docs.python.org/3/library/stdtypes.html#typeiter) over the elements of these data structures using [control flow](https://docs.python.org/3/tutorial/controlflow.html) such as `for` and `while`, optionally appending or adding to a new data structure as a result. 

An alternative and somtimes more idiomatic way to iterate is to use a [comprehension](https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions).  These come in several forms:

| Type | Example |
| ---- | ------- |
| `list` comprehension | `[i ** 2 for i in range(5)]` |
| `set` comprehension | `{i[0].casefold() for i in ("sampson", "Shelby", "pat")}` |
| `dict` comprehension | `{name: len(name) for i in ("sampson", "Shelby", "pat")}` |

_Note_: The term before "comprehension" here refers to the data structure that is being _formed_, not to the data structure that is being iterated over.  In otherwords, a _list comprehension_ will uses brackets `[ ... ]` to form a new list, and could be iterating over another `list`, a `dict` or something else.

### List Comprehensions

Square each number in `range(5)`:

In [None]:
print([i ** 2 for i in range(5)])

Iterate over a `tuple` rather than `range` object:

In [None]:
print([i ** 2 for i in (2, 4, 8, 16)])

Find only even numbers over the range `[0, 19]`:

In [None]:
print([i for i in range(20) if i % 2 == 0])

Define a `list` of addresses:

In [None]:
addresses = [
    "Fort Meade, MD",
    "Baltimore, MD",
    "Fort Worth, TX",
    "Culpeper, VA",
    "Boise, ID",
    "Baltimore, MD",
]

Using a traditional `for`-loop, create a new `list` that contains only cities in `MD`:

In [None]:
only_maryland = []
for a in addresses:
    if a.endswith("MD"):
        only_maryland.append(a)

print(only_maryland)

The above can be condensed into a `list` comprehension:

In [None]:
only_maryland_comp = [a for a in addresses if a.endswith("MD")]

print(only_maryland_comp)

The input may be more involved, such as a `dict` mapping student names to vectors of grades:

In [None]:
import pandas as pd

grades = {
    "tom": [98.7, 94.2, 89.0],
    "luke": [85.7, 83.0, 89.0],
    "jenn": [99.1, 99.2, 100.0],
}

You can find the unioned, flat list of grades by iterating over the dictionary `values` with a [**nested** `list` comprehension](https://docs.python.org/3/tutorial/datastructures.html#nested-list-comprehensions):

In [None]:
all_grades = [g for v in grades.values() for g in v]
print(all_grades)

### Set Comprehensions

Can you find the set of _unique cities_ in Maryland from `addresses`?

In [None]:
only_maryland_cities = set()
for a in addresses:
    if a.endswith("MD"):
        city = a.partition(",")[0]
        only_maryland_cities.add(city)

print(only_maryland_cities)

You can achieve this in one line of code with a **set comprehension**, which looks like a list comprehension except that it will remove duplicate entries from the result:

In [None]:
{a.partition(",")[0] for a in addresses if a.endswith("MD")}

_Note_: Unlike a `list` or `tuple`, a `set` has no concept of sortedness, and is most commonly used for fast **membership testing**.

### Dict Comprehensions

A related form is a **dict comprehension**:

In [None]:
import statistics

grades = [
    ("tom", [98.7, 94.2, 89.0]),
    ("luke", [85.7, 83.0, 89.0]),
    ("jenn", [99.1, 99.2, 100.0]),
]

In [None]:
avg_grades = {}
for student, gradeset in grades:
    avg_grades[student] = round(statistics.mean(gradeset), 2)
    
print(avg_grades)

In [None]:
{student: round(statistics.mean(gradeset), 2) for student, gradeset in grades}

### Bonus: Generator Expressions

Related to comprehensions are [generator expressions](https://www.python.org/dev/peps/pep-0289/).  These let you avoid creating an intermediate `list` object in memory if you only need to extract a particular data point from it:

In [None]:
stockdata = [
    {"ticker": "GE", "pct_chg": -2.0},
    {"ticker": "GE", "pct_chg": 2.1},
    {"ticker": "INTC", "pct_chg": 0.1},
    {"ticker": "INTC", "pct_chg": 0.3},
    {"ticker": "INTC", "pct_chg": -2.9},
]

max_ge_increase = max(row["pct_chg"] for row in stockdata if row["ticker"] == "GE")
print(max_ge_increase)

As a second example, you can reuse `all_grades` from above to find the count of grades by 10-percent bands:

In [None]:
from collections import Counter

all_grades = [98.7, 94.2, 89.0, 85.7, 83.0, 89.0, 99.1, 99.2, 100.0]
print(Counter(i // 10 * 10 for i in all_grades))

### Exercises: Comprehensions

#### Challenge 1

Given a sequence of numbers, find the count of elements that are greater than 20.

In [None]:
seq = [-21, 4, 15, 21, 25, 78, 19, 4]

The result should evaluate to **124**.

**Challenge 1**: [Solution Notebook](solutions/py201-lesson1-challenge1.ipynb)

#### Challenge 2

You have a bag containing magnets. Each magnet contains an individual letter of the alphabet.

Write a function `can_you_spell()` that returns `True` or `False` if a person's name can be spelled using letters from the bag, _without replacement_.

The function signature should be:

In [None]:
from typing import Container

def can_you_spell(name: str, bag: Container[str]) -> bool: ...

Examples:

```python
can_you_spell("lynn", ["y", "n", "p", "g", "n", "l"])  # True
can_you_spell("lynn", ["y", "n", "p", "g", "l"])  # False
```

Hints:

- Use a **list comprehension** to iterate over the `bag` parameter.
- Use the built-in [`sorted()` function](https://docs.python.org/3/library/functions.html#sorted) and its `key` argument to attempt to arrange letters from the bag into the person's name.
- The solution can be written with one line of code.

**Challenge 2**: [Solution Notebook](solutions/py201-lesson1-challenge2.ipynb)

## Object-Oriented Programming

One way to help define *object-oriented programming** is to contrast it to what is it _not_.

**Functional programming** is a programming style that breaks programs down into functions that take inputs, produce outputs, use **immutable** data structures heavily, and do not change state of other objects.

### Functional Programming

An example of functional programming is to decompose the calculation of [sample standard deviation](https://en.wikipedia.org/wiki/Standard_deviation) down into its parts.:

\begin{equation*}
s = {\sqrt {{\frac {1}{N-1}}\sum _{i=1}^{N}\left(x_{i}-{\bar {x}}\right)^{2}}}
\end{equation*}

where:

- ${\bar {x}}$ is the sample mean
- $x_{i}$ represents each observation
- $N$ is the number of observations
- $s$ is the resulting sample standard deviation
- $\sum _{i=1}^{N}\left(x_{i}-{\bar {x}}\right)^{2}$ is the **sum of squared deviations**

Here's how you can break this down into Python functions:

In [None]:
from math import sqrt

def sample_mean(seq) -> float:
    return sum(seq) / len(seq)

def squared_deviations(seq, mean) -> list:
    return [(i - mean) ** 2 for i in seq]

def sample_stdev(seq) -> float:
    mean = sample_mean(seq)
    devs = squared_deviations(seq, mean)
    result = sqrt(sum(devs) / (len(seq) - 1))
    return result

You've now added these three functions to the **global namespace**, but the variables defined within each function body (such as `devs = ...`) are **local** (internal) to those functions.

In [None]:
data = (727.7, 1086.5, 1091.0, 1361.3, 1490.5, 1956.1)
sd = sample_stdev(data)
print(f"Sample standard deviation of metabolic rate data: {sd:.2f}")

The `data` object is unchanged after being passed as a parameter to `sample_stdev`.

### What Makes Functional Programming "Functional"?

- A program is broken into smaller, concise functions as individual building blocks.
- Function's don't attempt to change state (mutate) their arguments.
- The functional style is deterministic: a function is only dependent on its input (no global state) and produces a consistent output for each input value.

### The Functional Style: More Reading

Aside from the `sample_stdev()`, Python also has other features that let you write in a functional style, such as:

- [Built-in functions](https://docs.python.org/3/library/functions.html) such as `map()` and `filter()`
- Libraries and modules such as [`itertools`](https://docs.python.org/3/library/itertools.html) and [`functools`](https://docs.python.org/3/library/functools.html)

Further reading:

- **docs.python.org**: [Functional HOWTO](https://docs.python.org/3/howto/functional.html)
- **realpython.com**: [Functional Programming in Python](https://realpython.com/courses/functional-programming-python/)

### Intro to OOP

What should this class do for us?

- You should be able to create **instances** of the class to **hold data** for individual employees (level, name, home office, etc)
- It should let you associate the employee with a timesheet
- It should provide you some **instance methods** that derive additional insights about that employee

Let's get down to it and create an `Employee` **class**.

A class is like a **blueprint**.  You **instantiate** a class to make individual **instances** of the blueprint.

In [None]:
import bisect
from datetime import date, timedelta
from typing import Optional

import pandas as pd


class Employee(object):

    valid_levels = (
        "analyst",
        "consultant",
        "senior consultant",
        "manager",
        "senior manager",
        "partner",
        "principal",
        "managing director",
    )

    def promote(self) -> Optional[str]:
        """Give Employee a promotion and *return* their new level.
        
        If they can't be promoted any further, do nothing and return None.
        """
        position_index = self.valid_levels.index(self.level)
        if position_index == len(self.valid_levels) - 1:
            # Could not promote further. Time for vacation
            return None
        self.level = self.valid_levels[position_index + 1]
        return self.level

    def __init__(self, lastname, firstname, level):
        self.lastname = lastname
        self.firstname = firstname
        level = level.casefold()
        if level not in self.valid_levels:
            raise ValueError(f"Invalid level: {level}")
        self.level = level
        
        self._time_table = []

    def add_time_entry(self, dt: date, wbs_code: str, hours: float):
        """Add a single timesheet row for this Employee."""
        # Use bisect.insort_left() to maintain sortedness by (date, wbs_code)
        bisect.insort_left(self._time_table, (dt, wbs_code, hours))
        return self
    
    def timesheet(self, since: Optional[date] = None, until: Optional[date] = None) -> pd.DataFrame:
        """Generate a timesheet as a Pandas DataFrame."""
        df = pd.DataFrame(self._time_table, columns=["dt", "wbs_code", "hours"])
        pretty_frame = df.pivot_table(index="wbs_code", columns="dt", values="hours").fillna(0)
        return pretty_frame.loc[:, since: until]

    def current_week_hours(self) -> dict:
        """Summarize current-week hours per WBS code."""
        since = self.most_recent_sunday()
        return self.timesheet(since=since).sum(axis=1).to_dict()

    @staticmethod
    def most_recent_sunday() -> date:
        """Find the most recent Sunday that falls before today."""
        today = date.today()
        while today.weekday() != 6:
            today = today - timedelta(days=1)
        return today

    @property
    def is_ppmd(self) -> bool:
        """Is this person a PPMD?"""
        return self.level in ("partner", "principal", "managing director")
    
    def __str__(self):
        """Let str(x) return a useful string representation of the Employee."""
        return f"{self.lastname}, {self.firstname} [{self.__class__.__name__} - {self.level}]"

In [None]:
emp = Employee("Loite", "Del", level="Senior Manager")
print(emp)

In [None]:
emp.level

In [None]:
emp.is_ppmd

Del has been promoted:

In [None]:
# Alter some internal state for `emp` and return the resulting new level
emp.promote()

In [None]:
emp.level

In [None]:
emp.is_ppmd

Now Del needs to record some timesheet entries:

In [None]:
entries = [
    {"dt": date(2020, 8, 20), "wbs_code": "pto", "hours": 8.0},
    {"dt": date(2020, 8, 1), "wbs_code": "ced", "hours": 2.0},
    {"dt": date(2020, 8, 20), "wbs_code": "xyz", "hours": 9.0},
    {"dt": date(2020, 8, 17), "wbs_code": "ced", "hours": 2.0},
    {"dt": date(2020, 8, 17), "wbs_code": "abc", "hours": 11.5},
    {"dt": date(2020, 8, 18), "wbs_code": "gaa", "hours": 1.0},
    {"dt": date(2020, 8, 18), "wbs_code": "xyz", "hours": 7.0},
    {"dt": date(2020, 8, 19), "wbs_code": "ced", "hours": 2.0},
    {"dt": date(2020, 8, 16), "wbs_code": "ced", "hours": 2.0}
]
for e in entries:
    emp.add_time_entry(**e)

In [None]:
emp.timesheet()

In [None]:
emp.timesheet(since=date(2020, 8, 17))

In [None]:
emp.current_week_hours()

One concept that this example illustrates is [**composition**](https://realpython.com/inheritance-composition-python/): each `Employee` holds a `_time_table` list representing rows in a timesheet.

### Inheritance

The `Employee` class above is narrow-minded in that it only accounts for Deloitte's Traditionalist track levels.

We can create separate classes for `Traditionalist`, `Specialist`, and others through **inheritance**.

`Employee` becomes the **base class**.  `Traditionalist` and `Specialist` are the **child classes** that **inherit** from `Employee`.  `Employee` defines pieces that are common to its subclasses.  These will be inherited unless they are overriden in the subclass:

In [None]:
class Traditionalist(Employee):
    valid_levels = (
        "analyst",
        "consultant",
        "senior consultant",
        "manager",
        "senior manager",
        "partner",
        "principal",
        "managing director",      
    )


class Specialist(Employee):
    valid_levels = (
        "analyst",
        "specialist senior",
        "specialist master",
        "specialist leader",
        "managing director",
    )

In [None]:
joe = Specialist("Smith", "Joe", "specialist master")
print(joe)

In [None]:
try:
    jane = Traditionalist("Doe", "Jane", "specialist leader")
except Exception as e:
    print(e)

Inherited methods still behave the same:

In [None]:
joe.is_ppmd

In [None]:
joe.add_time_entry(dt=date(2020, 8, 20), wbs_code="hgi", hours=9.25).timesheet()

### OOP Example 2: Representing Geography

In this section, you'll continue with another exercise in OOP, but switch to building a new `Coordinates` class that represents a pair of geographical _(latitude, longtitude)_ coordinates.

In [None]:
from math import radians, asin, cos, sin, sqrt

class IllegalCoordinatesError(Exception):
    """Raise this exception when you are passed a nonsensical coordinate value."""
    pass

class Coordinates(object):

    def __init__(self, lat: float, lng: float):
        """Make a new pair of coordinates."""
        self.lat = self.validate_lat(lat)
        self.lng = self.validate_lng(lng)
        
        # Functions from `math` expect coordinates expressed in radians, not degrees
        self._phi = radians(lat)
        self._lambda = radians(lng)
    
    def validate_lat(self, lat) -> float:
        """Validate that input latitude is within bounds."""
        if not (self.MIN_LAT <= lat <= self.MAX_LAT):
            raise IllegalCoordinatesError(f"lat must be in range ({self.MIN_LAT}, {self.MAX_LAT})")
        return lat

    def validate_lng(self, lng) -> float:
        """Validate that input longtitude is within bounds."""
        if not (self.MIN_LNG <= lng <= self.MAX_LNG):
            raise IllegalCoordinatesError(f"lng must be in range ({self.MIN_LNG}, {self.MAX_LNG})")
        return lng
    
    # Limits on latitude and longitude, expressed in degrees
    MIN_LAT, MAX_LAT = -90, 90
    MIN_LNG, MAX_LNG = -180, 180

    def __str__(self):
        """Make a human-readable string representation of the coordinates pair."""
        return f"<Coordinates> ({self.lat}, {self.lng})"

    def distance_from(self, other):
        """Approximate distance in KM from one Coordinate to another."""
        raise NotImplementedError("Challenge problem #3.  Write me!")

    @classmethod
    def from_string(cls, coords: str):
        """Parse a string into a Coordinates object.
        
        Accepts coordinate strings delimited by:
        - Whitespace ->            '38.8977559 -77.0704521'
        - Comma ->                 '38.8977559,-77.0704521'
        - Comma-plus-whitespace -> '38.8977559,   -77.0704521'
        """
        raise NotImplementedError("Challenge problem #4.  Write me!")

What are some features of modelling a coordinate pair with a `Coordinates` class?

- **Data encapsulation** and **namespacing**: You no longer have a bunch of individual variables floating around.  Each `Coordinate` instance gets its own `.lat`, `.lng` (degrees) and `._phi`, `._lambda` (radians).
- **Validation**: You can hook implicit validation via `.validate_lat()` and `.validate_lng()` into the classes's `.__init__()` to raise an early exception if inputs don't look right.
- **Extensibility**: You can add additional functionality just by defining new methods.

### Exercises: OOP

#### Challenge 3

In the cell above, implement the `Coordinates.distance_from()` **instance method** to determine the distance from one `Coordinate` to another `Coordinate`, in kilometers.

Use the Haversine formula and trigonemtric functions from the [`math`](https://docs.python.org/3/library/math.html) module:

\begin{equation*}
d = 2r \arcsin \sqrt{\sin^2 \frac{1}{2} (\phi_2 - \phi_1) + \cos{\phi_1} \cos{\phi_2} \sin^2 \frac{1}{2} (\lambda_2 - \lambda_1)}
\end{equation*}

where:

- $\phi_1$ and $\lambda_1$ are latitude and longitude for Point 1, respectively
- $\phi_2$ and $\lambda_2$ are latitude and longitude for Point 2, respectively

After adding the method body, the following comparison should hold:

In [None]:
def test_distance_from() -> None:
    c1 = Coordinates(38.8977559, -77.0704521)
    c2 = Coordinates(34.9201086, -95.6922305)
    assert math.isclose(c1.distance_from(c2), 1713, abs_tol=5.0), "Distance off by > 5 km"

**Challenge 3**: [Solution Notebook](solutions/py201-lesson1-challenge3.ipynb)

#### Challenge 4

Implement the `Coordinates.from_string()` **classmethod** to let a user form a new coordinates pair from a `str` representing a coordinates pair.

In [None]:
def test_from_string() -> None:
    c1 = Coordinates(38.8977559, -77.0704521)
    c2 = Coordinates.from_string("38.8977559, -77.0704521")
    c3 = Coordinates.from_string("38.8977559 -77.0704521")
    c4 = Coordinates.from_string("38.8977559,-77.0704521")
    assert c1 == c2 == c3 == c4

**Challenge 4**: [Solution Notebook](solutions/py201-lesson1-challenge4.ipynb)

## Conclusion

Here's a summary of what you covered in this tutorial:

- `list`, `set`, and `dict` comprehension: Create new data structures through concise **Pythonic** syntax.
- Object-oriented programming: Objects **encapsulate data** and **provide functionality**.

## More Resources

Interested in diving deeper?  Here are some places to start:

- **docs.python.org**: [List comprehensions](https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions)
- **attrs.org**: [`attrs` - classes without boilerplate](https://www.attrs.org/en/stable/)
- **realpython.com**: [Object-oriented programming (OOP)](https://realpython.com/search?q=oop)
- **wikipedia.org**: [Haversine formula](https://en.wikipedia.org/wiki/Haversine_formula)
- **wikipedia.org**: [Standard deviation](https://en.wikipedia.org/wiki/Standard_deviation)
- **docs.python.org**: [PEP 8, Style Guide for Python Code](https://www.python.org/dev/peps/pep-0008/)

## Part 2: A Gentle Introduction to Machine Learning

### Popular ML packages
Below are some general Python packages that are commonly used for ML as well as a brief description of them.

##### NumPy
A well known general-purpose package for array processing that assists in implementing solutions with linear algebra, Fourier transforms, and random numbers.

##### SciPy
An open source library supported by a developer community to assist with linear algebram image optimization, Fourier transforms, and computational analytics.

##### Scikit-learn
Built on top of NumPy and SciPy, it has become the most popular Python ML library and has a wide array of supervised and unsupervised algorithms.

##### TensorFlow
Developed for Google's internal use, it is a popular machine learning library.

##### Keras
An open source library primarily used to neural networks as well as features to analyze images.

## Classification Example
As a follow-up to the model types discussed in the presentation, we will be walking through how to create a classification model with a well known data set on iris species. In total, there are three iris species that the flowers can be classified into: setosa, versicola, and virginica.

The dataset we are importing has 150 samples, 3 possible labels, 4 features, and we will be using Scikit-learn for the model.

In [None]:
# 1. Load the data and separate it
from  sklearn import  datasets
iris = datasets.load_iris()

In [None]:
# The variable of this dataset are sepal length, sepal width, petal length
# petal width, and the iris class/species

iris

In [None]:
# Separate the labels from the features
import pandas as pd
x = iris.data
y = iris.target

# Putting to data into a Pandas dataframe to view more easily
d = [{"sepal_length":row[0], 
      "sepal_width":row[1], 
      "petal_length":row[2], 
      "petal_width":row[3]} for row in x]
df = pd.DataFrame(d)
df["types"] = y

In [None]:
df.head(10)

### Note on test-train split
Typically a test-train split of 70% of the dataset going to training group and 30% going to the testing group. The training group will be used to create the model.

In [None]:
# Split the dataset into test and train data using the train_test_split module
# x_train has training features
# x_test has testing features
# y_train has training label
# y_test has training label

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=.3)

### 2. Create the model

In [None]:
# There are many ways to build this classification model. 
# We are going to use a K Nearest Neighbor model.

from sklearn import neighbors
classifier=neighbors.KNeighborsClassifier()

### 3. Train the model
So far, we've created the model but haven't put any data into it. Until we train the model, we will not be able to us it to predict what species or Iris each flower belongs to.

In [None]:
# We will use the fit function in order to do this

classifier.fit(x_train,y_train)

### 4. Use the model to make predictions
Let's now use our trained model to make predictions on the test data set.

In [None]:
# We will use the predict function in order to do this

model_predictions = classifier.predict(x_test)

In [None]:
# Let's now use the accuracy_score function to evaluate our model

from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, model_predictions))

### Note: we could have used another classifier
We used a KNN model but we also could have used a Decision Tree. Here's the code for that.

In [None]:
from sklearn import tree
classifier_DT = tree.DecisionTreeClassifier()
classifier_DT.fit(x_train, y_train)
predictions_DT = classifier_DT.predict(x_test)

# Here is another way you could evaluate the model with the confusion matrix we
# discussed before

from sklearn.metrics import classification_report
print(classification_report(y_test, predictions_DT,
      target_names = ["type0","type1","type2"]))