## Review

I have two dictionaries describing the annual cost for rent in NYC for the years 2005 and 2019. Each dictionary is a seperate year, and the keys of the dictionary are the following: `median`, `average`, and `us-average`.

How can I calculate the rate of change in average cost from 2005 to 2019 for nyc rent? How can I calculate rate of change for us average?

In [3]:
year_2005 = {"median": 1051, "average": 1129, "us-average": 910}
year_2019 = {"median": 1309, "average": 1390, "us-average": 1097}


18.642857142857142
13.357142857142858


## Solution

Below is the solution to this prompt.

We want to be able to apply the "slope formula" to the average cost of rent in 2019 and 2005. We accomplish this by accessing the average "value" of 2019 by writing out `year_2019["average"]`. The same applies to the average value of the year 2005: `year_2005["average"]`.

In [3]:
rate = (year_2019["average"] - year_2005["average"])/(2019 - 2005)
rate

18.642857142857142

We do the same exact line of code, except with the `us-average` key to calculate the average rate of change for the US.

In [4]:
rate = (year_2019["us-average"] - year_2005["us-average"])/(2019 - 2005)
rate

13.357142857142858

## Type-Hinting Classes

If we want to type-hint a class that we created ourselves, we could either enclose the type in strings.

Or import annotations from the “future”

https://peps.python.org/pep-0563/ 

In [None]:
# function attached to class “Fellow”
def in_common(self, other_fellow: 'Fellow'):
    new_list = []
    for person in self.familiars:
        if person in other_fellow.familiars:
            new_list .append(person)
    return new_list

In [None]:
from __future__ import annotations

# function attached to class “Fellow”
def in_common(self, other_fellow: Fellow):
    new_list = []
    for person in self.familiars:
        if person in other_fellow.familiars:
            new_list .append(person)
    return new_list

## OOP in Data Engineering

We could potentially encapsulate data-engineering functionality  (loading, cleaning, processing) inside of a class.

But make sure that this class is *effective* and *necessary*, which is where skills in complexity management comes in.

It’s better to have a function that does ONE thing correctly, than a class that does a bunch of things half-...correctly.

Questions to ask yourself before engineering a pipeline object:

* Does this need to be a class?
  * Does this pipeline need “state” and “behavior”? Do I need to reuse this pipeline? Does this pipeline already exist?

* Does data need to be loaded constantly?
  * “Opening” the data could be resource-intensive.

* What input am I getting?
  * Data-type, format.

* What output am I generating?
  * Data-type, format.

* What guarantees am I given?
  * Is data already clean?

* What guarantees am I providing?
  * What will my pipeline do if data is dirty?

## Signature Practice

Write Python code to take in a file name and generates a signature at the end of that file.

Does this need to be a class?
* No. There is no state of data that I am keeping track of.
Does data need to be loaded constantly?
* No. Loading once is good enough.
What input am I getting?
* A file name. Possibly a signature too.
What output am I generating?
* A newly created file with a signature.
What guarantees am I given?
* The file exists, possibly.
What guarantees am I providing?
* A signature will be attached to the file, if it exists.


In [9]:
def add_sig(filepath, sig):
    """A function to add on a signature to some file

    Parameters
    ----------
    filepath : str
        a string to the file we are modifying
    sig : str
        a string signature
    """
    with open(filepath, "a") as f:
        f.write(sig)

In [10]:
sig = """
PHONE: 555-555-5555
EMAIL: BOB@bob.com
"The best time to plant a tree was 
20 years ago, the second best time
is now." - Bob"""

add_sig("data/email.txt", sig)

In [None]:
class Signature:
    """
    """
    def __init__(self, filepath, sig):
        self.filepath = filepath
        self.sig = sig
    
    def add_sig(self):
        with open(self.filepath, "a") as f:
            f.write(self.sig)
        return self

sig = """
PHONE: 555-555-5555
EMAIL: BOB@bob.com
"The best time to plant a tree was 
20 years ago, the second best time
is now." - Bob"""

test = Signature("data/email.txt", sig)
test.add_sig()

## Stock Cleaning Class

Write Python code takes in a file containing stock data, cleans data, and calculates the average on columns.

Does this need to be a class?
* We will be changing the state of this data, so yes.
Does data need to be loaded constantly?
* Only when cleaning. Afterwards it can be stored.
What input am I getting?
* A CSV file of floats, potentially missing data.
What output am I generating?
* Clean data, statistics.
What guarantees am I given?
* File will exist.
What guarantees am I providing?
* Data will be clean, and averages will be computed on columns.


In [12]:
import csv

class StockClean:
    def __init__(self, filepath):
        self.filepath = filepath
        self.data = []
    
    def clean(self):
        with open(self.filepath, "r") as infile:
            reader = csv.DictReader(infile)
            for r in reader:
                #print(r)
                continue


sc = StockClean("data/AMZN_err.csv")
sc.clean()
sc.data


alpha = ["a", "b", "c", "d"]


1
True
2
True
3
True
4
True
5
False
While loop is done


In [11]:
import csv

class StockClean:
    """
    """
    def __init__(self, file):
        # don't load data just yet!
        self.file = file
        # prepare a protected empty list to contain clean data
        self._clean_data = []

    def clean(self, outpath):
        # read in data and record only clean data
        new_data = []
        col_names = []
        with open(self.file, 'r') as infile:
            # read into DictReader
            reader = csv.DictReader(infile)
            for row in reader:
                # skip row if missing, otherwise append
                if "" in row.values():
                    continue
                new_data.append(row)
            # get column names at the end. Convoluted
            col_names =  list(row.keys())

        # write out data
        with open(outpath, 'w', newline='') as outfile:
            writer = csv.DictWriter(outfile, fieldnames=col_names)
            writer.writeheader()
            writer.writerows(new_data)
        
        # record in attribute
        self._clean_data = new_data



In [12]:
cleaner = StockClean("data/AMZN_err.csv")

# notice new name
cleaner.clean("data/AMZN_clean.csv")

## GitHub

Code is complex.

Untracked code is even more complex.

Code should be tracked
Why did this change happen?
Who made this change?
What did the code look like before?
What are we trying to build?

Unshared code is even more complex (sometimes).

Development should not be a one-person task.
We need fresh eyes.
We need testers.
We need domain experts. 
We need other developers.


How do we share & track code?

