# Ideas in Engineering

Welcome to ideas for engineering!  The goal of this notebook and adjoining powerpoint is to explain the principals of clean code in clear and simple terms.  We will be exploring some basic ideas in this tutorial:

* what is clean code?
* The ideology of writing a function
* Documentation - why and how
* An introduction to test driven development
* The power of classes
* debugging with:
    * code.interact
    * IPython.embed
* How to and when to make a pull request
* The power of continuous integration
* The power of continuous deployment
* Automation - CI and CD together

## What is clean code?

Clean code is code that is clear and easily readable.  It's code that doesn't just explain what's happening, it explains why it's happening.  The goal of code, especially in a language like Python is to be clear, and obvious.  We don't write Python because it's the fastest language in the world, but because it has the power to _optimize developer time_, which is a much more expensive resource than compute time.  That said, we should always try to performance tune our code, once all the major functionality has been written.  

Python in particular, lends itself to being clean by default, because there is one and preferably only one way one to do things.  This means, most code blocks should be easily identifiable, no matter what the context.  So you should always be able to know _what_ the code does, but you may not necessarily understand _why_ the code does it.  The general rules of programming are as follows:

* each line should do at most one simple thing
* each function should be a single transformation or semantic action
* each class should be a combination of transformations and data which are semantically related
* ideally you should only have a few classes per file
* where possible you should reuse code

Let's see some examples of patters of clean code:

In [4]:
def remove_whitespace(name: str) -> str:
    """
    Removes whitespace between words
    
    Parameters:
    * name - the string which may or may not
    have whitespace.
    
    Returns:
    A string without whitespace between characters.
    
    Examples:
    >>> remove_whitespace("Hello There")
    "HelloThere"
    >>> remove_whitespace("HelloThere")
    "HelloThere"
    """
    return "".join(name.split(" "))

def get_upper_case_indices(name: str) -> list:
    """
    Gets the indices of all upper case words
    
    Parameters:
    * name - looks for uppercase 
    characters in this string
    
    Returns:
    A list of indices of the uppercase characters
    
    Examples:
    >>> get_upper_case_indices("HelloThere")
    [0, 5]
    >>> get_upper_case_indices("HelloThereFriends")
    [0, 5, 10]
    """
    upper_case_indices = []
    for index, letter in enumerate(name):
        if letter.isupper():
            upper_case_indices.append(index)
    return upper_case_indices

def get_lower_case_words(upper_case_indices: list, name: str) -> list:
    """
    Gets a list of the words, in lower case, split on uppercase
    characters
    
    Parameters:
    * upper_case_indices - a list of integers corresponding
    to upper case letters in the string
    * name - the string to split and process
    
    Returns:
    A list of words in lower case
    
    Examples:
    >>> get_lower_case_words([0, 5], "HelloThere")
    ["hello", "there"]
    >>> get_lower_case_words([0, 5, 10], "HelloThereFriends")
    >>> ["hello", "there", "friends"]
    """
    start = 0
    lower_case_words = []
    for index in upper_case_indices[1:]:
        lower_case_words.append(
            name[start:index].lower()
        )
        start = index
    lower_case_words.append(
        name[index:].lower()
    )
    return lower_case_words

def connect_words(lower_case_words: list) -> str:
    """
    Connects a list of words via a '_'
    
    Parameters:
    * lower_case_words - a list of lower case words
    
    Returns:
    A string of concatenated words, with '_' between
    each word.
    """
    return "_".join(lower_case_words)

def to_snake_case(name: str) -> str:
    """
    Takes a camel case string
    and makes it snake case

    Parameters:
    - name - the string to translate

    Returns:
    The snake cased string
    
    Example:
    >>> to_snake_case("HelloThere")
    'hello_there'
    >>> to_snake_case("hello_there")
    'hello_there'
    >>> to_snake_case("Hello There")
    'hello_there'
    """
    name = remove_whitespace(name)
    upper_case_indices = get_upper_case_indices(name)
    lower_case_words = get_lower_case_words(
        upper_case_indices, name
    )
    return connect_words(lower_case_words)

print(to_snake_case("HelloThereFriends"))
print(to_snake_case("Hello There Friends"))

hello_there_friends
hello_there_friends


We can take these methods and actually make a class, which offers a rich array of string processing functionality, for very little extra work:

In [1]:
class StringProcessor:
    """
    An object for processing strings.  The main methods of interest are:
    * to_camel_case
    * to_snake_case
    
    The preferred way to instantiate the class is as follows:
    >>> processor = StringProcessing()
    """
    def __init__(self):
        pass
    
    def remove_whitespace(self, name: str) -> str:
        """
        Removes whitespace between words

        Parameters:
        * name - the string which may or may not
        have whitespace.

        Returns:
        A string without whitespace between characters.

        Examples:
        >>> processor = StringProcessor()
        >>> processor.remove_whitespace("Hello There")
        'HelloThere'
        >>> processor.remove_whitespace("HelloThere")
        'HelloThere'
        """
        return "".join(name.split(" "))

    def get_upper_case_indices(self, name: str) -> list:
        """
        Gets the indices of all upper case words

        Parameters:
        * name - looks for uppercase 
        characters in this string

        Returns:
        A list of indices of the uppercase characters

        Examples:
        >>> processor = StringProcessor()
        >>> processor.get_upper_case_indices("HelloThere")
        [0, 5]
        >>> processor.get_upper_case_indices("HelloThereFriends")
        [0, 5, 10]
        """
        upper_case_indices = []
        for index, letter in enumerate(name):
            if letter.isupper():
                upper_case_indices.append(index)
        return upper_case_indices

    def get_lower_case_words(self, upper_case_indices: list, name: str) -> list:
        """
        Gets a list of the words, in lower case, split on uppercase
        characters

        Parameters:
        * upper_case_indices - a list of integers corresponding
        to upper case letters in the string
        * name - the string to split and process

        Returns:
        A list of words in lower case

        Examples:
        >>> processor = StringProcessor()
        >>> processor.get_lower_case_words([0, 5], "HelloThere")
        ['hello', 'there']
        >>> processor.get_lower_case_words([0, 5, 10], "HelloThereFriends")
        ['hello', 'there', 'friends']
        """
        start = 0
        lower_case_words = []
        for index in upper_case_indices[1:]:
            lower_case_words.append(
                name[start:index].lower()
            )
            start = index
        lower_case_words.append(
            name[index:].lower()
        )
        return lower_case_words

    def connect_words(self, lower_case_words: list) -> str:
        """
        Connects a list of words via a '_'

        Parameters:
        * lower_case_words - a list of lower case words

        Returns:
        A string of concatenated words, with '_' between
        each word.
        
        Examples:
        >>> processor = StringProcessor()
        >>> processor.connect_words(['hello', 'there'])
        'hello_there'
        >>> processor.connect_words(['hello', 'there', 'friends'])
        'hello_there_friends'
        """
        return "_".join(lower_case_words)

    def to_snake_case(self, name: str) -> str:
        """
        Takes a camel case string
        and makes it snake case

        Parameters:
        - name - the string to translate

        Returns:
        The snake cased string

        Example:
        >>> processor = StringProcessor()
        >>> processor.to_snake_case("HelloThere")
        'hello_there'
        >>> processor.to_snake_case("hello_there")
        'hello_there'
        >>> processor.to_snake_case("Hello There")
        'hello_there'
        """
        name = self.remove_whitespace(name)
        upper_case_indices = self.get_upper_case_indices(name)
        if upper_case_indices == []:
            return name
        lower_case_words = self.get_lower_case_words(
            upper_case_indices, name
        )
        return self.connect_words(lower_case_words)
    
    def split(self, name: str) -> list:
        """
        Split words on either "_" or " " 
        if present in name.
        
        Parameters:
        * name - the string to segment
        
        Returns:
        A tokenized list of words, separated
        by either "_" or whitespace
        
        Examples:
        >>> processor = StringProcessor()
        >>> processor.split("hello_there")
        ['hello', 'there']
        >>> processor.split("hello there")
        ['hello', 'there']
        >>> processor.split("hello there friends")
        ['hello', 'there', 'friends']
        """
        if "_" in name:
            return name.split("_")
        elif " " in name:
            return name.split(" ")
        else:
            return [name]
        
    def capitalize_words(self, words: list) -> list:
        """
        Takes in a list of words (strings) and
        capitalizes them.
        
        Parameters:
        * words - a list of words to captialize
        
        Returns:
        A list of words that are capitalized.
        
        Examples:
        >>> processor = StringProcessor()
        >>> processor.capitalize_words(['hello', 'there'])
        ['Hello', 'There']
        >>> processor.capitalize_words(['hello', 'there', 'friends'])
        ['Hello', 'There', 'Friends']
        """
        capitalized_words = []
        for word in words:
            if word:
                capitalized_words.append(
                    word.capitalize()
                )
        return capitalized_words
    
    def to_camel_case(self, name: str) -> str:
        """
        Takes a string of words, either
        separated by "_" or whitespace and
        returns a camel cased string
        
        Parameters:
        * name - the string to camel case
        
        Returns:
        A camel cased string, with no whitespace
        
        Examples:
        >>> processor = StringProcessor()
        >>> processor.to_camel_case("hello there")
        'HelloThere'
        >>> processor.to_camel_case('hello_there')
        'HelloThere'
        """
        words = self.split(name)
        words = self.capitalize_words(words)
        return "".join(words)


There isn't much new here, but except a new method!  Now we can call `to_camel_case` in addition to all the other functions (now called methods) we had before.  The nice thing about this, is we were able to reuse one of the functions - `remove_whitespace` from one example to the other.  This is the power of objects - we can group related functionality together.  This way the reader can better understand generally what's going on.  And which methods are likely a good idea to call on related sets of objects.  Now let's take a closer look at documentation.

## Documentation - How and Why?

The why, of a function is usually answered by documentation.  Documentation tells the story of why the code does what it does, as well as providing a high level explaination of what the code is doing, this way, those who may not be familiar with the function understand it, by looking at it.  But more importantly, this way they don't have to look at the actual code in the function to understand it.

You see Python comes with a very powerful built-in function, called `help`.  Let's look at an example right now!

For this example we'll be making use of numpy:

In [1]:
import numpy as np

help(np.mean)

Help on function mean in module numpy:

mean(a, axis=None, dtype=None, out=None, keepdims=<no value>)
    Compute the arithmetic mean along the specified axis.
    
    Returns the average of the array elements.  The average is taken over
    the flattened array by default, otherwise over the specified axis.
    `float64` intermediate and return values are used for integer inputs.
    
    Parameters
    ----------
    a : array_like
        Array containing numbers whose mean is desired. If `a` is not an
        array, a conversion is attempted.
    axis : None or int or tuple of ints, optional
        Axis or axes along which the means are computed. The default is to
        compute the mean of the flattened array.
    
        .. versionadded:: 1.7.0
    
        If this is a tuple of ints, a mean is performed over multiple axes,
        instead of a single axis or all the axes as before.
    dtype : data-type, optional
        Type to use in computing the mean.  For integer inputs,

As you can see this documentation provides the following things:
* what the function does
* the expected parameters
* what the function returns
* an example of using the function
* nuance around the different function parameters
* notes and extra context for the function, motivating it's use

Every function that's written should be this well documented, and this clear.  

## An Introduction To Test Driven Development

The core of test driven development is notionally about how to write code is testable and about writing the tests themselves.  The ultimate goal of test driven development is to make sure as few bugs as possible make it into production code.  And that as few bugs as possible are introduced into an existing code base.

In theory, this process cuts down development time, because far less time should be spent fixing bugs.  Also, notionally, if bugs are found right after they are written, they are more likely to be fixable easily, because less code relies on the bugs in question.

How to write easily testable code:

* write many functions, because it's easy to test individual functions 
    * this informs _why_ we want our functions to do one and only one thing. 
    * also why each line should do as little as possible.  
    * As a result it's easy to test each function, because we only need to consider a few cases.  
    * And if there is a bug, it will be easy to find, because we can likely isolate the line that causes the issue. 
    * If there is a lot of logic per line, it can be _very_ hard to find the bug, because we might need to parse it apart into multiple logical pieces, just to figure out what's wrong.

* write unit tests
    * unit tests test the individual functions (or methods) in the code to ensure they do what's expected
    * usually you should only need between 1-4 tests to make sure a given function works

* write integration tests
    * integration tests combine multiple functions together, to make sure they all work in concert to produce a desired result
    * usually you only need 1-2 integration tests per set of functions. 
    * it's typically a good idea for integration tests to combine transformations across multiple files, to make sure they all work together.  
    * practically, this means, you'd like test multiple "main" functions that rely on one another with some dependency.
    * in this way, we are able to ensure that our code dependencies don't introduce bugs.

### A practical look at test driven development

Now that we've discussed the motivation of testing, let's look at an example:

In [None]:
from string_processor import StringProcessor

def test_remove_whitespace():
    processor = StringProcessor()
    assert 'HelloThere' == processor.remove_whitespace("Hello There")
    assert 'HelloThere' ==processor.remove_whitespace("HelloThere") 

def test_get_upper_case_indices():
    processor = StringProcessor()
    assert [0, 5] == processor.get_upper_case_indices("HelloThere")
    assert [0, 5, 10] == processor.get_upper_case_indices("HelloThereFriends")

def test_get_lower_case_words():
    processor = StringProcessor()
    assert ['hello', 'there'] == processor.get_lower_case_words([0, 5], "HelloThere")
    assert ['hello', 'there', 'friends'] == processor.get_lower_case_words([0, 5, 10], "HelloThereFriends")

def test_connect_words():
    processor = StringProcessor()
    assert 'hello_there' == processor.connect_words(['hello', 'there'])
    assert 'hello_there_friends' == processor.connect_words(['hello', 'there', 'friends'])
    
def test_to_snake_case():
    processor = StringProcessor()
    assert 'hello_there' == processor.to_snake_case("HelloThere")
    assert 'hello_there' == processor.to_snake_case("hello_there")
    assert 'hello_there' == processor.to_snake_case("Hello There")

def test_split():
    processor = StringProcessor()
    assert ['hello', 'there'] == processor.split("hello_there")
    assert ['hello', 'there'] == processor.split("hello there")
    assert ['hello', 'there', 'friends'] == processor.split("hello there friends")
    
def test_capitalize_words():
    processor = StringProcessor()
    assert ['Hello', 'There'] == processor.capitalize_words(['hello', 'there'])
    assert ['Hello', 'There', 'Friends'] == processor.capitalize_words(['hello', 'there', 'friends'])
    
def test_to_camel_case():
    processor = StringProcessor()
    assert 'HelloThere' == processor.to_camel_case("hello there")
    assert 'HelloThere' == processor.to_camel_case('hello_there')


Above is the test file for the `StringProcessor` object we made earlier.  Notice that the tests mirror the examples that we created in our docstrings.  And for good measure.  A well designed test is intended to be a representative example for the code to execute.  If the examples are well chosen, the tests should encapsulate the intended functionality of the code.  Additionally, tests are good place to look for hints for how the code should run, in absence of examples in your doc strings.

We can run the tests with the following command:

`pytest [Test file name]`

In this example it would be:

`pytest test_string_processor.py`

And it should tell us whether the tests pass or not.

Next let's talk about `mypy` tests.  The `mypy` static type checker adds the ability for you to add static types to your functions.  This way, you can ensure that your code is being used as expected, from your tests.  Additionally, this nice annotation makes it obvious what types to pass to each of your function parameters.  This addes readability essentially for free.

Let's look at an example of a function annotated with `mypy` static types:

In [12]:
def factorial(n: int) -> int:
    if n == 0:
        return 1
    else:
        return factorial(n-1)*n
    
factorial(5)

120

As you can see, this code tells us what type of parameter to pass to the code, and what the expected return type is.  If someone is interested in the type annotations of the function we can get this easily with:

In [15]:
def get_annotations(func):
    return func.__annotations__

get_annotations(factorial)

{'n': int, 'return': int}

In general the definition for how to annotation functions looks like the following:

```
def func(arg: arg_type, optarg: arg_type = default) -> return_type:
    ...
```

We can even test to see if our static typing works on our test file with the following:

`mypy [test file]`

let's look at our specific example:

`mypy test_string_processor.py`

Additionally we can run mypy on _any_ file, so we can even do this on the base file:

`mypy string_processor.py`

However in this case, nothing will get flagged because are just defining a class, but not using it.  It's best to call `mypy` anywhere that code is executed.

Type annotation reference:

* https://realpython.com/python-type-checking/

## The Power Of Classes

The next thing to discuss is the power of classes!

We've already seen classes a little bit, but what we haven't done is talk through their real power:

* inheritance
* composition

_def_ inheritance := Inheritance is the ability to inherit the methods and data of a parent class into a child class.  With inheritance, we get code reuse for very little.  

Caution - there is a danger with doing more than two levels of inheritance!  If you do so, you can end up it's what called callback hell, where a bug introduced in a ancestor class propagates all the way down to the descendant class.  For this reason, no more than one level of inheritance is recommended, and no more than 2 levels should _ever_ be used.

_def_ composition := Composition is the notion of instantiating a class object in another class, this way you can make use of it's methods and data, without having to expose it's methods publicly or rewrite them.

Let's get started!

### Time to Inherit Some Things

In [None]:
class StringTokenizer(StringProcessor):
    def __init__(self):
        pass
    
    def tokenize(self, name: str, split_on: str ='') -> list:
        """
        'Tokenize' the string, meaning
        return a list of semantically viable symbols
        or words.  
        
        Parameters:
        * name - the string to tokenize
        
        Returns:
        A list of tokens (usually words) with all excess
        white space removed
        
        Examples:
        >>> string_tokenizer = StringTokenizer()
        >>> string_tokenizer.tokenize("Hello There My Friends")
        ['Hello', 'There', 'My', 'Friends']
        >>> string_tokenizer.tokenize("Hello-There-My-Friends", split_on='-')
        ['Hello', 'There', 'My', 'Friends']
        >>> string_tokenizer.tokenize("Hello  There \nMy \tFriends")
        ['Hello', 'There', 'My', 'Friends']
        >>> string_tokenizer.tokenize("Hello_There_My_Friends", split_on='_')
        ['Hello', 'There', 'My', 'Friends']
        """
        tokens = self.split(name, split_on=split_on)
        return [self.clean_endings(token) for token in tokens]
    
    def clean_endings(self, word: str) -> str:
        """
        Strips the endings off of words or characters
        
        Parameters:
        * name - the string to clean
        
        Returns:
        A string without excess space
        
        Examples:
        >>> string_tokenizer = StringTokenizer()
        >>> string_tokenizer.clean_endings("  Hello ")
        'Hello'
        >>> string_tokenizer.clean_endings("  \nHello\t ")
        'Hello'
        """
        word = word.lstrip()
        return word.rstrip()
        
    def split(self, name: str, split_on: str = '') -> list:
        """
        Splits a string into a list based on some typical
        cases.
        
        Parameters:
        * name - the string to split
        
        Returns:
        A list of strings, where each sub string
        is between the split character found.
        
        Examples:
        >>> string_tokenizer = StringTokenizer()
        >>> string_tokenizer.split("Hello there friends")
        ['Hello', 'there', 'friends']
        >>> string_tokenizer.split('Hello-there-friends', split_on='-')
        ['Hello', 'there', 'friends']
        """
        if split_on != '':
            return name.split(split_on)
        else:
            return name.split()

As you can see at the top of the class definition we do inheritance as follows:

`[Class Name]([Parent Class Name]):`

For our example thats:

`class StringTokenizer(StringProcessor)`

Now the `StringTokenizer` has all the methods of the `StringProcessor`!  

Let's talk about the goal of this class, to make our tokenizing better.  

First we'll need to overwrite the `split` method.  To do so we only need to name the new method the same thing as in the parent class.  

Then we define 2 _totally_ new methods:

* clean_endings
* tokenize

This highlights what inheritance is good for:

* establishing namespaces, so method names can be reused in different contexts.  This is because sometimes naming things is hard.  By giving the extra context of an object name, you can use the same method name multiple times!
* allowing for smaller classes overall - so you end up writing less code.  The less code you write, the less you have to maintain.

### Time To Compose Some Things

In [2]:
import spacy

class NLPEngine:
    def __init__(self):
        self.spacy_nlp_sm = spacy.load("en_core_web_sm")
        self.spacy_nlp_md = None
        
    def get_tokens(self, doc: str) -> list:
        """
        Gets the tokens in the document
        
        Parameters:
        * doc - a string to process
        
        Returns:
        A list of tokens parsed from the text
        
        Examples:
        >>> nlp = NLPEngine()
        >>> nlp.get_tokens('Hello there friends')
        ['Hello', 'there', 'friends']
        """
        doc = self.spacy_nlp_sm(doc)
        return [token.text for token in doc]
    
    def get_pos(self, doc: str) -> list:
        """
        Gets the part of speech from 
        the tokens in the document.
        
        Parameters:
        * doc - a string to process
        
        Returns:
        A list of tuples of the form (token, part of speech)
        
        Examples:
        >>> nlp = NLPEngine()
        >>> nlp.get_pos('My dog spot runs fast!')
        [('My', 'DET'), ('dog', 'NOUN'), ('spot', 'NOUN'), ('runs', 'VERB'), ('fast', 'ADV'), ('!', 'PUNCT')]
        """
        doc = self.spacy_nlp_sm(doc)
        return [(token.text, token.pos_) for token in doc]
    
    def get_token_shape(self, doc: str) -> list:
        """
        Get the shape of all tokens in the document.
        
        The shape is the order of upper case and lower case
        letters per word.
        
        Parameters:
        * doc - the string to process
        
        Returns:
        A list of tuples of the form (token, shape)
        
        Examples:
        >>> nlp = NLPEngine()
        >>> nlp.get_token_shape('My dog spot runs fast')
        [('My', 'Xx'), ('dog', 'xxx'), ('spot', 'xxxx'), ('runs', 'xxxx'), ('fast', 'xxxx')]
        """
        doc = self.spacy_nlp_sm(doc)
        return [(token.text, token.shape_) for token in doc]
    
    def get_token_tags(self, doc: str) -> list:
        """
        Get the tags for each token.
        
        Paramters:
        * doc - the string to process
        
        Returns:
        A list of tuples of the form (token, tag)
        
        Examples:
        >>> nlp = NLPEngine()
        >>> nlp.get_token_tags('My dog spot runs fast')
        [('My', 'PRP$'), ('dog', 'NN'), ('spot', 'NN'), ('runs', 'VBZ'), ('fast', 'RB')]
        
        As you can see, this returns preposition, noun, noun, verb, adverb
        
        For a full list of tags see:
        https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
        """
        doc = self.spacy_nlp_sm(doc)
        return [(token.text, token.tag_) for token in doc]
    
    def get_token_lemmas(self, doc: str) -> list:
        """
        Get the lemmas for each token.
        A lemma is normalized version of the token
        
        Parameters:
        * doc - the string to process
        
        Returns:
        A list of tuples of the form (token, lemma)
        
        Examples:
        >>> nlp = NLPEngine()
        >>> nlp.get_token_lemmas("My dog spot runs very fast.  He is the fastest doggy in the world")
        [('My', '-PRON-'), ('dog', 'dog'), ('spot', 'spot'), ('runs', 'run'), 
        ('very', 'very'), ('fast', 'fast'), ('.', '.'), (' ', ' '), 
        ('He', '-PRON-'), ('is', 'be'), ('the', 'the'), ('fastest', 'fast'), 
        ('doggy', 'doggy'), ('in', 'in'), ('the', 'the'), ('world', 'world')]
        
        What I think is important is the difference between 'fastest' and 'fast' here.  It illustrates how
        lemmas work - a lemma is sort of like the base word, and something like fastest is the word modified
        to make sense grammatically.  Usually nlp systems don't care much about grammar (in some cases).  
        So we can drop it.  So we can treat lemmatization as a sort of normalization of the word that throws
        out grammatical transforms.
        """
        doc = self.spacy_nlp_sm(doc)
        return [(token.text, token.lemma_) for token in doc]
    
    def get_token_ner_labels(self, doc:str) -> list:
        """
        Gets the named entity recognition labels for each token
        
        Parameters:
        * doc - the string to process
        
        Returns:
        A list of tuples of the form (token, ner_label)
        
        Examples:
        >>> nlp = NLPEngine()
        >>> nlp.get_token_ner_labels('I predict Google is going to buy Microsoft for one dollar')
        [('Google', 'ORG'), ('Microsoft', 'ORG'), ('one dollar', 'MONEY')]
        
        As you can see, the ner tagger knows that google and microsoft are organizations
        and that one dollar is money!
        """
        doc = self.spacy_nlp_sm(doc)
        return [(ent.text, ent.label_) for ent in doc.ents]
    
    def load_medium_language_model(self):
        """
        loads the medium language model into the
        spacy_nlp_md attribute which initially set to
        None.
        
        Parameters:
        * None
        
        Returns:
        Nothing
        """
        if not self.spacy_nlp_md:
            self.spacy_nlp_md = spacy.load("en_core_web_md")
        
    def how_similar(self, word_one: str, word_two: str) -> float:
        """
        A measure of similarity for two word vectors.
        The closer to 1.0 you get, the more similar
        the two words are.
        
        Similarity is calculated via the L2 norm:
        
        import math
        def L2_norm(first, second):
            differences = [second[index] - first[index]
                           for index in range(len(second))]
            squared_difference = [math.pow(diff, 2) 
                                  for diff in differences]
            sum_squared_difference = sum(squared_difference)
            return math.sqrt(sum_squared_difference)
        
        Note: A word vector is a compact matrix 
        representation of a word.  It encodes
        features about a word in an R^n space.
        
        Parameters:
        * word_one - a word vector
        * word_two - another word vector
        
        Returns:
        The L2 normed distance between two word vectors
        
        Examples:
        >>> nlp = NLPEngine()
        >>> nlp.how_similar('hamburger', 'hotdog')
        """
        self.load_medium_language_model()
        word_one = self.spacy_nlp_md(word_one)
        word_two = self.spacy_nlp_md(word_two)
        return word_one.similarity(word_two)
    
    def get_string_mode(self, tokens: list) -> str:
        """
        Gets the most frequently occurring string,
        known as the mode.
        
        Paramters:
        * tokens - a list of strings
        
        Returns:
        The most frequently occurring string in the
        list of strings.
        
        Examples:
        >>> nlp = NLPEngine()
        >>> nlp.get_string_mode(["en", "en", "fr"])
        'en'
        """
        unique_tokens = set(tokens)
        token_count = {}
        for token in unique_tokens:
            token_count[token] = tokens.count(token)
        return max(token_count)
    
    def language_detection(self, doc: str) -> str:
        """
        Returns the most likely language based on
        the language assigned to the most tokens.
        
        Because some words are defined across multiple
        languages, often times in english, french, and
        german as well as other languages with overlap,
        words will be true for more than one language.
        
        Therefore we take the most seen language per token.
        The likelihood of a tie should be very low, unless
        a set of text is in multiple languages.  Then using
        this method is inappropriate.
        
        Paramaters:
        * doc - the text to process
        
        Returns:
        The most likely language used in the text
        
        Examples:
        >>> nlp = NLPEngine()
        >>> nlp.language_detection(['Hello there friends'])
        'en'
        """
        doc = self.spacy_nlp_sm(doc)
        langs = [token.lang_ for token in doc]
        return self.get_string_mode(langs)
    

The key power of composition is the ability to include other classes in your current class.  Because you can only one level deep in your dependency structure, there is a guarantee that you will avoid callback hell, which is a very useful feature.  

The reason we need composition here comes for two reasons:

1. We don't want to reimplement all these methods, which are super useful
2. We need more than one language model, but only sometimes!  The medium language model is very big, and takes a while to load.  So we shouldn't load it unless we need a task that's actually going to use it, like getting the similarity between 2 words.

## Debugging Your Code!

If possible, you should always use an IDE like PyCharm to debug your code.  VSCode is fine too.  But if you have some code on a server somewhere, or you really don't need all the extra setup, you can go the minimal route with the following methods:

* code.interact()
* IPython.embed()

These more or less do the same thing, but call different REPLs.  Let's look at a few examples:

In [3]:
def clean_number(number):
    return 0

def broken_function(number_one, number_two):
    number_two = clean_number(number_two)
    return number_one / number_two

broken_function(7, 19)

ZeroDivisionError: division by zero

As you can see, the error happens on line 6.  But does that mean that's where the bug is happening?  We should debug and explore to make sure!

In [None]:
import code

def clean_number(number):
    return 0

def broken_function(number_one, number_two):
    number_two = clean_number(number_two)
    code.interact(local=locals())
    return number_one / number_two

broken_function(7, 19)

Python 3.7.3 (v3.7.3:ef4ec6ed12, Mar 25 2019, 16:52:21) 
[Clang 6.0 (clang-600.0.57)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
(InteractiveConsole)


In : dir()


['__builtins__', 'number_one', 'number_two']

In : number_one


7

In : number_two


0

Ah ha!  We found the problem!  It looks like `clean_number` is setting `number_two` to `0`!  So all we should do is get rid of that call and then we are good!

In [1]:
import code

def clean_number(number):
    return 0

def broken_function(number_one, number_two):
    #number_two = clean_number(number_two)
    #code.interact(local=locals())
    return number_one / number_two

broken_function(7, 19)

0.3684210526315789

Now we get the right result!  

Yes, this was a _very_ worked example.  But most debugging honestly comes down to hitting an error and then getting the context to figure out where the error is coming from.  Then try to step back till you find the source of the bug.  Then simply update the code to do what you expect.  

So `code.interact` is pretty nifty - it launches a vanilla Python REPL.  And you can specify what scope you want.

Here we asked for the local scope with `code.interact(local=locals())`.  This is great for figuring out what's happening inside a function or method.  We could have also asked for `code.interact(local=globals())`, which would have given us everything in global scope.  This is useful if we are working in global scope.  

In my opinion, the REPL that comes with Python is usually good enough, but sometimes you need to debug something _complex_.  In that case, I'd recommend the ipython REPL, which is really an upgrade and has a lot of extra nice to haves.  

Let's look at the syntax:

In [4]:
import IPython

def whatever():
    return "hi"

def call_ipython_repl():
    local_var_one = "Hello there"
    IPython.embed()
    
call_ipython_repl()

Python 3.7.3 (v3.7.3:ef4ec6ed12, Mar 25 2019, 16:52:21) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.6.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: dir()
Out[1]: 
['In',
 'Out',
 '_dh',
 '_i',
 '_i1',
 '_ih',
 '_ii',
 '_iii',
 '_oh',
 'exit',
 'get_ipython',
 'local_var_one',
 'quit']

In [2]: local_var_one
Out[2]: 'Hello there'

In [3]: whatever()
Out[3]: 'hi'

In [4]: exit()



## Making a PR

There are a few simple rules that I like to follow for making PRs:

1. Commit as little code as possible, per PR, typically one function is all you should commit - this puts less strain on the reviewer and allows them to understand what you are doing.  It also means, the reviewer is more likely to read every line and therefore catch any bugs the first time.  If you make a reviewer review a lot of code at once, it is far more likely they will miss errors or bugs.  The goal of a PR review is to catch bugs that a linter or tests may miss.

2. Use conventional commits - a conventional commit is of the following form:

`git commit -m "
type: feature
files added: whatever.py
description: adding whatever.py that implements skynet.  This code will kill all humans, have fun!"
`

This is my convention for committing and has the following properties:

* type - I tend to use the following types: feature, refactor
* files added - what files were added with this commit, it's okay if the number is none, but specifying when you added a file is extremely helpful for rebasing and reflog.
* description - what you changed and possibly why - giving motivation for the commit makes rebasing much, much easier.  It also means the reviewer has a high level understanding of what you did, and possibly why, before reading the code.

Conventional commits are extremely helpful and add in the review process, greatly.  You don't need to use my conventions, but please use a convention that describes what you are doing.  It will save everyone time, and it makes reviews easier on the reader, meaning your code will get reviewed more often.

Another reason conventional commits are so powerful is for `git bisect`.  This tool helps you track down bugs in the code base by looking at commit messages.  If your commit messages are clear, straight forward and descriptive, then it can be very, very easy to revert code to a working state before a bug was introduced.  Some bugs will get through, that's just the way code works.  But with `git bisect` and conventional commits, it's easy to tell where and when it happened.

Intro to git bisect: https://www.metaltoad.com/blog/beginners-guide-git-bisect-process-elimination

## Continuous Integration

Continuous Integration or CI runs tests against the code, to ensure no code is committed to the master (or main branch) without passing all tests.  It also makes it easier on reviewers, because usually a CI pipeline will tell you what tests failed.  If a test fails then you will know immediately that the code isn't ready for production and then the writer can go fix what's broken.  No one writes perfect code, but with a CI pipeline, you can write code with as few bugs as possible.

Let's look at an example CircleCI pipeline:


```
# Python CircleCI 2.0 configuration file
#
# Check https://circleci.com/docs/2.0/language-python/ for more details
#
version: 2
jobs:
  build:
    docker:
      # specify the version you desire here
      # use `-browsers` prefix for selenium tests, e.g. `3.6.1-browsers`
      - image: circleci/python:3.6.1

    working_directory: ~/repo

    steps:
      - checkout
      - run:
          name: install dependencies
          command: |
            python3 -m venv venv
            . venv/bin/activate
            pip install -r requirements.txt

      # run tests!
      - run:
          name: Install python
          command: |
            python3 -m venv venv
            . venv/bin/activate
            python -m pip install pytest --user
      - run:
          name: run tests
          command: |
            . venv/bin/activate
            python -m pytest tests
```

This is a fairly basic CI pipeline it does the following things (in order):

1. creates a docker image with python 3.6.1 installed - image: circleci/python:3.6.1
2. installs dependencies
3. installs pytest
4. runs the tests

All you need to do is put this file in a .circleci folder at the root directory of your github repo, then sign into circleci and add the project.  Then all the tests should run!

## Continuous Deployment

The idea behind continuous deployment is related, but separate from continuous integration.  In CD your goal is to package up your code, and send it somewhere, perhaps running as a web service, or to execute a job.  The important notion here is, reproducability.  By having a standard process for deploying your code (whatever that means), you can ensure that the way it was run will be the same every time.  This way if an error does occur during deployment, you'll be able to figure out what may have caused the error, because the steps for deployment can be manually reproduced.

## CI & CD

With CI & CD your code should be fully tested, fully reviewed and automatically deployed, enabling full automation.  This allows for a less error prone software development process.  Each step informs the next:

clean code informs PR review and testing, PR review and testing informs CI, CI informs CD.  All these steps work in a chain, to ensure that code can be written quickly and correctly.  Which means fewer hassels, more features and clearer notions around how to run, and hopefully why to run a piece of code.