# Exploratory Analysis in Pandas with Functions

How can we write clean code to organize messy analytical work?

Objectives:
1. Show how to break down a broad business problem into answerable data questions
2. Answer these questions in a diligent, clear, and reproducible way
3. Drive further exploration based on the answers to these questions

Number (2) above is where clean, effective Python code will save your ass

## The Business Problem: CRD Changes

An example of how a business partner might approach this, via **email**, **slack**, or **intake meeting**:

> We're facing a lot of uncertainty due to changing CRDs - it's difficult or impossible to arrange origin trucking, and some of our sailing/filght assignments are being invalidated. Can we get more visibility into how often this is happening? 

It pays to follow up with the stakeholder here to get some context and figure out how to best approach the problem:
1. What's the specific business risk involved? The first example above outlines this fairly well.
2. What are the specific scenarios where we might care about this?
3. How can this information be _actionable_ for the end user?

Number (3) is especially important here - ultimately, whatever solution that you provide, it should speak to some action that the stakeholder team can take based on the data.

Let's say you ask for clarification and get the following additional information:

> We're specifically interested in ocean shipments, because anecdotally it's really disrupting our fullillment process. If a CRD changes and we can't use the original assignment, then we might end up with dead inventory which has a pretty direct financial impact. It also creates a bad experience for the client, who may end up with a longer transit time.

> If we knew how often CRDs were historically changing, we could get a sense of how much this is contributing to our overall fulfillment problems. We can start to "score" clients based on the likelihood that their CRDs will change, and either enforce better behavior or account for this in our fulfillment process. It would be really cool to actually _anticipate_ which CRDs will change and by how much, anecdotally this seems really unpredictable.

## Framing Analytical Questions

Great - now we have a better understanding of how the business is thinking about this problem. But where do we even start with actually pulling data?

While some of the suggestions from our stakeholder might be exciting - predicting CRD changes, scoring clients - here it pays to ***challenge your assumptions, start as simple as possible, and increase complexity in a logical and incremental fashion***

We can start by laying out some clearly defined questions that we should be able to answer with data:
- How often do shipment CRDs change?
- When CRD does change, how much on average does it differ from the original or previous CRD?
- When do CRD changes typically occur, relative to quoting and relative to the CRD itself?

All of these should give us insight into the more general questions asked by our stakeholder (how much of a problem is this?) while also giving us a sense of how much further we can take this analysis. All of these questions can also be sliced on different dimensions, like client segment and trade-lane, to get more insight (if it's actionable).

***What data points would we need to answer these basic questions?***
- Each instance of a CRD being created or changed for a shipment
- The time that the creation/change took place
- The CRD at that time
- When the shipment was quoted

Unfortunately we don't have a true event-based data table for CRDs (if this has changed, awesome). But we _can_ back out the changes from the audits table. Let's grab the data and get started.

In [None]:
import pandas as pd
import numpy as np
from datetime import date

# plotting
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
plt.rcParams["figure.figsize"] = (11,6)
%config IPCompleter.greedy=True


### Installing internal packages
We'll want to use our internal Python library to access Snowflake. It's easy to install!

Check out [PyPi instructions here](https://github.flexport.io/flexport/kimono/tree/master/astronomer/commonlib#step-6-installing-your-package) and navigate the [PyPi server here](http://10.70.168.13:6543/#/)

In [None]:
!pip3 install -i http://10.70.168.13:6543/simple/ analytics-utils==1.1.4 --trusted-host 10.70.168.13
from analytics_utils.utils import snowflake as sf

### Start with a query

The data has to come from somewhere...

Some general notes on SQL queries as part of EDA workflow:
- Code structure
    - You can extract the query into a separate file, but I like keeping it in the notebook so things are self contained and easier to reference
    - I like defining the query as a standalone string that you can reference in a function. That way, you can edit the query separately from running it
- How to write the query
    - Avoid including too much complex logic/transformation in the query itself. Python code is generally easier to parse and understand what's going on (maybe just my opinion)
    - We also want to avoid having to re-run the query a million times
    - "Go wide" and include any data that you think you might need



In [None]:
# here I define my query as a string

CRD_CHANGE_QUERY = (
    """
    SELECT 
      a.id as audit_id,
      a.auditable_id as leg_id,
      a.action,
      sl.shipment_id,
      a.created_at as changed_at,
      CASE WHEN ARRAY_SIZE(a.audited_changes:cargo_ready_date) = 2
        THEN a.audited_changes:cargo_ready_date[1]
        ELSE a.audited_changes:cargo_ready_date END as crd,
      q.quote_submitted_at,
      q.quote_accepted_at
    FROM core.audits as a
       JOIN legacy.bi_shipment_legs as sl
       ON (a.auditable_id = sl.leg_id 
           AND a.auditable_type in ('OperationalRoute::Leg', 'Leg')
           AND sl.from_origin_address)
       JOIN legacy.prep_quotes as q
       ON (sl.shipment_id = q.shipment_id and q.quote_accepted_at is not null)
    WHERE
      a.audited_changes:cargo_ready_date is not null
      AND a.audited_changes:cargo_ready_date != 'null'
      AND a.created_at BETWEEN '{start}' and '{end}'
    """)


# notice the bits in curly braces - these are arguments that we will later *interpolate* into the query

In [None]:
# next, I want to define a function to run a query with arguments
#  using a function here isolates the code necessary to pull data, so it's super easy to call later on

# first, let's set up a blank function
#  there's a lot of stuff here, but it doesn't actually do anything
def get_crd_changes(start: date, end: date) -> pd.DataFrame:
    """
    Return every instance of a CRD changing during the period between ``start`` and ``end``
    Pulling from the audits table
    """
    pass

### Anatomy of a function

1. Name: make it descriptive, even if it's verbose
2. Arguments: act as variables within the function. Here, we can put key pieces of configuration that we might want to change
    - Example: adding start and end dates to our query. Instead of changing the query itself, we can define arguments that are *interpolated* into the query text
3. Return something - in this case, a DataFrame
    - In the above example, we used `pass` to return nothing. This is very useful to skeleton out functions before you actually write them

#### A note on type hints
You may notice the **type hints** above. This is a new feature in Python 3 that allows us to specify input and output types. Each argument has a type, indicated with `:`, while the type of the data returned by the function is indicated outside of the function with `->`

Type hints are not required, nor are they enforced. There are, however, tools like [MyPy](http://mypy-lang.org/) that can be used to check code against type hints, and these are a good option for production code.

<a href="https://imgflip.com/i/4wtzfz"><img src="https://i.imgflip.com/4wtzfz.jpg" title="made at imgflip.com"/></a>

Check it out below:

In [None]:
# define a function to add two... things of unspecified type
def my_function(a, b):
    return a + b

In [None]:
# What will this return?
#my_function(1, 2)

In [None]:
# What will this return?
#my_function('a', 'b')

In [None]:
# What will this return?
#my_function('a', 1)

In [None]:
# let's rewrite this with type hints
def my_function_with_hints(a: int, b: int) -> int:
    return a + b

In [None]:
# However, does this actually do anything?
#my_function_with_hints('a', 'b')

#### Why bother with type hints anyway?

Type hints act as *rich documentation* of a function's intended usage. If you're collaborating with someone, or revisiting old code, you can quickly understand what to expect. Sure, it's not enforced, but it is helpful.

We most often see type hints in *production-level code*, where style guidelines may require their usage. So, is there any value in using them for ad-hoc analysis in notebooks, or is this just Tyler being a hardass (who has written production code)?

My *opinion* is emphatically yes! For all the reasons above, using type hints helps us write clean, well documented code without much additional effort. It also forces us to write functions with clear usage and avoid any unpleasant type flexibility that may be allowed by Python.

#### A note on *docstrings*

After the function definitions, you can include documentation using triple quotes. This is referred to as a "docstring" and acts as another key piece of documentation around functions. Again, these are typically required as part of style guidelines for production code - but, they can be useful for analytical work as well.

What should the docstring contain?
- A *concise* explanation of what the function does. If you can't concisely explain, it's likely that the function is doing too much!
- Explanation for how each of the arguments of a function are used 
- Explanation for the data returned

In [None]:
# time to actually write the function
def get_crd_changes(start: date, end: date) -> pd.DataFrame:
    """
    Return every instance of a CRD changing during the period between ``start`` and ``end``
    Pulling from the audits table
    """
    # interpolate arguments into the query
    formatted_query = CRD_CHANGE_QUERY.format(start=start, end=end)
    
    # use our snowflake package to run the query
    return sf.run_snowflake_query(formatted_query)

In [None]:
# `.format()` "just works" with dates, but let's make sure...
'{start}'.format(start=date(2020,1,1))

In [None]:
# use our function
# our use of arguments comes in handy here
#   we can run with a shorter date range to confirm that this is working
#   without waiting forever
crd_changes = get_crd_changes(date(2020, 1, 1), date(2020, 1, 10))

### Sanity checking the data

Prior to making any assertions, let's make sure that things look OK from a high level

- Is the _grain_ what I expected? 
    - It's a good practice to include some kind of primary key in your query, just so this is easier to reason about
    - the `.value_counts()` method is a lifesaver here
- Do columns take on the values that I expected?
    - Is anything missing more often than I would expect?
- Is the size in line with what I would expect (orders of magnitude)
- Other specific pieces of logic that might make sense

In [None]:
# is the grain correct?
crd_changes.audit_id.value_counts().head()

In [None]:
# does everything look as expected?
crd_changes.sample(5)

In [None]:
# are the types correct?
crd_changes.dtypes

In [None]:
# is anything missing more often than we expect?
pd.isnull(crd_changes).mean()

In [None]:
# how much data was returned?
crd_changes.shape

In [None]:
# how does this compare to the number of shipments?
crd_changes.shipment_id.nunique()

In [None]:
# we need to fix the formatting of the `crd` column
#pd.to_datetime(crd_changes.crd)

In [None]:
# we ran into errors parsing CRD into date type... how many rows are affected?

### Another useful application for functions: isolating and compartmentalizing code
After sanity checking the data above, we found one thing we wanted to change - the formatting around the CRD column. We *could* go make that change in the query, which may or may not be cumbersome. To some extent it's up to you.

One disadvantage of putting *all* logic in the query (as mentioned above), is that this requires you to re-run the query any time you tweak the logic, which may slow down your iteration.

However, if you put this transformation in Python, you run into potential cell state issues. If I alter the `crd` column in place, that might cause problems if I run the cells out of order, and it becomes hard for me to keep track of the state of individual objects.

In these cases I like to write an additional function to isolate all code used to transform the data in Python. In this function, you could include:
- date and string formatting that is cumbersome in SQL
- aggregation and window functions that are possible, but cumbersome and difficult to read in SQL
- anything that you want to make configurable after the SQL query is run

This way, you can keep the execution of all data processing in two functions, and focus on the analysis below

In [None]:
def format_change_data(data: pd.DataFrame) -> pd.DataFrame:
    """
    perform post-processing on the query output from ``get_crd_changes``
    - strip double quotes from CRD
    """
    output = data.copy()
    output['crd'] = pd.to_datetime(crd_changes.crd.str.strip('"'), errors='coerce')
    
    return output

In [None]:
crd_formatted = format_change_data(crd_changes)
crd_formatted.crd.dtype

### Let's start actually answering our questions

First, ***how often do shipment CRDs change?***

Now that we know the data is formatted properly, let's go back and get a bigger sample

Then, let's think of ways to summarize this

In [None]:
# what percentage of legs have more than one CRD? 
leg_crd_counts = crd_changes.groupby('leg_id')['audit_id'].count()
(leg_crd_counts > 1).mean()

In [None]:
# how often do CRDs change more than once?
leg_crd_counts.loc[leg_crd_counts > 1].hist(bins=np.arange(2, 10, 1))

In [None]:
(leg_crd_counts.value_counts() / len(leg_crd_counts)).head(6)

**When CRD does change, how much on average does it differ from the original or previous CRD?**

We'll need to do some additional transformation:
- We'll need to add "first CRD" as a column
- We'll need to add "previous CRD" as a column

Let's walk through these transformations, then build a function to do this for us

**Blank function below**

In [None]:
def format_change_data_updated(data: pd.DataFrame) -> pd.DataFrame:
    """
    format columns from ``get_crd_changes``
    add columns for previous/first CRD, as well as differences
    """
    pass

In [None]:
# let's come up with the answer here

## My solution below

```











This space left intentionally blank














```

In [None]:
# let's add this logic to our processing function
# normally I wouldn't write a second function - I'd just update the first one
def format_change_data_updated(data: pd.DataFrame) -> pd.DataFrame:
    """
    perform post-processing on the query output from ``get_crd_changes``
    - strip double quotes from CRD
    - add first CRD information
    - add previous CRD information
    """
    # crd formatting
    output = data.copy()
    output['crd'] = pd.to_datetime(crd_changes.crd.str.strip('"'), errors='coerce')
    
    # add first and previous CRD information
    first_crd = output.sort_values('changed_at').groupby('leg_id')[['crd', 'changed_at']].first()
    previous_crd = output.sort_values('changed_at').groupby('leg_id').crd.shift(1)
    
    output = output \
        .join(previous_crd, how='inner', rsuffix='_prev') \
        .set_index('leg_id') \
        .join(first_crd, how='inner', rsuffix='_first') \
        .reset_index()
    
    output['crd_order'] = output.groupby('leg_id').changed_at.rank(method='min')
    
    # generate columns for differences
    output['difference_from_prev'] = (output.crd - output.crd_prev) / np.timedelta64(1, 'D')
    output['difference_from_first'] = (output.crd - output.crd_first) / np.timedelta64(1, 'D')
    
    # null the difference from first if it is the first
    output.loc[output.crd_order == 1, 'difference_from_first'] = np.nan

    return output

In [None]:
crd_comparison = format_change_data_updated(crd_changes)

In [None]:
crd_comparison.difference_from_first.describe()

In [None]:
crd_comparison.difference_from_first.hist(bins=np.arange(-20, 50, 1))

In [None]:
crd_comparison.groupby('crd_order').agg(dict(leg_id='count', difference_from_first='mean')).head(10)

In [None]:
crd_comparison.groupby('crd_order').difference_from_first.quantile([0.25, 0.5, 0.75]).unstack().iloc[:10].plot()