# 🐍.3 Pandas Data Transformations, Pt. 1
_Nate Robinson_

Today's lesson will focus on understanding how we can use pandas to cut down on data manipulation workload and introduce some common methods for transforming data.

In [None]:
import sys
import pandas as pd
import numpy as np

# We should append the custom module to our PYTHONPATH to have access
# to the db_utils module!
sys.path.append('../../custom')

from db_utils import get_connection, validate_connection, get_data
pd.options.display.float_format = '{:,.2f}'.format

# Motivating Question

How can we manipulate data in `pandas` to to answer our questions?
What are some common data manipulations steps we should feel comfortable with in pandas?
There's more than one way to skin a cat! 🙀

We're going to be working with some **Milestone Updates** data which is somewhat messy but the only source for a lot of questions we'll need to solve.

*Note: the column `update_lead_hours` here is a calculation of hours between when the update was made, and when the milestone occured. Negative means the milestone was created after it happened, postive means the milestone was created ahead of the actual event.*

In [None]:
# We'll get started by establishing a database connection and pulling a predetermined dataset of milestone updates
# We're selecting for only July 2020 and only arrival/departure milestones to ignore T&T events like last free day.
conn, cur = get_connection()
df = get_data('milestones.sql','file', conn)

# Questions:

We've got four questions ahead of us. The first two we'll do here:

    1. What percentage of actual updates are human vs automated
    2. What (if any) is the improvement in update speed earned by automation and operations teams
    
The latter two will be explored in Part 2 of this lesson:
    
    3. What milestones are frequently missing per given mode?
    4. What are the fastest updates per mode
    
Keeping in mind for the above, we'll need to handle outliers and extraneous/missing data.
   
## Data Exploration

First, let's take a look through our data to have an idea of what we're working with.

**Can we identify any issues we might have in attempting to answer our motivating questions?**


In [None]:
# Question? Explore the data and determine some issues we may see with this analysis
df.head()

# 1. Percentage of Updates: Human vs. Automated

I'm solving this by stringing together some logic to **only solve for air and ocean milestones** (that's all we want right now) and **choosing only actual dates, not scheduled**. 

From there we get a total count of rows, and final determine the count of each value in a series before dividing by the total.

In [None]:
# One, less-readable way to filter data is to create a series that holds conditional expressions
# for example
ocean_filter = df.shipment_mode == 'Ocean'
air_filter = df.shipment_mode == 'Air'
milestones_filter = df.update_date_type == 'actual'
final_filter =  (ocean_filter | air_filter ) & milestones_filter

# We'll stop to discuss what these above filters mean here.
total_rows = df[final_filter]['id'].count()
df[final_filter]['source'].value_counts()/total_rows

### Group Work Session

After examining this solution work through your own solution utilizing another way to cut data. This could be solved using `iloc`, `loc`, or <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.query.html">query which I suggest</a>. You can also use the <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html">groupby function</a> to circumvent `value_counts`.

Try to answer this question in the most Pythonic, efficient, or visually appealing way; for example, try answering this question with one line of code! Furthermore, think about how we could persist this result.

# 2. Improvement in Update Speed for Automation

Well, we learned by answering the prior question that the majority of our updates are coming from humans, with a bit of our updates provided by `crux` or `inttra` or `cargosmart`. Nearly 30% fell into the `no user found` category. If you dig a little bit deeper, you'll find that we also have some null values present in this column. 

In this next challenge, we're going to do a handful of things:
    
    1. Use a separate CSV which maps "Human" users to bots, individuals or groups
    2. Using logic source and user type, determine if the the update was a bot, individual, or shared account
    3. Drop null values
    4. Eliminate outlier data and calculate average update speed

In [None]:
# Retrieve user/type mapping
user_types = pd.read_csv('user_types.csv')

# Create a new column with user_type in our current DF
data = df[final_filter]
data = pd.merge(data , user_types, on='created_by', how='left')

# Drop rows where the created_by or source are null
data = data.dropna(how='any', subset=['source','created_by'])

In [None]:
# Define a function for determining what type of user we have
def get_user(row):
    if row['source'] in ('crux', 'inttra', 'cargosmart'):
        return 'bot'
    if (row['source'] == 'no user found'):
        return row['user_type']
    if (row['source'] == 'human'):
        return row['user_type']
    else:
        return None

# Apply this function to our dataframe to create a new column from our logic
data['categorical_user_type'] = data.apply(get_user, axis=1)

In [None]:
# We'll then want to drop all null values from this colume that we'll be using moving forward
data = data.dropna(how='any', subset=['categorical_user_type'])

# Now let's take a look at what we're working with
data['update_lead_hours'].describe()

## Removing Outliers

We see that the long-tail of update lead hours is very long. On average, our milestone updates are logged $313$ hours after occuring, with the median sitting at $20$ hours. For this analysis we can probably assume no "actual" updates should be happening before the fact, so we'll drop all values greater than $0$. 

It seems like some of our large outliers are significantly skewing our data. I think 2 weeks is a reasonable assumption to make for the most amount of time it should take to update a milestone -- we'll drop all rows with lead hours over $336$ hours.

In [None]:
final_dataset = data[(data['update_lead_hours'] < 0) & (data['update_lead_hours'] > -336.0)]

# our final dataset has significantly less values
final_dataset['id'].count()

In [None]:
final_dataset.groupby(by='categorical_user_type')['update_lead_hours'].mean()

## Solution

This gives us the expected solution! Bots are significantly faster than human users, and accounts run by individuals are slower at making updates than those owned by shared accounts (teams).

### Group Work

We've walked through a single method for arriving at this solution. 
**Can you arrive at this solution through another method? Is there any other relevant measures we could calculate to provide more insight?**