# Getting to know Pandas 

Pandas can be as simple or as complex as you need it to be. As an analysis toolkit, it's designed to be flexible and provide a wide range of functionality so that the same tool can be used for a variety of tasks. Because of this, it can be a little overwhelming at first. In this notebook we will introduce some of the essential pandas functionality and list a few best practices that will make learning pandas easier as you go.

By now, you should be comfortable with:

* Reading in a CSV file
* Inspecting the first five rows of your data
* Selecting columns / filtering rows
* Creating new columns from existing columns

If not, please review [AFU PDA 2 - Pandas basics](./AFU PDA 2 - Pandas basics.ipynb)

## What's covered here

In this notebook you will learn:

* Basic indexing and working with dates
* Reading data from multiple sources
* Merging data (joins/vlookup)
* Groupby, pivot_table, transform, melt

Along the way, you will also learn pandas best practices in how to write your code. For further reading on mastering pandas syntax, [Minimally sufficient Pandas](https://medium.com/dunder-data/minimally-sufficient-pandas-a8e67f2a2428) is an excellent resource.

# Tutorial Overview
This tutorial is divided into 4 parts:

* Test your knowledge
* Basic indexing
* Slicing, merging, and grouping data
* Exercises: Your turn!

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
pd.set_option("display.expand_frame_repr", False)
pd.set_option("display.width", 160)

%matplotlib inline

# Test your knowledge
Before starting, try to complete the excercise below. This tests your knowledge on topics covered in [AFU PDA 2 - Pandas basics](./AFU PDA 2 - Pandas basics.ipynb)

Step 1: Choose a file to load

In [5]:
# we want to load the sales_fake.csv in the data directory
directory = '../data/' 
## START YOUR CODE HERE
file_name = 'sales_fake.csv'
## END YOUR CODE HERE
path = directory + file_name

# check to make sure you have the right path
print(path)

../data/sales_fake.csv


Step 2: Read your file into a pandas data frame and view the top 5 rows

In [6]:
# remember, pandas has built-in methods for reading data. 
# If you can't remember which one to use, try pd.read<TAB> to view the available methods

## START YOUR CODE HERE
# read the csv
df = pd.read_csv(path)
# view the top 5 rows
# remember, each dataframe also has built-in methods for working with the data.
# if you can't remember which one to use, try df.<TAB> to view the available methods
df.head()
## END YOUR CODE HERE

Unnamed: 0,marketplace_id,order_day,order_week,gl,product_family,units,ops
0,1,2017-01-01,2017-01-07,Amazon_Pantry,Consumables,9460.696331,46576.224274
1,1,2017-01-02,2017-01-07,Amazon_Pantry,Consumables,13828.318507,61148.434296
2,1,2017-01-03,2017-01-07,Amazon_Pantry,Consumables,12308.348366,59960.632711
3,1,2017-01-04,2017-01-07,Amazon_Pantry,Consumables,11912.807101,58991.544654
4,1,2017-01-05,2017-01-07,Amazon_Pantry,Consumables,10817.883315,54032.017658


# Basic Indexing

Pandas dataframes are a collection of rows and columns, organized by an index. The index is analogous to row numbers in excel or primary keys in SQL. Understanding and working with indexes allows us to easily filter, group, and manipulate our data. Additionally, pandas understands different types of indexes. Most notably, pandas has a DatetimeIndex which makes working dates very easy.

To begin, we will read in some data, inspect the index and set a new index. After that, we will explore the functionality with our new index

In [7]:
# use the same file from before
df = pd.read_csv(path)

In [8]:
print("What type of index?")
print("-> ", df.index, "\n")

print("Inspect some values:")
print("-> ", df.index.values, "\n")

# show the first 5 rows
df.head()

What type of index?
->  RangeIndex(start=0, stop=20605, step=1) 

Inspect some values:
->  [    0     1     2 ... 20602 20603 20604] 



Unnamed: 0,marketplace_id,order_day,order_week,gl,product_family,units,ops
0,1,2017-01-01,2017-01-07,Amazon_Pantry,Consumables,9460.696331,46576.224274
1,1,2017-01-02,2017-01-07,Amazon_Pantry,Consumables,13828.318507,61148.434296
2,1,2017-01-03,2017-01-07,Amazon_Pantry,Consumables,12308.348366,59960.632711
3,1,2017-01-04,2017-01-07,Amazon_Pantry,Consumables,11912.807101,58991.544654
4,1,2017-01-05,2017-01-07,Amazon_Pantry,Consumables,10817.883315,54032.017658


If I know the index, I can pull rows by their index

In [9]:
print(df.loc[3])

marketplace_id                1
order_day            2017-01-04
order_week           2017-01-07
gl                Amazon_Pantry
product_family      Consumables
units                   11912.8
ops                     58991.5
Name: 3, dtype: object


Or I can pull a whole range using `START:END` notation. This is known as "Slicing". For `[1:3]`, this means start index 1 and go up to 3 but don't include 3.

In [10]:
print(df.loc[1:3])

   marketplace_id   order_day  order_week             gl product_family         units           ops
1               1  2017-01-02  2017-01-07  Amazon_Pantry    Consumables  13828.318507  61148.434296
2               1  2017-01-03  2017-01-07  Amazon_Pantry    Consumables  12308.348366  59960.632711
3               1  2017-01-04  2017-01-07  Amazon_Pantry    Consumables  11912.807101  58991.544654


An integer index alone isn't particularly useful, so lets set a datetime index. Before we do that, we need to first understand data types in Python.

Basic data types are `float`, `integer`, `string`. A more advanced data type that python understands is `datetimes`. This is similar to programs like excel which will automatically try to detect a date column and treat it differently than a string or an integer.

Lets look at the data types in our current dataframe:

In [11]:
df.dtypes

marketplace_id      int64
order_day          object
order_week         object
gl                 object
product_family     object
units             float64
ops               float64
dtype: object

An object in python generally refers to a string or anything else that python wasn't able to automatically detect. As you can see, order_day and order_week are objects but we would like them to be dates.

In [12]:
# make a list of columns you want to change
columns_to_change = ['order_day', 'order_week']

# go through each column and update its dtype using pandas built-in function
for column in columns_to_change:
    df[column] = pd.to_datetime(df[column])

# check the output
df.dtypes

marketplace_id             int64
order_day         datetime64[ns]
order_week        datetime64[ns]
gl                        object
product_family            object
units                    float64
ops                      float64
dtype: object

In [13]:
# this gets me a whole new set of methods related to dates
new_dt_column = df['order_day']
new_dt_column.dt.weekday_name[:5]

0       Sunday
1       Monday
2      Tuesday
3    Wednesday
4     Thursday
Name: order_day, dtype: object

Now we have a column that pandas understands is a date. If you notice, our date column also matches our index, so instead of using the row index, let's index our data frame by date

In [14]:
df = df.set_index('order_day')
df.head()

Unnamed: 0_level_0,marketplace_id,order_week,gl,product_family,units,ops
order_day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2017-01-01,1,2017-01-07,Amazon_Pantry,Consumables,9460.696331,46576.224274
2017-01-02,1,2017-01-07,Amazon_Pantry,Consumables,13828.318507,61148.434296
2017-01-03,1,2017-01-07,Amazon_Pantry,Consumables,12308.348366,59960.632711
2017-01-04,1,2017-01-07,Amazon_Pantry,Consumables,11912.807101,58991.544654
2017-01-05,1,2017-01-07,Amazon_Pantry,Consumables,10817.883315,54032.017658


Now, instead of a row number, we have a date for our index. Let's revisit our index methods

In [15]:
# get data for a particular date
print(df.loc['2017-01-15'].head())
# get data for a slice
print(df.loc['2017-02-15':'2017-02-20'].head())
# get data for a month
print(df.loc['2017-07'].head())

            marketplace_id order_week                      gl product_family         units            ops
order_day                                                                                                
2017-01-15               1 2017-01-21           Amazon_Pantry    Consumables  12174.306544   57837.266537
2017-01-15               1 2017-01-21                 Apparel      Softlines  32991.826769  665125.652486
2017-01-15               1 2017-01-21                     Art      Hardlines      0.601876     115.510917
2017-01-15               1 2017-01-21  Art and Craft Supplies      Hardlines    183.130444    2103.539509
2017-01-15               1 2017-01-21              Automotive      Hardlines  10286.426235  316251.803210
            marketplace_id order_week             gl product_family         units           ops
order_day                                                                                      
2017-02-15               1 2017-02-18  Amazon_Pantry    Consumable

Now that our index knows about dates, we easily get some calendar view. For example, lets look at units by Month and Quarter:

In [16]:
print(df.resample('Q')['units'].sum()) 
print(df.resample('M')['units'].sum())

order_day
2017-03-31    4.244779e+07
2017-06-30    4.499966e+07
2017-09-30    4.924478e+07
2017-12-31    6.415063e+07
Freq: Q-DEC, Name: units, dtype: float64
order_day
2017-01-31    1.468871e+07
2017-02-28    1.296491e+07
2017-03-31    1.479417e+07
2017-04-30    1.441700e+07
2017-05-31    1.550550e+07
2017-06-30    1.507717e+07
2017-07-31    1.665600e+07
2017-08-31    1.700644e+07
2017-09-30    1.558234e+07
2017-10-31    1.689117e+07
2017-11-30    2.105286e+07
2017-12-31    2.620659e+07
Freq: M, Name: units, dtype: float64


# You try!

Using the same dataframe above (you can make sure it's there by running all the cells up til now), complete the following questions to test your understanding:

Step 1: create a new dataframe using the date index

In [17]:
# filter to data for November and December and save it to a new df called nov_to_dec
nov_to_dec = df.loc['2017-11':'2017-12']

nov_to_dec.head()

Unnamed: 0_level_0,marketplace_id,order_week,gl,product_family,units,ops
order_day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2017-11-01,1,2017-11-04,Amazon_Pantry,Consumables,11129.938755,57374.915375
2017-11-02,1,2017-11-04,Amazon_Pantry,Consumables,11036.026525,52465.788951
2017-11-03,1,2017-11-04,Amazon_Pantry,Consumables,10965.167199,48428.995346
2017-11-04,1,2017-11-04,Amazon_Pantry,Consumables,11401.431162,47517.117904
2017-11-05,1,2017-11-11,Amazon_Pantry,Consumables,13946.609159,69121.564445


Step 2: Re-index your new data frame by product_family

In [18]:
# hint: nov_to_dec.<TAB> will show you the available methods
nov_to_dec_reindexed = nov_to_dec.reset_index()

nov_to_dec_reindexed.head()

Unnamed: 0,order_day,marketplace_id,order_week,gl,product_family,units,ops
0,2017-11-01,1,2017-11-04,Amazon_Pantry,Consumables,11129.938755,57374.915375
1,2017-11-02,1,2017-11-04,Amazon_Pantry,Consumables,11036.026525,52465.788951
2,2017-11-03,1,2017-11-04,Amazon_Pantry,Consumables,10965.167199,48428.995346
3,2017-11-04,1,2017-11-04,Amazon_Pantry,Consumables,11401.431162,47517.117904
4,2017-11-05,1,2017-11-11,Amazon_Pantry,Consumables,13946.609159,69121.564445


Step 3: Get the data only for "Consumables"

In [19]:
# hint: remember, df.loc lets you grab data by index label
nov_to_dec_consumables = nov_to_dec[nov_to_dec["product_family"]=="Consumables"]

nov_to_dec_consumables.head()

Unnamed: 0_level_0,marketplace_id,order_week,gl,product_family,units,ops
order_day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2017-11-01,1,2017-11-04,Amazon_Pantry,Consumables,11129.938755,57374.915375
2017-11-02,1,2017-11-04,Amazon_Pantry,Consumables,11036.026525,52465.788951
2017-11-03,1,2017-11-04,Amazon_Pantry,Consumables,10965.167199,48428.995346
2017-11-04,1,2017-11-04,Amazon_Pantry,Consumables,11401.431162,47517.117904
2017-11-05,1,2017-11-11,Amazon_Pantry,Consumables,13946.609159,69121.564445


Step 4: Sum units in nov_to_dec_consumables and divide it by all units in nov_to_dec. This tells you what percentage Consumables was of all product families in November and December

In [20]:
total_consumables_nov_to_dec = nov_to_dec_consumables['units'].sum() 

In [21]:
# hint: we still have a dataframe indexed by date, nov_to_dec. Try using that
total_units_nov_to_dec = nov_to_dec['units'].sum() 

In [22]:
# run this cell

print(
    "Consumables were {:.1%} of all units From November to December.".format(
        total_consumables_nov_to_dec / total_units_nov_to_dec
    )
)

Consumables were 21.3% of all units From November to December.


Normally, this isn't how you would go about getting an answer like this using pandas, but it does demonstrate how to easy it is to work with data using slices and indexes. We could have accomplished the same thing using special functions, filtering, or groupby's. Groupby is what we will cover next!

# Groupby's and aggregations

In the last example, we used resample to sum over units to see aggregations by different calendar dimensions. This highlights one of pandas most powerful features: **GroupBy**'s and **Aggregations**.

GroupBy's provide a very flexible way to organize your data and Aggregations create summary views. GroupBy/Aggregate in pandas is much like pivot tables in excel, with lots of extras on top. Pandas also includes `pivot_table`, `melt`, and `transform` methods for shaping data. If interested, we cover those in more depth in [AFU PDA 3 - Pandas - Groupby, pivot_table, transform and melt](./AFU PDA 3 - Pandas - Groupby, pivot_table, transform and melt.ipynb)

At a high level, a GroupBy is a logical way to split your data and then apply some operation to each split before returning the results. This is known as Split-Apply-Combine:

<img src='img/split-apply-combine.PNG'>

To get the sum of units and OPS by each product_family, first create a groupby object (this is similar to a dataframe, except broken up into chunks as in the picture above).  Then, pick the columns you want to aggregate (units and ops in this case) and apply the .sum() method.  Pandas will automatically aggregate each product family "chunk" and combine them into a single dataframe.

GroupBy is one of panda's most versatile features. Be sure to checkout panda's documentation on [Split-Apply-Combine](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html)

In [23]:
sales_by_product_family = df.groupby('product_family')[['units', 'ops']].sum()
print(sales_by_product_family)

                       units           ops
product_family                            
Consumables     4.901449e+07  9.719792e+08
Hardlines       1.084074e+08  3.384190e+09
Media           2.177083e+07  3.887892e+08
Softlines       2.165010e+07  6.960792e+08


In SQL, this is equivalent to:

`SELECT
    product_family
    ,SUM(units) AS units
    ,SUM(ops) AS ops
FROM
    df
GROUP BY
    product_family
;`

The above example involves using the .sum() method on a groupby object, but there is a more general approach that we recommend, using .aggregate().  In Python, and especially Pandas, there are many different ways to perform the same operation, and .aggregate() is a versatile method that can help narrow down your options.

In [24]:
sales_by_product_family = df.groupby('product_family')[['units', 'ops']].aggregate('sum')
print(sales_by_product_family)

                       units           ops
product_family                            
Consumables     4.901449e+07  9.719792e+08
Hardlines       1.084074e+08  3.384190e+09
Media           2.177083e+07  3.887892e+08
Softlines       2.165010e+07  6.960792e+08


This is particularly helpful if you want to use different aggregate functions for different columns, which you'll see in the next couple of examples.

Next, let's look at gl units by quarter, using the .resample() method:

In [25]:
gl_by_qtr = df.groupby(['gl'])[['units','ops']].resample('Q').sum().head(n=12)
gl_by_qtr

Unnamed: 0_level_0,Unnamed: 1_level_0,units,ops
gl,order_day,Unnamed: 2_level_1,Unnamed: 3_level_1
Amazon_Pantry,2017-03-31,982901.6,4718324.0
Amazon_Pantry,2017-06-30,932935.1,4540742.0
Amazon_Pantry,2017-09-30,1038855.0,4956196.0
Amazon_Pantry,2017-12-31,1024466.0,4786795.0
Apparel,2017-03-31,2849410.0,57255890.0
Apparel,2017-06-30,3180649.0,63105350.0
Apparel,2017-09-30,3388291.0,66679770.0
Apparel,2017-12-31,5245171.0,107782000.0
Art,2017-03-31,84.18909,14417.88
Art,2017-06-30,134.5359,17102.97


Let's look at a more general case without using resample:

In [26]:
column_list = ['product_family']
aggregations = {'units':['sum','mean'],'ops':['min','max']}

df.groupby(column_list).aggregate(aggregations)

Unnamed: 0_level_0,units,units,ops,ops
Unnamed: 0_level_1,sum,mean,min,max
product_family,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Consumables,49014490.0,8985.240642,0.090388,3381976.0
Hardlines,108407400.0,10219.404387,0.440191,6605131.0
Media,21770830.0,8431.769999,0.139558,1614658.0
Softlines,21650100.0,11045.968597,0.55656,2670796.0


In [27]:
def summarize_data(df, cols, aggs):
    return df.groupby(cols).aggregate(aggs)

column_list = ['product_family', df.index.quarter]
aggregations = {'units':['min','max'],'ops':['sum','mean']}

summarize_data(df, column_list, aggregations)

Unnamed: 0_level_0,Unnamed: 1_level_0,units,units,ops,ops
Unnamed: 0_level_1,Unnamed: 1_level_1,min,max,sum,mean
product_family,order_day,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Consumables,1,0.034692,50756.372305,207593900.0,152754.905636
Consumables,2,0.03468,52561.869777,221053200.0,167211.19872
Consumables,3,0.034708,106296.996434,240110500.0,176163.223037
Consumables,4,0.034667,113244.908337,303221600.0,214898.338796
Hardlines,1,0.219582,40986.010161,667380600.0,255505.596678
Hardlines,2,0.034649,43616.066234,753158900.0,284963.657763
Hardlines,3,0.35174,105444.099619,828770500.0,310633.603211
Hardlines,4,0.034984,255925.390692,1134880000.0,422674.009058
Media,1,0.034676,71345.396536,93446130.0,143763.276162
Media,2,0.034624,49615.001452,75746890.0,117619.393752


# Pivot a dataframe using the .pivot_table() function

Just as in Excel, we can pivot our data, which typically involves swapping rows with columns and applying an aggregate function:

In [28]:
df_pivot = pd.pivot_table(data=df, index=df.index, columns='product_family', values='units', aggfunc='sum')
df_pivot.head()

product_family,Consumables,Hardlines,Media,Softlines
order_day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2017-01-01,104467.010325,211688.484896,52140.007166,39835.061817
2017-01-02,137732.111356,279720.766123,69765.86296,50933.915968
2017-01-03,132243.179295,279593.050424,76703.362539,46512.437495
2017-01-04,130234.835836,264975.041811,75100.633735,47089.022667
2017-01-05,120421.207782,258053.916394,73082.137475,47399.367667


Here, the "index" argument determines the row grouping (just like "rows" in an Excel pivot table).  The "columns" and "values" arguments are also the same as their Excel counterparts.

Note: a dataframe's columns can have indices just like its rows can.  An unintended consequence of using .pivot_table() is that it adds an index to the new dataframe's columns, so we'll use the line below to undo this:

In [29]:
df_pivot.columns = list(df_pivot.columns)

# Unpivot a dataframe using the .melt() function

Now we'll essentially undo the pivot transformation we just applied. Reversing a pivot is difficult in Excel, but it's one line in Pandas with .melt(). This function doesn't work very well with indices, so first we'll reset the order_day index so it becomes a column again:

In [30]:
df_pivot = df_pivot.reset_index()
df_pivot.head()

Unnamed: 0,order_day,Consumables,Hardlines,Media,Softlines
0,2017-01-01,104467.010325,211688.484896,52140.007166,39835.061817
1,2017-01-02,137732.111356,279720.766123,69765.86296,50933.915968
2,2017-01-03,132243.179295,279593.050424,76703.362539,46512.437495
3,2017-01-04,130234.835836,264975.041811,75100.633735,47089.022667
4,2017-01-05,120421.207782,258053.916394,73082.137475,47399.367667


In [31]:
df_unpivot = pd.melt(frame=df_pivot, id_vars=['order_day'], value_name='units', var_name='product_family')

# then set the index back to order_day
df_unpivot = df_unpivot.set_index('order_day')

df_unpivot.head()

Unnamed: 0_level_0,product_family,units
order_day,Unnamed: 1_level_1,Unnamed: 2_level_1
2017-01-01,Consumables,104467.010325
2017-01-02,Consumables,137732.111356
2017-01-03,Consumables,132243.179295
2017-01-04,Consumables,130234.835836
2017-01-05,Consumables,120421.207782


We have now unpivoted product_family so it is a column again.  Keep in mind we could have done this all in one line, including resetting and setting the index, but we broke it out into steps for illustration.

# Exercises: You try!

1) What is the GL with the highest OPS? (_hint:_ you can use the df.groupby(cols).aggregate({} syntax)

In [45]:
df.groupby('gl').aggregate({'ops':'sum'})['ops'].nlargest()

gl
PC                        4.221376e+08
Health & Personal Care    3.727017e+08
Home                      3.404439e+08
Apparel                   2.948230e+08
Wireless                  2.604885e+08
Name: ops, dtype: float64

2) What percentage of the Consumables OPS does Amazon Pantry have?

In [37]:
df[(df['gl']=='Amazon_Pantry') & (df['product_family']=='Consumables')]['ops'].sum() / df[df['product_family']=='Consumables']['ops'].sum()

0.019549860811198094

3) What date had the highest total OPS?

In [46]:
df['ops'].idxmax() # this returns the index for the row in df['ops'] with the .max() value

Timestamp('2017-11-27 00:00:00')

4) Create pivot table of OPS with gl as the row headers and order_day as the columns

In [47]:
df.pivot_table(values= 'ops', index = 'gl', columns = 'order_day')

order_day,2017-01-01 00:00:00,2017-01-02 00:00:00,2017-01-03 00:00:00,2017-01-04 00:00:00,2017-01-05 00:00:00,2017-01-06 00:00:00,2017-01-07 00:00:00,2017-01-08 00:00:00,2017-01-09 00:00:00,2017-01-10 00:00:00,...,2017-12-22 00:00:00,2017-12-23 00:00:00,2017-12-24 00:00:00,2017-12-25 00:00:00,2017-12-26 00:00:00,2017-12-27 00:00:00,2017-12-28 00:00:00,2017-12-29 00:00:00,2017-12-30 00:00:00,2017-12-31 00:00:00
gl,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Amazon_Pantry,46576.224274,6.114843e+04,5.996063e+04,5.899154e+04,5.403202e+04,51928.496274,61172.924939,63383.543121,5.821150e+04,5.285262e+04,...,2.795247e+04,25863.896954,1.878236e+04,2.566206e+04,5.436277e+04,5.462908e+04,5.840137e+04,5.614708e+04,5.096925e+04,4.642429e+04
Apparel,518385.466484,6.623158e+05,6.288446e+05,6.768550e+05,6.406175e+05,596412.157884,691937.966129,714923.515652,6.934735e+05,6.222569e+05,...,8.238651e+05,580099.161833,4.978391e+05,6.823170e+05,9.522871e+05,9.590389e+05,9.082524e+05,8.510218e+05,7.684187e+05,6.921696e+05
Art,134.004846,1.942962e+02,8.720298e+01,9.745591e+01,1.552267e+02,167.495657,92.785449,141.812250,3.274877e+01,1.234564e+02,...,1.216920e+02,79.593028,2.953275e+02,7.179918e+01,5.830757e+01,9.089160e+01,1.483489e+01,1.316105e+02,2.083224e+01,4.951967e+01
Art and Craft Supplies,1973.414864,2.297702e+03,2.281373e+03,2.305350e+03,2.316957e+03,2197.718275,2209.776651,2809.462790,2.175329e+03,2.114215e+03,...,5.693392e+03,2806.706719,2.062221e+03,3.395585e+03,4.493052e+03,4.597432e+03,4.307272e+03,3.489073e+03,3.368008e+03,2.875328e+03
Automotive,281442.376425,3.855640e+05,3.808311e+05,3.813907e+05,3.611299e+05,340747.444892,326953.640067,344014.434532,3.733829e+05,3.680268e+05,...,3.832572e+05,328676.287172,2.635753e+05,3.167934e+05,5.220134e+05,4.952357e+05,4.710639e+05,4.459765e+05,4.090563e+05,3.223932e+05
Baby,194193.327805,2.552986e+05,2.402196e+05,2.155808e+05,2.170896e+05,186290.543770,212208.011196,231301.653721,2.216316e+05,2.067340e+05,...,1.802510e+05,137415.271657,1.041553e+05,1.298765e+05,2.559942e+05,2.604417e+05,2.564969e+05,2.524372e+05,2.400722e+05,1.954617e+05
Beauty,479415.488944,5.944935e+05,5.588277e+05,5.337757e+05,5.316271e+05,470699.957861,505548.552018,547890.746005,5.011247e+05,4.747617e+05,...,6.141837e+05,472066.410093,3.590891e+05,4.645807e+05,6.786474e+05,7.294482e+05,6.627847e+05,6.725486e+05,6.449613e+05,5.179577e+05
Biss,95245.396917,1.481909e+05,2.044894e+05,2.264700e+05,2.091792e+05,181901.653913,122703.969921,127961.078261,2.235601e+05,2.120468e+05,...,1.985145e+05,112350.462774,7.906367e+04,9.521223e+04,1.995270e+05,2.314848e+05,2.387126e+05,2.222787e+05,1.478823e+05,1.421530e+05
Books,640274.443833,9.549175e+05,1.197001e+06,1.182931e+06,1.081670e+06,927515.164528,738663.623642,853111.445407,1.403776e+06,1.416396e+06,...,6.939132e+05,414856.000504,3.247715e+05,4.111223e+05,7.017016e+05,7.810182e+05,7.487924e+05,7.591506e+05,6.406607e+05,5.251130e+05
Camera,228127.158454,3.066761e+05,3.256328e+05,3.111772e+05,3.115571e+05,272390.230867,242586.009897,300594.001482,2.975485e+05,3.143490e+05,...,4.201503e+05,286550.196238,2.267815e+05,3.501082e+05,4.463638e+05,4.142560e+05,3.962701e+05,3.718388e+05,3.129247e+05,2.833519e+05


# Combining multiple data sets with pandas using Merge

In [48]:
import pandas as pd
%matplotlib inline

# define variables as much as possible
# this makes your code more re-usable and readable
# remember, a variable is where a value is stored, a string is the value
# file_path: variable to tell me where to find the file
# holidays_url: url for a list of holidays on github

holidays_url = 'https://gist.githubusercontent.com/shivaas/4758439/raw/'


In [49]:
# show reading a file without a header
holidays = pd.read_csv(holidays_url)
holidays.head()

Unnamed: 0,1,2012-01-02,New Year Day
0,2,2012-01-16,Martin Luther King Jr. Day
1,3,2012-02-20,Presidents Day (Washingtons Birthday)
2,4,2012-05-28,Memorial Day
3,5,2012-07-04,Independence Day
4,6,2012-09-03,Labor Day


In [50]:
# passing in the file names 
holidays = pd.read_csv(holidays_url, header=None, names=['date','holiday'])

# add in holiday date types
holidays.head()
holidays.dtypes

date       object
holiday    object
dtype: object

In [51]:
holidays['date'] = pd.to_datetime(holidays['date'])
holidays = holidays.set_index('date')

In [52]:
sales_data_with_holidays = pd.merge(df, holidays, right_index=True, left_index=True, how='left')

In [53]:
# sales_data_with_holidays['holiday'] = sales_data_with_holidays['holiday'].fillna('No holiday')
sales_data_with_holidays.columns

Index(['marketplace_id', 'order_week', 'gl', 'product_family', 'units', 'ops', 'holiday'], dtype='object')

In [54]:
sales_data_with_holidays[~sales_data_with_holidays['holiday'].isnull()].head()

Unnamed: 0,marketplace_id,order_week,gl,product_family,units,ops,holiday
2017-01-02,1,2017-01-07,Amazon_Pantry,Consumables,13828.318507,61148.434296,New Year Day
2017-01-02,1,2017-01-07,Apparel,Softlines,33581.730644,662315.758715,New Year Day
2017-01-02,1,2017-01-07,Art,Hardlines,0.594952,194.296205,New Year Day
2017-01-02,1,2017-01-07,Art and Craft Supplies,Hardlines,182.589249,2297.702145,New Year Day
2017-01-02,1,2017-01-07,Automotive,Hardlines,11318.818093,385564.035064,New Year Day


# Extensions

This section lists some ideas for extending the tutorial that you may wish to explore.
* Describe three examples when Pandas would be better than using Excel directly.
* Complete the next example that uses Pandas to clean a dataset. 

# Further Reading
This section provides more resources on the topic if you are looking to go deeper.

## Books
* Python for Data Analysis, by William McKinney. http://shop.oreilly.com/product/0636920023784.do

## APIs
* Pandas. https://pandas.pydata.org/

## Articles
* Getting started with Pandas in 5 minutes, on Towards Data Science. https://medium.com/bhavaniravi/python-pandas-tutorial-92018da85a33
* My Pandas Cheat Sheet, on Towards Data Science. https://towardsdatascience.com/my-python-pandas-cheat-sheet-746b11e44368
    
# Summary

In this tutorial, you used Pandas for more advanced data analysis. Specifically, you learned:
* Basic indexing and working with dates
* Reading data from multiple sources
* Merging data (joins/vlookup)
* Groupby, pivot_table, transform, melt

# Next

In the next section, you will use Pandas to explore a dataset. 