# 6 - Pivot Table
In this sixth step I'll show you how to reshape your data using a pivot table.

This will provide a nice condensed version. 

We'll reshape the data so that we can see how much each customer spent in each category.

In [2]:
import pandas as pd 
import numpy as np

df = pd.read_json("customer_data.json", convert_dates=False)
df.head()

Unnamed: 0,amount,category,city,customer_id,date,frequently_bought_together,lat_lon,purchase,related_items,state,zip_code
0,24.64,household,Chicago,100191,1-Jan-14,towels,"41.86,-87.619",soap,towels,IL,60605
1,35.0,clothing,Dallas,100199,2-Jan-14,sandals,"32.924,-96.547",shorts,belts,TX,75089
2,89.72,outdoor,Philadelphia,100170,3-Jan-14,lawn bags,"40.002,-75.118",lawn_mower,shovels,PA,19019
3,51.32,electronics,Chicago,100124,4-Jan-14,headphones,"41.88,-87.63",laptop,headphones,IL,60603
4,81.75,outdoor,Philadelphia,100173,5-Jan-14,sponge,"39.953,-75.166",car wash,sponge,PA,19102


Taking a quick look using the <code>.head()</code> function, we can see all of the columns, and the first few rows of the data.

For this example, let's just use the first 50 rows of the data. 

In [20]:
df_subset = df[0:50]
df_subset

Unnamed: 0,amount,category,city,customer_id,date,frequently_bought_together,lat_lon,purchase,related_items,state,zip_code
0,24.64,household,Chicago,100191,1-Jan-14,towels,"41.86,-87.619",soap,towels,IL,60605
1,35.0,clothing,Dallas,100199,2-Jan-14,sandals,"32.924,-96.547",shorts,belts,TX,75089
2,89.72,outdoor,Philadelphia,100170,3-Jan-14,lawn bags,"40.002,-75.118",lawn_mower,shovels,PA,19019
3,51.32,electronics,Chicago,100124,4-Jan-14,headphones,"41.88,-87.63",laptop,headphones,IL,60603
4,81.75,outdoor,Philadelphia,100173,5-Jan-14,sponge,"39.953,-75.166",car wash,sponge,PA,19102
5,29.16,outdoor,San Diego,100116,6-Jan-14,fertilizer,"33.143,-117.03",lawn mower,rakes,CA,92027
6,50.71,outdoor,Dallas,100105,7-Jan-14,bbq sauce,"32.745,-96.46",grill,grill cleaner,TX,75126
7,35.03,household,San Antonio,100148,8-Jan-14,spray bottles,"29.502,-98.306",household cleaner,spray bottles,TX,78109
8,30.55,appliances,Philadelphia,100118,9-Jan-14,pot holders,"39.953,-75.166",slow cooker,tupperware,PA,19102
9,92.01,electronics,Dallas,100106,10-Jan-14,camera lens,"32.917,-96.973",camera,lens cleaner,TX,75126


Let's take a look at the types for each column using the <code>.dtypes</code> method.

In [15]:
df_subset.dtypes

amount                        object
category                      object
city                          object
customer_id                    int64
date                          object
frequently_bought_together    object
lat_lon                       object
purchase                      object
related_items                 object
state                         object
zip_code                       int64
dtype: object

The amount column should be a numeric type, but Pandas thinks it's an <code>object</code>. Let's go ahead and change that column to a numeric <code>float</code> type using the <code>.astype()</code> method.

In [21]:
df_subset["amount"] = df_subset["amount"].astype(float)
df_subset.dtypes

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


amount                        float64
category                       object
city                           object
customer_id                     int64
date                           object
frequently_bought_together     object
lat_lon                        object
purchase                       object
related_items                  object
state                          object
zip_code                        int64
dtype: object

Now we can see that the <code>amount</code> column is a numeric <code>float</code> type. 

We don't need all of the columns, just the <code>customer_id</code>, <code>category</code>, and <code>amount</code> columns.

Here's what that smaller dataframe would look like.

In [17]:
df_subset[["customer_id", "category", "amount"]]

Unnamed: 0,customer_id,category,amount
0,100191,household,24.64
1,100199,clothing,35.0
2,100170,outdoor,89.72
3,100124,electronics,51.32
4,100173,outdoor,81.75
5,100116,outdoor,29.16
6,100105,outdoor,50.71
7,100148,household,35.03
8,100118,appliances,30.55
9,100106,electronics,92.01


Let's finish up by creating our <code>pivot_table</code>.

We'll set the index to <code>customer_id</code>, the columns to <code>category</code>, and the values to <code>amount</code>. This will reshape the data so that we can see how much each customer spent in each category. Let's create this using a new dataframe called <code>df_pivot</code>.

The final important point before we reshape the data is the <code>aggfunc</code> parameter. Since customers probably spent multiple purchase in the same categories, we'll want to collect all of the purchase. We'll do that using Numpy's <code>sum</code> method. I've shorted the Numpy library name to <code>np</code>, so that's why I've set the <code>aggfunc</code> to <code>np.sum</code>.

In [23]:
# pivot table; aggregation function "sum"

df_pivot = df_subset.pivot_table(index="customer_id", columns="category", values="amount", aggfunc=np.sum)
print(df_pivot)

category     appliances  clothing  electronics  house  household  outdoor
customer_id                                                              
100102              NaN       NaN          NaN  70.66        NaN      NaN
100103              NaN       NaN        78.61    NaN        NaN      NaN
100105              NaN       NaN          NaN    NaN        NaN    50.71
100106              NaN       NaN       183.88    NaN        NaN      NaN
100109              NaN       NaN          NaN    NaN        NaN    31.79
100111              NaN       NaN          NaN    NaN        NaN    77.28
100116              NaN       NaN          NaN    NaN        NaN    71.07
100118            30.55       NaN          NaN    NaN        NaN      NaN
100120              NaN     86.29          NaN    NaN        NaN      NaN
100123            34.57       NaN          NaN    NaN        NaN      NaN
100124              NaN       NaN        89.93    NaN        NaN      NaN
100133              NaN     23.69     

Now we have a new dataframe showing how much each customer spent in each category. 

There's a lot of <code>NaN</code> values because a lot of customers didn't spend any money in certain categories.

You should also note that there's a <code>house</code> and <code>household</code> column. We need to clean the data so that we have consistent strings before we reshape it. Look back at <strong>Step 3 - Consistent Strings</strong> to help you with that.