# Combining DataFrames

In this notebook, you'll see two different ways to combine _pandas_ DataFrames.

In [1]:
import pandas as pd

# Merging DataFrames

First, we'll import a DataFrame containing information on the revenue and quantity for sales that occurred in the year 2012.

In [2]:
sales_2012 = pd.read_csv('../data/sales_2012.csv')
sales_2012.head()

Unnamed: 0,Sale_ID,Year,Quarter,Revenue,Quantity
0,1,2012,Q1 2012,59628.66,489
1,3,2012,Q1 2012,89940.48,147
2,4,2012,Q1 2012,165883.41,303
3,5,2012,Q1 2012,119822.2,1415
4,6,2012,Q1 2012,87728.96,352


Next, we'll bring in a dataset which shows product, retailer, and order method information.

In [3]:
products = pd.read_csv('../data/products.csv')
products.head()

Unnamed: 0,Sale_ID,Retailer_country,Order_method_type,Retailer_type,Product_line,Product_type,Product
0,1,United States,Fax,Outdoors Shop,Camping Equipment,Cooking Gear,TrailChef Deluxe Cook Set
1,2,United States,Fax,Outdoors Shop,Camping Equipment,Cooking Gear,TrailChef Double Flame
2,900000,United States,Fax,Outdoors Shop,Camping Equipment,Tents,Star Dome
3,4,United States,Fax,Outdoors Shop,Camping Equipment,Tents,Star Gazer 2
4,5,United States,Fax,Outdoors Shop,Camping Equipment,Sleeping Bags,Hibernator Lite


Notice that these two DataFrames can be linked together through the `Sale_ID` column. Let's merge these together so that we can do some further analysis.

Recall that the syntax for merging dataframes in pandas is:

```pd.merge(left dataframe, right dataframe, how to merge, column to merge on)```

In [4]:
combined_data = pd.merge(products, sales_2012, how = 'right', on = 'Sale_ID')

In [5]:
combined_data.head()

Unnamed: 0,Sale_ID,Retailer_country,Order_method_type,Retailer_type,Product_line,Product_type,Product,Year,Quarter,Revenue,Quantity
0,1,United States,Fax,Outdoors Shop,Camping Equipment,Cooking Gear,TrailChef Deluxe Cook Set,2012,Q1 2012,59628.66,489
1,3,,,,,,,2012,Q1 2012,89940.48,147
2,4,United States,Fax,Outdoors Shop,Camping Equipment,Tents,Star Gazer 2,2012,Q1 2012,165883.41,303
3,5,United States,Fax,Outdoors Shop,Camping Equipment,Sleeping Bags,Hibernator Lite,2012,Q1 2012,119822.2,1415
4,6,United States,Fax,Outdoors Shop,Camping Equipment,Sleeping Bags,Hibernator Extreme,2012,Q1 2012,87728.96,352


Looks like we have some missing values.

In [6]:
combined_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 33895 entries, 0 to 33894
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Sale_ID            33895 non-null  int64  
 1   Retailer_country   33894 non-null  object 
 2   Order_method_type  33894 non-null  object 
 3   Retailer_type      33894 non-null  object 
 4   Product_line       33894 non-null  object 
 5   Product_type       33894 non-null  object 
 6   Product            33894 non-null  object 
 7   Year               33895 non-null  int64  
 8   Quarter            33895 non-null  object 
 9   Revenue            33895 non-null  float64
 10  Quantity           33895 non-null  int64  
dtypes: float64(1), int64(3), object(7)
memory usage: 3.1+ MB


In [7]:
## How many values are we missing from each column?

combined_data.isnull().sum()

Sale_ID              0
Retailer_country     1
Order_method_type    1
Retailer_type        1
Product_line         1
Product_type         1
Product              1
Year                 0
Quarter              0
Revenue              0
Quantity             0
dtype: int64

Notice that we have a row with Sale_ID of 3 which seems to be missing all of the product information. Let's double-check that this information is not contained in the products data.

In [8]:
products[products['Sale_ID'] == 3]

Unnamed: 0,Sale_ID,Retailer_country,Order_method_type,Retailer_type,Product_line,Product_type,Product


Once combined, we can start asking questions of our data.

#### Question 1: Which product type generated us the most total revenue in 2012? 

In [9]:
combined_data.groupby('Product_type')['Revenue'].sum().sort_values(ascending = False)

Product_type
Eyewear                 2.073434e+08
Watches                 1.396313e+08
Tents                   1.368340e+08
Packs                   8.707664e+07
Sleeping Bags           7.678566e+07
Cooking Gear            7.037098e+07
Woods                   6.938677e+07
Irons                   5.437231e+07
Navigation              4.344437e+07
Tools                   3.406203e+07
Knives                  3.287776e+07
Binoculars              3.009162e+07
Lanterns                2.968577e+07
Rope                    2.865527e+07
Putters                 2.832562e+07
Safety                  2.250587e+07
Climbing Accessories    2.187649e+07
Golf Accessories        1.312712e+07
Insect Repellents       1.144256e+07
Sunscreen               1.046941e+07
First Aid               2.861514e+06
Name: Revenue, dtype: float64

#### Question 2: What was our highest volume product?

In [10]:
combined_data.groupby('Product')['Quantity'].sum().sort_values(ascending = False)

Product
Zone                    1515204
TrailChef Water Bag     1119205
Granite Carabiner        833945
BugShield Extreme        817666
Single Edge              784149
                         ...   
Trail Star                 8434
Trail Master               7618
Polar Wave                 6654
Mountain Man Extreme       5484
Polar Extreme              3754
Name: Quantity, Length: 143, dtype: int64

What is Zone?

In [11]:
combined_data.loc[combined_data['Product'] == 'Zone']

Unnamed: 0,Sale_ID,Retailer_country,Order_method_type,Retailer_type,Product_line,Product_type,Product,Year,Quarter,Revenue,Quantity
52,54,United States,Telephone,Golf Shop,Personal Accessories,Eyewear,Zone,2012,Q1 2012,4574.3,149
126,128,United States,Telephone,Department Store,Personal Accessories,Eyewear,Zone,2012,Q1 2012,134472.6,4401
228,230,United States,Telephone,Outdoors Shop,Personal Accessories,Eyewear,Zone,2012,Q1 2012,86166.9,2802
291,293,United States,Telephone,Sports Store,Personal Accessories,Eyewear,Zone,2012,Q1 2012,35687.9,1169
401,403,United States,Web,Golf Shop,Personal Accessories,Eyewear,Zone,2012,Q1 2012,70464.7,2302
...,...,...,...,...,...,...,...,...,...,...,...
33605,33685,Italy,Web,Outdoors Shop,Personal Accessories,Eyewear,Zone,2012,Q4 2012,197792.3,6338
33639,33719,Italy,Web,Eyewear Store,Personal Accessories,Eyewear,Zone,2012,Q4 2012,31953.5,1043
33674,33754,Italy,Web,Sports Store,Personal Accessories,Eyewear,Zone,2012,Q4 2012,60664.2,1899
33785,34019,Spain,Web,Outdoors Shop,Personal Accessories,Eyewear,Zone,2012,Q4 2012,125030.4,4056


#### Question 3: For which retailer type do we have the highest sales quantity of Zone?

In [12]:
combined_data.loc[combined_data['Product'] == 'Zone'].groupby('Retailer_type')['Quantity'].sum().sort_values(ascending = False)

Retailer_type
Outdoors Shop       550211
Sports Store        450354
Department Store    215524
Golf Shop           170221
Eyewear Store       128894
Name: Quantity, dtype: int64

# Concatenating DataFrames

Notice that we also have access to sales data for 2013. Let's read it in.

In [13]:
sales_2013 = pd.read_csv('../data/sales_2013.csv')
sales_2013.head(2)

Unnamed: 0,Sale_ID,Year,Quarter,Revenue,Quantity
0,34129,2013,Q1 2013,19418.52,3102
1,34130,2013,Q1 2013,42304.32,794


This data looks to be formatted in the same way as our 2012 sales data.

In [14]:
sales_2012.head(2)

Unnamed: 0,Sale_ID,Year,Quarter,Revenue,Quantity
0,1,2012,Q1 2012,59628.66,489
1,3,2012,Q1 2012,89940.48,147


What if we want to combine these two DataFrames. In this case, we don't want to merge, as each record should still have its own row in the result. Instead, this is a time when we want to **concatenate**. 

To concatenate DataFrames, we can pass the dataframes that we want to combine as a list into the `pd.concat` function.

In [15]:
pd.concat([sales_2012, sales_2013])

Unnamed: 0,Sale_ID,Year,Quarter,Revenue,Quantity
0,1,2012,Q1 2012,59628.66,489
1,3,2012,Q1 2012,89940.48,147
2,4,2012,Q1 2012,165883.41,303
3,5,2012,Q1 2012,119822.20,1415
4,6,2012,Q1 2012,87728.96,352
...,...,...,...,...,...
32940,67147,2013,Q4 2013,10286.60,190
32941,67148,2013,Q4 2013,27512.00,181
32942,67149,2013,Q4 2013,5564.98,142
32943,67150,2013,Q4 2013,14309.44,1844


Note that while we have 66840 rows, the index value at the end of the DataFrame is only 32944. We can reindex the result by using the `ignore_index` argument.

In [16]:
pd.concat([sales_2012, sales_2013], ignore_index = True)

Unnamed: 0,Sale_ID,Year,Quarter,Revenue,Quantity
0,1,2012,Q1 2012,59628.66,489
1,3,2012,Q1 2012,89940.48,147
2,4,2012,Q1 2012,165883.41,303
3,5,2012,Q1 2012,119822.20,1415
4,6,2012,Q1 2012,87728.96,352
...,...,...,...,...,...
66835,67147,2013,Q4 2013,10286.60,190
66836,67148,2013,Q4 2013,27512.00,181
66837,67149,2013,Q4 2013,5564.98,142
66838,67150,2013,Q4 2013,14309.44,1844


We've also got sales for 2014. Rather than rewrite the same code multiple times, we could instead make use of a for loop in order to read in the three DataFrames and then combine them.

In [17]:
# Start with an empty list to hold the individual DataFrames
sales_dfs = []

for filename in ['../data/sales_2012.csv', '../data/sales_2013.csv', '../data/sales_2014.csv']:
    df = pd.read_csv(filename)
    sales_dfs.append(df)

In [18]:
sales_dfs

[       Sale_ID  Year  Quarter    Revenue  Quantity
 0            1  2012  Q1 2012   59628.66       489
 1            3  2012  Q1 2012   89940.48       147
 2            4  2012  Q1 2012  165883.41       303
 3            5  2012  Q1 2012  119822.20      1415
 4            6  2012  Q1 2012   87728.96       352
 ...        ...   ...      ...        ...       ...
 33890    34124  2012  Q4 2012   32090.31       189
 33891    34125  2012  Q4 2012   17007.36      1648
 33892    34126  2012  Q4 2012   23949.20      1938
 33893    34127  2012  Q4 2012   36899.06       179
 33894    34128  2012  Q4 2012   11011.28      1055
 
 [33895 rows x 5 columns],
        Sale_ID  Year  Quarter   Revenue  Quantity
 0        34129  2013  Q1 2013  19418.52      3102
 1        34130  2013  Q1 2013  42304.32       794
 2        34131  2013  Q1 2013  52266.32       824
 3        34132  2013  Q1 2013   5211.64      2659
 4        34133  2013  Q1 2013  51714.46       521
 ...        ...   ...      ...       ... 

In [19]:
sales = pd.concat(sales_dfs, ignore_index = True)

In [20]:
sales

Unnamed: 0,Sale_ID,Year,Quarter,Revenue,Quantity
0,1,2012,Q1 2012,59628.66,489
1,3,2012,Q1 2012,89940.48,147
2,4,2012,Q1 2012,165883.41,303
3,5,2012,Q1 2012,119822.20,1415
4,6,2012,Q1 2012,87728.96,352
...,...,...,...,...,...
88159,88471,2014,Q3 2014,30865.50,171
88160,88472,2014,Q3 2014,7485.29,191
88161,88473,2014,Q3 2014,12255.48,236
88162,88474,2014,Q3 2014,56448.00,1470
