# Combining DataFrames

In this notebook, you'll see two different ways to combine _pandas_ DataFrames.

In [None]:
import pandas as pd

# Merging DataFrames

First, we'll import a DataFrame containing information on the revenue and quantity for sales that occurred in the year 2012.

In [None]:
sales_2012 = pd.read_csv('../data/sales_2012.csv')
sales_2012.head()

Next, we'll bring in a dataset which shows product, retailer, and order method information.

In [None]:
products = pd.read_csv('../data/products.csv')
products.head()

Notice that these two DataFrames can be linked together through the `Sale_ID` column. Let's merge these together so that we can do some further analysis.

Recall that the syntax for merging dataframes in pandas is:

```pd.merge(left dataframe, right dataframe, how to merge, column to merge on)```

In [None]:
combined_data = pd.merge(products, sales_2012, how = 'right', on = 'Sale_ID')

In [None]:
combined_data.head()

Looks like we have some missing values.

In [None]:
combined_data.info()

In [None]:
## How many values are we missing from each column?

combined_data.isnull().sum()

Notice that we have a row with Sale_ID of 3 which seems to be missing all of the product information. Let's double-check that this information is not contained in the products data.

In [None]:
products[products['Sale_ID'] == 3]

Once combined, we can start asking questions of our data.

#### Question 1: Which product type generated us the most total revenue in 2012? 

In [None]:
# Try and fill in the code to answer this question

#### Question 2: What was our highest volume product?

In [None]:
combined_data.groupby('Product')['Quantity'].sum().sort_values(ascending = False)

What is Zone?

In [None]:
combined_data.loc[combined_data['Product'] == 'Zone']

#### Question 3: For which retailer type do we have the highest sales quantity of Zone?

In [None]:
combined_data.loc[combined_data['Product'] == 'Zone'].groupby('Retailer_type')['Quantity'].sum().sort_values(ascending = False)

# Concatenating DataFrames

Notice that we also have access to sales data for 2013. Let's read it in.

In [None]:
sales_2013 = pd.read_csv('../data/sales_2013.csv')
sales_2013.head(2)

This data looks to be formatted in the same way as our 2012 sales data.

In [None]:
sales_2012.head(2)

What if we want to combine these two DataFrames. In this case, we don't want to merge, as each record should still have its own row in the result. Instead, this is a time when we want to **concatenate**. 

To concatenate DataFrames, we can pass the dataframes that we want to combine as a list into the `pd.concat` function.

In [None]:
pd.concat([sales_2012, sales_2013])

Note that while we have 66840 rows, the index value at the end of the DataFrame is only 32944. We can reindex the result by using the `ignore_index` argument.

In [None]:
pd.concat([sales_2012, sales_2013], ignore_index = True)

We've also got sales for 2014. Let's see how we could read all three in and concatenate.

In [None]:
sales_dfs = []

sales_2012 = pd.read_csv('../data/sales_2012.csv')
sales_dfs.append(sales_2012)

sales_2013 = pd.read_csv('../data/sales_2013.csv')
sales_dfs.append(sales_2013)

sales_2014 = pd.read_csv('../data/sales_2014.csv')
sales_dfs.append(sales_2014)

sales = pd.concat(sales_dfs,
                 ignore_index = True)

In [None]:
sales

While the above code accomplishes what we want, perhaps there is a better way to write it.

Notice that we are reusing the same pattern of code three times:

```
sales_2012 = pd.read_csv('../data/sales_2012.csv')
sales_dfs.append(sales_2012)
```

This is the basic pattern:
```
df = pd.read_csv(filepath)
sales_dfs.append(df)
```

This can be used in a **for loop**, which is a way to direct Python to do the same thing multiple times.

A for loop needs two things:
1. a collection to iterate over
2. directions about what to do for each item in that collection

Here, the collection is the list of filepaths, and the directions are the basic pattern above.

In [None]:
# Start with an empty list to hold the individual DataFrames
sales_dfs = []

for filename in ['../data/sales_2012.csv',
                 '../data/sales_2013.csv',
                 '../data/sales_2014.csv']:
    df = pd.read_csv(filename)
    sales_dfs.append(df)

In [None]:
sales = pd.concat(sales_dfs, ignore_index = True)

In [None]:
sales

We'll look at a lot more examples of for loops in the next notebook.