# Combining Datasets: Concat and Append

In [1]:
import numpy as np
import pandas as pd

In [2]:
def make_df(cols, ind):
    """Quickly make a DataFrame"""
    data = {c: [str(c) + str(i) for i in ind]
            for c in cols}
    return pd.DataFrame(data, ind)
class display(object):
    """Display HTML representation of multiple objects"""
    template = """<div style="float: left; padding: 10px;">
    <p style='font-family:"Courier New", Courier, monospace'>{0}</p>{1}
    </div>"""
    def __init__(self, *args):
        self.args = args
        
    def _repr_html_(self):
        return '\n'.join(self.template.format(a, eval(a)._repr_html_())
                         for a in self.args)
    
    def __repr__(self):
        return '\n\n'.join(a + '\n' + repr(eval(a))
                           for a in self.args)

**Recall: Concatenation of NumPy Arrays**

In [None]:
np.concatenate([x, y], axis=1)

**Simple Concatenation with pd.concat**

In [11]:
df1 = make_df('AB', [1, 2])
df2 = make_df('AB', [3, 4])
display('df1', 'df2', 'pd.concat([df1, df2])','pd.concat([df1, df2], axis="columns")')

Unnamed: 0,A,B
1,A1,B1
2,A2,B2

Unnamed: 0,A,B
3,A3,B3
4,A4,B4

Unnamed: 0,A,B
1,A1,B1
2,A2,B2
3,A3,B3
4,A4,B4

Unnamed: 0,A,B,A.1,B.1
1,A1,B1,,
2,A2,B2,,
3,,,A3,B3
4,,,A4,B4


**Catching the repeats as an error**  
If you'd like to simply verify that the indices in the result of pd.concat() do not overlap, you can specify the verify_integrity flag. With this set to True, the concatenation will raise an exception if there are duplicate indices.

In [None]:
try:
    pd.concat([x, y], verify_integrity=True)
except ValueError as e:
    print("ValueError:", e)

**Ignoring the index**

In [None]:
pd.concat([x, y], ignore_index=True)

**Adding MultiIndex keys**  
Another option is to use the keys option to specify a label for the data sources:

In [None]:
pd.concat([x, y], keys=['x', 'y'])

**Concatenation with joins**  
By default, the join is a union of the input columns (join='outer').

In [None]:
pd.concat([x, y], join='inner')
pd.concat([x, y], join_axes=[x.columns])

**The append() method**  
Because direct concatenation is so common, Series and DataFrame objects have an **append** method that can accomplish the same thing in fewer keystrokes.

In [None]:
df1.append(df2)

Keep in mind that the **append()** method in Pandas does not modify the original object-instead it creates a new object with the combined data.

# Combining Datasets: Merge and Join  
The pd.merge() function implements a number of types of joins: the one-to-one, many-to-one, and many-to-many joins. 

**Specification of the Merge Key**

In [None]:
pd.merge(df1, df2, on='employee')
pd.merge(df1, df2, left_on='employee', right_on='name')
pd.merge(df1, df2, left_on='employee', right_on='name').drop('name', axis=1)

Sometimes, rather than merging on a column, you would instead like to merge on an index

In [None]:
pd.merge(df1, df2, left_index=True, right_index=True)
pd.merge(df1, df2, left_index=True, right_on='name')

For convenience, DataFrames implement the **join()** method, which performs a merge that defaults to joining on indices:

In [None]:
df1.join(df2)

**Specifying Set Arithmetic for Joins**

In [None]:
pd.merge(df1, df2, how='inner')
pd.merge(df1, df2, how='outer')
pd.merge(df1, df2, how='left')

**Overlapping Column Names: The _suffixes_ Keywords**  
The merge function automatically appends a suffix \_x or \_y to make the output columns unique if the output would have two conflicting column names. It is possible to specify a custom suffix using the suffixes keyword:

In [None]:
pd.merge(df1, df2, on='name', suffixes=['_L', '_R'])

**Example: US States Data**

In [None]:
merged = merged.drop('abbreviation', 1)
merged.isnull().any()
merged.loc[merged['state'].isnull(), 'state/region'].unique()
final.dropna(inplace=True)
import numexpr
final.query('year == 2010 & ages == "total"')
data2010.set_index('state', inplace=True)
density = data2010['population'] / data2010['area (sq. mi)']
density.sort_values(ascending=False, inplace=True)