## Chapter 12 -- Data Handling -- DRAFT

In [1]:
import numpy as np
import pandas as pd
from numpy.random import randn
from pandas import Series, DataFrame, Index

## pandas merge analogy for SAS match-merge

panda DataFrames can use either the concat() method or the merge() method to join multiple DataFrames together.  

Use the display() method is used to present a side-by-side comparison of the sythetically generated DataFrames, 'company' and 'finance'.

In [None]:
display("company", "finance")

The details for the merge() method are found <a href="http://pandas.pydata.org/pandas-docs/stable/merging.html#brief-primer-on-merge-methods-relational-algebra"> here </a>.  

The example below uses the pd.merge() method for a one-to-one join operation.  The 'finance' DataFrame lacks a row for Michael Morrison, which is found in the 'company' DataFrame.  Specifying the how='outer' option uses the the union of values from both key columns.  (how='left') will produce the same results in this case.

In [None]:
merged = pd.merge(company, finance, on='employee', how='outer')
display("company", "finance", "merged")

Unlike SAS' match-merge, the resulting DataFrame from the merge() method is not sorted.  You can use the sort_values() method to sort, in this case, on the key column, 'employee'. 

In [None]:
merged.sort_values("employee", inplace=True)
merged.info()

Also realize that the sort.values() method does not set an index. 

In [None]:
merged.index

To reproduce the results from the merge() method above with SAS, you perform example a match-merge.  The SAS example is illustrated below.  

The 'company' and 'finance' data sets are sorted by the 'name' key which is common to both.  To complete the one-to-one merge a new SAS data set is created using MERGE with the by-group 'name'.  The sorts are required, otherwise you produce an error.

Notice with the merge() method, no sorting of the input DataFrames is required and that no explict reference to a sort key was made.   

````
    79       proc sort data=company;
    80          by name;
    NOTE: 8 observations were read from "WORK.company"
    NOTE: Data set "WORK.company" has 8 observation(s) and 3 variable(s)
    81       
    82       proc sort data=finance;
    83          by name;
    NOTE: 7 observations were read from "WORK.finance"
    NOTE: Data set "WORK.finance" has 7 observation(s) and 3 variable(s)
    84       
    85       data merge_info;
    86          merge company finance;
    87          by name;

    NOTE: 8 observations were read from "WORK.company"
    NOTE: 7 observations were read from "WORK.finance"
    NOTE: Data set "WORK.merge_info" has 8 observation(s) and 5 variable(s)
````

The pandas merge() method is similiar to the concat() method.  The example below illustrates a column-wise concatenation using an axis=1 argument.  Unlike, the merge() method, the concat() method returns the key column, 'employee' from both input DataFrames, resulting in duplicated columns.

In [None]:
frames=(company,finance)
merge_info = pd.concat(frames, axis=1)
display("company", "finance", "merge_info")

In [22]:
fruit = pd.DataFrame({'type':     ['banana', 'apple', 'grapes', 'oranges', 'cherries', \
                                    'cranberry', 'mango', 'gooseberry'],
                      'id':        [11, 12, 13, 16, 15, 14, 18, 17],
                      'country':    ['US', 'Canada', 'France', 'Spain', 'US', 'Canada', 'US', 'New Zealand']})

amount = pd.DataFrame({'id':        [17, 11, 12, 18, 14, 13, 15, 16, 21, 18, 20, 19],
                       'quantity':  [1700, 9700, 2750, 3900, 8000, 5000, 2200, 5500, 4000, 3200, 1200, 9900]})

In [23]:
display("fruit", "amount")

Unnamed: 0,country,id,type
0,US,11,banana
1,Canada,12,apple
2,France,13,grapes
3,Spain,16,oranges
4,US,15,cherries
5,Canada,14,cranberry
6,US,18,mango
7,New Zealand,17,gooseberry

Unnamed: 0,id,quantity
0,17,1700
1,11,9700
2,12,2750
3,18,3900
4,14,8000
5,13,5000
6,15,2200
7,16,5500
8,21,4000
9,18,3200
