# Pandas - Advanced

In [3]:
import pandas as pd
import numpy as np

df = pd.read_csv('../Data/Height_Weight.csv')
df.head()

Unnamed: 0,Name,Height,Weight,Hometown
0,Ashley,155,140,Palo Alto
1,Robin,145,122,Fremont
2,Priyanka,152,131,Santa Clara
3,Youngchul,167,148,Cupertino
4,Aziz,161,139,San Francisco


## 1. Combining Datasets: Concat/Append/Join

Some of the most interesting studies of data come from combining different data sources. These operations can involve anything from very straightforward concatenation of two different datasets. `Series` and DataFrames are built with this type of operation in mind, and `Pandas` includes functions and methods that make this sort of data wrangling fast and straightforward.

### 1.1. Simple Concatenation
Pandas has a function `pd.concat()`, which has a similar syntax to `np.concatenate` but contains a number of options that we'll discuss. `pd.concat()` can be used for a simple concatenation of Series or DataFrame objects, just as `np.concatenate()` can be used for simple concatenations of arrays

In [4]:
s1 = pd.Series(['A', 'B', 'C'], index=[1, 2, 3])
s2 = pd.Series(['D', 'E', 'F'], index=[4, 5, 6])
pd.concat([s1, s2])

1    A
2    B
3    C
4    D
5    E
6    F
dtype: object

In [5]:
# append row wise
temp = dict({'Name': ['Jay'], 'Height': [183.0], 'Weight': [165], 'Hometown': ['Thousand Oaks']})
pd.concat([df, pd.DataFrame(temp)], axis=0) # ignore_index to set correct index 

Unnamed: 0,Name,Height,Weight,Hometown
0,Ashley,155.0,140,Palo Alto
1,Robin,145.0,122,Fremont
2,Priyanka,152.0,131,Santa Clara
3,Youngchul,167.0,148,Cupertino
4,Aziz,161.0,139,San Francisco
5,Zoey,181.0,190,Hayward
0,Jay,183.0,165,Thousand Oaks


In [6]:
# append column wise
temp = pd.Series(np.linspace(4,20,6))
pd.concat([df, temp], axis=1)

Unnamed: 0,Name,Height,Weight,Hometown,0
0,Ashley,155,140,Palo Alto,4.0
1,Robin,145,122,Fremont,7.2
2,Priyanka,152,131,Santa Clara,10.4
3,Youngchul,167,148,Cupertino,13.6
4,Aziz,161,139,San Francisco,16.8
5,Zoey,181,190,Hayward,20.0


### 1.2. Duplicate indices

`pd.concat` is that Pandas concatenation preserves indices, even if the result will have duplicate indices! However, it might cause the data quality issues by duplicates.

In [7]:
d1 = dict({'Name': ['Jay', 'Zoey'], 'Height': [183.0, 161], 'Weight': [165, 139], 'Hometown': ['Thousand Oaks', 'San Francisco']})
d2 = dict({'Name': ['Ashley', 'Zoey'], 'Height': [155.0, 161.0], 'Weight': [140, 139], 'Hometown': ['Palo Alto', 'San Francisco']})

df1 = pd.DataFrame(d1, index=[0, 1])
df2 = pd.DataFrame(d2, index=[2, 1])\

pd.concat([df1, df2])

Unnamed: 0,Name,Height,Weight,Hometown
0,Jay,183.0,165,Thousand Oaks
1,Zoey,161.0,139,San Francisco
2,Ashley,155.0,140,Palo Alto
1,Zoey,161.0,139,San Francisco


#### Catching the repeats as an error

If you'd like to simply verify that the indices in the result of `pd.concat` do not overlap, you can specify the verify_integrity flag. With this set to True, the concatenation will raise an exception if there are duplicate indices.

In [8]:
try:
    pd.concat([df1, df2], verify_integrity=True)
except ValueError as e:
    print("ValueError:", e)

ValueError: Indexes have overlapping values: Int64Index([1], dtype='int64')


#### Ignoring the index

Sometimes the index itself does not matter, and you would prefer it to simply be ignored. This option can be specified using the ignore_index flag. With this set to true, the concatenation will create a new integer index for the resulting Series.

In [9]:
pd.concat([df1, df2], ignore_index=True)

Unnamed: 0,Name,Height,Weight,Hometown
0,Jay,183.0,165,Thousand Oaks
1,Zoey,161.0,139,San Francisco
2,Ashley,155.0,140,Palo Alto
3,Zoey,161.0,139,San Francisco


#### Adding MultiIndex keys

Another option is to use the keys option to specify a label for the data sources; the result will be a hierarchically indexed series containing the data

In [10]:
pd.concat([df1, df2], keys=['df1', 'df2'])

Unnamed: 0,Unnamed: 1,Name,Height,Weight,Hometown
df1,0,Jay,183.0,165,Thousand Oaks
df1,1,Zoey,161.0,139,San Francisco
df2,2,Ashley,155.0,140,Palo Alto
df2,1,Zoey,161.0,139,San Francisco


### 1.3. Append DataFrame

Series and DataFrame objects have an append method that can accomplish the same thing in fewer keystrokes. However, keep in mind that unlike the `append()` and `extend()` methods of Python lists, the `append()` method in Pandas does not modify the original object.

**Instead, it creates a new object with the combined data. It also is not a very efficient method because it involves creation of a new index and data buffer.** Thus, if you plan to do multiple append operations, it is generally better to build a list of DataFrames and pass them all at once to the `concat()` function.

In [11]:
df1.append(df2)

Unnamed: 0,Name,Height,Weight,Hometown
0,Jay,183.0,165,Thousand Oaks
1,Zoey,161.0,139,San Francisco
2,Ashley,155.0,140,Palo Alto
1,Zoey,161.0,139,San Francisco


### 1.4. Sinple Merge

One essential feature offered by Pandas is its high-performance, in-memory join and merge operations. Pandas implements several of these fundamental building-blocks in the `pd.merge()` function and the related `join()` method of Series and Dataframes. The `pd.merge()` function implements a number of types of joins: the one-to-one, many-to-one, and many-to-many joins.

However, often the column names will not match so nicely, so `pd.merge()` provides a variety of options for handling explicitly specify the name of the key column using the `on` keyword.

In [12]:
# One-to-one joins
df2 = pd.DataFrame({'Name': ['Ashley', 'Robin', 'Aziz', 'Zoey'],
                    'group': ['Accounting', 'Engineering', 'Engineering', 'HR']})

print(df2)
temp = pd.merge(df, df2, on='Name')
temp

     Name        group
0  Ashley   Accounting
1   Robin  Engineering
2    Aziz  Engineering
3    Zoey           HR


Unnamed: 0,Name,Height,Weight,Hometown,group
0,Ashley,155,140,Palo Alto,Accounting
1,Robin,145,122,Fremont,Engineering
2,Aziz,161,139,San Francisco,Engineering
3,Zoey,181,190,Hayward,HR


In [13]:
# Many-to-one joins
df3 = pd.DataFrame({'depart': ['Accounting', 'Engineering', 'HR'],'supervisor': ['Carly', 'Guido', 'Steve']})
print(df3)
pd.merge(temp, df3, left_on='group', right_on='depart')

        depart supervisor
0   Accounting      Carly
1  Engineering      Guido
2           HR      Steve


Unnamed: 0,Name,Height,Weight,Hometown,group,depart,supervisor
0,Ashley,155,140,Palo Alto,Accounting,Accounting,Carly
1,Robin,145,122,Fremont,Engineering,Engineering,Guido
2,Aziz,161,139,San Francisco,Engineering,Engineering,Guido
3,Zoey,181,190,Hayward,HR,HR,Steve


Sometimes, rather than merging on a column, you would instead like to merge on an indx using `left_index` and `right_index` keywords. If you'd like to mix indices and columns, you can combine `index` with `on` to get the desired behavior.

In [14]:
# Many-to-many joins
df4 = pd.DataFrame({'group': ['Accounting', 'Accounting',
                              'Engineering', 'Engineering', 'HR', 'HR'],
                    'skills': ['math', 'spreadsheets', 'coding', 'linux',
                               'spreadsheets', 'organization']}).set_index('group')
print(df4)

pd.merge(temp, df4, left_on='group', right_index=True)

                   skills
group                    
Accounting           math
Accounting   spreadsheets
Engineering        coding
Engineering         linux
HR           spreadsheets
HR           organization


Unnamed: 0,Name,Height,Weight,Hometown,group,skills
0,Ashley,155,140,Palo Alto,Accounting,math
0,Ashley,155,140,Palo Alto,Accounting,spreadsheets
1,Robin,145,122,Fremont,Engineering,coding
1,Robin,145,122,Fremont,Engineering,linux
2,Aziz,161,139,San Francisco,Engineering,coding
2,Aziz,161,139,San Francisco,Engineering,linux
3,Zoey,181,190,Hayward,HR,spreadsheets
3,Zoey,181,190,Hayward,HR,organization


### 1.5. General Join Tables on Keys

By default, the result contains the intersection of the two sets of inputs; this is what is known as an inner join. We can specify this explicitly using the how keyword, which defaults to inner. Other options for the how keyword are 'outer', 'left', and 'right'. An outer join returns a join over the union of the input columns, and fills in all missing values with NAs.

In [15]:
# inner join
pd.merge(df, df2, on='Name', how='inner')

Unnamed: 0,Name,Height,Weight,Hometown,group
0,Ashley,155,140,Palo Alto,Accounting
1,Robin,145,122,Fremont,Engineering
2,Aziz,161,139,San Francisco,Engineering
3,Zoey,181,190,Hayward,HR


In [16]:
# left join
pd.merge(df, df2, on='Name', how='right')

Unnamed: 0,Name,Height,Weight,Hometown,group
0,Ashley,155,140,Palo Alto,Accounting
1,Robin,145,122,Fremont,Engineering
2,Aziz,161,139,San Francisco,Engineering
3,Zoey,181,190,Hayward,HR


In [17]:
# outer join
pd.merge(df, df2, on='Name', how='outer')

Unnamed: 0,Name,Height,Weight,Hometown,group
0,Ashley,155,140,Palo Alto,Accounting
1,Robin,145,122,Fremont,Engineering
2,Priyanka,152,131,Santa Clara,
3,Youngchul,167,148,Cupertino,
4,Aziz,161,139,San Francisco,Engineering
5,Zoey,181,190,Hayward,HR


## 2. Aggregation and Grouping

An essential piece of analysis of large data is efficient summarization: computing aggregations in which a single number gives insight into the nature of a potentially large dataset. I will explore aggregations in Pandas, from simple operations and more sophisticated operations based on the concept of a groupby.

The following table summarizes some other built-in Pandas aggregations:

| Aggregation              | Description                     |
|--------------------------|---------------------------------|
| ``count()``              | Total number of items           |
| ``first()``, ``last()``  | First and last item             |
| ``mean()``, ``median()`` | Mean and median                 |
| ``min()``, ``max()``     | Minimum and maximum             |
| ``std()``, ``var()``     | Standard deviation and variance |
| ``mad()``                | Mean absolute deviation         |
| ``prod()``               | Product of all items            |
| ``sum()``                | Sum of all items                |

### 2.1. GroupBy: Split, Apply, Combine

Simple aggregations can give you a flavor of your dataset, but often we would prefer to aggregate conditionally on some label or index: this is implemented in the so-called `groupby` operation. Although the name "group by" comes from a command in the SQL database language, the `groupby` operation can be considered as split, apply, combine.

![title](../Data/Notebook_Images/Groupby.png)


This makes clear what the groupby accomplishes:

* The split step involves breaking up and grouping a DataFrame depending on the value of the specified key.
* The apply step involves computing some function, usually an aggregate, transformation, or filtering, within the individual groups.
* The combine step merges the results of these operations into an output array.

#### GroupBy object

The GroupBy object is a very flexible abstraction. The most basic split-apply-combine operation can be computed with the `groupby()` method of DataFrames, passing the name of the desired key column

In [33]:
size_lvl = df2.groupby('group')

print('groupby object:', size_lvl, '\n')

for i,j in size_lvl:
    print('keys:', i, '\n', 'content:\n', j, '\n')

groupby object: <pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001EE96716B00> 

keys: Accounting 
 content:
      Name       group
0  Ashley  Accounting 

keys: Engineering 
 content:
     Name        group
1  Robin  Engineering
2   Aziz  Engineering 

keys: HR 
 content:
    Name group
3  Zoey    HR 



In [38]:
# select specific group in the groupby object
size_lvl.get_group('Engineering')

Unnamed: 0,Name,group
1,Robin,Engineering
2,Aziz,Engineering


### Aggregate, filter, transform, apply

The preceding discussion focused on aggregation for the combine operation, but there are more options available. In particular, GroupBy objects have aggregate(), filter(), transform(), and apply() methods that efficiently implement a variety of useful operations before combining the grouped data.