Let's start by importing the necessary modules. Our usual suspects so far are `numpy`, `os`,  and `pandas` of course. But we'll also be importing `pyplot` from `matplotlib` and we'll proceed to configure it right away so our graphics show up nicely:

# Introduction to Data Analysis with Python II


<img src="https://www.python.org/static/img/python-logo.png" style="width: 200px; float: right;"/>

## Data Wrangling: Clean, Transform, Merge, Reshape

In [None]:
import numpy as np
import os
import matplotlib.pyplot as plt
plt.rc('figure', figsize=(10, 6))
from pandas import Series, DataFrame
import pandas as pd

%matplotlib inline

## Combining and merging data sets

### Database-style DataFrame merges

Let's create a couple of quick dataframes from a dictionary as input to illustrate merges:

In [None]:
df1 = pd.DataFrame({
  'data1' : range(7),
  'key' : list('bbacaab')
})
df2 = pd.DataFrame({
  'data2' : range(20,23),
  'key' : list('abd')
})

Everything looks as expected:

In [None]:
df1

In [None]:
df2

Let's talk about merge.

By default, .merge() performs an [inner join](https://www.w3schools.com/sql/sql_join.asp) between the DataFrames, using the common columns as keys. In a database, inner join operation returns recors that have matching values in both tables:

![Inner Join](https://drive.google.com/uc?export=view&id=1ONVclC3ZQbsblQG8zwutQZB3JuE6AKF3)

Even better, using a diagram that approaches a bit more the representation of a dataframe, the merge operation would be like this:

<div>
<img src="https://i.stack.imgur.com/YvuOa.png" alt="Better Inner Join" width="300"/>
</div>

In our case, we're talking about merging elements on the basis of the `key` column. So merging `df2` on `df2` we'll yield a new dataframe based on the `df1` column structure with an additional `data2` column having the row value coming from `df2` corresponding to the values present in `key` in both merged datasets:

In [None]:
df1.merge(df2)

Reminder: we're not modifying `df1` when invoking `merge()` on it, we're being handed a copy instead:

In [None]:
df1

Inner merge implies that the cartesian product of the elements with common keys is returned. That is, if there are duplicates, it will return all the possible combinations.

In set theory, the cartesian product concept is easy. The cartesian product of sets $A$ and $B$ is $A \times B$, as shown in the image:

![Cartesian product of two sets](https://drive.google.com/uc?export=view&id=18ejPhzu4HQMgBl5omgfAYt67pGSTMa38)

To see what this yields for our dataframes merging we need to create first a new dataframe with **duplicate** `key` entries:

In [None]:
df2_wdups = pd.DataFrame({
    'data2' : range(20,24),
    'key' : list('abda')
    })
df2_wdups

When merging `df1` with `df2_wdups` we can see that the cartesian product for the set containing the duplicate elements for a given key is there, as for example is the case for $(a)$ and values $(20,23)$:

In [None]:
df1.merge(df2_wdups)

So, the cartesian products shown here for the key `a` are the corresponding to a matrix of ([2,4,5]x[20,23]).

If the columns to join on don't have the same name, or we want to join on the index of the DataFrames, we'll need to make it explicit. Let's create a couple of new datasets to show this:

In [None]:
df3 = pd.DataFrame({
    'data1' : range(7),
    'lkey' : list('bbacaab')
    })
df4 = pd.DataFrame({
    'data2' : range(3),
    'rkey' : list('abd')
    })

In [None]:
df3

In [None]:
df4

In [None]:
df3.merge(df4,
          left_on='lkey',
          right_on='rkey')

Why are we naming left and right in this operation? Think about what we saw about merge being an implementation of a SQL JOIN operation.

Do you think order matters in these operations? Generally speaking, it does, because the cartesian product is **not** commutative. The same applies for the JOIN operation that merge is doing, with the exception of the inner join where we can see the commutative property taking place:

In [None]:
df4.merge(
    df3,left_on='rkey',
    right_on='lkey')

You can see, however, that changing the order of the elements of the operation does change the ordering of the result.

Let's try now an outer JOIN, that corresponds in set theory to this operation:

![full outer join](https://www.w3schools.com/sql/img_fulljoin.gif)

Or, again, using a more dataframe-oriented representation:

<div>
<img src="https://i.stack.imgur.com/euLoe.png" alt="Better Full Outer Join" width="300"/>
</div>

In [None]:
df1

In [None]:
df2

In [None]:
df1.merge(df2, how='outer')

That was easy according to the set theory diagram (and should be commutative as well), as we didn't have any duplicates in df2. What if we do?:

In [None]:
df2_wdups

As expected, we'll get the cartesian product:

In [None]:
df1.merge(df2_wdups, how='outer')

We can do a left join as well, which returns all the rows from the left dataframe (`df1`) and the matched rows of the right dataframe (`df2`):

![left join](https://www.w3schools.com/sql/img_leftjoin.gif)

Or, once again:

<div>
<img src="https://i.stack.imgur.com/BECid.png" alt="Better Full Left Outer Join" width="300"/>
</div>

In [None]:
df1.merge(df2, how='left')

You can see that for existing keys in `df1` that are non existent in `df2` (as is the case of `c`) Pandas will fill the corresponding column (`data2`) with `NaN`.

If there are two columns with the same name that we do not join on, both will get transferred to the resulting DataFrame with a suffix.

Let's modify our dataframes `df1` and `df2` so they both have an extra column with the same name:

In [None]:
df1['X'] = 2
df1

In [None]:
df2['X'] = 42
df2

...and proceed to do an inner join on the column `key`:

In [None]:
df1.merge(df2, on='key')

We see the default naming convention assigns `x` and `y`. We can modify that by being explicit about suffixes:

In [None]:
df1.merge(df2, on='key', suffixes=['_left', '_right'])

### Merging on index

So far, we didn't explicitly define an index for our dataframes nor used it for our merging operations. Let's do it, creating a new `df5` for our practice first:

In [None]:
df5= pd.DataFrame({
    'g': range(4),
    'h': range(8,12)
    },
    index =list('abcd'))
df5

Our `df1` was:

In [None]:
df1

Let's do an inner join where we explicitly mark the index to use in the right set as the dataframe index (option `right_index=True`) 

In [None]:
df1.merge(df5, left_on='key', right_index=True)

### Concatenating along an axis

We all know merging and concat is **not** the same, but let's make it graphically clear:

Merging a dataframe:

![merge](https://miro.medium.com/max/1400/1*-uSHoxrzM57syqnKnms2iA.png)

Concatenating a dataframe on two different axes:
![concat](https://miro.medium.com/max/1400/1*0wu6DunCzPC4o9FIyRTW4w.png)

Now that we cleared that out, we can start practicing with dataframes concatenation:

In [None]:
df1

In [None]:
df5

In [None]:
pd.concat([df1, df5])

Remember that our default axis is axis 0 (operating by rows), and that was that happened here, where Pandas filled in with `NaN` the values it didn't have.

In [None]:
import numpy as np

a1 = np.arange(0,24).reshape(4,6)
a1

In [None]:
a2 = np.arange(25,37).reshape(4,3)
a2

In [None]:
a3 = np.concatenate([a1,a2], axis=1)
a3

In [None]:
s1 = pd.Series(range(4), index=list('abcd'))
s2 = pd.Series(range(10,13), index=list('lmn'))
s3 = pd.Series(range(40,43), index=list('xyz'))
print(f{s1},{s2},{s3})

In [None]:
pd.concat([s1,s2,s3])

In [None]:
result = pd.concat([s1,s2,s3], axis=1)
result

In [None]:
result = pd.concat([s1,s2,s3], axis=1, keys=['s1', 's2', 's3'])
result

In [None]:
pd.concat([df1,df2], ignore_index=True)

## On Time Performance Table, transtats.

Downloaded from `https://www.transtats.bts.gov/`. Here you have the instructions to download it by yourself, but this notebook takes care of it, so skip this extra step and go to the next cell.

### (Optional) Instructions for download

Input "On Time Performance" in search box, click on "Airline On-Time Performance Data" from the search results, then on the bottom right corner of "Reporting Carrier On-Time Performance (1987-present)" click "Download". In the next screen, click "Prezipped file", select the period (March and April 2020), and click "Download" once for each period, for a total of 2 zip files.

First, let's mount our Drive in Colab so we can store and persist the files we're going to download:

In [None]:
import os
drive_loc = '/content/gdrive'
files_loc = os.path.join(drive_loc, 'MyDrive', 'pdsfiles')

from google.colab import drive
drive.mount(drive_loc)

Now, get the data on airline on-time performance for march 2020:

In [None]:
!wget https://transtats.bts.gov/PREZIP/On_Time_Reporting_Carrier_On_Time_Performance_1987_present_2020_3.zip -P {files_loc}

...and do the same for april 2020:

In [None]:
!wget https://transtats.bts.gov/PREZIP/On_Time_Reporting_Carrier_On_Time_Performance_1987_present_2020_4.zip -P {files_loc}

Finally, let's unzip both files:

In [None]:
!cd {files_loc}; unzip -qq *_3.zip; unzip -qq *_4.zip

Let's check that our files are there:

In [None]:
!ls {files_loc}

### Take a look at the beginning of the readme file

Using the shell:

In [None]:
readme_loc = files_loc + '/readme.html'

In [None]:
! head {readme_loc}

The readme file is HTML. Luckily, we are working in an html environment.

### Display the contents of `readme.html` within Colab
Use [IPython.display](https://ipython.org/ipython-doc/3/api/generated/IPython.display.html):

In [None]:
from IPython.display import display, HTML
display(HTML(filename=readme_loc))

That's some very good documentation!

The files within the zip are " quoted csv's. They contain information on timeliness of departures in the US, at the departure level.

### Loading the data
Let's load one of the files into memory as a pandas dataframe. What functions do you need to use?

**Pro tip**: there is no need to decompress the whole file. Check out [zipfile.ZipFile](https://docs.python.org/3/library/zipfile.html)

First, open a connection to one of the files. Let's select the file for March 2020:



In [None]:
march_file = !cd {files_loc}; ls {files_loc}/*_3.zip

In [None]:
march_file

In [None]:
zip_file = march_file[0]

In [None]:
import os
import zipfile

zip_file_handle = zipfile.ZipFile(zip_file)

`zip_file` is a connection to the compressed file, the `.zip`. We can use it to open a connection to one of the files it contains, which will behave like a normal uncompressed file that we had opened with `open()`:

In [None]:
type(zip_file_handle)

In [None]:
zip_file_handle.namelist()

In [None]:
csv, readme = zip_file_handle.namelist()

In [None]:
csv_file = zip_file_handle.open(csv)

Now we're ready to load the file into memory as a pandas dataframe. Remember to close the connections to the files!

In [None]:
csv_file = zip_file_handle.open(csv)
df = pd.read_csv(csv_file)

csv_file.close()
zip_file_handle.close()

In [None]:
df.head()

#### Exercise

Load both March 2020 and April 2020 into a single DataFrame

In [None]:
def get_df_from_zip(zip_filepath):
  zip_file_handle = zipfile.ZipFile(zip_filepath)
  csv_filename, _ = zip_file_handle.namelist()
  csv_file = zip_file_handle.open(csv_filename)
  csv_df = pd.read_csv(csv_file)
  csv_file.close()
  zip_file_handle.close()
  return csv_df

In [None]:
  april_file = !cd {files_loc}; ls {files_loc}/*_4.zip
  df_otp = pd.concat([get_df_from_zip(march_file[0]), get_df_from_zip(april_file[0])])

Let's start examining the data: show the beginning of the file. How many records does it contain?

In [None]:
pd.options.display.max_columns = None

In [None]:
df_otp.head()

In [None]:
df_otp.shape

In [None]:
print(df_otp.size)
df_otp.size == df_otp.shape[0] * df_otp.shape[1]

In [None]:
df_otp.dtypes

#### Digression

Attention! Be careful not to reassign to reserved words or functions- you will overwrite the variable.

In [None]:
pd.concat = df1

In [None]:
pd.concat([s1,s2])

You can delete the overwritten variable, but you won't get back the original value. If it is an object or function from a module, you'll need to reload() the module, since Python doesn't load again an already imported module if you try to import it. reload() is useful also when you are actively developing your own module and want to load the latest definition of a function into memory.

In [None]:
del(pd.concat)

In [None]:
pd.concat

In [None]:
import imp
imp.reload(pd)

In [None]:
pd.concat

## Data transformation

### Removing duplicates

Let's create a new dataframe to learn about removing duplicates, that obviously contains some:

In [None]:
df6 = pd.DataFrame({
    'key1' : ['one'] * 3 + ['two'] * 4,
    'key2' : [1, 1, 2, 3, 3, 4, 4]
    })
df6

The method `duplicated` will work by default on axis 0 and will return a boolean marking the duplicate entries (rows) in the dataframe:

In [None]:
df6.duplicated()

If you want to act on the duplicates (removing them), then Pandas has the `drop_duplicates` method:

In [None]:
df6.drop_duplicates()

In [None]:
df6.drop_duplicates(keep='last', subset = 'key1')

#### Exercise

Let's practice with our big dataset by doing some data cleansing/rationalization thanks to duplicates processing. Consider the following questions:

- How many individual airports are there in the OTP data?
- How many routes (combinations of origin / destination)

How would you approach the solution?

*Hint: Remove duplicates with `subset`, then `count()`.*

In [None]:
df_otp.drop_duplicates(subset='OriginAirportID').count()['OriginAirportID']

In [None]:
df_otp.drop_duplicates(subset=['OriginAirportID','DestAirportID']).count()['OriginAirportID']

### Renaming axis indexes

Let's add an explicit index to our last dataset (refresh from the previous class):

In [None]:
df6.index = list('plfjdmh')
df6

### Discretization and binning

In [None]:
!wget http://bit.ly/ks-pds-csv8 -P {files_loc}

Let's rename our file for something more intuitive in the future:

In [None]:
import os
orig_filepath = os.path.join(files_loc,"ks-pds-csv8")
sales_data_filename = "sales_data.csv"
sales_data_filepath = os.path.join(files_loc, sales_data_filename)

In [None]:
!mv {orig_filepath} {sales_data_filepath}

Ok, let's load our data now and have a quick look at it:

In [None]:
raw_df = pd.read_csv(sales_data_filepath)

In [None]:
raw_df.head()

In [None]:
raw_df.describe()

Let's modify our dataframe a bit doing some group operations. Bear with me with this code for now as we'll be explaining how `groupby` works later on in this notebook:

In [None]:
df = raw_df.groupby(['account number', 'name'])['ext price'].sum().reset_index()

df

Now let's ask Seaborn to plot this information for us:

In [None]:
import seaborn as sns

sns.set_style('whitegrid')

df['ext price'].plot(kind='hist')

Here, Seaborn `hist` is showing us 8 bins with data. What if we wanted to divide our customers a different number of groups (or bins)? That’s what pandas `qcut` and `cut` are for.



Let's start with `qcut`. `qcut` is a Quantile-based discretization function. This basically means that qcut tries to divide up the underlying data into equal sized bins. The function defines the bins using percentiles based on the distribution of the data, not the actual numeric edges of the bins.

Having used pandas describe function, you have already seen an example of the underlying concepts represented by `qcut`



In [None]:
df['ext price'].describe()

Keep in mind the values for the 25%, 50% and 75% percentiles, we'll be seeing them again using `qcut`.

The simplest use of qcut is to define the number of quantiles and let pandas figure out how to divide up the data. Let's tell pandas to create 4 equal sized groupings of the data:

In [None]:
pd.qcut(df['ext price'], q=4)

The result is a categorical series representing the sales bins. Because we asked for quantiles with `q=4 `the bins match the percentiles from the describe function.

A common use case is to store the bin results back in the original dataframe for future analysis. So let's create 4 bins (aka quartiles) and 10 bins (aka deciles) and store the results back in the original dataframe:


In [None]:
df['quantile_ex_1'] = pd.qcut(df['ext price'], q=4)
df['quantile_ex_2'] = pd.qcut(df['ext price'], q=10, precision=0)

df.head()

We can see how the bins are very different between `quantile_ex_1` and `quantile_ex_2` . We also used `precision` to define how many decimal points to use for calculating the bin precision.

The other interesting view is to see how the values are distributed across the bins using `value_counts``:

In [None]:
df['quantile_ex_1'].value_counts()

In [None]:
df['quantile_ex_2'].value_counts()

This illustrates a key concept. In each case, there are an equal number of observations in each bin. Pandas does the math behind the scenes to figure out how wide to make each bin. If you look closely, in `quantile_ex_1` the range of the first bin is 74,661.15 while the second bin is only 9,861.02 (110132 - 100271).

One of the challenges with this approach is that the bin labels are not very easy to explain to an end user. For instance, if we wanted to divide our customers into 5 groups (aka quintiles) like an airline frequent flier approach, we can explicitly label the bins to make them easier to interpret.

In [None]:
bin_labels_5 = ['Bronze', 'Silver', 'Gold', 'Platinum', 'Diamond']
df['quantile_ex_3'] = pd.qcut(df['ext price'],
                              q=[0, .2, .4, .6, .8, 1],
                              labels=bin_labels_5)
df.head()

We just did some things a little differently. We explicitly defined the range of quantiles to use: q=[0, .2, .4, .6, .8, 1]

But we also defined the labels `labels=bin_labels_5` to use when representing the bins.

Let’s check the distribution:

In [None]:
df['quantile_ex_3'].value_counts()

W now have an equal distribution of customers across the 5 bins and the results are displayed in an easy to understand manner.

One important item to keep in mind when using `qcut` is that the quantiles must all be less than 1. Here are some examples of distributions. In most cases it’s simpler to just define `q` as an integer:

- terciles: q=[0, 1/3, 2/3, 1] or q=3
- quintiles: q=[0, .2, .4, .6, .8, 1] or q=5
- sextiles: q=[0, 1/6, 1/3, .5, 2/3, 5/6, 1] or q=6

Now, how do we know what ranges are used to identify the different bins? We can use `retbins=True` to return the bin labels:

In [None]:
results, bin_edges = pd.qcut(df['ext price'],
                            q=[0, .2, .4, .6, .8, 1],
                            labels=bin_labels_5,
                            retbins=True)

results_table = pd.DataFrame(zip(bin_edges, bin_labels_5),
                            columns=['Threshold', 'Tier'])

In [None]:
results_table

Now, let's go with `cut`.

Many of the concepts we discussed above apply but there are a couple of differences with the usage of `cut`.

The major distinction is that `qcut` will calculate the size of each bin in order to make sure the distribution of data in the bins is equal. All bins will roughly have the same number of observations but the bin range will vary.

On the other hand, `cut` is used to specifically define the bin edges. There is no guarantee about the distribution of items in each bin. In fact, we can define bins in such a way that no items are included in a bin or nearly all items are in a single bin.

In real world examples, bins may be defined by business rules. For a frequent flier program, 25,000 miles is the silver level and that does not vary based on year to year variation of the data. If we want to define the bin edges (25,000 - 50,000, etc) we would use `cut` . We can also use `cut` to define bins that are of constant size and let pandas figure out how to define those bin edges.

Let's remove some columns to keep the examples short:

In [None]:
df = df.drop(columns = ['quantile_ex_1','quantile_ex_2', 'quantile_ex_3'])
df

For the first example, we can cut the data into 4 equal bin sizes. Pandas will perform the math behind the scenes to determine how to divide the data set into these 4 groups:



In [None]:
pd.cut(df['ext price'], bins=4)

Let’s look at the distribution:

In [None]:
pd.cut(df['ext price'], bins=4).value_counts()

The first thing to notice is that the bin ranges are all about 32,265 but that the distribution of bin elements is not equal.

The bins have a distribution of 12, 5, 2 and 1 item(s) in each bin. This is the essential difference between `cut` and `qcut`.

Interval notation:

![Intervals](https://pbpython.com/images/Interval_notation.png)

When using `cut`, we may be defining the exact edges of our bins so it is important to understand if the edges include the values or not. Depending on the data set and specific use case, this may or may not be a big issue. It can certainly be a subtle issue we do need to consider.

To bring it into perspective, when we present the results of your analysis to others, we will need to be clear whether an account with 70,000 in sales is a silver or gold customer.

Here is an example where we want to specifically define the boundaries of our 4 bins by defining the bins parameter.

In [None]:
cut_labels_4 = ['silver', 'gold', 'platinum', 'diamond']
cut_bins = [0, 70000, 100000, 130000, 200000]
df['cut_ex1'] = pd.cut(df['ext price'], bins=cut_bins, labels=cut_labels_4)

In [None]:
df.head()

Let's make sure we understand how the `cut_bins` are being procesed when making the classification (close or open interval):

In [None]:
pd.cut(df['ext price'], bins=cut_bins)

## String manipulation

### String object methods

Maybe refreshing some concepts you already show in the Python introductory classes, you know that you can get a list from a string by calling `split()`. This may sound basic, but it will be quite useful for some Data Science text and language processing code you'll see later on in the course:

In [None]:
string = 'this is some sentence'
string.split()

### Vectorized string functions in pandas

[Vectorized string functions in pandas](https://pandas.pydata.org/pandas-docs/stable/text.html) are grouped within the `.str` attribute of Series and Indexes. They have the same names as the regular Python string functions, but work on Series of strings.

We saw in the previous NumPy class (and in the first Pandas class) how both modules generalize arithmetic operations so that we can easily and quickly perform the same operation on many array elements.

This is called a *vectorization* of the operations, and simplifies the syntax of operating on arrays of data: we no longer have to worry about the size or shape of the array, but just about what operation we want done.

Let's compare working with the array elements (for example, for capitalizing the animal names in this list):

In [None]:
animals = 'rhino giraffe molerat mantisshrimp cheetah mosquito whale'.split()
animals

Operating over the elements, *in a pythonic way*, involves using a lambda function on `map`:

In [None]:
list(map(lambda st: st.capitalize(), animals))

But with vectorized operations, we can be *even more data science pythonic*!:

In [None]:
df1['animal'] = animals
df1

In [None]:
animals_series = df1['animal']
animals_series.str

In [None]:
animals_series.str.capitalize()

Now, we've just applied the `capitalize()` operation over the components of the series but by syntactically acting on the Series itself.

Let's see more examples of it:

In [None]:
animals_series.str.len()

In [None]:
animals_series.str.count('o')

In [None]:
animals_series.str.contains('m')

In [None]:
df1[animals_series.str.contains('m')]

Having spaces in text information we're trying to clean up is quite common. We can perform blanks clean ups as well using vectorized operations:

In [None]:
series_with_blanks = pd.Series(['SDF    ', ' RTTR     ', 'BL   '])
series_with_blanks

We can clean up the trailing blanks using `rstrip`:

In [None]:
series_with_blanks.str.rstrip()

#### Exercise


Let's come back to our On Time Perfomance (OTP) dataset. To practice with the recently explained concepts, do the following:

* Generate a list of the columns that have 'Origin' in their name
* Show a sample of the values that those columns take.

In [None]:
df_otp.columns.str.contains('Origin')

In [None]:
df_otp.columns[df_otp.columns.str.contains('Origin')]

In [None]:
df_otp[df_otp.columns[df_otp.columns.str.contains('Origin')]].sample(5)

So much redundant information! Let's jump ahead with this list of interesting columns:

```python
interesting_columns= ['FlightDate', 'DayOfWeek', 'Reporting_Airline', 'Tail_Number', 'Flight_Number_Reporting_Airline', 
                      'Origin', 'OriginCityName', 'OriginStateName', 'OriginCityMarketID',
                      'Dest', 'DestCityName', 'DestStateName', 'DestCityMarketID',
                      'DepTime', 'DepDelay', 'AirTime', 'Distance']

flights = flights[interesting_columns]
```

# Data Aggregation and Group Operations

## GroupBy mechanics

Sometimes also called split-apply-combine for talking about group operations, a good description of the process.

- **Split**: data contained in a pandas object, whether a Series or DataFrame is split into groups based on one or more keys that you provide. The splitting is performed on a particular axis of an object. For example, a DataFrame can be grouped on its rows (axis=0) or its columns (axis=1).
- **Apply**: A function is then applied to each group, producing a new value.
- **Combine**: Finally, the results of all those function applications are combined into a result object. The form of the resulting object will usually depend on what’s being done to the data.

![Split-Apply-Combine](https://jakevdp.github.io/PythonDataScienceHandbook/figures/03.08-split-apply-combine.png)

Let's get started with an example:

In [None]:
df = DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],
                'key2' : ['one', 'two', 'one', 'two', 'one'],
                'data1' : np.random.randn(5),
                'data2' : np.random.randn(5)})
df

The most basic split-apply-combine operation can be computed with the `groupby()` method of DataFrames and Series, passing the name of the desired key column to group by. In this case, let's practice with a Series:

In [None]:
grouped = df['data1'].groupby(df['key1'])
grouped

We don't get a set of Series, but a SeriesGroupBy object. 

This object is where the magic is: you can think of it as a special view of the Series, waiting to dig into the groups but doesn't actually compute anything until the aggregation is applied.

This "lazy evaluation" approach means that common aggregates can be implemented very efficiently in a way that is almost transparent to the user.

Also, note that we're telling Pandas to group a Series using the criteria provided by another one (that is **not** the index of the Series we're grouping). This will be standard practice and you'll become more familiar with it as we move forward with this notebook.

Let's produce a result to see it:

In [None]:
grouped.sum()

As we expected, we obtain the sum of the rows that correspond to each key (`a` or `b`).

If we ask for the mean...

In [None]:
grouped.mean()

...what we get instead is the mean of the values of the Series (`data1`) in each of the groups (`a` and `b`).

Let's operate now on the dataframe itself:

In [None]:
means = df.groupby(['key1','key2','data2']).mean()
means

So we asked for the means considering grouping on `key1`, `key2` and `data2` and that's what we're getting for `data1`. There are no differences between original `data1` values and the mean provided because we have just 1 combination of all three different `groupby` combinations.

Let's crate a couple of arrays with data to be used by `groupby()` on the series for the `data1` column:

In [None]:
states = np.array(['Ohio', 'California', 'California', 'Ohio', 'Ohio'])
years = np.array([2005, 2005, 2006, 2005, 2006])
df['data1'].groupby([states, years]).mean()

You see here the same mechanism mentioned before, working in this case with Numpy arrays instead of other Series coming from the same DataFrame, but the internal dynamics are the same.

If we consolidate the computation on the grouping by asking to apply a funtion, we materialize a new dataframe with the result:

In [None]:
df.groupby('key1').mean()

In this case, we got the mean values for both non-`groupby` columns (that is, `data1` and `data2`).

**Question**: Where is `key2` column? Why is it not showing up?

To better understand how `groupby()` is generating the groups, let's compare the `head()` function over the dataframe or over the `DataframeGroupBy` object:

In [None]:
df.groupby(['key1']).head(1)

What we're seeing here is the first row of each dataframe in the `groupby` group. Compare that to asking for the first row of the original dataframe:

In [None]:
df.head(1)

If we ask for the first 3 rows of each dataframe in the `groupby, we will see no difference from asking the same from the original dataframe. Why?

In [None]:
df.groupby('key1').head(1)

In [None]:
df.head()

Well, this is because although we're actually getting the first 3 rows of each dataset in the group, the ordering is the same as in the original dataset.

We should have a more clear intuition right now about how `groupby()` works. With this in mind, let's do a a new grouping by two keys now and ask to compute the mean over each group:

In [None]:
df.groupby(['key1', 'key2']).mean()

We can ask about the size of each group as well:

In [None]:
df.groupby(['key1', 'key2']).size()

### Iterating over groups

We can dig up a bit on what the grouping is doing if we iterate over each of them and print it:

In [None]:
for name, group in df.groupby('key1'):
    print(name)
    print(group)

That was kind of easy to visualize without doing the iteration anyway. Let's do it again when we apply a multikey grouping:

In [None]:
for (k1, k2), group in df.groupby(['key1', 'key2']):
    print((k1, k2))
    print(group)

As we mentioned before, `groupby()` generates a `DataFrameGroupBy` object:

In [None]:
df.groupby('key1')

Generating a list on it will give us access to the actual elements composing the object, that is, the groups themselves:

In [None]:
list(df.groupby('key1'))

Now, to have this even better, let's put it in a Python dictionary and ask about one of its keys:

In [None]:
pieces = dict(list(df.groupby('key1')))
pieces

In [None]:
type(pieces['b'])

So now we have a nicely organized structure where we can see that each member of the group is a dataset where the operation is consequently applied.

We can access the dataframe's data types with the property `dtypes`:

In [None]:
df.dtypes

And then, we can use this property to group our dataframe, selecting the columns instead of the default axis 0. If we apply the same set of data structures transformations that we did before, we'll get the different dataframes in the group, no splitted by types:

In [None]:
grouped = df.groupby(df.dtypes, axis=1)
dict(list(grouped))

### Selecting a column or subset of columns

To select a column or a subset of columns from the `DataFrameGroupBy`, just pass its name as a string if we want a `SeriesGroupBy` or as an element of a list if we want a `DataFrameGroupBy` instead:

In [None]:
df.groupby('key1')

In [None]:
df.groupby('key1')['data1']

In [None]:
df.groupby('key1')[['data2']]

You may do the same for Series, for example extracting the Serie from a Dataframe just how we saw earlier in the notebook:

> Indented block



In [None]:
type(df['data1'])

In [None]:
df['data1'].groupby(df['key1'])

We asked for a `SeriesGroupBy` and this is what we got. It makes sense, because the data structure is 1-dimensional.

However, if we pass a non 1-dimensional grouper to the Series, this will not work:

In [None]:
df['data1'].groupby(df[['data1']])

Passing a list to our original datafram will generate a new one though:

In [None]:
type(df[['data2']])

...so this will accept a 2-dimensional grouper:

In [None]:
df[['data2']].groupby(df['key1'])

So wrapping up, when we materialize an operation (like the mean in this case), we obtain only the columns explicitly mentioned:

In [None]:
df.groupby(['key1', 'key2'])[['data2']].mean()

We can do the same for Series:

In [None]:
s_grouped = df.groupby(['key1', 'key2'])['data2']
s_grouped

So applying an operation on it will materialize a Series:

In [None]:
s_grouped.mean()