# <font face="times"><font size="6pt"><p style = 'text-align: center;'> The City University of New York, Queens College

<font face="times"><font size="6pt"><p style = 'text-align: center;'><b>Introduction to Computational Social Science</b><br/><br/>

<p style = 'text-align: center;'><font face="times"><b>Lesson 03 | Coding with Python II</b><br/><br/>


<p style = 'text-align: center;'><font face="times"><b>10 Checkpoints</b><br/><br/>


Source: https://www.coursera.org/learn/python-programming-introduction

***
# Begin Lesson 03

# Introduction to Pandas

**pandas** is a Python package providing fast, flexible, and expressive data structures designed to work with *relational* or *labeled* data both. It is a fundamental high-level building block for doing practical, real world data analysis in Python. 

pandas is well suited for:

- Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
- Ordered and unordered (not necessarily fixed-frequency) time series data.
- Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels
- Any other form of observational / statistical data sets. The data actually need not be labeled at all to be placed into a pandas data structure


Key features:
    
- Easy handling of **missing data**
- **Size mutability**: columns can be inserted and deleted from DataFrame.
- Automatic and explicit **data alignment**: objects can be explicitly aligned to a set of labels, or the data can be aligned automatically
- Powerful, flexible **group by functionality** to perform split-apply-combine operations on data sets
- Intelligent label-based **slicing, fancy indexing, and subsetting** of large data sets
- Intuitive **merging and joining** data sets
- Flexible **reshaping and pivoting** of data sets
- **Hierarchical labeling** of axes
- Robust **IO tools** for loading data from flat files, Excel files, databases, and HDF5
- **Time series functionality**: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging, etc.

### The plan of attack for this lesson
We are going to use `pandas` and `statsmodels` today to show how the data analysis can be done completely in Python. We will do so by using a number of modules, though `pandas` is going to feature heavily.

`Pandas` is a package that allows us to work with datasets in a similar manner as in R (with dataframes) and, according to their own website, has the objective of becoming **the most powerful and flexible open source data analysis / manipulation tool available in any language**. 

It's up to you to decide whether that is (already) true or not, but this tutorial will demonstrate some of its capabilities. We also use statsmodels to do some of the statistical analysis, in a workflow integrated with Pandas.

The dataset we use in this example is a subset of the data presented in
Trilling, D. (2013). *Following the news. Patterns of online and offline news use*. PhD thesis, University of Amsterdam. http://hdl.handle.net/11245/1.394551


### A note on importing packages
Because `pandas` is an outside package, we need to import it to use it in `python`. We'll do that with a number of other packages as well (statsmodels/numpy/matplotlib/etc.), since they give us some specific tools to work with. You don't have to worry about these yet. This time around we are only going to use `pandas` and `statsmodels`.

***
***

# `Pandas`
Normally, you'd have to install `pandas`, but [Anaconda](https://store.continuum.io/cshop/anaconda/) and [PythonAnywhere](https://pythonanywhere.com) comes with `pandas` pre-installed, so all you have to do is `import` it in `Python`. We do this using the `import` command:

In [None]:
import pandas

Since writing out `pandas` is too much work for us to be bothered with, we'll instead call it something shorter in length. Let's reimport it, but now let's call it "pd" instead. 

In [None]:
import pandas as pd

***
***

Just as we did with `python` in the last two sessions, we need to lay out the basics for `pandas` before we can really launch in and make the data analysis magic happen. Luckily, a lot of learning `pandas` is just learning how to use series and dataframes. We'll go over one after the other here.

## Series

A **Series** is a single vector of data with an *index* that labels each element in the vector. For most intents and purposes this works like a "column" or "variable" of in a dataset (. But treating it as a vector gives `pandas` (much like R) more flexibility on the mathematical side of things.

We create a series using the `pd.Series()` command:

In [None]:
counts = pd.Series([632, 1638, 569, 115])
counts

If an index is not specified--you might want to use the ids that come with survey data, for instance--a default sequence of integers (i.e. 1, 2, 3, ...) is assigned as the index.

When you create a series it actually creates two sets of information: (1) an array that contains the values of the `Series`, your data, and (2) the index, which is actually stored as a `pandas` `Index` object. See for yourself.

In [None]:
counts.values

In [None]:
counts.index

We can also assign meaningful labels to the index, if they are available:

In [None]:
bacteria = pd.Series([632, 1638, 569, 115], 
    index=['Firmicutes', 'Proteobacteria', 'Actinobacteria', 'Bacteroidetes'])

bacteria

****
****

# Checkpoint 1 of 10

## Now You Try!

### First, create your own `Series` that mimics the above example, but doesn't copy it. In other words, use different numbers and use different labels that are not bacteria. Call your `Series`, `myTest`. 

### Next, with your new `Series`, return the values and index of the Series, just as shown above. 

****
****

***
It's worth taking a little detour at this point to talk a bit more and indices, since we can do a lot of really useful things with an index. You can, of course, use the index to refer to specific values in the `Series`:

In [None]:
bacteria['Actinobacteria']

We can also use it to return a list of True and False values that indicate which row has an index that ends with the word `bacteria`. Which rows would be True and which would be False? 

In [None]:
[name.endswith('bacteria') for name in bacteria.index]

We can go a step further and return all rows whose indices end with the word `bacteria.` Which rows will we return?

In [None]:
bacteria[[name.endswith('bacteria') for name in bacteria.index]]

***
****
# Checkpoint 2 of 10

## Now you try! 

### First, return all rows from the `Series` called `bacteria`  that end with the letter `'s'`. 

### Next, return all rows from the `Series` that start with the letter `'P'`. Hint, try the method called `.beginswith()`. 

***
****

We can still use positional indexing if we wish, just like we saw with `python` so far. What value would we get from the row in the first index?

In [None]:
bacteria[0]

We can also give a title to the index (kind of like a title to a column in a table) and a title to the value. 

In [None]:
bacteria.index.name = 'phylum' #Title for the index
bacteria.name = 'counts' #Title for the values
bacteria

****
****

# Checkpoint 3 of 10

## Now you try! 

### Recall your `Series` called `myTest`. Create titles for its `index` and its `name`. 

****
****

***
Back to `Series` more generally, but only briefly. The subsetting tools from `python` work with `Series`, but you can also use more mathematical conditions, too, thanks to `pandas`. Just to give an example, we can a filter our `Series` based on a cutoff number. Which rows would you expect `python` to return if we only wanted rows whose values are greater than 1,000?

In [None]:
bacteria[bacteria>1000]

You can also enter in a range, as well! Recall the logic controls we discussed earlier. 

Using the `&` for `and` and the `|` for `or`, we filter on several conditions. 

For example, we can filter bateria in a certain range, say greater than 500 but less than 1000. 

In [None]:
bacteria[(bacteria<1000)&(bacteria[bacteria>500])]

****
****

# Checkpoint 4 of 10

## Now you try!

### Using your `Series` `myTest`, filter out any rows whose values are greater than `300` but less than `650` just as you did in the above example for the `Series` `bacteria`. (Note: Answers will vary.)

### Now, instead of using the `&`, repeat what you just did with  `|`. 
### Why did you get these results? 

****
****

***
Because in `Python`, there are countless of ways of doing anything, we can also define our `Series` as a `Dictionary`. Recall that a `Dictionary` is a data structure that is comprised up of pairs, each contains a `Key` and a `Value`. Think of the `Key` as the `Index` and the `Value` as (you guessed it) the value. Let's build our dictionary first:

In [None]:
bacteria_dict = {'Firmicutes': 632, 'Proteobacteria': 1638, 'Actinobacteria': 569, 'Bacteroidetes': 115}
bacteria_dict

Now, let's turn it into a `Series` and title the index and the values just as we did before.

In [None]:
bacteria_dict_pd = pd.Series(bacteria_dict)
bacteria_dict_pd.index.name = 'phylum' #Title for the index
bacteria_dict_pd.name = 'counts' #Title for the values

Let's compare it to the original series we were working with.

In [None]:
bacteria_dict_pd

In [None]:
bacteria

Notice that the two `Series` have all of the same indices and values, but the order is different. If that bothers you (for most practical situations, it won't matter), it is fairly easy to fix, because `pandas` will order the series according to the index values you feed it. 


In [None]:
bacteria_dict_ordered_pd = pd.Series(bacteria_dict, 
                                     index=['Actinobacteria','Bacteroidetes','Firmicutes','Proteobacteria'])


In [None]:
bacteria_dict_ordered_pd

****
****

# Checkpoint 5 of 10

## Now you try! 

### Create a `dictionary` that has some sort of labels (again, not bacteria!) with corresponding values. Call it `myTestDict2`. Make-up anything you'd like. 

### Convert this `dictionary` into a `Series`, just as you did before. Call the new `Series` `myTest2`. 

****
****

Take care to note that if you pass in an `Index` that does NOT correspond to the `Keys` in the original dictionary, you'll get missing values with this approach. In other words, it will treat it as missing data with a `NaN`, the "not a number" type for missing values. 

For example:

In [None]:
bacteria_dict

We will set this to be our new Index, but notice that "Cyanobacteria" is missing

In [None]:
['Cyanobacteria','Firmicutes','Proteobacteria','Actinobacteria']

In [None]:
bacteria_series_updated = pd.Series(bacteria_dict, index=['Cyanobacteria','Firmicutes','Proteobacteria','Actinobacteria'])
bacteria_series_updated

Notice that we dropped "Bacteroidetes" and instead added an "empty" row for "Cyanobacteria"

We can use `.isnull()` function to return a `Series` of True and False values as to which rows are `NaN` (i.e., nulls).
If this checks which rows are `NaN` (null), which row would come back True?

In [None]:
bacteria_series_updated.isnull()

This is helpful when we want to clean data and removing rows that have empty data. The basic strategy is to only keep those rows which return `False` when we use `.isnull()`.

How do we do this? The `~` (NOT) operator turns `True` into `False` and `False` into `True`. 

If we run the code below, which rows will we keep? (HINT: The rows that are NOT NULL)

In [None]:
bacteria_series_updated[~bacteria_series_updated.isnull()]

****
****

# Checkpoint 6 of 10

## Now you try! 

### Modify the list below and add your own labels to the list that aren't already included in the designated spaces.

`['PUT_YOUR_OWN_LABEL_HERE','PUT_ANOTHER_LABEL_HERE','Proteobacteria','Actinobacteria']`

### Repeat the above steps and see what non-null rows you return. Why did you get these rows?

****
****

***
Remember, the main reason why `Series` are vectors is that it makes the math much simpler for the computer. We can, for instance, combine `Series` together and add their values. It does this by adding together values that have the same index. For instance, consider our original two `Series`:

In [None]:
bacteria

In [None]:
bacteria_series_updated

We want to add the rows that have the same `indices` together. That is, we want to add the value for "Firmicutes" together, or 632+632, to yield 1264. `pandas` will do this if we just tell it to add the two `Series`.

***HOWEVER***, recall that "Bacteroidetes" exists only in the `Series` named "bacteria", while "Cyanobacteria" exists only in "bacteria_series_updated"

So what happens when we try add the values for *those* indices across the two `Series`?

In [None]:
bacteria+bacteria_series_updated

The missing values were propogated by addition. An NaN + any number = NaN. The same would hold if we did subtraction, multiplication, or division. 

In [None]:
bacteria - bacteria_series_updated #For values, should all be 0.

In [None]:
bacteria*bacteria_series_updated

In [None]:
bacteria/bacteria_series_updated #For values, should all be 1.

***
***
## DataFrames

Now that we've coverd `Series`, now we can combine them together to create a `DataFrame`. 

A `DataFrame` is a tabular data structure (technically, a matrix) made up of multiple `Series` that function like columns in a spreadsheet. The `DataFrame` is the central data structure for `pandas`. What makes it so useful and powerful to use is that we can store different types of variables all in one place. Run the next slab of code to see it in action:


In [None]:
data = pd.DataFrame({'value':[632, 1638, 569, 115, 433, 1130, 754, 555],
                     'patient':[1, 1, 1, 1, 2, 2, 2, 2],
                     'phylum':['Firmicutes', 'Proteobacteria', 'Actinobacteria', 
    'Bacteroidetes', 'Firmicutes', 'Proteobacteria', 'Actinobacteria', 'Bacteroidetes']})
data

Notice the columns in the `DataFrame` are in a particualr order: patient, phylum, and value. We can change the order by indexing them in the way we want, just like we could do with `Series`:

In [None]:
data[['phylum','value','patient']] #phylum, value, and patient

***
Now, let's get down to analyzing the parts of a `DataFrame`. Like a `Series`, a `DataFrame` also has an `Index` to reference rows in the exact same way. 

However, it also has a second index, representing its columns. We can look at them using the `.columns` method.

What values should we expect?

In [None]:
data.columns

***
If we wish to access a column, we can do it in one of several ways:

In [None]:
data['phylum'] 

This returns the column as a `Series`. Using the `type()` function, we can prove this. 

In [None]:
type(data['phylum'] )

In [None]:
data.phylum

In [None]:
type(data.phylum)

However, we can also return it as a `DataFrame` with a double-bracket, `[[]]`. This may come in handy later when we're working with data. 

In [None]:
data[['phylum']]

In [None]:
type(data[['phylum']])

***
***

# Checkpoint 7 of 10

## Now you try! 

### Create your own `DataFrame` with your own made-up data. Call it `myDF`. It doesn't need to be as elaborate as the above example, two columns will suffice. One column should have some sort of label, and another column should have some numeric value.

1. First, use the `.columns()` methods to return the column names you selected for your `DataFrame`. 

2. Then, use the `type()` function to return the data types of your two columns. 

3. Finally, use the double-bracket notation to select one of the two columsn you made. 

***
***

***
***
## Selecting specific data
Use the `loc[row_index, col_index]` attribute to select subsets of data. The `row/column_index` can be 
- a single label (e.g., 'quality')
- a `list` or `array` of labels (e.g., ['density', 'pH', 'quality']
- slice with **labels** (e.g., 'density':'quality', in this case the slice is *inclusive* on each end)
- array of boolean conditions (True/False)

Put a single "`:`" as an index if you want all rows/columns. For example,
```python
data.loc[:, 'phylum']
```
will return the entire 'quality' column.

In [None]:
data.loc[0:5, 'patient']  # get rows 0:5 and column 'patient'

In [None]:
data.loc[ 3:5 , ['patient', 'phylum', 'treatment'] ]  # get rows 3 through 5 and the three specified columns

In [None]:
data.loc[ 3:5, 'patient':'treatment' ]  # get rows 3 through 5 and all columns between 'patient' and 'treatment' (inclusive)

***
We an also use the `.iloc` attribute to achieve the same goal. 

In [None]:
data.iloc[3]

In [None]:
data.loc[3] #If the index wasn't a number, use `.loc` which produces the same result. 

****
****

# Checkpoint 8 of 10

## Now you try! 

1. First, using the `DataFrame` `data` and the `.loc()` method, select rows 2, 3, and 4 of the columns called `'phylum'` and `'treatment'`. 

2. Next, use the `.iloc()` method and return the first row. (Hint: this won't be the number 1.)


***
***

***
***

## Dictionaries and DataFrames

Just like with the `Series`, we can use `Dictionaries` to create a `DataFrame`. 

NOTE: This is NOT something you'll be really ever doing in practice, but it's good to see. 

For example, let's take a single a row from the `DataFrame` above and rewrite as into a `Dictionary` with the `dict()` function.

In [None]:
dict(data.iloc[3])

***

To save time, let's just the code below, which takes the rows from the previous `DataFrame` and turns them into `Dictionary` that is made up of other `Dictionaries`. 

In [None]:
data = pd.DataFrame({0: {'patient': 1, 'phylum': 'Firmicutes', 'value': 632},
                    1: {'patient': 1, 'phylum': 'Proteobacteria', 'value': 1638},
                    2: {'patient': 1, 'phylum': 'Actinobacteria', 'value': 569},
                    3: {'patient': 1, 'phylum': 'Bacteroidetes', 'value': 115},
                    4: {'patient': 2, 'phylum': 'Firmicutes', 'value': 433},
                    5: {'patient': 2, 'phylum': 'Proteobacteria', 'value': 1130},
                    6: {'patient': 2, 'phylum': 'Actinobacteria', 'value': 754},
                    7: {'patient': 2, 'phylum': 'Bacteroidetes', 'value': 555}})

In [None]:
data

***
Oops, the rows are columns and the columns are rows. We can fix that with the `Tranpose` function `.T`. All it does is swap the columns for rows and rows for columns.


In [None]:
data = data.T
data

***
***

## DataFrames, View, and Copy
We have to talk about a weird quirk of `Python` at this point. It's technical, but it's important to be aware of the quirk.

Say you assign particular values from a `DataFrame` to a variable. You might think you copied the values of a `DataFrame` to the variable, but in reality this is just a linked **view** of the `DataFrame` (and not just a copy of it). In other words, if you change something in this "copied" variable, **it will also change the `DataFrame`**

Let's test this out and create a `Series` called `vals` from the `patient` column in `data`.

In [None]:
vals = data.patient # We could also have done this as data["patient"]
vals

Notice that the value with the index 5 has a value 2. Let's change it to 0.

In [None]:
vals[5] = 0
vals

We see now that the value for the index 5 is now 0. However, we **also** changed the `DataFrame` as well. (Very trippy!)

Now, if we go back to `data` you'll notice that the patient variable for the fifth index is now 0.

In [None]:
data 

Instead, we'll have to create a ***copy*** of the patient `Series` with the `.copy()` method. 

In [None]:
vals = data.patient.copy()

Let's change the value back to 2. 

In [None]:
vals[5] = 2
vals

However, since this is a **copy**, the original `DataFrame` remains **unchanged**.

Let's bring up `data`. You'll notice that value remains `0`.

In [None]:
data 

This is because the `Index` is immutable, or cannot be changed. In other words, this is so that `Index`  can be shared between data structures without fear that they will be changed.

Case in point, try to access the index of a `DataFrame`. 

In [None]:
data.index

If we try to access the first element in the index, we should get a zero. 

In [None]:
data.index[0]

However, if we try to change that value of 0 to, say, 15, we get an error. 

In [None]:
data.index[0] = 15

***
***

## Modifying DataFrames

Of course, we can create or modify values in the `DataFrame` by assigning them directly to the `DataFrame`.

For example, let's change the value in the `value` column in the 3rd index from 115 to 14.

In [None]:
data['value'][3] = 14 # We could have also done this with data.value[3] = 14
data

We can also create a column with a single value. For example, let's create a column for the variable `year` and set it to 2018.

In [None]:
data['year'] = 2018
data

***HOWEVER***, we cannot use the attribute indexing method to add a new column. For instance, this won't work.

In [None]:
data.treatment = 1
data

In [None]:
data.treatment

Instead, we can create a `Series` and add it as a new column in the `DataFrame`. 

Let's create the `Series` `treatment` as having six observations, each having the value of 0 or 1. To start, let's set it up as a list of four 0s and two 1s, as shown below. 

In [None]:
[0]*4 + [1]*2

Now, let's turn this list of 1s and 0s into a `Series`

In [None]:
treatment = pd.Series([0]*4 + [1]*2)
treatment

When `Python` assigns the `Series`  `treatment` as a column to the `DataFrame`, it does so by aligning according to their `indices`.

Let's create a new column called `treatment` in the `DataFrame` and add the `Series` `treatment` to it. 

In [None]:
data['treatment'] = treatment
data

Notice, again, that we're the treatment values for the indices 6 and 7 are `Nan`, because the original `Series` didn't go that far. 

***
Other Python data structures without an index can be added, such as a list. However, they need to be the same length as the `DataFrame`. 

Notice here, we get an error. 

In [None]:
month = ['Jan', 'Feb', 'Mar', 'Apr']
data['month'] = month


To fix this, we first need to make sure that whatever we add to the `DataFrame`, it has the same length as the `DataFrame`. (And thus, matches its `Index`.)

First, The function `len()` returns the "length" of a data object. For a `DataFrame` this means the number of rows. So let's see how "long" our data structure needs to be. 

In [None]:
len(data)

So we need a data object that has 8 elements in it.

Let's use a list to do this. Let's create some random entry and extend it such that it has 8 elements. 

In [None]:
['Jan']*len(data)

Now, we can create a new column in the 'DataFrame' and add our above list to it. 

In [None]:
data['month'] = ['Jan']*len(data)
data

***
So this is a good technique, but we don't need these columns anymore. Let's delete the column.

We can use the `del` function to remove columns. Let's get rid of the `month` column. 

In [None]:
del data['month']

In [None]:
data

We can use the `.shape()` function to tell us the rows and columns of the `DataFrame`. 

In [None]:
data.shape

We can also drop rows by their indices, say we wanted to drop the first (index 0) and last (index 7) rows. 

In [None]:
data.drop([0, 7])

Remember, if you actually did want to drop these rows, you'd need to save the `DataFrame` back to itself, otherwise nothing changes: 

In [None]:
data

You can also drop by column titles, as well, by switching the `axis` parameter to 1. (It's set to zero by default.)

In [None]:
data.drop(['year','phylum'], axis=1)

***
***

## Summary Statistics from DataFrames

Let's quickly talk about summary statistics to close out this lesson. `Pandas` has a few summary functions that will meet most of your needs:

In [None]:
data.sum()  # return the sum of each column (where applicable)

In [None]:
data.mean()  # the mean of each column (where applicable)

In [None]:
data.median()  # median

Or the `describe()` method gives you a quick summary of each column:

In [None]:
data.describe()

Note: that for some reason the column for `value` isn't showing up, even though it's a numeric measure. Perhaps there's an issue with how the data are stored. 

To see by column how your data are stored in the `DataFrame` use the method `.dtypes`.

In [None]:
data.dtypes

It seems like `value` is stored as an object (e.g., `string`) and NOT AS an `int` or a `float`. 

So let's recast the column `value` from being a `string` to a `integer` using the method called `.astype()`. Here's we'll pass in `int` into this method to tell `Python` that we want to convert this column into `integers`. Remember, we'll have to save the column back onto itself. 

In [None]:
data["value"] = data["value"].astype(int)

Now, let's again use `.dtypes` to see if it worked.

In [None]:
data.dtypes

The column `value` should now we be listed as an integer. Now we can re-run the `.describe()` method and get the statistics out for `value`

In [None]:
data.describe()

***
***

# Checkpoint 8 of 10

## Now you try! 

1. Using the above `DataFrame` `data`, convert the column `value` back to a `string`. 

2. Next, use the method `.dtypes()` to see if it worked. 

****
****

Usually, summaries are more useful when combined with `groupby` operations. A `groupby` operation lets you group the data be its value in one or more column, then you can generate summaries per group:

In [None]:
data.groupby('patient').mean()

We can even do a `.groupby()` method using two columns. 

In [None]:
data.groupby(['patient','year']).mean()

Notice here that the columns look a bit off, specifically `value` and `treatment`. This is because they are hierarhically indexed. We'll talk more about this later, but all it means is that either the index for the rows or columns (or both!) are nested within something else. 

It's fine for now, but if we want to save this to a CSV file or work with another `DataFrame`, it might get a bit annoying. To that end, let's use the `.reset_index()` method and "flatten" the index out. 

(Remember, you'll have to save these operations back to the `DataFrame` if you want to keep these changes.)

In [None]:
data.groupby(['patient','year']).mean().reset_index()

***
***

# Checkpoint 9 of 10

## Now you try! 

### Using the `DataFrame` `data`, use the method `.groupby()` method and group the data by the column `year` and return the maximum value for each year. (Hint: Use the method `.max()` to get it.) 

### Be sure to also use `.reset_index()` method. 

***
***

***
***
## Using .agg() and .pipe() 

You can use the `.agg()` method to calculate different descriptive statistics for multiple columns at once. 

This include the mean (using `mean`), the number of observations (using `count`), the minimum or maximum (using `min` or `max`, respectively), and the standard deviation (using `std`). 

Let's try it out! Let's group the `data` by the `treatment` column and get descriptive statistics for the column `value`. 

All you need to do is put in a list `[]` all of the various descriptions you'd like, as shown below. 

In [None]:
data.groupby("treatment").agg({
        "value": ["mean","std","count","min","max"]
    })

Note, that `Python` code can get somewhat long. With `pandas` you can use the backslash to indicate a new line after each method you apply to your data. 

For instance, let's rewrite the code above with backslashes. Just copy and paste what you have above but start a new line after each method using a backslash, making sure the period is at the beginning of each method. 

In [None]:
data \
    .groupby("treatment") \
    .agg({"value": ["mean","std","count","min","max"]})

Great! Now you have descriptive statistics for the column `value`, specifically it's mean, standard deviation, frequency count, and minimum and maximum values. 

You'll notice that the columns look a bit odd though. This is called hierarchical indexing and it's a feature in `pandas`. 

While this is fine as is in `python`, we may want to manipulate these data in other ways or write it to a csv file. (More on this later.) This topic is a bit more complicated, so instead let's do some handwaving and "flatten" the columns, so that this is just one level of columns, instead of two. 

Recall that we used `.reset_index()` for the indicies, however this won't work here for columns. So, we'll have to improvise a bit. 

Below, I wrote a function that takes in a `DataFrame` and "flattens" its columns, if necessary. 

In [None]:
def flatten_index(df):
  df_copy = df.copy()
  df_copy.columns = ['_'.join(col).rstrip('_') for col in df_copy.columns.values]
  return df_copy.reset_index()

We can run this in one of two ways. First we can run the code above and save it to a new `DataFrame` and then feed it into the above function. 

Or, we can use a new method called `.pipe()`. This method is great and easy to use, as it simplifies the amount of code you need to write. (This is similar to piping in `R`.) It's another way of taking a `DataFrame` and inputting into a function, but doing it all at once with one line of code. 

Let's try it both ways. 

First, let's repeat what we did above, save it to a new `DataFrame`, and input it into our custom function, `flatten_index`. 

In [None]:
new_data = data \
    .groupby("treatment") \
    .agg({"value": ["mean","std","count","min","max"]})

Now, let's check it out, just to make sure. 

In [None]:
new_data

Finally, let's apply the new `DataFrame` and see what we get. 

In [None]:
flatten_index(new_data)

Great! Now we "flattened" our columns and appended the various statistics to the name of the column. 

Now, let's simplify the code and use the `pipe()` method. We add it at the very end and include it in the code just as we would with another method. The only thing we pass in is the name of the function we wrote (i.e., `flatten_index`). 

Let's save it to `new_data` and see what we get. 

In [None]:
new_data = data \
    .groupby("treatment") \
    .agg({"value": ["mean","std","count","min","max"]}) \
    .pipe(flatten_index)

In [None]:
new_data

The same result with one fewer step!

***
***

# Checkpoint 10 of 10

## Now you try! 

### Using the `.agg()` and the `.pipe()` methods, repeat the same steps as above. However, instead of using the column `treatment`, use `phylum`.  

***
***