# Introduction to Python II: Dataframes

Nearly all researchers work with tabular data at one time or another. In this lesson, we will practice with **dataframes**, a Python data structure designed to work with tabular data, using the **Pandas library**. We will create our own dataframes, read in data from .csv files to a dataframe, subset and combine dataframes, and add or modify columns (variables) and observations (rows). We will also examine how these tasks work with messy and large datasets (i.e. millions or rows).


## Introduction to Python Series (Spring 2023)

All Courses on Tuesdays, 12:00 (Noon) - 1:30pm. You may take all four lessons or pick and choose. However, each lesson builds on the previous so knowing how to work with data frames, for example, will help you with the visualization lesson even if you can get by without this prior knowledge.

1. The Basics (Apr. 4)
2. **Dataframes (Apr. 25)**
3. Visualization (May 2)
4. Text Analysis (May 16)

Note: If you want to take the whole series, you will need to sign up for each course individually at: http://dartgo.org/RRADworkshops.

To learn more about Pandas, visit the webpage for the [Pandas library](https://pandas.pydata.org/).

For further practice with Python and Pandas dataframes, visit:

1. Software Carpentries [Pandas DataFrames](http://swcarpentry.github.io/python-novice-gapminder/08-data-frames/index.html) lesson.
2. The **Pandas 1**, **Pandas 2**, and **Pandas 3** [tutorials offered by Constellate](https://constellate.org/tutorials).
3. Chapter 3, ["Data Analysis (Pandas)"](https://melaniewalsh.github.io/Intro-Cultural-Analytics/03-Data-Analysis/01-Pandas-Basics-Part1.html) in Melanie Walsh's *Introduction to Cultural Analytics & Python*.

Note: All three platforms above offer excellent lessons for learning how to perform various other tasks in Python. Check them out.

Finally, don't worry about memorizing all the functions and syntax needed to work with Pandas. You can use the [Python Pandas Cheat Sheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf) as a handy guide to remind you of some of these key functions.




<h2 style="text-align:center;font-size:300%;">Dataframes: The Basics<h1> 
  <img src="https://media.geeksforgeeks.org/wp-content/cdn-uploads/creating_dataframe1.png" style="width:%40;">


A dataframe is a particular data structure that includes:
+ **rows** that record each observation of data
+ **columns** that record different attributes / measures / variables for each observation
+ **column labels** that describe the data recorded in each column
+ **index labels** that differentiate between each observations

## I. Create a dataframe

1. First, we need to import the [**Pandas**, Python's data analysis library](https://pandas.pydata.org/) that allows us to work with dataframes. It is almost universal practice among Python users to import Pandas under the name "pd" to serve as a useful abbreviation we can use when calling Pandas functions. We will also import **pathlib** and **glob** to help us work with file paths.

In [None]:
import pandas as pd
import pathlib, glob
from pathlib import Path

2. There are many ways to create a dataframe:
    + Perhaps the most common way is to load a dataset saved as a **.csv** (Comma Separated Values) file and read it in directly as a Pandas dataframe. We will introduce that method in the next section.
    + Import information from other files that do not contain data already assembled in a two-dimensional table with rows and columns. For example, you could write a Python script iterating over 1000s of text files and create a Pandas dataframe with basic information about each text (i.e. file names, number of words, etc.)
    + We can also create a dataframe from scratch:

In [None]:
# this example comes from Constellate's Pandas 1 tutorial
  ## https://lab.constellate.org/ilr-review-primary/notebooks/tdm-notebooks-2023-04-19T13%3A37%3A13.281Z/pandas-1.ipynb
wcup = pd.DataFrame({"Year": [2022, 
                              2018, 
                              2014, 
                              2010, 
                              2006, 
                              2002, 
                              1998, 
                              1994, 
                              1990,
                              1986], 
                     "Champion": ["Argentina", 
                                  "France", 
                                  "Germany", 
                                  "Spain", 
                                  "Italy", 
                                  "Brazil", 
                                  "France", 
                                  "Brazil", 
                                  "Germany", 
                                  "Argentina"], 
                     "Host": ["Qatar", 
                              "Russia", 
                              "Brazil", 
                              "South Africa", 
                              "Germany", 
                              "Korea/Japan", 
                              "France", 
                              "USA", 
                              "Italy", 
                              "Mexico"]
                    })

3. In the above example, we are directly inputting a dataframe using the Python **dictionary** data structure. For more on Python dictionaries (click here)[https://www.w3schools.com/python/python_dictionaries.asp]. 

*This is beyond the scope of this lesson, but if you are curious, here is a brief explanation of how dictionaries work. ?????*

Let's view what the resulting dataframe looks like:

In [None]:
wcup

We will come back to this dataset below.

## II. Read a Dataframe

Besides the basic Men's World Cup dataset created above, in this lesson we will also be working with the following datasets:

+ The Gapminder dataset charting changing life expectancies and GDP values for countries over time. (used by Software Carpentries)
<!--+ The Bellevue Almshouse dataset (used in [Walsh, *Intro to Cultural Analytics*](https://melaniewalsh.github.io/Intro-Cultural-Analytics/03-Data-Analysis/01-Pandas-Basics-Part1.html) and created by Anelise Hanson Shrout. Original [link to data here](https://docs.google.com/spreadsheets/d/1uf8uaqicknrn0a6STWrVfVMScQQMtzYf5I_QyhB9r7I/edit#gid=2057113261). An [essay about this dataset is here](https://crdh.rrchnm.org/essays/v01-10-(re)-humanizing-data/).)-->
+ Hollywood Film Dialogue dataset (also used in Walsh. Original data from Hannah Anderson and Matt Daniels, ["Film Dialogue from 2,000 screenplays, Broken down by Gender and Age."](https://pudding.cool/2017/03/film-dialogue/))

4. To read data in, we will use the Pandas [**read_csv** function](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html). First, however, we need to create a path to the folder where our datasets are saved. We will use the **Path** function from the **pathlib** library.

In [None]:
gapminder_path = Path("~/shared/RR-workshop-data/gapminder").expanduser() 
gapminder_csvpath = Path(gapminder_path, "gapminder_all.csv")
csv_files = glob.glob(f"{gapminder_path}/*.csv")
print(csv_files)
#

5. Now we can read in the **Gapminder dataset** using Pandas read_csv. function:

In [None]:
gapminder_df = pd.read_csv(gapminder_csvpath)

We can print out this dataframe below in Jupyter Notebooks by simply typing the name of the dataframe (note: if you place any code below it, you need to wrap it in a **print()** command).

In [None]:
gapminder_df

Here is a quick summary of what you see above from Melanie Walsh's book:

There are a few important things to note about the DataFrame displayed here:

+ Index
    - The bolded ascending numbers in the very left-hand column of the DataFrame is called the Pandas Index. You can select rows based on the Index.
    - By default, the Index is a sequence of numbers starting with zero. However, you can change the Index to something else, such as one of the columns in your dataset.

+ Truncation
    - The DataFrame is truncated, signaled by the ellipses in the middle ... of every column.
    - The DataFrame is truncated because we set our default display settings to 100 rows. Anything more than 100 rows will be truncated. To display all the rows, we would need to alter Pandas’ default display settings yet again.

+ Rows x Columns
    - Pandas reports how many rows and columns are in this dataset at the bottom of the output (n rows x n columns).
    - This is very useful!

*In addition, I would also add that in the preview of the dataframe above - at least as viewed within JHub - you can click on a pen/highlighter icon to view the whole dataframe.*


## III. Get summary data to learn more about a dataframe

6. Now, often the first thing we want to do with a new dataset is to explore the size of the dataset and the type and range of data it contains. Run the following functions and then use hashtags ```#``` to add in comments about what each does.

In [None]:
gapminder_df.head() #what does the .head() method do to our dataframe?
                    ##explain: 

In [None]:
gapminder_df.head(12)  #

In [None]:
gapminder_df.tail() #

In [None]:
gapminder_df.shape #

In [None]:
gapminder_df.info() #

In [None]:
gapminder_df.sample(10) #

In [None]:
gapminder_df.describe() #

In [None]:
gapminder_df.columns

In [None]:
## We can also get summary information about individual columns. For example:

gapminder_df['gdpPercap_1987'].describe()

In [None]:
## or:
gapminder_df['continent'].value_counts()

<div class="alert alert-info" role="alert"><h3 style="color:blue;">Exercises for Parts II and III</h3>

<p style="color:blue;">7. Take a look at the [Python Pandas Cheat Sheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf). Are there any other summary data you would like to retrieve from this dataset. Try it below:</p></div>

<div class="alert alert-info" role="alert"><p style="color:blue;">8. Now let's open the Film Dialogue dataset. Copy and paste the code from Part II, Numbers 4 and 5 below and then modify that code to open the film dialogue dataset. (Navigate to the film-dialogue folder using the folder directory on the left. You will need to identify the specific file path to and file name of the film dataframe).</p>
</div>

\*\***A note about data entry and humans (and things like gender, race, class, etc.)**: When people collect data, they often have to reduce complex phenomena to individual categories. This is problematic when performed on objects that do not neatly fit into pre-existing categories (or any categories for that matter). This is especially problematic when assigned to humans. How do you reduce humans to a racial or ethnic category when those categories are, to at least some extent, social inventions? When a person comes from a mixed background? How do you reduce people to two gender categories, when many people do not feel they are adequately represented by such categories? Keep this in mind when reviewing datasets that place people into hard and fast categories (as the above dataset does for the gender of movie characters).

<div class="alert alert-info" role="alert"><p style="color:blue;">9. Using some of the methods introduced in Part III and the [Python Pandas Cheat Sheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf), answer the following questions:</p>
<ul>
    <li style="color:blue;">How many rows and columns are there in this dataset?</li>
</ul>
</div>

<div class="alert alert-info" role="alert">
<ul>
    <li style="color:blue;">What are the range of dates for the films included in this dataset?</li>
</ul>
</div>

<div class="alert alert-info" role="alert">
<ul>
    <li style="color:blue;">How many men vs. women (** - see note above) are included in this dataset</li>
</ul>
</div>

<div class="alert alert-info" role="alert">
<ul>
    <li style="color:blue;">What is the average age of film characters in this dataset?</li>

</ul>
</div>

## IV. Access Parts of a Data Frame

<div class="alert alert-success" role="alert"><p style="color:green">10. Let's take a look at our World Cup data again. Use the **.head()** method to output the first 5 rows of this dataset (wcup).</p></div>

11. We can use the **.iloc[..., ...]** method to extract particular columns, rows, or cells. For example to retrieve the first row or first column, we would do the following:

In [None]:
wcup.iloc[0]

In [None]:
wcup.iloc[:,0]

<div class="alert alert-success" role="alert"><p style="color:green">11b. Can you guess how to extract only the last column?</p></div>

<div class="alert alert-success" role="alert"><p style="color:green">11c. How would you retrieve the value found in the 2nd row of the 2nd column?</p></div>

12. We can us **.loc[..., ...] to retrieve values, columns, or rows by their labels. For example, if we wanted to retrieve all info from the "Champion" column, we would simply run:

In [None]:
wcup.loc[:,"Champion"]

12b. At the moment, rows are only indexed by numbers ("0", "1", "2" and so on). However, we can convert out "Year" column into an index:

In [None]:
wcup = wcup.set_index("Year")


Notice the slight change below:

In [None]:
wcup.head(3)

12c. Now we can search the dataframe by our index ("Year") and by column name. For example:

In [None]:
wcup.loc[1994, "Champion"]

**Note: for the code above, our index ("Year") has integers not strings. So running:*

```
wcup.loc["1994", "Champion"]
```

*will produce an error. Since 1994 is considered an integer here, we should leave the quotes out.*

12d. Like many things in Python, there are often multiple ways to accomplish the same goal. For example, we can also select particular columns by simply placing the names of a column within brackets (similar to indexing and slicing lists in Python).

In [None]:
# Use bracket notation to access the column 'Champion'
wcup['Champion']

In [None]:
# Access multiple columns
film_df[["title", "character", "age"]]

## V. Subset, Filter, and Sort a Dataframe

Frequently, when we are working with a dataset we may want to subset, filter, or sort the dataset before beginning your analysis.

Many of you may already be familiar with subsetting, filtering, and sorting datasets. But, just for review, this is what we mean by these terms:
+ **subset** - to subset a dataset is to create a smaller version of the dataset. For example, you may want to drop specific columns or rows to create a dataframe for your analysis

    <img src="https://pandas.pydata.org/docs/_images/03_subset_columns.svg" style="width:40%;">
    <img src="https://pandas.pydata.org/docs/_images/03_subset_rows.svg" style="width:40%;">
    
    + **filter** - filtering a dataframe is a specific type of subsetting. For example, you may want to drop all rows in a dataframe that contain missing values. Or keep only those rows that have a value that meets a particular condition.
+ **sort** - arrange the dataframe in a particular order. For example, you may want to sort a dataframe by a year or time column to arrange each row in chronological order. You may also sort a dataframe by multiple columns at once (i.e. sorting by year first and country second).
    <img src = "https://pythonexamples.org/images/python-pandas-dataframe-sort-by-index.svg" style = "width:40%">

Some examples of how you may want to subset, filter, or sort this lesson's dataframes: 

| Modification | World Cup dataset | Gapminder dataset | Film Dialogue dataset |
| :- | :- | :- | :- |
| **subset / filter** | create a df including only the 21st century men's World Cup champions | only review data from a set number of countries **or** for set range of years | keep only "title", "gender", and "proportion_of_dialogue" columns **or** only include data for film dialogue spoken by a character over 50 years old  |
| **sort** | Sort dataframe by year, but this time in ascending order | sort dataframe by life expectancy in 2007, in descending order | sort df by proportion of dialogue |



In [None]:
film_df.head(10)

### Vb. Subsetting / Filtering



13. We can create data subsets by using the same slicing functions described in Part IV and saving them as new dataframes. For example:

In [None]:
wcup_last5 = wcup.iloc[:5,]
wcup_last5
# note, the head function can accomplish the exact same thing
#i.e. wcup_last5 = wcup.head()

<div class="alert alert-success" role="alert"><p style="color:green">13b. Let's take a look at the film_df again by using the .head() method.</p></div>

In [None]:
film_subset = film_df[['release_year', 'title', 'proportion_of_dialogue']]
film_subset.head()

# notice: in the 1st line of code above, we not only selected 3 columns to keep but we also re-arranged their order!

In [None]:
#note: this does the same as above, only this time using the .loc[] method
film_subset = film_df.loc[:,['release_year', 'title', 'proportion_of_dialogue']]
film_subset.head()


14. We can drop specific columns using the **.drop()** method for Pandas dataframes. For example, if we want to drop the "Host" column from our World Cup dataset, we could run the following:

In [None]:
print(wcup.shape)
print(wcup.head(2))
wcup2 = wcup.drop(columns = "Host")  #note: we need to add "wcup2 =" to save the changes to our original wcup dataframe as "wcup2"
                                    ##you could also just replace the original "wcup" by writing wcup = wcup.drop(...)
                                    ## however, often it is helpful to keep both the original, full dataframe and the smaller, subsetted one
print(wcup2.shape)
print(wcup2.head(2))

There are many different ways to drop specific columns or rows using .drop(). See the [Pandas 2 Constellate lesson](https://lab.constellate.org/monist-language/notebooks/tdm-notebooks-2023-04-19T18%3A59%3A16.317Z/pandas-2.ipynb) for some additional examples.

15. Commonly, for example, we may want to drop observations (rows) containing null data (warning: you should always consider what this removal does to the representative nature of your dataset!). We can use the **.dropna()** method. See some examples below:

In [None]:
print(film_df.shape)
film_df_no_nas = film_df.dropna()
print(film_df_no_nas.shape)

15b. Temporarily or permanently changing a dataframe: Most methods in Python just temporarily change a dataframe unless you assign the modified dataframe to a variable. For example:

```
film_df.dropna()
```
just outputs a version of film_df with all NAs and null values removed, but does not save it to memory. If we wanted to replace the original "film_df" we can simply assign it the same name as follows:

```
film_df = film_df.dropna()
```
Or we can save it under a new name as we did above:
```
film_df_no_nas = film_df.dropna()
```

Finally, we can also use the "inplace" argument to do the same:

```
film_df.dropna(inplace = True)
```



15c. You can also drop columns with missing values (this is much rarer than dropping rows with missing values):

In [None]:
film_df_dropcols = film_df.dropna(axis = 1)    #for Pandas, rows are axis 0 and columns are axis 1
#film_df_dropcols = film_df.dropna(axis = "columns") # this does the same thing! 

15d. More commonly, you may want to remove rows that are missing values for specific columns. For example, if you want to analyze the age of characters / actors given speaking roles in films, you would want to remove any characters (rows) for which we lack age data.

In [None]:
print(film_df.shape)
film_df_age = film_df.dropna(subset = "age")
film_df_age

### Vc. Filtering a dataframe by a condition

16a. To filter a dataframe by a condition, first we need to define the filter itself:

In [None]:
gapminder_df['continent'] == 'Africa'   #remember: "=" indicates an assignment and "==" indicates a comparison returning True if both sides are equal and False if not

16b. We can then assign this specific filter to a variable...

In [None]:
filt = (gapminder_df['continent'] == 'Africa')

16c. And use that filter to create a subset of our original dataset

In [None]:
gapminder_Africa = gapminder_df.loc[filt]
gapminder_Africa.tail(2)

16d. We can also filter by multiple conditions:

In [None]:
years_filt = (film_df['release_year'] >= 1990) & (film_df['release_year'] < 2000)
nineties_films = film_df.loc[years_filt]     #note: you cannot begin a variable name with a number, thus 90s_films would raise an error
nineties_films

We can also combine multiple filters to create an even smaller and more specific subset:

In [None]:
gross_filt = (film_df['gross'] > 800)
nineties_blockbusters = film_df.loc[years_filt & gross_filt]
nineties_blockbusters

<div class="alert alert-info" role="alert"><h3 style="color:blue;">Exercise for Part V. Subsets and Filters</h3>

<p style="color:blue;">17. Filter and subset the film_df to create a new dataset with only films from **a decade of your choosing**.</p>
</div>

<div class="alert alert-info" role="alert"><h3 style="color:blue;">Exercise for Part Vb: Subsetting</h3>

<p style="color:blue;">17b. Using the methods you learned above, what are some different ways you could go about subsetting and filtering the gapminder dataset so that it includes only data from Asia and only for the twenty-first century.</p>
</div>

Write some possible solutions here: 

* 
* 

<div class="alert alert-info" role="alert"><p style="color:blue;">17c. Implement one of your proposed solutions above to subset the gapminder dataset as described in 17b.</p></div>

### Vd. Sorting a dataframe

18. We can sort a dataframe by the values in one column or in multiple columns using the **.sortvalues()** method.

In [None]:
film_df.sort_values(by = ['release_year'])

19. We can sort columns in descending order by adding the argument ```ascending = False``` (the default is "ascending = True").

In [None]:
film_df.sort_values(by = ['release_year'], ascending = False)

19b. When sorting by multiple columns, we need to specify which will be sorted in ascending rather than descending order using a list of the same length as the list of columns.

In [None]:
film_df.sort_values(by = ['release_year', 'title', 'character'], ascending = [False, True, True])

<div class="alert alert-info" role="alert"><h3 style="color:blue;">Exercise for Part Vd. Sorting</h3>

<p style="color:blue;">20. Sort the gapminder dataframe by first the "continent" column and then the "country" column, but in descending (not ascending) order for each.</p></div>

In [None]:
gapminder_df.sort_values(by = ['continent', 'country'], ascending = [False, False])

<div class="alert alert-info" role="alert"><p style="color:blue;">20b. Sort the Gapminder dataframe in three ways to answer the following three questions:</p>
<ul>
<li style = "color:blue;">Which five countries had the largest population in 2007?</li>
<li style = "color:blue;">Which five countries had the highest GDP in 2007?</li>
<li style = "color:blue;">Which five countries had the highest life expectancy in 2007?</li>
</ul>
</div>

In [None]:
gapminder_df[["continent", "country", "pop_2007"]].sort_values(by = "pop_2007", ascending = False)

In [None]:
gapminder_df[["continent", "country", "gdpPercap_2007"]].sort_values(by = "gdpPercap_2007", ascending = False).head()

In [None]:
gapminder_df[["continent", "country", "lifeExp_2007"]].sort_values(by = "lifeExp_2007", ascending = False).head()

## VI. Modify a Dataframe

There are many ways you can modify an existing dataframe.

21. You can change column names. First, let's review the names of our columns

In [None]:
wcup.columns

In [None]:
wcup3 = wcup.rename(columns = {'Host': 'home_country'})
wcup3.head(2)

22. Change values in the dataframe:


In [None]:
wcup.loc[2022, "Champion"] = "Messi!!"
wcup

23. We can apply functions across an entire column using Pandas' **.apply()** method.

In [None]:
def make_uppercase(text):   #In Python you use "def" to define a function, give the function a name, and then you have the option of reading in additional arguments
    text_upper = text.upper()
    return(text_upper)

wcup['Host'].apply(make_uppercase)

23b. The code above only temporarily created a new "Host" column all in uppercase. Modify the code cell below so that there is a new column to store the uppercase values from the Host column.

In [None]:
wcup['Host'].apply(make_uppercase)

24. You can also create a new column using operands. For example, in the Gapminder dataset, we may want to calculate population change from 2002 to 2007 by subtracting the former from the latter. We can so by doing the following:

In [None]:
gapminder_df["pop_chg_02-07"] = gapminder_df["pop_2007"] - gapminder_df["pop_2002"]

#to better see the results, let's just output the relevant columns
gapminder_df[['country', 'continent', 'pop_2002', 'pop_2007', 'pop_chg_02-07']]

In [None]:
#we can also sort the values to identify those countries that lost population between 2002 and 2007
gapminder_df[['country', 'continent', 'pop_2002', 'pop_2007', 'pop_chg_02-07']].sort_values(by = "pop_chg_02-07")

<div class="alert alert-info" role="alert"><h3 style="color:blue;">Exercises: Part VI Modifying a Dataframe</h3>

<p style="color:blue;">25. Following the code above, create a new column and then sort it to help you answer one of the following questions (you choose): </p>
<ul>
<li style="color:blue;">Which country's GDP improved the most between 1952 and 2007?</li>
<li style="color:blue;">Which country's life expectancy improved the most between 1952 and 2007?</li>
</ul>
</div>

<div class="alert alert-info" role="alert"><p style="color:blue;">26. Using a custom made function and the .apply() method create a new column in the film_df with each movie character's name capitalized.</p></div>

## VII. Split-Apply-Combine

<img src = "https://www.oreilly.com/api/v2/epubs/9781783985128/files/graphics/5128OS_09_01.jpg" style="width:40%">

A commonly used data analysis strategy is **split-apply-combine**:
+ **split** the problem / data into manageable pieces
+ **apply** some calculations or analysis to the pieces
+ **combine** the parts back into a whole

For computational data science this means (from [Python Pandas documentation](https://pandas.pydata.org/docs/user_guide/groupby.html)):
```
    * Splitting the data into groups based on some criteria
    * Applying a function to each group independently
    * Combining the results into a data structure
```

Let's examine how this strategy may work for the datasets we have been working with. In our Gapminder dataset, for example, we may want to explore larger trends at the continent level. So we could:
+ **split** the overall dataset by continent, creating separate Africa, Asia, North America, South America, Oceania, and Europe datasets.
+ **apply** some calculations to each continental dataset; i.e. What is the average GDP and life expectancy of each continent in different years? Or what has been the average rate of change in these values from one five year interval to the next?
+ **combine** the data back together to compare changes in productivity (as measured by GDP) and health (as measured by life expectancy) by continent and over time.

Different applications apply this strategy in different ways. In Excel, you would use pivot tables, SQL the "group by" operator, and in R you may use the plyr package.

With Pandas, we can use the **.groupby()**, **.agg()**, and **.apply()** functions to do this. 
+ **.groupby()** - *splits* a dataset into separate groups
+ **.agg()** - used to calculate multiple statistics per group in one calculation
+ **.apply()** - applies a function across an entire column (or row) in a Pandas dataframe 

Let's re-examine the contents of our Gapminder dataframe. 

27. Use the .head() method to output its first five rows and use the .columns method to output summary information about this dataframe.

In [None]:
gapminder_df.head()

In [None]:
gapminder_df.columns

### VIIb: Groupby()

*This first few cells of this section are adapted from Constellate's [3rd Pandas tutorial](https://lab.constellate.org/practical-diabetes-methods/notebooks/tdm-notebooks-2023-04-20T12%3A35%3A07.477Z/pandas-3.ipynb).*

28. Groupby is a powerful function built into Pandas that you can use to summarize your data. Groupby splits the data into different groups on a variable of your choice. 

To learn more about using Pandas **.groupby()** to split-apply-combine data, see the [documentation here](https://pandas.pydata.org/docs/user_guide/groupby.html).

In [None]:
# Group the data by continent
gapminder_df.groupby('continent')

29. The groupby() method returns a GroupBy object which describes how the rows of the original dataset have been split by the selected variable. You can actually see how the rows of the original dataframe have been grouped using the ```groups``` attribute after applying ```groupby().```


In [None]:
# See how the rows have been grouped
gapminder_df.groupby('continent').groups
#Note: this dataset had already been sorted by continent.

30. Of course, we don't just stop at grouping data. Grouping data is just a step towards data query. After we apply the .groupby() method, we can actually use different Pandas methods to query the data. For example, how do we get the number of documents in each docType by publicationYear?

In [None]:
# Create a series storing the number of documents in each doc type by year
gapminder_df.groupby('continent').size()

We can then use the **.agg()** (aggregate) method to choose what functions we will **apply** to each group to allow us to **combine** the data back together.

The general formula for applying functions using the aggregate (.agg()) method:

```
df_name.groupby('col_name').agg({dict assigning a function to each column we want to aggregate})

```
where the format for each dictionary is as follows:
```
{'name_of_col1_to_keep': 'function_to_apply_to_this_col', 'name_of_col2_to_keep': 'function_to_apply_to_this_col'}
```

Some commonly used functions with groupby ([click here](https://pandas.pydata.org/pandas-docs/stable/reference/groupby.html#computations-descriptive-stats) for a full list):
+ .count(), .mean(), .min(), .max(), .sum()

31. With this in mind, examine the following line of code. What do you guess it does? Run it and see if you were correct?

In [None]:
gapminder_df.groupby('continent').agg({'lifeExp_2007':'mean'})

32. We can use the .agg() method to apply multiple functions to multiple columns:

In [None]:
cont_pop_chg = gapminder_df.groupby('continent').agg({'pop_2002':'sum', 'pop_2007': 'sum'})
cont_pop_chg

33. We can then use these two columns in the continental dataset to calculate percent population change between 2002 and 2007:

In [None]:
cont_pop_chg['pct_chg_02-07'] = 100 * (cont_pop_chg['pop_2007'] - cont_pop_chg['pop_2002']) / cont_pop_chg['pop_2002']
cont_pop_chg

<div class="alert alert-info" role="alert"><h3 style="color:blue;">Exercises Part VII: Split-Apply-Combine</h3></div>

<div class="alert alert-info" role="alert"><p style="color:blue;">34. Using what you learned above, group the gapminder_df by continent again, but this time calculating the average change in life expectancy for each continent between 1952 and 2007.</p></div>