# Unit 3.3: Advanced Pandas

This notebook is based on Anna-Lena Lamprecht's CoTaPP repository (https://github.com/annalenalamprecht/CoTaPP). Some modifications were made.

Last time we discussed file I/O and the basics of Pandas.

Now we have a look at more use cases of Pandas. Always keep in mind that in the lecture we can only discuss a few selected examples, so refer to the respective online documentation for full reference.

Next time we will learn about data visualization with ```matplotlib```.

## Pandas

The most important things to know about Pandas we have already covered in the last lecture: how to use Pandas to read content from CSV files, the ```DataFrame``` and ```Series``` data structures, indexing operations and basic plotting and statistics methods for data frames and series. Please refer to the [Pandas documentation](http://pandas.pydata.org/pandas-docs/stable/) for further details, as we cannot cover the library in depth in this course. 

In this lecture we address some other important aspects: 

* handling of missing data

* concatenating/joining tables with Pandas

* grouping rows by a certain attribute

* correlations of variables

### Handling Missing Data

For various reasons it can happen that data are missing in a data frame. They might, for example, already have been missing in the input CSV file due to measurement faults, or have become unavailable because of computations that were not able to return a (good) result.
In Pandas the value ```np.nan``` (technically of type ```float```) is the primarily used value for representing missing data. This can look as follows:

In [None]:
import pandas as pd

df = pd.read_csv("data/table-with-missing-data.csv", sep=",")
print(df)

By default, Pandas operations simply ignore ```NaN``` values. That is, they simply carry out the computation on the available data in the data frame or series, and/or propagate ```NaN``` values if a meaningful result cannot be derived. For example:


In [None]:
df['age'].mean()

In [None]:
(29+45+63+42+75+35)/6

### Challenge!

What do you expect should happen when you compute the average age?

![](img/activity_small.png) 

In [None]:
print(df.describe())

### Challenge!

What do you expect should happen when you add 1 to each person's age?

![](img/activity_small.png) 

In [None]:
df['age_next_year'] = df['age'] + 1
df.head()

If such behavior is not wanted, the data frame or series can be manipulated accordingly before applying the operations. One option is to remove rows or columns with missing data completely by using the ```dropna()``` function. The following example shows how to drop all rows where any data are missing, and how to drop all rows where age or height data are missing:

In [None]:
print(df.dropna())
print("----------")
print(df.dropna(subset=["age", "height"]))

Another possibility is to replace the ```NaN``` values by other/better values:

In [None]:
print("-----Replaced all missing values with 0-----")
print(df.fillna(0))
print("-----Replaced all missing ages and heights with 0-----")
print(df.fillna(value={"age":0, "height":0}))
print("-----Replaced all missing ages and heights with mean values-----")
print(df.fillna(value={"age":df["age"].mean(), \
                       "height":df["height"].mean()}))

Extra: In some cases also Pandas’ ```interpolate()``` function can be used to come up with values to fill in for missing data. Of course, replacing missing data with values should always be done with great care, as there is a risk of producing distorted or even wrong results when adding data to a data set. Generally, the choice how to handle missing data depends on the specifics of the concrete case, but it is good to know about the different options.

### Grouping

Suppose you have a dataset containing information about students, including their names, grades, and subjects. You want to calculate the average grade for each subject. 

### Challenge!

How do you do this with the methods learnt so far?

![](img/activity_small.png) 

#### Pandas `groupby`

You can use the groupby function in pandas to group the data by subject and then calculate the average grade for each group.

In [None]:
import pandas as pd

# Create a sample dataset
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'Bob', 'Charlie'],
    'Subject': ['Math', 'Math', 'Math', 'Science', 'Science', 'Science'],
    'Grade': [85, 92, 78, 90, 88, 95]
}

df = pd.DataFrame(data)

# Group the data by subject and calculate the average grade
grouped = df.groupby('Subject')['Grade'].mean()

print(grouped)

We can access the groups in a loop if we want to calculate other more complicated functions:

In [None]:
for lbl, grp in df.groupby('Subject'):
    print(lbl, type(grp))

We can see that the label ```lbl``` corresponds to the unique values of the column "Subject", while each of the groups is of type ```DataFrame```. We might want to retrieve the name of the student who scored the highest grade on each subject:

In [None]:
for lbl, grp in df.groupby('Subject'):
    print(lbl, grp.sort_values('Grade', ascending=False)['Name'].tolist()[0])

## Exercises

### 1. Analysis of the McDonald’s Menu (★★★★☆)
This exercise is a variation of one that Dr. Adrien Melquiond (Utrecht Bioinformatics Center) developed in the scope of another Python course. It uses the Pandas library to analyze the dataset in the file `mcdonalds_menu.csv`, which provides a nutrition analysis of every menu item on the US McDonald's menu (including breakfast, beef burgers, chicken and fish sandwiches, fries, salads, soda, coffee and tea, milkshakes, and desserts). These data have been scraped from the McDonald's website. The assignment is basically about exploring how much fat and other nutrients contained in McDonald’s food. 

Write a program that reads the content of the file into a data frame, displays simple descriptive statistics about the numerical values in the data frame, and then answers the following questions (you might need Google’s help for some).

#### a. What do we have on the menu? 
How many different items do we have on the menu? Print the number of items per category. Which category is the most represented in this menu?

The output should look something like:

    Category
    Beef & Pork           15
    Beverages             27
    Breakfast             42
    Chicken & Fish        27
    Coffee & Tea          95
    Desserts               7
    Salads                 6
    Smoothies & Shakes    28
    Snacks & Sides        13

In [None]:
df['Popularity'] = 1
df.groupby('Category')['Item'].count().reset_index().sort_values('Item', ascending=False).head(1)

In [None]:
df[df['Category'] == 'Beef & Pork'].sort_values('Calories', ascending=False).head(1)

In [None]:
df = pd.read_csv('data/mcdonalds_menu.csv')
for lbl, grp in df.groupby('Category'):
    sub_grp = grp.sort_values('Calories', ascending=False).head(1)

#### b. What is the most fatty item for each category?
Background information: When it comes to fat, trans fats are really the ones to avoid. Trans fat is a byproduct of a process called hydrogenation that is used to turn healthy oils into solids and to prevent them from becoming rancid. It increases the amount of harmful LDL cholesterol in the bloodstream. Cholesterol can be either good (HDL) or bad (LDL) but chances are slim that we are talking about the good one here. Saturated fat is not necessarily bad, but diet rich in saturated fat can drive up total cholesterol, with increased risk of clogged arteries. Unsaturated fat are not reported in this table.

Create a subset data frame, called `grp_by_category`, that lists per category the maximal amount of 'Total Fat (% Daily Value)','Trans Fat','Saturated Fat (% Daily Value)' and 'Cholesterol (% Daily Value)'. Merge the data frames `menu` and `grp_by_category` and create a mask to select the items that correspond to the maximal 'Total Fat (% Daily Value)'. Be careful, you may end up with more than one fattest item per category. Repeating the same process, extract now the fattest item in 'Trans fat' (make sure to select only items with Trans fat > 0). Sort them by decreasing order of Trans fat, display the resulting data frame.

#### c. Is there anything healthy on the menu?
Search for items with 0 'Trans fat' and 'Cholesterol (% Daily Value)', and maximum 20 'Sugars' and 'Total Fat (% Daily Value)'. Sort the healthy items per calories in ascending order. Remove from this healthy data frame all the drinks (beverages, coffee & tea).

The output should look something like:

![](img/mcdo_q3.png)

#### d. What are the 10 items that have the highest content of Vitamin C?
Citrus fruits are the high source of Vitamin C. For adults, the recommended dietary reference intake for vitamin C is 65 to 90 milligrams (mg) a day, and the upper limit is 2,000 mg a day. Show the 'Vitamin C (% Daily Value)' for the ten items that contain the highest amount of vitamin C.

#### e. How do the nutrition features compare to each other?
Let's finally take a look at how one feature feeds into the other. Using `pandas.DataFrame.corr`, we can compute the correlation coefficient of all the following columns in your dataframe: 'Calories', 'Total Fat', 'Saturated Fat', 'Cholesterol', 'Sodium', 'Carbohydrates', 'Sugars', 'Protein'. What can you observe from the (anti)correlations of the nutritional metrics?

## Extras
The Anaconda website offers a number of *Learning Python For Data Science* [cheat sheets](https://www.anaconda.com/learning-python-data-science-cheat-sheets). Print out those that could be useful for quick reference when working on a project. In particular, that might be the cheat sheets about [Python basics](https://s3.amazonaws.com/assets.datacamp.com/blog_assets/PythonForDataScience.pdf) and [Pandas basics](https://s3.amazonaws.com/assets.datacamp.com/blog_assets/PandasPythonForDataScience.pdf), but there are some more that you might find interesting.