---
# Pandas DataFrames - 15 min / 15 min exercices

**Learning objectives**:
- Select individual values from a Pandas dataframe.
- Select entire rows or entire columns from a dataframe.
- Select a subset of both rows and columns from a dataframe in a single operation.
- Select a subset of a dataframe by a single Boolean criterion.



A DataFrame is a collection of Series; The DataFrame is the way Pandas represents a table, and Series is the data-structure Pandas use to represent a column.

Pandas is built on top of the Numpy library, which in practice means that most of the methods defined for Numpy Arrays apply to Pandas Series/DataFrames.

What makes Pandas so attractive is the powerful interface to access individual records of the table, proper handling of missing values, and relational-databases operations between DataFrames.

#### 1. Selecting values
To access a value at the position `[i,j]` of a DataFrame, we have two options, depending on what is the meaning of `i` in use. Remember that a DataFrame provides an index as a way to identify the rows of the table; a row, then, has a position inside the table as well as a label, which uniquely identifies its entry in the DataFrame.

Use DataFrame.iloc[..., ...] to select values by their (entry) position

In [None]:
import pandas as pd
data = pd.read_csv('data/gapminder_gdp_europe.csv', index_col='country')
print(data.iloc[0, 0])

Use DataFrame.loc[..., ...] to select values by their (entry) label.

In [None]:
print(data.loc["Albania", "gdpPercap_1952"])

Use `:` on its own to mean all columns or all rows.

In [None]:
print(data.loc["Albania", :])

Would get the same result printing `data.loc["Albania"]`(without a second index).

In [None]:
print(data.loc[:, "gdpPercap_1952"])

- Would get the same result printing `data["gdpPercap_1952"]`
- Also get the same result printing `data.gdpPercap_1952` (not recommended, because easily confused with . notation for methods)

In [None]:
print(data.loc['Italy':'Poland', 'gdpPercap_1962':'gdpPercap_1972'])

In the above code, we discover that slicing using `loc` is inclusive at both ends, which differs from slicing using `iloc`, where slicing indicates everything up to but not including the final index.

#### 2. Result of slicing can be used in further operations.

- Usually don’t just print a slice.
- All the statistical operators that work on entire dataframes work the same way on slices.
- E.g., calculate max of a slice.

In [None]:
print(data.loc['Italy':'Poland', 'gdpPercap_1962':'gdpPercap_1972'].max())


In [None]:
print(data.loc['Italy':'Poland', 'gdpPercap_1962':'gdpPercap_1972'].min())

Use comparisons to select data based on value.

In [None]:
subset = data.loc['Italy':'Poland', 'gdpPercap_1962':'gdpPercap_1972']
print('Subset of data:\n', subset)

# Which values were greater than 10000 ?
print('\nWhere are values large?\n', subset > 10000)

#### 3. Select values or NaN using a Boolean mask.

In [None]:
mask = subset > 10000
print(subset[mask])
#print(subset[subset > 10000].describe())


4. Group By: split-apply-combine

- Pandas vectorizing methods and grouping operations are features that provide users much flexibility to analyse their data.

- For instance, let’s say we want to have a clearer view on how the European countries split themselves according to their GDP.

In [None]:
mask_higher = data > data.mean()
wealth_score = mask_higher.aggregate('sum', axis=1) / len(data.columns)
wealth_score

data.groupby(wealth_score).sum()