# Summaring Data

> A quote about summarizing data.
>
> \- The person that said it

## Applied Review

### Dictionaries

* The `dict` structure is used to represent **key-value pairs**
* Like a real dictionary, you look up a word (**key**) and get its definition (**value**)
* Below is an example:

```python
ethan = {
    'first_name': 'Ethan',
    'last_name': 'Swan',
    'alma_mater': 'Notre Dame',
    'employer': '84.51˚',
    'zip_code': 45208
}
```

### DataFrame Structure

* We will start by importing the `flights` data set as a DataFrame:

In [1]:
import pandas as pd
flights_df = pd.read_csv('../data/flights.csv')

* Each DataFrame variable is a **Series** and can be accessed with bracket subsetting notation: `DataFrame['SeriesName']`
* The DataFrame has an **Index** that is visible the far left side and can be used to slide the DataFrame

### Methods

* Methods are *operations* that are specific to Python classes
* These operations end in parentheses and *make something happen*
* An example of a method is `DataFrame.head()`

## General Model

### Window Operations

Yesterday we learned how to manipulate data across one or more variables within the row(s):

![series-plus-series.png](images/series-plus-series.png)

Note that we return the same number of elements that we started with. This is known as a **window function**, but you can also think of it as summarizing at the row-level.

We could achieve this result with the following code:

```python
DataFrame['A'] + DataFrame['B']
```

We subset the two Series and then add them together using the `+` operator to achieve the sum.

### Summary Operations

However, sometimes we want to work with data across rows within a variable -- that is, aggregate/summarize values rowwise rather than columnwise.

![aggregate-series.png](images/aggregate-series.png)

Note that we return a single value represengint some aggregation of the elements we started with. This is known as a **summary function**, but you can think of it as summarizing across rows.

This is what we are going to talk about next.

## Summarizing a Series

### Summary Methods

The easiest way to summarize a specific series is by using bracket subsetting notation and the built-in Series methods:

In [3]:
flights_df['distance'].sum()

350217607

Note that a *single value* was returned because this is a **summary operation** -- we are summing the `distance` variable across all rows.

There are other summary methods with a series:

In [4]:
flights_df['distance'].mean()

1039.9126036297123

In [5]:
flights_df['distance'].median()

872.0

In [6]:
flights_df['distance'].mode()

0    2475
dtype: int64

All of the above methods work on quantitative variables, but we also have methods for character variables:

In [12]:
flights_df['carrier'].value_counts()

UA    58665
B6    54635
EV    54173
DL    48110
AA    32729
MQ    26397
US    20536
9E    18460
WN    12275
VX     5162
FL     3260
AS      714
F9      685
YV      601
HA      342
OO       32
Name: carrier, dtype: int64

While the above isn't *technically* returning a single value, it's still a useful Series method.

### Describe Method

There is also a method `describe()` that provides a lot of this information -- this is especially useful in exploratory data analysis.

In [8]:
flights_df['distance'].describe()

count    336776.000000
mean       1039.912604
std         733.233033
min          17.000000
25%         502.000000
50%         872.000000
75%        1389.000000
max        4983.000000
Name: distance, dtype: float64

Note that `describe()` will return different results depending on the `type` of the Series:

In [10]:
flights_df['carrier'].describe()

count     336776
unique        16
top           UA
freq       58665
Name: carrier, dtype: object

## Summarizing a DataFrame