# Summarizing Data

In this lecture, we'll discuss how to descriptively *summarize* data. Descriptive data summarization is one of the fundamental processes of exploratory data analysis. The `pandas` package offers us a powerful suite of tools for creating summaries. 

In [4]:
#standard imports
import pandas as pd
import numpy as np

#read in data from csv
penguins=pd.read_csv("palmer_penguins.csv")

In this video, let's work with the colums "Species", "Region", "Island", "Culmen Length (mm)", and "Culmen Depth (mm)"

In [5]:
cols=["Species", "Region", "Island", "Culmen Length (mm)", "Culmen Depth (mm)"]
penguins=penguins[cols]

Shorten Species names

In [6]:
penguins["Species"]=penguins["Species"].str.split().str.get(0)
                                                            

Shorten names of other columns

In [7]:
penguins["Length"]=penguins["Culmen Length (mm)"]
penguins["Depth"]=penguins["Culmen Depth (mm)"]

penguins = penguins.drop(labels=["Culmen Depth (mm)","Culmen Length (mm)"],axis=1)

In [8]:
penguins

Unnamed: 0,Species,Region,Island,Length,Depth
0,Adelie,Anvers,Torgersen,39.1,18.7
1,Adelie,Anvers,Torgersen,39.5,17.4
2,Adelie,Anvers,Torgersen,40.3,18.0
3,Adelie,Anvers,Torgersen,,
4,Adelie,Anvers,Torgersen,36.7,19.3
...,...,...,...,...,...
339,Gentoo,Anvers,Biscoe,,
340,Gentoo,Anvers,Biscoe,46.8,14.3
341,Gentoo,Anvers,Biscoe,50.4,15.7
342,Gentoo,Anvers,Biscoe,45.2,14.8


# Simple Aggregation

Because the columns of a data frame behave a lot like numpy arrays, we can use standard methods to compute summary statistics. Here are a few examples.

Let's focus on Length column

In [9]:
x=penguins["Length"]
x

0      39.1
1      39.5
2      40.3
3       NaN
4      36.7
       ... 
339     NaN
340    46.8
341    50.4
342    45.2
343    49.9
Name: Length, Length: 344, dtype: float64

In [10]:
#sum
np.sum(x), x.sum()

(15021.3, 15021.3)

Note: nans are ignored by default

In [11]:
#mean and standard deviation
x.mean(), x.std()

(43.92192982456142, 5.459583713926532)

In [43]:
#sum of entries with length > 40
np.sum(x[x>40])

11291.899999999998

In [12]:
#of entries with length > 40
np.sum(x>40)

242

It's also possible to aggregate the entire data frame at once, in which case `pandas` will attempt to apply the specified function to each column for which this is possible. When passing a numerical aggregation function, non-numeric columns are ignored. 

In [46]:
#count number of entries per column
penguins.count()

Species    344
Region     344
Island     344
Length     342
Depth      342
dtype: int64

In [15]:
#mean of each column
penguins.mean()

Length    43.92193
Depth     17.15117
dtype: float64

In [48]:
#max of each column
penguins.max()

Species       Gentoo
Region        Anvers
Island     Torgersen
Length          59.6
Depth           21.5
dtype: object

It is technically possible to aggregate across columns (rather than rows) in `pandas`; however, doing so usually violates the [*tidy data* principles](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html) and is not recommended. 

We've already seen `describe()`, a convenience function for calculating numerical summary statistics. 

In [50]:
penguins.describe()

Unnamed: 0,Length,Depth
count,342.0,342.0
mean,43.92193,17.15117
std,5.459584,1.974793
min,32.1,13.1
25%,39.225,15.6
50%,44.45,17.3
75%,48.5,18.7
max,59.6,21.5
