# How to Summarize Data in Python

## Learning Objectives
When exploring data, one of the most important things we can do is summarize it so we can better understand it. A common way to summarize data is by computing aggregations such as mean, median, maximum and minimum. These aggregations or statistical measures (as they are commonly referred to) describe the general and specific characteristics of our data. This is why these types of aggregations are sometimes referred to as **descriptive statistics** or **summary statistics**. The pandas DataFrame provides several methods for computing descriptive statistics. By the end of this tutorial, you will have learned:

+ how to describe a DataFrame
+ how to get simple aggregations
+ how to get group-level aggregations

## How to Describe a DataFrame

In [8]:
import pandas as pd
washers = pd.read_csv("washers.csv")
washers

Unnamed: 0,ID,BrandName,ModelNumber,UPC,Configuration,Features,Market,Volume,IMEF,MinimumIMEF,EnergyUse,IWF,MaximumIWF,WaterUse,DateAvailable,DateCertified,Countries,MostEfficient
0,2342279,GE,GTW845C*N***,1,Top Load,"Gentle Cycle,Delayed Start,Sanitize Option",Residential,5.0,2.06,1.29,192,4.3,8.4,6368,8/5/19,7/31/19,"United States, Canada",No
1,2331684,GE,GUD27EE*N***,84691844198,Top Load,Gentle Cycle,Residential,3.9,2.06,1.29,140,4.3,8.4,4947,12/10/18,11/30/18,United States,No
2,2331685,GE,GUD27EE*N***,7.57638E+11,Top Load,Gentle Cycle,Residential,3.9,2.06,1.29,140,4.3,8.4,4947,12/10/18,11/30/18,Canada,No
3,2331687,GE,GUD27GE*N***,84691844181,Top Load,Gentle Cycle,Residential,3.9,2.06,1.29,140,4.3,8.4,4947,12/10/18,11/30/18,United States,No
4,2331686,GE,GUD37EE*N***,7.57638E+11,Top Load,Gentle Cycle,Residential,3.9,2.06,1.29,140,4.3,8.4,4947,12/10/18,11/30/18,Canada,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
256,2333514,Samsung,WF45R61**A*,8.87276E+11,Front Load,"Steam Cycle,Other",Residential,4.5,2.92,1.84,95,2.9,4.7,3807,2/1/19,1/15/19,"United States, Canada",Yes
257,2333513,Samsung,WF45R63**A*,8.87276E+11,Front Load,"Other,Steam Cycle",Residential,4.5,2.92,1.84,95,2.9,4.7,3807,2/1/19,1/15/19,"United States, Canada",Yes
258,2354715,Samsung,WF45T60**A*,8.87276E+11,Front Load,"Steam Cycle,Other",Residential,4.5,2.95,1.84,90,2.9,4.7,3807,2/20/20,2/5/20,"United States, Canada",Yes
259,2352574,Samsung,WF45T62**A*,8.87276E+11,Front Load,"Steam Cycle,Other",Residential,4.5,2.92,1.84,95,2.9,4.7,3807,2/24/20,12/4/19,"United States, Canada",Yes


## How to get Simple Aggregations
The `describe()` method returns a statistical summary for each of the columns in a DataFrame. It's important to note that the descriptive statistics returned by the `describe()` method depends on the data type of a column. For non-numeric columns, the descriptive statistics returned by the method are as follows:

|Name      |   Description  |
|-----------------|---------------------|
| `count`         | Number of non-missing values                       |
| `unique`       | Number of unique non-missing values                   |
| `top`       | Most commonly occuring value   |
| `freq`        | Frequency of the most commonly occuring value                   |


In [7]:
washers.describe()

Unnamed: 0,ID,Volume,IMEF,MinimumIMEF,EnergyUse,IWF,MaximumIWF,WaterUse
count,261.0,261.0,261.0,261.0,261.0,261.0,261.0,261.0
mean,2320802.0,4.374713,2.45682,1.591341,129.214559,3.598851,6.372797,4632.727969
std,15747.93,0.965866,0.380599,0.274261,43.85062,0.538265,1.845032,1292.693059
min,2300602.0,1.9,2.06,1.29,60.0,2.7,4.7,1728.0
25%,2310408.0,4.3,2.06,1.29,99.0,3.2,4.7,3852.0
50%,2310499.0,4.5,2.38,1.84,120.0,3.6,4.7,4429.0
75%,2332089.0,5.0,2.92,1.84,150.0,4.3,8.4,5632.0
max,2359624.0,6.2,3.1,1.84,311.0,4.3,8.4,7827.0


For numeric columns, the `describe()` method returns the following descriptive statistics:

|Name      |   Description  |
|-----------------|---------------------|
| `count`         | Number of non-missing values                       |
| `mean`       | Average of the non-missing values                   |
| `std`       | Standard deviation of the values   |
| `min`        | Smallest value                  |
| `25%`         | 25th percentile                       |
| `50%`       | 50th percentile (same as the median)                   |
| `75%`       | 75th percentile   |
| `max`        | Largest value                   |


In [9]:
washers[['BrandName']].describe()

Unnamed: 0,BrandName
count,261
unique,22
top,LG
freq,50


In [10]:
washers[['Volume']].describe()

Unnamed: 0,Volume
count,261.0
mean,4.374713
std,0.965866
min,1.9
25%,4.3
50%,4.5
75%,5.0
max,6.2


In [12]:
washers[['BrandName']].value_counts()


Unnamed: 0_level_0,count
BrandName,Unnamed: 1_level_1
LG,50
GE,49
Samsung,47
Kenmore,30
Whirlpool,26
Maytag,18
Electrolux,7
Asko,4
Bosch,4
Miele,4


In [4]:
washers.head()

Unnamed: 0,ID,BrandName,ModelNumber,UPC,Configuration,Features,Market,Volume,IMEF,MinimumIMEF,EnergyUse,IWF,MaximumIWF,WaterUse,DateAvailable,DateCertified,Countries,MostEfficient
0,2342279,GE,GTW845C*N***,1.0,Top Load,"Gentle Cycle,Delayed Start,Sanitize Option",Residential,5.0,2.06,1.29,192,4.3,8.4,6368,8/5/19,7/31/19,"United States, Canada",No
1,2331684,GE,GUD27EE*N***,84691844198.0,Top Load,Gentle Cycle,Residential,3.9,2.06,1.29,140,4.3,8.4,4947,12/10/18,11/30/18,United States,No
2,2331685,GE,GUD27EE*N***,757638000000.0,Top Load,Gentle Cycle,Residential,3.9,2.06,1.29,140,4.3,8.4,4947,12/10/18,11/30/18,Canada,No
3,2331687,GE,GUD27GE*N***,84691844181.0,Top Load,Gentle Cycle,Residential,3.9,2.06,1.29,140,4.3,8.4,4947,12/10/18,11/30/18,United States,No
4,2331686,GE,GUD37EE*N***,757638000000.0,Top Load,Gentle Cycle,Residential,3.9,2.06,1.29,140,4.3,8.4,4947,12/10/18,11/30/18,Canada,No


## How to get Group-level Aggregations

In [11]:
washers.groupby('BrandName')[['Volume']].mean()

Unnamed: 0_level_0,Volume
BrandName,Unnamed: 1_level_1
Amana,4.25
Asko,2.525
Beko,2.133333
Blomberg,2.3
Bosch,2.2
Crosley,4.4
Electrolux,3.785714
Fisher & Paykel,2.4
GE,4.328571
GE Adora,4.2


In [13]:
washers.groupby('BrandName')[['Volume']].mean().sort_values(by='Volume')

Unnamed: 0_level_0,Volume
BrandName,Unnamed: 1_level_1
Beko,2.133333
Bosch,2.2
Gaggenau,2.2
Miele,2.3
Blomberg,2.3
Haier,2.4
Fisher & Paykel,2.4
Asko,2.525
Magic Chef,2.7
Electrolux,3.785714


In [14]:
washers.groupby('BrandName')[['Volume']].agg(['mean', 'median', 'min', 'max'])

Unnamed: 0_level_0,Volume,Volume,Volume,Volume
Unnamed: 0_level_1,mean,median,min,max
BrandName,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Amana,4.25,4.25,4.2,4.3
Asko,2.525,2.7,2.0,2.7
Beko,2.133333,2.0,1.9,2.5
Blomberg,2.3,2.5,1.9,2.5
Bosch,2.2,2.2,2.2,2.2
Crosley,4.4,4.5,4.2,4.5
Electrolux,3.785714,4.3,2.4,4.4
Fisher & Paykel,2.4,2.4,2.4,2.4
GE,4.328571,4.5,2.2,5.2
GE Adora,4.2,4.2,4.2,4.2
