# Pandas Cheat Sheet - Summarizing Data
There are lots of methods for summarizing data in Pandas- this notebook will look at a few of them.

In [1]:
import pandas as pd

## The data
Let's create a DataFrame to use:

In [2]:
data = {
    'biscuit': [6,5,2,1,5,6,2],
    'scone': [4,8,9,8,2,8,6],
    'donut': [9,15,8,12,13,10,18],
    'muffin': [4,7,2,8,8,6,4],
}
index = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
df = pd.DataFrame(data, index=index)
df

Unnamed: 0,biscuit,scone,donut,muffin
Monday,6,4,9,4
Tuesday,5,8,15,7
Wednesday,2,9,8,2
Thursday,1,8,12,8
Friday,5,2,13,8
Saturday,6,8,10,6
Sunday,2,6,18,4


## General Summaries
These are some basic ways of summarizing either the whole DataFrame, or specific columns of the DataFrame:

In [3]:
df.columns  # Returns the column labels of the DataFrame

Index(['biscuit', 'scone', 'donut', 'muffin'], dtype='object')

In [4]:
df.dtypes  # Returns the data types of each column

biscuit    int64
scone      int64
donut      int64
muffin     int64
dtype: object

In [5]:
df.info()  # Prints a summary of the DataFrame, including memory usage and dtypes

<class 'pandas.core.frame.DataFrame'>
Index: 7 entries, Monday to Sunday
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype
---  ------   --------------  -----
 0   biscuit  7 non-null      int64
 1   scone    7 non-null      int64
 2   donut    7 non-null      int64
 3   muffin   7 non-null      int64
dtypes: int64(4)
memory usage: 280.0+ bytes


In [6]:
len(df)  # Returns the number of rows in the DataFrame

7

In [7]:
df.shape  # Returns a tuple of (# of rows, # of columns)

(7, 4)

In [8]:
df['scone'].nunique()  # Returns number of unique values in the 'scone' column

5

In [9]:
df.describe()  # Basic analysis of each column

Unnamed: 0,biscuit,scone,donut,muffin
count,7.0,7.0,7.0,7.0
mean,3.857143,6.428571,12.142857,5.571429
std,2.115701,2.572751,3.532165,2.299068
min,1.0,2.0,8.0,2.0
25%,2.0,5.0,9.5,4.0
50%,5.0,8.0,12.0,6.0
75%,5.5,8.0,14.0,7.5
max,6.0,9.0,18.0,8.0


## Summary Functions
We will use these on DataFrames, but these functions are also available for other types of objects (GroupBy, Series, etc.)

In [10]:
df.sum()  # Sum values of each column

biscuit    27
scone      45
donut      85
muffin     39
dtype: int64

In [11]:
df.count()  # Count non-NA/null values of each column

biscuit    7
scone      7
donut      7
muffin     7
dtype: int64

In [12]:
df.median()  # Median value of each column

biscuit     5.0
scone       8.0
donut      12.0
muffin      6.0
dtype: float64

In [13]:
df.min()  # Minimum value of each column

biscuit    1
scone      2
donut      8
muffin     2
dtype: int64

In [14]:
df.max()  # Maximum value of each column

biscuit     6
scone       9
donut      18
muffin      8
dtype: int64

In [15]:
df.mean()  # Mean value of each column

biscuit     3.857143
scone       6.428571
donut      12.142857
muffin      5.571429
dtype: float64

In [16]:
df.var()  # Variance of each column

biscuit     4.476190
scone       6.619048
donut      12.476190
muffin      5.285714
dtype: float64

In [17]:
df.std()  # Standard Deviation of each column

biscuit    2.115701
scone      2.572751
donut      3.532165
muffin     2.299068
dtype: float64