In [1]:
import pandas as pd

### Pandas Describe

Pandas Describe will do all of the hard work for you. Well...most of it. Calling .describe() on your dataset will produce a series of descriptive statistics that allow you to get to know your data better. 

We will run through 3 examples:
1. Default Describe - Let's see what comes out by default
2. Including *all* columns via 'include'
3. Treating datetimes like numbers via *datetime_is_numeric=True*

But first, let's user our San Francisco Tree dataset as our DataFrame. You can download this dataset at the github link below. Watch out, it's 193K rows.

In [2]:
df = pd.read_csv('../data/Street_Tree_List.csv', parse_dates=['PlantDate'])
df = df[['TreeID', 'qSpecies', 'PlantDate', 'DBH']]
df.rename(mapper={'DBH':"tree_depth"}, axis=1, inplace=True)

df.head()

Unnamed: 0,TreeID,qSpecies,PlantDate,tree_depth
0,46534,Tree(s) ::,2002-04-01,
1,121399,Corymbia ficifolia :: Red Flowering Gum,NaT,
2,85269,Arbutus 'Marina' :: Hybrid Strawberry Tree,2007-07-24,
3,121227,Sequoia sempervirens :: Coast Redwood,NaT,
4,45986,Tree(s) ::,2001-12-06,


### 1. Default Describe - Let's see what comes out by default

By default, .describe() will tell us a series of descriptive statistics, let's see what they are.

You can see that although we have 4 columns in our dataset, only 2 of them are returned by default. This is because .describe() will only return the numeric column by default.

In [3]:
df.describe()

Unnamed: 0,TreeID,tree_depth
count,193940.0,151614.0
mean,126960.027674,9.927665
std,79504.829131,29.318932
min,1.0,0.0
25%,52836.75,3.0
50%,121171.5,7.0
75%,203348.25,12.0
max,262465.0,9999.0


### 2. Including all columns via 'include'

If you wanted to include all columns in describe, then set include='all'.

You'll notice that pandas needs to put 'NaN' for descriptive statistics that do not apply to non-numeric columns like strings. For example: 'qSpecies' does not have a 25th percentile.

In [4]:
df.describe(include='all')

  """Entry point for launching an IPython kernel.


Unnamed: 0,TreeID,qSpecies,PlantDate,tree_depth
count,193940.0,193940,68911,151614.0
unique,,571,8945,
top,,Tree(s) ::,2000-06-23 00:00:00,
freq,,11734,314,
first,,,1955-09-19 00:00:00,
last,,,2020-07-30 00:00:00,
mean,126960.027674,,,9.927665
std,79504.829131,,,29.318932
min,1.0,,,0.0
25%,52836.75,,,3.0


### 3. Treating datetimes like numbers via datetime_is_numeric=True

Finally, let's end by calling .describe() on a Series. We'll do it on our 'PlantDate' column and see the difference between treating dates like objects and treating them like numbers.

Notice how in the first example we do not get percentiles or min/max. But in the second example we do.

In [5]:
df['PlantDate'].describe()

  """Entry point for launching an IPython kernel.


count                   68911
unique                   8945
top       2000-06-23 00:00:00
freq                      314
first     1955-09-19 00:00:00
last      2020-07-30 00:00:00
Name: PlantDate, dtype: object

In [6]:
df['PlantDate'].describe(datetime_is_numeric=True)

count                            68911
mean     2000-12-02 22:19:59.122334464
min                1955-09-19 00:00:00
25%                1995-01-30 00:00:00
50%                2001-07-24 00:00:00
75%                2008-11-21 00:00:00
max                2020-07-30 00:00:00
Name: PlantDate, dtype: object