# Descriptive Statistics

Pandas includes many useful functions for summarizing your data. Remember to import Pandas before using:

In [1]:
import pandas as pd

Also, we will import the melon data set into this Notebook:

In [2]:
melon = pd.read_csv('data/melon.csv', sep='\t')
melon.head()

Unnamed: 0,variety,yield
0,A,25.12
1,A,17.25
2,A,26.42
3,A,16.08
4,A,22.15


Now, we can summarize numeric columns of the data frame (note how non-numeric columns are automatically ignored):

In [3]:
print(melon.describe())

           yield
count  24.000000
mean   26.820417
std     8.493297
min    11.420000
25%    21.187500
50%    27.000000
75%    32.285000
max    43.320000


Next, we will compute group-wise descriptive statistics (the column variety is a grouping factor in the data frame). The Pandas data frame brings along *methods* that can be applied to the data frame. That's why you have to append the method to be performed to a data frame.

Note: you can also apply the describe method to grouped columns (see Mean/Variance etc. below).

## Mean

In [4]:
print(melon.groupby(['variety']).mean())

             yield
variety           
A        20.490000
B        37.403333
C        19.491667
D        29.896667


## Variance

In [5]:
print(melon.groupby(['variety']).var())

             yield
variety           
A        22.037600
B        15.606427
C        30.914177
D         4.972427


## Standard Deviation

In [6]:
print(melon.groupby(['variety']).std())

            yield
variety          
A        4.694422
B        3.950497
C        5.560052
D        2.229894


## Median

In [7]:
print(melon.groupby(['variety']).median())

          yield
variety        
A        19.700
B        36.810
C        20.450
D        29.435


## Minimum

In [8]:
print(melon.groupby(['variety']).min())

         yield
variety       
A        15.92
B        31.98
C        11.42
D        27.58


## Maximum

In [9]:
print(melon.groupby(['variety']).max())

         yield
variety       
A        26.42
B        43.32
C        25.90
D        33.20


## Quantiles

In [10]:
print(melon.groupby(['variety']).quantile([0.25, 0.75]))

                yield
variety              
A       0.25  16.3725
        0.75  24.3775
B       0.25  35.5675
        0.75  39.4625
C       0.25  15.8625
        0.75  23.4100
D       0.25  28.1750
        0.75  31.3400


<font size="3"><div class="alert alert-warning"><b>Exercise 2.1:</b> <br> 

Plant growth is influenced by the microbial activity in the soil. Soil respiration is an indicator for this activity. Soil samples from two characteristic areas in the forest (gap = "clearing and growth" and growth = "dense tree population") have been analyzed regarding their carbon dioxide output in an experiment. The amount of excreted CO2 has been measured in mol CO2 g^{-1} soil hr^{-1} (Fierer, 1994, cited according to Samuels and Wittmer, 2003, p. 289).

The first code cell below creates the data as a data frame.
    
Compute the descriptive statistics for both groups using the <tt>describe()</tt> method.

Note: the code example to create the data set may turn out to be useful for you if you ever need to create your own dataframe in Python, directly.    
</div>
</font>

In [11]:
import numpy as np
import itertools

lst = ["growth"]
lst2 = ["gap"]
soil_dict = {"treatment" : list(itertools.chain.from_iterable(itertools.repeat(x, 8) for x in lst)) 
             + list(itertools.chain.from_iterable(itertools.repeat(x, 7) for x in lst2)),
             "response" :[17,20,170,315,22,190,64,22,29,13,16,15,18,14,6]}

import pandas as pd

soil_pd = pd.DataFrame(data=soil_dict)
soil_pd
soil_pd.to_csv(r'data/soil_respiration.csv', index = False, sep = "\t", header = True)

<font size="3">
<b>Try it yourself:</b></font>

**Example Solution**

In [12]:
print(soil_pd.groupby(['treatment']).describe())

          response                                                        
             count        mean         std   min   25%   50%    75%    max
treatment                                                                 
gap            7.0   15.857143    6.914443   6.0  13.5  15.0   17.0   29.0
growth         8.0  102.500000  110.794533  17.0  21.5  43.0  175.0  315.0


Further reading: 
   * <https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html>
   * <https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html>.