# Descriptive statistics

Here we show how to calculate how to calculate the descriptive statistics of a data set, such things as the mean, standard deviation, etc.  We use the investment data from chapter 1 of the book.

First import some libraries we will need for data analysis.

In [1]:
import pandas as pd
import numpy as np
#import os
#import statsmodels.formula.api as smf

Now, read in a data file, into our dataframe 'df'.

In [2]:
df = pd.read_csv('investment.csv')
df.head()               

Unnamed: 0,year,dwellings,transport,machinery,intangible,other,investment,gdp,realgdp,realinv
0,1976,2566,1364,4157,170,4090,25640,,,
1,1977,5699,3248,9950,797,8657,28351,146973.0,652875.0,89594.0
2,1978,6325,4112,11709,760,9481,32387,169344.0,673990.0,91877.0
3,1979,7649,4758,13832,1020,11289,38548,199220.0,692087.0,94974.0
4,1980,8674,4707,15301,1250,13680,43612,233184.0,678013.0,90533.0


In [None]:
df.tail()

The data look OK so we can continue to calculate some summary measures.

## Create summary statistics
We use the 'describe' method to calculate a range of summary statistics for the data set.

In [3]:
df.describe()

Unnamed: 0,year,dwellings,transport,machinery,intangible,other,investment,gdp,realgdp,realinv
count,34.0,34.0,34.0,34.0,34.0,34.0,34.0,33.0,33.0,33.0
mean,1992.5,23369.852941,10008.647059,39225.176471,7396.264706,39845.264706,120236.176471,716427.0,965656.1,147556.212121
std,9.958246,14424.374355,4454.028003,19580.01302,5319.123614,24988.093152,66521.897744,404062.6,233913.4,48560.505523
min,1976.0,2566.0,1364.0,4157.0,170.0,4090.0,25640.0,146973.0,652875.0,82627.0
25%,1984.25,12003.75,6104.0,21673.75,2974.5,17943.0,60753.75,361758.0,755300.0,104575.0
50%,1992.5,21126.0,10598.5,38097.5,6643.0,36142.5,112550.5,654196.0,897777.0,136318.0
75%,2000.75,29203.0,14512.25,58474.25,11162.0,54297.75,170629.5,1021828.0,1170489.0,189336.0
max,2009.0,55767.0,16314.0,69411.0,17710.0,92808.0,249517.0,1445580.0,1364029.0,245053.0


Compare how easy this is compared to Excel!  We have the mean, standard deviation and the quartiles of all the variables. The only downside is the formatting of the numbers - too many decimal places. Lets us set them to display as integer values.

In [4]:
pd.set_option("precision", 0)
df.describe()

Unnamed: 0,year,dwellings,transport,machinery,intangible,other,investment,gdp,realgdp,realinv
count,34,34,34,34,34,34,34,30.0,30.0,33
mean,1992,23370,10009,39225,7396,39845,120236,700000.0,1000000.0,147556
std,10,14424,4454,19580,5319,24988,66522,400000.0,200000.0,48561
min,1976,2566,1364,4157,170,4090,25640,100000.0,700000.0,82627
25%,1984,12004,6104,21674,2974,17943,60754,400000.0,800000.0,104575
50%,1992,21126,10598,38098,6643,36142,112550,700000.0,900000.0,136318
75%,2001,29203,14512,58474,11162,54298,170630,1000000.0,1000000.0,189336
max,2009,55767,16314,69411,17710,92808,249517,1000000.0,1000000.0,245053


This is more readable.  We can display just a subset of the results as follows.

In [5]:
df[['dwellings', 'transport']].describe()

Unnamed: 0,dwellings,transport
count,34,34
mean,23370,10009
std,14424,4454
min,2566,1364
25%,12004,6104
50%,21126,10598
75%,29203,14512
max,55767,16314


In [6]:
df.iloc[:,1:7].describe()

Unnamed: 0,dwellings,transport,machinery,intangible,other,investment
count,34,34,34,34,34,34
mean,23370,10009,39225,7396,39845,120236
std,14424,4454,19580,5319,24988,66522
min,2566,1364,4157,170,4090,25640
25%,12004,6104,21674,2974,17943,60754
50%,21126,10598,38098,6643,36142,112550
75%,29203,14512,58474,11162,54298,170630
max,55767,16314,69411,17710,92808,249517


Note that in this case we are using all rows of the data (":," in the command) and columns 1 to 7. But recall Python's numbering system - the first column is numbered 0, so we do not summarise the first column (year, not worth summarising)!  And the range 1:7 means up to but not including 7.  So we only count up to 6, but 6 refers to the seventh column in the data (investment)!  As we say, you get used to this. 

## Declaring data to be time series

So far we have not explicitly declared our data to be time series.  Doing so allows us to do some complicated tasks rather simply, such as creating a lagged variable or calculating the growth rate of a variable.
First we declare our data to be time series, using the 'year' variable to indicate how time is measured.

In [7]:
df['year'] = pd.to_datetime(df['year'], format='%Y')     # This changes the variable 'year' to a 'timestamped'
                                                         # variable. 
df = df.set_index('year')                                # This sets 'year' as the index variable, i.e. the one 
                                                         # that measures time

Now we can easily create the lag of a variable and the (annual) difference. We'll measure the growth rate of investment by calculating the change in the log (averaged over all the years in the data set).  First we create the log of investment.

In [8]:
df["log_inv"] = np.log(df["investment"])                 # We use a numpy (np) method to calculate the log

Now we create the lag and also the difference of the log of investment.

In [9]:
df['log_inv-1'] = df['log_inv'].shift(1)                 # This creates a one period (year) lag
df['log_inv_diff'] = df['log_inv'].diff()                # This creates the difference between successive years

We can use the 'shift' and 'diff' methods because we declared the data to be time series.  Otherwise these commands would not have worked.  Now check the data to see it has worked.

In [10]:
pd.set_option('precision', 3)                            # Use higher precision so we can see the log and change of log  
df.head()

Unnamed: 0_level_0,dwellings,transport,machinery,intangible,other,investment,gdp,realgdp,realinv,log_inv,log_inv-1,log_inv_diff
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1976-01-01,2566,1364,4157,170,4090,25640,,,,10.152,,
1977-01-01,5699,3248,9950,797,8657,28351,146973.0,652875.0,89594.0,10.252,10.152,0.101
1978-01-01,6325,4112,11709,760,9481,32387,169344.0,673990.0,91877.0,10.386,10.252,0.133
1979-01-01,7649,4758,13832,1020,11289,38548,199220.0,692087.0,94974.0,10.56,10.386,0.174
1980-01-01,8674,4707,15301,1250,13680,43612,233184.0,678013.0,90533.0,10.683,10.56,0.123


Note that the data are now presented slightly differently - the year is treated as the indexing variable, rather than as a variable in its own right.  Secondly, note that by lagging a variable we lose the first observation and that this is indicated by 'NaN' (not a number). 

(Note also that you don't actually need to calculate the lag before you calculate the difference.  We could have omitted the former.)

Now calculate the average growth rate.

In [11]:
df['log_inv_diff'].describe()

count    33.000
mean      0.063
std       0.073
min      -0.163
25%       0.033
50%       0.071
75%       0.101
max       0.198
Name: log_inv_diff, dtype: float64

From the mean we can see the growth rate as 0.063 or 6.3% per annum, on average.