# Stats

This notebook loads some data, reports some simple descriptive statistics (means, standard deviationsm etc) and show a number of useful plots (scatter plots, histograms, time series plots).

Most of the descriptive stats use built-in Julia commands. The plots reply on the Plots package and the pdf and quantiles are from the Distributions package (see https://github.com/JuliaStats/Distributions.jl).

## Load Packages

In [1]:
using Distributions      #distributions, random numbers, etc.
                         
include("printmat.jl")   #just a function for prettier matrix printing

println4Ps (generic function with 1 method)

In [2]:
using Plots

backend = "gr"              #"gr" (default), "pyplot" 

if backend == "pyplot"
    pyplot(size=(600,400))
else    
    gr(size=(600,400))
end

Plots.GRBackend()

# Load Data from Text File

The following is a portion of MyData.csv:

```
date,Mkt-RF,RF,SmallGrowth
197901,4.18,0.77,10.96
197902,-3.41,0.73,-2.09
197903,5.75,0.81,11.71
197904,0.05,0.8,3.27
```

In [3]:
xx   = readdlm("Data/MyData.csv",',',header=true)  #reading the csv file    
x    = xx[1]                                #xx[1] contains the data

println("Column headers: ",xx[2])           #xx[2] contains the headers
println("\nfirst four lines of x:") 
printmat(x[1:4,:])

Column headers: AbstractString["date" "Mkt-RF" "RF" "SmallGrowth"]

first four lines of x:
197901.000     4.180     0.770    10.960
197902.000    -3.410     0.730    -2.090
197903.000     5.750     0.810    11.710
197904.000     0.050     0.800     3.270



### Creating Variables

In [4]:
ym  = x[:,1]                    #yearmonth, like 200712
Rme = x[:,2]                    #picking out the second column
Rf  = x[:,3]                   
R   = x[:,4] +                  # + 0.0 to illustrate that commands continue on the next line
      0.0                     
Re  = R - Rf                    #do R .- Rf if R has several columns (and Rf only one)

println("first 4 obs of Rme and Re")
printmat([Rme[1:4,:] Re[1:4,:]])

first 4 obs of Rme and Re
     4.180    10.190
    -3.410    -2.820
     5.750    10.900
     0.050     2.470



# Some Descriptive Statistics

## Means and Standard Deviations

The next few cells estimate means, standard deviations, covarances and correlations of the variables Rme (US equity market excess return) and Re (excess returns for a segment of the market, small growth firms). 

In [5]:
μ = mean([Rme Re],1)    #,1 to calculate average along a column, gives a row vector
σ = std([Rme Re],1)     #do \sigma[Tab] to get σ

println("            Rme       Re")
printlnPs("means: ",μ)  #for more stat functions, see the package StatsBase.jl
printlnPs("std:   ",σ)

            Rme       Re
means:      0.602     0.303
std:        4.604     8.572


## Covariances and Correlations

In [6]:
println("\n","cov([Rme Re]): ")          
printmat(cov([Rme Re]))

println("\n","cor([Rme Re]): ")          
printmat(cor([Rme Re]))


cov([Rme Re]): 
    21.197    28.426
    28.426    73.475


cor([Rme Re]): 
     1.000     0.720
     0.720     1.000



## OLS

A simple linear regression
$
y = b x + u
$,
where $x=[1,R^e_m]$.

Clearly, the first element of b is the intercept and the second element is the slope coefficient.

The GLM package (not used here) has powerful regression methods. See https://github.com/JuliaStats/GLM.jl.

In [7]:
c   = ones(size(Rme,1))         #a vector with ones, no. rows from variable, here ones(Rme) works too
x   = [c Rme]                   #x is a Tx2 matrix
y   = copy(Re)                  #to get standard OLS notation, copy to get an independent copy

b   = inv(x'x)*x'y              #OLS according to a textbook
b2  = x\y                       #also OLS, quicker and numerically more stable
u   = y - x*b                   #OLS residuals
R2a = 1 - var(u)/var(y)         #R2, but that name is already taken

println("OLS coefficients, regressing Re on constant and Rme, different calculations")
printmat([b b2])                
printlnPs("R2: ",R2a) 
printlnPs("no. of observations: ",size(Re,1))

OLS coefficients, regressing Re on constant and Rme, different calculations
    -0.504    -0.504
     1.341     1.341

R2:      0.519
no. of observations:        388


# Drawing Random Numbers and Finding Critical Values

## Random Numbers: Independent Variables

In [8]:
T = 100
x = randn(T,2)    #T x 2 matrix, N(0,1) distribution

println("\n","mean and std of random draws: ")
mu    = mean(x,1)                 
sigma = std(x,1)
printmat([mu;sigma])

println("covariance and correlation matrices:")
printmat(cov(x))
printmat(cor(x))


mean and std of random draws: 
     0.069     0.117
     1.047     1.082

covariance and correlation matrices:
     1.095     0.133
     0.133     1.170

     1.000     0.117
     0.117     1.000



## Random Numbers: Correlated Variables

In [9]:
μ = [-1;10]
Σ = [1 0.5;
     0.5 2]

d = MvNormal(μ,Σ)       

T = 100
x = rand(d,T)'          #T x 2 matrix

println("\n","mean and std of random draws: ")
mu    = mean(x,1)                 
sigma = std(x,1)
printmat([mu;sigma])

println("covariance and correlation matrices:")
printmat(cov(x))
printmat(cor(x))


mean and std of random draws: 
    -0.926    10.099
     0.986     1.395

covariance and correlation matrices:
     0.972     0.567
     0.567     1.945

     1.000     0.412
     0.412     1.000



## Quantiles ("critical values") of Distributions

In [10]:
N05     = quantile(Normal(0,1),0.05)            #from the Distributions package
Chisq05 = quantile(Chisq(5),0.95)

println("\n","5th percentile of N(0,1) and 95th of Chisquare(5)")      #lots of statistics functions
printmat([N05 Chisq05])


5th percentile of N(0,1) and 95th of Chisquare(5)
    -1.645    11.070



# Statistical Plots

In [11]:
YearFrac = floor.(ym/100) + (mod.(ym,100)-1)/12    #year + (month-1)/12, simple but works

plot3a = plot(YearFrac,Rme,color=:blue,legend=false)
plot!(xlims=(1978,2012),ylims=(-25,25))
title!("Time series plot: monthly US equity market excess return")
ylabel!("%")

In [12]:
plot3b = scatter(Rme,Re,color=:blue,legend=false)
plot!([-40;60],[-40;60],color=:black)
plot!(xlims=(-40,40),ylims=(-40,60))
title!("Scatter plot: two monthly return series (and 45 degree line)")
xlabel!("Market excess return, %")
ylabel!("Excess returns on small growth stocks, %")

In [13]:
xGrid = -25:0.1:15
pdfX   = pdf(Normal(mean(Rme),std(Rme)),xGrid) #the N(μ,σ) pdf
                                        #"Distributions" wants σ, not σ^2

histogram(Rme,bins = -25:1:15,normalized=true,label="histogram")
plot!(xGrid,pdfX,linewidth=3,label="fitted N()")
title!("Histogram: monthly US equity market excess return")
xlabel!("Market excess return, %")
ylabel!("Number of months")

In [16]:
println("\n","end of program")


end of program
