# Doing statistics/econometrics in Julia 1.0

* author: Tianhao Zhao
* date: April 4, 2019
* copyright: free

-------------------

## Table of Contents
1. [Introduction](#introduction)
1. [What Packages Used?](#WhatPkgUsed)
1. [Descriptive Statistics](#descriptivestats)
1. [Distributions](#distributions)
1. [Sampling & Simulations](#sampling)
1. [Hypothesis Testing](#hypothesistesting)
1. [Linear Regression & GLM](#linearregression)
1. [Clustering, Classification, PCA/Factor and Others](#otherstatmodels)
    1.[Clustering](#clustering)
    1.[Classification](#classification)
    1.[PCA/Factor Analysis](#pcafactor)
    1.[Mixture Models](#mixturemodels)
1. [Econometrics Topics](#econometrictopics)
    1.[Time Series](#timeseries)
    1.[Panel Data Models](#panelmodel)
    1.[Quantile Models](#quantilemodels)
    1.[Structural Models](#structuralmodels)
    1.[Filters](#filters)
    1.[MCMC](#mcmc)
    1.[Beyasian Econometrics](#beyasianeconometrics)
1. [Develop Your Own Statistical Models in a Standard Work Flow](#devyourownstatmodels)


## 1. Introduction <a name="introduction"></a>

In another [blog](190421_PrepareYourJuliaPkg.html) where I introduced how to prepare your Julia 1.0 for economic research, I roughly discussed the statistics/econometrics in Julia in the conclusion section.
Though there is still a long way for Julia to go in econometrics, we can now do many fantastic jobs in statistics.
This blog discusses how to do basic statistics in Julia 1.0.
It depends on `JuliaStats` project and its packages. This is a project initiated by the Julia official. It is the base of the statistics in Julia language.

In Section [2](#WhatPkgUsed), I provide a full list of the packages under `JuliaStats` project and some other packages. We will use these packages in this blog.
In Section [3](#descriptivestats), I introduce how to use Julia do descriptive Statistics, e.g. histogram, QQ-plot.
In Section [4](#distributions), I introduce how to play with many statistical distributions in Julia.
In Section [5](#sampling), I introduce how to sampling on distributions or datasets.
In Section [6](#hypothesistesting), I introduce how to do hypothesis tests e.g. two-sample F tests.
In Section [7](#linearregression), I introduce how to do basic regression analysis with `MultivariateStats` and `GLM` packages.
In Section [8](#otherstatmodels), I introduce how to do other common statistical analysis such as clustering and PCA.
In Section [9](#econometrictopics), I discuss what other functions are required but not provided yet if we want to do more-specific econometric research, e.g. quantile regression and panel data models.

<font color = red><b>This blog will be updated by time. Not finished when published.</b></font>

## 2. What Packages Used? <a name="WhatPkgUsed"></a>

In this section, I talk about what packages to use in the next sections. About how to install Julia packages, please read my another [blog](190421_PrepareYourJuliaPkg.html).
Please note, we do not use R/Python's API via `RCall`/`PyCall` (except plotting with `PyPlot`) since we are talking about the statistics in Julia.
Meanwhile, this blog cannot cover every function of each package. Readers may be like to read these packages' documentations through searching their names on Github.

`JuliaStats` is a project initiated by Julia official. Its website is: [JuliaStats.org](https://juliastats.github.io/).
This project aims to make Julia become powerful in statistics (well ... not specially for econometrics).
It has been the foundation of many ML/DL pacakges of Julia.
The reason why we mainly talk about this project is that `JuliaStats` is like `SciPy` for Python or `stats` for R.
It is expected to be the foundation of Julia statistics.

There are 15 packages under `JuliaStats` project now. The following table is modified from the [documentation page](https://juliastats.github.io/) of `JuliaStats`:

|Package|Task|Mainly for|
|----|----|-----|
|StatsBase|Basic functionalities for statistics| Descriptive statistics; sampling; ranking; weights; correlation/auto-correlation |
|StatsModels|Interfaces for statistical models| R-style `@formula`; abstrations for statistical model development |
|DataFrames|Essential tools for tabular data| Data structure and operations for regression datasets |
|Distributions| Probability distributions | A large number of univariate/multivariate distributions; descriptive stats; moments/pdf/cdf/mgf, sampling, MLE |
|MultivariateStats|Multivariate statistical analysis | Matrix-based API for linear (e.g. LS, Lasso, Ridge) models, dimensionallity reduction, scaling, linear discriminant analysis|
|HypothesisTests|Hypothesis tests| Parametric and Nonparametric tests|
|MLBase| Swiss knife for machine learning| Data preprocessing; classifications; performance evaluation; model selectionn; cross-validation |
|Distances| Various distances between vectors |  |
|KernelDensity| Kernel density estimation| For univariate/multivariate/bivariate data; user customization of interpolation points/kernel/bandwidth |
|Clustering| Algorithms for data clustering| K-means/medoids; Affinity propagation; Performance evaluation etc.|
|<font color=red>GLM</font>| Generalized linear models| R-style API for LM/GLM |
|NMF| Nonnegative matrix factorization| A variety of NMF algorithms; NNDSVD |
|RegERMs| Lasso/Elstic Net linear and generalized linear models | `glmnet` coordinate descent algorithm; polinomial trend filtering ; O(n) fused Lasso ; Gamma Lasso |
|Klara| Markov Chain Monte Carlo (MCMC) |  Engine for Bayesian inference; samplers with latest techniques; ability to suspend and resume |
|TimeSeries| Time series analysis | Tools to represent, manipulate, and apply computation to time series data|
