# <span style="color:#2c061f"> Macro 318: Tutorial #2 </span>  

<br>

## <span style="color:#374045"> Data, Stats and Math with Julia </span>


#### <span style="color:#374045"> Lecturer: </span> <span style="color:#d89216"> <br> Dawie van Lill (dvanlill@sun.ac.za) </span>

# Introduction

In this tutorial we will start our discussion with how to work with data in Julia. We will then cover some basic statistics and in the last section move on to some fundamental ideas in mathematics (mostly related to calculus).  

Please note that working with data in Julia is going to be different than working with data in Stata. I am just showing basic principles here so that you are aware of them. You do not need to memorise everything in this notebook. It is simply here as a good reference to have if you want to do some useful data work for macroeconomics. 

If you are more comfortable with Stata for working with data then you can continue on that path. I am simply offering an alternative. 

In the job market there are a few languages that are used for data analysis. The most popular ones are Stata, R, Python and Julia. At this stage Julia is not the most popular for data work, but it shares similarities with Python. So if you know Julia well, it will be easy to pick up Python. Julia is more popular for work related to numerical / scientific computation, which we will cover in some of the future tutorials. 

As an aside, you might be wondering why we chose Julia for this course. There are several reasons, but primarily it is because the language is easy to learn and is similar in syntax to Python. It is also blazingly fast!

Why not learn Python then? Well, we considered this, but Julia is a just a bit easier to get started with and easier to install for most people. And also, it is super fast! I also believe that it is a language that will be used a lot in economics in the future, with a lot of macroeconomists starting to use it for their modelling purposes. 

If you are interested in Python as an alternative to Julia you can always contact me and I can refer you to some resources. However, for most students it is more important to get the programming principles right without worrying too much about the language that they are using. 

In [12]:
import Pkg

In [21]:
Pkg.add("CategoricalArrays")
Pkg.add("CSV")
Pkg.add("DataFrames")
Pkg.add("DataFramesMeta")
Pkg.add("GLM")
Pkg.add("Random")
Pkg.add("RDatasets")
Pkg.add("Statistics")

[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.7/Project.toml`
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.7/Manifest.toml`
[32m[1m   Resolving[22m[39m package versions...
[32m[1m    Updating[22m[39m `~/.julia/environments/v1.7/Project.toml`
 [90m [336ed68f] [39m[92m+ CSV v0.9.11[39m
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.7/Manifest.toml`
[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.7/Project.toml`
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.7/Manifest.toml`
[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.7/Project.toml`
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.7/Manifest.toml`
[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.7/Project.toml`

In [14]:
using CategoricalArrays
using CSV
using DataFrames
using DataFramesMeta
using GLM
using Random
using RDatasets
using Statistics

# Working with data

The primary package for working with data in Julia is `DataFrames.jl`. For a comprehensive tutorial series on this package I would recommend Bogumił Kamiński's [Introduction to DataFrames](https://github.com/bkamins/Julia-DataFrames-Tutorial).

# DataFrames basics

In this section we discuss basic principles from the DataFrames package. For the first topic we look at how to construct and access DataFrames. The fundamental object that we care about is the `DataFrame`. This is similar to a `dataframe` that you would find in R or in Pandas (Python).

DataFrames are essentially matrices, with the rows being observations and the columns indicating the variables. 

## Constructors

The easiest thing to construct is an empty DataFrame. 

In [16]:
DataFrame() # empty DataFrame

You could also construct a DataFrame with different keyword arguments. Notice the different types of the different columns. 

In [20]:
DataFrame(A = 2:5, B = randn(4), C = "Hello")

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,Float64,String
1,2,-0.653416,Hello
2,3,-0.519066,Hello
3,4,-2.09998,Hello
4,5,0.909537,Hello


One of the most common ways to use constructors is through arrays. 

In [41]:
commodities = ["crude", "gas", "gold", "silver"]
last_price = [4.2, 11.3, 12.1, missing] # notice that the last value is missing

df = DataFrame(commod = commodities, price = last_price) # give names to columns

Unnamed: 0_level_0,commod,price
Unnamed: 0_level_1,String,Float64?
1,crude,4.2
2,gas,11.3
3,gold,12.1
4,silver,missing


One can also easily add a new row to an existing `DataFrame` using the `push!` function. This is equivalent to adding new observations to the variables. 

In [43]:
new_row = (commod = "nickel", price = 5.1)
push!(df, new_row)

Unnamed: 0_level_0,commod,price
Unnamed: 0_level_1,String,Float64?
1,crude,4.2
2,gas,11.3
3,gold,12.1
4,silver,missing
5,nickel,5.1
6,nickel,5.1


One could also use array comprehensions to generate values for the DataFrame,  

In [24]:
DataFrame([rand(3) for i in 1:3], [:x1, :x2, :x3]) # see how we named the columns

Unnamed: 0_level_0,x1,x2,x3
Unnamed: 0_level_1,Float64,Float64,Float64
1,0.905493,0.453158,0.743097
2,0.683788,0.19378,0.511231
3,0.53654,0.230837,0.00504412


You can also create a DataFrame from a matrix, 

In [27]:
x = DataFrame(rand(3, 3), :auto) # automatically assign column names

Unnamed: 0_level_0,x1,x2,x3
Unnamed: 0_level_1,Float64,Float64,Float64
1,0.72952,0.353773,0.669856
2,0.703105,0.151893,0.567652
3,0.955099,0.695527,0.109053


Incidentally, you can convert the DataFrame into a matrix or array if you so wished, 

In [28]:
Matrix(x)

3×3 Matrix{Float64}:
 0.72952   0.353773  0.669856
 0.703105  0.151893  0.567652
 0.955099  0.695527  0.109053

In the next section we talk about accessing the element of a DataFrame as well as looking at some basic information about the DataFrame that we have on hand.  

## Accessing data

Once we have our data set up in a DataFrame, we are often going to want to know some basic things about the contents. Let us construct a relatively large DataFrame. Most of the time we will be working with large datasets in economics, with thousands of rows and columns. You might be used to working with data in Excel, so things might feel foreign right now. However, I promise that once you start working with data in a programming language such as R, Julia or Python, your productivity will greatly increase. You only need to get over that initial apprehension on learning something new. 

In [38]:
y = DataFrame(rand(1:10, 1000, 10), :auto);

We can get some basic summary statistics on the data in the DataFrame using the `describe` function. 

In [34]:
describe(y)

Unnamed: 0_level_0,variable,mean,min,median,max,nmissing,eltype
Unnamed: 0_level_1,Symbol,Float64,Int64,Float64,Int64,Int64,DataType
1,x1,5.354,1,5.0,10,0,Int64
2,x2,5.45,1,5.0,10,0,Int64
3,x3,5.451,1,5.0,10,0,Int64
4,x4,5.653,1,6.0,10,0,Int64
5,x5,5.457,1,5.0,10,0,Int64
6,x6,5.629,1,6.0,10,0,Int64
7,x7,5.538,1,5.0,10,0,Int64
8,x8,5.547,1,6.0,10,0,Int64
9,x9,5.419,1,5.0,10,0,Int64
10,x10,5.455,1,6.0,10,0,Int64


If we want to take a peak at the first few rows of the data we can use the `first` function. 

In [37]:
first(y, 5) # first 5 rows

Unnamed: 0_level_0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64
1,8,2,3,4,1,8,8,9,4,1
2,5,9,7,1,3,10,7,9,6,2
3,8,2,5,5,10,6,8,4,6,7
4,3,6,9,4,8,7,1,1,6,10
5,6,5,2,8,4,2,2,7,6,9


There are multiple ways to access particular columns of the DataFrame that we have created. The most obvious way is to to use `y.col` where `col` stands for the column name. This provides us the column in vector format. 

In [40]:
y.x2;

Another interesting way to access the column is the following, 

In [46]:
y[!, :x2]; # or y[!, 2]

# Importing data

# Structuring data

# Data analysis

# Statistics

Statistics and data are inextricably linked. If you are working with data you MUST have some type of background in statistics. In this section we will cover some of the relevant topics for macroeconomists. You would have already covered basic ideas in econometrics and statistics with Marisa, so this is the natural evolution from that section. 

# Math fundamentals

Mathematics is so much easier when we get to use a computer. In this section I will introduce some of the basic mathematical theory that you need as a macroeconomist and then we will show you how that relates to programming. 