## <center><font color=navy>Big Data Economics</font></center>
### <center>An Overview</center>
#### <center>Ali Habibnia</center>

    
<center> Assistant Professor, Department of Economics, </center>
<center> and Division of Computational Modeling & Data Analytics at Virginia Tech</center>
<center> habibnia@vt.edu </center> 

 
<img src="images/tech3.jpg" alt="Drawing" width="350"/>

#### The image exemplifies the intersection and collaborative synergy among three pivotal technological domains:

1. **Artificial Intelligence and Machine Learning**: AI and ML can leverage Big Alternative Data for developing more sophisticated models and algorithms, which can lead to more accurate predictions and better decision-making.

2. **Big Data**: provides the raw information that allows AI models to make informed decisions. This data can be processed and analyzed using machine learning algorithms to uncover insights that were previously not possible with traditional data sources, enhancing research and development across various fields.

3. **High-Performance Computing (HPC)**: HPC provides the necessary computational power to handle the vast amounts of data and complex calculations required by AI and ML, thus speeding up research and enabling more complex simulations. 

The integration of Big Data, AI, and HPC creates a powerful ecosystem for advanced analytics, enabling the tackling of intricate problems and the extraction of profound insights. This triad fosters the capability for real-time analysis and decision-making, revolutionizing sectors such as finance, healthcare, and transportation by improving efficiency and outcomes.

* #### The concept of Big Data is often associated with the (?) V's:

<img src="images/bigdatavs.jpg" alt="Drawing" width="500"/>


1. **Visualization**: Represents the importance of presenting data in a manner that is easily and immediately understandable.

2. **Velocity**: Refers to the speed at which data is generated, processed, and analyzed.

3. **Variety**: Indicates the different types of data (structured, unstructured, and semi-structured) that are available for analysis.

4. **Variability**: Suggests that data flows can be highly inconsistent with periodic peaks.

5. **Volume**: Points to the vast amounts of data generated from various sources.

6. **Vulnerability**: Highlights the security concerns and risks associated with managing and storing large quantities of data.

7. **Validity**: Concerns the accuracy and correctness of data for the intended use.

8. **Volatility**: Describes how long data is valid and how quickly it becomes outdated.

9. **Veracity**: Addresses the quality and reliability of data.

10. **Value**: Emphasizes the worth of the data being collected and how it can be turned into a valuable resource.

## What is Big Data?

<img src="images/model.png" alt="Drawing"/>

Big Data is a term that varies in definition depending on the context and the person you're asking. In the field of econometrics, the concept of Big Data can be framed with respect to the dimensions of the dataset, namely the number of variables and observations.

* **Wild data** (unstructured, constract with Census surveys, or twitter)

* **Wide data** (a.k.a Larg-P data because p>>N)

* **Long data** (a.k.a Large-N data because N is very large and may not even fit onto a single hard drive)

* **Complex model** (a.k.a Large-Theta because model/algorithm has many parameters)

<img src="images/mlp.png" alt="Drawing" width="500"/>


## Pillars of Big Data

* Foundation of basic calculus, linear algebra, probability analysis, and neumerical optimization)

* Programming (for automation of data collection, manipulation, cleaning, visualization, and modeling)

* Visualization & exploration

* Machine learning (to capture nonlinearity and non normality in data, to compress data, and prediction)

* Causal inference (to be able to make policy prescription)


## Impact of the Curse of Dimensionality on Regression Analysis

Before delving into the specific impacts, let's establish the general equations and notations used in regression analysis:

### Linear Regression: Basic Equations and Matrix Formulation

Linear regression is one of the foundational models in statistics, econometrics, and machine learning. It provides a simple yet powerful framework for modeling the relationship between a dependent variable and one or more explanatory variables.

#### 1. Scalar and Componentwise Representation

Consider a dataset with n observations and k explanatory variables. For observation i, the linear regression model is written as

<br>
<center> $y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_k x_{ik} + \varepsilon_i$ </center>
<br>
where

* $y_i$ is the dependent variable
* $x_{ij}$ denotes the j-th explanatory variable
* $\beta_0$ is the intercept
* $\beta_j$ for j = 1,…,k are slope coefficients
* $\varepsilon_i$ is an unobserved error term capturing noise and omitted factors

The objective of linear regression is to estimate the parameter vector $(\beta_0, \beta_1, \ldots, \beta_k)$ such that the fitted values approximate the observed outcomes as closely as possible.

#### 2. Vector and Matrix Formulation

Stacking all observations together yields a compact matrix representation. Define

$$
\mathbf{y} = \mathbf{X}\beta + \boldsymbol{\varepsilon}
$$

where

$$
\mathbf{y} \in \mathbb{R}^n,
\qquad
\mathbf{X} \in \mathbb{R}^{n \times p},
\qquad
\beta \in \mathbb{R}^p,
\qquad
\boldsymbol{\varepsilon} \in \mathbb{R}^n
$$

$y =
\begin{pmatrix}
y_1 \
y_2 \
\vdots \
y_n
\end{pmatrix},
\quad
X =
\begin{pmatrix}
1 & x_{11} & \cdots & x_{1k} \
1 & x_{21} & \cdots & x_{2k} \
\vdots & \vdots & \ddots & \vdots \
1 & x_{n1} & \cdots & x_{nk}
\end{pmatrix},
\quad
\beta =
\begin{pmatrix}
\beta_0 \
\beta_1 \
\vdots \
\beta_k
\end{pmatrix},
\quad
\varepsilon =
\begin{pmatrix}
\varepsilon_1 \
\varepsilon_2 \
\vdots \
\varepsilon_n
\end{pmatrix}$

A standard econometric formulation imposes the following assumptions.

Conditional mean zero (exogeneity):

$$
\mathbb{E}(\boldsymbol{\varepsilon} \mid \mathbf{X}) = \mathbf{0}
$$

Homoskedasticity and no cross-sectional correlation:

$$
\operatorname{Var}(\boldsymbol{\varepsilon} \mid \mathbf{X}) = \sigma^2 \mathbf{I}_n
$$

Full column rank (existence and identification):

$$
\operatorname{rank}(\mathbf{X}) = p
$$

For exact finite sample normal theory inference, one may additionally assume i.i.d. Gaussian errors conditional on the regressors:

$$
\varepsilon_i \mid \mathbf{X} \sim \mathcal{N}(0,\sigma^2),
\qquad
\varepsilon_i \;\perp\!\!\!\perp\; \varepsilon_{i'}
\quad
\text{for } i \neq i'
$$

Equivalently, in vector notation:

$$
\boldsymbol{\varepsilon} \mid \mathbf{X}
\sim
\mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I}_n)
$$

In most econometric applications, exact normality is not required; the moment conditions above suffice for consistency and asymptotic inference.


## OLS Objective Function, Estimator, and Bias–Variance Properties

The ordinary least squares estimator minimizes the quadratic loss

$$
\hat{\beta}_{OLS}
=
\arg\min_{\beta \in \mathbb{R}^p}
L_n(\beta)
$$

where

$$
L_n(\beta)
=
(\mathbf{y} - \mathbf{X}\beta)^\top (\mathbf{y} - \mathbf{X}\beta)
=
\|\mathbf{y} - \mathbf{X}\beta\|^2
$$

The first order condition yields the closed form solution

$$
\hat{\beta}_{OLS}
=
(\mathbf{X}^\top \mathbf{X})^{-1}\mathbf{X}^\top \mathbf{y}
$$

Define the residual vector

$$
\hat{\boldsymbol{\varepsilon}}
=
\mathbf{y} - \mathbf{X}\hat{\beta}_{OLS}
$$

Under exogeneity, the estimator is unbiased:

$$
\text{Bias}(\hat{\beta}_{OLS})
=
\mathbb{E}(\hat{\beta}_{OLS} \mid \mathbf{X}) - \beta
=
\mathbf{0}
$$

The conditional variance of the estimator is

$$
\operatorname{Var}(\hat{\beta}_{OLS} \mid \mathbf{X})
=
(\mathbf{X}^\top \mathbf{X})^{-1}
\mathbf{X}^\top
\operatorname{Var}(\mathbf{y} \mid \mathbf{X})
\mathbf{X}
(\mathbf{X}^\top \mathbf{X})^{-1}
$$

Since

$$
\operatorname{Var}(\mathbf{y} \mid \mathbf{X})
=
\operatorname{Var}(\boldsymbol{\varepsilon} \mid \mathbf{X})
=
\sigma^2 \mathbf{I}_n
$$

we obtain

$$
\operatorname{Var}(\hat{\beta}_{OLS} \mid \mathbf{X})
=
\sigma^2 (\mathbf{X}^\top \mathbf{X})^{-1}
$$


$$
\hat{\sigma}^2
=
\frac{\hat{\boldsymbol{\varepsilon}}^\top \hat{\boldsymbol{\varepsilon}}}{n - p - 1}
$$

- The accurate estimation of $\sigma^2$ is crucial for inference. The formula shows the dependency on $n$ and $p$.
- When $p$ is large relative to $n$, the denominator $n-p-1$ becomes small, leading to an overestimation of $\sigma^2$.

## Effect of Sample Size and Dimensionality on Estimator Variance

The curse of dimensionality significantly impacts regression analysis by increasing the potential for overfitting, causing issues with collinearity, necessitating larger sample sizes for statistical power, and leading to unreliable estimation of the variance of the OLS estimator and error variance $\sigma^2$. Techniques like regularization, dimensionality reduction, and careful model selection are crucial in high-dimensional settings.

Write

$$
\mathbf{X}^\top \mathbf{X}
=
n \left(\frac{\mathbf{X}^\top \mathbf{X}}{n}\right)
$$

Then

$$
\operatorname{Var}(\hat{\beta}_{OLS} \mid \mathbf{X})
=
\frac{\sigma^2}{n}
\left(\frac{\mathbf{X}^\top \mathbf{X}}{n}\right)^{-1}
$$

If $p$ is fixed and

$$
\frac{\mathbf{X}^\top \mathbf{X}}{n}
\to
\mathbf{\Sigma}
\quad \text{with } \mathbf{\Sigma} \text{ positive definite}
$$

the variance shrinks at rate $1/n$.

As $p$ increases, eigenvalues of $(\mathbf{X}^\top \mathbf{X})/n$ may approach zero, inflating the inverse and increasing estimator variance.


## Large-$p$ Pathologies in OLS

### Non-existence and interpolation

When

$$
p \ge n
$$

the matrix $\mathbf{X}^\top \mathbf{X}$ is singular, and the OLS estimator is not uniquely defined. Even when solutions exist, the model can interpolate the data, achieving near-zero in-sample loss with poor out-of-sample performance.

### Overfitting

The in-sample loss at the OLS solution is

$$
L_n(\hat{\beta}_{OLS})
=
\|\mathbf{y} - \mathbf{X}\hat{\beta}_{OLS}\|^2
$$

Overfitting occurs when this quantity is small, while the expected out-of-sample prediction error
- When $p$ approaches or exceeds $n$, $(\mathbf{X}^T\mathbf{X})^{-1}$ can become unstable or non-invertible, leading to a model that perfectly fits the training data but fails to generalize.

$$
\mathbb{E}
\left[
(y_0 - x_0^\top \hat{\beta}_{OLS})^2
\mid \mathbf{X}
\right]
$$

is large.

### Collinearity

Let

$$
\mathbf{X}^\top \mathbf{X}
=
\mathbf{U}\Lambda\mathbf{U}^\top
$$

Then

$$
(\mathbf{X}^\top \mathbf{X})^{-1}
=
\mathbf{U}\Lambda^{-1}\mathbf{U}^\top
$$

Small eigenvalues in $\Lambda$ lead to large entries in $\Lambda^{-1}$, inflating

- Collinearity refers to high linear dependency among predictor variables in the matrix $\mathbf{X}$.
- This leads to instability in $(\mathbf{X}^T\mathbf{X})^{-1}$, reflected in inflated values in the diagonal of $\text{Var}(\hat{\beta}_{OLS})$, indicating high variance of the coefficient estimates.
- Collinearity can make it difficult to discern the individual impact of predictors on $\mathbf{y}$.

$$
\operatorname{Var}(\hat{\beta}_{OLS} \mid \mathbf{X})
=
\sigma^2 \mathbf{U}\Lambda^{-1}\mathbf{U}^\top
$$

### Unreliable estimation of error variance

The classical estimator of $\sigma^2$ is

$$
\hat{\sigma}^2
=
\frac{\hat{\boldsymbol{\varepsilon}}^\top \hat{\boldsymbol{\varepsilon}}}{n - p - 1}
$$

As $p$ grows, the degrees-of-freedom adjustment shrinks, causing instability in variance estimation and inference.

- In high-dimensional spaces, $\text{Var}(\hat{\beta}_{OLS}) = \sigma^2(\mathbf{X}^T\mathbf{X})^{-1}$ can become unreliable due to the increased variance of the estimator.
- When $p$ is close to or greater than $n$, the matrix $(\mathbf{X}^T\mathbf{X})$ is often of low rank and thus non-invertible or poorly conditioned, which significantly affects the reliability of the OLS estimator.


## Hypothesis Testing, Type I and Type II Errors, and Power

For inference on $\beta_j$, consider

$$
H_0 : \beta_j = \beta_j^0
$$

The test statistic is

$$
T_{n,j}
=
\frac{\hat{\beta}_j - \beta_j^0}
{\sqrt{\hat{\sigma}^2 [(\mathbf{X}^\top \mathbf{X})^{-1}]_{jj}}}
$$

Under fixed $p$,

$$
T_{n,j} \xrightarrow{d} \mathcal{N}(0,1)
$$

The null is rejected when

$$
|T_{n,j}| > z_{1-\alpha/2}
$$

Type I error satisfies

$$
\mathbb{P}(\text{reject } H_0 \mid H_0) \to \alpha
$$

Type II error is

$$
\beta_n
=
\mathbb{P}(\text{fail to reject } H_0 \mid H_1)
$$

Power is

$$
\text{Power}_n = 1 - \beta_n
$$

in terms of probability

$$
\mathbb{P}(\text{reject } H_0 \mid H_1) \to \alpha
$$

$$
\mathbb{P}(\text{reject } H_0 \mid \beta_j = \beta^0_j + \Delta) 
$$

For a fixed alternative

$$
\beta_j = \beta_j^0 + \Delta
$$

$$
T_{n,j}
\xrightarrow{d}
\mathcal{N}
\left(
\frac{\sqrt{n}\Delta}
{\sigma \sqrt{[(\mathbf{X}^\top \mathbf{X}/n)^{-1}]_{jj}}},
1
\right)
$$

implying

$$
\beta_n \to 0
\quad \text{and} \quad
\text{Power}_n \to 1
$$

For local alternatives

$$
\beta_j = \beta_j^0 + \frac{h}{\sqrt{n}}
$$

$$
T_{n,j}
\xrightarrow{d}
\mathcal{N}
\left(
\frac{h}
{\sigma \sqrt{[(\mathbf{X}^\top \mathbf{X}/n)^{-1}]_{jj}}},
1
\right)
$$

## Large Sample Size, Significance, and Causal Interpretability

- Adequate sample size ($n$) is essential for reliable estimation. However, as $p$ increases, more data is required to maintain statistical power.
- The power of a test is the probability that the test correctly rejects a false null hypothesis (Type II error). In high dimensions, the power to detect true effects is diminished due to the increased variance in coefficient estimates.

The significance level controls false rejections:

$$
\mathbb{P}(\text{reject } H_0 \mid H_0) = \alpha
$$

Increasing $n$ does not reduce $\alpha$; it reduces standard errors through

$$
\operatorname{Var}(\hat{\beta}_{OLS} \mid \mathbf{X})
=
\frac{\sigma^2}{n}
\left(\frac{\mathbf{X}^\top \mathbf{X}}{n}\right)^{-1}
$$

Thus, arbitrarily small effects become statistically detectable.

Causal interpretation, however, depends on identification. If

$$
\mathbb{E}(\boldsymbol{\varepsilon} \mid \mathbf{X}) \neq \mathbf{0}
$$

then $\hat{\beta}_{OLS}$ converges to a biased estimand regardless of sample size.


### Summary Implication

Large $n$ increases detectability but not credibility.  
Large $p$ inflates variance and destabilizes inference.  
Their interaction motivates regularization and modern high-dimensional econometrics.


## Big Data: FROM HYPOTHESIS TESTING TO MACHINE LEARNING

* Machine learning (ML) allows researchers to analyze data in novel ways. Computers today can process multiple sets of data in little time and, with the correct classification sets, recognize highly complex patterns among them. 

* Designed to simulate the interactions of biological neurons, “deep learning” uses artificial neural networks to discern features in successive layers of data while iterating on previously recognized trends. 


### Econometrics vs. Machine Learning

#### Goal of econometrics

* "the goal of econometrics is to find β hat" where here we mean β hat to be the causal impact of X on y

* The primary statistical concern of econometrics is sampling error. In other words, the goal is to quantify the uncertainty around β hat due to randomness in the sampling of the population. 

* The goal is to make counterfactual predictions.

        What would happen to Amazon's profits if it changed its website layout?

We don't get to observe the world under these alternative policies, so we can't simply find the answers in the data. Knowing the counterfactual requires being able to measure a causal effect. Being able to measure a causal effect requires making assumptions. That's what economics is all about!


#### Goal of machine learning

* In contrast, the goal of machine learning is to come up with the best possible out-of-sample prediction. 

* We refer to this as the primary concern of machine learning being "y hat"

* The goal is to make sure that the best prediction is made by tuning and validating many different kinds of models. This is what cross-validation is all about, and it is what machine learning practitioners obsess about.

## Unlocking New Dimensions: The Pivotal Role of Alternative Data

- **Alternative data**: refers to non-traditional data sources that can provide additional insights beyond what's available through conventional data. Its advantages are frequency and near-real-time data, accuracy and objectiveness. Its disadvantages are the fact that the indicators available are merely proxies for what policymakers are interested in and need for policy design. Here are some examples of alternative data and variables:

    - **Textual data such as News Headlines**
    - **Digital footprints from social media**
    - **Mobile phone data**
    - **Satellite Imagery** nighttime light measures, or luminosity, as proxies for economic activity and population distribution
    - **Search Engine Trends**
    - **Environmental, Social, and Governance (ESG) Data**


<img src="images/text.png" alt="Drawing" width="400"/>
<img src="images/satt2.jpg" alt="Drawing" width="400"/>
<img src="images/satt.jpg" alt="Drawing" width="400"/>
<img src="images/satt3.jpg" alt="Drawing" width="400"/>