## <center><font color=navy>Big Data Economics</font></center>
### <center>An Overview</center>
#### <center>Ali Habibnia</center>

    
<center> Assistant Professor, Department of Economics, </center>
<center> and Division of Computational Modeling & Data Analytics at Virginia Tech</center>
<center> habibnia@vt.edu </center> 

 
<img src="images/tech3.jpg" alt="Drawing" width="350"/>

#### The image exemplifies the intersection and collaborative synergy among three pivotal technological domains:

1. **Artificial Intelligence and Machine Learning**: AI and ML can leverage Big Alternative Data for developing more sophisticated models and algorithms, which can lead to more accurate predictions and better decision-making.

2. **Big Data**: provides the raw information that allows AI models to make informed decisions. This data can be processed and analyzed using machine learning algorithms to uncover insights that were previously not possible with traditional data sources, enhancing research and development across various fields.

3. **High-Performance Computing (HPC)**: HPC provides the necessary computational power to handle the vast amounts of data and complex calculations required by AI and ML, thus speeding up research and enabling more complex simulations. 

The integration of Big Data, AI, and HPC creates a powerful ecosystem for advanced analytics, enabling the tackling of intricate problems and the extraction of profound insights. This triad fosters the capability for real-time analysis and decision-making, revolutionizing sectors such as finance, healthcare, and transportation by improving efficiency and outcomes.

* #### The concept of Big Data is often associated with the (?) V's:

<img src="images/bigdatavs.jpg" alt="Drawing" width="500"/>


1. **Visualization**: Represents the importance of presenting data in a manner that is easily and immediately understandable.

2. **Velocity**: Refers to the speed at which data is generated, processed, and analyzed.

3. **Variety**: Indicates the different types of data (structured, unstructured, and semi-structured) that are available for analysis.

4. **Variability**: Suggests that data flows can be highly inconsistent with periodic peaks.

5. **Volume**: Points to the vast amounts of data generated from various sources.

6. **Vulnerability**: Highlights the security concerns and risks associated with managing and storing large quantities of data.

7. **Validity**: Concerns the accuracy and correctness of data for the intended use.

8. **Volatility**: Describes how long data is valid and how quickly it becomes outdated.

9. **Veracity**: Addresses the quality and reliability of data.

10. **Value**: Emphasizes the worth of the data being collected and how it can be turned into a valuable resource.

## What is Big Data?

<img src="images/model.png" alt="Drawing"/>

Big Data is a term that varies in definition depending on the context and the person you're asking. In the field of econometrics, the concept of Big Data can be framed with respect to the dimensions of the dataset, namely the number of variables and observations.

* **Wild data** (unstructured, constract with Census surveys, or Twitter)

* **Wide data** (a.k.a Larg-P data because p>>N)

* **Long data** (a.k.a Large-N data because N is very large and may not even fit onto a single hard drive)

* **Complex model** (a.k.a Large-Theta because model/algorithm has many parameters)

<img src="images/mlp.png" alt="Drawing" width="500"/>


## Pillars of Big Data

* Foundation of basic calculus, linear algebra, probability analysis, and numerical optimization)

* Programming (for automation of data collection, manipulation, cleaning, visualization, and modeling)

* Visualization & exploration

* Machine learning (to capture nonlinearity and non-normality in data, to compress data, and for prediction)

* Causal inference (to be able to make policy prescription)


## Impact of the Curse of Dimensionality on Regression Analysis

Before delving into the specific impacts, we first establish the general equations and notation used in regression analysis.

### Linear Regression: Basic Equations and Matrix Formulation

Linear regression is one of the foundational models in statistics, econometrics, and machine learning. It provides a simple yet powerful framework for modeling the relationship between a dependent variable and one or more explanatory variables.

#### 1. Scalar and Componentwise Representation

Consider a dataset with n observations and k explanatory variables. For observation i, the linear regression model is written as

$$
y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_k x_{ik} + \varepsilon_i
$$

where

- $y_i$ is the dependent variable
- $x_{ij}$ denotes the j-th explanatory variable
- $\beta_0$ is the intercept
- $\beta_j$ for $j = 1, \ldots, k$ are slope coefficients
- $\varepsilon_i$ is an unobserved error term capturing noise and omitted factors

The objective of linear regression is to estimate the parameter vector $(\beta_0, \beta_1, \ldots, \beta_k)$ such that the fitted values approximate the observed outcomes as closely as possible.

#### 2. Vector and Matrix Formulation

Stacking all observations together yields a compact matrix representation. Define

$$
y =
\begin{pmatrix}
y_1 \\
y_2 \\
\vdots \\
y_n
\end{pmatrix},
\quad
X =
\begin{pmatrix}
1 & x_{11} & \cdots & x_{1k} \\
1 & x_{21} & \cdots & x_{2k} \\
\vdots & \vdots & \ddots & \vdots \\
1 & x_{n1} & \cdots & x_{nk}
\end{pmatrix}
$$

$$
\beta =
\begin{pmatrix}
\beta_0 \\
\beta_1 \\
\vdots \\
\beta_k
\end{pmatrix},
\quad
\varepsilon =
\begin{pmatrix}
\varepsilon_1 \\
\varepsilon_2 \\
\vdots \\
\varepsilon_n
\end{pmatrix}
$$

The linear regression model can then be written succinctly as

$$
y = X\beta + \varepsilon
$$

This formulation highlights that linear regression is a linear mapping from the parameter space to the outcome space.

#### 3. Ordinary Least Squares Estimation

The most common estimation method is Ordinary Least Squares (OLS), which chooses $\beta$ to minimize the sum of squared residuals

$$
\min_{\beta} (y - X\beta)'(y - X\beta)
$$

Provided that $X'X$ is invertible, the closed-form OLS estimator is

$$
\hat{\beta} = (X'X)^{-1}X'y
$$

This expression plays a central role in econometrics and statistical learning.

#### 4. Interpretation

- Each coefficient $\beta_j$ measures the marginal effect of $x_j$ on the expected value of $y$, holding other regressors fixed
- The matrix formulation clarifies issues such as identification, multicollinearity, and numerical stability
- Many extensions, including ridge regression, LASSO, generalized linear models, and neural networks, can be viewed as generalizations of this framework

$$
\mathbf{y} = \mathbf{X}\beta + \boldsymbol{\varepsilon}
$$

$$
\boldsymbol{\varepsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I})
$$

$$
L_{OLS}(\hat{\beta}) = \sum_{i=1}^n (y_i - \mathbf{x}_i^T \hat{\beta})^2 = \|\mathbf{y} - \mathbf{X}\beta\|^2
$$

$$
\hat{\beta}_{OLS} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}
$$

$$
\text{Bias}(\hat{\beta}_{OLS}) = \mathbb{E}[\hat{\beta}_{OLS}] - \beta
$$

$$
\text{Var}(\hat{\beta}_{OLS}) = \sigma^2 (\mathbf{X}^T \mathbf{X})^{-1}
$$

$$
\sigma^2 = \frac{\boldsymbol{\varepsilon}^T \boldsymbol{\varepsilon}}{n - p - 1}
$$

### Overfitting
- In high-dimensional settings where $p$ is large, OLS can fit noise as well as signal.
- Overfitting occurs when training error is small but out-of-sample error is large.
- When $p \ge n$, $(\mathbf{X}^T \mathbf{X})^{-1}$ becomes unstable or undefined.

### Collinearity
- Collinearity refers to strong dependence among regressors in $\mathbf{X}$.
- It inflates the variance of $\hat{\beta}_{OLS}$ and obscures individual effects.

### Sample Size and Power Issues
- As $p$ increases, larger $n$ is required to maintain statistical power.
- High dimensionality increases both Type I and Type II error risks.

### Unreliable Estimation of Variance
- When $p$ is close to $n$, $(\mathbf{X}^T \mathbf{X})$ becomes poorly conditioned.
- This undermines the reliability of variance and inference.

### Estimation of Error Variance ($\sigma^2$)
- The estimator depends critically on degrees of freedom $n - p - 1$.
- Large $p$ leads to unstable or inflated variance estimates.

The curse of dimensionality increases overfitting risk, worsens collinearity, weakens inference, and motivates regularization and dimensionality reduction in modern regression analysis.

## Big Data: FROM HYPOTHESIS TESTING TO MACHINE LEARNING

* Machine learning (ML) allows researchers to analyze data in novel ways. Computers today can process multiple sets of data quickly and, with the right classification sets, recognize highly complex patterns. 

* Designed to simulate the interactions of biological neurons, “deep learning” uses artificial neural networks to discern features in successive layers of data while iterating on previously recognized trends. 


### Econometrics vs. Machine Learning

#### Goal of econometrics

* "the goal of econometrics is to find β hat" where here we mean β hat to be the causal impact of X on y

* The primary statistical concern of econometrics is sampling error. In other words, the goal is to quantify the uncertainty around β hat due to randomness in the sampling of the population. 

* The goal is to make counterfactual predictions.

        What would happen to Amazon's profits if it changed its website layout?

We don't get to observe the world under these alternative policies, so we can't simply find the answers in the data. Knowing the counterfactual requires measuring a causal effect. Measuring a causal effect requires making assumptions. That's what economics is all about!


#### Goal of machine learning

* In contrast, the goal of machine learning is to come up with the best possible out-of-sample prediction. 

* We refer to this as the primary concern of machine learning being "y hat"

* The goal is to make sure that the best prediction is made by tuning and validating many different kinds of models. This is what cross-validation is all about, and it is what machine learning practitioners obsess about.

## Unlocking New Dimensions: The Pivotal Role of Alternative Data

- **Alternative data**: refers to non-traditional data sources that can provide additional insights beyond what's available through conventional data. Its advantages are frequency and near-real-time data, accuracy and objectiveness. Its disadvantages are the fact that the indicators available are merely proxies for what policymakers are interested in and need for policy design. Here are some examples of alternative data and variables:

    - **Textual data such as News Headlines**
    - **Digital footprints from social media**
    - **Mobile phone data**
    - **Satellite Imagery** nighttime light measures, or luminosity, as proxies for economic activity and population distribution
    - **Search Engine Trends**
    - **Environmental, Social, and Governance (ESG) Data**


<img src="images/text.png" alt="Drawing" width="400"/>
<img src="images/satt2.jpg" alt="Drawing" width="400"/>
<img src="images/satt.jpg" alt="Drawing" width="400"/>
<img src="images/satt3.jpg" alt="Drawing" width="400"/>