## <center><font color=navy>Big Data Economics</font></center>
### <center>An Overview</center>
#### <center>Ali Habibnia</center>

    
<center> Assistant Professor, Department of Economics, </center>
<center> and Division of Computational Modeling & Data Analytics at Virginia Tech</center>
<center> habibnia@vt.edu </center> 

 
<img src="images/tech3.JPG" alt="Drawing" width="350"/>

#### The image exemplifies the intersection and collaborative synergy among three pivotal technological domains:

1. **Artificial Intelligence and Machine Learning**: AI and ML can leverage Big Alternative Data for developing more sophisticated models and algorithms, which can lead to more accurate predictions and better decision-making.

2. **Big Data**: provides the raw information that allows AI models to make informed decisions. This data can be processed and analyzed using machine learning algorithms to uncover insights that were previously not possible with traditional data sources, enhancing research and development across various fields.

3. **High-Performance Computing (HPC)**: HPC provides the necessary computational power to handle the vast amounts of data and complex calculations required by AI and ML, thus speeding up research and enabling more complex simulations. 

The integration of Big Data, AI, and HPC creates a powerful ecosystem for advanced analytics, enabling the tackling of intricate problems and the extraction of profound insights. This triad fosters the capability for real-time analysis and decision-making, revolutionizing sectors such as finance, healthcare, and transportation by improving efficiency and outcomes.

* #### The concept of Big Data is often associated with the (?) V's:

<img src="images/bigdatavs.JPG" alt="Drawing" width="500"/>


1. **Visualization**: Represents the importance of presenting data in a manner that is easily and immediately understandable.

2. **Velocity**: Refers to the speed at which data is generated, processed, and analyzed.

3. **Variety**: Indicates the different types of data (structured, unstructured, and semi-structured) that are available for analysis.

4. **Variability**: Suggests that data flows can be highly inconsistent with periodic peaks.

5. **Volume**: Points to the vast amounts of data generated from various sources.

6. **Vulnerability**: Highlights the security concerns and risks associated with managing and storing large quantities of data.

7. **Validity**: Concerns the accuracy and correctness of data for the intended use.

8. **Volatility**: Describes how long data is valid and how quickly it becomes outdated.

9. **Veracity**: Addresses the quality and reliability of data.

10. **Value**: Emphasizes the worth of the data being collected and how it can be turned into a valuable resource.

## What is Big Data?

<img src="images/model.png" alt="Drawing"/>

Big Data is a term that varies in definition depending on the context and the person you're asking. In the field of econometrics, the concept of Big Data can be framed with respect to the dimensions of the dataset, namely the number of variables and observations.

* **Wild data** (unstructured, constract with Census surveys, or twitter)

* **Wide data** (a.k.a Larg-P data because p>>N)

* **Long data** (a.k.a Large-N data because N is very large and may not even fit onto a single hard drive)

* **Complex model** (a.k.a Large-Theta because model/algorithm has many parameters)

<img src="images/mlp.png" alt="Drawing" width="500"/>


## Pillars of Big Data

* Foundation of basic calculus, linear algebra, probability analysis, and neumerical optimization)

* Programming (for automation of data collection, manipulation, cleaning, visualization, and modeling)

* Visualization & exploration

* Machine learning (to capture nonlinearity and non normality in data, to compress data, and prediction)

* Causal inference (to be able to make policy prescription)


## Impact of the Curse of Dimensionality on Regression Analysis

Before delving into the specific impacts, let's establish the general equations and notations used in regression analysis:


<br>
<br>
<center> $ \mathbf{y} = \mathbf{X}\beta + \boldsymbol{\varepsilon} $ </center>
<br>
<center> $ \boldsymbol{\varepsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2\mathbf{I}) $ </center>
<br>
<center> $ L_{OLS}(\hat{\beta}) = \sum_{i=1}^n (y_i - \mathbf{x}_i^T \hat{\beta})^2 = \|\mathbf{y} - \mathbf{X}\beta\|^2 $ </center>
<br>
<center> $ \hat{\beta}_{OLS} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y} $ </center>
<br>
<center> $ \text{Bias} (\hat{\beta}_{OLS}) = \mathbb{E}[\hat{\beta}_{OLS}] - \beta $ </center>
<br>
<center> $ \text{Var} (\hat{\beta}_{OLS}) = \sigma^2(\mathbf{X}^T\mathbf{X})^{-1} $ </center>
<br>
<center> $ \sigma^2 = \frac{\boldsymbol{\varepsilon}^T \boldsymbol{\varepsilon}}{n - p - 1} $ </center>
<br>

### Overfitting:
- In high-dimensional settings where $p$ is large, the OLS model $ \mathbf{y} = \mathbf{X}\beta + \boldsymbol{\varepsilon} $ can fit the noise in the data as well as the signal.
- Overfitting is indicated when $ L_{OLS}(\hat{\beta}) $ is small for the training data but large for new data.
- When $p$ approaches or exceeds $n$, $(\mathbf{X}^T\mathbf{X})^{-1}$ can become unstable or non-invertible, leading to a model that perfectly fits the training data but fails to generalize.


### Collinearity:
- Collinearity refers to high correlation among predictor variables in matrix $\mathbf{X}$.
- This leads to instability in $(\mathbf{X}^T\mathbf{X})^{-1}$, reflected in inflated values in the diagonal of $\text{Var}(\hat{\beta}_{OLS})$, indicating high variance of the coefficient estimates.
- Collinearity can make it difficult to discern the individual impact of predictors on $\mathbf{y}$.


### Sample Size and Power Issues:
- Adequate sample size ($n$) is essential for reliable estimation. However, as $p$ increases, more data is required to maintain statistical power.
- The power of a test is the probability that the test correctly rejects a false null hypothesis (Type II error). In high dimensions, the power to detect true effects is diminished due to the increased variance in coefficient estimates.
- Type I error is the incorrect rejection of a true null hypothesis, while Type II error is failing to reject a false null hypothesis. High dimensionality can inflate both Type I and Type II errors, especially when $p$ is large relative to $n$.


### Unreliable Estimation of Variance:
- In high-dimensional spaces, $\text{Var}(\hat{\beta}_{OLS}) = \sigma^2(\mathbf{X}^T\mathbf{X})^{-1}$ can become unreliable due to the increased variance of the estimator.
- When $p$ is close to or greater than $n$, the matrix $(\mathbf{X}^T\mathbf{X})$ is often of low rank and thus non-invertible or poorly conditioned, which significantly affects the reliability of the OLS estimator.


### Estimation of Error Variance ($\sigma^2$):
- The accurate estimation of $\sigma^2$ is crucial for inference. The formula $\sigma^2 = \frac{\boldsymbol{\varepsilon}^T \boldsymbol{\varepsilon}}{n - p - 1}$ shows the dependency on $n$ and $p$.
- When $p$ is large relative to $n$, the denominator $n-p-1$ becomes small, leading to an overestimation of $\sigma^2$.


The curse of dimensionality significantly impacts regression analysis by increasing the potential for overfitting, causing issues with collinearity, necessitating larger sample sizes for statistical power, and leading to unreliable estimation of the variance of the OLS estimator and error variance $\sigma^2$. Techniques like regularization, dimensionality reduction, and careful model selection are crucial in high-dimensional settings.

## Big Data: FROM HYPOTHESIS TESTING TO MACHINE LEARNING

* Machine learning (ML) allows researchers to analyze data in novel ways. Computers today can process multiple sets of data in little time and, with the correct classification sets, recognize highly complex patterns among them. 

* Designed to simulate the interactions of biological neurons, “deep learning” uses artificial neural networks to discern features in successive layers of data while iterating on previously recognized trends. 


### Econometrics vs. Machine Learning

#### Goal of econometrics

* "the goal of econometrics is to find β hat" where here we mean β hat to be the causal impact of X on y

* The primary statistical concern of econometrics is sampling error. In other words, the goal is to quantify the uncertainty around β hat due to randomness in the sampling of the population. 

* The goal is to make counterfactual predictions.

        What would happen to Amazon's profits if it changed its website layout?

We don't get to observe the world under these alternative policies, so we can't simply find the answers in the data. Knowing the counterfactual requires being able to measure a causal effect. Being able to measure a causal effect requires making assumptions. That's what economics is all about!


#### Goal of machine learning

* In contrast, the goal of machine learning is to come up with the best possible out-of-sample prediction. 

* We refer to this as the primary concern of machine learning being "y hat"

* The goal is to make sure that the best prediction is had by tuning and validating many different kinds of models. This is what cross-validation is all about, and it is what machine learning practitioners obsess about.

## Unlocking New Dimensions: The Pivotal Role of Alternative Data

- **Alternative data**: refers to non-traditional data sources that can provide additional insights beyond what's available through conventional data. Its advantages are frequency and near-real-time data, accuracy and objectiveness. Its disadvantages are the fact that the indicators available are merely proxies for what policymakers are interested in and need for policy design. Here are some examples of alternative data and variables:

    - **Textual data such as News Headlines**
    - **Digital footprints from social media**
    - **Mobile phone data**
    - **Satellite Imagery** nighttime light measures, or luminosity, as proxies for economic activity and population distribution
    - **Search Engine Trends**
    - **Environmental, Social, and Governance (ESG) Data**


<img src="images/text.PNG" alt="Drawing" width="400"/>
<img src="images/satt2.jpg" alt="Drawing" width="400"/>
<img src="images/satt.jpg" alt="Drawing" width="400"/>
<img src="images/satt3.jpg" alt="Drawing" width="400"/>