# Chapter 1 Overview of Statistical Data Science 

## 1.1 What is data science?

Data science is an interdisciplinary field. 
- It uses scientific methods, processes, algorithms and systems to extract knowledge and insights from many structural and unstructured data.
- It involves techniques and theories drawn from many fields within the context of mathematics, statistics, computer science, domain knowledge and information science.
- It is related to data mining, machine learning and big data. 


**Remark** (Data science and statistics) People have very different views regarding the two terms. 
- Everyone agrees that statistics is a crucial component of data science
- Some argued that data science is not a new field, but rather another name for statistics.
- Some see that data science is applied statistics.
Compared with _traditional_ statistics (e.g., in the 70s), data science 
- deals with new types of data (e.g., images, electronic health record),
- deals with huge datasets,
- and emphasizes prediction and action.



A crude schematic of a common data science project is show below. 
<img src="../Figures/Ch1/datasci1.jpg" alt="lm" style="width: 500px;"/>
We can roughly categorize tasks of a data scientist based on the schematic. 

| Task | Descriptions | Skills required |
| :-: | :- | :- |
| Visioning | To generate hypotheses or questions that are of interest  | Domain knowledge, self-learning, quantitative methods, etc. |
| Data acquisition | To gather data for verifying hypotheses via experiments, sample surveys, or data queries | Experimental designs, causal inference, query languages, etc. |
| Analysis | To analyze data to produce meaningful visualizations, build predictive models, test hypotheses, etc. | Domain knowledge, programming languages, statistical models, statistical theory, optimization, etc.  |
| Conclusion | To conclude the analysis results in writing or via presentations | Communication skills, writing skills, public speaking skills, etc.  |
| Management | To manage and oversee the project | Leadership, collaboration skills, tools for team projects, etc. |


This collection of notes focuses on the __Data__ part of the chain. 
<img src="../Figures/Ch1/datasci2.jpg" alt="lm" style="width: 500px;"/>
The transform-visualise-model step is in fact an iterative process, where we need to reflect and revise our approaches based on results in previous steps. 
<img src="../Figures/Ch1/datasci3.jpg" alt="ds" style="width: 500px;"/>



## 1.2 Example: `wages`

We now turn to the `wages` data to see an example of data analysis. 

In the code above, we propose a linear model for the response (log income) and the single predictor (education). 

<img src="../Figures/Ch1/lm.jpg" alt="lm" style="width: 400px;"/>

The formula for `lm()` only needs to include the response (variable on the y axis) and predictors (variable on the x-axis). The intercept term is included by default, unless specified otherwise (`-1`). 


<img src="../Figures/Ch1/formula.jpg" alt="lm" style="width: 400px;"/>

We can try to interpret the fitted coefficients.   
- The average log income is {{as.numeric(round(mod_e$coef[1],digits=3))}} for those with zero years of education.

- The average difference in log income is {{as.numeric(round(mod_e$coef[2],digits=3))}} for groups with one year difference in education. 

Neither statement makes a lot of sense. For instance, (1) there isn't anyone with zero years of education in the `wages` dataset, and (2) differences in log income are not informative for general audience. Therefore, we have two questions to think about. 
- Why take the logarithm of income?
- How should we interpret the fitted results?



After model fitting, we usually proceed to hypothesis testing, predictive modeling, or model diagnostics. An experienced reader can certainly perform these tasks from scratch. However, it is best not to reinvent the wheels. For very common tasks, it is extremely likely that there are existing tools out there on the Internet. Here we use the package `broom` in `R`. 

`Broom` includes three functions which work for most types of models (and can be extended to more):  
1. `tidy()` - returns model coefficients, stats: what uncertainty is associated with it?  
2. `glance()` - returns model diagnostics: how "good" is the model?  
3. `augment()` - returns predictions, residuals, and other raw values  

The low $R^2$ indicates that education explains only a part of variability of income. We can include more predictors in the model. 

The $R^2$ does improves a bit, but still remains low. Maybe the linear model is a not a good choice. It might be a good idea to look at the raw data. 