![introduction](../img/introduction.png)

# What is Fitting models to Data?

**Goal:** How to fit **Statistical models** to **data** to help answer research questions.

## We do NOT fit data to Models. We fit Models to data.
* **Fit Models:** Specify models based on theory or subject knowledge and then fit those models to the data that we collected
* **Data:** Variables follow distributions have certain relationships and the models that we fit to the data describe those distributions or relationships
* Fit models -> Data

## Why do we fit modedls to data?
* **Estimate** distributional properties of variables, potentially conditional on other variables
    * Estimate means of distributions, their variances and the quantiles of those distributional properties of variables
* Concisely **Summerize relationships** between variables, and make inferentials statements about those relationships
    * Relationship between a predictor and a dependent variable
* **Predict** values of variables of interest conditional on the values of other predictior variables, and characterize the prediction uncertainty : to predict the outcomes on election, to predict the outcome of a sporting event or what's gonna happen with the weather or stock market -> focus is to predict the values of certain outcomes of interest

### Focus 
* Focus on **parametric models ->** Estimating **parameters** that describe the distributions of variables
* Given data, we suggest that a **variable of interest** follows a certain **probability model**
    * e.g. we might assume that a continuous variable of interest like blood pressure, etc follows a normal distribution
    * This is an example of a parametric model
    * Estimate parameters that define that model, e.g mean in addition to the variance of that normally distributed variable
        * These parameters define this model and we want to estimate their values in part to answer research questions
    **Fitting a normal distribution to a given continuous variable is one way of fitting a model to the data we have collected** 
### Fitting models to data.
We learnt in C2 how to estimate model parameters and sampling variance to **make inference about the parameters** by testing hypothesis or generating confidence intervals.
* Example of specifying a probability model given a research question and estimating the parameters of that model
* The idea of **Assessing model fit:** Does the model seem to fit the observed data well?

### Example: Test Performance and Age
* Variable of interest: Test performance in the range of [0,8]
* Possible Predictor to answer our research question: Age that is standardized in respect to mean and the standard dev of age
* We want to predict if age can predict the value of test performance
* Furthermore we believe that age has a **curvilinear** relationship with performance
    * it means Modorate values of age: Performance is best
    * and for Smaller or larger avlues of Age: Performance tends to be worse
    * We have a working theory that defines this curvilinear relationship and we want to collect data and fit a model to that data, estimate the parameters of that model and test this working theory
    
**Goals:**
1. Estimate **marginal mean** of performance across all ages so we might have a descriptive objective
    * just estimate the average test performance regardless of age
2. estimate mean performance **conditional** on age
    * Then two, we wish to estimate the mean performance conditional on age. So, we wish to estimate the relationship of age with mean test performance

### 2 different modelling approaches

![introduction](../img/test-performance-and-age.png)

1. **"Mean only" model** for test performance
    * We wukk assume test performance follows a **normal distribution** overall defined by a particular mean and a particular variance (2 parameters).
    * So we are estimating the mean and variance of that normal distribution
    * we think that the normal distribution represents a good model for the observed value on test performance and we are only interested in modelling the overall mean.
2. **Conditional on age**, we believe performance follows again a normal distribution where the mean is defined by a **quadratic function** of age, a + b * age + c * age^2 (3 parameters, a, b, and c) and variance sigma^2 (one parameter) conditioned on age, how variable is test performance given a particular value of age. So we relate test performance to age with this quadratic function and this quadratic function captures our theory of the curvilinear relationship between age and test performance   
    * We are expecting to see an U shaped or inverse-U shaped relationship between age and test performance
    * This is a conditional model for test performance
    
![introduction](../img/data-performance.png)
    
for the first image, We see that the distribution looks farely normal and from the Quantile Quantile plot we see that most of the points lie on the 45 degree line suggesting a normal distribution as well.
So the normal distribution is a reasonable model.


![introduction](../img/data-performance-age.png)    

We see that the range of standardized age lies between -3 and 4, it looks like we have an inverse U shaped plot.

### Fitting the "Mean-Only" model
Fit a **regression Model** to performance data

**Perf = m + e** mean + error

1st parameter: unknown constant, <br>
**m = marginal mean of test performance regardless of age across all different ages**
**e = random error defining each observation's deviation from the overall mean m.
Not everyone is going to have test performance equal to the overall mean, there is going to be random variability around that overall mean and these capture that random error.**

second parameter: We assume that these errors are normally distributed with a mean of 0 and var of sigma^2

![introduction](../img/mean-only-model.png)
![introduction](../img/mean-only-model-2.png)
We need to check the e ~ N(0,sigma^2) as we had assumed it. <br>
So we assess the fit.

![introduction](../img/assess-fit-mean-only-model.png)
Resuduals - e

### Conditional Model
![introduction](../img/conditional-model.png)
a,b,c are regression coefficients, these are the coefficients that describe the relationship of age with performance
![introduction](../img/conditional-model-fit.png)

a: when age = 0, i.e age = the overall mean, we'd expect the test performance to be 5.11 with se of .10 <br>
b: the linear portion of this quadratic relationship is 0.24 with a se of 0.06 <br>
c: it describes the acceleration of performance as a function of age or in this case, the decellaration is -0.26 with a se of 0.03 <br>
if we see the ratios of these estimate to the standard errors just to forshadow testing hypothesis about regression coefficients, all of these will seem to be non-zero <br>
our estimate of sigma^2 is 1.29. This is a conditional variance. Once we condition on age, how much unexplained variability is there in the test performance measures captured by those random errors. We see the curvilinear relationship that's the fit of the particular model to the observed data. Visually it looks like a good fit but we need to look at some diagnostics to see if this fit is reasonable.


### Assessing the fit of the model
looking at the dashed red line, it shows the predicted values in the fitted quadratic function

![introduction](../img/assess-fit-model.png)

![introduction](../img/assess-fit-model-poor.png)

![introduction](../img/what-we-saw-1.png)


# Types of variables in statistical modelling

![introduction](../img/types-of-var.png)

![introduction](../img/dv-iv.png)

![introduction](../img/dv1.png)

![introduction](../img/iv1.png)

![introduction](../img/iv2.png)

![introduction](../img/control-variable.png)

In non randomized or observational design, the groups that define the independent variable may not be balanced, so randomization is a tool that we can use in study design that the values in all other variables of interest that may be related to the dependent variable are equivalent between the two randomized groups, treatment and control.
<br>
We lose this control when we talk about observational design

![introduction](../img/control-var-2.png)

### Missing Data

![introduction](../img/missing-data.png)

![introduction](../img/missing-data-2.png)

# Different study designs generate different types of data: Implications of Modelling

![introduction](../img/where-data-come-from.png)

![introduction](../img/why-does-it-matter.png)

### Simple Random Sampling

![introduction](../img/srs.png)

![introduction](../img/srs-eg.png)

![introduction](../img/srs-eg-2.png)

### Cluster Sampling 

![introduction](../img/cluster-sample.png)

![introduction](../img/cluster-sample-eg.png)

### Longitudinal Data 

![introduction](../img/longitudinal-data.png)

### Dependent vs Independent data

![introduction](../img/dep-vs-indep.png)

# Objectives of model Fitting: Inference vs Prediction

### Two main objectives of Model Fitting

1. **Making Inference about relationships** between variables in a given data set

2. **Making predictions/forcasting future outcomes,** based on models estimated using historical data

## Making Inference 

![introduction](../img/making-inference-1.png)

![introduction](../img/making-inference-2.png)

![introduction](../img/making-inference-3.png)

![introduction](../img/making-inference-4.png)

![introduction](../img/making-inference-5.png)

![introduction](../img/making-inference-6.png)

![introduction](../img/making-inference-7.png)

![introduction](../img/making-inference-8.png)

![introduction](../img/making-inference-9.png)

![introduction](../img/making-inference-10.png)

![introduction](../img/making-inference-11.png)

## Making Predictions

![introduction](../img/making-predictions-1.png)

![introduction](../img/making-predictions-2.png)

![introduction](../img/making-predictions-3.png)

![introduction](../img/what-next.png)

In [1]:
50 + 25*30 + 2000*16

32800

# Mixed effects models: Is it time to go Bayesian by default?

[babies learning language](https://babieslearninglanguage.blogspot.com/2018/02/mixed-effects-models-is-it-time-to-go.html)

# Plotting Predictions and Prediction Uncertainty

![introduction](../img/overview.png)

![introduction](../img/blindly-fitting-models.png)

![introduction](../img/blindly-fitting-models-2.png)

![introduction](../img/blindly-fitting-models-3.png)

![introduction](../img/blindly-fitting-models-4.png)

![introduction](../img/estimation-uncertainty.png)

![introduction](../img/estimation-uncertainty-2.png)

## In video quizzes
### 1
Suppose that a researcher believes that a person’s income is a function of their age and years of education. The researcher collects measures of annual income, age, and years of education from a random sample of adults in a small community. The researcher wishes to fit a linear regression model that takes the following form: income = a + b × age + c × education + e, where the errors are assumed to follow a normal distribution with mean 0 and variance σ^2. Which of the following statements is true?


1. The predictors of interest are a, b, c, and e.


2. The predictors of interest are age, education, an2. d σ^2.


3. The parameters of interest are age and education.


4. The parameters of interest are a, b, c, and σ^2.

Correct 
   * Answer: the correct answer is d). In fitting this model, the researchers wishes to estimate the coefficients a, b, and c (which collectively with the predictors, age and years of education, define the conditional mean of income), and the variance of the errors, σ^2.

### 2

Suppose that a researcher believes that a person’s income is a function of their age and years of education. The researcher collects measures of annual income, age, and years of education from a random sample of adults in a small community. The researcher wishes to fit a linear regression model that takes the following form: income = a + b × age + c × education + e, where the errors are assumed to follow a normal distribution with mean 0 and variance σ^2σ 

Which of the following statements is true?


1. The predictors of interest are a, b, c, and e.


2. The predictors of interest are age, education, and σ^2σ 


3. The parameters of interest are age and education.


4. The parameters of interest are a, b, c, and σ^2σ 

Correct 
   * Answer: the correct answer is d). In fitting this model, the researchers wishes to estimate the coefficients a, b, and c (which collectively with the predictors, age and years of education, define the conditional mean of income), and the variance of the errors, σ^2σ 

### 3

Recall from the last lecture that a researcher wishes to fit a linear regression model to a measure of annual income, modeling income as a function of age and education: income = a + b × age + c × education + e, where the errors are assumed to follow a normal distribution with mean 0 and variance σ^2σ 

How would one classify the variables being analyzed here?


1. Income is the observed independent variable, and age and education are the observed dependent variables.


2. Age, education, and e are the observed independent variables, and income is the observed dependent variable.


3. Income is the observed dependent variable, and age and education are the observed independent variables.

Correct 
   * Answer: c). The researcher wishes to fit a model to the income variable, modeling the mean of income (where the mean of that variable depends on the other variables) as a function of age and education. Recall that e is a random variable that is not actually observed; e is an error that captures the difference between the conditional mean of education based on the fitted model and the observed value of income. Recall also that a, b, and c are parameters, not variables, describing the relationships of age and education with income.

4. a, b, and c are the observed independent variables, and income is the observed dependent variable.

### 4

Suppose that the researcher from the last two lectures who is interested in modeling income only has a limited budget for data collection, and is unable to select a simple random sample of adults from the small community. Instead, the researcher draws a multistage sample, by first selecting blocks within the community at random, and then sampling households within the selected blocks. The researcher then proceeds to fit the regression model of interest, which recall is income = a + b × age + c × education + e, where the errors are assumed to follow a normal distribution with mean 0 and variance σ^2σ 


Which of the following statements is true? Select all that apply.


1. Observed income values within the same sampled block should be considered independent of each other because of the random sampling.

Un-selected is correct 

2. Because income values are clustered within blocks, the researcher should consider estimating additional parameters capturing the correlations between a, b, and c.

Un-selected is correct 

3. Because income values are clustered within blocks, the researcher should consider estimating additional parameters capturing the correlations of income values within the same sampled block.

Correct 
Answer: c). Income values within the same sampled cluster will tend to be correlated, and it is important to model these correlations; the correlations between the parameters being estimated are not the critical issue here. Furthermore, standard errors of the estimates (especially the intercept, which recall is a mean) will tend to become larger when accounting for the within-block correlations in income.

4. Standard errors of the estimated coefficients will tend to be smaller due to possible within-block correlations of the income values.

Un-selected is correct 

### 5

The researcher from the earlier lectures this week fits the regression model of interest to the data collected from the community, correctly accounting for the within-block correlations in the annual income values. Recall that the regression model is income = a + b × age + c × education + e, where the errors are assumed to follow a normal distribution with mean 0 and variance σ^2σ 
2
 . The estimated parameters (standard errors in parentheses) are b = 25 (40), and c = 2000 (200). Which of the following inferential statements based on these results is correct? Select all that apply.


Age does not have a significant relationship with annual income when adjusting for years of education.

Correct 
Answers: a) and d). b) is not true, because the ratio of the estimated coefficient for age to its standard error is 25/40 = 5/8, which is small and does not suggest that this coefficient is different from zero (a ratio of 2 is generally considered “significant”). c) is not correct because the coefficient for years of education is 2000, not 200. d) is correct because the ratio of the estimated coefficient to its standard error is 10, and the estimate of the coefficient is equal to 2000. e) is not correct because of this large ratio.

Age does not have a significant relationship with annual income when adjusting for years of education.

is selected.This is correct.
Answers: a) and d). b) is not true, because the ratio of the estimated coefficient for age to its standard error is 25/40 = 5/8, which is small and does not suggest that this coefficient is different from zero (a ratio of 2 is generally considered “significant”). c) is not correct because the coefficient for years of education is 2000, not 200. d) is correct because the ratio of the estimated coefficient to its standard error is 10, and the estimate of the coefficient is equal to 2000. e) is not correct because of this large ratio.


1. Age has a significant positive relationship with annual income when adjusting for years of education.

Un-selected is correct 

2. Years of education has a significant positive relationship with income, where a one-year increase in education increases the expected mean income by dollars 200.

 Un-selected is correct


3. Years of education has a significant positive relationship with income, where a one-year increase in education increases the expected mean income by 2000 dollars.

Correct 


4. Years of education has a significant positive relationship with income, where a one-year increase in education increases the expected mean income by 2000 dollars.

is selected.This is correct.


5. Years of education does not have a significant relationship with income.

Un-selected is correct 

Answers: a) and d). b) is not true, because the ratio of the estimated coefficient for age to its standard error is 25/40 = 5/8, which is small and does not suggest that this coefficient is different from zero (a ratio of 2 is generally considered “significant”). c) is not correct because the coefficient for years of education is 2000, not 200. d) is correct because the ratio of the estimated coefficient to its standard error is 10, and the estimate of the coefficient is equal to 2000. e) is not correct because of this large ratio.

### 6

Recall that a regression model of interest is income = a + b × age + c × education + e. Suppose that all three estimates of the coefficients in the annual income model are a = 50, b = 25, and c = 2000. What is the expected mean income for a 30-year-old adult with 16 years of education?


$32,750


$32,800

Correct 
Answer: b). The expected mean income for this case is 50 + 25 x 30 + 16 x 2000 = $32,800.

$32,800

is selected.This is correct.
Answer: b). The expected mean income for this case is 50 + 25 x 30 + 16 x 2000 = $32,800.


$3,280


$32,000

### 7

Suppose that the estimated standard error of the expected mean income for the example adult from the fitted regression model in the previous lecture ($32,800) was $5,000. True or false: We have enough statistical evidence to reject a null hypothesis that the expected mean income for adults with these characteristics is $30,000.

(Hint: Recall one-sample t-tests from Course 2 in this specialization!)


True


False

Correct 
Answer: False. To test this null hypothesis, we could compute a t-statistic as follows: t = (32800 – 30000) / 5000 = 0.56. This is not enough evidence to reject the null hypothesis, regardless of the appropriate degrees of freedom for this test statistic (which have not been provided here).