# Linear Model: Mortality and Air Pollution

## Overview

In this lab you will build and analuze a linear model relating PM2.5 exposure with excess death rate attributed to outdoor air pollution.

### Learning Objectives


*   Use software to compute a linear regression given data on PM2.5 exposure and excess deaths attributed to air pollution.
*   Determine the units of a slope in a linear model.
*   Use a slope in a model to create a description of proportional change.
*   Compute the output of a linear function for a given input and interpret the meaning of the result in the context of an application.
*   Use a slope to compute and reason about a proportional change.
*   Use a linear model to draw appropriate conclusions in an application and summarize the result in words.

## Introduction

Particulate matter (PM) is a constituent of outdoor (ambient) air
pollution. $\text{PM}_{2.5}$ refers to fine particulate matter with
particles having diameters less than 2.5 microns. This is about 30 times smaller than the diameter of a human hair; these are inhalable particles that are harmful to human health.  See the article Burden of Cause-Specific Mortality Associated With PM2.5 Air Pollution in the United States by Bowe B, Xie Y, Yan Y, Al-Aly Z<a name="cite_ref-3"></a>[<sup>[3]</sup>](#cite_note-3) for more details.  


Exposure to $\text{PM}_{2.5}$ varies by country. An interactive scatter plot of the death rates from particular matter air pollution vs. $\text{PM}_{2.5}$ concentration from 2019 is shown below, it is also available at https://ourworldindata.org/grapher/eath-rate-from-pm25-vs-pm25-concentration. The EPA maintains a [page](https://www.epa.gov/pm-pollution), with more resources on particulate matter pollution.  


Let us define variables for the model:

*   Let $R$ denote the annual death rate due to outdoor air pollution. The
  variable $R$ has the units ``deaths per 100,000 people.'' For
  example, if $R=15$ this means that, in a given year, there are 15
  deaths for every 100,000 people in the population.   
*   Let $x$ denote the mean annual exposure to  $\text{PM}_{2.5}$. The variable
  $x$ has the units ``micrograms per cubic meter'' ($\mu\text{g}/\text{m}^3$).

An interactive scatter plot of the death rates from particular matter air pollution vs. $\text{PM}_{2.5}$ concentration from 2019 is shown below, it is also available at https://ourworldindata.org/grapher/eath-rate-from-pm25-vs-pm25-concentration The EPA maintains a [page](https://www.epa.gov/pm-pollution), with more resources on particulate matter pollution.  Below the interactive scatter plot, is another non-interactive scatter plot with the data from 2020.  



## **1. Discuss with Your Group:**

- [ ]  View the data in the scatter plot below and discuss with your group: Try to describe in words how the death rate from air pollution varies with the exposure to PM2.5 particulate pollution.





In [1]:
#@title Death Rate From Particular Matter Air Pollution vs PM2.5 Concentration, 2019
from IPython.display import IFrame
IFrame(src="https://archive.ourworldindata.org/20250624-125417/grapher/death-rate-from-pm25-vs-pm25-concentration.html",width="100%", height="600px")

## Creating a Linear Model

We've created a second scatter plot with data from 2020 along with a best fit line. The best fit line is the line that models the data as a linear model. We can think of this linear model as quantitative summary of the relationship between exosure to PM2.5 pollution and the mortality rate attributed to air pollution.



In [2]:
# @title Desmos Scatter Plot and Best Fit Line
from IPython.display import IFrame
IFrame(src="https://www.desmos.com/calculator/t7nctlbyh4?embed", width="100%", height="600px")

In [3]:
# @title Python Code for finding and plotting best fit line
import matplotlib.pyplot as plt
import numpy as np
from ipywidgets import interact
import ipywidgets as widgets


def FindBestFitLine(showbestfitline):
  datax = [46.09,15.71,25.55,6.72,9.08,25.15,19.7,14.91,30.58,8.25,10.93,21.73,18.94,58.5,42.38,23.73,14.61,11.22,28.65,51.05,7.34,23.95,22.88,26.60,18.91,12.18,7.6,17.15,58.47,29.65,24.14,39.81,6.57,42.32,34.43,49.14,23.28,34.81,14.18,11.93,28.57,14.31,49.50,16.12,21.24,13.51,14.11,26.65,9.07,35.92,20.67,19.83,17.41,16.66,54.86,20.03,35.38,32.35,6.15,18.78,27.31,12.42,4.9,9.6,29.95,58.36,17.04,10.29,54.24,14.41,6.56,24.66,9.58,21.64,43.77,50.21,25.46,21.35,20.31,14.14,5.11,48.39,17.88,32.29,38.18,8.17,18.60,14.66,17.41,12.84,28.77,20.03,24.39,11.26,53.72,24.38,22.32,11.6,19.23,23.44,42.36,32.97,9.22,8.67,12.75,23.7,16.19,12.01,56.78,11.72,10.73,70.82,9.15,14.99,12.11,14.76,9.58,29.65,17.98,21.32,20.08,32.32,19.99,6.06,45.72,10.89,6.49,16.81,85.12,56.53,29.93,26.12,9.71,6.06,39.58,43,7.033,26.36,11.47,17.31,10.09,27.04,20.29,17.96,8.45,7.19,75.66,14.85,11.27,31.29,8.59,23.76,23.93,12.55,10.69,27.34,53.15,63.74,22.51,8.5,43.22,13.87,15.39,14.29,13.95,24.53,23.75,25.94,30.74,9.58,19.96,45.83,27.31,5.64,9.06,24.71,37.05,25.08,31.01,51.67,12.48,25.60,24.18,21.61,19.6,5.92,33.81,14.9,36.31,9.91,7.81,8.63,10.57,31.96,14.08,15.26,20.80,34.83,24.31,19.49]
  dataR = [37,48.4,89.2,23.5,7.6,69.8,43.7,30.6,92.0,6.6,13.7,100.9,36.7,148,31.4,48.6,63.8,12.,49.1,37.3,6.8,66.3,48.4,72,62.5,22.5,14.6,69,35.7,15.2,38.5,45.7,4.9,94.2,28.5,42.1,26.4,99.1,22.6,10.1,70.4,18.1,62,39.2,47.3,27,32.7,19.1,10.7,71.7,58.0,55.0,44.7,30.8,236.5,35.9,118.5,31,5.4,74.1,14.9,70.8,2.1,7.8,107.1,50.2,52.6,12.7,78.2,24.8,10.6,71.1,17.6,38.1,39.1,51.7,87.7,24.4,41.5,39.5,2.3,82.1,64.6,73.0,152.8,8.4,19.3,16.0,35.0,9.8,58,89.5,22,17.6,76.3,54.8,62.1,30.5,41.5,44.2,27.8,94.6,22.1,8.1,11.5,14.9,50.2,23.2,37.3,14.6,39.1,81.3,22.9,36.7,38.3,36.8,12.7,91.8,82.7,91.5,15.4,78.7,73.6,34.8,54.6,12.8,4.9,16.1,38.7,65.8,29.9,120.5,28.3,3.4,115.9,77.2,29.5,87.1,16.5,35.7,22.0,37.2,57.1,41.2,8.8,7.1,108.2,45.3,36.7,13.7,20.8,48.9,64.0,38.6,6.5,49.5,132.1,38.1,82.2,17.2,32.5,15.8,44.7,17.7,23,15.8,68.6,27.7,25.6,8.9,46.1,90.9,68.7,3.0,7.0,121.8,65.2,20.7,51.1,41.8,32.4,65.1,67.9,65,103.7,12.6,21,67.7,112.3,11.5,9.5,13,16.8,125.9,25.9,42,51.3,82.4,35.1,24.3]

  m, b = np.polyfit(datax, dataR, 1) #This line determiens the lope and intercept for the best fit line through the given data.
  best_fit_line_x_values=range(5,85) #This line creates an array of x values from 0 to 89, to assist in graph the best fit line
  best_fit_line_y_values = m * best_fit_line_x_values + b #This line creates an array of y values for the x values found in the line above
    
  plt.xlabel("PM2.5 air pollution, mean annual exposure ($\\mu$g/m^3)")
  plt.ylabel("Death rate from outdoor particulate matter air pollution (deaths per 100,000 people)")
  plt.title("Death rate from outdoor particulate matter vs. PM2.5 concentration, 2020")
  plt.scatter(datax, dataR, color='blue', label='Data Points') #This line graphs the data as a scatter plot
  if(showbestfitline):
    plt.plot(best_fit_line_x_values, best_fit_line_y_values, color='red', label=f'Best-Fit Line: R = {m:.2f}x + {b:.2f}') #This line plosts all the x and y values for the best fine line as points and connects them with linesegments to graph the best fit line
  plt.legend()
  plt.show()



interact(FindBestFitLine,
         showbestfitline=widgets.Checkbox(value=True, description='Best Fit Line'),
          );

interactive(children=(Checkbox(value=True, description='Best Fit Line'), Output()), _dom_classes=('widget-inte…

**Note:** If you right click on any code block and select _explain code_  this  will generate an AI explanation of the code.  We encourage you to use that throughout the semester to get a better understanding of what is going on behind the scenes.

## Background on the Best Fit Line

The best fit line is the line that models the data as a linear model.  It is typically found by using linear regression using the least squares method.  It is the line that minimizes the difference between the observed values and the model predicted values. The difference between observed values and the model predictions are often called residuals.  The residual is the length of the vertical line segment from a data point to the line, shown below.  The best fit line is the one that would minimize the sum of the squares of all the residuals. Finding a best-fit line is an **optimization problem** and an application of Calculus techniques we will learn in Math 140B.

<!-- ![bestfitlineR.png](bestfitlineR.png) -->


## Our Linear Model

Let us define variables for our linear model:

*   Let $R$ denote the annual death rate due attributed to air pollution. The
  variable $R$ has the units ``deaths per 100,000 people.'' For
  example, if $R=15$ this means that, in a given year, there are 15
  deaths for every 100,000 people in the population.   
*   Let $x$ denote the mean annual exposure to  $\text{PM}_{2.5}$. The variable
  $x$ has the units ``micrograms per cubic meter'' ($\mu\text{g}/\text{m}^3$).


Based on available data the best linear model relating exposure to $\text{PM}_{2.5}$ and the Death rate from outdoor partiulate matter is $R(x)=1.22x+16.88$.  


1. The World Health Organization guidelines recommend $\text{PM}_{2.5}$ exposures of 5$\mu\text{g}/\text{m}^3$ or less<a name="cite_ref-1"></a>[<sup>[1]</sup>](#cite_note-1). Rewrite the fromula for $R(x)$ in point slope form using using an $x$ value of 5 $\mu\text{g}/\text{m}^3$.

2. Determine the units of the slope and explain what the slope represents.
   
3. Use your model to compute $R(50)$ and write a one-sentence
   interpretation of this result. Your interpretation should be in the
   context of the application and should include units.  

4. Since the passage of The Clean Air Act<a name="cite_ref-2"></a>[<sup>[2]</sup>](#cite_note-2), the mean annual
   exposure to $\text{PM}_{2.5}$ in the U.S. has fallen from 33.5 $\mu\text{g}/\text{m}^3$ to 7.1$\mu\text{g}/\text{m}^3$. Use this change in exposure to estimate the avoided deaths per year, based on the U.S. population. You'll need to use an estimate of the present U.S. population.

5. Write a 1--2 sentence summary of the results of your analysis from part 7.

6. Given that $R(x)=1.22x+16.88$ determine a formula for $R(x+\Delta x)$.  Determine a formula for $\Delta R=R(x+\Delta x)-R(x)$ in term of $\Delta x$.  That is find a formula for $\Delta R(\Delta x)$.

7.  Use the formula found in part 6. to double check your answer from part 4.

8.  The widget below will compute $\Delta R$ values for given $\Delta x$ values. Use the widget to explore the relationship between $\Delta R$ and $\Delta x$ and then Write a 1--2 sentence summary of the results of your analysis.



## **2. Point-Slope Form:**

The World Health Organization guidelines recommend $\text{PM}_{2.5}$ exposures of 5$\mu\text{g}/\text{m}^3$ or less<a name="cite_ref-1"></a>[<sup>[1]</sup>](#cite_note-1). Rewrite the fromula for $R(x)$ in point slope form using using an $x$ value of 5 $\mu\text{g}/\text{m}^3$. You can use the interactive plot above to determine the value of $R$ when $x=5$ in the linear model.

## **3. Units of the Slope and Proportional Change**

Identify the slope in the linear model and determine the correct units of the slope. The slope in a linear model is a representation of proportional change. It tells us how a change in one variable is related to a change in the other. Use this idea to discuss the meaning of the slope in this particular example.



## **4. Proportional Change**

The slope in a linear model is a representation of proportional change. It tells us how a change in the dependent variable is related to a change in the independent variable. This relationship is reflected in the **slope formula**: 

$$m = \frac{\Delta R}{\Delta x}$$

Use algebra to rearrange the slope formula to determine the formula for $\Delta R$ in terms of the slope and $\Delta x$.

Use your formula to discuss the meaning of the slope in this model with your group.

## **5. Computation and Interpretation**

Use the linear model to compute $R(50)$ and write a one-sentence
   interpretation of this result. Your interpretation should be in the
   context of the application and should include units.  

## **6. Using the slope to compute a Proportional Change**

The widget below will compute $\Delta R$ values for given $\Delta x$ values. Use the widget to explore the relationship between $\Delta R$ and $\Delta x$ and then write a 1--2 sentence summary of the results of your analysis.


In [None]:
# @title Python Code to compute changes in death rate by change in PM2.5 exposure
def DeltaR(z):
    return z * 1.22

Deltax = -26.4

print(DeltaR(Deltax))

-32.33


## **7. Applying the Model: California's above average exposure**

The EPA estimates that the mean annual $\text{PM}_{2.5}$ exposure in the U.S. is 8.6 $\mu\text{g}/\text{m}^3$ <a name="cite_ref-5"></a>[<sup>[5]</sup>](#cite_note-4) but there is variation in exposure by geographic region. The mean annual $\text{PM}_{2.5}$ exposure in California is estimated to be 12.7 $\mu\text{g}/\text{m}^3$, partly due the prevalence of wildfires.

The population of California is approximately 39 million people. Estimate the excess deaths in California due to the higher $\text{PM}_{2.5}$ relative to the U.S. average of 8.6 $\mu\text{g}/\text{m}^3$.

## **8. Applying the Model: PM2.5 Exposure in India**:

According to the data at https://ourworldindata.org/grapher/average-exposure-pm25-pollution India had an average $\text{PM}_{2.5}$ exposure of 62 $\mu\text{g}/\text{m}^3$ in 1990 and 48.39 $\mu\text{g}/\text{m}^3$ in 2020.  The population of the India was 1.4 billion in 2020.  Estimate the **avoided** $\text{PM}_{2.5}$ related deaths in 2020 in India due to the decrease in PM2.5 exposure.

  More details about the impact of air pollution $\text{PM}_{2.5}$ in India can be found in the article by Chatterjee, McDuffie, Smith, Et al.<a name="cite_ref-4"></a>[<sup>[4]</sup>](#cite_note-4).  This study found that cooking indoors with solid fuels contributed the largest share of air pollution related mortality.



<a name="cite_note-1"></a>1. [^](#cite_ref-1)https://www.who.int/news-room/feature-stories/detail/what-are-the-who-air-quality-guidelines

<a name="cite_note-2"></a>2. [^](#cite_ref-2)https://www.epa.gov/clean-air-act-overview

<a name="cite_note-3"></a>3. [^](#cite_ref-3)Bowe B, Xie Y, Yan Y, Al-Aly Z. Burden of Cause-Specific Mortality Associated With PM2.5 Air Pollution in the United States. JAMA Netw Open. 2019;2(11):e1915834. doi:10.1001/jamanetworkopen.2019.15834 https://jamanetwork.com/journals/jamanetworkopen/fullarticle/2755672

<a name="cite_note-4"></a>4. [^](#cite_ref-4)Deepangsu Chatterjee, Erin E. McDuffie, Steven J. Smith, Liam Bindle, Aaron van Donkelaar, Melanie S. Hammer, Chandra Venkataraman, Michael Brauer, and Randall V. Martin
Environmental Science & Technology 2023 57 (28), 10263-10275
DOI: 10.1021/acs.est.2c07641 https://pubs.acs.org/doi/10.1021/acs.est.2c07641

<a name="cite_note-5"></a>5. [^](#cite_ref-5) America's Health Rankings analysis of U.S. Environmental Protection Agency, United Health Foundation, AmericasHealthRankings.org, accessed 2025. https://www.americashealthrankings.org/explore/measures/air