<a href="https://colab.research.google.com/github/DavidSenseman/BIO5853/blob/main/Lesson_03_11.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---------------------------

**COPYRIGHT NOTICE:** This Jupyterlab/Colab notebook is a companion supplement to the textbook _Principles of Biostatistics_ by M. Pagano. K. Marcello and H. Mattie (3rd ed) published in 2022 by CRC Press. It is designed to be used in conjunction with -- not as a standalone substitute for – this textbook.  

This notebook is licensed under the Apache License, Version 2.0 (the "License"); You may not use this file except in compliance with the License. You may obtain a copy of the License at
>http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

------------------------

# **BIO 5853: Biostatistics**

##### **Module 3: Inference**

* Instructor: [David Senseman](mailto:David.Senseman@utsa.edu), [Department of Integrative Biology](https://sciences.utsa.edu/integrative-biology/), [UTSA](https://www.utsa.edu/)


### Module 3 Material

* Part 3.1: Confidence Intervals
* Part 3.2: Hypothesis Testing
* Part 3.3: Comparison of Two Means
* Part 3.4: Analysis of Variance (ANOVA)
* Part 3.5: Nonparametric Methods
* Part 3.6: Inference on Proportions
* Part 3.7: Contingency Tables
* Part 3.8: Correlation
* Part 3.9: Simple Linear Regression
* Part 3.10: Multiple Linear Regression
* Part **3.11: Logistic Regression**
* Part 3.12: Survival Analysis

## Google CoLab Instructions

The following code ensures that Google CoLab is running the correct version of TensorFlow.
  Running the following code will map your GDrive to ```/content/drive```.

In [None]:
# YOU MUST RUN THIS CODE CELL FIRST

try:
    from google.colab import drive
    drive.mount('/content/drive', force_remount=True)
    from google.colab import auth
    auth.authenticate_user()
    COLAB = True
    print("Note: using Google CoLab")
    import requests
    gcloud_token = !gcloud auth print-access-token
    gcloud_tokeninfo = requests.get('https://www.googleapis.com/oauth2/v3/tokeninfo?access_token=' + gcloud_token[0]).json()
    print(gcloud_tokeninfo['email'])
except:
    print("Note: not using Google CoLab")
    COLAB = False

# **Part 3.11: Logistics Regression**

**Logistic regression** is a statistical method used to model the relationship between a binary dependent variable (an outcome that has two possible values, such as 0 or 1) and one or more independent variables (predictors). It is particularly useful when the outcome of interest is categorical, such as "disease" vs. "no disease" or "success" vs. "failure."

### Why is Logistic Regression Important for Biostatistics?

1. **Predicting Binary Outcomes**: In biostatistics, logistic regression is often used to predict the probability of a binary outcome, such as whether a patient has a disease or not based on various predictors like age, gender, and other risk factors.

2. **Odds Ratios**: Logistic regression provides odds ratios, which are measures of association between the predictors and the outcome. This helps in understanding the strength and direction of the relationship between variables.

3. **Handling Multiple Predictors**: It can handle multiple predictors, both continuous and categorical, making it versatile for complex biomedical data.

4. **Interpretable Results**: The results of logistic regression are relatively easy to interpret, which is crucial for making informed decisions in healthcare and medical research.

5. **Modeling Probabilities**: Unlike linear regression, logistic regression models probabilities that are constrained between 0 and 1, making it suitable for binary outcomes.

6. **Generalized Linear Models**: Logistic regression is a type of generalized linear model, which extends the linear model to allow for response variables that have error distribution models other than a normal distribution.


### **Introduction**

When studying linear regression, we estimate a population regression equation

$$ \mu_{y|x_1, x_2, \ldots, x_q} = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_q x_q $$


by fitting a model of the form

$$ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_q x_q + \epsilon
$$

The response $Y$ is continuous, and is assumed to follow a normal distribution. We are concerned
with predicting or estimating the mean value of the response corresponding to a given set of values for the explanatory variables.

There are many situations, however, in which the response of interest is _dichotomous_ rather than continuous. Examples of variables that assume only two possible values are disease status (disease is either present or absent) and survival following surgery (a patient is either alive or dead). In general,
the value 1 is used to represent a “success,” or the outcome we are most interested in, and 0 represents a “failure.” The mean of the dichotomous random variable Y, designated p, is the proportion of times that Y takes the value 1. Equivalently,

$$ p = P(Y=1) = P("success”). $$

Just as we estimate the mean value of the response when Y is continuous, we would like to be able to estimate the probability p associated with a dichotomous response for various values of an explanatory variable. To do this, we use a technique known as _logistic regression_.



## **Dataset for this Lesson**

In this lesson we will be using only 1 dataset that we need to download from the course file server and store in DataFrames.

### Example 1: Read Datafile

We will be using a datafile called `hyponatremia.csv` stored on the course HTTPS server. As the file is read, the data is stored in a DataFrame called `hypoDF`.

_Data Description:_

**Hyponatremia** is a condition where the sodium levels in your blood are abnormally low. Sodium is an essential electrolyte that helps regulate water balance in and around cells, maintain normal blood pressure, and support nerve and muscle function. When sodium levels drop, it can cause water to move into cells, leading to swelling. Hyponatremia is more likely in female marathon runners than in male runners due to several factors:

1. **Body Composition**: Women generally have a higher percentage of body fat and lower muscle mass compared to men. This can affect fluid and electrolyte balance during prolonged exercise.
2. **Hormonal Differences**: Female hormones, particularly estrogen, can influence fluid regulation and sodium balance in the body.
3. **Fluid Intake**: Women may be more likely to drink excessive amounts of water during a marathon, especially if they are concerned about dehydration. This can dilute sodium levels in the blood.

These factors combined make female marathon runners more susceptible to developing hyponatremia during long-distance races.

This particular dataset contains

In [None]:
# Example 1: Read datafile

import pandas as pd

# Read datafile and create DataFrame
hypoDF = pd.read_csv(
    "https://biologicslab.co/BIO5853/data/hyponatremia.csv",
    index_col=0,
    sep=',',
    na_values=['NA','?'])

# Set max rows and max columns
pd.set_option('display.max_rows', 6)
pd.set_option('display.max_columns', 8)

# Display DataFrame
display(hypoDF)

If the code is correct, you should see the following output:

![____](https://biologicslab.co/BIO5853/images/module_03/lesson_03_11_image01.png)

You should note that some of the values are `NaN` which means `Not a Number`. We will have to be careful to take care these `NaNs` when analyzing this DataFrame `hypoDF`.

## **The Model**

Among marathonn runners. hyponatremia--defined as a decrease in blood sodium level to a value less than or equal to 135 millimoles per liter--can cause life threatening illness and, in extreme cases, death. In a sample of 488 adults who completed the Boston Marathon and who are considered to be representative of the larger population of runners who complete marathons, 62 were diagnosed with hyponatremia [287]. Let Y be a dichotomous random variable for which the value 1 represents a diagnosis of hyponatremia in a runner and 0 no such diagnosis. We would estimate the probabilitythat a runner develops hyponatremia by the sample proportion

$$ \hat{p}= \frac{62}{488} = 0.127 $$

Overall, 12.7% of runners in the sample are diagnosed with this condition.



### Example 2: Compute Probability

The code in the cell below shows how you would compute $\hat{p}$ value shown in the cell above.

_Code Description:_

To compute the total number of runners with hyponatremia, the following code chunk is used:

```python
# Define the number of runners with hyponatremia
num_hyponatremia = sum(hypoDF['hyponatremia'])
```

The works because a runner diagnosed with hyponatremia is assigned the value of `1` in the DataFrame. So adding up all of the `1s` gives the total number of runners diagnosed with hyponatremia.

In [None]:
# Example 2: Compute probability

import pandas as pd

# Define the number of runners with hyponatremia
num_hyponatremia = sum(hypoDF['hyponatremia'])

# Compute probability
prob_hyponatremia = num_hyponatremia/len(hypoDF['hyponatremia'])

# Print out the result
print(f"The predicted probability (p̂) is:{prob_hyponatremia:.3f}")

If the code is correct, you should see the following output:

~~~text
The predicted probability (p̂) is:0.127
~~~

### **Scatter Plot of Weight Gain and Hyponatremia**

We might suspect there are certain factors which affect the likelihood that a particular individual will develop hyponatremia. If we could classify a runner according to these characteristics, it might be possible to calculate a more informative estimate of their probability of developing hyponatremia.

Since it was hypothesized that excessive fluid consumption during the race might be associated with development of hyponatremia, for example, one factor of interest is a runner’s change in weight from the beginning of the marathon to its end. If the response Y were continuous, we would begin an analysis by constructing a scatter plot of the response versus the continuous explanatory variable.

A graph of hyponatremia versus weight gain in pounds is displayed in **Figure 19.1** for the 455 individuals for whom weight gain was measured. Note that all points lie on one of two parallel lines, depending on whether Y takes the value 0 or 1. There does appear to be a tendency for individuals who develop hyponatremia to have higher weight gain, on average, while those who do not develop hyponatremia have lower weight gain. There is a lot of overlap, however, and the nature of this is not clear from the graph.


  

![____](https://biologicslab.co/BIO5853/images/module_03/lesson_03_11_image02.png)

**FIGURE 19.1**  Head circumference versus birthweight for a sample of 100 low birth weight infants

### Example 2: Two-Way Scatter Plot

The code in the cell below shows how to recreate **Figure 19.1** using Python.

_Code Decription:_

1. **Import Libraries**:
    ```python
    import matplotlib.pyplot as plt
    import numpy as np
    ```
    - `matplotlib.pyplot` is used for creating plots.
    - `numpy` is used for numerical operations, although it's not used extensively in this snippet.

2. **Assuming `hypoDF` DataFrame**:
    ```python
    # Assuming hypoDF is already defined as a DataFrame
    x = hypoDF['wt_gain'].values
    y = hypoDF['hyponatremia'].values
    ```
    - `hypoDF` is assumed to be a DataFrame with columns `wt_gain` and `hyponatremia`.
    - `x` contains the values of weight gain during the race.
    - `y` contains the values indicating the diagnosis of hyponatremia.

3. **Create Plotting Environment**:
    ```python
    fig, ax = plt.subplots()
    ```
    - `fig` and `ax` are created for plotting. `fig` is the figure object, and `ax` is the axes object where the plot will be drawn.

4. **Define Color**:
    ```python
    color_1 = '#15466d'
    ```
    - A color is defined for the scatter plot points.

5. **Create Scatter Plot**:
    ```python
    scatter = ax.scatter(x, y, facecolors='none', edgecolors=color_1)
    ```
    - A scatter plot is created with `x` and `y` values.
    - `facecolors='none'` makes the points transparent.
    - `edgecolors=color_1` sets the color of the point edges to `#15466d`.

6. **Set X-axis Limits**:
    ```python
    ax.set_xlim(-8, 6)
    ```
    - The x-axis limits are set from -8 to 6.

7. **Set Y-axis Ticks and Limits**:
    ```python
    ax.set_yticks([0, 1])
    ax.set_ylim(-0.1, 1.1)
    ```
    - The y-axis ticks are set to show only 0 and 1.
    - The y-axis limits are set from -0.1 to 1.1 to ensure ticks are visible.

8. **Add Labels**:
    ```python
    ax.set_xlabel('Weight gain during race (lb)', fontsize=12)
    ax.set_ylabel('Diagnosis of hyponatremia', fontsize=12)
    ```
    - X-axis and y-axis labels are added with a font size of 12.

9. **Display the Plot**:
    ```python
    plt.show()
    ```
    - The plot is displayed.

#### Summary

- This code creates a scatter plot with weight gain during the race on the x-axis and the diagnosis of hyponatremia on the y-axis.
- The points are displayed with no face color and colored edges.
- The x-axis ranges from -8 to 6, and the y-axis shows only the ticks for 0 and 1.
- Labels are added to both axes for clarity.


In [None]:
# Example 3: 2-way scatter plot

import matplotlib.pyplot as plt
import numpy as np

# Assuming hypoDF is already defined as a DataFrame
x = hypoDF['wt_gain'].values
y = hypoDF['hyponatremia'].values

# Create plotting environment
fig, ax = plt.subplots()

# Define color
color_1 = '#15466d'

# Create the scatter plot with varying marker size
scatter = ax.scatter(x, y, facecolors='none', edgecolors=color_1)

# Set x-axis limits
ax.set_xlim(-8,6)

# Set y-axis ticks to only 0 and 1
ax.set_yticks([0, 1])
# Set y-axis limits to ensure ticks are visible
ax.set_ylim(-0.1, 1.1)

# Adding labels
ax.set_xlabel('Weight gain during race (lb)', fontsize=12)
ax.set_ylabel('Diagnosis of hyponatremia', fontsize=12)

# Display the plot
plt.show()


If the code is correct, you should see the following plot:

![____](https://biologicslab.co/BIO5853/images/module_03/lesson_03_11_image03.png)

A graph of hyponatremia versus weight gain in pounds is displayed in **Figure 19.1** for the 455 individuals for whom weight gain was measured. Note that all points lie on one of two parallel lines, depending on whether Y takes the value 0 or 1. There does appear to be a tendency for individuals who develop hyponatremia to have higher weight gain, on average, while those who do not develop hyponatremia have lower weight gain. There is a lot of overlap, however, and the nature of this is not clear from the graph.


Since the two-way scatter plot is not particularly helpful, we might instead explore whether an association exists between a diagnosis of hyponatremia and weight gain during the race by arbitrarily dividing the runners who had their weight gain recorded into three groups with similar numbers of people in each category: those losing at least 1.36 pounds, those losing less than 1.36 pounds, but not gaining more than 0.01 pounds, and those gaining 0.01 pounds or more. We could then estimate the probability that an individual will develop hyponatremia in each of these subgroups individually.


![____](https://biologicslab.co/BIO5853/images/module_03/lesson_03_11_image09.png)

The estimated probability of hyponatremia increases as the amount of weight gain increases, from a low of 0.038 for runners who lose the most weight to a high of 0.281 for those who gain weight. Since there does appear to be a relationship between these two variables, we would like to be able to use a runner’s weight gain during the race to help us predict the likelihood that they will develop hyponatremia.

### Example 4: Compute Probabilities

The code in the cell below, recreates the table shown in the cell above. It should be noted that it's not really feasible to generate the textbook tables _exactly_ as they appear after being type set.

_Code Description:_

For this example, we will need to drop rows (runners) in which no measurement of `wt_gain` were recorded. These missing values are shown as `NaN`. Rather than drop rows from our DataFrame `hypoDF`, we will make a copy called simply `df` and drop the rows from this copy.

In [None]:
# Example 4: Compute probabilites

import pandas as pd

# Define function to categorize weight gain
def categorize_wt_gain(value):
    if value <= -1.36:
        return '<= -1.36'
    elif -1.36 <= value < 0.01:
        return '-1.36 to < 0.01'
    else:
        return '>= 0.01'

# Make a copy before dropping rows
df = hypoDF.copy()

# Drop rows (runners) with no weight gain measurements
df.dropna(subset=['wt_gain'], inplace=True)

# Apply the function to create a new column for categories
df['Weight Gain'] = df['wt_gain'].apply(categorize_wt_gain)

# Group by the weight gain category
grouped = df.groupby('Weight Gain')

# Calculate the number of runners in each category
num_runners = grouped.size()

# Calculate the number of runners with hyponatremia in each category
num_hyponatremia = grouped['hyponatremia'].sum()

# Calculate the probability of hyponatremia in each category
prob_hyponatremia = num_hyponatremia / num_runners

# Create a summary DataFrame
summary_df = pd.DataFrame({
    'No. Runners': num_runners,
    'No. with Hyponat.': num_hyponatremia,
    'Probability (p-hat)': prob_hyponatremia.round(3)
})

# Sort the DataFrame to ensure '<= -1.36' is first
sorted_summary_df = summary_df.reindex(['<= -1.36', '-1.36 to < 0.01', '>= 0.01'])

# Display the sorted summary DataFrame
print(sorted_summary_df)


If the code is correct, you should see the following output:

~~~text
                 No. Runners  No. with Hyponat.  Probability (p-hat)
Weight Gain                                                         
<= -1.36                 159                  6                0.038
-1.36 to < 0.01          161                 13                0.081
>= 0.01                  135                 38                0.281
~~~

The output matches the values shown in the textbook:

![____](https://biologicslab.co/BIO5853/images/module_03/lesson_03_11_image09.png)

Notice that the number of runners is now only 455 instead of the original 488 since we dropped runners (rows) that contained `NaN` for their `wt_gain`.


# **Logistic Function**

Our first strategy might be to fit a model to the form

$$ p = \beta_0 + \beta_1 x_1 $$

where x represents weight gain. This is simply the standard linear regression model in which y — the outcome of a continuous, normally distributed random variable — has been replaced by $p$. As before, fo is the intercept of the line and ; is its slope. On inspection, however, this model is not feasible. Since $p$ is a probability, it is restricted to taking values between `0` and `1`. The term $\beta_0 + \beta_1x$ in contrast, could easily yield a value that lies outside this range.

We might try to solve this problem by fitting the model

$$ p = e^{\beta_0 + \beta_1 x} $$

This equation guarantees that the estimate of $p$ is positive. We would soon realize, however, that this model is also unsuitable. Although the term $e^{\beta_0 + \beta_1 x}$ cannot produce a negative estimate of $p$, it can result in a value that is greater than 1. To accommodate this additional constraint, we consider a model of the form

$$ p = \frac{e^{\beta_0 + \beta_1 x}}{1 + e^{\beta_0 + \beta_1 x}} $$

The expression on the right, called a _logistic function_, cannot yield a value that is either negative or greater than 1. Consequently, it restricts the estimated value of p to the required range.

Recall that if an event occurs with probability $p$, the odds in favor of the event are $p/(1 — p)$ to 1. Thus, if a success occurs with probability


$$ p = \frac{e^{\beta_0 + \beta_1 x}}{1 + e^{\beta_0 + \beta_1 x}} $$

the odds in favor of sucess are


$$\frac{p}{1 - p} = \frac{e^{\beta_0 + \beta_1 x} / (1 + e^{\beta_0 + \beta_1 x})}{1 / (1 + e^{\beta_0 + \beta_1 x})} = e^{\beta_0 + \beta_1 x}
$$

Taking the natural logarithm of each side of this equation

$$ \ln \left[ \frac{p}{1-p} \right] = \ln \left[ e^{\beta_0 + \beta_1 x} \right] = \beta_0 + \beta_1 x $$

Thus, modeling the probability $p$ with a logistic function is equivalent to fitting a linear regression model in which the continuous response $y$ has been replaced by the logarithm of the odds of success for a dichotomous random variable. Instead of assuming that the relationship between $p$ and $x$ is
linear, we assume that the relationship between $ln[p/(1 — p)]$ and $x$ is linear. The technique of fitting a model of this form is known as **logistic regression**.


## **Fitted Equation**

In order to use a marathon runner’s weight gain during the race to help us predict the probability
that they will develop hyponatremia, we fit the model

$$ \ln \left[ \frac{p}{1-p} \right] = \hat\beta{_0} + \hat\beta_1 x_1 $$

Although we divided weight gain into three categories when initially exploring its relationship with the outcome, we now use the original continuous measurement as the explanatory variable for the logistic regression model. As in linear regression, Bo and are estimates of the population
coefficients. However, we do not apply the method of least squares — which assumes that the response is continuous and normally distributed — to fit a logistic model. Instead, we use maximum likelihood estimation. Recall that this technique uses the information in a sample to find the parameter
estimates that are most likely to have produced the observed data.
For the sample of runners, the estimated logistic regression equation is


$$ \ln \left[ \frac{p}{1-p} \right] = \text{ -1.8849} + \text{ 0.7284x} $$


The intercept $\hat\beta_0 = —1.8849$ is the estimated log odds of hyponatremia for a runner with weight gain equal to 0 pounds, or no change in weight at all. The coefficient of the explanatory variable  $\hat\beta_1  = 0.7284$ implies that for each one pound increase in weight gain, the log odds that the runner
develops hyponatremia increase by $0.7284$ on average. When the log odds increase, the odds of the outcome increase, and the probability p increases as well. In order to determine whether this relationship is statistically significant, we test the null hypothesis that there is no relationship between
$p$ and $x$,

$$H_0: \hat\beta_1 = 0, $$

against the alternative

$$H_A: \hat\beta_1 \neq 0, $$

If the null hypothesis is true, the probability of being diagnosed with hyponatremia is the same regardless of the amount of weight gain.

In order to conduct the test, we need to know the estimated standard error of $ \hat\beta_1$. Then, if $H_0$ is true and the sample size is sufficiently large, the test statistic

$$ z = \frac{\hat{\beta}_1}{\hat{\text{se}}(\hat{\beta}_1)} $$

can be assumed to follow a normal distribution.

### Example 5: Estimate $\hat\beta_0$, $\hat\beta_1$ and $z$ Score

The code in the cell illustrates how to find $\hat\beta_0$, $\hat\beta_1$ as well as $z$ for this equation:

$$ z = \frac{\hat{\beta}_1}{\hat{\text{se}}(\hat{\beta}_1)} $$

shown in the  preceeding cell.


In [None]:
# Example 5: Estimate beta_0, beta_1 and z

import pandas as pd
import statsmodels.api as sm
import numpy as np

# Make a copy
df = hypoDF.copy()

# Drop rows (runners) with no weight gain measurements
df.dropna(subset=['wt_gain'], inplace=True)

# Define the predictor variable (X) and the response variable (y)
X = df['wt_gain']
y = df['hyponatremia']

# Add a constant to the predictor variables (to estimate the intercept)
X = sm.add_constant(X)

# Fit the logistic regression model
logit_model = sm.Logit(y, X)
result = logit_model.fit(disp=False)

# Get the estimated parameters
beta_0, beta_1 = result.params

# Print header
print("----Estimating beta_0, beta_1 and z-----------")

print(f"Estimated beta_0: {beta_0:.4f}")
print(f"Estimated beta_1: {beta_1:.4f}")

# Define the logistic function to compute the probability p
def logistic_function(beta_0, beta_1, x):
    logit = beta_0 + beta_1 * x
    p = 1 / (1 + np.exp(-logit))
    return p

# Compute the predicted probability p for a given x value
x_value = 2
p_hat = logistic_function(beta_0, beta_1, x_value)

# Print probability
print(f"Predicted probability (p) for x = {x_value} is: {p_hat:.2f}")

# Calculate the standard error of beta_1
se_beta_1 = result.bse['wt_gain']

# Calculate the z-score for beta_1
z_score = beta_1 / se_beta_1

# Print z score
print(f"z-score for beta_1: {z_score:.2f}")


If the code is correct, you should see the following output:

~~~text
----Estimating beta_0, beta_1 and z-----------
Estimated beta_0: -1.8849
Estimated beta_1: 0.7284
Predicted probability (p) for x = 2 is: 0.39
z-score for beta_1: 6.60
~~~


These are the same parameters as shown in your textbook on pages 458-459.

### Example 6: Plot Logistic Regression

If we calculate the estimated probability 6 for each observed value of weight gain x in the dataset and plot / versus x, the result would be the curve in **Figure 19.2**. According to the logistic regression model, the estimated value of p increases as weight gain increases. As previously noted, however, the relationship between $p$ and $x$ is not linear.

The code in the cell below recreates **Figure 19.2** in your textbook.

![____](https://biologicslab.co/BIO5853/images/module_03/lesson_03_11_image04.png)

**FIGURE 19.2**  Logistics regression of hyponatremia on weight gain: $ln[\hat{p} / (1-\hat{p})] = -1.8849 + 0.7284x $


_Code Description:_

As you can see, the code needed to recreate **Figure 19.2** is fairly straightforward. It is offered here without further comment.

In [None]:
# Example 6:  Plot logistic regression

import matplotlib.pyplot as plt
import numpy as np

# Define the logistic function
def logistic_function(x):
    return 1 / (1 + np.exp(-(-1.8849 + 0.7284 * x)))

# Generate x values
x = np.linspace(-7, 5, 400)

# Calculate the corresponding y values
y = logistic_function(x)

# Create the plot
fig, ax = plt.subplots()
ax.plot(x, y, label='Logistic Regression')

# Set labels
ax.set_xlabel('Weight gain during race (lb)')
ax.set_ylabel('Estimated value of $p$')
#ax.set_title('Logistic Regression Plot')


# Show the plot
plt.show()


If the code is correct, you should see the following plot:

![____](https://biologicslab.co/BIO5853/images/module_03/lesson_03_11_image05.png)


![____](https://biologicslab.co/BIO5853/images/module_03/lesson_03_11_image06.png)


# **Indicator Variables**

Like the linear regression model, the logistic regression model can include categorical explanatory variables in addition to continuous ones. Suppose we are interested in the relationship between hyponatremia and sex, categorized as female or male. We could begin by noting that the proportion of females with the outcome is 37/166 = 0.223, while the proportion of males with the outcome is
25/322 = 0.078. In the sample of marathon runners, the estimated probability of hyponatremia is higher for females than it is for males.

As a general rule, we prefer to include in a multivariable regression model


![____](https://biologicslab.co/BIO5853/images/module_03/lesson_03_11_image07.png)


Now that we have created these indicator variables, we put them both into the logistic regression model at the same time. The results are shown in **Table 19.1**. Because the coefficients of both weight gain categories are positive, we know that the probability of hyponatremia is higher for runners who lose no more than 1.35 pounds during the race relative to those who lose at least 1.36 pounds, and higher for runners who gain weight relative to those who lose at least 1.36 pounds.

**TABLE 19.1**

![____](https://biologicslab.co/BIO5853/images/module_03/lesson_03_11_image08.png)


### Example 7: Create Contingency Table

The code in the cell below, recreates the contingency table shown on page 461 in your textbook:

![____](https://biologicslab.co/BIO5853/images/module_03/lesson_03_11_image07.png)

_Code Description:_

To generate the contingency table, we do _not_ drop the `NaNs` since the `Total` is 488 runners.

In [None]:
# Example 7: Create Contingency Table

import pandas as pd

# Create new copy of DataFrame
df = hypoDF.copy()

# Create a new column 'Sex' based on the 'female' column using .loc
df.loc[:, 'Sex'] = df['female'].apply(lambda x: 'Female' if x == 1 else 'Male')

# Create a new column 'Hyponatremia' based on the 'hyponatremia' column using .loc
df.loc[:, 'Hyponatremia'] = df['hyponatremia'].apply(lambda x: 'Yes' if x == 1 else 'No')

# Create a contingency table and sort the index to have "Yes" above "No"
contingency_table = pd.crosstab(df['Hyponatremia'], df['Sex'], margins=True, margins_name='Total')
contingency_table = contingency_table.reindex(['Yes', 'No', 'Total'])

# Print header
print("-------2 X 2 Contingency Table -----------------------")
# Display the table
print(contingency_table)

# Compute the odds ratio
a = contingency_table.at['Yes', 'Female']
b = contingency_table.at['No', 'Female']
c = contingency_table.at['Yes', 'Male']
d = contingency_table.at['No', 'Male']

# Compute OR
odds_ratio = (a * d) / (b * c)
print("\n-------Odds Ratio -----------------------")
print(f"Odds Ratio (OR̂): {odds_ratio:.2f}")


If the code is correct, you should see the following output:

~~~text
-------2 X 2 Contingency Table -----------------------
Sex           Female  Male  Total
Hyponatremia                     
Yes               37    25     62
No               129   297    426
Total            166   322    488

-------Odds Ratio -----------------------
Odds Ratio (OR̂): 3.41
~~~

The output is identical to contingency table in your textbook:

![____](https://biologicslab.co/BIO5853/images/module_03/lesson_03_11_image07.png)

### Example 8: Logistic Regression


The code in the cell below performs logistic regression and displays the output to recreate **TABLE 19.1** in your textbook on page 462.

In [None]:
# Example 8: Logistic Regression

import pandas as pd
import statsmodels.api as sm
import numpy as np

# Use a copy
df = hypoDF.copy()

# Drop NaNs
df.dropna(subset=['wt_gain'], inplace=True)

# Create the indicator variables for weight gain categories
df['Wt_Gain_Cat2'] = np.where((df['wt_gain'] > -1.36) & (df['wt_gain'] <= 0.00), 1, 0)
df['Wt_Gain_Cat3'] = np.where(df['wt_gain'] > 0.00, 1, 0)

# Define the predictor variables (Wt_Gain_Cat2 and Wt_Gain_Cat3) and the response variable (y)
X = df[['Wt_Gain_Cat2', 'Wt_Gain_Cat3']]
y = df['hyponatremia']

# Add a constant to the predictor variables (to estimate the intercept)
X = sm.add_constant(X)

# Fit the logistic regression model
logit_model = sm.Logit(y, X)
result = logit_model.fit(disp=False)

# Extract and print the estimated parameters, standard error, test statistic, and p-value
params = result.params
stderr = result.bse
t_values = result.tvalues
p_values = result.pvalues

# Create a DataFrame to display the results in the desired order with 3 decimal places
results_df = pd.DataFrame({
    'Coefficient': params.round(4),
    'Standard Error': stderr.round(4),
    'Test Statistic': t_values.round(2),
    'p-value': p_values.round(3)
}).reindex(['Wt_Gain_Cat2', 'Wt_Gain_Cat3', 'const'])

# Rename the index to match the table
results_df.index = ['Weight gain category 2', 'Weight gain category 3', 'Intercept']

print("\nLogistic Regression Results:")
print(results_df)

# Compute predicted probabilities for each observation
predicted_probabilities = result.predict(X)
hypoDF.loc[:, 'predicted_probabilities'] = predicted_probabilities

If the code is correct, you should see the following output:

~~~text
Logistic Regression Results:
                        Coefficient  Standard Error  Test Statistic  p-value
Weight gain category 2       0.8064          0.5068            1.59    0.112
Weight gain category 3       2.3016          0.4581            5.02    0.000
Intercept                   -3.2387          0.4162           -7.78    0.000
~~~


![____](https://biologicslab.co/BIO5853/images/module_03/lesson_03_11_image10.png)

The output from our logistic regression is identical to **TABLE 19.1**.

### Example 9: Scatter Plot

The code in the cell recreates **Figure 19.3** , shown on page 463 in your textbook. This figure shows the log odds of hyponatremia within each quintile of weight gain versus quintile midpoints.

![____](https://biologicslab.co/BIO5853/images/module_03/lesson_03_11_image11.png)

**FIGURE 19.3** Observed log odds of hyponatremia within each quintile of weight gain versus quintile midpoints.

In [None]:
# Example 9: Scatter plot

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Use a copy
df = hypoDF.copy()

# Drop NaNs
df.dropna(subset=['wt_gain'], inplace=True)

# Calculate quintiles of weight gain
df['wt_gain_quintile'] = pd.qcut(df['wt_gain'], q=5, labels=False) + 1

# Calculate the midpoint of each quintile
quintile_midpoints = df.groupby('wt_gain_quintile')['wt_gain'].apply(lambda x: (x.min() + x.max()) / 2)

# Compute the observed log odds of hyponatremia within each quintile
quintile_hyponatremia = df.groupby('wt_gain_quintile')['hyponatremia'].agg(['sum', 'count'])
quintile_hyponatremia['odds'] = quintile_hyponatremia['sum'] / (quintile_hyponatremia['count'] - quintile_hyponatremia['sum'])
quintile_hyponatremia['log_odds'] = np.log(quintile_hyponatremia['odds'])

# Plot observed log odds versus quintile midpoints
plt.plot(quintile_midpoints, quintile_hyponatremia['log_odds'], marker='o', linestyle=' ')
plt.xlabel('Midpoint of weight gain quintile (lb)')
plt.ylabel('Log odds of hyponatremia')
# plt.title('Observed Log Odds of Hyponatremia vs. Quintile Midpoints of Weight Gain')
plt.grid(False)
plt.show()


If the code is correct, you should see the following plot:

![____](https://biologicslab.co/BIO5853/images/module_03/lesson_03_11_image12.png)


**FIGURE 19.3** Observed log odds of hyponatremia within each quintile of weight gain versus quintile midpoints.

# **Multiple Logistics Regression**

Now that we have seen that both weight gain during the race and female sex are associated with the probability that a marathon runner will be diagnosed with hyponatremia, we might wonder whether including these two explanatory variables in the same model will improve our ability to predict $\hat{p}$. In other words, given that we have already accounted for weight gain, does knowing a runner’s sex further improve our ability to predict whether they will experience hyponatremia?

To model the probability p as a function of the two explanatory variables, we fit a model of the form

$$ \ln \left[ \frac{p}{1-p} \right] = \hat\beta{_0} + \hat\beta_1 x_1 + \hat\beta_2 x_2 $$

where $x_1$ designates weight gain and $x_2$ represents sex. The estimated logistic regression equation based on the sample of runners is

$$ \ln \left[ \frac{p}{1-p} \right] = -2.3009 + 0.7026x_1 + 0.8695x_2 $$


As we see in Table 19.2, the coefficients of both weight gain and sex have decreased somewhat now that another explanatory variable has been added to the model. However, both are still significantly different from 0 at the 0.05 level. The coefficient of weight gain tells us that, holding sex constant, a one
pound increase in weight gain is associated with a 0.7026 increase in the log odds of hyponatremia.

### Multiple Logistic Regression

The code in the cell below recreates **TABLE 19.2** on page 464 in your textbook. The table shows the results of logistic regression of hyponatremia on weight gain and female sex.

![____](https://biologicslab.co/BIO5853/images/module_03/lesson_03_11_image19.png)

In [None]:
# Example 10: Logistic regression with interaction term

import pandas as pd
import statsmodels.api as sm
import numpy as np

# Use a copy
df = hypoDF.copy()

# Drop NaNs
df.dropna(subset=['wt_gain'], inplace=True)


# Define the predictor variables (x) and the response variable (y)
X = df[['wt_gain', 'female']]
y = df['hyponatremia']

# Add a constant to the predictor variables (to estimate the intercept)
X = sm.add_constant(X)

# Fit the logistic regression model
logit_model = sm.Logit(y, X)
result = logit_model.fit(disp=False)

# Extract and print the estimated parameters, standard error, test statistic, and p-value
params = result.params
stderr = result.bse
t_values = result.tvalues
p_values = result.pvalues

# Create a DataFrame to display the results in the desired order with 3 decimal places
results_df = pd.DataFrame({
    'Coefficient': params.round(4),
    'Standard Error': stderr.round(4),
    'Test Statistic': t_values.round(2),
    'p-value': p_values.round(3)
})

# Rename the index to match the table and then reorder to move 'Intercept' to the last line
results_df.index = ['Intercept', 'Weight gain', 'Female sex']
results_df = results_df.reindex(['Weight gain', 'Female sex', 'Intercept'])

print("\nLogistic Regression Results:")
print(results_df)

# Compute predicted probabilities for each observation
predicted_probabilities = result.predict(X)
hypoDF.loc[:, 'predicted_probabilities'] = predicted_probabilities


As you can see, the values in the output are identical to the values in **TABLE 19.2**


![____](https://biologicslab.co/BIO5853/images/module_03/lesson_03_11_image19.png)

### Example 11: Evaluation of Logistic Regression Model

While evaluation of the logistic regression model is beyond the scope of this text, we note that one way to judge the goodness-of-fit of a model to the observed sample data is to stratify the sample into subgroups — as we do below, looking at categories of weight gain and sex — and compare the observed proportion of cases with hyponatremia within each of the subgroups to the predicted probability of hyponatremia based on the model.

The code in the cell below prints out the observed and predicted values shown in the following table:


![____](https://biologicslab.co/BIO5853/images/module_03/lesson_03_11_image13.png)

In [13]:
# Example 11: Evaluation of logistic regression model

import pandas as pd
import numpy as np
import statsmodels.api as sm
from tabulate import tabulate

# Assuming hypoDF is your DataFrame with 'wt_gain', 'hyponatremia', and 'female'

# Use a copy
df = hypoDF.copy()

# Drop NaNs
df.dropna(subset=['wt_gain'], inplace=True)

# Define weight gain categories
def classify_wt_gain(value):
    if value <= -1.36:
        return '≤ -1.36'
    elif -1.35 <= value <= 0.00:
        return '-1.35 to 0.00'
    else:
        return '≥ 0.01'

# Apply the function to create a new weight gain category column
df['wt_gain_category'] = df['wt_gain'].apply(classify_wt_gain)

# Map the 'female' column to gender labels
df['gender'] = df['female'].map({1: 'Females', 0: 'Males'})

# Create a logistic regression model
X = df[['wt_gain', 'female']]
X = sm.add_constant(X)
y = df['hyponatremia']

# Fit the model
logit_model = sm.Logit(y, X).fit(disp=False)

# Predict probabilities
df['predicted_prob'] = logit_model.predict(X)

# Group by weight gain category and gender, then compute observed and predicted proportions
observed = df.groupby(['wt_gain_category', 'gender'])['hyponatremia'].mean().unstack()
predicted = df.groupby(['wt_gain_category', 'gender'])['predicted_prob'].mean().unstack()

# Combine observed and predicted proportions into a single DataFrame
table = pd.concat([observed.add_prefix('Observed '), predicted.add_prefix('Predicted ')], axis=1).reset_index()

# Rename columns
table.columns = ['Gain (pounds)',
                 'with Hyponatremia (Females)',
                 'with Hyponatremia (Males)',
                 'with Hyponatremia (Females)',
                 'with Hyponatremia (Males)']

# Reorder rows to move '≤ -1.36' to the top
row_order = ['≤ -1.36', '-1.35 to 0.00', '≥ 0.01']
table = table.set_index('Gain (pounds)').loc[row_order].reset_index()

# Round the values to 3 decimal places
table = table.round(3)

# Format the table for display
formatted_table = tabulate(table, headers='keys', tablefmt='grid', showindex=False)

# This is a KLUDGE! It is added only to make the table look more similar to the one in the textbook!
print("+-----Weight------|-----Observed Proportion-------|-----Observed Proportion-----|-----Predicted Proportion------|-----Predicted Proportion----+ ")

# Display the formatted table
print(formatted_table)


+-----Weight------|-----Observed Proportion-------|-----Observed Proportion-----|-----Predicted Proportion------|-----Predicted Proportion----+ 
+-----------------+-------------------------------+-----------------------------+-------------------------------+-----------------------------+
| Gain (pounds)   |   with Hyponatremia (Females) |   with Hyponatremia (Males) |   with Hyponatremia (Females) |   with Hyponatremia (Males) |
| ≤ -1.36         |                         0.154 |                       0.015 |                         0.06  |                       0.022 |
+-----------------+-------------------------------+-----------------------------+-------------------------------+-----------------------------+
| -1.35 to 0.00   |                         0.129 |                       0.044 |                         0.144 |                       0.061 |
+-----------------+-------------------------------+-----------------------------+-------------------------------+----------------------

If the code is correct, you should see the following table:

~~~text
+-----Weight------|-----Observed Proportion-------|-----Observed Proportion-----|-----Predicted Proportion------|-----Predicted Proportion----+
+-----------------+-------------------------------+-----------------------------+-------------------------------+-----------------------------+
| Gain (pounds)   |   with Hyponatremia (Females) |   with Hyponatremia (Males) |   with Hyponatremia (Females) |   with Hyponatremia (Males) |
+=================+===============================+=============================+===============================+=============================+
| ≤ -1.36         |                         0.154 |                       0.015 |                         0.06  |                       0.022 |
+-----------------+-------------------------------+-----------------------------+-------------------------------+-----------------------------+
| -1.35 to 0.00   |                         0.129 |                       0.044 |                         0.144 |                       0.061 |
+-----------------+-------------------------------+-----------------------------+-------------------------------+-----------------------------+
| ≥ 0.01          |                         0.339 |                       0.233 |                         0.361 |                       0.2   |
+-----------------+-------------------------------+-----------------------------+-------------------------------+-----------------------------+
~~~

You should note that in this table, the values for `Females` are shown first instead of the values for `Males`. Also there are some small differences in the values compared to the table in the textbook due to small differences in the datasets. In any event, the predicted values are very close to the observed values.



![____](https://biologicslab.co/BIO5853/images/module_03/lesson_03_11_image13.png)

![____](https://biologicslab.co/BIO5853/images/module_03/lesson_03_11_image14.png)

# **Interaction Terms**

In logistic regression, an **interaction term** is used to assess whether the effect of one predictor variable on the response variable changes depending on the level of another predictor variable. Essentially, it allows you to explore whether there is a combined effect of two variables that differs from their individual effects.

#### Why Use an Interaction Term?

- **Assess Combined Effects**: Sometimes, the relationship between a predictor and the outcome may not be the same at all levels of another predictor. For example, the effect of weight gain on the likelihood of hyponatremia might be different for males and females.
- **Improve Model Fit**: Including interaction terms can help in building more accurate predictive models by capturing complex relationships between variables.

#### How to Create an Interaction Term

To create an interaction term, you multiply the two variables of interest. For example, if you want to explore the interaction between `wt_gain` (weight gain) and `female` (gender), you create a new variable:

$$ \text{interaction} = \text{wt_gain} \times \text{female} $$



### Example 12: Logistic Regression with Interaction Term

The code in the cell be recreates **TABLE 19.4** in your textbook on page 468.

![____](https://biologicslab.co/BIO5853/images/module_03/lesson_03_11_image18.png)

This table shows the output of a logistic regression of hyponatremia on weight gain, female sex, and the weight gain X sex interaction.

_Code Description:_

The following code chunk creates the interaction term:

```python
# Create the interaction term
df['interaction'] = df['wt_gain'] * df['female']
```

The interaction term is then added to the definition of `X`:

```python
# Define the predictor variables (x) and the response variable (y)
X = df[['wt_gain', 'female', 'interaction']]
y = df['hyponatremia']
```

In [None]:
# Example 12: Logistic regression with interaction term

import pandas as pd
import statsmodels.api as sm
import numpy as np

# Use a copy
df = hypoDF.copy()

# Drop NaNs
df.dropna(subset=['wt_gain'], inplace=True)

# Create the interaction term
df['interaction'] = df['wt_gain'] * df['female']

# Define the predictor variables (x) and the response variable (y)
X = df[['wt_gain', 'female', 'interaction']]
y = df['hyponatremia']

# Add a constant to the predictor variables (to estimate the intercept)
X = sm.add_constant(X)

# Fit the logistic regression model
logit_model = sm.Logit(y, X)
result = logit_model.fit(disp=False)

# Extract and print the estimated parameters, standard error, test statistic, and p-value
params = result.params
stderr = result.bse
t_values = result.tvalues
p_values = result.pvalues

# Create a DataFrame to display the results in the desired order
results_df = pd.DataFrame({
    'Coefficient': params.round(4),
    'Standard Error': stderr.round(4),
    'Test Statistic': t_values.round(2),
    'p-value': p_values.round(3)
})

# Rename the index to match the table and then reorder to move 'Intercept' to the last line
results_df.index = ['Intercept', 'Weight gain', 'Female sex', 'Weight gain X sex']
results_df = results_df.reindex(['Weight gain', 'Female sex', 'Weight gain X sex', 'Intercept'])

print("\nLogistic Regression Results:")
print(results_df)


If the code is correct, you should see the following output:

~~~text

Logistic Regression Results:
                   Coefficient  Standard Error  Test Statistic  p-value
Weight gain             0.6868          0.1503            4.57    0.000
Female sex              0.8550          0.3241            2.64    0.008
Weight gain X sex       0.0366          0.2301            0.16    0.874
Intercept              -2.2954          0.2359           -9.73    0.000
~~~

The output exactly matches the value in **TABLE 19.4** on page 468 in your textbook.

![____](https://biologicslab.co/BIO5853/images/module_03/lesson_03_11_image18.png)

## **Lesson Turn-in**

When you have completed and run all of the code cells, create a PDF of your notebook and upload the **PDF** to your Lesson_03_11 assignment in Canvas for grading.
