# Linear Regression Code Documentation

## Concept

Linear regression is a fundamental statistical method used to model the relationship between a continuous dependent variable (also called the response or target variable) and one or more independent variables (also called predictors, explanatory variables, or features). The purpose of linear regression is to establish a mathematical equation that best describes this relationship, allowing for interpretation and prediction.

### Mathematical Representation

In its simplest form, simple linear regression models the relationship between one independent variable and one dependent variable using the following equation: <dt> $$ \displaystyle Y = \beta_0 + \beta_1 X + \varepsilon $$ </dt>


    

Where:


Y → The dependent variable (response), which must be continuous

X → The independent variable (predictor), which can be continuous  or categorical

𝛽0 → The intercept, representing the expected value of Y when X = 0

𝛽1 → The regression coefficient, indicating how much Y changes for each unit increase in X.

ε → The error term, representing random variability in Y that is not explained by X.

When there are multiple independent variables, we use multiple linear regression, which extends the equation to:<dt> $$ \displaystyle Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_n X_n + \varepsilon $$ </dt>

Where:



* X1,X2,…,Xn are the independent variables.

* 𝛽1,𝛽2,…,𝛽𝑛 are their respective regression coefficients.

### Estimation of the Coefficients

The coefficients in linear regression are estimated using the method of Ordinary Least Squares (OLS), which minimizes the sum of the squared differences between the observed and predicted values.



### Observations


**Linearity** → The relationship between independent and dependent variables must be linear.

**Independence** → Observations must be independent of each other.

**Homoscedasticity** → The variance of residuals (errors) should be constant across all levels of X.

**Normality of Errors** → Residuals should be approximately normally distributed.

**No Multicollinearity** → Independent variables should not be highly correlated with each other.


### Interpreting Linear Regression

**1. Coefficients (β)**

In linear regression, the estimated coefficients (β) represent the change in the dependent variable for a one-unit increase in the predictor variable, assuming all other variables are held constant.

* If β > 0, an increase in the predictor variable is associated with an increase in the dependent variable.

* If β < 0, an increase in the predictor variable is associated with a decrease in the dependent variable.

* If β = 0, the predictor variable has no effect on the dependent variable.

**2. Confidence Intervals (CI)**


To determine the reliability of the estimated coefficient, we compute a confidence interval (usually 95%):

<dt> $$ CI = \left[\beta - 1.96 \cdot \sigma, \, \beta + 1.96 \cdot \sigma\right] $$ </dt>

* If the interval does not include 0, the effect of the variable is statistically significant.

* If the interval includes 0, the effect may not be significant.

**3. p-value**

The p-value tests the null hypothesis that the coefficient is zero , meaning the predictor variable has no effect on the dependent variable.

* If p < 0.05, the effect is statistically significant.

* If p ≥ 0.05, there is insufficient evidence that the variable affects the dependent variable.



### Adventages of Linear Regression

* **Simplicity**: Easy to implement and interpret.

* **Interpretability**: Coefficients have a clear and direct interpretation in terms of effect size.

* **Efficiency**: Computationally fast, especially for small to medium-sized datasets.

* **Good baseline**: Often serves as a solid baseline model before exploring more complex methods.

* **Analytical properties**: Well-understood statistical properties under standard assumptions.



### Limitations of Linear Regression


* **Linearity assumption**: Assumes a linear relationship between the predictors and the response variable, which may not hold in real-world data.

* **Sensitive to outliers**: Outliers can significantly influence the model and distort predictions.

* **Homoscedasticity required**: Assumes constant variance of the errors; violation of this can lead to inefficient estimates.

* **Normality of residuals**: Required for valid hypothesis testing and confidence intervals.

* **Multicollinearity**: High correlation between independent variables can inflate variance and make coefficients unstable.

* **Not suitable for categorical outcomes**: Cannot be used when the dependent variable is binary or categorical (logistic regression is more appropriate in such cases).

## Function details

### 1. Importing Required Libraries

In [None]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf



* numpy: Used for mathematical operations

* pandas: Handles data in a structured table format

* statsmodels.api: Provides core statistical tools, including Generalized Linear Models (GLMs).

* statsmodels.formula.api: Enables formula-based modeling, allowing us to specify the regression equation in a natural syntax

### 2. Defining the Linear Regression Function

To streamline the process, we define a function that fits a Generalized Linear Model (GLM) for linear regression and summarizes the key outputs

In [None]:
def execute_glm_regression(elr_dataframe_df, elr_outcome_str, elr_predictors_list,
                           model_type='linear', print_results=True, labels=False, reg_type="Multi"):
    """
    Performs a Generalized Linear Model (GLM) for linear or logistic regression and returns a summary.

    Parameters:
    - elr_dataframe_df: Pandas DataFrame containing the data.
    - elr_outcome_str: Name of the outcome (dependent) variable.
    - elr_predictors_list: List of predictor (independent) variables.
    - model_type: 'linear' for Gaussian (linear regression) or 'logistic' for Binomial (logistic regression).
    - print_results: If True, prints the result summary.
    - labels: (Optional) Dictionary mapping variable names to readable labels.
    - reg_type: 'uni' or 'multi' for renaming columns in the output.

    Returns:
    - summary_df: DataFrame with regression results.
    """


### 3. Selecting the Regression Type

The function determines whether to use a Gaussian (linear regression) or Binomial (logistic regression) model based on the model_type parameter

In [None]:
    if model_type.lower() == 'logistic':
        family = sm.families.Binomial()
    elif model_type.lower() == 'linear':
        family = sm.families.Gaussian()
    else:
        raise ValueError("model_type must be 'linear' or 'logistic'")

This section checks the model_type parameter to determine which statistical family to use:

* For logistic regression ('logistic'), it sets the family to sm.families.Binomial(), which is suitable for binary outcomes.

* For linear regression ('linear'), it sets the family to sm.families.Gaussian(), appropriate for continuous outcomes.

* If an unsupported model_type is passed, it raises a ValueError to alert you immediately about the invalid input.

### 4. Constructing the Regression Formula

A dynamic formula is built based on the specified outcome and predictors

In [None]:
    formula = elr_outcome_str + ' ~ ' + ' + '.join(elr_predictors_list)


* The formula string follows the format required by statsmodels: outcome ~ predictor1 + predictor2 + ...

### 5. Identifying and Formatting Categorical Variables

Categorical predictors must be properly converted before fitting the model

In [None]:
    categorical_vars = elr_dataframe_df.select_dtypes(include=['object', 'category']).columns.intersection(elr_predictors_list)
    for var in categorical_vars:
        elr_dataframe_df[var] = elr_dataframe_df[var].astype('category')


* It selects columns in the DataFrame that are of type object or category and are also listed in your predictors.

* Each identified column is then converted to the category data type.

* This ensures that categorical variables are correctly treated without causing multicollinearity issues

### 6. Fitting the Linear Regression Model

The GLM model is fitted using the statsmodels package

In [None]:
    model = smf.glm(formula=formula, data=elr_dataframe_df, family=family)
    result = model.fit()


* The smf.glm function is called with the constructed formula, the dataset, and the chosen family.

* The model is then fitted using the .fit() method, which computes the coefficients and related statistics.

* This step applies Maximum Likelihood Estimation (MLE) to estimate the model parameters.



### 7. Extracting Regression Coefficients

Once the model is fitted, we extract coefficients, confidence intervals, and p-values:

In [None]:
    summary_table = result.summary2().tables[1].copy()


For **linear regression**, we focus on:

In [None]:
    summary_df = summary_table[['Coef.', '[0.025', '0.975]', 'P>|z|']].reset_index()
    summary_df = summary_df.rename(columns={'index': 'Variable',
                                            'Coef.': 'Coefficient',
                                            '[0.025': 'LowerCI',
                                            '0.975]': 'UpperCI',
                                            'P>|z|': 'p-value'})


After fitting the model, the summary table containing key statistics is extracted:

* <code>result.summary2()</code> generates a detailed summary.

* The second table (indexed at 1) is copied to work with, as it contains * coefficients, confidence intervals, and p-values.

### 8. Mapping Variable Names to Readable Labels

If labels are provided, variable names are mapped to readable labels:



In [None]:
    if labels:
      def parse_variable_name(var_name):
        if var_name == 'Intercept':
            return labels.get('Intercept', 'Intercept')
        elif '[' in var_name:
            base_var = var_name.split('[')[0]
            level = var_name.split('[')[1].split(']')[0]
            base_var_name = base_var.replace('C(', '').replace(')', '').strip()
            label = labels.get(base_var_name, base_var_name)
            return f'{label} ({level})'
        else:
            var_name_clean = var_name.replace('C(', '').replace(')', '').strip()
            return labels.get(var_name_clean, var_name_clean)
    summary_df['Study'] = summary_df['Study'].apply(parse_variable_name)


This snippet remaps raw variable names to more reader-friendly labels if a labels dictionary is provided:

* The function parse_variable_name checks if a variable is the intercept or a categorical variable.

* For categorical variables, it extracts the base name and level, then applies the mapping.

* The Study column in the summary DataFrame is updated with these parsed names.

## 9. Reordering and Cleaning the Columns

This section organizes the DataFrame

In [None]:
    if model_type.lower() == 'logistic':
        summary_df = summary_df[['Study', 'OddsRatio', 'LowerCI', 'UpperCI', 'p-value']]
    else:
        summary_df = summary_df[['Study', 'Coefficient', 'LowerCI', 'UpperCI', 'p-value']]

    summary_df['Study'] = summary_df['Study'].str.replace('T.', '')

* Columns are reordered to ensure a logical and consistent display.

* The string <code>'T.'</code>, which may appear in categorical variable names, is removed for clarity.

### 10. Formatting and Rounding Numeric Values

For better readability, numerical values are rounded:

In [None]:
    for col in summary_df.columns[1:-1]:
        summary_df[col] = summary_df[col].round(3)
    summary_df['p-value'] = summary_df['p-value'].apply(lambda x: f'{x:.4f}')


* Coefficients and confidence intervals are rounded to 3 decimal places.

* p-values are formatted to 4 decimal places.

### 11. Removing the Intercept Row

If desired, the intercept row can be removed from the summary:

In [None]:
    summary_df = summary_df[summary_df['Variable'] != 'Intercept']


* The intercept is often not needed for interpretation.

### 12. Renaming Columns Based on Regression Type

To distinguish between univariate and multivariate models:

In [None]:
    if reg_type.lower() == 'uni':
        summary_df.rename(columns={
            'Coefficient': 'Coefficient (uni)',
            'LowerCI': 'LowerCI (uni)',
            'UpperCI': 'UpperCI (uni)',
            'p-value': 'p-value (uni)'
        }, inplace=True)
    elif reg_type.lower() == 'multi':
        summary_df.rename(columns={
            'Coefficient': 'Coefficient (multi)',
            'LowerCI': 'LowerCI (multi)',
            'UpperCI': 'UpperCI (multi)',
            'p-value': 'p-value (multi)'
        }, inplace=True)


A suffix is appended to each column name:

* 'uni': Univariate regression (each predictor tested separately).

* 'multi': Multivariate regression (all predictors tested together).

### 13. Displaying and Returning the Results

The results are printed if print_results=True and then returned:

In [None]:
    if print_results:
        print(summary_df)

    return summary_df


The final DataFrame contains:

* Predictor names

* Coefficients

* Confidence intervals

* p-values

### Function

In [None]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf

def execute_glm_regression(elr_dataframe_df, elr_outcome_str, elr_predictors_list,
                           model_type='linear', print_results=True, labels=False, reg_type="Multi"):
    """
    Executa um modelo GLM (Generalized Linear Model) para regressão linear ou logística.

    Parâmetros:
    - elr_dataframe_df: DataFrame do pandas com os dados.
    - elr_outcome_str: Nome da variável resposta.
    - elr_predictors_list: Lista de nomes das variáveis preditoras.
    - model_type: 'linear' para regressão linear (Gaussiana) ou 'logistic' para regressão logística (Binomial).
    - print_results: Se True, imprime a tabela de resultados.
    - labels: (Opcional) Dicionário para mapear nomes das variáveis para rótulos legíveis.
    - reg_type: Tipo de regressão ('uni' ou 'multi') para renomear as colunas do output.

    Retorna:
    - summary_df: DataFrame com os resultados do modelo.
    """

    # Define a família de acordo com o model_type
    if model_type.lower() == 'logistic':
        family = sm.families.Binomial()
    elif model_type.lower() == 'linear':
        family = sm.families.Gaussian()
    else:
        raise ValueError("model_type deve ser 'linear' ou 'logistic'")

    # Monta a fórmula
    formula = elr_outcome_str + ' ~ ' + ' + '.join(elr_predictors_list)

    # Converte variáveis categóricas para o tipo 'category'
    categorical_vars = elr_dataframe_df.select_dtypes(include=['object', 'category']).columns.intersection(elr_predictors_list)
    for var in categorical_vars:
        elr_dataframe_df[var] = elr_dataframe_df[var].astype('category')

    # Ajusta o modelo GLM
    model = smf.glm(formula=formula, data=elr_dataframe_df, family=family)
    result = model.fit()

    # Extrai a tabela de resultados
    summary_table = result.summary2().tables[1].copy()

    # Para regressão logística, calcula Odds Ratios; para linear, utiliza os coeficientes diretamente.
    if model_type.lower() == 'logistic':
        summary_table['Odds Ratio'] = np.exp(summary_table['Coef.'])
        summary_table['IC Low'] = np.exp(summary_table['[0.025'])
        summary_table['IC High'] = np.exp(summary_table['0.975]'])

        summary_df = summary_table[['Odds Ratio', 'IC Low', 'IC High', 'P>|z|']].reset_index()
        summary_df = summary_df.rename(columns={'index': 'Study',
                                                  'Odds Ratio': 'OddsRatio',
                                                  'IC Low': 'LowerCI',
                                                  'IC High': 'UpperCI',
                                                  'P>|z|': 'p-value'})
    else:
        summary_df = summary_table[['Coef.', '[0.025', '0.975]', 'P>|z|']].reset_index()
        summary_df = summary_df.rename(columns={'index': 'Study',
                                                  'Coef.': 'Coefficient',
                                                  '[0.025': 'LowerCI',
                                                  '0.975]': 'UpperCI',
                                                  'P>|z|': 'p-value'})

    # Mapeia nomes das variáveis para rótulos legíveis, se fornecido
    if labels:
        def parse_variable_name(var_name):
            if var_name == 'Intercept':
                return labels.get('Intercept', 'Intercept')
            elif '[' in var_name:
                base_var = var_name.split('[')[0]
                level = var_name.split('[')[1].split(']')[0]
                base_var_name = base_var.replace('C(', '').replace(')', '').strip()
                label = labels.get(base_var_name, base_var_name)
                return f'{label} ({level})'
            else:
                var_name_clean = var_name.replace('C(', '').replace(')', '').strip()
                return labels.get(var_name_clean, var_name_clean)
        summary_df['Study'] = summary_df['Study'].apply(parse_variable_name)

    # Reordena as colunas
    if model_type.lower() == 'logistic':
        summary_df = summary_df[['Study', 'OddsRatio', 'LowerCI', 'UpperCI', 'p-value']]
    else:
        summary_df = summary_df[['Study', 'Coefficient', 'LowerCI', 'UpperCI', 'p-value']]

    # Remove the letter 'T.' from categorical variables
    summary_df['Study'] = summary_df['Study'].str.replace('T.', '')

    # Formata os valores numéricos
    for col in summary_df.columns[1:-1]:
        summary_df[col] = summary_df[col].round(3)
    summary_df['p-value'] = summary_df['p-value'].apply(lambda x: f'{x:.4f}')


    # Remove linha do intercepto, se desejar (opcional)
    summary_df = summary_df[summary_df['Study'] != 'Intercept']

    # Renomeia as colunas conforme o tipo de regressão
    if reg_type.lower() == 'uni':
        if model_type.lower() == 'logistic':
            summary_df.rename(columns={
                'OddsRatio': 'OddsRatio (uni)',
                'LowerCI': 'LowerCI (uni)',
                'UpperCI': 'UpperCI (uni)',
                'p-value': 'p-value (uni)'
            }, inplace=True)
        else:
            summary_df.rename(columns={
                'Coefficient': 'Coefficient (uni)',
                'LowerCI': 'LowerCI (uni)',
                'UpperCI': 'UpperCI (uni)',
                'p-value': 'p-value (uni)'
            }, inplace=True)
    elif reg_type.lower() == 'multi':
        if model_type.lower() == 'logistic':
            summary_df.rename(columns={
                'OddsRatio': 'OddsRatio (multi)',
                'LowerCI': 'LowerCI (multi)',
                'UpperCI': 'UpperCI (multi)',
                'p-value': 'p-value (multi)'
            }, inplace=True)
        else:
            summary_df.rename(columns={
                'Coefficient': 'Coefficient (multi)',
                'LowerCI': 'LowerCI (multi)',
                'UpperCI': 'UpperCI (multi)',
                'p-value': 'p-value (multi)'
            }, inplace=True)

    if print_results:
        print(summary_df)

    return summary_df




## Example

###Importng

In this example we are going to predict the Respiratory Rate using age, sex, hypertension, diabetes, Highest recorded temperature and Creatinine level as predictors variables

In [None]:
from google.colab import files

uploaded = files.upload()

dictionary_df = pd.read_csv("vertex_dictionary.csv")
df = pd.read_csv('df_map.csv')
var_name_map = dict(zip(dictionary_df["field_name"], dictionary_df["field_label"]))

Saving vertex_dictionary.csv to vertex_dictionary.csv
Saving df_map.csv to df_map.csv



* We load the dataset (df_map.csv) and the variable dictionary (vertex_dictionary.csv) using pd.read_csv().

* The variable dictionary contains technical variable names (field_name) and their corresponding human-readable labels (field_label).

* We then create a mapping dictionary (var_name_map) using zip() and dict() to associate each technical name with its readable label.

* This mapping is later used to replace technical variable names in the regression output with more understandable labels for interpretation and presentation.

### Define response variable and predictors

In [None]:
outcome_variable = "vital_rr"
predictor_variables = ["demog_age", "demog_sex", "comor_hypertensi",
                        "comor_diabetes_yn", "vital_highesttem_c", "labs_creatinine_mgdl"]



* **outcome_variable** → response variable (vital_rr).

* **predictor_variables** → list of predictor variables

### Run the linear regression model

In [None]:
summary_df = execute_glm_regression(elr_dataframe_df=df,
                                    elr_outcome_str=outcome_variable,
                                    elr_predictors_list=predictor_variables,
                                    model_type="linear",
                                    print_results=True,
                                    labels=var_name_map,
                                    reg_type="Multi")

                                       Study  Coefficient (multi)  \
1                        Sex at birth (Male)                0.022   
2  Hypertension (physician diagnosed) (True)               -0.010   
3                   Diabetes mellitus (True)                0.304   
4                                        Age                0.003   
5                   Highest temperature (°C)                0.010   
6                         Creatinine (mg/dL)                0.020   

   LowerCI (multi)  UpperCI (multi) p-value (multi)  
1           -0.167            0.212          0.8175  
2           -0.202            0.182          0.9174  
3           -0.005            0.613          0.0542  
4           -0.001            0.007          0.1278  
5           -0.070            0.089          0.8109  
6           -0.633            0.673          0.9525  


**The execute_glm_regression() function is called with all required parameters:**
* **elr_dataframe_df** = df → The dataset.

* **elr_outcome_str** = outcome_variable → The dependent variable (vital_rr).

* **elr_predictors_list** = predictor_variables  → List of independent variables.

* **model_type** = "linear" → We specify **linear regression**.

* **print_results** = True → Displays the regression output.

* **labels** = True → For mapping variable names to labels.

* **reg_type** = "Multi" → Specifies **multivariate regression**.

## Interpreting results

### Analysis of the Results

* **Sex (Male) (demog_sex[Male])** → Coefficient close to 0 (0.022), wide confidence interval, and p-value 0.8175 (not significant).

* **Age (demog_age)** → Small coefficient (0.003), p-value 0.1278 (not significant).

* **Hypertension (comor_hypertensi)** → Negative coefficient (-0.010), p-value 0.9174 (not significant).

* **Diabetes (comor_diabetes_yn)** → Positive coefficient (0.304), confidence interval includes 0, but p-value 0.0542 (almost significant).

* **Highest Temperature (vital_highesttem_c)** → Coefficient 0.010, p-value 0.8109 (not significant).

* **Creatinine (labs_creatinine_mgdl)** → Coefficient 0.020, p-value 0.9525 (not significant).

In this case None of the variables show strong statistical evidence to predict the outcome variable, as all p-values are greater than 0.05.

## Analyzing results with Forest Plot

The Forest Plot is a graphical method used to analyze the results table of linear regression. Each point represents an estimated coefficient (β), along with its lower and upper confidence intervals. This allows for a clear visualization of the direction (positive or negative) and the precision of the associations between the independent variables and the outcome.

In [None]:
import pandas as pd
import plotly.graph_objs as go


def fig_forest_plot(
        df, dictionary=None,
        title='Forest Plot',
        labels=['Study', 'OddsRatio', 'LowerCI', 'UpperCI'],
        graph_id='forest-plot', graph_label='', graph_about='',
        only_display=False):

    # Ordering Values -> Descending Order
    df = df.sort_values(by=labels[1], ascending=True)

    # Error Handling
    if not set(labels).issubset(df.columns):
        print(df.columns)
        error_str = f'Dataframe must contain the following columns: {labels}'
        raise ValueError(error_str)

    # Prepare Data Traces
    traces = []

    # Add the point estimates as scatter plot points
    traces.append(
        go.Scatter(
            x=df[labels[1]],
            y=df[labels[0]],
            mode='markers',
            name='Odds Ratio',
            marker=dict(color='blue', size=10))
    )

    # Add the confidence intervals as lines
    for index, row in df.iterrows():
        traces.append(
            go.Scatter(
                x=[row[labels[2]], row[labels[3]]],
                y=[row[labels[0]], row[labels[0]]],
                mode='lines',
                showlegend=False,
                line=dict(color='blue', width=2))
        )

    # Define layout
    layout = go.Layout(
        title=title,
        xaxis=dict(title='Coefficient'),
        yaxis=dict(
            title='', automargin=True, tickmode='array',
            tickvals=df[labels[0]].tolist(), ticktext=df[labels[0]].tolist()),
        shapes=[
            dict(
                type='line', x0=1, y0=-0.5, x1=1, y1=len(df[labels[0]])-0.5,
                line=dict(color='red', width=2)
            )],  # Line of no effect
        margin=dict(l=100, r=100, t=100, b=50),
        height=600
    )

    return go.Figure(data=traces, layout=layout)

### Executing Foresto Plot

In [None]:
graph = fig_forest_plot(
    df = summary_df,
    labels = summary_df.columns.tolist(),
    only_display=True
)

graph.show()

## References



* To learn more about numpy library, please go to the <a name='id_1'> <a href='https://numpy.org/'>Numpy WebPage</a>
<br>

* To learn more about pandas library, please go to the <a name='id_2'> <a href='https://pandas.pydata.org/'>Pandas WebPage</a>
<br>

* To learn more about statsmodels library, please go to the<a name='id_3'> <a href='https://www.statsmodels.org/stable/index.html'>StatsModels WebPage</a>
<br>

* To learn more about the Pip command, please go to the <a name='pip'>
<a href='https://pypi.org/project/pip/'>Pip WebPage</a>
<br>
