# Conceptual and Statistical Introduction

## Linear models in statistics

Linear regression models the expected value of a continuous response variable as a linear combination of input features. It assumes that changes in the input produce proportional changes in the output.

## Geometric intuition

The regression line represents the best linear projection of data points that minimizes squared vertical distances (residuals). These residuals quantify prediction error.

## Drug development relevance

Linear regression underlies QSAR, doseâ€“response modeling, and potency prediction. Coefficients quantify how sensitive a biological response is to changes in molecular properties.

# Linear Regression for Drug Development

## 1. Motivation

Many drug discovery problems involve predicting continuous outcomes such as IC50 values, binding affinity, or expression levels from molecular features.

## 2. Generating Example Data

A simple synthetic dataset is used to illustrate how linear regression learns relationships between input features and continuous outputs.

In [None]:
import numpy as np
# NumPy is used for numerical array creation and manipulation

import pandas as pd
# Pandas is imported for consistency with real drug-dev datasets

X = np.array([[1], [2], [3], [4], [5]])
# Feature matrix representing a single molecular descriptor
# Each row corresponds to one compound

y = np.array([2, 4, 6, 8, 10])
# Target vector representing a continuous biological response
# Here, the relationship is perfectly linear for demonstration

X, y
# Display inputs and outputs to verify alignment

## 3. Training the Linear Regression Model

The model learns coefficients that minimize the sum of squared residuals between predictions and observed values.

In [None]:
from sklearn.linear_model import LinearRegression
# LinearRegression implements ordinary least squares (OLS)

model = LinearRegression()
# Initializes the regression model object

model.fit(X, y)
# fit() estimates the slope (coef_) and intercept_ that minimize squared error
# Internally, this solves a least-squares optimization problem

## 4. Making Predictions

Predictions are generated using the learned linear relationship between input features and output.

In [None]:
y_pred = model.predict(X)
# predict() computes y = (coef_ * X) + intercept_
# Each prediction is a weighted sum of features plus a baseline offset

y_pred
# Display predicted values to compare with true targets

## 5. Interpreting Model Parameters

The learned coefficients and intercept provide insight into how strongly each feature influences the biological response.

In [None]:
model.coef_
# coef_ represents the slope of the regression line
# It quantifies how much the response changes per unit change in the feature

model.intercept_
# intercept_ represents the predicted response when all features are zero
# In drug-dev terms, this can be viewed as a baseline activity level