# Biol 359A | Statistical Tests: Linear Regression
### Spring 2025, Week 3
Objectives:
- Interact with real data
- Learn how to fit lines to data

In [None]:
# Import necessary libraries
from ipywidgets import interact, IntSlider, FloatSlider, Layout, Dropdown
import ipywidgets as widgets
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy.stats import ttest_ind, f, f_oneway
from scipy import stats
import seaborn as sns
import statsmodels.stats.multicomp as mc
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.multicomp import pairwise_tukeyhsd, MultiComparison
import sklearn as sk
from sklearn import linear_model
from sklearn.preprocessing import PolynomialFeatures, StandardScaler

In [None]:
! rm -r week3_anova/
! git clone https://github.com/BIOL359A-FoundationsOfQBio-Spr24/week3_anova.git
! cp -r week3_anova/* .
! ls

For today's lesson we will be working on real breast cancer data from the[ Wisconsin Diagnostic Breast Cancer Database (WDBC)](https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic).

Here is a summary of the data from the data source:
```
	Features are computed from a digitized image of a fine needle
	aspirate (FNA) of a breast mass.  They describe
	characteristics of the cell nuclei present in the image.
	A few of the images can be found at
	http://www.cs.wisc.edu/~street/images/

	Separating plane described above was obtained using
	Multisurface Method-Tree (MSM-T) [K. P. Bennett, "Decision Tree
	Construction Via Linear Programming." Proceedings of the 4th
	Midwest Artificial Intelligence and Cognitive Science Society,
	pp. 97-101, 1992], a classification method which uses linear
	programming to construct a decision tree.  Relevant features
	were selected using an exhaustive search in the space of 1-4
	features and 1-3 separating planes.

	The actual linear program used to obtain the separating plane
	in the 3-dimensional space is that described in:
	[K. P. Bennett and O. L. Mangasarian: "Robust Linear
	Programming Discrimination of Two Linearly Inseparable Sets",
	Optimization Methods and Software 1, 1992, 23-34].

	This database is also available through the UW CS ftp server:
	ftp ftp.cs.wisc.edu
	cd math-prog/cpo-dataset/machine-learn/WDBC/
    
    Source:
    W.N. Street, W.H. Wolberg and O.L. Mangasarian
	Nuclear feature extraction for breast tumor diagnosis.
	IS&T/SPIE 1993 International Symposium on Electronic Imaging: Science
	and Technology, volume 1905, pages 861-870, San Jose, CA, 1993.
```

What do all the column names mean?

- ID number
- Diagnosis (M = malignant, B = benign)

Ten real-valued features are computed for each cell nucleus:

- radius (mean of distances from center to points on the perimeter)
- texture (standard deviation of gray-scale values)
- perimeter
- area
- smoothness (local variation in radius lengths)
- compactness (perimeter^2 / area - 1.0)
- concavity (severity of concave portions of the contour)
- concave points (number of concave portions of the contour)
- symmetry
- fractal dimension ("coastline approximation" - 1) - a measure of "complexity" of a 2D image.


Cateogory Distribution: 357 benign, 212 malignant

In [None]:
import clean_data

cancer_dataset = clean_data.generate_clean_dataframe()
cancer_dataset

### Fitting lines to data


Fitting lines to data, or linear regression, learns the best linear relationship between two feature (X) and outcome (Y) by minimizing the sum of squared errors (SSE) between predicted and actual values. This fitted line can then be used to predict the value of one variable given the other.

Additionally, it helps quantify the strength and direction of the relationship (via the slope), and can provide insights into how changes in X are associated with changes in Y.

\begin{align*}
S_{xy} &= \sum_{i=1}^{n} (y_i - \bar{y})(x_i - \bar{x}) \\
S_{xx} &= \sum_{i=1}^{n} (x_i - \bar{x})^2 \\
\\
\text{Coefficients are:} \\
\hat{\beta}_1 &= \frac{S_{xy}}{S_{xx}} \\
\hat{\beta}_0 &= \mathbb{E}[Y] - \hat{\beta}_1 \mathbb{E}[X]
\end{align*}



In [None]:
def calculate_beta_coefficients(x, y):
    """
    Calculate beta coefficients using least squares method.

    Parameters:
    x (array-like): Independent variable values
    y (array-like): Dependent variable values

    Returns:
    tuple: (beta_0, beta_1, S_xy, S_xx)
    """
    # Convert to numpy arrays if they aren't already
    x = np.array(x)
    y = np.array(y)

    # Calculate means
    x_mean = np.mean(x)
    y_mean = np.mean(y)
    S_xy = np.sum((y - y_mean) * (x - x_mean))
    S_xx = np.sum((x - x_mean) ** 2)

    beta_1 = S_xy / S_xx
    beta_0 = y_mean - beta_1 * x_mean

    return beta_0, beta_1, S_xy, S_xx

def plot_regression(x, y, beta_0, beta_1, feature_name, outcome_name):
    """
    Plot scatter points and the fitted regression line.

    Parameters:
    x (array-like): Independent variable values
    y (array-like): Dependent variable values
    beta_0 (float): Intercept coefficient
    beta_1 (float): Slope coefficient
    feature_name (str): Name of the independent variable
    outcome_name (str): Name of the dependent variable
    """
    plt.figure(figsize=(10, 6))

    # Plot scatter points
    plt.scatter(x, y, color='blue', alpha=0.6, label='Data points')

    # Plot regression line
    x_range = np.linspace(min(x), max(x), 100)
    y_pred = beta_0 + beta_1 * x_range
    plt.plot(x_range, y_pred, color='red', linewidth=2, label=f'Fitted line: y = {beta_0:.4f} + {beta_1:.4f}x')

    # Add labels and title
    plt.xlabel(feature_name)
    plt.ylabel(outcome_name)
    plt.title(f'Linear Regression: {outcome_name} vs {feature_name}')
    plt.grid(True, alpha=0.3)
    plt.legend()

    # Display equation on the plot
    equation = f"{outcome_name} = {beta_0:.4f} + {beta_1:.4f} × {feature_name}"
    plt.annotate(equation, xy=(0.05, 0.95), xycoords='axes fraction',
                 fontsize=12, bbox=dict(boxstyle="round,pad=0.3", fc="white", ec="gray", alpha=0.8))

    plt.tight_layout()
    plt.show()

# Example usage with your dataset
def run_regression_analysis(data, feature_name, outcome_name):
    """
    Run the complete regression analysis on the given dataset.

    Parameters:
    data (DataFrame): Pandas DataFrame containing the data
    feature_name (str): Name of the independent variable column
    outcome_name (str): Name of the dependent variable column

    Returns:
    tuple: (beta_0, beta_1)
    """
    # Extract the variables
    x = data[feature_name]
    y = data[outcome_name]

    # Calculate beta coefficients
    beta_0, beta_1, S_xy, S_xx = calculate_beta_coefficients(x, y)

    # Print the results
    print(f"Covariance (S_xy): {S_xy:.4f}")
    print(f"Variance of X (S_xx): {S_xx:.4f}")
    print(f"Beta 1 (slope): {beta_1:.4f}")
    print(f"Beta 0 (intercept): {beta_0:.4f}")
    print(f"Regression equation: {outcome_name} = {beta_0:.4f} + {beta_1:.4f} × {feature_name}")

    # Plot the results
    plot_regression(x, y, beta_0, beta_1, feature_name, outcome_name)

    return beta_0, beta_1
feature_name = 'mean_radius'
outcome_name = 'mean_perimeter'
beta_0, beta_1 = run_regression_analysis(cancer_dataset, feature_name, outcome_name)