# Biol 359A | Statistical tests and comparing data

### Spring 2024, Week 2

Objectives:
- Run and interpret a t-test
- Understand p-values and common pitfalls
- Gain intuition about statistical tests and sample sizes

### Import statements

The packages used are as follows
- `pandas` provides dataframes for data storage and manipulation
- `ipywidgets` provides dynamic notebook widgets (like sliders)
- `scipy` us a general scientific computing package
- `numpy` is a general math/matrices package
- `matplotlib` Provides data visualization functionality
- `seaborn`: Data exploration visualization tool

In [10]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import ipywidgets as widgets
import scipy.stats as stats
import seaborn as sns

ASSIGNMENT QUESTION:
- What is the name of someone in your group? If they could have any animal or mythical creature as a pet or friend, what would it be?

### Hypothesis Testing Overview:

Hypothesis testing is a method used to decide whether there is enough evidence to reject the null hypothesis.

### Key Concepts and Steps:

1. **Null Hypothesis ($H_0$):** This is our starting assumption, which typically suggests no effect or no difference. In mathematical terms, we might express it as $H_0: \bar{X} = \mu$, where $\bar{X}$ is the sample mean, and $\mu$ is the population mean we're comparing to. The null hypothesis is what you'd expect by default, without new evidence.

2. **Alternative Hypothesis ($H_a$):** This represents the alternative belief, essentially what we suspect might be true instead of the null hypothesis. It's expressed as $H_A: \bar{X} \neq \mu$ for a two-sided test, or $H_A: \bar{X} > \mu$ or $H_A: \bar{X} < \mu$ for one-sided tests. This hypothesis suggests there is an effect or a difference.

3. **Test Statistic:** A test statistic is a value calculated from the sample data, used to decide whether to reject the null hypothesis. It's formulated based on our hypotheses. For a mean difference test with unknown variance, it's often a _t-statistic_ given by:

    $$t = \frac{\bar{X} - \mu}{s_x/\sqrt{n}}$$

    where $s_x$ is the sample standard deviation, and $n$ is the sample size. This statistic measures how far our sample mean $\bar{X}$ is from the population mean $\mu$, scaled by the variability of the sample. The test assumes that the two distributions have the same variance and follows a normal distribution.

4. **Null Distribution:** This is the probability distribution of the test statistic under the assumption that the null hypothesis is true. For the _t-statistic_, this would be the _t-distribution_. The null distribution helps us understand what values of the test statistic are likely if the null hypothesis holds. When \($H_0$\) is true, the mean of the _t-distribution_ is 0.

5. **P-value:** The p-value is the probability of observing a test statistic as extreme as, or more extreme than, the observed result, under the assumption that the null hypothesis is true. It quantifies how unusual our data is if $H_0$ were true. A low p-value indicates that such data would be very unlikely under the null hypothesis, suggesting evidence against $H_0$.

6. **Significance Level (α):** Before conducting the test, we choose a significance level (often 0.05), which is the threshold used to decide whether the p-value is low enough to reject the null hypothesis. If the p-value is less than $\alpha$, we reject $H_0$, concluding that our results are statistically significant.


DISCUSSION QUESTION
- What is a mean difference test?

## Exploring real data

For today's lesson we will be working on real breast cancer data from the[ Wisconsin Diagnostic Breast Cancer Database (WDBC)](https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic).

Here is a summary of the data from the data source:
```
	Features are computed from a digitized image of a fine needle
	aspirate (FNA) of a breast mass.  They describe
	characteristics of the cell nuclei present in the image.
	A few of the images can be found at
	http://www.cs.wisc.edu/~street/images/

	Separating plane described above was obtained using
	Multisurface Method-Tree (MSM-T) [K. P. Bennett, "Decision Tree
	Construction Via Linear Programming." Proceedings of the 4th
	Midwest Artificial Intelligence and Cognitive Science Society,
	pp. 97-101, 1992], a classification method which uses linear
	programming to construct a decision tree.  Relevant features
	were selected using an exhaustive search in the space of 1-4
	features and 1-3 separating planes.

	The actual linear program used to obtain the separating plane
	in the 3-dimensional space is that described in:
	[K. P. Bennett and O. L. Mangasarian: "Robust Linear
	Programming Discrimination of Two Linearly Inseparable Sets",
	Optimization Methods and Software 1, 1992, 23-34].

	This database is also available through the UW CS ftp server:
	ftp ftp.cs.wisc.edu
	cd math-prog/cpo-dataset/machine-learn/WDBC/
    
    Source:
    W.N. Street, W.H. Wolberg and O.L. Mangasarian 
	Nuclear feature extraction for breast tumor diagnosis.
	IS&T/SPIE 1993 International Symposium on Electronic Imaging: Science
	and Technology, volume 1905, pages 861-870, San Jose, CA, 1993.
```

What do all the column names mean?

- ID number
- Diagnosis (M = malignant, B = benign)

Ten real-valued features are computed for each cell nucleus:

- radius (mean of distances from center to points on the perimeter)
- texture (standard deviation of gray-scale values)
- perimeter
- area
- smoothness (local variation in radius lengths)
- compactness (perimeter^2 / area - 1.0)
- concavity (severity of concave portions of the contour)
- concave points (number of concave portions of the contour)
- symmetry 
- fractal dimension ("coastline approximation" - 1) - a measure of "complexity" of a 2D image.


Cateogory Distribution: 357 benign, 212 malignant

We will import and clean these data using another python script called clean_data.py. That script loads the data into a pandas dataframe and removes part of the dataset we will not be using:

In [4]:
import clean_data

cancer_dataset = clean_data.generate_clean_dataframe()
cancer_dataset

Unnamed: 0_level_0,Unnamed: 1_level_0,mean_radius,mean_texture,mean_perimeter,mean_area,mean_smoothness,mean_compactness,mean_concavity,mean_concave_points,mean_symmetry,mean_fractal_dimension
ID,diagnosis,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
842302,M,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,0.2419,0.07871
842517,M,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,0.1812,0.05667
84300903,M,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,0.2069,0.05999
84348301,M,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,0.2597,0.09744
84358402,M,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,0.1809,0.05883
...,...,...,...,...,...,...,...,...,...,...,...
926424,M,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,0.1726,0.05623
926682,M,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,0.1752,0.05533
926954,M,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,0.1590,0.05648
927241,M,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,0.2397,0.07016


In [25]:
cancer_dataset.index

MultiIndex([(  842302, 'M'),
            (  842517, 'M'),
            (84300903, 'M'),
            (84348301, 'M'),
            (84358402, 'M'),
            (  843786, 'M'),
            (  844359, 'M'),
            (84458202, 'M'),
            (  844981, 'M'),
            (84501001, 'M'),
            ...
            (  925291, 'B'),
            (  925292, 'B'),
            (  925311, 'B'),
            (  925622, 'M'),
            (  926125, 'M'),
            (  926424, 'M'),
            (  926682, 'M'),
            (  926954, 'M'),
            (  927241, 'M'),
            (   92751, 'B')],
           names=['ID', 'diagnosis'], length=569)

We can now use pandas dataframe functions to explore the data.

If we want to show the first five values in some column in the table:

In [6]:
cancer_dataset["mean_area"].head(5)

ID        diagnosis
842302    M            1001.0
842517    M            1326.0
84300903  M            1203.0
84348301  M             386.1
84358402  M            1297.0
Name: mean_area, dtype: float64

If we want to show the first five values for each diagnosis category:

In [7]:
cancer_dataset["mean_area"].groupby("diagnosis").head(5)

ID        diagnosis
842302    M            1001.0
842517    M            1326.0
84300903  M            1203.0
84348301  M             386.1
84358402  M            1297.0
8510426   B             566.3
8510653   B             520.0
8510824   B             273.9
854941    B             523.8
85713702  B             201.9
Name: mean_area, dtype: float64

We can use the `pandas` package for calculating summary statistics:

In [8]:
cancer_dataset["mean_area"].describe()

count     569.000000
mean      654.889104
std       351.914129
min       143.500000
25%       420.300000
50%       551.100000
75%       782.700000
max      2501.000000
Name: mean_area, dtype: float64

## Formulating hypotheses

To further our understanding of cancer, we aim to identify which characteristics of cell nuclei differ significantly between cancerous (malignant) and healthy (benign) tissue. Insight into these distinctions not only holds potential for enhancing diagnostic techniques but also for generating hypotheses regarding the onset and advancement of cancer.

The following Python code generates box plots to compare the distributions of various measured nuclear characteristics between malignant and benign cells. Box plots are invaluable for such comparisons because they succinctly summarize data distributions. Here's a brief guide to understanding the components of a box plot:

- The Box: Represents the interquartile range (IQR), capturing the middle 50 percent of the data. The bottom and top of the box mark the first (Q1) and third (Q3) quartiles, respectively, and the line inside the box identifies the median.
- The Whiskers: Extend from the box to show the range of the data, adjusted for potential outliers. Typically, they span from Q1 - 1.5IQR to Q3 + 1.5IQR.
- Outliers: Often depicted as diamonds, these points lie outside the range covered by the whiskers, indicating data points that are unusually high or low.
- The Verical Line: Represents the mean of the data, providing a point of central tendency.

By selecting different characteristics from the dropdown menu, you can explore the distribution of data for each nuclear characteristic across malignant and benign cells. This visual exploration can reveal how certain nuclear features vary between cancerous and non-cancerous cells, potentially highlighting factors critical to the diagnosis or understanding of cancer.

In [29]:
cancer_dataset_reset = cancer_dataset.reset_index()
cancer_dataset_reset.columns = ['ID', 'Diagnosis'] + list(cancer_dataset_reset.columns[2:])

# Function to plot the selected column
def plot_column(column):
    # Calculate summary statistics for each diagnosis
    stats_df = cancer_dataset_reset.groupby('Diagnosis')[column].describe()
    
    # Plotting
    plt.figure(figsize=(10, 8))
    
    # Plot boxplot
    sns.boxplot(y='Diagnosis', x=column, data=cancer_dataset_reset, orient='h')
    plt.title(f'Box plot of {column} by Diagnosis')
    plt.ylabel('Diagnosis')
    plt.xlabel(column)
    
    # Display summary statistics table above the boxplot
    table_ax = plt.table(cellText=stats_df.round(2).values,
                         colLabels=stats_df.columns,
                         rowLabels=stats_df.index,
                         cellLoc = 'center', rowLoc = 'center',
                         loc='bottom', bbox=[0, -0.5, 1, 0.3])
    plt.subplots_adjust(left=0.2, bottom=0.2)
    
    plt.show()

# Create a dropdown menu for selecting the column to plot
dropdown_columns = widgets.Dropdown(
    options=[col for col in cancer_dataset_reset.columns if col not in ['ID', 'Diagnosis']],
    description='Column:',
    disabled=False,
)

# Display the dropdown and plot the selected column
widgets.interact(plot_column, column=dropdown_columns)


interactive(children=(Dropdown(description='Column:', options=('mean_radius', 'mean_texture', 'mean_perimeter'…

<function __main__.plot_column(column)>