# Biol 359A  | Descriptive statistics and comparing groups
### Spring 2022, Week 2
<hr>

Objectives:
-  Read basic python syntax
-  Run and interpret a t-test
-  Gain intuition about statistical tests and sample sizes


In [None]:
!git clone https://github.com/BIOL359A-FoundationsOfQBio-Spr22/week2_statisticaltests
!mkdir ./data
!cp week2_statisticaltests/data/* ./data
!cp week2_statisticaltests/clean_data.py ./

### Import statements

Import statements are used to integrate *external code or packages* into our analysis. 

- `pandas`: Represents data as tables
- `seaborn`: Data exploration visualization tool
- `ipywidgets`: Notebook widgets that add user interfaces to notebooks
- `random`: Generate random numbers
- `numpy`: General math/matrices package
- `matplotlib`: Data visualization software
- `Scipy`: General scientific computing

Using `as` will alias the package in the code.
`matplotlib.pyplot` is importing the submodule `pyplot` from `matplotlib`. 
`from scipy.stats` is telling python where to find `ttest_ind`. 

In [1]:
import pandas as pd
import seaborn as sns
import ipywidgets as widgets
import random
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import ttest_ind as ttest


sns.set_context("notebook")
sns.set_style("whitegrid")

### Global Variables

These variables are designed accessible in any part of the code. These are normally formatted to be `TITLE_CASE`

In [2]:
#You can adjust these font sizes safely if you have a hard time reading
TITLE_FONT = 20
LABEL_FONT = 16
TICK_FONT = 16
FIG_SIZE = (15,15)

### Understanding our data

We are going to import a dataset using another python script. The python script is loading the file and doing some basic cleaning of parts of the dataset we aren't using. It can be found in `clean_data.py`.

In [3]:
import clean_data #helper function with 

cancer_dataset = clean_data.generate_clean_dataframe()
cancer_dataset

Unnamed: 0_level_0,Unnamed: 1_level_0,mean_radius,mean_texture,mean_perimeter,mean_area,mean_smoothness,mean_compactness,mean_concavity,mean_concave_points,mean_symmetry,mean_fractal_dimension
ID,diagnosis,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
842302,M,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,0.2419,0.07871
842517,M,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,0.1812,0.05667
84300903,M,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,0.2069,0.05999
84348301,M,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,0.2597,0.09744
84358402,M,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,0.1809,0.05883
...,...,...,...,...,...,...,...,...,...,...,...
926424,M,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,0.1726,0.05623
926682,M,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,0.1752,0.05533
926954,M,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,0.1590,0.05648
927241,M,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,0.2397,0.07016


### We have a basic understanding of the structure of the data now. 

From the data source: Wisconsin Diagnostic Breast Cancer (WDBC)

```
	Features are computed from a digitized image of a fine needle
	aspirate (FNA) of a breast mass.  They describe
	characteristics of the cell nuclei present in the image.
	A few of the images can be found at
	http://www.cs.wisc.edu/~street/images/

	Separating plane described above was obtained using
	Multisurface Method-Tree (MSM-T) [K. P. Bennett, "Decision Tree
	Construction Via Linear Programming." Proceedings of the 4th
	Midwest Artificial Intelligence and Cognitive Science Society,
	pp. 97-101, 1992], a classification method which uses linear
	programming to construct a decision tree.  Relevant features
	were selected using an exhaustive search in the space of 1-4
	features and 1-3 separating planes.

	The actual linear program used to obtain the separating plane
	in the 3-dimensional space is that described in:
	[K. P. Bennett and O. L. Mangasarian: "Robust Linear
	Programming Discrimination of Two Linearly Inseparable Sets",
	Optimization Methods and Software 1, 1992, 23-34].
    
    Source:
    W.N. Street, W.H. Wolberg and O.L. Mangasarian 
	Nuclear feature extraction for breast tumor diagnosis.
	IS&T/SPIE 1993 International Symposium on Electronic Imaging: Science
	and Technology, volume 1905, pages 861-870, San Jose, CA, 1993.
```

What do all the column names mean?

- ID number
- Diagnosis (M = malignant, B = benign)

Ten real-valued features are computed for each cell nucleus:

- radius (mean of distances from center to points on the perimeter)
- texture (standard deviation of gray-scale values)
- perimeter
- area
- smoothness (local variation in radius lengths)
- compactness (perimeter^2 / area - 1.0)
- concavity (severity of concave portions of the contour)
- concave points (number of concave portions of the contour)
- symmetry 
- fractal dimension ("coastline approximation" - 1)


Cateogory Distribution: 357 benign, 212 malignant

If we wanted to show the first five values in the table:

In [4]:
cancer_dataset["mean_area"].head(5)

ID        diagnosis
842302    M            1001.0
842517    M            1326.0
84300903  M            1203.0
84348301  M             386.1
84358402  M            1297.0
Name: mean_area, dtype: float64

If we wanted to show the first five values from the groups in the diagnosis column:

In [5]:
cancer_dataset["mean_area"].groupby("diagnosis").head(5)

ID        diagnosis
842302    M            1001.0
842517    M            1326.0
84300903  M            1203.0
84348301  M             386.1
84358402  M            1297.0
8510426   B             566.3
8510653   B             520.0
8510824   B             273.9
854941    B             523.8
85713702  B             201.9
Name: mean_area, dtype: float64

If we wanted to calculate basic descriptive statistics:

In [6]:
def calculate_count(x):
    return len(x)

def calculate_mean(x):
    return np.sum(x) / len(x)

def calculate_variance(x):
    return calculate_mean((x - calculate_mean(x))**2)

def calculate_std(x):
    return np.sqrt(calculate_variance(x))

area = cancer_dataset["mean_area"]

print(f"count =  {calculate_count(area):.0f}")
print(f"mean =  {calculate_mean(area):.2f}")
print(f"var =  {calculate_variance(area):.2f}")
print(f"std =  {calculate_std(area):.2f}")


count =  569
mean =  654.89
var =  123625.90
std =  351.60


`pandas` has us covered:

In [7]:
cancer_dataset["mean_area"].describe()

count     569.000000
mean      654.889104
std       351.914129
min       143.500000
25%       420.300000
50%       551.100000
75%       782.700000
max      2501.000000
Name: mean_area, dtype: float64

Notice a difference? The standard deviation is being calculated differently.

In [8]:
def calculate_std_est(x):
    """unbiased estimator"""
    return np.sqrt(np.sum((x - calculate_mean(x))**2)/(len(x)-1))

print(f"unbiased std =  {calculate_std_est(area):.2f}")

unbiased std =  351.91


What if we're interested in the relationship between variables?

In [16]:
###########################################
#                                         #
# You don't need to understand this code  #
#                                         #
###########################################


# Create scatter plots of the various features
def calculate_correlation(x,y):
    return calculate_mean((x - calculate_mean(x)).transpose() * (y - calculate_mean(y))) / np.sqrt(calculate_variance(x) * calculate_variance(y))
    
@widgets.interact(x=list(cancer_dataset), y=list(cancer_dataset))    
def make_scatterplot(x="mean_radius",y="mean_area"):
    colors=["#e28743", "#1e81b0"]

    corr = calculate_correlation(cancer_dataset[x], cancer_dataset[y])
    index = int(corr > 0.5)
    color = colors[index]
    plt.title(r"correlation: $\rho = ${:.3f}".format(corr), color= color, size=TITLE_FONT)
    sns.scatterplot(data=cancer_dataset, x=x, y=y, alpha=0.5, color=color);


interactive(children=(Dropdown(description='x', options=('mean_radius', 'mean_texture', 'mean_perimeter', 'mea…

### Let's split the variables up by their category (also called it's label).

Based on our available data, we're not that interested in what the descriptive statistics are on the individual columns. 

In [10]:
def compare_diagnoses_by_variable(variable: str, dataframe: pd.DataFrame = cancer_dataset):
    """Accepts column name to generate basic descriptions"""
    return dataframe[variable].groupby("diagnosis").describe()

In [11]:
@widgets.interact(variable=list(cancer_dataset))
def comparison_wrapper(variable="mean_radius"):
    return compare_diagnoses_by_variable(variable)

interactive(children=(Dropdown(description='variable', options=('mean_radius', 'mean_texture', 'mean_perimeter…

### Let's run a t-test on these categories

In [12]:
@widgets.interact(variable=list(cancer_dataset))
def run_ttest(variable="mean_radius"):
    cat1 = cancer_dataset.xs("M", level=1)
    cat2 = cancer_dataset.xs("B", level=1)

    tstat, pvalue = ttest(cat1[variable], cat2[variable])
    print(f"p-value: {pvalue:.2e}")

interactive(children=(Dropdown(description='variable', options=('mean_radius', 'mean_texture', 'mean_perimeter…

In [13]:
@widgets.interact(variable=list(cancer_dataset), n=(3,100))
def run_ttest(variable="mean_radius", n=3):
    seed = 1
    cat1 = cancer_dataset.xs("M", level=1).sample(n, random_state=seed)
    cat2 = cancer_dataset.xs("B", level=1).sample(n)

    tstat, pvalue = ttest(cat1[variable], cat2[variable])
    print(f"p-value: {pvalue:.2e}")

interactive(children=(Dropdown(description='variable', options=('mean_radius', 'mean_texture', 'mean_perimeter…

### Considering the assumptions of the Student's T-Test, when is it not appropriate?

### Intuition: Central limit theorem

You can ignore the code below! It is setting up an experiment.

In [14]:
def get_random_datasets(n, rng, repetitions=200):
    averages = []
    all_nums = []
    for i in range(0, repetitions):
        nums = [generate_random_numbers(rng) for n in range(0,n)]
        all_nums += nums
        averages.append(np.mean(nums))
    return all_nums, averages
    
def generate_histograms_clt(axs, n=10, rng="uniform"):
    """build 2x2 matrix of histograms"""

    all_nums_fixed, averages_fixed = get_random_datasets(10, rng)
    build_paired_histograms(averages_fixed, all_nums_fixed, 10, rng, axs, column=0)
    
    all_nums, averages = get_random_datasets(n, rng)
    build_paired_histograms(averages, all_nums, n, rng, axs, column=1)
    
def build_paired_histograms(averages, all_nums, n, rng, axs, column):
    colors=["#1e81b0", "#e28743"]
    color = colors[column]
    axs[0,column].set_title(f"random samples")
    axs[1,column].set_title(f"sample averages")
    axs[0,column].text(0.9, 0.9, f"n:{n}",
                       verticalalignment='bottom', horizontalalignment='right',
                       transform=axs[0,column].transAxes,
                       color=color, fontsize=LABEL_FONT)
    axs[1,column].text(0.9, 0.9, f"n:{n}",
                       verticalalignment='bottom', horizontalalignment='right',
                       transform=axs[1,column].transAxes,
                       color=color, fontsize=LABEL_FONT)
    sns.histplot(all_nums, ax=axs[0, column], color = color, stat="probability")
    sns.histplot(averages, bins=10, ax=axs[1, column], color = color, stat="probability")
    axs[1, column].set_xlim(0,10)
    
    
def generate_random_numbers(generator = "uniform"):
    """generate random numbers with a mean of 5"""
    if generator == "uniform": return random.uniform(0,10)
    elif generator == "exponential": return random.expovariate(1/5)
    elif generator == "normal": return random.gauss(5,2)
    
def format_plots(axs):
    for ax in axs.flat:
        title = ax.get_title()
        ax.set_title(title, fontweight="bold", size=LABEL_FONT)
        ax.set_ylabel('Proportion (Probability)', fontsize = LABEL_FONT)
        ax.set_xlabel('Number', fontsize = LABEL_FONT)
        ax.tick_params(labelsize=TICK_FONT)
        

In [15]:
@widgets.interact_manual(n=(3,100), generator=["uniform","exponential","normal"])
def demonstrate_clt(n=10, generator="exponential"):
    random.seed(10)
    fig, axs = plt.subplots(2, 2, figsize=FIG_SIZE, constrained_layout=True)
    fig.suptitle(f"Random number generator (distribution): {generator}",fontweight="bold", size=TITLE_FONT)
    generate_histograms_clt(axs, n, generator)
    format_plots(axs)

interactive(children=(IntSlider(value=10, description='n', min=3), Dropdown(description='generator', index=1, …