# t-test
## theory
    > main target: find mean differences bwtween two classes. Null hypothesis: no significant difference

## scipy implementation
```doc
ss.ttest_ind(
    a,
    b,
    axis=0,
    equal_var=True,
    nan_policy='propagate',
    permutations=None,
    random_state=None,
    alternative='two-sided',
    trim=0,
)
Docstring:
Calculate the T-test for the means of *two independent* samples of scores.

This is a two-sided test for the null hypothesis that 2 independent samples
have identical average (expected) values. This test assumes that the
populations have identical variances by default.

Parameters
----------
a, b : array_like
    The arrays must have the same shape, except in the dimension
    corresponding to `axis` (the first, by default).
axis : int or None, optional
    Axis along which to compute test. If None, compute over the whole
    arrays, `a`, and `b`.
equal_var : bool, optional
    If True (default), perform a standard independent 2 sample test
    that assumes equal population variances [1]_.
    If False, perform Welch's t-test, which does not assume equal
    population variance [2]_.

nan_policy : {'propagate', 'raise', 'omit'}, optional
    Defines how to handle when input contains nan.
    The following options are available (default is 'propagate'):

      * 'propagate': returns nan
      * 'raise': throws an error
      * 'omit': performs the calculations ignoring nan values

    The 'omit' option is not currently available for permutation tests or
    one-sided asympyotic tests.

permutations : non-negative int, np.inf, or None (default), optional
    If 0 or None (default), use the t-distribution to calculate p-values.
    Otherwise, `permutations` is  the number of random permutations that
    will be used to estimate p-values using a permutation test. If
    `permutations` equals or exceeds the number of distinct partitions of
    the pooled data, an exact test is performed instead (i.e. each
    distinct partition is used exactly once). See Notes for details.


random_state : {None, int, `numpy.random.Generator`,
        `numpy.random.RandomState`}, optional

    If `seed` is None (or `np.random`), the `numpy.random.RandomState`
    singleton is used.
    If `seed` is an int, a new ``RandomState`` instance is used,
    seeded with `seed`.
    If `seed` is already a ``Generator`` or ``RandomState`` instance then
    that instance is used.

    Pseudorandom number generator state used to generate permutations
    (used only when `permutations` is not None).

alternative : {'two-sided', 'less', 'greater'}, optional
    Defines the alternative hypothesis.
    The following options are available (default is 'two-sided'):

      * 'two-sided'
      * 'less': one-sided
      * 'greater': one-sided

trim : float, optional
    If nonzero, performs a trimmed (Yuen's) t-test.
    Defines the fraction of elements to be trimmed from each end of the
    input samples. If 0 (default), no elements will be trimmed from either
    side. The number of trimmed elements from each tail is the floor of the
    trim times the number of elements. Valid range is [0, .5).

```

In [18]:
from sklearn.datasets import load_iris
import scipy.stats as ss
import numpy as np

data_iris = load_iris()
data_iris.keys()

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])

In [28]:
iris_f = data_iris["feature_names"]
iris_data = data_iris["data"][:,:3]
iris_target = data_iris.target
iris_f[:3]

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)']

In [29]:
from collections import Counter

c = Counter(iris_target)
c.most_common()

[(0, 50), (1, 50), (2, 50)]

In [31]:
class0 = []
class1 = []
class2 = []
for idx,(feature, target) in enumerate(zip(iris_data,iris_target)):
    if target == 0:
        class0.append(feature)
    elif target == 1:
        class1.append(feature)
    else:
        class2.append(feature)
class0 = np.array(class0, dtype=np.float64)
class1 = np.array(class1, dtype=np.float64)
class2 = np.array(class2, dtype=np.float64)

In [33]:
# t-test for mean between class-0 and class-1
ss.ttest_ind(class0,class1,axis=0,equal_var=False)

Ttest_indResult(statistic=array([-10.52098627,   9.45497585, -39.49271939]), pvalue=array([3.74674261e-17, 2.48422790e-15, 9.93443296e-46]))

In [34]:
# t-test for mean between class-1 and class-2
ss.ttest_ind(class1,class2,axis=0,equal_var=False)

Ttest_indResult(statistic=array([ -5.62916526,  -3.20576075, -12.60377944]), pvalue=array([1.86614439e-07, 1.81948348e-03, 4.90028753e-22]))

In [35]:
# t-test for mean between class-0 and class-2
ss.ttest_ind(class0,class2,axis=0,equal_var=False)

Ttest_indResult(statistic=array([-15.38619582,   6.45034909, -49.98618626]), pvalue=array([3.96686727e-25, 4.57077142e-09, 9.26962759e-50]))

## notification
> 1. due to limitation of dimensionality, t-test will only work on 1-d data, which means that it could only calculate p value one feature at a time.  <br>
> 2. This means we assume that features are independent to each other. Instead of modeling all features as one single compound probability density function, we simply divided each variable under the assumption that they are independent to each other, and compound pdf are the product of each marginal pdf

# scatter matrix and FDR
> this definition will use variance(second order central distance) and mean(first order original distance)
<br>
> within-class scatter matrix
$$
    \mathbf{S_{\omega}} = \displaystyle\sum_{i=1}^{M}{p_i\Sigma_i}
$$
where
$p_i = \frac{n_i}{N}$ is the proportion of data of certain class to the whole data set, and $M$ is the total number of classes <br>
and $\Sigma_i = E[(x-\mu_i){(x-\mu_i)}^T]$ is the covariance matrix of ith-class

> between-class scatter matrix 
$$
   \mathbf{S_b} = \displaystyle\sum_{i=1}^{M}{p_i(\mu_i-\mu_0){(\mu_i-\mu_0)}^T}
$$
where
$\mu_0 = \displaystyle\sum_{i=1}^{M}{p_i\mu_i}$ is the global mean matrix <br>

> mixture scatter matrix
$$
    \mathbf{S_m} = E[(x-\mu_0){(x-\mu_0)}^T]
$$

> we could easily proof
$$
    \mathbf{S_m} =  \mathbf{S_{\omega}} +  \mathbf{S_b}
$$

> as we can see, in two class classification, $| \mathbf{S_{\omega}}|$ is proportional to ${\sigma_1}^2 + {\sigma_2}^2$ amd $| \mathbf{S_{b}}|$ is proportional to ${(\mu_1 - \mu_2)}^2$ <br>
> thus we could define <font color=maroon><b>FDR(Fisher's Discriminant Ratio)</b></font>
$$
    FDR = \displaystyle\sum_{i}^{M}\displaystyle\sum_{j \neq i}^{M}\frac{{(\mu_1 - \mu_2)}^2}{{\sigma_1}^2 + {\sigma_2}^2}
$$

which could be easier to memo
$$
    FDR \sim \frac{|\mathbf{S_b}|}{|\mathbf{S_{\omega}}|}
$$

## LDA(Linear Discriminant Analysis)
> one simple example with FDR <br>
> implementation: sklearn discriminant_analysis

```doc
LinearDiscriminantAnalysis(
    solver='svd',
    shrinkage=None,
    priors=None,
    n_components=None,
    store_covariance=False,
    tol=0.0001,
    covariance_estimator=None,
)
Docstring:     
Linear Discriminant Analysis.

A classifier with a linear decision boundary, generated by fitting class
conditional densities to the data and using Bayes' rule.

The model fits a Gaussian density to each class, assuming that all classes
share the same covariance matrix.

The fitted model can also be used to reduce the dimensionality of the input
by projecting it to the most discriminative directions, using the
`transform` method.


Parameters
----------
solver : {'svd', 'lsqr', 'eigen'}, default='svd'
    Solver to use, possible values:
      - 'svd': Singular value decomposition (default).
        Does not compute the covariance matrix, therefore this solver is
        recommended for data with a large number of features.
      - 'lsqr': Least squares solution.
        Can be combined with shrinkage or custom covariance estimator.
      - 'eigen': Eigenvalue decomposition.
        Can be combined with shrinkage or custom covariance estimator.

shrinkage : 'auto' or float, default=None
    Shrinkage parameter, possible values:
      - None: no shrinkage (default).
      - 'auto': automatic shrinkage using the Ledoit-Wolf lemma.
      - float between 0 and 1: fixed shrinkage parameter.

    This should be left to None if `covariance_estimator` is used.
    Note that shrinkage works only with 'lsqr' and 'eigen' solvers.

priors : array-like of shape (n_classes,), default=None
    The class prior probabilities. By default, the class proportions are
    inferred from the training data.

n_components : int, default=None
    Number of components (<= min(n_classes - 1, n_features)) for
    dimensionality reduction. If None, will be set to
    min(n_classes - 1, n_features). This parameter only affects the
    `transform` method.

store_covariance : bool, default=False
    If True, explicitly compute the weighted within-class covariance
    matrix when solver is 'svd'. The matrix is always computed
    and stored for the other solvers.

tol : float, default=1.0e-4
    Absolute threshold for a singular value of X to be considered
    significant, used to estimate the rank of X. Dimensions whose
    singular values are non-significant are discarded. Only used if
    solver is 'svd'.

covariance_estimator : covariance estimator, default=None
    If not None, `covariance_estimator` is used to estimate
    the covariance matrices instead of relying on the empirical
    covariance estimator (with potential shrinkage).
    The object should have a fit method and a ``covariance_`` attribute
    like the estimators in :mod:`sklearn.covariance`.
    if None the shrinkage parameter drives the estimate.

    This should be left to None if `shrinkage` is used.
    Note that `covariance_estimator` works only with 'lsqr' and 'eigen'
    solvers.

```

In [38]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.model_selection import train_test_split
from sklearn.metrics import  classification_report
from sklearn.datasets import load_breast_cancer
from collections import Counter
import numpy as np
import warnings

warnings.filterwarnings("ignore")

data_cancer = load_breast_cancer()
data_cancer.keys()

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])

In [45]:
cc = Counter(data_cancer.target)
print(data_cancer.data.shape)
print(cc.most_common(5))

(569, 30)
[(1, 357), (0, 212)]


In [44]:
cancer_data = data_cancer.data
cancer_f = data_cancer.feature_names
cancer_target = data_cancer.target
cx_train, cx_test, cy_train, cy_test = train_test_split(cancer_data, cancer_target, 
                                                        test_size=0.2, random_state=42,
                                                        shuffle=True)
lda_clf = LinearDiscriminantAnalysis(n_components=1)
lda_clf.fit(cx_train, cy_train)
cy_pred = lda_clf.predict(cx_test)
print(classification_report(cy_test, cy_pred))

              precision    recall  f1-score   support

           0       0.97      0.91      0.94        43
           1       0.95      0.99      0.97        71

    accuracy                           0.96       114
   macro avg       0.96      0.95      0.95       114
weighted avg       0.96      0.96      0.96       114



In [48]:
lda_clf.coef_

array([[ 3.80341976e+00, -5.39212301e-02, -4.39436335e-01,
        -6.34042278e-03,  7.93926949e+00,  9.65027592e+01,
        -1.94072071e+01, -9.48433359e+01,  6.52894977e+00,
        -1.12179949e+02, -8.34003815e+00,  2.43731481e-01,
         1.59124190e-01,  2.40224154e-02, -3.48805766e+02,
         4.26007772e+01,  8.24576776e+01, -3.50357849e+02,
         2.30465093e+01,  5.81140729e+01, -4.13737425e+00,
        -1.85553514e-01,  1.68084303e-01,  1.85463710e-02,
        -2.55422830e+00, -1.47149164e+01, -1.18801082e+01,
         2.55875979e+01, -1.97016750e+01, -2.45735089e+01]])

In [49]:
lda_clf.intercept_

array([49.29932077])

In [53]:
lda_clf.means_

array([[1.74169231e+01, 2.14923077e+01, 1.15012959e+02, 9.75013609e+02,
        1.02531124e-01, 1.43885148e-01, 1.59455266e-01, 8.67635503e-02,
        1.93533136e-01, 6.26227219e-02, 6.00757988e-01, 1.20041598e+00,
        4.28259763e+00, 7.18094675e+01, 6.75819527e-03, 3.17857870e-02,
        4.18483432e-02, 1.50038935e-02, 2.06236627e-02, 3.97157988e-03,
        2.10274556e+01, 2.92200592e+01, 1.40713964e+02, 1.41022781e+03,
        1.44440769e-01, 3.71363373e-01, 4.51448994e-01, 1.81149467e-01,
        3.26636095e-01, 9.11269822e-02],
       [1.21680559e+01, 1.78216434e+01, 7.82140909e+01, 4.64910839e+02,
        9.17334615e-02, 7.98258741e-02, 4.72053007e-02, 2.55395140e-02,
        1.73751049e-01, 6.28359790e-02, 2.84577273e-01, 1.20402867e+00,
        2.01659545e+00, 2.13169266e+01, 7.12550350e-03, 2.20011573e-02,
        2.74909122e-02, 1.00562413e-02, 2.05438776e-02, 3.73115490e-03,
        1.34032587e+01, 2.33585664e+01, 8.72421678e+01, 5.61890210e+02,
        1.23904301e-01,

In [54]:
lda_clf.xbar_

array([1.41176352e+01, 1.91850330e+01, 9.18822418e+01, 6.54377582e+02,
       9.57440220e-02, 1.03619319e-01, 8.88981451e-02, 4.82798703e-02,
       1.81098681e-01, 6.27567692e-02, 4.02015824e-01, 1.20268681e+00,
       2.85825341e+00, 4.00712989e+01, 6.98907473e-03, 2.56354484e-02,
       3.28236723e-02, 1.18939407e-02, 2.05735121e-02, 3.82045560e-03,
       1.62351033e+01, 2.55356923e+01, 1.07103121e+02, 8.76987033e+02,
       1.31532132e-01, 2.52741802e-01, 2.74594569e-01, 1.14182222e-01,
       2.90502198e-01, 8.38678462e-02])