# Anova Test

# Anova Test
================

## Overview
--------

The Analysis of Variance (ANOVA) test is a statistical technique used to compare means of three or more samples to find out if at least one of the means is different. It is a powerful tool for analyzing the differences among group means in a sample.

## Assumptions
------------

*   The observations are independent.
*   The observations are normally distributed.
*   The variance of the observations is constant across all groups.

## Manual Calculation Example
---------------------------

Suppose we have three groups of exam scores with the following means and standard deviations:

| Group | Mean | Standard Deviation | Sample Size |
| --- | --- | --- | --- |
| A    | 80   | 10               | 5          |
| B    | 75   | 12               | 5          |
| C    | 85   | 8                | 5          |

### Step 1: Calculate the Grand Mean

The grand mean is the mean of all the observations.

```python
import numpy as np

# Define the data
group_a = np.array([80, 75, 90, 85, 78])
group_b = np.array([70, 80, 75, 78, 72])
group_c = np.array([90, 88, 85, 80, 92])

# Calculate the grand mean
grand_mean = np.mean(np.concatenate((group_a, group_b, group_c)))
print("Grand Mean:", grand_mean)
```

### Step 2: Calculate the Sum of Squares Between (SSB)

The sum of squares between (SSB) measures the variation between the group means.

```python
# Calculate the sum of squares between (SSB)
ssb = 5 * ((np.mean(group_a) - grand_mean) ** 2 + (np.mean(group_b) - grand_mean) ** 2 + (np.mean(group_c) - grand_mean) ** 2)
print("Sum of Squares Between (SSB):", ssb)
```

### Step 3: Calculate the Sum of Squares Within (SSW)

The sum of squares within (SSW) measures the variation within each group.

```python
# Calculate the sum of squares within (SSW)
ssw = np.sum((group_a - np.mean(group_a)) ** 2) + np.sum((group_b - np.mean(group_b)) ** 2) + np.sum((group_c - np.mean(group_c)) ** 2)
print("Sum of Squares Within (SSW):", ssw)
```

### Step 4: Calculate the Mean Square Between (MSB) and Mean Square Within (MSW)

The mean square between (MSB) and mean square within (MSW) are calculated by dividing the sum of squares by the degrees of freedom.

```python
# Calculate the mean square between (MSB) and mean square within (MSW)
msb = ssb / 2
msw = ssw / 12
print("Mean Square Between (MSB):", msb)
print("Mean Square Within (MSW):", msw)
```

### Step 5: Calculate the F-Statistic

The F-statistic is calculated by dividing the MSB by the MSW.

```python
# Calculate the F-statistic
f_statistic = msb / msw
print("F-Statistic:", f_statistic)
```

### Step 6: Determine the Critical F-Value or P-Value

The critical F-value or p-value can be determined using an F-distribution table or a statistical software package.

```python
# Determine the critical F-value or p-value
from scipy.stats import f

# Define the degrees of freedom
df_between = 2
df_within = 12

# Calculate the p-value
p_value = f.sf(f_statistic, df_between, df_within)
print("P-Value:", p_value)
```

If the p-value is less than the chosen significance level (e.g., 0.05), we reject the null hypothesis that the means are equal. 

In [1]:
#Import scipy.stats
from scipy import stats
import pandas as pd

In [2]:
df = pd.read_csv('https://raw.githubusercontent.com/aishwaryamate/Datasets/main/Iris.csv', index_col=0)
df

Unnamed: 0_level_0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,5.1,3.5,1.4,0.2,Iris-setosa
2,4.9,3.0,1.4,0.2,Iris-setosa
3,4.7,3.2,1.3,0.2,Iris-setosa
4,4.6,3.1,1.5,0.2,Iris-setosa
5,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...
146,6.7,3.0,5.2,2.3,Iris-virginica
147,6.3,2.5,5.0,1.9,Iris-virginica
148,6.5,3.0,5.2,2.0,Iris-virginica
149,6.2,3.4,5.4,2.3,Iris-virginica


In [3]:
df

Unnamed: 0_level_0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,5.1,3.5,1.4,0.2,Iris-setosa
2,4.9,3.0,1.4,0.2,Iris-setosa
3,4.7,3.2,1.3,0.2,Iris-setosa
4,4.6,3.1,1.5,0.2,Iris-setosa
5,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...
146,6.7,3.0,5.2,2.3,Iris-virginica
147,6.3,2.5,5.0,1.9,Iris-virginica
148,6.5,3.0,5.2,2.0,Iris-virginica
149,6.2,3.4,5.4,2.3,Iris-virginica


In [4]:
df['SepalLengthCm']

Id
1      5.1
2      4.9
3      4.7
4      4.6
5      5.0
      ... 
146    6.7
147    6.3
148    6.5
149    6.2
150    5.9
Name: SepalLengthCm, Length: 150, dtype: float64

In [5]:
f , p = stats.f_oneway(df['SepalWidthCm'],df['SepalLengthCm'],df['PetalLengthCm'],
                       df['PetalWidthCm'])

In [6]:
p

np.float64(3.4996987081933735e-159)

In [7]:
if p < 0.05:
    print('Reject Null Hypothesis. At least one sample is different.')
else:
    print('Fail to reject Null Hypothesis.')

Reject Null Hypothesis. At least one sample is different.


# Chi-Squared Test
=====================

## Overview
------------

The Chi-Squared test is a statistical test used to determine whether there is a significant association between two categorical variables. It is commonly used to test the independence of two variables and to determine whether observed frequencies in one or more categories are significantly different from expected frequencies.

## Assumptions
-------------

*   The data is randomly sampled from the population.
*   The data is categorical.
*   The categories are mutually exclusive.
*   The expected frequency in each category is at least 5.

## Types of Chi-Squared Tests
-----------------------------

*   **Goodness of Fit Test**: Used to determine whether observed frequencies in one or more categories are significantly different from expected frequencies.
*   **Test of Independence**: Used to determine whether there is a significant association between two categorical variables.

## Formula
----------

χ² = Σ [(observed frequency - expected frequency)^2 / expected frequency]

## Interpretation
--------------

*   A small p-value (typically < 0.05) indicates that the observed frequencies are significantly different from the expected frequencies, suggesting a significant association between the variables.
*   A large p-value (typically > 0.05) indicates that the observed frequencies are not significantly different from the expected frequencies, suggesting no significant association between the variables.

## Example Use Cases
--------------------

*   Testing the association between a disease and a risk factor.
*   Testing the effectiveness of a treatment.
*   Testing the association between a demographic variable and a behavior.

## Common Applications
----------------------

*   Medical research
*   Social sciences
*   Marketing research  


In [None]:
from scipy.stats import chi2_contingency

In [None]:
df = pd.read_csv('chi2.csv', index_col=0)
df

In [None]:
obs = pd.crosstab(index=df['Athlete'],columns=df['Smoker'])
obs

In [None]:
chi2,p,df,exp  = chi2_contingency(obs)

In [None]:
chi2, df

In [None]:
p

In [None]:
if p < 0.05:
    print('Reject Null Hypothesis.Columns are dependent on each other.')
else:
    print('Fail to reject null hypothesis.')

In [None]:
1 - stats.chi2.cdf(12.6,1)