<br>

### $\chi^2 \text{ Test}$  

$\chi^2 $ Test (chi squared test) is used to investigate the relationship between categorical variables by calculating the difference between observed and expected values. Simply put, how closely related is the expected data to the actual data and is the difference statistically significant, where small $\chi^2$ lends support to the data not being statistically significant and a large $\chi^2$ value is statistically significant. <br>

#### Types Of $\chi^2$ Test

There are three kinds of $\chi^2$ test:
* A $\chi^2$ test for homogeneity
* A $\chi^2$ test for association/independence 
* A $\chi^2$ test for goodness of fit 

The type of $\chi^2$ test is determined by the type of data you have (will detail later)

#### Conditions for $\chi^2$

* Random sample 
* Normality or Large Counts
* Independence - Sample with replacement or sample size 10 % or less than population 
<br><br>

#### $\chi^2$ Formula 

&emsp;$\chi^2 = \sum \dfrac{(\text{observed - expected})^2}{expected}$
<br><br>

#### $\chi^2$ Homogeneity Example 
Does age have an impact on sports preferences?<br>

* sample 1 - 265 people in their 30's
  * Each was asked to pick their favorite sport from a list of MLB, NFL, and NBA 
* sample 2 - 235 people in their 40's
  * Each was asked to pick their favorite sport from a list of MLB, NFL, and NBA 
<br>

This is a $chi^2$ test for homogeneity because we are comparing two distributions with categorical data
<br><br>

In [1]:

import sys 
sys.path.insert(0, '..')
from resources import datum 
import numpy as np 
from IPython.display import display, Markdown,  Math
from tabulate  import tabulate 
from scipy.stats import chi2

datum = datum.Data()

CL = 0.90
tail = 'two'
sample1_categories = 2 # sample 1 categories: 30s , 40s
sample2_categories = 3 # sample 2 categories mlb, nfl, nba 
df = (sample1_categories - 1) * (sample2_categories - 1)
alpha = ((1 - CL)/2) if tail == 'two' else 1 - CL

col0 = ["30's", "40's", "$\Sigma$"]
mlb = [60, 65]
nfl = [125, 100]
nba = [80, 70]
total = [sum([mlb[0], nfl[0], nba[0]]), sum([mlb[1], nfl[1], nba[1]])]
phat = [] # proportions 
phat.extend([total[0]/sum(total), total[1]/sum(total)])

# add sum 
mlb.append(sum(mlb))
nfl.append(sum(nfl))
nba.append(sum(nba))
total.append(sum(total))
phat.append((sum(phat) * 100))  

# add proportions 
col0.append('$\hat{p}$')
mlb.append(mlb[2]/total[2])
nfl.append(nfl[2]/total[2])
nba.append(nba[2]/total[2])
total.append('')
phat.append('')

# Hypothesis
msg = '<br><br>$H_0$: age does not effect sports preference<br><br>'
msg = msg + '$H_a$: age does effect sports preference<br>\r\r'
display(Markdown(msg))

# make dictionary for table 
data = {' ': col0, 'MLB': mlb, 'NFL': nfl, 'NBA': nba, '$\Sigma$': total, '$\hat{p}$': phat}

tab = tabulate(data, headers='keys', tablefmt='pipe', numalign='center', stralign='center')
header = '### Survey Data <br>\r'
footer = '\r\r<br>$\hat{p}$ = proportion of population\r\r'
notes = '<span style = "color:skyblue;">Notes:<br>'
notes = notes + '&emsp;$\star$ If age had NO impact on preference we would find:<br>\r'
notes = notes + '&emsp;&emsp;$\star$ Of the 125 people who prefer MLB, 53 pct should be in their 30s and 47 pct should be in their 40s<br>\r'
notes = notes + '&emsp;&emsp;$\star$ Of the 225 people who prefer NFL, 53 pct should be in their 30s and 47 pct should be in their 40s<br>\r'
notes = notes + '&emsp;&emsp;$\star$ Of the 150 people who prefer NBA, 53 pct should be in their 30s and 47 pct should be in their 40s<br>\r'

msg = header + tab + footer + notes 

display(Markdown(msg))

# ADD EXPECTED VALUES 

# calculate expected values 
ev_mlb30s = np.round((mlb[2]*total[0])/total[2])
ev_mlb40s = np.round((mlb[2]*total[1])/total[2])
ev_nfl30s = np.round((nfl[2]*total[0])/total[2])
ev_nfl40s = np.round((nfl[2]*total[1])/total[2])
ev_nba30s = np.round((nba[2]*total[0])/total[2])
ev_nba40s = np.round((nba[2]*total[1])/total[2])

# make expected value table 
ev_col0 = ["30's", "40's", "$\Sigma$"]
ev_mlb = [ev_mlb30s, ev_mlb40s]
ev_nfl = [ev_nfl30s, ev_nfl40s]
ev_nba = [ev_nba30s, ev_nba40s]
ev_total = [sum([ev_mlb[0], ev_nfl[0], ev_nba[0]]), sum([ev_mlb[1], ev_nfl[1], ev_nba[1]])]

# add totals
ev_mlb.append(sum(ev_mlb))
ev_nfl.append(sum(ev_nfl))
ev_nba.append(sum(ev_nba))
ev_total.append([sum(ev_total)])

# add proportions 
ev_mlb.append(ev_mlb[2]/ev_total[2])
ev_nfl.append(ev_nfl[2]/ev_total[2])
ev_nba.append(ev_nba[2]/ev_total[2])

data = {' ': col0, 'MLB': mlb, 'E[MLB]': ev_mlb, 'NFL': nfl, 'E[NFL]': ev_nfl, 'NBA': nba, 'E[NBA]': ev_nba, '$\Sigma$': total, '$\hat{p}$': phat}

msg = '\\displaystyle \\bold{Calculate~Expected~Values}\\\\~\\\\\\text{The formula for expected value:}\\\\~\\\\'
msg = msg + '\\qquad E[V] = \\dfrac{\\text{category 1 total} \\cdot \\text{ category 2 total}}{\\text{total population}}\\\\~\\\\~\\\\'
msg = msg + '\\qquad E[V] = \\dfrac{\\text{mlb total } \\cdot \\text{ 30s total}}{\\text{total population}}'
msg = msg + '\\qquad E[V] = \\dfrac{%s \\cdot %s}{%s} = %s\\\\~\\\\'
msg = msg + '\\qquad E[V] = \\dfrac{\\text{mlb total } \\cdot \\text{ 40s total}}{\\text{total population}}'
msg = msg + '\\qquad E[V] = \\dfrac{%s \\cdot %s}{%s} = %s\\\\~\\\\~\\\\'
#nfl
msg = msg + '\\qquad E[V] = \\dfrac{\\text{nfl total } \\cdot \\text{ 30s total}}{\\text{total population}}'
msg = msg + '\\qquad E[V] = \\dfrac{%s \\cdot %s}{%s} = %s\\\\~\\\\'
msg = msg + '\\qquad E[V] = \\dfrac{\\text{nfl total } \\cdot \\text{ 40s total}}{\\text{total population}}'
msg = msg + '\\qquad E[V] = \\dfrac{%s \\cdot %s}{%s} = %s\\\\~\\\\-\\\\'
#nba
msg = msg + '\\qquad E[V] = \\dfrac{\\text{nba total } \\cdot \\text{ 30s total}}{\\text{total population}}'
msg = msg + '\\qquad E[V] = \\dfrac{%s \\cdot %s}{%s} = %s\\\\~\\\\'
msg = msg + '\\qquad E[V] = \\dfrac{\\text{nba total } \\cdot \\text{ 40s total}}{\\text{total population}}'
msg = msg + '\\qquad E[V] = \\dfrac{%s \\cdot %s}{%s} = %s\\\\~\\\\-\\\\'
display(Math(msg%(
    #mlb 30s
    mlb[2], total[0], total[2], ev_mlb[0]
    #mlb 40s
    ,mlb[2], total[1], total[2], ev_mlb[1]
    #nfl 30s
    ,nfl[2], total[0], total[2], ev_nfl[0]
    #nfl 40s
    ,nfl[2], total[1], total[2], ev_nfl[1]
    #nba 30s
    ,nba[2], total[0], total[2], ev_nba[0]
    #nba 40s
    ,nba[2], total[1], total[2], ev_nba[1]
    )))




tab = tabulate(data, headers='keys', tablefmt='pipe', numalign='center', stralign='center')
header = '### Survey Data With Expected Values<br>\r'
footer = '\r\r<br>$\hat{p}$ = proportion of population<br>\r\r'
notes = '<span style = "color:skyblue;">Notes:<br>'
notes = notes + '&emsp;$\star$ Note the difference between categorical observed values and expected values<br>\r'
notes = notes + '&emsp;&emsp;$\star$ Ex: For the 30s age group mlb perference the observed is 60 but E[MLB] is 66</span><br><br></span>\r'

msg = header + tab + footer + notes 

display(Markdown(msg))

msg = 'The inference condition for homogeneity is met:<br>\r'
msg = msg + '* We assume the sample was random\r'
msg = msg + '* Normality condidtion met because all values are greater than 10%<br>\r'
msg = msg + '&emsp; * All expected values are greater than 5<br>\r'
msg = msg + '* Independence conditoin met because we assumed sample with replacement or kept sample size less than 10%\r<br><br>\r\r'
msg = msg + '<b>Calculate $\chi^2$ Test<b>\r'


display(Markdown(msg))

chi_mlb30s = (mlb[0] - ev_mlb[0])**2 / ev_mlb[0]
chi_nfl30s = (nfl[0] - ev_nfl[0])**2 / ev_nfl[0]
chi_nba30s = (nba[0] - ev_nba[0])**2 / ev_nba[0]
chi_mlb40s = (mlb[1] - ev_mlb[1])**2 / ev_mlb[1]
chi_nfl40s = (nfl[1] - ev_nfl[1])**2 / ev_nfl[1]
chi_nba40s = (nba[1] - ev_nba[1])**2 / ev_nba[1]
test_statistic = chi_mlb30s + chi_nfl30s + chi_nba30s + chi_mlb40s + chi_nfl40s + chi_nba40s

msg = '\\chi^2 = \\sum \\dfrac{(\\text{observed - expected})^2}{\\text{expected}} = '
msg = msg + '\\dfrac{(%s - %s)^2}{%s} + \\dfrac{(%s - %s)^2}{%s} + \\dfrac{(%s - %s)^2}{%s}\\\\~\\\\'
msg = msg + '\\qquad \\qquad+ ~\\dfrac{(%s - %s)^2}{%s} + \\dfrac{(%s - %s)^2}{%s} + \\dfrac{(%s - %s)^2}{%s} = %s'
msg = msg + '\\\\~\\\\~\\\\'

display(Math(msg%(
    #mlb30s
    mlb[0], ev_mlb[0], ev_mlb[0]
    #nfl30s
    ,nfl[0], ev_nfl[0], ev_nfl[0]
    #nba30s
    ,nba[0], ev_nba[0], ev_nba[0]
    #mlb40s
    ,mlb[1], ev_mlb[1], ev_mlb[1]
    #nfl40s
    ,nfl[1], ev_nfl[1], ev_nfl[1]
    #nba40s
    ,nba[1], ev_nba[1], ev_nba[1]
    # answer
    ,f'{test_statistic: .4f}'
)))

notes = '<span style = "color:skyblue;">Notes:<br>\r'
notes = notes + 'The larger the value of $\chi^2$, the more likely the chance of rejecting the Null Hypothesis'
notes = notes + '<br><br><br></span>\r'

display(Markdown(notes))

msg = '**Calculate Degrees of Freedom For Homogeneity**<br><br>\r'
msg = msg + '$\displaystyle \\text{df = (sample 1 categories - 1)} \cdot \\text{(sample2 categories - 1)}$<br><br>\r'
msg = msg + '$\\text{df = (30s + 40s - 1 )} \\cdot \\text{ (mlb + nfl + nba - 1)}= (1 + 1 - 1) \cdot (1 + 1 + 1 - 1)'
msg = msg + '= 1 \\cdot 2 = 2$<br><br>\r '

display(Markdown(msg))

critical_value = datum.get_chi_square_ctitical_value(alpha = alpha, df = df)

msg = '\\displaystyle \\chi^2 \\text{ critical value: } %s\\\\~\\\\'
msg = msg + '\\text{Using the critical value approach, at a 90 pct confidence level, in the upper tail of a two tail test}\\\\~\\\\'
msg = msg + '\\text{we would have to calculate a test statistic of at least 5.99 in order to reject the null hypothesis}\\\\~\\\\'
msg = msg + '\\text{Thefore with }\\chi^2 \\text{ test statistic of %s}\\\\~\\\\'
msg = msg + '\\qquad \\star \\text{ failed to reject the null hypothesis that age has no effect on sports preference}\\\\~\\\\'
msg = msg + '\\qquad \\star \\text{ failed to support the alternative hypothesis that age has an effect on sports preference}\\\\~\\\\'
msg = msg + '\\qquad \\star \\text{ we cannot conclude that age has any effect on the preference of sport }\\\\~\\\\'

display(Math(msg%(f'{critical_value: .4f}',f'{test_statistic: .4f}')))





<br><br>$H_0$: age does not effect sports preference<br><br>$H_a$: age does effect sports preference<br>

### Survey Data <br>|           |  MLB  |  NFL  |  NBA  |  $\Sigma$  |  $\hat{p}$  |
|:---------:|:-----:|:-----:|:-----:|:----------:|:-----------:|
|   30's    |  60   |  125  |  80   |    265     |    0.53     |
|   40's    |  65   |  100  |  70   |    235     |    0.47     |
| $\Sigma$  |  125  |  225  |  150  |    500     |    100.0    |
| $\hat{p}$ | 0.25  | 0.45  |  0.3  |            |             |<br>$\hat{p}$ = proportion of population<span style = "color:skyblue;">Notes:<br>&emsp;$\star$ If age had NO impact on preference we would find:<br>&emsp;&emsp;$\star$ Of the 125 people who prefer MLB, 53 pct should be in their 30s and 47 pct should be in their 40s<br>&emsp;&emsp;$\star$ Of the 225 people who prefer NFL, 53 pct should be in their 30s and 47 pct should be in their 40s<br>&emsp;&emsp;$\star$ Of the 150 people who prefer NBA, 53 pct should be in their 30s and 47 pct should be in their 40s<br>

<IPython.core.display.Math object>

### Survey Data With Expected Values<br>|           |  MLB  |  E[MLB]  |  NFL  |  E[NFL]  |  NBA  |  E[NBA]  |  $\Sigma$  |  $\hat{p}$  |
|:---------:|:-----:|:--------:|:-----:|:--------:|:-----:|:--------:|:----------:|:-----------:|
|   30's    |  60   |    66    |  125  |   119    |  80   |    80    |    265     |    0.53     |
|   40's    |  65   |    59    |  100  |   106    |  70   |    70    |    235     |    0.47     |
| $\Sigma$  |  125  |   125    |  225  |   225    |  150  |   150    |    500     |    100.0    |
| $\hat{p}$ | 0.25  |   0.25   | 0.45  |   0.45   |  0.3  |   0.3    |            |             |<br>$\hat{p}$ = proportion of population<br><span style = "color:skyblue;">Notes:<br>&emsp;$\star$ Note the difference between categorical observed values and expected values<br>&emsp;&emsp;$\star$ Ex: For the 30s age group mlb perference the observed is 60 but E[MLB] is 66</span><br><br></span>

The inference condition for homogeneity is met:<br>* We assume the sample was random* Normality condidtion met because all values are greater than 10%<br>&emsp; * All expected values are greater than 5<br>* Independence conditoin met because we assumed sample with replacement or kept sample size less than 10%<br><br><b>Calculate $\chi^2$ Test<b>

<IPython.core.display.Math object>

<span style = "color:skyblue;">Notes:<br>The larger the value of $\chi^2$, the more likely the chance of rejecting the Null Hypothesis<br><br><br></span>

**Calculate Degrees of Freedom For Homogeneity**<br><br>$\displaystyle \text{df = (sample 1 categories - 1)} \cdot \text{(sample2 categories - 1)}$<br><br>$\text{df = (30s + 40s - 1 )} \cdot \text{ (mlb + nfl + nba - 1)}= (1 + 1 - 1) \cdot (1 + 1 + 1 - 1)= 1 \cdot 2 = 2$<br><br> 

AttributeError: 'Data' object has no attribute 'get_chi_square_ctitical_value'

#### The Different Types Of $\chi^2$ Test
<br>

* <b>Association / Independence</b>
  * One group and multiple traits about the one group
  * Example: 
    * A group of families (one group) with at least one child (trait 1)
    * A family with 1 child has more pets (trait 2) than a family with multiple children
* <b>Homogeneity</b>
  * Multiple Groups
  * Example
    * Sample from people in their 30s and from people in their 40s compared to multiple types of sports
* <b>Goodness of Fit</b>
  * One Category that has a dependent variable 
  * Think of a one-way table
  * Example
    * A car dealership sales by month
    * Compare car sales by month. Are more cars sold in June than December? 
    * The expected values in the a Goodness of Fit Chi Square test will be the ame for all variables
      * If the dealership sales 1200 cars per year the E[V] for each month will be 100
      * If the dealership sales 2400 cars per year the E[V] for each month will be 200
<b></b>

The $\chi^2$ test process is the same for each type of test: 
1. Layout, understand the data, making sure the correct $\chi^2$ test is applied
2. Calculate the expected values
3. Calculate $\chi^2$ (test statistic)
4. Find the degrees of freedom
5. Calculate the $\chi^2$ critical value
<br>

<span style = 'color:crimson;font-weight:bold;size:102%'>
END
</span>

