<img src="https://i.imgur.com/czqBFBV.png">

Given the data table below, determine if there is a relationship between fitness level and smoking habits:


|  | Low fitness level | Medium-low fitness level | Medium-high fitness level | High fitness level |
|-------------|-----------|-------------|-------------|-------------|
| Never smoked | 113 | 113 | 110 | 159 |
| Former smokers | 119 |  135 | 172 | 190 |
| 1 to 9 cigarettes | 77 |  91 | 86 | 65 |
| >=10 cigarettes daily | 181 |  152 | 124 | 73 |

You don't have to fully solve for the number here (that would be pretty time-intensive for an interview setting), but lay out the steps you would take to solve such a problem.

### Null hypothesis: There is NO relationship between fitness level and smoking
### Alternate hypothesis: There IS a relationship between fitness level and smoking

Assumptions
- Significance (Î±) is .05 (5%)

Followed this tutorial https://towardsdatascience.com/gentle-introduction-to-chi-square-test-for-independence-7182a7414a95



In [6]:
import pandas as pd

# Creating Pandas df
df = pd.DataFrame(
    [
        [113,113,110,159],
        [119,135,172,190],
        [77,91,86,65],
        [181,152,124,73]
    ],
    index=["Low fitness level","Medium-low fitness level", "Medium-high fitness level", "High fitness level"],
    columns=["Never smoked","Former smokers", "1 to 9 cigarettes", ">=10 cigarettes daily"])
df



Unnamed: 0,Never smoked,Former smokers,1 to 9 cigarettes,>=10 cigarettes daily
Low fitness level,113,113,110,159
Medium-low fitness level,119,135,172,190
Medium-high fitness level,77,91,86,65
High fitness level,181,152,124,73


In [30]:
# Check to see if we were given an even sample size across all groups. We weren't so we need to calculate expected value
df.sum()

Never smoked             490
Former smokers           491
1 to 9 cigarettes        492
>=10 cigarettes daily    487
dtype: int64

\begin{equation}
\text{Expected Value}=\frac{(Row Total*ColumnTotal)}{TableTotal} \\
\end{equation}

\begin{equation}
\text{Expected Value}=\frac{((113+113+110+159)*(113+119+77+181))}{(490+491+492+487)} \\
\end{equation}

123.75<br>
Repeat for all values in table


In [33]:
from scipy.stats import chi2_contingency

chi2_contingency(df) # chi2_contingency(df)[3][0][0] shows the 123.75 calculated earlier

(87.27274636300587,
 5.7306646048374425e-15,
 9,
 array([[123.75      , 124.00255102, 124.25510204, 122.99234694],
        [154.        , 154.31428571, 154.62857143, 153.05714286],
        [ 79.75      ,  79.9127551 ,  80.0755102 ,  79.26173469],
        [132.5       , 132.77040816, 133.04081633, 131.68877551]]))

In [34]:
chi2_contingency(df)[0] # This is the chi squared value

87.27274636300587

\begin{equation}
\chi^2=\Sigma\frac{(O-E)^2}{E} \\
\text{where O is the actual value and E is the expected value.}
\end{equation}

In [35]:
chi2_contingency(df)[1] # This is the p-value

5.7306646048374425e-15

\begin{equation}
\text{Degrees of Freedom}=(NumRows - 1) * (NumColumns - 1) \\
\text{where O is the actual value and E is the expected value.}
\end{equation}

In [36]:
chi2_contingency(df)[2] # This is degrees of freedom

9

In [50]:
from scipy.stats import chi2

# Below code copied/pasted from https://towardsdatascience.com/gentle-introduction-to-chi-square-test-for-independence-7182a7414a95
chi, pval, dof, exp = chi2_contingency(df)
print('p-value is: ', pval)
significance = 0.05
p = 1 - significance
critical_value = chi2.ppf(p, dof)
print('chi=%.6f, critical value=%.6f\n' % (chi, critical_value))
if chi > critical_value:
    print("""At %.2f level of significance, we reject the null hypotheses and accept the alternate hypothesis. 
There IS a relationship between smoking and fitness levels""" % (significance))
else:
    print("""At %.2f level of significance, we accept the null hypotheses. 
There is NO relationship between fitness level and smoking""" % (significance))

p-value is:  5.7306646048374425e-15
chi=87.272746, critical value=16.918978

At 0.05 level of significance, we reject the null hypotheses and accept the alternate hypothesis. 
There IS a relationship between smoking and fitness levels
