# A broad study on `Chi^2` algorithm
<hr>
#### Before going to broad discussion, first lets understand what it `Chi^2` algorithm ? What it does? Why we need to learn it?

* __`Chi^2` (pronounced as kai-square) is an algorithm that helps us to understand the relationship between two [categorical](https://youtu.be/o8gs-zgPfp4) variables.__  
* __It helps us to compare what we actually observed with what we expected.__
* __We use it to accept or reject our [hypothesis](https://youtu.be/AYSbHbM7Wp0).__
* __It also used for feature selection__



### There are tow types of Hypotheses test are present in `Chi^2` algorithm 
   * test for fitting/ goodness of fit
   * test for independence
  
  
#### `Chi^2` test for fitting or goodness of fit
Chi-squared __goodness-of-fit__ test is an analog of the one way to test for categorical variables: it tests whether the  distribution of sample categorical data matches an expected distribution.
#### `Chi^2` test of independence
The Chi-Square test of independence is a statistical test to determine if there is a __significant relationship__ between 2 categorical variables.

<hr>

## `Chi^2` test for goodness of fit

Lets understand the scenerio first:

Mr. Rahim thinking about buying a restaurant. So, he go and ask the owner what is the distribution of the `number of customer` you get in each day. The owner gives him distribution data of 6 days. BUT Mr. Rahim get little bit suspicious and he dicide to see how good the owners provided distribution! 

So he started observing the number of customer came to the restaurant in a week. And he finally collect the observed distribution data. Both Owner's distribution and observed distribution data showing below:

> __The above example taken from [Khan Academy](https://youtu.be/2QeDRsxSF9M)__


In [1]:
import pandas as pd

data = pd.read_csv(r"C:\Users\DIU\Desktop\goodness_of_fit.csv")

data

Unnamed: 0,day,owners_distribution,observed_distribution
0,sat,10,30
1,sun,10,14
2,mon,15,34
3,tue,20,45
4,wed,30,57
5,thu,15,20


`Chi^2` has a standard distribution [table](https://en.wikipedia.org/wiki/Chi-squared_distribution#Table_of_%CF%872_values_vs_p-values). We need this table on both test

![chi2_distribution_table.png](http://res.cloudinary.com/nasir78526/image/upload/q_100/v1531935798/chi2_distribution_table_tztb6a.png)

The Equation of `Chi^2`:

\begin{equation}
\chi=\sum\frac{\ (observed - expected)^2}{expected}
\end{equation}

In `data`  dataframe we have __Owner's distribution__ and __observed distribution__, but we do not have expected values! The formula of finding __expected__ values:

> expected = total no. of observed customer * (% of owners observation each day)

lets find out the expected values first.

In [2]:
data["expected"] = (sum(data["observed_distribution"]) * (data["owners_distribution"]/100)).astype(int)
data

Unnamed: 0,day,owners_distribution,observed_distribution,expected
0,sat,10,30,20
1,sun,10,14,20
2,mon,15,34,30
3,tue,20,45,40
4,wed,30,57,60
5,thu,15,20,30


__There are maily two hypothesis in `Chi^2`__

\begin{equation}
H_o = Null Hypothesis\\
H_a = Alternative Hypothesis
\end{equation}


* __Null Hypothesis => There's no significent relationship between specified features__
* __Alternative Hypothesis => reverse of Null Hypothesis__


__Now lets calculate the `chi-square` and find out the Null hypothesis of `Owners_distribution` is correct or not__

> `Correct means => Accepted => when Chi-square value less than Critical Value`

> `Incorrect means => Rejected => when Chi-square value greater than Critical Value`

|Calculating Chi-square value by Hand||#|#|#|#|#|#|
| --- | ---- | ---- | ---- | --- | -- | --- |
| \begin{equation}Observed\end{equation} || 30  | 14 | 34 | 45 | 57 | 20|
| \begin{equation}Expected\end{equation}|| 20  | 20 | 30 | 40 | 60 | 30 |
| \begin{equation}(O-E)\end{equation}|| 10 | -6 | 4 | 5 | -3 | -10 |
|\begin{equation}(O-E)^2\end{equation} || 100 | 36 | 16 | 25 | 9 | 100 |
|\begin{equation}\frac{(O-E)^2}{E}\end{equation} || 5 | 1.8 | 0.54 | 0.625 | 0.15 | 3.34 |
|\begin{equation}\sum\end{equation}||||||| __11.45__|

In [3]:
# Same calculation using python manually

subtract =  data["observed_distribution"] - data["expected"]
subtract_sqr = subtract**2
division = subtract_sqr / data["expected"]
chi_square = division.sum()

print(round(chi_square, 3))

11.442


Now we need to check whether `owner's disribution` is accepted or not, to do this we need some extra information:

* What is the `Degree of freedom`?
* What is the significant level ?
* What is the Critical Value?

__Answer:__

Degree of freedom = number of observation - 1 = __5__

Significant Level : 0.05 (_most used significant level by statistians_)

Critical Value: we need to find the critical value from the chi^2 distribution [table](https://en.wikipedia.org/wiki/Chi-squared_distribution#Table_of_%CF%872_values_vs_p-values).

We need to look at where the degree of freedom intersect the significant level(P value):
so we can see that, degree of freedom => 5 and significant level 0.05 will intersect at: 11.07

Hence, Critical Value = 11.07

In [4]:
critical_value = 11.07

if(chi_square<critical_value):
    print("Owner's distribution is correct, Accepted")
else:
    print("Owner's distribution is not correct, Rejected")

Owner's distribution is not correct, Rejected


### We can achive exact same thing by using `scipy`:

In [5]:
import scipy.stats as stats

(chi_square, p) = stats.chisquare(data["observed_distribution"], data["expected"], ddof=1)
print ('Chi-square Value = %f, P-value = %f' % (chi_square, p))

alpha = 0.05  # significance level

Chi-square Value = 11.441667, P-value = 0.022024


In [6]:
# another way to check the observation
# Correct means => Accepted => p (resulted level) > alpha (significant level)
# Incorrect means => Rejected => p < alpha

if p <= alpha:
    # we reject null hypothesis and accept alternative hypothesis
    print ("Owner's distribution is not correct, Rejected")
else:
    # we accept null hypothesis and reject alternative hypothesis
    print("Owner's distribution is correct, Accepted")

Owner's distribution is not correct, Rejected


## `Chi^2` test of independence

__Independence__ is a key concept in probability that describes a situation where knowing the value of one variable tells you nothing about the value of another.

For instance, the __month__ you were born probably doesn't tell you anything about which __web browser__ you use :p

So we'd expect birth month and browser preference to be __independent__.

On the other hand, your month of birth might be related to whether you __excelled__ at sports in school, so month of birth and sports performance might __not__ be __independent__.

* The chi-squared `test of independence` tests whether two categorical variables are independent.
* The test of independence is commonly used to determine whether variables like education, political views and other preferences vary based on demographic factors like gender, race and religion. 

> __The above content collected from this [blog](http://hamelg.blogspot.com)__


Let's say there are couple of herbs that people beleives help to prevent __flu__. So to test this, we randomly assign people into three different groups. And first two groups are taking herbs1 and herbs2 and third group doesnot take anything:

> __The above example collected from [KhanAcademy](https://youtu.be/hpWdDmgsIRE)__

In [7]:
flu_dataset = pd.read_csv(r"C:\Users\DIU\Desktop\flu_dataset.csv")

copy_df = flu_dataset.copy()
flu_dataset

Unnamed: 0,status,herb1,herb2,noherb
0,sick,20,30,30
1,not_sick,100,110,90


__Now we need to find out the total both column and row wise:__

In [8]:
# row wise sum added into a new column called 'total'
flu_dataset["total"] = flu_dataset.iloc[:, 1:].sum(axis=1) 

# column wise added into a new row with a index called 'Grand Total'
flu_dataset = pd.concat([flu_dataset, pd.DataFrame(flu_dataset.sum(axis=0), columns=['Grand Total']).T])

flu_dataset

Unnamed: 0,status,herb1,herb2,noherb,total
0,sick,20,30,30,80
1,not_sick,100,110,90,300
Grand Total,sicknot_sick,120,140,120,380


> The main difference between `goodness of fit` and `test of independence` is that in `test of independent` we have to find expected value for every cell in a two dimentional space. 

Now firstly we need to find out the expected frequency of getting sick or not sick:

> expected frequency for getting `sick = 80/380 = 0.2105 ~= 21%`

> expected frequency for getting `not sick = 300/380 = 0.7894 ~= 79%`

For each cell we need to find the expected value:

for, `sick patient = total frequency of getting sick * Total number of people taking herb or not`

for, `not sick patient = total frequency of getting not sick * Total number of people taking herb or not`

Expected `frequency for sick patient who takes Herb1 = total frequency of getting sick * Total number of people whom are taking herb1`

> expected_sick_herb1 = 21% * 120 = 25.2

> expected_sick_herb2 = 21% * 140 = 29.4

> expected_sick_noherb = 21% * 120 = 25.2

> expected_notsick_herb1 = 79% * 120 = 94.8

> expected_notsick_herb2 = 79% * 140 = 110.6

> expected_notsick_noherb = 79% * 120 = 94.8

|status| herb1| herb2|noherb|total|
|------|------|------|------|-----|
|sick|20|30|30|80|
|__Exp. Freq.__|__25.2__|__29.4__|__25.2__|__21%__|
|not_sick|100|110|90|300|
|__Exp. Freq.__|__94.8__|__110.6__|__94.8__|__79%__|
|GrandTotal|120|140|120|380|

No we need to calculate the chi-square:
\begin{equation}
\chi^2 = \sum\frac{(Observed - Expected)^2}{Expected}\\
=\frac{(20-25.2)^2}{25.2} + \frac{(30-29.4)^2}{29.4}  + \frac{(30-25.2)^2}{25.2} + \frac{(100-94.7)^2}{94.7} + \frac{(110-110.6)^2}{110.6} + \frac{(90-94.7)^2}{94.7}\\
= 2.52825\\
\end{equation}

### We can achive exact same thing in python using `stats` Library:

In [9]:
del copy_df["status"]
copy_df

Unnamed: 0,herb1,herb2,noherb
0,20,30,30
1,100,110,90


In [10]:
chiStats = stats.chi2_contingency(observed = copy_df)
print ('Chi-square Value = %f, p-value=%f' % (chiStats[0], chiStats[1]))

Chi-square Value = 2.525794, p-value=0.282834


__Now to Accept or Reject the hypothesis we need to look at chi-square distribution [table](https://en.wikipedia.org/wiki/Chi-squared_distribution#Table_of_%CF%872_values_vs_p-values).__

First thing first, we have a significant level/alpha for this problem = 10% = 0.10

and the degree of freedom for Contingency = (number of row - 1)* (number of column - 1) = (2-1) * (3-1) = 2

__Now, we need to find out the critical value where the `degree of freedom => 2` interset the `significant level 0.10`__

according to the chi-square distribution table the `intersect/ critical value is = 4.61`

If the `chi-square` value less than the `critical value` then the hypothesis is acceted (and that means variables are independent)


In [11]:
significant_level = 0.10
degree_of_freedom = 2

critical_value = crit = stats.chi2.ppf(q = 1 - significant_level, df = degree_of_freedom)

print("Critical Value: ", critical_value)

observe_chi_square = chiStats[0]

print("Observed Chi Value: ", observe_chi_square)

if observe_chi_square <= critical_value:
    # observed chi square value is not in critical area therefore we accept null hypothesis
    print ('Null hypothesis Accetped (variables are Independent)')
else:
    # observed value is in critical area therefore we reject null hypothesis
    print ('Null hypothesis Rejected (variables are related/dependent)')

Critical Value:  4.605170185988092
Observed Chi Value:  2.5257936507936507
Null hypothesis Accetped (variables are Independent)


<script src="https://gist.github.com/78526Nasir/111e6405b7ac0d34823839df42e2fc67.js"></script>