### Chi-Square Test

In [1]:
import pandas as pd
from scipy import stats
from scipy.stats import chi2
from scipy.stats import chi2_contingency

##### code Reference : 
 https://machinelearningmastery.com/chi-squared-test-for-machine-learning/

In [2]:
def chi_square_test(observed_values,prob):
    """
    Chi_square test for the
    * observed values in a contingency table format
         [[10,20,30],
        [40,50,60]]
    * prob - probalitiy

    """
    stat, p, dof, expected = chi2_contingency(observed_values)
    print(f"Chisquare statistic \t: {round(stat,3)}")
    print(f"p-value \t\t: {round(p,3)}")
    print(f"Degreed of freedom\t: {dof}")
    print(f"\nExpected value\t\t:")
    print(expected)
    critical = chi2.ppf(prob, dof)
    if abs(stat) >= critical:
        result = 'From critical value : Dependent (reject H0)'
    else:
        result = 'From critical value : Independent (fail to reject H0)'
    
    # interpret p-value
    alpha = 1.0 - prob
    print('significance=%.3f, p=%.3f' % (alpha, p))
    if p <= alpha:
        result2 = 'From significance : Dependent (reject H0)'
    else:
        result2 = 'From significance : Independent (fail to reject H0)'
    print(result)
    return result,result2

In [3]:
def chi_square_test_dof(observed,expected,dof,prob=0.95):    
    stat,p = stats.chisquare(observed,expected)
    critical = chi2.ppf(prob, dof)
    if abs(stat) >= critical:
            result = 'Reject Null Hypothesis H0'
    else:
        result = 'Fail to reject Null Hypothesis H0'
    return result

### Problem 1
1. A poker-dealing machine is supposed to deal cards at random, as if from an infinite deck.

In a test, you counted 1600 cards, and observed the following:
<pre>
Spades        404
Hearts        420
Diamonds      400
Clubs         376
</pre>

Could it be that the suits are equally likely ? Or are these discrepancies too much to be random ?

### Analysis:
In general, the distribution of cards in a deck of 1600 cards is 
<pre>
Type          Expected   Actual    difference
Spades        400        404        4
Hearts        400        420        20
Diamonds      400        400        0
Clubs         400        376        -24
</pre>
Since there are 4 types, and 3 of them can be independent, so degree of freedom is <b>3</b>
chi-square = sum (square of the differene/Expected)

           = 4^2/400 + 20^2/400 + 0^2/400 + (-24)^2/400
           = (16+400+0+576)/400
           = 992/400
           = 2.48
<b> Null Hypothesis: </b> The suits are equally likely, the observed values follows actual distribution

<b> Alternate Hypothesis: </b> The suits are random

From the chissquare-distribution, <b> critical value </b> for <b> significance of 0.05 </b> with <b> degree of freedom 1 </b> is : <b> 3.84</b>

since chisquare(2.48),it does not fall in critical region, Fail to Reject Null Hypothesis, 

<b> Answer </b>: Fail to reject Null Hypothesis, The suits are equally likely

In [4]:
#By programme
cards_expected = [400,400,400,400]
cards_actual = [404,420,400,376]
result = chi_square_test_dof(cards_actual,cards_expected,dof=3)
print(result)

Fail to reject Null Hypothesis H0


### Problem 2
Same as before, but this time jokers are included, and you counted 1662 cards, with these results:-
<pre>
Spades        404
Hearts        420
Diamonds      400
Clubs         356
Jokers         82
</pre>
a) How many jokers would you expect out of 1662 random cards? How many of each suit?

b) Is it possible that the cards are really random ? Or are the discrepancies too large ? 

### Analysis:
In general, the distribution of cards in a deck of 1662 cards is 
In normal deck of cards, each type is of 13 and 2 jokers
so, out of 1662, total 400 each type and 62 jokers are expected
<pre>
Type          Expected   Actual    difference
Spades        400        404        4
Hearts        400        420        20
Diamonds      400        400        0
Clubs         400        356        -44
Jokers         62         82        20
</pre>
Since there are 5 types, and 4 of them can be independent, so degree of freedom is <b>4</b>
chi-square = sum (square of the differene/Expected)

           = 4^2/400 + 20^2/400 + 0^2/400 + (-44)^2/400 +(20^2)/62
           = (16+400+0+1936)/400+400/62
           = 12.33           
<b> Null Hypothesis: </b> The suits are equally likely, the observed values follows actual distribution

<b> Alternate Hypothesis: </b> The suits are random

From the chissquare-distribution, <b> critical value </b> for <b> significance of 0.05 </b> with <b> degree of freedom 4 </b> is : <b> 9.488</b>

since chisquare(12.33), greater than critical, it fall in  critical region, so Reject the null hypothesis

<b> Answer </b>: Reject the  Null Hypothesis, The suits are random, discrepancy is random

In [5]:
cards_expected = [400,400,400,400,62]
cards_actual = [404,420,400,356,82]
result = chi_square_test_dof(cards_actual,cards_expected,dof=4)
print(result)

Reject Null Hypothesis H0


### Problem 3
A genetics engineer was attempting to cross a tiger and a cheetah. 
She predicted a phenotypic outcome of the traits she was observing to be in the following ratio
4 stripes only : 3 spots only : 9 both stripes and spots.
When the cross was performed and she counted the individuals she found 50 with stripes only,
41 with spots only and 85 with both.
According to the Chi-Square test, did she get the predicted outcome? 

### Analysis:
The ratio was given as 4:3:9
Observed values are : 50,41,85
Total observed values = 50 + 41 + 85 = 176
Expected values  = (4/16) * 176 , (3/16) * 176, (9/16) * 176 = 44,33,99

<pre>
Type          Expected     Actual    difference
Stripes only        44        50        6
spots   only        33        41        8
both                99        85        -14
</pre>
Since there are 3 types, and 2 of them can be independent, so degree of freedom is <b>2</b>
chi-square = sum (square of the differene/Expected)

           = (6^2/44 + 8^2/33 + (-14)^2/99)           
           = (36/44 + 64/33 + 196/99)    
           = 0.81 + 1.93 + 1.97
           = 4.71
<b> Null Hypothesis: </b> They are equally likely and predicted outcome follows expected

<b> Alternate Hypothesis: </b> The outcomes are random

From the chissquare-distribution, <b> critical value </b> for <b> significance of 0.05 </b> with <b> degree of freedom 2 </b> is : <b> 5.99</b>

since chisquare(4.73), less than  than critical, ,fail to reject the null hypothesis, and the observed values follow expected values

<b> Answer </b>: Fail to reject the null hypothesis, and the observed values follow expected values

In [6]:
#solving programatically
observed=[50,41,85]
expected = [44,33,99]
result = chi_square_test_dof(observed,expected,dof=2)
print(result)

Fail to reject Null Hypothesis H0


### Problem 4
In the garden pea, yellow cotyledon color is dominant to green and inflated pod shape is dominant to the constricted form. Considering both of these traits jointly in self-fertilized dihybrids, the progeny appeared in the following numbers:-

193 green inflated , 184 yellow constricted , 556 yellow inflated , 61 green constricted

Do these genes assort independently? Support your answer using Chi-square analysis.

Note:- Genes assort independently  if they follow the 9:3:3:1 rule ( on the 16 square Punnett Square) resulting from a dihybrid cross

### Analysis:
The ratio was given as 9:3:3:1 , on 16 square
Observed values are : 193, 184,556,61 (green inflated,yellow constricted,yellow inflated,green constricted)
Total observed values = 193 + 184 + 556 + 61 = 994
Expected values  = (9/16) * 994 , (3/16) * 994, (3/16) * 994,(1/16) * 994 = 559.125,186.375,186.375,62.125

<pre>
            Type          Expected     Actual    difference (Actual - Expected)
    green inflated        559.125        556        -3.125
yellow constricted        186.375        193         6.625
yellow inflated           186.375        184        -2.375
green   constricted        62.125         61        -1.125
</pre>
Since there are 4 types, and 3 of them can be independent, so degree of freedom is <b>3</b>
chi-square = sum (square of the differene/Expected)

           = (-3.125^2)/559.125 + (6.625^2)/186.375 + (-2.375^2)/186.375 + (-1.125^2)/62.125           
           = 0.017 +  0.235 + 0.03 + 0.02
           = 0.30
           
<b> Null Hypothesis: </b> They are equally likely and predicted outcome follows expected

<b> Alternate Hypothesis: </b> The outcomes are random

From the chissquare-distribution, <b> critical value </b> for <b> significance of 0.05 </b> with <b> degree of freedom 3 </b> is : <b> 7.815</b>

since chisquare(0.30), less than  than critical, ,fail to reject the null hypothesis, and the observed values follow expected values

<b> Answer </b>: Fail to reject the null hypothesis, and the observed values follow expected values

In [7]:
#solving programatically
observed=[556,193,184,61]
expected = [559.125,186.375,186.375,62.125]
result = chi_square_test_dof(observed,expected,dof=3)
print(result)

Fail to reject Null Hypothesis H0


### Problem 6
In the titanic Dataset, do a crosstab for embarked and survival rate. Using chi-square test, determine whether both of them are dependent or independent.

In [8]:
titanic = pd.read_csv(r'E:\SupervisedLearning\datasets\titanic.csv')

In [9]:
# pd.crosstab(titanic['PassengerId'],titanic['Survived'],margins=True)

In [13]:
observed = pd.crosstab(titanic['Embarked'],titanic['Survived']).values

In [14]:
observed

array([[ 75,  93],
       [ 47,  30],
       [427, 217]], dtype=int64)

In [16]:
chi_square_test(observed,0.90)

Chisquare statistic 	: 26.489
p-value 		: 0.0
Degreed of freedom	: 2

Expected value		:
[[103.7480315  64.2519685]
 [ 47.5511811  29.4488189]
 [397.7007874 246.2992126]]
significance=0.100, p=0.000
From critical value : Dependent (reject H0)


('From critical value : Dependent (reject H0)',
 'From significance : Dependent (reject H0)')

# Doubts
<pre>
1) Need to get clarity on Problem 4
2) For Problem 6, is it really,Reject Null Hypothesis
</pre>