## Is gender independent of education level? A random sample of 395 people were surveyed and each person was asked to report the highest education level they obtained. The data that resulted from the survey is summarized in the following table: 
## High School Bachelors Masters Ph.d. Total 
## Female 60 54 46 41 201 Male 40 44 53 57 194
## Total 100 98 99 98 395
## Question: Are gender and education level dependent at 5% level of significance? In other words, given the data collected above, is there a relationship between the gender of an individual and the level of education that they have obtained?

In [2]:
print("Using Chi- Square test of Independence")
print("H0: gender is independent of education level")
print("H1: gender and education level is dependent")

import numpy as np
import pandas as pd
import scipy.stats as stats

female = [60,54,46,41]
male = [40,44,53,57]
h = [40,60]
b = [44,54]
m = [53,46]
p = [57,41]
marks = male + female

sex =  ['Male','Male','Male','Male','Female','Female','Female','Female']
education = ['High School', 'Bachelors', 'Masters', 'Ph.d.','High School', 'Bachelors', 'Masters', 'Ph.d.']
student = pd.DataFrame({"Sex":sex,"Education":education,"Marks":marks})

#df_edu = df_edu[['Sex','High School', 'Bachelors', 'Masters', 'Ph.d.']]

#df_edu['Row_total'] = row_list
student 

Using Chi- Square test of Independence
H0: gender is independent of education level
H1: gender and education level is dependent


Unnamed: 0,Sex,Education,Marks
0,Male,High School,40
1,Male,Bachelors,44
2,Male,Masters,53
3,Male,Ph.d.,57
4,Female,High School,60
5,Female,Bachelors,54
6,Female,Masters,46
7,Female,Ph.d.,41


In [3]:
df_cross = pd.crosstab(student.Sex,student.Education,student.Marks, aggfunc="sum",margins=True)
df_cross

Education,Bachelors,High School,Masters,Ph.d.,All
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Female,54,60,46,41,201
Male,44,40,53,57,194
All,98,100,99,98,395


In [23]:
df_cross.index.name = ""
df_cross.columns.name =""
observed = df_cross.iloc[0:2,0:4]
observed

Unnamed: 0,Bachelors,High School,Masters,Ph.d.
,,,,
Female,54.0,60.0,46.0,41.0
Male,44.0,40.0,53.0,57.0


In [20]:
"""For a test of independence, we use the same chi-squared formula that we used for the goodness-of-fit test. 
The main difference is we have to calculate the expected counts of each cell in a 2-dimensional table instead of a 
1-dimensional table. To get the expected count for a cell, multiply the row total for that cell by the column total 
for that cell and then divide by the total number of observations. We can quickly get the expected counts for all cells 
in the table by taking the row totals and column totals of the table, performing an outer product on them with the np.outer() 
function and dividing by the number of observations:"""

'For a test of independence, we use the same chi-squared formula that we used for the goodness-of-fit test. \nThe main difference is we have to calculate the expected counts of each cell in a 2-dimensional table instead of a \n1-dimensional table. To get the expected count for a cell, multiply the row total for that cell by the column total \nfor that cell and then divide by the total number of observations. We can quickly get the expected counts for all cells \nin the table by taking the row totals and column totals of the table, performing an outer product on them with the np.outer() \nfunction and dividing by the number of observations:'

In [21]:
expected =  np.outer(df_cross["All"][0:2],
                     df_cross.loc["All"][0:4]) / 395.0
expected = pd.DataFrame(expected)
expected.columns = ["Bachelors","High School","Masters","Ph.d."]
expected.index = ["Female","Male"]
expected

Unnamed: 0,Bachelors,High School,Masters,Ph.d.
Female,49.868354,50.886076,50.377215,49.868354
Male,48.131646,49.113924,48.622785,48.131646


In [26]:
(((observed-expected)**2)/expected).sum()


Bachelors      0.696974
High School    3.323588
Masters        0.774385
Ph.d.          3.211119
dtype: float64

In [27]:
print("using sum() 2 times gives us the total sum as a single entity")

using sum() 2 times gives us the total sum as a single entity


In [28]:
chi_squared_stat = (((observed-expected)**2)/expected).sum().sum()

print(chi_squared_stat)

8.006066246262538


In [29]:
critical = stats.chi2.ppf(q=0.95,df=3)
print("Critical value: ",critical)

p_value  = 1- stats.chi2.cdf(x = chi_squared_stat,df=3)

print("P value: ",p_value)

Critical value:  7.814727903251179
P value:  0.04588650089174717


In [32]:


stats.chi2_contingency(observed=observed)

(8.006066246262538,
 0.045886500891747214,
 3,
 array([[49.86835443, 50.88607595, 50.37721519, 49.86835443],
        [48.13164557, 49.11392405, 48.62278481, 48.13164557]]))

In [35]:
"""The output shows the chi-square statistic = 8, the p-value as 0.045 and the degrees of freedom as 3 followed by the expected counts.The critical value with 3 degree of freedom is 7.815. Since 8.006 > 7.815, therefore we reject the null hypothesis and conclude that the education level depends on gender at a 5% level of significance."""

'The output shows the chi-square statistic = 8, the p-value as 0.045 and the degrees of freedom as 3 followed by the expected counts.The critical value with 3 degree of freedom is 7.815. Since 8.006 > 7.815, therefore we reject the null hypothesis and conclude that the education level depends on gender at a 5% level of significance.'

## Using the following data, perform a oneway analysis of variance using α=.05. Write up the results in APA format.[Group1: 51, 45, 33, 45, 67] 
## [Group2: 23, 43, 23, 43, 45] 
## [Group3: 56, 76, 74, 87, 56]

In [39]:
print("Clearly its a ANOVA testing problem as we need to compare multiple group at the same time")

import scipy.stats as stats
Group1 = [51, 45, 33, 45, 67]
Group2 = [23, 43, 23, 43, 45]
Group3 = [56, 76, 74, 87, 56]
# Perform the ANOVA

statistic, pvalue = stats.f_oneway(Group1,Group2,Group3)

print("\nF Statistic value {} , \n p-value {}".format(statistic,pvalue))

if pvalue < 0.05:
    print('\npvalue < 0.05')
    print("\nThe test result suggests the groups don't have the same sample means here, since the p-value is significant at a 99% confidence level. \nHere the p-value returned is {} which is < 0.05".format(pvalue))
else:
    print('pvalue >0.05')
    

Clearly its a ANOVA testing problem as we need to compare multiple group at the same time

F Statistic value 9.747205503009463 , 
 p-value 0.0030597541434430556

pvalue < 0.05

The test result suggests the groups don't have the same sample means here, since the p-value is significant at a 99% confidence level. 
Here the p-value returned is 0.0030597541434430556 which is < 0.05


## Calculate F Test for given 10, 20, 30, 40, 50 and 5,10,15, 20, 25. For 10, 20, 30, 40, 50:

In [42]:

print("F test =  Variance of Treatment")
print("         ----------------------")
print("           Variance of Error")
            

F test =  Variance of Treatment
         ----------------------
           Variance of Error


In [44]:

Set1 = [10, 20, 30, 40, 50]

Set2 = [5,10,15, 20, 25]


import statistics as stats

var1 = stats.stdev(Set1)**2
var2 = stats.stdev(Set2)**2

F_test = var1/var2

print(" F_ test value is: ",F_test)

 F_ test value is:  4.0
