## Problem Statement 1:
Is gender independent of education level? A random sample of 395 people were
surveyed and each person was asked to report the highest education level they
obtained. The data that resulted from the survey is summarized in the following table:

            High School Bachelors Masters Ph.d. Total
     Female         60     54        46    41    201
      Male          40     44        53    57    194
      Total         100    98        99    98    395

Question: Are gender and education level dependent at 5% level of significance? In
other words, given the data collected above, is there a relationship between the
gender of an individual and the level of education that they have obtained?

***

#### Chi-square Test Statistic,χ2= Σ  [(O-E)**2] / E ;

where O is the Observed Frequency and E is the Expected Frequency under the null hypothesis

#### E=(Row Total * Column Total)/Sample Size

In [1]:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.array([[60,54,46,41,201], [40,44,53,57,194],[100,98,99,98,395]])
                          ,columns=['HighSchool','Bachelors','Masters','PhD','Row_Total'])
df.rename(index={0:'Female',1:'Male',2:'Col_Total'}, inplace=True)
df

Unnamed: 0,HighSchool,Bachelors,Masters,PhD,Row_Total
Female,60,54,46,41,201
Male,40,44,53,57,194
Col_Total,100,98,99,98,395


In [2]:
expected =  np.outer(df["Row_Total"][0:2],
                     df.loc["Col_Total"][0:4]) / 395

expected = pd.DataFrame(expected)

expected.columns = ['HighSchool','Bachelors','Masters','PhD']
expected.index = ["Female","Male"]

expected

Unnamed: 0,HighSchool,Bachelors,Masters,PhD
Female,50.886076,49.868354,50.377215,49.868354
Male,49.113924,48.131646,48.622785,48.131646


In [3]:
observed = df.iloc[0:2,0:4]
chi_squared_stat = (((observed-expected)**2)/expected).sum().sum()

print("Chi-Square Statistic= {:.3f}".format(chi_squared_stat))

Chi-Square Statistic= 8.006


 We call .sum() twice: once to get the column sums and a second time to add the column sums together, returning the sum of the entire 2D table.

In [4]:
import scipy.stats as stats
crit = stats.chi2.ppf(q = 0.95, # Find the critical value for 95% confidence*
                      df = 3)   

print("Critical value")
print(crit)
p_value = 1 - stats.chi2.cdf(x=chi_squared_stat,  # Find the p-value
                             df=3)
print("P value")
print(p_value)

Critical value
7.814727903251179
P value
0.04588650089174717


#### The critical value of χ2 with 3 degree of freedom is 7.815. Since 8.006 > 7.815, therefore we reject the null hypothesis and conclude that the education level depends on gender at a 5% level of significance.

In [5]:
#An alternate method to find the Critical value,p-vale and degrees of freedom using stats is as follows:
stats.chi2_contingency(observed= observed)

(8.006066246262538,
 0.045886500891747214,
 3,
 array([[50.88607595, 49.86835443, 50.37721519, 49.86835443],
        [49.11392405, 48.13164557, 48.62278481, 48.13164557]]))

The output shows the chi-square statistic, the p-value and the degrees of freedom followed by the expected counts.The critical value of χ2 with 3 degree of freedom is 7.815. Since 8.006 > 7.815, therefore we reject the null hypothesis and conclude that the education level depends on gender at a 5% level of significance.

***

## Problem Statement 2: 
 
Using the following data, perform a oneway analysis of variance using α=.05. Write up the results in APA format. 
 
 
[Group1: 51, 45, 33, 45, 67]  
[Group2: 23, 43, 23, 43, 45]  
[Group3: 56, 76, 74, 87, 56] 

In [6]:
data = {'Group1':[51,45,33,45,67], 
        'Group2':[23,43,23,43,45], 
        'Group3':[56,76,74,87,56]} 
data = pd.DataFrame(data)

In [7]:
data

Unnamed: 0,Group1,Group2,Group3
0,51,23,56
1,45,43,76
2,33,23,74
3,45,43,87
4,67,45,56


#### Step 1: Calculate all the means

In [8]:
means=data.mean()
print("Means of the groups:\n",means)
grand_mean=means.mean()
print("Grand mean= ",round(grand_mean,2))

Means of the groups:
 Group1    48.2
Group2    35.4
Group3    69.8
dtype: float64
Grand mean=  51.13


#### Step 2: Specify the rejection area
α = 0.05
Rejection criteria: K(0.05) < F
This means that if the critical value of F from tables is less than the calculated value of F, we reject the null hypothesis

#### Step 3: Calculate the Sum of Squares
The formula for sum of squares is 
SS(total)=SS(between)+SS(within)

In [9]:
data['g1_sst']=(data.Group1-grand_mean)**2
data['g2_sst']=(data.Group2-grand_mean)**2
data['g3_sst']=(data.Group3-grand_mean)**2

In [10]:
data

Unnamed: 0,Group1,Group2,Group3,g1_sst,g2_sst,g3_sst
0,51,23,56,0.017778,791.484444,23.684444
1,45,43,76,37.617778,66.151111,618.351111
2,33,23,74,328.817778,791.484444,522.884444
3,45,43,87,37.617778,66.151111,1286.417778
4,67,45,56,251.751111,37.617778,23.684444


In [11]:
SST=data.g1_sst.sum()+data.g2_sst.sum()+data.g3_sst.sum()
print("Total Sum of Squares, SS(Total)= ",SST)

Total Sum of Squares, SS(Total)=  4883.733333333334


In [12]:
data['g1_ssw']=(data.Group1-data.Group1.mean())**2
data['g2_ssw']=(data.Group2-data.Group2.mean())**2
data['g3_ssw']=(data.Group3-data.Group3.mean())**2
data

Unnamed: 0,Group1,Group2,Group3,g1_sst,g2_sst,g3_sst,g1_ssw,g2_ssw,g3_ssw
0,51,23,56,0.017778,791.484444,23.684444,7.84,153.76,190.44
1,45,43,76,37.617778,66.151111,618.351111,10.24,57.76,38.44
2,33,23,74,328.817778,791.484444,522.884444,231.04,153.76,17.64
3,45,43,87,37.617778,66.151111,1286.417778,10.24,57.76,295.84
4,67,45,56,251.751111,37.617778,23.684444,353.44,92.16,190.44


In [13]:
ssw=data.g1_ssw.sum()+data.g2_ssw.sum()+data.g3_ssw.sum()
print("Sum of Squares within groups, SSW= ",ssw)

Sum of Squares within groups, SSW=  1860.8


In [14]:
data['g1_ssb']=(data.Group1.mean()-grand_mean)**2
data['g2_ssb']=(data.Group2.mean()-grand_mean)**2
data['g3_ssb']=(data.Group3.mean()-grand_mean)**2
data

Unnamed: 0,Group1,Group2,Group3,g1_sst,g2_sst,g3_sst,g1_ssw,g2_ssw,g3_ssw,g1_ssb,g2_ssb,g3_ssb
0,51,23,56,0.017778,791.484444,23.684444,7.84,153.76,190.44,8.604444,247.537778,348.444444
1,45,43,76,37.617778,66.151111,618.351111,10.24,57.76,38.44,8.604444,247.537778,348.444444
2,33,23,74,328.817778,791.484444,522.884444,231.04,153.76,17.64,8.604444,247.537778,348.444444
3,45,43,87,37.617778,66.151111,1286.417778,10.24,57.76,295.84,8.604444,247.537778,348.444444
4,67,45,56,251.751111,37.617778,23.684444,353.44,92.16,190.44,8.604444,247.537778,348.444444


In [15]:
ssb=data.g1_ssb.sum()+data.g2_ssb.sum()+data.g3_ssb.sum()
print("Sum of Squares between groups, SSB= ",ssb)

Sum of Squares between groups, SSB=  3022.933333333333


In [16]:
sst=ssb+ssw
print(sst)

4883.733333333333


In [17]:
print("Sum of Squares within groups, SSW= ",ssw)
print("Sum of Squares between groups, SSB= ",ssb)
print("Total Sum of Squares, SST= ",sst)

Sum of Squares within groups, SSW=  1860.8
Sum of Squares between groups, SSB=  3022.933333333333
Total Sum of Squares, SST=  4883.733333333333


#### Step 4: Calculate the Degrees of Freedom

In [18]:
n=15 #total number of samples
k=3 #total number of groups

dft=n-1
dfw =n-k 
dfb =k-1
print("df(total)= "+str(dft)+"\tdf(within)= "+str(dfw)+"\tdf(between)= "+str(dfb))



df(total)= 14	df(within)= 12	df(between)= 2


#### Step 5: Calculate the Mean Squares
Mean Square(between)=SS(between)/dfbetween
Mean Square(within)=SS(within)/dfwithin

In [19]:
MSbetween=ssb/dfb
MSwithin=ssw/dfw
print("Mean Square Between--- MS(between)= "+str(MSbetween)+"\nMean Square Within--- MS(within)= "+str(MSwithin))

Mean Square Between--- MS(between)= 1511.4666666666665
Mean Square Within--- MS(within)= 155.06666666666666


#### Step 6: Calculate the F Statistic
F=MS(between)/MS(within)

In [20]:
F=MSbetween/MSwithin
print("F statisitc = ",F)

F statisitc =  9.747205503009457


#### Step 7: Looking up F from table and stating Conclusion

From the table of F distribution, the critical value of F for 0.05 significance and degrees of freedom of(df1 = 12 and df2 = 2) we have:

F = 3.89

Since the calculated(absolute value) of F is greater than the tabulated value, we reject the null hypothesis and conclude that at least two of the means are significantly different from each other.

Effect size

η2=SSB/SST=3022.9/4883.7=0.62

#### APA writeup

F(2, 12)=9.75, p <0.05, η2=0.62.

In [21]:
#Alternate method using stats to find f statistics is as follows
stats.f_oneway([51,45,33,45,67],[23,43,23,43,45],[56,76,74,87,56])

F_onewayResult(statistic=9.747205503009463, pvalue=0.0030597541434430556)

***

## Problem Statement 3: 
 
Calculate F Test for given 10, 20, 30, 40, 50 and 5,10,15, 20, 25. 
 
For 10, 20, 30, 40, 50: 

In [22]:
set1=np.array([10,20,30,40,50])
set2=np.array([5,10,15,20,25])
print("Set1: ",set1)
print("Set2: ",set2)

Set1:  [10 20 30 40 50]
Set2:  [ 5 10 15 20 25]


In [23]:
mean1=set1.mean()
mean2=set2.mean()
print("Mean of set1: ",mean1)
print("Mean of set2: ",mean2)
stdev1=np.std(set1,ddof=1)
stdev2=np.std(set2,ddof=1)
print("\nStandard Deviation of set1: ",stdev1)
print("Standard Deviation of set2: ",stdev2)
var1=(stdev1)**2
var2=(stdev2)**2
print(" \nVariance of set1: ",var1)
print("Variance of set2: ",var2)
F_test=var1/var2
print("\nF_Test= Variance of set1/Variance of set2 = 250/62.5")
print("F Test value is ",F_test)


Mean of set1:  30.0
Mean of set2:  15.0

Standard Deviation of set1:  15.811388300841896
Standard Deviation of set2:  7.905694150420948
 
Variance of set1:  250.0
Variance of set2:  62.5

F_Test= Variance of set1/Variance of set2 = 250/62.5
F Test value is  4.0
