# Problem 1:

Is gender independent of education level? A random sample of 395 people were
surveyed and each person was asked to report the highest education level they
obtained. The data that resulted from the survey is summarized in the following table:  
Question: Are gender and education level dependent at 5% level of significance? In
other words, given the data collected above, is there a relationship between the
gender of an individual and the level of education that they have obtained?

In [1]:
import scipy.stats 
import math
import pandas as pd
import numpy as np

In [2]:
data = {
    'Gender' : ['Female', 'Male'],
    'High School' : [60, 40],
    'Bachelors' : [54, 44],
    'Masters' : [46, 53],
    'PhD' : [41, 57]
}

df = pd.DataFrame(data)
df

Unnamed: 0,Gender,High School,Bachelors,Masters,PhD
0,Female,60,54,46,41
1,Male,40,44,53,57


##### Hypothesis
H0 : Gender and education level are independent.  
H1 : Gender and education level are dependent.

In [3]:
df.loc[len(df.index)] = ['Total', df['High School'].sum(), df['Bachelors'].sum(), df['Masters'].sum(), df['PhD'].sum()]
df['Total'] = df['High School'] + df['Bachelors'] + df['Masters'] + df['PhD']
df

Unnamed: 0,Gender,High School,Bachelors,Masters,PhD,Total
0,Female,60,54,46,41,201
1,Male,40,44,53,57,194
2,Total,100,98,99,98,395


In [4]:
data = [list(df['High School'][:-1]), list(df['Bachelors'][:-1]), list(df['Masters'][:-1]), list(df['PhD'][:-1])]
print(data)

[[60, 40], [54, 44], [46, 53], [41, 57]]


In [5]:
# Calculation of chi-square from the above data
chi_calculated = scipy.stats.chi2_contingency(data)[0]

#Degree of freedom
Dof = scipy.stats.chi2_contingency(data)[2]

# As per question the level of significance is 5%
p = 0.05
chi_critical = scipy.stats.chi2.ppf(1-p, df=Dof)

print(Dof)
print(f"The chi-square calculated value is {chi_calculated}.")
print(f"The value of chi-square calculated should be less than {chi_critical} in order to accept the null hypothesis.")

3
The chi-square calculated value is 8.006066246262538.
The value of chi-square calculated should be less than 7.814727903251179 in order to accept the null hypothesis.


##### Conclusion
The calculated chi-square(i.e 8.0060) is more than chi-critical(i.e 7.8147).  
Hence the null hypothesis is rejected.  
We have enough evidence at 5% level of significance in order to prove that gender and education level are dependent.

# Problem 2:

Using the following data, perform a oneway analysis of variance using α=.05. Write
up the results in APA format.

In [6]:
data = {
    'Group1' : [51, 45, 33, 45, 67],
    'Group2' : [23, 43, 23, 43, 45],
    'Group3' : [56, 76, 74, 87, 56]
}
df = pd.DataFrame(data)
df

Unnamed: 0,Group1,Group2,Group3
0,51,23,56
1,45,43,76
2,33,23,74
3,45,43,87
4,67,45,56


###### Hypothesis
H0 : Population mean of all the groups are same.  
H1 : Population mean of atleast one group is not same.

In [7]:
# Calculation of Degree of freedom
Dof_between = len(df.columns) - 1
Dof_within = df.size - len(df.columns)
Dof_total = Dof_between + Dof_within

print("Dof_between :",Dof_between)
print("Dof_within :",Dof_within)
print("Dof_total :",Dof_total)

Dof_between : 2
Dof_within : 12
Dof_total : 14


In [8]:
# Mean_calculation

Group1_mean = np.mean(df['Group1'])
Group2_mean = np.mean(df['Group2'])
Group3_mean = np.mean(df['Group3'])

grand_mean = df.sum()[:].sum()/df.size

print("Group1_mean :",Group1_mean)
print("Group2_mean :",Group2_mean)
print("Group3_mean :",Group3_mean)
print("Grand_mean :",grand_mean)

Group1_mean : 48.2
Group2_mean : 35.4
Group3_mean : 69.8
Grand_mean : 51.13333333333333


In [9]:
df1 = df.apply(lambda x: (x-grand_mean)**2)
df1

Unnamed: 0,Group1,Group2,Group3
0,0.017778,791.484444,23.684444
1,37.617778,66.151111,618.351111
2,328.817778,791.484444,522.884444
3,37.617778,66.151111,1286.417778
4,251.751111,37.617778,23.684444


In [10]:
SST = df1.sum()[:].sum()
print("Sum of squares total (SST) :",SST)

Sum of squares total (SST) : 4883.733333333334


In [11]:
SSB = 0

for i in range(len(np.mean(df))):
    SSB += ((np.mean(df)[i] - grand_mean)**2)*len(df.iloc[:,i])
    
print("Sum of squares between group (SSB) :", SSB)

Sum of squares between group (SSB) : 3022.933333333333


In [12]:
# Calculating SSW as we know SST = SSB + SSW
SSW = SST - SSB
print("Sum of squares within group (SSW) :", SSW)

Sum of squares within group (SSW) : 1860.8000000000006


In [13]:
# Calculating f-ratio
f_ratio_cal = (SSB/Dof_between)/(SSW/Dof_within)

# Since in the question 5% level of significance is mentioned
f_critical = scipy.stats.f.ppf(1-p/2, dfn = Dof_between, dfd = Dof_within)

print(f"The calculated f-ratio is {f_ratio_cal}.")
print(f"The calculated f-ratio should be less than {f_critical} in order to accept null hypothesis.")

The calculated f-ratio is 9.747205503009454.
The calculated f-ratio should be less than 5.095867165783942 in order to accept null hypothesis.


##### Conclusion
The calculated f-ratio(i.e 9.7472) is more than f_critical(i.e 5.0958).  
Hence the null hypothesis is rejected.  
We conclude that at 5% level of significance the population mean of atleast one group is not same.

### APA Report

###### ANOVA table

In [14]:
new_data = {
    'source' : ['group', 'error', 'total'],
    'SS' : [round(SSB,2), round(SSW,2), round(SST,2)],
    'df' : [Dof_between, Dof_within, Dof_total],
    'MS' : [round(SSB/Dof_between,2), round(SSW/Dof_within,2), round(SST/Dof_total,2)],
    'F' : [round(f_ratio_cal,2), np.nan, np.nan]
}
df2 = pd.DataFrame(new_data)
df2

Unnamed: 0,source,SS,df,MS,F
0,group,3022.93,2,1511.47,9.75
1,error,1860.8,12,155.07,
2,total,4883.73,14,348.84,


###### Effect size

In [15]:
η2 = round(SSB/SST,2)
print("Effect Size =",η2)

Effect Size = 0.62


###### APA Writeup

F(2, 12) = 9.75, p < 0.05, η2 = 0.62

# Problem 3:

Calculate F Test for given 10, 20, 30, 40, 50 and 5,10,15, 20, 25.
For 10, 20, 30, 40, 50:

In [16]:
data = {
    'Group1' : [10, 20, 30, 40, 50],
    'Group2' : [5, 10, 15, 20, 25],
}

df = pd.DataFrame(data)
df

Unnamed: 0,Group1,Group2
0,10,5
1,20,10
2,30,15
3,40,20
4,50,25


###### Hypothesis
H0 : Population mean of both groups are same.  
H1 : Population mean of both groups are not same.

In [17]:
# Calculation of Degree of freedom
Dof_between = len(df.columns) - 1
Dof_within = df.size - len(df.columns)
Dof_total = Dof_between + Dof_within

print("Dof_between :",Dof_between)
print("Dof_within :",Dof_within)
print("Dof_total :",Dof_total)

Dof_between : 1
Dof_within : 8
Dof_total : 9


In [18]:
# Mean_calculation

Group1_mean = np.mean(df['Group1'])
Group2_mean = np.mean(df['Group2'])

grand_mean = df.sum()[:].sum()/df.size

print("Group1_mean :",Group1_mean)
print("Group2_mean :",Group2_mean)
print("Grand_mean :",grand_mean)

Group1_mean : 30.0
Group2_mean : 15.0
Grand_mean : 22.5


In [19]:
df1 = df.apply(lambda x: (x-grand_mean)**2)
df1

Unnamed: 0,Group1,Group2
0,156.25,306.25
1,6.25,156.25
2,56.25,56.25
3,306.25,6.25
4,756.25,6.25


In [20]:
SST = df1.sum()[:].sum()
print("Sum of squares total (SST) :",SST)

Sum of squares total (SST) : 1812.5


In [21]:
SSB = 0

for i in range(len(np.mean(df))):
    SSB += ((np.mean(df)[i] - grand_mean)**2)*len(df.iloc[:,i])
    
print("Sum of squares between group (SSB) :", SSB)

Sum of squares between group (SSB) : 562.5


In [22]:
# Calculating SSW as we know SST = SSB + SSW
SSW = SST - SSB
print("Sum of squares within group (SSW) :", SSW)

Sum of squares within group (SSW) : 1250.0


In [23]:
# Calculating f-ratio
f_ratio_cal = (SSB/Dof_between)/(SSW/Dof_within)

# Since in the question 5% level of significance is mentioned
f_critical = scipy.stats.f.ppf(1-p/2, dfn = Dof_between, dfd = Dof_within)

print(f"The calculated f-ratio is {f_ratio_cal}.")
print(f"The calculated f-ratio should be less than {f_critical} in order to accept null hypothesis.")

The calculated f-ratio is 3.6.
The calculated f-ratio should be less than 7.57088209969174 in order to accept null hypothesis.


##### Conclusion
The calculated f-ratio(i.e 3.6) is less than f_critical(i.e 7.5708).  
Hence the null hypothesis is accepted.  
We conclude that at 5% level of significance the population mean of both the groups are same.

### APA Report

###### ANOVA table

In [24]:
new_data = {
    'source' : ['group', 'error', 'total'],
    'SS' : [round(SSB,2), round(SSW,2), round(SST,2)],
    'df' : [Dof_between, Dof_within, Dof_total],
    'MS' : [round(SSB/Dof_between,2), round(SSW/Dof_within,2), round(SST/Dof_total,2)],
    'F' : [round(f_ratio_cal,2), np.nan, np.nan]
}
df2 = pd.DataFrame(new_data)
df2

Unnamed: 0,source,SS,df,MS,F
0,group,562.5,1,562.5,3.6
1,error,1250.0,8,156.25,
2,total,1812.5,9,201.39,


In [25]:
η2 = round(SSB/SST,2)
print("Effect Size =",η2)

Effect Size = 0.31


###### APA Writeup

F(2, 12) = 3.6, p < 0.05, η2 = 0.31