# ASSIGNMENT 18

In [1]:
# Q1. Is gender independent of education level? A random sample of 395 people were
# surveyed and each person was asked to report the highest education level they
# obtained. The data that resulted from the survey is summarized in the following table:
# High School Bachelors Masters Ph.d. Total
# Female 60 54 46 41 201
# Male 40 44 53 57 194
# Total 100 98 99 98 395
# Question: Are gender and education level dependent at 5% level of significance? In
# other words, given the data collected above, is there a relationship between the gender
# of an individual and the level of education that they have obtained?
'''
Here in this question, we are asked to perform a Non-Parametric Test.
NULL HYPOTHESIS:
    There is no relationship between the gender of an individual 
    and the level of education that they have obtained
ALTERNATE HYPOTHESIS:
    There is a relationship between the gender of an individual 
    and the level of education that they have obtained

Using Chi-Square Test,(since the test is non-parametric/categorical)
χ2=∑(O−E)2/E
χ2= Test Statistic
O = Observed frequencies
E = Expected frequencies
χ2 is compared to the critical value of χ2α with degree of freedom = (row - 1) (col - 1)
Null Hypothesis is rejected when:
             χ2>χ2α

GIVEN TABLE OF OBSERVED VALUES
    
        HighSchool  Bachelors  Masters  Ph.d.  Total
# Female   60           54       46       41    201
# Male     40           44       53       57    194
# Total   100           98       99       98    395            

We need to find the expected values. 
'''
import numpy as np
import pandas as pd
# Defining the table as a dataframe
df = pd.DataFrame({'HighSchool': [60,40,100],
                       'Bachelors': [54,44,98],
                       'Masters': [46,53,99], 
                       'Ph.d.': [41,57,98],
                       'Total': [201,194,395],
                        'Sex': ['Female','Male','Total']})
# Table with the observed values
df_obs=df.groupby('Sex').sum()
print("The table with observed values :\n",df_obs)
COLUMN_NAMES=['HighSchool','Bachelors','Masters','Ph.d.']
# Creating a dataframe to contain the expected values
df_exp=pd.DataFrame(columns=COLUMN_NAMES,index=['Female','Male'])
df_exp=df_exp.astype(float)
'''
Calculating expected value for the female-Highschool :
Exp_val = [Total no.of HighSchool students(both male and female)/Total sample size]*Total no.of females
        = (100/395)*201
Similarly expected values are to be calculated for all the categories
'''
# Populating the table with the expected values
for j in range(4):
    for i in range(2):
        df_exp.iloc[i][j]=round(df_obs.iloc[i][4]*df_obs.iloc[2][j]/df_obs.iloc[2][4],3)
# Printing the Table with expected values
print("\n\nThe table with expected values :\n",df_exp)
'''
Computing the Chi-square test statistic:
        χ2=∑(O−E)2/E
'''
# Computing chi-square value as test_stat
test_stat=0
for j in range(4):
    for i in range(2):
        test_stat += ((df_obs.iloc[i][j]-df_exp.iloc[i][j])**2)/df_exp.iloc[i][j]

test_stat = round(test_stat,3)
print("\n\nThe computed chi-square value is: ",test_stat)
'''
Degrees of freedom(DF) = k-1 where k is the no.of categories
In this case; k = 4 ['HighSchool','Bachelors','Masters','Ph.d.']
hence,
DF = 4-1 = 3
Level of significance = 5% = 0.05 [GIVEN]
Thus,
Confidence Interval = 95% = .95
We use scipy.stats.chi2.ppf(q,df) to find the critical value.
'''
import scipy.stats as st
crit = st.chi2.ppf(q=0.95,df=3)
print("\n\nThe critical value is: ",crit,"\n\n")
# Testing the hypothesis by comapring the computed value and the critical value
print("CONCLUSION:")
if(test_stat>crit):
    print("Null Hypothesis is rejected-education level depends on gender at a 5% level of significance")
else:
    print("Null Hypothesis is accepted-education level does not depend on gender at a 5% level of significance")

The table with observed values :
         HighSchool  Bachelors  Masters  Ph.d.  Total
Sex                                                 
Female          60         54       46     41    201
Male            40         44       53     57    194
Total          100         98       99     98    395


The table with expected values :
         HighSchool  Bachelors  Masters   Ph.d.
Female      50.886     49.868   50.377  49.868
Male        49.114     48.132   48.623  48.132


The computed chi-square value is:  8.006


The critical value is:  7.814727903251179 


CONCLUSION:
Null Hypothesis is rejected-education level depends on gender at a 5% level of significance


In [2]:
# Q2. Using the following data, perform a oneway analysis of variance using α=.05. Write up
# the results in APA format.
# [Group1: 51, 45, 33, 45, 67]
# [Group2: 23, 43, 23, 43, 45]
# [Group3: 56, 76, 74, 87, 56]
'''
A one way ANOVA is used to compare two means from two independent (unrelated) groups using the F-distribution. 
NULL HYPOTHESIS:for the test
    Two means are equal
ALTERNATE HYPOTHESIS:
    The two means are unequal
    
'''
# GIVEN GROUPS
Group1 = [51, 45, 33, 45, 67]
Group2 = [23, 43, 23, 43, 45]
Group3 = [56, 76, 74, 87, 56]

# Function to calculate mean
def mean(x):
    return sum(x) / len(x)
m1 = mean(Group1)
m2 = mean(Group2)
m3 = mean(Group3)
m = [m1,m2,m3]
o_mean = mean(m)

print("BETWEEN GROUP:")

# Sum of squared differences of the mean between group
sb = 5*((m1-o_mean)**2+(m2-o_mean)**2+(m3-o_mean)**2)
print("The Between-Group sum of squared differences(sb): ",sb)
'''
The between-group degrees of freedom is one less than the number of groups.
Thus dfb = 3-1 = 2
'''
dfg = 2
print("The Between-Group degrees of freedom(dfg): ",dfg)
# Between-group mean square value is sb/dfg
msb = sb/dfg
print("The Between-Group mean square value(msb): ",msb)

print("\n\nWITHIN GROUP:")
# Within group sum of squared differences
sw1 = 0
for i in range(len(Group1)):
    sw1 += (Group1[i]-m1)**2

sw2 = 0
for i in range(len(Group2)):
    sw2 += (Group2[i]-m2)**2
sw3 = 0
for i in range(len(Group3)):
    sw3 += (Group3[i]-m3)**2

sw = sw1+sw2+sw3
print("The Within-Group sum of squared differences(sw): ",sw)
# Within-group degrees of freedom no.of groups x (no.of values-1)
dfw = 3*(5-1)
# Within-group sum of mean square value
msw = sw/dfw
print("The Within-Group degrees of freedom(dfw): ",dfw)
print("The Within-Group sum of mean square value(msw): ",msw)
# Calculating the F-ratio
F_ratio = msb/msw
print("\n\nThe F-Ratio is: ",F_ratio)
import scipy.stats
# Given α=0.05, hence q=1-0.05
F_crit = scipy.stats.f.ppf(q=1-0.05, dfn=2, dfd=12)
print("The critical value is: ",F_crit)
print("\n\nCONCLUSION:")
if(F_ratio>F_crit):
    print("Null Hypothesis is rejected")
    print("There is strong evidence that the expected values in the three groups differ")
else:
    print("There is no evidence that the expected values in the three groups differ")

BETWEEN GROUP:
The Between-Group sum of squared differences(sb):  3022.933333333333
The Between-Group degrees of freedom(dfg):  2
The Between-Group mean square value(msb):  1511.4666666666665


WITHIN GROUP:
The Within-Group sum of squared differences(sw):  1860.8
The Within-Group degrees of freedom(dfw):  12
The Within-Group sum of mean square value(msw):  155.06666666666666


The F-Ratio is:  9.747205503009457
The critical value is:  3.8852938346523933


CONCLUSION:
Null Hypothesis is rejected
There is strong evidence that the expected values in the three groups differ


In [3]:
# Q3. Calculate F Test for given 10, 20, 30, 40, 50 and 5,10,15, 20, 25.
'''
F-test is small sample test.
F = (Larger estimate of population variance) / (Smaller estimate Of population variance)
The variance ratio = S1^2 / S2^2

NULL HYPOTHESIS:for the test
    Two variances are equal
ALTERNATE HYPOTHESIS:
    The two variances are unequal

GIVEN VALUES:
I - 10, 20, 30, 40, 50
II - 5,10,15, 20, 25
Thus the ratio of the variances of both the samples is to be calculated
'''
# Function to calculate mean
def mean(x):
    return sum(x) / len(x)
# Function to calculate variance
def variance(x):
    n = len(x)
    x_bar = mean(x)
    return(round(sum((x_i - x_bar)**2 for x_i in x) / (n - 1), 2))
# Taking the given values in separate lists
Lst_1 = [10, 20, 30, 40, 50]
Lst_2 = [5,10,15, 20, 25]
# Calculating the variances of the sets
var_1 = variance(Lst_1)
var_2 = variance(Lst_2)

# Defining a function to perform the F-Test and print the F-Test Value
def f_test(x,y):
    if(x>y):
        f_test_val = x/y
    else:
        f_test_val = y/x
    print("The F-Test Value is: {}".format(f_test_val)) 
# Performing the F-Test
f_test(var_1,var_2)

The F-Test Value is: 4.0
