# Q1
# **`Problem Statement 1:`**

- Is gender independent of education level? A random sample of 395 people were surveyed and each person was asked to report the highest education level they obtained. The data that resulted from the survey is summarized in the following table:


| Gender/Edu | High-School | Bachelors |  Masters | Ph.d.  |  Total  |
|------------|-------------|-----------|----------|--------|---------|
| Female     |       60    |     54    |     46   |   41   |   201   |
| Male       |       40    |     44    |     53   |   57   |   194   |
|------------|-------------|-----------|----------|--------|---------|
| Total      |      100    |     98    |     99   |   98   |   395   |


- Question: Are gender and education level dependent at 5% level of significance? In other words, given the data collected above, is there a relationship between the gender of an individual and the level of education that they have obtained?

#### **`Solution :- `**

In [8]:
import numpy as np
import pandas as pd
import scipy.stats as stats

female_list = [60,54,46,41]
male_list = [40,44,53,57]
marks = male_list + female_list

gender = ['Male','Male','Male','Male','Female','Female','Female','Female']
edu = ['High School', 'Bachelors', 'Masters', 'Ph.d.','High School', 'Bachelors', 'Masters', 'Ph.d.']
df_edu = pd.DataFrame({"Gender":gender,"Edu":edu,"Marks":marks})


In [14]:
df_edu.head()

Unnamed: 0,Gender,Edu,Marks
0,Male,High School,40
1,Male,Bachelors,44
2,Male,Masters,53
3,Male,Ph.d.,57
4,Female,High School,60


In [18]:
table = pd.pivot_table(df_edu, values='Marks', index = 'Gender', columns= 'Edu', aggfunc = np.sum, margins=True)
table                   

Edu,Bachelors,High School,Masters,Ph.d.,All
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Female,54,60,46,41,201
Male,44,40,53,57,194
All,98,100,99,98,395


In [21]:
table.rename(columns={'All':'Row_Total'}, inplace = True)
table.index = ["Female","Male","Col_Total"]

#### Chi-Square test of independence 
- To Test:
   - H0 : The two categorical variables are independent.
   - H1 : The two categorical variables are dependent.

#### Step 1: Observed frequencies
- Our data are summarized in the contingency table below reporting the number of people in each subgroup, totals by row, by column and the grand total:

In [26]:
obs_table = table.iloc[0:2,0:4]   
obs_table

Edu,Bachelors,High School,Masters,Ph.d.
Female,54,60,46,41
Male,44,40,53,57


#### Step 2: Expected frequencies
- Remember that for the Chi-square test of independence we need to determine whether the observed counts are significantly different from the counts that we would expect if there was no association between the two variables. 
- We have the observed counts (see the table above), so we now need to compute the expected counts in the case the variables were independent. 
- These expected frequencies are computed for each subgroup one by one with the following formula:

 
![image](https://journal.ahima.org/wp-content/uploads/2015/10/Expected-Cell-Frequency-Equation.png)

In [28]:
exp_freq =  np.outer(table['Row_Total'][0:2], table.loc["Col_Total"][0:4]) / 395.0
exp_freq = pd.DataFrame(exp_freq)
exp_freq.columns = ["Bachelors","High School","Masters","Ph.d."]
exp_freq.index = ["Female","Male"]

- where obs. correspond to observations. Given our table of observed frequencies above, below is the table of the expected frequencies computed for each subgroup:


In [29]:
exp_freq

Unnamed: 0,Bachelors,High School,Masters,Ph.d.
Female,49.868354,50.886076,50.377215,49.868354
Male,48.131646,49.113924,48.622785,48.131646


### Step 3: Test statistic
- We have the observed and expected frequencies. We now need to compare these frequencies to determine if they differ significantly. The difference between the observed and expected frequencies, referred as the test statistic (or t-stat) and denoted χ2, is computed as follows:
![image](https://www.thoughtco.com/thmb/ns7d4DC1AqVGme2p1-WYqC26r_s=/768x0/filters:no_upscale():max_bytes(150000):strip_icc()/latex_ac74fec08532861eb5f8b87226ebf396-5c59a6fcc9e77c00016b4195.jpg)

In [30]:
chi_squared_stat = (((obs_table-exp_freq)**2) / exp_freq).sum().sum()

print("Chi Squar : ",chi_squared_stat)

Chi Squar :  8.006066246262538


#### Step 4: Critical value
- The test statistic alone is not enough to conclude for independence or dependence between the two variables. As previously mentioned, this test statistic must be compared to a critical value to determine whether the difference is large or small. 
- The critical value can be found in the statistical table of the Chi-square distribution and depends on the significance level, denoted αα, and the degrees of freedom, denoted dfdf. The significance level is usually set equal to 5%. The degrees of freedom for a Chi-square test of independence is found as follow:

   **`df = (number of rows − 1) ⋅ (number of columns − 1)`**

- In our example, the degrees of freedom is thus 


In [35]:
df = ( 2 - 1) * ( 4 - 1 )
print("Degrees of Freedom: " ,df)

Degrees of Freedom:  3


- since there are two rows and four columns in the contingency table 


In [38]:
cri_val = stats.chi2.ppf(q = 0.95, df = 3)   # 95% confidence
print("Critical value : ",round(cri_val, 4))

p_value = 1 - stats.chi2.cdf(x = chi_squared_stat, df=3)
print("P value        :  ",round(p_value,4))
print()

Critical value :  7.8147
P value        :   0.0459



- We now have all the necessary information to find the critical value in the Chi-square table (α = 0.05 and df = 3). 
- The critical value is `7.8147`

#### Step 5: Conclusion and interpretation
- Now that we have the test statistic and the critical value, we can compare them to check whether the null hypothesis of independence of the variables is rejected or not. In our example,

    **`test statistic = 8.006 > critical value = 7.8147`**

- therefore **`reject the null hypothesis`** and conclude that the education level depends on gender at a 5% level of significance.

# Q2
# **`Problem Statement 2:`**
- Using the following data, perform a oneway analysis of variance using α=.05. Write up the results in APA format.

      [Group1: 51, 45, 33, 45, 67]
      [Group2: 23, 43, 23, 43, 45]
      [Group3: 56, 76, 74, 87, 56]



#### **`Solution :- `**

- ANOVA is a statistical inference test that lets you compare multiple groups at the same time.
- The one-way ANOVA tests whether the mean of some numeric variable differs across the levels of one categorical variable. 
- It essentially answers the question: do any of the group means differ from one another? 

In [39]:
import scipy.stats as stats

Group1 = [51, 45, 33, 45, 67]
Group2 = [23, 43, 23, 43, 45]
Group3 = [56, 76, 74, 87, 56]

#### **`ANOVA`**
- The `scipy library` has a function for carrying out one-way ANOVA tests called `scipy.stats.f_oneway()`


In [43]:
statistic, pvalue = stats.f_oneway(Group1, Group2, Group3)

print("F Statistic :   {} ".format(statistic))
print("p-value     :   {}  ".format(pvalue))


F Statistic :   9.747205503009463 
p-value     :   0.0030597541434430556  


In [44]:
if p_value < 0.05:
    print("H0 is rejected")
else:
    print("H0 is accepted")

H0 is rejected


# Q3
# **`Problem Statement 3:`**
- Calculate F Test for given 10, 20, 30, 40, 50 and 5,10,15, 20, 25. For 10, 20, 30, 40, 50:


#### **`Solution :- `**

- The F-statistic is simply:

![image](https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRDYhndizrofVIphFkNxs4D97W9LZQRdpv7wQ&usqp=CAU)

- where s1_sq is the variance of sample 1. Remember that the sample variance is:

In [87]:
#stats.f_oneway([10, 20, 30, 40, 50],[5,10,15, 20, 25])

In [88]:
x1 = [10, 20, 30, 40, 50]
x2 = [5, 10, 15, 20, 25]
x_bar_1 = np.mean(x1)
x_bar_2 = np.mean(x2)

In [89]:
num1 = 0
num2 = 0
for x in x1:
    num1 += (x - x_bar_1)**2

for x in x2:
    num2 += (x - x_bar_2)**2

- Calculate `s1_sq` and `s2_sq`

![image](https://miro.medium.com/max/666/0*ovSFlxj9RJMgtQoX.png)

In [90]:
n1 = len(x1)
n2 = len(x2)

s1_sq = num1 / (n1 - 1)
s2_sq = num2 / (n2 - 1)

In [91]:
F_Test = s1_sq / s2_sq
print("F Test for given 10, 20, 30, 40, 50 and 5,10,15, 20, 25 is : " , F_Test)

F Test for given 10, 20, 30, 40, 50 and 5,10,15, 20, 25 is :  4.0
